<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">

<channel>
	<title>Semantikoz</title>
	
	<link>http://blog.semantikoz.com</link>
	<description>Semantic Spaces in the Cloud</description>
	<lastBuildDate>Tue, 18 Oct 2011 22:29:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/Semantikoz" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="semantikoz" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Comparing Big Data</title>
		<link>http://blog.semantikoz.com/2011/10/14/comparing-big-data/</link>
		<comments>http://blog.semantikoz.com/2011/10/14/comparing-big-data/#comments</comments>
		<pubDate>Thu, 13 Oct 2011 15:03:34 +0000</pubDate>
		<dc:creator>Christian</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[TEAM]]></category>

		<guid isPermaLink="false">http://blog.semantikoz.com/?p=39</guid>
		<description><![CDATA[At Mendeley, we work with an ever increasing document collection currently of the magnitude of 100,000,000. Besides the documents we process related PDFs, extracted and user generated meta-data, user information, user libraries and groups. Together the data set and its application at Mendeley is large and complex. After closer inspection we can identify a core [...]]]></description>
			<content:encoded><![CDATA[<p>At <a title="Mendeley" href="http://www.mendeley.com/" target="_blank">Mendeley</a>, we work with an ever increasing document collection currently of the magnitude of 100,000,000. Besides the documents we process related PDFs, extracted and user generated meta-data, user information, user libraries and groups. Together the data set and its application at Mendeley is large and complex. After closer inspection we can identify a core operation and challenge besides scale. In almost every feature/product, internally and client-facing, we have to compare data items. This basic operation becomes challenging not just because of the scale. We deal with a noisy data set with different types of information coming from users, meta-data extraction and partner archives. In short, we have to compare items in a huge set efficiently and effectively. This is a core challenge for big data. Like Mendeley, most if not all real world big data services face some kind of noise in their data and use comparisons extensively in their algorithms/products.</p>
<p>There are three main classes of comparison coming to mind in our context:</p>
<ol>
<li>Search &#8211; comparing patterns and frequencies within and across items. Example, text query against documents.</li>
<li>Recommendation &#8211; comparing items based on their occurrence. Example, collaborative filtering of co-occurring items</li>
<li>Classification/clustering &#8211; comparing items based on their features. Example, clustering and merging (near) duplicate items.</li>
</ol>
<p>There are products along these classes of comparison available, e.g. <a title="Apache Lucene" href="http://lucene.apache.org/" target="_blank">Lucene</a> or <a title="Apache Solr" href="http://lucene.apache.org/solr/" target="_blank">Solr</a>. The problem is that products specialise on a use case, for example search in case of Lucene and Solr.  The specialisation commonly focuses on one or small subset of aspects of the information we have available, e.g. patterns or relationships. Some of the data are only poorly or not at all utilised and comparison across types is often impossible or hard. Moreover, we have to (internally) in many situations do similar comparisons but utilizing specialised products is not always a sensible approach. Where we do use existing technologies and algorithms we are limited by their abilities and insight (or lack of).</p>
<p>We pose these challenges:</p>
<ul>
<li>To unify the data comparison classes in one system to extract value from the full data set (patterns, frequencies, relationships, co-occurrence, &#8230;) and access it transparently from different services (search, recommendation, de-duplication) according to their needs.</li>
<li>To scale it for Mendeley (to 10^8 and beyond).</li>
<li>To be as effective or even better than dedicated products.</li>
</ul>
<p>We will solve these challenges as part of the <a title="TEAM Project" href="http://team-project.tugraz.at" target="_blank">TEAM project</a> applying and extending state-of-the-art research. The outcome will a) extend knowledge in form of peer reviewed research publications, and b) result in a real-world, working system at Mendeley.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.semantikoz.com/2011/10/14/comparing-big-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

