<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:cc="http://web.resource.org/cc/" xmlns="http://purl.org/rss/1.0/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">

<channel rdf:about="http://datamining.typepad.com/data_mining/">
<title>Data Mining: Text Mining, Visualization and Social Media</title>
<link>http://datamining.typepad.com/data_mining/</link>
<description />
<dc:language>en-US</dc:language>
<dc:creator />
<dc:date>2012-05-26T21:24:38-04:00</dc:date>
<admin:generatorAgent rdf:resource="http://www.typepad.com/" />


<items>
<rdf:Seq><rdf:li rdf:resource="http://datamining.typepad.com/data_mining/2012/05/5-hidden-skills-for-big-data-scientists.html" />
<rdf:li rdf:resource="http://datamining.typepad.com/data_mining/2012/05/zero-tolerance-search-24-year-old-neuroscientist.html" />
<rdf:li rdf:resource="http://datamining.typepad.com/data_mining/2012/05/excellent-visualization-of-network-effect.html" />
<rdf:li rdf:resource="http://datamining.typepad.com/data_mining/2012/04/graphing-twitter-attention.html" />
<rdf:li rdf:resource="http://datamining.typepad.com/data_mining/2012/04/microsofts-windows-phone-8-problem-a-solution.html" />
<rdf:li rdf:resource="http://datamining.typepad.com/data_mining/2012/04/finding-new-story-links-through-blog-clustering.html" />
<rdf:li rdf:resource="http://datamining.typepad.com/data_mining/2012/04/local-lens-a-hyperlocal-retrospective.html" />
<rdf:li rdf:resource="http://datamining.typepad.com/data_mining/2012/04/search-engine-lands-mediocre-post-on-local-search.html" />
<rdf:li rdf:resource="http://datamining.typepad.com/data_mining/2012/04/tracking-microsoft-weekly-roundup.html" />
<rdf:li rdf:resource="http://datamining.typepad.com/data_mining/2012/04/lumia-review-cluster.html" />
</rdf:Seq>
</items>

<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rdf+xml" href="http://feeds.feedburner.com/DataMining" /><feedburner:info uri="datamining" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><geo:lat>40.468968</geo:lat><geo:long>-79.918639</geo:long><feedburner:emailServiceId>DataMining</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><feedburner:browserFriendly>This is an XML content feed. It is intended to be viewed in a newsreader or syndicated to another site, subject to copyright and fair use.</feedburner:browserFriendly></channel>

<item rdf:about="http://datamining.typepad.com/data_mining/2012/05/5-hidden-skills-for-big-data-scientists.html">
<title>5 Hidden Skills for Big Data Scientists</title>
<link>http://feedproxy.google.com/~r/DataMining/~3/-sXPggJj_Os/5-hidden-skills-for-big-data-scientists.html</link>
<description>1. Be Clear: Is Your Problem Really A Big Data Problem? There are many big data problems out there requiring huge compute scale, innovations in computation paradigms, vast storage space and so on. But just because your data takes up...</description>
<content:encoded><![CDATA[<p>1. Be Clear: &#0160;Is Your Problem Really A Big Data Problem?</p>
<p>There are many big data problems out there requiring huge compute scale, innovations in computation paradigms, vast storage space and so on. But just because your data takes up lots of disc space does not mean that you have a big data problem. Firstly, your data may be encoded in an inefficient format. XML, for example, can be incredible verbose (all those close tags and human readable text). Secondly, if your data changes over time it may change very slowly indicating that monitoring the difference between data sets is more important that importing complete data sets. Thirdly, you may be processing your information on a legacy architecture designed for low power CPUs or cores. Architecture should be data driven, meaning that you need to deeply understand the informational aspects of your data and not just the size of the data as it comes to you on disc.</p>
<p>2. Communicating About Your Data</p>
<p>Often, in large organization (I work for Microsoft and have worked at IBM in the past), the product requirements for data deliverables are high level. For example: we need these variables to be 99% accurate. This simplistic view of data - that a level of quality can be delivered in a specified time frame - is ignorant of the highly opportunistic nature of processes that improve the quality of data. Consequently, a data scientist needs to aggressively manage the communication about projects which transform and improve data sets. Do as much research as possible to minimize unknowns, but don&#39;t sign contracts that involve both time and quality metrics!</p>
<p>3. Invest in Interactive Analytics, not Reporting</p>
<p>When you construct reports about your data products, you are answering a fixed set of questions. This is useful for monitoring, but it doesn&#39;t provide a way to get at the unknown unknowns. It is only through interactions with data (often called slicing and dicing) that pockets of interest (problems and opportunities) are discovered. Rich, interactive tools may be perceived as a low priority and never quite got to. Avoid this peril!</p>
<p>4. Understand the Role and Quality of Human Evaluations of Data</p>
<p>When trying to determine how good your data product is, it is often the case that we employ an array of human judges to evaluate a sample of the data. The higher up the management chain you go, you tend to find a higher degree of respect for human judgement. There are many studies, however, that show that human judgements are not always as good as they are cracked up to be. In many cases, machines can do better than humans, they just tend to make different types of errors. On deeper inspection, human errors can be traced to the structure of incentives around the judgement process. Innovate in methods to compare data sets that help distinguish their relative quality without necessarily the expense of human assessment.</p>
<p>5. Spend Time on the Plumbing</p>
<p>How does data get in to your system? How does it flow? Are you sure every bit of information got in? With large scale data loading and processing systems, one doesn&#39;t one a small number of failures to tip over the entire run. However, silently failing components can cause big headaches down the line when you are reporting your summary findings. Make sure there are no leaks in your pipeline!</p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DataMining?a=-sXPggJj_Os:_Kpq09JV_Ec:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=-sXPggJj_Os:_Kpq09JV_Ec:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=-sXPggJj_Os:_Kpq09JV_Ec:2mJPEYqXBVI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=2mJPEYqXBVI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=-sXPggJj_Os:_Kpq09JV_Ec:I9og5sOYxJI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=I9og5sOYxJI" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/DataMining/~4/-sXPggJj_Os" height="1" width="1"/>]]></content:encoded>



<dc:creator>Matthew Hurst</dc:creator>
<dc:date>2012-05-26T21:24:38-04:00</dc:date>
<feedburner:origLink>http://datamining.typepad.com/data_mining/2012/05/5-hidden-skills-for-big-data-scientists.html</feedburner:origLink></item>
<item rdf:about="http://datamining.typepad.com/data_mining/2012/05/zero-tolerance-search-24-year-old-neuroscientist.html">
<title>Zero Tolerance Search : 24 year old neuroscientist</title>
<link>http://feedproxy.google.com/~r/DataMining/~3/d4OAMfV1_Ko/zero-tolerance-search-24-year-old-neuroscientist.html</link>
<description>[The idea behind 'zero tolerance search' posts is to illustrate real life search interactions that show how far we have to go in leveraging the explicit and implicit data in the web and elsewhere.] Yesterday, I heard part of an...</description>
<content:encoded><![CDATA[<p>[The idea behind &#39;zero tolerance search&#39; posts is to illustrate real life search interactions that show how far we have to go in leveraging the explicit and implicit data in the web and elsewhere.]</p>
<p>Yesterday, I heard part of an interview on NPR. The interview was around a new book on determinism and neuroscience. The only thing I remember about the author was his young age. I wanted to recover the name of the author and the title of his new book so that I could comment on his argument against determinism (which was, essentially, &#39;I&#39;m afraid of determinism therefore it can&#39;t be right&#39;).</p>
<p>The query {24 year old neuroscientist} is very clear to a human reader. The goal is to find information about a person who functions in the role of &#39;neuroscientist&#39; and whose age is 24 years.</p>
<p>The text matching approaches of both Bing and Google essentially fails at this task. The errors they make include:</p>
<ul>
<li>Matching on &#39;year[s] old&#39; without recognizing the requirement that the result has to be about specifically a 24 year old.</li>
<li>Matching on the bare number &#39;24&#39;.</li>
<li>Articles about neuroscientists which also mention a &#39;24 year old donor&#39;</li>
<li>Matching on &#39;18-24 year old samples&#39; of the population.</li>
<li>A 24 year old who was studying neuroscientist before having a hit song and making a career change.</li>
<li>&#39;24 year old&#39; in the body of the document versus &#39;neuroscientist&#39; in the &#39;about&#39; section for a blogger.</li>
</ul>
<p>Google edges out Bing by returning a single result in position 5 that does pertain to a 24 year old neuorscientist.</p>
<p>Given all the advances and trumpets employed in search these days, I still, shall we say, interested in results that ignore simple elements of document structure (the bio of the author being mixed with the content of a blog post) and inattention to elemental linguistics in the query (the &#39;24&#39; in &#39;24 year old&#39; really shouldn&#39;t match the &#39;24&#39; in &#39;March 24&#39;).</p>
<p>What would be a killer answer to this query would be returning a page about a person who was indeed a 24 year old neuroscientist but where the age and occupation of the individual were not present in the document.</p>
<ul>
</ul><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DataMining?a=d4OAMfV1_Ko:8KWjPUCde54:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=d4OAMfV1_Ko:8KWjPUCde54:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=d4OAMfV1_Ko:8KWjPUCde54:2mJPEYqXBVI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=2mJPEYqXBVI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=d4OAMfV1_Ko:8KWjPUCde54:I9og5sOYxJI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=I9og5sOYxJI" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/DataMining/~4/d4OAMfV1_Ko" height="1" width="1"/>]]></content:encoded>


<dc:subject>zerotolerancesearch</dc:subject>

<dc:creator>Matthew Hurst</dc:creator>
<dc:date>2012-05-12T17:44:21-04:00</dc:date>
<feedburner:origLink>http://datamining.typepad.com/data_mining/2012/05/zero-tolerance-search-24-year-old-neuroscientist.html</feedburner:origLink></item>
<item rdf:about="http://datamining.typepad.com/data_mining/2012/05/excellent-visualization-of-network-effect.html">
<title>Excellent Visualization of Network Effect</title>
<link>http://feedproxy.google.com/~r/DataMining/~3/TbXsPqrwqME/excellent-visualization-of-network-effect.html</link>
<description />
<content:encoded><![CDATA[<p><iframe frameborder="0" height="344" src="http://www.youtube.com/embed/GA8z7f7a2Pk?fs=1&amp;feature=oembed" width="459"></iframe>&#0160;</p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DataMining?a=TbXsPqrwqME:gf1mMFbEGQM:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=TbXsPqrwqME:gf1mMFbEGQM:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=TbXsPqrwqME:gf1mMFbEGQM:2mJPEYqXBVI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=2mJPEYqXBVI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=TbXsPqrwqME:gf1mMFbEGQM:I9og5sOYxJI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=I9og5sOYxJI" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/DataMining/~4/TbXsPqrwqME" height="1" width="1"/>]]></content:encoded>



<dc:creator>Matthew Hurst</dc:creator>
<dc:date>2012-05-12T17:28:24-04:00</dc:date>
<feedburner:origLink>http://datamining.typepad.com/data_mining/2012/05/excellent-visualization-of-network-effect.html</feedburner:origLink></item>
<item rdf:about="http://datamining.typepad.com/data_mining/2012/04/graphing-twitter-attention.html">
<title>Graphing Twitter Attention</title>
<link>http://feedproxy.google.com/~r/DataMining/~3/HtEByuHycPs/graphing-twitter-attention.html</link>
<description>track // microsoft (and games and movies) now includes a simple graph indicating the attention being given to each cluster of posts. This graph shows the total of tweets per hour for all posts in the cluster. Below is an...</description>
<content:encoded><![CDATA[<p><a href="http://d8taplex.com/track/microsoft-widescreen.html" target="_self">track // microsoft</a> (and <a href="http://d8taplex.com/track/games-widescreen.html" target="_self">games</a> and <a href="http://d8taplex.com/track/movies-widescreen.html" target="_self">movies</a>) now includes a simple graph indicating the attention being given to each cluster of posts. This graph shows the total of tweets per hour for all posts in the cluster. Below is an example from the cluster around Steve Wozniak&#39;s positive comments for his Windows Phone.</p>
<p><a class="asset-img-link" href="http://datamining.typepad.com/.a/6a00d8341c994053ef016304f9da8a970d-pi" style="display: inline;"><img alt="Capture" class="asset  asset-image at-xid-6a00d8341c994053ef016304f9da8a970d" src="http://datamining.typepad.com/.a/6a00d8341c994053ef016304f9da8a970d-500wi" title="Capture" /></a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DataMining?a=HtEByuHycPs:8tCzLqyoX98:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=HtEByuHycPs:8tCzLqyoX98:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=HtEByuHycPs:8tCzLqyoX98:2mJPEYqXBVI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=2mJPEYqXBVI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=HtEByuHycPs:8tCzLqyoX98:I9og5sOYxJI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=I9og5sOYxJI" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/DataMining/~4/HtEByuHycPs" height="1" width="1"/>]]></content:encoded>



<dc:creator>Matthew Hurst</dc:creator>
<dc:date>2012-04-30T10:39:42-04:00</dc:date>
<feedburner:origLink>http://datamining.typepad.com/data_mining/2012/04/graphing-twitter-attention.html</feedburner:origLink></item>
<item rdf:about="http://datamining.typepad.com/data_mining/2012/04/microsofts-windows-phone-8-problem-a-solution.html">
<title>Microsoft's Windows Phone 8 Problem - A Solution</title>
<link>http://feedproxy.google.com/~r/DataMining/~3/74WhxbNMItk/microsofts-windows-phone-8-problem-a-solution.html</link>
<description>Briefly - there is plenty of chatter (see track // microsoft) about the possibility that Microsoft won't be upgrading existing handsets to Windows Phone 8. However, Hal makes a very interesting point in his post on the topic. The problem...</description>
<content:encoded><![CDATA[<p>Briefly - there is plenty of chatter (see <a href="http://d8taplex.com/track/microsoft-widescreen.html" target="_self">track // microsoft</a>) about the possibility that Microsoft won&#39;t be upgrading existing handsets to Windows Phone 8. However, Hal makes a very interesting point in <a href="http://hal2020.com/2012/04/22/will-existing-phones-be-upgradeable-to-windows-phone-8/" target="_self">his post on the topic</a>. The problem is not the upgrade, it is the users. Simply give them all a new Windows Phone 8 hand set for free - problem solved.</p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DataMining?a=74WhxbNMItk:AHQmKkZGEa4:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=74WhxbNMItk:AHQmKkZGEa4:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=74WhxbNMItk:AHQmKkZGEa4:2mJPEYqXBVI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=2mJPEYqXBVI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=74WhxbNMItk:AHQmKkZGEa4:I9og5sOYxJI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=I9og5sOYxJI" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/DataMining/~4/74WhxbNMItk" height="1" width="1"/>]]></content:encoded>



<dc:creator>Matthew Hurst</dc:creator>
<dc:date>2012-04-22T11:08:02-04:00</dc:date>
<feedburner:origLink>http://datamining.typepad.com/data_mining/2012/04/microsofts-windows-phone-8-problem-a-solution.html</feedburner:origLink></item>
<item rdf:about="http://datamining.typepad.com/data_mining/2012/04/finding-new-story-links-through-blog-clustering.html">
<title>Finding New Story Links Through Blog Clustering</title>
<link>http://feedproxy.google.com/~r/DataMining/~3/B1n5piJGJ34/finding-new-story-links-through-blog-clustering.html</link>
<description>The basic mechanism used in track // microsoft to cluster articles is similar to that used by Techmeme. A fixed set of blogs are crawled and clustered based on specific features such as link structure and content (and in the...</description>
<content:encoded><![CDATA[<p>The basic mechanism used in <a href="http://d8taplex.com/track/microsoft-widescreen.html" target="_self">track // microsoft</a> to cluster articles is similar to that used by <a href="http://www.techmeme.com" target="_self">Techmeme</a>. A fixed set of blogs are crawled and clustered based on specific features such as link structure and content (and in the case of Techmeme, additional human input). However, what about blogs that aren&#39;t known to the system?</p>
<p>I recently added a feature to <a href="http://d8taplex.com/track/microsoft-widescreen.html" target="_self">track // microsoft</a> which analyses clusters for popular urls and adds those to the bottom of the cluster. The title of the web page is used as a simple description of the popular page.</p>
<p>In the recent story about Nuno Silva&#39;s mistaken comment regarding the future of Windows Phone devices, there were many links to <a href="http://blogs.msdn.com/b/nunos/archive/2012/04/19/my-comments-on-windows-phone.aspx" target="_self">Nuno&#39;s own blog post</a>. In addition to the large cluster of known blogs that were determined to be talking about the story, track // microsoft also surfaced Nuno&#39;s post through analysing the popular links discovered within the cluster.</p>
<p>This can be seen in this screen shot of the cluster currently appearing on the site.</p>
<p><a class="asset-img-link" href="http://datamining.typepad.com/.a/6a00d8341c994053ef0167656e0d6d970b-pi" style="display: inline;"><img alt="Apollo" class="asset  asset-image at-xid-6a00d8341c994053ef0167656e0d6d970b" src="http://datamining.typepad.com/.a/6a00d8341c994053ef0167656e0d6d970b-500wi" title="Apollo" /></a><br /><br /></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DataMining?a=B1n5piJGJ34:L4AuOSXsutY:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=B1n5piJGJ34:L4AuOSXsutY:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=B1n5piJGJ34:L4AuOSXsutY:2mJPEYqXBVI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=2mJPEYqXBVI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=B1n5piJGJ34:L4AuOSXsutY:I9og5sOYxJI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=I9og5sOYxJI" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/DataMining/~4/B1n5piJGJ34" height="1" width="1"/>]]></content:encoded>



<dc:creator>Matthew Hurst</dc:creator>
<dc:date>2012-04-20T01:15:53-04:00</dc:date>
<feedburner:origLink>http://datamining.typepad.com/data_mining/2012/04/finding-new-story-links-through-blog-clustering.html</feedburner:origLink></item>
<item rdf:about="http://datamining.typepad.com/data_mining/2012/04/local-lens-a-hyperlocal-retrospective.html">
<title>Local Lens : A Hyperlocal Retrospective</title>
<link>http://feedproxy.google.com/~r/DataMining/~3/_7U1ZQh58cs/local-lens-a-hyperlocal-retrospective.html</link>
<description>Thanks to a recent talk by Gonzalo, I discovered his video capturing the UX developed for our hyperlocal content aggregation and browsing application called Local Lens.</description>
<content:encoded><![CDATA[<p>Thanks to a recent talk by Gonzalo, I discovered his video capturing the UX developed for our hyperlocal content aggregation and browsing application called Local Lens.</p>
<p><iframe frameborder="0" height="281" src="http://www.youtube.com/embed/kMzL99LRPAA?fs=1&amp;feature=oembed" width="500"></iframe>&#0160;</p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DataMining?a=_7U1ZQh58cs:ygj_MFMVgsQ:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=_7U1ZQh58cs:ygj_MFMVgsQ:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=_7U1ZQh58cs:ygj_MFMVgsQ:2mJPEYqXBVI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=2mJPEYqXBVI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=_7U1ZQh58cs:ygj_MFMVgsQ:I9og5sOYxJI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=I9og5sOYxJI" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/DataMining/~4/_7U1ZQh58cs" height="1" width="1"/>]]></content:encoded>



<dc:creator>Matthew Hurst</dc:creator>
<dc:date>2012-04-19T09:57:52-04:00</dc:date>
<feedburner:origLink>http://datamining.typepad.com/data_mining/2012/04/local-lens-a-hyperlocal-retrospective.html</feedburner:origLink></item>
<item rdf:about="http://datamining.typepad.com/data_mining/2012/04/search-engine-lands-mediocre-post-on-local-search.html">
<title>Search Engine Land's Mediocre Post on Local Search</title>
<link>http://feedproxy.google.com/~r/DataMining/~3/_pjSp3UO5SA/search-engine-lands-mediocre-post-on-local-search.html</link>
<description>A colleague brought to my attention a post on the influential search blog Search Engine Land which makes claims about the quality of local data found on search engines and local verticals: Yellow Pages Sites Beat Google In Local Data...</description>
<content:encoded><![CDATA[<p>A colleague brought to my attention a post on the influential search blog Search Engine Land which makes claims about the quality of local data found on search engines and local verticals: <a href="http://searchengineland.com/yellowpages-sites-beat-google-in-local-data-accuracy-test-118467" target="_self">Yellow Pages Sites Beat Google In Local Data Accuracy Test</a>. The author describes surprise at the outcome reported - that Yellow Pages sites are better at local search than Google. Rather, we should express surprise at how poorly this article is written and at the intentional misleading nature of the title.</p>
<p>The article describes an analysis done by Implied Intelligence. The analysis looks at 1, 000 local businesses in the US. Here is the first problem - these businesses exclude chains and franchises. In addition, if a website wasn&#39;t known for the business, it too was excluded. With some general assumptions about the definition of local business, it is safe to assert that firstly there are many instances of chains and franchises out there and secondly that many (if not most) businesses don&#39;t have a website (the distribution varies by category of course). Quite where the original sample of 1, 000 came from is not reported.</p>
<p>This biases the analysis - Google, like Bing is intersted in all local entities.</p>
<p>The initial part of the analysis is reasonable - looking at coverage (% in the sample found on the site) and quality (duplicates, phone number errors and adderss errors). Note, however, that this is a measure of the local data, not of local search. A search product includes a relevance component and it is quite possible that a well tuned relevance algorithm might suppress duplicates.</p>
<p>The last table in the analysis sees us swinging back to bad reporting. It describes the percentage of records that have a certain attribute: URL, Hours of Operation and &#39;additional info&#39;. Did you see what they did there? This is what we call the coverage of an attribute, and it tells us nothing as to the quality of the value. I can quite easily populate a local database with 100% coverage for all attributes. They might all be wrong, but the coverage could be 100%. Consequently, this table is reasonably close to meaningless. If they had included the precision of these values then coverage can be used to compute recall, but that wasn&#39;t done.</p>
<p>In summary, an important search publication has either written an intentionally misleading article, or has demonstrated that it doesn&#39;t really get data.</p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DataMining?a=_pjSp3UO5SA:hdyUYUpo_2M:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=_pjSp3UO5SA:hdyUYUpo_2M:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=_pjSp3UO5SA:hdyUYUpo_2M:2mJPEYqXBVI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=2mJPEYqXBVI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=_pjSp3UO5SA:hdyUYUpo_2M:I9og5sOYxJI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=I9og5sOYxJI" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/DataMining/~4/_pjSp3UO5SA" height="1" width="1"/>]]></content:encoded>


<dc:subject>local</dc:subject>
<dc:subject>search</dc:subject>

<dc:creator>Matthew Hurst</dc:creator>
<dc:date>2012-04-18T11:43:53-04:00</dc:date>
<feedburner:origLink>http://datamining.typepad.com/data_mining/2012/04/search-engine-lands-mediocre-post-on-local-search.html</feedburner:origLink></item>
<item rdf:about="http://datamining.typepad.com/data_mining/2012/04/tracking-microsoft-weekly-roundup.html">
<title>Tracking Microsoft - Weekly Roundup</title>
<link>http://feedproxy.google.com/~r/DataMining/~3/MUcXUGhxoHY/tracking-microsoft-weekly-roundup.html</link>
<description>I thought I'd cull some stories from track // microsoft that summarize the week's news: Steve Wozniak - famous for standing in line for Apple products, tweeted his intention to get into the smartphone game with a Lumia 900 The...</description>
<content:encoded><![CDATA[<p>I thought I&#39;d cull some stories from <a href="http://bit.ly/GK4hyx" target="_self">track // microsoft</a> that summarize the week&#39;s news:</p>
<ul>
<li>Steve Wozniak - famous for standing in line for Apple products, <a href="http://www.phonearena.com/news/Steve-Wozniak-looking-to-pickup-a-Lumia-900-today_id29115" target="_self">tweeted his intention to get into the smartphone game with a Lumia 900</a></li>
<li>The <a href="http://www.phonearena.com/news/Girl-tells-her-BF-that-his-Samsung-Galaxy-S-II-is-s-after-getting-Blown-Away-by-Nokia-Lumia-800_id29122" target="_self">reaction of a girlfriend to the continuing Windows Phone / Lumia challenge failure of her boyfriends Android phone</a></li>
<li><a href="http://www.redmondpie.com/this-app-brings-windows-8-metro-ui-to-the-ipad-for-everyone-to-try-video/" target="_self">Splashtop - an iPad app the emulates the Windows 8 Metro experience</a> to give developers and UX designers a head start</li>
<li>From the perenial department: <a href="http://www.neowin.net/news/rumor-microsoft-to-sell-bing-to-facebook" target="_self">a rumour that Facebook would acquire Bing</a></li>
<li>AT&amp;T states that <a href="http://wmpoweruser.com/att-president-lumia-sales-have-exceeded-expectations/" target="_self">Lumia sales have exceed expectations</a></li>
</ul>
<p>Keep up to date on Microsoft news by visiting <a href="http://bit.ly/GK4hyx" target="_self">track // microsoft</a> - a site dedicated to discovering and surfacing stories from the blogosphere about my employer!</p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DataMining?a=MUcXUGhxoHY:30dXPuw6vBg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=MUcXUGhxoHY:30dXPuw6vBg:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=MUcXUGhxoHY:30dXPuw6vBg:2mJPEYqXBVI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=2mJPEYqXBVI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=MUcXUGhxoHY:30dXPuw6vBg:I9og5sOYxJI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=I9og5sOYxJI" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/DataMining/~4/MUcXUGhxoHY" height="1" width="1"/>]]></content:encoded>



<dc:creator>Matthew Hurst</dc:creator>
<dc:date>2012-04-14T11:57:49-04:00</dc:date>
<feedburner:origLink>http://datamining.typepad.com/data_mining/2012/04/tracking-microsoft-weekly-roundup.html</feedburner:origLink></item>
<item rdf:about="http://datamining.typepad.com/data_mining/2012/04/lumia-review-cluster.html">
<title>Lumia Review Cluster</title>
<link>http://feedproxy.google.com/~r/DataMining/~3/WKSACAhPdS8/lumia-review-cluster.html</link>
<description>Briefly, track // microsoft (a buzz tracking site for Microsoft that I'm experimenting with) is currently sporting a large story cluster focused on the reviews for the new (today!) Nokia Lumia 900. [note - I updated this image due to...</description>
<content:encoded><![CDATA[<p>Briefly, <a href="http://d8taplex.com/track/microsoft-widescreen.html" target="_self">track // microsoft</a> (a buzz tracking site for Microsoft that I&#39;m experimenting with) is currently sporting a large story cluster focused on the reviews for the new (today!) Nokia Lumia 900.</p>
<p>[note - I updated this image due to fixing a bug I recently discovered with the clustering]</p>
<p><a class="asset-img-link" href="http://datamining.typepad.com/.a/6a00d8341c994053ef0168e9d4df29970c-pi" style="display: inline;"><img alt="Lumia" class="asset  asset-image at-xid-6a00d8341c994053ef0168e9d4df29970c" src="http://datamining.typepad.com/.a/6a00d8341c994053ef0168e9d4df29970c-500wi" title="Lumia" /></a><br /><br /><br /></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/DataMining?a=WKSACAhPdS8:FHRRjDOgXTI:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=WKSACAhPdS8:FHRRjDOgXTI:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/DataMining?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=WKSACAhPdS8:FHRRjDOgXTI:2mJPEYqXBVI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=2mJPEYqXBVI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/DataMining?a=WKSACAhPdS8:FHRRjDOgXTI:I9og5sOYxJI"><img src="http://feeds.feedburner.com/~ff/DataMining?d=I9og5sOYxJI" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/DataMining/~4/WKSACAhPdS8" height="1" width="1"/>]]></content:encoded>



<dc:creator>Matthew Hurst</dc:creator>
<dc:date>2012-04-08T11:25:44-04:00</dc:date>
<feedburner:origLink>http://datamining.typepad.com/data_mining/2012/04/lumia-review-cluster.html</feedburner:origLink></item>


</rdf:RDF><!-- ph=1 -->

