<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>The art of Information Engineering</title>
	
	<link>http://www.grok.in</link>
	<description>(ignorance killed the cat, curiosity was framed)</description>
	<lastBuildDate>Tue, 11 Aug 2009 06:30:18 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/grokin" /><feedburner:info uri="grokin" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Marketing your Search Engine</title>
		<link>http://feedproxy.google.com/~r/grokin/~3/AvHtSCdcLXI/</link>
		<comments>http://www.grok.in/blog/2009/05/22/merketing-your-search-engine/#comments</comments>
		<pubDate>Fri, 22 May 2009 12:06:10 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Marketing]]></category>
		<category><![CDATA[Search Engines]]></category>
		<category><![CDATA[zook]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=173</guid>
		<description><![CDATA[I recently came across an interesting post by Mark Johnson, senior product manager at Powerset:
While demoing Live Search at the Web 2.0 Expo, people continually asked the same questions: “What makes Live different?” or “Show me some features that will make me want to switch from my search engine” or the extremely confrontational “Why do [...]]]></description>
			<content:encoded><![CDATA[<p>I recently came across an <a title="How *not* to rate a search engine" href="http://deliberateambiguity.typepad.com/blog/2009/04/how-not-to-rate-a-search-engine.html">interesting post</a> by Mark Johnson, senior product manager at <a href="http://www.powerset.com%22/">Powerset</a>:</p>
<blockquote><p>While demoing Live Search at the Web 2.0 Expo, people continually asked the same questions: “What makes Live <strong>different</strong>?” or “Show me some <strong>features </strong>that will make me want to switch from my search engine” or the extremely confrontational “Why do you think you’re <strong>better </strong>than Google?”</p></blockquote>
<p>If only I had a dollar for every time someone asked me one of these questions when demo-ing <a title="Zook - Mobile Search Engine" href="http://www.zook.in/">Zook</a> (my company&#8217;s Mobile search engine) &#8230;I would have been richer by at least a couple of hundred dollars!</p>
<p><span style="text-decoration: line-through;">How do you convinve someone that your search engine is better than Google?</span> How do you convince someone that Google is not the do-all and be-all of search engines? Don&#8217;t get me wrong, I love Google. But I do resent the fact that, with their enormous market share, Google has moulded an expectation of what a search engine is and is not into people&#8217;s minds; so much so that if even Google itself introduced anything drastically different, it would probably get rejected! How do you convince someone of something when they are already convinced of the contrary? If you run a Web search engine, Mark Johnson offers some advice:</p>
<blockquote><p>So, after awhile, I started my demos with a caveat about the nature of a search engine: I implored my audience to <strong>try out Live Search for a week</strong> so that, in the words of the immortal <a href="http://powerset.com/explore/semhtml/LeVar_Burton">Lavar Burton</a> of <a href="http://www.youtube.com/watch?v=c6j8EiWIVZs">Reading Rainbow</a>, “But, you don’t have to take my word for it.”</p></blockquote>
<p>This is a great tactic if you are a generic Web search engine. But if you are a more specialised search engine which has been built to answer only a subset of the queries that people go to Google for — but answer them much better than Google does — you are out of luck: people are not too keen on juggling between different search engines for different needs; they want a single box into which they would like to type in a query and get the results. So what is the recourse?</p>
<p>If you have <strong><a title="Deep Web - Wikipedia" href="If you have Deep Web content — data that others (read as: Google) do not have access to, you are through.">Deep Web</a></strong> content — data that others (read as: Google) do not have access to — you are through. Can there be anything sweeter than a <em>monopoly</em>? This is usually achieved by search engines that own the data they are searching. Usually, most of this data is inaccessible to the regular search engines. Several <em>local search engines</em> are this way. YouTube, the <a title="YouTube surpasses Yahoo as world’s #2 search engine - TG Daily" href="http://www.tgdaily.com/content/view/39777/113/">second largest search engine</a> in US, also falls in this category. As do Amazon and eBay and Craigslist. But with the increasing domination of Google and its growing importance in driving traffic, many of these websites are resorting to <a title="SEO on Wikipedia" href="http://en.wikipedia.org/wiki/Search_engine_optimization">Search Engine Optimization (SEO)</a> techniques resulting in large portions of their content being made available to be indexed by Google.</p>
<p>If you are a <a title="Vertical Search - Wikipedia" href="http://en.wikipedia.org/wiki/Vertical_search">vertical search</a> engine, you can entice people with features that go well beyond mere listing of results. These features can include things like <strong>sorting</strong> (example: <em>sort by price</em> in <em>product search</em>), <strong>actionable results</strong> (example: <em>book tickets</em> in <em>movie search</em>), <strong><a title="Faceted search - Wikipedia" href="http://en.wikipedia.org/wiki/Faceted_search">faceted search</a></strong> (example: <em>narrow down by brands</em> in <em>product search</em> — useful for cameras, laptops etc.), <strong>community inputs</strong> (example: <em>user reviews &amp; ratings</em> in <em>restaurant search</em>) and so on. Often, several such perks are needed to get people to try out your search engine long enough for them to really experience it.</p>
<p>But if you are a drastically different search engine trying to bring in a whole new paradigm of search, you are facing a real tough battle. <a href="http://www.wolframalpha.com/">Wolfram|Alpha</a> learned it the hard way after its recent launch. <a href="http://www.powerset.com/">Powerset</a> faced similar hurdles when trying to convince people of they are worthy of attention. Several other such attempts have been made but none too successfully.</p>
<p>&#8211;</p>
<p>Developing a brand new search engine in this Google-dominated world is no more just about coming up with great technical ideas. The technical superiorities need to be market driven. The ideas need to come from a marketing perspective. This does not imply that there is any less scope for leaps of technical improvements, it just means that without a marketing plan to go with them, such improvements will find themselves in obscurity in a hurry.</p>
<p>We came to this realisation at Zook a long time ago. When we did, and started developing our marketing strategy, it was mere good fortune that we found that most of our development until that time would align quite well with the strategy. We were lucky.</p>
<p>&#8211;</p>
<p>We like to think of Zook as a <em>lateral search engine</em> i.e. we specialise in some kinds of content unlike a <em>horizontal search engine</em> (most Web search engines) but unlike most <em>vertical search engines</em> we pan across several verticals without going too deep into any one of them. Being this way, we are able to offer several of the features/benefits that are normally the privilege only of vertical search engines — actionable results (examples: <a title="Search for &quot;download rahman airtel jingle&quot; on Zook" href="http://www.zook.in/wap/search.jsp?browse=music&amp;query=download+rahman+airtle+jingle&amp;submit=Search&amp;latitude=&amp;longitude=#music">buy/download a song</a>, <a title="Search for &quot;restaurant reserve bombay post&quot; on Zook" href="http://www.zook.in/wap/search.jsp?query=restaurant+reserve+bombay+post&amp;submit=Search&amp;latitude=&amp;longitude=">reserve a table at a restaurant</a>, <a title="Search for &quot;stocks infosys bse&quot; on Zook" href="http://www.zook.in/wap/search.jsp?query=stocks+infosys+bse&amp;submit=Search&amp;latitude=&amp;longitude=">subscribe to alerts</a>), faceted search (examples: <a title="Search for &quot;restaurant bangalore&quot; on Zook" href="http://www.zook.in/wap/search.jsp?query=restaurant+bangalore&amp;submit=Search&amp;latitude=&amp;longitude=#restaurant">restaurant bangalore</a>, <a title="Browse &quot;ringtones&quot; on Zook" href="http://www.zook.in/wap/search.jsp?browse=ringtone">ringtones</a>, <a title="Browse &quot;movies&quot; on Zook" href="http://www.zook.in/wap/search.jsp?browse=movie">movies</a>) etc.</p>
<p>Another thing we have going for Zook is that people seem to be more open to trying out alternative search engines on their mobile than they are on their desktops; Google deosn&#8217;t <em>yet</em> have a stranglehold on mobile internet users. In most cases, the promise of &#8220;exact/precise information instead of a set of links that you have navigate yourself to find the information&#8221; gets people excited enough that they are willing to Zook a try.</p>
<p>Zook has a lot of Deep Web content that we source directly from the multitude of our <em>partners</em>. That helps too <img src='http://www.grok.in/wp/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/grokin?a=AvHtSCdcLXI:4UWf-UM_ILo:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/grokin?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/grokin?a=AvHtSCdcLXI:4UWf-UM_ILo:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/grokin?i=AvHtSCdcLXI:4UWf-UM_ILo:D7DqB2pKExk" border="0"></img></a>
</div>]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2009/05/22/merketing-your-search-engine/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://www.grok.in/blog/2009/05/22/merketing-your-search-engine/</feedburner:origLink></item>
		<item>
		<title>Cloud Computing</title>
		<link>http://feedproxy.google.com/~r/grokin/~3/c3ngz9dRWy8/</link>
		<comments>http://www.grok.in/blog/2008/12/29/cloud-computing/#comments</comments>
		<pubDate>Mon, 29 Dec 2008 11:48:57 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[IaaS]]></category>
		<category><![CDATA[PaaS]]></category>
		<category><![CDATA[SaaS]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=83</guid>
		<description><![CDATA[A couple of weeks ago I participated in an interesting discussion   on Cloud Computing at an unconference in Bangalore. Though the discussion was to be on &#8220;whether Cloud Computing is inevitable or not&#8221;, we hardly got past defining it! That just about demonstrates the confusion that surrounds Cloud Computing — it isn&#8217;t even [...]]]></description>
			<content:encoded><![CDATA[<p>A couple of weeks ago I participated in an interesting discussion   on <em>Cloud Computing</em> at an <a title="Osmosis08" href="http://barcamp.org/osmosis08">unconference</a> in Bangalore. Though the discussion was to be on &#8220;whether Cloud Computing is inevitable or not&#8221;, we hardly got past defining it! That just about demonstrates the confusion that surrounds Cloud Computing — it isn&#8217;t even clear what it&#8217;s supposed to be. It&#8217;s not for no reason that it has been referred to as <em>Haze Computing</em>!</p>
<p>I think everyone has a moral (:P) responsibility to add to the confusion. Only through such attempts can we achieve clarity. This post — an attempt to put in words, my understanding of what Cloud Computing is and is not — is a contribution to that end.</p>
<p><em><strong><a title="Cloud Computing - Notes" href="http://www.grok.in/notes/cloud-computing/">Read on&#8230;</a></strong></em></p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/grokin?a=c3ngz9dRWy8:ySA5er9l3hs:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/grokin?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/grokin?a=c3ngz9dRWy8:ySA5er9l3hs:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/grokin?i=c3ngz9dRWy8:ySA5er9l3hs:D7DqB2pKExk" border="0"></img></a>
</div>]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/12/29/cloud-computing/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		<feedburner:origLink>http://www.grok.in/blog/2008/12/29/cloud-computing/</feedburner:origLink></item>
		<item>
		<title>Softwares/Libraries for Full-text Search</title>
		<link>http://feedproxy.google.com/~r/grokin/~3/Pr6Y71RzPVY/</link>
		<comments>http://www.grok.in/blog/2008/11/05/full-text-search/#comments</comments>
		<pubDate>Wed, 05 Nov 2008 13:47:15 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[full-text]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[sphinx]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=74</guid>
		<description><![CDATA[A lot of applications have a requirement to search the full-text of some content they have for some words it might contain. This kind of functionality is often referred to as full-text search. For example, a blogging software might need to provide a search functionality that searches the blog posts for the user entered query [...]]]></description>
			<content:encoded><![CDATA[<p>A lot of applications have a requirement to search the full-text of some content they have for some words it might contain. This kind of functionality is often referred to as <em>full-text search</em>. For example, a blogging software might need to provide a search functionality that searches the blog posts for the user entered query terms.</p>
<p>It is not possible to use the regular database indexes (usually B-Trees or Hashmaps) for this purpose because they require that you provide the full value of the column you are searching in; in essence they do an equality search. In the blogging software example, the user would then have to type in the entire blog post verbatim in order to find it; even if you could imagine the most patient of users, if s/he already knows the entire post by-heart, why would s/he be looking for it anyway?!</p>
<p><em><strong><a title="Softwares/Libraries for Full-text Search - Notes" href=" http://www.grok.in/notes/full-text-search/">Read on&#8230;</a></strong></em></p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/grokin?a=Pr6Y71RzPVY:X1rRQ7w9VxI:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/grokin?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/grokin?a=Pr6Y71RzPVY:X1rRQ7w9VxI:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/grokin?i=Pr6Y71RzPVY:X1rRQ7w9VxI:D7DqB2pKExk" border="0"></img></a>
</div>]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/11/05/full-text-search/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.grok.in/blog/2008/11/05/full-text-search/</feedburner:origLink></item>
		<item>
		<title>The Mother of All Database Normalization Debates</title>
		<link>http://feedproxy.google.com/~r/grokin/~3/YOApiuTvOCA/</link>
		<comments>http://www.grok.in/blog/2008/07/25/the-mother-of-all-database-normalization-debates/#comments</comments>
		<pubDate>Fri, 25 Jul 2008 11:47:24 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[debate]]></category>
		<category><![CDATA[normalization]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=51</guid>
		<description><![CDATA[Over at &#8220;Coding Horror&#8221; blog, Jeff Atwood published an interesting article titled &#8220;Maybe Normalizing Isn&#8217;t Normal&#8220;.
But more than the article itself, the debate that ensued in the comments there is very interesting. The &#8220;High Scalability&#8221; blog published a compilation of some of the interesting quotes from the debate. This compilation provides a great overview of [...]]]></description>
			<content:encoded><![CDATA[<p>Over at &#8220;Coding Horror&#8221; blog, Jeff Atwood published an interesting article titled &#8220;<a href="http://www.codinghorror.com/blog/archives/001152.html">Maybe Normalizing Isn&#8217;t Normal</a>&#8220;.</p>
<p>But more than the article itself, the debate that ensued in the comments there is very interesting. The &#8220;High Scalability&#8221; blog <a href="http://highscalability.com/mother-all-database-normalization-debates-coding-horror">published a compilation of some of the interesting quotes from the debate</a>. This compilation provides a great overview of the (admittedly long) discussion.</p>
<p>I would recommend that you read the original article first and then the compilation of the quotes at the High Scalability blog.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/grokin?a=YOApiuTvOCA:yNDDlYUWS6c:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/grokin?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/grokin?a=YOApiuTvOCA:yNDDlYUWS6c:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/grokin?i=YOApiuTvOCA:yNDDlYUWS6c:D7DqB2pKExk" border="0"></img></a>
</div>]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/07/25/the-mother-of-all-database-normalization-debates/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.grok.in/blog/2008/07/25/the-mother-of-all-database-normalization-debates/</feedburner:origLink></item>
		<item>
		<title>Web Crawling with Perl</title>
		<link>http://feedproxy.google.com/~r/grokin/~3/EB-N_A2uOpQ/</link>
		<comments>http://www.grok.in/blog/2008/06/09/web-crawling-with-perl/#comments</comments>
		<pubDate>Sun, 08 Jun 2008 19:13:25 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Crawling]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=49</guid>
		<description><![CDATA[If you are looking to write a web crawler, Perl, with all its great CPAN modules, is one of the best platforms you can pick. There are CPAN modules for most of the common components of a web crawler. Here, I&#8217;ll point to some of the modules that you would want to start out with.
Read [...]]]></description>
			<content:encoded><![CDATA[<p>If you are looking to write a web crawler, Perl, with all its great CPAN modules, is one of the best platforms you can pick. There are CPAN modules for most of the common components of a web crawler. Here, I&#8217;ll point to some of the modules that you would want to start out with.</p>
<p><em><strong><a title="Web Crawling with Perl - Notes" href="http://www.grok.in/notes/web-crawling-with-perl/">Read on&#8230;</a></strong></em></p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/grokin?a=EB-N_A2uOpQ:fbfCB1qEnEU:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/grokin?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/grokin?a=EB-N_A2uOpQ:fbfCB1qEnEU:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/grokin?i=EB-N_A2uOpQ:fbfCB1qEnEU:D7DqB2pKExk" border="0"></img></a>
</div>]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/06/09/web-crawling-with-perl/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.grok.in/blog/2008/06/09/web-crawling-with-perl/</feedburner:origLink></item>
		<item>
		<title>Introduction to Web Crawling</title>
		<link>http://feedproxy.google.com/~r/grokin/~3/5ShRhzN2IRc/</link>
		<comments>http://www.grok.in/blog/2008/06/07/introduction-to-web-crawling/#comments</comments>
		<pubDate>Sat, 07 Jun 2008 16:40:17 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Crawling]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=35</guid>
		<description><![CDATA[In the context of the World Wide Web, crawling refers to gathering web pages, by following hyperlinks, starting from a small set of web pages, for the purposes of further processing. For example, a Web search engine needs to gather as many pages as possible before it indexes and makes them available for searching.
A program [...]]]></description>
			<content:encoded><![CDATA[<p>In the context of the World Wide Web, <em>crawling</em> refers to gathering web pages, by following hyperlinks, starting from a small set of web pages, for the purposes of further processing. For example, a Web search engine needs to gather as many pages as possible before it indexes and makes them available for searching.</p>
<p>A program which performs crawling is variably known as a <em>crawler</em>, a <em>spider</em>, a <em>robot</em> or simply a <em>bot</em>. The set of pages from which the crawler starts crawling is known as <em>seed list</em>.</p>
<p>Although it seems pretty straightforward, writing a good web crawler is not very much so. There are a good number of challenges which vary subtly depending on whether it&#8217;s a large-scale web crawler or a crawler for a handful of websites. These challenges include: ensuring politeness to the web servers (by observing the widely accepted <em>robots exclusion protocol</em>), <em>URL normalization</em>, <em>duplicate detection</em>, avoiding <em>spider traps</em>, maintaining a <em>queue</em> of un-fetched pages, maintaining a <em>repository</em> of crawled pages, <em>re-crawling</em> and a few more. For large-scale crawlers, one of the most important challenges is to increase the <em>throughput</em> by optimizing the <em>resource utilization</em>, because their coverage usually gets limited by this.</p>
<p><a title="Introduction to Web Crawling - Notes" href="http://www.grok.in/notes/full-text-search/">Read on&#8230;</a></p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/grokin?a=5ShRhzN2IRc:BHHiwQl47Os:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/grokin?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/grokin?a=5ShRhzN2IRc:BHHiwQl47Os:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/grokin?i=5ShRhzN2IRc:BHHiwQl47Os:D7DqB2pKExk" border="0"></img></a>
</div>]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/06/07/introduction-to-web-crawling/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://www.grok.in/blog/2008/06/07/introduction-to-web-crawling/</feedburner:origLink></item>
		<item>
		<title>More Bayes’ magic</title>
		<link>http://feedproxy.google.com/~r/grokin/~3/Py9mJVC-i9c/</link>
		<comments>http://www.grok.in/blog/2008/05/04/more-bayes-magic/#comments</comments>
		<pubDate>Sat, 03 May 2008 18:49:26 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Probability & Statistics]]></category>
		<category><![CDATA[bayes]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=47</guid>
		<description><![CDATA[Statistics can be quite bewildering. Consider the following problem:
It is given that if a person having a disease takes a diagnostic test for the disease, the test returns a positive result 99% of the time, or with a probability of 0.99. Now, for some person picked at random, if the test returns a positive result, [...]]]></description>
			<content:encoded><![CDATA[<p>Statistics can be quite bewildering. Consider the following problem:</p>
<blockquote><p>It is given that if a person having a disease takes a diagnostic test for the disease, the test returns a positive result 99% of the time, or with a probability of 0.99. Now, for some person picked at random, if the test returns a positive result, what is the probability that s/he has the disease?</p></blockquote>
<p>You might think that the probability is <em>of course </em>0.99. But <em>of course</em> that isn&#8217;t so. If you did reach the naive conclusion, don&#8217;t worry: a lot of eminent scientists and doctors have been seen doing the same mistake (try it with your doctor!)</p>
<p><span id="more-47"></span></p>
<p>Actually, there isn&#8217;t enough information to give an answer to that question. To be able to get to the answer, we need to know how rare the disease is. Let&#8217;s add in this information and work out the answer.</p>
<p>Let&#8217;s say it is also known that 5 people in every 1000 suffer from this disease. Put another way, if you pick a person at random, the probability that s/he has the disease is 0.005. Let&#8217;s also say that if a person who does not have the disease takes the test, s/he tests positive 2% of the time, or with a probability of 0.02.</p>
<p>Now let&#8217;s get down to business.</p>
<blockquote><p>If we pick a person at random,</p>
<p>let A be the event that the person has the disease, P(A) = 0.005</p>
<p>let B be the event that the person tests positive for the disease, P(B) = ?</p>
<p>let A&#8217; be the event that the person does not have the disease, P(A&#8217;) = 1 &#8211; P(A) = 0.995</p>
<p>it is given that if a person has a disease, s/he will test positive with a probability 0.99, P(B | A) = 0.99</p>
<p>it is given that if a person does not have the disease, s/he will test positive with a probability 0.02, P(B | A&#8217;) = 0.02</p>
<p>We are looking to find P(A | B).</p>
<p>From Bayes&#8217; rule,</p>
<p>P(A | B) = ( P(A) * P(B | A) ) / P(B)</p></blockquote>
<p>We know everything on the right hand side except P(B). We can calculate P(B) as follows:</p>
<blockquote><p>P(B) = P(the person has the disease and tests positive) + P(the person does not have the disease and tests positive)</p>
<p>or P(B) = P(A) * P(B | A) + P(A&#8217;) * P(B | A&#8217;)</p>
<p>or P(B) = 0.005 * 0.99 + 0.995 * 0.02 = 0.02485</p></blockquote>
<p>Now let&#8217;s get P(A | B).</p>
<blockquote><p>P(A | B) = ( P(A) * P(B | A) ) / P(B)</p>
<p>P(A | B) = (0.005 * 0.99) / 0.02485 = 0.199</p></blockquote>
<p>That&#8217;s less than 2% chance that the person actally has the disease given s/he tests positive for it. And you thought it was 99%!</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/grokin?a=Py9mJVC-i9c:N9NtCbwO8hU:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/grokin?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/grokin?a=Py9mJVC-i9c:N9NtCbwO8hU:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/grokin?i=Py9mJVC-i9c:N9NtCbwO8hU:D7DqB2pKExk" border="0"></img></a>
</div>]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/05/04/more-bayes-magic/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.grok.in/blog/2008/05/04/more-bayes-magic/</feedburner:origLink></item>
		<item>
		<title>Probabilities, huh!</title>
		<link>http://feedproxy.google.com/~r/grokin/~3/H0npJghsYcU/</link>
		<comments>http://www.grok.in/blog/2008/05/03/probabilities-huh/#comments</comments>
		<pubDate>Sat, 03 May 2008 08:05:54 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Probability & Statistics]]></category>
		<category><![CDATA[bayes]]></category>
		<category><![CDATA[probability]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=46</guid>
		<description><![CDATA[sanket asked a very interesting question in the comments to my previous post on Monty Hall Problem:
Assume that boys and girls are equally likely to be born. Let us say that a family has two children. Given that one of them is a boy, what is the probability that the other one is a boy [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://1stprinciples.wordpress.com/">sanket</a> asked a very interesting question in the comments to my previous <a title="grok.in - Monty Hall Problem" href="http://www.grok.in/blog/2008/04/21/to-switch-or-not-to-switch-that-is-the-question/">post on Monty Hall Problem</a>:</p>
<blockquote><p>Assume that boys and girls are equally likely to be born. Let us say that a family has two children. Given that one of them is a boy, what is the probability that the other one is a boy too? (Source: One of Scott Aaronson&#8217;s (http://scottaaronson.com/blog/) lecture notes.)</p></blockquote>
<p><strong>Update</strong>: It turns out that after stating this problem this way here, I solved a different problem altogether. Thanks, Nikhil, for pointing it out in the comments. To keep my life simple, I&#8217;ll state a modified problem below &#8212; the one that I <em>did</em> solve.</p>
<blockquote><p>Assume that boys and girls are equally likely to be born. Let us say that a family has two children. Given that one of them is a boy, what is the probability that the other one is a girl?</p></blockquote>
<p>Most people would jump out with 1/2 as the answer. Of course, if the answer was that obvious the question wouldn&#8217;t exist. The answer is 2/3. Here I will describe two different ways of arriving at this, as well as the common mistake that leads people to 1/2.</p>
<p><span id="more-46"></span></p>
<p>I will start by defining the problem in more formal terms. We are doing an experiment of giving birth to two children. Let GG, GB, BG and GG denote the outcomes that the children born are girl-girl, girl-boy, boy-girl and girl-girl respectively.</p>
<blockquote><p>So the sample space S is, S = { GG, GB, BG, BB }</p>
<p>Since each outcome is equally likely, the probabilities of the outcomes, P(GG) = P(GB) = P(BG) = P(BB) = 1/4</p>
<p>We are given that one of the child is a boy.</p>
<p>We are interested in the event E that the other child is a girl, E = { GB, BG }</p></blockquote>
<p>But before seeing the correct solutions, let us see one of the the <em><strong>flawed</strong></em> solutions that generally leads people to the conclusion that the the probability is 1/2.</p>
<blockquote><p>S = { GG, GB, BG, BB }</p>
<p>P(GG) = P(GB) = P(BG) = P(BB) = 1/4</p>
<p>E = { GB, BG }</p>
<p>P(E) = P(GB) + P(BG) = 1/4 + 1/4 = 1/2</p>
<p>Voila! Of course this is <em><strong>wrong</strong></em>, the first of correct solutions below should make it clear why it is so.</p></blockquote>
<p>Now the <em><strong>correct</strong></em> solutions:</p>
<p>1. Let&#8217;s just count.</p>
<blockquote><p>S = { GG, GB, BG, BB }</p>
<p>Since we are given that one of the children is a boy, our <em>sample space reduces</em> to</p>
<p>S&#8217; = { GB, BG, BB}</p>
<p>See how GG is conspicous by its absence? That is because the outcome GG cannot occur (wow, a boy cannot be girl!)</p>
<p>In this reduced sample space, our probabilities become</p>
<p>P(GB) = P(BG) = P(BB) = 1/3</p>
<p>E = { GB, BG }</p>
<p>P(E) = P(GB) + P(BG) = 1/3 + 1/3 = 2/3</p>
<p>Huh!</p></blockquote>
<p>2. Using Bayes&#8217; Rule.</p>
<blockquote><p>S = { GG, GB, BG, BB }</p>
<p>P(GG) = P(GB) = P(BG) = P(BB) = 1/4</p>
<p>E = { GB, BG }</p>
<p>Let F be the event that at least one of the child is a boy, F = { GB, BG, BB }</p>
<p>We are looking for probability of E given F, P(E | F)</p>
<p>Using Bayes&#8217; rule,</p>
<p>P(E | F) = (P(E) * P(F | E)) / P(F)</p>
<p>Now, P(E) = P(GB) + P(BG) = 1/4 + 1/4 = 1/2</p>
<p>P(F) = P(GB) + P(BG) + P(BB) = 1/4 + 1/4 + 1/4 = 3/4</p>
<p>Because E is the even that there is one boy and one girl and F is the even that there is at least one boy, P(F | E) = 1</p>
<p>So, P(E | F) = (1/2 * 1) / (3/4) = 2/3</p></blockquote>
<p>The observant might notice that we did not reduce the sample space in second solution. This is no trick: we <em>could</em> reduce the sample space and would still end up with the same answer; try it yourself. But there is no <em>need</em> to do any such thing when using the Bayes&#8217; rule because the fact that the sample space is actually reduced is inherently captured. In fact, many a time it is not possible to simply &#8220;reduce the sample space and proceed&#8221; &#8212; Bayes&#8217; rule really shines then.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/grokin?a=H0npJghsYcU:JqH7cRfUn8w:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/grokin?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/grokin?a=H0npJghsYcU:JqH7cRfUn8w:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/grokin?i=H0npJghsYcU:JqH7cRfUn8w:D7DqB2pKExk" border="0"></img></a>
</div>]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/05/03/probabilities-huh/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.grok.in/blog/2008/05/03/probabilities-huh/</feedburner:origLink></item>
		<item>
		<title>To switch or not to switch, that is the question</title>
		<link>http://feedproxy.google.com/~r/grokin/~3/aAHKi5Cgbf4/</link>
		<comments>http://www.grok.in/blog/2008/04/21/to-switch-or-not-to-switch-that-is-the-question/#comments</comments>
		<pubDate>Mon, 21 Apr 2008 18:27:52 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Probability & Statistics]]></category>
		<category><![CDATA[paradox]]></category>
		<category><![CDATA[problem]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=45</guid>
		<description><![CDATA[I recently came across a very interesting problem known as &#8220;The Monty Hall Problem.&#8221; This is a statistical puzzle named after the host of an old television show &#8220;Let&#8217;s Make a Deal&#8221; which featured a similar problem albeit a little more involved than the basic version that mathematicians use. Here is a simple description of [...]]]></description>
			<content:encoded><![CDATA[<p>I recently came across a very interesting problem known as &#8220;The Monty Hall Problem.&#8221; This is a statistical puzzle named after the host of an old television show &#8220;Let&#8217;s Make a Deal&#8221; which featured a similar problem albeit a little more involved than the basic version that mathematicians use. Here is a simple description of the problem <a title="Wikipedia - Monty Hall problem" href="http://en.wikipedia.org/wiki/Monty_Hall_problem">from Wikipedia</a>:</p>
<blockquote><p>Suppose you&#8217;re on a game show, and you&#8217;re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what&#8217;s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, &#8220;Do you want to pick door No. 2?&#8221; Is it to your advantage to switch your choice?</p></blockquote>
<p><span id="more-45"></span>If you have not come across this problem before, intuition would make you say &#8220;it just won&#8217;t matter&#8221; (if not, either you are <em>really</em> clever or your intuition is screwed up; I&#8217;d put my money on the later.) It <em>does</em> matter. But I&#8217;m not going to tell you which is a better strategy (to switch or not to switch). New York Times has a good <a title="New York Times - The Monty Hall Problem" href="http://www.nytimes.com/2008/04/08/science/08monty.html">simulation of the game</a>, go play the game yourself a few times with different strategies and I assure you that you&#8217;ll be stunned. After you&#8217;ve played it a few times, click on &#8220;How it works&#8221; to get a decent understanding of why the best strategy is what it is. If you want a better explanation, both in words and mathematical (using Bayesian analysis), refer to the Wikipedia page linked above.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/grokin?a=aAHKi5Cgbf4:LklwSfzlriU:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/grokin?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/grokin?a=aAHKi5Cgbf4:LklwSfzlriU:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/grokin?i=aAHKi5Cgbf4:LklwSfzlriU:D7DqB2pKExk" border="0"></img></a>
</div>]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/04/21/to-switch-or-not-to-switch-that-is-the-question/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.grok.in/blog/2008/04/21/to-switch-or-not-to-switch-that-is-the-question/</feedburner:origLink></item>
		<item>
		<title>More data usually beats better algorithms?</title>
		<link>http://feedproxy.google.com/~r/grokin/~3/b9NgDGxaCfo/</link>
		<comments>http://www.grok.in/blog/2008/04/14/more-data-usually-beats-better-algorithms/#comments</comments>
		<pubDate>Mon, 14 Apr 2008 12:13:41 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[data]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=44</guid>
		<description><![CDATA[Anand   Rajaraman, who teaches a class onÂ  Machine Learning at Stanford, recently wrote an interesting blog post: More data usually beats better algorithms, he claimed. The post makes for an interesting read and so do the plethora of comments on it. He made a follow-up post, which is equally interesting.
I do agree with [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www-db.stanford.edu/%7Eanand/">Anand   Rajaraman</a>, who teaches a class onÂ  Machine Learning at Stanford, recently wrote an interesting blog post: <a title="More data usually beats better algorithms" href="http://anand.typepad.com/datawocky/2008/03/more-data-usual.html">More data usually beats better algorithms</a>, he claimed. The post makes for an interesting read and so do the plethora of comments on it. He made a <a href="http://anand.typepad.com/datawocky/2008/04/data-versus-alg.html">follow-up post</a>, which is equally interesting.</p>
<p>I do agree with a good number of the points he brings up, but at the same time believe that such a blanket statement is not warranted. I believe that adding more data to a given algorithm does give out better results, especially if the new data is independent and the algorithm is capable of utilizing such data appropriately. But to say that better data is more important than better algorithms most of the time is far-fetched.</p>
<p><span id="more-44"></span>Such generalizations usually fall flat in the absence of the unstated assumptions under which they have been made. In the case of Dr. Rajaraman&#8217;s post, these assumptions include (but are not limited to): the data that is already available is not representative enough so that more data could add value; the algorithm is capable of utilizing the additional data; the additional data is good, even better, than the existing data.</p>
<p>There is data. There is information. There is a small <a title="Differences between data and information" href="http://jmcsweeney.co.uk/computing/m150/differences.php">&#8217;semantic&#8217; difference</a> between the two. Data has an &#8216;informational value&#8217; which can be described as the information it brings to the system. Additional data, if it does not add any information to the system is more often than not useless. So, more data only helps if it adds to the information in the system. Even where there is data available that can add information to the system, the value of that information needs to be considered. If a lot of additional data adds a marginal value, it might not be worth using that data. That is because processing the additional data requires additional resources in terms of time and machine power, and, in machine learning applications we are usually trying to optimize the resource utilization.</p>
<p>Most simplistic algorithms need to be modified for them to be able to utilize more than one independent sets of data. So in this case, the algorithm is being improved, as well as more data being added. This is the case, for example, with Google&#8217;s PageRank: Google decided to use the hyperlink information for ranking web pages but the existing ranking algorithms could not utilize that information, so they improved on them to come up with PageRank. Now, this very example could have been stated in another way: Google came up with a better ranking algorithm that could utilize the social citations of web pages to rank them, this new algorithm needed new data and hyperlink information happened to be that data; it might as well have been the bookmarks of all the people, if cloud computing was invented before the Internet (sic).</p>
<p>At the end of the day, whether more/better data is more important or better algorithm is highly dependent on the particular application at hand.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/grokin?a=b9NgDGxaCfo:sm-92BSgFXo:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/grokin?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/grokin?a=b9NgDGxaCfo:sm-92BSgFXo:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/grokin?i=b9NgDGxaCfo:sm-92BSgFXo:D7DqB2pKExk" border="0"></img></a>
</div>]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/04/14/more-data-usually-beats-better-algorithms/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.grok.in/blog/2008/04/14/more-data-usually-beats-better-algorithms/</feedburner:origLink></item>
	</channel>
</rss>
