<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">

<channel>
	<title>AI and Social Science - Brendan O'Connor</title>
	
	<link>http://anyall.org/blog</link>
	<description>Cognition, systems, decisions, visualization, machine learning, etc.</description>
	<pubDate>Fri, 26 Jun 2009 21:46:28 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.2</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/BrendanOConnorsBlog" type="application/rss+xml" /><item>
		<title>Michael Jackson in Persepolis</title>
		<link>http://anyall.org/blog/2009/06/michael-jackson-in-persepolis/</link>
		<comments>http://anyall.org/blog/2009/06/michael-jackson-in-persepolis/#comments</comments>
		<pubDate>Fri, 26 Jun 2009 16:13:00 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=571</guid>
		<description><![CDATA[Michael Jackson just died while Iran is in turmoil.  I am reminded of a passage in Marjane Satrapi&#8217;s wonderful graphic novel Persepolis, a memoir of growing up in revolutionary Iran in the 80&#8217;s.

(Read the book to see how it ends.)
I wonder how much coincidences of news event timing can influence perceptions.  Clearly, large [...]]]></description>
			<content:encoded><![CDATA[<p>Michael Jackson just died while Iran is in turmoil.  I am reminded of a passage in Marjane Satrapi&#8217;s wonderful graphic novel <a href="http://en.wikipedia.org/wiki/Persepolis">Persepolis</a>, a memoir of growing up in revolutionary Iran in the 80&#8217;s.</p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/06/persepolis-michael-jackson-3frames.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/06/persepolis-michael-jackson-3frames.png" alt="" title="persepolis-michael-jackson-3frames" width="500" height="192" class="aligncenter size-full wp-image-578" /></a></p>
<p>(Read the book to see how it ends.)</p>
<p>I wonder how much coincidences of news event timing can influence perceptions.  Clearly, large news stories can crowd out other ones.  Are there any other effects of joint appearances?  Celebrity deaths are fairly exogenous shocks &#8212; there might be a nice natural experiment somewhere here.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/06/michael-jackson-in-persepolis/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Psychometrics quote</title>
		<link>http://anyall.org/blog/2009/06/psychometrics-quote/</link>
		<comments>http://anyall.org/blog/2009/06/psychometrics-quote/#comments</comments>
		<pubDate>Sun, 14 Jun 2009 01:57:17 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=569</guid>
		<description><![CDATA[
It is rather surprising that systematic studies of human abilities were not undertaken until the second half of the last century&#8230; An accurate method was available for measuring the circumference of the earth 2,000 years before the first systematic measures of human ability were developed.
&#8211;Jum Nunnally, Psychometric Theory (1967)
(Social science textbooks from the 60&#8217;s and [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>
It is rather surprising that systematic studies of human abilities were not undertaken until the second half of the last century&#8230; An accurate method was available for measuring the circumference of the earth 2,000 years before the first systematic measures of human ability were developed.</p></blockquote>
<p>&#8211;Jum Nunnally, <i>Psychometric Theory</i> (1967)</p>
<p>(Social science textbooks from the 60&#8217;s and 70&#8217;s are rad.)</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/06/psychometrics-quote/feed/</wfw:commentRss>
		</item>
		<item>
		<title>June 4</title>
		<link>http://anyall.org/blog/2009/06/june-4/</link>
		<comments>http://anyall.org/blog/2009/06/june-4/#comments</comments>
		<pubDate>Thu, 04 Jun 2009 08:17:44 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=562</guid>
		<description><![CDATA[BBC News - June 4, 1989, Tiananmen Square Massacre

Also worth reading: Nicholas Kristof&#8217;s riveting firsthand account.
]]></description>
			<content:encoded><![CDATA[<p>BBC News - June 4, 1989, Tiananmen Square Massacre</p>
<p><object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/XJBnHMpHGRY&#038;hl=en&#038;fs=1&#038;"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/XJBnHMpHGRY&#038;hl=en&#038;fs=1&#038;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object></p>
<p>Also worth reading: <a href="http://www.nytimes.com/2009/06/04/opinion/04kristof.html">Nicholas Kristof&#8217;s riveting firsthand account</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/06/june-4/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Where tweets get sent from</title>
		<link>http://anyall.org/blog/2009/05/where-tweets-get-sent-from/</link>
		<comments>http://anyall.org/blog/2009/05/where-tweets-get-sent-from/#comments</comments>
		<pubDate>Wed, 27 May 2009 23:05:29 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=554</guid>
		<description><![CDATA[Playing around with stream.twitter.com/spritzer, ggplot2 and maps / mapdata:

I think I like the top better, without the map lines, like those night satellite photos: pointwise ghosts of high-end human economic development.
This data is a fairly extreme sample of convenience: I&#8217;m only looking at tweets posted by certain types of iPhone clients, because they conveniently report [...]]]></description>
			<content:encoded><![CDATA[<p>Playing around with <a href="http://apiwiki.twitter.com/Streaming-API-Documentation#spritzer">stream.twitter.com/spritzer</a>, <a href="http://had.co.nz/ggplot2/">ggplot2</a> and <a href="http://cran.es.r-project.org/web/views/Spatial.html">maps / mapdata</a>:</p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/05/untitled1.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/05/untitled1.png" alt="" title="untitled1" width="500" height="559" class="aligncenter size-full wp-image-558" /></a></p>
<p>I think I like the top better, without the map lines, like those <a href="http://www.worldmapsonline.com/SatPosters/EarthAtNight.jpg">night satellite photos</a>: pointwise ghosts of high-end human economic development.</p>
<p>This data is a fairly extreme sample of convenience: I&#8217;m only looking at tweets posted by certain types of iPhone clients, because they conveniently report exact gps-derived latitude/longitude numbers.  (<a href="http://search.twitter.com/">search.twitter.com</a> has geographic proximity operators &#8212; which are very cool! &#8212; but they seem to usually use zip codes or other user information that&#8217;s not available in the per-tweet API data.)  So there&#8217;s only 30,000 messages out of 1.2 million <a href="http://apiwiki.twitter.com/Streaming-API-Documentation#spritzer">spritzer</a> tweets over ~3 days (itself only a small single-digit percentage sample of twitter).</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/05/where-tweets-get-sent-from/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Zipf’s law and world city populations</title>
		<link>http://anyall.org/blog/2009/05/zipfs-law-and-world-city-populations/</link>
		<comments>http://anyall.org/blog/2009/05/zipfs-law-and-world-city-populations/#comments</comments>
		<pubDate>Sun, 24 May 2009 22:52:01 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=536</guid>
		<description><![CDATA[Will Fitzgerald just wrote about an excellent article by Steven Strogatz on Zipf&#8217;s Law for the populations of cities.  If you look at the biggest city, then the next biggest city, etc., there tends to be an exponential fall-off in size.
I was wondering what this looks like so here&#8217;s the classic zipfian plot (log-size [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.entish.org/wordpress/?p=762">Will Fitzgerald just wrote about</a> an excellent <a href="http://judson.blogs.nytimes.com/2009/05/19/math-and-the-city/">article by Steven Strogatz</a> on Zipf&#8217;s Law for the populations of cities.  If you look at the biggest city, then the next biggest city, etc., there tends to be an exponential fall-off in size.</p>
<p>I was wondering what this looks like so here&#8217;s the classic zipfian plot (log-size vs. log-rank) for city population data from from <a href="http://www.populationdata.net/index2.php?option=palmares&#038;rid=4&#038;nom=grandes-villes">populationdata.net</a>:</p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/05/picture-4.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/05/picture-4.png" alt="" title="picture-4" width="500" height="433" class="aligncenter size-full wp-image-547" /></a></p>
<p>If you fit a power law &#8212; that is, a line on the above logsize-logrank plot &#8212; you can use rank to predict the sizes of smaller cities very accurately, according to Will&#8217;s analysis.  Larger cities are more problematic, lying off the line.</p>
<p>I was curious whether the power law holds <i>within</i> countries as well.  The above plot was only for the countries that had more than 10 cities in the dataset &#8212; just eight countries.  So here are those same cities again, but plotted against ranks within their respective countries.</p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/05/picture-3.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/05/picture-3.png" alt="" title="picture-3" width="500" height="446" class="aligncenter size-full wp-image-538" /></a></p>
<p>The answer is &#8212; usually, yes, the power law looks like it holds within countries as well.  (Country names are French in this data &#8230; Etats-Unis = USA, Allemagne = Germany, etc.)  Russia seems to have the biggest difference between its head vs. tail cities.  The tail cities have the linear logsize-logrank relationship, but the top 3 cities (Moscow, St. Petersburg, Nizhny Novgorod) seem to have their own different slope.</p>
<p>If you randomly subsample out of a Zipf distribution, the samples will be Zipfian as well, so this isn&#8217;t too surprising.  If, on the other hand, you&#8217;re a fan of theories that power law population relationships might happen as a result of the structural dynamics of growth &#8212; for example, winners-win (i.e. rich-get-richer) growth patterns can sometimes result in zipf-distributed sizes &#8212; then there&#8217;s a case that these dynamics might be happening at both the world and country levels.</p>
<p>Also: this is the first time I&#8217;ve used <a href="http://had.co.nz/">Hadley Wickham</a>&#8217;s <a href="http://had.co.nz/ggplot2/">ggplot2</a> and it was great.  All of the fun of <a href="http://www.jstatsoft.org/v25/b02/paper">lattice</a> minus a lot of the pain, plus default display options that aren&#8217;t ugly as hell :)</p>
<p>Update: alternative view of those two above graphs.</p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/05/picture-6.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/05/picture-6.png" alt="" title="picture-6" width="300" height="262" class="alignnone size-medium wp-image-551" /></a></p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/05/picture-7.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/05/picture-7.png" alt="" title="picture-7" width="300" height="265" class="alignnone size-medium wp-image-552" /></a></p>
<p>This was brought to you via the following R code: <span id="more-536"></span><br />
<i><br />
d=read.delim(&#8217;<a href="http://anyall.org/blog/wp-content/uploads/2009/05/cities.tsv">cities.tsv</a>&#8216;,header=T)<br />
bigs=names(table(d$country))[table(d$country) > 10]<br />
x=d[d$country %in% bigs,]; x=x[order(-x$pop),]<br />
plot(log10(pop) ~ log10(1:nrow(x)), data=x, main=&#8217;World city populations for 8 countries\nlogsize vs logrank&#8217;, col=&#8217;darkred&#8217;)<br />
text(x=log10(1:nrow(x)), y=log10(x$pop), labels=x$city, pos=ifelse(1:nrow(x) %% 2 == 1, 4, 2), cex=.5, col=&#8217;gray30&#8242;)<br />
# or better<br />
library(<a href="http://had.co.nz/ggplot2/">ggplot2</a>)<br />
qplot(log10(1:nrow(x)), log10(pop), data=x) + geom_text(hjust=ifelse(1:nrow(x) %% 2 == 1, 0, 1),label=sprintf(&#8221;    %s    &#8220;,x$city),size=2,colour=&#8217;darkblue&#8217;)<br />
library(<a href="http://had.co.nz/plyr/">plyr</a>)<br />
xr=ddply(x, .(country), function(x) { x=x[order(-x$pop),]; ranks=(1:nrow(x));  data.frame(x$city, logpop=log10(x$pop), logrank=log10(ranks)) })<br />
qplot(logrank, logpop, country, data=xr, facets=~country, main=&#8217;world city populations by ranks, for 8 countries&#8217;)<br />
# alternate views<br />
xr=ddply(x, .(country), transform, within_country_rank=rank(-pop))<br />
qplot(rank(-pop),pop, data=xr, log=&#8217;xy&#8217;,colour=country, main=&#8217;City population vs rank across countries&#8217;)<br />
qplot(within_country_rank,pop, data=xr, log=&#8217;xy&#8217;,colour=country,main=&#8217;City population vs rank within country&#8217;)<br />
</i></p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/05/zipfs-law-and-world-city-populations/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Announcing TweetMotif for summarizing twitter topics with a dash of NLP</title>
		<link>http://anyall.org/blog/2009/05/announcing-tweetmotif-for-summarizing-twitter-topics-with-a-dash-of-nlp/</link>
		<comments>http://anyall.org/blog/2009/05/announcing-tweetmotif-for-summarizing-twitter-topics-with-a-dash-of-nlp/#comments</comments>
		<pubDate>Mon, 18 May 2009 17:40:03 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Best Posts]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=515</guid>
		<description><![CDATA[Last week, I, with my awesome friends David Ahn and Mike Krieger, finished hacking together an experimental prototype, TweetMotif, for exploratory search on Twitter.  If you want to know what people are thinking about something, the normal search interface search.twitter.com gives really cool information, but it&#8217;s hard to wade through hundreds or thousands of [...]]]></description>
			<content:encoded><![CDATA[<p>Last week, I, with my awesome friends <a href="http://tweetmotif.com/about">David Ahn and Mike Krieger</a>, finished hacking together an experimental prototype, <a href="http://tweetmotif.com/">TweetMotif</a>, for exploratory search on Twitter.  If you want to know what people are thinking about something, the normal search interface <a href="http://search.twitter.com/">search.twitter.com</a> gives really cool information, but it&#8217;s hard to wade through hundreds or thousands of results.  We take tweets matching a query and group together similar messages, showing significant terms and phrases  that co-occur with the user query.  Try it out at <a href="http://tweetmotif.com">tweetmotif.com</a>.  Here&#8217;s an example for a current hot topic, <a href="http://tweetmotif.com/#%23wolframalpha">#WolframAlpha</a>:</p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/05/wa_top.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/05/wa_top.png" alt="" title="wa_top" width="500" height="85" class="aligncenter size-full wp-image-517" /></a><br />
<a href="http://anyall.org/blog/wp-content/uploads/2009/05/wa_bot.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/05/wa_bot.png" alt="" title="wa_bot" width="500" height="240" class="aligncenter size-full wp-image-518" /></a></p>
<p>It&#8217;s currently showing tweets that match both <a href="http://tweetmotif.com/#%23wolframalpha">#WolframAlpha</a> as well as two interesting bigrams: &#8220;queries failed&#8221; and &#8220;google killer&#8221;.  TweetMotif doesn&#8217;t attempt to derive the meaning or sentiment toward the phrases &#8212; NLP is hard, and doing this much is hard enough! &#8212; but it&#8217;s easy for you to look at the tweets themselves and figure out what&#8217;s going on.</p>
<p>Here&#8217;s another fun example right now, a query for <a href="http://tweetmotif.com/#Dollhouse">Dollhouse</a>:</p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/05/dollhouse_top.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/05/dollhouse_top.png" alt="" title="dollhouse_top" width="500" height="26" class="aligncenter size-full wp-image-519" /></a><br />
<a href="http://anyall.org/blog/wp-content/uploads/2009/05/dollhouse_bot.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/05/dollhouse_bot.png" alt="" title="dollhouse_bot" width="500" height="253" class="aligncenter size-full wp-image-520" /></a></p>
<p>I love that the #wolframalpha topic has &#8220;infected&#8221; the dollhouse space.  Someone pointed out a connection between them, but really they&#8217;re connected through bot spam.  TweetMotif&#8217;s duplicate detection algorithm found 22 messages here where each is basically a list of all the trending topics.  This seems to be a popular form of twitter spambots.</p>
<p>I learned a ton making this system, and I&#8217;ll try to write more about the technical details in a future post.  It&#8217;s interesting to hear people speculate on how it works; everyone gives a different answer.  I guess this goes to show you that search/NLP is still a pretty unsettled, not-completely-understood area.</p>
<p>There are lots of interesting TweetMotif examples.  More prosaic, less news-y queries like <a href="http://tweetmotif.com/#sandwich">sandwich</a> yield cool things like major ingredients of sandwiches and types of sandwiches.  (These are basically distributional similarity candidates for synonym and meronym acquisition, though a bit too noisy to use in its current form.)  And in a few cases, like for understanding currently unfolding events, TweetMotif might even be useful!  It would be nice to expand the set of usefully served queries.  We&#8217;re occasionally posting interesting queries at <a href="http://twitter.com/tweetmotif">twitter.com/tweetmotif</a>.</p>
<p>And oh yeah.  We have a beautiful iPhone interface!</p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/05/astro_mike_iphone_cropped.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/05/astro_mike_iphone_cropped.png" alt="" title="astro_mike_iphone_cropped" width="200" height="300" class="aligncenter size-medium wp-image-523" /></a></p>
<p>Check it out folks.  This is a functional prototype, so you can play with it right now at <a href="http://tweetmotif.com/">tweetmotif.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/05/announcing-tweetmotif-for-summarizing-twitter-topics-with-a-dash-of-nlp/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Performance comparison: key/value stores for language model counts</title>
		<link>http://anyall.org/blog/2009/04/performance-comparison-keyvalue-stores-for-language-model-counts/</link>
		<comments>http://anyall.org/blog/2009/04/performance-comparison-keyvalue-stores-for-language-model-counts/#comments</comments>
		<pubDate>Wed, 22 Apr 2009 10:11:46 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=492</guid>
		<description><![CDATA[I&#8217;m doing word and bigram counts on a corpus of tweets.  I want to store and rapidly retrieve them later for language model purposes.  So there&#8217;s a big table of counts that get incremented many times.  The easiest way to get something running is to use an open-source key/value store; but which? [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m doing <a href="http://en.wikipedia.org/wiki/N-gram">word and bigram counts</a> on a corpus of tweets.  I want to store and rapidly retrieve them later for <a href="http://en.wikipedia.org/wiki/Language_model">language model</a> purposes.  So there&#8217;s a big table of counts that get incremented many times.  The easiest way to get something running is to use an open-source key/value store; but which?  There&#8217;s recently been some development in this area so I thought it would be good to revisit and evaluate some options.</p>
<p>Here are timings for a single counting process: iterate over 45,000 short text messages, tokenize them, then increment counters for their unigrams and bigrams.  (The speed of the data store is only one component of performance.)  There are about 17 increments per tweet: 400k unique terms and 750k total count.  This is substantially smaller than what I need, but it&#8217;s small enough to easily test.  I used several very different architectures and packages, explained below.</p>
<table align="center" cellspacing=0 cellpadding=3 border=1>
<tr>
<th>architecture
<th>name
<th>speed</p>
<tr>
<td>in-memory, within-process
<td>python dictionary
<td> 2700 tweets/sec</p>
<tr>
<td>on-disk, within-process
<td>berkeleydb hashtable
<td>  340 tweets/sec</p>
<tr>
<td>on-disk, within-process
<td>tokyo cabinet hashtable
<td>  1400 tweets/sec</p>
<tr>
<td>in-memory, over socket
<td>memcached
<td>  120 tweets/sec</p>
<tr>
<td>on-disk, over socket
<td>memcachedb
<td>0.5 tweets/sec</p>
<tr>
<td>in-memory, over socket
<td>tokyo tyrant, memcached protocol
<td> 85 tweets/sec</p>
<tr>
<td>on-disk, over socket
<td>tokyo tyrant, memcached protocol
<td> 85 tweets/sec</p>
<tr>
<td>on-disk, over socket
<td>tokyo tyrant, binary protocol
<td>225 tweets/sec<br />
</table>
<p>Eventually, <span id="more-492"></span> I&#8217;ll want a purely in-memory, distributed table.  That&#8217;s why I was interested in Memcached.  But for development purposes, it&#8217;s very convenient to use an on-disk database.  In the past I&#8217;ve used BerkeleyDB for this.  (An SQL database is also possible but seems like overkill.)  Ideally it would be nice to have a distributed key-value store with a heavy caching layer.  Check out <a href="http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores">Leonard Lin&#8217;s post</a> on the subject.  Unfortunately, these experiments were limited to single node with a small dataset.</p>
<p>More details on the options:</p>
<ul>
<li>Python dictionary: <a href="http://www.python.org/doc/2.5.2/lib/defaultdict-examples.html">defaultdict(int)</a> is the simplest and most obvious implementation.  It&#8217;s the baseline and the fastest.  This is the best option for many types of experimental NLP code, since it can just be serialized to disk for use later.  Only if you want many processes to build it concurrently and incrementally, or want many processes to access the model but not have to hold it in their own process space, do the other options start becoming relevant.</p>
<li><a href="http://www.oracle.com/technology/products/berkeley-db/index.html">BerkeleyDB</a>: a well-known key/value store that I&#8217;ve used for a while.  Unfortunately it&#8217;s been removed from the Python distribution, and there are often version hell issues every time I see people try to use it.  (Every Linux/Unix seems to carry a different version, and they&#8217;re all not compatible with each other.)
<li><a href="http://tokyocabinet.sourceforge.net/index.html">Tokyo Cabinet</a> is a newish key/value store that has some impressive benchmarks.  I just learned about it from <a href="http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores">Leonard&#8217;s post</a>, and I also found it to be excellent.  If Cabinet keeps being so awesome, I might never use BerkeleyDB again.  (Though installation issues are worse than BerkeleyDB since it&#8217;s new enough to not be a common package; e.g. I found it on MacPorts but not Debian.)
<li><a href="http://www.danga.com/memcached/">Memcached</a>: The most standard in-memory key/value for use over sockets.  Usually used for caching results from database queries for web applications &#8212; because in-memory caching is <i>way</i> faster than hitting disk on a database query.  All data in a Memcached disappears if you turn it off.  Clients talk to it via a plaintext protocol over sockets.
<ul>
<li>The fact it was slower than the dictionary or BDB or Cabinet means that the communication overhead was high.  The nice thing about Memcached for keeping running counts like this is that it should distribute well: have lots of different processes/machines processing data and asking a central Memcached cluster to increment counters.  It might be a little unfair to compare Memcached performance to BerkeleyDB or Cabinet, since it&#8217;s designed for the situation of communicating with many clients at once.  It&#8217;s usually considered a win if Memcached is faster than a parallel-ly accessed RDBMS, which is very much the case.</p>
<li>I wonder how this architecture would compare to a Hadoop/HDFS/MapReduce for batch-job term counting performance.  Jimmy Lin &amp; other Maryland folks <a href="http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf">wrote an interesting report (2009)</a> about using Memcached during a Hadoop job in a similar way for, among other things, this same language model use case.  In general, lots of machine learning algorithms really don&#8217;t parallelize very well in the MapReduce architecture; parameter updates in Gibbs sampling, EM, and any online algorithm (e.g. <a href="http://en.wikipedia.org/wiki/Stochastic_gradient_descent">SGD</a>) are other examples.  (An earlier paper on a better-than-mapreduce approach for EM parameters: Jason Wolfe et al. 2008; <a href="http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em-slides.pdf">slides</a>, <a href="http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf">paper</a>.)  A Memcached-like system could be a component of more client-server-ish parallel processing models for these use cases.
<li>Note of warning: there are actually 3 different Python libraries to talk to Memcached: (1) memcache.py aka <a href="http://www.tummy.com/Community/software/python-memcached/">python-memcached</a>; (2) <a href="http://gijsbert.org/cmemcache/">cmemcache</a> which wraps the C library <a href="http://people.freebsd.org/~seanc/libmemcache">libmemcache</a>, and (3) cmemcached.pyx aka <a href="http://code.google.com/p/python-libmemcached/">python-libmemcached</a> write wraps a <i>different</i> C library, <a href="http://tangent.org/552/libmemcached.html">libmemcached</a>.  For each one, the <i>X</i> in <i>import X</i> correlates quite poorly to the project&#8217;s name.  Bleah.  Option #3 seems newest, or at least has the best-maintained websites, so I used that.</ul>
<li><a href="http://memcachedb.org/">MemcacheDB</a> is a BerkeleyDB-backed, Memcached-protocol server.  Initially I had hoped it was just Memcached over BDB.  Unfortunately this is clearly not the case.  Its name is so similar yet its effectiveness is so different than Memcached!  As Leonard points out, there are lots of half-assed solutions out there.  It&#8217;s easy for anyone to create a system that works well for their needs, but it&#8217;s harder to make something more general.
<li><a href="http://tokyocabinet.sourceforge.net/tyrantdoc/">Tokyo Tyrant</a> is a server implemented on top of Cabinet that implements a similar key/value API except over sockets.  It&#8217;s incredibly flexible; it was very easy to run it in several different configurations.  The first one is to use an in-memory data store, and communicate using the memcached protocol.  This is, of course, *exactly* comparable to Memcached &#8212; behaviorally indistinguishable! &#8212; and it does worse.  The second option is to do that, except switch to an on-disk data store.  It&#8217;s pretty ridiculous that that&#8217;s still the same speed &#8212; communication overhead is completely dominating the time.  Fortunately, Tyrant comes with a binary protocol.  Using that substantially improves performance past Memcached levels, though less than a direct in-process database.  Yes, communication across processes incurs overhead.  No news here, I guess.
</ul>
<p>I can&#8217;t say this evaluation tells us too much about the server systems, since it&#8217;s all for a single process, which really isn&#8217;t their use case.  It is interesting, however, to see that memcached&#8217;s plaintext protocol causing a big performance hit compared to a binary one.  There&#8217;s a <a href=http://code.google.com/p/memcached/wiki/MemcacheBinaryProtocol>lot</a> of <a href=http://www.slideshare.net/tmaesaka/memcached-binary-protocol-in-a-nutshell-presentation>talk</a> and perhaps <a href=http://github.com/dustin/memcached/tree/binary>code</a> for a binary memcached protocol, but I couldn&#8217;t find any docs suggesting whether it currently works.  Tyrant seems to work great.</p>
<p>The biggest takeaway is that <a href="http://tokyocabinet.sourceforge.net/index.html">Tokyo Cabinet</a> is awesome.  It has very complete English language documentation &#8212; something sadly lacking in many otherwise fine Japanese open-source projects &#8212; and appears to be highly performant and very flexible.  <a href="http://tokyocabinet.sourceforge.net/tokyoproducts.pdf">This presentation by its author</a> (<a href="http://www.linkedin.com/pub/dir/mikio/hirabayashi">Mikio Hirabayashi</a>) shows a pretty impressive array of different things the suite of packages can do.  At the very least, I&#8217;ll probably abandon BerkeleyDB if Cabinet keeps working so well; and hopefully, distribution and remote access will be easy to add via Tyrant.</p>
<p>Final note: it&#8217;s interesting how many of these new low-latency datastore systems come out of open-sourced projects from social network companies.  Tokyo Cabinet/Tyrant is from <a href="http://mixi.jp/">Mixi</a>, a large Japanese social networking site; <a href="http://incubator.apache.org/cassandra">Cassandra</a> is from <a href="http://www.facebook.com/">Facebook</a>; and <a href="http://project-voldemort.com/">Voldemort</a> is from <a href="http://www.linkedin.com/">LinkedIn</a>.  (Hadoop HDFS, approximately from Yahoo, is another open-source non-rdbms distributed datastore, though it&#8217;s not really low-latency enough to be comparable.)  Then there are lots of commercial low-latency and distributed systems for data warehousing (<a href="http://www.oracle.com/solutions/business_intelligence/dw_home.html">oracle</a> <a href=http://www.greenplum.com/products/greenplum-database/>greenplum</a> <a href=http://www.vertica.com/>vertica</a> <a href=http://www.asterdata.com>aster</a>&#8230;) but all these large web companies seem happy open-sourcing their infrastructure.  This is great for me, but sucks to be a database company.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/04/performance-comparison-keyvalue-stores-for-language-model-counts/feed/</wfw:commentRss>
		</item>
		<item>
		<title>1 billion web page dataset from CMU</title>
		<link>http://anyall.org/blog/2009/04/1-billion-web-page-dataset-from-cmu/</link>
		<comments>http://anyall.org/blog/2009/04/1-billion-web-page-dataset-from-cmu/#comments</comments>
		<pubDate>Fri, 17 Apr 2009 02:57:50 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=482</guid>
		<description><![CDATA[This is fun &#8212; Jamie Callan&#8217;s group at CMU LTI just finished a crawl of 1 billion web pages.  It&#8217;s 5 terabytes compressed &#8212; big enough so they have to send it to you by mailing hard drives.
Link: ClueWeb09
One of their motivations was to have a corpus large enough such that research results on [...]]]></description>
			<content:encoded><![CDATA[<p>This is fun &#8212; <a href="http://www.cs.cmu.edu/~callan/">Jamie Callan</a>&#8217;s group at <a href="http://www.lti.cs.cmu.edu/">CMU LTI</a> just finished a crawl of 1 billion web pages.  It&#8217;s 5 terabytes compressed &#8212; big enough so they have to send it to you by mailing hard drives.</p>
<p><a href="http://boston.lti.cs.cmu.edu/Data/clueweb09/">Link: ClueWeb09</a></p>
<p>One of their motivations was to have a corpus large enough such that research results on it would be taken seriously by search engine companies.  To my mind, this begs the question whether academics should try to innovate in web search, when it&#8217;s a research area incredibly dependent on really large, expensive-to-acquire datasets.  And what&#8217;s the point?  To slightly improve Google someday?  Don&#8217;t they do that pretty well themselves?</p>
<p>On the other hand, having a billion web pages around sounds like a lot of fun.  Someone should get Amazon to add this to the <a href="http://aws.amazon.com/publicdatasets/">AWS Public Datasets</a>.  Then, to process the data, instead of paying to get 5 TB of data shipped to you, you instead pay Amazon to rent virtual computers that can access the data.  This costs less only to a certain point, of course.</p>
<p>It always seemed to me that a problem with Amazon&#8217;s public datasets program is that they want data that&#8217;s genuinely large enough you need to rent lots of computing power to work on it; but there are very few public datasets large enough to warrant that.  (For example, they have Freebase up there, but I think it&#8217;s slightly too small to justify that; e.g. I can fit all of freebase just fine on my laptop and run a grep on it in like 5 minutes flat.)  But 1 billion web pages is more arguably appropriate for this treatment.</p>
<p>The bigger problem with big-data research initiatives is that organizations with petabyte-scale data are always going to keep it private; e.g. from giant corporations &#8212; walmart retail purchase records, or the facebook friend graph, or google search query logs &#8212; or else from governments of course.  Maybe biology and computational genetics is the big exception to this tendency.  At least the public data situation for web research just got a lot better.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/04/1-billion-web-page-dataset-from-cmu/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Pirates killed by President</title>
		<link>http://anyall.org/blog/2009/04/pirates-killed-by-president/</link>
		<comments>http://anyall.org/blog/2009/04/pirates-killed-by-president/#comments</comments>
		<pubDate>Wed, 15 Apr 2009 08:02:03 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=472</guid>
		<description><![CDATA[A lesson in x-axis scaling, and choosing which data to compare.  Two current graphs making their rounds on the internet:


(about this.)
]]></description>
			<content:encoded><![CDATA[<p>A lesson in x-axis scaling, and choosing which data to compare.  Two current graphs making their rounds on the internet:</p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/04/oadqwwrabm98xg3mzqg8jbv2o1_500.jpg"><img class="aligncenter size-medium wp-image-473" title="oadqwwrabm98xg3mzqg8jbv2o1_500" src="http://anyall.org/blog/wp-content/uploads/2009/04/oadqwwrabm98xg3mzqg8jbv2o1_500.jpg" alt="" width="415" height="275" /></a></p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/04/6a00d83451c45669e20115701d788e970b-800wi.jpg"><img class="aligncenter size-medium wp-image-474" title="6a00d83451c45669e20115701d788e970b-800wi" src="http://anyall.org/blog/wp-content/uploads/2009/04/6a00d83451c45669e20115701d788e970b-800wi.jpg" alt="" width="415" height="265" /></a></p>
<p>(<a href="http://www.nytimes.com/2009/04/14/world/africa/14sniper.html?ref=global-home">about this</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/04/pirates-killed-by-president/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Binary classification evaluation in R via ROCR</title>
		<link>http://anyall.org/blog/2009/04/binary-classification-evaluation-in-r-via-rocr/</link>
		<comments>http://anyall.org/blog/2009/04/binary-classification-evaluation-in-r-via-rocr/#comments</comments>
		<pubDate>Wed, 01 Apr 2009 06:33:58 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=429</guid>
		<description><![CDATA[A binary classifier makes decisions with confidence levels.  Usually it&#8217;s imperfect: if you put a decision threshold anywhere, items will fall on the wrong side &#8212; errors.  I made this a diagram a while ago for Turker voting; same principle applies for any binary classifier.

So there are a zillion ways to evaluate a [...]]]></description>
			<content:encoded><![CDATA[<p>A binary classifier makes decisions with confidence levels.  Usually it&#8217;s imperfect: if you put a decision threshold anywhere, items will fall on the wrong side &#8212; errors.  I made <a href="http://blog.doloreslabs.com/2008/06/aggregate-turker-judgments-threshold-calibration/">this a diagram</a> a while ago for Turker voting; same principle applies for any binary classifier.</p>
<p><center><a  href="http://blog.doloreslabs.com/2008/06/aggregate-turker-judgments-threshold-calibration/"><img src="http://blog.doloreslabs.com/wp-content/uploads/2008/06/vertthresh.png" width="318" height="266"></a></center></p>
<p>So there are a zillion ways to evaluate a binary classifier.  Accuracy?  Accuracy on different item types (sens, spec)?  Accuracy on different classifier decisions (prec, npv)?  And worse, over the years every field has given these metrics different names.  Signal detection, bioinformatics, medicine, statistics, machine learning, and more I&#8217;m sure.  But in R, there&#8217;s the excellent <a href="http://rocr.bioinf.mpi-sb.mpg.de/">ROCR package</a> to compute and visualize all the different metrics.</p>
<p>I wanted to have a small, easy-to-use function that calls ROCR and reports the basic information I&#8217;m interested in.  For <i>preds</i>, a vector of predictions (as confidence scores), and <i>labels</i>, the true labels for the instances, it works like this:</p>
<p>&gt; <b>binary_eval(preds, labels)</b></p>
<p><a href="http://anyall.org/blog/wp-content/uploads/2009/04/picture-4.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/04/picture-4.png" alt="" title="picture-4" width="500" height="485" class="aligncenter size-full wp-image-430" /></a></p>
<p>These are four graphs showing variation of classifier performance as the cutoff changes.  <span id="more-429"></span> There&#8217;s more than one because they convey different, interesting information.  There are four because then they arrange nicely into a square.  I wanted to balance having enough information vs. simplicity.  The graphs are:</p>
<table border="1" cellspacing="0" cellpadding="3">
<tr>
<td valign="top"><b>Accuracy</b> &#8212; how raw accuracy changes with different cutoffs.  The extreme left and right sides are baselines of always guessing positive vs always guessing negative &#8212; so their accuracies are the respective population rates.  (So you can see the data set is unbalanced; 55:45 pos:neg items.)</p>
<td valign="top"><b><a href="http://en.wikipedia.org/wiki/F-score">F-score</a></b> &#8212; I want to kill this graph and replace it with something more useful.  <a href="http://en.wikipedia.org/wiki/Cohen's_kappa">Kappa</a>?  But positionally it&#8217;s nice here: cutoff on the x-axis like the topleft, but it&#8217;s a precision/recall metric so goes along with the bottomright.</p>
<tr>
<td valign="top"><b><a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC aka Sensitivity-Specificity curve</a></b> &#8212; Left side is low cutoff (aggressive), right side is high cutoff (conservative).  Dashed line is for flipping weighted coins.  &#8220;AUC&#8221; is area under curve which is what it sounds like.  Is the <a href="http://en.wikipedia.org/wiki/Gini_coefficient">Gini coefficient</a> rescaled from [-1,+1] to [0,1].  Note that unlike all the other graphs, ROC &#038; AUC are invariant across different item label proportions.  (Given a trained classifier, precision and accuracy are influenced by the balanced-ness of the items; recall and specificity are not.)</p>
<td valign="top"><b><a href="http://en.wikipedia.org/wiki/Precision_and_recall">Precision-Recall curve</a></b> &#8212; Left side is high cutoff (conservative), right side is low cutoff (aggressive).  Note this directionality is opposite of the 3 other panels.  In particular, the recall axis is different than in the ROC plot.  I debated flipping the axes, but prec vs rec is standard practice elsewhere.  The green triangle is the point of best F-score.  (Other possible points to highlight might be prec-rec break-even, or argmaxes for F2, or F0.5&#8230; all these prec-rec metrics seem like ad-hoc hacks to me.)<br />
</table>
<p>The other thing binary_eval does is show one cutoff point across all the graphs &#8212; the blue circle &#8212; and display its associated performance metrics.  So here, it used the naive cutoff (0 for real-valued predictions; 0.5 for probabilities) and plotted that point with the blue circle on all the graphs.  Plus it reports for that cutoff:</p>
<pre>
Predictions seem to be real-valued scores, so using naive cutoff 0:
Acc = 0.531
  F = 0.338
 Prec = 0.780
  Rec = 0.216
 Spec = 0.924
Balanced Acc = 0.570
</pre>
<p>So it&#8217;s very apparent that the naive cutoff isn&#8217;t performing well here.  We can call it and supply a cutoff, or perhaps better, ask it to pick a cutoff that maximizes any metric that ROCR knows about.   For example, calibration for accuracy or F-score:</p>
<table border="1" cellspacing="0" cellpadding="3">
<tr>
<td><b>binary_eval(preds, labels, &#8216;acc&#8217;)</b></p>
<td><b>binary_eval(preds, labels, &#8216;f&#8217;)</b></p>
<tr>
<td>
<a href="http://anyall.org/blog/wp-content/uploads/2009/04/acc.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/04/acc.png" alt="" title="acc" width="250" height="230" class="alignnone size-full wp-image-432" /></a></p>
<td>
<a href="http://anyall.org/blog/wp-content/uploads/2009/04/f.png"><img src="http://anyall.org/blog/wp-content/uploads/2009/04/f.png" alt="" title="f" width="250" height="230" class="alignnone size-full wp-image-433" /></a><br />
</table>
<p>I added the function to my R utilities file, here: <a href="http://github.com/brendano/dlanalysis/blob/master/util.R">dlanalysis/util.R</a>.</p>
<p>Some people in machine learning like to use standalone programs like <a href="http://kodiak.cs.cornell.edu/kddcup/software.html">PERF</a> to compute all these metrics.  R is better if you know it.  You can calculate all sorts of things with it.  Above I was using R for the evaluation of the outputs of a command-line classifier, importing them easily with <i>scan()</i> and <i>scan(pipe(&#8221;cut -f1 &lt; data.svmlight_format&#8221;))</i>.</p>
<p>The best Wikipedia page to start reading about the zillions of all of these interrelated confusion matrix metrics might be the <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">the ROC page</a>.</p>
<p>If you want to see lots and lots more evaluation metrics, look at the <a href="http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html#app-report">&#8220;extended classifier evaluation&#8221; section of LingPipe&#8217;s sentiment analysis tutorial</a>.  And once you get into multiclass and such, all complexity hell breaks loose.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/04/binary-classification-evaluation-in-r-via-rocr/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
