<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">

<channel>
	<title>AI and Social Science - Brendan O'Connor</title>
	
	<link>http://anyall.org/blog</link>
	<description>Cognition, systems, decisions, visualization, machine learning, etc.</description>
	<pubDate>Sat, 07 Nov 2009 17:19:26 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.2</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/BrendanOConnorsBlog" type="application/rss+xml" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
		<title>Seeing how “art” and “pharmaceuticals” are linguistically similar in web text</title>
		<link>http://anyall.org/blog/2009/09/seeing-how-art-and-pharmaceuticals-are-linguistically-similar-in-web-text/</link>
		<comments>http://anyall.org/blog/2009/09/seeing-how-art-and-pharmaceuticals-are-linguistically-similar-in-web-text/#comments</comments>
		<pubDate>Sat, 26 Sep 2009 02:27:49 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=745</guid>
		<description><![CDATA[Earlier this week I asked the question,
How are &#8220;art&#8221; and &#8220;pharmaceuticals&#8221; similar?
People sent me lots of submissions!  Some are great, some are a bit of a stretch.

Overpriced by an order of magnitude.
The letters of &#8220;art&#8221; are found embedded, in order, in &#8220;pharmaceuticals&#8221;.
Search keywords that cost the most to advertise on?
&#8220;Wyeth&#8221;: I think this means [...]]]></description>
			<content:encoded><![CDATA[<p>Earlier this week I asked the question,</p>
<p><center>How are &#8220;art&#8221; and &#8220;pharmaceuticals&#8221; similar?</center></p>
<p>People sent me lots of submissions!  Some are great, some are a bit of a stretch.</p>
<ul>
<li>Overpriced by an order of magnitude.
<li>The letters of &#8220;art&#8221; are found embedded, in order, in &#8220;pharmaceuticals&#8221;.
<li>Search keywords that cost the most to advertise on?
<li>&#8220;Wyeth&#8221;: I think this means <a href="http://www.wyeth.com/">this</a>, and <a href="http://www.andrewwyeth.com/">this</a>.
<li>&#8220;Romeo and Juliet&#8221; famously includes both &#8220;art&#8221; (wherefore art thou) and pharmaceuticals (poison!)
<li>Some art has been created out of pharmaceuticals.
<li>Some art has been created under the influence of pharmaceuticals.
</ul>
<p>I was asking because I was playing around with a dataset of 100,000 noun phrases&#8217; appearances on the web, from the <a href="http://rtw.ml.cmu.edu/readtheweb.html">Reading the Web</a> project at CMU.  That is, for a noun like &#8220;art&#8221;, this data has a large list of phrases in which the word &#8220;art&#8221; is used, across some 200 million web pages.  For two noun concepts, we can see what they have in common and what&#8217;s different by looking at examples of how people use them when writing.  So, for &#8220;art&#8221; versus &#8220;pharmaceuticals&#8221;:</p>
<table>
<tr>
<th>common contexts for &#8220;art&#8221; but not &#8220;pharmaceuticals&#8221; [7394&nbsp;total]
<th>common contexts for both &#8220;art&#8221; and &#8220;pharmaceuticals&#8221; [165&nbsp;total]
<th>common contexts for &#8220;pharmaceuticals&#8221; but not &#8220;art&#8221; [206&nbsp;total]</tr>
<tr>
<td valign=top align=right style="padding:1em">
&#8216;m&nbsp;into&nbsp;_<br />
&#8217;s&nbsp;interested&nbsp;in&nbsp;_<br />
A&nbsp;collection&nbsp;of&nbsp;_<br />
_&nbsp;has&nbsp;been&nbsp;described&nbsp;by<br />
structure&nbsp;of&nbsp;_<br />
study&nbsp;in&nbsp;_<br />
_&nbsp;have&nbsp;been&nbsp;shown&nbsp;in<br />
The&nbsp;knowledge&nbsp;of&nbsp;_<br />
_&nbsp;is&nbsp;a&nbsp;commodity<br />
_&nbsp;is&nbsp;a&nbsp;creation<br />
_&nbsp;is&nbsp;a&nbsp;world<br />
an&nbsp;exhibition&nbsp;of&nbsp;_<br />
the&nbsp;commercialization&nbsp;of&nbsp;_<br />
the&nbsp;confinement&nbsp;of&nbsp;_<br />
_&nbsp;is&nbsp;cast&nbsp;in</p>
<td valign=top align=right style="padding:1em">
areas&nbsp;such&nbsp;as&nbsp;_<br />
prices&nbsp;of&nbsp;_<br />
storage&nbsp;of&nbsp;_<br />
producers&nbsp;of&nbsp;_<br />
_&nbsp;designed&nbsp;for<br />
the&nbsp;provision&nbsp;of&nbsp;_<br />
_&nbsp;sold&nbsp;in<br />
the&nbsp;same&nbsp;way&nbsp;as&nbsp;_<br />
_&nbsp;are&nbsp;among<br />
The&nbsp;production&nbsp;of&nbsp;_<br />
the&nbsp;analysis&nbsp;of&nbsp;_<br />
advances&nbsp;in&nbsp;_<br />
specialising&nbsp;in&nbsp;_<br />
a&nbsp;career&nbsp;in&nbsp;_<br />
_&nbsp;stolen&nbsp;from</p>
<td valign=top align=right style="padding:1em">
a&nbsp;greater&nbsp;amount&nbsp;of&nbsp;_<br />
standards&nbsp;for&nbsp;_<br />
marketer&nbsp;of&nbsp;_<br />
market&nbsp;for&nbsp;_<br />
prescriptions&nbsp;for&nbsp;_<br />
the&nbsp;supply&nbsp;of&nbsp;_<br />
the&nbsp;availability&nbsp;of&nbsp;_<br />
advertising&nbsp;for&nbsp;_<br />
the&nbsp;appropriate&nbsp;use&nbsp;of&nbsp;_<br />
shipment&nbsp;of&nbsp;_<br />
a&nbsp;cocktail&nbsp;of&nbsp;_<br />
classes&nbsp;of&nbsp;_<br />
a&nbsp;complete&nbsp;inventory&nbsp;of&nbsp;_<br />
_&nbsp;related&nbsp;downloads<br />
new&nbsp;generations&nbsp;of&nbsp;_<br />
</table>
<p>The middle column, showing ways in which people talk about both &#8220;art&#8221; and &#8220;pharmaceuticals&#8221;, makes it pretty clear.  What they have in common is that they&#8217;re both products: you can buy, sell, produce, and store them.  (There&#8217;s also an intellectual goods aspect: they both can be stolen.)  This really didn&#8217;t occur to me at first; silly me, I thought art was a thing of beauty removed from such mundane considerations.  A number of the submitted answers, though, center around the theme of them both being expensive &#8212; so we have positive agreement between <a href="http://en.wikipedia.org/wiki/Corpus_linguistics">corpus statistics</a> and human judgments!</p>
<p>Examining massive numbers of contexts like this follows what the infinitely wise Dinosaur Comics calls <a href="http://www.qwantz.com/index.php?comic=1541">&#8220;a statistically-based descriptivist approach to semantics.&#8221;</a>  Or as linguist <a href="http://en.wikipedia.org/wiki/John_Rupert_Firth">J.R. Firth</a> put it, &#8220;You shall know a word by the company it keeps.&#8221;  Many subtleties of the two concepts can be seen just in their context lists.  For example, in the left column, we see that only art &#8220;is a commodity&#8221;.  Well, certainly pharmaceuticals are a commodity too.  But that&#8217;s so obvious it&#8217;s not worth saying.  Proclaiming that &#8220;art is a commodity&#8221;, however, is interesting.  Maybe we think about this (possible) fact more.</p>
<p>As for the data: it comes from 200 million web pages (500 million sentences), and is filtered to contexts that appear more than five hundred times in the data.  It was collected as part of a research project that seeks to extract a database of knowledge from this information &#8212; <a href="http://rtw.ml.cmu.edu/readtheweb.html">&#8220;reading the web&#8221;</a>.  (Yes, <a href="http://hadoop.apache.org/">Hadoop</a> was involved.)  To make the table, I took the contexts&#8217; set differences and intersection and showed a random subsample from each.</p>
<p>A final note.  <a href="http://willwhim.wordpress.com/">Will</a> pointed out that in Alice in Wonderland, the Mad Hatter asks, <a href="http://www.straightdope.com/columns/read/1173/why-is-a-raven-like-a-writing-desk">&#8220;Why is a raven like a writing desk?&#8221;</a>  I tried that query on this data, but unfortunately, it didn&#8217;t contain many instances of &#8220;raven&#8221;.  However, it <i>does</i> include a proper name &#8220;Raven&#8221; &#8212; which turns out to be an <a href="http://www.animevice.com/raven/18-23399/">anime character</a>.  Not the first time I&#8217;ve seen the Internet&#8217;s massive amount of anime knowledge get in the way of a very serious semantic extraction system!</p>
<p>Many thanks to <a href="http://tr.ashcan.org/">Adam</a>, Joanna, <a href="http://willwhim.wordpress.com/">Will</a>, <a href="http://www.umiacs.umd.edu/~vikas/">Vikas</a>, and Michael for the submitted answers.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/09/seeing-how-art-and-pharmaceuticals-are-linguistically-similar-in-web-text/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Quiz: “art” and “pharmaceuticals”</title>
		<link>http://anyall.org/blog/2009/09/quiz-art-and-pharmaceuticals/</link>
		<comments>http://anyall.org/blog/2009/09/quiz-art-and-pharmaceuticals/#comments</comments>
		<pubDate>Sun, 20 Sep 2009 13:24:00 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=740</guid>
		<description><![CDATA[A lexical semantics question:
How are &#8220;art&#8221; and &#8220;pharmaceuticals&#8221; similar?
I have a data-driven answer, but am curious how easy it is to guess it, and in what sense it&#8217;s valid.  I&#8217;ll post my answer and supporting evidence on Tuesday.
]]></description>
			<content:encoded><![CDATA[<p>A lexical semantics question:</p>
<p><center>How are &#8220;art&#8221; and &#8220;pharmaceuticals&#8221; similar?</center></p>
<p>I have a data-driven answer, but am curious how easy it is to guess it, and in what sense it&#8217;s valid.  I&#8217;ll post my answer and supporting evidence on Tuesday.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/09/quiz-art-and-pharmaceuticals/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Don’t MAWK AWK - the fastest and most elegant big data munging language!</title>
		<link>http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/</link>
		<comments>http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/#comments</comments>
		<pubDate>Thu, 10 Sep 2009 04:17:38 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=652</guid>
		<description><![CDATA[When one of these newfangled &#8220;Big Data&#8221; sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like.  Once you&#8217;re dealing with hundreds of megabytes of data, even simple operations can take plenty of time.
For one recent ad-hoc task I had [...]]]></description>
			<content:encoded><![CDATA[<p>When one of these newfangled <a href="http://dataspora.com/blog/tipping-points-and-big-data/">&#8220;Big Data&#8221;</a> sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like.  Once you&#8217;re dealing with hundreds of megabytes of data, even simple operations can take plenty of time.</p>
<p>For one recent ad-hoc task I had &#8212; reformatting 1GB of textual feature data into a form Matlab and R can read &#8212;  I tried writing implementations in several languages, with help from my classmate <a href="http://www.cs.cmu.edu/~elijah/">Elijah</a>.  The results really surprised us:</p>
<table border="1" cellspacing="0" cellpadding="10px" width="100%">
<tr>
<th>&nbsp;Language&nbsp;
<th>&nbsp;Time (min:sec)&nbsp;
<th> Speed (vs. gawk)
<th>Lines of code
<th>Notes
<th>Type</tr>
<tr>
<td><a href="http://github.com/brendano/awkspeed/blob/master/2num.awk">mawk</a>
<td align=center>1:06
<td align=center>7.8x
<td align=center>3
<td><a href="http://invisible-island.net/mawk/mawk.html">Mike Brennan&#8217;s Awk</a>, system default on Ubuntu/Debian Linux.
<td>VM</p>
<tr>
<td><a href="http://github.com/brendano/awkspeed/blob/master/Formatter.java">java</a>
<td align=center>1:20
<td align=center>6.4x
<td align=center>32
<td>version 1.6 (-server didn&#8217;t matter)
<td>VM+<a href="http://en.wikipedia.org/wiki/Just-in-time_compilation">JIT</a></p>
<tr>
<td><a href="http://github.com/brendano/awkspeed/blob/master/2num_c.cc">c-ish c++</a>
<td align=center>1:35
<td align=center>5.4x
<td align=center>42
<td>g++ 4.0.1 with -O3, using stdio.h
<td>Native</p>
<tr>
<td><a href="http://github.com/brendano/awkspeed/blob/master/2num.py">python</a>
<td align=center>2:15
<td align=center>3.8x
<td align=center>20
<td>version 2.5, system default on OSX 10.5
<td>VM</p>
<tr>
<td><a href="http://github.com/brendano/awkspeed/blob/master/2num.pl">perl</a>
<td align=center>3:00
<td align=center>2.9x
<td align=center>17
<td>version 5.8.8, system default on OSX 10.5
<td>VM</p>
<tr>
<td><a href="http://github.com/brendano/awkspeed/blob/master/2num.awk">nawk</a>
<td align=center>6:10
<td align=center>1.4x
<td align=center>3
<td><a href="http://www.cs.princeton.edu/~bwk/">Brian Kernighan</a>&#8217;s <a href="http://www.cs.bell-labs.com/cm/cs/awkbook/">&#8220;One True Awk&#8221;</a>, system default on OSX, *BSD
<td>?</p>
<tr>
<td><a href="http://github.com/brendano/awkspeed/blob/master/2num.cc">c++</a>
<td align=center>6:50
<td align=center>1.3x
<td align=center>48
<td>g++ 4.0.1 with -O3, using fstream, stringstream
<td>Native</p>
<tr>
<td><a href="http://github.com/brendano/awkspeed/blob/master/2num.rb">ruby</a>
<td align=center>7:30
<td align=center>1.1x
<td align=center>22
<td>version 1.8.4, system default on OSX 10.5; also tried 1.9, but was slower
<td>Interpreted</p>
<tr>
<td><a href="http://github.com/brendano/awkspeed/blob/master/2num.awk">gawk</a>
<td align=center>8:35
<td align=center>1x
<td align=center>3
<td><a href="http://www.gnu.org/software/gawk/">GNU Awk</a>, system default on RedHat/Fedora Linux
<td>Interpreted<br />
</table>
<p>To be clear, <span id="more-652"></span> the problem is to take several files of (item name, feature name, value) triples, like:</p>
<pre>
000794107-10-K-19960401 limited 1
000794107-10-K-19960401 colleges 1
000794107-10-K-19960401 code 2
...
004334108-10-K-19961230 recognition 1
004334108-10-K-19961230 gross 8
...</pre>
<p>And then rename items and features into sequential numbers as a sparse matrix: (i, j, value) triples.  Items should count up from inside each file; but features should be shared across files, so they need a shared counter.  Finally, we need to write a mapping of feature IDs back to their names for later inspection; this can just be a list.</p>
<p>This task is simple, but it&#8217;s representative of many data munging tasks out there.  It inputs and outputs textual data.  It&#8217;s probably one-off.  The algorithm is easy &#8212; especially since it&#8217;s a subtask of something larger &#8212; but still complex enough you&#8217;ll need a debug cycle or two.  You want to get it done fast so you can get on to the real work.  Complex programming tools, like debuggers, are of little use &#8212; you figure out what&#8217;s going on by inspecting the output.  Complex data processing environments, like Hadoop or an RDBMS, are also of little use &#8212; you have to munge in the first place to load data into them.</p>
<p>It turns out, this is a task <a href="http://en.wikipedia.org/wiki/AWK">AWK</a> was made for.  It&#8217;s a language dating from the original Bell Labs Unix era &#8212; circa 1977 &#8212; and it&#8217;s extremely specialized for processing delimited text files in a single pass.  Perl was created in part to supersede it, but for this core use case, Awk is much more elegant and clearer.  <a href="http://github.com/brendano/awkspeed/blob/master/2num.awk">The implementation here</a> is only 8 lines of code, expanded from <a href="http://github.com/brendano/awkspeed/blob/master/2num_3line.awk">merely 3</a> when I first wrote it.</p>
<p>Since it&#8217;s a standardized language, many implementations exist.  One of them, <a href="http://invisible-island.net/mawk/mawk.html">MAWK</a>, is incredibly efficient.  It outperforms <b><i>all</i></b> other languages, including statically typed compiled ones like Java and C++!  It wins on <i>both</i> LOC and performance criteria &#8212; a rare feat indeed, transcending the usual competition of slow-but-easy scripting languages versus fast-but-hard compiled languages.</p>
<p>[There's another big pro-VM story here: Java beat C++, both in LOC/ease-of-programming as well as performance.  C++ was, as usual, a total nightmare to write.  What you don't see in the LOC numbers is the sheer amount of time spent googling through every weird issue.  For example, apparently g++ <a href="http://www.gamedev.net/community/forums/topic.asp?topic_id=119766">requires you to define the hash function</a> in order to use a hash_map of string keys.  Or, there are a zillion different ways to <a href="http://stackoverflow.com/questions/236129/c-how-to-split-a-string">split a string</a>, none of them standard.  Then there are 2 different I/O and 2 different string libraries given its C heritage, and if you make the wrong choices, performance is terrible.  I'm totally sure that given some more rewriting, the C++ implementation can be made the fastest.  It's just a question of how much pain you go through to find the right rewrites.  All the other implementations were written in the most straightforward, simplest way possible.  C++ abjectly fails the "get it done quick" criterion.]</p>
<p>My most pleasant surprise learning Awk was its shortcuts for reading and writing files.  Like shell, there is no concept of a file handle &#8212; you don&#8217;t open and close files, you just specify the filename and the VM figures it out.  Even Perl, king of syntactic shorthand, doesn&#8217;t have this useful feature; and even Ruby, with its elegant <a href="http://www.ruby-doc.org/core/classes/Kernel.html#M005950">open()-block-cleanup</a> construct, is clunkier.  It sounds minor, but this eliminates a number of bugs you can make in scripts.</p>
<p>But what I most appreciate about Awk is <a href="http://www.vectorsite.net/tsawk_1.html">the discourse structure of its programs</a>: every clause is a potentially conditional action to be performed with the current record.  If you want actions to be taken upon program start or exit, you declare special clauses for that.  Awk manages all these features while being staying incredibly small and simple &#8212; the advantages of being a <a href="http://en.wikipedia.org/wiki/Domain-specific_language">domain-specific language</a>.  I think it feels a little more like a super-flexible, index-challenged version of SQL than it does a standard scripting language.  I suspect Awk&#8217;s simplicity and specialization is part of why Mike Brennan was able to make Mawk so insanely fast.  If your only datatypes are strings and hashes, then compile-time type inference is pretty easy.</p>
<p>Awk is also well-suited to the <a href="http://www.lexemetech.com/2008/03/disks-have-become-tapes.html">&#8220;Disk is the New Tape&#8221;</a> era.  That is, right now, hard drive sizes are rapidly growing &#8212;  allowing very large datasets &#8212; but random access seek times aren&#8217;t catching up.  In this setting, the only way to process data will be via linear scans, accessing one item of data at a time.  (E.g. <a href="http://www.johndcook.com/standard_deviation.html">running variance</a>, <a href="http://hunch.net/?p=277">online learning</a>, <a href="http://databasecolumn.vertica.com/2007/09/disk-trends.html">column stores</a>, etc.)   This is the core philosophy behind <a href="http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/oscon-part-1.pdf">Hadoop&#8217;s computation model</a> &#8212; and Awk&#8217;s.  If hard drives are like tape drives, then it&#8217;s worth looking in to other blast-from-the-past technologies!  (Similar point <a href="http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/#comment-5714">about SAS</a>, in fact.)</p>
<p>There are many Awk tutorials on the web.  <a href="http://www.vectorsite.net/tsawk_1.html">This one</a> is decent, though I strongly recommend Ken Church</a>&#8217;s classic tutorial <a href="http://people.sslmit.unibo.it/~baroni/compling04/UnixforPoets.pdf">Unix for Poets</a>.  It shows how to do all sorts of great things with Unix text processing tools, including Awk.</p>
<p>All the code, results, and data can be obtained at <a href="http://github.com/brendano/awkspeed">github.com/brendano/awkspeed</a>.  I&#8217;d love to see results for more languages.  And I hope someday someone tries writing an <a href="http://llvm.org/">LLVM</a> Awk &#8212; will it be even faster?</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Patches to Rainbow, the old text classifier that won’t go away</title>
		<link>http://anyall.org/blog/2009/09/patches-to-rainbow-the-old-text-classifier-that-wont-go-away/</link>
		<comments>http://anyall.org/blog/2009/09/patches-to-rainbow-the-old-text-classifier-that-wont-go-away/#comments</comments>
		<pubDate>Tue, 08 Sep 2009 18:45:09 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=702</guid>
		<description><![CDATA[I&#8217;ve been reading several somewhat recent finance papers (Antweiler and Frank 2005, Das and Chen 2007) that use Rainbow, the text classification software originally written by Andrew McCallum back in 1996.  The last version is from 2002 and the homepage announces he isn&#8217;t really supporting it any more.
However, as far as I can tell, [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been reading several somewhat recent finance papers (<a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=878091">Antweiler and Frank 2005</a>, <a href="http://mansci.journal.informs.org/cgi/content/abstract/53/9/1375">Das and Chen 2007</a>) that use <a href="http://www.cs.cmu.edu/~mccallum/bow/">Rainbow</a>, the text classification software originally written by <a href="http://www.cs.umass.edu/~mccallum/">Andrew McCallum</a> back in 1996.  The last version is from 2002 and the homepage announces he isn&#8217;t really supporting it any more.</p>
<p>However, as far as I can tell, it might still be the easiest-to-use text classifier package out there.  You don&#8217;t have to program &#8212; just invoke commandline arguments &#8212; and it can accommodate reasonably sized datasets, does tokenization, stopword filtering, etc. for you, and has some useful feature selection and other options.  Based on my limited usage, it seems well-implemented.  If anyone knows of a better one I&#8217;d love to hear it.  I once looked at, among other things, <a href="http://gate.ac.uk/">GATE</a> and <a href="http://incubator.apache.org/uima/">UIMA</a>, and they seemed too hard to use if you wanted to download something that did simple text classification; or else, maybe they didn&#8217;t have documentation on how to use them in that manner.    <a href="http://www.cs.cmu.edu/~mccallum/bow/rainbow/">Rainbow does</a>.  If I had to recommend a text classifier to a social scientist today, I might say they should Rainbow.</p>
<p>(GATE and UIMA call themsleves &#8220;architectures&#8221;.  I usually don&#8217;t want an architecture, I want a program that does stuff.  <a href="http://alias-i.com/lingpipe/">LingPipe</a> was the only other system I found that had <a href="http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html">good web documentation</a> saying how to use it to do text classification.  It looks like a good option, if you&#8217;re willing to write some code.  There are numerous academic efforts to make automated content analysis systems that at a high level sound like the right sort of thing, but nearly all of them have poor web docs so it&#8217;s hard to tell whether they do what you want.)</p>
<p>In the meantime, the current Rainbow download has issues compiling on modern GCC and Mac OSX &#8212; some issues <a href="http://fugutabetai.com/?postid=170">documented here</a>.  I worked through them put my patched version (only tested on GCC 4.0, OSX 10.5) up here: <a href="http://github.com/brendano/bow/">github.com/brendano/bow</a></p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/09/patches-to-rainbow-the-old-text-classifier-that-wont-go-away/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Another R flashmob today</title>
		<link>http://anyall.org/blog/2009/09/another-r-flashmob-today/</link>
		<comments>http://anyall.org/blog/2009/09/another-r-flashmob-today/#comments</comments>
		<pubDate>Tue, 08 Sep 2009 13:14:17 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=699</guid>
		<description><![CDATA[Dan Goldstein sends word they&#8217;re doing another Stackoverflow R flashmob today.  It&#8217;s a neat trick.  The R tag there is becoming pretty useful.
]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.dangoldstein.com/">Dan Goldstein</a> sends word they&#8217;re doing <a href="http://www.decisionsciencenews.com/?p=1042">another Stackoverflow R flashmob today</a>.  It&#8217;s a neat trick.  The <a href="http://stackoverflow.com/questions/tagged/r">R tag there</a> is becoming pretty useful.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/09/another-r-flashmob-today/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Beautiful Data book chapter</title>
		<link>http://anyall.org/blog/2009/08/beautiful-data-book-chapter/</link>
		<comments>http://anyall.org/blog/2009/08/beautiful-data-book-chapter/#comments</comments>
		<pubDate>Wed, 12 Aug 2009 22:14:47 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Best Posts]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=634</guid>
		<description><![CDATA[ Today I received my copy of Beautiful Data, a just-released anthology of articles about, well, working with data.  Lukas and I contributed a chapter on analyzing social perceptions in web data.  See it here. After a long process of drafting, proofreading, re-drafting, and bothering the publishers under rather sudden deadlines, I&#8217;ve resolved to never [...]]]></description>
			<content:encoded><![CDATA[<p><img alt="" src="http://assets.doloreslabs.com/blog/beautiful-data.gif" class="alignright" width="85" height="112" /> Today I received my copy of <a href="http://oreilly.com/catalog/9780596157111/">Beautiful Data</a>, a just-released anthology of articles about, well, working with data.  <a href="http://blog.doloreslabs.com/2009/08/beautiful-data/">Lukas</a> and I contributed a chapter on analyzing social perceptions in web data.  <a href="http://anyall.org/bd">See it here.</a> After a long process of drafting, proofreading, re-drafting, and bothering the publishers under rather sudden deadlines, I&#8217;ve resolved to never use graphics again in anything I write :)</p>
<p>Here&#8217;s our final figure, a <a href="http://en.wikipedia.org/wiki/K-means_clustering">k-means</a> clustering of face photos via perceived social attributes (social <a href="http://en.wikipedia.org/wiki/Concept_learning">concepts/types</a>? with <a href="http://en.wikipedia.org/wiki/Prototype_theory">exemplars</a>?):<br />
<a href="http://anyall.org/cluster_table.png"><img src="http://anyall.org/cluster_table.png" alt="" title="cluster_table" width="500" height="593" class="aligncenter size-full wp-image-637" /></a></p>
<p>I just started reading the rest of the book and it&#8217;s very fun.  <a href="http://norvig.com/">Peter Norvig</a>&#8217;s chapter on language models is gripping.  (It does word segmentation, ciphers, and more, in that lovely python-centric tutorial style extending his previous <a href="http://norvig.com/spell-correct.html">spell correction article</a>.)  There are also chapters by many other great researchers and practitioners (some of whom you may have seen around this blog or its neighborhood) like <a href="http://www.stat.columbia.edu/~gelman/">Andrew Gelman</a>, <a href="http://had.co.nz/">Hadley Wickham</a>, <a href="http://mike.teczno.com/">Michal Migurski</a>, <a href="http://jheer.org/">Jeffrey Heer</a>, and still more&#8230;  I&#8217;m impressed just by the talent-gathering-and-organizing operation.  Big kudos to editors <a href="http://kiwitobes.com/">Toby Segaran</a> and <a href="http://www.linkedin.com/in/jhammerb">Jeff Hammerbacher</a>, and O&#8217;Reilly&#8217;s <a href="http://twitter.com/jsteeleeditor">Julie Steele</a>.</p>
<p>I also have an apparently secret code that gets you a discount, so email me if you want it.  I wonder if I&#8217;m not supposed to give out many of them.  Hm.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/08/beautiful-data-book-chapter/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features</title>
		<link>http://anyall.org/blog/2009/08/haghighi-and-klein-2009-simple-coreference-resolution-with-rich-syntactic-and-semantic-features/</link>
		<comments>http://anyall.org/blog/2009/08/haghighi-and-klein-2009-simple-coreference-resolution-with-rich-syntactic-and-semantic-features/#comments</comments>
		<pubDate>Sat, 08 Aug 2009 22:42:23 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=618</guid>
		<description><![CDATA[I haven&#8217;t done a paper review on this blog for a while, so here we go.
Coreference resolution is an interesting NLP problem.  (Examples.)  It involves honest-to-goodness syntactic, semantic, and discourse phenomena, but still seems like a real cognitive task that humans have to solve when reading text [1].  I haven&#8217;t read the whole literature, but I&#8217;ve always [...]]]></description>
			<content:encoded><![CDATA[<p>I haven&#8217;t done a paper review on this blog for a while, so here we go.</p>
<p><a href="http://en.wikipedia.org/wiki/Coreference">Coreference</a> resolution is an interesting NLP problem.  (<a href="http://anyall.org/files/coref_example2.pdf">Examples.</a>)  It involves honest-to-goodness syntactic, semantic, and discourse phenomena, but still seems like a real cognitive task that humans have to solve when reading text [1].  I haven&#8217;t read the whole literature, but I&#8217;ve always been puzzled by the crop of papers on it I&#8217;ve seen in the last year or two.  There&#8217;s a big focus on fancy graph/probabilistic/constrained optimization algorithms, but often these papers gloss over the linguistic features &#8212; the core information they actually make their decisions with [2].  I never understood why the latter isn&#8217;t the most important issue.  Therefore, it was a joy to read</p>
<ul>
<li>Aria Haghighi and Dan Klein, EMNLP-2009.  <a href="http://www.aclweb.org/anthology/D/D09/D09-1120.pdf">&#8220;Simple Coreference Resolution with Rich Syntactic and Semantic Features.&#8221;</a></li>
</ul>
<p>They describe a simple, essentially non-statistical system that outperforms previous unsupervised systems, and compares favorably to supervised ones, by using smart features.  It has two-ish modular components:</p>
<ul>
<li>Syntactic constraints: entity type agreement, appositives, and a few other things that get at syntactic salience.</li>
<li>Semantic filter:  non-pronoun mentions must be corpus-pattern-compatible with their antecedents; described below.</li>
</ul>
<p>For each mention, these constraints filter previous mentions to several possible antecedents.  If there&#8217;s more than one, the system picks the closest.  Entity clusters are formed in the simplest (dumbest) way possible, by taking the transitive closure of these pairwise mention-antecedent decisions.</p>
<p>The lexical semantic filter is interesting.  They found that syntactic cues have recall issues for non-pronoun references, e.g. <em>the company</em> referring to <em>AOL</em>.  You need to know that these two words tend to be compatible with each other.  They create a very specific lexical resource &#8212; of word pairs compatible to be coreferent &#8212; by finding coreferent expressions via bootstrapping in a large unlabelled text corpus (Wikipedia abstracts and newswire articles.  But they say only 25k Wiki abstracts?  There are &gt;10 million total; how were they selected?).</p>
<p>Using a parsed corpus, they seeded with appositive and predicate-nominative patterns: I&#8217;m guessing, something like &#8221;Al Gore, vice-president&#8230;&#8221; and &#8220;Al Gore was vice-president&#8221;.  Then they extracted connecting paths on those pairs.  E.g., the text &#8220;Al Gore served as the vice-president&#8221; then yields the path-pattern &#8220;X served as Y&#8221;.  Then there&#8217;s one more iteration to extract more word pairs &#8212; pairs that appear a large number of times.</p>
<p>They cite <a href="http://people.ischool.berkeley.edu/~hearst/papers/coling92.pdf">Hearst 1992</a>, <a href="http://ai.stanford.edu/~rion/papers/hypernym_nips05.pdf">Snow et al 2005</a>, and <a href="http://www.cs.utah.edu/~riloff/pdfs/ranlp07.pdf">Phillips and Riloff 2007</a>.  But note, they don&#8217;t describe their technique as trying to be a general hypernym finder; rather, its only goal is to find pairs of words (noun phrase heads) that might corefer in text that the final system encounters.  In fact, they describe the bootstrapping system as merely trying to find instances of noun phrase pairs that exhibit coreference.  I wonder if it&#8217;s fair to think of the bootstrapping system as a miniature coreference detector itself, but tuned for high-precision, by only considering very short syntactic paths (no longer than 1 sentence).  I also wonder if there are instances of non-pronoun coreference that <em>aren&#8217;t</em> hyponym-hypernym pairs; if not, my analysis here is silly.</p>
<p>Coverage seems to be good, or at least useful: many non-pronoun coreference recall errors are solved by using this data.  (I can&#8217;t tell whether it&#8217;s two-thirds of all recall errors after the syntactic system, or two-thirds of the errors in Table 1.)  And they claim word pairs are usually correct, with a few interesting types of errors (that look quite solvable).  As for coverage, I wonder if they tried WordNet hypernym/synonym information, and whether it was useful.  My bet is that WordNet&#8217;s coverage here would be significantly poorer than a bootstrapped system.</p>
<p>This paper was fun to read because it&#8217;s written very differently than the usual NLP paper.  Instead of presenting a slew of modelling, it cuts to the chase, using very simple algorithms and clear argumentation to illustrate why a particular set of approaches is effective.  There&#8217;s lots of error analysis motivating design decisions, as well as suggesting concusions for future work.  In particular, they think discourse, pragmatics, and salience aren&#8217;t the most important issues; instead, better syntactic and semantic modelling would give the biggest gains.</p>
<p>There&#8217;s also something very nice about reading a paper that doesn&#8217;t have a single equation yet makes a point, and is easy to implement yourself to boot.  I think the machine learning approach to NLP research can really hurt insight.  Every paper is obsessed with held-out predictive accuracy.  If you&#8217;re lucky, a paper will list out all the features they used, then (only sometimes!) they make a cursory attempt at finding which features were important.  A simple hand-coded system lends itself to easily describing and motivating every feature by itself &#8212; better narrative explanations and insight.  Which type of research is more useful as science?</p>
<p>Final note: it&#8217;s not totally fair to consider this system a non-statistical one, because its syntactic and semantic subsystems rest on complicated statistical components that required boatloads of labelled training data &#8212; the <a href="http://nlp.stanford.edu/software/lex-parser.shtml">Stanford parser</a>, <a href="http://nlp.stanford.edu/software/CRF-NER.shtml">Stanford NER</a>, and the <a href="ftp://ftp.cs.brown.edu/pub/nlparser/">Charniak parser</a>.  (I wonder how sensitive performance is relative to these components.  Could rule-based parsing and NER work as well?)  Further, as they point out, more complicated structured approaches to the problem of forming entity partitions from features should improve performance.  (But how much?)</p>
<p>[1] As opposed to, say, the rarified activity of <a href="http://www.cis.upenn.edu/~treebank/">treebank</a>ing, a <a href="ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz">318-page-complex</a> linguistic behavior that maybe several dozen people on Earth are capable of executing.  There&#8217;s a whole other rant here, on the topic of the behavioral reality of various linguistic constructs.  (Treebank parsers were extensively used in this paper, so maybe I shouldn&#8217;t hate too much&#8230;)</p>
<p>[2] There are certainly exceptions to this like <a href="file:///Users/brendano/Desktop/acad%20soup/emnlp08/pdf/EMNLP031.pdf">Bengston and Roth</a>, or maybe <a href="file:///Users/brendano/Desktop/acad%20soup/emnlp08/pdf/EMNLP069.pdf">Denis and Baldridge</a> (both EMNLP-2008).  I should emphasize my impression of the literature is from a small subsample.</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/08/haghighi-and-klein-2009-simple-coreference-resolution-with-rich-syntactic-and-semantic-features/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Blogger to Wordpress migration helper</title>
		<link>http://anyall.org/blog/2009/08/blogger-to-wordpress-migration-helper/</link>
		<comments>http://anyall.org/blog/2009/08/blogger-to-wordpress-migration-helper/#comments</comments>
		<pubDate>Tue, 04 Aug 2009 15:50:49 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=616</guid>
		<description><![CDATA[A while ago I moved my blog from Blogger (socialscienceplusplus.blogspot.com) to a custom Wordpress installation here (anyall.org/blog).  Wordpress has a nice Blogger import feature, but I also wanted all the old URL&#8217;s to redirect to their new equivalents.  This is tricky because Blogger doesn&#8217;t give you much control over their system.  I only found pretty [...]]]></description>
			<content:encoded><![CDATA[<p>A while ago I moved my blog from Blogger (socialscienceplusplus.blogspot.com) to a custom Wordpress installation here (anyall.org/blog).  Wordpress has a nice Blogger import feature, but I also wanted all the old URL&#8217;s to redirect to their new equivalents.  This is tricky because Blogger doesn&#8217;t give you much control over their system.  I only found pretty hacky solutions online, so I wrote a new one that&#8217;s slightly better, and posted it here if anyone&#8217;s interested: <a href="http://gist.github.com/15594">gist.github.com/15594</a></p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/08/blogger-to-wordpress-migration-helper/feed/</wfw:commentRss>
		</item>
		<item>
		<title>R questions on StackOverflow</title>
		<link>http://anyall.org/blog/2009/07/r-questions-on-stackoverflow/</link>
		<comments>http://anyall.org/blog/2009/07/r-questions-on-stackoverflow/#comments</comments>
		<pubDate>Thu, 23 Jul 2009 17:54:53 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=609</guid>
		<description><![CDATA[R is notoriously hard to learn, but there was just an effort [1] [2] to populate the programming question-and-answer website StackOverflow with content for the R language.
Amusingly, one of the most useful intro questions is: How to search for &#8220;R&#8221; materials?
Mike Driscoll (who organized an in-person conference event to get this bootstrapped) pointed out that in many [...]]]></description>
			<content:encoded><![CDATA[<p>R is notoriously hard to learn, but there was just an effort <a href="http://www.meetup.com/R-Users/boards/thread/7315352/">[1]</a> <a href="http://blog.stackoverflow.com/2009/07/stack-overflow-flash-mobs/">[2]</a> to populate the programming question-and-answer website <a href="http://stackoverflow.com/questions/tagged/r">StackOverflow with content for the R language</a>.</p>
<p>Amusingly, one of the most useful intro questions is: <a href="http://stackoverflow.com/questions/102056/how-to-search-for-r-materials">How to search for &#8220;R&#8221; materials?</a></p>
<p><a href="http://dataspora.com/">Mike Driscoll</a> (who organized an in-person conference event to get this bootstrapped) pointed out that in many ways StackOverflow is a nicer forum for help than a mailing list.  (i.e. the impressive but hard-to-approach <a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a>.)  It&#8217;s more organized, easier to browse, and repetition and wrong answers can get downvoted.  (And <a href="http://www.johndcook.com/blog/2009/07/23/r-questions-answers/">more thoughts from John Cook</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/07/r-questions-on-stackoverflow/feed/</wfw:commentRss>
		</item>
		<item>
		<title>FFT: Friedman + Fortran + Tricks</title>
		<link>http://anyall.org/blog/2009/07/fft-friedman-fortran-tricks/</link>
		<comments>http://anyall.org/blog/2009/07/fft-friedman-fortran-tricks/#comments</comments>
		<pubDate>Wed, 22 Jul 2009 02:08:55 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://anyall.org/blog/?p=602</guid>
		<description><![CDATA[&#8230;is a tongue-in-cheek phrase from Trevor Hastie&#8217;s very fun to read useR-2009 presentation, from the merry trio of Hastie, Friedman, and Tibshirani, who brought us, among other things, the excellent Elements of Statistical Learning textbook.  It&#8217;s a joy to read sophisticated but well-presented work like this.
This comes from a slide explaining the impressive speed results for [...]]]></description>
			<content:encoded><![CDATA[<p>&#8230;is a tongue-in-cheek phrase from Trevor Hastie&#8217;s <a href="http://www.agrocampus-ouest.fr/math/useR-2009/slides/Hastie.pdf">very fun to read useR-2009 presentation</a>, from the merry trio of Hastie, Friedman, and Tibshirani, who brought us, among other things, the excellent <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/">Elements of Statistical Learning textbook</a>.  It&#8217;s a joy to read sophisticated but well-presented work like this.</p>
<p>This comes from a slide explaining the impressive speed results for their <a href="http://www-stat.stanford.edu/~hastie/Papers/glmnet.pdf">glmnet</a> regression package.  Substantively, I&#8217;m interested in their observation that coordinate descent works well for sparse data &#8212; if you&#8217;re optimizing one feature at a time, and that feature is used in only a small percentage of instances, there are some neat optimizations!</p>
<p>But mostly, I had a fun time skimming the <a href="http://cran.r-project.org/web/packages/glmnet/index.html">glmnet code</a>.  It&#8217;s written in 2008, but, yes, <a href="http://anyall.org/blog/wp-content/uploads/2009/07/glmnet.f90">the core algorithm is written entirely in Fortran</a>, complete with punchcard-style, fixed-width formatting!  (This seems gratuitous to me &#8212; I thought the modern <a href="http://en.wikipedia.org/wiki/Fortran#Fortran_90">Fortran-90</a> had done away with such things?)  I&#8217;ve felt clever enough making 10x-100x performance gains by switching from R or Python down to C++, but I&#8217;m told that this is nothing compared to Fortran with the proprietary <a href="http://en.wikipedia.org/wiki/Intel_Fortran_Compiler">Intel compiler</a> &#8212; still the fastest language in the world for numeric computing.</p>
<p>(Hat tip: <a href="http://blog.revolution-computing.com/2009/07/presentations-from-user-2009-online.html">Revolution Computing</a> pointed out the useR-2009 presentations.)</p>
]]></content:encoded>
			<wfw:commentRss>http://anyall.org/blog/2009/07/fft-friedman-fortran-tricks/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
