<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>AI and Social Science - Brendan O'Connor</title>
	
	<link>http://brenocon.com/blog</link>
	<description>cognition, language, social systems; statistics, visualization, computation</description>
	<lastBuildDate>Sat, 04 Feb 2012 05:17:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.4</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/BrendanOConnorsBlog" /><feedburner:info uri="brendanoconnorsblog" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Histograms — matplotlib vs. R</title>
		<link>http://feedproxy.google.com/~r/BrendanOConnorsBlog/~3/ha0W0TVsLFM/</link>
		<comments>http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 20:57:10 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1112</guid>
		<description><![CDATA[When possible, I like to use R for its really, really good statistical visualization capabilities. I&#8217;m doing a modeling project in Python right now (R is too slow, bad at large data, bad at structured data, etc.), and in comparison &#8230; <a href="http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>When possible, I like to use R for its really, really good statistical visualization capabilities.  I&#8217;m doing a modeling project in Python right now (R is too slow, bad at large data, bad at structured data, etc.), and in comparison to base R, the matplotlib library is just painful.  I wrote a toy <a href="http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm">Metropolis</a> sampler for a <a href="http://en.wikipedia.org/wiki/Triangular_distribution">triangle distribution</a> and all I want to see is whether it looks like it&#8217;s working.  For the same dataset, here are histograms with default settings.  (Python: <em>pylab.hist(d)</em>, R: <em>hist(d)</em>)</p>
<p><a href="http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/screen-shot-2012-02-02-at-3-30-30-pm/" rel="attachment wp-att-1113"><img src="http://brenocon.com/blog/wp-content/uploads/2012/02/Screen-shot-2012-02-02-at-3.30.30-PM.png" alt="" title="Screen shot 2012-02-02 at 3.30.30 PM" width="983" height="467" class="aligncenter size-full wp-image-1113" /></a></p>
<p>I want to know whether my Metropolis sampler is working; those two plots give a very different idea.  Of course, you could say this is an unfair comparison, since matplotlib is only using 10 bins, while R is using 18 here &#8212; and it&#8217;s always important to vary the bin size a few times when looking at histograms.  But R&#8217;s defaults really are better: it actually uses an adaptive bin size, and the heuristic worked, choosing a reasonable number for the data.  The <a href="http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/hist.html">hist()</a> manual says it&#8217;s from Sturges (1926).  It&#8217;s hard to find other computer software that cites 100 year old papers for its design decisions &#8212; and where it matters.  (Old versions of R used to yell at you when you made a pie chart, citing perceptual studies that humans are really bad at interpreting them (<a href="http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/pie.html">here</a>).  This is what originally made me love R.)</p>
<p>Second, R is much smarter about breakpoints.  In the following plots, I&#8217;ve manually set the  number of bins to 10, and then 30 for each.</p>
<p><a href="http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/screen-shot-2012-02-02-at-3-39-45-pm/" rel="attachment wp-att-1114"><img src="http://brenocon.com/blog/wp-content/uploads/2012/02/Screen-shot-2012-02-02-at-3.39.45-PM.png" alt="" title="Screen shot 2012-02-02 at 3.39.45 PM" width="672" height="250" class="aligncenter size-full wp-image-1114" /></a></p>
<p><a href="http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/screen-shot-2012-02-02-at-3-40-48-pm/" rel="attachment wp-att-1115"><img src="http://brenocon.com/blog/wp-content/uploads/2012/02/Screen-shot-2012-02-02-at-3.40.48-PM.png" alt="" title="Screen shot 2012-02-02 at 3.40.48 PM" width="642" height="243" class="aligncenter size-full wp-image-1115" /></a></p>
<p>The second one is now OK for matplotlib &#8212; it&#8217;s good enough to figure out what&#8217;s going on &#8212; though still a little lame.  Why the gaps?</p>
<p>The problem is that my data are discrete &#8212; they&#8217;re all integers from 1 through 19 &#8212; and I think matplotlib is naively carving up that range into bins, which sometimes lumps together two integers, and sometimes gets zero of them.  I understand this is the simple naive implementation, and you could say it&#8217;s my fault that I shouldn&#8217;t have used the pylab histogram function for this type of data &#8212; but it&#8217;s really not as good as whatever R is doing, which works rather well here, and I didn&#8217;t have to waste time thinking about the internals of the algorithm.  For reference, here is the correct visualization of the data (R: <em>plot(table(d))</em>).  Note that R&#8217;s original Sturges breakpoints did make one error: the first two values got combined into one bin.<br />
<a href="http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/screen-shot-2012-02-02-at-4-06-28-pm/" rel="attachment wp-att-1144"><img src="http://brenocon.com/blog/wp-content/uploads/2012/02/Screen-shot-2012-02-02-at-4.06.28-PM.png" alt="" title="Screen shot 2012-02-02 at 4.06.28 PM" width="294" height="206" class="aligncenter size-full wp-image-1144" /></a></p>
<p>Lessons: (1) always vary the bin sizes for histograms, especially if you&#8217;re using naive breakpoint selection, and (2) don&#8217;t ignore a century&#8217;s worth of statistical research on these issues.  And since it&#8217;s hard to learn a century&#8217;s worth of statistics, just use R, where they&#8217;re compiled it in for you.</p>
<img src="http://feeds.feedburner.com/~r/BrendanOConnorsBlog/~4/ha0W0TVsLFM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://brenocon.com/blog/2012/02/histograms-matplotlib-vs-r/</feedburner:origLink></item>
		<item>
		<title>Bayes update view of pointwise mutual information</title>
		<link>http://feedproxy.google.com/~r/BrendanOConnorsBlog/~3/F2Z1runI5R0/</link>
		<comments>http://brenocon.com/blog/2011/11/bayes-update-view-of-pointwise-mutual-information/#comments</comments>
		<pubDate>Sun, 13 Nov 2011 18:41:03 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1093</guid>
		<description><![CDATA[This is fun. Pointwise Mutual Information (e.g. Church and Hanks 1990) between two variable outcomes \(x\) and \(y\) is \[ PMI(x,y) = \log \frac{p(x,y)}{p(x)p(y)} \] It&#8217;s called &#8220;pointwise&#8221; because Mutual Information, between two (discrete) variables X and Y, is the &#8230; <a href="http://brenocon.com/blog/2011/11/bayes-update-view-of-pointwise-mutual-information/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<br />
This is fun.  Pointwise Mutual Information (e.g. <a href="http://acl.ldc.upenn.edu/J/J90/J90-1003.pdf">Church and Hanks 1990</a>) between two variable outcomes \(x\) and \(y\) is</p>
<p>\[ PMI(x,y) = \log \frac{p(x,y)}{p(x)p(y)} \]
<p>It&#8217;s called &#8220;pointwise&#8221; because <a href="http://en.wikipedia.org/wiki/Mutual_information">Mutual Information</a>, between two (discrete) variables X and Y, is the expectation of PMI over possible outcomes of X and Y: \( MI(X,Y) = \sum_{x,y} p(x,y) PMI(x,y) \).</p>
<p>One interpretation of PMI is it&#8217;s measuring how much deviation from independence there is &#8212; since \(p(x,y)=p(x)p(y)\) if X and Y were independent, so the ratio is how non-independent they (the outcomes) are.</p>
<p>You can get another interpretation of this quantity if you switch into conditional probabilities.  Looking just at the ratio, apply the definition of conditional probability:</p>
<p>\[ \frac{p(x,y)}{p(x)p(y)} = \frac{p(x|y)}{p(x)} \]</p>
<p>Think about doing a Bayes update for your belief about \(x\).  Start with the prior \(p(x)\), then learn \(y\) and you update to the posterior belief \(p(x|y)\).  How much your belief changes is measured by that ratio; the log-scaled ratio is PMI.  (Positive PMI = increase belief, negative PMI = decrease belief.  Positive vs. negative associations.)</p>
<p>Interestingly, it&#8217;s symmetric (obvious from the original definition of PMI, sure):<br />
\[ \frac{p(x|y)}{p(x)} = \frac{p(y|x)}{p(y)} \]</p>
<p>So under this measurement of &#8220;amount of information you learn,&#8221; the amount you learn about \(x\) from \(y\) is actually the same as how much you learn about \(y\) from \(x\).</p>
<p>This is closer to the information gain view of mutual information, when you decompose it into relative and conditional entropies; the current Wikipedia page has some of the derivations back and forth for them.</p>
<p>Lots more about this stuff on the <a href="http://en.wikipedia.org/wiki/Mutual_information">MI</a> and <a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL Divergence</a> Wikipedia pages.  And early chapters of the (free) <a href="http://www.inference.phy.cam.ac.uk/mackay/itila/book.html">MacKay 2003 textbook</a>.  There seems to be lots of recent work using PMI for association scores between words or concepts and such (I did this with Facebook &#8220;Like&#8221; data at my internship there, it is quite fun); it&#8217;s nice because with MLE or fixed-Dirichlet-MAP estimation it only requires simple counts and no optimization/sampling, so you can use it on very large datasets, and it seems to give good pairwise association results in many circumstances.</p>
<img src="http://feeds.feedburner.com/~r/BrendanOConnorsBlog/~4/F2Z1runI5R0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/11/bayes-update-view-of-pointwise-mutual-information/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://brenocon.com/blog/2011/11/bayes-update-view-of-pointwise-mutual-information/</feedburner:origLink></item>
		<item>
		<title>Memorizing small tables</title>
		<link>http://feedproxy.google.com/~r/BrendanOConnorsBlog/~3/yv5Bx5k0Sjc/</link>
		<comments>http://brenocon.com/blog/2011/11/memorizing-small-tables/#comments</comments>
		<pubDate>Fri, 11 Nov 2011 18:13:49 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1066</guid>
		<description><![CDATA[Lately, I&#8217;ve been trying to memorize very small tables, especially for better intuitions and rule-of-thumb calculations. At the moment I have these above my desk: The first one is a few entries in a natural logarithm table. There are all &#8230; <a href="http://brenocon.com/blog/2011/11/memorizing-small-tables/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><br />
Lately, I&#8217;ve been trying  to memorize very small tables, especially for better intuitions and rule-of-thumb calculations.  At the moment I have these above my desk:</p>
<p><a href="http://brenocon.com/blog/2011/11/memorizing-small-tables/screen-shot-2011-11-11-at-1-04-28-pm-3/" rel="attachment wp-att-1074"><img src="http://brenocon.com/blog/wp-content/uploads/2011/11/Screen-shot-2011-11-11-at-1.04.28-PM1.jpg" alt="" title="Screen shot 2011-11-11 at 1.04.28 PM" width="1061" height="526" class="aligncenter size-full wp-image-1074" /></a></p>
<p>The first one is a few entries in a natural logarithm table.  There are all these stories about how in the slide rule era, people would develop better intuitions about the scale of logarithms because they physically engaged with them all the time.  I spend lots of time looking at log-likelihoods, log-odds-ratios, and logistic regression coefficients, so I think it would be nice to have quick intuitions about what they are.  (Though the <a href="http://www.stat.columbia.edu/~gelman/arm/">Gelman and Hill</a> textbook has an interesting argument against odds scale interpretations of logistic regression coefficients.)</p>
<p>The second one are some zsh filename manipulation <a href="http://www.rayninfo.co.uk/tips/zshtips.html">shortcuts</a>.  OK, this is more narrow than the others, but pretty useful for me at least.</p>
<p>The third one are rough unit equivalencies for data rates over time.  I find this very important for quickly determining whether a long-running job is going to take a dozen minutes, or a few hours, or a few days.  In particular, many data transfer commands (scp, wget, s3cmd) immediately tell you a rate per second, which you then can scale up.  (And if you&#8217;re using a CPU-bound pipeline command, you can always use the amazing <a href="http://www.ivarch.com/programs/pv.shtml">pv</a> command to get a rate-per-second estimate.)  This table is inspired by the <a href="http://brenocon.com/dean_perf.html">&#8220;Numbers Everyone Should Know&#8221;</a> list.</p>
<p>The fourth one is the <a href="http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval">Clopper-Pearson</a> binomial confidence interval.  Actually, the more useful ones to memorize are <a href="http://brenocon.com/blog/2011/04/rough-binomial-confidence-intervals/">Wald binomial intervals</a>, which are easy because they&#8217;re close to \(\pm 1/\sqrt{n}\).  Good party trick.  This sticky is actually the relevant R calls (type <a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/binom.test.html">binom.test</a> and press enter); I was using small-n binomial hypothesis testing a lot recently so wanted to get more used to it.  Maybe this one isn&#8217;t very useful.</p>
<img src="http://feeds.feedburner.com/~r/BrendanOConnorsBlog/~4/yv5Bx5k0Sjc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/11/memorizing-small-tables/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://brenocon.com/blog/2011/11/memorizing-small-tables/</feedburner:origLink></item>
		<item>
		<title>Be careful with dictionary-based text analysis</title>
		<link>http://feedproxy.google.com/~r/BrendanOConnorsBlog/~3/Keowz_pVIC0/</link>
		<comments>http://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/#comments</comments>
		<pubDate>Wed, 05 Oct 2011 16:15:36 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1050</guid>
		<description><![CDATA[OK, everyone loves to run dictionary methods for sentiment and other text analysis &#8212; counting words from a predefined lexicon in a big corpus, in order to explore or test hypotheses about the corpus. In particular, this is often done &#8230; <a href="http://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>OK, everyone loves to run dictionary methods for sentiment and other text analysis &#8212; counting words from a predefined lexicon in a big corpus, in order to explore or test hypotheses about the corpus.  In particular, this is often done for sentiment analysis: count positive and negative words (according to a sentiment polarity lexicon, which was derived from human raters or previous researchers&#8217; intuitions), and then proclaim the output yields sentiment levels of the documents.  More and more papers come out every day that do this.  <a href="http://brenocon.com/oconnor_balasubramanyan_routledge_smith.icwsm2010.tweets_to_polls.pdf">I&#8217;ve done this myself.</a>  It&#8217;s interesting and fun, but it&#8217;s easy to get a bunch of meaningless numbers if you don&#8217;t carefully validate what&#8217;s going on.  There are certainly good studies in this area that do further validation and analysis, but it&#8217;s hard to trust a study that just presents a graph with a few overly strong speculative claims as to its meaning.  This happens more than it ought to.</p>
<p>I was happy to see a similarly critical view in a nice working paper by <a href="http://www.justingrimmer.org/">Justin Grimmer</a> and <a href="http://www.gov.harvard.edu/people/brandon-stewart">Brandon Stewart</a>, <a href="http://stanford.edu/~jgrimmer/tad2.pdf">Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts</a>.</p>
<p>Since I think these arguments need to be more widely known, here&#8217;s a long quote from Section 4.1 &#8230; see also the paper for more details (and lots of other interesting stuff).  Emphases are mine.</p>
<blockquote><p>
For dictionary methods to work well, the scores attached to words must closely align with how the words are used in a particular context. If a dictionary is developed for a specific application, then this assumption should be easy to justify. But <strong>when dictionaries are created in one substantive area and then applied to another problems, serious errors can occur</strong>. Perhaps the clearest example of this is shown in Loughran and McDonald (2011).  Loughran and McDonald (2011) critique the increasingly common use of off the shelf dictionaries to measure the tone of statutorily required corporate earning reports in the accounting literature. They point out that many words that have a negative connotation in other contexts, like <em>tax</em>, <em>cost</em>, <em>crude</em> (oil) or <em>cancer</em>, may have a positive connotation in earning reports. For example, a health care company may mention cancer often and oil companies are likely to discuss crude extensively. And words that are not identified as negative in off the shelf dictionaries may have quite negative connotation in earning reports (<em>unanticipated</em>, for example).</p>
<p>Dictionaries, therefore, should be used with substantial caution. Scholars must either explicitly establish that word lists created in other contexts are applicable to a particular domain, or create a problem specific dictionary. In either instance, scholars must validate their results. But <strong>measures from dictionaries are rarely validated. Rather, standard practice in using dictionaries is to assume the measures created from a dictionary are correct and then apply them to the problem.</strong> This is due, in part, to the exceptional difficulties in validating dictionaries. Dictionaries are commonly used to establish granular scales of a particular kind of sentiment, such as tone. While this is useful for applications, the granular measures insure that it is essentially impossible to derive gold standard evaluations based on human coding of documents, because of the difficulty of establishing reliable granular scales from humans (Krosnick, 1999).</p>
<p>The consequence of domain specificity and lack of validation is that <strong>most analyses based on dictionaries are built on shaky foundations.</strong> <strong>Yes, dictionaries are able to produce measures that are claimed to be about tone or emotion, but the actual properties of these measures &#8211; and how they relate to the concepts their attempting to measure &#8211; are essentially a mystery.</strong> Therefore, for scholars to effectively use dictionary methods in their future work, advances in the validation of dictionary methods must be made. We suggest two possible ways to improve validation of dictionary methods. First, the classification problem could be simplified. If scholars use dictionaries to code documents into binary categories (positive or negative tone, for example), then validation based on human gold standards and the methods we describe in Section 4.2.4 is straightforward. Second, scholars could treat measures from dictionaries similar to how we validations from unsupervised methods are conducted (see Section 5.5). This would force scholars to establish that their measures of underlying concepts have properties associated with long standing expectations.
</p></blockquote>
<p>And after an example analysis,</p>
<blockquote><p>
&#8230; we reiterate our skepticism of dictionary based measures. As is standard in the use of dictionary measures (for example, Young and Soroka (2011)) the measures are presented here without validation.  This lack of validation is due in part because <strong>it is exceedingly difficult to demonstrate that our scale of sentiment precisely measures differences in sentiment expressed</strong> towards Russia.  Perhaps this is because <strong>it is equally difficult to define what would constitute these differences in scale</strong>.
</p></blockquote>
<img src="http://feeds.feedburner.com/~r/BrendanOConnorsBlog/~4/Keowz_pVIC0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://brenocon.com/blog/2011/10/be-careful-with-dictionary-based-text-analysis/</feedburner:origLink></item>
		<item>
		<title>Information theory stuff</title>
		<link>http://feedproxy.google.com/~r/BrendanOConnorsBlog/~3/AURRQGkFLcc/</link>
		<comments>http://brenocon.com/blog/2011/09/information-theory-stuff/#comments</comments>
		<pubDate>Sun, 25 Sep 2011 21:28:59 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=1010</guid>
		<description><![CDATA[Actually this post is mainly to test the MathJax installation I put into WordPress via this plugin. But information theory is great, why not? The probability of a symbol is . It takes bits to encode one symbol &#8212; sometimes &#8230; <a href="http://brenocon.com/blog/2011/09/information-theory-stuff/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<br />
Actually this post is mainly to test the <a href=http://www.mathjax.org/>MathJax</a> installation I put into WordPress via <a href=http://wordpress.org/extend/plugins/mathjax-latex/>this plugin</a>.  But <a href="http://en.wikipedia.org/wiki/Information_theory">information theory</a> is great, why not?</p>
<p>The probability of a symbol is \(p\).</p>
<p>It takes \(\log \frac{1}{p} = -\log p\) bits to encode one symbol &#8212; sometimes called its &#8220;surprisal&#8221;.  Surprisal is 0 for a 100% probable symbol, and ranges up to \(\infty\) for extremely low probability symbols.  This is because you use a coding scheme that encodes common symbols as very short strings, and less common symbols as longer ones.  (e.g. <a href="http://en.wikipedia.org/wiki/Huffman_coding">Huffman</a> or <a href="http://en.wikipedia.org/wiki/Arithmetic_coding">arithmetic</a> coding.)  We should say logarithms are base-2 so information is measured in bits.\(^*\)</p>
<p>If you have a stream of such symbols and a probability distribution \(\vec{p}\) for them, where a symbol \(i\) comes at probability \(p_i\), then the average message size is the expected surprisal:</p>
<p>\[ H(\vec{p}) = \sum_i p_i \log \frac{1}{p_i} \]
<p>this is the Shannon <b>entropy</b> of the probability distribution \( \vec{p} \), which is a measure of its uncertainty.  In fact, if you start with a few pretty reasonable axioms for how to design a measurement of uncertainty of a discrete probability distribution, you end up with the above equation as the only possible measure.  (I think. This is all in Shannon&#8217;s original paper.)</p>
<p>Now, what if you have symbols at a distribution \( \vec{p} \) but you encode then with the wrong distribution \( \vec{q} \)?  You pay \(\log\frac{1}{q}\) bits per symbol but the expectation is under the true distribution \(\vec{p}\).  Then the average message size is called the <b>cross-entropy</b> between the distributions:</p>
<p>\[ H(\vec{p},\vec{q}) = \sum_i p_i \log \frac{1}{q_i} \]</p>
<p>How much worse is this coding compared to the optimal one?  (I.e. how much a cost do you pay for encoding with the wrong distribution?)  The optimal one is size \( \sum -p_i \log p_i \) so it&#8217;s just</p>
<p>\[ \begin{align}<br />
&#038; \sum_i -p_i \log q_i + p_i \log p_i \\<br />
KL(\vec{p} || \vec{q})=<br />
&#038;\sum_i p_i \log \frac{p_i}{q_i}<br />
\end{align} \]</p>
<p>which is called the <b>relative entropy</b> or <a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a>, and it&#8217;s a measurement of the disssimilarity of the distributions \(\vec{p}\) and \(\vec{q}\).  You can see it&#8217;s about dissimilarity because if \(\vec{p}\) and \(\vec{q}\) were the same, then the inner term \(\log\frac{p}{q}\) would always be 0 and the whole thing comes out to be 0.</p>
<p>For more, I rather like the early chapters of the free online textbook by <a href="http://www.cs.toronto.edu/~mackay/itila/book.html">David MacKay: &#8220;Information Theory, Inference, and Learning Algorithms&#8221;</a>.  That&#8217;s where I picked up the habit of saying surprisal is \( \log \frac{1}{p} \) instead of \(-\log p\); the former seems more intuitive to me, and then you don&#8217;t have a pesky negative sign in the entropy and cross-entropy equations.  In general the book is great at making things intuitive.  Its main weakness is you can&#8217;t trust the insane negative things he says about frequentist statistics, but that&#8217;s another discussion.</p>
<p>\(^*\) You can use natural logs or whatever and it&#8217;s just different sized units: &#8220;nats&#8221;, as you can see in the fascinating Chapter 18 of MacKay on codebreaking, which features Bletchley Park, Alan Turing, and Nazis.</p>
<img src="http://feeds.feedburner.com/~r/BrendanOConnorsBlog/~4/AURRQGkFLcc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/09/information-theory-stuff/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://brenocon.com/blog/2011/09/information-theory-stuff/</feedburner:origLink></item>
		<item>
		<title>End-to-end NLP packages</title>
		<link>http://feedproxy.google.com/~r/BrendanOConnorsBlog/~3/T5SXUaxj2Fs/</link>
		<comments>http://brenocon.com/blog/2011/09/end-to-end-nlp-packages/#comments</comments>
		<pubDate>Mon, 19 Sep 2011 00:31:30 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=995</guid>
		<description><![CDATA[What freely available end-to-end natural language processing (NLP) systems are out there, that start with raw text, and output parses and semantic structures? Lots of NLP research focuses on single tasks at a time, and thus produces software that does &#8230; <a href="http://brenocon.com/blog/2011/09/end-to-end-nlp-packages/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>What freely available end-to-end natural language processing (NLP) systems are out there, that start with raw text, and output parses and semantic structures?  Lots of NLP research focuses on single tasks at a time, and thus produces software that does a single task at a time.  But for various applications, it is nicer to have a full end-to-end system that just runs on whatever text you give it.</p>
<p>If you believe this is a worthwhile goal (see caveat at bottom), I will postulate there aren&#8217;t a ton of such end-to-end, multilevel systems.  Here are ones I can think of.  Corrections and clarifications welcome.</p>
<ul>
<li><a href="http://nlp.stanford.edu/software/corenlp.shtml">Stanford CoreNLP</a>.  Raw text to <a href="http://nlp.stanford.edu/software/stanford-dependencies.shtml">rich syntactic dependencies</a> (<a href="http://en.wikipedia.org/wiki/Lexical_functional_grammar">LFG</a>-inspired).  Also POS, NER, coreference.</li>
<li><a href="http://svn.ask.it.usyd.edu.au/trac/candc/wiki">C&amp;C tools</a>.  From (sentence-segmented, tokenized?) text to rich syntactic dependencies (<a href="http://en.wikipedia.org/wiki/Combinatory_categorial_grammar">CCG</a>-based) and also a semantic representation.  POS and chunks on the way.  Does anyone use this much?  It seems underappreciated relative to its richness.</li>
<li><a href="http://ml.nec-labs.com/senna/">Senna</a>.  Sentence-segmented text -> parse trees, plus POS, NER, chunks, and semantic role labeling.  This one is quite new; is it as good?  It doesn&#8217;t give syntactic dependencies, though for some applications semantic role labeling is similar or better (or worse?).  I&#8217;m a little concerned that its documentation seems overly focused on competing in evaluation datasets, as opposed to trying to ensure they&#8217;ve made something more broadly useful.  (To be fair, they&#8217;re focused on developing algorithms that could be broadly applicable to different NLP tasks; that&#8217;s a whole other discussion.)</li>
</ul>
<p>If you want to quickly get some sort of shallow semantic relations, a.k.a. high-level syntactic relations, one of the above packages might be your best bet.  Are there others out there?</p>
<p>Restricting oneself to these full end-to-end systems is also funny since you can mix-and-match components to get better results for what you want.  One example: if you have constituent parse trees and want dependencies, you could swap in the <a href="http://nlp.stanford.edu/software/stanford-dependencies.shtml">Stanford Dependency</a> extractor (or another one like <a href="http://nlp.cs.lth.se/software/treebank_converter/">pennconverter</a>?) to post-process the parses.  Or you could swap in the <a href="http://bllip.cs.brown.edu/resources.shtml">Charniak-Johnson</a> or <a href="http://code.google.com/p/berkeleyparser/">Berkeley</a> parser into the middle of the Stanford CoreNLP stack.  Or you could use a direct dependency parser (I think <a href="http://maltparser.org/">Malt</a> is the most popular?) and skip the pharse structure step.  Etc.</p>
<p>It&#8217;s worth noting several other NLP libraries that I see used a lot.  I believe that, unlike the above, they don&#8217;t focus on out-of-the-box end-to-end NLP analysis (though you can certainly use them to perform various parts of an NLP pipeline).</p>
<ul>
<li><a href="http://incubator.apache.org/opennlp/">OpenNLP</a> &#8212; I&#8217;ve never used it but lots of people like it.  Seems well-maintained now?  Does chunking, tagging, even coreference.</li>
<li><a href="http://alias-i.com/lingpipe/">LingPipe</a> &#8212; has lots of individual algorithms and high-quality implementations.  Only chunking and tagging (I think).  It&#8217;s only quasi-free.</li>
<li><a href="http://mallet.cs.umass.edu/">Mallet</a> &#8212; focuses on information extraction and topic modeling, so slightly different than the other packages listed here.</li>
<li><a href="http://www.nltk.org/">NLTK</a> &#8212; I always have a hard time telling what this actually does, compared to what it aims to teach you to do.  It seems to do various tagging and chunking tasks.  I use the nltk_data.zip archive all the time though (I can&#8217;t find a direct download link unfortunately), for its stopword lists and small toy corpora.  (Including the <a href="http://en.wikipedia.org/wiki/Brown_Corpus">Brown Corpus</a>!  I guess it now counts as a toy corpus since you can grep it in less than a second.)</li>
</ul>
<p>These packages are nice in terms of documentation and software engineering, but they don&#8217;t do any syntactic parsing or other shallow relational extraction.  (NLTK has some libraries that appear to do parsing and semantics, but it&#8217;s hard to tell how serious they are.)</p>
<p>Oh finally, there&#8217;s also <a href="http://uima.apache.org/">UIMA</a>, which isn&#8217;t really a tool, but rather a high-level API to integrate together your tools.  <a href="http://gate.ac.uk/">GATE</a> also heavily emphasizes the framework aspect, but does come with some sort of tools.</p>
<img src="http://feeds.feedburner.com/~r/BrendanOConnorsBlog/~4/T5SXUaxj2Fs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/09/end-to-end-nlp-packages/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		<feedburner:origLink>http://brenocon.com/blog/2011/09/end-to-end-nlp-packages/</feedburner:origLink></item>
		<item>
		<title>CMU Twitter Part-of-Speech tagger 0.2</title>
		<link>http://feedproxy.google.com/~r/BrendanOConnorsBlog/~3/afU0TibuOXo/</link>
		<comments>http://brenocon.com/blog/2011/08/cmu-twitter-part-of-speech-tagger-0-2/#comments</comments>
		<pubDate>Sat, 27 Aug 2011 19:55:40 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=993</guid>
		<description><![CDATA[Announcement: We recently released a new version (0.2) of our part-of-speech tagger for English Twitter messages, along with annotations and interface. See the link for more details.]]></description>
			<content:encoded><![CDATA[<p>Announcement: We recently released a new version (0.2) of our <a href="http://www.ark.cs.cmu.edu/TweetNLP/">part-of-speech tagger for English Twitter messages</a>, along with annotations and interface.  See the link for more details.</p>
<img src="http://feeds.feedburner.com/~r/BrendanOConnorsBlog/~4/afU0TibuOXo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/08/cmu-twitter-part-of-speech-tagger-0-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://brenocon.com/blog/2011/08/cmu-twitter-part-of-speech-tagger-0-2/</feedburner:origLink></item>
		<item>
		<title />
		<link>http://feedproxy.google.com/~r/BrendanOConnorsBlog/~3/sDzsTPftobo/</link>
		<comments>http://brenocon.com/blog/2011/07/987/#comments</comments>
		<pubDate>Mon, 04 Jul 2011 03:36:54 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=987</guid>
		<description><![CDATA[One last thing on the Norvig vs. Chomsky thing from a little while ago (http://norvig.com/chomsky.html), which (correctly) casts the issue as Shannon vs. Chomsky. The relevant seminal publications are: Shannon, &#8220;Mathematical Theory of Communication,&#8221; 1948 Chomsky, &#8220;Syntactic Structures,&#8221; 1957 One &#8230; <a href="http://brenocon.com/blog/2011/07/987/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>One last thing on the Norvig vs. Chomsky thing from a little while ago (http://norvig.com/chomsky.html), which (correctly) casts the issue as Shannon vs. Chomsky.</p>
<p>The relevant seminal publications are:</p>
<ul>
<li>Shannon, &#8220;Mathematical Theory of Communication,&#8221; 1948</li>
<li>Chomsky, &#8220;Syntactic Structures,&#8221; 1957</li>
</ul>
<p>One of those historical figures is still around and representing himself in 2011 &#8212; he should get credit just for still showing up to the fight. Are there any historical figures from the Shannon side still around?  What I would&#8217;ve given to see a Jelinek vs. Chomsky public debate.  Though I guess Pereira vs. Chomsky would be pretty great.</p>
<img src="http://feeds.feedburner.com/~r/BrendanOConnorsBlog/~4/sDzsTPftobo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/07/987/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://brenocon.com/blog/2011/07/987/</feedburner:origLink></item>
		<item>
		<title>Good linguistic semantics textbook?</title>
		<link>http://feedproxy.google.com/~r/BrendanOConnorsBlog/~3/1ifz8oWRSx8/</link>
		<comments>http://brenocon.com/blog/2011/06/good-linguistic-semantics-textbook/#comments</comments>
		<pubDate>Sun, 26 Jun 2011 19:03:39 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=983</guid>
		<description><![CDATA[I&#8217;m looking for recommendations for a good textbook/handbook/reference on (non-formal) linguistic semantics.  My undergrad semantics course was almost entirely focused on logical/formal semantics, which is fine, but I don&#8217;t feel familiar with the breadth of substantive issues &#8212; for example, &#8230; <a href="http://brenocon.com/blog/2011/06/good-linguistic-semantics-textbook/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m looking for recommendations for a good textbook/handbook/reference on (non-formal) linguistic semantics.  My undergrad semantics course was almost entirely focused on logical/formal semantics, which is fine, but I don&#8217;t feel familiar with the breadth of substantive issues &#8212; for example, I&#8217;d be hard-pressed to explain why something like semantic/thematic role labeling should be useful for anything at all.</p>
<p>I somewhat randomly stumbled upon <a href="http://www.amazon.com/Linguistic-Semantics-William-Frawley/dp/0805810757/">Frawley 1992</a> (<a href="http://www.clres.com/online-papers/dsna94.html">review</a>) in a <a href="http://www.powells.com/">used bookstore</a> and it seemed pretty good &#8212; in particular, it cleanly separates itself from the philosophical study of semantics, and thus identifies issues that seem amenable to computational modeling.</p>
<p>I&#8217;m wondering what else is out there?  Here&#8217;s <a href="http://www.acsu.buffalo.edu/~jb77/review_Loebner2002_JB.pdf">a comparison of three textbooks</a>.</p>
<img src="http://feeds.feedburner.com/~r/BrendanOConnorsBlog/~4/1ifz8oWRSx8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/06/good-linguistic-semantics-textbook/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://brenocon.com/blog/2011/06/good-linguistic-semantics-textbook/</feedburner:origLink></item>
		<item>
		<title>How much text versus metadata is in a tweet?</title>
		<link>http://feedproxy.google.com/~r/BrendanOConnorsBlog/~3/v53dtmkqSNo/</link>
		<comments>http://brenocon.com/blog/2011/06/how-much-text-versus-metadata-is-in-a-tweet/#comments</comments>
		<pubDate>Tue, 14 Jun 2011 03:25:59 +0000</pubDate>
		<dc:creator>brendano</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://brenocon.com/blog/?p=980</guid>
		<description><![CDATA[This should have been a blog post, but I got lazy and wrote a plaintext document instead. Link For twitter, context matters: 90% of a tweet is metadata and 10% is text.  That&#8217;s measured by (an approximation of) information content; &#8230; <a href="http://brenocon.com/blog/2011/06/how-much-text-versus-metadata-is-in-a-tweet/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>This should have been a blog post, but I got lazy and wrote a plaintext document instead.</p>
<ul>
<li><a title="http://j.mp/jc1EjQ" rel="nofollow" href="http://t.co/vz6hYP5" target="_blank">Link</a></li>
</ul>
<p>For twitter, context matters: 90% of a tweet is metadata and 10% is text.  That&#8217;s measured by (an approximation of) information content; by raw data size, it&#8217;s 95/5.</p>
<img src="http://feeds.feedburner.com/~r/BrendanOConnorsBlog/~4/v53dtmkqSNo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://brenocon.com/blog/2011/06/how-much-text-versus-metadata-is-in-a-tweet/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://brenocon.com/blog/2011/06/how-much-text-versus-metadata-is-in-a-tweet/</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic page generated in 0.313 seconds. --><!-- Cached page generated by WP-Super-Cache on 2012-02-05 19:44:45 -->

