<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0"><channel><title>streamhacker.com</title> <link>http://streamhacker.com</link> <description>Weotta be Hacking</description> <lastBuildDate>Sat, 21 Aug 2010 00:33:21 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.0.1</generator>   <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/StreamHacker" /><feedburner:info uri="streamhacker" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://superfeedr.com/hubbub" /><feedburner:emailServiceId>StreamHacker</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><item><title>Announcing Text Processing APIs</title><link>http://feedproxy.google.com/~r/StreamHacker/~3/HcVf5lH-Kwc/</link> <comments>http://streamhacker.com/2010/08/20/announcing-text-processing-api/#comments</comments> <pubDate>Sat, 21 Aug 2010 00:33:21 +0000</pubDate> <dc:creator>Jacob</dc:creator> <category><![CDATA[Uncategorized]]></category> <category><![CDATA[api]]></category> <category><![CDATA[chunking]]></category> <category><![CDATA[classification]]></category> <category><![CDATA[nlp]]></category> <category><![CDATA[nltk]]></category> <category><![CDATA[sentiment]]></category> <category><![CDATA[stemming]]></category> <category><![CDATA[tagging]]></category><guid isPermaLink="false">http://streamhacker.com/?p=1327</guid> <description><![CDATA[Text mining and natural language processing APIs for stemming, sentiment analysis, part-of-speech tagging, chunking, and information extraction.]]></description> <content:encoded><![CDATA[<p>If you liked the <a
title="Python NLTK Demos" href="http://text-processing.com/demo/">NLTK demos</a>, then you'll love the <a
title="Text Processing API Docs" href="http://text-processing.com/docs/">text processing APIs</a>. They provide all the functionality of the demos, plus a little bit more, and return results in JSON. Requests can contain up to 10,000 characters, instead of the 1,000 character limit on the demos, and you can do up to 100 calls per day. These limits may change in the future depending on usage &amp; demand. If you'd like to do more, please fill out this <a
title="Natural Language Processing Services" href="/nltk-services/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">survey</a> to let me know what your needs are.</p> <div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=HcVf5lH-Kwc:DsG3fQSu0J4:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=HcVf5lH-Kwc:DsG3fQSu0J4:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=HcVf5lH-Kwc:DsG3fQSu0J4:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=HcVf5lH-Kwc:DsG3fQSu0J4:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=HcVf5lH-Kwc:DsG3fQSu0J4:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=HcVf5lH-Kwc:DsG3fQSu0J4:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/HcVf5lH-Kwc" height="1" width="1"/>]]></content:encoded> <wfw:commentRss>http://streamhacker.com/2010/08/20/announcing-text-processing-api/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <feedburner:origLink>http://streamhacker.com/2010/08/20/announcing-text-processing-api/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item> <item><title>Announcing Python NLTK Demos</title><link>http://feedproxy.google.com/~r/StreamHacker/~3/RS_GWtsAto8/</link> <comments>http://streamhacker.com/2010/08/02/announcing-python-nltk-demos/#comments</comments> <pubDate>Mon, 02 Aug 2010 17:00:47 +0000</pubDate> <dc:creator>Jacob</dc:creator> <category><![CDATA[python]]></category> <category><![CDATA[chunking]]></category> <category><![CDATA[classification]]></category> <category><![CDATA[ner]]></category> <category><![CDATA[nlp]]></category> <category><![CDATA[nltk]]></category> <category><![CDATA[parsing]]></category> <category><![CDATA[sentiment]]></category> <category><![CDATA[tagging]]></category><guid isPermaLink="false">http://streamhacker.com/?p=1300</guid> <description><![CDATA[Python NLTK demonstrations of part-of-speech tagging, chunk extraction, named entity recognition, and sentiment analysis with text classification. Includes links to similar resources on the web, such as demos of the Stanford Parser and FreeLing.]]></description> <content:encoded><![CDATA[<p>If you want to see what NLTK can do, but don't want to go thru the effort of installation and learning how to use it, then check out my <a
href="http://text-processing.com/demo/">Python NLTK demos</a>.</p><p>It currently demonstrates the following functionality:</p><ul><li><a
title="Part of Speech Tagging with Python NLTK" href="http://text-processing.com/demo/tag/">part-of-speech tagging</a> with the default NLTK pos tagger</li><li><a
title="Chunk Extraction and Named Entity Recognition with Python NLTK" href="http://text-processing.com/demo/tag/">chunking and named entity recognition</a> with the default NLTK chunker</li><li><a
title="Sentiment Analysis with Python NLTK" href="http://text-processing.com/demo/sentiment/">sentiment analysis</a> with a combination of a <a
title="Sentiment Analysis with NaiveBayesClassifier from Python NLTK" href="/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">naive bayes classifier</a> and a <em>maximum entropy classifier</em>, both trained on the movie reviews corpus</li></ul><p>If you like it, <strong><a
title="Share on Bitly" href="http://bit.ly/http://text-processing.com/demo/">please share it</a></strong>. If you want to see more, leave a comment below. And if you are interested in a service that could apply these processes to your own data, please fill out this <a
href="/nltk-services/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">NLTK services survey</a>.</p><h2>Other Natural Language Processing Demos</h2><p>Here's a list of similar resources on the web:</p><ul><li>A demo of the <a
href="http://nlp.stanford.edu/software/lex-parser.shtml">Stanford Parser</a> with a javascript API: <a
href="http://nlp.naturalparsing.com/browserparser/parse">Natural-language Parsing For The Web</a></li><li>A demo of the <a
href="http://www.lsi.upc.edu/~nlp/freeling/">FreeLing</a> language analysis suite: <a
href="http://garraf.epsevg.upc.es/freeling/demo.php">FreeLing Demo</a></li><li>Emotional identification from text: <a
href="http://dtminredis.housing.salle.url.edu:8080/EmoLib/">EmoLib</a></li></ul> <div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=RS_GWtsAto8:3jha3xaIqUE:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=RS_GWtsAto8:3jha3xaIqUE:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=RS_GWtsAto8:3jha3xaIqUE:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=RS_GWtsAto8:3jha3xaIqUE:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=RS_GWtsAto8:3jha3xaIqUE:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=RS_GWtsAto8:3jha3xaIqUE:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/RS_GWtsAto8" height="1" width="1"/>]]></content:encoded> <wfw:commentRss>http://streamhacker.com/2010/08/02/announcing-python-nltk-demos/feed/</wfw:commentRss> <slash:comments>7</slash:comments> <feedburner:origLink>http://streamhacker.com/2010/08/02/announcing-python-nltk-demos/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item> <item><title>Text Classification for Sentiment Analysis – Eliminate Low Information Features</title><link>http://feedproxy.google.com/~r/StreamHacker/~3/CEoX5CRsuzg/</link> <comments>http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/#comments</comments> <pubDate>Wed, 16 Jun 2010 20:19:57 +0000</pubDate> <dc:creator>Jacob</dc:creator> <category><![CDATA[python]]></category> <category><![CDATA[bigrams]]></category> <category><![CDATA[chi square]]></category> <category><![CDATA[classification]]></category> <category><![CDATA[dimensionality]]></category> <category><![CDATA[feature extraction]]></category> <category><![CDATA[information gain]]></category> <category><![CDATA[sentiment]]></category> <category><![CDATA[statistics]]></category><guid isPermaLink="false">http://streamhacker.com/?p=1246</guid> <description><![CDATA[Reduce dimensionality of a classifier with high information feature selection to significantly increase accuracy, precision, and recall. Information gain with Chi Square is calculated with NLTK BigramAssocMeasures.chi_sq to identify highly informative words for filtering features.]]></description> <content:encoded><![CDATA[<p>When your classification model has hundreds or thousands of features, as is the case for <a
title="Feature Selection in Text Mining" href="http://ewinarko.staff.ugm.ac.id/blog/?p=17">text categorization</a>, it's a good bet that many (if not most) of the features are <em>low information</em>. These are features that are common across all classes, and therefore contribute little information to the classification process. Individually they are harmless, but in aggregate, <strong>low information features can decrease performance</strong>.</p><p>Eliminating low information features gives your model clarity by removing <a
title="Noisy text analytics" href="http://en.wikipedia.org/wiki/Noisy_text_analytics">noisy data</a>. It can save you from overfitting and the <a
title="Curse of dimensionality" href="http://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a>. When you use only the higher information features, you can <a
title="Feature selection" href="http://www.dataminingblog.com/feature-selection/">increase performance</a> while also decreasing the size of the model, which results in less memory usage along with faster training and classification. <a
title="Feature selection" href="http://en.wikipedia.org/wiki/Feature_selection">Removing features</a> may seem intuitively wrong, but wait till you see the results.</p><h2>High Information Feature Selection</h2><p>Using the same evaluate_classifier method as in the previous post on <a
title="Evaluating stopwords and bigram collocations" href="/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">classifying with bigrams</a>, I got the following results using the 10000 most informative words:</p><pre>evaluating best word features
accuracy: 0.93
pos precision: 0.890909090909
pos recall: 0.98
neg precision: 0.977777777778
neg recall: 0.88
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0</pre><p>Contrast this with the results from the first article on <a
title="Naive Bayes classifier" href="/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">classification for sentiment analysis</a>, where we use all the words as features:</p><pre>evaluating single word features
accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0</pre><p>The <strong>accuracy is over 20% higher when using only the best 10000 words</strong> and <strong>pos precision has increased almost 24%</strong> while <strong>neg recall improved over 40%</strong>. These are huge increases with no reduction in pos recall and even a slight increase in neg precision. Here's the full code I used to get these results, with an explanation below.</p><pre class="brush: python;">
import collections, itertools
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews, stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

def evaluate_classifier(featx):
	negids = movie_reviews.fileids('neg')
	posids = movie_reviews.fileids('pos')

	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

	classifier = NaiveBayesClassifier.train(trainfeats)
	refsets = collections.defaultdict(set)
	testsets = collections.defaultdict(set)

	for i, (feats, label) in enumerate(testfeats):
			refsets[label].add(i)
			observed = classifier.classify(feats)
			testsets[observed].add(i)

	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
	classifier.show_most_informative_features()

def word_feats(words):
	return dict([(word, True) for word in words])

print 'evaluating single word features'
evaluate_classifier(word_feats)

word_fd = FreqDist()
label_word_fd = ConditionalFreqDist()

for word in movie_reviews.words(categories=['pos']):
	word_fd.inc(word.lower())
	label_word_fd['pos'].inc(word.lower())

for word in movie_reviews.words(categories=['neg']):
	word_fd.inc(word.lower())
	label_word_fd['neg'].inc(word.lower())

# n_ii = label_word_fd[label][word]
# n_ix = word_fd[word]
# n_xi = label_word_fd[label].N()
# n_xx = label_word_fd.N()

pos_word_count = label_word_fd['pos'].N()
neg_word_count = label_word_fd['neg'].N()
total_word_count = pos_word_count + neg_word_count

word_scores = {}

for word, freq in word_fd.iteritems():
	pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
		(freq, pos_word_count), total_word_count)
	neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
		(freq, neg_word_count), total_word_count)
	word_scores[word] = pos_score + neg_score

best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:10000]
bestwords = set([w for w, s in best])

def best_word_feats(words):
	return dict([(word, True) for word in words if word in bestwords])

print 'evaluating best word features'
evaluate_classifier(best_word_feats)

def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	d = dict([(bigram, True) for bigram in bigrams])
	d.update(best_word_feats(words))
	return d

print 'evaluating best words + bigram chi_sq word features'
evaluate_classifier(best_bigram_word_feats)
</pre><h2>Calculating Information Gain</h2><p>To find the highest information features, we need to calculate information gain for each word. <a
title="Information gain in decision trees" href="http://en.wikipedia.org/wiki/Information_gain_in_decision_trees">Information gain</a> for classification is a measure of how common a feature is in a particular class compared to how common it is in all other classes. A word that occurs primarily in positive movie reviews and rarely in negative reviews is high information. For example, the presence of the word "magnificent" in a movie review is a strong indicator that the review is positive. That makes "magnificent" a high information word. Notice that the most informative features above did not change. That makes sense because the point is to use only the most informative features and ignore the rest.</p><p>One of the best metrics for information gain is <a
title="Chi-square distribution" href="http://en.wikipedia.org/wiki/Chi-square_distribution">chi square</a>. NLTK includes this in the <a
title="BigramAssocMeasures class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics.association.BigramAssocMeasures-class.html">BigramAssocMeasures class</a> in the <a
title="metrics module" href="http://nltk.googlecode.com/svn/trunk/doc/api/toc-nltk.metrics-module.html">metrics package</a>. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a <a
title="FreqDist class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html">FreqDist</a> for overall frequency of words, and a <a
title="ConditionalFreqDist class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.ConditionalFreqDist-class.html">ConditionalFreqDist</a> where the conditions are the class labels. Once we have those numbers, we can score words with the <a
title="chi_sq function" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics.association.BigramAssocMeasures-class.html#chi_sq">BigramAssocMeasures.chi_sq</a> function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our <a
title="Introduction to feature selection (part 1)" href="http://www.dataminingblog.com/introduction-to-feature-selection-part-1/">feature selection</a> function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.</p><h2>Signficant Bigrams</h2><p>The code above also evaluates the inclusion of 200 <a
title="Classification with significant bigrams" href="/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">significant bigram collocations</a>. Here are the results:</p><pre>evaluating best words + bigram chi_sq word features
accuracy: 0.92
pos precision: 0.913385826772
pos recall: 0.928
neg precision: 0.926829268293
neg recall: 0.912
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
       ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
          ('give', 'us') = True              neg : pos    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
    ('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0</pre><p>This shows that <strong>bigrams don't matter much when using only high information words</strong>. In this case, the best way to evaluate the difference between including bigrams or not is to look at <a
title="Classifier Precision and Recall" href="/2010/05/17/text-classification-sentiment-analysis-precision-recall/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">precision and recall</a>. With the bigrams, you we get more uniform performance in each class. Without bigrams, precision and recall are less balanced. But the differences may depend on your particular data, so don't assume these observations are always true.</p><h2>Improving Feature Selection</h2><p>The big lesson here is that <strong>improving feature selection will improve your classifier</strong>. <a
title="Dimensionality reduction" href="http://en.wikipedia.org/wiki/Dimensionality_reduction">Reducing dimensionality</a> is one of the single best things you can do to improve classifier performance. It's ok to throw away data if that data is not adding value. And it's especially recommended when that data is actually making your model worse.</p> <div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=CEoX5CRsuzg:Z1EjlOEF3D8:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=CEoX5CRsuzg:Z1EjlOEF3D8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=CEoX5CRsuzg:Z1EjlOEF3D8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=CEoX5CRsuzg:Z1EjlOEF3D8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=CEoX5CRsuzg:Z1EjlOEF3D8:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=CEoX5CRsuzg:Z1EjlOEF3D8:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/CEoX5CRsuzg" height="1" width="1"/>]]></content:encoded> <wfw:commentRss>http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/feed/</wfw:commentRss> <slash:comments>18</slash:comments> <feedburner:origLink>http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item> <item><title>Text Classification for Sentiment Analysis – Stopwords and Collocations</title><link>http://feedproxy.google.com/~r/StreamHacker/~3/JtvVo5tESQI/</link> <comments>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/#comments</comments> <pubDate>Mon, 24 May 2010 13:40:45 +0000</pubDate> <dc:creator>Jacob</dc:creator> <category><![CDATA[python]]></category> <category><![CDATA[bayes]]></category> <category><![CDATA[bigrams]]></category> <category><![CDATA[classification]]></category> <category><![CDATA[collocation]]></category> <category><![CDATA[correlation]]></category> <category><![CDATA[feature extraction]]></category> <category><![CDATA[nlp]]></category> <category><![CDATA[nltk]]></category> <category><![CDATA[sentiment]]></category> <category><![CDATA[statistics]]></category> <category><![CDATA[stopwords]]></category><guid isPermaLink="false">http://streamhacker.com/?p=1227</guid> <description><![CDATA[Evaluation of how filtering stopwords and including bigram collocations affect the accuracy, precision, and recall of a Naive Bayes classifier used for sentiment analysis. Uses python NLTK and the movie reviews corpus with various feature extraction methods to train the Naive Bayes classifier for text categorization of positive and negative sentiment.]]></description> <content:encoded><![CDATA[<p>Improving <a
title="Feature extraction" href="http://en.wikipedia.org/wiki/Feature_extraction">feature extraction</a> can often have a significant positive impact on classifier accuracy (and <a
title="Text Classification for Sentiment Analysis - Precision and Recall" href="/2010/05/17/text-classification-sentiment-analysis-precision-recall/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">precision and recall</a>). In this article, I'll be evaluating two modifications of the <code>word_feats</code> feature extraction method:</p><ol><li>filter out <a
title="Stop words" href="http://en.wikipedia.org/wiki/Stop_words">stopwords</a></li><li>include <a
title="Bigram Collocations in Tom Sawyer" href="http://www.briandonovan.info/projects/FoSNLP/chapters/chpt-1/1.4.4-Collocations/html/">bigram collocations</a></li></ol><p>To do this effectively, we'll modify the previous code so that we can use an arbitrary feature extractor function that takes the words in a file and returns the feature dictionary. As before, we'll use these features to train a <a
title="Text Classification for Sentiment Analysis - Naive Bayes Classifier" href="/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">Naive Bayes Classifier</a>.</p><pre class="brush: python;">
import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
	negids = movie_reviews.fileids('neg')
	posids = movie_reviews.fileids('pos')

	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

	classifier = NaiveBayesClassifier.train(trainfeats)
	refsets = collections.defaultdict(set)
	testsets = collections.defaultdict(set)

	for i, (feats, label) in enumerate(testfeats):
			refsets[label].add(i)
			observed = classifier.classify(feats)
			testsets[observed].add(i)

	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
	classifier.show_most_informative_features()
</pre><h2>Baseline Bag of Words Feature Extraction</h2><p>Here's the baseline feature extractor for bag of words feature selection.</p><pre class="brush: python;">
def word_feats(words):
	return dict([(word, True) for word in words])

evaluate_classifier(word_feats)
</pre><p>The results are the same as in the <a
title="Text Classification for Sentiment Analysis - Precision and Recall" href="/2010/05/17/text-classification-sentiment-analysis-precision-recall/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">previous</a> <a
title="Text Classification for Sentiment Analysis - Naive Bayes Classifier" href="/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">articles</a>, but I've included them here for reference:</p><pre>accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0</pre><h2>Stopword Filtering</h2><p><em>Stopwords</em> are words that are generally considered <em>useless</em>. Most search engines ignore these words because they are so common that including them would greatly increase the size of the index without improving precision or recall. NLTK comes with a <a
href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus-module.html#stopwords">stopwords corpus</a> that includes a list of 128 english stopwords. Let's see what happens when we filter out these words.</p><pre class="brush: python;">
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))

def stopword_filtered_word_feats(words):
	return dict([(word, True) for word in words if word not in stopset])

evaluate_classifier(stopword_filtered_word_feats)
</pre><p>And the results for a stopword filtered bag of words are:</p><pre>accuracy: 0.726
pos precision: 0.649867374005
pos recall: 0.98
neg precision: 0.959349593496
neg recall: 0.472</pre><p>Accuracy went down .2%, and <em>pos precision</em> and <em>neg recall</em> dropped as well! Apparently <strong>stopwords add information to sentiment analysis classification</strong>. I did not include the most informative features since they did not change.</p><h2>Bigram Collocations</h2><p>As mentioned at the end of the article on <a
title="Classification precision and recall" href="/2010/05/17/text-classification-sentiment-analysis-precision-recall/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">precision and recall</a>, it's possible that including <a
title="Collocations, Chi-Squared Independence, and N-gram Count Boundary Conditions" href="http://lingpipe-blog.com/2008/05/28/collocations-chi-squared-independence-and-n-gram-count-boundary-conditions/">bigrams</a> will <a
title="Phrases, natural language processing, and artificial intelligence" href="http://srispot.wordpress.com/2008/07/03/phrases-natural-language-processing-and-artificial-intelligence/">improve classification accuracy</a>. The hypothesis is that people say things like "not great", which is a negative expression that the <a
title="Bag of words model" href="http://en.wikipedia.org/wiki/Bag_of_words_model">bag of words model</a> could interpret as positive since it sees "great" as a separate word.</p><p>To find significant bigrams, we can use <a
title="Bigram Collocation Finder" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.collocations.BigramCollocationFinder-class.html">nltk.collocations.BigramCollocationFinder</a> along with <a
title="Bigram Assoc Measures" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics.association.BigramAssocMeasures-class.html">nltk.metrics.BigramAssocMeasures</a>. The BigramCollocationFinder maintains 2 internal <a
title="Freq Dist" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html">FreqDists</a>, one for individual word frequencies, another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such <a
title="Finding Phrases - Two Statistical Approaches" href="http://sujitpal.blogspot.com/2009/11/finding-phrases-two-statistical.html">chi-square</a>. These scoring functions <a
title="Measures of correlation" href="http://nlpers.blogspot.com/2008/05/measures-of-correlation.html">measure the collocation correlation</a> of 2 words, basically whether the bigram occurs about as frequently as each individual word.</p><pre class="brush: python;">
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])

evaluate_classifier(bigram_word_feats)
</pre><p>After some experimentation, I found that using the 200 best bigrams from each file produced great results:</p><pre>accuracy: 0.816
pos precision: 0.753205128205
pos recall: 0.94
neg precision: 0.920212765957
neg recall: 0.692
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
   ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
      ('give', 'us') = True              neg : pos    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0</pre><p>Yes, you read that right, <a
title="Matt Damon" href="http://en.wikipedia.org/wiki/Matt_Damon">Matt Damon</a> is apparently one of the best predictors for positive sentiment in movie reviews. But despite this chuckle-worthy result</p><ul><li>accuracy is up almost 9%</li><li><code>pos</code> precision has increased over 10% with only 4% drop in recall</li><li><code>neg</code> recall has increased over 21% with just under 4% drop in precision</li></ul><p>So it appears that the bigram hypothesis is correct, and <strong>including significant bigrams can increase classifier effectiveness</strong>. Note that it's <em>significant bigrams</em> that <a
title="The use of bigrams to enhance text categorization" href="http://portal.acm.org/citation.cfm?id=603538">enhance effectiveness</a>. I tried using <a
title="bigrams" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.util-module.html#bigrams">nltk.util.bigrams</a> to include all bigrams, and the results were only a few points above baseline. This points to the idea that including only significant features can improve accuracy compared to using all features. In a future article, I'll try trimming down the single word features to only include significant words.</p> <div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=JtvVo5tESQI:ZDnR4p090YQ:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=JtvVo5tESQI:ZDnR4p090YQ:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=JtvVo5tESQI:ZDnR4p090YQ:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=JtvVo5tESQI:ZDnR4p090YQ:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=JtvVo5tESQI:ZDnR4p090YQ:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=JtvVo5tESQI:ZDnR4p090YQ:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/JtvVo5tESQI" height="1" width="1"/>]]></content:encoded> <wfw:commentRss>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/feed/</wfw:commentRss> <slash:comments>12</slash:comments> <feedburner:origLink>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item> <item><title>Text Classification for Sentiment Analysis – Precision and Recall</title><link>http://feedproxy.google.com/~r/StreamHacker/~3/KiKyDaN3iQI/</link> <comments>http://streamhacker.com/2010/05/17/text-classification-sentiment-analysis-precision-recall/#comments</comments> <pubDate>Mon, 17 May 2010 14:45:56 +0000</pubDate> <dc:creator>Jacob</dc:creator> <category><![CDATA[python]]></category> <category><![CDATA[bayes]]></category> <category><![CDATA[classification]]></category> <category><![CDATA[feature extraction]]></category> <category><![CDATA[performance]]></category> <category><![CDATA[precision]]></category> <category><![CDATA[recall]]></category> <category><![CDATA[sentiment]]></category><guid isPermaLink="false">http://streamhacker.com/?p=1201</guid> <description><![CDATA[How to use precision and recall to evaluate the effectiveness of a Naive Bayes Classifier used for sentiment analysis. Precision and recall provide more insight into classification performance than F-measure or accuracy, and are available in the Python NLTK metrics module.]]></description> <content:encoded><![CDATA[<p><a
title="Statistics to English Translation, Part 1: Accuracy Measures" href="http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/">Accuracy</a> is not the only metric for <a
title="Binary classification evaluation in R via ROCR" href="http://anyall.org/blog/2009/04/binary-classification-evaluation-in-r-via-rocr/">evaluating</a> the effectiveness of a <a
title="Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick)" href="http://blog.crowdflower.com/2008/06/aggregate-turker-judgments-threshold-calibration/">classifier</a>. Two other useful metrics are <a
title="Precision and recall" href="http://en.wikipedia.org/wiki/Precision_and_recall">precision and recall</a>. These two metrics can provide much greater insight into the <a
title="Evaluation of unranked retrieval sets" href="http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-unranked-retrieval-sets-1.html">performance characteristics</a> of a <a
title="Binary classification" href="http://en.wikipedia.org/wiki/Binary_classification">binary classifier</a>.</p><h2>Classifier Precision</h2><p><a
title="Accuracy and precision in binary classification" href="http://en.wikipedia.org/wiki/Accuracy_and_precision#Accuracy_and_precision_in_binary_classification">Precision</a> measures the exactness of a classifier. A higher precision means less <a
title="Type I Error" href="http://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Type_I_error">false positives</a>, while a lower precision means more false positives. This is often at odds with recall, as an easy way to <a
title="Precision-recall trade off ~ misclassification cost" href="http://non-non-sense.blogspot.com/2010/01/precision-recall-trade-off.html">improve precision</a> is to decrease recall.</p><h2>Classifier Recall</h2><p><em>Recall</em> measures the completeness, or <a
title="Sensitivity and specificity" href="http://en.wikipedia.org/wiki/Sensitivity_and_specificity">sensitivity</a>, of a classifier. Higher recall means less <a
title="Type II Error" href="http://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Type_II_error">false negatives</a>, while lower recall means more false negatives. Improving recall can often decrease precision because it gets increasingly harder to be precise as the sample space increases.</p><h2>F-measure Metric</h2><p>Precision and recall can be combined to produce a single metric known as <em>F-measure</em>, which is the weighted harmonic mean of precision and recall. I find F-measure to be about as <a
title="F-measure versus Accuracy" href="http://nlpers.blogspot.com/2007/10/f-measure-versus-accuracy.html">useful as accuracy</a>. Or in other words, compared to precision &amp; recall, <a
title="Building High Precision Classifiers" href="http://lingpipe-blog.com/2010/04/22/high-precision-classifiers-taggers-chunkers-spelling/">F-measure is mostly useless</a>, as you'll see below.</p><h2>Measuring Precision and Recall of a Naive Bayes Classifier</h2><p>The <a
title="NLTK metrics" href="http://nltk.googlecode.com/svn/trunk/doc/api/toc-nltk.metrics-module.html">NLTK metrics module</a> provides functions for calculating all three metrics mentioned above. But to do so, you need to build 2 sets for each classification label: a <em>reference set</em> of correct values, and a <em>test set</em> of observed values. Below is a modified version of the code from the previous article, where we trained a <a
title="Text Classification for Sentiment Analysis - Naive Bayes Classifier" href="/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">Naive Bayes Classifier</a>. This time, instead of measuring accuracy, we'll <a
title="Python defaultdict collection" href="http://docs.python.org/library/collections.html#defaultdict-objects">collect</a> reference values and observed values for each label (pos or neg), then use those <a
title="Python sets" href="http://docs.python.org/library/stdtypes.html#set">sets</a> to calculate the <a
title="NLTK metrics precision" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics-module.html#precision">precision</a>, <a
title="NLTK metrics recall" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics-module.html#recall">recall</a>, and <a
title="NLTK metrics f_measure" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics-module.html#f_measure">F-measure</a> of the naive bayes classifier. The actual values collected are simply the index of each featureset using <a
title="Python enumerate function" href="http://docs.python.org/library/functions.html#enumerate">enumerate</a>.</p><pre class="brush: python;">
import collections
import nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
	return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(testfeats):
	refsets[label].add(i)
	observed = classifier.classify(feats)
	testsets[observed].add(i)

print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
</pre><h2>Precision and Recall for Positive and Negative Reviews</h2><p>I found the results quite interesting:</p><pre>pos precision: 0.651595744681
pos recall: 0.98
pos F-measure: 0.782747603834
neg precision: 0.959677419355
neg recall: 0.476
neg F-measure: 0.636363636364</pre><p>So what does this mean?</p><ol><li>Nearly every file that is pos is correctly identified as such, with 98% recall. This means very few <em>false negatives</em> in the pos class.</li><li>But, a file given a pos classification is only 65% likely to be correct. Not so good precision leads to <strong>35% false positives</strong> for the pos label.</li><li>Any file that is identified as neg is 96% likely to be correct (high precision). This means very few <em>false positives</em> for the neg class.</li><li>But many files that are neg are incorrectly classified. Low recall causes <strong>52% false negatives</strong> for the neg label.</li><li><strong>F-measure provides no useful information</strong>. There's no insight to be gained from having it, and we wouldn't lose any knowledge if it was taken away.</li></ol><h2>Improving Results with Better Feature Selection</h2><p>One possible explanation for the above results is that people use normally positives words in negative reviews, but the word is preceded by "not" (or some other <a
title="Negative Words Dominate Language" href="http://abcnews.go.com/Technology/DyeHard/story?id=460987&amp;page=1">negative word</a>), such as "not great". And since the classifier uses the <a
title="Bag of words model" href="http://en.wikipedia.org/wiki/Bag_of_words_model">bag of words</a> model, which assumes every word is independent, it cannot learn that "not great" is a negative. If this is the case, then these metrics should  improve if we also train on <a
title="n-gram" href="http://en.wikipedia.org/wiki/Ngram">multiple words</a>, a topic I'll explore in a future article.</p><p>Another possibility is the abundance of naturally neutral words, the kind of words that are devoid of sentiment. But the classifier treats all words the same, and has to assign each word to either pos or neg. So maybe otherwise neutral or meaningless words are being placed in the pos class because the classifier doesn't know what else to do. If this is the case, then the metrics should improve if we eliminate the neutral or meaningless words from the featuresets, and only classify using <em>sentiment rich</em> words. This is usually done using the concept of <a
title="Quantized Information Gain (Conditional Entropy)" href="http://lingpipe-blog.com/2009/05/14/quantized-information-gain-for-real-count-valued-features/">information gain</a>, aka <a
title="Mutual information" href="http://en.wikipedia.org/wiki/Mutual_information">mutual information</a>, to improve <a
title="Feature Selection" href="http://redwriteshere.com/post/321476335/feature-selection">feature selection</a>, which I'll also explore in a future article.</p><p>If you have your own theories to explain the results, or ideas on how to improve precision and recall, please share in the comments.</p> <div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=KiKyDaN3iQI:8ea8Eh-_h7k:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=KiKyDaN3iQI:8ea8Eh-_h7k:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=KiKyDaN3iQI:8ea8Eh-_h7k:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=KiKyDaN3iQI:8ea8Eh-_h7k:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=KiKyDaN3iQI:8ea8Eh-_h7k:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=KiKyDaN3iQI:8ea8Eh-_h7k:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/KiKyDaN3iQI" height="1" width="1"/>]]></content:encoded> <wfw:commentRss>http://streamhacker.com/2010/05/17/text-classification-sentiment-analysis-precision-recall/feed/</wfw:commentRss> <slash:comments>9</slash:comments> <feedburner:origLink>http://streamhacker.com/2010/05/17/text-classification-sentiment-analysis-precision-recall/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item> <item><title>Text Classification for Sentiment Analysis – Naive Bayes Classifier</title><link>http://feedproxy.google.com/~r/StreamHacker/~3/cSYlXaI7IPQ/</link> <comments>http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/#comments</comments> <pubDate>Mon, 10 May 2010 14:32:18 +0000</pubDate> <dc:creator>Jacob</dc:creator> <category><![CDATA[python]]></category> <category><![CDATA[bayes]]></category> <category><![CDATA[classification]]></category> <category><![CDATA[nlp]]></category> <category><![CDATA[nltk]]></category> <category><![CDATA[sentiment]]></category> <category><![CDATA[statistics]]></category><guid isPermaLink="false">http://streamhacker.com/?p=1180</guid> <description><![CDATA[Sentiment analysis with python and NLTK using a Naive Bayes Classifier to classify text. The classifier is trained using supervised learning on a movie reviews corpus that has already been categorized into positive and negative polarity labels.]]></description> <content:encoded><![CDATA[<p><a
title="Sentiment Analysis / Opinion Mining" href="http://en.wikipedia.org/wiki/Sentiment_analysis">Sentiment analysis</a> is becoming a <a
title="5 ways sentiment analysis is ramping up in 2009" href="http://www.readwriteweb.com/archives/sentiment_analysis_is_ramping_up_in_2009.php">popular</a> area of <a
title="Everything I Need To Know About Sentiment Analysis" href="http://www.semanticweb.com/news/everything_i_need_to_know_about_sentiment_analysis_158453.asp">research</a> and <a
title="How Companies Can Use Sentiment Analysis to Improve Their Business" href="http://mashable.com/2010/04/19/sentiment-analysis/">social media analysis</a>, especially around <a
title="Mining the Web for Feelings, Not Facts" href="http://www.nytimes.com/2009/08/24/technology/internet/24emotion.html?_r=1">user reviews</a> and <a
title="5 Free Ways To Track Twitter Sentiment" href="http://smallbiztrends.com/2010/03/tracking-twitter-sentiment.html">tweets</a>. It is a special case of <a
title="Text (data) mining" href="http://en.wikipedia.org/wiki/Text_mining">text mining</a> generally focused on identifying <a
title="What Is Automated Sentiment Analysis Good For?" href="http://www.sevendayworkweek.com/web-analytics-tools/what-is-automated-sentiment-analysis-good-for/">opinion polarity</a>, and while it's often <a
title="Is Automated Sentiment Analysis Reliable?" href="http://www.marketingpilgrim.com/2009/08/why-sentiment-analysis-is-about-as-reliable-as-a-canary-in-a-coal-mine.html">not very accurate</a>, it can still be <a
title="Sentiment Analysis: Can you get it right by just automating it?" href="http://priyankmohan.blogspot.com/2010/04/sentiment-analysis-can-you-get-it-right.html">useful</a>. For simplicity (and because the training data is easily accessible) I'll focus on 2 possible sentiment <a
title="Learning to Classify Text" href="http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html">classifications</a>: <em>positive</em> and <em>negative</em>.</p><h2>NLTK Naive Bayes Classification</h2><p><a
title="Natural Language Toolkit" href="http://www.nltk.org/">NLTK</a> comes with all the pieces you need to get started on sentiment analysis: a <a
href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus-module.html#movie_reviews">movie reviews corpus</a> with reviews categorized into <em>pos</em> and <em>neg</em> categories, and a number of trainable <a
title="NLTK Classify module" href="http://nltk.googlecode.com/svn/trunk/doc/api/toc-nltk.classify-module.html">classifiers</a>. We'll start with a simple <a
title="NLTK Naive Bayes Classifier" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.naivebayes.NaiveBayesClassifier-class.html">NaiveBayesClassifier</a> as a baseline, using boolean word <a
title="Document Classification" href="http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html#document-classification">feature extraction</a>.</p><h2>Bag of Words Feature Extraction</h2><p>All of the NLTK classifiers work with <a
title="NLTK featstruct module" href="http://nltk.googlecode.com/svn/trunk/doc/api/toc-nltk.featstruct-module.html">featstructs</a>, which can be simple dictionaries mapping a <em>feature name</em> to a <em>feature value</em>. For text, we'll use a simplified <a
title="unordered collection of words" href="http://en.wikipedia.org/wiki/Bag_of_words_model">bag of words model</a> where every word is feature name with a value of True. Here's the feature extraction method:</p><pre class="brush: python;">
def word_feats(words):
		return dict([(word, True) for word in words])
</pre><h2>Training Set vs Test Set and Accuracy</h2><p>The movie reviews corpus has 1000 positive files and 1000 negative files. We'll use 3/4 of them as the <a
title="Supervised machine learning" href="http://en.wikipedia.org/wiki/Supervised_learning">training set</a>, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier <a
title="NLTK NaiveBayesClassifier.train" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.naivebayes.NaiveBayesClassifier-class.html#train">training method</a> expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of 'pos' or 'neg'. For accuracy evaluation, we can use <a
href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.util-module.html#accuracy">nltk.classify.util.accuracy</a> with the test set as the gold standard.</p><h2>Training and Testing the Naive Bayes Classifier</h2><p>Here's the complete python code for training and testing a <a
title="Bayes Classifier" href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes Classifier</a> on the movie review corpus.</p><pre class="brush: python;">
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
	return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
</pre><p>And the output is:</p><pre>train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0</pre><p>As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are "vulnerable" and "avoids". Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we're able to get almost 73% accuracy. This is somewhat near <a
title="Sentiment Analysis best done by humans" href="http://www.webmetricsguru.com/archives/2010/04/sentiment-analysis-best-done-by-humans/">human accuracy</a>, as apparently people agree on sentiment only <a
title="Is Sentiment Analysis an 80% Solution?" href="http://intelligent-enterprise.informationweek.com/channels/business_intelligence/showArticle.jhtml;jsessionid=CDFC5V5I1WXU5QE1GHPCKH4ATMY32JVN?articleID=224200667">around 80%</a> of the time. Future articles in this series will cover <a
title="Text Classification Precision &amp; Recall" href="/2010/05/17/text-classification-sentiment-analysis-precision-recall/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">precision &amp; recall metrics</a>, alternative classifiers, and techniques for improving accuracy.</p> <div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=cSYlXaI7IPQ:v8XBlffLa74:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=cSYlXaI7IPQ:v8XBlffLa74:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=cSYlXaI7IPQ:v8XBlffLa74:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=cSYlXaI7IPQ:v8XBlffLa74:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=cSYlXaI7IPQ:v8XBlffLa74:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=cSYlXaI7IPQ:v8XBlffLa74:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/cSYlXaI7IPQ" height="1" width="1"/>]]></content:encoded> <wfw:commentRss>http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/feed/</wfw:commentRss> <slash:comments>8</slash:comments> <feedburner:origLink>http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item> <item><title>Linguistic and Natural Language Processing Links</title><link>http://feedproxy.google.com/~r/StreamHacker/~3/rhugLNEI_hw/</link> <comments>http://streamhacker.com/2010/05/01/nlp-linguistic-links/#comments</comments> <pubDate>Sat, 01 May 2010 18:00:48 +0000</pubDate> <dc:creator>Jacob</dc:creator> <category><![CDATA[links]]></category> <category><![CDATA[linguistics]]></category> <category><![CDATA[nlp]]></category> <category><![CDATA[nltk]]></category> <category><![CDATA[spelling]]></category> <category><![CDATA[stemming]]></category> <category><![CDATA[tagging]]></category><guid isPermaLink="false">http://streamhacker.com/?p=1043</guid> <description><![CDATA[Natural language processing and linguistics links]]></description> <content:encoded><![CDATA[<p>A number of links related to natural language processing and linguistics:</p><ul><li><a
href="http://www.ideaeng.com/tabId/98/itemId/180/Whats-the-Difference-Between-Stemming-and-Lemmati.aspx">What’s the Difference Between Stemming and Lemmatization? - Ask Dr. Search</a></li><li><a
href="http://kmi.tugraz.at/staff/markus/datasets/">A List of Social Tagging Datasets Made Available for Research</a></li><li><a
href="http://www.cpdiehl.org/2010/04/social-signaling-and-language-use.html">Social Signaling and Language Use</a></li><li><a
href="http://datamining.typepad.com/data_mining/2009/02/lexical-growth-in-the-blogosphere.html">Lexical Growth in the Blogosphere</a></li><li><a
href="http://www.biais.org/blog/index.php/2007/01/31/25-spelling-correction-using-the-python-natural-language-toolkit-nltk">Spelling correction using the Python Natural Language Toolkit (nltk)</a></li><li><a
href="http://urd.let.rug.nl/tiedeman/OPUS/">OPUS - an open source parallel corpus</a></li><li><a
href="http://workproduct.wordpress.com/2008/11/07/evaluating-pos-taggers-the-contenders/">Evaluating POS Taggers: The Contenders</a></li><li><a
href="http://textanalytics.wikidot.com/">Text Analytics Wiki</a></li></ul> <div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=rhugLNEI_hw:DZhMN1J6xiE:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=rhugLNEI_hw:DZhMN1J6xiE:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=rhugLNEI_hw:DZhMN1J6xiE:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=rhugLNEI_hw:DZhMN1J6xiE:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=rhugLNEI_hw:DZhMN1J6xiE:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=rhugLNEI_hw:DZhMN1J6xiE:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/rhugLNEI_hw" height="1" width="1"/>]]></content:encoded> <wfw:commentRss>http://streamhacker.com/2010/05/01/nlp-linguistic-links/feed/</wfw:commentRss> <slash:comments>0</slash:comments> <feedburner:origLink>http://streamhacker.com/2010/05/01/nlp-linguistic-links/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item> <item><title>Part of Speech Tagging with NLTK Part 4 – Brill Tagger vs Classifier Taggers</title><link>http://feedproxy.google.com/~r/StreamHacker/~3/B0HAWGY_MDU/</link> <comments>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/#comments</comments> <pubDate>Mon, 12 Apr 2010 15:24:41 +0000</pubDate> <dc:creator>Jacob</dc:creator> <category><![CDATA[python]]></category> <category><![CDATA[brill]]></category> <category><![CDATA[classification]]></category> <category><![CDATA[nlp]]></category> <category><![CDATA[nltk]]></category> <category><![CDATA[tagging]]></category> <category><![CDATA[treebank]]></category><guid isPermaLink="false">http://streamhacker.com/?p=1116</guid> <description><![CDATA[Evaluating the accuracy of a Brill Tagger for pos tagging compared to a Classifier Based Tagger and NLTK's pre-trained tagger used by nltk.pos_tag.]]></description> <content:encoded><![CDATA[<p>In previous installments on <a
href="http://en.wikipedia.org/wiki/Part-of-speech_tagging">part-of-speech tagging</a>, we saw that a <a
title="Part of Speech Tagging with BrillTagger" href="/2008/12/03/part-of-speech-tagging-with-nltk-part-3/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">Brill Tagger</a> provides significant accuracy improvements over the <a
title="Part of Speech Tagging with UnigramTagger, BigramTagger and TrigramTagger" href="/2008/11/03/part-of-speech-tagging-with-nltk-part-1/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">Ngram Taggers</a> combined with <a
title="Part of Speech Tagging with RegexTagger and AffixTagger" href="/2008/11/10/part-of-speech-tagging-with-nltk-part-2/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">Regex and Affix Tagging</a>.</p><p>With the latest <a
title="NLTK downloads" href="http://code.google.com/p/nltk/downloads/list">2.0 beta releases</a> (2.0b8 as of this writing), <a
title="Natural Language Toolkit" href="http://www.nltk.org/">NLTK</a> has included a <a
title="Classifier Based Tagger Class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.sequential.ClassifierBasedTagger-class.html">ClassifierBasedTagger</a> as well as a pre-trained tagger used by the <a
title="pos_tag" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag-module.html#pos_tag">nltk.tag.pos_tag</a> method. Based on the name, then pre-trained tagger appears to be a <code>ClassifierBasedTagger</code> trained on the <a
title="The Penn Treebank Project" href="http://www.cis.upenn.edu/~treebank/">treebank corpus</a> using a <a
title="Maxent Classifier class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.maxent.MaxentClassifier-class.html">MaxentClassifier</a>. So let's see how a classifier tagger compares to the brill tagger.</p><h2>Training Sets</h2><p>For the <a
title="Brown Corpus Manual" href="http://icame.uib.no/brown/bcm.html">brown corpus</a>, I trained on 2/3 of the <em>reviews</em>, <em>lore</em>, and <em>romance</em> categories, and tested against the remaining 1/3. For conll2000, I used the standard <em>train.txt</em> vs <em>test.txt</em>. And for <a
title="Treebank corpus" href="http://en.wikipedia.org/wiki/Treebank">treebank</a>, I again used a 2/3 vs 1/3 split.</p><pre class="brush: python;">
import itertools
from nltk.corpus import brown, conll2000, treebank

brown_reviews = brown.tagged_sents(categories=['reviews'])
brown_reviews_cutoff = len(brown_reviews) * 2 / 3
brown_lore = brown.tagged_sents(categories=['lore'])
brown_lore_cutoff = len(brown_lore) * 2 / 3
brown_romance = brown.tagged_sents(categories=['romance'])
brown_romance_cutoff = len(brown_romance) * 2 / 3

brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff],
	brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff]))
brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:],
	brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:]))

conll_train = conll2000.tagged_sents('train.txt')
conll_test = conll2000.tagged_sents('test.txt')

treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3
treebank_train = treebank.tagged_sents()[:treebank_cutoff]
treebank_test = treebank.tagged_sents()[treebank_cutoff:]
</pre><h2>Classifier Taggers</h2><p>There are 3 new taggers referenced below:</p><ul><li><code>cpos</code> is an instance of <a
title="Classifier Based POS Tagger class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.sequential.ClassifierBasedPOSTagger-class.html">ClassifierBasedPOSTagger</a> using the default <a
title="Naive Bayes Classifier class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.naivebayes.NaiveBayesClassifier-class.html">NaiveBayesClassifier</a>. It was trained by doing <code>ClassifierBasedPOSTagger(train=train_sents)</code></li><li><code>craubt</code> is like <code>cpos</code>, but has the <code>raubt</code> tagger from <a
title="Part of Speech Tagging Part 2" href="/2008/11/10/part-of-speech-tagging-with-nltk-part-2/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">part 2</a> as a backoff tagger by doing <code>ClassifierBasedPOSTagger(train=train_sents,</code> <code>backoff=raubt)</code></li><li><code>bcpos</code> is a <a
title="Brill Tagger class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.brill.BrillTagger-class.html">BrillTagger</a> using <code>cpos</code> as its initial tagger instead of <code>raubt</code>.</li></ul><p>The <code>raubt</code> tagger is the same as from <a
title="Regex and Affix Taggers" href="/2008/11/10/part-of-speech-tagging-with-nltk-part-2/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">part 2</a>, and <code>braubt</code> is from <a
title="Brill Tagger" href="/2008/12/03/part-of-speech-tagging-with-nltk-part-3/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">part 3</a>.</p><p><code>postag</code> is NLTK's pre-trained tagger used by the <a
title="Using a Tagger with nltk.pos_tag" href="http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html#using-a-tagger">pos_tag</a> function. It can be loaded using <code>nltk.data.load(nltk.tag._POS_TAGGER)</code>.</p><h2>Accuracy Evaluation</h2><p>Tagger accuracy was determined by calling the <a
title="TaggerI.evaluate" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.api.TaggerI-class.html#evaluate">evaluate</a> method with the test set on each trained tagger. Here are the results:</p><p><img
src="http://chart.apis.google.com/chart?chxt=y,x&amp;chd=s:wxwwyM,122232,yy0009&amp;chxr=0,50,100&amp;chxtc=0,-600|1,-600&amp;chco=61380B,04B404,7401DF&amp;chs=500x400&amp;cht=lc&amp;chxl=0:|50|55|60|65|70|75|80|85|90|95|100|1:|raubt|braubt|cpos|craubt|bcpos|postag&amp;chls=1,1,0|1,1,0|1,1,0&amp;chdl=brown|conll2000|treebank" alt="brill vs classifier tagger accuracy chart" width="500" height="400" /></p><h2>Conclusions</h2><p>The above results are quite interesting, and lead to a few conclusions:</p><ol><li>Training data is hugely significant when it comes to accuracy. This is why <code>postag</code> takes a huge nose dive on <code>brown</code>, while at the same time can get near 100% accuracy on <code>treebank</code>.</li><li>A <a
title="ClassifierBasedPOSTagger class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.sequential.ClassifierBasedPOSTagger-class.html">ClassifierBasedPOSTagger</a> does not need a <a
title="Sequential Backoff Tagger class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.sequential.SequentialBackoffTagger-class.html">backoff tagger</a>, since <code>cpos</code> accuracy is exactly the same as for <code>craubt</code> across all corpora.</li><li>The <a
title="Classifier Based POS Tagger class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.sequential.ClassifierBasedPOSTagger-class.html">ClassifierBasedPOSTagger</a> is not necessarily more accurate than the <code>bcraubt</code> tagger from <a
title="BrillTagger" href="/2008/12/03/part-of-speech-tagging-with-nltk-part-3/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">part 3</a> (at least with the default <a
title="ClassifierBasedPOSTagger.feature_detector" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.sequential-pysrc.html#ClassifierBasedPOSTagger.feature_detector">feature detector</a>). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency.</li><li>Using <a
title="BrillTagger class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.brill.BrillTagger-class.html">BrillTagger</a> will nearly always increase the accuracy of your initial tagger, but not by much.</li></ol><p>I was also surprised at how much more accurate <code>postag</code> was compared to <code>cpos</code>. Thinking that <code>postag</code> was probably trained on the full treebank corpus, I did the same, and re-evaluated:</p><pre class="brush: python;">
cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents())
cpos.evaluate(treebank_test)
</pre><p>The result was 98.08% accuracy. So the remaining 2% difference must be due to the <a
title="MaxentClassifier class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.maxent.MaxentClassifier-class.html">MaxentClassifier</a> being more accurate than <a
title="NaiveBayesClassifier class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.naivebayes.NaiveBayesClassifier-class.html">NaiveBayesClassifier</a>, and/or the use of a different feature detector. I tried again with <code>classifier_builder=MaxentClassifier.train</code> and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.</p><h2>Efficiency</h2><p>On the <a
title="nltk-users mailing list" href="http://groups.google.com/group/nltk-users">nltk-users list</a>, there was a question about <a
href="http://groups.google.com/group/nltk-users/msg/a82768daa23a5932">which tagger is the most computationaly economic</a>. I can't tell you the right answer, but I can definitely say that <a
title="Classifier Based POS Tagger class" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.sequential.ClassifierBasedPOSTagger-class.html">ClassifierBasedPOSTagger</a> is the wrong answer. During accuracy evaluation, I noticed that the <code>cpos</code> tagger took a lot longer than <code>raubt</code> or <code>braubt</code>. So I ran <a
title="Measure execution time of small code snippets" href="http://docs.python.org/library/timeit.html">timeit</a> on the <a
title="TaggerI.tag" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.api.TaggerI-class.html#tag">tag</a> method of each tagger, and got the following results:</p><table
border="1"><colgroup><col
width="38%"></col><col
width="63%"></col></colgroup><thead><tr><th>Tagger</th><th>secs/pass</th></tr></thead><tbody><tr><td>raubt</td><td>0.00005</td></tr><tr><td>braubt</td><td>0.00009</td></tr><tr><td>cpos</td><td>0.02219</td></tr><tr><td>bcpos</td><td>0.02259</td></tr><tr><td>postag</td><td>0.01241</td></tr></tbody></table><p>This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. <code>braubt</code> is <strong>over 246 times faster</strong> than <code>cpos</code>! To put it another way, <code>braubt</code> can process over 66666 words/sec, where <code>cpos</code> can only do 270 words/sec and <code>postag</code> only 483 words/sec. So the lesson is: <strong>do not use a classifier based tagger if speed is an issue</strong>.</p><p>Here's the code for timing <code>postag</code>. You can do the same thing for any other pickled tagger by replacing <code>nltk.tag._POS_TAGGER</code> with a <a
title="NLTK Data module" href="http://nltk.googlecode.com/svn/trunk/doc/api/toc-nltk.data-module.html">nltk.data accessible path</a> with a <em>.pickle</em> suffix for the <a
title="nltk.data.load" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.data-module.html#load">load method</a>.</p><pre class="brush: python;">
import nltk, timeit
text = nltk.word_tokenize('And now for something completely different')
setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)'
t = timeit.Timer('tagger.tag(%s)' % text, setup)
print 'timing postag 1000 times'
spent = t.timeit(number=1000)
print 'took %.5f secs/pass' % (spent / 1000)
</pre><h3>File Size</h3><p>There's also a significant difference in the file size of the pickled taggers (trained on treebank):</p><table
border="1"><colgroup><col
width="60%"></col><col
width="40%"></col></colgroup><thead><tr><th>Tagger</th><th>Size</th></tr></thead><tbody><tr><td>raubt</td><td>272K</td></tr><tr><td>braubt</td><td>273K</td></tr><tr><td>cpos</td><td>3.8M</td></tr><tr><td>bcpos</td><td>3.8M</td></tr><tr><td>postag</td><td>8.2M</td></tr></tbody></table><h2>Fin</h2><p>I think there's a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don't even bother. In that case, stick with a simpler tagger that's nearly as accurate and orders of magnitude faster.</p> <div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=B0HAWGY_MDU:6-R0booEjvQ:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=B0HAWGY_MDU:6-R0booEjvQ:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=B0HAWGY_MDU:6-R0booEjvQ:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=B0HAWGY_MDU:6-R0booEjvQ:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=B0HAWGY_MDU:6-R0booEjvQ:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=B0HAWGY_MDU:6-R0booEjvQ:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/B0HAWGY_MDU" height="1" width="1"/>]]></content:encoded> <wfw:commentRss>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/feed/</wfw:commentRss> <slash:comments>12</slash:comments> <feedburner:origLink>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item> <item><title>Python Logging Filters</title><link>http://feedproxy.google.com/~r/StreamHacker/~3/WnhUngxhh8Y/</link> <comments>http://streamhacker.com/2010/04/08/python-logging-filters/#comments</comments> <pubDate>Thu, 08 Apr 2010 16:00:50 +0000</pubDate> <dc:creator>Jacob</dc:creator> <category><![CDATA[python]]></category> <category><![CDATA[filter]]></category> <category><![CDATA[logging]]></category><guid isPermaLink="false">http://streamhacker.com/?p=1098</guid> <description><![CDATA[Create a python logging Filter object to filter log records, then add it to a custom Handler that can be specified in a logging configuration file.]]></description> <content:encoded><![CDATA[<p>The <a
title="Python Logging 101" href="http://plumberjack.blogspot.com/2009/09/python-logging-101.html">python logging</a> <a
title="Logging facility for Python" href="http://docs.python.org/library/logging.html">package</a> provides a <a
title="Filter objects" href="http://docs.python.org/library/logging.html#filter-objects">Filter</a> class that can be used for filtering <a
title="LogRecord Objects" href="http://docs.python.org/library/logging.html#logrecord-objects">log records</a>. This is a simple way to ensure that a <a
title="Logger Objects" href="http://docs.python.org/library/logging.html#logger-objects">logger</a> or <a
title="Handler Objects" href="http://docs.python.org/library/logging.html#handler-objects">handler</a> will only output desired log messages. Here's an example filter that only allows <a
title="Logging Levels" href="http://docs.python.org/library/logging.html#logging-levels">INFO</a> messages to be logged:</p><pre class="brush: python;">
import logging

class InfoFilter(logging.Filter):
	def filter(self, rec):
		return rec.levelno == logging.INFO
</pre><h3>Configuring Python Logging Filters</h3><p><a
title="Using the Python logging module" href="http://code.activestate.com/recipes/412552-using-the-logging-module/">Filters can be added to a logger instance or a handler instance</a> using the <a
title="logging.Logger.addFilter" href="http://docs.python.org/library/logging.html#logging.Logger.addFilter">addFilter(filt)</a> method. For a logger, the best time to do this is probably right after calling <a
title="logging.getLogger" href="http://docs.python.org/library/logging.html#logging.getLogger">getLogger</a>, like so:</p><pre class="brush: python;">
log = logging.getLogger()
log.addFilter(InfoFilter())
</pre><p>What about adding a filter to a handler? If you're <a
title="The python logging module is much better than print statements" href="http://blog.tplus1.com/index.php/2007/09/28/the-python-logging-module-is-much-better-than-print-statements/">programmatically configuring handlers</a> with <a
title="logging.Logger.addHandler" href="http://docs.python.org/library/logging.html#logging.Logger.addHandler">addHandler(hdlr)</a>, then you can do the same thing by calling <a
title="logging.Handler.addFilter" href="http://docs.python.org/library/logging.html#logging.Handler.addFilter">addFilter(filt) on the handler</a> instance. But if you're using <a
title="logging.fileConfig" href="http://docs.python.org/library/logging.html#logging.fileConfig">fileConfig</a> to configure handlers and loggers, it's a little bit harder. Unfortunately, the <a
title="Configuration file format" href="http://docs.python.org/library/logging.html#configuration-file-format">logging configuration format</a> does not support adding filters. And it's not always clear which logger the handler instances are attached to in the logger hierarchy. So the simplest way to add a filter to a handler in this case is to subclass the handler:</p><pre class="brush: python;">
class InfoHandler(logging.StreamHandler):
	def __init__(self, *args, **kwargs):
		StreamHandler.__init__(self, *args, **kwargs)
		self.addFilter(InfoFilter())
</pre><p>Then in your file config, make sure to set the class value for your custom handler to a complete code path for import:</p><pre>[handler_infohandler]
class=mypackage.mylogging.InfoHandler
level=INFO</pre><p>Now your handler will only handle the log records that pass your custom filter. As long your handlers aren't changing much, the above method is much more reusable than having to call <code>addFilter(filt)</code> everytime a new logger is instantiated.</p> <div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=WnhUngxhh8Y:8o2eWOmpa5g:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=WnhUngxhh8Y:8o2eWOmpa5g:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=WnhUngxhh8Y:8o2eWOmpa5g:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=WnhUngxhh8Y:8o2eWOmpa5g:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=WnhUngxhh8Y:8o2eWOmpa5g:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=WnhUngxhh8Y:8o2eWOmpa5g:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/WnhUngxhh8Y" height="1" width="1"/>]]></content:encoded> <wfw:commentRss>http://streamhacker.com/2010/04/08/python-logging-filters/feed/</wfw:commentRss> <slash:comments>4</slash:comments> <feedburner:origLink>http://streamhacker.com/2010/04/08/python-logging-filters/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item> <item><title>Python Unicode Links</title><link>http://feedproxy.google.com/~r/StreamHacker/~3/P7QX3hs0Ps4/</link> <comments>http://streamhacker.com/2010/04/05/python-unicode-links/#comments</comments> <pubDate>Mon, 05 Apr 2010 16:00:22 +0000</pubDate> <dc:creator>Jacob</dc:creator> <category><![CDATA[links]]></category> <category><![CDATA[python]]></category> <category><![CDATA[i18n]]></category> <category><![CDATA[unicode]]></category> <category><![CDATA[utf8]]></category><guid isPermaLink="false">http://streamhacker.com/?p=1071</guid> <description><![CDATA[Links for python unicode explanations and utf8 encoding detection and conversion]]></description> <content:encoded><![CDATA[<h4>Links for understanding how to use unicode in python:</h4><ul><li><a
title="Red Mercury Labs" href="http://www.red-mercury.com/blog/eclectic-tech/python-unicode-fixing-utf-8-encoded-as-latin-1-iso-8859-1/">Python Unicode – Fixing UTF-8 encoded as Latin-1 / ISO-8859-1</a></li><li><a
title="Introduction to Unicode with Python" href="http://www.amk.ca/python/howto/unicode">Unicode HOWTO</a></li><li><a
title="Character encoding auto-detection in Python 2 and 3" href="http://chardet.feedparser.org/">Universal Encoding Detector</a></li></ul> <div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=P7QX3hs0Ps4:SmGPl6IhfSI:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=P7QX3hs0Ps4:SmGPl6IhfSI:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=P7QX3hs0Ps4:SmGPl6IhfSI:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=P7QX3hs0Ps4:SmGPl6IhfSI:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=P7QX3hs0Ps4:SmGPl6IhfSI:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=P7QX3hs0Ps4:SmGPl6IhfSI:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/P7QX3hs0Ps4" height="1" width="1"/>]]></content:encoded> <wfw:commentRss>http://streamhacker.com/2010/04/05/python-unicode-links/feed/</wfw:commentRss> <slash:comments>1</slash:comments> <feedburner:origLink>http://streamhacker.com/2010/04/05/python-unicode-links/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item> </channel> </rss><!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk
Page Caching using disk (enhanced)

Served from: streamhacker.com @ 2010-08-21 00:33:25 -->
