<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>StreamHacker</title>
	
	<link>http://streamhacker.com</link>
	<description>Weotta be Hacking</description>
	<lastBuildDate>Sun, 26 May 2013 19:51:52 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>

	
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/StreamHacker" /><feedburner:info uri="streamhacker" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://superfeedr.com/hubbub" /><feedburner:emailServiceId>StreamHacker</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>Instant PyGame Book Review</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/me5L-vP5M5I/</link>
		<comments>http://streamhacker.com/2013/05/26/instant-pygame-book-review/#comments</comments>
		<pubDate>Sun, 26 May 2013 19:44:06 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[numpy]]></category>
		<category><![CDATA[pygame]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1885</guid>
		<description><![CDATA[A review of Instant Pygame for Python Game Development How-to, by Ivan Idris.]]></description>
				<content:encoded><![CDATA[<p><a href="http://link.packtpub.com/2Vrs5V"><img class="alignleft" alt="Pygame for Python Game Development How-to" src="http://www.packtpub.com/sites/default/files/2865OScov.jpg" width="130" height="160" /></a>This is a review of the book <a href="http://link.packtpub.com/2Vrs5V">Instant Pygame for Python Game Development How-to</a>, by <a href="http://ivanidris.net/wordpress/">Ivan Idris</a>. <a href="http://www.packtpub.com/">Packt</a> asked me to review the book, and I agreed because like many developers, I've thought about writing my own game, and I've been curious about the capabilities of <a href="http://www.pygame.org/">pygame</a>. It's a short book, ~120 pages, so this is a short review.</p>
<p>The book covers <a href="http://www.pygame.org/">pygame</a> basics like drawing images, rendering text, playing sounds, creating animations, and altering the mouse cursor. The author has helpfully posted some <a href="http://www.youtube.com/user/ivanidris">video demos</a> of some of the exercises, which are linked from the book. I think this is a great way to show what's possible, while also giving the reader a clear idea of what they are creating &amp; what should happen. After the basic intro exercises, I think the best content was how to manipulate pixel arrays with <a href="http://www.numpy.org/">numpy</a> (the author has also written two books on numpy: <a href="http://www.amazon.com/gp/product/B00CITNP76/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B00CITNP76&amp;linkCode=as2&amp;tag=streamhacker-20">NumPy Beginner's Guide</a> &amp; <a href="http://www.amazon.com/gp/product/B009X5KIH8/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B009X5KIH8&amp;linkCode=as2&amp;tag=streamhacker-20">NumPy Cookbook</a>), how to create &amp; use sprites, and how to make your own version of the <a href="http://www.youtube.com/watch?v=NNsU-yWTkXM">game of life</a>.</p>
<p>There were 3 chapters whose content puzzled me. When you've got such a short book on a specific topic, why bring up <a href="http://matplotlib.org/">matplotlib</a>, profiling, and debugging? These chapters seemed off-topic and just thrown in there randomly. The organization of the book could have been much better too, leading the reader from the basics all the way to a full-fledged game, with each chapter adding to the previous chapters. Instead, the chapters sometimes felt like unrelated low-level examples.</p>
<p>Overall, the book was a quick &amp; easy read, that rapidly introduces you to basic pygame functionality, and leads you on to more complex activities. My main takeaway is that pygame provides an easy to use &amp; low-level framework for building simple games, and can be used to create more complex games (but probably not <a href="http://en.wikipedia.org/wiki/First-person_shooter">FPS</a> or similar graphically intensive games). The ideal games would probably be puzzle based and/or dialogue heavy, and only require simple interactions from the user. So if you're interested in building such a game in Python, you should definitely get a copy of <a href="http://link.packtpub.com/2Vrs5V">Instant Pygame for Python Game Development How-to</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=me5L-vP5M5I:eXE035Um5kY:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=me5L-vP5M5I:eXE035Um5kY:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=me5L-vP5M5I:eXE035Um5kY:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=me5L-vP5M5I:eXE035Um5kY:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=me5L-vP5M5I:eXE035Um5kY:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=me5L-vP5M5I:eXE035Um5kY:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/me5L-vP5M5I" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2013/05/26/instant-pygame-book-review/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2013/05/26/instant-pygame-book-review/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Avogadro Corp Book Review / AI Speculation</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/4nY56iwD78k/</link>
		<comments>http://streamhacker.com/2013/04/28/avogadro-corp-book-review-ai-speculation/#comments</comments>
		<pubDate>Sun, 28 Apr 2013 21:43:15 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[ai]]></category>
		<category><![CDATA[email]]></category>
		<category><![CDATA[nlp]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1877</guid>
		<description><![CDATA[Speculation about how an email language optimization program could be created, based on the ELOPe program from Avogadro Corp: The Singularity Is Closer Than It Appears, by William Hurtling]]></description>
				<content:encoded><![CDATA[<p><a title="Avogadro Corp" href="http://www.amazon.com/gp/product/B006ACIMQQ/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B006ACIMQQ&amp;linkCode=as2&amp;tag=streamhacker-20"><img class="alignleft" alt="Avogadro Corp" src="http://ws.assoc-amazon.com/widgets/q?_encoding=UTF8&amp;ASIN=B006ACIMQQ&amp;Format=_SL160_&amp;ID=AsinImage&amp;MarketPlace=US&amp;ServiceVersion=20070822&amp;WS=1&amp;tag=streamhacker-20" width="107" height="160" /></a><a title="Avogadro Corp" href="http://www.amazon.com/gp/product/B006ACIMQQ/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B006ACIMQQ&amp;linkCode=as2&amp;tag=streamhacker-20">Avogadro Corp: The Singularity Is Closer Than It Appears</a>, by <a href="http://www.williamhertling.com/">William Hertling</a>, is the first sci-fi book I've read with a semi-plausible <a href="https://en.wikipedia.org/wiki/Artificial_intelligence">AI</a> origin story. That's because the premise isn't so simple as "increased computing power -&gt; emergent AI". It's a much more well defined formula: ever increasing computing power + powerful language processing + never ending stream of training data + goal oriented behavior + deep integration into internet infrastructure -&gt; AI. The AI in the story is called <em>ELOPe</em>, which stands for <em>Email Language Optimization Program</em>, and its function is essentially to improve the quality of emails. <strong>WARNING</strong> there will be spoilers below, but only enough to describe ELOPe and speculate about how it might be implemented.</p>
<h2>What is ELOPe</h2>
<p>The idea behind ELOPe is to provide writing suggestions as a feature of a popular web-based email service. These writing suggestions are designed to improve the outcome of your email, whatever that may be. To take an example from the book, if you're requesting more compute resources for a project, then ELOPe's job is to offer writing suggestions that are most likely to get your request approved. By taking into account your own past writings, who you're sending the email to, and what you're asking for, it can go as far as completely re-writing the email to achieve the optimal outcome.</p>
<p>Using the existence of ELOPe as a given, the author writes a enjoyable story that is (mostly) technically accurate with plenty of details, without being boring. If you liked <a href="http://www.amazon.com/gp/product/0451228731/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0451228731&amp;linkCode=as2&amp;tag=streamhacker-20">Daemon</a> by <a href="http://www.thedaemon.com/">Daniel Suarez</a>, or you work with any kind of natural language / text-processing technology, you'll probably enjoy the story. I won't get into how an email writing suggestion program goes from that to full AI &amp; takes over the world as a benevolent ghost in the wires - for that you need to read the <a href="http://www.amazon.com/gp/product/B006ACIMQQ/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B006ACIMQQ&amp;linkCode=as2&amp;tag=streamhacker-20">book</a>. What I do want to talk about is how this email optimization system could be implemented.</p>
<h2>How ELOPe might work</h2>
<p>Let's start by defining the high-level requirements. ELOPe is an email optimizer, so we have the sender, the receiver, and the email being written as inputs. The output is a re-written email that preserves the "voice" of the sender while using language that will be much more likely to achieve the sender's desired outcome, given who they're sending the email to. That means we need the following:</p>
<ol>
<li>ability to analyze the email to determine what outcome is desired</li>
<li>prior knowledge of how the receiver has responded to other emails with similar outcome topics, in order to know what language produced the best outcomes (and what language produced bad outcomes)</li>
<li>ability to re-write (or generate) an email whose language is consistent with the sender, while also using language optimized to get the best response from the receiver</li>
</ol>
<h2>Topic Analysis</h2>
<p>Determining the desired outcome for an email seems to me like a sophisticated combination of <a href="https://en.wikipedia.org/wiki/Topic_modeling">topic modeling</a> and <a href="https://en.wikipedia.org/wiki/Deep_linguistic_processing">deep linguistic parsing</a>. The goal would be to identify the core reason for the email: what is the sender asking for, and what would be an optimal response?</p>
<p>Being able to do this from a single email is probably impossible, but if you have access to thousands, or even millions of email chains, accurate topic modeling is much more do-able. Nearly every email someone sends will have some similarity to past emails sent by other people in similar situations. So you could create <a href="https://en.wikipedia.org/wiki/Feature_vector">feature vectors</a> for every email chain (using deep semantic parsing), then <a href="https://en.wikipedia.org/wiki/Cluster_analysis">cluster</a> the chains using feature similarity. Now you have topic clusters, and from that you could create training data for thousands of topic <a href="https://en.wikipedia.org/wiki/Classification_in_machine_learning">classifiers</a>. Once you have the classifiers, you can run those in parallel to determine the most likely topic(s) of a single email.</p>
<p>Obviously it would be very difficult to create accurate clusters, and even harder to do so at scale. Language is very fuzzy, humans are inconsistent, and a huge fraction of email is spam. But the core of the necessary technology exists, and can work very well in limited conditions. The ability to parse emails, extract textual features, and cluster &amp; classify feature vectors are functionality that's available in at least a few modern programming libraries today (i.e. <a href="http://python.org/">Python</a>, <a href="http://nltk.org/">NLTK</a> &amp; <a href="http://scikit-learn.org/stable/">scikit-learn</a>). These are areas of software technology that are getting a lot of attention right now, and all signs indicate that attention will only increase over time, so it's quite likely that the difficulty level will decrease significantly over the next 10 years. Moving on, let's assume we can do accurate email topic analysis. The next hurdle is outcome analysis.</p>
<h2>Outcome Analysis</h2>
<p>Once you can determine topics, now you need to learn about outcomes. Two email chains about acquiring compute resources have the same topic, but one chain ends with someone successfully getting access to more compute resources, while the other ends in failure. How do you differentiate between these? This sounds like next-generation <a href="https://en.wikipedia.org/wiki/Sentiment_analysis">sentiment analysis</a>. You need to go deeper than simple failure vs. success, positive vs. negative, since you want to know which email chains within a given topic produced the best responses, and what language they have in common. In other words, you need a <a href="https://en.wikipedia.org/wiki/Language_model">language model</a> that weights successful outcome language much higher than failure outcome language. The only way I can think of doing this with a decent level of accuracy is massive amounts of human verified training data. Technically do-able, but very expensive in terms of time and effort.</p>
<p>What really pushes the bounds of plausibility is that the language model can't be universal. Everyone has their own likes, dislikes, biases, and preferences. So you need language models that are specific to individuals, or clusters of individuals that respond similarly on the same topic. Since these clusters are topic specific, every individual would belong to many <code>(topic, cluster)</code> pairs. Given <code>N</code> topics and an average of <code>M</code> clusters within each topic, that's <code>N*M</code> language models that need to be created. And one of the major plot points of the book falls out naturally: ELOPe needs access to huge amounts of high end compute resources.</p>
<p>This is definitely the least do-able aspect of ELOPe, and I'm ignoring all the implicit conceptual knowledge that would be required to know what an optimal outcome is, but let's move on <img src='http://streamhacker.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<h2>Language Generation</h2>
<p>Assuming that we can do topic &amp; outcome analysis, the final step is using language models to generate more persuasive emails. This is perhaps the simplest part of ELOPe, assuming everything else works well. That's because <a href="https://en.wikipedia.org/wiki/Natural_language_generation">natural language generation</a> is the kind of technology that works much better with more data, and it already exists in various forms. <a href="http://translate.google.com/">Google translate</a> is a kind of language generator, <a href="https://en.wikipedia.org/wiki/Chatterbot">chatbots</a> have been around for decades, and spammers use software to <a href="https://en.wikipedia.org/wiki/Article_spinning">spin</a> new articles &amp; text based on existing writings. The differences in this case are that every individual would need their own language generator, and it would have to be parameterized with pluggable language models based on the topic, desired outcome, and receiver. But assuming we have good topic &amp; receiver specific outcome analysis, plus hundreds or thousands of emails from the sender to learn from, then generating new emails, or just new phrases within an email, seems almost trivial compared to what I've outlined above.</p>
<h2>Final Words</h2>
<p>I'm still highly skeptical that <a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence">strong AI</a> will ever exist. We humans barely understand the mechanisms of own intelligence, so to think that we can create comparable artificial intelligence smells of hubris. But it can be fun to think about, and the point of sci-fi is to tell stories about possible futures, so I have no doubt various forms of AI will play a strong role in sci-fi stories for years to come.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=4nY56iwD78k:u0ngnJxkqLg:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=4nY56iwD78k:u0ngnJxkqLg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=4nY56iwD78k:u0ngnJxkqLg:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=4nY56iwD78k:u0ngnJxkqLg:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=4nY56iwD78k:u0ngnJxkqLg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=4nY56iwD78k:u0ngnJxkqLg:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/4nY56iwD78k" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2013/04/28/avogadro-corp-book-review-ai-speculation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2013/04/28/avogadro-corp-book-review-ai-speculation/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Monetizing the Text-Processing API with Mashape</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/RvoYBnjUBYw/</link>
		<comments>http://streamhacker.com/2013/02/27/monetizing-textprocessing-api-mashape/#comments</comments>
		<pubDate>Thu, 28 Feb 2013 00:11:04 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[apis]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1862</guid>
		<description><![CDATA[Why I created the text-processing.com APIs, and how I monetized them with Mashape.]]></description>
				<content:encoded><![CDATA[<p>This is a short story about the <a href="http://text-processing.com/">text-processing.com</a> API, and how it became a profitable side-project, thanks to <a href="https://www.mashape.com/">Mashape</a>.</p>
<h2>Text-Processing API</h2>
<p>When I first created <a href="http://text-processing.com/">text-processing.com</a>, in the summer of 2010, my initial intention was to provide an online demo of <a href="http://nltk.org/">NLTK's</a> capabilities. I trained a bunch of models on various <a href="http://nltk.org/nltk_data/">NLTK corpora</a> using <a href="https://github.com/japerk/nltk-trainer">nltk-trainer</a>, then started making some simple <a href="https://www.djangoproject.com/">Django</a> forms to display the results. But as I was doing this, I realized I could fairly easily create an API based on these models. Instead of rendering HTML, I could just return the results as JSON.</p>
<p>I wasn't sure if anyone would actually use the API, but I knew the best way to find out was to just put it out there. So I did, initially making it completely open, with a rate limit of 1000 calls per day per IP address. I figured at the very least, I might get some PHP or Ruby users that wanted the power of <a href="http://nltk.org/">NLTK</a> without having to interface with Python. Within a month, people were regularly exceeding that limit, and I quietly increased it to 5000 calls/day, while I started searching for the simplest way to monetize the API. I didn't like what I found.</p>
<h2>Monetizing APIs</h2>
<p>Before Mashape, your options for monetizing APIs were either building a custom solution for authentication, billing, and tracking, or pay thousands of dollars a month for an "enterprise" solution from <a href="http://www.mashery.com/">Mashery</a> or <a href="http://apigee.com/about/">Apigee</a>. While I have no doubt Mashery &amp; Apigee provide quality services, they are not in the price range for most developers. And building a custom solution is far more work than I wanted to put into it. Even now, when companies like <a href="https://stripe.com/">Stripe</a> exist to make billing easier, you'd still have to do authentication &amp; call tracking. But Stripe didn't exist 2 years ago, and the best billing option I could find was <a href="https://www.paypal.com/">Paypal</a>, whose API documentation is great at inducing headaches. Lucky for me, Mashape was just opening up for beta testing, and appeared to be in the process of solving all of my problems <img src='http://streamhacker.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<h2>Mashape</h2>
<p>Mashape was just what I needed to monetize the <a href="https://www.mashape.com/japerk/text-processing">text-processing API</a>, and it's improved tremendously since I started using it. They handle all the necessary details, like integrated billing, plus a lot more, such as usage charts, latency &amp; uptime measurements, and automatic client library generation. This last is one of my favorite features, because the client libraries are generated using your API documentation, which provides a great incentive to accurately document the ins &amp; outs of your API. Once you've documented your API, downloadable libraries in 5 different programming languages are immediately available, making it that much easier for new users to consume your API. As of this writing, those languages are Java, PHP, Python, Ruby, and Objective C.</p>
<p>Here's a little history for the curious: Mashape originally did authentication and tracking by exchanging tokens thru an API call. So you had to write some code to call their token API on every one of your API calls, then check the results to see if the call was valid, or if the caller had reached their limit. They didn't have all of the nice charts they have now, and their billing solution was the CEO manually handling Paypal payments. But none of that mattered, because it worked, and from conversations with them, I knew they were focused on more important things: building up their infrastructure and positioning themselves as a kind of app-store for APIs.</p>
<p>Mashape has been out of beta for a while now, with automated billing, and a custom proxy server for authenticating, routing, and tracking all API calls. They're releasing new features on a regular basis, and sponsoring events like <a href="http://sf.musichackday.org/2013/index.php?page=Main+page">MusicHackDay</a>. I'm very impressed with everything they're doing, and on top of that, they're good hard-working people. I've been over to their "hacker house" in San Francisco a few times, and they're very friendly and accomodating. And if you're ever in the neighborhood, I'm sure they'd be open to a visit.</p>
<h2>Profit</h2>
<p>Once I had integrated Mashape, which was maybe 20 lines of code, the money started rolling in <img src='http://streamhacker.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . Just kidding, but using the typical definition of profit, when income exceeds costs, the <a href="https://www.mashape.com/japerk/text-processing">text-processing API</a> was profitable within a few months, and has remained so ever since. My only monetary cost is a single <a href="http://www.linode.com/?r=1b9fe1a4c29f9122ef178a2eb79af69badb68b9f">Linode</a> server, so as long as people keep paying for the API, <a href="http://text-processing.com/">text-processing.com</a> will remain online. And while it has a very nice profit margin, total monthly income barely approaches the cost of living in San Francisco. But what really matters to me is that <a href="http://text-processing.com/">text-processing.com</a> has become a self-sustaining excuse for me to experiment with natural language processing techniques &amp; data sets, test my models against the market, and provide developers with a simple way to integrate NLP into their own projects.</p>
<p>So if you've got an idea for an API, especially if it's something you could charge money for, I encourage you to build it and put it up on <a href="https://www.mashape.com/">Mashape</a>. All you need is a working API, a unique image &amp; name, and a Paypal account for receiving payments. Like other app stores, Mashape takes a 20% cut of all revenue, but I think it's well worth it compared to the cost of replicating everything they provide. And unlike some app stores, you're not locked in. Many of the APIs on Mashape also provide alternative usage options (including <a href="http://text-processing.com/docs/">text-processing</a>), but they're on Mashape because of the increased exposure, distribution, and additional features, like client library generation. <a href="https://en.wikipedia.org/wiki/Software_as_a_service">SaaS</a> APIs are becoming a significant part of modern computing infrastructure, and Mashape provides a great platform for getting started.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=RvoYBnjUBYw:2Funu0khQ9A:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=RvoYBnjUBYw:2Funu0khQ9A:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=RvoYBnjUBYw:2Funu0khQ9A:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=RvoYBnjUBYw:2Funu0khQ9A:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=RvoYBnjUBYw:2Funu0khQ9A:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=RvoYBnjUBYw:2Funu0khQ9A:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/RvoYBnjUBYw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2013/02/27/monetizing-textprocessing-api-mashape/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2013/02/27/monetizing-textprocessing-api-mashape/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Text Classification for Sentiment Analysis – NLTK + Scikit-Learn</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/UQgbWEvAixs/</link>
		<comments>http://streamhacker.com/2012/11/22/text-classification-sentiment-analysis-nltk-scikitlearn/#comments</comments>
		<pubDate>Thu, 22 Nov 2012 18:10:01 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[sklearn]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1856</guid>
		<description><![CDATA[An analysis of scikit-learn algorithms using NLTK's SklearnClassifier and nltk-trainer's train_classifier.py. The best algorithms are BernoulliNB, MultinomialNB, LogisticRegression, LinearSVC and NuSVC, using bigrams and feature scoring.]]></description>
				<content:encoded><![CDATA[<p>Now that <a href="http://nltk.org/">NLTK</a> versions <em>2.0.1</em> &amp; higher include the <a href="http://nltk.org/api/nltk.classify.html#nltk.classify.scikitlearn.SklearnClassifier">SklearnClassifier</a> (contributed by <a href="https://github.com/larsmans">Lars Buitinck</a>), it's much easier to make use of the excellent <a href="http://scikit-learn.org/">scikit-learn</a> library of algorithms for text classification. But how well do they work?</p>
<p>Below is a table showing both the accuracy &amp; <a href="http://en.wikipedia.org/wiki/F-measure">F-measure</a> of many of these algorithms using different feature extraction methods. Unlike the standard <a href="http://nltk.org/api/nltk.classify.html">NLTK classifiers</a>, sklearn classifiers are designed for handling numeric features. So there are 3 different values under the <code>feats</code> column for each algorithm. <code>bow</code> means <a title="Naive Bayes Bag of Words Text Classification for Sentiment Analysis" href="/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">bag-of-words</a> feature extraction, where every word gets a 1 if present, or a 0 if not. <code>int</code> means word counts are used, so if a word occurs twice, it gets the number 2 as its feature value (whereas with <code>bow</code> it would still get a 1). And <code>tfidf</code> means the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer">TfidfTransformer</a> is used to produce a floating point number that measures the importance of a word, using the <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">tf-idf</a> algorithm.</p>
<p>All numbers were determined using <a href="https://github.com/japerk/nltk-trainer">nltk-trainer</a>, specifically, <code>python train_classifier.py movie_reviews <span class="pre">--no-pickle</span> <span class="pre">--classifier</span> sklearn.ALGORITHM <span class="pre">--fraction</span> 0.75</code>. For <code>int</code> features, the option <code><span class="pre">--value-type</span> int</code> was used, and for <code>tfidf</code> features, the options <code><span class="pre">--value-type</span> float <span class="pre">--tfidf</span></code> were used. This was with <em>NLTK 2.0.3</em> and <em>sklearn 0.12.1</em>.</p>
<table class="docutils" border="1">
<colgroup>
<col width="30%" />
<col width="10%" />
<col width="15%" />
<col width="22%" />
<col width="22%" /> </colgroup>
<thead valign="bottom">
<tr>
<th class="head">algorithm</th>
<th class="head">feats</th>
<th class="head">accuracy</th>
<th class="head">neg f-measure</th>
<th class="head">pos f-measure</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td>BernoulliNB</td>
<td>bow</td>
<td>82.2</td>
<td>82.7</td>
<td>81.6</td>
</tr>
<tr>
<td>BernoulliNB</td>
<td>int</td>
<td>82.2</td>
<td>82.7</td>
<td>81.6</td>
</tr>
<tr>
<td>BernoulliNB</td>
<td>tfidf</td>
<td>82.2</td>
<td>82.7</td>
<td>81.6</td>
</tr>
<tr>
<td>GaussianNB</td>
<td>bow</td>
<td>66.4</td>
<td>65.1</td>
<td>67.6</td>
</tr>
<tr>
<td>GaussianNB</td>
<td>int</td>
<td>66.8</td>
<td>66.3</td>
<td>67.3</td>
</tr>
<tr>
<td>MultinomialNB</td>
<td>bow</td>
<td>82.2</td>
<td>82.7</td>
<td>81.6</td>
</tr>
<tr>
<td>MultinomialNB</td>
<td>int</td>
<td>81.2</td>
<td>81.5</td>
<td>80.1</td>
</tr>
<tr>
<td>MultinomialNB</td>
<td>tfidf</td>
<td>81.6</td>
<td>83.0</td>
<td>80.0</td>
</tr>
<tr>
<td>LogisticRegression</td>
<td>bow</td>
<td>85.6</td>
<td>85.8</td>
<td>85.4</td>
</tr>
<tr>
<td>LogisticRegression</td>
<td>int</td>
<td>83.2</td>
<td>83.0</td>
<td>83.4</td>
</tr>
<tr>
<td>LogisticRegression</td>
<td>tfidf</td>
<td>82.0</td>
<td>81.5</td>
<td>82.5</td>
</tr>
<tr>
<td>SVC</td>
<td>bow</td>
<td>67.6</td>
<td>75.3</td>
<td>52.9</td>
</tr>
<tr>
<td>SVC</td>
<td>int</td>
<td>67.8</td>
<td>71.7</td>
<td>62.6</td>
</tr>
<tr>
<td>SVC</td>
<td>tfidf</td>
<td>50.2</td>
<td>0.8</td>
<td>66.7</td>
</tr>
<tr>
<td>LinearSVC</td>
<td>bow</td>
<td>86.0</td>
<td>86.2</td>
<td>85.8</td>
</tr>
<tr>
<td>LinearSVC</td>
<td>int</td>
<td>81.8</td>
<td>81.7</td>
<td>81.9</td>
</tr>
<tr>
<td>LinearSVC</td>
<td>tfidf</td>
<td>85.8</td>
<td>85.5</td>
<td>86.0</td>
</tr>
<tr>
<td>NuSVC</td>
<td>bow</td>
<td>85.0</td>
<td>85.5</td>
<td>84.5</td>
</tr>
<tr>
<td>NuSVC</td>
<td>int</td>
<td>81.4</td>
<td>81.7</td>
<td>81.1</td>
</tr>
<tr>
<td>NuSVC</td>
<td>tfidf</td>
<td>50.2</td>
<td>0.8</td>
<td>66.7</td>
</tr>
</tbody>
</table>
<p>As you can see, the best algorithms are <a href="http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB">BernoulliNB</a>, <a href="http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB">MultinomialNB</a>, <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression">LogisticRegression</a>, <a href="http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC">LinearSVC</a>, and <a href="http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC">NuSVC</a>. Surprisingly, <code>int</code> and <code>tfidf</code> features either provide a very small performance increase, or significantly decrease performance. So let's see if we can improve performance with the same techniques used in previous articles in this series, specifically <a title="Bigrams and Collocation Features for Text Classification" href="/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">bigrams</a> and <a title="Eliminate Low Information Words for Text Classification" href="/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">high information words</a>.</p>
<h2>Bigrams</h2>
<p>Below is a table showing the accuracy of the top 5 algorithms using just <code>unigrams</code> (the default, a.k.a single words), and using unigrams + <code>bigrams</code> (pairs of words) with the option <code><span class="pre">--ngrams</span> 1 2</code>.</p>
<table class="docutils" border="1">
<colgroup>
<col width="51%" />
<col width="26%" />
<col width="23%" /> </colgroup>
<thead valign="bottom">
<tr>
<th class="head">algorithm</th>
<th class="head">unigrams</th>
<th class="head">bigrams</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td>BernoulliNB</td>
<td>82.2</td>
<td>86.0</td>
</tr>
<tr>
<td>MultinomialNB</td>
<td>82.2</td>
<td>86.0</td>
</tr>
<tr>
<td>LogisticRegression</td>
<td>85.6</td>
<td>86.6</td>
</tr>
<tr>
<td>LinearSVC</td>
<td>86.0</td>
<td>86.4</td>
</tr>
<tr>
<td>NuSVC</td>
<td>85.0</td>
<td>85.2</td>
</tr>
</tbody>
</table>
<p>Only <code>BernoulliNB</code> &amp; <code>MultinomialNB</code> got a modest boost in accuracy, putting them on-par with the rest of the algorithms. But we can do better than this using feature scoring.</p>
<h2>Feature Scoring</h2>
<p>As I've shown previously, <a title="Eliminate Low Information Features for Sentiment Analysis" href="/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">eliminating low information features</a> can have significant positive effects. Below is a table showing the accuracy of each algorithm at different score levels, using the option <code><span class="pre">--min_score</span> SCORE</code> (and keeping the <code><span class="pre">--ngrams</span> 1 2</code> option to get bigram features).</p>
<table class="docutils" border="1">
<colgroup>
<col width="40%" />
<col width="20%" />
<col width="20%" />
<col width="20%" /> </colgroup>
<thead valign="bottom">
<tr>
<th class="head">algorithm</th>
<th class="head">score 1</th>
<th class="head">score 2</th>
<th class="head">score 3</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td>BernoulliNB</td>
<td>62.8</td>
<td>97.2</td>
<td>95.8</td>
</tr>
<tr>
<td>MultinomialNB</td>
<td>62.8</td>
<td>97.2</td>
<td>95.8</td>
</tr>
<tr>
<td>LogisticRegression</td>
<td>90.4</td>
<td>91.6</td>
<td>91.4</td>
</tr>
<tr>
<td>LinearSVC</td>
<td>89.8</td>
<td>91.4</td>
<td>90.2</td>
</tr>
<tr>
<td>NuSVC</td>
<td>89.4</td>
<td>90.8</td>
<td>91.0</td>
</tr>
</tbody>
</table>
<p><code>LogisticRegression</code>, <code>LinearSVC</code>, and <code>NuSVC</code> all get a nice gain of ~4-5%, but the most interesting results are from the <code>BernoulliNB</code> &amp; <code>MultinomialNB</code> algorithms, which drop down significantly at <code><span class="pre">--min_score</span> 1</code>, but then skyrocket up to 97% with <code><span class="pre">--min_score</span> 2</code>. The only explanation I can offer for this is that <a href="https://en.wikipedia.org/wiki/Bayes_classifier">Naive Bayes classification</a>, because it does not weight features, can be quite sensitive to changes in training data (see <a href="https://en.wikipedia.org/wiki/Bayesian_poisoning">Bayesian Poisoning</a> for an example).</p>
<h2>Scikit-Learn</h2>
<p>If you haven't yet tried using <a href="http://scikit-learn.org/">scikit-learn</a> for text classification, then I hope this article convinces you that it's worth learning. NLTK's <a href="http://nltk.org/api/nltk.classify.html#nltk.classify.scikitlearn.SklearnClassifier">SklearnClassifier</a> makes the process much easier, since you don't have to convert feature dictionaries to <a href="http://numpy.scipy.org/">numpy</a> arrays yourself, or keep track of all known features. The Scikits classifiers also tend to be more memory efficient than the standard NLTK classifiers, due to their use of sparse arrays.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=UQgbWEvAixs:YiY3pCc1yRg:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=UQgbWEvAixs:YiY3pCc1yRg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=UQgbWEvAixs:YiY3pCc1yRg:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=UQgbWEvAixs:YiY3pCc1yRg:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=UQgbWEvAixs:YiY3pCc1yRg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=UQgbWEvAixs:YiY3pCc1yRg:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/UQgbWEvAixs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2012/11/22/text-classification-sentiment-analysis-nltk-scikitlearn/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2012/11/22/text-classification-sentiment-analysis-nltk-scikitlearn/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>NLTK 2 Release Highlights</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/Shm8rH6PBDA/</link>
		<comments>http://streamhacker.com/2012/06/03/nltk-2-release-highlights/#comments</comments>
		<pubDate>Sun, 03 Jun 2012 17:00:17 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[corpora]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[tagging]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1847</guid>
		<description><![CDATA[NLTK 2 includes a SVM Classifier, a scikit-learn classifier, and new taggers &#038; stemmers. The project has also moved to github, and the documentation has been updated to use Sphinx.]]></description>
				<content:encoded><![CDATA[<p>NLTK 2.0.1, a.k.a <strong>NLTK 2</strong>, was recently released, and what follows is my favorite changes, new features, and highlights from the <a href=" https://raw.github.com/nltk/nltk/master/ChangeLog#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">ChangeLog</a>.</p>
<h2>New Classifiers</h2>
<p>The <a href="http://nltk.org/api/nltk.classify.html#module-nltk.classify.svm">SVMClassifier</a> adds <a href="https://en.wikipedia.org/wiki/Support_vector_machine">support vector machine</a> classification thru <a href="http://svmlight.joachims.org/">SVMLight</a> with <a href="https://bitbucket.org/wcauchois/pysvmlight">PySVMLight</a>. This is a much needed addition to the set of supported classification algorithms. But even more interesting...</p>
<p>The <a href="http://nltk.org/api/nltk.classify.html#module-nltk.classify.scikitlearn">SklearnClassifier</a> provides a general interface to text classification with <a href="http://scikit-learn.org/stable/">scikit-learn</a>. While scikit-learn is still pre-1.0, it is rapidly becoming one of the most popular <a href="https://en.wikipedia.org/wiki/Machine_learning">machine learning</a> toolkits, and provides more advanced feature extraction methods for classification.</p>
<h2>Github</h2>
<p><a href="http://nltk.org/">NLTK</a> has moved development and hosting to <a href="https://github.com/nltk">github</a>, replacing <a href="https://code.google.com/">google code</a> and <a href="http://subversion.tigris.org/">SVN</a>. The primary motivation is to make new development easier, and already a <a href="https://github.com/kmike/nltk">Python 3 branch</a> is under active development. I think this is great, since github makes forking &amp; pull requests quite easy, and it's become the de-facto "social coding" site.</p>
<h2>Sphinx</h2>
<p>Coinciding with the github move, the documentation was updated to use <a href="http://sphinx.pocoo.org/">Sphinx</a>, the same documentation generator used by <a href="http://docs.python.org/">Python</a> and many other projects. While I personally like Sphinx and <a href="http://docutils.sourceforge.net/rst.html">restructured text</a> (which I used to write this post), I'm not thrilled with the results. The new documentation structure and NLTK homepage seem much less approachable. While it works great if you know exactly what you're looking for, I worry that new/interested users will have a harder time getting started.</p>
<h2>New Corpora</h2>
<p>Since the 0.9.9 release, a number of new corpora and corpus readers have been added:</p>
<ul>
<li><a href="http://lilyx.net/nltk-japanese-corpus/">JEITA</a></li>
<li><a href="http://www.cse.unt.edu/~rada/downloads.html#semcor">SemCor</a></li>
<li><a href="http://borel.slu.edu/crubadan/">langid</a></li>
<li><a href="http://www.statmt.org/europarl">europarl</a></li>
<li><a href="http://nlp.cs.nyu.edu/meyers/NomBank.html">NomBank</a></li>
<li><a href="http://www.mimuw.edu.pl/polszczyzna/pl196x/index_en.htm">pl196x</a></li>
<li><a href="http://machado.mec.gov.br/">Machado</a></li>
<li><a href="http://language.dyndns.org/research/CHILDES/">CHILDES</a></li>
</ul>
<h2>ChangeLog Highlights</h2>
<p>And here's a few final highlights:</p>
<ul>
<li>The <a href="http://nltk.org/api/nltk.tag.html#module-nltk.tag.hunpos">HunposTagger</a>, which wraps <a href="https://code.google.com/p/hunpos/">hunpos</a>.</li>
<li>The <a href="http://nltk.org/api/nltk.tag.html#module-nltk.tag.stanford">StanfordTagger</a> plus 2 subclasses for <a href="https://en.wikipedia.org/wiki/Named_entity_recognition">NER</a> and <a href="https://en.wikipedia.org/wiki/Part-of-speech_tagging">POS tagging</a> with the <a href="http://nlp.stanford.edu/software/tagger.shtml">Stanford POS Tagger</a>.</li>
<li>The <a href="http://nltk.org/api/nltk.stem.html#nltk.stem.snowball.SnowballStemmer">SnowballStemmer</a>, which supports 13 different languages. You can try it out at my <a title="NLTK Stemming Demo" href="http://text-processing.com/demo/stem/">online stemming demo</a>.</li>
</ul>
<h2>The Future</h2>
<p>I think NLTK's ideal role is be a standard interface between corpora and <a href="https://en.wikipedia.org/wiki/Natural_language_processing">NLP algorithms</a>. There are many different corpus formats, and every algorithm has its own data structure requirements, so providing common abstract interfaces to connect these together is very powerful. It allows you to test the same algorithm on disparate corpora, or try multiple algorithms on a single corpus. This is what NLTK already does best, and I hope that becomes even more true in the future.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Shm8rH6PBDA:N1dI15akGRI:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Shm8rH6PBDA:N1dI15akGRI:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Shm8rH6PBDA:N1dI15akGRI:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=Shm8rH6PBDA:N1dI15akGRI:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Shm8rH6PBDA:N1dI15akGRI:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=Shm8rH6PBDA:N1dI15akGRI:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/Shm8rH6PBDA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2012/06/03/nltk-2-release-highlights/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2012/06/03/nltk-2-release-highlights/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Recent Talks &amp; Presentations</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/AuODJiShMP0/</link>
		<comments>http://streamhacker.com/2012/04/12/talks-presentations/#comments</comments>
		<pubDate>Thu, 12 Apr 2012 16:44:18 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[talks]]></category>
		<category><![CDATA[weotta]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[pycon]]></category>
		<category><![CDATA[strata]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1842</guid>
		<description><![CDATA[Links to recent presentations at Strata, PyCon, and the SF MongoDB meetup.]]></description>
				<content:encoded><![CDATA[<p>I've given a few talks &amp; presentations recently, so for anyone that doesn't <a href="http://twitter.com/#!/japerk">follow japerk on twitter</a>, here are some links:</p>
<ul>
<li><a href="http://www.slideshare.net/mongodb/weotta-presentation-at-sf-bay-area-mongodb-user-group-feb-21-2012">Weotta's MongoDB presentation</a> from Tuesday, Feb 21 at the <a href="http://www.meetup.com/San-Francisco-MongoDB-User-Group/events/45348472/">SF MongoDB meetup</a></li>
<li><a href="http://www.slideshare.net/japerk/corpus-bootstrapping-with-nltk">Corpus Bootstrapping with NLTK</a> from Tuesday, Feb 28, during the <a href="http://strataconf.com/strata2012/public/schedule/detail/22903">Deep Data</a> session at <a href="http://strataconf.com/strata2012/">Strata</a></li>
<li><a href="https://github.com/japerk/PyCon-NLTK-Tutorial">PyCon NLTK Tutorial code</a> from Thursday, March 8 at <a href="https://us.pycon.org/2012/">PyCon 2012</a></li>
</ul>
<p>I also want to recommend 2 books that helped me mentally prepare for these talks:</p>
<ul>
<li><a href="http://www.amazon.com/gp/product/0978577604/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0978577604">Even a Geek can Speak</a> by <a href="http://www.speechworks.net/blog/">Joey Asher</a></li>
<li><a href="http://www.amazon.com/gp/product/1449301959/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=1449301959">Confessions of a Public Speaker</a> by <a href="http://www.scottberkun.com">Scott Berkun</a></li>
</ul>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AuODJiShMP0:ghwPPOngNbI:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AuODJiShMP0:ghwPPOngNbI:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AuODJiShMP0:ghwPPOngNbI:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=AuODJiShMP0:ghwPPOngNbI:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AuODJiShMP0:ghwPPOngNbI:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=AuODJiShMP0:ghwPPOngNbI:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/AuODJiShMP0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2012/04/12/talks-presentations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2012/04/12/talks-presentations/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>PyCon NLTK Tutorial Assistants</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/ad79sSrm-M8/</link>
		<comments>http://streamhacker.com/2012/02/19/pycon-nltk-tutorial-assistants/#comments</comments>
		<pubDate>Sun, 19 Feb 2012 18:43:59 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[nltk]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1836</guid>
		<description><![CDATA[A request for tutorial assistants for Introduction to NLTK at PyCon 2012.]]></description>
				<content:encoded><![CDATA[<p>My <a href="https://us.pycon.org/2012/">PyCon</a> tutorial, <a href="https://us.pycon.org/2012/schedule/presentation/199/">Introduction to NLTK</a>, now has over 40 people registered. This is about twice as many people as I was expecting, but I'm glad so many people want to learn <a href="http://www.nltk.org/">NLTK</a> <img src='http://streamhacker.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />   Because of the large class size, it'd really helpful to have a couple assistants with at least some NLTK experience, including, but not limited to:</p>
<p>* installing NLTK<br />
* installing &amp; using NLTK on Windows<br />
* installing &amp; using <a href="https://github.com/japerk/nltk-trainer">nltk-trainer</a><br />
* creating custom corpora<br />
* using WordNet</p>
<p>If you're interested in helping out, please read <a href="https://us.pycon.org/2012/tutorials/assistants/">Tutorial Assistants</a> and contact me, <em>japerk</em> -- at -- <em>gmail</em>. Thanks!</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ad79sSrm-M8:4g1jv7NegIg:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ad79sSrm-M8:4g1jv7NegIg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ad79sSrm-M8:4g1jv7NegIg:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=ad79sSrm-M8:4g1jv7NegIg:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ad79sSrm-M8:4g1jv7NegIg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=ad79sSrm-M8:4g1jv7NegIg:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/ad79sSrm-M8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2012/02/19/pycon-nltk-tutorial-assistants/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2012/02/19/pycon-nltk-tutorial-assistants/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Upcoming Talks</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/Zk8lUjVTaaQ/</link>
		<comments>http://streamhacker.com/2012/01/09/upcoming-talks/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 18:00:41 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[talks]]></category>
		<category><![CDATA[weotta]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[pycon]]></category>
		<category><![CDATA[strata]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1829</guid>
		<description><![CDATA[Upcoming talks include How Weotta uses MongoDB at 10gen's new SF office; a NLTK Jam Session at NICAR 2012 in St Louis, MO; Corpus Bootstrapping with NLTK at Strata 2012, and my PyCon 2012 tutorial: Introduction to NLTK.]]></description>
				<content:encoded><![CDATA[<p>At the end of February and the beginning of March, I'll be giving 3 talks in the SF Bay Area and one in St Louis, MO. In chronological order...</p>
<h2>How Weotta uses MongoDB</h2>
<p><a href="http://www.crunchbase.com/person/grant-wernick">Grant</a> and I will be helping <a href="http://www.10gen.com/">10gen</a> celebrate the opening of their new San Francisco office on Tuesday, February 21, by talking about<br />
<a href="http://www.meetup.com/San-Francisco-MongoDB-User-Group/events/45348472/">How Weotta uses MongoDB</a>. We'll cover some of our favorite features of <a href="http://www.mongodb.org/">MongoDB</a> and how we use it for local place &amp; events search. Then we'll finish with a preview of <a href="http://www.weotta.com/">Weotta's</a> upcoming MongoDB powered local search APIs.</p>
<h2>NLTK Jam Session at NICAR 2012</h2>
<p>On Thursday, February 23, in St Louis, MO, I'll be demonstrating how to use <a href="http://www.nltk.org/">NLTK</a> as part of the <a href="http://ire.org/conferences/nicar-2012/newscamp/">NewsCamp workshop</a> at <a href="http://ire.org/conferences/nicar-2012/">NICAR 2012</a>. This will be a version of my <a href="https://us.pycon.org/2012/schedule/presentation/199/">PyCon NLTK Tutorial</a> with a focus on news text and corpora like <em>treebank</em>.</p>
<h2>Corpus Bootstrapping with NLTK at Strata 2012</h2>
<p>As part of the <a href="http://strataconf.com/strata2012">Strata 2012</a> <a href="http://strataconf.com/strata2012/public/schedule/detail/22903">Deep Data program</a>, I'll talk about <a href="http://strataconf.com/strata2012/public/schedule/detail/22412">Corpus Bootstrapping with NLTK</a> on Tuesday, February 28. The premise of this talk is that while there's plenty of great algorithms and methods for <a href="http://en.wikipedia.org/wiki/Natural_language_processing">natural language processing</a>, most of them require a training corpus, and chances are the training corpus you really need doesn't exist. So how can you quickly create a quality corpus at minimal cost? I'll cover specific real-world examples to answer this question.</p>
<h2>NLTK Tutorial at PyCon 2012</h2>
<p><a href="https://us.pycon.org/2012/schedule/presentation/199/">Introduction to NLTK</a> will be a 3 hour tutorial at <a href="https://us.pycon.org/2012/">PyCon</a> on Thursday, March 8th. You'll get to know <a href="http://www.nltk.org/">NLTK</a> in depth, learn about corpus organization, and train your own models manually &amp; with <a href="https://github.com/japerk/nltk-trainer">nltk-trainer</a>. My goal is that you'll walk out with at least one new NLP superpower that you can put to use immediately.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Zk8lUjVTaaQ:SKP1QfZ8pnY:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Zk8lUjVTaaQ:SKP1QfZ8pnY:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Zk8lUjVTaaQ:SKP1QfZ8pnY:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=Zk8lUjVTaaQ:SKP1QfZ8pnY:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Zk8lUjVTaaQ:SKP1QfZ8pnY:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=Zk8lUjVTaaQ:SKP1QfZ8pnY:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/Zk8lUjVTaaQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2012/01/09/upcoming-talks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2012/01/09/upcoming-talks/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Fuzzy String Matching in Python</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/AS4JRWS2bhY/</link>
		<comments>http://streamhacker.com/2011/10/31/fuzzy-string-matching-python/#comments</comments>
		<pubDate>Mon, 31 Oct 2011 15:47:47 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[doctest]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[phonetic]]></category>
		<category><![CDATA[regex]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1815</guid>
		<description><![CDATA[Python fuzzy string matching using normalization, regular expressions, edit distance, and fuzzywuzzy. You can do your own fuzzy matching with Python NLTK by combining tokenization, stemming, and edit distance. Phonetic algorithms can also be used to match strings.]]></description>
				<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Approximate_string_matching">Fuzzy matching</a> is a general term for finding strings that are <em>almost</em> equal, or <em>mostly</em> the same. Of course <em>almost</em> and <em>mostly</em> are ambiguous terms themselves, so you'll have to determine what they really mean for your specific needs. The best way to do this is to come up with a list of test cases before you start writing any fuzzy matching code. These test cases should be pairs of strings that either should fuzzy match, or not. I like to create <a title="Test Driven Development in Python" href="http://streamhacker.com/2009/02/05/test-driven-development-in-python/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">doctests</a> for this, like so:</p>
<pre class="brush: python; title: ; notranslate">
def fuzzy_match(s1, s2):
	'''
	&gt;&gt;&gt; fuzzy_match('Happy Days', ' happy days ')
	True
	&gt;&gt;&gt; fuzzy_match('happy days', 'sad days')
	False
	'''
	# TODO: fuzzy matching code
	return s1 == s2
</pre>
<p>Once you've got a good set of test cases, then it's much easier to tailor your fuzzy matching code to get the best results.</p>
<h2>Normalization</h2>
<p>The first step before doing any string matching is <em>normalization</em>. The goal with normalization is to transform your strings into a normal form, which in some cases may be all you need to do. While <code>'Happy Days' != ' happy days '</code>, with simple normalization you can get <code>'Happy <span class="pre">Days'.lower()</span> == ' happy days '.strip()</code>.</p>
<p>The most basic normalization you can do is to <a href="http://docs.python.org/library/stdtypes.html#str.lower">lowercase</a> and <a href="http://docs.python.org/library/stdtypes.html#str.strip">strip</a> whitespace. But chances are you'll want to more. For example, here's a simple normalization function that also removes all punctuation in a string.</p>
<pre class="brush: python; title: ; notranslate">
import string

def normalize(s):
	for p in string.punctuation:
		s = s.replace(p, '')

	return s.lower().strip()
</pre>
<p>Using this <code>normalize</code> function, we can make the above fuzzy matching function pass our simple tests.</p>
<pre class="brush: python; title: ; notranslate">
def fuzzy_match(s1, s2):
	'''
	&gt;&gt;&gt; fuzzy_match('Happy Days', ' happy days ')
	True
	&gt;&gt;&gt; fuzzy_match('happy days', 'sad days')
	False
	'''
	return normalize(s1) == normalize(s2)
</pre>
<p>If you want to get more advanced, keep reading...</p>
<h2>Regular Expressions</h2>
<p>Beyond just stripping whitespace from the ends of strings, it's also a good idea replace all whitespace occurrences with a single space character. The <a title="python re module" href="http://docs.python.org/library/re.html">regex</a> function for doing this is <code>re.sub('\s+', s, ' ')</code>. This will replace every occurrence of one or more spaces, newlines, tabs, etc, essentially eliminating the significance of whitespace for matching.</p>
<p>You may also be able to use regular expressions for <em>partial fuzzy matching</em>. Maybe you can use regular expressions to identify significant parts of a string, or perhaps split a string into component parts for further matching. If you think you can create a <em>simple</em> regular expression to help with fuzzy matching, do it, because chances are, any other code you write to do fuzzy matching will be more complicated, less straightforward, and probably slower. You can also use more complicated regular expressions to handle specific edge cases. But beware of any expression that takes puzzling out every time you look at it, because you'll probably be revisiting this code a number of times to tweak it for handling new cases, and tweaking complicated regular expressions is a sure way to induce headaches and eyeball-bleeding.</p>
<h2>Edit Distance</h2>
<p>The <a href="http://en.wikipedia.org/wiki/Edit_distance">edit distance</a> (aka <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a>) is the number of single character edits it would take to transform one string into another. Thefore, the smaller the edit distance, the more similar two strings are.</p>
<p>If you want to do edit distance calculations, checkout the standalone <a href="http://www.mindrot.org/projects/py-editdist/">editdist</a> module. Its <code>distance</code> function takes 2 strings and returns the Levenshtein edit distance. It's also implemented in C, and so is quite fast.</p>
<h2>Fuzzywuzzy</h2>
<p><a href="https://github.com/seatgeek/fuzzywuzzy">Fuzzywuzzy</a> is a great all-purpose library for fuzzy string matching, built (in part) on top of Python's <a href="http://docs.python.org/library/difflib.html">difflib</a>. It has a number of different <a href="http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python">fuzzy matching functions</a>, and it's definitely worth experimenting with all of them. I've personally found <code>ratio</code> and <code>token_set_ratio</code> to be the most useful.</p>
<h2>NLTK</h2>
<p>If you want to do some custom fuzzy string matching, then <a href="http://www.nltk.org/">NLTK</a> is a great library to use. There's <a title="Python NLTK Word Tokenization Demo" href="http://text-processing.com/demo/tokenize/">word tokenizers</a>, <a title="Python NLTK Stemming and Lemmatization Demo" href="http://text-processing.com/demo/stem/">stemmers</a>, and it even has its own <a href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics-module.html#edit_distance">edit distance</a> implementation. Here's a way you could combine all 3 to create a fuzzy string matching function.</p>
<pre class="brush: python; title: ; notranslate">
from nltk import metrics, stem, tokenize

stemmer = stem.PorterStemmer()

def normalize(s):
	words = tokenize.wordpunct_tokenize(s.lower().strip())
	return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=3):
	return metrics.edit_distance(normalize(s1), normalize(s2)) &lt;= max_dist
</pre>
<h2>Phonetics</h2>
<p>Finally, an interesting and perhaps non-obvious way to compare strings is with <a href="http://en.wikipedia.org/wiki/Phonetic_algorithm">phonetic algorithms</a>. The idea is that 2 strings that sound same may be the same (or at least similar enough). One of the most well known phonetic algorithms is <a href="http://en.wikipedia.org/wiki/Soundex">Soundex</a>, with a <a href="http://code.activestate.com/recipes/52213/">python soundex algorithm here</a>. Another is <a href="http://en.wikipedia.org/wiki/Double_Metaphone#Double_Metaphone">Double Metaphone</a>, with a <a href="http://www.atomodo.com/code/double-metaphone/metaphone.py/view">python metaphone module here</a>. You can also find code for these and other phonetic algorithms in the <a href="https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/featx/phonetics.py">nltk-trainer phonetics module</a> (copied from a now defunct sourceforge project called <a href="http://advas.sourceforge.net/">advas</a>). Using any of these algorithms, you get an encoded string, and then if 2 encodings compare equal, the original strings match. Theoretically, you could even do fuzzy matching on the phonetic encodings, but that's probably pushing the bounds of fuzziness a bit too far.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AS4JRWS2bhY:rDCFHFwi_uk:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AS4JRWS2bhY:rDCFHFwi_uk:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AS4JRWS2bhY:rDCFHFwi_uk:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=AS4JRWS2bhY:rDCFHFwi_uk:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AS4JRWS2bhY:rDCFHFwi_uk:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=AS4JRWS2bhY:rDCFHFwi_uk:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/AS4JRWS2bhY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/10/31/fuzzy-string-matching-python/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/10/31/fuzzy-string-matching-python/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>NLTK Overview at SF Python</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/ZY4H_FYegAI/</link>
		<comments>http://streamhacker.com/2011/09/06/nltk-overview-sf-python/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 16:00:34 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[nltk]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1807</guid>
		<description><![CDATA[Announcement of a NLTK overview talk at the San Francisco Python Meetup Group on September 14, 2011. The talk will be a quick overview of topics such as tokenization, part-of-speech tagging, chunking and named entity recognition, text classification, corpus readers, and using nltk-trainer to train custom models.]]></description>
				<content:encoded><![CDATA[<p>On September 14, 2011, I'll be giving a 20 minute overview of <a href="http://www.nltk.org/">NLTK</a> for the <a href="http://www.meetup.com/sfpython/events/29072421/">San Francisco Python Meetup Group</a>. Since it's only 20 minutes, I can't get into too much detail, but I plan to quickly cover the basics of:</p>
<ul>
<li><a href="http://text-processing.com/demo/tokenize/">tokenization</a> and why it's not as easy as <code>str.split()</code></li>
<li><a href="http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html">part-of-speech tagging</a> and why it's important</li>
<li><a href="http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html">chunking and named entity recognition</a></li>
<li><a href="http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html">text classification</a> and how it works for <a href="http://text-processing.com/demo/sentiment/">sentiment analysis</a></li>
<li>training your own models with <a href="https://github.com/japerk/nltk-trainer">nltk-trainer</a></li>
</ul>
<p>I'll also be soliciting feedback for a <a href="http://streamhacker.com/2011/08/22/pycon-nltk-tutorial-suggestions/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">NLTK Tutorial at PyCON 2012</a>. So if you'll be at the meetup and are interested in attending a NLTK tutorial, come find me and tell me what you'd want to learn.</p>
<p><strong>Updated 9/15/2011</strong>: Slides from the talk are online - <a title="A sprint thru Python's Natural Language ToolKit" href="http://www.slideshare.net/japerk/nltk-in-20-minutes">NLTK in 20 minutes</a></p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ZY4H_FYegAI:UhAv7hKUlWk:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ZY4H_FYegAI:UhAv7hKUlWk:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ZY4H_FYegAI:UhAv7hKUlWk:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=ZY4H_FYegAI:UhAv7hKUlWk:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ZY4H_FYegAI:UhAv7hKUlWk:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=ZY4H_FYegAI:UhAv7hKUlWk:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/ZY4H_FYegAI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/09/06/nltk-overview-sf-python/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/09/06/nltk-overview-sf-python/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
	</channel>
</rss>
