<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>StreamHacker</title>
	
	<link>http://streamhacker.com</link>
	<description>Weotta be Hacking</description>
	<lastBuildDate>Mon, 09 Jan 2012 18:00:41 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>

	
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/StreamHacker" /><feedburner:info uri="streamhacker" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://superfeedr.com/hubbub" /><feedburner:emailServiceId>StreamHacker</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>Upcoming Talks</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/Zk8lUjVTaaQ/</link>
		<comments>http://streamhacker.com/2012/01/09/upcoming-talks/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 18:00:41 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[talks]]></category>
		<category><![CDATA[weotta]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[pycon]]></category>
		<category><![CDATA[strata]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1829</guid>
		<description><![CDATA[Upcoming talks include How Weotta uses MongoDB at 10gen's new SF office; a NLTK Jam Session at NICAR 2012 in St Louis, MO; Corpus Bootstrapping with NLTK at Strata 2012, and my PyCon 2012 tutorial: Introduction to NLTK.]]></description>
			<content:encoded><![CDATA[<p>At the end of February and the beginning of March, I'll be giving 3 talks in the SF Bay Area and one in St Louis, MO. In chronological order...</p>
<h2>How Weotta uses MongoDB</h2>
<p><a href="http://www.crunchbase.com/person/grant-wernick">Grant</a> and I will be helping <a href="http://www.10gen.com/">10gen</a> celebrate the opening of their new San Francisco office on Tuesday, February 21, by talking about<br />
<a href="http://www.meetup.com/San-Francisco-MongoDB-User-Group/events/45348472/">How Weotta uses MongoDB</a>. We'll cover some of our favorite features of <a href="http://www.mongodb.org/">MongoDB</a> and how we use it for local place &amp; events search. Then we'll finish with a preview of <a href="http://www.weotta.com/">Weotta's</a> upcoming MongoDB powered local search APIs.</p>
<h2>NLTK Jam Session at NICAR 2012</h2>
<p>On Thursday, February 23, in St Louis, MO, I'll be demonstrating how to use <a href="http://www.nltk.org/">NLTK</a> as part of the <a href="http://ire.org/conferences/nicar-2012/newscamp/">NewsCamp workshop</a> at <a href="http://ire.org/conferences/nicar-2012/">NICAR 2012</a>. This will be a version of my <a href="https://us.pycon.org/2012/schedule/presentation/199/">PyCon NLTK Tutorial</a> with a focus on news text and corpora like <em>treebank</em>.</p>
<h2>Corpus Bootstrapping with NLTK at Strata 2012</h2>
<p>As part of the <a href="http://strataconf.com/strata2012">Strata 2012</a> <a href="http://strataconf.com/strata2012/public/schedule/detail/22903">Deep Data program</a>, I'll talk about <a href="http://strataconf.com/strata2012/public/schedule/detail/22412">Corpus Bootstrapping with NLTK</a> on Tuesday, February 28. The premise of this talk is that while there's plenty of great algorithms and methods for <a href="http://en.wikipedia.org/wiki/Natural_language_processing">natural language processing</a>, most of them require a training corpus, and chances are the training corpus you really need doesn't exist. So how can you quickly create a quality corpus at minimal cost? I'll cover specific real-world examples to answer this question.</p>
<h2>NLTK Tutorial at PyCon 2012</h2>
<p><a href="https://us.pycon.org/2012/schedule/presentation/199/">Introduction to NLTK</a> will be a 3 hour tutorial at <a href="https://us.pycon.org/2012/">PyCon</a> on Thursday, March 8th. You'll get to know <a href="http://www.nltk.org/">NLTK</a> in depth, learn about corpus organization, and train your own models manually &amp; with <a href="https://github.com/japerk/nltk-trainer">nltk-trainer</a>. My goal is that you'll walk out with at least one new NLP superpower that you can put to use immediately.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Zk8lUjVTaaQ:SKP1QfZ8pnY:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Zk8lUjVTaaQ:SKP1QfZ8pnY:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Zk8lUjVTaaQ:SKP1QfZ8pnY:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=Zk8lUjVTaaQ:SKP1QfZ8pnY:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=Zk8lUjVTaaQ:SKP1QfZ8pnY:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=Zk8lUjVTaaQ:SKP1QfZ8pnY:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/Zk8lUjVTaaQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2012/01/09/upcoming-talks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2012/01/09/upcoming-talks/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Fuzzy String Matching in Python</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/AS4JRWS2bhY/</link>
		<comments>http://streamhacker.com/2011/10/31/fuzzy-string-matching-python/#comments</comments>
		<pubDate>Mon, 31 Oct 2011 15:47:47 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[doctest]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[phonetic]]></category>
		<category><![CDATA[regex]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1815</guid>
		<description><![CDATA[Python fuzzy string matching using normalization, regular expressions, edit distance, and fuzzywuzzy. You can do your own fuzzy matching with Python NLTK by combining tokenization, stemming, and edit distance. Phonetic algorithms can also be used to match strings.]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Approximate_string_matching">Fuzzy matching</a> is a general term for finding strings that are <em>almost</em> equal, or <em>mostly</em> the same. Of course <em>almost</em> and <em>mostly</em> are ambiguous terms themselves, so you'll have to determine what they really mean for your specific needs. The best way to do this is to come up with a list of test cases before you start writing any fuzzy matching code. These test cases should be pairs of strings that either should fuzzy match, or not. I like to create <a title="Test Driven Development in Python" href="http://streamhacker.com/2009/02/05/test-driven-development-in-python/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">doctests</a> for this, like so:</p>
<pre class="brush: python; title: ; notranslate">
def fuzzy_match(s1, s2):
	'''
	&gt;&gt;&gt; fuzzy_match('Happy Days', ' happy days ')
	True
	&gt;&gt;&gt; fuzzy_match('happy days', 'sad days')
	False
	'''
	# TODO: fuzzy matching code
	return s1 == s2
</pre>
<p>Once you've got a good set of test cases, then it's much easier to tailor your fuzzy matching code to get the best results.</p>
<h2>Normalization</h2>
<p>The first step before doing any string matching is <em>normalization</em>. The goal with normalization is to transform your strings into a normal form, which in some cases may be all you need to do. While <code>'Happy Days' != ' happy days '</code>, with simple normalization you can get <code>'Happy <span class="pre">Days'.lower()</span> == ' happy days '.strip()</code>.</p>
<p>The most basic normalization you can do is to <a href="http://docs.python.org/library/stdtypes.html#str.lower">lowercase</a> and <a href="http://docs.python.org/library/stdtypes.html#str.strip">strip</a> whitespace. But chances are you'll want to more. For example, here's a simple normalization function that also removes all punctuation in a string.</p>
<pre class="brush: python; title: ; notranslate">
import string

def normalize(s):
	for p in string.punctuation:
		s = s.replace(p, '')

	return s.lower().strip()
</pre>
<p>Using this <code>normalize</code> function, we can make the above fuzzy matching function pass our simple tests.</p>
<pre class="brush: python; title: ; notranslate">
def fuzzy_match(s1, s2):
	'''
	&gt;&gt;&gt; fuzzy_match('Happy Days', ' happy days ')
	True
	&gt;&gt;&gt; fuzzy_match('happy days', 'sad days')
	False
	'''
	return normalize(s1) == normalize(s2)
</pre>
<p>If you want to get more advanced, keep reading...</p>
<h2>Regular Expressions</h2>
<p>Beyond just stripping whitespace from the ends of strings, it's also a good idea replace all whitespace occurrences with a single space character. The <a title="python re module" href="http://docs.python.org/library/re.html">regex</a> function for doing this is <code>re.sub('\s+', s, ' ')</code>. This will replace every occurrence of one or more spaces, newlines, tabs, etc, essentially eliminating the significance of whitespace for matching.</p>
<p>You may also be able to use regular expressions for <em>partial fuzzy matching</em>. Maybe you can use regular expressions to identify significant parts of a string, or perhaps split a string into component parts for further matching. If you think you can create a <em>simple</em> regular expression to help with fuzzy matching, do it, because chances are, any other code you write to do fuzzy matching will be more complicated, less straightforward, and probably slower. You can also use more complicated regular expressions to handle specific edge cases. But beware of any expression that takes puzzling out every time you look at it, because you'll probably be revisiting this code a number of times to tweak it for handling new cases, and tweaking complicated regular expressions is a sure way to induce headaches and eyeball-bleeding.</p>
<h2>Edit Distance</h2>
<p>The <a href="http://en.wikipedia.org/wiki/Edit_distance">edit distance</a> (aka <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a>) is the number of single character edits it would take to transform one string into another. Thefore, the smaller the edit distance, the more similar two strings are.</p>
<p>If you want to do edit distance calculations, checkout the standalone <a href="http://www.mindrot.org/projects/py-editdist/">editdist</a> module. Its <code>distance</code> function takes 2 strings and returns the Levenshtein edit distance. It's also implemented in C, and so is quite fast.</p>
<h2>Fuzzywuzzy</h2>
<p><a href="https://github.com/seatgeek/fuzzywuzzy">Fuzzywuzzy</a> is a great all-purpose library for fuzzy string matching, built (in part) on top of Python's <a href="http://docs.python.org/library/difflib.html">difflib</a>. It has a number of different <a href="http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python">fuzzy matching functions</a>, and it's definitely worth experimenting with all of them. I've personally found <code>ratio</code> and <code>token_set_ratio</code> to be the most useful.</p>
<h2>NLTK</h2>
<p>If you want to do some custom fuzzy string matching, then <a href="http://www.nltk.org/">NLTK</a> is a great library to use. There's <a title="Python NLTK Word Tokenization Demo" href="http://text-processing.com/demo/tokenize/">word tokenizers</a>, <a title="Python NLTK Stemming and Lemmatization Demo" href="http://text-processing.com/demo/stem/">stemmers</a>, and it even has its own <a href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics-module.html#edit_distance">edit distance</a> implementation. Here's a way you could combine all 3 to create a fuzzy string matching function.</p>
<pre class="brush: python; title: ; notranslate">
from nltk import metrics, stem, tokenize

stemmer = stem.PorterStemmer()

def normalize(s):
	words = tokenize.wordpunct_tokenize(s.lower().strip())
	return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=3):
	return metrics.edit_distance(normalize(s1), normalize(s2)) &lt;= max_dist
</pre>
<h2>Phonetics</h2>
<p>Finally, an interesting and perhaps non-obvious way to compare strings is with <a href="http://en.wikipedia.org/wiki/Phonetic_algorithm">phonetic algorithms</a>. The idea is that 2 strings that sound same may be the same (or at least similar enough). One of the most well known phonetic algorithms is <a href="http://en.wikipedia.org/wiki/Soundex">Soundex</a>, with a <a href="http://code.activestate.com/recipes/52213/">python soundex algorithm here</a>. Another is <a href="http://en.wikipedia.org/wiki/Double_Metaphone#Double_Metaphone">Double Metaphone</a>, with a <a href="http://www.atomodo.com/code/double-metaphone/metaphone.py/view">python metaphone module here</a>. You can also find code for these and other phonetic algorithms in the <a href="https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/featx/phonetics.py">nltk-trainer phonetics module</a> (copied from a now defunct sourceforge project called <a href="http://advas.sourceforge.net/">advas</a>). Using any of these algorithms, you get an encoded string, and then if 2 encodings compare equal, the original strings match. Theoretically, you could even do fuzzy matching on the phonetic encodings, but that's probably pushing the bounds of fuzziness a bit too far.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AS4JRWS2bhY:rDCFHFwi_uk:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AS4JRWS2bhY:rDCFHFwi_uk:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AS4JRWS2bhY:rDCFHFwi_uk:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=AS4JRWS2bhY:rDCFHFwi_uk:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=AS4JRWS2bhY:rDCFHFwi_uk:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=AS4JRWS2bhY:rDCFHFwi_uk:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/AS4JRWS2bhY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/10/31/fuzzy-string-matching-python/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/10/31/fuzzy-string-matching-python/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>NLTK Overview at SF Python</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/ZY4H_FYegAI/</link>
		<comments>http://streamhacker.com/2011/09/06/nltk-overview-sf-python/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 16:00:34 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[nltk]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1807</guid>
		<description><![CDATA[Announcement of a NLTK overview talk at the San Francisco Python Meetup Group on September 14, 2011. The talk will be a quick overview of topics such as tokenization, part-of-speech tagging, chunking and named entity recognition, text classification, corpus readers, and using nltk-trainer to train custom models.]]></description>
			<content:encoded><![CDATA[<p>On September 14, 2011, I'll be giving a 20 minute overview of <a href="http://www.nltk.org/">NLTK</a> for the <a href="http://www.meetup.com/sfpython/events/29072421/">San Francisco Python Meetup Group</a>. Since it's only 20 minutes, I can't get into too much detail, but I plan to quickly cover the basics of:</p>
<ul>
<li><a href="http://text-processing.com/demo/tokenize/">tokenization</a> and why it's not as easy as <code>str.split()</code></li>
<li><a href="http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html">part-of-speech tagging</a> and why it's important</li>
<li><a href="http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html">chunking and named entity recognition</a></li>
<li><a href="http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html">text classification</a> and how it works for <a href="http://text-processing.com/demo/sentiment/">sentiment analysis</a></li>
<li>training your own models with <a href="https://github.com/japerk/nltk-trainer">nltk-trainer</a></li>
</ul>
<p>I'll also be soliciting feedback for a <a href="http://streamhacker.com/2011/08/22/pycon-nltk-tutorial-suggestions/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">NLTK Tutorial at PyCON 2012</a>. So if you'll be at the meetup and are interested in attending a NLTK tutorial, come find me and tell me what you'd want to learn.</p>
<p><strong>Updated 9/15/2011</strong>: Slides from the talk are online - <a title="A sprint thru Python's Natural Language ToolKit" href="http://www.slideshare.net/japerk/nltk-in-20-minutes">NLTK in 20 minutes</a></p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ZY4H_FYegAI:UhAv7hKUlWk:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ZY4H_FYegAI:UhAv7hKUlWk:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ZY4H_FYegAI:UhAv7hKUlWk:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=ZY4H_FYegAI:UhAv7hKUlWk:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ZY4H_FYegAI:UhAv7hKUlWk:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=ZY4H_FYegAI:UhAv7hKUlWk:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/ZY4H_FYegAI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/09/06/nltk-overview-sf-python/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/09/06/nltk-overview-sf-python/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Python 3 Web Development Review</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/HCmo1xpjUUE/</link>
		<comments>http://streamhacker.com/2011/08/27/python-3-web-development-review/#comments</comments>
		<pubDate>Sat, 27 Aug 2011 16:19:25 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[cherrypy]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[http]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[jquery]]></category>
		<category><![CDATA[sqlite]]></category>
		<category><![CDATA[templates]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1799</guid>
		<description><![CDATA[Python 3 Web Development Beginner's Guide should more accurately be called Web Framework Development from scratch using CherryPy and jQuery. When you have the right expectations, the book's approach makes a lot more sense. The only major drawback then is the inline HTML rendering, when the author really should have a used one of the many Python templating engines.]]></description>
			<content:encoded><![CDATA[<p><a href="http://streamhacker.com/wp-content/uploads/2011/06/python3_web_dev_guide.png#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed"><img class="size-full wp-image-1630 alignright" title="Python 3 Web Development Beginner's Guide" src="http://streamhacker.com/wp-content/uploads/2011/06/python3_web_dev_guide.png" alt="Python 3 Web Development" width="125" height="152" /></a>The problem with <a href="http://www.amazon.com/gp/product/1849513740/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399373&amp;creativeASIN=1849513740">Python 3 Web Development Beginner's Guide</a>, by <a href="http://michelanders.blogspot.com/">Michel Anders</a>, is one of expectations (<strong>disclaimer</strong>: I received a free eBook from <a href="http://www.packtpub.com/">Packt</a> for review). Let's start with the title... First we have <em>Python 3 Web Development</em>. This immediately sets the wrong expectations because:</p>
<blockquote>
<ol>
<li>There's almost as much <a href="http://jquery.com/">jQuery</a> &amp; <a href="http://en.wikipedia.org/wiki/JavaScript">Javascript</a> as there is <a href="http://www.python.org/">Python</a>.</li>
<li>Most of the <a href="http://www.python.org/">Python</a> code is not <a href="http://docs.python.org/py3k/whatsnew/3.0.html">Python 3</a> specific, and the code that is could easily be translate to <a href="http://wiki.python.org/moin/Python2orPython3">Python 2</a>.</li>
<li>Much of the Python code either uses <a href="http://www.cherrypy.org/">CherryPy</a> or is for generating HTML. This is not immediately obvious, but becomes apparent in Chapter 3 (which is available as a free PDF download: <a href="http://www.packtpub.com/sites/default/files/3746OS-Chapter-3-Tasklist-I-Persistence.pdf?utm_source=packtpub&amp;utm_medium=free&amp;utm_campaign=pdf">Chapter 3 - Tasklist I Persistence</a>).</li>
</ol>
</blockquote>
<p>Second, this book is also supposed to be a <em>Beginner's Guide</em>, but that is definitely not the case. To really grasp what's going on, you need to already know the basics of HTML, <a href="http://www.amazon.com/gp/product/1847196705/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399369&amp;creativeASIN=1847196705">jQuery interaction</a>, and how <a href="http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol">HTTP</a> works. Chapter 1 is an excellent introduction to <a href="http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol">HTTP</a> and web application development, but the book as a whole is not beginner material. I think that anything that uses <a href="http://stackoverflow.com/questions/100003/what-is-a-metaclass-in-python">Python metaclasses</a> automatically becomes at least intermediate level, if not expert, and the main thrust of Chapter 7 is refactoring all your straightforward database code to use complicated <a href="http://jasonmbaker.com/python-metaclasses-in-depth">metaclasses</a>.</p>
<p>However, if you mentally rewrite the title to be "Web Framework Development from scratch using CherryPy and jQuery", then you've got the right idea. The book steps you through <a href="http://www.amazon.com/gp/product/1904811841/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399373&amp;creativeASIN=1904811841">web app development with CherryPy</a>, database models with <a href="http://www.sqlite.org/">sqlite3</a>, and plenty of HTML and jQuery for interface generation and interaction. While creating example applications, you slowly build up a re-usable framework. It's an interesting approach, but unfortunately it gets muddied up with inline HTML rendering. I never thought a language as simple and elegant as Python could be reduced to the ugliness of common <a href="http://www.php.net/">PHP</a>, but generating HTML with string interpolation inside the same functions that are accessing the database gets pretty close. I kept expecting the author to introduce <a href="http://en.wikipedia.org/wiki/Template_engine_(web)">template rendering</a>, which is a major part of most modern web development frameworks, but it never happened, despite the plethora of excellent <a href="http://wiki.python.org/moin/Templating">Python templating libraries</a>.</p>
<p>While reading this book, I often had the recurring thought "I'm so glad I use <a href="https://www.djangoproject.com/">Django</a>". If your aim is rapid application development, this is not the book for you. However, if you're interested in creating your own web development framework, or would at least like to understand how a framework like <a href="https://www.djangoproject.com/">Django</a> could be created, then buy a copy <a href="http://link.packtpub.com/K7lyie">Python 3 Web Development</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=HCmo1xpjUUE:VB4hu5cCqSI:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=HCmo1xpjUUE:VB4hu5cCqSI:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=HCmo1xpjUUE:VB4hu5cCqSI:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=HCmo1xpjUUE:VB4hu5cCqSI:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=HCmo1xpjUUE:VB4hu5cCqSI:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=HCmo1xpjUUE:VB4hu5cCqSI:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/HCmo1xpjUUE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/08/27/python-3-web-development-review/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/08/27/python-3-web-development-review/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>PyCon NLTK Tutorial Suggestions</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/ksck5m2NJ3U/</link>
		<comments>http://streamhacker.com/2011/08/22/pycon-nltk-tutorial-suggestions/#comments</comments>
		<pubDate>Mon, 22 Aug 2011 15:02:50 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[pycon]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1792</guid>
		<description><![CDATA[A request for feedback on a NLTK tutorial for PyCon 2012. What topics should be covered? And does anyone want to help co-host?]]></description>
			<content:encoded><![CDATA[<p><a href="http://us.pycon.org/2012/">PyCon 2012</a> just released a <a href="http://us.pycon.org/2012/cfp/">CFP</a>, and <a href="http://www.nltk.org/">NLTK</a> shows up 3 times in the suggested topics. While I've never done this before, I know stuff about <a title="Python Natural Language Processing with NLTK" href="http://www.amazon.com/gp/product/1849513600?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=1849513600">Text Processing with NLTK</a> so I'm going to submit a tutorial abstract. But I want your feedback: what exactly should this tutorial cover? If you could attend a 3 hour class on NLTK, what knowledge &amp; skills would you like to come away with? Here are a few specific topics I could cover:</p>
<ul>
<li>part-of-speech tagging &amp; chunking</li>
<li>text classification</li>
<li>creating a custom corpus and corpus reader</li>
<li>training custom models (manually and/or with <a href="https://github.com/japerk/nltk-trainer">nltk-trainer</a>)</li>
<li>bootstrapping a custom corpus for text classification</li>
</ul>
<p>Or I could do a high-level survey of many <a href="http://www.nltk.org/">NLTK</a> modules and corpora. Please let me know what you think in the comments, if you plan on going to <a href="http://us.pycon.org/2012/">PyCon 2012</a>, and if you'd want to attend a tutorial on NLTK. You can also <a href="http://text-processing.com/contact/">contact me directly</a> if you prefer.</p>
<h2>Co-Hosting</h2>
<p>If you've done this kind of thing before, have some teaching and/or speaking experience, and you feel you could add value (maybe you're a computational linguist or NLP'er and/or have used <a href="http://www.nltk.org/">NLTK</a> professionally), I'd be happy to work with a co-host. <a href="http://text-processing.com/contact/">Contact me</a> if you're interested, or leave a note in the comments.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ksck5m2NJ3U:UWFCIMM0UaQ:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ksck5m2NJ3U:UWFCIMM0UaQ:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ksck5m2NJ3U:UWFCIMM0UaQ:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=ksck5m2NJ3U:UWFCIMM0UaQ:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=ksck5m2NJ3U:UWFCIMM0UaQ:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=ksck5m2NJ3U:UWFCIMM0UaQ:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/ksck5m2NJ3U" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/08/22/pycon-nltk-tutorial-suggestions/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/08/22/pycon-nltk-tutorial-suggestions/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Testing Command Line Scripts with Roundup</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/p1IZhW2a84o/</link>
		<comments>http://streamhacker.com/2011/07/25/testing-command-line-scripts-roundup/#comments</comments>
		<pubDate>Mon, 25 Jul 2011 16:00:40 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[roundup]]></category>
		<category><![CDATA[testing]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1783</guid>
		<description><![CDATA[Using roundup to testing command line shell scripts, with examples from nltk-trainer.]]></description>
			<content:encoded><![CDATA[<p>As <a href="https://github.com/japerk/nltk-trainer">nltk-trainer</a> becomes more stable, I realized that I needed some way to test the command line scripts. My previous ad-hoc method of "test whatever script options I can remember" was becoming unwieldy and unreliable. But how do you make repeatable tests for a command line script? It doesn't really fit into the standard unit testing model.</p>
<p>Enter <a href="http://bmizerany.github.com/roundup/">roundup</a> by <a href="http://twitter.com/#!/bmizerany">Blake Mizerany</a>. (NOTE: do not try to do <code>apt-get install roundup</code>. You will get an <a href="http://www.roundup-tracker.org/">issue tracking system</a>, not a script testing tool).</p>
<p>Roundup provides a great way to <a href="http://itsbonus.heroku.com/p/2010-11-01-roundup">prevent shell bugs</a> by creating simple test functions within a shell script. Here's the first dozen lines of <a href="https://github.com/japerk/nltk-trainer/blob/master/tests/train_classifier.sh">train_classifier.sh</a>, which you can probably guess tests <a href="https://github.com/japerk/nltk-trainer/blob/master/train_classifier.py">train_classifier.py</a>:</p>
<pre class="brush: bash; title: ; notranslate">
#!/usr/bin/env roundup

describe &quot;train_classifier.py&quot;

it_displays_usage_when_no_arguments() {
	./train_classifier.py 2&gt;&amp;1 | grep -q &quot;usage: train_classifier.py&quot;
}

it_cannot_find_foo() {
	last_line=$(./train_classifier.py foo 2&gt;&amp;1 | tail -n 1)
	test &quot;$last_line&quot; &quot;=&quot; &quot;ValueError: cannot find corpus path for foo&quot;
}
</pre>
<p><code>describe</code> is like the name of a module or test case, and all test functions begin with <code>test_</code>. Within the test functions, you use standard shell commands that should produce no output on success (like <code>grep -q</code> or the <a href="http://linux.about.com/library/cmd/blcmdl1_test.htm">test</a> command). You can also match multiple lines of output, as in:</p>
<pre class="brush: bash; title: ; notranslate">
it_trains_movie_reviews_paras() {
	test &quot;$(./train_classifier.py movie_reviews --no-pickle --no-eval --fraction 0.5 --instances paras)&quot; &quot;=&quot; &quot;loading movie_reviews
2 labels: ['neg', 'pos']
1000 training feats, 1000 testing feats
training NaiveBayes classifier&quot;
}
</pre>
<p>Once you've got all your test functions defined, make sure your test script is executable and <a href="https://github.com/bmizerany/roundup/blob/master/INSTALLING#files">roundup is installed</a>, then run your test script. You'll get nice output that looks like:</p>
<pre>nltk-trainer$ tests/train_classifier.sh
train_classifier.py
  it_displays_usage_when_no_arguments:             [PASS]
  it_cannot_find_foo:                              [PASS]
  it_cannot_import_reader:                         [PASS]
  it_trains_movie_reviews_paras:                   [PASS]
  it_trains_corpora_movie_reviews_paras:           [PASS]
  it_cross_fold_validates:                         [PASS]
  it_trains_movie_reviews_sents:                   [PASS]
  it_trains_movie_reviews_maxent:                  [PASS]
  it_shows_most_informative:                       [PASS]
=========================================================
Tests:    9 | Passed:   9 | Failed:   0</pre>
<p>So far, <a href="http://bmizerany.github.com/roundup/">roundup</a> has been a perfect tool for testing all the <a href="https://github.com/japerk/nltk-trainer">nltk-trainer</a> scripts, and the only downside is the one-time manual installation. I highly recommend it for anyone writing custom commands and scripts, no matter what language you use to write them.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=p1IZhW2a84o:1qxYHx70Y1M:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=p1IZhW2a84o:1qxYHx70Y1M:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=p1IZhW2a84o:1qxYHx70Y1M:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=p1IZhW2a84o:1qxYHx70Y1M:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=p1IZhW2a84o:1qxYHx70Y1M:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=p1IZhW2a84o:1qxYHx70Y1M:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/p1IZhW2a84o" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/07/25/testing-command-line-scripts-roundup/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/07/25/testing-command-line-scripts-roundup/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Python Testing Cookbook Review</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/KbKaHkiAmNM/</link>
		<comments>http://streamhacker.com/2011/07/18/python-testing-cookbook-review/#comments</comments>
		<pubDate>Mon, 18 Jul 2011 16:00:03 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[bdd]]></category>
		<category><![CDATA[doctest]]></category>
		<category><![CDATA[nose]]></category>
		<category><![CDATA[tdd]]></category>
		<category><![CDATA[testing]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1768</guid>
		<description><![CDATA[Review of Python Testing Cookbook by Greg Turnquits. The book covers unit testing with unittest, running tests with nose, writing nose plugins, testable docstrings with doctest, behavior driven development and mock objects, acceptance testing, continuous integration, and smoke and load testing. Many python testing libraries are introduced with good usage examples.]]></description>
			<content:encoded><![CDATA[<p><a title="Python Testing Cookbook" href="http://link.packtpub.com/n1Izb8"><img class="size-full wp-image-1629 alignright" title="Python Testing Cookbook" src="http://streamhacker.com/wp-content/uploads/2011/06/python_testing_cookbook.png" alt="Python Testing Cookbook" width="125" height="152" /></a><a href="http://www.amazon.com/gp/product/1849514666/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399373&amp;creativeASIN=1849514666">Python Testing Cookbook</a>, by <a href="http://greg-turnquist.blogspot.com/">Greg L Turnquist</a> (<a href="http://twitter.com/#!/gregturn">@gregturn</a>), goes far beyond <a href="http://en.wikipedia.org/wiki/Unit_testing">Unit Testing</a>, but overall it's a mixed bag. Here's a breakdown for each chapter (<strong>disclaimer</strong>: I received a free eBook from <a href="http://www.packtpub.com/">Packt</a> for review):</p>
<ol>
<li>Basic introduction to testing with <a href="http://docs.python.org/library/unittest.html">unittest</a>, which is great if you're just starting with Python and testing.</li>
<li>Good coverage of <a href="http://packages.python.org/nose/">nose</a>. I was pleasantly surprised at how easy it is to write <a href="http://packages.python.org/nose/plugins/writing.html">nose plugins</a>.</li>
<li>Deep coverage of using <a href="http://docs.python.org/library/doctest.html">doctest</a> and writing <a href="http://pythontestingcookbook.posterous.com/mixing-docstrings-and-doctests-makes-python-a">testable docstrings</a>. You can <a href="http://www.packtpub.com/sites/default/files/4668OS-Chapter-3-Creating-Testable-Documentation-with-doctest.pdf?utm_source=packtpub&amp;utm_medium=free&amp;utm_campaign=pdf">download a free PDF of Chapter 3 here</a>.</li>
<li><a href="http://en.wikipedia.org/wiki/Behavior_Driven_Development">BDD</a> with a cool nose plugin, and how to use <a href="http://python-mock.sourceforge.net/">mock</a> or <a href="http://code.google.com/p/mockito-python/">mockito</a> for testing with <a href="http://en.wikipedia.org/wiki/Mock_object">mock objects</a>. I wish the author had expressed an opinion in favor of either <a href="http://python-mock.sourceforge.net/">mock</a> or <a href="http://code.google.com/p/mockito-python/">mockito</a>, but he didn't, so I will: use <a href="http://farmdev.com/projects/fudge/">Fudge</a>. Chapter 4 also covers the <a href="http://packages.python.org/lettuce/index.html">Lettuce DSL</a>, which I think is pretty neat, but since every step requires writing a function handler, I'm not convinced it's actually easier or better than writing doctests or unit tests.</li>
<li><a href="http://en.wikipedia.org/wiki/Acceptance_testing">Acceptance testing</a> with <a href="http://pyccuracy.org/">Pyccuracy</a> and <a href="http://code.google.com/p/robotframework/">Robot Framework</a>, which both give you a way to use <a href="http://seleniumhq.org/">Selenium</a> from Python. I thought the <a href="http://en.wikipedia.org/wiki/Domain-specific_language">DSLs</a> used seemed too "magic", but I that's probably because I didn't know the command words, and they weren't highlighted or adequately explained.</li>
<li>How to install and use <a href="http://jenkins-ci.org/">Jenkins</a> and <a href="http://www.jetbrains.com/teamcity/">TeamCity</a>, and how to display XML reports produced using <a href="http://nosexunit.sourceforge.net/">NoseXUnit</a>. This is a very useful chapter for anyone thinking about or setting up <a href="http://en.wikipedia.org/wiki/Continuous_integration">continuous integration</a>.</li>
<li>This chapter is supposed to be about test coverage, and does introduce <a href="http://pypi.python.org/pypi/coverage">coverage</a>, but the examples get needlessly complicated. Previous chapters used a simple shopping cart example, but this chapter uses network events, which really distracts from the tests. The author also writes unittests that just print the results intead of actually testing results with assertions.</li>
<li>More network event complexity while trying to demonstrate <a href="http://en.wikipedia.org/wiki/Smoke_testing">smoke testing</a> and <a href="http://en.wikipedia.org/wiki/Load_testing">load testing</a>. This chapter would have made a lot more sense in a book about network programming and how to test network events. <a href="http://packages.python.org/Pyro4/">Pyro</a> is used with very little explanation, and <a href="http://www.mysql.com/">MySQL</a> and <a href="http://www.sqlite.org/">SQLlite</a> are brought in too, increasing the complexity even more.</li>
<li>This chapter is filled with useful advice, but there's no actual code examples. Instead, the advice is shoehorned into the cookbook format, which I felt detracted from the otherwise great content.</li>
</ol>
<p>Throughout the book, the author presents a kind of "main script" that he updates at the end of many of the chapters. However, the script also contains stub functions that are never touched and barely explained, making their existance completely unnecessary. There's also far too many <code>import *</code>, which can make it difficult to understand the code. But I did learn enough new things that I think <a href="http://www.amazon.com/gp/product/1849514666/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399373&amp;creativeASIN=1849514666">Python Testing Cookbook</a> is worth buying and reading. Leaving out Chapters 7 and 8, I think the book is a great resource if you're just getting started with testing, you want to do <a href="http://en.wikipedia.org/wiki/Continuous_integration">continuous integration</a>, and/or you want to get non-programmers involved in the testing process. There's also a <a href="http://pythontestingcookbook.posterous.com/">blog about the book</a>, which has links to other reviews.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=KbKaHkiAmNM:kYaMN3Erqi4:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=KbKaHkiAmNM:kYaMN3Erqi4:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=KbKaHkiAmNM:kYaMN3Erqi4:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=KbKaHkiAmNM:kYaMN3Erqi4:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=KbKaHkiAmNM:kYaMN3Erqi4:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=KbKaHkiAmNM:kYaMN3Erqi4:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/KbKaHkiAmNM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/07/18/python-testing-cookbook-review/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/07/18/python-testing-cookbook-review/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Programming Collective Intelligence Review</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/QOJY0i2Kg1s/</link>
		<comments>http://streamhacker.com/2011/07/11/programming-collective-intelligence-review/#comments</comments>
		<pubDate>Mon, 11 Jul 2011 16:00:38 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[bayes]]></category>
		<category><![CDATA[machinelearning]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[spam]]></category>
		<category><![CDATA[svm]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1759</guid>
		<description><![CDATA[Programming Collective Intelligence is a great survey of machine learning algorithms using Python. It covers Naive Bayes, Artificial Neural Networks, Genetic Programming, Support Vector Machines, and more.]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.amazon.com/gp/product/0596529325/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399369&amp;creativeASIN=0596529325"><img class="alignleft" src="http://ecx.images-amazon.com/images/I/51pZYWZFkZL._SL110_.jpg" alt="Programming Collective Intelligence" width="84" height="110" /></a></p>
<p><a href="http://www.amazon.com/gp/product/0596529325/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399369&amp;creativeASIN=0596529325">Programming Collective Intelligence</a> is a great conceptual introduction to many common machine learning algorithms and techniques. It covers classification algorithms such as <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes</a> and <a href="http://en.wikipedia.org/wiki/Artificial_neural_network">Neural Networks</a>, and algorithmic optimization approaches like <a href="http://en.wikipedia.org/wiki/Genetic_programming">Genetic Programming</a>. The book also manages to pick interesting example applications, such as <a href="http://en.wikipedia.org/wiki/Stock_market_prediction">stock price prediction</a> and <a href="http://en.wikipedia.org/wiki/Topic_model">topic identification</a>.</p>
<p>There are two chapters in particular that stand out to me. First is Chapter 6, which covers <strong>Naive Bayes classification</strong>. What stood out was that the algorithm presented is an <a href="http://en.wikipedia.org/wiki/Online_machine_learning">online learner</a>, which means it can be updated as data comes in, unlike the <a href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.naivebayes.NaiveBayesClassifier-class.html">NLTK NaiveBayesClassifier</a>, which can be trained only once. Another thing that caught my attention was <a href="http://en.wikipedia.org/wiki/Fisher's_method">Fisher's method</a>, which is not implemented in <a href="http://www.nltk.org/">NLTK</a>, but could be with a little work. Apparently <strong>Fisher's method</strong> is great for <a href="http://en.wikipedia.org/wiki/Bayesian_spam_filtering">spam filtering</a>, and is used by the <a href="http://spambayes.sourceforge.net/">SpamBayes</a> Outlook plugin (which is also written in Python).</p>
<p>Second, I found Chapter 9, which covers <a href="http://en.wikipedia.org/wiki/Support_vector_machine">Support Vector Machines</a> and <a href="http://en.wikipedia.org/wiki/Kernel_methods">Kernel Methods</a>, to be quite intuitive. It explains the idea by starting with examples of <a href="http://en.wikipedia.org/wiki/Linear_classifier">linear classification</a> and its shortfalls. But then the examples show that by scaling the data in a particular way first, linear classification suddenly becomes possible. And the <a href="http://en.wikipedia.org/wiki/Kernel_trick">kernel trick</a> is simply a neat and efficient way to reduce the amount of calculation necessary to train a classifier on scaled data.</p>
<p>The final chapter summarizes all the key algorithms, and for many it includes commentary on their strengths and weaknesses. This seems like valuable reference material, especially for when you have a new data set to learn from, and you're not sure which algorithms will help get the results you're looking for. Overall, I found <a href="http://www.amazon.com/gp/product/0596529325/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399369&amp;creativeASIN=0596529325">Programming Collective Intelligence</a> to be an enjoyable read on my <a href="http://www.amazon.com/gp/product/B002FQJT3Q/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399373&amp;creativeASIN=B002FQJT3Q">Kindle 3</a>, and highly recommend it to anyone getting started with machine learning and Python, as well as anyone interested in a general survey of machine learning algorithms.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=QOJY0i2Kg1s:t0LiYKElzYA:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=QOJY0i2Kg1s:t0LiYKElzYA:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=QOJY0i2Kg1s:t0LiYKElzYA:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=QOJY0i2Kg1s:t0LiYKElzYA:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=QOJY0i2Kg1s:t0LiYKElzYA:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=QOJY0i2Kg1s:t0LiYKElzYA:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/QOJY0i2Kg1s" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/07/11/programming-collective-intelligence-review/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/07/11/programming-collective-intelligence-review/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Bay Area NLP Meetup</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/WzBZh149CMw/</link>
		<comments>http://streamhacker.com/2011/07/05/bay-area-nlp-meetup/#comments</comments>
		<pubDate>Wed, 06 Jul 2011 02:08:48 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1749</guid>
		<description><![CDATA[Bay Area NLP talk on NLTK: the Good, the Bad, and the Awesome. Will speak about natural language processing with NLTK, operating the text-processing.com APIs and demos, and NLP consulting.]]></description>
			<content:encoded><![CDATA[<p>This Thursday, June 7 2011, will be the <a href="http://www.meetup.com/Bay-Area-NLP/events/16522295/">first meeting of the Bay Area NLP group</a>, at <a href="http://chomp.com/">Chomp</a> HQ in San Francisco, where I will be giving a talk on <a href="http://www.nltk.org/">NLTK</a> titled "NLTK: the Good, the Bad, and the Awesome". I'll be sharing some of the things I've learned using NLTK, operating <a title="Natural Language Processing APIs and Demos" href="http://text-processing.com/">text-processing.com</a>, and doing random consulting on natural language processing. I'll also explain why <a title="Train NLTK models for natural language processing" href="https://github.com/japerk/nltk-trainer">NLTK-Trainer</a> exists and how awesome it is for training NLP models. So if you're in the area and have some time Thursday evening, come by and say hi.</p>
<p><strong>Update on 07/10/2011</strong>: slides are online from my talk: <a title="NLTK: the Good, the Bad, and the Awesome" href="http://www.slideshare.net/japerk/nltk-the-good-the-bad-and-the-awesome-8556908">NLTK: the Good, the Bad, and the Awesome</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=WzBZh149CMw:xj4OaW_SxIY:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=WzBZh149CMw:xj4OaW_SxIY:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=WzBZh149CMw:xj4OaW_SxIY:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=WzBZh149CMw:xj4OaW_SxIY:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=WzBZh149CMw:xj4OaW_SxIY:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=WzBZh149CMw:xj4OaW_SxIY:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/WzBZh149CMw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/07/05/bay-area-nlp-meetup/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/07/05/bay-area-nlp-meetup/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Upcoming Python Book Reviews</title>
		<link>http://feedproxy.google.com/~r/StreamHacker/~3/rfmAREOeAz8/</link>
		<comments>http://streamhacker.com/2011/07/04/upcoming-python-book-reviews/#comments</comments>
		<pubDate>Mon, 04 Jul 2011 18:23:16 +0000</pubDate>
		<dc:creator>Jacob</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[ebook]]></category>
		<category><![CDATA[machinelearning]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[webdev]]></category>

		<guid isPermaLink="false">http://streamhacker.com/?p=1721</guid>
		<description><![CDATA[Upcoming book reviews for Programming Collective Intelligence, Python Testing Cookbook, and Python 3 Web Development Beginner's Guide. All ebooks were read on a Kindle 3.]]></description>
			<content:encoded><![CDATA[<h4><a href="http://oreilly.com/catalog/9780596529321/"><img class="alignleft" style="clear: both;" src="http://ecx.images-amazon.com/images/I/51pZYWZFkZL._SL110_.jpg" alt="Programming Collective Intelligence" width="88" height="106" /></a>Programming Collective Intelligence</h4>
<p>I recently finished reading <a href="http://www.amazon.com/gp/product/0596529325/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399369&amp;creativeASIN=0596529325">Programming Collective Intellegince</a> and will be posting a review soon. The TL;DR review is: get it if want an great introduction to machine learning with Python. It covers a lot of complex algorithms in a simple way, and provides some great example use cases.</p>
<h4><a href="http://www.amazon.com/gp/product/1849514666/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399373&amp;creativeASIN=1849514666"></a><a href="http://link.packtpub.com/n1Izb8"><img class="alignleft size-full wp-image-1728" style="clear: left;" src="http://streamhacker.com/wp-content/uploads/2011/07/python_testing_cookbook.png" alt="Python Testing Cookbook" width="88" height="106" /></a>Python Testing Cookbook</h4>
<p>Testing is something nearly every developer can do more of, and this <a href="http://www.amazon.com/gp/product/1849514666/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399373&amp;creativeASIN=1849514666">Python Testing Cookbook</a> looks to be full of techniques for integrating testing at various levels of a project. As a preview, you can download a <a href="http://www.packtpub.com/sites/default/files/4668OS-Chapter-3-Creating-Testable-Documentation-with-doctest.pdf?utm_source=packtpub&amp;utm_medium=free&amp;utm_campaign=pdf">PDF of Chapter 3 - Creating Testable Documentation with doctest</a>.</p>
<p><a href="http://link.packtpub.com/K7lyie"><img class="alignleft size-full wp-image-1729" style="clear: left; margin-top: 20px;" src="http://streamhacker.com/wp-content/uploads/2011/07/python3_web_dev_guide.png" alt="Python 3 Web Development" width="88" height="106" /></a></p>
<h4>Python 3 Web Development Beginner's Guide</h4>
<p>I haven't used <a href="http://docs.python.org/py3k/">Python 3</a> yet, so <a href="http://www.amazon.com/gp/product/1849513740/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399373&amp;creativeASIN=1849513740">Python 3 Web Development Beginner's Guide</a> is a good excuse to do so. I also haven't done any web development outside of <a href="https://www.djangoproject.com/">Django</a> in a few years, and I'm interested to see how it compares to doing it from scratch. As a preview, you can download a <a href="http://www.packtpub.com/sites/default/files/3746OS-Chapter-3-Tasklist-I-Persistence.pdf?utm_source=packtpub&amp;utm_medium=free&amp;utm_campaign=pdf">PDF of Chapter 3 - Tasklist I Persistence</a>.</p>
<p><a href="http://www.amazon.com/gp/product/B002FQJT3Q/ref=as_li_ss_il?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399373&amp;creativeASIN=B002FQJT3Q"><img class="alignleft" style="clear: left;" src="http://ws.assoc-amazon.com/widgets/q?_encoding=UTF8&amp;Format=_SL110_&amp;ASIN=B002FQJT3Q&amp;MarketPlace=US&amp;ID=AsinImage&amp;WS=1&amp;tag=streamhacker-20&amp;ServiceVersion=20070822" alt="Kindle 3" width="88" height="106" /></a></p>
<h4>Kindle 3</h4>
<p>I'm reading all of these on a <a href="http://www.amazon.com/gp/product/B002FQJT3Q/ref=as_li_ss_tl?ie=UTF8&amp;tag=streamhacker-20&amp;linkCode=as2&amp;camp=217145&amp;creative=399373&amp;creativeASIN=B002FQJT3Q">Kindle 3</a>, which has worked out surprisingly well. It's obviously not good for copy &amp; pasting code snippets, but that's generally a bad idea anyway. And if don't want to type code in yourself, you can always download it from the publisher's site.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/StreamHacker?a=rfmAREOeAz8:ntx67Vqs8qE:cGdyc7Q-1BI"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=cGdyc7Q-1BI" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=rfmAREOeAz8:ntx67Vqs8qE:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/StreamHacker?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=rfmAREOeAz8:ntx67Vqs8qE:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=rfmAREOeAz8:ntx67Vqs8qE:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/StreamHacker?a=rfmAREOeAz8:ntx67Vqs8qE:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/StreamHacker?i=rfmAREOeAz8:ntx67Vqs8qE:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/StreamHacker/~4/rfmAREOeAz8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://streamhacker.com/2011/07/04/upcoming-python-book-reviews/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://streamhacker.com/2011/07/04/upcoming-python-book-reviews/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
	</channel>
</rss>

