<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Evan Muehlhausen</title><link href="http://evanmuehlhausen.com/" rel="alternate"></link><link href="http://feeds.feedburner.com/feeds/main.xml" rel="self"></link><id>http://evanmuehlhausen.com/</id><updated>2013-02-13T00:00:00-05:00</updated><entry><title>Analyze Gchat transcripts in AWK</title><link href="http://evanmuehlhausen.com/analyze-gchat-transcripts-in-awk" rel="alternate"></link><updated>2013-02-13T00:00:00-05:00</updated><author><name>Evan Muehlhausen</name></author><id>tag:evanmuehlhausen.com,2013-02-13:analyze-gchat-transcripts-in-awk</id><summary type="html">&lt;p&gt;I learned about AWK when I first started using Linux. My exposure to the language
generally came in the form of one-liners that I would cut and paste from the web.
While it seemed like a powerful tool, I never saw it as a full-fledged programming
language and never took the time to learn to use it.&lt;/p&gt;
&lt;div class="section" id="why-awk"&gt;
&lt;h2&gt;Why AWK&lt;/h2&gt;
&lt;p&gt;While I've seen some &lt;a class="reference external" href="https://github.com/rupa/z/blob/master/z.sh"&gt;sophisticated applications&lt;/a&gt; of AWK in the wild, I mainly used it
for simple operations on log files. I wondered whether properly learning AWK
even made sense.&lt;/p&gt;
&lt;div class="section" id="the-book"&gt;
&lt;h3&gt;The book&lt;/h3&gt;
&lt;p&gt;Research on the topic lead me to &lt;a class="reference external" href="http://stackoverflow.com/a/703174/544160"&gt;this Stack Overflow answer&lt;/a&gt; by  &lt;a class="reference external" href="http://rhodesmill.org/brandon"&gt;Brandon Craig Rhodes&lt;/a&gt;. Mr. Rhodes is an &lt;a class="reference external" href="http://pyvideo.org/speaker/337/brandon-rhodes"&gt;avid speaker&lt;/a&gt; in the Python community and
I respect his opinion. He recommends learning AWK not only to increase mastery
at the command line, but as an excuse to read &lt;a class="reference external" href="http://www.amazon.com/AWK-Programming-Language-Alfred-Aho/dp/020107981X/?tag=evanmuehl-20"&gt;The AWK Programming Language&lt;/a&gt;
by the original authors of the language.&lt;/p&gt;
&lt;p&gt;Convinced, I acquired the book. While I'm still working my way though it, I've found
it succinct and comprehensive. It's a lot more than a manual for the language, it's
a discussion of many important programming concepts.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="staying-power"&gt;
&lt;h3&gt;Staying power&lt;/h3&gt;
&lt;p&gt;What also struck me about AWK is its staying power. Though it's around 40 years old,
a &lt;a class="reference external" href="https://www.google.com/search?q=awk+site%3Ausesthis.com"&gt;search on usesthis.com&lt;/a&gt;
reveals a lot of smart people who explicitly mention AWK as an important part of
their toolset. Even though many of these people also mention a high-level language
like Python or Ruby, AWK stays relevant.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="chat-transcripts"&gt;
&lt;h2&gt;Chat transcripts&lt;/h2&gt;
&lt;p&gt;Since reading &lt;a class="reference external" href="http://www.amazon.com/Most-Human-Artificial-Intelligence-Teaches/dp/0307476707?tag=evanmuehl-20"&gt;The Most Human Human&lt;/a&gt;,
I've been fascinated by chat transcripts. Since I don't have anyone recording and
transcribing my face-to-face conversations, my Gchat logs are the closest thing
I have to a record of a real-time interaction with other people.&lt;/p&gt;
&lt;p&gt;With that in mind, I wondered what interesting questions I could answer by analyzing
a transcript. Some of my ideas were:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Duration of interaction&lt;/li&gt;
&lt;li&gt;Total words/chars for each participant (who does all the talking?)&lt;/li&gt;
&lt;li&gt;Total time when no one was speaking (are we distracted?)&lt;/li&gt;
&lt;li&gt;Number of exchanges (how often does the active speaker change?)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Starting with this small set of data points, chat logs could reveal some interesting
dimensions of a relationship. By comparing interactions between different people, or
with the same person over time, trends might start to emerge. Very brief and terse
interactions might suggest a casual acquaintance. Where as very long (in duration and
words exchanged) and engaging (as # of exchanges) might suggest close friendship.&lt;/p&gt;
&lt;p&gt;To generate the source transcript for this post, I pulled up the chat log in Gmail,
pressed print and then simply cut and pasted it into a text file.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="awk-string-processing-made-easy"&gt;
&lt;h2&gt;AWK: string processing made easy&lt;/h2&gt;
&lt;p&gt;AWK is a data extraction language. While it has a rich set of features, enabling
a variety of applications, it's manipulating text and freeing the data within where it
really shines. In a few short lines, it can manage tasks that would take more work in
other languages. In Python, to open a text file and run a regular expression on it,
we require some boilerplate code to get started.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;data.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;[0-9]+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;AWK allows us to do this from the command line and get to the real work much more quickly.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;awk &lt;span class="s1"&gt;&amp;#39;/[0-9]+/&amp;#39;&lt;/span&gt; data.txt
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is a contrived example but it's meant to show that AWK makes some tasks very
easy.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="flocks-of-awks"&gt;
&lt;h2&gt;Flocks of AWKs&lt;/h2&gt;
&lt;p&gt;Since it's creation in the 1970s, AWK implementations have proliferated. They differ
in their licencing, speed and feature set. The original implementation, the one
described in the seminal volume on the language is known as nawk. This is the version
available by default in BSD operating systems and OSX. &lt;a class="reference external" href="http://www.freebsd.org/cgi/cvsweb.cgi/src/contrib/one-true-awk/FREEBSD-upgrade?rev=1.9&amp;amp;content-type=text/x-cvsweb-markup"&gt;FreeBSD calls it&lt;/a&gt;
&amp;quot;one true awk&amp;quot;.&lt;/p&gt;
&lt;p&gt;The GNU project provides an alternative implementation called &lt;a class="reference external" href="http://www.gnu.org/software/gawk/manual/html_node/index.html"&gt;gawk&lt;/a&gt;. It adds features
not included in the original language including built-in date functions and true
multidimensional arrays. It's provided under the GPL which may matter to some. For
me, the additional features justify the extra installation on OSX (&lt;cite&gt;brew install
gawk&lt;/cite&gt; did the trick). Gawk is required to run the code for these examples.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="parsing-the-transcript"&gt;
&lt;h2&gt;Parsing the transcript&lt;/h2&gt;
&lt;p&gt;Basic AWK programs are structured in blocks like&lt;/p&gt;
&lt;blockquote&gt;
condition { action }&lt;/blockquote&gt;
&lt;p&gt;AWK reads a target file line by line and, if the condition holds, it performs the
action then moves onto the next condition. When parsing this chat transcript, we have
four types of lines. Some indicate a speaker:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
me: hi!
&lt;/pre&gt;
&lt;p&gt;Others indicate the time:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
3:27 PM
&lt;/pre&gt;
&lt;p&gt;Or when a certain amount of time has elapsed between messages:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
5 minutes
&lt;/pre&gt;
&lt;p&gt;Some have no distinguishing features at all and are just lines of text. These need to
be attributed to the active speaker as indicated by the last speaker identifier line.&lt;/p&gt;
&lt;p&gt;My strategy for handling different cases is to look first for the time-related lines.
If we find one, we stop processing the line using &lt;cite&gt;next&lt;/cite&gt;. If we find a line
identifying a speaker, we store the active speaker then remove the speaker
designation e.g. &lt;cite&gt;me:&lt;/cite&gt; from the line, leaving only the raw chat content. Then, for
all remaining lines we simply count the words and characters and attribute them to
the active speaker.&lt;/p&gt;
&lt;p&gt;Here we handle a line that indicates a speaker.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="c1"&gt;#Speaker change line&lt;/span&gt;
&lt;span class="c1"&gt;# e.g. me: I love cats&lt;/span&gt;
&lt;span class="sr"&gt;/^[A-za-z]+: /&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;speaker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;speaker&lt;/span&gt; &lt;span class="o"&gt;!~&lt;/span&gt; &lt;span class="sr"&gt;/^me/&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
        &lt;span class="nx"&gt;other_speaker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;changes&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;

    &lt;span class="c1"&gt;# Remove the speaker from the line&lt;/span&gt;
    &lt;span class="kr"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="nb"&gt;FS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;When parsing some of the lines, it simplifies the script to utilize the &lt;cite&gt;match&lt;/cite&gt;
function provided by gawk. This makes it easier to capture segments of the string for
processing. For example, when calculating how much dead time elapsed between messages
we do.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="c1"&gt;# Dead time line&lt;/span&gt;
&lt;span class="c1"&gt;# e.g. 5 minutes&lt;/span&gt;
&lt;span class="kr"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sr"&gt;/^([0-9]+) minutes/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;dead_time&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# Don&amp;#39;t count this as chat content&lt;/span&gt;
    &lt;span class="kr"&gt;next&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This makes it easy to capture the number of minutes and add it to our total.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="issues"&gt;
&lt;h2&gt;Issues&lt;/h2&gt;
&lt;div class="section" id="regexes"&gt;
&lt;h3&gt;Regexes&lt;/h3&gt;
&lt;p&gt;This strategy presents a problem. If a user types a message. Then a subsequent message
which reads:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
10 minutes
&lt;/pre&gt;
&lt;p&gt;This will get counted as dead air time. Getting around this would require parsing the
HTML version of the chat log. Since we want to use AWK for it's plain-text goodness,
we will ignore this issue.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="multiple-speakers"&gt;
&lt;h3&gt;Multiple speakers&lt;/h3&gt;
&lt;p&gt;The program only works for two-party conversations. It could be modified to allow for
chats involving any number of parties.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="output"&gt;
&lt;h2&gt;Output&lt;/h2&gt;
&lt;p&gt;Using the AWK's &lt;cite&gt;END&lt;/cite&gt; directive, we print the our results:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
me:  890 words  (46%),  4415 characters  (47%)
Jose: 1032 words  (53%),  4896 characters  (52%)
exchanges: 115
duration: 109 minutes
dead_time: 4 minutes
&lt;/pre&gt;
&lt;/div&gt;
&lt;div class="section" id="impressions"&gt;
&lt;h2&gt;Impressions&lt;/h2&gt;
&lt;p&gt;AWK is a great tool and I think it's worth a programmer's time to learn it. That
said, it is not without it's problems.&lt;/p&gt;
&lt;div class="section" id="readability"&gt;
&lt;h3&gt;Readability&lt;/h3&gt;
&lt;p&gt;AWK is good at what it does but I don't find the code I wrote very readable. Perhaps
this is my own inexperience. With a more complex project, this could lead to
maintenance issues. I'd be interested to know how more experienced AWKers deal with
this.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="data-structures"&gt;
&lt;h3&gt;Data structures&lt;/h3&gt;
&lt;p&gt;Lack of data structures (e.g. lists), as well as a limited set of built-in
functionality (esp. outside of gawk) can make things harder.&lt;/p&gt;
&lt;p&gt;Overall, I've enjoyed my foray into AWK. While I still wouldn't use it for anything
too complex, it's always good to learn new tools. Plus, I've already found myself
using it in cases where I would normally have to paste data into a spreadsheet.
Having these tasks in scripts saves time and adds flexibility.&lt;/p&gt;
&lt;p&gt;The code I wrote for this post is available on &lt;a class="reference external" href="https://github.com/evan2m/awk-gchat"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</summary><category term="AWK"></category><category term="data analysis"></category></entry><entry><title>Data mining local radio with Node.js</title><link href="http://evanmuehlhausen.com/data-mining-local-radio-with-nodejs" rel="alternate"></link><updated>2012-08-20T00:00:00-04:00</updated><author><name>Evan Muehlhausen</name></author><id>tag:evanmuehlhausen.com,2012-08-20:data-mining-local-radio-with-nodejs</id><summary type="html">&lt;div class="section" id="more-harpsicord"&gt;
&lt;h2&gt;More harpsicord?!&lt;/h2&gt;
&lt;p&gt;Seattle is lucky to have KINGFM, a local radio station dedicated to 100% classical
music. As one of the few existent classical music fans in his twenties, I listen
often enough.  Over the past few years, I've noticed that when I tune to the station,
I always seem to hear the plinky sound of a harpsicord.&lt;/p&gt;
&lt;p&gt;Before I sent KINGFM an email, admonishing them for playing so much of an instrument
I dislike, I wanted to investigate whether my ears were deceiving me. Perhaps my own
distaste for the harpsicord increased its impact in my memory.&lt;/p&gt;
&lt;p&gt;This article outlines the details of this investigation and especially the process of
collecting the data.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="if-it-ain-t-baroque"&gt;
&lt;h2&gt;If it ain't baroque...&lt;/h2&gt;
&lt;p&gt;A harpsicord is in many ways similar to the piano. Pressing down and releasing one of
its keys triggers an internal mechanism that plucks a string inside. Resultant
vibration of the string produces the corresponding pitch. Because its strings are
plucked, the instrument has no dynamic range. Each note sounds at roughly the same
volume; however firmly or softly the player strikes the keys. Some harpsicords have
several &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Harpsichord#Multiple_choirs_of_strings"&gt;choirs&lt;/a&gt; of strings
that allow the player limited control of the volume and timbre.&lt;/p&gt;
&lt;p&gt;The harpsicord can sound tinny to modern ears. Thomas Beecham famously said, &amp;quot;The
sound of a harpsichord - two &lt;strong&gt;skeletons copulating&lt;/strong&gt; on a tin roof in
a thunderstorm.&amp;quot;&lt;/p&gt;
&lt;p&gt;At the start of the 16th century, the newly invented &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Fortepiano"&gt;fortepiano&lt;/a&gt; began to push both the harpsicord and its
close relative, the clavicord out of favor. The new instrument worked more like the
modern piano in that its strings were struck with padded hammers. Compared to the
other keyboard instruments of its day, it had a more resonant sound and allowed the
player to control the dynamics of each note simply by altering the force with which
he struck the keys.&lt;/p&gt;
&lt;p&gt;The period before the fortepiano, during which the harpsicord had its heyday is known
as the Baroque Era. The history of classical music is often divided into several
historical &amp;quot;eras&amp;quot; or &amp;quot;periods&amp;quot;. The dates that separate them are somewhat arbitrary
with substantial overlap, I'll follow &lt;a class="reference external" href="http://en.wikipedia.org/wiki/List_of_Classical_era_composers"&gt;Wikipedia&lt;/a&gt; in defining these
boundaries because they have the most comprehensive composers data with the most
permissive licence.&lt;/p&gt;
&lt;p&gt;Wikipedia's dates differ little from those outlined by &lt;a class="reference external" href="http://www.naxos.com/education/brief_history.asp"&gt;Naxos&lt;/a&gt;, a well respected music label
with an extremely comprehensive catalog. Unfortunately, the Naxos ToS are extremely
restrictive with respect to their composer data.&lt;/p&gt;
&lt;p&gt;These eras are:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Medieval (476–1400)&lt;/li&gt;
&lt;li&gt;Renaissance (1400–1600)&lt;/li&gt;
&lt;li&gt;Baroque (1600–1760)&lt;/li&gt;
&lt;li&gt;Classical era (1730–1820)&lt;/li&gt;
&lt;li&gt;Romantic era (1815–1910)&lt;/li&gt;
&lt;li&gt;20th century (1900–2000)&lt;/li&gt;
&lt;li&gt;21st century (since 2000)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Since King seems to play very little music from before 1600, I ignored the Medieval
and Renaissance era in my analysis.&lt;/p&gt;
&lt;p&gt;Because of the dominance of the piano and its predecessors starting in the Classical
Era, one is less likely to find the sound of the harpsicord in modern recordings of
anything but Baroque music. Even then, music originally written for harpsicord is
often transcribed to the piano and recorded that way. Glenn Gould, perhaps the most
famous modern Bach interpreter, is well-known for recordings of such transcriptions.&lt;/p&gt;
&lt;p&gt;One exception is opera. Harpsicord was used for accompanying recitative all the way
into the late 18th century. For simplicity, we will blissfully ignore this fact.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="collecting-the-data"&gt;
&lt;h2&gt;Collecting the data&lt;/h2&gt;
&lt;p&gt;KINGFM's posts their playlist daily to &lt;a class="reference external" href="http://www.king.org/Music-Schedule/4399266"&gt;their website&lt;/a&gt;. Scraping this data, I was able to
build the dataset.&lt;/p&gt;
&lt;div class="section" id="scraping-with-node"&gt;
&lt;h3&gt;Scraping with Node&lt;/h3&gt;
&lt;p&gt;Web scraping is an normally a network constrained task. Most of the execution time is
spent waiting for the server to respond. Node.js encourages an asynchronous style
that is well-suited to such tasks. Using the &lt;a class="reference external" href="https://github.com/mikeal/request"&gt;request&lt;/a&gt; module, it's easy to send non-blocking HTTP
requests and process each result in a callback as it's returned. For this reason,
rate limiting yourself is important when scraping more than a few pages. Otherwise,
the flood of requests you will unleash is likely to get you blocked or interfere with
the target site.&lt;/p&gt;
&lt;p&gt;Another advantage for Node for this usecase is that it brings existing client-side
libraries to the server. Great scraping tools exist in other languages (e.g. &lt;a class="reference external" href="http://scrapy.org/"&gt;scrapy&lt;/a&gt;). However, since many developers already have years of
experience using jQuery client-side to access the DOM, they may prefer to use
a familiar API instead of learning a new one.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="cheerio"&gt;
&lt;h3&gt;Cheerio&lt;/h3&gt;
&lt;p&gt;Several npm packages are available to help us use jQuery in Node. &lt;a class="reference external" href="https://github.com/tmpvar/jsdom/"&gt;jsdom&lt;/a&gt; is a popular option that implements the full DOM
in JS; allowing us to use jQuery or most any other client-side library on the server.&lt;/p&gt;
&lt;p&gt;However, &lt;a class="reference external" href="https://github.com/MatthewMueller/cheerio"&gt;cheerio&lt;/a&gt; better suits this
simple task. The project provides a re-implementation of a the most important parts
of jQuery core. It's simpler to use and the author claims its a much faster choice
compared to jsdom. Since much of the official jQuery source provides unneeded
functionality like AJAX and browser compatibility, a re-implementation that leaves
this bloat behind is preferable.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="an-example-scaper"&gt;
&lt;h3&gt;An example scaper&lt;/h3&gt;
&lt;p&gt;Especially when used in concert with Coffeescript, Cheerio makes for readable
scrapers. By leveraging the superpower that is the jQuery selector we can often get
at our data with minimal code. As an example, let's use it to scrape the target URLs
from the front page of reddit using the selector &lt;cite&gt;#siteTable a.title&lt;/cite&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="nv"&gt;request = &lt;/span&gt;&lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;request&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;cheerio = &lt;/span&gt;&lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;cheerio&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;parse_page = &lt;/span&gt;&lt;span class="nf"&gt;(error, response, body) -&amp;gt;&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="o"&gt;or&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;statusCode&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="c1"&gt;# Load the page into cheerio&lt;/span&gt;
    &lt;span class="nv"&gt;$ = &lt;/span&gt;&lt;span class="nx"&gt;cheerio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Iterate over the the links on the front page&lt;/span&gt;
    &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;#siteTable a.title&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;each&lt;/span&gt; &lt;span class="nf"&gt;(k,v) =&amp;gt;&lt;/span&gt;

      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;href&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;http://www.reddit.com&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;parse_page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Using this technique, I quickly pulled the last 30 days of playlist data from KINGFM
and dumped it into a file.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="the-joys-of-data-normalization"&gt;
&lt;h2&gt;The joys of data normalization&lt;/h2&gt;
&lt;p&gt;Then came the hard part: associating composer names in the playlist data with
historical eras. This is more difficult than it seems because subtle differences in
the datasets could result in mismatches. King is mostly consistent in defining its
composers but it does so in a different format from Wikipedia. It lists &amp;quot;SCHUBERT&amp;quot;
instead of &amp;quot;Franz Schubert&amp;quot;. These cases are easy enough: simply convert to
lowercase and lop off the first name.&lt;/p&gt;
&lt;p&gt;There are several types of more difficult cases where the database contains multiple
composers that share a last name. e.g. J.S. Bach and all of his sons. In these cases,
we need first names or initials to differentiate. Unfortunately, the formats differ
between the data sets. Wikipedia has &amp;quot;Carl Philipp Emanuel Bach&amp;quot; and KINGFM, &amp;quot;BACH,
C.P.E&amp;quot;. Other tough cases are those where composer has multiple last names e.g.
&amp;quot;Vaughan Williams&amp;quot;. Other annoying cases occur where diacritics did not match
e.g.&amp;quot;Dvořák&amp;quot; and &amp;quot;DVORÁK&amp;quot; (no accented r). Since my data set is fairly small (3210
playlist items), I was able to handle these unfortunate cases with regular
expressions and frustration.&lt;/p&gt;
&lt;div class="section" id="handling-overlap"&gt;
&lt;h3&gt;Handling overlap&lt;/h3&gt;
&lt;p&gt;In some cases, a composer belongs to multiple eras. For example, Beethoven's music is
said to span both the Classical and the Romantic eras. One way to handle these cases
would be to count every movement of Beethoven's as both Classical and Romantic. The
downside is that this would result in double counting a lot of the most popular
composers.&lt;/p&gt;
&lt;p&gt;Instead, I chose to sort the database alphabetically by composer name rather than by
era. In cases where there are two entries for the same name, the second one
overwrites the first. This is not ideal either but should help to randomize the era
into which era transitional composers are placed. I did some editorializing here for
the most prominent composers. For instance, Beethoven was counted as Classical and
Schubert as Romantic.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="results"&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;div class="section" id="play-count"&gt;
&lt;h3&gt;Play count&lt;/h3&gt;
&lt;p&gt;2691 of the 3208 playlist items had matches in the database, leaving 472 unidentified
tracks. The results were distributed like this&lt;/p&gt;
&lt;div class="section" id="era-distribution"&gt;
&lt;h4&gt;Era distribution&lt;/h4&gt;
&lt;!-- romantic: 712, --&gt;
&lt;!-- twentieth: 814, --&gt;
&lt;!-- undefined: 472, --&gt;
&lt;!-- baroque: 502, --&gt;
&lt;!-- classical: 712 --&gt;
&lt;img alt="" src="http://chart.apis.google.com/chart?chs=542x285&amp;amp;cht=p&amp;amp;chco=80C65A&amp;amp;chd=s:QSLQ&amp;amp;chl=Romantic|Twentieth|Baroque|Classical" /&gt;
&lt;/div&gt;
&lt;div class="section" id="top-composers"&gt;
&lt;h4&gt;Top composers&lt;/h4&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="52%" /&gt;
&lt;col width="48%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Composer&lt;/th&gt;
&lt;th class="head"&gt;Play count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;Mozart&lt;/td&gt;
&lt;td&gt;191&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Bach&lt;/td&gt;
&lt;td&gt;188&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Haydn&lt;/td&gt;
&lt;td&gt;114&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Beethoven&lt;/td&gt;
&lt;td&gt;92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Schubert&lt;/td&gt;
&lt;td&gt;83&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Chopin&lt;/td&gt;
&lt;td&gt;78&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Debussy&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Mendelssohn&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Brahms&lt;/td&gt;
&lt;td&gt;51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Tchaikovsky&lt;/td&gt;
&lt;td&gt;46&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="air-time"&gt;
&lt;h3&gt;Air time&lt;/h3&gt;
&lt;p&gt;Analyzing the total play count for each era is useful. But the more interesting
question for a listener is not how often tracks from each era are played. Rather,
it's what proportion of the airtime each era consumes. This is an important
distinction because some movements last less than a minute while others can last 30
minutes or longer.&lt;/p&gt;
&lt;p&gt;Top Composers by airtime (including only the top 16 composers or 48% of total):&lt;/p&gt;
&lt;img alt="" src="http://chart.apis.google.com/chart?chs=500x285&amp;amp;cht=p&amp;amp;chd=s:JGGGEDDCCCCCBBBB&amp;amp;chp=1.567&amp;amp;chl=Mozart|Beethoven|Haydn|Bach|Shubert|Schumann|Brahms|Mendelssohn|Dvor%C3%A1k|Tchaikovsky|Chopin|Debussy|Ravel|Handel|Telemann|Vivaldi" /&gt;
&lt;p&gt;This data highlights the importance of using airtime over play count. While King
plays J.S. Bach almost as often as Mozart, Mozart gets 42% more airtime, more
than 20 hours more per month compared to Bach.&lt;/p&gt;
&lt;div class="section" id="async-and-ordering"&gt;
&lt;h4&gt;Async and ordering&lt;/h4&gt;
&lt;p&gt;When analyzing airtime, we have to make sure all of our data is properly sorted.
Since we are scraping asynchronously and writing the results to disk as they are
returned, it's likely that our data will come back in an order different from the
order of the HTTP requests.&lt;/p&gt;
&lt;p&gt;This can occur because some pages are larger and so take longer to transfer than
others. A glance at the data shows that this did indeed happen:&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="60%" /&gt;
&lt;col width="40%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Time&lt;/th&gt;
&lt;th class="head"&gt;Composer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;07/19/2012 11:51pm&lt;/td&gt;
&lt;td&gt;PURCELL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;07/19/2012 11:54pm&lt;/td&gt;
&lt;td&gt;SCHUBERT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;07/16/2012 12:02am&lt;/td&gt;
&lt;td&gt;CHOPIN&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Since we made non-blocking HTTP requests, the data from 7/19 arrived more
quickly and so was written to disk before the data from the 16th.&lt;/p&gt;
&lt;p&gt;Since we want to access this data as a JavaScript object anyway, we ought not rely on
the default ordering of an object's properties. Field ordering is not not part of the
ECMAScript spec. To remedy this, we will use &lt;a class="reference external" href="http://momentjs.com/"&gt;moment.js&lt;/a&gt; to
parse each date string and convert it to a UNIX timestamp.&lt;/p&gt;
&lt;p&gt;These timestamps will be cast to two separate data structures, a list and hash. The
list will be sorted and used for ordering. The hash maps timestamps to composer
names. Iterating through the list, we look up the associated composer and use
subtraction to work out the total seconds of airtime for each track.&lt;/p&gt;
&lt;!-- Link to raw json file --&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;fs&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;moment&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;composers_by_air_time = &lt;/span&gt;&lt;span class="nf"&gt;-&amp;gt;&lt;/span&gt;

  &lt;span class="nv"&gt;dates = &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
  &lt;span class="nv"&gt;map = &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="nv"&gt;playlist = &lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;readFileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;db/playlist.json&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;composer&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;playlist&lt;/span&gt;
    &lt;span class="c1"&gt;# Build an array of timestamps together with a mapping from&lt;/span&gt;
    &lt;span class="c1"&gt;# timestamp to composer&lt;/span&gt;
    &lt;span class="nv"&gt;unix_date = &lt;/span&gt;&lt;span class="nx"&gt;moment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;MM/DD/YYYY hh:mma&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;unix&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Push it onto our list of timestamps&lt;/span&gt;
    &lt;span class="nx"&gt;dates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;unix_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Map the composer whose work STARTED to this timestamp&lt;/span&gt;
    &lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;unix_date&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;composer&lt;/span&gt;

  &lt;span class="c1"&gt;# Sort the dates (as ints) so we can subtract adjacent members&lt;/span&gt;
  &lt;span class="nx"&gt;dates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sort&lt;/span&gt;

  &lt;span class="nv"&gt;results = &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;

  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;date&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;dates&lt;/span&gt; &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="nx"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="c1"&gt;# Subtract each item from its predecessor, ignoring the first one&lt;/span&gt;
    &lt;span class="nv"&gt;prev_date = &lt;/span&gt;&lt;span class="nx"&gt;dates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nv"&gt;difference = &lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;prev_date&lt;/span&gt;
    &lt;span class="nv"&gt;composer = &lt;/span&gt;&lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;prev_date&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Group by composer name&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;composer&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
      &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;composer&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;difference&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;
      &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;composer&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;difference&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="airtime-by-era"&gt;
&lt;h4&gt;Airtime by era&lt;/h4&gt;
&lt;p&gt;We can combine the airtime data with the composer era data to get the total airtime
by era&lt;/p&gt;
&lt;!-- 24hr
romantic: 864180,
twentieth: 1299600,
undefined: 625080,
baroque: 579420,
classical: 860880 --&gt;
&lt;!-- Between 7-midnight only:
classical: 587640,
baroque: 589080,
romantic: 733260,
twentieth: 1347660,
undefined: 740880 } --&gt;
&lt;img alt="" src="http://chart.apis.google.com/chart?chs=542x285&amp;amp;cht=p&amp;amp;chco=80C65A&amp;amp;chd=s:GJEG&amp;amp;chl=Romantic|Twentieth|Baroque|Classical" /&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="conclusions"&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;div class="section" id="blame-the-bias"&gt;
&lt;h3&gt;Blame the bias&lt;/h3&gt;
&lt;p&gt;The data shows that KINGFM is innocent of the charge of favoring Baroque music over
other eras. Indeed, they play less Baroque than anything else: less than half as much
as twentieth century music. Looks like my own bias against harpsicord has affected my
statistical judgment. Good thing I actually checked before blaming the station.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="data-mining-in-node"&gt;
&lt;h3&gt;Data mining in Node&lt;/h3&gt;
&lt;p&gt;Part of my motivation for this post was to get more familiar with using Node and
Coffeescript. This pair makes a convenient programming environment for tasks like web
scraping and networking applications.&lt;/p&gt;
&lt;p&gt;That said, JavaScript on its own is a poor candidate for data analysis. It has
a limited set of built-in data structures and no default support for parsing data
from common file formats. &lt;a class="reference external" href="https://github.com/stackd/gauss"&gt;Gauss&lt;/a&gt; looks like it
may help to fill this void but it will likely be some time until the node world has
something as full-featured as &lt;a class="reference external" href="http://pandas.pydata.org/"&gt;pandas&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For those interested, the simple scripts that I wrote in coffeescript for the
scraping and analysis are on &lt;a class="reference external" href="http://github.com/evan2m/node-kingfm/"&gt;Github&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</summary><category term="node"></category><category term="coffeescript"></category><category term="data mining"></category><category term="music"></category></entry><entry><title>Saving Screenshots in Rails with url2png and Paperclip</title><link href="http://evanmuehlhausen.com/saving-screenshots-in-rails-with-url2png-and-paperclip" rel="alternate"></link><updated>2012-05-30T00:00:00-04:00</updated><author><name>Evan Muehlhausen</name></author><id>tag:evanmuehlhausen.com,2012-05-30:saving-screenshots-in-rails-with-url2png-and-paperclip</id><summary type="html">&lt;p&gt;&lt;a class="reference external" href="http://url2png.com/"&gt;url2png&lt;/a&gt; is a service for generating screenshots of websites.
Pass in a URL and some dimensions and it spits back a high quality png capture of
that site.&lt;/p&gt;
&lt;p&gt;Unlike some competing services I've tried, it even does a decent job handling
sites that require client-side rendering.&lt;/p&gt;
&lt;p&gt;Someone has already built a &lt;a class="reference external" href="https://github.com/wout/url2png-gem"&gt;ruby gem&lt;/a&gt; for
working with url2png. It provides a rails helper for hot-linking url2png images in
your views. Perhaps a better name for that gem would be url2png-rails.&lt;/p&gt;
&lt;p&gt;Though useful, this is not what I was looking for. Instead, I wanted the ability to
save a local copy of the screenshots on my own server. Since I was already using
Paperclip for saving attachments, this turned out to be easy.&lt;/p&gt;
&lt;div class="section" id="an-api-wrapper"&gt;
&lt;h2&gt;An API wrapper&lt;/h2&gt;
&lt;p&gt;The url2png API is quite simple. Using it requires building the URL of the image by
generating a token. The following uses v3 of their API. As of writing, this is the
version they use in their &lt;a class="reference external" href="http://url2png.com/code/"&gt;guide&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;digest/md5&amp;#39;&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ScreenShot&lt;/span&gt;

  &lt;span class="no"&gt;KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;your key&amp;#39;&lt;/span&gt;
  &lt;span class="no"&gt;SECRET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;your secret&amp;#39;&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="vi"&gt;@url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;
    &lt;span class="vi"&gt;@bounds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bounds&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;token&lt;/span&gt;
    &lt;span class="no"&gt;Digest&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;MD5&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="no"&gt;SECRET&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;+&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="vi"&gt;@url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;img_url&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;http://api.url2png.com/v3/&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="no"&gt;KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="vi"&gt;@bounds&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="vi"&gt;@url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Using this, we can easily get the URL of a screenshot of the front page of reddit&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;shot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;ScreenShot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;http://reddit.com&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;200x200&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;shot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;img_url&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="ss"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sr"&gt;//u&lt;/span&gt;&lt;span class="n"&gt;rl2png&lt;/span&gt;&lt;span class="o"&gt;.../&lt;/span&gt;&lt;span class="n"&gt;reddit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;png&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="saving-it-with-paperclip"&gt;
&lt;h2&gt;Saving it with Paperclip&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/thoughtbot/paperclip"&gt;Paperclip&lt;/a&gt; is a popular gem for managing
file attachments in rails applications. Until now, I'd only used it to save files
that were passed in through a form. But, it is not restricted to handling POST data
or files already on disk. Pass in any &lt;cite&gt;IO&lt;/cite&gt; and it will take care of the rest.&lt;/p&gt;
&lt;p&gt;Given a Website model with a &lt;cite&gt;url&lt;/cite&gt; attribute, we can fetch an image for that URL and
save an associated screenshot.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Website&lt;/span&gt;
  &lt;span class="n"&gt;has_attached_file&lt;/span&gt; &lt;span class="ss"&gt;:screenshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="ss"&gt;:styles&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ss"&gt;:thumb&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;50x50&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:square&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;200x200&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;gen_screenshot!&lt;/span&gt;
    &lt;span class="n"&gt;shot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;ScreenShot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;200x200&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;img_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;save!&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Notice that we can pass the &lt;cite&gt;IO&lt;/cite&gt; returned by &lt;cite&gt;open&lt;/cite&gt; directly to Paperclip without
having to bother saving it to disk ourselves.&lt;/p&gt;
&lt;p&gt;If the image is small enough, behind the scenes &lt;cite&gt;open&lt;/cite&gt; will use a StringIO and hold
the image data in memory. This avoids the filesystem overhead of writing an extra
&lt;cite&gt;TempFile&lt;/cite&gt;.&lt;/p&gt;
&lt;p&gt;We can attach the image to our Website model like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;site&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Website&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;http://reddit.com&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gen_screenshot!&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Paperclip will handle the messy details of thumbnail generation. When it's done, it
will move the files to the proper location on disk.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="better-as-a-gem"&gt;
&lt;h2&gt;Better as a Gem&lt;/h2&gt;
&lt;p&gt;Since this is a such a small amount of code, bundling it as a gem may seem like
overkill. But, I would argue that it is still the right move. Perhaps someday this
functionality could be integrated into &amp;#64;wout's url2png gem.&lt;/p&gt;
&lt;/div&gt;
</summary><category term="rails"></category><category term="ruby"></category></entry><entry><title>Simple Counters in Python (with Benchmarks)</title><link href="http://evanmuehlhausen.com/simple-counters-in-python-with-benchmarks" rel="alternate"></link><updated>2012-05-16T00:00:00-04:00</updated><author><name>Evan Muehlhausen</name></author><id>tag:evanmuehlhausen.com,2012-05-16:simple-counters-in-python-with-benchmarks</id><summary type="html">&lt;p&gt;It's sometimes necessary to count the number of distinct occurrences in an
collection.  For example, counting how many times each letter occurs in a block of
text. Or sorting a list by its most common member.&lt;/p&gt;
&lt;p&gt;If I were to do this sort of counting with SQL, I would generally use something like
this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;column&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This could easily by combined this with an &lt;cite&gt;ORDER BY&lt;/cite&gt; to get the most common items.&lt;/p&gt;
&lt;p&gt;However, assuming you are working with some raw data, here are some strategies for
counting distinct occurrences in Python. Skip to the end to see which method performs
best.&lt;/p&gt;
&lt;div class="section" id="dict-and-in"&gt;
&lt;h2&gt;dict and &lt;cite&gt;in&lt;/cite&gt;&lt;/h2&gt;
&lt;p&gt;A plain dictionary works well as a counter. Though using it is verbose, it performs
surprisingly well and works in any python version.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;foods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;dairy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;gluten&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;foods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="o"&gt;..&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;cheese&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;dairy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="defaultdict"&gt;
&lt;h2&gt;defaultdict&lt;/h2&gt;
&lt;p&gt;I've always loved the defaultdict. Used properly, it can cut out a lot of boilerplate
from your code. It has many applications, one of which is a counter.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;

&lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;foods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;dairy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;gluten&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;foods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="o"&gt;..&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;
&lt;span class="n"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;int&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;cheese&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;dairy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;By passing &lt;cite&gt;int&lt;/cite&gt; to the class, all empty keys default to zero. This allows you
to do += without setting the key first.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="dict-and-setdefault"&gt;
&lt;h2&gt;dict and setdefault&lt;/h2&gt;
&lt;p&gt;Dictionaries have a setdefault method that allows you to set the default value
for a single key.&lt;/p&gt;
&lt;p&gt;According to the &lt;a class="reference external" href="http://docs.python.org/library/collections.html#collections.defaultdict"&gt;python docs&lt;/a&gt;, running
setdefault on every key is slower than using defaultdict. The benchmark below
confirms this.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;foods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;dairy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;gluten&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;foods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;setdefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;cheese&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;dairy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="collections-counter"&gt;
&lt;h2&gt;collections.Counter&lt;/h2&gt;
&lt;p&gt;Python 2.7 introduced &lt;strong&gt;collections.Counter&lt;/strong&gt; which makes this trivial.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;
&lt;span class="n"&gt;foods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;dairy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;gluten&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;foods&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;..&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;
&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;gluten&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;&amp;#39;dairy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;By passing a list to the Counter constructor, it does the grouping for us. It still
behaves like a dictionary so we can still do stuff like&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;soy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="benchmarks"&gt;
&lt;h2&gt;Benchmarks&lt;/h2&gt;
&lt;p&gt;Here are some quick and dirty benchmarks for these methods. I used &lt;a class="reference external" href="https://gist.github.com/2706238"&gt;this code&lt;/a&gt; to generate the data. I took some text by The
Bard and counted the number of each letter and each word. There were a lot more
unique words than letters which resulted in slower times to count them.&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="11%" /&gt;
&lt;col width="17%" /&gt;
&lt;col width="23%" /&gt;
&lt;col width="32%" /&gt;
&lt;col width="17%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;Keys&lt;/th&gt;
&lt;th class="head"&gt;Counter&lt;/th&gt;
&lt;th class="head"&gt;defaultdict&lt;/th&gt;
&lt;th class="head"&gt;dict.setdefault&lt;/th&gt;
&lt;th class="head"&gt;dict.in&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;6691&lt;/td&gt;
&lt;td&gt;3.62&lt;/td&gt;
&lt;td&gt;1.97&lt;/td&gt;
&lt;td&gt;2.88&lt;/td&gt;
&lt;td&gt;1.95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;26727&lt;/td&gt;
&lt;td&gt;13.13&lt;/td&gt;
&lt;td&gt;4.31&lt;/td&gt;
&lt;td&gt;9.58&lt;/td&gt;
&lt;td&gt;7.17&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These results show that while a plain dict and &lt;cite&gt;in&lt;/cite&gt; checks performs best for
a smaller number of keys, it's not significantly better than defaultdict. With
a larger number of distinct members, defaultdict did substantially better than any
other option.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="use-defaultdict"&gt;
&lt;h2&gt;Use defaultdict&lt;/h2&gt;
&lt;p&gt;The takeaway is to stick to the defaultdict when you need a counter. Not only is it
performant, but it saves you from the boilerplate of operating on every key.&lt;/p&gt;
&lt;p&gt;While &lt;cite&gt;Counter&lt;/cite&gt; is shinny and convenient, it's slow. As an added bonus, defaultdict
works in Python 2.5. If you are stuck with python 2.4 (upgrade!), running &lt;cite&gt;in&lt;/cite&gt; on
every key is your best option.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Edit&lt;/strong&gt;: Updated in light of Philip's comment.&lt;/p&gt;
&lt;/div&gt;
</summary><category term="python"></category><category term="benchmarks"></category></entry><entry><title>A Few Static Blog Generators</title><link href="http://evanmuehlhausen.com/a-few-static-blog-generators" rel="alternate"></link><updated>2012-05-15T00:00:00-04:00</updated><author><name>Evan Muehlhausen</name></author><id>tag:evanmuehlhausen.com,2012-05-15:a-few-static-blog-generators</id><summary type="html">&lt;p&gt;Before finally choosing an engine for my own blog, I spent too much time comparing
the many available options. My goal in this post is to share what I learned about
some of the tools that are avilable. Hopefully this makes it easier for others to
publish their own writing.&lt;/p&gt;
&lt;div class="section" id="why-static"&gt;
&lt;h2&gt;Why Static?&lt;/h2&gt;
&lt;p&gt;Static website generators take content in a user-friendly markup language (e.g.
markdown) and compile it into flat HTML pages.&lt;/p&gt;
&lt;p&gt;Static sites, particularly static blogs, have become increasingly popular. This isn't
to say they are for everyone. They do require more effort to setup and generally have
fewer features than a mainstream blogging platform.&lt;/p&gt;
&lt;p&gt;The reasons for the increasing popularity of static websites can be found all over
the web. Here are some of the big ones for me.&lt;/p&gt;
&lt;div class="section" id="fewer-moving-parts"&gt;
&lt;h3&gt;Fewer Moving Parts&lt;/h3&gt;
&lt;p&gt;Simple websites shouldn't require the same stack as a full-blown web application. Why
depend on an app server and a database when all you really need are flat files and
a web server?&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="cheap-and-easy-deployments"&gt;
&lt;h3&gt;Cheap and Easy Deployments&lt;/h3&gt;
&lt;p&gt;Deploy your blog anywhere that can serve static assets. Some free/cheap options for
this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://pages.github.com/"&gt;Github pages&lt;/a&gt; (free)&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://aws.amazon.com/s3/"&gt;s3&lt;/a&gt; (practically free unless you are famous or
&lt;a class="reference external" href="http://www.behind-the-enemy-lines.com/2012/04/google-attack-how-i-self-attacked.html"&gt;ddos yourself&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;shared host (at least you don't have to use their database)&lt;/li&gt;
&lt;li&gt;a tiny VPS (my choice)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div class="section" id="security"&gt;
&lt;h3&gt;Security&lt;/h3&gt;
&lt;p&gt;Having had to clean up hacked Wordpress installations, I know what a pain it is to
keep Wordpress up to date and locked down. Any popular web app that can be deployed
to your own server is a natural target for attackers. By serving static files on
disk, we eliminate a wide range of attack vectors.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="my-requirements"&gt;
&lt;h2&gt;My Requirements&lt;/h2&gt;
&lt;p&gt;Having decided on a static blog, I still had a sea of options to sift through before
I could reach a decision. Salient features include:&lt;/p&gt;
&lt;div class="section" id="restructuredtext"&gt;
&lt;h3&gt;reStructuredText&lt;/h3&gt;
&lt;p&gt;Markdown is a very popular choice for web writers. While I do like markdown, I prefer
&lt;a class="reference external" href="http://sphinx.pocoo.org/rest.html#rst-primer"&gt;reStructuredText&lt;/a&gt; for a number of
reasons. It requires more upfront investment to learn its larger syntax. But its rich
feature set is well worth it.&lt;/p&gt;
&lt;p&gt;Some great rEST features are footnotes, tables of contents. Also, .rst files look great in
plaintext (even without syntax highlighting).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="theming-support"&gt;
&lt;h3&gt;Theming Support&lt;/h3&gt;
&lt;p&gt;Themes give you a big head start when starting a site. Taking an existing theme and
customizing it to your needs is a lot faster than starting from scratch. Tools like
twitter bootstrap make this process easier. But they don't save you from having to
learn the the names of all of the template variables and settings provided by your
static generator.&lt;/p&gt;
&lt;p&gt;Good theme support depends on a well-designed site theme API to enable customization.
But it also depends on a community who has already released themes worth using.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="extensible"&gt;
&lt;h3&gt;Extensible&lt;/h3&gt;
&lt;p&gt;Should I ever want to hack my own plugin or do some customization, I want to know
that the engine exposes a good API for extending its functionality. For this reason,
I considered only options written in Python, Ruby and Javascript: the three
languages where I'm most comfortable.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="configurable"&gt;
&lt;h3&gt;Configurable&lt;/h3&gt;
&lt;p&gt;I should be able to tweak the most important features with a simple change to
a settings file. Important options for me are the ability to use arbitrary URL
structures and organize my content however I choose.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="of-tool-fetishes"&gt;
&lt;h2&gt;of Tool Fetishes&lt;/h2&gt;
&lt;p&gt;As would-be-bloggers, we are spoilt choice when choosing a static blog engine. We can
start with many off-the-shelf themes and customize them to suit our needs.&lt;/p&gt;
&lt;p&gt;That said, the choice probably does not matter as much as we like to think. The most
important part of blogging is the content, not the presentation. It's not about the
minor differences between blog generators. Nor is it reinventing yet another one. We
reach a point when our obsession with our tools becomes fetishistic (MUST blog with
vim!).&lt;/p&gt;
&lt;p&gt;The tools that you use for publishing only matter if you write regularly.
I would argue that we see the same fetishistic attitutude during perennial
flamewars about text editors and web frameworks.&lt;/p&gt;
&lt;p&gt;The past few years have seen colossal duplication of efforts in the space of static
site generators. While no solution will suit everyone, more consolidation would be
nice.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="the-winner-is"&gt;
&lt;h2&gt;The winner is...&lt;/h2&gt;
&lt;p&gt;After almost going with Jekyll, I found &lt;a class="reference external" href="http://pelican.notmyidea.org/en/latest/"&gt;Pelican&lt;/a&gt;. It is built on top of software that
I consider best of breed:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://jinja.pocoo.org/"&gt;Jinja2&lt;/a&gt; templates&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://sphinx.pocoo.org/rest.html#rst-primer"&gt;rst&lt;/a&gt; markup&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://sphinx.pocoo.org/rest.html#rst-primer"&gt;Sphinx&lt;/a&gt; documentation&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://python.org"&gt;Python&lt;/a&gt; implementation language&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Development is currently very active. Compared to the others, it has good
documentation. I generally prefer Sphinx docs to the combination of RDocs and
GitHub wikis popular among Ruby projects.&lt;/p&gt;
&lt;p&gt;Pelican has a dedicated script called &lt;cite&gt;pelican-themes&lt;/cite&gt; for managing themes. I liked
the default theme enough to take it as the starting point for my design.&lt;/p&gt;
&lt;p&gt;Its has a 'watch' mode for development has worked very well for me so far. Even when
I left it running for long periods of time.&lt;/p&gt;
&lt;p&gt;A few &lt;a class="reference external" href="http://kennethreitz.com"&gt;influential&lt;/a&gt; python &lt;a class="reference external" href="http://pydanny.com"&gt;bloggers&lt;/a&gt; have
also switched over to Pelican.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="other-contenders"&gt;
&lt;h2&gt;Other Contenders&lt;/h2&gt;
&lt;p&gt;Here is a quick overview of the other choices that I evaluated before choosing
Pelican. In some cases, my evaluation was fairly superficial. I won't try to be
comprehensive. Instead, I hope this this will serve as a good starting point for
someone trying to make the same choice.&lt;/p&gt;
&lt;div class="section" id="jekyll"&gt;
&lt;h3&gt;Jekyll&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="http://jekylrb.com"&gt;Jekyll&lt;/a&gt; is the engine behind GitHub pages. It's written in
Ruby and is easily the most popular of the options I considered. Its important enough
that its creation may have helped to bring about the resurgence of static websites in
general. As I mentioned it was my top choice behind Pelican and a well built piece of
software.&lt;/p&gt;
&lt;p&gt;Jekyll's large community comes with great benefits. Two popular projects built atop
Jekyll are &lt;a class="reference external" href="http://octopress.org/"&gt;Octopress&lt;/a&gt; and &lt;a class="reference external" href="http://jekyllbootstrap.com/"&gt;Jekyll-Bootstrap&lt;/a&gt;. Both attempt to provide a simple blogging
experience out of the box. Much of the Jekyll configuration done for you. Each of
these projects has its own set of themes, making customization a snap.&lt;/p&gt;
&lt;p&gt;To someone technical enough to want a static blog, but who is still looking to hit
the ground running. I would point them to Jekyll-bootstrap or Octopress along with
Pelican. Jekyll has a the largest community and the benefits that come with that.&lt;/p&gt;
&lt;div class="section" id="plugins"&gt;
&lt;h4&gt;Plugins&lt;/h4&gt;
&lt;p&gt;Jekyll has a ton of plugins. One that made Jekyll a contender for me is
&lt;a class="reference external" href="https://github.com/xdissent/jekyll-rst"&gt;jekyll-rst&lt;/a&gt;. This allows you to write your
posts in rst instead of markdown. The plugin is a bit rough around the edges and
still requires you to install some python packages.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/versapay/jekyll-s3"&gt;jekyl-s3&lt;/a&gt; will deploy it to S3 for you
which is nice if you you'd rather not mess with &lt;a class="reference external" href="http://s3tools.org/s3cmd"&gt;s3cmd&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="issues"&gt;
&lt;h4&gt;Issues&lt;/h4&gt;
&lt;p&gt;I found a number of things about Jekyll confusing. Its docs are decent if a bit
scattered.&lt;/p&gt;
&lt;p&gt;It uses &lt;a class="reference external" href="http://liquidmarkup.org/"&gt;Liquid templating language&lt;/a&gt;, which was &lt;a class="reference external" href="http://www.scribd.com/doc/57352264/The-Impact-of-Django"&gt;inspired
by Django's&lt;/a&gt; templating
language. Jinja2, also inspired by Django seems to me to be a much more mature
implementation.&lt;/p&gt;
&lt;p&gt;In general, Jekyll was not  simple or inviting enough for  a tool of its popularity.
I think this helps to explain the demand for &amp;quot;frameworks&amp;quot; like Octopress on top of jekyll.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="hyde"&gt;
&lt;h3&gt;Hyde&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="http://ringce.com/hyde"&gt;Hyde&lt;/a&gt; started as a python port of Jekyll but has
become something quite distinct. At first glance, Hyde seemed like the python option
with the largest community. Since I still generally prefer Python to Ruby, Hyde was
the first option I considered when building a static site last year.&lt;/p&gt;
&lt;p&gt;The major problem with Hyde is that it has been between major versions for a long
time. &lt;a class="reference external" href="http://github.com/hyde/hyde"&gt;Hyde 1.0&lt;/a&gt; remains mostly undocumented. The new
version makes some welcome improvements like breaking its dependency on Django and
moving to Jinja2 for templating. But, as a new user, I had no idea where to start on
a Hyde project.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="cyrax"&gt;
&lt;h3&gt;Cyrax&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="http://github.com/piranha/cyrax"&gt;Cyrax&lt;/a&gt; was the next option I considered. It's
also written in Python and uses Jinja2 templates. The author writes in the readme:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
It's inspired from Jekyll and Hyde site generators and started when I realized that
I'm dissatisfied with both of them by different reasons.
&lt;/pre&gt;
&lt;p&gt;I found Cyrax to be generally well done. In general, it's better suited to websites
than blogs; but it can do either. It allows you to use different layouts for different
page types. This is very helpful in cases where you need more than just generic pages
and blog posts.&lt;/p&gt;
&lt;p&gt;The largest problems with Cyrax are its documentation and community. Any would-be
contributors to Cyrax have hopefully found their way to Pelican.&lt;/p&gt;
&lt;p&gt;Cyrax has some rough spots. For instance, the development server does not have
a delay between refreshes. Rapidly editing lots of files can practically crash your
machine if your site has more than a few pages.&lt;/p&gt;
&lt;!-- TODO http://tinkerer.bitbucket.org/, http://mynt.mirroredwhite.com/quickstart/ --&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="new-ideas"&gt;
&lt;h2&gt;New Ideas&lt;/h2&gt;
&lt;p&gt;NIH syndrome in this space aside, some exciting new projects are appearing.&lt;/p&gt;
&lt;div class="section" id="punch"&gt;
&lt;h3&gt;Punch&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="http://laktek.github.com/punch/"&gt;Punch&lt;/a&gt; is a static website generator written in
Javascript. All metadata is stored as JSON except long-form writing which can be done
in markdown. The coolest part of punch is the ability to render pages on both the
server and the client using the same code.&lt;/p&gt;
&lt;p&gt;While I like the idea a lot, the project is just getting off the ground. Also, I'm
not sure this hits a sweet spot for any particular usecase.&lt;/p&gt;
&lt;p&gt;Someone who wants to serve pre-rendered content may not be happy having to input all
of all their metadata with JSON. YAML is a better choice here if the users are
supposed to hand-editing these files.&lt;/p&gt;
&lt;p&gt;On the other hand, someone who wants a fully client-side site will likely choose
a more full-featured build tool like &lt;a class="reference external" href="http://brunch.io"&gt;Brunch&lt;/a&gt;. Brunch provides
a framework that helps structure your code instead of your blog content. Or, if a
user is more minimalistic, he will manage the Javascript and templates himself.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="ruhoh"&gt;
&lt;h3&gt;ruhoh&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="http://ruhoh.com/"&gt;ruhoh&lt;/a&gt; is the new project from &lt;a class="reference external" href="https://github.com/plusjade"&gt;plusjade&lt;/a&gt;, the creator or jekyll bootstrap. What's exciting
about this project is that, instead of allowing pluggable templating languages, its
allows for a plugable implementation language. While the ruhoh API has to date only
been implemented in &lt;a class="reference external" href="https://github.com/ruhoh/ruhoh.rb"&gt;Ruby&lt;/a&gt; , the plan is to
build implementations for many popular languages.&lt;/p&gt;
&lt;p&gt;The key insight here is that the choice of templating language should not be very
important. If ruhoh can definite its entire API in any language, why not take care
of any preprocessing in your programming language of choice? &lt;a class="reference external" href="http://mustache.github.com/"&gt;Mustache&lt;/a&gt; can handle rendering the content and still maintain
a clear separation of concerns. Mustache is a good choice for this usecase because it
already has bindings in most languages.&lt;/p&gt;
&lt;p&gt;True language independence sounds like a great goal. If we view the function of
static site generator as a simple transformation of data and allow the proper hooks
for extensibility, ruhoh can provide something much more powerful than plugins for
a single platform. Instead, it could allow for fullly customizable experience.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="variations-on-a-build-tool"&gt;
&lt;h2&gt;Variations on a build tool&lt;/h2&gt;
&lt;p&gt;Since all dynamic elements in a statically generated site will require Javascript,
users of static generators might appreciate javascript-specific features like
combining scripts and minifying them. These features exist in Javascript build tools
like Sprockets or Brunch and tool that adds these features to a static site builder
may be exactly what is needed to build sites that are rich in both content and
client-side functionality.&lt;/p&gt;
&lt;p&gt;A complete build tool might seem like overkill when a make/fab/rake/cake file
combined wuth something like &lt;a class="reference external" href="https://github.com/guard/guard"&gt;guard&lt;/a&gt; to watch your
files and rebuild during the development is all that's required. While this will
certainly work, it's a nontrivial problem since rebuilding everything from scratch
after every change is not feasible for sites with lots of content or scripts.&lt;/p&gt;
&lt;p&gt;In my opinion, we are still waiting for a build tool for modern, content-rich sites.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="see-also"&gt;
&lt;h2&gt;See also&lt;/h2&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="http://pydanny.com/choosing-a-new-python-based-blog-engine.html"&gt;http://pydanny.com/choosing-a-new-python-based-blog-engine.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.subspacefield.org/~travis/python_web_page_generators.html"&gt;http://www.subspacefield.org/~travis/python_web_page_generators.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.reddit.com/r/Python/comments/r5cuv/python_static_site_generators/"&gt;http://www.reddit.com/r/Python/comments/r5cuv/python_static_site_generators/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.quora.com/Whats-the-best-available-static-blog-website-generator-in-Python"&gt;http://www.quora.com/Whats-the-best-available-static-blog-website-generator-in-Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
</summary><category term="static sites"></category><category term="rst"></category><category term="python"></category></entry></feed>