Evan Muehlhausen

Analyze Gchat transcripts in AWK

2013-02-13T00:00:00-05:00

I learned about AWK when I first started using Linux. My exposure to the language generally came in the form of one-liners that I would cut and paste from the web. While it seemed like a powerful tool, I never saw it as a full-fledged programming language and never took the time to learn to use it.

Why AWK

While I've seen some sophisticated applications of AWK in the wild, I mainly used it for simple operations on log files. I wondered whether properly learning AWK even made sense.

The book

Research on the topic lead me to this Stack Overflow answer by Brandon Craig Rhodes. Mr. Rhodes is an avid speaker in the Python community and I respect his opinion. He recommends learning AWK not only to increase mastery at the command line, but as an excuse to read The AWK Programming Language by the original authors of the language.

Convinced, I acquired the book. While I'm still working my way though it, I've found it succinct and comprehensive. It's a lot more than a manual for the language, it's a discussion of many important programming concepts.

Staying power

What also struck me about AWK is its staying power. Though it's around 40 years old, a search on usesthis.com reveals a lot of smart people who explicitly mention AWK as an important part of their toolset. Even though many of these people also mention a high-level language like Python or Ruby, AWK stays relevant.

Chat transcripts

Since reading The Most Human Human, I've been fascinated by chat transcripts. Since I don't have anyone recording and transcribing my face-to-face conversations, my Gchat logs are the closest thing I have to a record of a real-time interaction with other people.

With that in mind, I wondered what interesting questions I could answer by analyzing a transcript. Some of my ideas were:

Duration of interaction

Total words/chars for each participant (who does all the talking?)

Total time when no one was speaking (are we distracted?)

Number of exchanges (how often does the active speaker change?)

Starting with this small set of data points, chat logs could reveal some interesting dimensions of a relationship. By comparing interactions between different people, or with the same person over time, trends might start to emerge. Very brief and terse interactions might suggest a casual acquaintance. Where as very long (in duration and words exchanged) and engaging (as # of exchanges) might suggest close friendship.

To generate the source transcript for this post, I pulled up the chat log in Gmail, pressed print and then simply cut and pasted it into a text file.

AWK: string processing made easy

AWK is a data extraction language. While it has a rich set of features, enabling a variety of applications, it's manipulating text and freeing the data within where it really shines. In a few short lines, it can manage tasks that would take more work in other languages. In Python, to open a text file and run a regular expression on it, we require some boilerplate code to get started.

import re

with open('data.txt','r') as f:
    for line in f:
        if re.match('[0-9]+', line):
            print line

AWK allows us to do this from the command line and get to the real work much more quickly.

awk '/[0-9]+/' data.txt

This is a contrived example but it's meant to show that AWK makes some tasks very easy.

Flocks of AWKs

Since it's creation in the 1970s, AWK implementations have proliferated. They differ in their licencing, speed and feature set. The original implementation, the one described in the seminal volume on the language is known as nawk. This is the version available by default in BSD operating systems and OSX. FreeBSD calls it "one true awk".

The GNU project provides an alternative implementation called gawk. It adds features not included in the original language including built-in date functions and true multidimensional arrays. It's provided under the GPL which may matter to some. For me, the additional features justify the extra installation on OSX (brew install gawk did the trick). Gawk is required to run the code for these examples.

Parsing the transcript

Basic AWK programs are structured in blocks like

condition { action }

AWK reads a target file line by line and, if the condition holds, it performs the action then moves onto the next condition. When parsing this chat transcript, we have four types of lines. Some indicate a speaker:

me: hi!

Others indicate the time:

3:27 PM

Or when a certain amount of time has elapsed between messages:

5 minutes

Some have no distinguishing features at all and are just lines of text. These need to be attributed to the active speaker as indicated by the last speaker identifier line.

My strategy for handling different cases is to look first for the time-related lines. If we find one, we stop processing the line using next. If we find a line identifying a speaker, we store the active speaker then remove the speaker designation e.g. me: from the line, leaving only the raw chat content. Then, for all remaining lines we simply count the words and characters and attribute them to the active speaker.

Here we handle a line that indicates a speaker.

#Speaker change line
# e.g. me: I love cats
/^[A-za-z]+: / {
    speaker = $1
    if (speaker !~ /^me/){
        other_speaker = $1
    }
    changes++

    # Remove the speaker from the line
    sub($1 FS, "");
}

When parsing some of the lines, it simplifies the script to utilize the match function provided by gawk. This makes it easier to capture segments of the string for processing. For example, when calculating how much dead time elapsed between messages we do.

# Dead time line
# e.g. 5 minutes
match($0, /^([0-9]+) minutes/, out) {
    dead_time += out[1]
    # Don't count this as chat content
    next
}

This makes it easy to capture the number of minutes and add it to our total.

Issues

Regexes

This strategy presents a problem. If a user types a message. Then a subsequent message which reads:

10 minutes

This will get counted as dead air time. Getting around this would require parsing the HTML version of the chat log. Since we want to use AWK for it's plain-text goodness, we will ignore this issue.

Multiple speakers

The program only works for two-party conversations. It could be modified to allow for chats involving any number of parties.

Output

Using the AWK's END directive, we print the our results:

me:  890 words  (46%),  4415 characters  (47%)
Jose: 1032 words  (53%),  4896 characters  (52%)
exchanges: 115
duration: 109 minutes
dead_time: 4 minutes

Impressions

AWK is a great tool and I think it's worth a programmer's time to learn it. That said, it is not without it's problems.

Readability

AWK is good at what it does but I don't find the code I wrote very readable. Perhaps this is my own inexperience. With a more complex project, this could lead to maintenance issues. I'd be interested to know how more experienced AWKers deal with this.

Data structures

Lack of data structures (e.g. lists), as well as a limited set of built-in functionality (esp. outside of gawk) can make things harder.

Overall, I've enjoyed my foray into AWK. While I still wouldn't use it for anything too complex, it's always good to learn new tools. Plus, I've already found myself using it in cases where I would normally have to paste data into a spreadsheet. Having these tasks in scripts saves time and adds flexibility.

The code I wrote for this post is available on GitHub.

Data mining local radio with Node.js

2012-08-20T00:00:00-04:00

More harpsicord?!

Seattle is lucky to have KINGFM, a local radio station dedicated to 100% classical music. As one of the few existent classical music fans in his twenties, I listen often enough. Over the past few years, I've noticed that when I tune to the station, I always seem to hear the plinky sound of a harpsicord.

Before I sent KINGFM an email, admonishing them for playing so much of an instrument I dislike, I wanted to investigate whether my ears were deceiving me. Perhaps my own distaste for the harpsicord increased its impact in my memory.

This article outlines the details of this investigation and especially the process of collecting the data.

If it ain't baroque...

A harpsicord is in many ways similar to the piano. Pressing down and releasing one of its keys triggers an internal mechanism that plucks a string inside. Resultant vibration of the string produces the corresponding pitch. Because its strings are plucked, the instrument has no dynamic range. Each note sounds at roughly the same volume; however firmly or softly the player strikes the keys. Some harpsicords have several choirs of strings that allow the player limited control of the volume and timbre.

The harpsicord can sound tinny to modern ears. Thomas Beecham famously said, "The sound of a harpsichord - two skeletons copulating on a tin roof in a thunderstorm."

At the start of the 16th century, the newly invented fortepiano began to push both the harpsicord and its close relative, the clavicord out of favor. The new instrument worked more like the modern piano in that its strings were struck with padded hammers. Compared to the other keyboard instruments of its day, it had a more resonant sound and allowed the player to control the dynamics of each note simply by altering the force with which he struck the keys.

The period before the fortepiano, during which the harpsicord had its heyday is known as the Baroque Era. The history of classical music is often divided into several historical "eras" or "periods". The dates that separate them are somewhat arbitrary with substantial overlap, I'll follow Wikipedia in defining these boundaries because they have the most comprehensive composers data with the most permissive licence.

Wikipedia's dates differ little from those outlined by Naxos, a well respected music label with an extremely comprehensive catalog. Unfortunately, the Naxos ToS are extremely restrictive with respect to their composer data.

These eras are:

Medieval (476–1400)

Renaissance (1400–1600)

Baroque (1600–1760)

Classical era (1730–1820)

Romantic era (1815–1910)

20th century (1900–2000)

21st century (since 2000)

Since King seems to play very little music from before 1600, I ignored the Medieval and Renaissance era in my analysis.

Because of the dominance of the piano and its predecessors starting in the Classical Era, one is less likely to find the sound of the harpsicord in modern recordings of anything but Baroque music. Even then, music originally written for harpsicord is often transcribed to the piano and recorded that way. Glenn Gould, perhaps the most famous modern Bach interpreter, is well-known for recordings of such transcriptions.

One exception is opera. Harpsicord was used for accompanying recitative all the way into the late 18th century. For simplicity, we will blissfully ignore this fact.

Collecting the data

KINGFM's posts their playlist daily to their website. Scraping this data, I was able to build the dataset.

Scraping with Node

Web scraping is an normally a network constrained task. Most of the execution time is spent waiting for the server to respond. Node.js encourages an asynchronous style that is well-suited to such tasks. Using the request module, it's easy to send non-blocking HTTP requests and process each result in a callback as it's returned. For this reason, rate limiting yourself is important when scraping more than a few pages. Otherwise, the flood of requests you will unleash is likely to get you blocked or interfere with the target site.

Another advantage for Node for this usecase is that it brings existing client-side libraries to the server. Great scraping tools exist in other languages (e.g. scrapy). However, since many developers already have years of experience using jQuery client-side to access the DOM, they may prefer to use a familiar API instead of learning a new one.

Cheerio

Several npm packages are available to help us use jQuery in Node. jsdom is a popular option that implements the full DOM in JS; allowing us to use jQuery or most any other client-side library on the server.

However, cheerio better suits this simple task. The project provides a re-implementation of a the most important parts of jQuery core. It's simpler to use and the author claims its a much faster choice compared to jsdom. Since much of the official jQuery source provides unneeded functionality like AJAX and browser compatibility, a re-implementation that leaves this bloat behind is preferable.

An example scaper

Especially when used in concert with Coffeescript, Cheerio makes for readable scrapers. By leveraging the superpower that is the jQuery selector we can often get at our data with minimal code. As an example, let's use it to scrape the target URLs from the front page of reddit using the selector #siteTable a.title.

request = require('request')
cheerio = require('cheerio')

parse_page = (error, response, body) ->
  if(error or response.statusCode != 200)
    console.log(error)
  else
    # Load the page into cheerio
    $ = cheerio.load(body)

    # Iterate over the the links on the front page
    $('#siteTable a.title').each (k,v) =>

      console.log($(v).attr('href'))

request.get("http://www.reddit.com", parse_page)

Using this technique, I quickly pulled the last 30 days of playlist data from KINGFM and dumped it into a file.

The joys of data normalization

Then came the hard part: associating composer names in the playlist data with historical eras. This is more difficult than it seems because subtle differences in the datasets could result in mismatches. King is mostly consistent in defining its composers but it does so in a different format from Wikipedia. It lists "SCHUBERT" instead of "Franz Schubert". These cases are easy enough: simply convert to lowercase and lop off the first name.

There are several types of more difficult cases where the database contains multiple composers that share a last name. e.g. J.S. Bach and all of his sons. In these cases, we need first names or initials to differentiate. Unfortunately, the formats differ between the data sets. Wikipedia has "Carl Philipp Emanuel Bach" and KINGFM, "BACH, C.P.E". Other tough cases are those where composer has multiple last names e.g. "Vaughan Williams". Other annoying cases occur where diacritics did not match e.g."Dvořák" and "DVORÁK" (no accented r). Since my data set is fairly small (3210 playlist items), I was able to handle these unfortunate cases with regular expressions and frustration.

Handling overlap

In some cases, a composer belongs to multiple eras. For example, Beethoven's music is said to span both the Classical and the Romantic eras. One way to handle these cases would be to count every movement of Beethoven's as both Classical and Romantic. The downside is that this would result in double counting a lot of the most popular composers.

Instead, I chose to sort the database alphabetically by composer name rather than by era. In cases where there are two entries for the same name, the second one overwrites the first. This is not ideal either but should help to randomize the era into which era transitional composers are placed. I did some editorializing here for the most prominent composers. For instance, Beethoven was counted as Classical and Schubert as Romantic.

Results

Play count

2691 of the 3208 playlist items had matches in the database, leaving 472 unidentified tracks. The results were distributed like this

Era distribution

Top composers

Composer	Play count
Mozart	191
Bach	188
Haydn	114
Beethoven	92
Schubert	83
Chopin	78
Debussy	66
Mendelssohn	63
Brahms	51
Tchaikovsky	46

Air time

Analyzing the total play count for each era is useful. But the more interesting question for a listener is not how often tracks from each era are played. Rather, it's what proportion of the airtime each era consumes. This is an important distinction because some movements last less than a minute while others can last 30 minutes or longer.

Top Composers by airtime (including only the top 16 composers or 48% of total):

This data highlights the importance of using airtime over play count. While King plays J.S. Bach almost as often as Mozart, Mozart gets 42% more airtime, more than 20 hours more per month compared to Bach.

Async and ordering

When analyzing airtime, we have to make sure all of our data is properly sorted. Since we are scraping asynchronously and writing the results to disk as they are returned, it's likely that our data will come back in an order different from the order of the HTTP requests.

This can occur because some pages are larger and so take longer to transfer than others. A glance at the data shows that this did indeed happen:

Time	Composer
07/19/2012 11:51pm	PURCELL
07/19/2012 11:54pm	SCHUBERT
07/16/2012 12:02am	CHOPIN

Since we made non-blocking HTTP requests, the data from 7/19 arrived more quickly and so was written to disk before the data from the 16th.

Since we want to access this data as a JavaScript object anyway, we ought not rely on the default ordering of an object's properties. Field ordering is not not part of the ECMAScript spec. To remedy this, we will use moment.js to parse each date string and convert it to a UNIX timestamp.

These timestamps will be cast to two separate data structures, a list and hash. The list will be sorted and used for ordering. The hash maps timestamps to composer names. Iterating through the list, we look up the associated composer and use subtraction to work out the total seconds of airtime for each track.

require('fs')
require('moment')

composers_by_air_time = ->

  dates = []
  map = {}
  playlist = JSON.parse(fs.readFileSync('db/playlist.json'))

  for date, composer of playlist
    # Build an array of timestamps together with a mapping from
    # timestamp to composer
    unix_date = moment(date, 'MM/DD/YYYY hh:mma').unix()

    # Push it onto our list of timestamps
    dates.push(unix_date)

    # Map the composer whose work STARTED to this timestamp
    map[unix_date] = composer

  # Sort the dates (as ints) so we can subtract adjacent members
  dates.sort

  results = {}

  for idx, date of dates when idx > 0
    # Subtract each item from its predecessor, ignoring the first one
    prev_date = dates[idx - 1]
    difference = date - prev_date
    composer = map[prev_date]

    # Group by composer name
    if results[composer] > 0
      results[composer] += difference
    else
      results[composer] = difference

  return results

Airtime by era

We can combine the airtime data with the composer era data to get the total airtime by era

Conclusions

Blame the bias

The data shows that KINGFM is innocent of the charge of favoring Baroque music over other eras. Indeed, they play less Baroque than anything else: less than half as much as twentieth century music. Looks like my own bias against harpsicord has affected my statistical judgment. Good thing I actually checked before blaming the station.

Data mining in Node

Part of my motivation for this post was to get more familiar with using Node and Coffeescript. This pair makes a convenient programming environment for tasks like web scraping and networking applications.

That said, JavaScript on its own is a poor candidate for data analysis. It has a limited set of built-in data structures and no default support for parsing data from common file formats. Gauss looks like it may help to fill this void but it will likely be some time until the node world has something as full-featured as pandas.

For those interested, the simple scripts that I wrote in coffeescript for the scraping and analysis are on Github.

Saving Screenshots in Rails with url2png and Paperclip

2012-05-30T00:00:00-04:00

url2png is a service for generating screenshots of websites. Pass in a URL and some dimensions and it spits back a high quality png capture of that site.

Unlike some competing services I've tried, it even does a decent job handling sites that require client-side rendering.

Someone has already built a ruby gem for working with url2png. It provides a rails helper for hot-linking url2png images in your views. Perhaps a better name for that gem would be url2png-rails.

Though useful, this is not what I was looking for. Instead, I wanted the ability to save a local copy of the screenshots on my own server. Since I was already using Paperclip for saving attachments, this turned out to be easy.

An API wrapper

The url2png API is quite simple. Using it requires building the URL of the image by generating a token. The following uses v3 of their API. As of writing, this is the version they use in their guide.

require 'digest/md5'

class ScreenShot

  KEY = 'your key'
  SECRET = 'your secret'

  def initialize(url, bounds)
    @url = url
    @bounds = bounds
  end

  def token
    Digest::MD5.hexdigest("#{SECRET}+#{@url}")
  end

  def img_url
    "http://api.url2png.com/v3/#{KEY}/#{token}/#{@bounds}/#{@url}"
  end
end

Using this, we can easily get the URL of a screenshot of the front page of reddit

>> shot = ScreenShot.new('http://reddit.com', '200x200')
>> shot.img_url
...
http://url2png.../reddit.png

Saving it with Paperclip

Paperclip is a popular gem for managing file attachments in rails applications. Until now, I'd only used it to save files that were passed in through a form. But, it is not restricted to handling POST data or files already on disk. Pass in any IO and it will take care of the rest.

Given a Website model with a url attribute, we can fetch an image for that URL and save an associated screenshot.

class Website
  has_attached_file :screenshot,
                    :styles => {:thumb => '50x50', :square => '200x200' }

  def gen_screenshot!
    shot = ScreenShot.new(url, '200x200')
    self.screenshot = open(shot.img_url)
    save!
  end
end

Notice that we can pass the IO returned by open directly to Paperclip without having to bother saving it to disk ourselves.

If the image is small enough, behind the scenes open will use a StringIO and hold the image data in memory. This avoids the filesystem overhead of writing an extra TempFile.

We can attach the image to our Website model like this:

>> site = Website.new(url: 'http://reddit.com')
>> site.gen_screenshot!

Paperclip will handle the messy details of thumbnail generation. When it's done, it will move the files to the proper location on disk.

Better as a Gem

Since this is a such a small amount of code, bundling it as a gem may seem like overkill. But, I would argue that it is still the right move. Perhaps someday this functionality could be integrated into @wout's url2png gem.

Simple Counters in Python (with Benchmarks)

2012-05-16T00:00:00-04:00

It's sometimes necessary to count the number of distinct occurrences in an collection. For example, counting how many times each letter occurs in a block of text. Or sorting a list by its most common member.

If I were to do this sort of counting with SQL, I would generally use something like this:

SELECT count(*)
FROM table
GROUP BY column

This could easily by combined this with an ORDER BY to get the most common items.

However, assuming you are working with some raw data, here are some strategies for counting distinct occurrences in Python. Skip to the end to see which method performs best.

dict and in

A plain dictionary works well as a counter. Though using it is verbose, it performs surprisingly well and works in any python version.

counter = dict()
foods = ['soy', 'dairy', 'gluten', 'soy']
for k in foods:
    if not k in counter:
        counter[k] = 1
    else:
        counter[k] += 1
..

>>> counter
{'soy': 2, 'cheese': 1, 'dairy': 1}

defaultdict

I've always loved the defaultdict. Used properly, it can cut out a lot of boilerplate from your code. It has many applications, one of which is a counter.

from collections import defaultdict

counter = defaultdict(int)
foods = ['soy', 'dairy', 'gluten', 'soy']
for k in foods:
    counter[k] += 1

..
>>> counter
defaultdict(<type 'int'>, {'soy': 2, 'cheese': 1, 'dairy': 1})

By passing int to the class, all empty keys default to zero. This allows you to do += without setting the key first.

dict and setdefault

Dictionaries have a setdefault method that allows you to set the default value for a single key.

According to the python docs, running setdefault on every key is slower than using defaultdict. The benchmark below confirms this.

counter = dict()
foods = ['soy', 'dairy', 'gluten', 'soy']
for k in foods:
    counter.setdefault(k, 0)
    counter[k] += 1
...

>>> counter
{'soy': 2, 'cheese': 1, 'dairy': 1}

collections.Counter

Python 2.7 introduced collections.Counter which makes this trivial.

from collections import Counter
foods = ['soy', 'dairy', 'gluten', 'soy']
Counter(foods)

..

>>> counter
Counter({'soy': 2, 'gluten': 1, 'dairy': 1})

By passing a list to the Counter constructor, it does the grouping for us. It still behaves like a dictionary so we can still do stuff like

>>> counter['soy'] += 3
>>> counter['soy']
5

Benchmarks

Here are some quick and dirty benchmarks for these methods. I used this code to generate the data. I took some text by The Bard and counted the number of each letter and each word. There were a lot more unique words than letters which resulted in slower times to count them.

Keys	Counter	defaultdict	dict.setdefault	dict.in
6691	3.62	1.97	2.88	1.95
26727	13.13	4.31	9.58	7.17

These results show that while a plain dict and in checks performs best for a smaller number of keys, it's not significantly better than defaultdict. With a larger number of distinct members, defaultdict did substantially better than any other option.

Use defaultdict

The takeaway is to stick to the defaultdict when you need a counter. Not only is it performant, but it saves you from the boilerplate of operating on every key.

While Counter is shinny and convenient, it's slow. As an added bonus, defaultdict works in Python 2.5. If you are stuck with python 2.4 (upgrade!), running in on every key is your best option.

Edit: Updated in light of Philip's comment.

A Few Static Blog Generators

2012-05-15T00:00:00-04:00

Before finally choosing an engine for my own blog, I spent too much time comparing the many available options. My goal in this post is to share what I learned about some of the tools that are avilable. Hopefully this makes it easier for others to publish their own writing.

Why Static?

Static website generators take content in a user-friendly markup language (e.g. markdown) and compile it into flat HTML pages.

Static sites, particularly static blogs, have become increasingly popular. This isn't to say they are for everyone. They do require more effort to setup and generally have fewer features than a mainstream blogging platform.

The reasons for the increasing popularity of static websites can be found all over the web. Here are some of the big ones for me.

Fewer Moving Parts

Simple websites shouldn't require the same stack as a full-blown web application. Why depend on an app server and a database when all you really need are flat files and a web server?

Cheap and Easy Deployments

Deploy your blog anywhere that can serve static assets. Some free/cheap options for this:

Github pages (free)

s3 (practically free unless you are famous or ddos yourself)

shared host (at least you don't have to use their database)

a tiny VPS (my choice)

Security

Having had to clean up hacked Wordpress installations, I know what a pain it is to keep Wordpress up to date and locked down. Any popular web app that can be deployed to your own server is a natural target for attackers. By serving static files on disk, we eliminate a wide range of attack vectors.

My Requirements

Having decided on a static blog, I still had a sea of options to sift through before I could reach a decision. Salient features include:

reStructuredText

Markdown is a very popular choice for web writers. While I do like markdown, I prefer reStructuredText for a number of reasons. It requires more upfront investment to learn its larger syntax. But its rich feature set is well worth it.

Some great rEST features are footnotes, tables of contents. Also, .rst files look great in plaintext (even without syntax highlighting).

Theming Support

Themes give you a big head start when starting a site. Taking an existing theme and customizing it to your needs is a lot faster than starting from scratch. Tools like twitter bootstrap make this process easier. But they don't save you from having to learn the the names of all of the template variables and settings provided by your static generator.

Good theme support depends on a well-designed site theme API to enable customization. But it also depends on a community who has already released themes worth using.

Extensible

Should I ever want to hack my own plugin or do some customization, I want to know that the engine exposes a good API for extending its functionality. For this reason, I considered only options written in Python, Ruby and Javascript: the three languages where I'm most comfortable.

Configurable

I should be able to tweak the most important features with a simple change to a settings file. Important options for me are the ability to use arbitrary URL structures and organize my content however I choose.

of Tool Fetishes

As would-be-bloggers, we are spoilt choice when choosing a static blog engine. We can start with many off-the-shelf themes and customize them to suit our needs.

That said, the choice probably does not matter as much as we like to think. The most important part of blogging is the content, not the presentation. It's not about the minor differences between blog generators. Nor is it reinventing yet another one. We reach a point when our obsession with our tools becomes fetishistic (MUST blog with vim!).

The tools that you use for publishing only matter if you write regularly. I would argue that we see the same fetishistic attitutude during perennial flamewars about text editors and web frameworks.

The past few years have seen colossal duplication of efforts in the space of static site generators. While no solution will suit everyone, more consolidation would be nice.

The winner is...

After almost going with Jekyll, I found Pelican. It is built on top of software that I consider best of breed:

Jinja2 templates

rst markup

Sphinx documentation

Python implementation language

Development is currently very active. Compared to the others, it has good documentation. I generally prefer Sphinx docs to the combination of RDocs and GitHub wikis popular among Ruby projects.

Pelican has a dedicated script called pelican-themes for managing themes. I liked the default theme enough to take it as the starting point for my design.

Its has a 'watch' mode for development has worked very well for me so far. Even when I left it running for long periods of time.

A few influential python bloggers have also switched over to Pelican.

Other Contenders

Here is a quick overview of the other choices that I evaluated before choosing Pelican. In some cases, my evaluation was fairly superficial. I won't try to be comprehensive. Instead, I hope this this will serve as a good starting point for someone trying to make the same choice.

Jekyll

Jekyll is the engine behind GitHub pages. It's written in Ruby and is easily the most popular of the options I considered. Its important enough that its creation may have helped to bring about the resurgence of static websites in general. As I mentioned it was my top choice behind Pelican and a well built piece of software.

Jekyll's large community comes with great benefits. Two popular projects built atop Jekyll are Octopress and Jekyll-Bootstrap. Both attempt to provide a simple blogging experience out of the box. Much of the Jekyll configuration done for you. Each of these projects has its own set of themes, making customization a snap.

To someone technical enough to want a static blog, but who is still looking to hit the ground running. I would point them to Jekyll-bootstrap or Octopress along with Pelican. Jekyll has a the largest community and the benefits that come with that.

Plugins

Jekyll has a ton of plugins. One that made Jekyll a contender for me is jekyll-rst. This allows you to write your posts in rst instead of markdown. The plugin is a bit rough around the edges and still requires you to install some python packages.

jekyl-s3 will deploy it to S3 for you which is nice if you you'd rather not mess with s3cmd.

Issues

I found a number of things about Jekyll confusing. Its docs are decent if a bit scattered.

It uses Liquid templating language, which was inspired by Django's templating language. Jinja2, also inspired by Django seems to me to be a much more mature implementation.

In general, Jekyll was not simple or inviting enough for a tool of its popularity. I think this helps to explain the demand for "frameworks" like Octopress on top of jekyll.

Hyde

Hyde started as a python port of Jekyll but has become something quite distinct. At first glance, Hyde seemed like the python option with the largest community. Since I still generally prefer Python to Ruby, Hyde was the first option I considered when building a static site last year.

The major problem with Hyde is that it has been between major versions for a long time. Hyde 1.0 remains mostly undocumented. The new version makes some welcome improvements like breaking its dependency on Django and moving to Jinja2 for templating. But, as a new user, I had no idea where to start on a Hyde project.

Cyrax

Cyrax was the next option I considered. It's also written in Python and uses Jinja2 templates. The author writes in the readme:

It's inspired from Jekyll and Hyde site generators and started when I realized that
I'm dissatisfied with both of them by different reasons.

I found Cyrax to be generally well done. In general, it's better suited to websites than blogs; but it can do either. It allows you to use different layouts for different page types. This is very helpful in cases where you need more than just generic pages and blog posts.

The largest problems with Cyrax are its documentation and community. Any would-be contributors to Cyrax have hopefully found their way to Pelican.

Cyrax has some rough spots. For instance, the development server does not have a delay between refreshes. Rapidly editing lots of files can practically crash your machine if your site has more than a few pages.

New Ideas

NIH syndrome in this space aside, some exciting new projects are appearing.

Punch

Punch is a static website generator written in Javascript. All metadata is stored as JSON except long-form writing which can be done in markdown. The coolest part of punch is the ability to render pages on both the server and the client using the same code.

While I like the idea a lot, the project is just getting off the ground. Also, I'm not sure this hits a sweet spot for any particular usecase.

Someone who wants to serve pre-rendered content may not be happy having to input all of all their metadata with JSON. YAML is a better choice here if the users are supposed to hand-editing these files.

On the other hand, someone who wants a fully client-side site will likely choose a more full-featured build tool like Brunch. Brunch provides a framework that helps structure your code instead of your blog content. Or, if a user is more minimalistic, he will manage the Javascript and templates himself.

ruhoh

ruhoh is the new project from plusjade, the creator or jekyll bootstrap. What's exciting about this project is that, instead of allowing pluggable templating languages, its allows for a plugable implementation language. While the ruhoh API has to date only been implemented in Ruby , the plan is to build implementations for many popular languages.

The key insight here is that the choice of templating language should not be very important. If ruhoh can definite its entire API in any language, why not take care of any preprocessing in your programming language of choice? Mustache can handle rendering the content and still maintain a clear separation of concerns. Mustache is a good choice for this usecase because it already has bindings in most languages.

True language independence sounds like a great goal. If we view the function of static site generator as a simple transformation of data and allow the proper hooks for extensibility, ruhoh can provide something much more powerful than plugins for a single platform. Instead, it could allow for fullly customizable experience.

Variations on a build tool

Since all dynamic elements in a statically generated site will require Javascript, users of static generators might appreciate javascript-specific features like combining scripts and minifying them. These features exist in Javascript build tools like Sprockets or Brunch and tool that adds these features to a static site builder may be exactly what is needed to build sites that are rich in both content and client-side functionality.

A complete build tool might seem like overkill when a make/fab/rake/cake file combined wuth something like guard to watch your files and rebuild during the development is all that's required. While this will certainly work, it's a nontrivial problem since rebuilding everything from scratch after every change is not feasible for sites with lots of content or scripts.

In my opinion, we are still waiting for a build tool for modern, content-rich sites.

Evan Muehlhausen

Analyze Gchat transcripts in AWK

Why AWK

The book

Staying power

Chat transcripts

AWK: string processing made easy

Flocks of AWKs

Parsing the transcript

Issues

Regexes

Multiple speakers

Output

Impressions

Readability

Data structures

Data mining local radio with Node.js

More harpsicord?!

If it ain't baroque...

Collecting the data

Scraping with Node

Cheerio

An example scaper

The joys of data normalization

Handling overlap

Results

Play count

Era distribution

Top composers

Air time

Async and ordering

Airtime by era

Conclusions

Blame the bias

Data mining in Node

Saving Screenshots in Rails with url2png and Paperclip

An API wrapper

Saving it with Paperclip

Better as a Gem

Simple Counters in Python (with Benchmarks)

dict and in

defaultdict

dict and setdefault

collections.Counter

Benchmarks

Use defaultdict

A Few Static Blog Generators

Why Static?

Fewer Moving Parts

Cheap and Easy Deployments

Security

My Requirements

reStructuredText

Theming Support

Extensible

Configurable

of Tool Fetishes

The winner is...

Other Contenders

Jekyll

Plugins

Issues

Hyde

Cyrax

New Ideas

Punch

ruhoh

Variations on a build tool

See also