Coderholic

More invaluable command line tools for web developers

2012-11-15T00:00:00-08:00

This article is a follow up to Invaluable command line tools for web developers, and covers some more great tools that can make your life as a developer that little bit easier.

This post first appeared, combined with the first command line tools post, on Smashing Magazine.

netcat

Netcat, or nc, is a self described networking swiss-army-knife. It's an very simple but also very powerful and versatile application that allows you to create arbitrary network connections. Here we see it being used as a port scanner:

$ nc -z example.com 20-100
Connection to example.com 22 port [tcp/ssh] succeeded!
Connection to example.com 80 port [tcp/http] succeeded!

In addition to creating arbitrary connections netcat can also listen for incoming connections. Here we use this feature of nc, combined with tar, to very quickly and efficiently copy files between servers. On the server run:

$ nc -l 9090 | tar -xzf -

And on the client:

$ tar -czf dir/ | nc server 9090

We can use netcat to expose any application over the network. Here we expose a shell over port 8080:

$ mkfifo backpipe
$ nc -l 8080  0<backpipe | /bin/bash > backpipe

You can now access the server from any client:

$ nc example.com 8080
uname -a
Linux li228-162 2.6.39.1-linode34 ##1 SMP Tue Jun 21 10:29:24 EDT 2011 i686 GNU/Linux

While the last two examples are slightly contrived (in reality you'd be more likely to use tools such as rsync to copy files, and ssh to remotely access a server) they do show the power and flexibility of netcat, and hint at all of the different things you can achieve by combining netcat with other applications.

sshuttle

Sshuttle allows you to securely tunnel your traffic via any server you have SSH access to. It's extremely easy to setup and use, not requiring you to install any software on the server, or change any local proxy settings.

By tunnelling your traffic over SSH you secure yourself against tools like firesheep and dsniff when on unsecured public wifi or other untrusted networks. All network communication, including DNS requests, can be sent via your SSH server:

$ sshuttle -r <server> --dns 0/0

If you provide the --daemon argument sshuttle will run in the background as a daemon. Combined with some other options you can make aliases to simply and quickly start and stop tunnelling your traffic:

alias tunnel='sshuttle --D --pidfile=/tmp/sshuttle.pid -r <server> --dns 0/0'
alias stoptunnel='[[ -f /tmp/sshuttle.pid ]] && kill `cat /tmp/sshuttle.pid`'

You can also use sshuttle is to get around the IP-based-geolocation filters that are now used by many services, such as BBC's iPlayer which requires you to be in the UK, and turntable.fm requires you to be in the US. You'll need access to a server in the target country. Amazon has a free tier of EC2 micro instances that are available in many countries, or you can find a cheap VPS in almost any country in the world.

In this scenario rather than tunnelling all of your traffic you might want to send only that for the service you are targeting. Unfortunately sshuttle only accepts IP address arguments rather than hostnames, so we need to make use of dig to first resolve the hostname:

$ sshuttle -r <server> `dig +short <hostname>`

mitmproxy

mitmproxy is an SSL-capable man-in-the-middle HTTP proxy that allows you to inspect both HTTP and HTTPS traffic, and rewrite requests on the fly. The application has been behind quite a few different iOS app privacy scandals, including Path's address book upload one. It's ability to rewrite requests on the fly has also been used to target iOS, including setting a fake a high score in GameCenter.

Far from only being useful to see what mobile apps are sending over the wire or for faking high scores, mitmproxy can help out with a whole range of web development tasks. For example, instead of constantly hitting F5 or clearing your cache to make sure you're seeing the latest content you can run

$ mitmproxy --anticache

which will automatically strip all cache control headers and make sure you always get fresh content. Unfortunately it doesn't automatically setup forwarding for you like sshuttle, so after starting mitmproxy you still need to change your system wide, or browser specific proxy settings.

Another extremely handy feature of mitmproxy is the ability to record and replay HTTP interactions. The official documentation gives an example of a wireless network login. Exactly the same technique can we used as a basic web testing framework. For example, to confirm that your user signup flow is works you can start recording the session:

$ mitmdump -w user-signup

Then go through the user signup process, which at this point should work as expected. Stop recording the session with Ctrl-c. At any point we can then replay what was recorded and check for the 200 status code:

$ mitmdump -c user-signup | tail -n1 | grep 200 && echo "OK" || echo "FAIL"

If the signup flow gets broken at any point we'll see a FAIL message, rather than an OK. You could create a whole suite of these tests and run them regularly to make sure you get notified if you ever accidentally break anything on your site.

2 Years of #HNLondon

2012-08-15T00:00:00-07:00

It's hard to believe that it's been 2 years since Dmitri and I started the Hacker News London meetup. We saw some amazing growth in the first year, going from 40 to almost 200 attendees.

Things didn't slow down from there. In the past year we've moved to a bigger venue and grown to around 500 regular attendees, making it one of the largest tech meetups in Europe! We've had some great sponsors, allowing us to ensure that everyone is well fed on Pizza, and suitably drunk on beer (although, sadly, we did once run out of beer!).

We've bagged some fantastic speakers too, including Joel Spolsky, Harjeet Taggar (partner at YC), Eben Upton of Raspberry Pi and others. You can view their talks, and all of the others on the HNLondon vimeo page. One of my personal favourites has been Dr Marc Warner's talk Quantum Computing for Hackers

Sadly June's HNLondon event was the last time I'll personally be inolved with the event. With the recent acquisition of Lightbox I'll be moving to the US later this summer to work from the Facebook HQ in Menlo Park. I'm hugely excited about the opportunity, but I'll definitley miss the HNLondon meetups and all of the great people I've met as a result of them.

I'm sure HNLondon will continue to go from strength to strength in Dmitri and Steve's hands, and I look forward to watching the videos online!

EasyDB: Simple SQL Persistence for Python

2012-03-10T00:00:00-08:00

Python provides some great tools for simple data persistence. For key value storage there's anydb for when you have only string values, and shelve for when you've got more complex values (eg. lists or objects).

Sometimes key value persistence doesn't cut it. For example, you might want to sort, filter or query your data efficiently. In those situations you can always use Python's built in support for SQLite - a serverless self-contained SQL database engine. With SQLite you've got to worry about database schemas, connections and transactions though. A far cry from simple persistence!

Enter EasyDB - essentially a really simple SQLite wrapper for that saves you from having to worry about creating tables, managing connections or transactions. Here's a quick example:

from easydb import EasyDB
db = EasyDB('filename.db', {'users': ['username text', 'comments text']})

db.query("INSERT INTO users (username, comments) VALUES (?, ?)", ('ben', 'Python coder'))

for result in db.query("SELECT * FROM users"):
    print result
    
# => ('ben', 'Python coder')

If the filename being passed into the EasyDB constructor already exists it'll get reused. Otherwise it'll be created with the provided database structure. Notice that you don't need to worry about the correct SQL syntax for creating tables, you just provide the details as a Python dictionary in the following format (see the SQLite documentation for a list of available types):

{
    'table_name': ['column_name column_type', '…'],
    'table_name': ['column_name column_type', '…'],
}

EasyDB can also be used on existing SQLite database by simply passing in the database filename:

from easydb import EasyDB
db = EasyDB('filename.db')
results = db.query("SELECT * FROM mytable").fetchall()

EasyDB is available on GitHub at https://github.com/coderholic/easydb, and can also be installed via pip.

My Startup Failed, But It's OK

2012-01-28T00:00:00-08:00

It's now been more than 18 months since I wrote about taking the leap from being an employee to working full time on my startup, Geomium. At the time I was full of mixed emotions. I was hugely excited about the possibilities that lay ahead, and also hugely apprehensive about leaving behind the safety net of employment to go all out own my own. Fame and fortune would obviously await me if it all went right, but what would happen if it all went wrong? I had a wife a daughter to support, I couldn't let them down! You've got to feel the fear and do it anyway though, right?

Well, we did fail. Geomium had been in a state of limbo since April last year, but I finally pulled the plug in November when I emailed Geomium's 50,000 users to tell them that we were discontinuing the service. The website will live on in a different form, but the Geomium that we'd envisaged back when I wrote the post, the Geomium we'd been trying to achieve for all that time, the Geomium that we hoped we could create is dead.

Things didn't work out for a whole variety of reasons. I'm sure we didn't fail in any particularly new or exciting way, and probably made mistakes that have been made thousands of times before, so I'm not going to dwell on what went wrong or what we could have done differently. What I am going to focus on is two of the biggest takeaways...

Sometimes you have to learn things first hand

Not only were none of the mistakes we made novel, a lot of them were mistakes I'd read about before and was sure I'd avoid. I then went on to make them anyway. I think the reason is that when you read about mistakes that other people have made, or the right way to do things, it's often in an abstract way that won't apply directly to your own situation. It can therefore be difficult to recognise the advice you've read. However, once you've made the mistake yourself it's a lot more concrete, and more easily recognised in the future.

As valuable as reading all the books, reading all the blogs, and doing all your research is, sometimes it's just a case of getting hard earned experience. The only way to do that is to jump in, and get your hands dirty.

I've definitely got many more mistakes to make, and many more lessons to learn, but hopefully there'll be a few that I'll manage to avoid next time.

The worst case scenario isn't as bad as you think

By far the biggest lesson for me was that failure isn't anywhere near as bad as you think it might be. Even when everything goes wrong, it's actually OK.

Instead of ending up destitute and homeless a load of new opportunities came out of the experience. It was at Seedcamp, where Geomium were finalists in January 2011, that I met one of my partners on BusMapper, which is now a successful side project. Another opportunity came out of a VC pitch. We pitched Geomium to a lot of the top VC firms in London, and it was when pitching Index Ventures that I met Thai Tran. I'm now working full time for his startup Lightbox. Had I not taken the leap I'd have missed these opportunities, and others.

Keep On Failing, Keep On Trying

I'm still really disappointed that we couldn't make Geomium a success, but I'm pleased I tried, and I learnt a lot along the way that I'll be able to apply to other projects in the future.

If you're thinking of taking the leap, but worried about what happens if it all goes wrong, don't. Chances are it will go wrong, but you'll learn a lot along the way and expose yourself to even more opportunities. Once you're not afraid to fail there's no stopping you!

If you enjoyed this post please vote for it on on Hacker News

Goodbye Wordpress, Hello Jekyll!

2012-01-20T00:00:00-08:00

I've finally given up on Wordpress, and moved this blog over to Jekyll, the "blog aware static site generator". My blog is now 100% static HTML. No database, no dynamic code, and hopefully no downtime the next time one of my posts makes it to the front page of Hacker News!

My primary motivation for the move was the fact that wordpress would die every time my blog got a decent amount of traffic, even with WP-SuperCache enabled. That's far from the only benefit though. I don't have to worry about keeping wordpress updated anymore, or it getting hacked (which has happened a few times). Tom Preston-Werner, GitHub co-founder and Jekyll author, talks about even more benefits in his article Blogging like a Hacker.

Before settling on Jekyll I evaluated a few different static site generators, including rstblog and Hyde. What tipped it in jekyll's favour was the integration with GitHub. Thanks to GitHub pages this blog is now served directly from GitHub, so I don't even need to worry about a server!

Migrating from Wordpress

Paul Stamatiou did a write up of his own experience of moving from wordpress to Jekyll. I took a slightly different approach. Rather than installing Ruby on my server and running the normal migration script I instead took an XML export of my posts from wordpress, and then ran a php script on my local machine.

The script took care of creating files for each of the posts, but those posts still had a load of links to images and scripts that would need to be updated. For me this is where the benefits of having your posts locally as flat files became really clear - with a little bit of command line magic I could update all my links, and download all the linked files!

As I was using wordpress I knew all of the links that would need updating would contain "wp-content", so I could pull them out with grep:

$ cd _posts
$ grep -Eho "[^\"']*?/wp-content/[^\"']*" *
http://www.coderholic.com/wp-content/uploads/google-index.png
http://www.coderholic.com/wp-content/uploads/twitter.png
http://www.coderholic.com/wp-content/uploads/technorati.png
…

I outputted these to a file, and the downloaded them all with wget:

$ mkdir files
$ cd files
$ cat files | xargs wget

Next I updated the URLs, coverting something like http://www.coderholic.com/wp-content/uploads/google-index.png to /files/google-index.png:

find . -type f -exec sed -i -e "s/[^\"']*\/wp-content\/[^\"']*\/\([^\"']*\)/\/files\/\1/g" {} \;

All that was left to do then was to push it to GitHub, and let them take care of the rest!

My OSX Setup

2011-10-09T00:00:00-07:00

After over 10 years of Linux I recently switched to OSX on a MacBook Air. I can't say enough good things about the Air hardware, but it's taken me a few months to get completely confortable with the operating system. Here are the apps I've settled on, which I think make for a fairly awesome OSX setup:

Alfred

Alfred is a powerful launcher, which allows you to start any application simply by pressing Alt-Space and then typing the application name. Alfred also has an inbuilt calculator, which is really handy, and can be used to quickly search google too.

Chrome

I did try Safari for a while but came back to Chrome, my browser of choice. It's fast, has a clean UI, and more and more great extensions are available all the time. My current extensions: Rapportive, Pretty Beautiful Javascript, YSlow, GitHub Inbox, and Screen Capture.

Cyberduck

Cyberduck is a handy little file transfer app that supports FTP, SCP, WebDAV and a load of cloud storage services such as Amazon's S3. It integrates well with Finder, so you can simply drag and drop files between computers.

Fluid

A lot of the applications that I use these days are actually web apps. For the ones I use most frequently (Google Calendar and WorkFlowy) I use Fluid to turn them into desktop-like apps in their own window, which makes them easier to access.

Homebrew

A package manager for OSX that makes it as easy as brew install <package> to install almost any UNIX package you can think of. For those it doesn't support it's relatively easy to add support yourself. Here are the apps installed with brew so far:

$ brew list
ack             fortune         jpeg            ngrep           pv              sqlite
android-sdk     gdbm            libevent        nmap            readline        unrar
cmake           git             libmemcached    pidof           redis           watch
ctags           graphviz        macvim          pil             siege           wget
curl            htop            memcached       pkg-config      solr

iTerm2

OSX ships with a terminal application, but iTerm comes with a whole host of advanced features that it became my terminal app of choice. Some of those features include 256 colour support and allowing scrollback within screen. The killer feature for me though is the system wide hotkey. I've set it up so that whenever I press F12 iTerm drops down from the top of the screen, quake style, and I'm good to go.

MacVim

On Linux I used Vim within a terminal, but I'm really liking MacVim on OSX. It integrates really nicely with the system clipboard and is customisable enough that you get get most of the default GUI elements out of the way. I had to make some minor adjustments to my existing vimrc but almost everything I had working on vim under linux, including all of the plugins, just worked.

VLC

I find iTunes a bit too heavy weight for most of my media needs, and it also crashes a lot! I used VLC a lot on Linux, and the Mac client is equally as good. I've yet to find an audio or video format it doesn't support.

Spaces

Initially the hardest part of the transition to OSX was the lack of a good virtual desktop manager. With Linux I'd constantly be changing virtual desktops, and moving windows around with the various keyboard shortcuts. OSX's manager, spaces, though required me to move windows around with the mouse! It wasn't until I discovered the "always open this app on this desktop" feature that I really began to like spaces. I have chrome always open on it's own desktop, vim on another, workflowy on another, and everything else on another one again. I haven't upgraded to Lion yet, and I hear that Spaces has been replaced. Hopefully I'll get on with the replacement just as well.

What am I missing?

Let me know what great OSX apps I'm missing out on!

How I got the Turntable.fm Gorilla in less than 48 hours

2011-09-13T00:00:00-07:00

Turntable.fm has taken off in a big way. Launching in May, over 140,000 users signed up in the first month. Celebrities regularly hang out there, and Lady GaGa and Kanye West are even investing in the site.

For those that haven't tried it out yet (or perhaps can't, more on that shortly) turntable is somewhere that you can listen to great music and discover new artists and songs that you might otherwise not have otherwise come across.

The site features a number of rooms, each with a different theme (eg. chillout, dubstep, indie) and in each room there are up to 5 DJs. Every one in a room can "awesome" or "lame" each song that is played.

If you're the DJ you get a point for everyone awesome you receive. The more points you get the better the avatar you can select. The most prized avatar on the site, a gorilla, requires 1000 points. Currently that's something that has only been achieved by 4000 of the sites users, and something that has taken most of them weeks or even months of effort. Here's how I got the gorilla in less than 48 hours...

Getting in

Due to some licensing issues Turntable has been unavailable to anyone outside the US since the end of June. I'm based in the UK, so the first challenge I faced was simply getting into the site. I'd need to make it appear as though I was in the US. The sshuttle app makes extremely easy to do exactly that. You just need a host in the target country (fortunately this blog is hosted in the US, so that's what I used), and the IP address of the target site. It gets a little more complicated if the target site has several IP addresses, but a quick check with dig shows that turntable currently has just the one:

$ dig -tA +short turntable.fm
50.16.229.9

Tunnelling all traffic to this IP via my US based server is as simple as running this command:

sshuttle -r coderholic.com 50.16.229.9/32

It'll now appear to turntable that I'm in the US, and can login into the site using my Facebook account.

Getting 1000 points

In popular rooms it can be really difficult to get a DJ spot. Even if you manage to get one it's not easy to pick songs that get a lot of points, and it can be hard to keep your spot. That's why it usually takes weeks if not months of effort to get the 1000 points required for the gorilla. My plan was to automate the process as much as possible.

There are already lots of scripts and plugins available for Turntable. I started digging into the code of frankielaguna's Auto-Awesome bookmarket to see what was going on under the hood. The code starts with this:

//Attempt to find the room manager object
for (var prop in window) { 
    if (window.hasOwnProperty(prop) && window[prop] instanceof roommanager){ 
        ttObj = window[prop];
        break;
    } 
}

The "room manager object" sounded very interesting! Pasting the above code into the Chrome javascript console showed right away how much interesting stuff there really is in this object:

It contains details of who's DJing, how many DJ slots there are, callbacks for when you get points, and lots lots more. Certainly everything I would need was there.

My plan was to create a new room and login with 2 accounts, one my actual account, and one fake account. I'd have both the accounts DJ and automatically awesome each other. To speed the process up I made some changes to the auto-awesome code so that it would "awesome" every 5 seconds, and so that the current DJ would skip the rest of their song as soon as they received a point. I noticed a set_dj_points method in the room manager object, and overrode it like so:

// Override the set_dj_points function, so we skip to the next DJ as soon as we get a point
var set_dj_points = ttObj['set_dj_points'];
ttObj['set_dj_points'] = function(j) {
    if(ttObj.myuserid == ttObj.current_dj[0]) {
        console.log("I've got more points:", j);
        // We're done - skip to the next DJ
        ttObj.callback('stop_song');
    }
    set_dj_points(j);
}

After settings things in motion I discovered that turntable require a variable amount of the song to be played before the an awesome is actually counted, so it would usually take longer than 5 seconds for the current DJ to get a point. Not a huge problem, but it meant things would take longer than expected.

The next problem I ran into was a little more serious. After each DJ had played 40 songs they got kicked off. I could manually make them a DJ again, but that'd require me to check the site every so often. Instead I updated the script so that every 5 second iteration we check to see if we're the DJ, and if not then become one:

// Check to see if we're in the DJ queue or not
if(!ttObj.myuserid in ttObj.djs_uid) {
    // We're not!! Become a DJ
    ttObj.callback('become_dj');
}

The next problem that I ran into was that turntable limits the number of awesomes you can get from a single user to 50! Therefore I had to signup for more fake facebook accounts and get them in on the act. Rather than signing up for one new account at a time I instead created severeal accounts and got them all into the room at the same time. I created a slightly different script for these other accounts, so that they'd just awesome the song rather than DJ. I also modified the DJ script to DJ for a fixed amount of time, rather than give up DJing after the first awesome. That way I could collect a few points for each song play.

Less than 48 hours and a load of fake facebook accounts later I'd managed to get the required 1000 points. The gorilla was mine! The complete code that I used is up on GitHub: https://github.com/coderholic/turntable.fm

Preventing it

I went through this process mostly out of curiosity. It's clear that not many people are employing the same kind of tactics, with only around 4000 users of the site having obtained the gorilla. It'd certainly be a problem for turntable if everyone started doing this though, so what could they do to prevent it?

It seems as though turntable are already doing quite a bit to make it difficult , by limiting the number of awesomes received from each profile, requiring the song to be played for a certain amount of time, and booting DJs off after they've played a certain number of songs. There's certainly more that they could do. For example, they could make it harder to get hold of the roommanger object. Within the room manager object itself they could ignore any actions unless the browser has focus.

Ultimately the client code must communicate with the Turntable server though, so any client side changes would only make things harder. They wouldn't actually prevent anything. In fact Alain Gilbert has put together a node.js based Turntable client that does exactly that.

Comment or vote at Hacker News

Invaluable command line tools for web developers

2011-08-13T00:00:00-07:00

Life as a web developer can be hard when things start going wrong. The problem could be in any number of places. Is there a problem with the request your sending, is the problem with the response, is there a problem with a request in a third party library you're using, is an external API failing? There are lots of different tools that can make our life a little bit easier. Here are some command line tools that I've found to be invaluable.

Curl Curl is a network transfer tool that's very similar to wget, the main difference being that by default wget saves to file, and curl outputs to the command line. This makes is really simple to see the contents of a website. Here, for example, we can get our current IP from the ipinfo.io website:

$ curl ipinfo.io/ip
93.96.141.93

Curl's -i (show headers) and -I (show only headers) option make it a great tool for debugging HTTP responses and finding out exactly what a server is sending to you:

$ curl -I news.ycombinator.com
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Cache-Control: private
Connection: close

The -L option is handy, and makes Curl automatically follow redirects. Curl has support for HTTP Basic Auth, cookies, manually settings headers, and much much more.

Siege

Siege is a HTTP benchmarking tool. In addition to the load testing features it has a handy -g option that is very similar to curl -iL except it also shows you the request headers. Here's an example with www.google.com (I've removed some headers for brevity):

$ siege -g www.google.com
GET / HTTP/1.1
Host: www.google.com
User-Agent: JoeDog/1.00 [en] (X11; I; Siege 2.70)
Connection: close

HTTP/1.1 302 Found
Location: http://www.google.co.uk/
Content-Type: text/html; charset=UTF-8
Server: gws
Content-Length: 221
Connection: close

GET / HTTP/1.1
Host: www.google.co.uk
User-Agent: JoeDog/1.00 [en] (X11; I; Siege 2.70)
Connection: close

HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
X-XSS-Protection: 1; mode=block
Connection: close

What siege is really great at is server load testing. Just like ab (apache benchmark tool) you can send a number of concurrent requests to a site, and see how it handles the traffic. With the following command we test google with 20 concurrent connections for 30 seconds, and then get a nice report at the end:

$ siege -c20 www.google.co.uk -b -t30s
...
Lifting the server siege...      done.
Transactions:                    1400 hits
Availability:                 100.00 %
Elapsed time:                  29.22 secs
Data transferred:              13.32 MB
Response time:                  0.41 secs
Transaction rate:              47.91 trans/sec
Throughput:                     0.46 MB/sec
Concurrency:                   19.53
Successful transactions:        1400
Failed transactions:               0
Longest transaction:            4.08
Shortest transaction:           0.08

One of the most useful features of siege is that it can take a url file as input, and hit those urls rather than just a single page. This is great for load testing, because you can replay real traffic against your site and see how it performs, rather than just hitting the same URL again and again. Here's how you would use siege to replay your apache logs against another server to load test it with:

$ cut -d ' ' -f7 /var/log/apache2/access.log > urls.txt
$ siege -c<concurreny rate> -b -f urls.txt

Ngrep

For serious network packet analysis there's Wireshark, with it's thousands of settings, filters and different configuration options. There's also a command line version, tshark. For simple tasks I find wireshark can be overkill, so unless I need something more powerful, ngrep is my tool of choice. It allows you to do with network packets what grep does with files.

For web traffic you almost always want the -W byline option which preserves linebreaks, and -q is a useful argument which supresses some additional output about non-matching packets. Here's an example that captures all packets that contain GET or POST:

ngrep -q -W byline "^(GET|POST) .*"

You can also pass in additional packet filter options, such as limiting the matched packets to a certain host, IP or port. Here we filter all traffic going to or coming from google.com, port 80, and that contains the term "search".

ngrep -q -W byline "search" host www.google.com and port 80

Hacker News London Meetup: A Year On

2011-05-29T00:00:00-07:00

Almost exactly a year ago I posted a comment to Hacker News about organizing a London meetup. Soon after that I met up with Dmitri, and over a beer we came up with a plan for the Hacker News London meetup. The first event was a lot of fun, with around 40 London based hackers turning up to a pub to drink beer and chat about what they were all hacking on.

We've had 7 more meetups since then, and in that time we've grown from 40 hackers to almost 200! We've changed the format slightly, starting the evening with 8 short 5 minute talks, and we've managed to bag a few sponsors who make sure everyone who attends is well fed with pizza, and never short of a beer. We're still toying with the format a bit, and at the most recent meetup we had a panel of YC alumni (Pete Smith and Phil Cowans from SongKick, Josh Buckley from MinoMonstors, and Colin Beattie from Tuxebo - see the picture below) that I think worked well.

The meetups are a great opportunity to meet like-minded people, discover new opportunities, and relax over a few beers! Our next event is going to be on June 23rd, so if you're a Hacker News reader, hacker, or simply interested in technology and startups and not too far from London then signup on our meetup page and come along! If you're not in London then see if there's a HN meetup near you, and if not start one!

Scraping the web with Node.io

2011-04-15T00:00:00-07:00

Node.io is a relatively new screen scraping framework that allows you to easily scrape data from websites using Javascript, a language that I think is perfectly suited to the task. It's built on top of Node.js, but you don't need to know any Node.js to get started, and can run your node.io jobs straight from the command line.

The existing documentation is pretty good, and includes a few detailed examples, such as the one below that returns the number of google search results for some given keywords:

var nodeio = require('node.io');
var options = {timeout: 10};

exports.job = new nodeio.Job(options, {
    input: ['hello', 'foobar','weather'],
    run: function (keyword) {
        var self = this, results;
        this.getHtml('http://www.google.com/search?q=' + encodeURIComponent(keyword), function (err, $) {
            results = $('#resultStats').text.toLowerCase();
            self.emit(keyword + ' has ' + results);
        });
    }
});

Running this from the command line gives you the following output:

$ node.io google.js 
hello has about 878,000,000 results
foobar has about 2,630,000 results
weather has about 719,000,000 results
OK: Job complete

Scraping Multiple Pages

Unfortunately some of the documentation simply says coming soon, so you're left to guess the best way to put together more advanced scraping workflows. For example, I wanted to scrape the search results from GitHub. If you search for "django" then you (currently) get 6067 results spread over 203 pages.

What I could figure out from the documentation is that a node.io job passes through several stages: input, run, reduce, and output. The documentation also mentions that multiple invocations of the run method can be run in parallel, so the logical thing to do seems to be to pass in the page number to run, and have it scrape the results from a single page. You can then scrape lots of different pages in parallel.

To calculate the total number of pages, and pass the page numbers to the run method, I implemented an input method. There's not much documentation on this, but the key thing is to make sure it returns false once you're done, otherwise it'll keep getting called again and again. The other key thing is that you need to pass your data to the run method via the callback function, and it needs to be wrapped in an array. Here's the complete GitHub search results scraper:

var nodeio = require('node.io');
exports.job = new nodeio.Job({benchmark: true, max: 50}, {
    input: function(start, num, callback) {
        if(start !== 0) return false; // We only want the input method to run once
        var self = this;

        this.getHtml('https://github.com/search?type=Repositories&language=python&q=django&repo=&langOverride=&x=0&y=0&start_value=1', function(err, $) {
            if (err) self.exit(err);
            var total_pages = $('.pager_link').last().text;
            for(var i = 1; i < total_pages; i++) {
                callback([i]); // The page number will be passed to the run method
            }
            callback(null, false);
        });
    }, 
    run: function(page_number) {
        var self = this;
        this.getHtml('https://github.com/search?type=Repositories&language=python&q=django&repo=&langOverride=&x=0&y=0&start_value=' + page_number, function(err, $) {
            if (err) {
                console.log("ERROR", err);
                self.retry();
            }
            else {
                $('.result').each(function(listing) {
                    var project = {}
                    var title = $('h2 a', listing).fulltext;
                    project.author = title.substring(0, title.indexOf(" / "));
                    project.title = title.substring(title.indexOf(" / ") + 3);
                    project.link = "https://github.com" + $('h2 a', listing).attribs.href; 
                    var language = $('.language', listing).fulltext;
                    project.language = language.substring(1, language.length - 1); // Strip of leading and trailing brackets
                    project.description = $('.description', listing).fulltext
                    self.emit(project)
                });
            }
        });
    }
});

While my solution works I'm sure it's not optimal. By implementing an input method there's no way to specify a search term from the command line, which is far from ideal. Hopefully I'll be able to improve the scraper once some additional documentation is written, or after I've dug through the node.io code some more.

There's lots more than node.io can do. It has built in functions to do things like calculate the pagerank of a domain, resolving domain names to IPs, and lots of other useful utilities. Like Node.js it also has full support for coffeescript. It's a fantastic tool to have in your toolbox!