mgutz.com Syndication

Shingles

2011-08-25T22:00:00Z

At Triposo we have a database of restaurants that we've collected from various open content sources. When we display this list in our mobile travel guides one of the key pieces of information for our users is what cuisine the restaurant has. We often have this available but not all the time.

Screenshot from one of our Android travel guides

Sometimes you can guess the cuisine just by looking at the name:

"Sea Palace" - probably chinese.
"Bavarian Biercafe" - probably german.
"Athena" - probably greek.

A useful technique you will run into very quickly as soon as you start working with machine learning is what is popularly known as shingles. (Referring to roof shingles not the disease.)

Roof shingles

Shingles in machine learning is the set of overlapping character n-grams produced from a string of characters. It's common to add a start and end indicator to the string so that characters at the start and end are treated specially. Hopefully this diagram explains how to produce the shingles from a string.

Shingles are easily generated with a one-liner Python list comprehension (n is the size of each n-gram, n=4 often works well):

[word[i:i + n] for i in range(len(word) - n + 1)]

Shingles are great when you want to get a measure of how similar two strings of characters are. The more n-grams that are shared, the more similar the strings are. Since you can use hashing to look up instead of slightly more expensive trees this is an important tool when you're working with really large quantities of data. At Google O(1) rules.

Let's use shingles to guess the cuisine of restaurants given its name.

First of all let's index all the restaurant names where we know the cuisine. We want to know how likely it is that a particular 4-gram has a particular cuisine.

We start with a hash map from 4-gram to observed cuisine probability (in Python collections.Counter is just great for this). We iterate through all the restaurants and for every 4-gram shingle we bump up that cuisine in the corresponding probability distribution.

Right now we have an association to 4-gram to number of observations but we want to work with probabilities, number of observations is irrelevant. So let's also normalize each histogram. This prevents common restaurants like American and Chinese to score really highly for common n-grams. (This was a problem I initially had.)

So our model looks something like this:

'bier' => {'german': 0.37, 'currywurst': 0.06, ...}
      '{ath' => {'greek': 0.68, 'american': 0.04, 'pizza': 0.04, ...}
      ...

Then to guess the cuisine of a restaurant name we produce the 4-gram shingles and then add together the probability distribution for the cuisines for each 4-gram.

lookup('bavarian biercafe')
      [('german', 2.28), ('italian', 1.17), ('cafe coffee shop', 0.95)]
      lookup('athena')
      [('greek', 2.64), ('pizza', 0.37), ('italian', 0.36)]

The highest scored cuisine is our best guess but the bigger difference it is to the second score the higher confidence we have.

The neat thing with this technique is not only that it grows linearly with the size of the learning set and the size of the inputs. It's also really easy to parallelize. This is why it's so popular when you're working at "web scale".

So, should we use this in our guides? I'm not entirely convinced, for high confidence guesses we are usually correct but a human is also very good at guessing the cuisine. That said, it provided a nice little example of the type of processing we do in our pipeline.

Not a bubble

2011-06-01T22:00:00Z

I'll probably regret arguing against The Economist but I have to say I agree with Marc Andreessen here. There isn't a bubble as such, it's just a simple matter of economy 101: price is defined by supply and demand. There's a big demand for investing in the new "social media" technology companies but there's almost no supply. High demand + low supply = high price.

But don't get me wrong here: LinkedIn trading at several thousand times earning is still over-valued. Facebook is trading at several hundred times earnings but chances are still high they will "pop" at the IPO and probably increase in value several times. Does that mean I would invest my own money in these companies? Hell no, I'm coming in way too late. All the capital that wants to invest in this strong new technology trend needs to go somewhere and so far LinkedIn is the only stock it can go. At some point more companies will be going on the market and it will start getting easier to invest, supply is starting to meet demand and the valuations of these companies will start going down.

What about all these startups with crazy valuations? Same thing. There is no big company there to mop up all the surplus capital which means that early investment rounds is getting over subscribed. Start ups can ask for higher valuations and still get the capital they need.

I still don't think this is a bubble. A few over-valued companies is not enough to create a bubble in the same sense as the one in the late 90ies (I was there). These sort of bubbles have a very significant impact on the world economy. I doubt this will happen now.

In fact, I am investing myself. I left a huge salary and a lot of unvested shares on the table when I left Google to start Triposo. I have no salary and no idea whether it's going to work out.

Why invest? I think we are at a flexion point in this industry where a lot of interesting things will happen. It's not only about technological developments: what we're seeing now is mainstream adoption of things that's been around for quite some time. At Triposo we're primarily betting on powerful mobile devices and high quality open content. As technologies these things aren't exactly new but have significantly disrupted the industry by going mainstream (via the iPhone and Wikipedia).

Leaving Google

2011-04-28T22:00:00Z

Today is my last day at Google.

I joined Google about 3 years ago and most of that time I worked as one of the tech leads of Google Wave. It was fun, exciting and slightly crazy. We went from being the most hyped product on the web to slightly ridiculed. Then we were cancelled...

After Wave I worked for a little while on the Maps API until my 20% project Shared Spaces took off and became my 100% project. It was fun for a little while but we've now transitioned it to another team.

Me and Douwe had been playing around with this idea for a start up for a little while and we felt it was probably as good a time as any to have a go at it. Together with Douwe's two brothers we're starting Triposo. We're building travel guides for mobile phones. We crawl open content on the web, merge, score and generally massage it in a pipeline (running on our rather beefy workstations under our desks at home currently) then publish it to iPad, iPhone and Android (Windows Phone 7 to come). Here is for example our Barcelona guide in the App Store and in the Market. They're not bad actually and we've barely started!

I'm also doing another start up. My son Samir was born last friday (he's a week old today!) so here's the mandatory new-parent gratuitous baby shot: