HarveyNick.com

Suddenly, a Wild “Now” Page Appears

Nick Johnson — Tue, 07 Aug 2018 20:00:29 GMT

There’s a new page in the menu of this site. First of all I’ll make an admission: the idea is shamelessly stolen from Jacoby Young. Hopefully it’s meaning is fairly self explanatory. It’s mostly there for my own self reflection, which I’ve decided to do in public, apparently.

My working plan is that I’ll update this page about once a month. I suspect that the favourite podcasts list to stay mostly unchanged over time, but I can see almost everything else changing on a monthly basis.

Something else: right now it says I’m reading three books. That’s mostly because I found Joon Ha Lee’s Ninefox Gambit hard enough going that I needed to take a break after the first few chapters. That book is work.

A Swing and a Miss: Trying to Reduce English Uncertainty in IMDB Review Classification

Nick Johnson — Sun, 22 Jul 2018 18:15:14 GMT

When I was following the natural language processing / recurrent neural networks^[1] section of Andrew Ng’s Deep Learning Specialisation, there was a detail which bothered me. Now, having reached the same subject area in fast.ai’s own deep learning course, the same detail is bothering me again. The structure of the fast.ai course is much more open ended, so I could indulge myself and try to figure it out.

In case the title wasn’t a big enough give away: I was largely unsuccessful. But I’m not convinced I’m wrong, and the specifics of my failure suggest that it might be worth pulling on this thread a little more. Also: I think it’s important to publish negative results as well as positive ones.

Before I get to the detail in question: some background.

How Neural Networks Understand Natural Language

In short: they don’t. They understand numbers; or lists of numbers; or lists of lists of numbers; and so on.

The simplest way of converting words into a form which can be understood by a neural network is called a “one hot encoding”. Let’s say all of the text you care about uses only the 1000 most common english words, as per Randall Munroe’s Thing Explainer. So then each word is represented by a list of 1000 numbers, exactly one of which is 1, whilst the rest are 0. If “a” is the first word, then it will be represented a 1 followed by 999 0s.

This has the benefit of being simple. However, it has the disadvantage that it can’t generalise. A system trained against one hot encodings can only ever understand the words it was originally trained against. Furthermore, it has to explicitly learn their relationship to each other.

A more advanced alternative is to use word vectors. Here each word is represented by a smaller list of numbers of any value, though probably between 0 and 1. You can think of these as being coordinates in space (though it might be a space with over a hundred dimensions). Words with similar meanings are close together in this space. So the vectors which represent the words nice and pleasant are likely to be similar. If the word vectors are well tuned, simple maths should also be possible. For example v['king'] - v['male'] + v['female'] == v['queen'] should hold true.

I’m not going to go too deeply into this, or how the vectors themselves are created. If you want to find out more, check out word2vec, which is probably the most well known Implementation.

The crux of this is that language model trained against word vectors rather than one hot encodings has the potential to generalise. The word nice might not have appeared in the corpus it was training against, but if its synonym pleasant was, then (given the embeddings) it can probably work with that. For a really well trained model, it could potentially even cope with never having seen queen if king, male and female were all in the training corpus.

The Ambiguity of the English Language

Here’s an example from Groucho Marks:

One morning I shot an elephant in my pyjamas. How he got into my pyjamas I'll never know.

I’m not even going to go there. That’s at least a level of abstraction higher than the problem I have in mind. Namely: the syntactic and semantic ambiguity of individual words. For example: Still is a very ambiguous word, having 8 possible interpretations. These include:

As an adjective, it is a synonym for “unmoving”;
As a verb, it means “cause to be unmoving”, a rough synonym of “quieten”;
As a noun (here we go) it may refer to:
- A still image, especially one taken from a movie, so a rough synonym of “photograph”;
- Deep silence;
- The equipment used to make alcohol, as in “distillery”(!).

It might not even be the most troublesome example. Consider “seed”, which as a well as being a noun has several possible interpretations as a verb. The first means, essentially, “to add seeds”. One of the others? “To remove seeds”^[2].

Put That Together and What Have You Got?

Hopefully you see the problem. For the word vectors to work the way they’re usually constructed, the vector which represents still somehow has to encode it’s similarity with the words unmoving, photograph and distillery. seed’s vector must encode two meanings which are essentially opposites.

I’m not saying that’s actually impossible, but that it makes it very likely that some important nuance will be lost. It seemed pretty odd to me that Andrew Ng didn’t mention this at all. Jeremy Howard has also not mentioned it thus far in the fast.ai course (disclaimer: I’m not finished with this course yet, so he still might).

I can think of three possible reasons for this:

It’s a new idea, or at least not well developed;
It’s been tried, and it doesn’t make any difference;
It’s a more advanced topic.

I made a conscious decision not to look into either latter possibility. I will later, but first I wanted to dive in and see what I found. Sometimes it’s worth just trying something with minimal background reading. If you don’t know how possible something is thought to be, you’re less likely to be limited by that knowledge. Of course you might just be repeating the mistakes of others, but that too can be a learning experience.

My Attempt at a Solution

My plan was pretty simple: don’t look up the word vector using just the word, use the part-of-speech tag as well. To put it another way: use different vectors for still when used as a verb, and still when used as an adjective.

The fast.ai lesson 4 jupyter notebook begins by training a language modelling RNN from scratch against the IMDB reviews data-set, building new word vectors along the way. This RNN is then retrained as a classifier which identifies the sentiment of a particular review. This seemed like an ideal test case for my idea, on the surface at least.

The line of code responsible for building the vocabulary for the data-set looks like this:

TEXT = data.Field(lower=True, tokenize='spacey')

The tokenize parameter is ultimately resolved here, with code which looks like this:

import spacy
spacy_en = spacy.load('en')
return lambda s: [tok.text for tok in spacy_en.tokenizer(s)]

tokenize can be either a callable entity, or a string which represents a known tokeniser, as above^[3]. Spacey already performs part of speech tagging as part of the tokenisation process, so I can use the following to get the effect that I want:

import spacy
spacy_tok = spacy.load('en')
tok_pos = lambda s: [tok.text + "-" + tok.pos_.lower() for tok in spacy_tok(s)]
TEXT = data.Field(lower=True, tokenize=tok_pos)

Now, when tokenised, the phrase “Still the film still” becomes ['still-verb', 'the-det', 'film-noun', 'still-adv'] instead of ['still', 'the', 'film', 'still']. Alternatively, for a more fine-grained output, I can use this in the lambda:

[tok.text + "-" + tok.tag_.lower() for tok in spacy_tok(s)]

This yields more information about the part of speech. So whereas pos_ might yield “verb”, tag_ will provide the type of verb e.g. “transitive verb”. This might might reduce the ambiguity even further, or it might just make it harder for the system to build good vectors.

The Effect on Building the Model

Both of my modified tokenisation strategies take a lot longer than the default. Rather than taking fifteen to twenty minutes to build the TEXT field, they take two to three hours. The resulting change in vocabulary size is significant, though.

Likely as a result of this increased vocabulary, the per epoch training time for the language modelling phase increases from around seven and half minutes per epoch to around nine minutes per epoch.

As a first step, I followed identical training schedules for all three options. There are thirty-five epochs in fast.ai’s schedule so the training time increases from 260 minutes to 315 minutes, almost an hour more.

What did I have to show for these additional hours of tagging and training? Here are my final values for the training and validation loss:

Model

Training Loss

Validation Loss

Original

4.2471

4.2114

Simple Tags

4.3200

4.2542

Complex Tags

4.2976

4.2548

And here is the validation perplexity:

There’s nothing too surprising there. The task of predicting the next word and its part of speech is harder than predicting just the next word. A larger vocabulary is harder to model. Given the same amount of training time, the system gets a lower score on the more complex task.

The Effect on the Sentiment Detection

This will be a very short section. It’s pretty easy to sum up: almost none. Which is disappointing, I have to admit. In all cases, training took around 1 minute per epoch. Also in all cases, the accuracy is about 93%. Across all three cases, the range of the final accuracy is about 0.2%, which I think is well within the margin of error. I’m not even going to bother charting it. The version with the simple tags scored lowest, the version one the complex tags scored highest. But the differences are so small that I suspect small changes in the distribution between training and validation sets would remove them. Given the random initialisation, there’s even a good chance that just repeating the experiment would give a different ordering.

Analysis

On the one hand: this didn’t make things any better, on the other: it didn’t make them any worse. The fact that the model which uses the parts of speech can do worse at the language modelling stage but still get the same score for sentiment detection is pretty interesting.

Howard notes in the lecture that once the training schedule in the notebook is complete the network isn’t really close to overfitting. If anything, I think the modified versions are even further away. There’s still lots of room to train all three versions of the model.

I think that’s most the likely the best approach to carry forwards:

Train the language modelling system almost to overfitting;
Train the sentiment detection almost to overfitting;

Then repeat this this with the versions which incorporate the part of speech tags. Once the three models are fully trained, a better comparison can probably be made.

Thus far I’ve been doing all of the training inside a Jupytor notebooks. That’s great for interactivity, but highly suboptimal for long lived training. If the browser tab crashes or loses its connection to the kernel hours of work can be lost. So if I do take another run at this I’ll probably also use it as a reason to learn how to use Paperspace’s Gradient system, which allows long lived jobs to be run remotely^[4].

Discuss this post on Hacker News.

The sub-course in question is actually called “sequence models”. ↩︎
Apparently this is called a “Janus Word” or contronym. ↩︎
This sort of things sometimes freaks me out about Python. ↩︎
It’s very similar to how Floydhub works, as far as I can tell. ↩︎

Kaggle’s Yelp Restaurant Photo Classification Competition, Fast.ai Style: Part 2

Nick Johnson — Sat, 30 Jun 2018 09:35:20 GMT

Note: This post has a lot of javascript graphs, but if you’re using a feed reader or have javascript turned off you’ll just get basic tables. Sorry about that.

With that out of the way: I’m going to assume that if you’re reading this you’ve already read Part 1. As such, I’m just going to dive right back in where I left off.

Calculating the Per Business F1 on the Fly

Having now calculated the per business F1 at the end of the training run, I realised it would be useful (or at least interesting) to be able to see how it was changing during training. The per photo F1 makes for a decent heuristic, but isn’t guaranteed to actually correlate with the per business F1, which is what I actually care about.

I hit another snag here with the the fast.ai library. At the end of each epoch, the metrics supplied by the user are calculated using the same batch size as training. The results for each batch are then averaged, and this is what is shown to the user.

That’s a problem for calculating the per business F1, as the entire dataset is needed to build the per business predictions. As a smaller and less obvious problem: it also means that the per photo F1 which is displayed at the end of the epoch will be even more inflated than I thought. This is because rather than finding the best threshold for the entire data set, a more specialised threshold will be found for each batch. It’s overfitting, essentially. The premature optimisation of the machine learning world.

I could solve both problems if I could collate the per photo predications before calculating the F1. If I had access to state which persisted between batches, I could use something like the method below to collate the predictions and target values.

def collate(preds, targs, data):
  multiplier = 1.0
  if preds.shape[0] != len(photo_idx_to_val_biz_idx):
    if data['preds'] is None:
      data['preds'] = preds
      data['targs'] = targs
      # The dataset is incomplete.
      return 0, None, None
    # Append the data to the known data.
    data['preds'] = torch.cat([data['preds'], preds])
    data['targs'] = torch.cat([data['targs'], targs])
    if data['preds'].shape[0] != len(photo_idx_to_val_biz_idx):
      # The dataset is incomplete.
      return 0, None, None
    # The dataset is complete.
    # See below for an explanation of the multiplier.
    multiplier = len(photo_idx_to_val_biz_idx) / preds.shape[0]
    preds = data['preds']
    targs = data['targs']
    data['preds'] = None
    data['targs'] = None
  return multiplier, preds, targs

That’s all well and good, but the metric calculations are supplied to the fast.ai library as a pure function, which is then run without additional state or context. Except… this is Python, and calling something a “pure” function is a sign, not a cop. Functions in Python are first class objects, and objects in Python have arbitrarily assignable state. So there is a way around this problem…

Warning: If the Python code I admitted to using elsewhere in these posts bothers you, that below will almost certainly bother you even more. And it should. It’s a horrible hack, and nothing like it should ever get anywhere near a production environment. It should never even get near the critical path of a non production environment. But still. I’m not using it for either of those things.

def f1_biz_avg(preds, targs, start=0.24, end=0.50, step=0.01):
  # Collate the predictions and targets, using persistent
  # state attached to this function.
  multiplier, preds, targs = collate(preds, targs, f1_biz_avg.data)

  # A multiplier of 0 means that collation is incomplete.
  if multiplier == 0.0:
    return 0
    
  # Convert the per photo values to per business values.
  biz_preds, biz_targs = photo_to_biz(preds, targs, True)

  # Ignore warnings.
  with warnings.catch_warnings():
    warnings.simplefilter("ignore")

        # Find the threshold which yields the best F1.
    mapping = {th : f1_score(biz_targs, (biz_preds > th), average='samples')
           for th in np.arange(start,end,step)}
    th = max(mapping.keys(), key=mapping.get)

    # Return the best F1, scaled so that the fast.ai library
    # will average it with the 0's to get the correct value. 
    return mapping[th] * multiplier

# Initialize the persistent state.
f1_biz_avg.data = {'preds': None, 'targs': None}

This works around two issues:

Collating the predications before calculating the F1;
Scaling the output of the final batch so that when it’s averaged with the zeros returned for the other batches the correct value results.

I’ll say it again: all of this is a horrible hack. I’m ashamed of it. I worry that if anyone who works for my employer sees this, I might be fired. And yet...

This is how the per photo F1 compares to the business F1 over the course of the training schedule:

Set

Epoch

Per Photo F1

Per Business F1

0.711957

0.74221

0.726165

0.735293

0.730785

0.737406

0.732486

0.739411

0.733644

0.736971

0.73573

0.742263

0.741325

0.750094

0.742113

0.754373

0.744895

0.761078

0.746609

0.77235

0.74738

0.76478

0.748527

0.761866

0.74936

0.768682

0.749511

0.768273

0.749174

0.764132

0.749718

0.766357

0.751661

0.777684

0.753739

0.776017

0.753325

0.776626

0.755435

0.785654

0.756214

0.780722

0.756346

0.787445

0.756536

0.784503

As you can see the per business F1 is consistently higher than the per photo, but less stable. The latter makes sense, given that the model is being trained against the individual photos, not the business. The former was a little surprising to me. I assume the mixed signals start to cancel each other out when you average the predications together.

Comparing Different Architectures

An F1 of 0.7845 was actually pretty close to my original goal, but not quite there. The obvious next step was to try the same approach with a more advanced model. I also thought it might be interesting to compare the performance of a few different CNN architectures for my own information. So next I ran the exact same schedule, but using the ResNet-50 and ResNext-50 CNN architectures.

Set

Epoch

resnet34

resnet50

resnext50

0.711957

0.727451

0.72667

0.726165

0.73852

0.737242

0.730785

0.742298

0.740598

0.732486

0.744425

0.742447

0.733644

0.745335

0.743586

0.73573

0.745172

0.747723

0.741325

0.750352

0.753836

0.742113

0.751288

0.754403

0.744895

0.754363

0.75765

0.746609

0.75604

0.759276

0.74738

0.756608

0.759684

0.75674

0.760055

0.748527

0.758205

0.761691

0.74936

0.758833

0.762266

0.749511

0.758572

0.762637

0.749174

0.759194

0.762198

0.749718

0.759701

0.762108

0.751661

0.761914

0.761366

0.753739

0.763467

0.762696

0.753325

0.762945

0.762683

0.755435

0.764565

0.76418

0.756214

0.764523

0.764647

0.756346

0.765443

0.764441

0.756536

0.765464

0.765044

I was pretty sure that both 50 layer architectures would do consistently better than ResNet-34, but I also thought that ResNext-50 would do consistently better than ResNet-50. So I was half right.

One advantage ResNext-50 did have is that it trained more quickly. Stupidly, I didn’t record the training time for each architecture. I trained ResNet-34 over the course of a day. I’d say that ResNet-50 look about half as long again to train as ResNet-34^[1]. ResNext-50 seemed like it took about halfway between the two. But that could be my imagination. Next time I should actually record the timings…

Be that as it may, my initial goal was to achieve an F1 score of at least 0.8 against the validation set. 0.8082 is (just barely) higher than that, so: mission accomplished, I guess.

Right?

Trying Class Specific Thresholds

At this point I started to realise a few things which would have been obvious at the top if I had more experience. Firstly, after I started writing this post I realised it might be interesting to graph the proportion of businesses which belong to each class. For your convenience (and in the interest of making the following charts readable on mobile), here are the class names again:

Good for lunch;
Good for dinner;
Takes reservations;
Outdoor seating;
Restaurant is expensive;
Has alcohol;
Has table service;
Ambience is classy;
Good for kids.

And here is a graph of their proportions:

Class

Proportion

good_for_lunch

0.3300

good_for_dinner

0.5125

takes_reservations

0.5275

outdoor_seating

0.5125

restaurant_is_expensive

0.2625

has_alcohol

0.6375

has_table_service

0.6850

ambience_is_classy

0.2825

good_for_kids

0.6075

average

0.4842

Following on from this, I started to wonder whether my system was doing better on some classes rather than others. Which is when the obvious thought arrived: I was using the same threshold for each class, but I had no reason to assume that the sensitivity was the same. I could probably get better results by using different thresholds for each class.

I ran the following code against the inference output for the validation set to find the best individual threshold for for each class:

def per_class_threshholds(preds, targs, start=0.04, end=0.50, step=0.001):

  # Initialize the per class threshholds to 0.
  thresholds = np.zeros((preds.shape[1]))

  # Ignore warnings.
  with warnings.catch_warnings():
    warnings.simplefilter("ignore")
  
    # Iterate 10 times, trying to improve the thresholds each
    # time.
    # Note: This is overkill, but runs quickly enough not to matter.
    # Some CPU time could be saved by stopping once the F1 is no 
    # longer improving.
    for _ in range(10):
      # Try to improve the threshold for each class in turn.
      for i in range(0, thresholds[0]):
        best_th = 0.0
        best_score = 0.0
        for th in np.arange(start, end, step):
          thresholds[i] = th
          score = f1_score(targs, (preds > thresholds), average='samples')
          if score > best_score:
            best_th = th
            best_score = score
        thresholds[i] = best_th

  return thresholds

Running this for each of three architectures individually gave me the per class F1 scores. You can see them in the graph below, which I’ve foreshortened to emphasise the differences in performance:

Class

resnet34 F1

resnet50 F1

resnext50 F1

Ensemble F1

good_for_lunch

0.6824

0.6643

0.6053

0.6824

good_for_dinner

0.8326

0.8387

0.8372

0.8387

takes_reservations

0.8728

0.8874

0.8805

0.8874

outdoor_seating

0.6944

0.7449

0.7329

0.7449

restaurant_is_expensive

0.7510

0.7401

0.7570

has_alcohol

0.8905

0.9136

0.8869

0.9136

has_table_service

0.9252

0.9329

0.9310

0.9329

ambience_is_classy

0.7672

0.7679

0.7857

good_for_kids

0.8716

0.8835

0.8793

0.8835

average

0.8097

0.8193

0.8106

0.8251

There’s actually more variation than I was expecting. Firstly between the per class scores. There’s some correlation with the per class proportions above, but not for every class. Accounting for that effect, “Good for lunch” and “takes reservations” appear to be the hardest classes to detect.

Secondly the best architecture is not consistent across the classes. ResNet-34 is actually a bit of a dark horse when it comes to detecting establishments which open for lunch. Who knew?

Speaking of ideas which occur to you after the fact: I’m willing to bet that the time stamp of the photo is probably a pretty solid signal for the “good for lunch” class.

At this point I didn’t trust the overall F1 scores these thresholds gave me against the validation set. It was time run against the test set, submit to Kaggle, and find out what my real score was. I did this for each of the architectures, and also built an ensemble output by using the predictions of the architecture which got the best score for each class.

Having run inference on the test data, I used the following code to build per business predications^[2] and generate the formatted output.

predications = # The per photo predictions.
test_photo_to_biz = f'{PATH}/test_photo_to_biz.csv'
test_photo_to_biz_data = pd.read_csv(test_photo_to_biz)

# Gather the individual business IDs, and the image counts
# for each business.
biz_counts = {}
for biz_id in test_photo_to_biz_data.business_id:
  biz_counts.setdefault(biz_id, 0)
  biz_counts[biz_id] += 1
biz_ids = list(biz_counts.keys())
biz_idxs = {biz_ids[i] : i for i in range(len(biz_ids))}

# Extract the photo IDs in order from the test image file names.
images_in_order = [v[9:-4] for v in learn.data.test_ds.fnames]
photo_idxs = {int(images_in_order[i]) : i for i in range(len(images_in_order))}

# Convert the per photo predictions to per business predictions.
biz_preds = np.zeros((len(biz_ids), preds2.shape[1]))
for i in range(test_photo_to_biz_data.shape[0]):
  photo_id = test_photo_to_biz_data.photo_id[i]
  photo_idx = photo_idxs[photo_id]
  photo_preds = predications[photo_idx, :]
  
  biz_id = test_photo_to_biz_data.business_id[i]
  biz_idx = biz_idxs[biz_id]
  biz_count = biz_counts[biz_id]
  
  biz_preds[biz_idx, :] += photo_preds * (1.0 / biz_count)

# Convert the predications into booleans.
biz_cls = preds > threshholds

# Convert the booleans into lists of matched classes.
classes = []
for i in range(biz_cls.shape[0]):
  biz_cls_biz = biz_cls[i, :]
  biz_classes = " ".join([str(i) for i in range(preds_cls.shape[1]) if biz_cls_biz[i]])
  classes.append(biz_classes)

# Build a pandas data frame with the business IDs and
# matched classes.
data = np.array(list(zip(biz_ids, classes)), order = 'F')
output = pd.DataFrame(data=data, columns=['business_id', 'labels'])
# Write the data frame out to a CSV files.
csv_fn=f'{PATH}tmp/sub_{f_model.__name__}.csv'
output.to_csv(csv_fn, index=False)
# Display a link to the CSV file.
FileLink(csv_fn)

The code used to build the ensemble is left as an exercise for the reader. Obviously this is for educational purposes. Not just because I don’t want you to see my code and possibly judge me more harshly than you already do for the other code in this post. Ahem.

So, without further ado, here are the final scores against the public and private leaderboards for the competition:

Model

Public Score

Private Score

ResNet-34

0.7788

0.78823

ResNet-50

0.8009

0.8062

ResNext-50

0.7872

0.7830

Ensemble

0.7896

0.8001

As you can see, ResNet-50 wins bother leaderboards pretty handily. That’s not a huge surprise, but I really wasn’t expecting ResNet-34 to beat ResNext-50 on the private leaderboard. The ensemble takes a respectable second place on both leaderboards. I would tend to blame overfitting for it not coming first. Overfitting is Blofeld. Absent another obvious villain, it’s usually the culprit.

Regardless, I’m definitely over my target F1 score of 0.8. Enough over it to get me inside the top 100 on the private leaderboard. Which would put me on the bronze podium. You know… if this competition hadn’t ended two years ago…

Ideas for Further Improvement

At this stage I have essentially four ideas to try and improve on these results.

The first is simple: try exactly the same approach with a more complex CNN architecture, such as ResNet-101. The reason I haven’t tried this one already is also simple: time. I already had to train ResNet-50 and ResNext-50 across multiple days. At a rough guess I’d expect ResNet-101 to take twice as long. If I was actually entering a competition here I would probably try it, but since I’m not it doesn’t seem worth the effort. There’s also no guarantee it would actually improve the results. It might just overfit.

Idea two is to tweak the loss function in order to minimise (if not eliminate) the mixed signals it currently sends out. My first thought was to use the standard multi-class loss functions, but de-emphasise or remove the term which punishes false negatives. That would probably lead to a model which just returns false for every class, though. Another thought I had was to add an additional class “none of the above” and then use a single class loss function with no punishment of false negatives. The model would then have to select a single class for each image, or explicitly select that none apply. Of course this would only work if there is at least one image which provides a good representation of each class for each business. As for why I haven’t tried this yet: I’m not experienced enough with PyTorch (which the fast.ai library is based on) and I don’t know how. Yet.

My third idea is to fiddle with the data loader. Each training example would become a business, rather than a photo. When loading the data, the data loader would randomly select (say) 4 of the images for that restaurant and return a composite of them. So it might return the following image (or any other permutation) for business 485, which I used as an example in Part 1:

Idea four isn’t actually my idea at all, but one which came from Hacker News user kaveh_h (aka Kaveh Hadjar) in this comment. It amounts to this: use a recurrent neural network (such as an LSTM) in place of the fully connected classification layers of the model. You would then batch together all of the images for a restaurant, run them all through the network sequentially, and then train / infer based on the final output. It would look something like this (art style shamelessly stolen from Stratechery):

RNNs are ideal for handling sequences of data, so there is some possibility that the order the images were supplied in could make a difference. I also have absolutely no idea how I’d go about putting something like that together, either using the fast.ai library or without it.

In summery: I have a lot to learn. Which is quite exciting, to be honest.

Probably about 50/34 times as long, in fact. ↩︎
The photo to business prediction code I wrote for the validation set no longer works here. Thankfully something much similar suffices. ↩︎

Kaggle’s Yelp Restaurant Photo Classification Competition, Fast.ai Style: Part 1

Nick Johnson — Sun, 24 Jun 2018 18:07:13 GMT

Lectures 3 and 4 of fast.ai’s Practical Deep Learning for Coders MOOC focuses in part on multi-label image classification. Teacher Jeremy Howard uses the Understanding the Amazon from Space Kaggle competition for teaching purposes^[1], and sets homework to try other similar image classification competitions.

The forums point to a template version of the Jupyter notebook used in the lecture, which suggests trying the Yelp Restaurant Photo Classification competition. On the surface this actually turned out to be a pretty suboptimal match for the techniques used in the lecture and the default setup of the fast.ai library. But it did give me an opportunity to dig a little deeper than I might have otherwise.

Howard’s guidance is that students should aim to get an evaluation score which would put them in the top 50% of the leaderboard for the completion. Taking a look at the leaderboard I decided to aim slightly higher: I wanted to get an F1 score of at least 0.8. Ideally this score would be against the test set, but if I managed it against the validation set I’d be happy enough with that.

Specifics of the Competition

This competition has a degree of separation between the input and the expected output. To explain what I mean by that, let’s compare the training data from the Yelp completion to that from the Amazon competition. In both cases the data is supplied as jpeg images and CSV files. For the Amazon competition there is a single CSV file, mapping images to labels. It looks like this:

image_name,tags
train_0,haze primary
train_1,agriculture clear primary water
train_2,clear primary
train_3,clear primary
train_4,agriculture clear habitation primary road
train_5,haze primary water

Nice and simple. The name of the image on the first column; the labels for the image in the second. Mapping images of satellite imagery to appropriate labels is, after all, the point of this competition. For the yelp competition there are two CSV files. The first maps businesses (not images) to labels:

business_id,labels
1000,1 2 3 4 5 6 7
1001,0 1 6 8
100,1 2 4 5 6 7
...
485,1 2 3 4 5 6 7
...

Again this represents the point of the completion. We’re trying to learn the right labels for a particular restaurant. The images are a data tool we use in order to do so. So, there is a second CSV file which maps images to businesses:

photo_id,business_id
204149,3034
52779,2805
278973,485
195284,485
19992,485
80748,485
...

This is the degree of separation: no direct mapping between the input data (the images) and the desired output (the labels).

This presents two main problems. The first is small: the data needs to be merged into a format which can be used to train a neural network. Solving this leads to the second, much bigger issue: many of the resulting label to image mappings are inappropriate. But there isn’t enough information in the data set to do anything other than map every label for a business to every image for that business.

Consider that there are 9 labels:

Good for lunch;
Good for dinner;
Takes reservations;
Outdoor seating;
Restaurant is expensive;
Has alcohol;
Has table service;
Ambience is classy;
Good for kids.

Now consider business 485 from the data above. It has every label apart form 0 (good for lunch) and 8 (good for kids). Associated with it are these 4 images:

Do each of those images demonstrate each of those labels? I can certainly see that the the presence of wine glasses in the last suggests that alcoholic drinks are available. But there’s nothing in any of the other three pictures which suggests booze is on the menu to my eyes. Likewise I’m not sure the third image suggests any of the labels, yet in training it will be expected to match all of them.

That’s not all. From the description of the data:

Since Yelp is a community driven website, there are duplicated images in the dataset. They are mainly due to:

users accidentally upload the same photo to the same business more than once (e.g., this and this)

chain businesses which upload the same photo to different branches
Yelp is including these as part of the competition, since these are challenges Yelp researchers face every day.

So the same image might be in the training set multiple times, with entirely different labels each time. That’s a lot of mixed signals.

The upshot of this is that the problem is harder. This is borne out by the leaderboard results for the two competitions. The winning score for the Amazon competition is 0.93317 and 100th place has 0.92895. The winning score for the Yelp competition, however, is 0.83177, with 100th place getting 0.80087. Now, I want to stress that this is an apples to oranges comparison. The Yelp competition is graded using the F1 score, whereas the Amazon competition uses the F2 score, which punishes false negatives more harshly. Nevertheless, that’s a big difference and a larger drop off between 1st and 100th place.

Harder doesn’t mean impossible, though. I was curious as to whether the fast.ai techniques would work anyway. Beyond that I wondered if there was anything I could tweak to make them work better.

Processing the Input and Picking the Validation Set

I originally screwed this up and wasted a good few hours of training. The crux of the matter is this: your validation set should be based on the individual restaurants, not the individual images. I know that the first time around, but I didn’t fully understand the way the fast.ai library would handle it. The following is how I built my second, correct validation set.

Side note: I did (and continue to) do all my work for the fast.ai course using the fast.ai template at Paperspace, which I can highly recommend. If you want to try it our you can use my referral code to get $5 credit here.

First things first, I set the paths for the input CSV files and loaded them into pandas data frames:

PATH = 'data/yelp/'
photo_to_biz = f'{PATH}/train_photo_to_biz_ids.csv'
biz_to_labels = f'{PATH}/train.csv'
photo_to_biz_data = pd.read_csv(photo_to_biz)
biz_to_labels_data = pd.read_csv(biz_to_labels)

Next I selected the businesses which will be used for validation using fast.ai’s get_cv_idxs method. This provides a random but deterministic^[2] list of indices given a dataset size. I added a new column to the biz_to_labels_data data frame and set it to true for every business in the validation set.

val_biz_idxs = get_cv_idxs(biz_to_labels_data.shape[0])
val_biz_idxs_set = set(val_biz_idxs)

for index in range(biz_to_labels_data.shape[0]):
	biz_to_labels_data.loc[index, 'validation_set'] = index in val_biz_idxs_set

You specify the validation set to the fast.ai library by giving it a list of the indices in the data set which are to be used for validation. But these indices must be based on the on-disk order of the input files, not the order they appear in the input CSV. Remember above when I said that I originally messed up the validation set? This point about how the fast.ai library interprets the validation set indices is where I did it. I didn’t look deeply enough at my original validation set, and that cost me a lot of time.

I joined the two data frames on the business_id field. Then sorted the resulting data frame by photo_id. As the photo_id field corresponds to the filename of each image, sorting on it means the two orders are now the same. This done, the indices of the validation data can be found by including the row number of each item which has the validation_set column I created above set to True.

joined = pd.merge(photo_to_biz_data, biz_to_labels_data, on='business_id')
joined.sort_values(by='photo_id', inplace=True)
val_idxs = [i for i in range(joined.shape[0])
            if joined.iloc[i, -1]]

Finally what remains is to output just the photo_id and labels columns to a new CSV which can be read in by the fast.ai library:

photos_to_labels = f'{PATH}/train_photos_to_labels.csv'
joined.to_csv(photos_to_labels, columns=['photo_id', 'labels'],
			  index=False)

A key thing I learned here is that I need more experience with pandas. I’m pretty sure there are much more elegant and idiomatic ways of achieving the above. In the lecture, Howard recommends Python for Data Analysis which is written by the main author of pandas. That’s going on my todo list.

First Runs Through ResNet-34

I’m not going to go too deep into the nuts and bolts of actually training the neural network, nor talk about finding the learning rates. You can find pretty comprehensive notes and code samples for this in the fast.ai course forum here.

There are a few things which Andrew Ng’s Coursera Deep Learning Specialisation treats as advanced topics, but fast.ai bakes in from the outset. One of these is transfer learning. The starting point as taught by fast.ai is to use the ResNet-34 architecture with weights pre-trained against the ImageNet dataset. The trained weights are kept for the convolutional layers, but new fully connected classification layers are added to the end. Following the fast.ai recipe, I trained the new layers for 5 epochs, keeping the weights of the convolutional layers static. Then I unfroze the weights of the convolutional layers and continued training for a total of 7 epochs^[3].

Something included in fast.ai from the start but not present in Andrew Ng’s course at all is one of Howard’s tricks for avoiding overfitting. This comes now. I increased the size of the input images from 244px to 299px then repeated the the above procedure. This makes the full regime:

5 epochs with an image size of 224px and the convolutional layers frozen;
7 epochs with an image size of 224px and the convolutional layers unfrozen;
5 epochs with an image size of 299px and the convolutional layers frozen;
7 epochs with an image size of 299px and the convolutional layers unfrozen;

Why 244px and 299px? It’s mentioned in the lectures that these are the standard sizes of images in the ImageNet dataset, which the ResNet was trained against. When I originally started playing with the data I tried a three stage progression from 64px to 128px to 256px, but found I was getting much better results more quickly by going directly to 244px and 299px. This may or may not be the case for other datasets. Figuring it out is definitely an art. I think Rick put it best:

The fast.ai library allows you to supply additional metrics when you train the network. These are entirely for the user’s feedback, and have no affect on the training itself. In order to get a better handle on how the training was actually going, I put together a function which returns the best case F1 value by picking the most effective decision boundary:

def f1(preds, targs, start=0.17, end=0.50, step=0.01):

	# Ignore warnings.
	with warnings.catch_warnings():
		warnings.simplefilter("ignore")

		# Find the threshold which yields the best F1.
		# Note: np.arange(...) is essentially range(...) for floats.
		mapping = {th : f1_score(targs, (preds > th), average='samples')
				   for th in np.arange(start,end,step)}
		th = max(mapping.keys(), key=mapping.get)
		
		# Return the F1 generated by this threshold.
		return mapping[th]

Running ResNet-34 with the above schedule gave me the following values for the trading and validation losses, plus my highly optimistic per photo F1 metric.

Set

Epoch

Training Loss

Validation Loss

0.618955

0.587201

0.711957

0.575491

0.555127

0.726165

0.551347

0.5452

0.730785

0.548335

0.540938

0.732486

0.538313

0.536894

0.733644

0.535068

0.535683

0.73573

0.516119

0.523915

0.741325

0.516364

0.522218

0.742113

0.508947

0.51653

0.744895

0.500656

0.514619

0.746609

0.490499

0.511588

0.74738

0.499408

0.509486

0.748527

0.489586

0.509366

0.74936

0.486146

0.5088

0.749511

0.489602

0.507902

0.749174

0.485432

0.507049

0.749718

0.483158

0.504774

0.751661

0.477729

0.500665

0.753739

0.484929

0.501044

0.753325

0.47441

0.497946

0.755435

0.475028

0.495898

0.756214

0.469361

0.495856

0.756346

0.475626

0.495669

0.756536

You’ll notice that there’s a data point missing at the end of the second set of epochs. The Jupyter notebook had a bit of an issue here, and though the training finished successfully, the loss and metric output didn’t make it to the screen. Frustrating, but this is one of the dangers of using Jupyter for long lived training runs.

Processing the Output

With the training runs finished, the next step was to test against in the validation set. At this stage I need per business, rather than per photo, F1. More processing is needed.

Remember before when I said my use of pandas was far from elegant and idiomatic? Well… look away now if that bothered you, because it’s about to get a lot worse. One of the dangers of Python is that it’s really easy to use it as a write only language. You can put a lot of power into a single line of code which makes no sense to you about an hour later.

Well... my quickly hacked together solution for matching photos to businesses in the validation set is one of those times. It uses a series of three dictionary comprehensions to to map the index of each photo in the validation photo set to the index of the appropriate business in the validation business set.

# Map the ids of businesses in the business validation set to their
# index in that set. 
val_biz_ids = {biz_to_labels_data.loc[val_biz_idxs[i], 'business_id'] : i
			   for i in range(len(val_biz_idxs))}
# Map the ids of photos in the photo validation set to their index
# in that set.
val_photo_ids = {joined.iloc[val_idxs[i], -4] : i for i in range(len(val_idxs))}

# Map index in the photo validation set to index in the business validation
# set.
photo_idx_to_val_biz_idx = {val_photo_ids[joined.iloc[i, -4]] :
							val_biz_ids[joined.iloc[i, -3]] for i in val_idxs}

With that done, I wrote a new method which first builds the per business predictions. I originally tried two approaches to this: taking the maximum of the predicted values for each class; and taking the mean of the predicted values. After a little bit of experimentation, I found that the mean^[4] gave better results.

def photo_to_biz(preds, targs):
	
	# Initial storage for predications and targets.
	biz_preds = np.zeros((len(val_biz_idxs), preds.shape[1]))
	biz_targs = np.zeros((len(val_biz_idxs), targs.shape[1]))
	# Counts of the number of photos observed for each business.
	# Used to calculate a rolling average.
	biz_counts = {}
	
	for val_idx in range(preds.shape[0]):
		biz_idx = photo_idx_to_val_biz_idx[val_idx]
		
		# Update the number of photos seen for this business.
		biz_count = biz_counts.get(biz_idx, 0) + 1
		biz_counts[biz_idx] = biz_count

		# Update the rolling mean of the predictions.
		frac = ((biz_count-1) / biz_count)
		biz_preds[biz_idx,:] = (biz_preds[biz_idx,:] * frac) + (preds[val_idx,:] / biz_count)
		
		# Use max to update the target values.
		# (Technically this only needs to be done once for each
		# business and could be precalculated).
		biz_targs[biz_idx,:] = np.maximum(biz_targs[biz_idx,:], targs[val_idx,:])

	return biz_preds, biz_targs

The output can then be fed into the F1 calculation above. Surprisingly (to me), the per-business F1 score actually same out higher than the per photo score. 0.7845 vs 0.7565, which is a notable improvement.

Not quite good enough to hit my goal, though.

Next time: Dirty hacks, improvements galore, submitting to Kaggle, and graphs. Lots of graphs. You can read it here.

As he is fond of doing. ↩︎
Meaning that it always returns the same same output given the same input. ↩︎
It’s actually more complicated that than, but as I noted above: that’s not important right now. ↩︎
Again, this might not be the case for other datasets. I used a rolling calculation of the mean. This code could be made a little simpler by pre-counting the number of photos for each business. ↩︎

A Short Word About GDPR

Nick Johnson — Mon, 28 May 2018 14:11:48 GMT

If you’re not dead and you’ve been on the internet recently, you’ve probably head about GDPR. Given the circumstances of you reading this article, I’m going to assume that both of those things are true. GDPR, which of course stands for ~~Google Democratic People’s Republic~~ General Data Protection Regulation, is a new law in the EU. The gist of it is this: you’re data belongs to you; you get to choose who has it; you get to choose how it is used.

Without fear of breaking my NDA, I think I can say that GDPR led to a lot of internal effort at Google. Perhaps not quite as much as you would think. Google does genuinely try to be a good steward of its users’ data. But still, there was effort, and possibly the biggest nexus around which that effort revolved is Google’s advertising business.

Most of the commentary on GDPR amounted to it being an annoyance for large businesses and a burden for smaller ones. Some smaller companies have stopped doing business in the EU permanently (or so they say). Some lightbulbs have stopped working in the EU (not kidding). Some US news websites are currently blocked^[1], and USA today actually threw up an EU only ad free version of the site^[2].

There is also some effect on small blogs such as this one. At least: some effect on small blogs such as this one… which happen to use use Google AdSense. Google has actually made it the publisher’s responsibility to get consent for showing personalised advertising. Why would a publisher want to do that? Well, let me illustrate by showing the “tracking type” distribution for the tiny amount of AdSense revenue I’ve made via this site:

Targetting Type

Estimated Earnings

Personalised

£4.44

Contextual

£3.03

Run of Network

£0.04

None

£0.01

Placement

£0.00

As you can see well over half of this tiny amount of money has come from personalised ads^[3]. If this was my main source of income you could see why I might want to keep showing ads with personalised targeting. It isn’t, though. Also: in order to keep showing them I’d have to ask each user if they’re okay with it. Then I’ve have to store that information somewhere, probably with a cookie. I do not want any part of using cookies.

Where I’m going with this, dear reader, is that I’ve gone ahead and made the decision on your behalf. Sincere apologies if you do wish to be tracked, and see adverts which are more relevant to you on all of the websites you visit. However I’ve decided that actually I’d prefer the readers of this site not be tracked. So, as per Google’s instructions, I’ve added the following line of code to the ad unit which is shown on this site:

(adsbygoogle=window.adsbygoogle||[]).requestNonPersonalizedAds=1

The upshot of this is that you should never see personalised ads when reading this site.

What the hell are you doing with our data, guys? ↩︎
It’s wonderful. Just HTML and CSS. I wish this is what all news websites were like. ↩︎
Personalised and contextual targeting are somewhat straightforward. “Run of network” is remarking (you put these shoes in your basket but didn’t buy them), I think. Placement means someone specifically placed an ad on my site. I don’t have that enabled (I think?), so it makes sense that it’s zero. “None” is a puzzling one, though. It’s not included in the AdSense definitions. I guess there are circumstances under which an ad is selected randomly?! ↩︎

An iOS Developer’s Opinions of Flutter

Nick Johnson — Mon, 21 May 2018 17:18:24 GMT

Around a year ago I led a team which spent several months building a complete app using Flutter. For various reasons (none of which I can go into) this app did not ship, and I moved on to a different role. More recently, I spent a week participating in a hackathon. The team I was part of built a successful proof of concept inside an existing app using Flutter.

As a result, Flutter is kind of on my mind right now. I’d like to take the opportunity to put my thoughts down in words. As it turns out, my thoughts on this are quite lengthy. If two and a half thousand words sounds like too much, you can find the TLDR above.

Who Exactly am I?

Since I’m expressing an opinion, I think it’s worth laying out my experience in relevant areas. Presently I’m mostly a front end web developer at Google, where (at the time of writing) I have been employed for almost exactly 7 years. As such the current tools of my trade are JavaScript (using Google’s Closure compiler), CSS and HTML.

I originally started my Google career as a backend developer. About a year into that I incepted, coded, and shipped the Google AdSense App for iPhone. Up until the most recent releases (which has an additional contributer^[1]) I was responsible for every line of (non-library) code in the app. After that I moved to the Google Calendar team and helped ship the v1 of Google Calendar for iOS. At the time I left that team I was responsible for the architecture of about 80% of the UI, and all of the animations in the app. I also worked on several of Google’s internal UI libraries for iOS. The last time I looked I had contributed around 500k lines of Objective-C to Google’s codebase.

I have scant experience of coding for Android, but at lot of experience of using the Java programming language, having previously been a backend developer, and using it extensively during my PhD.

On the whole I enjoy coding for iOS a great deal. In fact I still play with coding iOS Apps in Swift during my spare time. I enjoy web development less, and Android development much less still. I also enjoy science fiction TV shows and American style barbecue, but that’s not important right now.

What is Flutter?

Flutter is essentially two things:

A cross platform framework which allows you to build an app once and then ship it on both Android and iOS;
The UI framework used by Google’s in development Fuchsia operating system.

You could probably make the case that it’s actually a cross platform UI framework which allows you build for three operating systems, including one which doesn’t entirely exist yet.

To talk about it more deeply I’m going to break it down into three subareas: the programming language, the APIs, and the renderer. Small warning before we start: this is going to be something of a reverse shit sandwich. Perhaps not quite that extreme, but the filling is definitely a lot better than the bread.

Anatomy of Flutter Part 1: The Dart Programming Language

Dart was conceived at Google as, essentially, a better JavaScript. It was intended to replace JavaScript as the web language of choice. Thus its design goals included that it should:

run on a reasonably sized virtual machine (or VM);
transpile it to sensible JavaScript for use in browsers which lacked a Dart VM;
appeal to JavaScript programmers.

Unfortunately, that “browsers which lack a Dart VM” part is what killed the original plan. The one browser the Dart team thought they could count on, Google Chrome, decided that it was already a little on the heavy side. The last thing it needed was to included a second virtual machine, especially on mobile platforms^[2].

That didn’t actually kill the use of the language, though. The “transpile to JavaScript” aspect remained alive, and it evolved into a peer of TypeScript and CoffeeScript. It remains in active use both inside and outside of Google.

Originating as a planned alternative to JavaScript, in its original form it was extremely dynamic. Type annotations were an option the programmer could chose to add. Later more concrete static typing was added in the form of “strong mode”, which become the default as of the 2.0 release.

I would describe Dart as boring, but in a good way. It’s neat, productive, and pleasant to use. These are the only exciting things about it. Again: this is not a bad thing. Aside from some very neat constructor syntax and perhaps the cascade operator, you are unlikely to be surprised by the content of a Dart codebase.

When I move from Swift to Dart I find myself missing Swift’s optionals, enums, value types and weak references. On the other hand, Dart has language level asynchronous functionality in the form of async/await, which is currently only a proposal for Swift. I think Dart also has a shallower learning curve than Swift, as there is less to learn.

I would say I prefer Swift to Dart, but I prefer Dart to Java. I definitely prefer Dart to JavaScript. Dart is... fine. It’s a totally solid choice for a UI language. If you have a Dart based web codebase you can even share some business logic, in theory^[3].

I have no direct knowledge of this, but: it’s also a solid political choice. Using Java as the Android programming language has caused Google more than a little bit of trouble. I can see why a language which is both open source and stewarded by Google directly makes sense here.

Anatomy of Flutter Part 2: The Flutter Framework

The Flutter UI APIs are very different to both UIKit and the Android framework, being completely declarative in native. In Flutter you build your UI as an immutable tree of widgets, branches of which are then rebuilt in response to changes in state.

Consider a simple static text label. Using Swift and iOS that would look something like this:

let label = UILabel()
label.text = "Hello world!"

In Flutter it would look quite similar:

final label = Text("Hello world!");

The differences become more obvious when we want the displayed text to change. Here’s one way of doing that in Swift:

let label = UILabel()
label.text = "Hello world!"

func setText(text: String) {
  label.text = text
}

UIKit’s labels are mutable, making this very easy. Flutter’s Text widget is stateless and immutable, though. To update its value we’d need to wrap it in a StatefulWidget and do something like this:

class TextWidget extends StatefulWidget {

  @override
  TextWidgetState createState() => TextWidgetState();
}

class TextWidgetState extends State {
  String text = "hello world";

  void setText(String text) {
    // setState takes a function which updates the local state
    // as input, then rebuilds the tree from this point down.
    setState(() {
      this.text = text;
    });
  }

  @override
  Widget build(BuildContext context) => Text(text);
}

This is imperfect, however. There’s no good way to get hold of that setter on the state object. Instead we need to handle the change reactively, using something like a Dart Stream:

class TextWidget extends StatefulWidget {
  final Stream textStream;

  // This is the neat constructor syntax I mentioned before.
  TextWidget(this.textStream);

  @override
  TextWidgetState createState() => TextWidgetState();
}

class TextWidgetState extends State {
  String text = "hello world";

  @override
  void initState() {
    super.initState();
    // This will call [setText] whenever the contents of [textStream] changes.
    widget.textStream.listen(setText);
  }

  void setText(String text) {
    // [setState] takes a function which updates the local state
    // as input, then rebuilds the tree from this point down.
    setState(() {
      this.text = text;
    });
  }

  @override
  Widget build(BuildContext context) => Text(text);
}

Perhaps a better Swift comparison would be to use rxSwift. That would look something like this:

let label = UILabel()
label.text = "Hello world!"
let textObservable: Observable
textObservable.bindTo(label.rx.text)

In either case: Flutter requires more code for these examples, but that’s because I’m leaving out some of the additional code which would be needed for UIKit. The Flutter is self contained, and it makes the state transforms much more explicit. In the UIKit example, anything which can get a reference to the label can modify its state. Not so for Flutter. The Stream is the only means by which the text can be updated. The setState method is the only means by which the local state and child widgets of a StatefulWidget can be updated in turn. That removes whole classes of bugs.

I’ve barely scratched the surface here. You can find much deeper and more complete explanations of how Flutter widgets work on the official website. There are also specific guides for Android, iOS and Web developers.

For the most part, I really like this approach to building apps. Some things (infinite scrolling, for example) can be quite hard to achieve with the built-in widgets. I’ve also found a few apparently simple layouts which required complex solutions with Flutter. But here’s where Flutter has a strong advantage: it’s open source all the way down. If one of the default widgets doesn’t quite do what you want, you can just fork the code and adjust it to meet your needs.

I find Flutter to be incredibly productive. I’m amazed at what the small team I worked with over the last week managed to get done. Across my experience of working with Flutter, I’ve also found that it does tend to produce fewer UI bugs. Where as UIKit pushes you towards an MVC approach to building an app, Flutter pushes you towards a reactive approach. This reduces the number of ways data can flow through the app, and makes it easier to reason about.

Comparing it to UIKit directly, I would say Flutter is close to on-par in terms of API quality. It is, however, significantly less comprehensive. This is hardly surprising. iOS and UIKit within it will ship its 12.0 version this year. Flutter has not yet hit 1.0.

Anatomy of Flutter Part 3: The Flutter Renderer

Now we get to the more controversial aspect of Flutter. That part which makes some mobile developers exclaim “What?! Gross!”

In my experience, people tend to put Flutter into the same mental box as React Native. Both are cross platform and both use what are usually thought of as “web languages”. Both strongly push users towards a reactive approach to data flow (obviously). That’s more or less where the comparisons end, however. React Native UIs use the system’s native components. So UIKit on iOS and Android Framework on Android. Flutter does not. It has its own OpenGL based renderer, and creates its own UI components completely from scratch. It handles user interaction directly, and has its own gesture handlers.

Within this sandbox, it emulates the look and feel of the host operating system user interface. In fact, there is a Flutter demo which allows the look and feel to be “flipped”, using iOS iconography and physics on Android and vice versa.

Here is one of the downsides of Flutter: In my opinion, and when built for iOS, this facsimile is imperfect. The UI looks just a little off. The scroll physics aren’t quite right. Animations don’t move in quite the way I would expect. It’s climbing the cliff at the other side of the uncanny valley, but not quite out of it. I have less experience of Android, so find it hard to make a comparison there. But I’m told it does a much better job^[4].

Another point which I think is worth bringing up: Unless you make a specific effort to have it be otherwise, an app built with Flutter will look almost identical on iOS and Android. Flutter will make only minimal changes (e.g. default font, back button icon, title justification) by default. There are the “Cupertino” widgets which mimic the iOS design language, built they lag behind iOS. In fact at the time of writing they do not appear to be up to date with iOS 11 which is more than 6 months old. That being the case: you’re probably looking at shipping a Material Design app if you use Flutter. Now, I don’t think that’s necessarily a bad thing on iOS, but it is worth baring in mind.

Summation

I have to admit that I like developing with Flutter a great deal. Aside from a few frustrations it’s a genuinely great framework, though I wish its basis programming language was Swift, rather than Dart. I’m never completely happy with the app which results, however. I really, really wish Flutter produced native OS UI components, rather than its own OpenGL rendered widgets.

John Gruber’s little birds tell him Apple’s rumoured cross platform UI framework^[5] is based on a declarative paradigm. If that turns out to essentially be Flutter, but written with Swift and producing native OS UI components I will be absolutely thrilled. That would be a serious sweet spot, in my opinion.

If I was going to make a recommendations to the reader, they would be:

Consider using Flutter if you need to build a cross platform app, especially if you need to build for both iOS and Android in a hurry. Remember, though, that the differences between the platforms are more than fonts and scroll physics;
If you’re building just an Android app it might also be worth considering. Android developers tell me the Flutter APIs are a step up from the Android UI APIs, and Android users tell me Flutter based Android apps feel totally solid;
If you’re building for iOS: beware. Flutter apps don’t quite feel native to the platform. UIKit and something like rxSwift might be a better option;
If you’re really forward looking and want to build for Fuchsia (or whatever it ends up being called), then yes: definitely use Flutter. I suspect that’s very forward looking, though.

Edit (22/05/2018): Added a link to the Flutter for iOS Developers resource, removed the new keyword from the Dart code and removed the e from Michał because spelling his name is even harder than I thought.

S’up Michał, if you’re reading this. I hope you appreciate the ł. ↩︎
Obviously I’m anthropomorphising here. Google Chrome is a piece of software. It does not make design decisions for itself. Yet. ↩︎
The biggest issue I’m aware of here is that certain mathematical operations are not guaranteed to be identical when transpiled to JavaScript. ↩︎
This isn’t a knock against either Android or Flutter itself. It stands to reason that a team within Google would have an easier time reproducing the Android look and feel than the iOS look and feel. Need to quantify the scroll physics used by Android? Just look at the code. That’s not really an option for iOS. ↩︎
Cross platform in this case meaning iOS and macOS, plus potentially tvOS and watchOS. The Apple ecosystem only. ↩︎

Fast.ai via iPad with Paperspace and Juno App

Nick Johnson — Wed, 09 May 2018 12:00:00 GMT

Note: This is a repost from my other blog.

Having started Fast.ai’s Practical Deep Learning for Coders course, the first thing I noticed is how much less structured it is than Andrew Ng’s Coursera Deep Learning Specialization (non affiliate link).

Fast.ai supplies you with the Jupyter notebooks needed for the assignments, but here a lot of the setup is down to you. At first I was a little frustrated by the extra work that Fast.ai was making me do. Then I came to the conclusion that it’s actually a good thing. In the first instance, the less controlled environment is better preparation for actual problems.

In the second, it means I can try doing the whole course via iPad. I’ve already noted that Jupyter in the browser is a pretty miserable experience on iPad. Happily there’s an excellent native Jupyter app called Juno, which solves that problem nicely. But a bit of extra work is needed to get it working well.

I decided to use Paperspace^[1] (Fast.ai’s recommended option) as my GPU cloud for this course. There are instructions for setting up Paperspace for fast.ai here. Once you’ve done that, your workflow will look something like this:

Start your instance via the Paperspace console;
Log in via ssh and start Jupytor;
Copy the URL with the magic token;
Paste it into your browser, replacing localhost with your instance’s public IP;
Hack hack hack;
Shut down your instance via the Paperspace console.

Step 3 and 4 don’t work so well for Juno, and step 2 is also pretty superfluous. We can eliminate these by turning on password authentication and automatically starting Jupyter on boot.

Password authentication comes first, which will make connecting via Juno a lot easier. I’m assuming you’ve followed the setup I linked to above. Start your instance and log in via the terminal. Now run this on the commend line:

cd fastai
jupyter notebook password

Then give it your chosen password. Next: run Jupyter on startup. Type this on the command line:

crontab -e

Now add this to the bottom of the file which opens:

@reboot cd /fastai; source /.bashrc; /anaconda3/envs/fastai/bin/jupyter notebook >>/cronrun.log 2>&1

Even though Jupyter will now start automatically, there are still reasons to log in. You’re going to need to download additional datasets, for one thing. ssh would be the usual means of doing so, but from the iPad mosh (short for “Mobile Shell”) is a more robust option. I’m using an app called Blink for that.

Paperspace machines are not set up to allow the ports mosh uses by default. So you’ll need to open one, like so:

sudo ufw allow 60001

After that mosh should work just fine.

That’s an affiliate link which will get you $10 of credit. If you prefer a non-affiliate link there’s one here. If you go that route and still want the $10 credit, you can use my code, which is: AAGWLUH. ↩︎

Some Notes on Coursera’s Andrew Ng Deep Learning Speciality

Nick Johnson — Mon, 07 May 2018 12:00:00 GMT

Note: This is a repost from my other blog.

As with my previous post on Coursera’s headline Machine Learning course, this is a set of observations rather than an explicit “review”. There’s a heavy dose of “your mileage may vary” here. I’m aiming to lay out a set of objective observations about the course to help the reader decide if the course will be useful to them. That said: There will be opinions here.

I’ll be using that same ML course as a reference for comparisons. I’ll also make a comparison to the Udacity “Introduction to Machine Learning” course I mentioned in the previous post. That’s a lot of “learning”, so I’ll be using the following acronyms to help maintain my sanity:

CML - Andrew Ng’s Coursera Machine Learning course, originally taught at Stanford University;
UIML - Sebastian Thrun and Katie Malone’s Udacity Introduction to Machine Learning course;
CDLS - The Coursera Deep Learning Speciality by Andrew Ng’s DeepLearning.ai. i.e. The subject of this post.

If at any point I’m talking about a course but haven’t specified which: assume it’s CDLS.

Cost

Let’s get this out of the way first. Where as CML could be fully “audited” for free, CDLS cannot. To be clear: you can get just about everything out of CML, including grading of assignments, without having to pay a penny to Coursera. If, at the end of the course, you want a digital certificate: that will cost you £60. But if you don’t care about that, you don’t have to spend the money^[1].

CDLS, on the other hand, is subscription based. At the time of writing it costs £37 per month. You can watch at least some of the lectures without paying that, but you can’t do any of the coding assignments or access the course forums. The charge is fair enough, in my opinion. The content is new, and you’re learning from a master. This course requires more support resources than CML, as well. I’ll get to that in the “Coding Assignments” section. In fairness, it’s also a business, not a charity.

Course Content

The specialisation is actually made up of five separate courses. In order, these are:

Neural Networks and Deep Learning (4 weeks);
Improving Deep Neural Networks: Hyperparameter Tuning, Regularisation and Optimisation (3 weeks);
Structuring Deep Learning Projects (2 weeks);
Convolutional Neural Networks, aka image processing (4 weeks);
Sequence Models, aka language processing (3 weeks).

You can choose to take them in any order, or to skip any you’re not interested in. Each builds on those before^[2], though, so my advice would be to take each course in the specified order. The length of each course in weeks is really just a guidance figure. Each “week” is actually a little under 2 hours of video lectures, plus graded assignments. You can take it as quickly or slowly as you like^[3].

The first three weeks of the first course overlap quite heavily with the parts of CML which teach neutral networks. The third course also has some crossover with the sixth (Advice for Applying Machine Learning and Machine Learning System Design) and tenth (Large Scale Machine Learning) weeks of CML. The material has been updated, however, and made more applicable to deep neural networks.

Teaching Method

The teaching methodology is basically identical to CML. Ng talks to the camera, or he talks whilst annotating slides. It is, again, a pretty direct conversion from an in person classroom lecture to video format. There are a couple of exceptions to this in the main lectures, which show Ng interacting with an implemented system.

The other exception to this format are the optional “heroes of deep learning” interviews which are included at the end of five of the lectures in the first two courses. The subjects of the interviews are: Geoffrey Hinton, Pieter Abbeel, Ian Goodfellow, Yoshua Bengio, and Yuanqing Lin. In my opinion the Hinton interview is the one most worth your time.

Again as with CML, each week’s lecture is broken down into more focussed individual videos of between 5 to 15 minutes.

I have to mention the first quality issue here. In some places it’s actually really badly edited. Ng sometimes makes false starts and begins again. At first I though the video was skipping, but then I noticed small changes in what Ng said at the times he repeated himself. This happens about once a video, on average. There are also occasional long pauses in the dialog, suggesting that Ng has lost his place in his notes. I find it pretty baffling that these glitches haven’t been edited out. Hopefully it’ll get fixed in an update to the course at some point.

Alongside the videos are the graded assignments. Every week there is a quiz, usually with 10 to 15 questions. A score of 80% or higher is required to pass, but you’re allowed to retake the quiz if you initially get some of the questions wrong. Most weeks also have coding assignments. This isn’t true of the very first week, which serves as an introduction. Nor is it true of the third course, Structuring Deep Learning Projects. This is assessed with longer quizzes, which it calls “machine learning flight simulators”.

Editing issues aside, the teaching worked really well for me. In particular, I feel like I came out of the sequence models course with a level of understanding I’ve failed to get from other sources.

Coding Assignments

The coding assignments are something I feel CDLS really gets right. To recap, both CML and UIML have you download datasets and outline code. You modify the code locally, run it to make sure it works, and then either submit it for online assessment or answer questions about its output. CML has you code (almost) everything from scratch using Matlab/Octave. For UIML you use Python, and mostly parameterise library implementations of the relevant algorithms from SciKit Learn.

CDLS uses the best parts of both of these approaches, in my opinion. The teaching language is Python. You begin in the first course by coding neural networks and optimisation algorithms completely from scratch. This is done using NumPy, which essentially adds most of the numerical computing features of Matlab/Octave to Python. The second course then introduces TensorFlow, a much higher level framework which does a lot of the work for you. Later in the fourth course, Keras (an even higher level framework) is introduced and used. Still, even during the later assignments you might occasionally use NumPy when it makes sense to teach you about an algorithm or technique.

Rather than having you download the code and run it locally, CDLS instead uses Jupyter Notebooks hosted by Coursera. This is a web based IDE, which allows you to code in your browser. It mixes code, blocks of descriptive text, formulas, and images. This makes it an excellent teaching tool. Being browser based, you can also access it from anywhere without needing to download anything^[4]. I actually did one of the assignments from my iPad when I couldn’t use my laptop^[5].

Here, unfortunately, I also need to mention another quality issue. Two of the assignments in the last course had incorrect “expected output” values. The upshot of which was that I spent over an hour in total trying to figure out how to “fix” my code, when it was actually working perfectly. In both cases I eventually discovered “errata” forum posts detailing the issue. I guess the lesson here is: always check the forums first. Again: it’s pretty frustrating that this hasn’t been fixed in the notebook and I needed to refer to the forum at all.

It’s worth noting here that server time isn’t free. Running the Jupytor notebooks for the coding assignments is one of the things your £37 a month^[6] is paying for.

I do have a small complaint about the specifics of that server time, though: it’s running on CPU instances. This means that training times are much longer than they would be on a GPU instance. In the earlier courses this means you might need to occupy yourself for 15 minutes whilst your network trains. In the later courses training just becomes infeasible and you work with pre-trained models instead. I can only assume that using GPU instances would make the course prohibitively expensive, either for the user or Coursera themselves.

Prerequisites

Going into this course, you should already have some experience of coding in Python. Fortunately Python is an easy language to pick up. HackerRank is my favourite “coding dojo” for when I need to skill up on a new programming language. I think their “30 days of code” challenge is probably a good place to start for someone new to coding.

Additionally, I would recommend doing the CML course first. It’s not all directly relevant, but will give you some good intuition and background. For example: when Ng using the phrase “large margin classifier” in CDLS, you’ll know exactly what he’s talking about right away.

What’s Next

Broadly I’m sticking to the plan I wrote at the end of my write up of CML. Next: Fast.ai’s Practical Deep Learning for Coders.

For the record, I did spend the money. In the first instance: curiosity about what the certificate would actually be. In the second: I thought the course was great and felt Coursera had earned the money. ↩︎
To a greater or lesser extent. The earlier courses are quite foundational, the latter ones less so. That said: you will miss some nuiance in the Sequence Models course if you skip Convolutional Neural Networks, for example. ↩︎
Though if you take it too slowly the system will by default start to bug you with notifications and emails. ↩︎
Aside from the content of the web page. Obviously. You know what I mean. ↩︎
I don’t recommend that, though. Jupytor breaks the inertial scrolling in mobile safari, which can make it pretty frustrating to use. ↩︎
Or local equivalent. ↩︎

I Ain’t Dead

Nick Johnson — Sun, 06 May 2018 21:45:20 GMT

I know, it’s been pretty damn quiet around here. Aside of course, from a single repost from my other blog. But I’d like to cite two pieces of evidence to prove that I haven’t been doing nothing at all:

This site has a shiny new theme;
I’ve actually been posting multiple times a week over at my Future Technology blog.

Lets talk about the theme first. It’s called Cedar, and I paid actual money for it over at Theme Forest. Why pay money for a theme, when there are plenty of free ones available? I didn’t like any of the free ones enough, essentially. I tend to believe that the intersection of talent and compensation leads to high quality work. I was pretty happy to pay a fair price for a good design^[1].

I know just enough web development to be dangerous and I wrote this site’s previous theme. It was… functional, but I was never completely happy with the design. It also had a bunch of odd CSS issues, and I didn’t relish the thought of updating it if a new major version Ghost brakes compatibility.

As suggested above, this site still runs on Ghost, whereas ftrsn.net is Wordpress based. Having used it pretty solidly for a few months, I’d like to talk a little about the things I like about it. Then I’d like to talk about why I have no plans to move this site over to it, and why a future project I’m working on will likely also use Ghost.

One point in Wordpress’ favour is that it’s ubiquitous. What I said before is still true now. Getting up and running with Wordpress can be both free and easy^[2]. This ubiquity means that there are plenty of themes available for it, and plugins to do just about anything you might want. I try to run as few plugins as possible on the site, on the basis that any of them could have a security flaw. But still, I currently have eight active plugins:

Disable Comments. I don’t use them on the site, this makes sure there isn’t even the option to accidentally enable them;
Jetpack by Wordpress. To add all of the missing features which otherwise would require hosting at Wordpress.com;
Google Analytics. To fill in the gaps Jetpack misses, although I am looking at alternatives;
Limit Login Attempts. For security;
miniOrange. Adds 2 factor authentication, again for security;
Really Simple SSL. To force https to be used instead of http when possible;
WP Super Cache. To try and pretend that Wordpress isn’t painfully slow;
WP to Twitter. Handles automatic posts to Twitter much better than Jetpack.

I also have 2 deactivated plugins which I keep around in order to workaround a few missing capabilities of Wordpress:

Categories to Tags. Because otherwise I’d need to use database queries(!) to do this;
Post Type Switcher. As above.

As for themes: Wordpress makes it really easy. If the theme is available on Wordpress’ marketplace, you can install and activate it from the Wordpress console. Any configuration parameters can then be edited from the console as well, with the live UI to show you the changes. You can also also edit the theme files directly, again from the Wordpress console. Don’t do that, though. In the first instance it might break. In the second: when you use Wordpress, your website is a sausage. You do not want to see how it gets made. In short, though: You can edit your site from your site.

One thing which I have nothing but good things to say about is the workflow I’ve been able to achieve with Wordpress. Ulysses, my editor of choice, has Wordpress export built in. I can seamlessly move between writing on my laptop and my iPad^[3]. The iPad, when coupled with a smart keyboard, turns out to be an absolutely spectacular machine for this. It’s light and small enough that I barely even notice if it’s in my bag. It easily lasts a week between charges with my usage. I was also able to build workflows which make the process of creating posts in the format I use for ftrsn.net incredibly easy. Once written, posts take two taps to publish.

Now it comes to what I don’t like about Wordpress. It essentially boils down to this: it’s gross. The Wordpress console might be powerful, but I never look forward to using it. It feels like a Frankenstein app, hacked together out of miscellaneous spare parts. When you activate a plugin, it’s free to do just about anything it likes. I’ve found at least three different places in the console UI where plugins might install their settings pane. It’s clunky, and slow. The version of the Wordpress console used on Wordpress.com is quite clean and pleasantly designed, but clearly Automattic are keeping that for themselves. But even then, I know it’s php under the covers.

But what of Ghost? I did say I was sticking with it and planning to use it again in the future.

Ghost tends to be more expensive to host (especially if you use Ghost’s own plans, which are now aimed a professional publishers^[4]). As for ease of setup, the new Ghost CLI has actually made that pretty easy. With it in place, I think I’d actually be more comfortable self hosting Ghost than I would Wordpress.

At the time of writing there is no such thing as a Ghost plugin. There is a thing called an “app”, but thus far only first party apps are available. Happily with Ghost I don’t really need much in the way of plugins at the moment. What I do need can be accomplished via code injection, IFTTT, or with the first party apps which are available.

Ghost also has no UI for editing themes. Themes are uploaded as .zip files, and applied to your site as is. This means that to configure Cedar to work the way I want I had to edit 1 Javascript file, and several handlebars files directly. Nothing too onerous, but still more hassle than the Wordpress equivalent.

The workflow for uploading to Ghost isn’t as nice. I have to copy and paste into the web UI, rather than just clicking two buttons in Ulysses. But that’s fine for a blog I update every cough few weeks cough (as opposed to one I update a few times a week).

Despite all of the above, I really like Ghost. Workflow issues aside, it’s a pleasure to use. It’s fast and well designed. If I need to edit the theme of my site I’m not horrified by what I find there.

Going forward a lot of the long posts I had planned for ftrsn.net are going to be cross posted here, or posted here and linked to from there. They’re about my journey in trying to skill up on Machine Learning, and that feels like more of a good fit for this site.

It actually shook out to just under £20 once VAT and credit card fees were applied. ↩︎
Although there are options which are neither, if that’s what you’re looking for. ↩︎
Or even my iPhone, if I need to make a quick edit on the go. ↩︎
Thankfully they’re keeping me on a “legacy” plan, which is only costing me a few pounds a month. ↩︎

Some Notes on the “Andrew Ng” Coursera Machine Learning Course

Nick Johnson — Wed, 25 Apr 2018 08:30:13 GMT

Note: This is a repost from my other blog.

I was originally going to write this as a “review”, but this course is now considered such a foundational resource that writing a review would feel presumptuous and redundant. Then I was going to write it as a list of pros and cons, but I came to the conclusion that this would probably be subjective. So instead I’m writing a set of notes to be interpreted by the reader.

I originally started the Udacity Introduction to Machine Learning course in preference to Coursera’s Stanford University Machine Learning course^[1], for reasons which I’ll come to. As for why I switched, I’ll come to that as well. I’ll use the Udacity course as a point of comparison throughout. Please note, though, that I’ve only followed the first 4 weeks of the Udacity course.

Teachers

The Coursera course is taught by Andrew Ng, Professor at Stanford University, former chief scientist at Baidu and co-founder of the following things:

Needless to say he, he knows his stuff. He also delivers it in a very direct, understandable and sometimes self affacing manner.

The Udacity Course is taught by Sebastian Thrun and Katie Malone. Sebastian Thrun is also a Professor at Stanford University, as well as Georgia Tech. He led the team which won the DARPA Grand Challenge in 2005. He also co-founded:

Google X;
The self driving car project within Google X which became Waymo;
Udacity itself.

Katie Malone is currently the Director of Data Science Research and Development at Civis Analytics. A Stanford PhD, she was an intern at Udacity when the Intro to ML course was made (I think). She handles around 70% of the teaching in the course.

At the time I started the course, I had not heard of Andrew Ng, but was very aware of Sebastian Thrun. So that was (at the time) a point in favour of the Udacity course.

Looking back now that I know more about Ng, I’d say it’s quite hard to pick who has the more impressive CV. It feels^[2] as though Andrew Ng is a little more respected in the machine learning field, however.

Teaching Method

The teaching method of the Coursera course is a fairly direct conversion of a standard in-person lecture. Ng talks directly to the camera, or talks while digitally annotating his lecture slides.

The Udacity course plays with the format much more. Having two course leads means that there can be dialogue between them. Sections of some lectures are delivered from inside a self driving car. The leads also joke with each other at times. Making fun of each other’s taste in music, for example, as part of the explanation of a music recommendation system.

Your mileage may vary regarding which of these teaching methods works best for you. I found those used in the Udacity course to be more engaging. It felt as though it used more of the potential of an online course than the approach the Coursera course takes. At no point in the Coursera course does Andrew Ng sing “Let It Go”. I’ll let you decide whether this is a positive or a negative.

That said, I think I retained more of the knowledge from the Coursera course, so perhaps its explanations were clearer. I suspect it also has a lot to do with the next point.

Coding Assignments

The Udacity course uses Python as its teaching language. The Coursera course uses Matlab/Octave.

Superficially, this is a huge point in the favour of the Udacity course. Python is essentially the language of machine learning at this point. It also has a much bigger ecosystem surrounding it^[3].

For me, at the time, that was enough to make me choose to move forward with the Udacity course.

When I revisited the Coursera course I realised that there’s another significant difference. The Udacity course imports its implementations of the various algorithms from SciKit Learn. Most of your work in the first three assignments is to initialise the correct class from the library, set it training on the data, and wait.

For the Coursera course, on the other hand, you implement almost everything from scratch. As a result, you learn more about how the algorithms actually work.

In real world use the first methodology makes a lot of sense. There’s no need to reinvent the wheel. For educational purposes I think it’s preferable to learn the lower level nuts and bolts of the algorithms. Even if the choice of language and programming environment is somewhat... suboptimal, in my opinion^[4].

One further difference: the datasets are much larger in the Udacity assignments. Depending on how powerful the machine you’re using is, it might take 15 minutes or so to finish training the models for the assigments. You’ll also need to download a roughly 4GB zip with the data before starting the first assignment.

Course Content

The Udacity course is 10 weeks long, where as the Coursera course is 16 weeks. That being the case, clearly the latter has more than 50% more room for content. Even so, the Udacity course teaches several shallow learning methods^[5] which are not present in the Coursera course. The breadth of a Coursera course is much larger, however, and it’s the only one of the two which covers neural networks^[6].

You can read the syllabus of both courses before enrolling, so it’s easy to see whether a particular technology of interest is present.

Prerequisites

In both cases, I would say that some programming experience is needed. Both Python and Matlab/Octave are reasonably easy to pick up, though. Based on the assignments I finished, the Udacity course requires the least actual programming.

Please note: If your machine learning needs are limited to training and deploying existing models, you might only need to learn a bare minimum of coding in order to do so.

What’s Next?

This course was the first part of a syllabus I built for myself when I started trying to skill up on machine learning. It wasn’t part of my original plan, but Andrew Ng released his new Deep Learning Specialisation on Coursera just as I was finishing the last few weeks of his Machine Learning course. That seemed somewhat serendipitous, so that’s what I’m working on now. I’ll write a similar set of notes on that course after I’ve finished.

After that I plan to follow both of fast.ai’s deep learning courses. And after that I’m planning on following the Philosophy of Mind Series from the Great Courses. I might also finish up the Udacity Machine Learning course to help fill in the gaps in my shallow learning knowledge. That said: No battle plan survives contact with the enemy^[7], so I guess I’ll see.

On an entirely different strand of learning, I’m also following the Princeton University Bitcoin and Cryptocurrency course on Coursera.

Discuss this post on Hacker News.

Which tends to get referred to as “The Andrew Ng Machine Learning Course”, hence the title of this piece. ↩︎
To me, at least. ↩︎
It also numbers its arrays from zero, just like God intended. ↩︎
Matlab/Octave is not an environment you’re likely to use in production. It also numbers arrays from 1, where most other programming languages number from 0 (as I noted before). This difference can definitely lead to bugs if you’re not careful. ↩︎
Such as Naive Bayes and Decision Trees. ↩︎
Which is probably what you’re interested in if you’re starting a machine learning education today. ↩︎
Or “everyone has a plan until they get punched in the face”, if you prefer Mike Tyson’s trainer’s version. ↩︎

International Men’s Day

Nick Johnson — Sun, 19 Nov 2017 17:59:56 GMT

As promised back in my International Women’s Day post, I’m now writing about its counterpart. International Men’s Day is the day I’m posting this: the 19th of November. As noted in the previous post: some people would say that we don’t need this day, because the other 364 days of the year already serve that purpose. I say we do need it. In the first instance because that’s not the way it should be.

In the second instance I say we need it because the fact of the matter it is: Men have problems too. I hope we can accept that without feeling like it devalues women’s problems^[1]. It would be laughable to claim that men’s problems approach the number or severity of those which women face. So please don’t think I’m doing that. But we, men, do nevertheless have problems. Beyond that fact, many of the problems women face boil down to one thing: men. Us. That too is our problem.

I’m going to be talking directly to other men in this post. If that’s not you, that doesn’t necessarily mean that it doesn’t apply to you. Especially if there are men in your life whom you care about. Conversely, I don’t claim that I’m speaking for all men. Just that I’m speaking as one.

The evidence would suggest that men (as a population) don’t do right by women (as a population). So as not to belabour the point that I’m not talking about individuals: you can assume from this point on, that whenever I say “men” or “women” I’m talking about populations. The thing is: men don’t do right by men, either. We need to work on both. Perhaps making progress in the latter will help us with the progress we need to make in the former.

When I wrote the IWD post, I barely even had to think about which of women who’ve inspired me I’d talk about. It was obvious. As it turns out, I find it much harder to think of inspirational men to talk about on the same terms. I don’t want to underplay the role my father and both grandfathers have had in my life, because truly I would not be the person I am today without their support, their love, and most of all their example. Beyond them, though, it’s hard to think of anyone. There are historical figures I find inspirational. Men I’ve never met, such as Mahatma Gandhi, Alan Turning and Douglas Adams. But as I say: I’ve never met them. Their influence on me is indirect and vicarious.

Which raises the question: Where should we look for male role models in the here and now? Or if not the here, at least in the now? I’ll come back to this.

In the meantime, though. Let’s talk about Chester Bennington. The former lead singer of the rock band Linkin Park, who committed suicide earlier this year. You might not be a fan of his music. People certainly seemed to feel the need to tweet about it if they weren’t on the day he died. For me, he was one of my favourite singers. I have a real weakness for people who sing like it’s the only way to scare the demons away^[2]. The problem, though, is that sometimes the demons catch up.

Along with two of his band mates (and Dr Ken), Chester recorded an episode of “Carpool Karaoke” days before he died. His family asked that it be released. You should watch it. Not because it’s especially good, because in my opinion it’s not. It’s awkward and weird. The music and singing really aren’t that great. You should watch it because it’s a video of a man recorded days before he would take his own life. It’s there. In that video. Somewhere inside him. The thing which would kill him. And it’s almost impossible to see.

Even if you don’t care one iota about Chester Bennington, we need to talk about Chris Cornell. Robin Williams. The list goes on. I could list more but I just don’t want to.

Perhaps you know that rates of suicide are much higher for men than for women. At least, I hope you do. What you might not know, though, is that the rate of suicide attempts is much closer to even, even swinging the other way. The main reason for this, it seems, is the choice of method. Men chose methods which end their lives there and then. Women choose methods which take time and allow them to be saved, either by themselves or by others. Women’s suicide attempts tend to be of a kind which gets labelled as a “cry for help”.

Look at that. Turn your head to the side. Squint your eyes. Eventually it starts to look like what’s happening here is that men don’t cry for help. We don’t even ask for it. We should start.

So, let’s loop back around to the question of role models. The short answer is: I can’t tell you who your role models should be. But I can perhaps suggest some places to look. Tim Ferriss has spoken openly about his own battle with depression and suicide. My first suggestion of a place to look is in the guests of his podcast. Even if you don’t like Ferriss himself, his list of interviewees is phenomenal. Start, I’d say, with Sebastian Junger and perhaps his book Tribe.

Masculinity gets a bad rap. It’s easy (and sometimes apt) to attach the word “toxic” as prefix. After a while you can’t think the first without at least part of your mind adding the second. But for men, masculinity is basically unavoidable. Toxic masculinity is bad. But so, I think, is toxic lack of masculinity. The hard part is figuring out the right amount and, more importantly, the right kind. The hard part is figuring out what exactly that looks like. I’m honestly not certain that I can even give a satisfying definition for the word.

Another place to look is The Art of Manliness blog and podcast. It’s dedicated, more or less, to figuring that out.

That’s all I have for now really. I don’t think I’ve even scratched the surface of the issues here^[3]. I'm sorry it's so disjointed. Happy International Men’s Day, everyone.

If you don’t think that’s the case then I humbly submit that you might be part of the problem. ↩︎
See also: Lacey Sturm. ↩︎
Maybe next year we should talk about violence. ↩︎

Blogging on the Quartz Curve

Nick Johnson — Mon, 06 Nov 2017 12:00:00 GMT

A little while ago, over dinner, a good friend of mine introduced me to something called the “Quartz Curve”. Named for the online news magazine which coined it, it goes like this: if you plot the length of an article against user engagement, the resulting graph is bowl shaped. Specifically, the trough is between 500 and 700 words.

Shorter articles can be “Short, sharp creative takes on news stories that are creative and say something new”. Conversely, longer articles can provide “strong detailed narrative or insightful analysis”. Anything in between, though, will tend to be wishywashy, and no one wants that.

In the words of Quartz editor-in-chief Kevin Delany:

The place between 500 and 800 words is the place you don't want to be because it's not short and fast and focused and shareable, but it's not long enough to be a real pay-off for readers.

The standard of production for most traditional news organisations is still somewhere within that range. For a digitally native organisation there's an opportunity.

This got me thinking about this blog, the kind of articles I write here, and other blogs which I find to be worthwhile. Generally speaking, I aim for about 1000 words per article, which puts me (just) north of the Quartz Curve danger zone. I’ve been aiming for about one article every two weeks this year, but my average has been closer to one every three. Another point is that this blog has no real focus, aside from things I’m currently interested in.

When I think of other blogs (rather than bonafide news sites) which I consider really worthwhile, they do tend to mostly line up with the Quartz Curve. In terms of blogs which feature long, insightful (but somewhat infrequent) posts, two good examples are Ben Thompson’s Stratechery^[1] and Nick Szabo’s Unenumerated. Both sit at the intersection of technology and law. The first has a roughly weekly cadence and long but not unwieldy posts. The later has posts as and when the author feels like it, with months going by between posts. The resulting posts are... comprehensive. It would be fair to say that Szabo only speaks when he has something to say.

Stratechery has a quite a strong brand. It sits at it’s own domain. It also has a cohesive design, down to the “hand drawn”^[2] style of the diagrams Thompson uses. Unenumerated, on the other hand, sits at a vanilla blogspot sub-domain. It looks to to use the default blogger template. Lastly, Szaro’s diagrams and images tend towards the... functional. So there are definitely similarities and differences between the two.

Another example of note which straddles both walls of the Quartz Curve is John Gruber’s Daring Fireball. It’s probably the most notable and popular example of the “linklog” blog format^[3]. Most posts are short commentary on external sources. The title of the post actually links to that original sources and the same is true of the entries in the RSS feed (example). These posts are very frequent (usually Gruber posts multiple times a day), and range across the author’s interests^[4]. I often disagree with Gruber (he really can’t see straight when it comes to Google and other companies which compete with Apple), but he can also be very insightful. Particularly in his less frequent longer posts, such as this fantastic example from way back in 2004. That said, even at his most insightful, Gruber remains highly opinionated and lacks the critical detachment of Thompson or Szaro.

Which brings me back around to this blog and my own writing. I do plan to keep writing here. I’m also pretty keen on trying something a little more focussed. I want to try mixing together the short commentary with longer, more in depth and thought out writing. To that end I’ve started working on two other blogging projects.

The first I’ve launched already. It’s a blog about future technology called “Future Soon”^[5]. You can find it at ftrsn.net. Generally, it follows the “linklog” format. Originally I followed it exactly, but found some issues with that, so instead each post now begins with a “Source” attribution. I’m planning on posting these short referential posts with something like a daily cadence, and then writing occasional longer posts when I feel like I can add something. So Gruber’s format, with longer articles using Szabo’s cadence and Thompson’s style. Probably.

As to what issues I had with the linklog format... well, first I’ll have to tell you why I’m using Wordpress, despite previously expressing my distaste for it. It comes down to two things: price and ubiquity. I like Ghost (which this blog uses) a lot, but there’s no getting around the fact that it is more expensive. Compare the cost of hosting Ghost with it’s makers vs the equivalent options at Wordpress.com. For this experiment I actually went even cheaper than that. I’m paying about £1.25 a month for Wordpress hosting from TsoHost^[6]. This is much easier for me to justify to myself.

I could also self host Ghost, which is an option TsoHost offers for an extra £3 a month. Which is where the next issue with Ghost shows up: it’s lack of ubiquity. I trust Ghost themselves to support a Ghost blog. They’re the experts. I have less faith in a third party, and even less faith in a second party (meaning me).

I’m also planning to run ftrsn.net almost entirely from my phone and iPad. Wordpress has a really solid iOS app, plus good integration with both Ulysses and IA Writer. Running a Ghost blog entirely through the web interface on a smaller device creates far too much friction.

As I said before, I originally followed the linklog format exactly. This meant installing a plug-in and editing the theme to work correctly with it. It all worked fine, but it made me nervous. One of the main things I dislike about Wordpress is the hot mess which is it’s plug-in and theming system. One incompatible update could break everything. As it stands I’m using a slightly customised version of an existing theme, and very few plug-ins, which feels much safer. Should the blog go well enough that I don’t mind paying a bit more per month, I can easily transfer it to the safety of Wordpress later.

Likewise if using Ghost from an iOS device becomes a better experience I'd be tempted to pay the extra for Ghost hosting. It really is that much more pleasant.

Now... I did mention a second other blog project. That’s going to be a travel site, of sorts. More about it when it’s ready. Which it isn’t.

That’s stra-tech-ary, not strate-cherry. ↩︎
I’m guessing with an Apple Pencil. ↩︎
Nick Heer’s Pixel Envy is another good example. ↩︎
Chiefly Apple related tech news, but also baseball, plus Kubrick and James Bond movies, among other things. ↩︎
Yes, it is named after a Jonathan Coulton song. ↩︎
If you click through here, you can use the code “HarveyNick” for a 10% discount on hosting, which will kick a small amount back to me. ↩︎

Mostly an iPad Followup

Nick Johnson — Sun, 13 Aug 2017 19:57:14 GMT

Somewhat appropriately, I finished and edited my previous post about switching to an iPad for many of my computing use cases on an iPad. Luckily I have access to one which belongs to my employer and happens to be running the iOS 11 Beta. Since I was traveling at the time, I could also grab the Magic Keyboard which usually sits on my desk for typing purposes. Here are some observations.

First of all: this obviously wasn’t an exact facsimile. The iPad Air 2 I was using has a slightly smaller screen and significantly less processing power than the iPad Pro I’m considering. It also doesn’t support the Apple Pencil, so that’s not something I couldn’t test in the field. Lastly, much as I love the Magic Keyboard (it might be my favourite keyboard ever, in fact) it’s far from an exact replication of the iPad Pro Smart Keyboard. It’s a bit nicer to type on from what I can tell, it’s nowhere near as mobile, and it doesn’t attach to the device for more laptop like use.

With all of that in mind, the first thing I’d note is that the experience of typing text was really good. I did some work in iA Writer, then moved it into Ulysses in the interests of trying something new (for the record I’m typing this in Bear for much the same reason). All of this worked really nicely. Ulysses in particular came pretty close to pulling me all the way over.

When it came to editing it was great to be able to move to a chair, flip the device to portrait and change the context as much as possible. I found that the on-screen soft keyboard was more than equal to making minor edits here, especially with the improved keyboard in iOS 11.

One thing which didn’t work quite so well was moving large chunks of text around. This is pretty trivial with a mouse or trackpad. It’s also very easy with a hard keyboard (and its curser keys). With iOS’s on screen selection system I found this much more awkward. Hopefully this is just a matter of inexperience and I’m actually missing a key insight which will make it much easier. Perhaps there will be improvements when apps start using the new iOS 11 APIs.

Moving content between apps is still pretty awkward, but again I expect iOS 11 will improve this. Moving your attention between different apps I actually find to work really well on the iPad, though. The much more formal system of how apps appear on the screen seems to be easier for me to handle intuitively (or at least reflexively).

Ulysses has the built in functionality to export directly to Medium and Wordpress. It would be really good if it had the same functionality for Ghost, because the Ghost web UI really doesn’t scale well to the iPad sized screen. It seems to fall into something of a blind spot. Hopefully the new editor in Ghost 1.0 will improve this, but it doesn’t seem to have been rolled out to my blog yet. In fact thanks to Wordpress’s very solid mobile app, this whole process would have been much simpler if I was using Wordpress^[1].

I also spent some time using Affinity Photo to develop RAW files taken with my mirrorless camera. The short summery is that it works really, really well on the iPad. It was a little slow at times, but given I was using it on the lowest specced device it’s compatible with that’s quite understandable. If I was doing anything more delicate than developing I think the Apple Pencil would have been very nice to have. Doing this kind of work whilst comfortable sitting crosslegged in a sofa is very nice indeed.

The experience of getting the photos out of the Mirrorless camera and onto the iPad was also pretty good. The iPad SD Card based connection kit does its job really well. I suspect the USB based version is probably the better option most of the time, though. I’ll probably switch to that and put up with also needing to carry an additional cable. Whilst it was straight forward it was a little slow. Wireless transfer of RAW files (which tend to be big) would be way more convenient, but slower still. It might be worth it.

A quick aside about the way in which the Photos app stores and displays RAW files: If you shoot in RAW+JPEG (so your camera outputs both a RAW file and a developed JPEG), Photos displays this as a single photo. Any changes you make are applied only to the JPEG. Photos itself barely even acknowledges that the RAW exists. Affinity will open the RAW file by default when you import from your photo library , but it would be awesome if it could also act as an extension and allow you to redevelop the RAW file from with Photos.

I also used it as a media device during the flight^[2]. So much nicer than the built in option. The Netflix UI is lightyears ahead of any of the grotesque “entertainment systems” I’ve encountered on a plane. The iPad screen is much better than any in-seat screen. As an added bonus: When the pilot or cabin crew decide it’s time to wax lyrical, you get to choose whether to listen, or stick with the entertainment you chose.

As final thoughts: mobility and battery life were both glorious. With a SIM card and a decent international plan it could be amazing.

In actually fact, thanks to really solid tools like the Working Copy git client, a lot of this might even have been easier using Jekyll and GitHub Pages. I still really like Ghost and intend to keep using it for this blog, but I can’t deny that frustrations like this keep pilling up. I’m going to be looking at other options for some other side projects I have in mind. ↩︎
I watched the “Netflix Original” move Spectral, in case you’re interested. I would recommend it for you if you (like me) are a fan of a) very well made B movies; or b) Doctor Who, with which it shares an attitude to “science”. ↩︎

Thinking About Going iPad... Mostly

Nick Johnson — Sun, 30 Jul 2017 12:00:00 GMT

Going “iPad only” is all the rage in some circles. I don't think it would really work for me. iOS still has too many limitations for that. But I do like the idea of using an iPad as my main “carry around” machine. Right now I use a first generation MackBook Adorable for this purpose. I take it on holiday and bring it with me to coffee shops (and occasionally the office) when I have use for it. But an iPad is smaller, lighter, and more versatile in form (if not in function). It also uses the same charger as my phone, which I consider a win.

The MacBook is also very small and light, but the issue is that a lot of the software built for it was not designed with the “small” part in mind. A lot of Mac apps were clearly designed to run on a 27" iMac screen and perhaps scale down to a 15" MacBook Pro. The 12" screen of the MacBook is not what the designers had in mind, and so the apps end up cramped and inelegant. Due to the windowing user interface, software for the Mac must be capable of being displayed at almost any size. It can't possibly be optimized for all of them.

It might seem natural that if the 12" Screen of the MacBook feels cramped, then the 9" or 10.5" screen of the (non tea-tray sized) iPad would be even more so. In practice this is generally not the case. The limited sizes an iOS app can be displayed in are known at design time. Thus they can be taken into account from the get go, and iPad apps (even complicated ones) can achieve a high level of elegance.

I can think of four key use cases I have for my MacBook Adorable, which any possible replacement would have to cover:

Content consumption;
Writing;
Image Editing;
Coding.

Content consumption is one area in which the iPad is basically peerless. Technically there isn't anything you can do on a iPad which you can't do on a laptop, but often times it's much more pleasant on the iPad.

If you want to watch Netflix or Amazon Video, for example, the iPad app gives a much cleaner interface than the browser, and allows you to download (some) TV shows and movies to watch offline later.

There are exceptions to this, of course. Some websites don't work well on Safari mobile and so can't be viewed on the iPad at all. Any website which uses Flash, for example, but that's increasingly rare. The Kindle app (and other eReaders, for that matter) on the iPad is fantastic for viewing cookbooks and textbooks[^1]. But an ePaper Kindle is significantly more comfortable when it comes to novels and other long form writing.

For the writing use case, I think a tiny laptop and an iPad with a keyboard are on a pretty equal footing. I use IA Writer for this purpose. It's available on both platforms, so no issue there. It's also very minimal, so screen real estate is more or less a non-issue. If Matt Gemmell can write Novels on an iPad I can probably manage to write blog posts.

Using a trackpad/mouse is more convenient than tapping the screen for quickly jumping around the text. However the lack of this might not actually be such a bad thing. It would force me to rely on the keyboard more, which is likely to increase my productivity in the long run.

Another bonus is the ability to remove the keyboard and rotate into portrait mode when editing. In my experience changing your context as much as possible is really good idea when editing. It stops me from seeing what I was trying to write, rather than what I actually wrote.

Image editing, when performed on a Mac, is one of those use cases which benefits from the largest screen possible. More pixels on the screen means you can see more of the pixels in your photo (obviously). Plus you have space for the myriad palettes and tools image editing tends to require.

I'm mainly thinking of two apps here: Adobe's ubiquitous LightRoom, and Serif’s plucky newcomer Affinity Photo. There's not an exact one to one mapping between the two. Affinity Photo is more of a competitor to PhotoShop than LightRoom, but it can be purchased for a one off fee rather than requiring a subscription, so it's what I've tended to use when developing raw files and otherwise tweaking the images which come out of my camera.

As luck would have it, there are iPad versions of both of these apps. LightRoom Mobile is mostly a companion to the desktop version, but appears pretty fully functional. Affinity Photo for iPad is the whole shebang, though. It's seriously impressive. In both cases, the possibility of using an Apple Pencil for selection and making exact adjustments is very compelling.

The downside here is the lack of a good way of getting access to the image files for editing. Neither (currently) supports reading from external sources directly. LightRoom mobile appears to work best if you first upload the pictures to Adobe’s cloud via the desktop app, then pull them back down to the iPad. It can pull RAW files from the iPad’s camera roll, but not elegantly. Affinity Photo from iPad doesn’t seem to be able to see those RAW files at all, though. It wants you to upload them separately to cloud storage, and then pull them down individually. The workflow here is really not appealing, but I’m hopeful that new features in iOS 11 will help here.

Coding is potentially a showstopper for the iPad. Much as I want Xcode for iPad to exist, it stubbornly persists in failing to. It's not that coding is completely impossible on the tablet. There exist very solid apps for programming in Python and Lua for example. Unfortunately I want to code in Swift. There is the increasingly impressive Swift Playgrounds app. Unfortunately I want to work on a full App.

One possible solution is to use Dringend, which provides a full App coding environment for Objective-C and Swift. However it also requires a Mac to be used as a build server in order to make this work. Another slight red flag is that the App hasn't been updated since January 2016. As things stand it wouldn't gain the benefits of iOS 11. But still, there is hope.

That said, if I have a decent PC back at home base then possibly I don't need to code on the iPad. Maybe I can work on coding side projects when I'm at home, but when I'm out and about I can do... other things. Like write, or work on photos. Or maybe just look up and enjoy the world. It's crazy, I know.

So... what's my conclusion? I really like the idea, and will probably go ahead with it... at some point. Right now I really don't want to spend the money, but when my MacBook Adorable hits end of life I think it might be the next thing for me. A reasonable iMac at home (maybe) and an iPad Pro for when I'm not at my desk. It seems like a good division of labour.

[^1] anything with pictures, or that you might want to flick around in, essentially.

Should Have Paid for the Delivery, or: Value Your Time

Nick Johnson — Sun, 16 Jul 2017 21:29:52 GMT

Just before the weekend I realised I needed a few items from Ikea at fairly short notice. Nothing complicated. A rug and some rails of the sort kitchen utensils dangle from. My in-progress kitchen would definitely remain in-progress without the latter. The obvious solution was the usual online shopping and home delivery combo. I fired up Ikea.com and lobbed everything I needed into the basket. I clicked checkout. Cost for delivery: £35, said the site. "Daylight robbery!" said I. I think you can see where this is going. More specifically: I think you can see where I was going.

Ikea is 60 minutes away from me by public transport. 30 minutes by car. Driving myself in a ZipCar was more logistics than I was looking for, so I decided public transport there, Uber back was my best option. Off I set. About two hours later I had a heavy yellow Ikea sack hanging from my shoulder (because obviously I also spotted a couple of other things we could really use). I started to manhandle the 2.5 metre rug off the shelf in the self service warehouse. I'm sure I knew how big the rug was beforehand, but it hadn't entirely registered until this point. It was beginning to dawn on me: "this is silly."

In the end the 30 minute Uber back cost me £17.14. The 60 minute trip on public transport to get there in the first place cost £6.60. I spend around an hour trudging around Ikea. So the final tally was £23.74. I saved £11.26, but I spent 2 and a half hours of my life. In other words I payed myself almost exactly £4.50 an hour for my time.

I really like to think that my time is worth more than that. It's certainly worth more than that to me. I generally have a rule that if I can pay money to increase quality of life for myself or my partner I will do so without hesitation. I broke that rule spectacularly here.

I can handle the economy cabin for a long flight, but if it's overnight I will tend to upgrade to premium. Business would be nice but I generally can't afford it when paying for myself. An overnight flight in economy will usually leave me completely destroyed the next day. It will be several days and a lot of coffee before I'm back up to speed. It's worth it to me to pay the extra money and avoid that.

Another example is taking my shirts to the laundrette. It takes me something like 30 minutes to do a crappy job of ironing a shirt. But for £1.50 each my local laundrette will wash and iron them. For me it's an absolute no brainer^[1].

In a world of Amazon Prime and public transport which feels free at point of use, £35 sounded like a lot of money for delivery. In fact I was thinking about it in entirely the wrong way, and my maths was deeply flawed. You should value your time highly. Very highly. If someone is providing a service which is genuinely useful to you, then you should pay them for it. I should have been praising Ikea for charging for delivery, and not just disguising the cost elsewhere.

The longwinded upshot is this: Next time it comes up, I'm definitely paying Ikea for the damn delivery. Replace the words "Ikea" and "delivery" as appropriate.

Before someone says it: I'm not saying my time is worth more than the person ironing the shirts in the laundrette. I'm saying they'll iron the shirt in 3 minutes and do a much better job of it. ↩︎