foldl

Motivating the rules of the game for adversarial example research

Jon Gauthier — Fri, 17 Aug 2018 00:00:00 +0000

Motivating the Rules of the Game for Adversarial Example Research is one of the most level-headed things I’ve read on AI safety/security in a while. It’s 25 pages, which is long for a machine learning paper — but it’s worth it. My brief take-away from the paper, which I totally support:

Adversarial example research has been framed in two ways:

an experimental method for pure research which helps us better understand neural network architectures and their learned representations

a practical method for securing machine learning models against attacks from adversaries in the wild.

Adversarial examples are the least of our problems in the latter practical framing. We ought to either (1) re-cast adversarial example work as a pure research problem, or (2) build better “rules of the game” which actually motivate popular adversarial defense methods as sufficient security solutions.

Here are some more extracts that I think summarize the push of the paper (emphasis mine):

We argue that adversarial example defense papers have, to date, mostly considered abstract, toy games that do not relate to any specific security concern (1).

Much of the adversarial perturbation research arose based on observations that even small perturbations can cause significant mistakes in deep learning models, with no security motivation attached. … Goodfellow et al. intended $l_p$ adversarial examples to be a toy problem where evaluation would be easy, with the hope that the solution to this toy problem would generalize to other problems. … Because solutions to this metric have not generalized to other settings, it is important to now find other, better more realistic ways of evaluating classifiers in the adversarial [security] setting (20).

Exploring robustness to a whitebox adversary [i.e. $l_p$-norm attacks] should not come at the cost of ignoring defenses against high-likelihood, simplistic attacks such as applying random transformations or supplying the most difficult test cases. … Work primarily motivated by security should first build a better understanding of the attacker action space (23).

An appealing alternative for the machine learning community would be to recenter defenses against restricted adversarial perturbations as machine learning contributions and not security contributions (25).

To have the largest impact, we should both recast future adversarial example research as a contribution to core machine learning functionality and develop new abstractions that capture realistic thread models (25).

Some other notes:

The authors correctly point out that “content-preserving” perturbations are difficult to identify. $l_p$-norm is just a proxy (and a poor one at that!) this criterion. If we try to formalize this briefly, it seems like a content-preserving perturbation $\delta_{O,T}(x)$ on an input $x$ for some task $T$ is one which does not push $x$ out of some perceptual equivalence class according to a system-external observer $O$ who knows $T$.

If that’s right, then concretely defining $\delta$ for any domain requires that we construct the relevant perceptual equivalence classes for $O$ on $T$. Is this any easier than reverse-engineering the representations that $O$ uses to solve $T$ in the first place? If not, then posing the “correct” perturbation mechanism is just as difficult as learning the “correct” predictive model in the first place.
I think the definition of “adversarial example” begins to fall apart as we expand its scope. See e.g. this quote:

for many problem settings, the existence of non-zero test error implies the existence of adversarial examples for sufficiently powerful attacker models (17).

This is true for a maximally broad notion of “adversarial example,” which just means “an example that the system gets wrong.” If we expand the definition that way, the line between a robust system (in the security sense) and a well-generalizing model begins to get fuzzy.

Conceptual issues in AI safety: the paradigmatic gap

Jon Gauthier — Thu, 21 Jun 2018 00:00:00 +0000

tl;dr: I question the assumption that technical solutions to mid-term safety problems will be relevant to the long-horizon problems of AI safety. This assumption fails to account for a potential paradigmatic change in technology between now and the date at which these long-horizon problems will become pressing. I present a historical example of paradigmatic change and suggest that the same is possible for AI, and argue that our bets on the importance of present-day safety work ought to incorporate our beliefs over the strength of the current paradigm.

I’m back from a brief workshop on technical issues in AI safety, organized by the Open Philanthropy Project. The workshop brought together the new class of AI Fellows with researchers from industry labs, nonprofits, and academia to discuss actionable issues in AI safety.

Discussions at the workshop have changed and augmented my views on AI safety in fundamental ways. Most importantly, they have revealed to me several critical conceptual issues at the foundation of AI safety research, involving work with both medium time horizons (e.g. adversarial attacks, interpretability) and much longer horizons (e.g. aligning the incentives of superintelligent AIs to match our own values). I believe that these are blocking issues for safety research: I don’t know how to value the various sorts of safety work until I arrive at satisfying answers to these questions. Over the next months, I’ll formalize these questions in separate single-authored and co-authored blog posts.

This post addresses the first critical of these critical conceptual issues. This issue is the least technical – and possibly the least deep-cutting – of those which I want to raise. Because it touches on one of the most common safety arguments, though, I thought it’d be best to publish this one first.

Introduction

AI safety is a very diverse field, encompassing work targeted at vastly different time horizons. I identify three in this post:

Short-term: This work involves immediately practical safety risks in deploying machine learning systems. These include data poisoning, training set inference, lack of model interpretability, and undesirable model bias.¹
Mid-term: This work targets potential safety risks of future AI systems that are more powerful and more broadly deployed than those used today. Relevant problems in this space include scalably specifying and supervising reward-based learning, preventing unwanted side effects, safely generalizing out of domain, and ensuring that systems remain under our control.
Long-term: This theoretical work addresses the risks posed by artificially engineered (super)intelligences. It asks, for example, how we might ensure that a system is aligned with our values, and proposes procedures for conserving this alignment while supporting recursive self-improvement.

This post is mainly concerned with the value of mid-term work.

Mid-term AI safety

Many mid-term researchers assume that their work is well aligned with solving longer-horizon safety risks — that is, that technical solutions to mid-term problems will also help us make progress on the most concerning long-horizon risk scenarios. Paul Christiano has made statements about the likely alignment of mid-term and long-term issues — see, for example, his 2016 article on prosaic AI:

It now seems possible that we could build “prosaic” AGI, which can replicate human behavior but doesn’t involve qualitatively new ideas about “how intelligence works” … If we build prosaic superhuman AGI, it seems most likely that it will be trained by reinforcement learning … But we don’t have any shovel-ready approach to training an RL system to autonomously pursue our values.

To illustrate how this can go wrong, imagine using RL to implement a decentralized autonomous organization (DAO) which maximizes its profit. If we had very powerful RL systems, such a DAO might be able to outcompete human organizations at a wide range of tasks — producing and selling cheaper widgets, but also influencing government policy, extorting/manipulating other actors, and so on.

This sort of argument is used to motivate mid-term technical work on controlling AI systems, aligning their values with our own, and so on. In particular, this argument is used to motivate technical work in small-scale synthetic scenarios which connect to these long-term concerns. Leike et al. (2017) propose minimal environments for checking the safety of reinforcement learning agents, for example, and justify the work as follows:

While these [proposed grid-world safety environments] are highly abstract and not always intuitive, their simplicity has two advantages: it makes the learning problem very simple and it limits confounding factors in experiments. Such simple environments could also be considered as minimal safety checks: an algorithm that fails to behave safely in such simple environments is also unlikely to behave safely in real-world, safety-critical environments where it is much more complicated to test. Despite the simplicity of the environments, we have selected these challenges with the safety of very powerful artificial agents (such as artificial general intelligence) in mind.

These arguments gesture at the successes of modern machine learning technology—especially reinforcement learning—and recognize (correctly!) that we don’t yet have good procedures for ensuring that these systems behave the way that we want them to when they are deployed in the wild. We need to have safety procedures in place, they claim, far before more powerful longer-horizon systems arrive that can do much more harm in the real world. This argument rests on the assumption that our technical solutions to mid-term problems will be relevant at the long-horizon date when such systems arrive.

This post questions that assumption. I claim that this assumption fails to account for a potential paradigmatic change in our engineered AI systems between now and the date at which these long-horizon problems become pressing.² A substantial paradigmatic change — which could entail a major change in the way we engineer AI systems, or the way AI is used and deployed by corporations and end users — may make irrelevant any mid-term work done now which aims to solve those long-horizon problems.

I’ll make the argument by historical analogy, and circle back to the issue of AI safety at the end of this post.

Paradigmatic change: an example

At the end of the 19th century, some of the largest cities in the world relied on horses as a central mode of transportation. A city horse was tasked with driving everything from the private hansom cab (a Sherlock Holmes favorite) to the double-decker horsebus, which could tow dozens of passengers.

1890s New York City, for example, housed over a hundred thousand horses for transporting freight and humans. While this massive transportation industry helped to continue an era of explosive city growth, it also posed some serious novel logistical problems. Many of those horses were housed directly in urban stables, taking up valuable city space. Rats and other city rodents flocked to the urban granaries established to support these stables.

But the most threatening problem posed by this industry by far was the waste. The massive horse population produced a similarly massive daily output of excrement and urine. Because the average city horse survived fewer than three years of work, horse carcasses would commonly be found abandoned in the streets.³

This sort of waste had the potential to doom New York and similar cities to an eventual crisis of public health. On dry days, piles of horse excrement left in the streets would turn to dust and pollute the air. Rainstorms and melting snow would precipitate floods of horse poop, meeting the especially unlucky residents of ground-floor apartments. In all climates, flies flocked to the waste and helped to spread typhoid fever.

Enterprising Americans were quick to respond to the problem—or, rather, to the business opportunities posed by the problem. “Crossing sweepers” paved the way through the muck for the classiest of street-crossers. Urban factories cropped up to process horse carcasses, producing glue, gelatin, and fertilizer. Services carted away as much horse poop as possible to pre-designated “manure blocks,” in order to keep at least part of the city presentable.

Horse waste posed a major looming public health risk for the 19th-century city. I assume there were two clear roads forward here:

Reduce the horse population. With cities around the world booming in population, banning or restricting horse-based transportation would stall a city’s growth. Not a viable option, then.
Eliminate the waste. New technical solutions would need to be future-proof, robust even in the face of a continuously growing horse population. While the technical solutions of the day only mitigated some of the worst effects of the waste, this would have seemed like the only viable technical solution to pursue.

I certainly would have voted for #2 as an 1890s technologist or urban planner. But neither of these solutions ended up saving New York City, London, and friends from their smelly 19th-century fates. What saved them?

The automobile, of course. The internal combustion engine offered a fast, efficient, and increasingly cheap alternative to the horse. Urban horses were replaced somewhat slowly, only as market pressures forced horse owners to close up shop. By the final decade of the 19th century, most major cities had switched from horse-pulled streetcars to electrified trolleys. Over the following decades, increasingly economical engines replaced horses in buses, cabs, and personal vehicles. Automobiles introduced a novel technological paradigm, leading to entirely new manufacturing methods, service jobs, and — most importantly — safety issues.

The introduction of the automobile dealt a final blow to the previous transportation paradigm, and rendered irrelevant the safety issues it had imposed on modern cities: automobiles did not leave excrement, urine, or horse carcasses in the streets. Automobiles introduced entirely new safety issues, no doubt, which still trouble us today: car exhaust pollutes our atmosphere, and drunk car drivers do far more damage to pedestrians than a drunk hansom driver ever could. But it’s critical to note for our purposes that technologists of the horse-era could not have foreseen such safety problems, let alone develop solutions to them.

Potential paradigmatic changes in AI

Modern machine learning and AI are likewise built within a relatively fixed paradigm, which specifies how systems are constructed and used. I want to suggest that substantial changes to the present paradigm might invalidate the assumed alignment between mid-term AI safety work and the longer-term goals. But first, I’ll identify the relevant features of the paradigm that contains modern ML work.

What are some assumptions of the assumed paradigm which might change in the future? Any answer is bound to look silly in hindsight. In any case, here are a few candidate concepts which are currently central to machine learning, and to AI safety by association. I’ve heard all of these concepts questioned in conversations with reasonable machine learning researchers. Many of these assumptions have even been questioned/subverted in published papers. In short, I don’t think any of these concepts are set in stone.

the train/test regime — the notion that a system is “trained” offline and then “deployed” to the real world⁴
reinforcement learning; (discrete-time) MDPs
stationarity as a default assumption; IID data sampling as a default assumption
RL agents with discrete action spaces
RL agents with actions whose effects are pre-specified by the system’s designer
gradient-based learning / local parameter search⁵
parametric models⁶
the notion of discrete “tasks” or “objectives” that systems optimize
(heresy!) probabilistic inference as a framework for learning and inference

I believe that, while many of the above axiomatic elements of modern machine learning seem foundational and unshakable, most are likely to be obsolete within decades. Before you disagree with that last sentence, think of what futures a horse-drawn cab driver or an 1890s urban planner would have predicted. Consider also what sort of futures that expert systems developers and Lisp machine engineers from past decades of AI research would have sketched. (Would they have mentioned MDPs?)⁷

You may not agree that all or most of the above concepts are about to be subverted any time soon. If you do agree that any foundational axiom A has the chance of disappearing, though, it is imperative that 1) your safety questions are still relevant, and 2) your technical solutions are successful both in a world where $A$ holds and $\neg A$ holds.

Consequences of paradigmatic change

The argument I am suggesting here is different from the standard “technical gap” argument.⁸ I am instead pointing out a paradigmatic gap: the technical solutions we develop now may be fatally attached to the current technological paradigm. Let $T_S$ be the future time at which long-horizon AI safety risks – say, prosaic AGI or superintelligence – become a reality. Here are two consequences of granting this as a possibility:

Our current technological paradigm may mislead us to consider safety problems that won’t be at all relevant at $T_S$, due to paradigmatic change.

Excrement evacuation seemed like a pressing issue in the late 19th century; the problem is entirely irrelevant in the present-day automobile paradigm. We instead deal with an entirely different set of modern automobile safety issues.

The task of scalable reward specification likewise appears critically important to the mid-term and long-term AI safety crowds. Such a problem is only relevant, however, if many of the paradigmatic axioms from the previous section hold (at least #2–5).
Technical solutions developed now may be irrelevant at $T_S$. Even if the pressing safety issues overlap with the pressing safety issues at $T_S$ (i.e., #1 above doesn’t hold), it’s possible that our technical solutions will still be fatally tied to elements of the current paradigm.

Pedestrians and riders alike faced collision risks in the horse era — runaway horses might kick and run over people in their way. But the technical solutions to horse collision look nothing like those which save lives today (for example, airbags and stop lights).

There’s room to disagree on this question of a paradigmatic gap. But it certainly needs to be part of the AI safety discussion: our bets on the importance of present-day technical safety work ought to incorporate our beliefs over the strength of the current paradigm. Here are some concrete questions worth debating once we’ve granted the possibility of paradigmatic change:

How much are different risks and research directions actually tied to the current paradigm?⁹ (How can we get more specific about this “fatal attachment?”)
Do our current paradigm-bets look good, or should we be looking to diversify across possible paradigm changes or weaken the connection to the current paradigm?
- What does “diversify” mean here? Would it entail doing more or less work under the framing of AI safety?
- We need to arrive at a consensus on the pessimistic meta-induction argument here (see footnote #8). Are we justified in assuming the current paradigm (or any candidate future paradigm) is the right one in which to do mid-term safety work? Can empirical evidence help here? How can we get more concrete, in any case, about our uncertainty about the strength of a technological paradigm?
Are there ways to do research that’s likely to survive a paradigm shift?¹⁰ (What are the safety problems which are likely to survive a paradigm shift?)

Future posts will address some of the above questions in detail. For now, I look forward to the community’s response!

Follow-ups

Here I’ll respond to various comments from reviewers which I couldn’t fit nicely into the above narrative.

Aditi and Alex suggested that AI safety work might be what actually brings about the paradigmatic change I’m talking about. Under this view, the safety objective motivates novel work which otherwise would have come more slowly (or not at all). I think that’s possible for some sorts of AI safety research — for example, the quest to build systems which are robust to open-ended / real-world adversarial attacks (stop sign graffiti) might end up motivating substantial paradigm changes. This is a possibility worth considering. My current belief is that many of these sorts of safety research could be just as well branded as “doing machine learning better” or “better specifying the task.” In other words, the “safety” framing adds nothing new. At best, it’s distracting; at worst, it gives AI safety an undeserved poor reputation. (As Michael suggested: I’d rather say “I’m working on X because it makes AI more robust / ethical / fair” than “I’m working on X because it will help stave off an existential threat to the human race.”) This is a very compressed argument, and I’ll expand it in a future post in this series.
Michael, Jacob, Max, and Paul suggested that mid- and long-term AI safety research might transfer across paradigm shifts. This is certainly true for the most philosophical parts of AI safety research. I am not convinced it applies in more mid-term work. I’m not certain about the answer here, but I am certain that this is a live question and ought to play an important role in debates over AI timelines.
Jacob, Tomer, and Daniel pointed out the possible link to Kuhnian paradigm shifts. See footnote #2 for a response. In a future post, I intend to address the separate danger of failing to acknowledge dependence on the current scientific paradigm (i.e., on our present notion of “what intelligence is”).

This post benefited from many discussions at the Open Philanthropy AI safety workshop, as well as from reviews from colleagues across the world. Thanks to Paul Christiano, Daniel Dewey, Roger Levy, Alex Lew, Jessy Lin, João Loula, Chris Maddison, Maxinder Kanwal, Michael Littman, Thomas Schatz, Amir Soltani, Jacob Steinhardt, Tomer Ullman, and all of the AI Fellows for enlightening discussions and feedback on earlier drafts of this post.

I prefer to separate these practical issues under the name “machine learning security,” which has a more down-to-earth ring compared to “AI safety.” ↩
I don’t intend to refer to Kuhnian paradigm shifts by using this term. Kuhn makes the strong claim that shifts between scientific paradigms (which establish standards of measurement, theory evaluation, common concepts, etc.) render theories incommensurable. I am referring to a much simpler sort of technological paradigm (the toolsets and procedures we use to reach our engineering targets). This post is only concerned with the latter sort of paradigmatic change. ↩
From Greene (2008), cited in this Quora answer. ↩
see e.g. online learning / lifelong learning ↩
see e.g. neuroevolution ↩
see e.g. nonparametric models :) ↩
This belief is at present no more than an intuition from my experience as a computer scientist / member of the ML/NLP community / reading on the history of science and technology. I hope future posts and discussion can make these beliefs more concrete — though the only way to prove that the future will be radically different is to go ahead and make that future a reality! Arguments for and against pessimistic meta-induction in the philosophy of science might be a good place to start for developing both the positive and negative views here. (Thanks to João for the suggestion.) ↩
AI safety skeptic: “We’re decades or centuries away from developing superintelligent machines. Why work on safety now?” AI safety non-skeptic: “We have no idea how to solve this issue, and it’s likely to take decades before we arrive at anything near robust. Thus we need to start now.” ↩
I’ve found it difficult to quantify the probability of a paradigm shift. Given the way that I’ve presented paradigm-shift, indeed, these are extremely difficult to imagine and develop by definition. I’d very much like to figure out how to be more concrete about these ideas. ↩
See Daniel Dewey’s evaluation of the MIRI HRAD approach for an example answer to this question. ↩

Do brains represent words?

Jon Gauthier — Mon, 16 Apr 2018 00:00:00 +0000

Jack Gallant’s group published a Nature paper several years back which caused quite a buzz. It presented interactive “semantic maps” spanning the human cortex, mapping out how words of different semantic categories were represented in different places. From the abstract:

Our results suggest that most areas¹ within the [brain’s] semantic system represent information about specific semantic domains, or groups of related concepts, and our atlas [an interactive web application] shows which domains are represented in each area. This study demonstrates that data-driven methods – commonplace in studies of human neuroanatomy and functional connectivity – provide a powerful and efficient means for mapping functional representations in the brain.

The paper is worth a read, but is unfortunately behind a paywall. The group also produced the video below, which gives a brief introduction to the methods and results.

In extremely abbreviated form, here’s what happened: the authors of the paper put people in a functional magnetic resonance imaging machine and took snapshots of their brain activity while they listened to podcasts. They tracked the exact moments at which each subject heard each word in a podcast recording, yielding a large dataset mapping individual words to the brain responses of subjects who heard those words.

They combined this dataset with several fancy computational models to produce maps of semantic selectivity, charting which parts of the brain respond especially strongly to which sorts of words. You can see the video for examples, or try out their online 3D brain viewer yourself.

This systems neuroscience paper managed to reach people all the way out in the AI community, as it seemed to promise a comprehensive account of actual neural word representations.² There has since been plenty of criticism of the paper on multiple levels – in experimental design, in modeling choices, in scientific value, and so forth. In this post, I’ll raise a simple philosophical issue with the claims of the paper. That issue has to do with the central concept of “representation.” This paper’s claims to representation bring us to what I think is one of the most important open questions in the philosophy of cognitive science and neuroscience.

This post is intended to serve as a non-philosopher-friendly introduction to the problem of neural representation. Rather than advancing any new theory in this post, I’ll just chart out the problem and end with some links to further reading.

The essential argument

The authors repeatedly allude to “(functional) representations” of words in the brain. This term is often bandied about in systems neuroscience, but it is much more philosophically troubling than you might think at first glance. Let’s spell out the high-level logic of the paper:

We play subjects some podcasts and make sure they pay attention.
At the same time, we record physical traces of their brain activity.³
After we have collected our dataset matching words spoken in the podcasts to brain activity, we build a mathematical model relating the two. We find that we can predict the brain activity of a subject (in particular regions) based on the words that they heard at that moment.⁴
When we can predict the brain activity of a region with reasonable accuracy based on the identity of the word being heard alone, we can say that the region serves to represent that word.
Our derived semantic map shows how the brain represents words from different semantic domains in different areas.

Let’s step back and put on our philosopher-hats here.

Things bumping around

What we actually observe in this experiment are two different types of physical events. First, a word is played through a pair of headphones, vibrating the air around the ears of the subject in particular way. Next, we see some neurons firing in the brain, spewing out neurotransmitters and demanding extra nutrients to replenish their strength.⁵ We find that there is some regular relationship between the way the air vibrates (that is, the particular words a subject hears) and the way particular populations of neurons respond.

Let’s make an even higher-level gloss of the core logic in this spirit:

We make some atoms bump around in pattern $ A $ near the subject’s ears.
We watch how some atoms bump around at the same time in a nearby area (the subject’s brain). Call this pattern $ B(A) $. Note that $ B $ is a function of $ A $ – we group the atom-bumps in the brain according to the particular patterns $ A $ presented to the subject.
We build a mathematical model relating how the ear-atom-bumping $ A $ relates to the brain-atom-bumping $ B(A) $_.
When our model accurately predicts the bumping $ B(A) $ given the bumping $ A $, we say that $ B(A) $ represents some aspect of $ A $.
The brain activity pattern $ B(A) $ represents the ear-bumping pattern $ A $.

At this level of abstraction—a level which might sound a little silly, but which preserves the essential moves of the argument—we might be able to draw out a strange logical leap. Point #4 takes a correlation between different bumping-patterns $ A $ and $ B(A) $ and concludes that $ B(A) $ represents $ A $.

Correlation as representation

That notion of representation captures the relevant relation in the paper. But it also captures quite a bit more – namely, any pair of physical events $ A $, $ B(A) $ for which some aspect of $ B(A) $ correlates with some aspect of $ A $. Here’s a random list of pairs of physical events or states which satisfy this requirement:

The length of a tree’s shadow ($ B(A) $) and the time of day ($ A $)
My car’s engine temperature ($ B(A) $) and the position of the key in my car’s ignition ($ A $)
The volume of a crowd in a restaurant ($ B(A) $) and the number of eggs broken in the last hour of that restaurant’s kitchen ($ A $)

In none of the above cases would we say that the atom/molecule/photon-bumps $ B(A) $ represent an aspect of $ A $. So why do we make the claim so confidently when it comes to brains? Our model of the brain as an information-processor needs this notion of representation to be rather strong – to not also include random physical relationships between shadows and time, or volumes and egg-cracking.⁶

The quest

We could just declare by fiat, of course, that the relationships between the brain and the outside world are the ones we are interested in explaining. But as scientists we are interested in developing explanations that are maximally observer-independent. The facts we discover – that region $ X $ of the brain exhibiting a pattern $ B(A) $ represents some aspect $ A $ of the outside world – ought to be true whether or not any scientist cares to investigate it. Our desired notion of representation should emerge naturally from a description how \B(A)\) and $A$ relate, without selecting the silly cases from above. For this reason, people generally think of this theoretical program as a quest for naturalistic representation.

M.C. Escher — Hand with Reflecting Sphere.

A first response: Sure, the details of $ B(A) $ can be used to infer the details of $ A $ in all of these cases, including the case of the Nature paper. The difference between the Nature paper and the silly examples given above is that the correlation between $ B(A) $ and $ A $ is relevant or important in some sense. We’re capturing some actual mechanistic relationship in the case of the brain, whereas the other examples simply pick on chance correlations.

A counter: I don’t see a principled difference between your “mechanistic relationships” and your “chance correlations.” There are certainly mechanistic explanations which link the length of a tree’s shadow and the time of day, or any of the other pairs given above. Why privilege the neural relationship with the label of “mechanism?”

Our answer to that question can’t fall back on claims about the brain being a more “interesting” or “relevant” system of study in any respect. We need to find a naturalistic account of why the brain as a data-processor is any different than those (admittedly silly) examples above.

This, then, is the critical problem of representation in the brain: we need to find some way to assert that the brain is doing something substantial in responding to its inputs, over and above the way a tree or a car engine “respond” to their “inputs.” (Why do we need scare-quotes in the second case, but not in the first?)

Future posts on this blog will characterize some of the most popular responses to this conceptual issue. In particular, I’ll explore notions of representation which require an account of how content is used or consumed. For now, though, I’ll link to some relevant writing:

From neuroscientists: deCharms & Zador (2000), Parker & Newsome (1998) – more sophisticated operational definitions of neural representation.
From philosophers:
- Ramsey (2003) – difficult, but very exciting, attack on the idea of neural representation.
- Egan (2013), see also video here – argues that talk of representational content is simply a useful “gloss” on actual theories. (Directed at mental representation, but applies just as well to neural representation.)

Here “area” means a particular region of the cortex of the human brain. ↩
This is absolutely not the first paper on how words are represented neurally – see e.g. Martin (2007). It may be unique as of 2016, though, in its breadth and its spread into the AI community. The first author of the paper presented this work, for example, at the NIPS conference in the winter of 2016. ↩
In this particular case, those traces consist of changes in blood flow to different regions of the brain, detected by a machine with an enormous magnet surrounding the person’s head. For more, check out the Wikipedia article on functional magnetic resonance imaging (fMRI). ↩
Technical note: “at that moment” is not exactly correct, since fMRI data only tracks the pooled responses of samples of neurons over the span of several seconds. ↩
Another hedge: what we actually observe is the flow of oxygenated and deoxygenated blood around the brain. I’ll stop making these technical hedges now; the neuroscientists can grant me a bit of loose language, and the non-neuroscientists nerdy enough to read these footnotes are hopefully motivated by this point to go read about the details of fMRI. ↩
M.H. points out that this naïve notion of neural representation also fails to pick out cases we would call proper representation. Consider entertaining an arbitrary thought, which (presumably) activates neural populations in such a way that we’d say those activations represent the thought. It’s not possible in this case to point out any stimulus or action correlated with that neural activity, since the actual represented content of the thought is unobservable to the scientist. ↩

This is not an academic post

Jon Gauthier — Thu, 29 Mar 2018 00:00:00 +0000

A friend remarked recently that the majority of the recent posts on this were rather “academic.” Well, I’m an academic, aren’t I? I didn’t see any problem with this label at the time.

But it turns out that academics are people, too—and, as genuine people, might benefit from exploring outside the ivory tower every once in a while. In celebration of my own human-ness, then, this week’s post has zero intellectual content.

I was sitting earlier tonight at Cambridge Zen Center in a weekly community meeting. It’s a nice, humble get-together where Zen Center members and a few dozen people from the community come to talk about meditation and Buddhism.

Springtime at Cambridge Zen Center. Always a good time for practice! #spring #springtime #cherry #cherryblossom #cambridge #zen #meditation #practice #rightnow #justdoit

Ein Beitrag geteilt von Cambridge Zen Center (@cambridgezencenter) am Apr 29, 2017 um 10:27 PDT

Each meeting begins with a five-minute meditation. We had an unusually large crowd tonight, and the room was packed as we settled in for our sit. A clap from the leader signaled the start of the five minutes, and the room fell silent.

But that silence tonight was by no means the absence of sound. I’ve been going to these meetings for about 9 months, but somehow never noticed within the stillness all this noise.

A man behind me pushed air back and forth slowly over tensed vocal cords, singing a high-frequency static like that of a distant sea. A girl to my breathed quickly, occasionally voicing little falsetto squeaks. In front of me, a man exhaled in quick bursts, like a horse just after a gallop. Beneath these solos swayed a textured chorus of ins and outs, ins and outs.

The symphony at the Zen Center cued a memory of a quiet forest, with the wind filtering through the leaves of the trees: in and out, in and out.

The forest behind Dhamma Suttama. Montebello, Québec.August 2017

It was a unique moment. After five precious minutes, we separated ourselves from our branched brethren and began to talk.

This post shall have no conclusion attempting to induce any general lessons from the above story. Instead, without a whiff of conceptual analysis or other “academic” hullabaloo, it will simply end.

I saw a dog

Jon Gauthier — Tue, 13 Mar 2018 00:00:00 +0000

—as I walked down a quiet side-street in Cambridge, not far from Central Square. I was glued to my phone and couldn’t make out so many details without looking up, but I could see that it was middle-sized and black, facing me and angled to the north-east.

I could tell this was a dog not only from its shape, but also from that primitive thwang that dogs trigger in my bones. I’m not afraid of dogs – I’ve spent most of my life around them – but I’m still wary around arbitrary canines on the street, leashed or not.

I felt that thwang as I registered the dog’s basic features. Black, medium size – maybe a black labrador. I raised my head, ready to step out of the way, smile at the owner, follow the basic program. But there was no dog in front of me.

What was in front of me was not a black labrador, but a commuter bike locked to a slightly oblique street sign. The bike had a thin black seat and narrow road tires, with a rusty pannier rack framing its back wheel. Its handlebars – drop bars, taped black – were angled away from me. No recognizable dog-features in sight, let alone an actual dog.

How could my own experience of the world be so wrong?

Am I pathological? I don’t think so. I’ve been noticing more of these experiences over the past few months. Sights, sounds, and sensations occasionally reveal themselves to be little fibs: reasonable, but ultimately inaccurate, pictures of what is actually out there in the real world.

There are at least two pictures of perception that such fib-experiences might suggest. Both views suggest that sensations and beliefs combine to produce our visual experience—this much is uncontroversial. They differ, though, on how much credit is assigned to each of those sources.

In one picture, my brain takes in an abundant amount of detail about the visual world at all times. On top of that abundant stream of information, some higher-level system sprinkles on the conceptual details: that cube is a cardboard box, that wiggling object is dangerous, and so on. Serious mistakes in those higher-level attributions – like the dog-percept presented above – can temporarily paint over my sensory inputs and cause me to see things that aren’t there.

A second picture suggests that the sensory information reaching my brain at any moment is actually quite sparse. On top of this sparse stream, most of the work of perception is performed at higher levels, with the mind “filling in” all of the gaps in my sensory data on the basis of beliefs and expectations. In this view, it’s not that the mind overwrites sensory data that is already there — rather, the mind is continuously tasked with filling in perceptual information not present in the original sensory data.

To further develop these two pictures, I’ll turn to some details on the human eye.

The human retina contains two major types of light-sensitive cells:¹ rods and cones. Rods are responsible for vision in low light, and are not sensitive to color. Cones function well only in high-light situations, and uniquely support color vision.

It turns out that these two types of cells are distributed unequally in the retina. Cones cluster around the area of the retina which maps to the center of our visual field (the fovea), while rods dominate everywhere else.

Spatial distribution of rods and cones in the human retina. From Cmglee on Wikipedia.

This spatial distribution suggests that, at any moment, the majority of the color information my retina receives only picks out points in the very center of my visual field.²

This is one case, then, in which the brain seems to receive rather sparse sensory information. That’s puzzling, because it doesn’t seem to map onto my experience. I certainly don’t think that my color vision is limited to the very center of my visual field—I really hope yours isn’t, either.³

How is it that I perceive the world as fully colored, if my sensory machinery cannot possibly yield such an image? If that underlying hardware is yielding only a sparse picture of the real world, why does color feel so abundant in my visual experience?

Balas & Sinha (2007) present a simple experiment which will help us better draw out this sparse view. Their results offer behavioral evidence that some higher-level mental module actively fills in color information, turning what is originally a rather sparse sensory representation into an abundant visual experience.

(Unfortunately, the paper is not open-access, and the publisher demands hundreds of dollars for the rights to reproduce the figures. So I’ll do my best to summarize the procedure and results here.)

The authors prepared modified images of natural scenes like the one in the figure below. They took full-color images and imposed a randomly sized color mask, such that a circle in the center of the image remained in color while the rest of the image appeared in grayscale.

Partially-colored chimera image like those used in Balas & Sinha (2007).

These “chimera” images were rapidly presented to adult subjects, and the subjects were asked after each presentation to report whether the image they had just seen was in grayscale, full color, or a mix of the two.

The crucial metric of interest here is the rate of color “false alarms” — that is, how often subjects perceive an image with only its center in color as a fully colored picture. These false alarms be evidence of the brain filling in color percepts.

What would we expect to find? We know that the actual sensory data is rather sparse — recall that the majority of color-sensitive photoreceptors cluster in the fovea, in the center of the visual field. We might guess, then, that if the region of the image perceived by this color-sensitive area is appropriately colored, then the brain would serve to fill in the rest of the percept.

This is what Balas & Sinha find in their main result: even when nontrivial portions of the image are presented in grayscale, people are likely to report that the entire image is in color. For example, when the color mask covers 17.5 degrees of the visual field, subjects report that the entire image is colored almost 40% of the time. These false alarm rates reach 60% as the size of the color mask increases.

There’s much more to the paper: the authors present further experiments attempting to work out the source of the information used to fill in color. For our purposes, though, this headline result is already interesting.

We have evidence from both directions for the sparse view, then:

At the neural level we can see that the hardware to support color vision is clustered around a small central area of the retina, yielding rather sparse information about color in the rest of the visual field.
At the behavioral level we see that people often perceive these partially-colored images as fully colored.

It seems, then, that higher-level mechanisms are doing quite a bit of work in the brain to “fill in” the sparse information which manages to filter through the retina.

Why am I writing about this? I think the “filling in” metaphor is a useful tool for the mental toolbox.⁴ While this sort of phenomenon shows up again and again in psychology, I feel like I’ve only just begun to internalize it — to start to actually see footprints of the process in my own experience.

It’s likely due only in small part to my intellectual understanding of the process. It’s more likely, I think, that regular meditation and introspection is what is actually helping me see my own experience more clearly.

In any case, it’s quite the thrilling ride. I am catching my mind for the regular fibber that it is, as it paints pretty pictures over messy and sparse arrays of input from my sensory hardware. Happy hallucinating!

Fun fact. There is actually a third type with quite a long name: intrinsically photosensitive retinal ganglion cells. These cells (a ~1% minority in the retina) help regulate circadian rhythms and contribute to melatonin production/suppression. They were first hypothesized after scientists discovered that supposedly blind mice were still able to respond to changes in their visual environment. ↩
This is not exactly correct, of course. We rapidly and subconsciously microsaccade, even when we feel we are fixating our eyes on one position in our visual field. It’s possible that these microsaccades function in part to gather information about colors and forms in our periphery. I don’t pretend to cover all my bases as a vision scientist here – I only hope to get the broad strokes of this argument correct. ↩
I also don’t think that my peripheral vision is especially acute in low-light conditions. ↩
Psychologists and cognitive scientists might be reminded of terms like “top-down processing” and “predictive processing.” I’m not sure this metaphor adds anything on top of those, but it does sound quite a bit more intuitive. Anyway, the point of this post is to share some fun facts and ideas, not to present a novel metaphor. ↩

LeCun: Language is the next frontier for AI—or not

Jon Gauthier — Fri, 23 Feb 2018 00:00:00 +0000

Abi See recently hosted a debate between Yann LeCun (Facebook/NYU) and Chris Manning (Stanford) on the importance of linguistic structure and innate priors in systems for natural language understanding. The video, along with a nice write-up from Abi herself, are available here.

I used to have strong opinions on the use of linguistic structure in NLP systems.¹ I’m no longer so passionate about that debate, but I still found something interesting in this particular discussion. Yann made a striking remark near the end of the debate (beginning in the video at 59:54):

Language is sort of an epiphenomenon [of human intelligence] … it’s not that complicated … There is a lot to intelligence that has absolutely nothing to do with language, and that’s where we should attack things first. … [Language] is number 300 in the list of 500 problems that we need to face.²

Wow. Those words come from the man who regularly claimed a few years ago (circa 2014) that language was the “next frontier” of artificial intelligence.³

Four years is quite a long time in the world of AI — enough to legitimately change your mind in the light of evidence (or lack thereof). Recall that NLP in 2014 was awash in the first exciting results in neural machine translation.⁴ LeCun rode the wave, too: in 2015 he and one of his students put up a rather ambitiously titled preprint called “Text Understanding from Scratch.” (No, they didn’t solve “text understanding.”)

Yann seems to have had a change of heart since those brave words. I think the natural language processing community as a whole has begun to brush off the deep learning hype, too.

One can hope.

Heck, it was a central motivation for my research at that time. I suppose that was a natural consequence of being directly advised by Chris. :) ↩
The “we” in this quote is ambiguous. I’d guess from context that he was referring to Facebook AI, but he could have also meant to refer to the larger AI research community. ↩
I recall this distinctive phrasing from several public talks, but we also have some text records. Source 1, Yann’s response to a 2014 AMA: “The next frontier [sic] for deep learning are language understanding, video, and control/planning.” Source 2, quoted in a 2015 article from Cade Metz: “Yann LeCun … calls natural language processing ‘the next frontier.’” ↩
See e.g. Kalchbrenner & Blunsom (2013); Sutskever, Vinyals, & Le (2014); and Bahdanau, Cho, & Bengio (2014). ↩

How to prepare for a Vipassana retreat

Jon Gauthier — Mon, 11 Sep 2017 00:00:00 +0000

The meditation hall at Dhamma Suttama in Montebello, Québec.August 2017

Several weeks ago I sat a 10-day Vipassana meditation retreat at the Dhamma Suttama in Montebello, Québec. In an explicitly secular setting, a team of volunteer teachers demonstrated to around 100 of us students the basics of a meditation technique recovered directly from the teachings of the Buddha.¹ These 10 days were excruciating – painful both physically and mentally. But out of the pain and maddening boredom emerged an inner stillness and peace I had never felt before.

I could write a lot more about my personal experience at the retreat. I could document how days of sitting in silence brought about such heightened sensory perception that I felt I had acquired superpowers. I could recall my dreams during those 10 days, more vivid and bizarre than I had ever experienced. Unsurprisingly, there is already plenty of writing like this all over the Internet. And I suggest you read none of it.

Meditation is fundamentally a solitary practice. The mission, after all, is to discover a new way to see the world through your own eyes. This can’t be accomplished by reading blog posts or books, or by discussing your practice with teachers or fellow students. You can only find your own way.

I’ve been reading other people’s accounts of Vipassana retreats after returning from my own, and frankly, a lot of this content underserves the experience. It’s not that there is a dearth of skilled writers discussing the topic. It’s simply that extended meditation yields effects that really can’t be conveyed through language, no matter the skill of the writer.

I can sympathize if that last sentence sounds a little loosey-goosey to you. My analytical-philosopher-mind of early 2017 would have thought the same. But try to think of meditation as a method of practicing a skill – the skill of seeing clearly into your own, first-person experience.² It’s a skill just like improvisational jazz, or cooking, or racecar driving. If you ask a master of any of these skills to describe what it’s like to engage their mastery, you might be able to get a detailed description of the sights, sounds, tastes, smells, and thoughts involved in the experience. But hearing such a description is not the same thing as having these feelings first-hand, no matter how lengthy or beautifully worded.

The special thing about meditation as skill practice, then, is that the skill we are practicing is necessary for grasping the original lessons of the Buddha. It’s quite fun to study Buddhist notions of suffering, impermanence, and “not-self” at an intellectual level. But we can’t fully grasp these concepts until we directly experience their effects — until we sit down and listen.

So go sit down and listen for yourself! Retreats organized by the international Dhamma organization are free of charge.

But am I prepared?

Good question. I was anxiously asking this question as my own retreat approached. By this point, I had a daily meditation practice but had never participated in any retreat longer than several hours.

You are not prepared if you’re like I was. But this is no reason at all to despair. Even those students returning for their second or third retreat feel they are not physically or mentally prepared for the experience. I suppose the only way to really be prepared for this sort of experience is to achieve enlightenment. Unless you think can accomplish that before your trip, I think you ought to settle for less.

That being said, I would like to end the post with some actual advice on how to be ready for your own retreat.

Physical preparation

First: resign yourself to the fact that you are not—and will not be—physically prepared. Experienced meditators and newbies alike complain of extreme physical pain during the one-hour sits over the ten days. You are no different.

Having said that, here are some basic items I think it’d be wise to follow in order to minimize your risk of injury.

Research proper meditation posture. There are lots of ways to sit in meditation, and you need to find one (or two, or more) which work well for your body. Whichever position you pick, make sure that it is safe! Your best bet is probably to visit a local meditation group and rely on the teacher’s advice here. An unsafe posture can do real damage to your muscles, bones, and nerves. Be safe!
Sit! Establish a daily meditation practice. You can sit for 5 minutes or 50 minutes at first – just make sure you keep it up every day. As your retreat approaches, begin increasing your daily sit time. You should be comfortable sitting in some position – whichever works for you – for at least 20 or 30 minutes.

Mental preparation

Again: resign yourself to the fact that you are not—and will not be—mentally prepared. But you can do your best with the following:

Remove stressors. Try to leave for your retreat on good terms with your family, friends, and colleagues. Finish big work projects or life projects. Pay off your bills.
Keep a journal. Write about your daily experience and think about what you want to get out of your meditation practice.
Share your journey. Tell your friends and family about your plans and let them interrogate you. Some will be surprised, some won’t understand at all, and some might want to come with you. You can use your social network to help work out for yourself what the retreat is for in the first place.

It’s up to you to begin the exploration. Good luck!

A wooden bridge in the forest near the Dhamma Suttama meditation center in Montebello, Québec.August 2017

Wikipedia has fairly good coverage on the Vipassana movement and its relation to Buddhism. We were given rather orthodox lessons in the tradition of Theravāda Buddhism, with lectures drawing almost entirely on content from the Pāli Canon. ↩
In Pali, vipassana actually means something like “seeing into [your own experience].” Those interested can read an entire article on the meaning of vipassana. ↩

Sunday links

Jon Gauthier — Sun, 26 Feb 2017 00:00:00 +0000

I like plants. Plants are nice. But are they conscious? Do they think? Follow this very beginner-friendly series on the Brains Blog: Can plants remember? Perceive? Represent? Think?
How is cognitive function distributed across the human brain? A meta-study applies a simple entropy measure over distributions of activations for brain regions across tasks. The findings: both cortical and subcortical regions vary greatly in their task diversity. An assortativity measure also varies greatly across cortex; some regions are functionally cohesive, while others seem to yield a patchwork of different functions.

Bridging principles

Jon Gauthier — Thu, 23 Feb 2017 00:00:00 +0000

It’s been quiet here for a while! I have a new series of posts coming up soon, in which I’ll try to specify some desiderata for agents which understand language. That word understand gets thrown around a lot in my circles, with too little critical thought about it’s meaning.

But enough foreshadowing—for now, I wanted to just share some philosophizing thoughts which have been rolling around in my head for the past few days.

Language is defined by its use. The meaning of the words we use can be derived from the way that they are deployed and the way that people react to them in conversation.

But it’s certainly necessary that this language bottom out in uses that are nonlinguistic. Language doesn’t happen in a vacuum: we deploy it as a tool in all sorts of situations in order to get things done. Take the standard Wittgensteinian language game example:¹

The language is meant to serve for communication between a builder A and an assistant B. A is building with building-stones: there are blocks, pillars, slabs and beams. B has to pass the stones, and that in the order in which A needs them. For this purpose they use a language consisting of the words “block”, “pillar”, “slab”, “beam”. A calls them out;—B brings the stone which he has learnt to bring at such-and-such a call.——Conceive this as a complete primitive language.

I think Wittgenstein—and many others—would readily agree that this game is not just a game of words but a game of words causing things and of other things causing words. We can’t fully define the meaning of a word like “slab” without referring to the physical actions of A and B. In this way, linguistic meaning has to bottom out at some point in nonlinguistic facts.²

Bridging principles

Call the above argument a defense of a bridging principle. Generally speaking, a bridge principle is some statement which links entities from a domain or mode A to a domain or mode B, and thereby gives items in B some new sort of meaning. In the case above, we have that nonlinguistic things—grounded objects, physical actions, nonlinguistic cognitive states, etc.—exist in a domain A, and link to words, sentences, etc. in domain B, thereby giving them their meaning.

I wanted to make this post simply to point out that this search for bridging principles is by no means one unique to language. There are at least three parallels within philosophy that I can think of off of the top of my head:

A central open question in moral philosophy asks whether normative/evaluative statements (“you should be politically involved,” “it is bad to kill people,” …; domain B) bottom out in non-normative statements (physical states of our brain, etc.; domain A). Some people believe that this non-normative domain is the only thing that we can actually use to make our normative statements meaningful. Concretely, the bridge here is from non-normative to normative statements.
In epistemology, we ask whether our (inferential) justified beliefs (“I see a rock over there,” “I am in pain,” …; domain B) might bottom out in things that are not beliefs at all (the perceptual experience of seeing a rock, the sense of pain; domain A). Concretely, the bridge here is from nondoxastic experiences to justified beliefs.
In philosophy of mind, theories of intentional representation attempt to explain how our items of mental content (thinking of the color blue, wanting pizza; domain B) represent things in the real world (blue things, the state of wanting pizza; domain A). These theories explain how our representations bottom out in the real world by some direct causal chain, normative conditions, etc. Concretely, the bridge here is from real-world things to representations of those things.

The case of linguistic meaning is certainly very close to #3, though I’m not yet sure how to unify the two (or if they can be unified).

I’m not sure what to do next with this information. In any way, I find it pleasing to recognize that a pile of nominally separate disciplines are actually all engaged in rather similar activities at a high level.

Philosophical Investigations, §2 ↩
John Searle calls this system of nonlinguistic facts a cognitive “Background.” Where we locate the Background — whether in the brain or in the real world — is not very relevant for the purposes of this post. ↩

Sunday links

Jon Gauthier — Sun, 06 Nov 2016 00:00:00 +0000

Whither Speech Recognition?
Is the mind just an accident of the universe? A very brief introduction to panpsychism.
Three ways you might still be racist. Eric Schwitzgebel spells out how eliminating consistent overt racist behavior is not the same thing as not being a racist. I think it’s very good to clearly designate these other less obvious moral violations. As a result, we’ll be forced to recognize our own “moral mediocrity” (Eric’s phrase). The unacceptable alternative is to pretend we are morally perfect and not leave any room to keep learning. This is the default state of the non-critical modern liberal.

Situated language learning

Jon Gauthier — Fri, 07 Oct 2016 00:00:00 +0000

Update: this content (and more) is now available as a more thorough abstract on arXiv, co-authored with my OpenAI colleague Igor Mordatch.

I’ve been really pleased with the response to my last post, On “solving language.” While I certainly wasn’t saying anything revolutionary, it does seem that I managed to capture some very common sentiment floating around in the AI community today. I think the post has served as a clear checkpoint for me and for people with similar interests: it’s time to focus on language in a situated, interactive setting!

Since that time in mid-August, I’ve been working on a paradigm for simulating situated language acquisition. This post will give a brief overview of the motivating ideas, and I’ll follow up shortly with more concrete details on some experiments I’ve been doing recently.

(Before I get started: this space is rapidly increasing in activity, which is certainly a good thing for science! Facebook Research just released their Environment for Communication-based AI, and there have been murmurs of other similar environments around the Internets.)

The paradigm

One of the key points of “Solving language” was that natural language dialogue is necessarily situated in some grounded context. We use language (and other tools) to accomplish real-world goals, which are themselves often not linguistic. The reference-game example in that post gave one instance of linguistic behavior that was strongly tied to nonlinguistic world knowledge — something we can’t solve as a language problem in isolation.

If we’re interested in building language agents which can eventually cooperate with us via language in similarly grounded contexts, then the learning tasks we design should reflect this goal.

I’ve followed this idea through to design a general paradigm for situated language acquisition. In this paradigm, cooperative agents teach or learn a language in order to accomplish some nonlinguistic goal. Here are the details:

A child agent lives in some grounded world and has some goal which is nonlinguistic (e.g. reach a goal region, get food, etc.).
The child has only partial observations of its environment, and can take only a subset of the necessary actions to reach its goal.
A parent agent also exists in this world. The parent speaks some fixed language and wants to cooperate with the child (to help it reach its goal).
The parent has full observations from the environment, and can take actions which the child cannot take on its own.
The child and parent can communicate via a language channel.

The environment is designed such that the child cannot accomplish the goal on its own; it must employ the help of its parent. The child acquires language as a side effect of accomplishing its grounded goal: it is the most efficient (or perhaps the only efficient) mechanism for reaching its main goal.

Philosophizing

To clearly restate: a critical and distinguishing factor of this framework is that the child acquires language only as a side effect of striving for some grounded, nonlinguistic goal.

The environment is designed in particular to avoid reifying “language.” I think it is misleading to see language as some sort of unitary thing to be solved — as just one of a few isolated tools in the toolbox of cognition that need to be picked up on the way to general artificial intelligence.

Language is defined by its use. Language-enabled agents are not identified their next-word prediction perplexity or their part-of-speech tag confusion matrix, but by their ability to cooperate with other agents through language. We shouldn’t expect the latter to magically emerge from hill-climbing on any of the former.

As I’ll show in my next post, it’s within our reach to design simple environments that let us directly hill-climb on this objective of cooperation through language. Stay tuned!¹

And please get in touch! I always enjoy hearing new ideas from my readers. (All four of you. ;) ) ↩

Sunday Links

Jon Gauthier — Sun, 18 Sep 2016 00:00:00 +0000

Writing about writing about writing about language: John McWhorter helps kick off Vox’s new “Big Idea” section with a review of Tom Wolfe’s evaluation of Chomskyan linguistics. This is a nice opinionated summary of a decades-long debate.
C.A.R. Hoare reflects on software engineering experiences in “The Emperor’s Old Clothes”:
Our main failure was overambition. "The goals which we have attempted have obviously proved to be far beyond our grasp." ... What was amazing was that a large team of highly intelligent programmers could labor so hard and so long on such an unpromising project. You know, you shouldn’t trust us intelligent programmers. We can think up such good arguments for convincing ourselves and each other of the utterly absurd. Especially don’t believe us when we promise to repeat an earlier success, only bigger and better next time.
Debugging machine learning.
What’s going on in my head when I’m thinking of an image?

On "solving language"

Jon Gauthier — Tue, 16 Aug 2016 00:00:00 +0000

Some of my more ambitious machine-learning colleagues often have a tendency to get misty-eyed and talk about language as the “next frontier” of artificial intelligence. I can completely agree at a high level: building practical language-enabled agents which can use real language with real people will be a huge breakthrough for our field.

But too frequently people seem to see language as a mere box to be checked along the way to general artificial intelligence. As soon as we break through some perplexity floor on our language models, apparently, we’ll have dialogue agents that can communicate just like we do.

I’ve been exposed to this exhausting view often enough¹ that I’ve decided to start working to actively counter it.

What does it mean to “solve language?”

I’m not even sure what people are trying to say when they talk about “solving language.” It’s as vague and underspecified as the quest to “solve AI” as a whole. We already have systems that build investment strategies by reading the news, intelligent personal assistants, and automated psychologists which aid the clinically depressed. Are we done?

It depends on your measure of done-ness, I suppose. I’m partial to a utilitarian stopping criterion here. That is, we’ve “solved language” — we’ve built an agent which “understands language” — when the agent can embed itself into any real-world environment and interact with us humans via language to make us more productive.²

On that measure we’ve actually made some good progress, as shown by the linked examples above. But there’s plenty of room for improvement which will likely amount to decades of collaborative work.

If you can show me how your favorite NLP/NLU task connects directly to this measure of progress, then that’s great. I unfortunately don’t think this is the case for much current work, including several of the tasks popular in deep learning for NLP.

(See the addenda to the post for some juicy follow-up to the claims in this section.)

Situated language use

If I believe in the definition of “solving language” given above, I am led to focus on tasks of situated language use: cases where agents are influencing or being influenced by other actors in grounded scenarios through language.

This setting seems extremely fertile for new language problems that aren’t currently seriously considered as major tasks of understanding. I’ll end this post with a simple thought experiment demonstrating how far behind we are on real situated tasks — or rather, how much exciting work there still is to be done!

Generalization in reference games

Here’s a simple example of situated learning that should reveal the great complexity of language we have yet really to start solving.

Consider a Wittgensteinian reference game with two agents, Alexa and Bill. Alexa is using words that Bill doesn’t know to pick out referents in their environment. It’s a cooperative game: Alexa tries to use words that will make Bill most successful, Bill knows that Alexa is cooperating, Alexa knows that Bill knows that she is cooperating, and so on.

Round 1

Alexa selects some object in the environment and speaks a word to Bill, say, BLOOP. Here’s what Bill has observed in this first round of the game:

Possible referents:

Alexa said: BLOOP

Now Bill has to select the object he thinks Alexa was referring to. The dialogue is a “success” if Alexa and Bill pick the same object.

Suppose Bill has to predict some distribution over the objects given Alexa’s utterance $\pi(o \mid u)$. Since Bill doesn’t know any of Alexa’s words in the first round, we would expect his distribution then to be roughly uniform:³

Let’s say he randomly chooses the cup, and the two are informed that they have succeeded. That means Alexa used BLOOP to refer to the cup. Both internalize that information and proceed to the next round.

Round 2

Now suppose we begin round 2 with different objects. Alexa, still trying to maximize the chance that Bill understands what she is referring to, makes the same utterance: BLOOP.

Possible referents:

Alexa said: BLOOP

I would argue that, despite the fact that both objects are novel (at least at a perceptual level), Bill’s probabilities would look something like this:

What happened? Bill learned something about Alexa’s language in round 1: that her word BLOOP can refer to a glass. In round 2, he was forced to generalize that information and use it to pick the most cup-like object among the referents.

What just happened?

How do we model this? There’s a lot going on. Here are the three most important properties I can recognize.

As is often the case in reference game examples, Bill had to reason pragmatically in round 2 in order to understand what Alexa might mean.
This pragmatic reasoning relied on a model of the world. Bill had to reason that the mug was similar to the cup, not because they look alike but because they are both used for drinking.⁴
Bill made his inference using a single previous example. This inductive inference relied mostly on his model of the world, as opposed to an enormous dataset of in-domain examples.

These are three basic properties of Bill the language agent. These are three properties that we haven’t come anywhere close to solving in a general way.

(These goals have been recognized in the past, of course; there’s been some good recent work on the specific problems above. But I think these objectives should get much more focus under the utilitarian motivation underlying this post.)

Conclusion

That rather long example was my first stab at picking out situated learning problems, where language is a mean rather than an end in itself. It’s just one sample from what I see as a large space of underexplored problems. Any language researcher worth her salt could engineer a quick solution to this particular scenario. The real challenge is to tackle the whole problem class with an integrated solution. Why don’t we start working on that?

Keep posted for updates in this space — I’m working hard every day on projects related to this goal at OpenAI. If you’re interested in collaborating or sharing experiences, feel free to get in touch via the comments below or via email.

Acknowledgements

I’ve been mulling these ideas over for most of the summer, and a whole lot of people from several institutions have helped me to sharpen my thinking: Gabor Angeli, Phil Blunsom, Sam Bowman, Arun Chaganty, Danqi Chen, Kevin Clark, Prafulla Dhariwal, Chris Dyer, Jonathan Ho, Rafal Jozefowicz, Nal Kalchbrenner, Andrej Karpathy, Percy Liang, Alireza Makhzani, Christopher Manning, Igor Mordatch, Allen Nie, Craig Quiter, Alec Radford, Zain Shah, Ilya Sutskever, Sida Wang, and Keenon Werling.

Special thanks to Gabor Angeli, Sam Bowman, Roger Levy, Christopher Manning, Sida Wang, and a majority of the OpenAI team for reviewing early drafts of this post.

The sketch images in the post are taken from the TU Berlin human sketching dataset.

Addenda

Many of my trusted colleagues who reviewed this post made interesting and relevant points which are worth mentioning. I’ve included them in a separate section in order to prevent the main post from getting too long and full of hedging clauses.

I claimed in this post that work in natural language understanding often seems too disconnected from the real downstream utilitarian goal—in the case of our field, to actually use language in order to cooperate with human beings. Roger and Sam pointed out that many people in the field do feel this to be true at a broader scale — that is, there are several such related divergences (e.g. the divergence between modern NLP and its computational linguistics roots) that ought to be unified.
Chris pointed out that there are other uses of language which could be used to likewise argue for different lines of research. For example, language also functions as an information storage mechanism, and my situated approach doesn’t capture this. We can actually use this information-storage view to motivate many of the tasks central to NLP or, more specifically, information extraction.

It’s a view that’s quite hard to escape in Silicon Valley for sure. I actually wasn’t able to find the clarity to write this post until now, after a week of travel and late-night conference discussions in Europe. ↩
I’m aware this is not only a utilitarian aim but also an anthropocentric one. I’m not sure it’s totally right, and am certainly open to belief updates from my readers. ↩
Interestingly, if both Alexa and Bill have English as a native language, I would guess that phonaesthetic effects would lead Bill to prefer the round object over the long, pointy one. That’s how I would behave, anyway. Don’t ask me how to model that. ↩
Importantly, this is more than a linguistic model. The facts which Bill exploits are nonlinguistic properties learned from embodied experience. ↩

Hybrid tree-sequence neural networks with SPINN

Jon Gauthier — Wed, 22 Jun 2016 00:00:00 +0000

We’ve finally published a neural network model which has been under development for over a year at Stanford. I’m proud to announce SPINN: the Stack-augmented Parser-Interpreter Neural Network. The project fits into what has long been the Stanford research program, mixing deep learning methods with principled approaches inspired by linguistics. It is the result of a substantial collaborative effort also involving Sam Bowman, Abhinav Rastogi, Raghav Gupta, and our advisors Christopher Manning and Christopher Potts.

This post is a brief introduction to the SPINN project from a particular angle, one which is likely of interest to researchers both inside and outside of the NLP world. I’ll focus here on the core SPINN theory and how it enables a hybrid tree-sequence architecture.¹ This architecture blends the otherwise separate paradigms of recursive and recurrent neural networks into a structure that is stronger than the sum of its parts.

(quick links: model description, full paper, code)

Our task, broadly stated, is to build a model which outputs compact, sufficient² representations of natural language. We will use these representations in downstream language applications that we care about.³ Concretely, for an input sentence $\mathbf x$, we want to learn a powerful representation function $f(\mathbf x)$ which maps to a vector-valued representation of the sentence. Since this is a deep learning project, $f(\mathbf{x})$ is of course parameterized by a neural network of some sort.

Voices from Stanford have been suggesting for a long time that basic linguistic theory might help to solve this representation problem. Recursive neural networks, which combine simple grammatical analysis with the power of recurrent neural networks, were strongly supported here by Richard Socher, Chris Manning, and colleagues. SPINN has been developed in this same spirit of merging basic linguistic facts with powerful neural network tools.

Model

Our model is based on an insight into representation. Recursive neural networks are centered around tree structures (usually binary constituency trees) like the following:

In a standard recursive neural network implementation, we compute the representation of a sentence (equivalently, the root node S) as a recursive function of its two children, and so on down the tree. The recursive function is specified like this, for a parent representation $\vec p$ with child representations $\vec c_1, \vec c_2$: \[\vec p = \sigma(W [\vec c_1, \vec c_2])\] where $\sigma$ is some nonlinearity such as the $\tanh$ or sigmoid function. The obvious way to implement this recurrence is to visit each triple of a parent and two children, and compute the representations bottom-up. The graphic below demonstrates this computation order.

The computation defined by a standard recursive neural network. We compute representations bottom-up, starting at the leaves and moving to nonterminals.

This is a nice idea, because it allows linguistic structure to guide computation. We are using our prior knowledge of sentence structure to simplify the work left to the deep learning model.

One substantial practical problem with this recursive neural network, however, is that it can’t easily be batched. Each input sentence has its own unique computation defined by its parse tree. At any given point, then, each example will want to compose triples in different memory locations. This is what gives recurrent neural networks a serious speed advantage. At each timestep, we merely feed a big batch of memories through a matrix multiplication. This work can be easily farmed out on a GPU, leading to order-of-magnitude speedups. Recursive neural networks unfortunately don’t work like this. We can’t retrieve a single batch of contiguous data at each timestep, since each example has different computation needs throughout the process.⁴

Shift-reduce parsing

The fix comes from the change in representation foreshadowed earlier. To make that change, I need to introduce a parsing formalism popular in natural language processing, originally stolen from the compiler/PL crowd.

Shift-reduce parsing is a method for building parse structures from sequence inputs in linear time. It works by exploiting an auxiliary stack structure, which stores partially-parsed subtrees, and a buffer, which stores input tokens which have yet to be parsed.

We use a shift-reduce parser to apply a sequence of transitions, moving items from the buffer to the stack and combining multiple stack elements into single elements. In the parser’s initial state, the stack is empty and the buffer contains the tokens of an input sentence. There are just two legal transitions in the parser transition sequence.

Shift pulls the next token from the buffer and pushes it onto the stack.
Reduce combines the top two elements of the stack into a single element, producing a new subtree. The top two elements of the stack become the left and right children of this new subtree.

The animation below shows how these two transitions can be used to construct the entire parse tree for our example sentence.⁵

A shift-reduce parser produces the pictured constituency tree. Each timestep is visualized before and then after the transition is taken. The text at the top right shows the transition at each timestep, and yellow highlights indicate the data involved in the transition. The table at the right displays the stack contents before and after each transition.

Rather than running a standard bottom-up recursive computation, then, we can execute this table-based method on transition sequences. Here’s the buffer and accompanying transition sequence we used for the sentence above. S denotes a shift transition and R denotes a reduce transition.

Buffer: The, man, picked, the, vegetables
Transitions: S, S, R, S, S, S, R, R, R

Every binary tree has a unique corresponding shift-reduce transition sequence. For a sentence with $n$ tokens, we can produce its parse with a shift-reduce parser in exactly $2n - 1$ transitions.

All we need to do is build a shift-reduce parser that combines vector representations rather than subtrees. This system is a pretty simple extension of the original shift-reduce setup:

Shift pulls the next word embedding from the buffer and pushes it onto the stack.
Reduce combines the top two elements of the stack $\vec c_1, \vec c_2$ into a single element $\vec p$ via the standard recursive neural network feedforward: $\vec p = \sigma(W [\vec c_1, \vec c_2])$.

Now we have a shift-reduce parser, deep-learning style.

This is really cool for several reasons. The first is that this shift-reduce recurrence computes the exact same function as the recursive neural network we formulated above. Rather than making the awkward bottom-up tree-structured computation, then, we can just run a recurrent neural network over these shift-reduce transition sequences.⁶

If we’re back in recurrent neural network land, that means we can make use of all the batching goodness that we were excited about earlier. It gains us quite a bit of speed, as the figure below from our paper demonstrates.

Massive speed-ups over a competitive recursive neural network implementation (from Irsoy and Cardie, 2014). A baseline RNN implementation, which ignores parse information, is also shown. The y-axis shows feedforward speed on random input sequence data.

That’s up to a 25x improvement over our comparison recursive neural network implementation. We’re between two to five times slower than a recurrent neural network, and it’s worth discussing why. Though we are able to batch examples and run an efficient GPU implementation, this computation is fundamentally divergent — at any given timestep, some examples require a “shift” operation, and other examples require a “reduce.” When computing results for all examples in bulk, we’re fated to throw away at least half of our work.

I’m excited about this big speedup. Recursive neural networks have often been dissed as too slow and “not batchable,” and this development proves both points wrong. I hope it will make new research on this model class a practical opportunity.

Hybrid tree-sequence networks

I’ve been hinting throughout this post that our new shift-reduce feedforward is really just a recurrent neural network computation. To be clear, here’s the “sequence” that the recurrent neural network traverses when it reads in our example tree:

Visualization of the post-order tree traversal performed by a shift-reduce parser.

This is a post-order tree traversal, where for a given parent node we recurse through the left subtree, then the right, and then finally visit the parent.

We had a simple idea with a big result after looking at this diagram: why not have a recurrent neural network follow along this path of arrows?

Concretely, that means that at every timestep, we update some RNN memory regardless of the shift-reduce transition. We call this the tracking memory. We can write out the algorithm mathematically for clarity. At any given timestep $t$, we compute a new tracking value $\vec m_t$ by combining the top two elements of the stack $\vec c_1, \vec c_2$, the top of the buffer $\vec b_1$, and the previous tracking memory $\vec m_{t-1}$: \begin{equation} \vec m_t = \text{Track}(\vec m_{t-1}, \vec c_1, \vec c_2, \vec b_1) \\ \end{equation} We can then pass this tracking memory onto the recursive composition function, via a simple extension like this: \begin{equation} \vec p = \sigma(W [\vec c_1; \vec c_2; \vec m_t]) \\ \end{equation} What have we done? We’ve just interwoven a recurrent neural network into a recursive neural network computation. The recurrent memories are used to augment the recursive computation ($m_t$ is passed to the recursive composition function) and vice versa (the recurrent memories are a function of the recursively computed values on the stack).

We show in our paper how these two paradigms turn out to have complementary power on our test data. By combining the recurrent and recursive models into a single feedforward, we get a model that is more powerful than the sum of its parts.

What we’ve built is a new way to build a representation $f(\mathbf x)$ for an input sentence $\mathbf x$, like we discussed at the beginning of this post. In our paper, we use this representation to reach a high-accuracy result on the Stanford Natural Language Inference dataset.

This post managed to cover about one section of our full paper. If you’re interested in more details about how we implemented and applied this model, related work, or a more formal description of the algorithm discussed here, take a read. You can also check out our code repository, which has several implementations of the SPINN model and models which you can run to reproduce or extend our results.

We’re continuing active work on this project in order to learn better end-to-end models for natural language processing. I always enjoy hearing ideas from my readers — if this project interests you, get in touch via email or in the comment section below.

Acknowledgements

I have to first thank my collaborators, of course — this was a team of strong researchers with nicely complementary skills, and I look forward to pushing this further together with them in the future.

The SPINN project has been supported by a Google Faculty Research Award, the Stanford Data Science Initiative, and the National Science Foundation under grant numbers BCS 1456077 and IIS 1514268. Some of the Tesla K40s used for this research were donated to Stanford by the NVIDIA Corporation. Kelvin Gu, Noah Goodman, and many others in the Stanford NLP Group contributed helpful comments during development. Craig Quiter and Sam Bowman helped review this blog post.

This is only a brief snapshot of the project focusing on modeling and algorithms. For details on the task / data, training, related work etc., check out our full paper. ↩
I mean sufficient here in a formal sense — i.e., powerful enough to answer questions of interest in isolation, without looking back at the original input value. ↩
In this first paper, we use the model to answer questions from the Stanford Natural Language Inference dataset.) ↩
A non-naïve approach might involve maintaining a queue of triples from an input batch and rapidly dequeuing them, batching together all of these dequeued values. This has already been pursued (of course) by colleagues at Stanford, and it shows some promising speed improvements on a CPU. I doubt, though, that the gains from this method will offset the losses on the GPU, since this method sacrifices all data locality that a recurrent neural network enjoys on the GPU. ↩
For a more formal and thorough definition of shift-reduce parsing, I’ll refer the interested reader to our paper. ↩
The catch is that the recurrent neural network must maintain the per-example stack data. This is simple to implement in principle. We had quite a bit of trouble writing an efficient implementation in Theano, though, which is not really built to support complex data structure manipulation. ↩

Sunday Links

Jon Gauthier — Sun, 19 Jun 2016 00:00:00 +0000

Er ist wieder da (Look Who’s Back) is available on Netflix. It’s based off a popular book in Germany by the same name. Adolf Hitler is plopped into a modern Berlin and becomes a comedy star as he reproduces his tirades on television shows. The country receives him as a talented impressionist who sometimes says things that sound sort of accurate.

It’s supposed to be a comedy. It’s funny, sure, but it’s also a very timely reminder in the current political climates of Europe and of the US. I’ll translate two scenes below. These are translated from the subtitles; the actual spoken dialogue is much better.

Two colleagues (C) from a TV station bring home the Hitler “impostor,” and they run into their German grandmother (G). The grandmother launches into a tirade:

G	Sie haben alle vergast!	You gassed everyone!
C	Oma, das ist Satire.	Grandma, it's just satire.
G	Er sieht aus wie früher.	He looks just like he used to.
	Er sagt die gleichen Sachen wie früher.	He's saying the same things he used to.
	Damals lachten die Leute anfangs auch.	Back then, people laughed at first, too.

The protagonist (P) has realized that the impostor Hitler is the real Hitler, and holds a pistol at Hitler’s face.

H	Haben Sie sich nie gefragt, warum die Leute mir folgen?	Have you never asked yourself why people follow me?
	Weil sie im Kern genauso sind wie ich.	Because they're at the core just like me.
	Sie teilen die gleichen Werte.	We have the same values.
	Darum schießen Sie auch nicht.	That's why you won't shoot me.
P shoots. Hitler collapses and vanishes, reappearing behind P.
H	Sie können mich nicht loswerden.	You can't get rid of me.
	Ich bin ein Teil von Ihnen.	I am a part of you.
	Von euch allen.	Of all of you.

Worth a watch.

Tom Vanderbilt tells his story of discovering cycling in The Long and Winding Road. He recounts his humble road-biker beginnings, and accurately describes the pain and glory on the way to a life of serious pedaling. It’s a nice tribute to the crazy, maybe-masochistic sport I’ve come to love. The long article ends with a brief tour from San Francisco to Santa Barbara.
An introduction to the work of composer Dmitri Shostakovich.

Life update

Jon Gauthier — Tue, 14 Jun 2016 00:00:00 +0000

Hello world! I’m coming back up to the surface after over a year of silence on this blog. As always, I have things about technology and my work to rant about… but I’d like to first make a (public) review of what’s happened in all this time. This will be a smorgasbord of professional and personal happenings, probably too strangely interwoven to be of interest to anyone in particular. I feel nevertheless I should get some words out on this blog, and update the, ahem, public record.

Just about the time of my last post in March 2015, the Stanford NLP Group went on its first lab retreat in a long time to Point Reyes. The lead picture of this post shows a beautiful section of the Pacific Coast Highway I captured on the way north.

At this lab retreat I spoke with Sam Bowman and Kelvin Gu about end-to-end systems for deep learning for NLP. Inspired by a recent debate between Andrew Ng and Christopher Manning, I tried to argue an aggressive (though not original) thesis about our long-term approach to language understanding systems. My claim, which now sounds rather stale in 2016, was that we should drop tasks like parsing and part-of-speech tagging. Natural language processing should solve language understanding tasks that non-experts actually care about. In short, we should focus on “end-to-end” systems where the output is something of interest to real people.

The counter to that argument (then and still now) is that intermediate representations like constituency parses actually work to produce useful results downstream. They help us practically reason about extremely complicated linguistic patterns. I can point at a parse or a part-of-speech tag and recognize that it tells me something meaningful about a piece of language. Building deep-learning systems that learn to model such complexity, while theoretically feasible, is pretty darn hard.

But “pretty darn hard” had never stopped any of us before. We started to discuss ways to simplify the end-to-end learning problem by adding structure (less constraining than the structure of a parse) to the computation problem. That discussion was the basis for our recently published SPINN model. More on that later.

The Stanford Symbolic Systems Program is often described as the study of “minds and machines.” My spring 2015 quarter at Stanford turned out to be rather mind-heavy in this sense. I took a seminar on theoretical neuroscience with Surya Ganguli which opened my eyes to the deep connections between modern machine learning and what we know so far about systems neuroscience. A class by Anthony Wagner introduced me to theories of human learning, memory, and attention. The synergies between these theories and popular models in neural networks are impressive, but there is still plenty of room for more cross-talk. I am now convinced that we do have much more to learn from cognitive neuroscience.¹

A summer at Google. I did some work, too — not pictured here.August 2015

I spent my summer as a research intern at Google Brain, co-advised by Ilya Sutskever and Oriol Vinyals.² The Brain team was an outstanding place for young aspiring researchers like myself. Around 15 interns came to work with the motivated and creative members of the team, using some seriously beefy Google infrastructure to attack challenging problems in machine learning.

As a first non-work-related note, it’s certainly worth mentioning that I discovered a new road biking obsession that summer. I joined up with a team of Googlers riding Waves to Wine and trained with them for the whole summer. I learned to climb mountains and speed down descents, leveling up from newbie-cyclist to slightly-less-newbie cyclist.

Yay, bikes!October 2015 / July 2015

Cycling was a great release for me. While I loved working on the Brain team, I found the work environment on the Google campus rather stifling. It’s truly a paradise in Mountain View, for better and for worse. With a road bike I was able to escape that sprawling land of perfection and explore everywhere from China Camp in the North Bay to Santa Cruz on the opposite end.

I kept biking when I got back to Stanford in the fall. It’s become a reliable way for me to escape the messy, mathy world of artificial intelligence. The Stanford cycling community is very supportive of motivated beginners, and I was able to level up again with some very impressive skilled riders.

A fraction of the Stanford group on a ride in Southern California.January 2016

In the fall, Sam and I started working seriously on what would later be called SPINN (the Stack-augmented Parser-Interpreter Neural Network). The idea began as a very simple one — we wanted to exploit a stack as a structured memory model for end-to-end deep learning for NLP tasks, possibly using existing linguistic data to guide exactly how the stack was used. We saw this approach³ as a first step toward a totally parse-free, fully-differentiable end-to-end system.

We designed a model that would follow a shift-reduce parse sequence (generated from a gold constituency parse) and build vector representations of the parse nodes along the way. At some point in front of a whiteboard, Sam realized that this model computed exactly the same function as a recursive neural network.⁴ That was when we knew we were onto something interesting. Sam ran some preliminary experiments on toy data and the results suggested we should move forward.

Near the end of the year, Abhinav Rastogi and Raghav Gupta, two masters students, joined us on the project. With double the manpower, we charged ahead and sicced various evolved forms of the model on Sam’s baby, the Stanford Natural Language Inference dataset.

We wrote the whole thing in Theano, and ended up having some serious problems with training runtime performance. Months of toil in code optimization, Theano-hacking, and writing CUDA code taught us a good lesson: if you’re doing anything remotely unusual in terms of neural network structure, you probably should ditch abstractions like Theano and work down at the layer of the metal.

Keenon Werling and I in Montréal during NIPS 2015.December 2015

I escaped Stanford for a bit in the winter to attend NIPS 2015. This was my my first machine learning conference, and I really enjoyed socializing and participating in casual brain-storm sessions with people from across the world. The conference itself was pretty exhausting — there were way, way too many people, and we all had to squeeze into a single gigantic conference room for a single-track oral presentation schedule. In any case, it was nice to slip out of California for a while and revive my seriously broken French in Montréal.

After a comfortable holiday break at home with the family, I returned to the bike heaven / ridiculous economic bubble / home of an artificial intelligence renaissance that is the San Francisco Bay Area. I made a big life decision that had been sitting around in the queue for a while and became a vegetarian.

A yummy vegetarian meal from Ricker, a Stanford dining hall near the Gates CS building. Stanford has been voted the 'Favorite Vegan-Friendly Large College' by peta2.February 2016

This isn’t the place for me to wage an anti-meat campaign — interested readers can just take a good look at e.g. the report of the Pew Commission on Industrial Farm Animal Production to see why this might be a reasonable thing to do.⁵ I had seen these reports and become acquainted with the facts long before 2016. But, like many people, I had settled to live with the cognitive dissonance of at once enjoying meat and knowing that it was a harmful and ethically wrong habit. That changed after I read Jonathan Safran Foer’s Eating Animals. Jonathan’s situation was much the same: before writing his book, he was a meat eater with a guilty conscience who had made several failed attempts at vegetarianism. While I was already familiar with the high-level facts of the modern meat industry, Jonathan’s personal story motivated me even further — enough to actually move forward and make the change.

The SPINN team — now six strong, including our advisors — toiled throughout the winter to develop the model theory and evaluate it on natural language data. When not outside on a bike saddle, I spent the season cooped up in my office in the Gates building, writing code, reading relevant papers, and hoping for success. It slowly became clear during the quarter that things were going to work out in our favor. We submitted a report of our work to the Association for Computational Linguistics in March, and will present it at the conference this summer.

And then I was off to Berlin!

Riding a vintage roadie in central Berlin.April 2016

I had long ago applied to the Stanford overseas study program in Berlin. After several years of mostly tunnel-vision intellectual focus on artificial intelligence, I knew that my brain needed some novel content in order to keep functioning. I wanted to return to my dormant love of language learning. I had obsessively taught myself languages during high school, but mostly dropped the activity at university, where time for leisure-learning was rather scarce. A language-immersion trip to Germany was a perfect way to slip back into the old hobby — and, no less important, to slip out of the fast-paced world of AI research for a short while.

A generous and welcoming family hosted me in the peaceful city quarter of Friedenau. I studied German at a Stanford satellite in Berlin and spent my free time exploring the city and cycling, sometimes with the Freie Universität Berlin riding group.

Lutherstadt Wittenberg, in Saxony-Anhalt. Taken at the halfway point on a bike trip to Leipzig.May 2016

The immersion experience was a wonderful challenge, though I probably wouldn’t have chosen the adjective “wonderful” as I was going through it. I learned (the hard way!) to trust myself to speak fluently in another language — to be willing to make some mistakes and fail in front of other people, that is, in order to convey what I want to say. This was a fully intellectually engaging task. I found every day quickly filled up with conversations, literature, television, and movies, and everything that might have been banal or boring in English was an important and interesting part of my studies in Berlin.

Of course, there is more to Berlin than the German language. The city has a dark and complex past, and the modern Berlin, no longer physically divided, still shows scars of its turbulent history. Exploring this history in person — through conversation and travel — added a new dimension to formerly dry textbook knowledge in my mind.

It’s a happy coincidence that this year’s ACL conference takes place in Berlin in a few months. I have an excuse, then, to go back and ensure that my German hasn’t rusted!

But for now, I’m off to San Francisco for what is sure to be an exciting summer. I’m joining the rapidly growing reseach team at OpenAI, which has managed to already release a swathe of papers on generative models and a platform for open reinforcement learning research. I believe I’m expected to add some linguistic nuance and NLP fun to the mix. We’ll see what I come up with.

Well, I’ve reached the present day, so this brings my whirlwind review of over a year to a close. I certainly wouldn’t have managed to make this progress without the support of my family and friends. With thanks to them and to my readers, I’ll end the review here. Until next time.

I could point out in particular that our current memory models and representation choices are extremely limited compared to what is available in the human brain. Some (but not enough) work has been done on differential-speed learning models (comparable to e.g. complementary learning systems theory in neuroscience). ↩
Things move quickly in today’s world of AI research. Ilya is now at OpenAI, and Oriol works at Google DeepMind in London. ↩
very similar to e.g. Chris Dyer’s work from ACL ↩
I think this is a good example of how “nonlinear” research work often feels. We were totally surprised by this finding, and it certainly changed the direction of the project. These sort of discontinuous jumps are often omitted in the terse final write-ups, where it generally pays off to appear as if you knew the whole thing from the start… ↩
And why eating vegan may also be reasonable, if you have more willpower than I do. ↩

Conditional generative adversarial networks for face generation

Jon Gauthier — Tue, 17 Mar 2015 00:00:00 +0000

This week marks the end of the winter quarter at Stanford, and with it ends the class CS 231N: Convolutional Neural Networks for Visual Recognition. The teaching team, led by Andrej Karpathy and Fei-Fei Li, did an outstanding job putting together a course on neural networks and CNNs from scratch.

This post is a high-level overview of the project I submitted for the course, titled Conditional generative adversarial networks for convolutional face generation. For those interested in a technical deep dive, check out my full paper and the code on GitHub.

(jump to: introduction, model)

Example of faces sampled from the generative model. We draw random faces in the first row. In the second row we ask the model to 'age' the faces, and in the third row we ask the model to add a smile.

Introduction

A major task in machine learning is to learn density models of particular distributions. Ideally, we want to have a machine that accepts arbitrary inputs and says either “Yes, that’s an x,” or “No, that’s not an x.” We might name this density model mathematically as $p(\mathbf{x})$, where $\mathbf{x}$ is the data we’re interested in modeling.

In this project, we learn a density model of human faces. Given some image $\mathbf{x}$, our precise task here is to determine whether $\mathbf{x}$ is a picture of a face or not. Every seeing human has such a model in her brain, and uses it effortlessly every day. Like many tasks in computer vision and artificial intelligence, what is ridiculously simple for humans turns out to be notoriously difficult for computers to crack.

There are two important extensions in this project:

We want to be able to sample from the model — to ask it to “imagine” new faces that we haven’t ever showed it. (Again, this is something that humans can do easily.)
We want the model to condition on external data. This means that we should be able to specify particular facial attributes as we sample. (Once again, this is trivial for humans. Imagining an old white male with a mustache takes little apparent cognitive effort.)

Images are traditionally represented in digital form as large matrices of numbers. Our density model in particular is expected to deal with images with 3072 different dimensions of variation.¹ Our task, then, (as is the task in much of unsupervised learning) is to find and exploit structure in the data that help us reason efficiently and accurately about what is and isn’t a face.

The project

We train on human face images like these:

The above images are samples from a dataset called Labeled Faces in the Wild, which contains about 13,000 images of random people in uncontrollled settings.²

As mentioned before, we intend to build a density model in this project that is conditional. Rather than answering the question “Is this an x,” we now answer “Is this an x given y?” Formally, we build a density model $p(\mathbf{x} \mid \mathbf{y})$. $\mathbf{y}$ is the “conditional information” — any external information that might cue us on what we should be looking for in the provided data $\mathbf{x}$.

Concretely, in this project $\mathbf{y}$ specifies facial attributes, such as the following:

Age: baby, youth, middle-aged, senior, …
Emotion: frowning, smiling, …
Race: Asian, Indian, black, white, …

Informally, when we ask about $p(\mathbf{x} \mid \mathbf{y})$ in this setting, we ask the question: If we’re looking for faces of type $\mathbf{y}$ (e.g. frowning people with mustaches), should we accept $\mathbf{x}$ as a good example or not?

Why is this interesting?

Good question! Recall that we’re learning a generative model of faces while we do this density modeling. The interesting implication is that we can sample brand-new faces from the learned density model. Like these ones:

The faces above are created by the model from scratch. These faces are entirely new, and don’t resemble faces in the training data provided to the model. That’s right — once our model learns what a face looks like, it can learn to draw new ones.

Hooked? Let’s get into the model.

Model

The model used in this project is an extension of the generative adversarial network, proposed by Ian Goodfellow and colleagues (see their paper and associated GitHub repo). Here’s the basic pitch:

Suppose you want to build a generative model of some dataset with data points called $\mathbf{x}$. Let’s create two players and set them against each other in an adversarial game:

A discriminator – call it D. D’s job is to accept an input $\mathbf{x}$ and determine whether the input came from the dataset, or whether it was simply made up. D wins points when he detects real dataset values correctly, and loses points when he approves fake values or denies real dataset values.
A generator – call it G. G’s job is to make up new values $\mathbf{x}$. G wins points when he tricks D into thinking that his made-up values are real.

We now let D and G take turns in the game, and teach both how to correct their mistakes after each turn. Here’s what we expect to happen:

G begins as a completely stupid generator. He outputs some random noise in a weak attempt to trick D.
D quickly learns to make a lazy distinction between G’s random noise and things that look like human faces. But D trains only long enough to make a basic distinction — to look for a skin tone, for example.
G learns from its mistakes and starts producing images with skin tone color in them.
D picks up on basic facial structure, and uses this to distinguish between real face images and G’s fake data.
G follows D’s cue, and learns to draw face shapes (and perhaps some basic features like noses and eye-holes).
D notices other features in the real face images that distinguish them from G’s data.

This process continues on forever, with D learning new discriminative features and G promptly learning to copy them.

What we end up with (after several hours of training on GPUs) is a generative model G which can make convincing images of human faces like the ones presented earlier. Ideally, this model G can also serve as a generative density model as described earlier.

Conditional data

But there’s more! I mentioned earlier that our key extension is to add a conditioning feature. In the setting of face images, this means we can specify particular facial attributes. Both the generator G and the discriminator D learn to operate in certain modes. For example, with a particular conditional information input $\mathbf{y}$, we might ask the generator G to generate a face with a smile, and likewise ask the discriminator D whether a particular image contains a face with a smile.

Demonstration of deterministic control of image samples. We tweak conditional information to first make the sampled faces age, then again to make them smile.

The final consequence of all of this is that we can directly control the output of the generator G. The image above shows a figure from the paper. We begin with a random row of faces sampled from the model. (Note that these are not faces from the training data.) We then tweak $\mathbf{y}$ in two ways: first, along an axis that corresponds to old age, and second, along an axis that corresponds to smiling. You can see in the second and third rows that the subjects first grow older and then put on a smile.

Conclusion

I’ll end my brief overview here. There’s much more to say — for example, on training dynamics, on evaluation, and on the role of convolution in both G and D — and if you’ve reached this far in the blog post you would probably enjoy reading the paper in full.

It is still early days for generative models, and I’m excited to explore both new architectures and new possibilities in model applications. Watch this space for more soon, and let me know if you’re working on similar ideas!

Acknowledgements

Thanks to my colleagues Keenon Werling and Danqi Chen, who put up with persistent requests for advice on the project throughout the quarter. None of this would have happened without Andrej’s advice at the start of the quarter, which set me off in the right direction.

This project would have taken twice as long were it not for the developers of Pylearn2 and Theano. This kind of machine learning framework development is certain to accelerate the progress of the field — exciting stuff! Of course, Ian Goodfellow deserve credit for the original generative adversarial net code, published with their NIPS ‘14 paper.

3072 = 32 by 32 by 3. Our images are 32 by 32, with three RGB color channels. ↩
These images were cropped by Conrad Sanderson at NICTA. ↩

Machine learning and technical debt

Jon Gauthier — Mon, 12 Jan 2015 00:00:00 +0000

Machine Learning: The High Interest Credit Card of Technical Debt is a great non-technical paper from this past NIPS conference that is being passed around online and off among people in my network. For me, the paper centers around the claim that machine learning systems are fundamentally different from traditional software in how they are developed and used:

Arguably the most important reason for using a machine learning system is precisely that the desired behavior cannot be effectively implemented in software logic without dependency on external data. (p. 2, italics theirs)

Machine learning models are useful because they can encode complex world knowledge that would be near-impossible to handle in a deterministic, rule-based setting. But this precise distinction is what makes them enormously brittle in the face of change. Repeated throughout the paper is the principle of “Changing Anything Changes Everything” (CACE):

To make this concrete, imagine we have a system that uses features x₁, … x_n in a model. If we change the input distribution of values in x₁, the importance, weights, or use of the remaining n - 1 features may all change … No inputs are ever really independent. (p. 2)

CACE can be a problem for all sorts of reasons, and if you’ve gotten this far in my post I strongly suggest you read the full paper to understand the implications of building and depending on software which can often be fundamentally opaque and unstable by design.

This is useful to me as a researcher in helping me to understand the divide between research and practice, and to recognize which elements of my methods might be incompatible with those of a practitioner.

A GloVe implementation in Python

Jon Gauthier — Wed, 24 Sep 2014 00:00:00 +0000

GloVe (Global Vectors for Word Representation) is a tool recently released by Stanford NLP Group researchers Jeffrey Pennington, Richard Socher, and Chris Manning for learning continuous-space vector representations of words.

(jump to: theory, implementation)

Introduction

These real-valued word vectors have proven to be useful for all sorts of natural language processing tasks, including parsing,¹ named entity recognition,² and (very recently!) machine translation.³⁴

It’s been shown (and widely shared by this point) that these word vectors exhibit interesting semantic and syntactic regularities. For example, we find that claims like the following hold for the associated word vectors:

\[\begin{align*}\text{king} - \text{man} + \text{woman} &\approx \text{queen} \\ \text{brought} - \text{bring} + \text{seek} &\approx \text{sought}\end{align*}\]

There’s quite a bit of buzz around the tools which build these word vectors at the moment, so I figured it would be worth it to provide a down-to-earth coverage of GloVe, one of the newest methods.

The GloVe authors present some results which suggest that their tool is competitive with Google’s popular word2vec package. In order to better understand how GloVe works and to make available a nice learning resource, I decided to port the open-source (yay!) but somewhat difficult-to-read (no!) GloVe source code from C to Python.

In this post I’ll give an explanation by intuition of how the GloVe method works⁵ and then provide a quick overview of the implementation in Python. You can find the complete Python code (just 187 SLOC, including command-line argument processing, IO, etc.) in the glove.py GitHub repo.

A quick disclaimer before we begin: I wrote this code for tutorial purposes. It is nowhere near production-ready in terms of efficiency. If you would like to parallelize and optimize it as an exercise, be my guest — just be sure to share the results!

Theory

The GloVe model learns word vectors by examining word co-occurrences within a text corpus. Before we train the actual model, we need to construct a co-occurrence matrix $X$, where a cell $X_{ij}$ is a “strength” which represents how often the word $i$ appears in the context of the word $j$. We run through our corpus just once to build the matrix $X$, and from then on use this co-occurrence data in place of the actual corpus. We will construct our model based only on the values collected in $X$.

Once we’ve prepared $X$, our task is to decide vector values in continuous space for each word we observe in the corpus. We will produce vectors with a soft constraint that for each word pair of word $i$ and word $j$,⁶

\[\begin{equation}\vec{w}_i^T \vec{w}_j + b_i + b_j = \log X_{ij}.\end{equation}\]

where $b_i$ and $b_j$ are scalar bias terms associated with words $i$ and $j$, respectively. Intuitively speaking, we want to build word vectors that retain some useful information about how every pair of words $i$ and $j$ co-occur.

We’ll do this by minimizing an objective function $J$, which evaluates the sum of all squared errors based on the above equation, weighted with a function $f$:

\[\begin{equation}J = \sum_{i=1}^V \sum_{j=1}^V \; f\left(X_{ij}\right) \left( \vec{w}_i^T \vec{w}_j + b_i + b_j - \log X_{ij} \right)^2 \end{equation}\]

We choose an $f$ that helps prevents common word pairs (i.e., those with large $X_{ij}$ values) from skewing our objective too much:

\[\begin{equation}f\left(X_{ij}\right) = \left\{ \begin{array}{cl}\left(\frac{X_{ij}}{x_{\text{max}}}\right)^\alpha & \text{if } X_{ij} < x_{\text{max}} \\ 1 & \text{otherwise.} \end{array}\right. \end{equation} \]

When we encounter extremely common word pairs (where $X_{ij} > x_{\text{max}}$) this function will cut off its normal output and simply return $1$. For all other word pairs, we return some weight in the range $(0, 1)$, where the distribution of weights in this range is decided by $\alpha$.

Implementation

Now for the code! I’ll skip the boring parts which do things like model saving and argument parsing, and focus on the three most meaty functions in the code:

build_cooccur accepts a corpus and yields a list of co-occurrence blobs (the $X_{ij}$ values). It calculates co-occurrences by moving a sliding n-gram window over each sentence in the corpus.
train_glove, which prepares the parameters of the model and manages training at a high level, and
run_iter, which runs a single parameter update step.

First, our build_cooccur function accepts a vocabulary (mapping words to integer word IDs), a corpus (a simple iterator over sentences), and some optional parameters: a context window size and a minimum count (used to drop rare word co-occurrence pairs). We’ll start by building a sparse matrix for collecting cooccurrences $X_{ij}$ and some simple helper data.

def build_cooccur(vocab, corpus, window_size=10, min_count=None):
    vocab_size = len(vocab)
    id2word = dict((i, word) for word, (i, _) in vocab.iteritems())

    # Collect cooccurrences internally as a sparse matrix for
    # passable indexing speed; we'll convert into a list later
    cooccurrences = sparse.lil_matrix((vocab_size, vocab_size),
                                      dtype=np.float64)

For each line in the corpus, we’ll conjure up a sequence of word IDs:

# -- continued --
    for i, line in enumerate(corpus):
        tokens = line.strip().split()
        token_ids = [vocab[word][0] for word in tokens]

Now for each word ID $i$ in the sentence, we’ll extract a window of context words to the left of the word.

# -- continued --
        for center_i, center_id in enumerate(token_ids):
            # Collect all word IDs in left window of center word
            context_ids = token_ids[max(0, center_i - window_size)
                                    : center_i]
            contexts_len = len(context_ids)

For each word ID $j$ in the context, we’ll add on weight to the cell $X_{ij}$. The increment for the word pair is inversely related to the distance between the two words in question. This means word instances which appear next to each other see higher $X_{ij}$ increments than word instances which appear with many words in between.

One last technical point: we build this matrix $X_{ij}$ symmetrically. This means that we treat word co-occurrences where the context word is to the left of the main word exactly the same as co-occurrences where the context word is to the right of the main word.

# -- continued --
            for left_i, left_id in enumerate(context_ids):
                # Distance from center word
                distance = contexts_len - left_i

                # Weight by inverse of distance between words
                increment = 1.0 / float(distance)

                # Build co-occurrence matrix symmetrically (pretend
                # we are calculating right contexts as well)
                cooccurrences[center_id, left_id] += increment
                cooccurrences[left_id, center_id] += increment

That’s about it — build_cooccur finishes with a bit more code to yield co-occurrence pairs from this sparse matrix, but I won’t bother to show it here.

Next, train_glove initializes the model parameters given the fully constructed co-occurrence data. We expect the same vocab object as before as a first parameter. The second parameter, cooccurrences, is a co-occurrence iterator produced in build_cooccur, which yields co-occurrence tuples of the form (main_word_id, context_word_id, x_ij), where x_ij is an $X_{ij}$ co-occurrence value as introduced above.

def train_glove(vocab, cooccurrences, vector_size=100,
                iterations=25, **kwargs):

We next prepare the primary model parameters: the word vector matrix $W$ and a collection of bias scalars. Note that our word matrix has twice as many rows as the number of words in the corpus. We will find out why later in describing the run_iter function.

# -- continued --
    vocab_size = len(vocab)

    # Word vector matrix. This matrix is (2V) * d, where N is the
    # size of the corpus vocabulary and d is the dimensionality of
    # the word vectors. All elements are initialized randomly in the
    # range (-0.5, 0.5]. We build two word vectors for each word:
    # one for the word as the main (center) word and one for the
    # word as a context word.
    #
    # It is up to the client to decide what to do with the resulting
    # two vectors. Pennington et al. (2014) suggest adding or
    # averaging the two for each word, or discarding the context
    # vectors.
    W = ((np.random.rand(vocab_size * 2, vector_size) - 0.5)
         / float(vector_size + 1))

    # Bias terms, each associated with a single vector. An array of
    # size $2V$, initialized randomly in the range (-0.5, 0.5].
    biases = ((np.random.rand(vocab_size * 2) - 0.5)
              / float(vector_size + 1))

We will be training using adaptive gradient descent (AdaGrad),⁷ and so we’ll also need to initialize helper matrices for $W$ and the bias vector which track gradient histories. Note that these all are initialized as blocks of ones. By starting with every gradient history equal to one, our first training step in AdaGrad will simply use the global learning rate for each example. (See footnote 7⁷ to work this out from the AdaGrad definition.)

# -- continued --
    # Training is done via adaptive gradient descent (AdaGrad). To
    # make this work we need to store the sum of squares of all
    # previous gradients.
    #
    # Like `W`, this matrix is (2V) * d.
    #
    # Initialize all squared gradient sums to 1 so that our initial
    # adaptive learning rate is simply the global learning rate.
    gradient_squared = np.ones((vocab_size * 2, vector_size),
                               dtype=np.float64)

    # Sum of squared gradients for the bias terms.
    gradient_squared_biases = np.ones(vocab_size * 2,
                                      dtype=np.float64)

Next, we begin training by iteratively calling the run_iter function.

# -- continued --
    for i in range(iterations):
        cost = run_iter(vocab, data, **kwargs)

run_iter accepts this pre-fetched data and begins by shuffling it and establishing a global cost for the iteration:

# -- continued --
    global_cost = 0

    # Iterate over data in random order
    shuffle(data)

Now for every co-occurrence data tuple, we compute the weighted cost as described in the above theory section. Each tuple has the following elements:

v_main: the word vector for the main word in the co-occurrence
v_context: the word vector for the context word in the co-occurrence
b_main: bias scalar for main word
b_context: bias scalar for context word
gradsq_W_main: a vector storing the squared gradient history for the main word vector (for use in the AdaGrad update)
gradsq_W_context: a vector gradient history for the context word vector
gradsq_b_main: a scalar gradient history for the main word bias
gradsq_b_context: a scalar gradient history for the context word bias
cooccurrence: the $X_{ij}$ value for the co-occurrence pair, described at length above

We retain an intermediate “inner” cost (not squared or weighted) for use in calculating the gradient in the next section.

# -- continued --
    for (v_main, v_context, b_main, b_context, gradsq_W_main,
         gradsq_W_context, gradsq_b_main, gradsq_b_context,
         cooccurrence) in data:

        # Calculate weight function $f(X_{ij})$
        weight = ((cooccurrence / x_max) ** alpha
                  if cooccurrence < x_max else 1)

        # Compute inner component of cost function, which is used in
        # both overall cost calculation and in gradient calculation
        #
        #   $$ J' = w_i^Tw_j + b_i + b_j - log(X_{ij}) $$
        cost_inner = (v_main.dot(v_context)
                      + b_main[0] + b_context[0]
                      - log(cooccurrence))

        # Compute cost
        #
        #   $$ J = f(X_{ij}) (J')^2 $$
        cost = weight * (cost_inner ** 2)

        # Add weighted cost to the global cost tracker
        global_cost += cost

With the cost calculated, we now need to compute gradients. From our original cost function $J$ we derive gradients with respect to the relevant parameters $\vec w_i$, $\vec w_j$, $b_i$, and $b_j$. (Note that since $f(X_{ij})$ doesn’t depend on any of these parameters, the derivations are quite simple.) Below we use the operator $\odot$ to denote elementwise vector multiplication.

\[\begin{align*}J &= \sum_{i=1}^V \sum_{j=1}^V \; f\left(X_{ij}\right) \left( \vec{w}_i^T \vec{w}_j + b_i + b_j - \log X_{ij} \right)^2 \\ \nabla_{\vec{w}_i} J &= \sum_{j=1}^V f\left(X_{ij}\right) \vec{w}_j \odot \left( \vec{w}_i^T \vec{w}_j + b_i + b_j - \log X_{ij}\right) \\ \frac{\partial J}{\partial b_i} &= \sum_{j=1}^V f\left(X_{ij}\right) \left(\vec w_i^T \vec w_j + b_i + b_j - \log X_{ij}\right) \end{align*} \]

Now let’s put that in code! We use the earlier-calculated intermediate value cost_inner, which stores the value being squared and weighted in the full cost function.

# -- continued --
        # Compute gradients for word vector terms.
        #
        # NB: `v_main` is only a view into `W` (not a copy), so our
        # modifications here will affect the global weight matrix;
        # likewise for v_context, biases, etc.
        grad_main = weight * cost_inner * v_context
        grad_context = weight * cost_inner * v_main

        # Compute gradients for bias terms
        grad_bias_main = weight * cost_inner
        grad_bias_context = weight * cost_inner

Finally, we update weights with AdaGrad⁷ and add the calculated gradients to the gradient history variables.

# -- continued --
        # Now perform adaptive updates
        v_main -= (learning_rate * grad_main
                   / np.sqrt(gradsq_W_main))
        v_context -= (learning_rate * grad_context
                      / np.sqrt(gradsq_W_context))

        b_main -= (learning_rate * grad_bias_main
                   / np.sqrt(gradsq_b_main))
        b_context -= (learning_rate * grad_bias_context
                      / np.sqrt(gradsq_b_context))

        # Update squared gradient sums
        gradsq_W_main += np.square(grad_main)
        gradsq_W_context += np.square(grad_context)
        gradsq_b_main += grad_bias_main ** 2
        gradsq_b_context += grad_bias_context ** 2

After we’ve processed all data for the iteration, we return the global cost and relax for a while.

# -- continued --
    return global_cost

That’s it for code! If you’d like to see word vectors produced by this Python code in action, check out this IPython notebook.

If you found this all fascinating, I highly recommend digging into the official GloVe documentation, especially the original paper, which is due to be published at this year’s EMNLP conference. A quality general coverage of word representations and their uses is Peter Turney and Patrick Pantel’s paper, “From frequency to meaning: Vector space models of semantics.”

Distributed word representations such as those which GloVe produces are really revolutionizing natural language processing. I’m excited to see what happens as more and more tools of this sort are disseminated outside of academia and put to real-world use.

If you’re making use of GloVe or similar tools in your own projects, let me know. Until next time, happy coding!

Richard Socher et al., “Parsing with Compositional Vector Grammars,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Sofia, Bulgaria: Association for Computational Linguistics, 2013), 455–65. ↩
Joseph Turian, Lev Ratinov, and Yoshua Bengio, “Word Representations: A Simple and General Method for Semi-Supervised Learning,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2010), 384–94. ↩
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv:1409.0473 [cs, Stat], September 1, 2014.

This is what I’m working on right now—if this sounds interesting to you, get in touch! ↩
There is still quite a bit of debate, however, over the best way to construct these vectors. The popular tool word2vec, which has seen wide use and wide success in the past year, builds so-called neural word embeddings, whereas GloVe and others construct word vectors based on counts. I won’t get into the controversy in this post, but feel free to read up and pick a side.

See e.g. Marco Baroni, Georgiana Dinu, and Germán Kruszewski, “Don’t Count, Predict! A Systematic Comparison of Context-Counting vs. Context-Predicting Semantic Vectors,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Baltimore, Maryland: Association for Computational Linguistics, 2014), 238–47; Omer Levy and Yoav Goldberg, “Linguistic Regularities in Sparse and Explicit Word Representations,” in Proceedings of the Eighteenth Conference on Computational Natural Language Learning (Ann Arbor, Michigan: Association for Computational Linguistics, 2014), 171–80. ↩
I hope this post is a useful supplement to the original paper. If you have the time, read the original too — it has a lot of useful and well-stated insights about the task of word representations in general. ↩
I am skipping over a lot of interesting / beautiful details here — please read the paper if you are interested in more than the implementation! ↩
AdaGrad is a modified form of stochastic gradient descent which attempts to guide learning in the proper direction by weighting rarely occurring features more often than those that always fire. Briefly, for a gradient component $g_{t,i}$ at training step $t$, AdaGrad defines the gradient descent update to be \[x_{t+1, i} = x_{t, i} - \dfrac{\eta}{\sqrt{\sum_{t'=1}^{t-1} g_{t', i}^2}} g_{t, i}.\] For a more thorough coverage see this AdaGrad tutorial. ↩ ↩² ↩³

Summarizing Spanish with Stanford CoreNLP

Jon Gauthier — Sat, 13 Sep 2014 00:00:00 +0000

After a summer replete with feature-engineering and corpus processing, the Stanford NLP Group has just released CoreNLP 3.4.1, which includes support for Spanish-language text. In this post I’ll show how to make use of these tools to make a dead-simple document summarizer.¹

Our end goal will be to take a news article of significant length and reduce it to its two or three most important points. We’ll run through each sentence and assign it a score based on two factors:

tf–idf weights. The tf–idf metric is a formula which explains how important a particular word is in the context of its containing document. We’ll calculate the sum of tf–idf scores for all nouns in each sentence, and consider those sentences with the greatest sums to be the most important.

The tf–idf metric is the product of two factors:
\[\text{tf–idf}_{t, d} = tf_{t, d} \; idf_t\]
The first is a term frequency factor, which tracks how often the word appears in its containing document. It is some scaled version of the number of times the word appears in the given document. We’ll use a logarithm form here:
\[\text{tf}_{t, d} = \log(1 + \text{count of $t$ in $d$})\]
The second is an inverse document frequency (IDF) factor. This measures the informativeness of the word based on how often it appears in total across an entire corpus. The inverse document frequency factor is a logarithm as well:
\[\text{idf}_{t} = \log\left( \frac{\text{count of total documents}}{\text{count of documents containing $t$}} \right)\]
Note that IDF values will be exactly 0 for common words like “the,” as they are likely to appear in every document in the corpus. Meaningful and less common words like “transmogrify” and “incinerate” will yield higher IDF values.
Positional weight. For news articles, another easy measure of the importance of a sentence is its position in the document: important sentences tend to appear before less crucial ones. We can model this by scaling our original tf–idf score by the index of the sentence within the document.

With theory over, let’s get to the code. I’m going to walk through a Java class Summarizer, the full source code of which is available in a GitHub repo. Our only dependency here is Stanford CoreNLP 3.4.1. We begin by instantiating the CoreNLP pipeline statically.

Properties props = new Properties();

// We need part-of-speech annotations (and tokenization /
// sentence-splitting, which are required for POS tagging)
props.setProperty("annotators", "tokenize,ssplit,pos");

// Tokenize using Spanish settings
props.setProperty("tokenize.language", "es");

// Load the Spanish POS tagger model (rather than the
// default English model)
props.setProperty("pos.model",
    "edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger");

pipeline = new StanfordCoreNLP(props);

As we discussed earlier, the summarizer depends upon document frequency data, which must be precalculated from a corpus of Spanish text. In the constructor of the Summarizer, we receive a prebuilt dfCounter and determine the total number of documents in the training corpus.²

public Summarizer(Counter<String> dfCounter) {
  this.dfCounter = dfCounter;
  this.numDocuments = (int) dfCounter.getCount("__all__");
}

Our main routine, summarize, accepts a document string and a number of sentences to return.

public String summarize(String document, int numSentences) {
  // Process the document with the constructed pipeline; get
  // a list of tokenized sentences
  Annotation annotation = pipeline.process(document);
  List<CoreMap> sentences = annotation.get(
    CoreAnnotations.SentencesAnnotation.class);

  // Collect raw term frequencies from this document (method
  // not shown here)
  Counter<String> tfs = getTermFrequencies(sentences);

  // Rank sentences of the document by descending importance
  sentences = rankSentences(sentences, tfs);

  // Build a single string with our results
  StringBuilder ret = new StringBuilder();
  for (int i = 0; i < numSentences; i++) {
    ret.append(sentences.get(i));
    ret.append(" ");
  }

  return ret.toString();
}

The method rankSentences sorts the provided sentence collection using a custom comparator SentenceComparator, which contains the bulk of our actual logic for sentence importance. Here’s the framework:

private List<CoreMap> rankSentences(List<CoreMap> sentences,
                                    Counter<String> tfs) {
  Collections.sort(sentences, new SentenceComparator(tfs));
  return sentences;
}

private class SentenceComparator implements Comparator<CoreMap> {
  private final Counter<String> termFrequencies;

  public SentenceComparator(Counter<String> termFrequencies) {
    this.termFrequencies = termFrequencies;
  }

  @Override
  public int compare(CoreMap o1, CoreMap o2) {
    return (int) Math.round(score(o2) - score(o1));
  }

  /**
   * Compute sentence score (higher is better).
   */
  private double score(CoreMap sentence) {
    // ...
  }

  // ...
}

score and the following methods are the meat of the entire code. score accepts a sentence and returns a floating-point value indicating the sentence’s importance.

private double score(CoreMap sentence) {
  // Get the sum of tf-idf weights for the nouns in this
  // sentence
  double tfIdf = tfIDFWeights(sentence);

  // Scale weight based on the position of this sentence in
  // its containing document
  int index = sentence.get(CoreAnnotations.SentenceIndexAnnotation.class);
  double indexWeight = 5.0 / index;

  // Return a scaled tf-idf weight. Note that we multiply all scores
  // by 100 to avoid the case where two sentences with 0 < |score| < 1
  // being marked as "equal" by the comparator function
  return indexWeight * tfIdf * 100;
}

score calls a method tfIDFWeights, which determines the total tf–idf scores for all the nouns in the given sentence:

private double tfIDFWeights(CoreMap sentence) {
  double total = 0;
  List<CoreLabel> tokens = sentence.get(CoreAnnotations.TokensAnnotation.class);

  for (CoreLabel cl : tokens) {
    String pos = cl.get(CoreAnnotations.PartOfSpeechAnnotation.class);

    // Nouns in the Spanish POS tagset begin with the letter
    // "n."
    boolean isNoun = pos.startsWith("n");

    if (isNoun) {
      String text = cl.get(CoreAnnotations.TextAnnotation.class)

      // Calculate the tf-idf weight for this particular
      // word, and add it to the sentence total
      total += tfIDFWeight(text);
    }
  }

  return total;
}

/**
 * Calculate the tf-idf weight for a single word.
 */
private double tfIDFWeight(String word) {
  // Skip unknown words
  if (dfCounter.getCount(word) == 0)
    return 0;

  // Scale the raw term frequency (stored in an instance
  // variable of the comparator)
  double tf = 1 + Math.log(termFrequencies.getCount(word));

  // Scale the document frequency (pre-built with a Spanish
  // corpus)
  double idf = Math.log(numDocuments /
      (1 + dfCounter.getCount(word)));

  return tf * idf;
}

That’s it for the code. You can see the entire class in this public GitHub repo.

I’ll end with a quick unscientific test of the code. I built document-frequency counts (using a helper DocumentFrequencyCounter class) from the Spanish Gigaword, which contains about 1.5 billion words of Spanish. It took several days (running on a 16-core machine) to POS-tag each sentence and collect the nouns in a global counter.³

I next tested with a few recent Spanish news articles, requesting a two-sentence summary of each. Here’s the output summary of an article on the Laniakea supercluster:

Las galaxias no están distribuidas al azar en todo el universo, sino que se encuentran en grupos, al igual que nuestro propio Grupo Local, que contiene docenas de galaxias, y en cúmulos masivos, que poseen cientos de galaxias, todas interconectadas en una red de filamentos en la que se ensartan como perlas. Estos expertos han bautizado al supercúmulo con el nombre de ‘Laniakea’, que significa “cielo inmenso” en hawaiano, como informan en un artículo de la edición de este jueves de Nature. Una galaxia entre dos estructuras de este tipo puede quedar atrapada en un tira y afloja gravitacional en el que el equilibrio de las fuerzas gravitacionales que rodean las estructuras a gran escala determina el movimiento de la galaxia.

And another on Argentinian debt:

La inclusión de la capital de Francia como nueva jurisdicción para hacer efectivos los desembolsos a los acreedores ha sido una iniciativa del bloque ‘cristinista’ para ganar los votos de algunos legisladores opositores. Por ejemplo, los legisladores del Frente Renovador, también peronista pero no ‘cristinista’, según la prensa, acordarían con la inclusión de París, por considerar que allí los pagos estarían a salvo de los fondos especulativos o ‘buitre’. Con esta iniciativa el gobierno de la presidenta Cristina Fernández, viuda de Kirchner, pretende esquivar a la justicia de los Estados Unidos y a los fondos especulativos o ‘buitre’ que ganaron a Argentina un juicio y colocaron al país en ‘default’ parcial.

I hope this code serves as a useful example for using basic CoreNLP tools in Spanish. Feel free to follow up below in the comments or by email!

I won’t claim this will always give fantastic summarizations, but it’s definitely a quick and easy-to-grasp algorithm. ↩
If you are interested in how this helper data is constructed, see the DocumentFrequencyCounter class in the GitHub repo. ↩
This probably could have been optimized quite a bit down to the level of hours – but when you’ve got the time… ↩

On Metacademy and knowledge graphs

Jon Gauthier — Thu, 14 Aug 2014 00:00:00 +0000

If there’s one complaint I have about starting work as a researcher, it’s about information glut.

Sure—the work is scintillating and intellectually stimulating. But it is easy to be overwhelmed by the sheer volume of knowledge lying ahead, and struggle in the task of organizing your own plan for learning. While the acronyms and extended noun phrases that your colleagues drop all sound darn interesting, it’s often unclear how to best acquire all the requisite knowledge you’re lacking.

Enter Metacademy, an open-source project for creating “dependency graphs of knowledge.” The site consists of an enormous list of concepts, all linked together in a single dependency graph. Each concept consists of a list of prerequisite concepts and a collection of resources for learning the concept itself. A picture is worth many words of explanation:

Metacademy concept graph

This is an exciting way to think about knowledge acquisition, I think. The graph itself reminds me of “skill trees” in MMORPGs or of the “technology tree” in the game Civilization:

Civilization technology tree

With this kind of ontology defined, we can now think of learning as a slow and deliberate traversal of a massive graph. I’ve been browsing and tracking my learning on Metacademy for some time now, and I think it’s a useful way to organize my knowledge.

The future

This is just the start for Metacademy. While the site content currently centers around machine learning and artificial intelligence topics,¹ this won’t be the case for much longer. The plan from the start has been to expand to cover all sorts of knowledge, from music to mathematics.

In hopes of expanding in this way, the Metacademy founders have just begun a private beta of a visual knowledge graph editor. This is good news for getting non-technical visitors to contribute to the graphs for their own fields.

Some other assorted opportunities that strike me as interesting:

Gamification of learning. Provide personal and social incentives (à la Duolingo) for users to continue their walk through the knowledge graph. Imagine “leveling up” after marking a certain concept as known, and having a clear view of what your friends and colleagues are learning at the same time.
Inter-disciplinary links. Visualize exactly how machine learning overlaps with natural language processing, or identify the mathematical concepts most crucial to understanding core artificial intelligence ideas.
Community-curated resources. With all the content open and free, this site has the chance to raise a Wikipedia-style community, where motivated volunteers work to collect the best resources for each concept and ensure the entire graph remains well-connected.

I’m planning to extend the Metacademy database and increase its coverage in natural language processing (and perhaps linguistics) topics. It’s exciting to think about the opportunities this site will reveal for the thousands of autodidacts—students and workers alike—who wish to continue learning.

Conveniently for me, the exact topics I should be learning! ↩

Where does ice cream come from?

Jon Gauthier — Wed, 09 Jul 2014 00:00:00 +0000

Since I was first introduced to the field of natural language understanding this spring, I’ve had an extra voice stuck in my head. It monitors conversations closely, watching for syntactic, semantic, and pragmatic oddities. Its utterance is always some variation on the same theme, always blurted out after I hear a particularly interesting construction:

How the heck did I just understand what was said?

It’s a bit mind-boggling at times to examine what I hear and say, and to acknowledge that this small brain of mine somehow processes these things properly.

A case in point: I was eating with friends at a Stanford dining hall recently. Aforementioned blurting voice had been quiet all night. A friend returned with an empty ice cream cone in hand and said the following:

Well, I got a cone, but no ice cream came out.

My companions nodded in sympathy. But the voice in my head distracted me, exploding: World knowledge! World knowledge!

The above is an example of an utterance which requires world knowledge to understand correctly. To see this, try to make a literal reading of the quote: my friend seems to suggest that she expected ice cream to magically emerge from the cone she was holding, and that she was disappointed when this didn’t happen. We know this interpretation isn’t correct, because there is a much more plausible one: my friend fetched an ice cream cone but found that the ice cream dispenser failed to dispense.

Of course, my friend did not mention anything about an ice cream dispenser. The others who heard this statement all used their own knowledge — accrued over years of ice cream consumption — to infer the crucial details of the story.

How can we expect an artificial intelligence to do the same? To interpret this particular statement and feel sympathy (as it should!), an agent needs to understand the following relatively obscure facts:

ice cream is often served in cones,
dispensers are sometimes available to fill these cones, and
it is unfortunate when no ice cream comes out of such a dispenser.

This is the minimal world knowledge to get the gist of a single casual statement. Sure, Siri may be able to get you directions to the airport, but we are far from complete natural language understanding.

Writing about this topic reminds me of Hector Levesque’s wonderful IJCAI paper on natural language understanding and artificial intelligence, On our best behavior. Do give it a read if you are interested in the current state of AI. This paper merits some more discussion in a separate post!

Sunday Links

Jon Gauthier — Sun, 20 Apr 2014 00:00:00 +0000

Lots of reading for an NLP-themed week (Stanford Workshop on AI and Knowledge was on Wednesday, and preparation for a big NLU project is ramping up). I’ve been reading much of Magnus Sahlgren’s work on distributed word representations. Also worth mentioning: Huang, Socher, Manning and Ng (2012) learn word embeddings using both local and global context; Reisinger and Mooney (2010) show how VSMs can handle polysemy.
“Shut up and deal”: Tesla is restricted from selling in New Jersey due to a statute which requires manufacturers to sell through dealers.
“The Enhanced Supplementary Leverage Ratio is your new bicycle.” Leverage ratio requirements are being increased for some of the US’s systematically important financial institutions. See Fed press release, Reuters coverage.
“The recovery puzzle.”

Sunday Links

Jon Gauthier — Mon, 14 Apr 2014 00:00:00 +0000

Preliminary results from Mexico’s National Institute of Statistics and Geography indicate massive amounts of fraud in the country’s education system. I can’t find credible English sources on this at the moment. See “Censo en escuelas descubre anomalías” and a campaign to crack down on this fraud named “Fin al abuso”.
The Rise of Theodore Roosevelt details Roosevelt’s growth from an awkward, squeaky boy obsessed with taxidermy and books into a man of staggering power and charisma. Inspiring read. Recommended to me by Art of Manliness.
“The slumps that shaped modern finance” covers the history of the financial industry from the late 18th century forward. Really fascinating read. (For those reading on a desktop, the web reading interface for this essay is also pretty nice!)
An interview with John McAfee.

Sunday Links

Jon Gauthier — Sun, 06 Apr 2014 00:00:00 +0000

Happy April! I’ve been slacking on these link posts.. I’m hoping to launch myself back into the habit with a varied collection of links today:

No one really knows if HFT is good or bad. The econoblogosphere has been on fire this past week with the release of Michael Lewis’ book Flash Boys, which exposes to the general public some of the more nefarious aspects of high-frequency trading. This post by Noah Smith is a candid admission of the fact that we really don’t have the tools at this point to conclude whether many HFT practices (barring those obviously egregious techniques like front-running) help or hinder the stock market. See also “High Speed Trading and Slow-Witted Economic Policy” and “Flash Boys for the People”.
The Alchemists: Three Central Bankers and a World On Fire revealed to me just how pivotal a role central bankers of the world have had in the recent recession and the subsequent recovery. The amount of engineering that these institutions—most notably, the Federal Reserve, ECB, and Bank of England—perform in hopes of maintaining stable employment and moderate inflation is just astounding.
“The value of time as a student” asks a question that has been bothering me for some time: why do students of extraordinary intellectual aptitude (and no immediate financial burdens) take on menial campus jobs? Katja Grace suggests: “It seems that college students generally treat their time as low value.” We trade this time for small immediate returns, perhaps unable to conceive of how spending this same time on more positive engagements (reading, meeting with professors, building things) could yield much greater long-term returns.
“No country for old members” models the linguistic change of communities, and reaches some interesting conclusions about how language change tells a story about the “age” or character of a community.
“I’m sorry Dave, I’m afraid I can’t do that” is now my go-to resource for explaining some of the struggles of natural language processing. The paper gives a brief overview of the history of NLP and the challenges which it has faced or has yet to overcome.

Sunday Links

Jon Gauthier — Sun, 02 Feb 2014 00:00:00 +0000

Hello world! Well, I had a feeling things would get quiet around here in the early-quarter crunch, but I failed to predict just how much I would lose control. The details aren’t especially important, but suffice it to say that I drove myself slightly crazy in these first few weeks. I’m just resurfacing now, and can affirm that my idea of “living intentionally” has gained entire unforeseen layers of meaning. I’m tracking my progress in January in-depth in a private log, and the relevant details will likely surface here come my winter quarter review next month.

For now, let’s focus on the reading! I’ve kept up my reading habits over these past few weeks, though my focus has shifted from online posts and articles to books.

“In Praise of Passivity” has been making quite the splash in libertarian circles recently. Michael Huemer sums up decades of pro-free-market discourse and asks us to consider the forgotten wisdom of doing nothing — of not “fighting” for the causes we “believe” in, and of not pushing current social theory into undue prominence.
Michael Nielsen’s “How the Bitcoin protocol actually works” is a fascinating intuitive explanation of the innards of the Bitcoin protocol. This post added much to my understanding of the system, even though I had already read (and thought I understood) the original Satoshi paper.
Friedrich Hayek contrasts two strongly opposed schools which both take the same name in “Individualism: True and False”. He scorns the individualism which he labels as “Cartesian rationalism,” which supposes that a society must promote the search for those pinnacles of human reason who can serve as “wise legislators” to lead the rest of us. Hayek puts forth his own idea of individualism, a measured and exceedingly humble recognition of any individual’s fallibility and the consequent need for free interaction and group consensus in a political system.
Walter E. Williams defends capitalism in the face of Pope Francis’ recent harsh critique.

Kneser-Ney smoothing explained

Jon Gauthier — Sat, 18 Jan 2014 00:00:00 +0000

Language models are an essential element of natural language processing, central to tasks ranging from spellchecking to machine translation. Given an arbitrary piece of text, a language model determines whether that text belongs to a given language.

We can give a concrete example with a probabilistic language model, a specific construction which uses probabilities to estimate how likely any given string belongs to a language. Consider a probabilistic English language model $ P_E $. We would expect the probability

\[P_E(\text{I went to the store})\]

to be quite high, since we can confirm this is valid English. On the other hand, we expect the probabilities

\[P_E(\text{store went to I the}), P_E(\text{Ich habe eine Katz})\]

to be very low, since these fragments do not constitute proper English text.

I don’t aim to cover the entirety of language models at the moment — that would be an ambitious task for a single blog post. If you haven’t encountered language models or n-grams before, I recommend the following resources:

“Language model” on Wikipedia
Chapter 4 of Jurafsky and Martin’s Speech and Language Processing
Chapter 7 of Statistical Machine Translation (see summary slides online)

I’d like to jump ahead to a trickier subject within language modeling known as Kneser-Ney smoothing. This smoothing method is most commonly applied in an interpolated form,¹ and this is the form that I’ll present today.

Kneser-Ney evolved from absolute-discounting interpolation, which makes use of both higher-order (i.e., higher-n) and lower-order language models, reallocating some probability mass from 4-grams or 3-grams to simpler unigram models. The formula for absolute-discounting smoothing as applied to a bigram language model is presented below:

\[P_{abs}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1} w')} + \alpha\; p_{abs}(w_i)\]

Here $\delta$ refers to a fixed discount value, and $\alpha$ is a normalizing constant. The details of this smoothing are covered in Chen and Goodman (1999).

The essence of Kneser-Ney is in the clever observation that we can take advantage of this interpolation as a sort of backoff model. When the first term (in this case, the discounted relative bigram count) is near zero, the second term (the lower-order model) carries more weight. Inversely, when the higher-order model matches strongly, the second lower-order term has little weight.

The Kneser-Ney design retains the first term of absolute discounting interpolation, but rewrites the second term to take advantage of this relationship. Whereas absolute discounting interpolation in a bigram model would simply default to a unigram model in the second term, Kneser-Ney depends upon the idea of a continuation probability associated with each unigram.

This probability for a given token $w_i$ is proportional to the number of bigrams which it completes:

\[P_{\text{continuation}}(w_i) \propto \: \left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|\]

This quantity is normalized by dividing by the total number of bigram types (note that $j$ is a free variable):

\[P_{\text{continuation}}(w_i) = \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}\]

The common example used to demonstrate the efficacy of Kneser-Ney is the phrase San Francisco. Suppose this phrase is abundant in a given training corpus. Then the unigram probability of Francisco will also be high. If we unwisely use something like absolute discounting interpolation in a context where our bigram model is weak, the unigram model portion may take over and lead to some strange results.

Dan Jurafsky gives the following example context:

I can’t see without my reading _____.

A fluent English speaker reading this sentence knows that the word glasses should fill in the blank. But since San Francisco is a common term, absolute-discounting interpolation might declare that Francisco is a better fit: $P_{abs}(\text{Francisco}) > P_{abs}(\text{glasses})$.

Kneser-Ney fixes this problem by asking a slightly harder question of our lower-order model. Whereas the unigram model simply provides how likely a word $w_i$ is to appear, Kneser-Ney’s second term determines how likely a word $w_i$ is to appear in an unfamiliar bigram context.

Kneser-Ney in whole follows:

\[P_{\mathit{KN}}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1} w')} + \lambda \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}\]

$\lambda$ is a normalizing constant

\[\lambda(w_{i-1}) = \dfrac{\delta}{c(w_{i-1})} \left| \{w' : c(w_{i-1}, w') > 0\} \right|.\]

Note that the denominator of the first term can be simplified to a unigram count. Here is the final interpolated Kneser-Ney smoothed bigram model, in all its glory:

\[P_{\mathit{KN}}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{c(w_{i-1})} + \lambda \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}\]

Sunday Links

Jon Gauthier — Sun, 05 Jan 2014 00:00:00 +0000

Happy New Year! I’m really excited about what 2014 has in store, and interested to see how a consistent blogging practice might make things better.

This concludes the final week of Stanford’s winter break. I’ve been working hard to make the most of my remaining relatively-free time.¹ I made an effort to read more seriously this week (mostly on more topics in economics), alongside beginning work on several projects that are important to me, including learning German, experimenting with biphasic sleep, researching possible summer work opportunities, and writing a book (post forthcoming!).

There will likely be an enormous time crunch in these first few weeks as I re-adjust to the school environment and “shop” for classes. Expect it to be quiet around here for a bit.

Nudge: Improving Decisions About Health, Wealth, and Happiness is a friendly introduction to “libertarian paternalism,” the strategy of constructing choice architectures in a way that counters the harmful effects of acknowledged cognitive biases and allows individuals to better pursue their own aims. The first several chapters are absolutely worth a read.
The Motivation Hacker is a fantastic condensed presentation of the best points of LessWrong-style instrumental rationality: it is packed with useful “hacks” which can be used to ramp up motivation and productivity. This is not a self-help book but rather a descriptive work: an image of the most promising techniques applied successfully in a single person’s life. A great quick read without the usual condescension or absolute focus on a single method that usually comes in this self-improvement genre.
Russ Roberts hosts an outstanding EconTalk interview with Dallas Fed president Richard Fisher. Fisher clearly and plainly states his opinion on the Fed’s support of “too big to fail” banks and exhibits without restraint his strong disapproval of current monetary policy. I’ve listened to this three times already and will be taking notes on it as I listen again on the plane ride back to Stanford. Pure gold. (See also Mathbabe’s summary and response.)
Alex Gaynor gives a realist’s view of the state of Python 3, pointing out that only 2% of PyPI package downloads are for version-3 applications. He suggests that Python 3 features be backported to a 2.8 release.
“These New-Fangled Books Will Doom Us All!” is a fun infographic that exposes thinkers since the 15th century decrying the horrible social effects of new technologies. Critics of technology have lamented the impending destruction of society due to new media since the printing press, but we still seem to be doing okay.
“bunnie studios” pries open some SD cards and reveals some surprisingly insecure designs.
Ben Bernanke gives what may be his final speech as Fed chairman.
Eric Hegelson discovers that his ISP injects affiliate links into web pages.

I returned to Stremor for these three weeks, so work and commuting still took up a significant portion of each day. ↩

Sunday Links

Jon Gauthier — Sun, 29 Dec 2013 00:00:00 +0000

I have just discovered (via Ben Kuhn) a wonderful blog called Ribbonfarm. While I’m still struggling to understand what the site is really about,¹ I feel obligated to share an outstanding summary of the book Seeing Like a State in the post “A Big Little Idea Called Legibility”. Ribbonfarm’s author presents the “authoritarian high-modernist recipe for failure,” a detail of the process by which leaders throughout history have consistently attempted at utopian reform for the better and failed miserably. Seeing Like a State went onto my reading list immediately after I finished this article.
The Three Languages of Politics presents a “three-axis model” which is claimed to chain modern political discourse, restricting the various political “tribes” (in this model, progressives, conservatives, and libertarians) from having real constructive debates. Actively detach yourself from the axis on which your System 1 lounges, he suggests, and you’ll be able to better understand the arguments of those who oppose you (and avoid engaging in the self-reinforcing rhetoric of the tribe you might associate with for any given issue).
Economist Bryan Caplan suggests we give those pushing a certain policy or perspective an “ideological Turing test”. Here’s the test: Take a person of ideology A and place him in a room of people who strongly support ideology B. If the person can’t convincingly argue for ideology B and blend in, the person fails the test. Those who fail this kind of test make it evident that they chose to simply argue for a side rather than examining all possible lines of reasoning.
“Lunch with the FT: Peter Thiel” gives a brief peek inside Thiel’s mind.
The Economist gives a really great overview of the Sapir-Whorf debate in “Do different languages confer different personalities?”. I’ll definitely refer people to this article when the Whorfian question comes up in future discussions.

The site’s tagline, “experiments in refactored perception,” doesn’t help me too much. ↩

The wacky economics of gift-giving

Jon Gauthier — Tue, 24 Dec 2013 00:00:00 +0000

Christmas, birthdays, graduations, housewarming parties — all wonderful times to celebrate and reunite with family and friends. But for a certain class of unlucky revelers, beneath that celebration lies that haunting specter of the gift.

I’m terrible at finding good gifts for people. Maybe I was one of those born without the proper genes. Malls and department stores are scary places for people of my type at this time of year, where everyone but you seems to be running from store to store, picking out the perfect products without breaking a sweat. For people like me, these celebrations always come with a side of dread as I worry about what gifts to buy for friends and family.

I’ve often asked myself whether it would be appropriate to ditch the standard plan and simply wrap some cash in an envelope along with a nice handwritten letter. It makes perfect sense, I think: given that I lack any sort of skill in selecting goods that other people might enjoy, it would be better to simply let the giftee decide what he or she wants.¹ With cash the recipient can choose to buy any of the potential gifts I was considering, or anything else that he might prefer. Then everyone is better off: I don’t sink hours into fruitless gift searches and end up buying the wrong thing, and the person receiving the gift instead gets to buy exactly what he has always wanted.² So cash is okay, right? The answer, unfortunately for me, is almost universally no.³

Why don’t we give cash gifts?

I began to undertand why this is so just a few weeks ago as I listened to Russ Roberts and Michael Munger discuss the economics of gifts in an EconTalk episode. The two discuss the common scenario of a dinner party, where guests are often expected to arrive with a small gift in hand. Most guests would arrive with a bottle of wine, chocolates, or maybe some flowers. Wouldn’t the most “thoughtful” guest be the one that arrives with a $20, though — the guest that lets the host decide what would be best for himself? Obviously not, as anyone with a modicum of training in manners can confirm. But why is this the case?

Munger uses Aristotle’s theory of value, which decomposes the concept of “value” into a subjective and idiosyncratic value in use and a more universal value in exchange, to justify why we are so repulsed by the $20 gift:

There’s an ancient distinction, I think first and most importantly made by Aristotle, between value in use and value in exchange. Something that you make as an artisan, or something that you make for the specific purpose of using it, or having someone else use it, is just better. He claimed it was morally better. … What he’s comparing that to is value in exchange. And he agreed that things that we make to exchange have value, but it’s a morally much lower kind of value.

He implies that when we talk about whether a gift is “good” or “meaningful,” we really are evaluating its value in use. While a $20 bill has a significant value in exchange, it has little if any value in use. On the other hand, a bottle of wine brought to a dinner party holds a certain value in use — it can be put directly into use at the party — as well as some value in exchange.

What differentiates gifts of the same value?

All right, so cash doesn’t make the cut. But outside of the restricted context of the dinner party, picking the right gift for a person is just so difficult. It’s possible to establish a fixed value in use and still find two items which lead to radically different reactions from the recipient. I can’t help but use Munger’s example:

Suppose that for your [Russ Roberts] wife’s birthday you bought her a new vacuum cleaner. Let’s suppose it’s a really, really nice new vacuum cleaner. It’s still a cold night at the Roberts house.

Even the most state-of-the-art vacuum cleaner would still be an offensive gift in this scenario. It seems in this case that the gift actually has too much direct utility. We want gifts, rather, that are something of a surprise: something we didn’t know we wanted.

Hold on a second — we’re in a society where people try to “surprise” a recipient by subverting his idea of what would maximize his own utility? And the recipient wants (or expects) the surprise? Something is wrong here.

Munger poses an addition to this theory by holding that the best gift is more than just a signal: it is a costly signal. In spending more money than is likely necessary on a gift which we expect to surprise its recipient, we signal multiple things:

I am willing to spend my free time searching for a “good” gift.
I am willing to spend more money than the recipient might consider reasonable on the gift that I find.
I know the recipient well enough that I am confident this gift will be a good surprise.

This act of costly signaling reinforces our altruistic image in the recipient’s eyes, and also distinguishes the gift from what could otherwise be interpreted as a compensation or bribe — an idea that no one would want to support in giving a gift to another.

It’s not clear that the task of gift-buying will be any less painful now that the economic principles behind it have been demystified. But at least you can know exactly what you’re talking about as you trod through shopping malls and grumble to yourself: these costly signals just cost too much!

Sunday Links

Jon Gauthier — Sun, 22 Dec 2013 00:00:00 +0000

“I, Pencil” trumpets the self-organizing wonders of the free market in parable form. Not one single person on this earth, it claims, knows how to produce a pencil from scratch. It takes millions of people working together to accomplish such a feat. Moreover, all these people working together toil not in the interest of global pencil production rates but rather in the interest of feeding their families. The output is a mere side effect!¹
Scott Alexander’s “A Something Sort of Like Left-Libertarianism-ist Manifesto” details how free-market solutions can solve real social problems. I’ll have to read more about this “bleeding-heart libertarianism” that he discusses.
“The Facebooking of Economics”² gives a view into how modern economics discourse operates. The bleeding edge debates don’t always happen in peer-reviewed journals anymore — they’re also present on the blogs of economists across the web!
“On the Folly of Rewarding A, While Hoping for B” gives some interesting examples of perverse incentives in public policy and management.
“The Thought Leader” and (published a day later!) “We are nothing (and that is beautiful)” air troubling concerns about what our generation has been taught about success (and how our visions actually play out in the long run). This topic merits substantial discussion, but I’ll need to think more before putting forth any ideas on this medium.

Except for the pencil vendors, of course. ↩
Yes, I just linked to a Liberty Fund site and a Krugman article in the same post. ↩

Imperat aut servit: Managing our knowledge inheritance

Jon Gauthier — Thu, 19 Dec 2013 00:00:00 +0000

This was submitted as my final research paper for Education as Self-Fashioning: The Active, Inquiring, Beautiful Life. See my Zotero folder documenting my research for this essay.

The learning of a Salmasius or a Burman, unless you are its master, will be your tyrant. ‘Imperat aut servit’; if you can wield it with a strong arm, it is a great weapon; otherwise, ‘Vis consili expers / Mole ruit suâ.’ You will be overwhelmed, like Tarpeia, by the heavy wealth which you have exacted from tributary generations.¹

Cardinal John Newman acknowledges a problem familiar to any student in this quote from his seminal work, “The Idea of a University.” Every student is well acquainted with that intimidating specter—the constant worry of the weight of his own reading, the sheer pressure imposed by the great minds of “tributary generations.” The student is assailed throughout his undergraduate years with assignments on the landmark publications of each field. The classics professor encourages the student to read Homer, while Plato and Descartes are urged on him by his philosophy lecturer. From the department of economics come aggressive recommendations of Hayek, Keynes and Friedman. It is easy, and indeed common in this environment, to feel overwhelmed by the heights we are expected to climb in order to stand on the proverbial “shoulders of giants.” In our default shortsighted mode of vision, it may appear simpler and more efficient to merely speed through our reading tasks, satisfying short-term goals while neglecting the long-term. It is indeed tempting to sacrifice those lofty aims when suffocating under the weight of past minds, their words forever channeled at us through book after book, article after article. To skim just a few pages or to read for raw information rather than for meaning is an impulse indeed often too attractive to resist.

Newman’s essay urges students to avoid this kind of fervent fact-gathering, however, and instead remain “above [their] knowledge.” His most central recommendations are brief and firm: he suggests that “we must generalize, we must reduce to method, we must have a grasp of principles, and group and shape our acquisitions by means of them.”² He refers to this entire process concisely by employing a metaphor of digestion:

The enlargement [of mind] consists, not merely in the passive reception into the mind of a number of ideas hitherto unknown to it … it is a digestion of what we receive, into the substance of our previous state of thought; and without this no enlargement is said to follow.³

This “digestion,” according to Newman, is a necessary condition for any actual improvement in intellect—or, in his own terms, for any “enlargement of mind.” The process at a high level is clear enough: a student wishing to best benefit from his reading must always synthesize the new facts he acquires with his existing knowledge, aggregating a cohesive whole of mastery rather than a mere conglomeration of pieces of information. But Newman does not sufficiently detail concrete methods for engaging in this “digestion.” He does not provide practical suggestions for applying these ideas to any learning practice, let alone reading in particular. It is the purpose of this paper to show that a reading practice combined with a slow, thoughtful extraction of passages from the texts being read satisfies the requirements for Newman’s process of “digestion.” The method of integrated reading and writing we examine is by no means novel: it has existed since at latest the era of the Roman Empire, albeit under many unrelated names.

It will first be useful, then, to unify the various overlapping terms for the products of the practice discussed in this essay. The most well-known names still extant in discourse today are the florilegium, the hypomnēma, and the commonplace book.⁴ These media, though published and discussed under distinct names, are actually quite similar in their methods of composition and the results of their use. This essay will focus on the properties and consequences of the practice at the intersection of these three traditions, which will be termed the “commonplace book” for convenience and consistency. This “commonplace book” under discussion refers to a personal book containing collections of quotations and passages from texts or dialogues of particular value or interest to its owner. The book might also contain marginalia referring to the passages quoted, or a prologue stating the intention of the author or summarizing the texts included.⁵ Commonplace authors composed these books during their reading or immediately after completing a text.

Historical support for this practice of reading and writing intertwined appears as far back as ancient Rome. Seneca, a major figure in the Stoic school of philosophy, stresses in his works the importance of “continuous writing.” He employs several oft-used analogies, first portraying scholars as bees traveling from flower to flower and gathering the nectar of knowledge at each:

We also, I say, ought to copy these bees, and sift whatever we have gathered from a varied course of reading … then, by applying the supervising care with which our nature has endowed us … we should blend those several flavors into one delicious compound.⁶

Seneca portrays the collection of ideas as an extremely discerning process, far from the simple mechanical procedure of facsimile that the description of commonplacing might first suggest. We must imitate the bees, using our “supervising care” to select the nectar most especially sweet to us—those passages which send the clearest and most striking messages. Seneca affirms that writing is the most reliable method for gathering these notes. The various ideas which readers collect from books, he says, must be “reduced to concrete form by the pen.”⁷ This conviction appears in equal force among medieval writings under the name florilegium (literally, the “collection of flowers.”)⁸ John of Wales, a 13th century writer, quotes Seneca directly as he explains his own practice of compiling quotes and forming florilegia. He works to collect “examples worthy to be imitated, which are all so many flowers.” John stresses the process of selection in his own explanation, emphasizing that readers must make selections carefully, avoiding “poisonous errors” that could otherwise enter the collection of quotes.⁹ Several centuries later, Erasmus recommends to his readers a specific method of processing and organizing what they take in from books. After a student has prepared separate categories for the storage of ideas, he should proceed to read with “a view to extrapolating.”¹⁰ Indeed, for these authors throughout history, reading without simultaneously writing was likely not a serious activity in any amount. As they struggled under the weight of their reading, the commonplace book served as a natural recourse for distilling and simplifying the massive input which these writers had to manage.¹¹ Reading in this style was an eminently practical engagement—students combed through texts with the explicit purpose of extracting useful morsels of information.

The result of such extraction was a collection surprisingly personal and unique. This simple act of rewriting and annotating quotations by hand produced a unique work which evidenced the individuality of its owner. In selecting and arranging choice extracts, a reader would construct a personalized image of his studies—a concrete projection of how the texts had affected his own ideas and beliefs. Kevin Sharpe gives a view of how this practice was paralleled in early modern England:

[T]hough what the compiler copied was extracted from a common storehouse of wisdom, the manner in which extracts were copied, arranged, juxtaposed, cross-referenced or indexed was personal and individual … The compiler essentially rewrote, fashioned a new text, which was anything but common, indeed was unique.¹²

Sharpe suggests that the way in which any given commonplace author chose to unify heterogeneous entries from distinct sources was necessarily personal and idiosyncratic. As Michel Foucault names the practice in his work “Self-Writing,” this “subjectivation” of new knowledge—the deliberate process by which a learner would attempt through transcription and further writing to blend the lessons from a text into his existing worldview—was absolutely crucial for proper understanding and assimilation.¹³ This same concept of subjectivation is cited in Newman’s work as a necessary condition for proper “digestion” of information. Newman claims that such a process of fusion of ideas makes “the objects of our knowledge subjectively our own.”¹⁴ The process of commonplace book composition would yield something wholly unique, composed of equal parts original text and individual impression. The result was so personal, in fact, that Francis Bacon proposed it would be of little use to other readers: “I think first in general that one man’s notes will little profit another, because one man’s conceit doth so much differ from another’s; and also because the bare note itself is nothing so much worth as the suggestion it gives to the reader.”¹⁵ A man’s “conceit” derived from any text was personal and unique. In the end, what readers remembered from their texts was what the works meant to them—the “suggestion” they felt—rather than the exact content of any given paragraph. By Bacon’s claim, then, a commonplace book contained extracts which, as a whole, were of higher value to their creator than a simple sum of the values of the book’s parts. The artifacts of these Baconian “suggestions” could exist in the form of short pieces of commentary, or simply render themselves in the way commonplace writers chose to give certain quotes extra prominence in their books. In allowing original text and impression to mix, whether consciously or not, readers produced a unique view of their own exploration and learning progress. These personal extracts would prove to be of real use in the weeks, months, or years after a reader finished a text, when the passages could be revisited.

Gathering passages and notes from various works allowed readers to greatly simplify the otherwise burdensome task of memorizing the most important lessons they derived from their texts. By perusing their commonplace books, they were able to not only recall the exact content of their past favorite quotes but also experience again the impression which the text had made upon them. Through our modern lens, it is evident that these revision activities were crucial to proper memorization and assimilation of the ideas most important to a reader. There is much scientific evidence today, in fact, which supports this idea of periodic review of important reading content. The phenomenon known as the spacing effect is particularly relevant to this rereading practice. According to the spacing effect, learners who sustain consistent review of content over long periods of time can memorize more effectively than those who commit to large amounts of memorization work in a single sitting.¹⁶ The tradition of the commonplace book encouraged this kind of regular spaced review, which modern psychology research has proven to be an effective method for memorization. Historical support for this revision activity is just as plentiful: nearly every author who suggests the composition of a commonplace book also stipulates its constant inspection and revision. The most critical ideas were to be consistently reread until finally learned by heart. This recall practice, which slowly brought the most crucial passages of one’s reading closer to a subconscious level, was of obvious utility for students.¹⁷ Historical sources show that recall aided not only students but learners of all ages. Aspiring essayists and professional writers alike benefited in methodically memorizing the many short bodies of text which they found relevant or especially profound. A typical citizen could deploy these memorized commonplaces in conversation or correspondences. Writers could include their own preferred quotations in their compositions on any topic.¹⁸ This tradition was an exceptionally useful practice for readers of any sort to memorize the passages they held most dear.

While recall was the most visible and straightforward end in revisiting these collections, the commonplace book also served as a tool to instill proper values in its owner. Every commonplace book had its own distinctive selections, carefully gathered and uniquely recorded by their respective authors. Foucault draws a strong contrast between the product of this writing practice and other more superficial methods of collection, which he claims yield nothing more than “memory cabinets.” Commonplace books were vessels, rather, for the most meaningful and crucial ideas that readers wished to “deeply lodge in the soul.”¹⁹ To return to this book and rediscover this collection of commonplaces was a wholly solemn and private task. It offered readers an opportunity to relive those solitary moments of deep insight which they had experienced with texts in the past. Furthermore, it allowed them to see their various transcriptions not as isolated morsels of wisdom but as small parts of a larger system. Each quote and comment rested in between many others, and readers who returned to their books for a certain passage could not avoid comparing this entry with those nearby. This is Newman’s digestion forced into action: it is the unification of new ideas with old beliefs. It trained the reader to never contemplate any given concept, in Newman’s words, “without recollecting that it is but a part [of a larger picture of knowledge], or without the associations which spring from this recollection.”²⁰ This process of revision and subsequent synthesis helped the reader to piece together a more uniform self.

Commonplace books had their highest end, then, in helping their owners to build their own characters in the most literal sense: the composition of the book paralleled the construction of the character. Readers strove through their commonplace books to synthesize the heterogeneous opinions they acquired through their reading into a single cohesive statement of identity. Of course, this deliberate integration of different ideas could not be accomplished without significant modification. Seneca compares the gathering of ideas to the chemical process of digestion, wherein no food can benefit us unless it is also changed in the process:

This is what we see nature doing in our own bodies without any labor on our part; the food we have eaten, as long as it retains its original quality and floats in our stomachs as an undiluted mass, is a burden; but it passes into tissue and blood only when it has been changed from its original form. So it is with the food which nourishes our higher nature,— we should see to it that whatever we have absorbed should not be allowed to remain unchanged, or it will be no part of us. We must digest it; otherwise it will merely enter the memory and not the reasoning power.²¹

Again we return to the abundant metaphors used throughout history in discussion of this practice. The “food which nourishes our higher nature” arrives to readers in the form of well-crafted sentences, aligned in sequence across sections, chapters, and volumes—distilled and compressed knowledge, printed onto lifeless paper. It is our task, suggests Seneca, to revive these ideas, enliven them: that is, to analyze, transcribe and repeatedly modify them until they are wholly our own, persisting in their changed state in our private thoughts. In transcription and subsequent rereading of these copies their owners could not avoid contemplating exactly how the passage affected them rather than simply viewing the old ideas in a dry, objective manner.²² The process of revisiting the book was intrinsically reflective and introspective. It was the catalyst for a “personal construction of meaning,” says Sharpe²³—a solemn and meditative activity of recollection and review through which the reader could “[constitute] his own identity.”²⁴ It is this active review and constant inquiry into past lessons learned that Newman promotes as a requirement for “digestion.” He portrays the process as one of deliberate “locomotion”: a “gravitat[ion]” toward some cohesive personal worldview.²⁵

Such constant gravitation brought readers ever closer toward a destined balance in knowledge and identity through reading, writing, and meditation on that writing. The result, to engage once more in the profusion of metaphor present among all the writing on this topic, was a “choir” of past ideas, each memorized and carefully aligned to fit the theme of the whole:

Let our minds aim at showing the finished product, but conceal all that has helped to produce it … take a choir: it consists, as you see, of many voices and yet all those voices form a unity … although individual voices do not emerge, the voices of all are heard and from a number of different sounds there comes a harmony.²⁶

The mind as constructed by the commonplace book was purely a synthesis of the most prized thoughts of the past. After some slight modification, each idea fit in as a voice in a choir, such that the group as a whole produced a single harmony. While this “harmony” is always cited as a beneficial result of commonplacing, sources across history disagree on the related consequences of this practice of collection. Mary J. Carruthers observes that the medieval citizen may have treasured the activity to a fault, seeing himself as precisely the product of his readings and no more.

One sometimes gets the impression that a medieval person … could do nothing (especially in duress) without rehearsing a whole series of exemplary stories, the material of their experience built up board by board in memory … so that even in moments of stress the counsel of experience will constrain a turbulent and willful mind.²⁷

This is a relatively radical perspective of the effects of commonplacing. Readers were so extremely dependent on their personal books, Carruthers suggests, that without those collected quotes we would see unleashed “turbulent and willful mind[s],” unbridled and unanchored from their safe havens built up from the writings of past authors. Some opponents of the practice criticized this extreme form of the idea. They claimed that those who relied solely on accepted ideas of the past doomed themselves to weak, impersonal argumentation.²⁸ This contention has merit, but only serves to refute the particular conception of the practice that Carruthers presents. The portrayal is indeed unfair to the majority of commonplace book authors, who were not so absolutely dependent on their books. As we have already learned, commonplace books were defined by their more primary use in deliberate reflection and introspection rather than by their purpose as a simple reference. This particular form of study did not yield some sort of unstable debater completely dependent on his past notes, but rather, in Seneca’s words, a unique intellectual child of the authors of quotes gathered.²⁹ A commonplace writer was the product of many disparate (and perhaps even disagreeing) voices, whose precepts were each tweaked and re-aligned by the writer to form a holistic and coherent compilation. The diligent commonplace book owner was not a lifeless portrait of the inputs,³⁰ but rather a living and breathing entity, carrying treasures of accumulated wisdom and prepared to deploy them in situations both trivial and novel.

In such novel situations the benefit of commonplacing becomes even clearer. To many writers of the past, those readers who worked to establish concrete moral codes through their commonplace books proved to be the most resilient in the face of new or resurgent problems. These types of problems require a stable, robust mind: one which can rationally evaluate the root of an issue and resolve it using a combination of past knowledge and new constructions. This thinking, says Foucault, is patently impossible for a reader who does not maintain control over the many distinct ideas he derives from his texts. “Endless reading” without deliberate writing interspersed would lead to the “great deficiency” of stultitia—a state of “mental agitation” and “distraction” in which the reader is overwhelmed by the massive amounts of input which he must process, without the assistance of any external tools.³¹ A program of intense reading which omits these aids would do little for a student but fill his mind with a hodgepodge of facts and likely contradictory ideas. Newman’s essay claims that readers who simply hoard knowledge “they have not thought through” are thus “only possessed by [it], not possessed of it; nay, in matter of fact they are often even carried away by it, without any volition of their own.”³² The function of the commonplace book from this perspective, then, is to allow readers to establish an intellectual anchor—some definite collection of precepts and past arguments against which new ideas and new problems can be compared—so as to not be “carried away” by the weight of their own reading.

The product of commonplacing was a cohesive composition far greater than the sum of its parts, a physical realization of the identity and intellect of its discerning author. The practice yielded after years a mind rigorously trained and exceptionally prepared, forever supported by this anchor of precepts deliberately assimilated and consciously reaffirmed over time. The commonplace author was a perfect image of the product of Newman’s “digestion.” He used his book to process and memorize important information from the texts he encountered, while simultaneously working to construct a unique, robust self. Over the years the author built an image of himself through his collection, and benefited from a clear view into his own intellectual past.

This is not to say that items once penned in a commonplace book were made eternally true. On the contrary, the book offered detailed images of a reader’s past thoughts and beliefs, many of which he was likely to see invalidated in the future. These collections merely provided a snapshot of the reader’s worldview of the past, which upon revisiting he could choose to reinforce or deny using new evidence or personal experience. In returning to past extractions wielding new evidence or ideas, the author continued that same process of synthesis and modification begun during the book’s creation even after its pages had been filled. As such a projection of the past, the book made for the reader a nearly physical companion out of his former self—a reification of his earlier thoughts with whom he could converse and debate.³³ These images of a past self were invaluable to a commonplace writer in providing an image of which intellectual landscapes he had visited and where he had yet to explore. The book was a depository, moreover, into which the suggestions triggered by each individual work could be assembled, mixed, rearranged, and investigated as a cohesive whole. In this way, the commonplace author found a method for managing his knowledge inheritance—for lodging deeply in his soul those precepts most important to him, and thus learning to master the weight of past generations. Imperat aut servit: with the commonplace book at hand, the reader was finally free to choose.

John Henry Newman, The Idea of a University: Defined and Illustrated (Oxford: Clarendon Press, 1976), 125. The Latin phrases translate as “master or slave” and “brute force bereft of wisdom / falls to ruin by its own weight,” respectively. Newman’s Idea, the work on which this essay is based, has had a profound—some might say even revolutionary—effect on our modern view of education. See Cornwell, John. Newman’s Unquiet Grave: The Reluctant Saint. New York: Continuum, 2010, 128. ↩
Ibid., 120. ↩
Ibid., 120. Emphasis added. ↩
There are plenty of names for this practice which happen to have fallen out of fashion. A 1964 analysis lists 38 different names which authors of the past used to refer to the same concept of the florilegium: see Henri-Marie Rochais, “Florilèges spirituels,” in Dictionnaire de spiritualité ascétique et mystique, doctrine et histoire., vol. 5 (Paris: Beauchesne, 1964), 438. ↩
Jacqueline Hamesse, “Les florilèges philosophiques du XIIIe au XVe siècle,” in Les genres littéraires dans les sources théologiques et philosophiques médiévales, vol. 5, Textes, Études, Congrès 2 (Louvain-la-Neuve: Université catholique de Louvain, 1982), 184. ↩
Lucius Annaeus Seneca, “On Gathering Ideas,” in Ad Lucilium epistulae morales, trans. Richard M. Gummere (Cambridge, Mass.: Harvard University Press, 1970), 279. This metaphor is often stolen by later writers. See for example Ambrosius Aurelius Theodosius Macrobius, The Saturnalia, trans. Percival Vaughan Davies (New York: Columbia University Press, 1969), 1.5. ↩
Seneca, “On Gathering Ideas,” 277. ↩
Alternate names included the flores philosophorum and the flores auctorum. See Ann Moss, Printed Commonplace-Books and the Structuring of Renaissance Thought (Oxford: Clarendon Press, 1996), 24. ↩
Quoted in ibid., 30. ↩
R. R. Bolgar, The Classical Heritage and Its Beneficiaries. (Cambridge, England: University Press, 1954), 273–274. ↩
Rochais, “Florilèges spirituels,” 457. ↩
Kevin Sharpe, Reading Revolutions: The Politics of Reading in Early Modern England (New Haven, CT: Yale University Press, 2000), 278. Emphasis added. ↩
Michel Foucault, Ethics: Subjectivity and Truth, vol. 1, The Essential Works of Michel Foucault, 1954–1984 (New York: New Press, 1997), 210. ↩
Newman, The Idea of a University, 120. ↩
Francis Bacon, The Letters and the Life of Francis Bacon (Longmans, Green and Co., 1890), 25–26. ↩
Douglas L. Hintzman, “Repetition and Memory,” in Psychology of Learning and Motivation, by Gordon H. Bower, vol. 10 (Academic Press, 1976), 65. ↩
Freyja Cox Jensen, Reading the Roman Republic in Early Modern England, Library of the Written Word v. 22 (Boston: Brill, 2012), 37. ↩
Ibid., 91. John of Salisbury, a writer well known for his commonplacing practice, serves as evidence of the utility of commonplacing for writers. See Rochais, “Florilèges spirituels,” 462. ↩
Foucault, Ethics: Subjectivity and Truth, 210. ↩
Newman, The Idea of a University, 123. ↩
Seneca, “On Gathering Ideas,” 279–281. Emphasis added. ↩
See Barbara M. Benedict, Making the Modern Reader: Cultural Mediation in Early Modern Literary Anthologies (Princeton, NJ: Princeton University Press, 1996), 47; Susan Miller, Assuming the Positions: Cultural Pedagogy and the Politics of Commonplace Writing, Pittsburgh Series in Composition, Literacy, and Culture (Pittsburgh, PA: University of Pittsburgh Press, 1998), 21, 24. ↩
Sharpe, Reading Revolutions, 279. ↩
Foucault, Ethics: Subjectivity and Truth, 214. ↩
Newman, The Idea of a University, 121. ↩
Macrobius, The Saturnalia, 1.9. ↩
Mary J. Carruthers, “Memory and the Ethics of Reading,” in The Book of Memory: A Study of Memory in Medieval Culture (Cambridge, England: Cambridge University Press, 1990), 180. ↩
Moss, Printed Commonplace-Books and the Structuring of Renaissance Thought, 21. ↩
Seneca, “On Gathering Ideas,” 281. ↩
Ibid.: “I would have you resemble [the authors from whom you collect ideas] … not as a picture resembles its original, for a picture is a lifeless thing” ↩
Foucault, Ethics: Subjectivity and Truth, 211–212. ↩
Newman, The Idea of a University, 126. ↩
Kenneth Lockridge, “Individual Literacy in Commonplace Books,” Interchange 34, no. 2/3 (September 2003), 338–339. ↩

Sunday Links

Jon Gauthier — Sun, 15 Dec 2013 00:00:00 +0000

The quarter is over, and I’m back at home with the family. Tomorrow I’ll return for a few weeks contracting at Stremor. Here’s what I’ve been reading this week:

Aaron (1994) attacks the idea of treating the preferences and values of an individual as stable as more of an “axiom of religious faith” than a “defensible scientific hypothesis” (6). He supports approaches such as that of Epstein and Axell, which avoid unrealistic oversimplified models of utility that Aaron claims have plagued microeconomics for too long.
Bertrand Russell gives the 10 commandments of “true liberalism” — worth a read for any student, learner, thinker, etc.
Yandle (1983) gives an outline of his catchy concept of “bootleggers and baptists.” For those unacquainted with the bridge between public policy and economics (read: me), this is a really enlightening read.
Rudolf Weinstock’s The Lisp Curse makes a poignant statement about software development: the people developing may be as affected by their tools as they affect their environment. Technology affects our social behavior.

Sunday Links

Jon Gauthier — Sun, 08 Dec 2013 00:00:00 +0000

It’s finals week here at Stanford. I’m just finishing up a research paper entitled “Imperat aut servit: Managing our knowledge inheritance” which will likely be published here after the quarter has ended. Apart from writing, I’m beginning to think about further independent research. I’ve been investigating patterns in Romanian etymology and am considering engaging in some sort of formal analysis. Hayek is also still on my mind — I’ve been wondering about how I can model a particularly interesting statement in “The Use of Knowledge in Society” using genetic programming. More on all of this later, I’m sure.

For now, here are the interesting parts of my (very sparse) reading this week:

Hoxby (2009) details a trend of “re-sorting” among college applicants over the past decades which has led to a more efficient “matching” between high-aptitude students and high-selectivity schools. She demonstrates that per-student resources at highly selective universities have skyrocketed due to this higher efficiency: as better-matched students arrive at top-tier colleges, they demand the support they have been looking forward to. Furthermore, the paper reveals a curious trend in the financials of premier American colleges: the average student only pays around 20% of the value of the resources offered to him, while the remainder is repaid by alumni donations.
Growing Artificial Societies by Epstein and Axtell has been useful as I think about developing a system to model problems in economics (see introduction).

Sunday Links

Jon Gauthier — Sun, 01 Dec 2013 00:00:00 +0000

Gneezy and Rustichini (2000) examine the effect of imposing a fine on parents who arrive late to pick up their children from Israeli daycare centers. They find that parents actually arrive significantly later once this fine is in place. When what was formerly a fuzzy unwritten social contract is rewritten with an exact punishment, they suggest that parents feel free to break the rules and pay the exact fine (where before they were unsure about the potential consequences of their lateness).
Lawrence Lessig argues that most of the people who watch The Social Network will miss the really important point in the story of the rise of Facebook: that is, no innovation of its kind would have been possible without the Internet, which is not bogged down by monopolies or excessive regulation (yet?).
“Bitcoin is Flawed, But It Will Still Take Over the World” covers some of the most recent controversy concerning the booming cryptocurrency. It is still disputed whether the inevitable deflation of Bitcoin (the total circulation is limited by design) will doom the currency.
A free rider problem will also emerge when this limit is reached: those who currently “mine” Bitcoin are actually in effect working to validate real digital transactions. When there is no more coin to be mined, the incentive to continue supporting the network in this way drops significantly. There is by design a mechanism by which spenders can pay transaction fees to miners so that their exchanges are verified efficiently, but it is not certain that this will be a stable or sufficient incentive alone. The author ignores the possibility of the problem in suggesting: “They may run them just to keep the Bitcoin system going, knowing that the system will reward them in other ways.”
More on Bitcoin: “As China Looms, the U.S. Ponders Ways Not to Destroy Bitcoin.” Robert McMillan of Wired Enterprise claims that current U.S. regulations are stifling an industry with great potential. He suggests that as the involvement of the Chinese with the currency skyrckets, a significant part of the Congressional support for Bitcoin is fueled by a worry of a China-dominated digital currency.
In the early 20th century, William James’ “The Ph.D. Octopus” pointed to a dangerous and popular trend of overdependence on credentials within academia. I can’t say the problem has improved — it’s survived, rather, and spread into industry as well!

Sunday Links

Jon Gauthier — Sun, 24 Nov 2013 00:00:00 +0000

I’ve decided to join the Internet-wide trend of posting regular updates on articles, blog posts, etc. that I have read recently and find interesting. I see multiple benefits in such a practice:

As a commitment mechanism: With the Internet watching, I will (hopefully) feel more motivated to maintain good reading habits. Furthermore, I’ll have an incentive to write about my insights on each piece (these posts would otherwise be of no more value than any other news aggregation).
As a more effective method of sharing: Individual posts on social networks often seem “drowned out” to me, quickly eclipsed by the slightly newer articles displayed immediately above it. This form of more formal publication should counter that effect.
As a log of my reading progress: I envision an archive of my reading interests over the weeks / months / years¹ could be very useful to my future self.

Without further ado, here begins the Sunday Links series!²

Sean Trende from RealClearPolitics claims that House campaign spending only has significant effects for challengers. This post is based on findings from the late 80’s / early 90’s. Levitt (1994) contradicts these ideas and suggests that the marginal dollar of funding actually does little for both the incumbent and the challenger, after controlling for candidate quality and performing more than a simple cross-sectional analysis (as Trende has done).
Varshney et al. (2013) describe a system for computational creativity. The system combines many disparate datasets on cuisine and food ingredients (containing data concerning e.g. regional cooking practices, observed links between chemicals and “hedonic psychophysical effects,” olfactory properties, etc. etc.) and automatically designs and assesses “creative” novel meal recipes.
F. A. Hayek’s article “The Use of Knowledge in Society” explains how a free market can resolve the problem of insufficient knowledge on the part of any given individual through prices. Hayek claims that a system of central planning would require a single entity with total knowledge of the circumstances of all of its constituents, and argues that a decentralized system which regulates itself by prices is superior.
David Luposchainsky provides examples of strange loops in Haskell with two simple functions, loeb and moeb.

Assuming I can make this a long-term habit, that is! ↩
Series name shamelessly stolen from Peter Hurford, whose Sunday Links posts I always enjoy. ↩

Marcus Aurelius and slavery in the Roman Empire

Jon Gauthier — Tue, 05 Nov 2013 00:00:00 +0000

Something of a departure from the usual content today — what follows is my attempt to answer the question, “Why didn’t Marcus Aurelius the Stoic fight to end slavery?” I hope you enjoy the read. I’d appreciate any and all comments!

See a Zotero folder documenting my research for this essay.

In the Meditations Marcus Aurelius extols the ideas of independence and self-determination, echoing many of his Later Stoic intellectual ancestors and contemporaries. In 1.14 he speaks of the treasures of a “balanced constitution” and a “monarchy which values above all things the freedom of the subject.” It is difficult to reconcile egalitarian precepts like these, though, with Marcus’ behavior as leader of the Roman Empire for 19 years. The Roman institution of slavery, for example, seems to be in direct contradiction with his own ideals. Although Aurelius likely interacted with or benefited from the work of slaves daily while writing the Meditations on campaign, he makes little mention of this practice in his work. Why would Aurelius not fight against slavery in the Roman Empire, given his strong commitment to his philosophy and the significant power he wielded? It may appear at first glance that Aurelius simply refused to consider any sort of action. But we can’t make complete conclusions by scanning the Meditations alone. A study of the emperor’s philosophical tradition reveals a much more nuanced picture: the ethics and personal beliefs sourced from his Stoic predecessors combined to present several substantial obstacles to an aggressive campaign for the abolition of slavery.

The practice of slavery was prevalent throughout the Roman Empire at the time of Aurelius’ ascension. About 30 percent of the population of the city of Rome consisted of slaves. The duties of slaves in the empire varied widely. In urban Rome, those in servitude might be employed by the city to maintain public buildings or coordinate construction projects. Wealthy private citizens often owned several slaves who acted as nurses, tutors, or housekeepers. Others would be sent to work in factories or on farms.¹ Slaves acting as domestic servants often had good chances of economic success or even freedom, while those working in large groups away from the cities were likely forced to resign to a lifetime of subjugation.² Together slaves played a crucial role in sustaining the empire, supporting projects in both the public and private sectors.

There is surprisingly little discussion of this slavery so important to the empire in Later Stoic texts (of which Aurelius’ Meditations forms a part). The Stoics remain curiously quiet on the social and political institution of slavery in their time, but do make significant comments about how one should best treat a slave. In Seneca’s famous 47th letter to Lucilius, he skirts around the larger question of the norm of slavery and instead attempts to prescribe how they should be treated: “But this is the kernel of my advice: Treat your inferiors as you would be treated by your betters.”³ All of the Later Stoic writing on this specific topic stresses the equality of all men in a spiritual or cosmic sense. While such a tenet is only implied in Seneca’s letter, Marcus Aurelius states it more plainly in his Meditations. He asks us to “[c]onsider how [we] stand in relation to [our companions], and how we were born to help one another.”⁴ Though a free Roman and a Roman slave obviously differed in their social positions, the Stoics thought it important to recognize that both were humans and both therefore deserved humane treatment.

But the Stoics seem to feel less direct sympathy than we might expect for their own fellow human beings, relegated to servitude under a human master. This is perhaps because they were more concerned with a very different kind of slavery: that of a free man to desire, emotion, or irrationality. “Slavery” for the Stoics referred, rather, to an unacknowledged dependence on an external factor for internal tranquility and peace. Seneca asks his contemporaries to turn inward when contemplating slaves, and realize how they too are bound to their own, more abstract masters:

‘He is a slave.’ His soul, however, may be that of a freeman. ‘He is a slave.’ But shall that stand in his way? Show me a man who is not a slave; one is a slave to lust, another to greed, another to ambition, and all men are slaves to fear. … No servitude is more disgraceful than that which is self-imposed.⁵

A slavery “self-imposed” was more terrifying to the Stoics than any external social circumstance. Such servitude led man after man astray from his duties, suffering from a “disgraceful” irrationality and lack of wisdom. Following Seneca, Marcus Aurelius uses the same metaphor to describe how one’s mind may be dominated and enslaved by thoughts of an unhappy status quo or an uncertain future: “No longer allow [your ruling center] to act as a slave … no longer allow it to be discontented with its present lot or flinch from what will fall to it in the future.”⁶ This other form of captivity, named “moral slavery” by later scholars, was of far superior importance to Stoic thinkers. The Stoics knew that this servitude to such invisible masters pervaded the minds of plebeians and patricians alike, and they saw righting this malady as a more primary goal. Moreover, Aurelius and his companions were well aware that while moral slaves could free themselves immediately of their own volition, this was not the case with traditional slaves. The Stoics believed, then, that “[b]y comparison with the slavery that was a condition of the soul, legal slavery was of marginal importance. It was an external—like health and illness, wealth and poverty, high and low status—over which we had no control. As such, it was neither good nor bad but, rather, indifferent.”⁷ The Stoics saw this distinction between fixed external factors and mutable internal factors as central to everyday life. They chose to regard the features of their lives not under their direct control with a knowing indifference—a relaxed and rational concession of power over things outside of the individual.

This indifference was prevalent throughout the writing of the Later Stoics. In the context of enslavement, they would think it important to recognize the random chance that could cast one person into slavery and another into a sedentary life in the aristocracy. Both conditions would be bestowed at birth, outside of an individual’s control. The result of this coin-flip decision by what the Stoics dub “Fortune,” then, would ideally not have any bearing on the happiness of the individual in life. What leads the path of a person’s life—the rational mind, or the “ruling center” by Aurelius’ terminology—should be, in the Stoics’ view, independent from and unhampered by any physical or social circumstance.⁸ Seneca claims that “it is a mistake to imagine that slavery pervades a man’s whole being,” because our rational center “cannot be transferred as a chattel.”⁹ Once we recognize the independence of this ruling rationality in ourselves, suggest the Stoics, we are no longer hindered by the emotions that any circumstance may trigger. Epictetus exhibits the same sentiment in a more general context in his Encheiridion, holding that “[m]en are disturbed not by things, but by the views which they take of things.”¹⁰ Our happiness, according to the Stoics, can be independent from our physical state. Thus in this view life as a well-treated slave could be just as tranquil and happy as life as a commoner or an aristocrat.

However much this philosophical reasoning allows us to excuse our Stoic ruler from a fight against slavery, Marcus Aurelius did in fact strive to protect the rights of slaves. Anthony R. Birley indicates a “consistent policy” throughout Aurelius’ reign of giving every slave “the maximum possible chance of attaining freedom.” The jurist Marcellus attributed to Aurelius a “partiality for freedom” with respect to cases involving manumission.¹¹ Other sources state the leader’s preference for the freedom of slaves even more strongly. Arnold M. Duff indicates a trend of legislating and ruling in the favor of slaves among several of the Antonines:

Hadrian and his two successors, under the influence of the Stoics, began an energetic campaign for the amelioration of slavery. … Hadrian put an end to the anomaly that provincial towns were not, like the state, allowed to free their slaves; in the reign of Marcus Aurelius the right of manumission was granted to collegia. … [O]ne of the most striking evidences of the humanitarian movement is the history of fideicommissary manumission which evolved itself into legal form between Trajan and Marcus Aurelius. About twenty senatus-consulta and imperial constitutiones are known to us with reference to fideicommissa. … Of those twenty rescripts and decisions all are in favour of the slave. If it was quite clear that the testator wanted a certain slave to be freed, then he had to be freed, and no legal forms or theories could prevent it.¹²

Duff shows evidence of perhaps weakening support for the institution of slavery in a specific sector of law, through both decisions of the senate (senatus-consulta) and imperial proclamations (constitutiones). But by modern standards, this sort of piecemeal progress is maddeningly insufficient. Marcus Aurelius ruled the Roman Empire for nearly two decades. Why did he not take further steps to eradicate legal slavery as an institution?

It is possible, in fact, that Marcus had such a plan in mind but simply refused to put it into action. Marcus’ own writing clearly shows that he struggled to balance ideas sourced from his personal philosophy with the expectations of his Roman counterparts. Late in the Meditations he makes an entry, seemingly resigned to the unfairness of his post: “You should not hope for Plato’s ideal state, but be satisfied to make even the smallest advance, and regard such an outcome as nothing contemptible. For who can change the convictions of others?”¹³ Here we see Marcus as something of a realist, spelling out the struggle he sees between an ideal government and what his society currently regards as an acceptable state. The most groundbreaking changes, he says, must be made in the “smallest advance[s].” Surprising legislation or sudden imperial action could lead to political turmoil or even large-scale revolt. Little could be more groundbreaking, of course, than the abolition of slavery in the Roman Empire. Aurelius evidently did his part to better the treatment of slaves, but was likely wary to continue to more involved social change for fear of upsetting his political allies and the Roman public.

Aurelius also mentions the hopelessness of attempting to “change the convictions of others.” This is perfectly in line with Stoic thought: such a goal would be marked as unwise and even dangerous by any of his Stoic companions as well, simply because its fate of success or failure lies outside of the self. Marcus would see the “convictions of others” as an external, a factor which deserved no emotion but indifference. Though he may have had the strongest and most unorthodox opinions about manumission and the treatment of slaves, it appears that he kept them to himself, staying in line with his recommendation in 9.29. Aurelius knew that to impose such a radical change in such a brief period of time would have wreaked havoc on the Roman economy, and to attempt to force his “convictions” on others would be a breach of his own philosophical ideals.

It is unfortunate that we have only a record of Marcus Aurelius’ Meditations and not also of his unfiltered daily thoughts. But by reading the Meditations and understanding the opinions of his intellectual counterparts, we can begin to glimpse the hard conclusions he must have had to reach on the topic of slavery. While legal slavery was an institution obviously not befitting an “ideal state,” Aurelius knew that such a prevalent practice could not simply be swept away in a decade, or even a century. The Stoic view of this ancient form of slavery further deterred the school’s thinkers from launching a full attack on the institution. Through the Stoic lens, we imagine that “good” slaves would have been content with their place, pleased with the cards that the proverbial Fortune had dealt them. The more pressing societal problem in the Stoic view had to do with the far more numerous “moral” slaves, bound to their own ephemeral pleasures. This reasoning is by no means a complete excuse for Marcus Aurelius’ inaction. Through this wider historical and philosophical lens, however, we can understand the enormous and intricate internal conflict which he must have faced on this issue. Our philosopher-king was a mortal in struggle, permanently torn between maintaining the status quo and campaigning for such revolutionary change.

Jo-Ann Shelton, As the Romans Did: A Sourcebook in Roman Social History, 2nd ed (New York: Oxford University Press, 1998), 166–167. ↩
Roman Social History: A Sourcebook, Routledge Sourcebooks for the Ancient World (London; New York: Routledge, 2007), 155. ↩
Lucius Annaeus Seneca, “On Master and Slave,” in Epistulae morales, trans. Richard M. Gummere (London Heinemann, 1917), [link]. ↩
Marcus Aurelius, Meditations, trans. Robin Hard, Oxford World’s Classics (Oxford; New York: Oxford University Press, 2011), 11.18. ↩
Seneca, “On Master and Slave.” ↩
Aurelius, Meditations, 2.2. ↩
Hellenistic Constructs: Essays in Culture, History, and Historiography, Hellenistic Culture and Society v. 26 (Berkeley: University of California Press, 1997), 159. ↩
Miriam T. Griffin, Seneca: A Philosopher in Politics (Oxford: Clarendon Press, 1976). ↩
Lucius Annaeus Seneca, De Beneficiis, trans. John W. Basore (Cambridge, Mass.: Harvard University Press, 1989). ↩
Epictetus, The Enchiridion, trans. Thomas W. Higginson, 2d ed. (Indianapolis: Bobbs-Merrill, 1955), 5. ↩
Anthony R. Birley, “Marcus’ Life as Emperor,” in A Companion to Marcus Aurelius, ed. Marcel van Ackeren, vol. 96 (Wiley-Blackwell, 2012), 160, [link]. ↩
Arnold Mackay Duff, Freedmen in the Early Roman Empire (Clarendon Press, 1928), 195–196. ↩
Aurelius, Meditations, 9.29. ↩

Review: ZeroMQ: Messaging for Many Applications by Pieter Hintjens

Jon Gauthier — Sun, 30 Jun 2013 00:00:00 +0000

ZeroMQ is one of those technologies today that have sizeable shares of breathless adherents. I had been aware of the hubbub over the open-source messaging library for quite some time when I heard that the popular online tutorial – known simply as “The Guide”, written by Pieter Hintjens, an author of ZeroMQ – would be made available in print and ebook. I snagged my chance to get a nice Kindle edition of the O’Reilly release. Apart from some serious formatting problems with the ebook (read on), I was extremely satisfied with the breadth and depth of this guide.

Hintjens abandons all pretense at the very beginning of Chapter 1, acknowledging the fervor of the community:

How to explain ØMQ? Some of us start by saying all the wonderful things it does. It’s sockets on steroids. It’s like mailboxes with routing. It’s fast! Others try to share their moment of enlightenment, that zap-pow-kaboom satori paradigm-shift moment when it all became obvious. Things just become simpler. Complexity goes away. It opens the mind. Others try to explain by comparison. It’s smaller, simpler, but still looks familiar.

Yes–the whole book is like that. Our author has a wonderfully lucid and light-hearted writing style¹ that keeps you focused during the long stretches of code.

And is there code! The majority of the book offers a tour through a dizzying array of ØMQ network patterns, each accompanied by a cute name and often a diagram. See, for example, the “Majordomo Pattern.”

What follows this reasonably simple diagram is no less than 500 lines of C code. Inline. I appreciate this in some amount – there’s nothing more practical than a real implementation – but was blown away (rather, smothered by) the piles of code in this book. The density of the code hindered my reading experience, especially in the Kindle edition, where there were no bookmarks within sub-chapter sections to help me easily jump around between the massive code blocks. Many of the most important sections of the book that offered real, usable patterns were difficult to scan and reference later on given the lack of navigation aids.²

This publishing error, however serious, is my only major gripe with the book. I learned quite a lot about the core of ZeroMQ, and am now interested in exploring the bindings written for my everyday languages.³ I’m excited to see how I can integrate the library at the core of horizontally scalable systems in the near future.

(Disclosure: I received an electronic copy of this book in exchange for writing a review.)

This is likely something of a rarity, I’d assume, when it comes to guides on message-passing libraries. ↩
To be fair, this would be much less of an issue with a physical book (or in the online guide, where much of the code is held externally and simply referenced by hyperlink). I am still disappointed by O’Reilly’s apparent lack of concern for the usability of this work’s ebook format. ↩
The book does make reference to the large amount of language bindings available, but keeps all code in C. ↩

Review: Clojure Programming by Chas Emerick et al.

Jon Gauthier — Wed, 17 Apr 2013 00:00:00 +0000

I often run into what you might call closet functional programmers – people who seem to have a genuine interest in acquainting themselves with a new paradigm, but just can’t manage to find the time to do it. Some of those who do invest the time often end up on something like the Typeclassopedia¹, where the combined force of jargon and type signatures kill whatever interest they began with.

Thanks to Clojure Programming, though, I’m happy to report that this will no longer be a problem. This book gives hope to those who have championed Lisp and / or functional programming in vain. Emerick et al. provide not only a thorough tour of the language, but also demonstrate the beauty and conciseness of its solutions to common problems. The book dedicates an entire section (“Practicum”) to describing how Clojure is idiomatically used in different application domains.

I was particularly pleased by the stellar coverage of some of Clojure’s most compelling features:

Concurrency primitives (ref, atom, agent, future, and friends)
The power of the JVM and easy Java interop
Lisp syntax (which makes for easy and powerful metaprogramming)
The sequence abstraction

These features are all explained in a bottom-up style (fitting for a Lisp!) – the authors build up a sizeable example by providing an implementation in small increments, explaining along the way. This style is a nice parallel to the nature of traditional Lisp programming.

This book would fit best any of these three groups:

Java refugees. Give me the JVM, hold the AbstractSingletonProxyFactoryBean. Clojure Programming shows you how to take advantage of the vast Java ecosystem while avoiding some of the pitfalls of having static typing and OOP forced upon you. The authors make a good case for interactive programming with the Clojure REPL, which gives you a direct line to the JVM not usually available in Java-land.
Beginning functional programmers. For those already acquainted with a scripting language like Python, Ruby, etc., your first Clojure programs will be a breeze. The book spends a chapter first easing you into Clojure syntax before presenting the basics of functional programming in all of their greatness. You’ll come to love the paradigm and appreciate how Clojure facilitates its use so effectively.
Lispers. While Clojure is by no means a mainstream language, it provides a compelling case of a successful Lisp dialect. The later chapters, which provide examples of Clojure applications in all sorts of distinct domains, will definitely be of interest.

Beginners, intermediate users and masters alike will find something of use in Clojure Programming. It’ll be one of the first books I recommend from now on to anyone curious about Lisp or functional programming.

(Disclosure: I received an electronic copy of this book in exchange for writing a review.)

I’ve absolutely nothing against this document – it’s a fascinating and wonderfully helpful piece of work – but when the first few paragraphs include the words “category theory,” “monoid,” etc., etc., beginners will tend to get spooked! ↩

Parsing sound change rules with Parsec: Part 2

Jon Gauthier — Mon, 17 Dec 2012 00:00:00 +0000

This is the second post in a tutorial series on applying Parsec in historical linguistics. We’ve begun by providing a more formal description of sound change rule grammars and will end by building a full-fledged sound change applier.

In my last post we established a BNF grammar for files which describe sound change rules:

<file>               ::= (<phoneme-class-defn> <EOL>)* (<rule> <EOL>)+
<phoneme-class-defn> ::= <phoneme-class> ":" <phoneme>+
<rule>               ::= <context> ">" <replacement> ["/" <condition>]
<condition>          ::= <context>_<context>
<context>            ::= (<phoneme> | <phoneme-class>)+
<phoneme>            ::= <lowercase-letter>
<phoneme-class>      ::= <uppercase-letter>

Before we begin parsing, let’s set up some basic datatypes which can be used to store the parse results.

import Data.Map (Map)

-- A single phoneme.
type Phoneme = Char

-- Phoneme class storage, mapping from a single character ('V', 'A',
-- 'F', etc.) to a collection of phonemes.
type PhonemeClassMap = Map Char [Phoneme]

-- A string of phonemes used to match a given context.
type Context = [Phoneme]

-- A complete sound change rule.
data Rule = Rule { replacement   :: [Phoneme],
                   beforeContext :: Context,
                   inContext     :: Context,
                   afterContext  :: Context }

instance Show Rule where
    show (Rule r b i a) = show i ++ " > " ++ show r ++ " / " ++ show b
                          ++ "_" ++ show a

Referencing the BNF grammar, we can use these types to build the returns for our parsers. Let’s start with the simplest Parsec rules, anyPhoneme and anyPhonemeClass. Any uppercase character in sound change rules should be interpreted as a phoneme class reference, and any lowercase character must be a phoneme.

import Text.Parsec

anyPhoneme :: Parsec String () Phoneme
anyPhoneme = lower

anyPhonemeClass :: Parsec String () Char
anyPhonemeClass = upper

As evidenced by the given type annotations, our parsers (for the moment) will have a stream type of String, a user state type of (), and a return type that varies based on their purpose.

Our first lift

We need to next build the parser for phoneme class definitions. As a first try, we could have our parser return a pair of type (Char, [Phoneme]), matching with the type of a PhonemeClassMap. Let’s start:

phonemeClassDefinition :: Parsec String () (Char, [Phoneme])
phonemeClassDefinition = (,) (anyPhonemeClass) (many1 anyPhoneme)

This doesn’t work! What gives?

Let’s look at the type of (,), a tuple constructor:

(,) :: a -> b -> (a, b)

And check the type of anyPhonemeClass and many1 anyPhoneme:

anyPhonemeClass  :: ParsecT String () Identity Char
many1 anyPhoneme :: ParsecT String () Identity [Phoneme]

These have the right types Char and [Phoneme], except they’re contained within a ParsecT type.

Good news: ParsecT s u m is a functor! This means that we can “lift” functions into the context defined by the type. Check the type of fmap and fmap (,):

fmap     :: Functor f => (a -> b) -> f a -> f b
fmap (,) :: Functor f => f a -> f (b -> (a, b))

You can check that the type of (,) corresponds with the type of fmap (,). (In fmap’s type signature, b corresponds to b -> (a, b) from (,)’s type.)

Let’s provide fmap (,) with that first argument f a, where f is ParsecT String () Identity and a is Char. (This looks like the type of anyPhonemeClass!)

fmap (,) anyPhonemeClass :: ParsecT String () Identity (b -> (Char, b))

Great - just as fmap’s signature described, (,) was lifted into the ParsecT String () Identity context and a was clarified to be a Char. We can make our expression look a bit nicer by using an infix alias for fmap from Control.Applicative, <$>:

(,) <$> anyPhonemeClass :: ParsecT String () Identity (b -> (Char, b))

Applying within a context

Looking at the types, we’re almost there: we want our final parser to have a return type of (Char, [Phoneme]) and the current parser has a return type of b -> (Char, b). How can we supply a type b?

The answer comes from the fact that ParsecT s u m is not only a functor but an applicative functor. This means that we can apply functions already within the context (like b -> (Char, b)!) to values within the context (like anyPhoneme!).

This contextual application is invoked by Control.Applicative’s <*>. Compare its type with the type of $: the only difference is that <*> operates within a context f.

(<*>) :: Applicative f => f (a -> b) -> f a -> f b
($)   ::                    (a -> b) ->   a ->   b

Let’s apply the lifted and partially applied function from the last section to anyPhoneme:

anyPhoneme :: ParsecT String () Identity Phoneme
(,) <$> anyPhonemeClass :: ParsecT String () Identity (b -> (Char, b))
(,) <$> anyPhonemeClass <*> anyPhoneme
    :: ParsecT String () Identity (Char, Phoneme)

Close: our return type is (Char, Phoneme). Let’s apply with a many1 anyPhoneme instead, which will produce a parser that accepts one or more phonemes.

(,) <$> anyPhonemeClass <*> many1 anyPhoneme
    :: ParsecT String () Identity (Char, [Phoneme])

Great! Our parser returns the proper type. Let’s write the actual implementation of our phoneme class definition rule before continuing:

import Control.Applicative ((<$>), (<*>))

phonemeClassDefinition :: ParsecT String () Identity (Char, [Phoneme])
phonemeClassDefinition = (,) <$> anyPhonemeClass <*> many1 anyPhoneme

We must do a bit of bookkeeping. In the original BNF, we stated that a phoneme class definition was of the form

<phoneme-class-defn> ::= <phoneme-class> ":" <phoneme>+

We need to account for the “useless” colon in this expression. It’s useless in that it contributes nothing to the parse result. Using the *> function from Control.Applicative, we can consume a ':' character and discard its result:

import Control.Applicative ((<$>), (<*>), (*>))

phonemeClassDefinition :: ParsecT String () Identity (Char, [Phoneme])
phonemeClassDefinition :: (,) <$> anyPhonemeClass
                          <*> (char ':' *> many1 anyPhoneme)

Modifying user state

There’s one significant problem left with this parser. True, it eats up strings without a problem:

> parse phonemeClassDefinition "" "V:aeiou"
Right ('V', "aeiou")

Our problem is that we need to reference these definitions in another parser, specifically the context parser:

<context> ::= (<phoneme> | <phoneme-class>)+

Since Parsec has no idea of what a phoneme class is, when we build this parser we’ll need to identify exactly what we should look for in test words given that we saw “V” or “A” in a rule. How can we have the phoneme class definitions “carry over?”

It’s simple using Parsec’s built-in “user state” feature. (It shows up in the u in ParsecT s u m a.) Rather than using () as our user state type, let’s carry along a PhonemeClassMap as state. Each rule’s type needs to now be redefined (but the implementation for those not using the state data need not change):

anyPhoneme :: ParsecT String PhonemeClassMap Identity Phoneme
anyPhonemeClass :: ParsecT String PhonemeClassMap Identity Char
phonemeClassDefinition
    :: ParsecT String PhonemeClassMap Identity (Char, [Phoneme])

In phonemeClassDefinition we’ll need to use Parsec’s modifyState function:

modifyState :: Monad m => (u -> u) -> ParsecT s u m ()

This type annotation does a great job of helping us understand what exactly happens within the function. Given some user state modifier (i.e., a function which takes an old user state of type u and creates a new one), a new parser is yielded which has a user state of type u and returns nothing.

Now phonemeClassDefinition will return nothing and instead modify the parser’s state (i.e., add entries to the phoneme class map).

phonemeClassDefinition :: ParsecT String PhonemeClassMap Identity ()

We want to modify this map by inserting an entry whose contents will be equal . We run into a familiar problem, however, since insert was not built explicitly for use with the ParsecT s u m context:

insert :: a -> b -> Map a b -> Map a b

Let’s lift insert into the ParsecT s u m functor:

(<$>) insert :: (Functor f, Ord a) => f a -> f (a1 -> Map a a1
                                                   -> Map a a1)
insert <$> anyPhonemeClass
    :: ParsecT String PhonemeClassMap Identity (a -> Map Char a
                                                  -> Map Char a)

Close, like before: we can now provide fmap insert with a first argument in the ParsecT s u m context, but the a1 in the type annotation has no concept of context. Using <*> once more, we can fix the problem:

insert <$> anyPhonemeClass <*> (char ':' *> many1 anyPhoneme)
    :: ParsecT String PhonemeClassMap Identity
       (Map Char [Phoneme] -> Map Char [Phoneme])

Before continuing, let’s give a name to the parser created in this section.

modifier :: ParsecT String PhonemeClassMap Identity
            (Map Char [Phoneme] -> Map Char [Phoneme])
modifier = insert <$> anyPhonemeClass <*> many1 anyPhoneme

Notice that Map Char [Phoneme] is equivalent to PhonemeClassMap, or our parser’s user state u. We just did all this work to lift and apply insert within a context, but now, upon revisiting modifyState’s type, we see we’ll need to head in the other direction:

phonemeClassDefinition :: ParsecT String u Identity ()
modifier               :: ParsecT String u Identity (u -> u)
modifyState            :: Monad m => (u -> u) -> ParsecT s u m ()

Binding

If we simplify the types here, the next step should be obvious. (This is pseudo-Haskell.)

u :: PhonemeClassMap
f :: ParsecT String u Identity
a :: u -> u
b :: ()

modifier    :: f a
modifyState :: Monad m => a -> m b

We need some function that, with an f a and a -> f b, derive an f b. This sounds just like >>=, the monadic bind operation!

(>>=)                    :: Monad m => m a -> (a -> m b) -> m b
modifier >>= modifyState :: ParsecT s PhonemeClassMap m ()

That’s it – we’ve found our definition for phonemeClassDefinition! With some reformatting:

phonemeClassDefinition :: ParsecT String PhonemeClassMap Identity ()
phonemeClassDefinition = modifier >>= modifyState
                         where modifier = insert <$> anyPhonemeClass
                                          <*> defn
                               defn     = char ':' >> many1 anyPhoneme

We’ve finished with the hardest parser of the set. In the next post, we’ll tackle the remaining parsers, most of which are simple combinations of those constructed today.

Parsing sound change rules with Parsec: Part 1

Jon Gauthier — Fri, 14 Dec 2012 00:00:00 +0000

This is the first post in a tutorial series on applying Parsec in historical linguistics. We’ll begin by providing a more formal description of sound change rule grammars and end by building a full-fledged sound change applier.

Historical linguists¹ use a standard grammar to describe a language’s sound change over time (diachronically) or among different speakers at the same time (synchronically). Each individual change can be explained by a simple replacement rule, but often requires a certain context to occur. An example unconditioned sound change rule follows:

r > l

This rule states that, in some language, the /r/ sound becomes /l/ no matter the context of the /r/ sound. This rule alone can effectively describe the change from a morph /fara/ to /fala/, or from /rata/ to /lata/.

Most sound change in natural languages, however, are conditional: they only occur in certain contexts. We can describe required contexts with an additional clause in sound change rules:

r > l / a_o

This rule states that /r/ changes to /l/ only when preceded by an /a/ and succeeded by an /o/. It describes a change from /taro/ to /talo/, but not from /tar/ to /tal/ or /ero/ to /elo/.

We can describe this sound change rule format as a simple BNF grammar:

<rule>          ::= <context> ">" <replacement> ["/" <condition>]
<condition>     ::= <context>_<context>
<context>       ::= (<phoneme> | <phoneme-class>)+
<phoneme>       ::= <lowercase-letter>
<phoneme-class> ::= <uppercase-letter>

Close readers will notice that I included in the above grammar the concept of a phoneme class. Sound change appliers often accept as input along with sound change rules a list of phoneme class definitions. These describe sets of phonemes which, when referenced within a context, allow any of their members to appear in the specified position.

V: aeiou

This line defines a phoneme class V (presumably the vowel class). Whenever V appears in a sound change rule, any of /a/, /e/, /i/, /o/, or /u/ should match.

Phoneme classes are extremely useful, since most sound changes (synchronic and diachronic) only apply in certain phonological contexts rather than standing as a universal fact. Let’s amend our grammar so that we can describe an entire sound change collection:

<file>               ::= (<phoneme-class-defn> <EOL>)* (<rule> <EOL>)+
<phoneme-class-defn> ::= <phoneme-class> ":" <phoneme>+
<rule>               ::= <context> ">" <replacement> ["/" <condition>]
<condition>          ::= <context>_<context>
<context>            ::= (<phoneme> | <phoneme-class>)+
<phoneme>            ::= <lowercase-letter>
<phoneme-class>      ::= <uppercase-letter>

In the next installment of this tutorial, we’ll use this grammar to build a Parsec parser that can digest sound change rules.

And conlangers! ↩

Hillis β-reduction in Haskell

Jon Gauthier — Sun, 26 Aug 2012 00:00:00 +0000

I decided to rewrite my Hillis β-reduction routine in Haskell. I was very pleased to find that the rewrite yielded code much more concise and less “hacky” than the original Clojure algorithm.¹

module Beta (beta)
where

import Data.Map (Map, alter, empty)

-- Used to insert or merge map values. Partially apply this function
-- with a merge function and an initial value, then use it in
-- `Data.Map.alter`.
alterer :: (a -> a -> a) -> a -> Maybe a -> Maybe a
alterer _ v Nothing = Just v
alterer f v (Just x) = Just (f v x)

-- beta-reduce a list of keys and values with a given merge function.
beta :: Ord k => (a -> a -> a) -> [k] -> [a] -> (Map k a)
beta f keys vals = beta' empty f keys vals

-- Internal recursive function.
beta' :: Ord k => (Map k a) -> (a -> a -> a) -> [k] -> [a] -> (Map k a)
beta' map _ [] _ = map
beta' map _ _ [] = map
beta' map f (k:ks) (v:vs) = let map' = alter (alterer f v) k map
                            in beta' map' f ks vs

Data.Map turned out to be a lifesaver! ↩

Hillis beta reduction improvements

Jon Gauthier — Sun, 08 Jul 2012 00:00:00 +0000

Last week I introduced the concept of Hillis beta reduction and provided an example implementation in Clojure. There were a few caveats to this implementation, however, mostly stemming from the fact that I β-reduced with sequences and vectors rather than the native “xectors” of Hillis’ system. With the risk of adding even more complexity to the demonstration, I’d like to attempt to rectify some of these problems using a few extra tools to transform our data.

Xectors

I won’t provide much detail at all on the xector data type, as I will inevitably botch the majority of the facts. If you’re at all interested in parallel computing, I recommend checking out Hillis’ book, The Connection Machine.

For our purposes, we can consider a xector to be equivalent to a Clojure map¹. We can easily redesign our Hillis β-reduction function to take maps as input, but who would want to convert a sequence to a map whenever using the function?

The Xector monad

For this issue we can design a small monad which deliberately breaks the monad laws (cowboy monad?) for demonstration purposes². The monad will convert provided seqs into an internal map (xector) representation for use in the β-function and (here’s the law-breaking part) leave them as maps when returning results. We could be proper and return the same seq type that was provided, but that would essentially destroy the purpose of the β-reduction in the first place!

Without further ado:

(use 'clojure.algo.monads)
(defmonad xector-m
  [;; Xector a -> a
   m-result (fn m-result-xector [xector]
              (if (= 1 (count xector))
                (first (vals xector))
                xector))

   ;; a -> (Xector b -> Xector c) -> Xector c
   m-bind (fn m-bind-xector [v f]
            (let [xec (if (sequential? v)
                        (into {} (map-indexed vector v))
                        v)]
              (f xector)))])

Notice the cheating in m-result: we return constant values as expected, but non-constant values (i.e., maps made from seqs) are kept as maps.

In m-bind, we convert any type of seq into a map, and keep any other constant value (in our case, we’ll use numbers) unmodified³.

β-reduction-redux

We define a new multimethod xvals which dispatches on the result of map?⁴. This aligns with the result of the monad bind we defined earlier: for any a of Xector a, a will be either a map or a constant.

(defmulti xvals map?)
(defmethod xvals true [xec] (vals xec))
(defmethod xvals false [const] (repeat const))

(defn beta
  ;; (a -> b) -> Xector c -> Xector d
  ([f x1]
     (beta f x1 1))
  ;; (a -> b) -> Xector c -> Xector d -> Xector e
  ([f x1 x2]
     (let [c1 (xvals x1)
           c2 (xvals x2)]
       (loop [acc {}
              e1 (first c1) c1 (rest c1)
              e2 (first c2) c2 (rest c2)]
         (if (or (nil? e1) (nil? e2))
           acc
           (let [new-val (if (contains? acc e2)
                           (f (get acc e2) e1)
                           e1)]
             (recur (assoc acc e2 new-val)
                    (first c1) (rest c1)
                    (first c2) (rest c2))))))))

Now we can perform β-reduction on seqs by binding them inside our xector-m:

(domonad xector-m
         [a '(1 2 5)
          b '(X Z Z)]
         (beta + a b))  ; => {X 1, Z 7}

Traditional folds can now return the expected values without a wrapping map:

(domonad xector-m
         [a '(1 2 5)
          b 1]
         (beta + a b))  ; => 8

Hillis’ arity function

Now that we have a better-ported version of the beta function, I can present a fascinating application also imagined by Hillis later in his paper. It uses both the one- and two-argument forms of the beta function: it simultaneously folds multiple maps into a single map, and then folds that single map to a single value. As always, the code speaks more clearly than I can:

;; Return the highest arity of the sequence (i.e., the number of times
;; the most often occurring element appears).
(defn arity [seq]
  (domonad xector-m
           [a 1
            b seq]
           (beta max (beta + a b))))

In the inner β-reduce, we reduce a set of keys to the same constant value of 1. When duplicate keys (duplicate occurrences of a value in the provided seq) are found, the value 1 is combined with another value 1 using the function +, forming a count of 2! This process repeats until the entire provided seq has been digested. Here’s a look at the inner β-reduction by itself (notice that its output matches that of frequencies!):

(domonad xector-m
         [a 1
          b '(2 3 8 2 2 5 8 2)]
         (beta + a b))  ; => {2 4, 3 1, 5 1, 8 2}

Where from here?

To be frank: not a clue! I am having trouble thinking of names for the process, let alone applications. It will most likely remain nothing more than a “thought experiment,” as I said in my previous post. Let me know in the comments below if you have any thoughts!

Ignoring all the parallel-processing fun that comes along with xectors, yes. ↩
I still don’t fully understand monads, so I may actually be breaking more laws than I intend. ↩
I experimented with a ConstantXector type and a :constant metadata key, but both of these methods proved much less elegant than simply leaving the value alone. ↩
Dispatches on “mappiness?” ↩

Hillis beta reduction in Clojure

Jon Gauthier — Tue, 03 Jul 2012 00:00:00 +0000

Danny Hillis’ seminal work The Connection Machine introduced, among many other things, the concept of “beta reduction” on vectors¹ (I dub this “Hillis beta reduction” so as not to confuse the term with traditional beta reduction in the lambda calculus). I found this particular idea fascinating and still applicable today, if only as a quick thought experiment.

Hillis asserted that the everyday fold / reduce routines that we have all come to know and love are merely a subset of a larger scheme of operations that can be performed on ordered data structures. The overarching process is named the “beta function,” and accepts one function and two vectors as arguments. The result of this function is the combination of the two vectors into a map, using the first vector to form the map’s values and the second vector to form the map’s keys. When duplicate keys are found, the provided function is used to “combine” the corresponding values. It’s an interesting process that’s much easier to understand given an example:

(beta + '(1 2 5) '(X Z Z))  ; => {X 1, Z 7}

Using the positions of each element to match keys and values of the map to be formed, the beta function pulls together data like a zip function. When the duplicate key Z is encountered twice, the two values are combined using the + function we provided, and the final value corresponding to the key Z in the map is (+ 2 5), or 7.

What is traditional list folding, then? Why, it’s just beta reduction with a certain constant second argument:

(beta + '(1 2 5) (repeat 1))  ; => {1 8}

When we provide the beta function with the same key for every value (1, in this case), all values are combined to the same key using the provided function. This is reduction in a different form!²

Below is an implementation of Hillis’ beta function in Clojure. I included a shorthand form of the function in which a two-argument call will give the same result as a call to reduce:

(beta + '(1 2 5))  ; => 8

Feel free to play around!

(defn beta
  ([f c1]
     (get (beta f c1 (repeat 1)) 1))
  ([f c1 c2]
     (loop [acc {}
            e1 (first c1) c1 (rest c1)
            e2 (first c2) c2 (rest c2)]
       (if (or (nil? e1) (nil? e2))
         acc
         (let [new-val (if (contains? acc e2)
                         (f (get acc e2) e1)
                         e1)]
           (recur (assoc acc e2 new-val)
                  (first c1) (rest c1)
                  (first c2) (rest c2)))))))

For simplicity’s sake, I deliberately ripped out Hillis’ concept of beta reduction from its containing system of parallel processing with xectors. References to this particular domain have been shamelessly replaced with Clojure-specific terms. ↩
You may notice that this alternate fold returns {1 8} rather than a simple 8. This is due to my somewhat-haphazard readaptation of Hillis’ concept for Clojure: the {1 8} that is returned is technically correct within Hillis’ system of xectors, but is not convenient for those of us living in a von Neumann world. I experimented in using a xector monad as a sort of shim to better integrate beta-reduction into Clojure, but did not develop it far enough to make it merit more than a mention in a footnote. ↩

Disabling electric indenting in Emacs modes

Jon Gauthier — Mon, 02 Jul 2012 00:00:00 +0000

I am just getting settled in with Org Mode for Emacs and am constantly amazed at its versatility and wide feature set. One problem has been bugging me in Org for quite a while now, though: electric-indent-mode, which I use for auto-indentation when programming, gets in the way by auto-indenting Org headers.

My annoyance reached a critical point earlier this morning, and so I set off in search of a fix to disable electric indenting “mode-locally” — that is, disable the mode in Org buffers but not in buffers of any other mode. The catch with electric-indent-mode is that it is a global minor mode — something that is enabled once and assumed to be necessary for all buffers.

After a bit of searching, however, I was glad to find that the author of the mode had left in a backdoor for customizing its functionality. Meet electric-indent-functions:

Special hook run to decide whether to auto-indent. Each function is called with one argument (the inserted char), with point right after that char, and it should return t to cause indentation, `no-indent’ to prevent indentation or nil to let other functions decide.

Perfect! The default value for this variable (as of Emacs 24.0.92.1) is nil, so I made the choice to recklessly overwrite this variable at a buffer-local level.

Enough technoblabber; here’s the fix. Add the following code into an Emacs Lisp file that gets run on initialization:

(add-hook 'org-mode-hook
          (lambda ()
            (set (make-local-variable 'electric-indent-functions)
                 (list (lambda (arg) 'no-indent)))))

You can find the latest version of my Org mode config in my dotfiles repo.

Post excerpts in Jekyll

Jon Gauthier — Sat, 21 Jan 2012 00:00:00 +0000

There’s a fairly simple method for creating WordPress-like post excerpts in Jekyll using a  tag, but unfortunately it requires the installation of a Jekyll plugin, a feature unavailable on a GitHub-hosted Jekyll instance.

What’s a bored geek like me to do? Find another way! The exact same excerpt functionality linked to earlier can be replicated by composing a few Liquid filters in the site layout code, like so:

{{ post.content | split: '<!-- more -->' | first }}

This little bit of code will output the content of a post until it sees a  tag inside the post content. Just insert the  tag into your posts wherever you’d like them to be cut off.

You can see this snippet in action in my site index template.

foldl

Motivating the rules of the game for adversarial example research

Conceptual issues in AI safety: the paradigmatic gap

Introduction

Mid-term AI safety

Paradigmatic change: an example

Potential paradigmatic changes in AI

Consequences of paradigmatic change

Follow-ups

Do brains represent words?

The essential argument

Things bumping around

Correlation as representation

The quest

This is not an academic post

I saw a dog

LeCun: Language is the next frontier for AI—or not

How to prepare for a Vipassana retreat

But am I prepared?

Physical preparation

Mental preparation

Sunday links

Bridging principles

Bridging principles

Sunday links

Situated language learning

The paradigm

Philosophizing

Sunday Links

On "solving language"

What does it mean to “solve language?”

Situated language use

Generalization in reference games

Round 1

Possible referents:

Alexa said: BLOOP

Round 2

Possible referents:

Alexa said: BLOOP

What just happened?

Conclusion

Acknowledgements

Addenda

Hybrid tree-sequence neural networks with SPINN

Model

Shift-reduce parsing

Hybrid tree-sequence networks

Acknowledgements

Sunday Links

Life update

Conditional generative adversarial networks for face generation

Introduction

The project

Why is this interesting?

Model

Conditional data

Conclusion

Acknowledgements

Machine learning and technical debt

A GloVe implementation in Python

Introduction

Theory

Implementation

Summarizing Spanish with Stanford CoreNLP

On Metacademy and knowledge graphs

The future

Where does ice cream come from?

Sunday Links

Sunday Links

Sunday Links

Sunday Links

Kneser-Ney smoothing explained

Further reading

Sunday Links

Sunday Links

The wacky economics of gift-giving

Why don’t we give cash gifts?

What differentiates gifts of the same value?

Further reading

Sunday Links