In this post, I'm going to discuss an alternative resolution to this paradox that I've been playing around with. It started developing while writing a paper with Sharon Goldwater (which will appear soon in the Journal of Memory and Language). Briefly, several people have been exploring the proposal that human speech is adapted for communication that is efficient, in a way that is defined by information theory. Information theory defines certain limits on how quickly information can be transferred, and part of our motivation for the paper was the realization that, for natural language, there is a conflict between incremental speech (producing and comprehending speech one word or unit at a time) and information-theoretic efficiency.

The details of this realization were beyond the scope of that paper, but in this post I want to describe this conflict and explore some future directions that capitalize on it directly.

Let's be more specific about what "efficient in a way that is defined by information theory" means. The basic set-up in information theory is that we have some message composed of message characters , and we want to transfer it to our listener by translating the message into some signal composed of signal characters. For convenience, the signal characters are usually set to be just ones and zeros. A good code in information theory has two fundamental properties:

- It is short: each message has, on average, a short signal (few signal characters).
- It has a low error rate: the message the receiver (listener) interprets is almost always the same as the one the sender (talker) intended.

We'll ignore the error rate in this post and just consider the first point of looking for short messages. Let's adopt the following notation. All of the possible message characters are denoted by a subscript index ; if our alphabet of message characters is the English alphabet, , , and so on. We'll denote the length of the signal assigned by code to a message character by . For example, if we decide to encode the message character "A" with 00111, then .

We can compute the expected length of a code using the usual formula for the expected value of a function:

Now we can ask: what is the shortest possible code? I.e., what is the optimal set of length assignments that minimizes the expected length?

It turns out that there is a single answer: , where the base of the logarithm is the size of the signal alphabet (i.e., it is 2 if the signal alphabet is binary letters, it is about 40 if the signal alphabet is English phonemes, etc.). This quantity is variously called the surprisal or Shannon information of , and the expected surprisal is the entropy of the probability distribution over . Adjusting your code so that matches its Shannon information is called *source coding*. Intuitively, this means our signals will be shorter on average if we use our shorter signals for more frequent messages. For example, if the phonological form of a word is much longer than its Shannon information, we could improve our efficiency by making that word shorter. Conversely, if the phonological form of a word is much shorter than its Shannon information, we could improve our efficiency by lengthening that word, thereby freeing up a short phonological form to be used by a different, more frequent word.

At this point you may notice a problem: will almost never be an integer, and so we might have trouble approaching the Shannon Information by adjusting the lengths of codes for individual message components, such as the phonological forms of those message components, since lengths are all integer values. For example, if the Shannon information of some message component is 2.4, and we need to encode it twice, we'll have to pick a codeword that is 3 characters long, for a total of 6 characters when the theoretical limit uses only 4.8.

There are approaches to coding that get around this limit. The simplest is called "block coding." In a block code, we encode *sequences* of message characters instead of individual message characters. In the example above, instead of providing one codeword for each message character, we can provide one codeword, of length 5, for the pair of characters. This new codeword is much closer to the theoretical limit, and we can approach the theoretical limit arbitrarily well (in the limit of long messages) by picking longer codewords. Using a block code has a downside, however, because we lose a degree of incrementality: we have to wait until the end of the block to interpret the message characters in the block.

However, block coding is not the only way to circumvent this obstacle. Arithmetic coding, in fact, can be processed incrementally while approaching theoretical efficiency limits. Is it possible that natural language implements some other strategy, so there is no tension between incrementality and brevity?

Now I will describe my reason to think that natural language does not implement one of these other strategies. First, observe that a substantial portion of linguistic meaning is *compositional*: the meaning of the whole is analyzed into parts, each of which is then encoded into a signal. For example, a -expression is divided into sub-expressions:

and each sub-expression is encoded into a word: Some, cat, runs. So the length of the signal for the message is the sum of the lengths of the signals for the parts of the message.

To make this precise, we need to introduce just a touch more notation, so we can talk about entire messages that are composed of message characters. A complete message is denoted by boldface , and we will index locations within the message with superscripts: is the character of message . Then, we can see that because language is compositional:

Let's call this the "sum of lengths" equation.

Ok, there's just one more concept. In the Entropy section, we looked at how to compute optimal lengths, given a probability distribution. However, with a little algebra, we can see how to go the other way. If I am *given* a set of signal lengths, what is the probability distribution for which they *would* be optimal? We call this the *implicit* probability distribution that the code assumes is true. Since we take a log to compute lengths from probabilities, we exponentiate to compute probabilities from lengths. Assuming a binary signal alphabet:

(don't worry too much about the term. It just ensures that sums to 1. will be equal to 1 if our code is *complete*, i.e. there are no "accidental gaps" in the set of signals). Now, we can ask: What does natural language assume about the distribution over messages? For simplicity, let's assume a binary signal alphabet; we'll see the signal alphabet doesn't matter because it cancels. Because natural language is compositional, we can use the "sum of lengths" equation to show that natural language assumes that the probability of a message is a product of the probability of each part of the message:

In other words, because natural language is compositional, it assumes that different components of a message are statistically independent! However, different parts of linguistic messages are highly dependent. Thus, natural language implicitly relies on a family of probability distributions that is far too constrained to capture the true distribution over messages for optimal source coding.

This view suggests that language is far from optimal in the sense of information theory. It does not, however, mean that there is not any pressure in the direction of optimal behavior (indeed, in my paper with Sharon Goldwater we conclude that there is evidence of pressure in the direction of optimal behavior). This view does make some suggestions about what such moves towards optimal behavior should look like. We could, for example, use the same block coding trick as before: instead of encoding grammatical elements as separate message components, encode ensembles of grammatical elements simultaneously. Now, however, we are dealing with a much larger space of message components, such as all the words in the language, so we'd want to get a lot of "bang for our buck:" the increase in fit of our low-dimensional approximation must be very large relative to the additional parameter introduced by the larger block. A likely example here is the reduction of "going to" to "gonna." This happens only when "to" is part of an infinitive, never when "to" is part of a prepositional phrase:

- I'm gonna school you in Chess!
- *I'm gonna school tomorrow.

In "gonna", we see message components that were previously encoded in two separate words encoded in one amalgamated and overall shorter block.

This is a process called grammaticalization, and Joan Bybee and others have already proposed for quite a while that usage-based processes drive these kinds of changes. What does this extra information-theoretic framework give us? Well, first, it opens up a library of theoretically-motivated measure for a broader range of data-driven tests. Secondly, I still have a lot of reading to do in this grammaticalization literature, but, as far as I can tell, the frequency effects are assumed to be given, as some kind of basic feature of how brains work. This framing suggests that the role of frequency effects in grammaticalization itself may reflect an adaptation towards some balance between incrementality and brevity.

]]>

The cooler solution is to use various networking systems to get access when not actually in the office. At Edinburgh, I used a combination of Virtual Private Network (VPN) and Andrew File System (AFS). Specifically, with openafs (available in the Arch User Repository), I was able to mount my network share by starting the openafs-client service, authenticating with Kerberos using `kinit`

, and passing the Kerberos authentication ticket to openafs with `aklog`

(I've attached my kerberos configuration file krb5.conf to this post). I was also able to `ssh`

into Informatics servers without a password by using `ssh -K`

, which uses the kerberos ticket. So that let me access my files on the network transparently, and sign into my desktop by first signing into the Informatics server then signing into my desktop machine.

Next, I set up openvpn to sign in to my desktop immediately without first `ssh`

'ing into the Informatics ssh server. I've attached my configuration file to this post, named "Informatics-via-Forum.conf", along with the certificate file "EdUniRootCA.crt" and tls authorization file tls.auth. There is another file pass.txt that should have your username on the first line and password on the second line (kerberos authentication doesn't seem to work with openvpn). With this set up, I could ssh into my desktop machine by just typing `ssh <desktop_IP>`

directly.

It's also worth noting that kerberos authentication is secure even when you don't trust your connection. When I've visited China and wanted to access the internet in a secure way or bypass the firewall, I'd authenticate with kerberos, and then set up a socks5 proxy using `ssh`

:

`ssh -C2qTnN -D 8080 -K <studentid>@student.ssh.inf.ed.ac.uk`

(I aliased this command to infproxy in my .bashrc).

While this command is running, I can start up chromium to tunnel everything through the proxy with `chromium --proxy-server="socks5://localhost:8080"`

, use foxyproxy to be fine-grained about what gets tunneled with firefox, and use `proxychains`

to tunnel arbitrary command line programs through the proxy. (Use this proxy sparingly though, so you don't overload your ssh server!)

Macquarie doesn't seem to have AFS set up, but there is a VPN called OneNet Anywhere. This VPN uses not the open source openvpn but Microsoft's proprietary pptp protocol. To set it up, I installed the open-source client pptp, which is available in Arch's core repository, and configured it in the attached options.pptp, Macquarie_OneNet_Anywhere, and chap-secrets files. You should modify Macquarie_OneNet_Anywhere and chap-secrets to contain your mq id and password.

Once it's configured, you can start it up with (the initial "`#`

" means with root permissions):

`# pon Macquarie_OneNet_Anywhere`

and turn it off with

`# poff`

This opens the connection, but you still have to specify what goes to the VPN and what goes through your normal connection. I really only want to access my workstation, so currently I modify my routing table to use the VPN only for packets that want to go to my workstation's IP address:

`# ip route add <workstation_IP> dev ppp0`

This uses iproute2 to add a new route for packets to go through the ppp0 interface, which is the VPN's interface, if their final destination is my workstation's IP.

Finally, you can verify with sytem monitoring programs like gkrellm or conky to check that packets are being routed properly. In this screenshot from my laptop, you can see towards the bottom of gkrellm that ppp0, the VPN interface, has a small spike every minute or so because I've ssh'd in but haven't been doing anything, but there's a lot more activity along the wlan0 interface, my usual wireless interface.

And now you too can slip in and out of your office invisibly!

Here are all of the configuration files I mentioned in this post.

]]>

To start, let's talk a bit about the very basics of what neurons have to do with computation, and then turn to Hebbian learning. Neurons ``fire'' by sending out an electrical charge, and a neuron is more likely to fire when it receives electrical charges from those neurons it is connected to. However, not all neural connections are the same: some are strong, and some are weak, and a neuron may not fire if it receives a charge from many weak connections, and it may fire if it receives a charge from a few strong connections.

We can understand this better by imagining a very small artificial neural network. with two ``input'' neurons and that are both connected to one ``output'' neuron :

(This kind of network would actually not really be suitable for Hebbian learning; I'm just trying to show how different weights can encode different functions that might be useful for cognition.)

As a simplification, let's say that, at any given time, if an input neuron fires, it sends a 1 along the connection to the output neuron, otherwise it sends a 0. To model the relevance of strong and weak connections, whatever an input neuron sends out is multiplied by the weight of its connection to the output neuron. So at any given time, the output neuron receives as *activation* whether the first input neuron fires times its weight, plus whether the second input neuron fires times its weight:

We can then say that the output neuron fires if its incoming activation is greater than or equal to one. If we set and both to 1, then the output neuron fires if one or both input neurons fire: the network computes OR . Alternatively, we can set and to 0.5, and the output neuron fires if *both* input neurons fire: the network computes AND . Finally, we can set to 1 and to -1, in which case the network computes OR NOT . So we can see that networks of neurons can in principle compute different functions of the input if all we do is change the strength of their connections. In fact, if we allow multiple layers of neurons, with output neurons that turn connect to other neurons, this kind of network is capable of representing *any* function of the input with enough neurons.

Under the proposal that cognition is computation, then, this all speaks towards the view that the biological implementation of learning comes down to finding the right weights. Brains have immense numbers of neurons and astronomical numbers of connections: the usually-reported numbers are 100 billion neurons with about 10,000 connections each. Each. It is hard to imagine how these weights could be encoded in the comparatively small human genome, and so most work has looked for biologically-plausible learning rules that could apply throughout the brain.

Hebb discovered, appropriately, Hebbian learning. The intuition is simple: when two neurons fire at the same time, any connections between them become stronger. Computationally, this allows the network to encode correlations in the environment. More importantly, biological neurons behave in this way, with a telling wrinkle. In the simple intuition of Hebbian learning stated above, connections only become stronger. For biological neurons, however, there are physical limits on how strong connections can be. This leads to a ``saturation'' problem where, eventually, all connections are as strong as physically possible. It turns out, however, that biological neurons also *weaken* connections that rarely fire together. These two rules, one for strengthening and one for weakening, allow effective Hebbian learning within fixed limits on biological connections.

Ok, on to the results of the actual papers that inspired this post. Most work has ignored the question of which neurons connect to which neurons. Computational models of neural networks typically have pretty simple structure, with every neuron either connected to every neuron (or maybe to every neuron in a different layer) or connected to random neurons. This *tabula rasa* approach assumes that the only interesting structure in neural connections is learned structure in the weights. The papers look for, and find, interesting structure in neural connections that is not learned (at least not in a Hebbian way).

Specifically, they take slices of brains from 14-day-old rats, and look at structure not in the strength of connections but whether or not any connection exists at all, finding a number of interesting predictors. First, as you would expect, neurons are, broadly, more likely to be connected if they are near each other. However, this trend holds only for distances greater than 100 micrometers; neurons that are 50 micrometers apart are less likely to be connected than neurons that are 100 micrometers apart, even though they are closer to each other. So distance matters, but something else also matters, at least for the first 150 micrometers or so.

That something else is whether the two neurons under consideration are also connected to other neurons. So, if we want to know if A and B are connected, then we should check how far apart they are, and also see how many other neurons they are both connected to. If they are connected to many other neurons, then there is a much higher chance that they are also connected to each other. Even more interestingly, it turns out that there are recurring patterns in the actual shape of the connections, which the paper calls *motifs*.

These results together indicate that neurons do not form connections randomly but focus on forming these small (2-6 neuron) motif networks that then connect to each other. But here's the real kicker: remember that saturation problem with Hebbian learning, where connection strengths hit physical limits? According to the paper, connections within these motifs *are* saturated, so meaningful Hebbian learning only occurs *between* motifs.

The authors of these two papers talk about these motifs as ``Lego-like building blocks for memory and perception,'' with each motif encoding fundamentally different ways of integrating input activations, and with motifs joining together to form larger motifs. Now, I think the authors are speculating at this point, but they say that the stability of these motifs throughout a species provides a stable basic toolkit for high-level perception and memory, explaining why different members of a given species appear to perceive inputs in similar ways. That is, if perception and memory were strictly based on learning from the environment, we would expect enormous individual variation throughout a species. While some variation certainly exists, perception is remarkably stable throughout a species. They propose that this stability is the result of different individuals relying on a similar innately specified relative prevalences of these motifs.

So what might this have to do with language? At this point, it doesn't necessarily have *anything* to do with language. These motifs are extremely low-level structures, involving literally fewer than 10 cells, and it's entirely possible that the computations involved in language happen at a scale that is so much larger that this low-level detail is not relevant. However, this is my blog so I can do what I want, and right now I want to speculate wildly on what these motifs might potentially have to do with language.

As I mentioned in my Principles & Parameters (P&P) post, one of the central questions of modern linguistics, and specifically those parts that involve the acquisition, representation, and processing of syntax, has been what kinds of grammatical regularities exist across languages. For example, languages tend to put syntactic heads either at the beginning of syntactic phrases or at the ends of syntactic phrases. In my P&P post, I described how P&P tries to exploit these regularities by genetically-specifying an auxiliary Parameter space that explicitly represents each Parameter, and then exploring this space (that is exponentially large in the number of parameters) by way of a random walk. I also described how data-driven approaches similarly exploit a smaller space, but, in this case, the smaller space is a subspace of the observed and unobserved variables whose shape is determined by both the observed data and the innate functional form of the assumed statistical dependencies.

I'm not sure how these motifs/lego blocks would fit into the P&P approach, but they look an awful lot like they could provide an inventory of innate functional forms for statistical dependencies. Hebbian learning algorithms, at least in the simple idealized forms I'm familiar with that don't worry too much about the neural hardware, implicitly rely on linear dependencies in a continuous space (i.e. hebbian learning basically does Principle Components Analysis, which assumes linear components).

I see three basic ways to exploit non-linear dependencies in a high-dimensional space to find that useful lower-dimensional manifold. First, we can just fit lots of linear functions and locally approximate the manifold with just the linear functions. Second, we can move to more general ``manifold-embedding'' methods that build a mesh in the high-dimensional space based on the observed data points that should trace out the manifold on which they lie. Third, we can augment our original high-dimensional space with even more dimensions that are non-linear transforms of the original dimensions, and hope that the non-linear relationships become linear once we get these extra dimensions (this is the basic idea behind kernel methods). Motifs, and their compositions, might provide a good enough set of non-linear transforms to enable language learning and processing.

I think a fully worked-out model would concretely relate the shapes of these motifs to, say, syntactic subtrees, and compute parse probabilities in terms of these subtrees, perhaps relying on these subtrees in proportion to the prevalence of the corresponding motif in areas of the brain that process syntax. We could then evaluate variants of this neurally-inspired model by seeing to what extent it produced accurate parses and predicted measures of syntactic processing, such as reading times.

So what do y'all think of my wild speculation?

]]>

The authors of the original paper performed their entire analysis in a spreadsheet program, specifically Excel. There are reasons to avoid Excel in particular for serious analysis, but I think their two errors resulted from characteristics of spreadsheet programs in general. The first, and clearest, error was their accidental exclusion of countries with high debt but average economic performance from the economic growth average. This kind of mistake is very easy to make in a spreadsheet, because spreadsheets do not clearly differentiate values from formulas. That is, if you look at a cell, it is impossible to tell if the number in it is directly typed in, or computed from some other cells, and, if the latter, how the computation was performed. To check the formula, one has to click in the cell. Moreover, spreadsheet programs typically contain logic that tries to guess how formulas should be changed when the spreadsheet is changed, and when to update cells, and what to do when some value or computation is missing or invalid. All of this behind-the-scenes sleight-of-hand serves to make spreadsheet programs more "user-friendly" in the sense that users have to do less to get a number, but at the expense of opaque relationships between data and the statistical techniques used.

It also looks like the second error resulted from the athurs' reliance on spreadsheet-style analysis. Specifically, they compute the average economic performance by first computing an average performance per country, and then averaging the countries' averages. This is inappropriate because the data are un-balanced: there are many data points for some countries, but very few for others. Thus, we should be confident about the within-country average for countries with many data points, but less confident about the within-country average for countries with few data points. One of the commenters on Konczal's article put it well: their method is like taking one batter with 100 at-bats with a 0.200 average and another batter with 1 successful at-bat, and concluding that, together, they bat 0.600 on average.

Why would they weight countries in the average in this way? Well, I suspect it's because it is easy to do averages of averages in a spreadsheet: select the cells containing the averages, and click the "Average" option. As a non-economist with statistical training, my instinct for analyzing this data is to perform a mixed-effects regression of economic performance against debt-to-GDP ratio, with random effects (ideally both random intercepts and random slopes) for country, maybe country region (i.e. Europe, Asia, and so on), and a control fixed-effect of year. Such a model would estimate how the debt-to-GDP ratio predicted economic growth, accounting for the fact that the relationship may vary according to country and across time. However, spreadsheets typically do not contain such interesting models.

So: should we just avoid spreadsheets? I don't think so. Spreadsheets can be good for eyeballing data and doing preliminary analyses. I often use `gnumeric`

to do just this. The problem comes from relying entirely on spreadsheets, or indeed any computational tool (although the "user-friendliness" of spreadsheets makes them more often instruments of abuse). All computational tools have drawbacks, and we should be familiar with a variety of tools. On the natural language toolkit (NLTK) mailing list, for example, I've seen e-mails asking how to build frequency lists from English text files. If you're building a larger application using NLTK, it might make sense to do this using NLTK (because then it will already be loaded into the program), but if all you want is a frequency dictionary, and your text is tokenized by spaces, then standard `unix`

command-line tools are all you need. For example, given penn treebank-style strings in a file strings.txt:

this is an example string . here is another string .

we can get one word per line with the command:

$ cut -d " " --output-delimiter " " -f 1- strings.txt this is an example string . here is another string .

(the double quotes allow the space and carriage return to be interpreted as characters rather than a tokenizer of the string or command execution, respectively). To get a frequency list, pipe the output into `sort`

and `uniq -c`

:

$ cut -d " " --output-delimiter " " -f 1- strings.txt | sort | uniq -c 2 . 1 an 1 another 1 example 1 here 2 is 2 string 1 this

We can sort by decreasing frequency with another pipe:

$ cut -d " " --output-delimiter " " -f 1- strings.txt | sort | uniq -c | sort -nr 2 string 2 is 2 . 1 this 1 here 1 example 1 another 1 an

If there are potentially many files scattered across several directories, there's no need to cook up a directory-traversal function yourself; use `find`

:

find . -name "*.txt" -exec cut -d " " --output-delimiter " " -f 1- {} \; | sort | uniq -c | sort -nr 2 string 2 is 2 . 1 this 1 here 1 example 1 another 1 an

Text files that are not penn-treebank formatted may need a bit more piping and processing, but a frequency dictionary in and of itself does not demand a full-blown NLP library. Being proficient with several tools allows us to build this frequency dictionary more quickly. More importantly, even if we do end up building the frequency dictionary inside the python code, we have a variety of techniques for *double-checking* our results. Because those checks are relatively fast and easy, we are more likely to actually do them.

Finally, using command-line tools in particular is nice because it is easier to keep a record of what you did. In R, for example, you can use `txtStart(file="logfile.log")`

and `txtStop()`

from the TeachingDemos package to keep a text record of your entire session. Or, more generally, if you do your work in gnu `screen`

, you can open a special window with `screen -L`

that logs everything that happens inside the session to a file `screenlog.#`

, where `#`

is the number of the logged window, in the `screen`

session's root directory.

Ok, I think I've gone on long enough.

]]>First, notice that, in the opening, I wrote "Generative" with a capital 'G'. Strictly speaking, a "generative grammar" is simply a grammar that is itself (potentially) finite, but produces an infinite set of strings. For example, a grammar with the three rules:

- N Adj N
- Adj blue
- N cat

generates the infinite set containing the strings "blue cat", "blue blue cat", "blue blue blue cat", and so on. Generativism got its start by proposing that human languages are infinite, but, broadly, can be generated by this kind of limited grammar in this kind of way. Such a generative (little 'g') grammar is useful for encoding broad generalizations about linguistic regularities, such as the fact that English nouns can be preceded by multiple adjectives. Most work that tries to address syntax relies on, or is at least compatible with, generative grammar, including work outside of the Generativist (big 'G') tradition.

The Generativist (big 'G') tradition takes this machinery literally: human languages *are* infinite sets of strings. People don't actually produce most of those strings. For example, people don't say sentences that are 100,000,000,000 words long, but, under the Generativist view, those long strings are part of the language. Formally, they argue that such strings are part of a speaker's "competence," but the kinds of strings a speaker says constitutes their "performance." A speaker's performance reflects an effort to match competence given limited processing resources. Performance may differ from competence on shorter sentences as well. For example, English allows center-embedding:

- The rat the cat hunted ate the cheese.

Where "the cat hunted" is embedded within "the rat ate the cheese" to identify *which* rat ate the cheese. However, English speakers have a very difficult time understanding:

- The rat the cat the dog chased hunted ate the cheese.

Where "the dog chased" is embedded within "the cat hunted" to identify which cat was doing the hunting. So, English allows center-embedding: but how much? One potential solution is that unlimited center-embedding is part of the competence of English, but that speakers are only able to process one level in practice.

The explanatory power of a particular proposed grammar comes from its ability to describe the content of a language (subject to our ability to discern which strings are actually part of the competence). For Generativists, this has meant that a proposed grammar should successfully discriminate strings that are part of the infinite set of strings of a particular human language from strings that are not part of that infinite set (although they may be part of a different human language). (Other kinds of generative grammar take a different notion of what constitutes the "content of a language." For example, Combinatory Categorial Grammar has a very tight relationship between syntax and semantics, and focuses not on describing sentences as strings only but as strings and meanings.)

Perhaps most famously, Generativists have gone beyond making explanatory generalizations about individual languages, and sought to make explanatory generalizations about all human languages. Just as they propose that the set of possible strings of a particular human language is characterized by a generative grammar, they propose that the set of possible human grammars is characterized by Universal Grammar (UG). These constraints on possible grammars are supposed to rule out languages that are conceivable but impossible as a human language. I won't get into the details of the constraints; they are often highly technical and fairly specific to grammar formalisms (and so when a grammar formalism is revised or abandoned, the details of UG constraints must change). Proposed UG constraints are also rarely evaluated at scale, so it's usually unclear (at least to me) if they actually rule out the kinds of strings they're supposed to.

One of the central motivations for UG has been the fact that children can learn a particular language, which Generativists take to involve identifying the right infinite set of strings. If a language is an infinite set of strings, then the task is difficult: a child only observes a finite number of strings, and must determine, on the basis of this finite set, which infinite set is correct. Since every finite set of strings is a subset of infinitely many infinite sets of strings, the finite set of strings a child observes will never bring down the set of potential infinite languages to one. This has been called "the logical problem of language acquisition." However, UG (supposedly) rules out huge numbers of infinite sets of strings, and the idea is that children manage to identify a single infinite set of strings on the basis of a finite set of strings (the utterances they hear) by using their innately-specified knowledge of UG to eliminate the other infinite sets that are consistent with the finite evidence they have.

This innate knowledge of UG is traditionally taken to be implemented in terms of innately-specified parameters that express broad typological divisions among the world's languages. For example, a child may try to decide whether her language typically puts main verbs in the middle of the sentence, like English, or somewhere else, like Japanese (which typically puts them at the end). This framework is called "Principles and Parameters," or P&P, and, as far as I've seen, the parameters that people actually propose are binary-valued. See Sakas & Fodor (2012) for a fairly comprehensive overview of this basic approach, along with computational simulations with artificial languages.

So, what is the deep relation to NLP-style approaches I promised to bring up? Let's first assess the UG and P&P account in terms of Marr's levels of analysis. The computational-level goal is to identify the single infinite set of strings that is consistent with both the finite set of strings a child can see and the innately-specified constraints of UG. The first of these constraints is relatively easy: an infinite set of strings is consistent with a finite set of strings if and only if the finite set is a subset of the infinite set (presumably with some mechanism to identify performance errors in the input). The second constraint is hard: we need to consider every generative grammar that generates an infinite set of strings that both includes the evidence and does not violate a UG constraint. Because two generative grammars may generate the same set of strings (such grammars are said to be "weakly equivalent"), finding that an infinite set is generated by a UG-forbidden grammar is not enough to exclude it. An infinite set of strings is ruled out only if it is generated *only* by UG-forbidden grammars. Thus, the search space is enormous: we cannot rule out an infinite set of strings unless every grammar that generates it violates UG.

P&P essentially provides a representation that is suitable for exploring this space efficiently. It does this by constructing two auxiliary spaces. First, rather than searching in the space of languages directly, P&P borrows the space of Generative grammars originally devised for descriptive adequacy. This will be an extremely high-dimensional space with, e.g., one dimension for each possible generative rule. A dimension is 1 if that rule is in the target grammar, and 0 if that rule is not. Each conceivable grammar is then a point in this space. If the number of possible Generative rules is finite, like the number of rules in a specific grammar (I'm not actually sure what Generativists think on this point), then this space will clearly be smaller than the infinite space of infinite languages, thus providing the infant with a finite space to explore.

The second auxiliary space constructed by P&P is the Parameter space, which informs the exploration of the Grammar space. This is supposed to be a space with much lower dimensionality, on the order of 10 to 100 dimensions, with each dimension indicating the setting of one of the Parameters. The Parameter space informs the exploration of the Grammar space because each parameter setting rules out most of the Grammar space. Note that the Parameter space is useful for exploring the Grammar space because the infant is equipped with innate hard correlations between the parameter space and the Grammar space: given a Parameter setting, some grammars (and their generated languages) become *impossible*, and the infant moreover knows *which* grammars become impossible.

Algorithmically, learning in the Parameter space is supposed to proceed by way of a random walk prompted by "triggers." Specifically, if a child can provide some analysis, the Parameters remain unchanged, but if the parse fails, the infant changes a Parameter. The details of this change vary from proposal to proposal. (It's interesting that this essentially looks like the Win-Stay Lose-Shift algorithm, except most accounts forbid changing a Parameter once it has been set. Win-Stay Lose-Shift is a kind of particle filtering, which is in turn a resource-limited approximation to Bayesian inference.) Thus, UG and P&P fundamentally propose children succeed at language acquisition because they can explore innately-specified and built-for-purpose spaces that are low-dimensional compared to the space of infinite sets of strings.

NLP-style grammar induction does fundamentally the same thing, but picks different kinds of spaces and correlations. Specifically, remember that the DMV that I mentioned in the previous post generates an unobserved parse tree and sentences. It computes the probability of the parse tree as a product of the probability of each arc in the tree (along with some other stuff). It computes the probability of each arc by paying attention to the head word, the dependent word, the arc direction, and the arc valence (whether or not the arc is the first arc of that head word in that direction). This defines a high-dimensional grammar space, analogous to the grammar space of UG and P&P. However, rather than explicitly posit a built-for-purpose and innate Parameter space to reduce the dimensionality of the space to be explored, the DMV relies on statistical correlations.

The figure below illustrates the basic idea. We see a three-dimensional plot with several points that have a strong linear correlation. Even though there are three dimensions, the location of each point can mostly be specified by indicating its location along the best-fit line ( for ), and completely specified by also indicating its distance from the best-fit line along the best-fit plane (). The *intrinsic* dimensionality of this surface is thus between 1 and 2, even though it is situated in a 3-dimensional space. It is not more than two because the location of each point can be exactly specified with two numbers (location along best-fit line, distance from the best-fit line along the best-fit plane), and it is close to one because the location of each point can be almost exactly specified with one number (location along the best-fit line). A surface of arbitrary dimensionality is in general called a *manifold*.

In this example, assuming correlational forms in terms of best-fit lines and best-fit planes allowed us to reduce the dimensionality from three to between one and two. The DMV essentially builds a space of projective dependency trees, situates utterances in that space, and assumes the utterances are correlated in terms of arc heads, directions, and valences. These correlations allow us to deal with a lower-dimensional grammatical manifold in the same way. This kind of space is hard to picture because the correlated objects are discrete, their locations in the space are probabilistic rather than hard, and the functional form is defined in terms of arcs of projective trees, but the same basic principle applies.

Thus, just like UG and P&P-based approaches, NLP-style approaches are fundamentally concerned with reducing the dimensionality of a search space. They differ in that UG and P&P-based approaches propose an explicit and innate low-dimensional Parameter space with innately-specified hard correlations between the Parameter space and the Grammar space, while the DMV proposes a latent and learned low-dimensional space with only the correlational form innately specified. The assumed correlational form can be adjusted by changing the likelihood function, and, indeed, people have explored adjusting the likelihood function to enable correlations in terms of multi-arc subtrees and different amounts of structural and lexical information. The final set of models in my dissertation (submitted, but not yet defended) came down to enabling correlations that involve the spoken duration of the head and dependent word.

So far I've argued that UG/P&P-based approaches differ from more data-driven approaches only in that UG/P&P approaches posit that the reduced search space is 1) innate 2) is related to the grammar space by way of innate, hard correlations. This in and of itself doesn't mean that UG/P&P is "right" or "wrong" as an account of what happens in children's heads during language acquisition. However, it does have certain methodological implications.

The crucial point is to notice that the UG/P&P-based approach is a special case of NLP-style data-driven approaches. If the likelihood function of a data-driven approach allows the right correlations, then that data-driven approach is capable of representing the Parameter space of P&P as a low-dimensional manifold within the grammar space. If a proposed Parameter space actually does explain the observed distributional regularities well, then the data-driven approach *will* recover that Parameter space; and if the proposed Parameter space explains the observed regularities much better than other potential latent variables, then the data-driven approach will probably recover only the proposed Parameter space.

We can see an example of this in Kwiatkowski et al (2012) (described in more detail in Kwiatkowski's PhD dissertation), who built a (data-driven) online unsupervised learner for Combinatory Categorial Grammar (CCG). Two of the most often proposed Parameters are "head-direction" and "V2." These combine to describe the canonical ordering of Subjects, Verbs, and Objects in a language. English, for example, is typically Subject-Verb-Object: SVO. One of the interesting results from Kwiatkowski et al is that their system rapidly determined that English is SVO, despite having no clear prior bias towards SVO, no explicit switch related to canonical ordering, or even advance knowledge about which words are verbs and which words are nouns. Rather, they analyzed what the learner believed about likely sentences, and found that SVO sentences dominated fairly early in learning.

Hard-coded latent spaces make different empirical predictions from learned latent spaces in terms of the rate of acquisition. If the reduced space is not hard-coded, then children can exploit correlations in the latent space only to the extent that they are supported by the data. First, this means that a hard-coded latent space potentially enables faster learning (although there is a tradeoff, because an elaborate latent space, even if it is hard-coded and correct, will still require substantial data before being useful). Second, if the latent space is learned, the developmental trajectory of children should be largely determined by when different parts of the latent space become evident. Both of these considerations motivate the use of data-driven, NLP-style structured probabilistic models: we cannot know if children exploit regularities before they are evident without measuring when they are evident, and structured probabilistic models measure to what extent latent variables have been identified by evidence.

Working with learned latent spaces also makes it easier to explore potentially useful correlations not only within syntactic systems but between syntax and other systems. Specifically, a lot of people have explored the possibility that children use correlations between syntax and semantics, or syntax and prosody, to learn one or both systems (that Kwiatkowski et al paper actually learns syntax and semantics simultaneously). These accounts, called "bootstrapping" accounts, fundamentally exploit correlations between systems in exactly the same way, and structured probabilistic models provide a unified framework for exploring how correlations within syntax interact with correlations between syntax and other linguistic systems.

Thus, UG/P&P approaches are about dimensionality reduction, just as data-driven approaches are, except UG/P&P approaches assume that the latent space must be hard-coded and related to grammar space by way of hard innate correlations. I've argued that it makes more sense methodologically to explore data-driven models, because they provide a more solid foundation for determining whether or not grammatical forms are easy to identify.

I'll close with some reflection on the goals of each style of grammar induction. UG/P&P, despite its explicit emphasis on "grammar," takes as its goal not the induction of grammatical knowledge from evidence but the induction of an infinite set of strings from a finite set of strings. The functional role of the grammar formalism and associated machinery is just to limit the infinite sets of strings a learner considers. Data-driven approaches take the grammar itself more seriously. For example, Kwiatkowski et al (2012) use the grammar as a mapping from the linear order of words to the semantic composition of those words, and assess the grammar not only in terms of the word strings it produces but also in terms of the accuracy of the meanings it produces. In my own work (e.g. Pate & Goldwater (2013), and a more qualitative and theoretical article in progress), I evaluate the accuracy of the parses against hand-annotated parses, and examine the posterior distribution over parses to identify what kinds of evidence different aspects of the input provide. Other parts of the field of NLP take the syntax seriously in other ways, such as for translating between languages and identifying out-of-vocabulary words.

By emphasizing the evidence for grammatical structures themselves, rather than the evidence for infinite sets of strings, data-driven approaches avoid the "logical problem of language acquisition." We can never see an infinite set of strings, but we can encounter recurring grammatical structures many times, and, on the basis of structures we are confident in, make inferences about new grammatical structures we haven't seen.

Why is there this neglect of grammatical structures themselves in the Generative tradition? I think at least part of it has to do with the Generativists', or at least Chomsky's, view of language as a kind of Platonic ideal rather than an adaptation for communication. Evaluating grammatical structures in terms of their adequacy for specifying meanings tacitly assumes that the functional role of language is communication. This may seem obvious to the layperson, but its apparently controversial among more traditional Generativists. I was at the 2011 Cog Sci conference when Chomsky spoke, and was astonished to hear him deny that language was an evolutionary adaptation for communication. I don't want to put words in his mouth, but he talked about language as some kind of fundamental ability to manipulate and combine symbols, and as having more to do with the human capacity for thought and logic. Its suitability for communication is, according to him, merely a quite convenient side-effect.

I think this view of language is too limiting. I like, in principle, a view of grammar as "a talker's knowledge about language," and see no reason to restrict ourselves to considering only generated strings. It can be useful to focus on only part of language, such as the linear order of words, or linear order plus meaning, or morphology, or whatever, in a particular investigation, but I don't like ruling out certain interactions *a priori*.

I do think language is about communication. Maybe I'll talk about that more later (we'll get to discuss information theory! yay!), but for now I've gone on enough. Thanks!

]]>

The interview is best understood in the context of the historical development of the field of linguistics, and the rise of the field of cognitive science. Way back in the 1950's, the behaviorist pioneer B. F. Skinner wrote a book purporting to provide a psychological theory of language entitled "Language behavior." As a behaviorist account, this book sought to explain language as the result of "stimulus-response" type behavior, just like Pavlov's dogs. Chomsky showed, in a scathing review of Language Behavior, that straightforward "stimulus-response" type accounts that did not appeal to some notion of linguistic structure are inadequate for language.

Here's the basic idea behind the kinds of arguments Chomsky made. We can distinguish between sentences that are syntactically well-formed but meaningless and random sequences of words, as demonstrated by the (now-famous) example: "colorless green ideas sleep furiously." This sentence is acceptable in a way that "green sleep ideas furiously colorless" is not. However, the sentence doesn't have any clear meaning, and so it's not obvious how this stimulus could receive a response that is systematically different from "green sleep ideas furiously colorless;" at least, prior to decades of linguistic commentary on this particular meaningless sentence . Chomsky argued that "colorless green ideas sleep furiously" is acceptable because it has the right kinds of words in the right order to provide the subject and verb phrase that constitute a sentence, but "green sleep ideas furiously colorless" does not have the words in the kind of order that allows a sentence to be constructed. So the two word sequences can be distinguished only by reference to some kind of psychologically-real internal structure.

Behaviorism, however, expressly forbids theories from referencing psychologically-real internal structure. Behaviorists argued that internal psychological processes are not objectively measurable, and good science only uses data that are measurable, so psychological theories should not speculate on what happens inside people's heads. If "colorless green ideas sleep furiously" is different from "green sleep ideas furiously colorless" only by virtue of psychologically-real internal structure, then behaviorist approaches are inherently broken. Chomsky argued that, yes, behaviorism is completely hopeless as a paradigm for understanding language, and good linguists should abandon behaviorism and focus on elucidating linguistic structures. This rejection of behaviorism bled over into other parts of psychology, leading to the "Cognitive Revolution" in which several subfields of psychology started proposing, and testing experimentally, psychologically-real structures.

Artificial Intelligence is, broadly put, the endeavor to make computers capable of doing what people can do. AI projects are often classified as "strong" or "weak" AI. Strong AI projects seek to build systems that do the things people can do *in the same way that people do them*, while weak AI projects seek to build systems using whatever we can get to work. Thus, strong AI projects can be viewed as trying to implement a psychological theory, but weak AI is more of an engineering field. Because decades of research following up on the Cognitive Revolution have shown that we can make at least some empirically-supported, substantive claims about the internal structures and processes of human cognition, strong AI should presumably be looking to implement non-behaviorist theories.

In the interview, Chomsky is essentially arguing that behaviorism (or "associationism") has snuck back into strong AI in the form of statistical models, and, since behaviorism is a fundamentally flawed paradigm for human cognition, this is bad. Computational models for human language have increasingly relied on statistical cues, especially in the last twenty years. In fact, the term "Language Model" almost always refers to what's called an -gram model, where you compute the probability of a sequence of words by looking at how common portions of that sequence are. So a trigram model (-gram with set to 3) would determine the goodness of "colorless green ideas sleep furiously" by seeing how common "colorless green ideas," "green ideas sleep," and "ideas sleep furiously" are in some dataset. As long as the dataset doesn't contain too many papers by linguists (har har), the trigram model should give the sentence a very low probability. Thus, the trigram model makes exactly the same kind of mistake as the behaviorist approach, for the same reason: it maintains no notion of linguistic structure. Moreover, the trigram model can be viewed as a kind of behaviorist theory, wherein the subject's response to a stimulus of "x" is "increase the probability of trigram x."

Chomsky explicitly targets Bayesian approaches. Remember from my last post that a Bayesian approach explains observations (the coin flips) by averaging over possible hypotheses (the bias of the coin). For natural language syntax, this comes down to explaining sequences of words (or morphemes, or speech sounds) by averaging over possible grammars and syntactic analyses. In the interview, Chomsky portrays Bayesian approaches as simply trying to regenerate the observations, and says that a system that simply regenerates what you saw does not improve your scientific understanding. I agree that a system that merely regenerates what you saw does not improve your scientific understanding, but most Bayesian approaches do not just regenerate the observations. A Bayesian approach that only regenerates our word strings would pick an -gram model to be the likelihood function. Remember that a likelihood function is the term . If we've picked a trigram likelihood function, our observed data would be triples of words, and would just be a table that tells us the probability of each trigram. We would then compute the probability of our observed data by averaging over all possible entries in the table .

I agree with Chomsky that this is a boring model, but there's no reason to confine ourselves to such boring models. I like to work with a likelihood function called the Dependency Model with Valence, or DMV (originally devised by Dan Klein and Christopher Manning). It generates not just the observed words but also a *tree structure for them*:

This kind of likelihood function has two kinds of generated variables: observed variables (the words) and hidden variables (different parts of the tree), and the DMV is one definition for how to compute the likelihood function . The hidden variables represent the psychologically-real internal structure the cognitive agent is proposed to be manipulating. Empirical evaluations of the trees from this kind of model show that higher probability trees tend to correspond more closely to linguist intuitions about syntactic trees than would be expected by chance and various baselines, bolstering the notion that they can be interpreted as *syntactic trees*. The probability of a tree and words under this kind of likelihood function will be lower than the probability of words under the trigram likelihood function, so this approach isn't just trying to reproduce noisy data, as Chomsky alleges. Instead, it uses results from linguistic theory, laboratory experiments, and corpus studies to guide the kind of hidden structure we build, and the form of the likelihood function we rely on.

Chomsky is not unreasonable in his critique because there are people who focus on models that only generate or represent fairly superficial aspects of linguistic data. However, it is unreasonable to dismiss all Bayesian methods as efforts to fit noise.

Computational modeling of this kind is important, especially in the context of language acquisition. Elaborate morphological and syntactic structures are important to explain adult competence, but, during learning, a structure that has more parameters (entries in ) than are supported by the amount of data available will be too ambiguous to be useful, *even if the structure is correct*. Bayesian approaches allow us to measure the ambiguity of different kinds of assumed structure directly, by specifying different likelihood functions, regardless of the learning strategy the child actually uses.

Alright, back to marking. Thanks!

]]>

So we considered in the previous post a situation where we see a coin flipped three times, and get two heads. I tried to build the intuition that this result is fairly consistent with coins that are fair and with coins that are biased for or against heads. It's easiest to see this by plotting the likelihood . If we assume that the coin flips are independent (that is, the outcome of one flip does not affect the next flip), then we have for the sequence H,H,T:

This is the probability of *one* sequence of flips. But I didn't tell you the order of the flips; maybe the one tail came first, maybe it was in the second flip, or maybe in the third. So to get the probability of two heads and one tail, ignoring order, we have to add up all three possibilities:

We can see this probability distribution in the next plot, where the horizontal axis is the probability of heads and the vertical axis is the probability of two heads and one tail for each value of :

We can see that the peak of the distribution is at (and we can see that I made a mistake in the previous post when I said this peak was probability 0.375... *sigh*), but other values of are pretty high as well. This is because two heads in three flips is pretty likely for a coin of almost any bias. Let's see what happens if we take thirty flips and get twenty heads; the same ratio, but more total flips:

We can see that the peak is still at 2/3, but a much tighter range of possible values of get appreciable probability. For both of these sets of observations, though, the maximum-likelihood estimate (MLE) is the same: .

In the previous post, I mentioned that, in a Bayesian approach, we can model our joint distribution over data and parameters in terms of a likelihood and a prior : . Let's have a look at what a sensible prior might look like. Remember, for a prior expectation of a fair coin, we want a prior that gives high probability to near . Here's one possibility:

This prior concentrates probability mass on near 0.5. Technically, this is a Beta prior with hyperparameters of 10, but the details aren't important. Also, since is a continuous variable, plots of the probability of are probability *densities*, and we can only get probabilities that is in a certain range by taking the *integral* over that range. Because of this, the "probability " will sometimes go above 1 (this can happen any time is not on the right side of the vertical bar).

Now let's look at the distribution over and our observations :

After three flips, the peak ofÂ in the joint distribution is only 0.52! The three flips have barely changed our opinion of the fairness of the coin. Now let's see the joint distribution after thirty flips, using the second likelihood function from above:

We can see in this plot that the peak, with respect to , of the joint distribution is 0.6, after seeing thirty flips, and almost all of the probability mass is to the right of a perfectly fair () coin. So, as we gather more evidence, our opinion of the fairness of the coin depends less on the prior distribution.

In fact, if we can get *infinite* evidence, the Bayesian distribution over is an infinitely thin spike at the maximum-likelihood estimate. Of course, we never have infinite evidence, and cognitive tasks involve variables that are almost always *rare*. This is relevant in two different ways. First, some researchers believe humans use strategies that compute Bayesian posteriors. For example, in a daily conversation (or blog post), you'll encounter sentences that you've never seen before; in fact, most of the sub-parts of those sentences will be rare as well. The proposal that humans use (approximations to) Bayesian strategies provides an intuitive and mathematically satisfying explanation of why people are able to deal with all these rare events.

The second way that Bayesian reasoning is relevant has to do with the intention of my previous post. Probabilistic computational-level models essentially come down to proposing that "This likelihood function together with this prior distribution recover cognitively-relevant structure in the data." Maximum-Likelihood Estimates do not test the likelihood function or prior distribution directly; rather, they test it subject to the requirement that every single parameter in the likelihood function is supported by the data. Since we know data in cognitive domains is often sparse, this is a harmful requirement that interferes with our ability to assess the likelihood function *per se*. Bayesian approaches allow us to test the extent to which likelihood functions of interest recover cognitively-relevant structure directly.

Ok, I've just talked a lot about "likelihood functions of interest," but all we've talked about so far are coin flips. Coin flipping is pretty boring, I know. I promise to talk about more interesting likelihood functions soon.

Thanks,

]]>In this post about computational modeling, I promised to talk about how a computational model could improve our understanding of cognition without asserting that people actually did what the computational model did. The basic point is that we can use probabilistic modeling to explore the shape of our data, which in turn constrains what kinds of strategies a human learner could successfully use. In this post, I'll discuss how Bayesian modeling allows us to explore the statistical structure of data and characterize what any (decent) probabilistic algorithm would try to do with that data.

It's easiest to understand the motivation behind Bayesian modelling by comparing with Maximum-Likelihood modelling.

Bayesian models of cognition propose that some aspect of cognitive behavior can be explained in terms of a probability distribution over what the agent has seen (the observed data) and what the agent believes about the observed data (the hypotheses). Concretely, if you flip a coin three times and get two heads, a Bayesian model would propose you are maintaining two kinds of beliefs. First, you have a belief about whether the coin is fair or it is biased to give one result; this is the belief about the *hypothesis*. Second, you have a belief about how likely you are to get two heads in three flips under each possible kind of coin; this is the belief about the *data*.

A Maximum-Likelihood model, by contrast, proposes that we only need to pay attention to the probability over the observed data. Here's an example:

Flipping a coin three times doesn't give us very much evidence about the coin. A perfectly fair coin gives this result 37.5% of the time (in three flips), and a coin that's biased *against* heads 2:1 gives this result just over 22% of the time. Moreover, we have an odd number of flips, so it's not actually possible to obtain the same number of heads and tails. Intuitively, most people would continue to believe that the coin is probably fair after these three flips.

What does the maximum-likelihood estimate (MLE) about the coin say? Well, remember that it only cares about the probability of the observed data. Specifically, it proposes that people believe the hypothesis that makes the data most probable. If we call our belief about the coin (which is traditional), and our observations, evidence, or data (which is just the outcome of the three flips), the MLE for theta is :

So what is the that makes two heads and one tail most likely? Well, it's a coin that comes up heads of the time. The maximum-likelihood model proposes that people commit immediately to the data, and ignore all but the best explanation. In this case, that's not very realistic; we have so little data that other explanations are also pretty good.

A Bayesian model maintains uncertainty about the coin. Let's take a brief digression to establish two basic facts in probability theory:

First, let's talk about *conditional probabilities*. If is a variable indicating whether I'm using an umbrella ( can be yes or no), and is a variable indicating whether it's raining ( can also be yes or no), then is the probability that I'm using an umbrella, *given* that it's raining, i.e. if we only consider times when it's raining. We can get a probability distribution over whether I'm using an umbrella *and* it's raining by taking the probability that I'm using an umbrella *given* that it's raining and multiplying it by the probability that it's raining:

assumes we know , but maintains uncertainty over both and

The other important thing to understand is *marginalization*. This is a fancy word for using addition in a particular context. We might know the probability that I'm using an umbrella and it is raining and the probability that I'm using an umbrella and it is not raining . If we only care about the probability that I'm using an umbrella (perhaps you want to borrow it, and don't know if it's raining), then we can get a probability distribution over alone by adding up the two probabilities: . In general, as long as we include *all* possible values of , the following equation holds:

If is a continuous variable, like height, or the bias of a coin, then we have to integrate:

Putting these equations together, we have the *rule of total probability* for discrete variables:

and continuous variables:

Now, back to coin flipping. Remember that I said a Bayesian model maintains a probability distribution over both data and hypotheses . That just means we care about the joint distribution over these variables:

The first term on the right-hand side is the same thing that the MLE approach above maximized, and is called the *likelihood* of the data or the evidence under a particular hypothesis (which is why the MLE approach is called "maximum-likelihood"). The second term is called the *prior*, and reflects a prior distribution over all hypotheses . Now, after seeing two heads in three flips, we've already pointed out that is pretty large for all . Since the likelihood doesn't change much across possible values for , the prior over has a strong effect on the joint distribution. In our daily lives, the coins we encounter (when we bother to flip them) are pretty fair, so our prior will give much higher probability to the hypothesis that the coin is probably fair.

Accordingly, the MLE approach incorrectly predicts that people will feel a coin is strongly biased after seeing two heads in three flips, while the Bayesian approach both makes the correct prediction and provides an intuitively appealing separation between prior biases and evidence.

Now let's dig a little deeper. I have previously talked about David Marr's levels of analysis. What level of analysis do these two approaches target? We can probe people's beliefs about the coin and the data by asking what would happen if the coin were flipped again. So now we have , which is our old data (the three flips we already saw), and , which is the new data we are making a prediction about, and we want to make a prediction about given the old data: . Now, under the MLE approach, we assume people see three flips, fix their opinion of , and then make a guess about purely on the basis of :

That is, the MLE approach bases its guess about what happens next not on what happened before directly but on only the best explanation of what happened before. Under the Bayesian approach, however since we maintain uncertainty about , we can model exactly:

The first line is just marginalization we talked about earlier, and the second line results from the definition of a conditional probability discussed earlier. If we assume that the likelihood function doesn't change over time (and we have to make *some* assumption about how it does or doesn't change over time), we have , producing:

So the Bayesian approach bases its guess about what happens next on what happened before, modulated by the likelihood term.

The contrast is important when understanding these kinds of models in the context of Marr's levels of analysis. The Bayesian approach characterizes in terms of a likelihood function . As long as the likelihood function expresses the dependencies within well, and represents prior knowledge well, the Bayesian approach characterizes what *any* probabilistic algorithm will be trying to compute. Accordingly, the Bayesian approach is squarely targeted at the computational level of analysis, and brings in a minimum of assumptions *per se*

The MLE approach, however, approximates with a single "best" hypothesis about the data which is known to break down in the face of small data. Of course, it's possible that people actually adopt this strategy in some contexts. However, people still behave sensibly in the face of small data (by not changing their belief about the coin very much after only three flips, for example). Additionally, the MLE approach assumes that people actually adopt this strategy, and, in doing so, targets the algorithmic level of analysis at least to some extent.

So, while the Bayesian approach appears more complicated (omg integrals), it is actually less complex from a theoretical viewpoint because it imposes fewer assumptions about what people do. By abstaining from such assumptions, we can explore how different kinds of likelihood functions and prior biases behave with the data that humans see. When we find a likelihood function and prior bias that behaves similarly to humans, we have evidence that whatever algorithm humans use is taking advantage of the statistical structure assumed by that likelihood function and prior bias. For the coin example, the likelihood function is very simple: each flip is independent, and comes up heads with a probability of . Figuring out the details of the prior function is a little more tricky, but the basic shape (lots of probability for a fair coin, little probability for a biased coin) is fairly clear. How do people store, represent, and access these probabilities? Well, that's a different question that a computational-level model does not try to answer.

Thanks,

]]>