Stanford NLP Research Blog

Reading Group Blog -- Semantically Equivalent Adversarial Rules for Debugging NLP Models (ACL 2018)

2019-02-19T18:33:07Z

In the second post, we will focus on this paper:

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Semantically equivalent adversarial rules for debugging nlp models." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018.

Robustness is a central concern in engineering. Our suspension bridges need to stand against strong wind so it won't collapse like the Tacoma Narrows Bridge [video]. Our nuclear reactors need to be fault tolerant so that the Fukushima Daiichi incident won't happen in the future [link].

When we become increasingly reliant on a type of technology -- suspension bridges, nuclear power, or in this case: NLP models, we must raise the level of trust we have in this technology. Robustness is precisely the requirement we need to place on such systems.

Early work from Jia & Liang (2017)¹ shows that NLP models are not immune to small negligible-by-human perturbation in text -- a simple addition or deletion can break the model and force it to produce nonsensical answers. Other work such as Belinkov & Bisk ², Ebrahimi et al.³ showed a systematic perturbation that dropping or replacing a character is sufficient to break a model. Introducing noise to sequence data is not always bad: earlier work done by Xie et al.⁴ shows that training machine translation or language model with word/character-level perturbation (noising) actually improves performance.

However, it is hard to call these perturbed examples "adversarial examples" in the original conception of Ian Goodfellow⁵. This paper proposed a way to characterize an adversarial example in text with two properties:

Semantic equivalence of two sentences: $ \text{SemEq}(x, x') $

Perturbed label prediction: $ f(x) \not= f(x') $

In our discussion, people point out that from a linguistic point of view, it is very difficult to define "semantic equivalence" because we don't have a precise and objective definition of "meaning". This is to say that even though two sentences might elicit the same effect for a particular task, they do not need to be synonymous. A more nuanced discussion of paraphrases in English can be found in What Is a Paraphrase? [link] by Bhagat & Hovy (2012). In this paper, semantic equivalence is operationalized as what humans (MTurkers) judged to be "equivalent".

Semantically Equivalent Adversaries (SEAs)

Ribeiro et al. argue that only a sequence that satisfies both conditions is a true adversarial example in text. They translate this criteria into a conjunctive form using an indicator function:

$$ \text{SEA}(x, x') = \unicode{x1D7D9}[\text{SemEq}(x, x') \wedge f(x) \not= f(x')] \label{1} $$

In this paper, semantic equivalence is measured by the likelihood of paraphrasing, defined in multilingual multipivot paraphrasing paper from Lapata et al. (2017)⁶. Pivoting is a technique in statistical machine translation proposed by Bannard and Callison-Burch (2005)⁷: if two English strings $e_1$ and $e_2$ can be translated into the same French string $f$, then they can be assumed to have the same meaning.

The pivot scheme is depicted by the generative model on the left, which assumes conditional independence between $e_1$ and $e_2$ given $f$: $p(e_2 \vert e_1, f) = p(e_2 \vert f)$ . Multipivot is depicted by the model on the right: it translates one English sentence into multiple French sentences, and translate back to generate the paraphrase. The back-translation of multipivoting can be a simple decoder average -- each decoder takes a French string, and the overall output probability for the next English token is the weighted sum of the probability of every decoder.

Paraphrase Probability Reweighting

Assuming the unnormalized logit from the paraphrasing model is $\phi(x' \vert x)$, and suppose $\prod_x$ is the set of paraphrases that the model could generate given $x$, then the probability of a particular paraphrase can be written as below:

$$ p(x'|x) = \frac{\phi(x'|x)}{\sum_{i \in \prod_x} \phi(i|x)} \\ $$

Note in the denominator, all sentences being generated (including generating the original sentence) share the probability mass. If a sentence has many easy-to-generate paraphrases (indicated by high $\phi$ value), then $p(x \vert x)$ will be small, as well as all other $p(x' \vert x)$. Dividing $p(x' \vert x)$ by $p(x \vert x)$ will get a large value (closer to 1). As for a sentence that is difficult to paraphrase, $p(x \vert x)$ should be rather large compared to $p(x' \vert x)$, then this ratio will provide a much smaller value (closer to 0).

Based on this intuition, Ribeiro et al. proposed to compute a semantic score $S(x, x')$ as a measure of the paraphrasing quality:

$$ S(x, x') = \min(1, \frac{p(x'|x)}{p(x|x)}) \\ \text{SemEq}(x, x') = \unicode{x1D7D9}[S(x, x') \geq \tau] $$

A simple schema to generate adversarial sentences that satisfy the Equation 1 is: ask the paraphrase model to generate paraphrases of a sentence $x$. Try these paraphrases if they can change the model prediction: $f(x') \not = f(x)$.

Semantically Equivalent Adversarial Rules (SEARs)

SEAs are adversarial examples generated independently for each example. In this step, authors lay out steps to convert these local SEAs to global rules (SEARs). The rule defined in this paper is a simple discrete transformation $r = (a \rightarrow c)$. The example for $r = (movie \rightarrow film)$ can be $r$("Great movie!") = "Great film!".

Given a pair of text $(x, x')$ where $\text{SEA}(x, x') = 1$, Ribeiro et al. select the minimal contiguous span of text that turn $x$ into $x'$, include the immediate context (one word before and after the span), and annotate the sequence with POS (Part of Speech) tags. The last step is to generate the product of combinations between raw words and their POS tags. A step-wise example is the follow:

"What color is the tray?" -> "Which color is the tray?"

Step 1: (What -> Which)

Step 2: (What color -> Which color)

Step 3: (What color -> Which color), (What NOUN -> Which NOUN), (WP color -> Which color), (What color -> WP color)

Since this process is applied for every pair of $(x, x')$, and we assume humans are only willing to go through $B$ rules, Ribeiro et al. propose to filter the candidates such that $\vert R \vert \leq B$. The criteria would be:

High probability of producing semantically equivalent sentences: this is measured by a population statistic $E_{x \sim p(x)}[\text{SemEq(x, r(x))}] \geq 1 - \delta$. Simply put, by applying this rules, most $x$ in the corpus can be translated to semantically equivalent paraphrases. In the paper, $\delta = 0.1$.
High adversary count: rule $r$ must also generate paraphrases that will alter the prediction of the model. Additionally, the semantic similarity should be high between paraphrases. This can be measured by $\sum_{x \in X} S(x, r(x)) \text{SEA}(x, r(x))$.
Non-redundancy: rules should be diverse and cover as many $x$ as possible.

To satisfy criteria 2 and 3, Ribeiro et al. proposed a submodular optimization objective, which can be solved with a greedy algorithm with a theoretical guarantee to a constant factor off of the optimum.

$$ \max_{R, |R| <B} \sum_{x \in X} \max_{r \in R} S(x, r(x)) \text{SEA}(x, r(x)) $$

The overall algorithm is described below:

Experiment and Validation

The key metric Ribeiro et al. measure is the percentage of Flips, defined as in the validation set, how many instances are predicted correctly on the validation data, but predicted incorrectly after the application of the rule.

The comment on this metric during discussion is that it does not indicate how many examples are affected by this rule. For example, a rule that changes "color" to "colour" might only have a Flips rate of 2.2% in VQA dataset, but this might be due to the fact that in the validation set of VQA, only 2.2% of instances contain the word color, so in fact, this rule has a 100% rate of success at generating adversarial examples.

The paper shows some really good discrete rules that can generate adversarial text examples:

Human-in-the-loop

Ribeiro et al. conducted experiments on humans. Bringing humans into the loop can serve two purposes: humans can judge if rules can actually generate paraphrases (beyond the semantic scoring model provided by Lapata et al.); humans can decide if the perturbations incurred by rules are actually meaningful.

They first judge the quality of SEA: For 100 correctly predicted instances in the validation set, they create three sets of comparison: 1). completely created by human MTurkers, referred to as humans; 2). purely generated by the paraphrasing model described above as SEA; 3). Generate SEA by the algorithm, but replace the $S(x, x')$ criteria with human judgment of similarity.

They show that SEA narrowly beats human (18% vs. 16%), but combining with human judgments, HSEA outperforms human by a large margin (24% vs. 13%).

Then they evaluate the global rules SEARs. This time, they invite "experts" to use an interactive web interface to create global rules. They define experts as students, faculties who have taken one graduate-level NLP or ML class. Strictly speaking, experts should have been linguistic students.

Experts are allowed to see immediate feedback on their rule creation: they know how many instances (out of 100) are perturbed by their rule, and how many instances have their prediction label perturbed. In order to have a fair comparison, they are asked to create as many rules as they want but select 10 as the best. Also, each expert is given roughly 15 minutes to create rules. They were also asked to evaluate SEARs and select 10 rules that most preserve semantic equivalence.

The results are not surprising. SEARs are much better at reaching a high flip percentage. The combined effort between human and machine is higher than the individual. They also compared the number of seconds on average it takes an expert to create rules vs. evaluating rules created by the machine.

Finally, the paper shows a simple method to fix those bugs: they can simply perturb the training set using these human-accepted rules, and they are able to reduce the percentage of error from 12.6% to 1.4% on VQA, and from 12.6% to 3.4% on sentiment analysis.

Wrap up

This paper uses paraphrasing models as a way to measure semantic similarity and generating semantically equivalent sentences. As is mentioned in the text, machine translation based paraphrasing perturbs the sentence only locally, while humans generate semantically equivalent adversaries with more significant perturbations.

Another limitation is that gradient-based adversarial example generation is more guided, while the method proposed by this paper seems to be a simple trial-and-error approach (keep generating paraphrases until one paraphrase perturbs the model prediction). On the flip side, this method applies to blackbox models without access to gradients, and thus more universal than gradient-based approaches.

This paper provides a clear framework and proposes clear properties that adversarial text examples should abide. This definition is very compatible with adversarial examples in computer vision. However, this framework only covers a specific type of adversarial examples. An obvious adversarial example not covered by this method would be operations such as adding or deleting sentences, which is important at attacking QA models.

Jia, Robin, and Percy Liang. "Adversarial examples for evaluating reading comprehension systems." arXiv preprint arXiv:1707.07328 (2017). ↩
Belinkov, Yonatan, and Yonatan Bisk. "Synthetic and natural noise both break neural machine translation." arXiv preprint arXiv:1711.02173(2017). ↩
Ebrahimi, Javid, et al. "HotFlip: White-Box Adversarial Examples for Text Classification." arXiv preprint arXiv:1712.06751 (2017). ↩
Xie, Ziang, et al. "Data noising as smoothing in neural network language models." arXiv preprint arXiv:1703.02573 (2017). ↩
Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples (2014)." arXiv preprint arXiv:1412.6572. ↩
Mallinson, Jonathan, Rico Sennrich, and Mirella Lapata. "Paraphrasing revisited with neural machine translation." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Vol. 1. 2017. ↩
Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 597–604, Ann Arbor, Michigan. ↩

Reading Group Blog -- LSTMs Can Learn Syntax-Sensitive Dependencies Well, But Modeling Structure Makes Them Better (ACL 2018)

2019-01-25T10:59:44Z

Welcome to the Stanford NLP Reading Group Blog! Inspired by other groups, notably the UC Irvine NLP Group, we have decided to blog about the papers we read at our reading group.

In this first post, we'll discuss the following paper:

Kuncoro et al. "LSTMs Can Learn Syntax-Sensitive Dependencies Well, But Modeling Structure Makes Them Better." ACL 2018.

This paper builds upon the earlier work of Linzen et al.:

Linzen et al. "Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies." TACL 2016.

Both papers address the question, "Do neural language models actually learn to model syntax?" As we'll see, the answer is yes, even for models like LSTMs that do not explicitly represent syntactic relationships. Moreover, models like RNN Grammars, which build representations based on syntactic structure, fare even better.

Subject-verb number agreement as an evaluation task

First, we must decide how to measure whether a language model has "learned to model syntax." Linzen et al. propose using subject-verb number agreement to quantify this. Consider the following four sentences:

The key is on the table
* The key are on the table
* The keys is on the table
The keys are on the table

Sentences 2 and 3 are invalid because the subject ("key"/"keys") disagree with the verb ("are"/"is") in number (singular/plural). Therefore, a good language model should give higher probability to sentences 1 and 4.

For these simple sentences, a simple heuristic can predict whether the singular or plural form of the verb is preferred (e.g., find the closest noun to the left of the verb, and check if it is singular or plural). However, this heuristic fails on more complex sentences. For example, consider:

The keys to the cabinet are on the table.

Here we must use the plural verb, even though the nearest noun ("cabinet") is singular. What matters here is not linear distance in the sentence, but syntactic distance: "are" and "keys" have a direct syntactic relationship (namely an nsubj arc). In general there may be many intervening nouns between the subject and verb ("The keys to the cabinet in the room next to the kitchen..."), making predicting the correct verb form very challenging. This is the key idea of Linzen et al.: we can measure whether a language model has learned about syntax by asking, How well does the language model predict the correct verb form on sentences where linear distance is a bad heuristic?

Note how convenient it is that this syntax-sensitive dependency exists in English: it allows us to draw conclusions about syntactic awareness of models that only make word-level predictions. Unfortunately, the downside is that this approach is limited to certain types of syntactic relationships. We might also want to see if language models can correctly predict where a prepositional phrase attaches, for example, but there is no analogue of number agreement involving prepositional phrases, so we cannot develop an analogous test.

LSTM language models learn syntax (but only if they are big enough)

Linzen et al. found that LSTM language models are not very good at predicting the correct verb form, in cases when linear distance is unhelpful. On a large test set of sentences from English Wikipedia, they measure how often the language model prefers to generate the verb with the correct form ("are", in the above example) over the verb with the wrong form ("is"). The language model is considered correct if

P("are" | "The keys to the cabinet") > P("is" | "The keys to the cabinet").

This is a natural choice, although another possibility is to let the language model see the entire sentence before predicting. In this regime, the model would be considered correct if

P("The keys to the cabinet are on the table") > P("The keys to the cabinet is on the table").

Here, the model gets to use both the left and right context when deciding the correct verb. This puts it on equal footing with, for example, a syntactic parser, which can look at the entire sentence and generate a full parse tree. On the other hand, you could argue that because LSTMs generate from left to right, whatever is on the right hand side is irrelevant to whether it generates the correct verb during generation.

Using the "left context only" definition of correctness, Linzen et al. find that the language model does okay on average, but it struggles on sentences in which there are nouns between the subject and verb with the opposite number as the subject (such as "cabinet" in the earlier example). The authors refer to these nouns as attractors. The language model does reasonably well (7% error) when there are no attractors, but this jumps to 33% error on sentences with one attractor, and a whopping 70% error (worse than chance!) on very challenging sentences with 4 attractors. In contrast, an LSTM trained specifically to predict whether an upcoming verb is singular or plural is much better, with only 18% error when 4 attractors are present. Linzen et al. conclude that while the LSTM architecture can learn about these long-range syntactic cues, the language modeling objective forces it to spend a lot of model capacity on other things, resulting in much worse error rates on challenging cases.

However, Kuncoro et al. re-examine these conclusions, and find that with careful hyperparameter tuning and more parameters, an LSTM language model can actually do a lot better. They use a 350-dimensional hidden state (as opposed to 50-dimensional from Linzen et al.) and are able to get 1.3% error with 0 attractors, 3.0% error with 1 attractor, and 13.8% error with 4 attractors. By scaling up LSTM language models, it seems we can get them to learn qualitatively different things about language! This jives with the work of Melis et al., who found that careful hyperparameter tuning makes standard LSTM language models outperform many fancier models.

Language model variants

Next, Kuncoro et al. examine variants of the standard LSTM word-level language model. Some of their findings include:

A language model trained on a different dataset (1 Billion word benchmark, which is mostly news instead of Wikipedia) does slightly worse across the board, but still learns some syntax (20% error with 4 attractors)
A character-level language model does about the same with 0 attractors, but is worse than the word-level model as more attractors are added (6% error as opposed to 3% with 1 attractor; 27.8% error as opposed to 13.8% with 4 attractors). When many attractors are present, the subject is very far away from the verb in terms of number of characters, so the character-level model struggles.

But the most important question Kuncoro et al. ask is whether incorporating syntactic information during training can actually improve language model performance at this subject-verb agreement task. As a control, they first try keeping the neural architecture the same (still an LSTM) but change the training data so that the model is trying to generate not only words in a sentence but also the corresponding constituency parse tree. They do this by linearizing the parse tree via a depth-first pre-order traversal, so that a tree like

becomes a sequence of tokens like

["(S", "(NP", "(NP", "The", "keys", ")NP", "(PP", "to", "(NP", ...]

The LSTM is trained just like a language model to predict sequences of tokens like these. At test time, the model gets the whole prefix, consisting of both words and parse tree symbols, and predicts what verb comes next. In other words, it computes

P("are" | "(S (NP (NP The keys )NP (PP to (NP the table )NP )PP )NP (VP").

You may be wondering where the parse tree tokens come from, since the datsaet is just a bunch of sentences from Wikipedia with no associated gold-labeled parse trees. The parse trees were generated with an off-the-shelf parser. The parser gets to look at the whole sentence before predicting a parse tree, which technically leaks information about the words to the right of the verb--we'll come back to this concern in a little bit.

Kuncoro et al. find that a plain LSTM trained on sequences of tokens like this does not do any better than the original LSTM language model. Changing the data alone does not seem to force the model to actually get better at modeling these syntax-sensitive dependencies.

Next, the authors additionally change the model architecture, replacing the LSTM with an RNN Grammar. Like the LSTM that predicts the linearized parse tree tokens, the RNN Grammar also defines a joint probability distribution over sentences and their parse trees. But unlike the LSTM, the RNN Grammar uses the tree structure of words seen so far to build representations of constituents compositionally. The figure below shows the RNN Grammar architecture:

On the left is the stack, consisting of all constituents that have either been opened or fully created. The embedding for a completed constituent ("The hungry cat") is created by composing the embeddings for its children, via a neural network. An RNN then runs over the stack to generate an embedding of the current stack state. This, along with a representation of the history of past parsing actions $a_{<t}$ is used to predict the next parsing action (i.e. to generate a new constituent, complete an existing one, or generate a new word). The RNN Grammar variant used by Kuncoro et al. ablates the "buffer" $T_t$ on the right side of the figure.

The compositional structure of the RNN Grammar means that it is naturally encouraged to summarize a constituent based on words that are closer to the top-level, rather than words that are nested many levels deep. In our running example, "keys" is closer to the top level of the main NP, whereas "cabinet" is nested within a prepositional phrase, so we expect the RNN Grammar to lean more heavily on "keys" when building a representation of the main NP. This is exactly what we want in order to predict the correct verb form! Empirically, this inductive bias towards using syntactic distance helps with the subject-verb agreement task: the RNN Grammar gets only 9.4% error on sentences with 4 attractors. Using syntactic information at training time does make language models better at predicting syntax-sensitive dependencies, but only if the model architecture makes smart use of the available tree structure.

As mentioned earlier, one important caveat is that the RNN Grammar gets to use the predicted parse tree from an external parser. What if the predicted parse of the prefix leaks information about the correct verb? Moreover, reliance on an external parser also leaks information from another model, so it is unclear whether the RNN Grammar itself has really "learned" about these syntactic relationships. Kuncoro et al. address these objections by re-running the experiments using a predicted parse of the prefix generated by the RNN Grammar itself. They use a beam search method proposed by Fried et al. to estimate the most likely parse tree structure, according to the RNN Grammar, for the words before the verb. This predicted parse tree fragment is then used by the RNN Grammar to predict what the verb should be, instead of the tree generated by a separate parser. The RNN Grammar still does well in this setting; in fact, it does somewhat better (7.1% error with four attractors present). In short, the RNN Grammar does better than the LSTM baselines at predicting the correct verb, and it does so by first predicting the tree structure of the words before the verb, then using this tree structure to predict the verb itself.

(Note: a previous version of this post incorrectly claimed that the above experiments used a separate incremental parser to parse the prefix.)

Wrap-Up

Neural language models with sufficient capacity can learn to capture long-range syntactic dependencies. This is true even for very generic model architectures like LSTMs, though models that explicitly model syntactic structure to form their internal representations are even better. We were able to quantify this by leveraging a particular type of syntax-sensitive dependency (subject-verb number agreement), and focusing on rare and challenging cases (sentences with one or more attractors), rather than the average case which can be solved heuristically.

There are many details I've omitted, such as a discussion in Kuncoro et al. of alternative RNN Grammar configurations. Linzen et al. also explore other training objectives besides just language modeling.

If you've gotten this far, you might also enjoy these highly related papers:

Gulordava et al. "Colorless green recurrent networks dream hierarchically." NAACL 2018. This paper actually came out a bit before Kuncoro et al., and has similar findings regarding LSTM size. But the main point of this paper is to determine whether the LSTM is actually learning syntax, or if it is using collocational/frequency-based information. For example, given "dogs in the neighborhood often bark/barks," knowing that barking is something that dogs can do but neighborhoods can't is sufficient to guess the correct form. To test this, they construct a new test set where content words are replaced with other content words of the same type, resulting in nonce sentences with equivalent syntax. The LSTM language models do somewhat worse with this data but still quite well, again suggesting that they do learn about syntax.
Yoav Goldberg. "Assessing BERT's Syntactic Abilities". With the recent success of BERT, a natural question is whether BERT learns these same sorts of syntactic relationships. Impressively, it does very well on the verb prediction task, getting 3-4% error rates across the board for 1, 2, 3, or 4 attractors. It's worth noting that for various reasons, these numbers are not directly comparable with the numbers in the rest of the post (both due to BERT seeing the whole sentence and for data processing reasons).

A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset

2017-07-03T12:23:58Z

Task-oriented dialogue focuses on conversational agents that participate in user-initiated dialogues on domain-specific topics. Traditionally, the task-oriented dialogue community has often been hindered by a lack of sufficiently large and diverse datasets for training models across a variety of different domains. In an effort to help alleviate this problem, we release a corpus of 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation. Our dialogues are grounded through knowledge bases ensuring that they are versatile in their natural language without being completely free form. The dialogues include exchanges such as the following:

DRIVER	I need to find the time and parties attending my optometrist appointment.
CAR	I have 3 appointments scheduled, with Alex, your sister, and Jeff. Which are you referring to?
DRIVER	I want to know about the one that Alex is joining me at.
CAR	That optometrist appointment is at 4 pm.
DRIVER	Thanks.
CAR	No problem.

Data Collection

Our data was collected using a Wizard-of-Oz scheme inspired by that of Wen et. al. In our scheme, users had two potential modes they could play: Driver and Car Assistant. In the Driver mode, users were presented with a task that listed certain information they were trying to extract from the Car Assistant as well as the dialogue history exchanged between Driver and Car Assistant up to that point. An example task is presented in the Driver Mode figure below. The Driver was then only responsible for contributing a single line of dialogue that appropriately continued the discourse given the prior dialogue history and the task definition.

Tasks were randomly specified by selecting values (5pm, Saturday, San Francisco, etc.) for three to five slots (time, date, location, etc.) that depended on the domain type. Values specified for the slots were chosen according to a uniform distribution from a per-domain candidate set.

In the Car Assistant mode, users were presented with the dialogue history exchanged up to that point in the running dialogue and a private knowledge base known only to the Car Assistant with information that could be useful for satisfying the Driver query. Examples of knowledge bases could include a calendar of event information, a collection of weekly forecasts for nearby cities, or a collection of nearby points-of-interest with relevant information. The Car Assistant was then responsible for using this private information to provide a single utterance that progressed the user-directed dialogues. The Car Assistant was also asked to fill in dialogue state information for mentioned slots and values in the dialogue history up to that point. We provide a screenshot of Car Assistant Mode below:

Each private knowledge base had six to seven distinct rows and five to seven attribute types. The private knowledge bases used were generated by uniformly selecting a value for a given attribute type, where each attribute type had a variable number of candidate values. Some knowledge bases intentionally lacked certain attributes to encourage diversity in discourse.

While specifying the attribute types and values in each task presented to the Driver allowed us to ground the subject of each dialogue with our desired entities, it would occasionally result in more mechanical discourse exchanges. To encourage more naturalistic, unbiased utterances, we had users record themselves saying commands in response to underspecified visual depictions of an action a car assistant could perform. These commands were transcribed and then inserted as the first exchange in a given dialogue on behalf of the Driver. Roughly 1,500 of the dialogues employed this transcribed audio command first-utterance technique.

241 unique workers from Amazon Mechanical Turk were anonymously recruited to use the interface we built over a period of about six days.

Data Statistics

Below we include statistics for our dataset:

Training Dialogues	2,425
Validation Dialogues	302
Test Dialogues	304
Calendar Scheduling Dialogues	1034
Navigation Dialogues	1000
Weather Dialogues	997
Avg. # Utterances Per Dialogue	5.25
Avg. # Tokens Per Utterance	9
Vocabulary Size	1,601
# of Distinct Entities	284
# of Entity (or Slot) Types	15

We also include some information regarding the type and number of slots per domain:

	Calendar Scheduling	Weather Information Retrieval	POI Navigation
Slot Types	event, time, date, party, room agenda	location, weekly time, temperature, weather attribute	POI name, traffic info, POI category, address, distance
# Distinct Slot Values	79	65	140

Our dataset was designed so that each dialogue had the grounded world information that is often crucial for training task-oriented dialogue systems, while at the same time being sufficiently lexically and semantically versatile. We hope that this dataset will be useful in building diverse and robust task-oriented dialogue systems!

Download

Our data is made publicly available for download at the following link: dataset

If you choose to use this dataset for your own work, please cite the following paper:

Mihail Eric and Lakshmi Krishnan and Francois Charette and Christopher D. Manning. 2017. Key-Value Retrieval Networks for Task-Oriented Dialogue. In Proceedings of the Special Interest Group on Discourse and Dialogue (SIGDIAL). https://arxiv.org/abs/1705.05414. [pdf]

CS224n Competition on The Stanford Question Answering Dataset with CodaLab

2017-04-27T16:29:03Z

The Stanford Question Answering Dataset (SQuAD) is a reading comprehension benchmark with an active and highly-competitive leaderboard. Over 17 industry and academic teams have submitted their models (with executable code) since SQuAD’s release in June 2016, leading to the advancement of novel deep learning architectures which have outperformed baseline models by wide margins. As teams compete to build the best machine comprehension system, the challenge of rivaling human-level performance still remains open.

SQuAD is a unique large-scale benchmark in that it uses a hidden test set for official evaluation of models. Teams submit their executable code, which is then run on a test set that is not publicly readable. Such a setup preserves the integrity of the test results. Models can be rerun on new test sets, either to get tighter confidence bounds on model performance, or to evaluate the ability of the model to generalize to new domains. Another advantage of having teams submit executable code is that models can be ensembled to further boost performance so that the weaknesses of one model are offset by the strengths of another. But having teams submit arbitrary code poses technical challenges: different programs expect different arguments and command-line options, they often require custom environments and library dependencies, and some models may also involve running multiple programs in a sequential pipeline.

CodaLab for reproducibility

This is where CodaLab comes in. CodaLab is an online platform for collaborative and reproducible computational research. With CodaLab Worksheets, you can run your jobs on a cluster, document and share your experiments, all while keeping track of full provenance. The system exposes a simple command-line interface, with which you can upload your code and data as well as submit jobs to run them (see SQuAD data worksheet here). A job consists of 1) a Docker image, containing the environment in which to run your code, 2) a set of dependencies, i.e. the code and data to load into the Docker container where you job is run, and 3) the shell command to run inside this container. The files generated by a job can then be loaded into subsequent jobs as dependencies themselves. All this metadata about your jobs not only allows you to maintain a record of how you ran your code, but also enable others to reproduce your experiments, or even to rerun your pipelines on new datasets by substituting in different new dependencies.

These features allow us to run arbitrary code submissions for the SQuAD leaderboard, all while keeping the test set secret using CodaLab access control lists. Once a team has uploaded their code to CodaLab and successfully constructed jobs running the code on the public development dataset, we can reproduce the run by simply substituting the hidden test set for the development dataset. The results can then be queried using the CodaLab REST API to construct a live leaderboard on the web.

Stanford NLP Class Competition

Last month, the CodaLab team and course staff organized a competition on SQuAD for Stanford’s popular CS224N (Natural Language Processing with Deep Learning) course. 162 student teams (with 1-3 students in each) competed in a tight, four-week expedition to apply their knowledge of deep learning for natural language processing to a real-world challenge task: SQuAD. CodaLab was employed for automated running and evaluation of the student submissions on the hidden test set, and a real-time online leaderboard that interfaced with CodaLab was set up for instantaneous feedback on the submission. Over the span of a short few weeks, many student teams managed to break a competitive EM/F1 score of 60/70, and the very top teams managed to rival entries on the external SQuAD leaderboard. The top student submission, at 77.5 F1, would have been a top 3 score on the leaderboard only 3 months ago -- not bad for a 4-week course!

View post on imgur.com

We are grateful to Microsoft for their support of CodaLab and for giving students free GPU computing resources on Microsoft Azure, allowing them to build and test complex deep learning models on the large SQuAD dataset.

Interactive Language Learning

2016-12-14T12:55:31Z

Today, natural language interfaces (NLIs) on computers or phones are often trained once and deployed, and users must just live with their limitations. Allowing users to demonstrate or teach the computer appears to be a central component to enable more natural and usable NLIs. Examining language acquisition research, there is considerable evidence suggesting that human children require interactions to learn language, as opposed to passively absorbing language, such as when watching TV (Kuhl et al., 2003, Sachs et al., 1981). Research suggests that when learning a language, rather than consciously analyzing increasingly complex linguistic structures (e.g. sentence forms, word conjugations), humans advance their linguistic ability through meaningful interactions (Kreshen, 1983).

In contrast, the standard machine learning dataset setting has no interaction. The feedback stays the same and does not depend on the state of the system or the actions taken. We think that interactivity is important, and that an interactive language learning setting will enable adaptive and customizable systems, especially for resource-poor languages and new domains where starting from close to scratch is unavoidable.

We describe two attempts towards interactive language learning — an agent for manipulating blocks, and a calendar scheduler.

Language Games: A Blocks-World Domain to Learn Language Interactively

Inspired by the human language acquisition process, we investigated a simple setting where language learning starts from scratch. We explored the idea of language games, where the computer and the human user need to collaboratively accomplish a goal even though they do not initially speak a common language. Specifically, in our pilot we created a game called SHRDLURN, in homage to the seminal work of Terry Winograd. As shown in Figure 1a, the objective is to transform a start state into a goal state, but the only action the human can take is entering an utterance. The computer parses the utterance and produces a ranked list of possible interpretations according to its current model. The human scrolls through the list and chooses the intended one, simultaneously advancing the state of the blocks and providing feedback to the computer. Both the human and the computer wish to reach the goal state (only known to the human) with as little scrolling as possible. For the computer to be successful, it has to learn the human’s language quickly over the course of the game, so that the human can accomplish the goal more efficiently. Conversely, the human can also speed up progress by accommodating to the computer, by at least partially understanding what it can and cannot currently do.

We model the computer as a semantic parser (Zettlemoyer and Collins, 2005; Liang and Potts, 2015), which maps natural language utterances (e.g., ‘remove red’) into logical forms (e.g., remove(with(red))). The semantic parser has no seed lexicon and no annotated logical forms, so it just generates many candidate logical forms. From the human’s feedback, it learn by adjusting the parameters corresponding to simple and generic lexical features. It is crucial that the computer learns quickly, or users are frustrated and the system is less usable. In addition to feature engineering and tuning online learning algorithms, we achieved higher learning speed by incorporating pragmatics.

However, what is special here is the real-time nature of learning, in which the human also learns and adapts to the computer, thus making it easier to achieve good task performance. While the human can teach the computer any language - in our pilot, Mechanical Turk users tried English, Arabic, Polish, and a custom programming language - a good human player will choose to use utterances so that the computer is more likely to learn quickly.

You can find more information in the SHDLURN paper, a demo, code, data, and experiments on CodaLab and the client side code.

1a SCHRDLURN (top) and 1b SCHEDLURN (bottom)

Figure 1: 1a: A pilot for learning language through user interaction. The system attempts an action in response to a user instruction and the user indicates whether it has chosen correctly. This feedback allows the system to learn word meaning and grammar. 1b: the interface for interactive learning in the calendars domain.

A Calendar Employing Community Learning with Demonstration

Many challenges remain if we want to advance to NLIs for broader domains. First, in order to scale to more open, complex action spaces, we need richer feedback signals that are both natural for humans and useful for the computer. Second, to allow for quick, generalizable data collection, we seek to support collective, rather than individual, languages, in a community-based learning framework. We now outline our first attempt at addressing these challenges and scaling the framework to a calendar setting. You can find a short video overview.

Event scheduling is a common yet unsolved task: while several available calendar programs allow limited natural language input, in our experience they all fail as soon as they are given something slightly complicated, such as ‘Move all the tuesday afternoon appointments back an hour’. We think interactive learning can give us a better NLI for calendars, which has more real world impact than blocks world. Furthermore, aiming to expand our learning methodology from definition to demonstration, we chose this domain as most users are already familiar with the common calendar GUI with an intuition for its manual manipulation. Additionally, as calendar NLIs are already deployed, particularly on mobile, we hoped users will naturally be inclined to use natural language style phrasing rather than a more technical language as we saw in the blocks world domain. Lastly, a calendar is a considerably more complex domain, with a wider set of primitives and possible actions, and will allow us to test our framework with a larger action space.

Learning from Demonstration and Community

In our pilot, user feedback was provided by scrolling and selecting the proper action for a given utterance - a process both unnatural and un-scalable for large action spaces. Feedback signals in human communication include reformulation, paraphrases, repair sequences etc. (Clark, 1996). We expanded our system to receive feedback through demonstration, as it is 1) natural for people, especially using a calendar, allowing for easy data collection, and 2) informative for language learning and can be leveraged by current machine learning methods. In practice, if the correct interpretation is not among the top choices, the system falls back to a GUI and the user uses the GUI to show the system what they meant. Algorithms for learning from denotations are well-suited for this, where the interactivity can potentially help in the search for the latent logical forms.

While learning and adapting to each user provided a clean setting for the pilot study, we would not expect good coverage if each person has to teach the computer everything from scratch. Despite individual variations, there should be much in common across users which allows the computer to learn faster and generalize better. For our calendar, we abandoned the individualized user-specific language model for a collective community model where a model consists of a set of grammar rules and parameters collected across all users and interactions. Each user contributes to the expressiveness and complexity of the language where jargons and conventions are invented, modified, or rejected in a distributed way.

Preliminary Results

Using Amazon Mechanical Turk (AMT), we paid 20 workers 2 dollars each to play with our calendar. In total, out of 356 total utterances, in 196 cases the worker selected a state out of the suggested ranked list as the desired calendar state, and 68 times the worker used the calendar GUI to manually modify and submit feedback by demonstration.

A small subset of commands collected is displayed in figure 2. While a large percentage involved relatively simple commands (Basic), AMT workers did challenge the system for complex tasks using non-trivial phrasing (Advanced). As we hoped, users were highly inclined to use natural language, and did not develop a technical, artificial language. A small number of commands were questionable in nature, with unusual calendar commands (see Questionable).

Basic	Advanced	Questionable
move "ideas dinner tressider" to Saturday	change "family room" to "game night" and add location "family room	duplicate all calendar entries
cancel "team lunch” Friday between 12 pm and 1 pm	Duplicate the "family dinner" event to 9pm today	remove all appointments for the entire week
Change "golf lesson" to 5pm	remove all appointments on monday	Remove all entries
Schedule a "meeting with Bob" Tuesday at 10:30am"	change all "team lunch" to after 2 pm

Figure 2. A categorized sample of commands collected in our experiment

To assess learning performance, we measure the system’s ability to correctly predict the correct calendar action given a natural language command. We see that the top-ranked action is correct about 60% of the time, and the correct meaning is in the top three system-ranked actions about 80% of the time.

Discussion

The key challenge is figuring out which feedback signals are both usable for the computer and natural for humans. We explored providing alternatives and learning from demonstration. We are also trying definitions and rephrasing. For example, when a user rephrases “my meetings tomorrow morning” as “my meetings tomorrow after 7 am and before noon”, we can infer the meaning of “morning".

Looking forward, we believe NLIs must learn through interaction with users, and improve over time. NLIs have the potential to replace GUIs and scripting for many tasks, and doing so can bridge the great digital divide of skills and enable all of us to better make use of computers.

In Their Own Words: The 2016 Graduates of the Stanford NLP Group

2016-07-06T13:02:57Z

This year we have a true bumper crop of graduates from the NLP Group - ten people! We're sad to see them go but excited for all the wonderful things they're off to do. Thanks to them all for being a part of the group and for their amazing contributions! We asked all the graduates to give us a few words about what they did here and where they're headed - check it out!

PhD Students

Gabor Angeli
My Ph.D. focused on natural language understanding. Early in the program, I worked on semantic parsing for temporal expressions, before moving on to relation extraction -- I was actively involved in Stanford's Knowledge Base Population efforts -- and textual entailment. My thesis work was on applying natural logic -- a formal logic over the syntax of natural language -- to large-scale open-domain question answering tasks. Now I'm off working on Eloquent Labs with my cofounder Keenon Werling, where we're building dialog AI for customer service.

Sam Bowman
My research has been on understanding and improving neural network models for encoding and reasoning with sentence meaning. In working on that, I've done a lot with the task of natural language inference (a.k.a. recognizing textual entailment), for which I led the creation of the Stanford NLI challenge dataset last spring. After Stanford, I'll be starting as an assistant professor in the Department of Linguistics and the Center for Data Science at NYU.

Angel Chang
At Stanford, I worked on temporal expression resolution (SUTime), entity linking, and text to 3D scene generation. At the moment I am working part time at Tableau Research on NLP for data visualization. This fall, I will be going to Princeton for a postdoc and will continue to work on projects connecting language with 3D scene understanding.

Thang Luong
I spent the first half of my PhD wandering around, thinking of dropping out, and working on various research areas such as parsing, psycholinguistics, and word embedding learning. Then I fell in love with deep learning models, specifically neural machine translation which I wrote a thesis about. I'll be joining the Google Brain team and am excited to help contribute my part towards the future of AI.

Natalia Silveira
At Stanford, I got a PhD in Linguistics and worked on dependency syntax for NLP, as a core contributor to the Universal Dependencies project. My focus is on how linguists' theoretical knowledge of syntax interacts with practical constraints for representing syntax for NLP applications, and how representation choices affect entire NLP pipelines. Next, I'm joining a research team at Apple.

Masters Students

Konstantin Lopyrev
I spent 2 quarters improving CodaLab and 1 quarter working on a reading comprehension project. I'm going back to Google where I'll be working on Google Now to improve the personalized news recommendations.

Neha Nayak
At Stanford I did work on lexical semantics and word embedding evaluation, and I'm now a software engineer at Google.

Hieu Pham
At Stanford, I worked on multilingual representation learning and neural machine translation. I'm spending the year 2016-17 at Google Brain before joining CMU's PhD program in Fall 2017.

Victor Zhong
At Stanford, I worked on applying deep learning methods to relation extraction and knowledge base population. I am now a research scientist at MetaMind/Salesforce.

Undergraduates

Keenon Werling
I did research on semantic parsing and human-in-the-loop systems while at Stanford, and absolutely loved it. Now I'm off working on Eloquent Labs with my cofounder Gabor Angeli, where we're building dialog AI for customer service.

Hybrid tree-sequence neural networks with SPINN

2016-06-23T16:10:38Z

This is a cross-post from my personal blog.

We’ve finally published a neural network model which has been under development for over a year at Stanford. I’m proud to announce SPINN: the Stack-augmented Parser-Interpreter Neural Network. The project fits into what has long been the Stanford research program, mixing deep learning methods with principled approaches inspired by linguistics. It is the result of a substantial collaborative effort also involving Sam Bowman, Abhinav Rastogi, Raghav Gupta, and our advisors Christopher Manning and Christopher Potts.

This post is a brief introduction to the SPINN project from a particular angle, one which is likely of interest to researchers both inside and outside of the NLP world. I’ll focus here on the core SPINN theory and how it enables a hybrid tree-sequence architecture.¹ This architecture blends the otherwise separate paradigms of recursive and recurrent neural networks into a structure that is stronger than the sum of its parts.

(quick links: model description, full paper, code)

Our task, broadly stated, is to build a model which outputs compact, sufficient² representations of natural language. We will use these representations in downstream language applications that we care about.³ Concretely, for an input sentence $\mathbf x$, we want to learn a powerful representation function $f(\mathbf x)$ which maps to a vector-valued representation of the sentence. Since this is a deep learning project, $f(\mathbf{x})$ is of course parameterized by a neural network of some sort.

Voices from Stanford have been suggesting for a long time that basic linguistic theory might help to solve this representation problem. Recursive neural networks, which combine simple grammatical analysis with the power of recurrent neural networks, were strongly supported here by Richard Socher, Chris Manning, and colleagues. SPINN has been developed in this same spirit of merging basic linguistic facts with powerful neural network tools.

Model

Our model is based on an insight into representation. Recursive neural networks are centered around tree structures (usually binary constituency trees) like the following:

In a standard recursive neural network implementation, we compute the representation of a sentence (equivalently, the root node S) as a recursive function of its two children, and so on down the tree. The recursive function is specified like this, for a parent representation $\vec p$ with child representations $\vec c_1, \vec c_2$: \[\vec p = \sigma(W [\vec c_1, \vec c_2])\] where $\sigma$ is some nonlinearity such as the $\tanh$ or sigmoid function. The obvious way to implement this recurrence is to visit each triple of a parent and two children, and compute the representations bottom-up. The graphic below demonstrates this computation order.

The computation defined by a standard recursive neural network. We compute representations bottom-up, starting at the leaves and moving to nonterminals.

This is a nice idea, because it allows linguistic structure to guide computation. We are using our prior knowledge of sentence structure to simplify the work left to the deep learning model.

One substantial practical problem with this recursive neural network, however, is that it can’t easily be batched. Each input sentence has its own unique computation defined by its parse tree. At any given point, then, each example will want to compose triples in different memory locations. This is what gives recurrent neural networks a serious speed advantage. At each timestep, we merely feed a big batch of memories through a matrix multiplication. This work can be easily farmed out on a GPU, leading to order-of-magnitude speedups. Recursive neural networks unfortunately don’t work like this. We can’t retrieve a single batch of contiguous data at each timestep, since each example has different computation needs throughout the process.⁴

Shift-reduce parsing

The fix comes from the change in representation foreshadowed earlier. To make that change, I need to introduce a parsing formalism popular in natural language processing, originally stolen from the compiler/PL crowd.

Shift-reduce parsing is a method for building parse structures from sequence inputs in linear time. It works by exploiting an auxiliary stack structure, which stores partially-parsed subtrees, and a buffer, which stores input tokens which have yet to be parsed.

We use a shift-reduce parser to apply a sequence of transitions, moving items from the buffer to the stack and combining multiple stack elements into single elements. In the parser’s initial state, the stack is empty and the buffer contains the tokens of an input sentence. There are just two legal transitions in the parser transition sequence.

Shift pulls the next token from the buffer and pushes it onto the stack.
Reduce combines the top two elements of the stack into a single element, producing a new subtree. The top two elements of the stack become the left and right children of this new subtree.

The animation below shows how these two transitions can be used to construct the entire parse tree for our example sentence.⁵

A shift-reduce parser produces the pictured constituency tree. Each timestep is visualized before and then after the transition is taken. The text at the top right shows the transition at each timestep, and yellow highlights indicate the data involved in the transition. The table at the right displays the stack contents before and after each transition.

Rather than running a standard bottom-up recursive computation, then, we can execute this table-based method on transition sequences. Here’s the buffer and accompanying transition sequence we used for the sentence above. S denotes a shift transition and R denotes a reduce transition.

Buffer: The, man, picked, the, vegetables
Transitions: S, S, R, S, S, S, R, R, R

Every binary tree has a unique corresponding shift-reduce transition sequence. For a sentence with $n$ tokens, we can produce its parse with a shift-reduce parser in exactly $2n - 1$ transitions.

All we need to do is build a shift-reduce parser that combines vector representations rather than subtrees. This system is a pretty simple extension of the original shift-reduce setup:

Shift pulls the next word embedding from the buffer and pushes it onto the stack.
Reduce combines the top two elements of the stack $\vec c_1, \vec c_2$ into a single element $\vec p$ via the standard recursive neural network feedforward: $\vec p = \sigma(W [\vec c_1, \vec c_2])$.

Now we have a shift-reduce parser, deep-learning style.

This is really cool for several reasons. The first is that this shift-reduce recurrence computes the exact same function as the recursive neural network we formulated above. Rather than making the awkward bottom-up tree-structured computation, then, we can just run a recurrent neural network over these shift-reduce transition sequences.⁶

If we’re back in recurrent neural network land, that means we can make use of all the batching goodness that we were excited about earlier. It gains us quite a bit of speed, as the figure below from our paper demonstrates.

Massive speed-ups over a competitive recursive neural network implementation (from Irsoy and Cardie, 2014). A baseline RNN implementation, which ignores parse information, is also shown. The y-axis shows feedforward speed on random input sequence data.

That’s up to a 25x improvement over our comparison recursive neural network implementation. We’re between two to five times slower than a recurrent neural network, and it’s worth discussing why. Though we are able to batch examples and run an efficient GPU implementation, this computation is fundamentally divergent — at any given timestep, some examples require a “shift” operation, and other examples require a “reduce.” When computing results for all examples in bulk, we’re fated to throw away at least half of our work.

I’m excited about this big speedup. Recursive neural networks have often been dissed as too slow and “not batchable,” and this development proves both points wrong. I hope it will make new research on this model class a practical opportunity.

Hybrid tree-sequence networks

I’ve been hinting throughout this post that our new shift-reduce feedforward is really just a recurrent neural network computation. To be clear, here’s the “sequence” that the recurrent neural network traverses when it reads in our example tree:

Visualization of the post-order tree traversal performed by a shift-reduce parser.

This is a post-order tree traversal, where for a given parent node we recurse through the left subtree, then the right, and then finally visit the parent.

We had a simple idea with a big result after looking at this diagram: why not have a recurrent neural network follow along this path of arrows?

Concretely, that means that at every timestep, we update some RNN memory regardless of the shift-reduce transition. We call this the tracking memory. We can write out the algorithm mathematically for clarity. At any given timestep $t$, we compute a new tracking value $\vec m_t$ by combining the top two elements of the stack $\vec c_1, \vec c_2$, the top of the buffer $\vec b_1$, and the previous tracking memory $\vec m_{t-1}$: \begin{equation} \vec m_t = \text{Track}(\vec m_{t-1}, \vec c_1, \vec c_2, \vec b_1) \\ \end{equation} We can then pass this tracking memory onto the recursive composition function, via a simple extension like this: \begin{equation} \vec p = \sigma(W [\vec c_1; \vec c_2; \vec m_t]) \\ \end{equation} What have we done? We’ve just interwoven a recurrent neural network into a recursive neural network computation. The recurrent memories are used to augment the recursive computation ($m_t$ is passed to the recursive composition function) and vice versa (the recurrent memories are a function of the recursively computed values on the stack).

We show in our paper how these two paradigms turn out to have complementary power on our test data. By combining the recurrent and recursive models into a single feedforward, we get a model that is more powerful than the sum of its parts.

What we’ve built is a new way to build a representation $f(\mathbf x)$ for an input sentence $\mathbf x$, like we discussed at the beginning of this post. In our paper, we use this representation to reach a high-accuracy result on the Stanford Natural Language Inference dataset.

This post managed to cover about one section of our full paper. If you’re interested in more details about how we implemented and applied this model, related work, or a more formal description of the algorithm discussed here, take a read. You can also check out our code repository, which has several implementations of the SPINN model and models which you can run to reproduce or extend our results.

We’re continuing active work on this project in order to learn better end-to-end models for natural language processing. I always enjoy hearing ideas from my readers — if this project interests you, get in touch via email or in the comment section below.

Acknowledgements

I have to first thank my collaborators, of course — this was a team of strong researchers with nicely complementary skills, and I look forward to pushing this further together with them in the future.

The SPINN project has been supported by a Google Faculty Research Award, the Stanford Data Science Initiative, and the National Science Foundation under grant numbers BCS 1456077 and IIS 1514268. Some of the Tesla K40s used for this research were donated to Stanford by the NVIDIA Corporation. Kelvin Gu, Noah Goodman, and many others in the Stanford NLP Group contributed helpful comments during development. Craig Quiter and Sam Bowman helped review this blog post.

This is only a brief snapshot of the project focusing on modeling and algorithms. For details on the task / data, training, related work etc., check out our full paper. ↩
I mean sufficient here in a formal sense — i.e., powerful enough to answer questions of interest in isolation, without looking back at the original input value. ↩
In this first paper, we use the model to answer questions from the Stanford Natural Language Inference dataset.) ↩
A non-naïve approach might involve maintaining a queue of triples from an input batch and rapidly dequeuing them, batching together all of these dequeued values. This has already been pursued (of course) by colleagues at Stanford, and it shows some promising speed improvements on a CPU. I doubt, though, that the gains from this method will offset the losses on the GPU, since this method sacrifices all data locality that a recurrent neural network enjoys on the GPU. ↩
For a more formal and thorough definition of shift-reduce parsing, I’ll refer the interested reader to our paper. ↩
The catch is that the recurrent neural network must maintain the per-example stack data. This is simple to implement in principle. We had quite a bit of trouble writing an efficient implementation in Theano, though, which is not really built to support complex data structure manipulation. ↩

How to help someone feel better: NLP for mental health

2016-05-25T15:24:05Z

Natural language processing (NLP) allows us to tag, parse, and even extract information from text. But we believe it also has the potential to help address major challenges facing the world. Recently, we have been working on applying NLP to a serious global health issue: mental illness. In the U.S. alone, 43.6 million adults (18.1%) experience mental illness each year. Fortunately, mental health conditions can often be treated with counseling and psychotherapy, and in recent years there has been rapid growth in the availability of these treatments thanks to technology-mediated counseling. The goal of our project was to better understand how to conduct counseling sessions, which we have done through a large-scale study of crisis counseling conversations.

So far, most research on counseling has been small-scale and qualitative due to the difficulty of obtaining data. We partnered with a nonprofit organization that offers crisis counseling via text messages to apply techniques from data mining and NLP on a dataset of over 80,000 counseling sessions. In our analysis, we searched for linguistic aspects of conversations that were correlated with the outcomes of the conversations (whether the person texting felt better afterwards).

The Data

The text-based counseling service offers free, 24/7 counseling for anyone in crisis (depression, self-harm, suicidal thoughts, anxiety, etc.). Anyone who texts the public number will be matched with a counselor and undergo a counseling session completely via SMS. At the end of the session, the texter receives a follow-up question: “How are you feeling now? Better, same, or worse?” Texting-based counseling is particularly effective with teenagers, allows for privacy (nobody can overhear your conversation), and is much easier to access than other forms of counseling. Each day, the service conducts hundreds of conversations and (on average) instigates at least one active rescue of a texter who’s thought to be in immediate danger of suicide. Carefully anonymized data collected from these conversations is made available with the hope of facilitating research on counseling. You can learn more about accessing the data here.

Our Research

Our study was conducted on about 15,000 conversations (660,000 messages) that had a response to the follow-up question. On average, the conversations were 43 messages long with around 20 words per message. There are many questions that could be investigated with this data, but we were most interested in learning what characterizes a successful conversation. Although a counseling session is free-from and without strict rules, it involves many choices that could make a difference in someone’s life. To answer this question, we developed techniques to quantify aspects of the conversations and determine which ones were associated with successful counselors. There are five “strategies” we found more prevalent in successful counselors (i.e., those who have a higher rate of texters saying they felt better in the follow-up):

Adaptability: Successful counselors are aware of how the conversation is going and react accordingly.
Dealing with ambiguity: Successful counselors clarify situations by writing more, reflecting back to check understanding, and making their conversation partner feel more comfortable through affirmation.
Creativity: Successful counselors respond in a creative way, not using too generic or “templated” responses.
Making Progress: Successful counselors are quicker to get to know the main issue and are faster to move on to collaboratively solving the problem.
Change in Perspective: We found that people in distress are more likely to be more positive, think about the future, and consider others, when the counselors bring up these concepts. This kind of perspective change is associated with positive conversations, a finding that is consistent with psychological theories of depression.

Although some of these are obvious in hindsight, this is to the best of our knowledge the first time someone has been able to perform a large-scale analysis of these strategies. We hope that this research will lead to a better understanding of how to provide quality counseling services.

Some of our Findings

Here is a summary of some of our findings. See our paper for the full set of experiments and analyses.

Being Adaptable

Thanks to the post-conversation question, we know the outcomes of the counseling sessions. But are the counselors themselves aware of how the conversation is going? And if a conversation is going badly do they react? We investigated this question by looking for language differences between positive (i.e. the texter says they feel better at the end) and negative conversations. In particular, we computed a distance measure between the language counselors use in positive conversations and the language counselors use in negative conversations and observed how this distance changes over time. The results are shown below.

At the beginning of the conversation, the language used in positive and negative conversations is quite similar, but then the distance in language increases over time. This increase in distance is much larger for more successful counselors than less successful ones, suggesting they are more aware of when conversations are going poorly and adapt their counseling more in an attempt to remedy the situation.

Reacting to Ambiguity

We also analyzed how counselors react to ambiguous situations. Ambiguity arises most at the beginning of conversations. We looked at the counselors' responses to the first long message by the texter (typically a response to a “Can you tell me more about what is going on?” question by the counselor). Based on counselor training materials, we hypothesized successful counselors would

Write more themselves
Use more check questions (statements that tell the conversation partner that you understand them while avoiding the introduction of any opinion or advice e.g., “that sounds like...”),
Check for suicidal thoughts
Thank the texter for showing the courage to talk to them
Use more hedges (mitigating words used to lessen the impact of an utterance; e.g., “maybe”, “fairly”)
Be less likely to respond with surprise (e.g., “oh, this sounds really awful”)

	More successful counselors	Less successful counselors
Counselor message length (in words)	15.8	11.8
Counselor responds with check question	12.6%	4.1%
Counselor responds with suicide check	13.5%	10.3%
Counselor responds with thanks	6.3%	2.4%
Counselor responds with hedges	41.4%	36.8%
Counselor responds with surprise	3.3%	2.8%

We found there to be statistically significant differences in all of these aspects except for showing surprise, suggesting these methods from counselor training do indeed help.

Being Creative

Interestingly, although more successful counselors tend to more often use structured responses like check questions, their responses also tended to be more unique. We measured the uniqueness of responses by clustering counselor messages and then counting how many close neighbors the messages tended to have. Messages from more successful counselors tended to have fewer neighbors, suggesting they were being more creative or personalized in their responses. This tailoring of messages requires more effort from the counselor, which is consistent with the results in the above table showing that more successful counselors put in more effort in composing longer messages as well.

Facilitating Perspective Change

Prior work on counseling suggests that certain perspectives are associated with depression, such as having a negative view of the future or being self-focused. We quantified the concept of perspective change by measuring the frequency of different word categories (provided by LIWC) over time in the conversation. The results for time are shown below (we also explored focusing on oneself and having a positive or negative perspective).

Texters start explaining their issues largely in terms of the past and present, but over time switch to talking about the future. Additionally, texters writing more about the future are more likely to feel better after the conversation. This suggests that changing the perspective from issues in the past towards the future is associated with a higher likelihood of successfully working through the crisis. We also investigated whether counselors could instigate this perspective change, and found that texters were more likely to talk about the future if the counselor brought up the subject.

Conclusion

As NLP techniques become more effective and data becomes more available, it is becoming increasingly useful as a tool for investigating pressing issues that our societies face. We think mental health is one such problem and we hope our research on counseling will inspire future work on the area, leading to new insights that could benefit treatments for mental illness. Such research could improve counselor training and lead to tools that help counselors be more successful.

Acknowledgements

Thanks to Jure Leskovec for advising the research and Chris Manning for providing helpful feedback.

Maximum Likelihood Decoding with RNNs - the good, the bad, and the ugly

2016-04-26T12:23:17Z

Training Tensorflow's large language model on the Penn Tree Bank yields a test perplexity of 82. With the code provided here, we used the large model for text generation, and got the following results depending on the temperature parameter used for sampling:

\tau = 1.0

The big three auto makers posted a N N drop in early fiscal first-half profit. The same question is how many increasing cash administrative and financial institutions might disappear in choosing. The man in the compelling future was considered the city Edward H. Werner Noriega's chief financial officer were unavailable for comment.

\tau = 0.5

The proposed guidelines are expected to be approved by the end of the year. The company said it will sell N N of its common shares to the New York Stock Exchange. The New York Stock Exchange's board approved the trading on the big board to sell a N N stake in the company.

Which sample is better? It depends on your personal taste. The high temperature sample displays greater linguistic variety, but the low temperature sample is more grammatically correct. Such is the world of temperature sampling - lowering the temperature allows you to focus on higher probability output sequences and smooth over deficiencies of the model. But if you dig into the math, there's actually a lot more going on.

The freezing function

Temperature sampling works by increasing the probability of the most likely words before sampling. The output probability p_i of each word is transformed by the freezing function f to:

\tilde{p}_i = f_\tau(p)_i = \frac{p_i^{\frac{1}{\tau}}}{\sum_j{p_j^{\frac{1}{\tau}}}}

For \tau = 1, the freezing function is just the identity function. For \tau \rightarrow 0, the freezing function turns sampling into the argmax function, returning the most likely output word. For \tau = 0.5, the freezing function is equivalent to squaring the probability of each output word, and then renormalizing the sum of probabilities to 1. The typical perspective I hear is that a temperature like 0.5 is supposed to make the model more robust to errors while maintaining some diversity that you'd miss out on with a greedy argmax sampler.

But what if our model was fantastic and didn't make any errors? What would the effect of temperature sampling be in that case? If we look at a simple grammar where an LSTM won't make any mistakes, we can start to answer this question.

What day of the week is it?

Suppose your are asked what day of the week it is, and you have a 70% chance of knowing the answer. 30% of the time you respond "I don't know". The remaining answers of "Monday", "Tuesday", etc. each occur with probability 10%. Your responses are over a few months and you want to train a Recurrent Neural Network to generate your responses. Given the simplicity of the task, the neural network will learn the probability of each answer with high precision, and won't be expected to make any errors. If you use \tau = 1.0, you'll get representative samples from the same 70/30 distribution in which you uttered them.

But if you use \tau = 0.5, will the network be more or less likely to know what day of the week it is? Temperature sampling biases samples towards more likely responses, but in this case, lowering the temperature will actually cause the chance that you know the answer to go down! Squaring the probability for each specific answer and renormalizing yields \tilde{p} with a 6.25% chance of answering "Monday", "Tuesday", etc., and a 56.25% chance of responding "I don't know". Maybe you think that this is an okay result. After all, "I don't know" was the single most likely response. But there is a different perspective under which we should expect the probability of the network knowing the day of the week to have gone up.

What if instead of recording your answers verbatim, you had recorded your responses as simply knowing or not knowing what day of the week it was? We could go back and replace each instance of "Monday" or "Tuesday" etc. in the training set with "I do know". After training that model, temperature sampling with \tau = 0.5 would causes the probability that the network knows the day of the week to go up to 84.5%. To remove any vocabulary changes, we could further go back and sample the day of the week you answered whenever you responded "I do know", and produce each answer of "Monday", "Tuesday", etc. with probability 12.1%. "I don't know" would be produced 14.5% of the time.

Sampling for the semantic category before sampling at the word level decreases the probability of the "I don't know" response

Semantic temperature sampling

Which of these two sampling methods is correct? Both have natural interpretations, but they give completely different results. In some cases, the latter two-stage sampling method may be more appropriate, and we define it formally here. Given two temperatures \tau_1 and \tau_2, and a semantic partition \phi: \text{words} \epsilon [1 .. N] \rightarrow \text{categories} \epsilon [1 .. k] , we define the semantic freezing function h_{\tau_1, \tau_2, \phi} as follows:

q_j = \sum_i{p_i * \mathbb{1}\{\phi(i) == j}\}
\tilde{q_j} = f_{\tau_1}(q)_j
r^{j}_{i} = Pr[i | \text{category} = j]= \frac{p_i * \mathbb{1}\{\phi(i) == j\}}{q_j}
\tilde{r}^j = f_{\tau_2}(r^j_i)
\tilde{p_i} = h(p)_i = \tilde{q}_{\phi(i)} * \tilde{r}^{\phi(i)}_i

That is, we partition our vocabulary into k semantic categories. At each output step, we compute the probability that the output word is in each category, and sample from this distribution with temperature \tau_1. Once a category is decided, we sample among words within that category with temperature \tau_2. Note that semantic temperature sampling generalizes classical temperature sampling as we may choose to use only a single semantic category and let \tau_2 = \tau (i.e. f_\tau(p) = h_{1, \tau, \text{lambda }i : 1}(p)). Alternatively, we may also chose N discrete categories, one for each word, and let \tau_1 = \tau (i.e. f_{\tau}(p) = h_{\tau, 1, \text{lambda }i: i}(p)).

Returning to our original example, what kind of output do we get from semantic temperature sampling? We define \phi by running k-means with 100 categories on the word vector projection matrix, and then sample as:

\tau_1 = 0.5, \tau_2=1.0

The vague tone of the Seattle business has been first to be offset by the Oct. N massacre in the state. The president said that when the bank leaves economic development of a foreign contractor, it is not offered to find a major degree for the market. The Miller Metal Co. unit of the national airline, which publishes the caribbean and its latest trading network, will be the first time since the new company has completed the sale of New York City Bank in Pittsburgh. In the above, we're heavily weighting the most likely categories, but then backing off and sampling less aggressively with \tau_2=1.0 within a category. It's not clear that this output is better than any that could be achieved with traditional temperature sampling. Achieving lower perplexity on the Penn Tree Bank would be more impactful to that end. But we do see qualitative changes in the output when turning the new \tau_1 knob. The sampling regime above focuses on the stock market semantics so frequently found in the Wall Street Journal without overusing individual terms like "company" and "New York Stock Exchange" as in the original example.

Maximum likelihood decoding

Armed with the tool of semantic temperature sampling, we can make a few more interesting connections within the realm of RNN decoding. Consider the case where both \tau_1 \rightarrow 0 and \tau_2 \rightarrow 0. This decoding scheme corresponds to first picking the most likely semantic category, and then picking the best way of expressing those semantics. If the scheme fully achieved this claim, it would be quite satisfying. But the semantic categories we're thinking of here are constrained to be at the word level, not the sentence level. In our day of the week example above, these happen to be identical, but that need not be the case in general.

If the semantics do need to take place at the sentence level, there is no clear path forward in the general case. While we may use k-means as an attempt at word level semantics, it's unclear what kind of systematic strategies could be used for sentence level clustering. One could try sentence vectors, but those are not directly available from the task at hand. The idea of a sample that first 1) figures out what semantics to respond with and then 2) figures out how to express those semantics is a nice abstraction. But word-level semantic temperature sampling as defined above only gives us an approximation.

What then are we to do? LSTM language models trained end-to-end give us a beautiful abstraction; minimizing perplexity on the training set produces an optimal word sampler for free. But if we want a maximum likelihood decoder, we have to define semantics and we're in trouble. If we don't define semantics, we'll just implicitly be assuming that all words have their own independent semantic category [1]. In the simplest case where word level semantics suffice, we can provide a \phi function to the semantic temperature sampler. Sadly though, the choice of \phi is not quantitatively justified, as it is not encoded in the training loss. As a result, the very idea of maximum likelihood sampling from a perplexity-trained language model is still somewhat dubious.

Conclusion

Temperature sampling is a standard technique for improving the quality of samples from language models. But temperature sampling also introduces semantic distortions in the process. We explored these distortions in the context of a simple grammar, and introduced semantic temperature sampling as a method to control them through the semantic function \phi. We fall short of defining a meaningful objective function over which to compare different sampling regimes, and punt on a metric for comparing choices of \phi.

Humans can disambiguate the advantages of varied sampling schemes because their conversational responses are ultimately derived from the evolutionary advantages of strong communication. Such an evolutionary pressure would likewise provide a principled objective function for machine conversational semantics in the general case.

Acknowledgments

Thanks to Chris Manning for advising this research. Thanks to Jiwei Li, Thang Luong, Andrej Karpathy, Tudor Achim, and Ankit Kumar for providing insightful feedback.

Notes

[1] Imagine inserting the alias "zn" for the word "an" throughout the corpus in 50% of "an" instances. How would this impact a maximum likelihood decoder trained on that corpus? Hint: how would this affect the ability of the "an" token to compete with the "a" token in the maximum likelihood sense? This one simple change could significantly decrease the presence of all words beginning with vowels in our samples.

WikiTableQuestions: a Complex Real-World Question Understanding Dataset

2016-02-11T15:15:05Z

Natural language question understanding has been one of the most important challenges in artificial intelligence. Indeed, eminent AI benchmarks such as the Turing test require an AI system to understand natural language questions, with various topics and complexity, and then respond appropriately. During the past few years, we have witnessed rapid progress in question answering technology, with virtual assistants like Siri, Google Now, and Cortana answering daily life questions, and IBM Watson winning over humans in Jeopardy!. However, even the best question answering systems today still face two main challenges that have to solved simultaneously:

Question complexity (depth). Many questions the systems encounter are simple lookup questions (e.g., "Where is Chichen Itza?" or "Who's the manager of Man Utd?"). The answers can be found by searching the surface forms. But occasionally users will want to ask questions that require multiple, non-trivial steps to answer (e.g., "What the cheapest bus to Chichen Itza leaving tomorrow?" or "How many times did Manchester United reach the final round of Premier League while Ferguson was the manager?"). These questions require deeper understanding and cannot be answered just by retrieval.
Domain size (breadth). Many systems are trained or engineered to work very well in a few specific domains such as managing calendar schedules or finding restaurants. Developing a system to handle questions in any topic from local weather to global military conflicts, however, is much more difficult.

While most systems understand questions containing either depth or breadth alone (e.g., by handling complex questions in a few domains and fall back to web search on the rest), they often struggle on ones that require both. To this end, we have decided to create a new dataset, WikiTableQuestions, that address both challenges at the same time.

Task and Dataset

In the WikiTableQuestions dataset, each question comes with a table from Wikipedia. Given the question and the table, the task is to answer the question based on the table. The dataset contains 2108 tables from a large variety of topics (more breadth) and 22033 questions with different complexity (more depth). Tables in the test set do not appear in the training set, so a system must be able to generalize to unseen tables.

The dataset can be accessed from the project page or on CodaLab. The training set can also be browsed online.

We now give some examples that demonstrate the challenges of the dataset. Consider the following table:

The question is "In what city did Piotr's last 1st place finish occur?" In order to answer the question, one might perform the following steps:

With this example, we can observe several challenges:

Schema mapping. One fundamental challenge when working with messy real-world data is handling diverse and possibly unseen data schemas. In this case, the system must know that the word "place" refers to the "Position" column while the word "city" refers to the "Venue" column, even if the same table schema has not been observed before during training.
Compositionality. Natural language can express complex ideas thanks to the principle of compositionality: the ability to compose smaller phrases into bigger ones. Small phrases could correspond to different operations (e.g., locating the last item), which can be composed to get the final answer.
Variety of operations. To fully utilize a rich data source, it is essential to be able to perform different operations such as filtering data ("1st place", "in 1990"), pinpointing data ("the longest", "the first"), computing statistics ("total", "average", "how many"), and comparing quantities ("difference between", "at least 10"). The WikiTableQuestions dataset contains questions with a large variety of operations, some of which can be observed in other questions for the table above:

what was piotr's total number of 3rd place finishes?

which competition did this competitor compete in next after the world indoor championships in 2008?

how long did it take piotr to run the medley relay in 2001?

which 4x400 was faster, 2005 or 2003?

how many times has this competitor placed 5th or better in competition?

Common sense reasoning. Finally, one of the most challenging aspect of natural language is that the meaning of some phrases must be inferred using the context and common sense. For instance, the word "better" in the last example (... placed 5th or better …) means "Position ≤ 5", but in "scored 5 or better" it means "Score ≥ 5".

Here are some other examples (cherry-picked from the first 50 examples) that show the variety of operations and topics of our dataset:

how many people stayed at least 3 years in office?

which players played the same position as ardo kreek?

in how many games did the winning team score more than 4 points?

what's the number of parishes founded in the 1800s?

in 1996 the sc house of representatives had a republican majority. how many years had passed since the last time this happened?

how many consecutive friendly competitions did chalupny score in?

Comparison to Existing QA Datasets

Most QA datasets address only either breadth (domain size) or depth (question complexity). Early semantic parsing datasets such as GeoQuery and ATIS contain complex sentences (high depth) in a focused domain (low breadth). Here are some examples from GeoQuery, which contains questions on a US geography database:

how many states border texas?

what states border texas and have a major river?

what is the total population of the states that border texas?

what states border states that border states that border states that border texas?

More recently, Facebook released the bAbI dataset featuring 20 types of automatically generated questions with different complexity on simulated worlds. Here is an example:

John picked up the apple.
John went to the office.
John went to the kitchen.
John dropped the apple.
Question: Where was the apple before the kitchen?

In contrast, many QA datasets contain questions spanning a variety of topics (high breadth), but the questions are much simpler or retrieval-based (low depth). For example, WebQuestions dataset contains factoid questions that can be answered using a structured knowledge base. Here are some examples:

what is the name of justin bieber brother?

what character did natalie portman play in star wars?

where donald trump went to college?

what countries around the world speak french?

Other knowledge base QA datasets include Free917 (also on Freebase) and QALD (on both knowledge bases and unstructured data).

QA datasets that focus on information retrieval and answer selection (such as TREC, WikiQA, QANTA Quiz Bowl, and many Jeopardy! questions) are also of this kind: while some questions in these datasets look complex, the answers can be mostly inferred by working with the surface form. Here is an example from QANTA Quiz Bowl dataset:

With the assistence of his chief minister, the Duc de Sully, he lowered taxes on peasantry, promoted economic recovery, and instituted a tax on the Paulette. Victor at Ivry and Arquet, he was excluded from succession by the Treaty of Nemours, but won a great victory at Coutras. His excommunication was lifted by Clement VIII, but that pope later claimed to be crucified when this monarch promulgated the Edict of Nantes. For 10 points, name this French king, the first Bourbon who admitted that "Paris is worth a mass" when he converted following the War of the Three Henrys.

Finally, there are several datasets that address both breadth and depth but in a different angle. For example, QALD Hybrid QA requires the system to combine information from multiple data sources, and in AI2 Science Exam Questions and Todai Robot University Entrance Questions, the system has to perform common sense reasoning and logical inference on a large volume of knowledge to derive the answers.

State of the Art on WikiTableQuestions Dataset

In our paper, we present a semantic parsing system which learns to construct formal queries ("logical forms") that can be executed on the tables to get the answers.

The system learns a statistical model that builds logical forms in a hierarchical fashion (more depth) using parts can be freely constructed from any table schema (more breadth). The system achieve a test accuracy of 37.1%, which is higher than the previous semantic parsing system and an information retrieval baseline.

We encourage everyone to play with the dataset, develop systems to tackle the challenges, and advance the field of natural language understanding! For suggestions and comments on the dataset, please contact the author Ice Pasupat.

The Stanford NLI Corpus Revisited

2016-01-25T14:28:42Z

Last September at EMNLP 2015, we released the Stanford Natural Language Inference (SNLI) Corpus. We're still excitedly working to build bigger and better machine learning models to use it to its full potential, and we sense that we're not alone, so we're using the launch of the lab's new website to share a bit of what we've learned about the corpus over the last few months.

What is SNLI?

SNLI is a collection of about half a million natural language inference (NLI) problems. Each problem is a pair of sentences, a premise and a hypothesis, labeled (by hand) with one of three labels: entailment, contradiction, or neutral. An NLI model is a model that attempts to infer the correct label based on the two sentences.

Here's a typical example randomly chosen from the development set:

Premise: A man inspects the uniform of a figure in some East Asian country.

Hypothesis: The man is sleeping.

Label: contradiction

The sentences in SNLI are all descriptions of scenes, and photo captions played a large role in data collection. This made it easy for us to collect reliable judgments from untrained annotators, and allowed us to solve the surprisingly difficult problem of coming up with a logically consistent definition of contradiction, so it's what made the huge size of the corpus possible. However, using only that genre of text means that there are several important linguistic phenomena that don't show up in SNLI—things like tense and timeline reasoning or opinions and beliefs. We are interested in going back to collect another inference corpus that goes beyond just single scenes, so stay tuned.

What can I do with it?

We created SNLI with the goal of making the first high quality NLI dataset large enough to be able to serve as the sole training data set for low-bias machine learning models like neural networks. There are plenty of things one can do with it, but we think it's especially valuable for three things:

Training practical NLI systems: NLI is a major open problem in NLP, and many approaches to applied tasks like summarization, information retrieval, and question answering rely on high-quality NLI.
Corpus semantics: SNLI is unusual among corpora for natural language understanding tasks in that it was annotated by non-experts without any annotation manual, such that its labels reflect the intuitive judgments of the annotators about what each sentence means. This makes it well suited for work in quantitative corpus linguistics, and makes it one of few corpora that allow researchers in Linguistics to apply corpus methods to questions about what sentences mean, rather than just what kinds of sentences people use.
Evaluating sentence encoding models: There has been a great deal of recent research on how best to build supervised neural network models that extract vector representations of sentences that capture their meanings. Since SNLI is large enough to serve as a training set for such models, and since modeling NLI within a neural network requires highly informative meaning representations (more so than previous focus tasks like sentiment analysis), we think that SNLI is especially well suited to be a target evaluation task for this kind of research.

What does it look like?

If you simply want to browse the corpus, the corpus page contains several examples and a download link. If you want to see the basic key statistics about the size of the corpus and how it was annotated, the corpus paper has that information. For this post, we thought it would be helpful to do a quick quantitative breakdown of what kinds of phenomena tend to show up in the corpus.

In particular, we tagged 100 randomly sampled sentence pairs from the test set by hand with labels denoting a handful of phenomena that we found interesting. These phenomena are not mutually exclusive, and the count of each phenomenon can be treated as a very rough estimate of its frequency in the overall corpus.

Full sentences and bare noun phrases: SNLI is a mix of full sentences (There is a duck) and bare noun phrases (A duck in a pond). Using the labels from the Stanford parser, we found that full sentences are more common, and that noun phrases mostly occur in pairs with full sentences.

Sentence–sentence pairs: 71 (23 ent., 28 neut., 20 contr.)
Sentence–bare NP pairs (either order): 27 (10 ent., 9 neut., 8 contr.)
Bare NP–bare NP pairs: 3 (0 ent., 2 neut., 1 contr.)

Insertions: One strategy for creating pairs that turned out to be especially popular among annotators trying to create neutral pairs is to create a hypothesis that mostly draws text from the premise, but that adds a prepositional phrase (There is a duck to There is a duck in a pond) or an adjective or adverb (There is a duck to There is a large duck).

Insertions of a restrictive PP: 4 (0 ent., 4 neut., 0 contr.)
Insertions of a restrictive adjective or adverb: 5 (1 ent., 4 neut., 0 contr.)

Lexical relations: One of they key building blocks for logical inference systems like those studied in natural logic is an ability to reason about relationships like entailment or contradiction between individual words. In many examples of sentence level entailment, this kind of reasoning makes up a substantial part of the problem, as in There is a duck by the pond–There is a bird near water. We measured the frequency of this phenomenon by counting the number of examples in which a pair of words falling into an entailment or contradiction relationship (in either direction) could be reasonably aligned between the premise and the hypothesis.

Aligned lexical entailment or contradiction pairs: 28 (5 ent., 11 neut., 12 contr.)

Commonsense world knowledge: Unlike in earlier FRACAS entailment data, SNLI contains many examples which can be difficult to judge without access to contingent facts about the world that go beyond lexical relationships, as in examples like A girl makes a snow angel–A girl is playing in snow, where it is necessary to know explicitly that snow angels are made by playing in snow.

Inferences requiring commonsense world knowledge: 47 (17 ent., 18 neut., 12 contr.)

Multi-word expressions: Multi-word expressions with non-compositional meanings (or, loosely speaking, idioms) complicate the construction and evaluation of models like RNNs that take words as input. SICK, the earlier dataset that inspired our work, explicitly excludes any such multi-word expressions. We did not find them to be especially common.

Sentence pairs containing non-compositional multi-word expressions: 2 (1 ent., 1 neut., 0 contr.)

Pronoun coreference/anaphora: Reference (or anaphora) from a pronoun in the hypothesis to an expression in the premise, as in examples like The duck was swimming—It was in the water, can create additional complexity for inference systems, especially when there are multiple possible referents. We found only a handful of such cases.

Instances of pronoun coreference: 3 (0 ent., 2 neut., 1 contr.)

Negation: One simple way to create a hypothesis that contradicts some premise is to copy the premise and add any of several kinds of negation, as in There is a duck–There is not a duck. This approach to creating contradictions is extremely easy to detect, and was somewhat common in the SICK entailment corpus. We measure the frequency of this phenomenon by counting the number of sentence pairs where the hypothesis and the premise can be at least loosely aligned, and that the hypothesis uses any kind of negation in a position that does not align to any negation in the premise.

Insertions of negation: 1 (0 ent., 0 neut., 1 contr.)

Common templates: Beside what came up above, two common techniques that annotators used to build sentence pairs were to either come up with a complete non sequitur (usually marked contradiction) or to pick out one entity from the premise and to compose a sentence of the form there {is, are} X. Together, these two templates make up a few percent of the corpus.

Non-sequitur/unrelated sentence pairs: 2 (0 ent., 0 neut., 2 contr.)
"There {is, are} X" hypotheses: 3 (3 ent., 0 neut., 0 contr.)

Mistakes: The corpus wasn't edited for spelling or grammar, so there are typos.

Examples with a single-word typo in either sentence: 3 (0 ent., 3 neut., 0 contr.)
Examples with a grammatical error or nonstandard grammar in either sentence: 9 (3 ent., 4 neut., 2 contr.)

What is the state of the art right now?

We have become aware of several papers that have been released in recent months that evaluate models on SNLI (mostly thanks to Google Scholar), and we've collected all of the papers that we're aware of on the corpus page. The overall state of the art right now is 86.1% classification accuracy from Shuohang Wang and Jing Jiang at Singapore Management University, using a clever variant of a sequence-to-sequence neural network model with soft attention. Lili Mou et al. at Peking University and Baidu Beijing deserve an honorable mention for creating the most effective model that reasons over a single fixed-size vector representation for each sentence, rather than constructing word-by-word alignments as with attention. They reach 82.1% accuracy. There are two other papers on the corpus page as well with their own insights about NLI modeling with neural networks, so have a look there before setting off on your own with the corpus.

Google's Mat Kelcey has some simple experiments on SNLI posted as well here and here. While these experiments don't reach the state of the art, they include Theano and Tensorflow code, and so may be a useful starting point for those building their own models.

Welcome to the new Stanford NLP Research Blog

2016-01-14T14:29:13Z

This page will hold the research blog for the Stanford Natural Language Processing group. Here group members will post descriptions of their research, tutorials, and other interesting tidbits. No posts yet, but stay tuned!