The post New paper: “A formal solution to the grain of truth problem” appeared first on Machine Intelligence Research Institute.
]]>Future of Humanity Institute Research Fellow Jan Leike and MIRI Research Fellows Jessica Taylor and Benya Fallenstein have just presented new results at UAI 2016 that resolve a longstanding open problem in game theory: “A formal solution to the grain of truth problem.”
Game theorists have techniques for specifying agents that eventually do well on iterated games against other agents, so long as their beliefs contain a “grain of truth” — nonzero prior probability assigned to the actual game they’re playing. Getting that grain of truth was previously an unsolved problem in multiplayer games, because agents can run into infinite regresses when they try to model agents that are modeling them in turn. This result shows how to break that loop: by means of reflective oracles.
In the process, Leike, Taylor, and Fallenstein provide a rigorous and general foundation for the study of multiagent dilemmas. This work provides a surprising and somewhat satisfying basis for approximate Nash equilibria in repeated games, folding a variety of problems in decision and game theory into a common framework.
The paper’s abstract reads:
A Bayesian agent acting in a multiagent environment learns to predict the other agents’ policies if its prior assigns positive probability to them (in other words, its prior contains a grain of truth). Finding a reasonably large class of policies that contains the Bayesoptimal policies with respect to this class is known as the grain of truth problem. Only small classes are known to have a grain of truth and the literature contains several related impossibility results.
In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of policies that contains all computable policies as well as Bayesoptimal policies for every lower semicomputable prior over the class. When the environment is unknown, Bayesoptimal agents may fail to act optimally even asymptotically. However, agents based on Thompson sampling converge to play εNash equilibria in arbitrary unknown computable multiagent environments. While these results are purely theoretical, we show that they can be computationally approximated arbitrarily closely.
Traditionally, when modeling computer programs that model the properties of other programs (such as when modeling an agent reasoning about a game), the first program is assumed to have access to an oracle (such as a halting oracle) that can answer arbitrary questions about the second program. This works, but it doesn’t help with modeling agents that can reason about each other.
While a halting oracle can predict the behavior of any isolated Turing machine, it cannot predict the behavior of another Turing machine that has access to a halting oracle. If this were possible, the second machine could use its oracle to figure out what the first machineoracle pair thinks it will do, at which point it can do the opposite, setting up a liar paradox scenario. For analogous reasons, two agents with similar resources, operating in realworld environments without any halting oracles, cannot perfectly predict each other in full generality.
Game theorists know how to build formal models of asymmetric games between a weaker player and a stronger player, where the stronger player understands the weaker player’s strategy but not vice versa. For the reasons above, however, games between agents of similar strength have resisted full formalization. As a consequence of this, game theory has until now provided no method for designing agents that perform well on complex iterated games containing other agents of similar strength.
Usually, the way to build an ideal agent is to have the agent consider a large list of possible policies, predict how the world would respond to each policy, and then choose the best policy by some metric. However, in multiplayer games, if your agent considers a big list of policies that both it and the opponent might play, then the best policy for the opponent is usually some alternative policy that was not in your list. (And if you add that policy to your list, then the new best policy for the opponent to play is now a new alternative that wasn’t in the list, and so on.)
This is the grain of truth problem, first posed by Kalai and Lehrer in 1993: define a class of policies that is large enough to be interesting and realistic, and for which the best response to an agent that considers that policy class is inside the class.^{1}
Taylor and Fallenstein have developed a formalism that enables a solution: reflective oracles capable of answering questions about agents with access to equivalently powerful oracles. Leike has led work on demonstrating that this formalism can solve the grain of truth problem, and in the process shows that the Bayesoptimal policy generally does not converge to a Nash equilibrium. Thompson sampling, however, does converge to a Nash equilibrium — a result that comes out of another paper presented at UAI 2016, Leike, Lattimore, Orseau, and Hutter’s “Thompson sampling is asymptotically optimal in general environments.”
The key feature of reflective oracles is that they avoid diagonalization and paradoxes by randomizing in the relevant cases.^{2} This allows agents with access to a reflective oracle to consistently reason about the behavior of arbitrary agents that also have access to a reflective oracle, which in turn makes it possible to model agents that converge to Nash equilibria by their own faculties (rather than by fiat or assumption).
This framework can be used to, e.g., define games between multiple copies of AIXI. As originally formulated, AIXI cannot entertain hypotheses about its own existence, or about the existence of similarly powerful agents; classical Bayesoptimal agents must be larger and more intelligent than their environments. With access to a reflective oracle, however, Fallenstein, Soares, and Taylor have shown that AIXI can meaningfully entertain hypotheses about itself and copies of itself while avoiding diagonalization.
The other main novelty of this paper is that reflective oracles turn out to be limitcomputable, and so allow for approximation by anytime algorithms. The reflective oracles paradigm is therefore likely to be quite valuable for investigating gametheoretic questions involving generally intelligent agents that can understand and model each other.^{3}
Get notified every time a new technical paper is published.
The Bayesoptimal behavior is to cooperate until the posterior belief that the other agent defects in the time step after the next is greater than some constant (depending on the discount function) and then defect afterwards.
But this is itself a strategy in the class under consideration. If both players are Bayesoptimal, then both will have a grain of truth (i.e., their actual strategy is assigned nonzero probability by the other player) “and therefore they converge to a Nash equilibrium: either they both cooperate forever or after some finite time they both defect forever.”
Slightly expanding the list of policies an agent might deploy, however, can make it hard to find a policy class that contains a grain of truth. For example, if “tit for tat” is added to the policy class, then, depending on the prior, the grain of truth may be lost. In this case, if the first agent thinks that the second agent is very likely “always defect” but maybe “tit for tat,” then the best policy might be something like “defect until they cooperate, then play tit for tat,” but this policy is not in the policy class. The question resolved by this paper is how to find priors that contain a grain of truth for much richer policy classes.
The post New paper: “A formal solution to the grain of truth problem” appeared first on Machine Intelligence Research Institute.
]]>The post June 2016 Newsletter appeared first on Machine Intelligence Research Institute.
]]>
Research updates
General updates
News and links

The post June 2016 Newsletter appeared first on Machine Intelligence Research Institute.
]]>The post New paper: “Safely interruptible agents” appeared first on Machine Intelligence Research Institute.
]]>Abstract:
Reinforcement learning agents interacting with a complex environment like the real world are unlikely to behave optimally all the time. If such an agent is operating in realtime under human supervision, now and then it may be necessary for a human operator to press the big red button to prevent the agent from continuing a harmful sequence of actions—harmful either for the agent or for the environment—and lead the agent into a safer situation. However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interruptions, for example by disabling the red button — which is an undesirable outcome.
This paper explores a way to make sure a learning agent will not learn to prevent (or seek!) being interrupted by the environment or a human operator. We provide a formal definition of safe interruptibility and exploit the offpolicy learning property to prove that either some agents are already safely interruptible, like Qlearning, or can easily be made so, like Sarsa. We show that even ideal, uncomputable reinforcement learning agents for (deterministic) general computable environments can be made safely interruptible.
Orseau and Armstrong’s paper constitutes a new angle of attack on the problem of corrigibility. A corrigible agent is one that recognizes it is flawed or under development and assists its operators in maintaining, improving, or replacing itself, rather than resisting such attempts.
In the case of superintelligent AI systems, corrigibility is primarily aimed at averting unsafe convergent instrumental policies (e.g., the policy of defending its current goal system from future modifications) when such systems have incorrect terminal goals. This leaves us more room for approximate, trialanderror, and learningbased solutions to AI value specification.
Interruptibility is an attempt to formalize one piece of the intuitive idea of corrigibility. Utility indifference (in Soares, Fallenstein, Yudkowsky, and Armstrong’s “Corrigibility”) is an example of a past attempt to define a different piece of corrigibility: systems that are indifferent to programmers’ interventions to modify their terminal goals, and will therefore avoid trying to force their programmers either to make such modifications or to avoid such modifications. “Safely interruptible agents” instead attempts to define systems that are indifferent to programmers’ interventions to modify their policies, and will not try to stop programmers from intervening on their everyday activities (nor try to force them to intervene).
Here the goal is to make the agent’s policy converge to whichever policy is optimal if the agent believed there would be no future interruptions. Even if the agent has experienced interruptions in the past, it should act just as though it will never experience any further interruptions. Orseau and Armstrong show that several classes of agent are safely interruptible, or can be easily made safely interruptible.
Further reading:
Get notified every time a new technical paper is published.
The post New paper: “Safely interruptible agents” appeared first on Machine Intelligence Research Institute.
]]>The post May 2016 Newsletter appeared first on Machine Intelligence Research Institute.
]]>
Research updates
General updates
News and links

The post May 2016 Newsletter appeared first on Machine Intelligence Research Institute.
]]>The post A new MIRI research program with a machine learning focus appeared first on Machine Intelligence Research Institute.
]]>MIRI’s research in general can be viewed as a response to Stuart Russell’s question for artificial intelligence researchers: “What if we succeed?” There appear to be a number of theoretical prerequisites for designing advanced AI systems that are robust and reliable, and our research aims to develop them early.
Our general research agenda is agnostic about when AI systems are likely to match and exceed humans in general reasoning ability, and about whether or not such systems will resemble presentday machine learning (ML) systems. Recent years’ impressive progress in deep learning suggests that relatively simple neuralnetworkinspired approaches can be very powerful and general. For that reason, we are making an initial inquiry into a more specific subquestion: “What if techniques similar in character to presentday work in ML succeed in creating AGI?”.
Much of this work will be aimed at improving our highlevel theoretical understanding of taskdirected AI. Unlike what Nick Bostrom calls “sovereign AI,” which attempts to optimize the world in longterm and largescale ways, task AI is limited to performing instructed tasks of limited scope, satisficing but not maximizing. Our hope is that investigating task AI from an ML perspective will help give information about both the feasibility of task AI and the tractability of early safety work on advanced supervised, unsupervised, and reinforcement learning systems.
To this end, we will begin by investigating eight relevant technical problems:
1. Inductive ambiguity detection.
How can we design a general methodology for ML systems (such as classifiers) to identify when the classification of a test instance is underdetermined by training data?
For example: If an ambiguitydetecting classifier is designed to distinguish images of tanks from images of nontanks, and the training set only contains images of tanks on cloudy days and nontanks on sunny days, this classifier ought to detect that the classification of an image of a tank on a sunny day is ambiguous, and pose some query for its operators to disambiguate it and avoid errors.
While past and current work in active learning and statistical learning theory more broadly has made progress towards this goal, more work is necessary to establish realistic statistical bounds on the error rates and query rates of realworld systems in advance of their deployment in complex environments.
2. Informed oversight.
How might we train a reinforcement learner to output both an action and a “report” comprising information to help a human evaluate its action?
For example: If a human is attempting to train a reinforcement learner to output original stories, then in evaluating the story, the human will want to know some information about the story (such as whether it has been plagiarized from another story) that may be difficult to determine by looking at the story itself.
3. Safe training procedures for humanimitators.
How might we design a ML system that imitates humans performing some task that involves rich outputs (such as answering questions in natural language), to the best of the ML system’s abilities?
While there are existing approaches to imitation learning and generative models, these have some theoretical shortcomings that prevent them from fully solving the general problem. In particular, a generative adversarial model trained on human actions only has an incentive to imitate aspects of the human that the adversary can detect; thus, issues similar to the plagiarism problem from (2) can arise.
4. Conservative concepts.
How might we design a system that, given some positive examples of a concept, can synthesize new instances of the concept without synthesizing edge cases of it?
For example: If we gave the system detailed information about 100 humancreated burritos as training data, it should manufacture additional burritos while avoiding edge cases such as extremely small burritos (even though these could still be considered burritos).
By default, most objective functions will lead to such edge cases (say, because small burritos are cheaper to manufacture). Can we develop a general technique for avoiding this problem?
5. Specifying environmental goals using sensory data.
The goals of most ML systems are defined in terms of sensory data, such as discounted reward over time. While these sensory goals can be useful proxies for environmental goals, they are not identical to them: tricking your own sensors into perceiving a strawberry in the room is not the the same as actually having one there. How might we create systems that directly pursue useful goals in their environment, such as “make there be a strawberry in this room”, in a way that is robust to opportunities for the system to interfere with the proxies we used to specify the goal?
6. Lowimpact agents.
We would prefer a highly intelligent taskoriented system to generally avoid creating large unintendedbyus side effects in pursuit of its goal, and also to notify us of any large impacts that might result from achieving its goal. For example, if we ask it to build a house for a homeless family, it should know implicitly that it should avoid destroying nearby houses for materials — a large side effect — while still having the desirable large followon effect of improving the family’s socioeconomic situation. For any specific task we can specify adhoc cost functions for side effects like nearby house destruction, but since we cannot always anticipate such costs in advance, we want a quantitative understanding of how to generally limit an AI’s side effects (without also limiting its ability to have large positive intended impacts).
7. Mild optimization.
In the event that the low impact criteria we employ in (6) are flawed or missing some key considerations, we would like some way to tell our system “don’t try too hard” or “don’t exercise too much optimization power”, since intuitively, large impacts are more likely to result from large efforts (although the correspondence is imperfect).
Less capable AI systems tend to be less likely to have large unintended side effects, and we would like to know how to make even highly capable systems safe in a similar way, by having them apply less “effort”. With a satisfactory encoding of our notion of “effort”, we could then program a system to impose absolute limits on its total effort, or to use only as much effort as is needed to achieve its objective with reasonably high probability, thus potentially limiting impacts that we did not manage to measure in (6).
8. Averting instrumental pressures.
How might we design a system that is not motivated to pursue certain convergent instrumental goals — such as gaining additional resources — even when pursuing these goals would help it achieve its main objective?
In particular, we may wish to build a system that has no incentive to cause or prevent its own shutdown/suspension. This relates to (6) and (7) in that instrumental pressures like “ensure my continued operation” can incentivize large impacts/efforts. However, this is a distinct agenda item because it may be possible to completely eliminate certain instrumental incentives in a way that would apply even before solutions to (6) and (7) would take effect.
Having identified these topics of interest, we expect our work on this agenda to be timely. The idea of “robust and beneficial” AI has recently received increased attention as a result of the new wave of breakthroughs in machine learning. The kind of theoretical work in this project has more obvious connections to the leading paradigms in AI and ML than, for example, our recent work in logical uncertainty or in game theory, and therefore lends itself better to collaborations with AI/ML researchers in the near future.
Thanks to Eliezer Yudkowsky and Paul Christiano for seeding many of the initial ideas for these research directions, to Patrick LaVictoire, Andrew Critch, and other MIRI researchers for helping develop these ideas, and to Chris Olah, Dario Amodei, and Jacob Steinhardt for valuable discussion.
The post A new MIRI research program with a machine learning focus appeared first on Machine Intelligence Research Institute.
]]>The post New papers dividing logical uncertainty into two subproblems appeared first on Machine Intelligence Research Institute.
]]>The solutions for each subproblem are available in two new papers, based on work spearheaded by Scott Garrabrant: “Uniform coherence” and “Asymptotic convergence in online learning with unbounded delays.”^{1}
To give some background on the problem: Modern probability theory models reasoners’ empirical uncertainty, their uncertainty about the state of a physical environment, e.g., “What’s behind this door?” However, it can’t represent reasoners’ logical uncertainty, their uncertainty about statements like “this Turing machine halts” or “the twin prime conjecture has a proof that is less than a gigabyte long.”^{2}
Roughly speaking, if you give a classical probability distribution variables for statements that could be deduced in principle, then the axioms of probability theory force you to put probability either 0 or 1 on those statements, because you’re not allowed to assign positive probability to contradictions. In other words, modern probability theory assumes that all reasoners know all the consequences of all the things they know, even if deducing those consequences is intractable.
We want a generalization of probability theory that allows us to model reasoners that have uncertainty about statements that they have not yet evaluated. Furthermore, we want to understand how to assign “reasonable” probabilities to claims that are too expensive to evaluate.
Imagine an agent considering whether to use quicksort or mergesort to sort a particular dataset. They might know that quicksort typically runs faster than mergesort, but that doesn’t necessarily apply to the current dataset. They could in principle figure out which one uses fewer resources on this dataset, by running both of them and comparing, but that would defeat the purpose. Intuitively, they have a fair bit of knowledge that bears on the claim “quicksort runs faster than mergesort on this dataset,” but modern probability theory can’t tell us which information they should use and how.^{3}
What does it mean for a reasoner to assign “reasonable probabilities” to claims that they haven’t computed, but could compute in principle? Without probability theory to guide us, we’re reduced to using intuition to identify properties that seem desirable, and then investigating which ones are possible. Intuitively, there are at least two properties we would want logically nonomniscient reasoners to exhibit:
1. They should be able to notice patterns in what is provable about claims, even before they can prove or disprove the claims themselves. For example, consider the claims “this Turing machine outputs an odd number” and “this Turing machine outputs an even number.” A good reasoner thinking about those claims should eventually recognize that they are mutually exclusive, and assign them probabilities that sum to at most 1, even before they can run the relevant Turing machine.
2. They should be able to notice patterns in sentence classes that are true with a certain frequency. For example, they should assign roughly 10% probability to “the 10^{100}th digit of pi is a 7” in lieu of any information about the digit, after observing (but not proving) that digits of pi tend to be uniformly distributed.
MIRI’s work on logical uncertainty this past year can be very briefly summed up as “we figured out how to get these two properties individually, but found that it is difficult to get both at once.”
“Uniform coherence,” which I coauthored with Garrabrant, Benya Fallenstein, and Abram Demski, shows how to get the first property. The abstract reads:
While probability theory is normally applied to external environments, there has been some recent interest in probabilistic modeling of the outputs of computations that are too expensive to run. Since mathematical logic is a powerful tool for reasoning about computer programs, we consider this problem from the perspective of integrating probability and logic.
Recent work on assigning probabilities to mathematical statements has used the concept of coherent distributions, which satisfy logical constraints such as the probability of a sentence and its negation summing to one. Although there are algorithms which converge to a coherent probability distribution in the limit, this yields only weak guarantees about finite approximations of these distributions. In our setting, this is a significant limitation: Coherent distributions assign probability one to all statements provable in a specific logical theory, such as Peano Arithmetic, which can prove what the output of any terminating computation is; thus, a coherent distribution must assign probability one to the output of any terminating computation.
To model uncertainty about computations, we propose to work with approximations to coherent distributions. We introduce uniform coherence, a strengthening of coherence that provides appropriate constraints on finite approximations, and propose an algorithm which satisfies this criterion.
Given a series of provably mutually exclusive sentences, or a series of sentences where each provably implies the next, a uniformly coherent predictor’s probabilities eventually start respecting this pattern. This is true even if the predictor hasn’t been able to prove that the pattern holds yet; if it would be possible in principle to eventually prove each instance of the pattern, then the uniformly coherent predictor will start recognizing it “before too long,” in a specific technical sense, even if the proofs themselves are very long.
“Asymptotic convergence in online learning with unbounded delays,” which I coauthored with Garrabrant and Jessica Taylor, describes an algorithm with the second property. The abstract reads:
We study the problem of predicting the results of computations that are too expensive to run, via the observation of the results of smaller computations. We model this as an online learning problem with delayed feedback, where the length of the delay is unbounded, which we study mainly in a stochastic setting. We show that in this setting, consistency is not possible in general, and that optimal forecasters might not have average regret going to zero. However, it is still possible to give algorithms that converge asymptotically to Bayesoptimal predictions, by evaluating forecasters on specific sparse independent subsequences of their predictions. We give an algorithm that does this, which converges asymptotically on good behavior, and give very weak bounds on how long it takes to converge. We then relate our results back to the problem of predicting large computations in a deterministic setting.
The first property is about recognizing patterns about logical relationships between claims — saying “claim A implies claim B, so my probability on B must be at least my probability on A.” By contrast, the second property is about recognizing frequency patterns between similar claims — saying “I lack the resources to tell whether this claim is true, but 90% of similar claims have been true, so the base rate is 90%” (where part of the problem is figuring out what counts as a “similar claim”).
In this technical report, we model the latter task as an online learning problem, where a predictor observes the behavior of many small computations and has to predict the behavior of large computations. We give an algorithm that eventually assigns the “right” probabilities to every predictable subsequence of observations, in a specific technical sense.
Each paper is interesting in its own right, but for us, the exciting result is that we have teased apart and formalized two separate notions of what counts as “good reasoning” under logical uncertainty, both of which are compelling.
Furthermore, our approaches to formalizing these two notions are very different. “Uniform coherence” frames the problem in the traditional “unify logic with probability” setting, whereas “Asymptotic convergence in online learning with unbounded delays” fits more naturally into the online machine learning framework. The methods we found for solving the first problem don’t appear to help with the second problem, and vice versa. In fact, the two isolated solutions appear quite difficult to reconcile. The problem that these two papers leave open is: Can we get one algorithm that satisfies both properties at once?
Get notified every time a new technical paper is published.
The post New papers dividing logical uncertainty into two subproblems appeared first on Machine Intelligence Research Institute.
]]>