AI Impacts

How should we analyse survey forecasts of AI timelines?

aiimpacts — Mon, 16 Dec 2024 05:39:34 +0000

Tom Adamczewski, 2024

The Expert Survey on Progress in AI (ESPAI) is a large survey of AI researchers about the future of AI, conducted in 2016, 2022, and 2023. One main focus of the survey is the timing of progress in AI.¹

The timing-related results of the survey are usually presented as a cumulative distribution function (CDF) showing probabilities as a function of years, in the aggregated opinion of respondents. Respondents gave triples of (year, probability) pairs for various AI milestones. Starting from these responses, two key steps of processing are required to obtain such a CDF:

Fitting a continuous probability distribution to each response
Aggregating these distributions

These two steps require a number of judgement calls. In addition, summarising and presenting the results involves many other implicit choices.

In this report, I investigate these choices and their impact on the results of the survey (for the 2023 iteration). I provide recommendations for how the survey results should be analysed and presented in the future.

This plot represents a summary of my best guesses as to how the ESPAI data should be analysed and presented.

See the version in the paper

The purpose of philosophical AI will be: To orient ourselves in thinking

Katja Grace — Mon, 28 Oct 2024 16:46:53 +0000

Max Noichl ¹

This was a prize-winning entry into the Essay Competition on the Automation of Wisdom and Philosophy.

Summary

In this essay I will suggest a lower bound for the impact that artificial intelligence systems can have on the automation of philosophy. Specifically I will argue that skepticism is warranted about whether LLM-based systems similar to the best ones available right now will be able to independently produce philosophy at a level of quality and creativity that is interesting to us. But they are clearly already able to solve medium-complexity language tasks in a way that makes them useful to structure and consolidate the contemporary philosophical landscape, allowing for novel and interesting ways to orient ourselves in thinking.

Introduction

The purpose of philosophical AI will be: To orient ourselves in thinking. This position is opposed to the view that the LLM-based artificial intelligence systems which are at this point foreseeable, will autonomously produce philosophy that is of a high enough quality and novelty to be interesting to us. In this essay I will briefly try to make this position plausible. I will then sketch the alternative direction in which I suspect the most impactful practical interaction of philosophy and AI will go and present a pilot study of what this may look like. Finally, I will argue that this direction can integrate well into contemporary philosophical practice and solve some previously unresolved desiderata.

Autonomous production of philosophy

The first idea that we might have when thinking about how artificial intelligence might serve to automate philosophy is that the AI system is going to philosophize for—instead —of us. And indeed the currently best publicly available systems seem to show some basic promise. They² are able to recapitulate classic philosophical arguments and thought experiments with reasonable, although somewhat spotty, quality, and when vaguely prompted to opine on topics of philosophical impact they are also able to identify classical lines of argument.

But these abilities are very much in line with an understanding of LLMs that sees them largely as sophisticated mechanisms for the reproduction and adaptation of already present textual material, which would of course be in stark contrast to the capabilities that are arguably necessary for the production of truly novel and logically coherent philosophy, namely strong abstract reasoning capabilities.

Some grounds for scepticism

To me it seems like the capability profile of the language models we have seen so far is distinctly weird. They play chess, to some degree convincingly, although not well, but they are abysmal at tic-tac-toe. They can explain simulated annealing perfectly well, but can’t tell me reliably which countries in Europe start with a ‘Q’. Generally speaking, it seems to be hard to predict or intuit whether one of the current systems we have available will be good at a task without just trying it out. And of course, much harder still to predict what they will be good at in the future.

But I do believe that we have reasonable grounds for at least a certain amount of skepticism about whether really strong reasoning capabilities are around the corner. First, when trying to get LLMs to produce philosophical reasoning, it is common that they struggle to transfer argument schemes to novel contexts and to generalize them to domains that are not commonly used as examples in the literature. It also seems hard to keep them arguing a coherent point and to maintain truth and consistency through prolonged arguments. Finally, when simulating philosophical debates between multiple LLM agents, I have found them to be extremely stereotypical, repeating mostly stale commonplaces, and failing to come up with novel argumentative patterns—experiences which are in line with at least some lines of research that question current systems’ abstract reasoning capabilities.³

A lower bound

But publicly predicting that contemporary AI systems are unable to ever achieve this or that specific task has been a good method to force oneself to a public correction a few months later, or to stubbornly remain denying the obvious in increasingly ridiculous fashion.⁴

Therefore, instead of making any strong claims about the abilities of current or future AI systems, I suggest that the most productive way forward to consider the potential of automation of philosophy is to articulate what is reliably achievable with the systems that we have available now, and what plays into their current abilities and sidesteps their faults. As I have mentioned, a general formulation for the abilities of large language models will likely not be forthcoming. But a most uncontroversial provisional formulation instead might be something like: These systems excel at medium-complexity language tasks, which are similar to tasks solved in everyday language or well represented in public code bases. And of course, as computer systems, they excel in doing these tasks over and over again, many thousands of times.

The question we need to answer is thus, how philosophy might profit from a process of automation that plays into these precise strengths. Answering this question will provide us with a firm, unspeculative lower bound of what is possible in the automation of philosophy.

Making it more concrete

I have given some reason to think that the most likely short-term role for artificial intelligence within philosophy is not going to be through independently reasoning non-human intelligences that directly produce philosophy in a way that is on par, or superior to what we are able to do now. Rather, I argued, philosophy will be altered by the ability of artificial intelligence to integrate and structure large quantities of thought, which might drastically increase the cohesion of our collective philosophical enterprise. But this proposal might seem somewhat abstract. To give a more concrete idea of what I am thinking about, I have conducted a little pilot study.

For this pilot study, I have scraped the whole open-access bibliography of “philosophy of artificial intelligence” available on the PhilArchive.⁵ I have also searched for all articles containing ‘artificial intelligence’, ‘machine learning’, or ‘deep learning’ in the last 20 years among ten highly reputable Anglophone philosophy journals and integrated them into my dataset.

I then filtered out unusable texts—texts that were obviously not philosophy⁶, texts that were badly OCRed, and texts that were not in the English language, leaving me with a sample of 1,025 full-text articles.

In the first pass, I used GLiNER-large⁷, a flexible LLM-based Named Entity Recognition system, to search for entities that conformed to the working definition of “a philosophical theory or philosophical position, a view that attempts to explain or account for a particular problem in philosophy, or a named argument.”

This first pass extracted from each article a number of relatively low-quality but passable candidate positions, things like ‘naturalized moral psychology’, ‘naturalistic framework’, ‘moderate defense’, ‘human nature is bad’, ‘schools of thought’, ‘Confucian tradition’, etc.

In a second pass, these articles were fed to GPT-4o, which searched them for philosophical positions and parsed them into a structured data format, which contained a label for the position, a definition of the position that had to be drawn from the text, a number between -1 and 1 indicating whether the author was arguing in favor or against the position, with 0 indicating neutrality, and the exact passage at which this stance became apparent. The initial candidate positions extracted by GLiNER were used to identify potentially relevant candidate positions. These were then fed to GPT-4o to keep the naming in the dataset consistent.

At the end of this process, for our 1,025 papers, I had gathered a total of 6,059 distinct positions, which contained named positions and arguments like ‘functionalism’, ‘computationalism’, ‘the Chinese room argument’, ‘connectionism’, etc.

There are a number of potentially interesting analyses that are now possible on this dataset. But for this pilot project, I conducted an overview mapping, in which I combined two nearest-neighbor graphs, one that linked articles to articles with similar position profiles (articles arguing for, and denying the same things), and one based on semantic similarity, which was determined via embeddings produced through the all-mpnet-base-v2⁸ language model. These graphs were combined, resulting in a new graph in which thematically similar texts are moved close together in the global picture, while groups of texts that are thematically similar but argue for different position profiles are locally split apart. This combined graph was then reweighted and laid out using uniform manifold approximation and projection in two dimensions.⁹

I then applied an hDBSCAN, a clustering algorithm, to this layout and marked the most relevant positions for each cluster on the two-dimensional layout. The results can be explored below:

The clusters are marked with dashed lines, the grey points represent individual papers, and the positions are marked on top of them, with blue positions being those that are positively held by the authors in the cluster and red positions being denied.

We note that the map reproduces a sensible structure of the whole field with questions that relate artificial intelligence to the philosophy of mind towards the upper left, questions that relate to the moral status of artificial intelligence towards the center right, and questions about the societal impact of artificial intelligence and the connected ethical questions towards the bottom.

We also note that in quite a few instances we find clearly oppositional local structures, for example with clusters denying or appraising the Chinese room argument together with the appropriate associated stances on computationalism or universal realizability (middle-top). Similar things are true for functionalism (left-middle), as well as utilitarianism and virtue ethics (right-lower middle).

Philosophical relevance

I think that this pilot shows that, while quite a bit of additional work is evidently needed, contemporary LLM systems can reliably parse large amounts of philosophical text into structured representations that can be used to map out the argumentative landscapes. And while this is not automated philosophical reasoning, this is certainly not nothing.¹⁰ We commonly think about philosophy as a large, intractable net of interlinked arguments, where each single premise, if denied or accepted, has numerous implications for others, opening and closing paths to various positions—with philosophy arguably being the collective task of maintaining and refining this structure.

But this structure is never made explicit, and each philosopher, somewhat lonely, tends to produce philosophy in essays, which add to this whole structure only in a very local and convoluted fashion. The largest promise of automated philosophy as we can foresee it at this point, is thus to make this process explicit, to draw out the collective structure into the open, make it accessible, and, to borrow a phrase from Kant, to find a novel way to orient ourselves in thinking.¹¹

I want to thank Christopher Zosh, Scott Page, Johannes Marx, John Miller, Melanie Mitchell, Arseny Moskvichev, and Robert Ward as well as my advisors Dominik Klein and Erik Stei for helpful discussions during the preparation of this essay. This project is part of my PhD at the department of theoretical philosophy at Utrecht University, and will be available soon in article form, alongside the code. Feel free to get in touch or learn more about my work via https://www.maxnoichl.eu/

Footnotes

Disclosure of AI usage: OpenAI’s GPT-4o was used as a coding assistant. OpenAI’s Whisper model was used to partially dictate this essay. OpenAI’s GPT-4o API was used for the presented analysis.︎
All the (informal) tests I have made in the process of writing this essay have been conducted on Anthropic’s Claude Opus model, OpenAI’s GPT-4o & GPT-4 Turbo, and Meta’s Llama 3 70B.︎
Lewis and Mitchell. Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models. 2024. arXiv: 2402.08955; Moskvichev, Odouard, and Mitchell. The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain. 2023. arXiv: 2305.07141 ↩︎
E. g. OpenAI’s o1-preview model was released a few weeks after the writing of this piece, and apparently achieved a marked increase in reasoning capabilities. When I gave it the final draft of this article to look over for typos, it did flag the “European countries with Q”-example I gave earlier as misleading, as there are, as o1-preview correctly noted, no such countries. I nonetheless currently belief that the arguments in this essay still hold.︎
Philosophy of Artificial Intelligence – Bibliography edited by Eric Dietrich – PhilArchive (accessed: 8.7.2024)↩︎
Many material machine-learning and computer-interaction articles somehow end up on PhilPapers, as well as a large collection of random things.︎
Zaratiana et al. *GLiNER: Generalist Model for Named Entity Recognition Using Bidirectional Transformer. 2023. arXiv: 2311.08526 ↩︎
Using the accessible implementation provided by Reimers and Gurevych. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. 2019. arXiv: 1908.10084 ↩︎
McInnes, Healy, and Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 2018. arXiv: 1802.03426 ↩︎
And as a potential high-level interface between AI generated material and philosophers, it might also be crucial for the development of computer assisted philosophy, if the reasoning capabilities of the AI systems were to drastically improve.︎
Kant. Was Heißt: Sich Im Denken Orientiren? 1786. Source ↩︎

Machines and Moral Judgment

Katja Grace — Sun, 27 Oct 2024 17:24:20 +0000

By Jacob Sparks

This was a prize-winning entry into the Essay Competition on the Automation of Wisdom and Philosophy.

§1 Good AGI

The explicit goal of most major AI labs is to create artificial general intelligence (AGI): machines that can assist us across a wide range of tasks. Additionally, they all want to build systems that are safe, fair and beneficial to their users – machines that are good. But, building machines that are both generally intelligent and good requires building machines that can “think” about what’s good, that make their own moral judgments. And this raises both philosophical and technical questions that we have barely started to address.

§2 What is a Moral Judgment?

Moral judgments, in the sense I intend, are judgments with moral content. They are about what is right or good in a non-reductive sense. Judgments of this kind are philosophically puzzling. They are where thought becomes practical, where the cognitive and conative aspects of intelligence come together. They raise difficult questions: how are they related to motivation and action? Can they be said to be true or false? If so, is their truth objective, or is it determined ultimately by our attitudes? What is the proper method for resolving disputes about them? What are they even about?

In machine ethics, “moral judgment” often refers to any kind of judgment that is morally significant. In this sense, we can speak of the “moral judgments” current AI systems make when determining risk scores, diagnosing disease, or driving a car. Or we could talk more speculatively about the “moral judgments” machines would need to make to determine a criminal sentence, treat a disease, or buy a car on your behalf. Asking if machines can make these kinds of “moral judgments” is really just asking about their trustworthiness performing these morally significant tasks. But nothing in these debates touches on the question of how machines can make moral judgments in the sense I intend.

When some speak of building “moral machines” or about “putting ethical principles into machines” they are thinking about building systems that act in accordance with some particular ethical theory – Utilitarian, Rossian, Kantian, Contractualist, etc. The major debate here is whether these theories can be expressed in sufficiently precise ways to govern the behavior of an AI. But machines built along these lines would not be making moral judgments, in my sense, even if their behavior was “moral” according to one or more of these theories. If someone slavishly interpreted every moral question as being about utility maximization, prima facie duty satisfaction, maxim universalizability, hypothetical consent, etc., and failed to see that, whatever their merits, none of these theories captures the meaning of “good” or “right,” they would fail to make a moral judgment. One must be able to wonder if, after all, it would be right or good to maximize, satisfy, etc.¹

Even if we grant that one of these traditional moral theories is correct, I’m not making the trivial claim that good AGI requires machines getting it right about moral questions. There is no guarantee that when you make moral judgments you get it right. What’s important about moral reasoning is that it allows you to hold your own motivations at a distance, as presenting possibilities that you can choose to act on or not. Moral judgments are a way for the cognitive aspects of intelligence to shape the conative ones in a process that overcomes this reflective distance. If machines are going to behave well across the range of use cases intended for AGI, they’ll need to make these kinds of fallible and philosophically puzzling moral judgments. And to build such machines, we’ll have to learn much more about what moral judgments are and how they work.

§3 Moral Judgments Are Strange

Moral judgments are philosophically puzzling for two main reasons. The first has to do with their form. Like beliefs, they attempt to represent an independent reality. We’re trying to get it right when we make moral judgments. We want our moral judgments to be true and this gives them a “world to mind” direction of fit. But moral judgments are also like desires. They aim to change reality, and this gives them a “mind to world” direction of fit. We want to do what we judge to be right or good. We often act in accordance with and because of our moral judgments. They are, as some philosophers put it, “intrinsically motivating.” But, according to a widely accepted doctrine called “The Humean Theory of Motivation,” nothing could be both a belief and a desire, since each has a different direction of fit. According to the Humean Theory, beliefs and desires have a necessary and distinct role to play in the explanation of action. But moral judgments seem to muddy that distinction.

The second puzzle has to do with the content of moral judgments. Moral facts – the things we are judging about – seem to be both fully grounded in and yet somehow independent of natural facts. On the one hand, if something is good or right, it is good or right in virtue of other natural properties that it has. Every action that’s right is right because it keeps a promise, makes someone happy, relieves suffering, etc. That’s why constructing moral theories, where we attempt to characterize moral properties in terms of natural properties, is a project that makes sense. On the other hand, being good or right seems to be something above and beyond any natural property. However you explain why some action is right, you always mean something more by “right” than what you cite in your explanation. Even if some right act is right because it keeps a promise, when you call it “right,” you don’t just mean “keeps a promise.” Otherwise you’d just be repeating yourself. Moreover, we all recognize that sometimes it isn’t right to keep a promise. So, how could anything have the content that moral judgments purport to have, something that is both grounded in and independent of non-moral facts?

There are many who attempt to resolve these puzzles and their attempts comprise most of what philosophers call “metaethics.” Some metaethicists think moral judgments really are just desires (with no objective correctness conditions), or really are just beliefs (with no intrinsic motivation), or both (denying the Humean Theory). Some think the contents of moral judgments really are just natural facts (and so not independent of natural facts) or that some of them are not dependent on any natural fact (and so not grounded in natural facts). But even if we accept one or another of these solutions, we shouldn’t lose our appreciation of the initial puzzles. These puzzles show us that moral judgments are theoretically strange, but they also show us how and why moral judgments are practically important.

The capacity to make moral judgments involves a kind of active reflection. When we think about what’s good or right, we are stepping back and taking stock of our inclinations. Whatever we might want or intend to do, we can ask, “yes, but would it be good?” No matter how we describe our action we can ask, “yes, but would it be right?” And, importantly, how we answer those questions matters to us and to what we do. Moral judgments allow us to ask potent questions about any motivation or any description under which we might act. They give us both a kind of freedom from our inclinations and an external standard for our actions to live up to. Without the capacity to think in this way, we’d be like animals.

If machines could make moral judgments, they too would have a kind of freedom. Some might find that problematic. They would prefer generally intelligent machines to only pursue the goals we give them or to be otherwise bound to human needs, desires, and aims. But machines that made moral judgments would also hold themselves to a standard that is independent of any of their (or our) motivations. And that is precisely what a machine needs to do in order to be a good AGI.

§4 Good AGI Requires Moral Judgment

The basic argument that good AGI requires the capacity to make moral judgments involves a generalization of what Stuart Russell calls “The King Midas Problem.” Midas came to regret his wish that everything he touched turn to gold when his food, drink and daughter were turned to gold as well.

In the context of AGI, Russell uses this allegory to illustrate the idea that “the achievement of … any fixed objective can result in arbitrarily bad outcomes.” Tell an intelligent machine to cure cancer, and it might induce tumors in every human to be able to conduct more experiments; tell the machine to get you from A to B as quickly as possible, and it might jostle you catastrophically, etc. Russell’s solution to this problem is to build what he calls “beneficial AI.” These are machines designed to achieve, not some fixed objective, but our objectives. According to Russell, the machine’s only goal should be to satisfy our preferences, it should be uncertain about what those preferences are and should learn about our preferences by observing our behavior.

Russell’s approach is promising. Machines designed along these lines partially avoid the King Midas Problem, since we don’t need to specify any objective for them. But it is only partial avoidance. Humans can have preferences for all manner of terrible things, and optimizing on any objective, even one that remains unspecified and must be learned, can have disastrous results. Even when we take our aggregate or collective preferences, optimizing for their satisfaction can lead to very bad outcomes. At various times in history, the collective preferred to put some people in subservient roles on the basis of their gender or race. Today we collectively prefer to treat animals in horrific ways.

Russell is aware of this issue. He asks, “what should machines learn from humans who enjoy the suffering of others?” His answer is that, since these kinds of evil preferences would involve the frustration of other human preferences, there will naturally be some discount rate on their satisfaction. The only real question Russell sees here is about the balance between loyal AI that focuses exclusively on the preferences of some person or set of persons, and utilitarian AI that tries to maximize everyone’s utility.

This response (as well as Russell’s choice to call his approach “Beneficial AI”) indicates a failure to appreciate the difference between the non-moral question, “Does it satisfy a preference?” and the moral question, “Is it good?” This distinction is essential. Evil preferences should count for nothing, even if everyone shares them. All objectives, even ones machines learn from humans, should be subject to the kind of reflective scrutiny inherent to moral thought.

When machines operate in narrow contexts, the meaning of a term like “good” can be given a sufficiently reductive analysis. Playing chess, assuming we’re trying to win, a good chess move just is a move that makes winning more likely. But AGI does not operate in a narrow context. A good move for a generally intelligent machine cannot be specified – that is Russell’s insight. But neither can a good move for a generally intelligent machine simply be read off human preferences. When we’re talking about the wide context of AGI, the only move that is always good is a good move. If an AGI can’t work with some non-reductive sense of “good,” it won’t be a good AGI.

§5 But How?

Unfortunately, it isn’t at all clear what we’re doing when we think something is good and it isn’t clear how to build machines that can do the same. I’ve said moral judgment involves a kind of active reflection. But what can bring reflection to an end? And how can any reflection affect what it reflects? Importantly, in answering these questions and characterizing moral judgment, we can’t be content with the kinds of answers philosophers usually give. To hear that a moral judgment is a certain type of belief or a certain type of desire does not help us design artificial agents that can make such judgments. We need to speak the language of the people building AGI. However, since metaethicists tend to disagree about the details and since expressing philosophical theories of moral judgment in the precise terms required by computer science is exceptionally difficult, what I say here will be highly speculative.

One potentially promising paradigm comes from reinforcement learning (RL). Reinforcement agents learn to maximize a reward by interacting with their environment. They have an ability to sense the state of their environment and to take actions to affect that environment. Their goals are represented by a reward function that returns some value for each possible pair. The central assumption of reinforcement learning – sometimes called the reward hypothesis – is that any goal can be represented as an attempt to maximize some suitably chosen reward function. Doing what’s right might be thought of as the ultimate goal of any agent capable of moral judgment. So, if the reward hypothesis is correct, there should be some RL agent who succeeds in making moral judgments.

There are many different variations on the basic learning problem faced by reinforcement agents. The environment may be deterministic or stochastic. The agent may or may not have a model of the environment that predicts what transitions will take place given various actions. The agent may balance present and future reward in different ways. The policy that agents use to select an action may be deterministic, selecting a specific action for each state of the environment, or stochastic, selecting a probability distribution over actions for each state. The reward agents receive may come with greater or lesser frequency. The agent may or may not have a value function that predicts future reward, given a specific policy. Which kind of RL agent, operating in which kind of environment, would succeed in making moral judgments? Where in these formalisms can we locate the moral judgment?

A reinforcement agent’s reward function is something that is both “objective” in the sense that it isn’t determined by the agent and “intrinsically motivating” in that it determines the policy the agent learns and the actions they ultimately take. However, an agent who has a specified reward function seems to lack the kind of agency required to make moral judgments, since they lack reflective distance from the goal of maximizing their specified reward.

This is similar to the problem of trying to build “moral machines” by using supervised learning to predict the moral judgments of humans. Systems designed along these lines would not be holding their own motivations at arm’s length in the way moral judgment requires. Moreover, this approach risks calcifying moral thought, since machines would be aping the moral judgments of imperfect humans at a particular time and place. True moral reasoning is more dynamic and adaptive.

More promising would be RL agents who were uncertain about their reward function and had to learn about it through their actions. This is what Russell proposes. But the nature of this uncertainty is critical. On Russell’s view, machines should be initially uncertain about their reward function and should learn about it by observing human behavior. He admits that, with enough observation, an RL agent may become completely confident about the human reward it aims to maximize. However, these kinds of agents would lack the kind of reflective distance characteristic of moral judgment. Even if a machine is certain that some course of action would maximize human reward, it should still be able to ask if it is right to pursue it.

We could imagine agents who are always uncertain about the reward they are trying to maximize. But what kind of uncertainty is needed? Is it the kind of uncertainty we can express as a probability distribution over different reward functions or is it a deeper kind of uncertainty that resists such characterization? What mechanism can assure that some degree of uncertainty persists? How should machines choose a policy given the persistent kind of uncertainty that moral concepts seem to engender?

Even if we had satisfactory questions to these answers, other complications remain. Unlike the contents of our moral judgments, an RL agent’s reward is not something that is both grounded in, and also independent of, the environment. Likewise, while an RL agent’s predictions about reward – its value function – shares some features with moral judgment in being both belief-like and desire-like, it doesn’t seem to achieve the reflective distance indicative of moral judgment. An agent’s value function is not a way for them to hold a mirror up to their own motivations and decide which to endorse and which to reject.

Finally, in applications of RL, actions are usually individuated in simple ways – a move in a chess game, selecting the next word, or the next piece of content, etc. But when humans act, our actions are individuated by the knowledge, motives and intentions we bring to them. One and the same move in the chess game might be a blunder, a way to keep the game interesting, a kindness shown to a child, or an attempt to hustle an opponent. If we want to build machines that make moral judgments, we will need to think about their actions in more sophisticated ways.

§6 The Path Forward

Despite the concerns I’ve raised, I see no reason to think building machines that make moral judgments is impossible. We may be able to find computationally useful notions of agent, action, reward, value and uncertainty that will allow us to build machines that have reflective distance from their own motivations and that hold themselves to an external standard that resists specification. If we are going to progress along the path to good AGI, we need to confront the philosophical puzzles raised by moral judgment in the unfamiliar context of machine learning. This project is just beginning.

Notes

Towards the Operationalization of Philosophy & Wisdom

Katja Grace — Sun, 27 Oct 2024 17:21:02 +0000

By Thane Ruthenis

This was a prize-winning entry into the Essay Competition on the Automation of Wisdom and Philosophy.

Summary

Philosophy and wisdom, and the processes underlying them, currently lack a proper operationalization: a set of robust formal or semi-formal definitions. If such definitions were found, they could be used as the foundation for a strong methodological framework. Such a framework would provide clear guidelines for how to engage in high-quality philosophical/wise reasoning and how to evaluate whether a given attempt at philosophy or wisdom was a success or a failure.

To address that, I provide candidate definitions for philosophy and wisdom, relate them to intuitive examples of philosophical and wise reasoning, and offer a tentative formalization of both concepts. The motivation for this is my belief that the lack of proper operationalization is the main obstacle to both (1) scaling up the work done in these domains (i. e., creating a bigger ecosystem that would naturally attract funding), and (2) automating them.

The discussion of philosophy focuses on the tentative formalization of a specific algorithm that I believe is central to philosophical thinking: the algorithm that allows humans to derive novel ontologies (conceptual schemes). Defined in a more fine-grained manner, the function of that algorithm is “deriving a set of assumptions using which a domain of reality could be decomposed into subdomains that could be studied separately”.

I point out the similarity of this definition to John Wentworth’s operationalization of natural abstractions, from which I build the formal model.

From this foundation, I discuss the discipline of philosophy more broadly. I point out instances where humans seem to employ the “algorithm of philosophical reasoning”, but which don’t fall under the standard definition of “philosophy”. In particular, I discuss the category of research tasks varyingly called “qualitative” or “non-paradigmatic” research, arguing that the core cognitive processes underlying them are implemented using “philosophical reasoning” as well.

Counterweighting that, I define philosophy-as-a-discipline as a special case of such research. While “qualitative research” within a specific field of study focuses on decomposing the domain of reality within that field’s remit, “philosophy” focuses on decomposing reality-as-a-whole (which, in turn, produces the previously mentioned “specific fields of study”).

Separately, I operationalize wisdom as meta-level cognitive heuristics that take object-level heuristics for planning/inference as inputs, and output predictions about the real-world consequences of an agent which makes use of said object-level heuristics. I provide a framework of agency in which that is well-specified as “inversions of inversions of environmental causality”.

I close things off with a discussion of whether “human-level” and “superhuman” AIs would be wise/philosophical (arguing yes), and what options my frameworks offer regarding scaling up or automating both types of reasoning.

1. Philosophical Reasoning

One way to define philosophy is “the study of confusing questions”. Typical philosophical reasoning happens when you notice that you have some intuitions or nagging questions about a domain of reality which hasn’t already been transformed into a formal field of study, and you follow them, attempting to gain clarity. If successful, this often results in the creation of a new field of study focused solely on that domain, and the relevant inquiries stop being part of philosophy.

Notable examples include:

Physics, which started as “natural philosophy”.
Chemistry, which was closely related to a much more philosophical “alchemy”.
Economics, rooted in moral philosophy.
Psychology, from philosophy of mind.

Another field that serves as a good example is agent foundations¹, for those readers familiar with it.

One notable feature of this process is that the new fields, once operationalized, become decoupled from the rest of reality by certain assumptions. A focus on laws that apply to all matter (physics); or on physical interactions of specific high-level structures that are only possible under non-extreme temperatures and otherwise-constrained environmental conditions (chemistry); or on the behavior of human minds; and so on.

This isolation allows each of these disciplines to be studied separately. A physicist doesn’t need training in psychology or economics, and vice versa. By the same token, a physicist mostly doesn’t need to engage in interdisciplinary philosophical ponderings: the philosophical work that created the field has already laid down the conceptual boundaries beyond which physicists mostly don’t need to go.

The core feature underlying this overarching process of philosophy is the aforementioned “philosophical reasoning”: the cognitive algorithms that implement our ability to generate valid decompositions of systems or datasets. Formalizing these algorithms should serve as the starting point for operationalizing philosophy in a more general sense.

1A. What Is an Ontology?

In the context of this text, an “ontology” is a decomposition of some domain of study into a set of higher-level concepts, which characterize the domain in a way that is compact, comprehensive, and could be used to produce models that have high predictive accuracy.

In more detail:

“Compactness”: The ontology has fewer “moving parts” (concepts, variables) than a full description of the corresponding domain. Using models based on the ontology for making predictions requires a dramatically lower amount of computational or cognitive resources, compared to a “fully detailed” model.
“Accuracy”: An ontology-based model produces predictions about the domain that are fairly accurate at a high level, or have a good upper bound on error.
“Comprehensive”: The ontology is valid for all or almost all systems that we would classify as belonging to the domain in question, and characterizes them according to a known, finite family of concepts.

Chemistry talks about atoms, molecules, and reactions between them; economics talks about agents, utility functions, resources, and trades; psychology recognizes minds, beliefs, memories, and emotions. An ontology answers the question of what you study when you study some domain, characterizes the joints along which the domain can be carved and which questions about it are meaningful to focus on. (In this sense, it’s similar to the philosophical notion of a “conceptual scheme”, although I don’t think it’s an exact match.)

Under this view, deriving the “highest-level ontology” – the ontology for the-reality-as-a-whole – decomposes reality into a set of concepts such as “physics”, “chemistry”, or “psychology”. These concepts explicitly classify which parts of reality could be viewed as their instances, thereby decomposing reality into domains that could be studied separately (and from which the disciplines of physics, chemistry, and psychology could spring).

By contrast, on a lower level, arriving at the ontology of some specific field of study allows you to decompose it into specific sub-fields. These subfields can, likewise, be studied mostly separately. (The study of gasses vs. quantum particles, or inorganic vs. organic compounds, or emotional responses vs. memory formation.)

One specific consequence of the above desiderata is that good ontologies commute.

That is, suppose you have some already-defined domain of reality, such as “chemistry”. You’d like to further decompose it into sub-domains. You take some system from this domain, such as a specific chemical process, and derive a prospective ontology for it. The ontology purports to decompose the system into a set of high-level variables plus compactly specified interactions between them, producing a predictive model of it.

If you then take a different system from the same domain, the same ontology should work for it. If you talked about “spirits” and “ether” in the first case, but you need to discuss “molecules” and “chemical reactions” to model the second one, then the spirits-and-ether ontology doesn’t suffice to capture the entire domain. And if there are no extant domains of reality which are well-characterized by your ontology – if the ontology of spirits and ether was derived by “overfitting” to the behavior of the first system, and it fails to robustly generalize to other examples – then this ontology is a bad one.

The go-to historical example comes from the field of chemistry: the phlogiston theory. The theory aimed to explain combustion, modeling it as the release of some substance called “phlogiston”. However, the theory’s explanations for different experiments implied contradictory underlying dynamics. In some materials, phlogiston was supposed to have positive mass (and its release decreased the materials’ weight); in others, negative mass (in metals, to explain why they gained weight after being burned). The explanations for its interactions with air were likewise ad-hoc, often invented post factum to rationalize an experimental result, and essentially never to predict it. That is, they were overfit.

Another field worth examining here is agent foundations. The process of deriving a suitable ontology for it hasn’t yet finished. Accordingly, it is plagued by questions of what concepts / features it should be founded upon. Should we define agency from idealized utility-maximizers, or should we define it structurally? Is consequentialism-like goal-directed behavior even the right thing to focus on, when studying real-world agent-like systems? What formal definition do the “values” of realistic agents have?

In other words: what is the set of variables which serve to compactly and comprehensively characterize and model any system we intuitively associate with “agents”, the same way chemistry can characterize any chemical interaction in terms of molecules and atoms?

Another telling example is mechanistic interpretability. Despite being a very concrete and empirics-based field of study, it likewise involves attempts to derive a novel ontology for studying neural networks. Can individual neurons be studied separately? The evidence suggests otherwise. If not, what are the basic “building blocks” of neural networks? We can always decompose a given set of activations into sparse components, but what decompositions would be robust, i. e., applicable to all forward passes of a given ML model? (Sparse autoencoders represent some progress along this line of inquiry.)

At this point, it should be noted that the process of deriving ontologies, which was previously linked to “philosophical reasoning”, seems to show up in contexts that are far from the traditional ideas of what “philosophy” is. I argue that this is not an error: we are attempting to investigate a cognitive algorithm that is core to philosophy-as-a-discipline, yet it’s not a given that this algorithm would show up only in the context of philosophy. (An extended discussion of this point follows in 1C and 1E.)

To summarize: Philosophical reasoning involves focusing on some domain of reality² to derive an ontology for it. That ontology could then be used to produce a “high-level summary” of any system from the domain, in terms of specific high-level variables and compactly specifiable interactions between them. This, in turn, allows to decompose this domain into further sub-domains.

1B. Tentative Formalization

Put this way, the definition could be linked to John Wentworth’s definition of natural abstractions.

The Natural Abstraction Hypothesis states that the real-world data are distributed such that, for any set of “low-level” variables L representing some specific system or set of systems, we can derive the (set of) high-level variable(s) H, such that they would serve as “natural latents” for L. That is: conditional on the high-level variables H, the low-level variables L would become (approximately) independent³:

(Where “\” denotes set subtraction, meaning L \ L_i is the set of all L_k except L_i.)

There are two valid ways to interpret L and H.

The “bottom-up” interpretation: L_i could be different parts of a specific complex system, such as small fragments of a spinning gear. H would then correspond to a set of high-level properties of the gear, such as its rotation speed, the mechanical and molecular properties of its material, and so on. Conditional on H, the individual L_i become independent: once we’ve accounted for the shared material, for example, the only material properties by which they vary are e. g. small molecular defects, individual to each patch.
The “top-down” interpretation: L_i could be different examples of systems belonging to some reference class of systems, such as individual examples of trees. H would then correspond to the general “tree” abstraction, capturing the (distribution over the) shapes of trees, the materials they tend to be made of, and so on. Conditional on H, the individual L_i become independent: the “leftover” variance are various contingent details such as “how many leaves this particular tree happens to have”.

Per the hypothesis, the high-level variables H would tend to correspond to intuitive human abstractions. In addition, they would be “in the territory” and convergent – in the sense that any efficient agent (or agent-like system) that wants to model some chunk of the world would arrive at approximately the same abstractions for this chunk, regardless of the agent’s goals and quirks of its architecture. What information is shared between individual fragments of a gear, or different examples of trees, is some ground-truth fact about the systems in question, rather than something subject to the agent’s choice.⁴

The Universality Hypothesis in machine-learning interpretability is a well-supported empirical complement of the NAH. While it doesn’t shed much light on what exact mathematical framework for abstractions we should use, it supplies strong evidence in favor of the NAH’s basic premise: that there’s some notion of abstraction which is convergently learned by agents and agent-like systems.

A natural question, in this formalism, is how to pick the initial set of low-level variables L for the ontology of which we’d be searching: how we know to draw the boundary around the gear, how we know to put only examples of trees into the set L. That question is currently open, although one simple way to handle it might be to simply search for a set such that it’d have a nontrivial natural latent H.

The NAH framework captures the analysis in the preceding sections well. H constitutes the ontology of L, creating conditional independence between the individual variables. Once H is derived, we can study each of L_i separately. (More specifically, we’d be studying L_i conditioned on H: the individual properties of a specific tree in the context of it being a tree; the properties of a physical system in the context of viewing it as a physical system.)

If the disagreement over the shape of H exists – if researchers or philosophers are yet to converge to the same H – that’s a sign that no proposed H is correct, that it fails to robustly induce independence between L_i. (Psychology is an illustrative example here: there are many extant ontologies purporting to characterize the human mind. But while most of them explain some phenomena, none of them explain everything, which leads to different specialists favoring different ontologies – and which is evidence that the correct framework is yet to be found.)

This definition could be applied iteratively: an ontology H would usually consist of a set of variables as well, and there could be a set of even-higher-level variables inducing independence between them. We could move from the description of reality in terms of “all elementary particles in existence” to “all atoms in existence”, and then, for example, to “all cells”, to “all organisms”, to “all species”. Or: “all humans” to “all cities” to “all countries”. Or: starting from a representation of a book in terms of individual sentences, we can compress it to the summary of its plot and themes; starting from the plots and themes of a set of books, we can derive common literary genres. Or: starting from a set of sensory experiences, we can discover some commonalities between these experiences, and conclude that there is some latent “object” depicted in all of them (such as compressing the visual experiences of seeing a tree from multiple angles into a “tree” abstraction). And so on.

In this formalism, we have two notable operations:

Deriving H given some L.
Given some L,H, and the relationship P(L | H), propagating some target state “up” or “down” the hierarchy of abstractions.
- That is, if H = H*, what’s P(L | H = H*)? Given some high-level state (macrostate), what’s the (distribution over) low-level states (microstates)?
- On the flip side, if L = L*, what’s P(H | L = L*)? Given some microstate, what macrostate does it correspond to?

It might be helpful to think of P(L | H) and P(H | L) as defining functions for abstracting down H → L and abstracting up L → H, respectively, rather than as a probability distribution. Going forward, I will be using this convention.

I would argue that (2) represents the kinds of thinking, including highly intelligent and sophisticated thinking, which do not correspond to philosophical reasoning. In that case, we already have H → L pre-computed, the ontology defined. The operations involved in propagating the state up/down might be rather complex, but they’re ultimately “closed-form” in a certain sense.

Some prospective examples:

Tracking the consequences of local political developments on the global economy (going “up”), or on the experiences of individual people (going “down”).
Evaluating the geopolitical impact of a politician ingesting a specific poisonous substance at a specific time (going “up”).
Modeling the global consequences of an asteroid impact while taking into account orbital dynamics, weather patterns, and chemical reactions (going “down” to physical details, then back “up”).
Translating a high-level project specification to build a nuclear reactor into specific instructions to be carried out by manufacturers (“down”).
Estimating the consequences of a specific fault in the reactor’s design on global policies towards nuclear power (“up”).

As per the examples, this kind of thinking very much encompasses some domains of research and engineering.

(1), on the other hand, potentially represents philosophical reasoning. The question is: what specific cognitive algorithms are involved in that reasoning?

Intuitively, some sort of “babble-and-prune” brute-force approach seems to be at play. We need to semi-randomly test various possible decompositions, until ultimately arriving at one that is actually robust. Another feature is that this sort of thinking requires a wealth of concrete examples, a “training set” we have to study to derive the right abstractions. (Which makes sense: we need a representative sample of the set of random variables L in order to derive approximate conditional-independence relations between them.)

But given that, empirically, the problem of philosophy is computationally tractable at all, it would seem that some heuristics are at play here as well. Whatever algorithms underlie philosophical reasoning, they’re able to narrow down the hypothesis space of ontologies that we have to consider.

Another relevant intuition: from a computational-complexity perspective, the philosophical reasoning of (1), in general, seems to be more demanding than the more formal non-philosophical thinking of (2). Philosophical reasoning seems to involve some iterative search-like procedures, whereas the “non-philosophical” thinking of (2) involves only “simpler” closed-form deterministic functions.

This fits with the empirical evidence: deriving a new useful model for representing some domain of reality is usually a task for entire fields of science, whereas applying a model is something any individual competent researcher or engineer is capable of.

1C. Qualitative Research

Suppose that the core cognitive processes underlying philosophy are indeed about deriving novel ontologies. Is the converse true: all situations in which we’re deriving some novel ontology are “philosophy-like” undertakings, in some important sense?

I would suggest yes.

Let’s consider the mechanistic-interpretability example from 1A. Mechanistic interpretability is a very concrete, down-to-earth field of study, with tight empirical-testing loops. Nevertheless, things like the Universality Hypothesis, and speculations that the computations in neural networks could be decomposed into “computational circuits”, certainly have a philosophical flavor to them – even if the relevant reasoning happens far outside the field of academic philosophy.

Chris Olah, a prominent ML interpretability researcher, characterizes this as “qualitative research”. He points out that one of the telltale signs that this type of research is proceeding productively is finding surprising structure in your empirical results. In other words: finding some way to look at the data which hints at the underlying ontology.

Another common term for this type of research is “pre-paradigmatic” research. The field of agent foundations, for example, is often called “pre-paradigmatic” in the sense that within its context, we don’t know how to correctly phrase even the questions we want answered, nor the definitions of the basic features we want to focus on.

Such research processes are common even in domains that have long been decoupled from philosophy, such as physics. Various attempts to derive the Theory of Everything often involve grappling with very philosophy-like questions regarding the ontology of a physical universe consistent with all our experimental results (e. g., string theory). Different interpretations of quantum mechanics is an even more obvious example.

Thomas Kuhn’s The Structure of Scientific Revolutions naturally deserves a mention here. His decomposition of scientific research into “paradigm shifts” and “normal science” would correspond to the split between (1) and (2) types of reasoning as outlined in the previous section. The research that fuels paradigm shifts would be of the “qualitative”, non-paradigmatic, ontology-discovering type.

Things similar to qualitative/non-paradigmatic research also appear in the world of business. Peter Thiel’s characterization of startups as engaging in “zero to one” creation of qualitatively new markets or goods would seem to correspond to deriving some novel business frameworks, i. e., ontologies, and succeeding by their terms. (“Standard”, non-startup businesses, in this framework’s view, rely on more “formulaic” practices – i. e., on making use of already pre-computed H → L. Consider opening a new steel mill, which would produce well-known products catering to well-known customers, vs. betting on a specific AI R&D paradigm, whose exact place in the market is impossible to predict even if it succeeds.)

Nevertheless, intuitively, there still seems to be some important difference between these “thin slices” of philosophical reasoning scattered across more concrete fields, and “pure” philosophy.

Before diving into this, a short digression:

1D. Qualitative Discoveries Are Often Counterfactual

Since non-paradigmatic research seems more computationally demanding, requiring a greater amount of expertise than in-paradigm reasoning, its results are often highly counterfactual. While more well-operationalized frontier discoveries are often made by many people near-simultaneously, highly qualitative discoveries could often be attributed to a select few people.

As a relatively practical example, Shannon’s information theory plausibly counts. The discussion through the link also offers some additional prospective examples.

From the perspective of the “zero-to-one startups are founded on novel philosophical reasoning” idea, this view is also supported. If a novel startup fails due to some organizational issues before proving the profitability of its business plan, it’s not at all certain that it would be quickly replaced by someone trying the same idea, even if its plan were solid. Failures of the Efficient Market Hypothesis are common in this area.

1E. What Is “Philosophy” As a Discipline?

Suppose that the low-level system L represents some practical problem we study. A neural network that we have to interpret, or the readings yielded by a particle accelerator which narrow down the fundamental physical laws, or the behavior of some foreign culture that we want to trade with. Deriving the ontology H would be an instance of non-paradigmatic research, i. e., philosophical reasoning. But once H is derived, it would be relatively easily put to use solving practical problems. The relationship H → L, once nailed down, would quickly be handed off to engineers or businessmen, who could start employing it to optimize the natural world.

As an example, consider anthropics, an emerging field studying anthropic principles and extended reasoning similar to the doomsday argument.⁵ Anthropics doesn’t study a concrete practical problem. It’s a very high-level discipline, more or less abstracting over the-world-as-a-whole (or our experiences of it). Finding a proper formalization of anthropics, which satisfactorily handles all edge cases, would result in advancement in decision theory and probability theory. But there are no immediate practical applications.

They likely do exist. But you’d need to propagate the results farther down the hierarchy of abstractions, moving through these theories down to specific subfields and then to specific concrete applications. None of the needed H → L pathways are derived, there’s a lot of multi-level philosophical work to be done. And there’s always the possibility that it would yield no meaningful results, or end up as a very circuitous way to justify common intuitions.

The philosophy of mind could serve as a more traditional example. Branches of it are focused on investigating the nature of consciousness and qualia. Similarly, it’s a very “high-level” direction of study, and the success of its efforts would have significant implications for numerous other disciplines. But it’s not known what, if any, practical consequences of such a success would be.

Those features, I think, characterize “philosophy” as a separate discipline. Philosophy (1) involves attempts to derive wholly new multi-level disciplines, starting from very-high-level reasoning about the-world-as-a-whole (or, at least, drawing on several disciplines at once), and (2) it only cashes out in practical implementations after several iterations of concretization.

In other words, philosophy is the continuing effort to derive the complete highest-level ontology of our experiences/our world.

1F. On Ethics

An important branch of philosophy which hasn’t been discussed so far is moral philosophy. The previously outlined ideas generalize to it in a mostly straightforward manner, though with a specific “pre-processing” twist.

This is necessarily going to be a very compressed summary. For proper treatment of the question, I recommend Steven Byrnes’ series on the human brain and valence signals, or my (admittedly fairly outdated) essay on value formation.

To start off, let’s assume that the historical starting point of moral philosophy are human moral intuitions and feelings. Which actions, goals, or people “feel like” good or bad things, what seems just or unfair, and so on. From this starting point, people developed the notions of morality and ethics, ethical systems, social norms, laws, and explicit value systems and ideologies.

The process of moral philosophy can then be characterized as follows:

As a premise, human brains contain learning algorithms plus a suite of reinforcement-learning training signals.

In the course of life, and especially in childhood, a human learns a vast repository of value functions. These functions take sensory perceptions and thoughts as inputs, and output “valence signals” in the form of real numbers. The valence assigned to a thought is based on learned predictive heuristics about whether a given type of thought has historically led to positive or negative reward (as historically scored by innate reinforcement-signal functions).

The valences are perceived by human minds as a type of sensory input. In particular, a subset of learned value functions could be characterized as “moral” value functions, and their outputs are perceived by humans as the aforementioned feelings of “good”, “bad”, “justice”, and so on.

Importantly, the learned value functions aren’t part of a human’s learned world-model (explicit knowledge). As the result, their explicit definitions aren’t immediately available to our conscious inspection. They’re “black boxes”: we only perceive their outputs.

One aspect of moral philosophy, thus, is to recover these explicit definitions: what value functions you’ve learned and what abstract concepts they’re “attached to”. (For example: does “stealing” feel bad because you think it’s unfair, or because you fear being caught? You can investigate this by, for example, imagining situations in which you manage to steal something in circumstances where you feel confident you won’t get caught. This would allow you to remove the influence of “fear of punishment”, and thereby determine whether you have a fairness-related value function.)

That is a type of philosophical reasoning: an attempt to “abstract up” from a set of sensory experiences of a specific modality, to a function defined over high-level concepts. (Similar to recovering a “tree” abstraction by abstracting up from a set of observations of a tree from multiple angles.)

Building on that, once a human has recovered (some of) their learned value functions, they can keep abstracting up in the manner described in the preceding text. For example, a set of values like “I don’t like to steal”, “I don’t like to kill”, “I don’t like making people cry” could be abstracted up to “I don’t want to hurt people”.

Building up further, we can abstract over the set of value systems recovered by different people, and derive e. g. the values of a society…

… and, ultimately, “human values” as a whole.

Admittedly, there are some obvious complications here, such as the need to handle value conflicts / inconsistent values, and sometimes making the deliberate choice to discard various data points in the process of computing higher-level values (often on the basis of meta-value functions). For example, not accounting for violent criminals when computing the values a society wants to strive for, or discarding violent impulses when making decisions about what kind of person you want to be.

In other words: when it comes to values, there is an “ought” sneaking into the process of abstracting-up, whereas in all other cases, it’s a purely “is”-fueled process.

But the “ought” side of it can be viewed as simply making decisions about what data to put in the L set, which we’d then abstract over in the usual, purely descriptive fashion.

From this, I conclude that the basic algorithmic machinery, especially one underlying the philosophical (rather than the political) aspects of ethical reasoning, is still the same as with all other kinds of philosophical reasoning.

1G. Why Do “Solved” Philosophical Problems Stop Being Philosophy?

As per the formulations above:

The endeavor we intuitively view as “philosophy” is a specific subset of general philosophical reasoning/non-paradigmatic research. It involves thinking about the world in a very general sense, without the philosophical assumptions that decompose it into separate domains of study.
“Solving” a philosophical problem involves deriving an ontology/paradigm for some domain of reality, which allows to decouple that domain from the rest of the world and study it mostly separately.

Put like this, it seems natural that philosophical successes move domains outside the remit of philosophy. Once a domain has been delineated, thinking about it by definition no longer requires the interdisciplinary reasoning characteristic of philosophy-as-a-discipline. Philosophical reasoning seeks to render itself unnecessary.

(As per the previous sections, working in the domains thus delineated could still involve qualitative research, i. e., philosophical reasoning. But not the specific subtype of philosophical reasoning characteristic of philosophy-as-a-discipline, involving reasoning about the-world-as-a-whole.)

In turn, this separation allows specialization. Newcomers could focus their research and education on the delineated domain, without having to become interdisciplinary specialists. This means a larger quantity of people could devote themselves to it, leading to faster progress.

That dynamic is also bolstered by greater funding. Once the practical implications of a domain become clear, more money pours into it, attracting even more people.

As a very concrete example, we can consider the path-expansion trick in mechanistic interpretability. Figuring out how to mathematically decompose a one-layer transformer into the OV and QK circuits requires high-level reasoning about transformer architecture, and arriving at the very idea of trying to do so requires philosophy-like thinking (to even think to ask, “how can we decompose a ML model into separate building blocks?”). But once this decomposition has been determined, each of these circuits could be studied separately, including by people who don’t have the expertise to derive the decomposition from scratch.

Solving a philosophical problem, then, often allows to greatly upscale the amount of work done in the relevant domain of reality. Sometimes, that quickly turns it into an industry.

2. Wisdom

Let’s consider a wide variety of “wise” behavior or thinking.

Taking into account “second-order” effects of your actions.
- Example: The “transplant problem”, which examines whether you should cut up a healthy non-consenting person for organs if that would let you save five people whose organs are failing.
- “Smart-but-unwise” reasoning does some math and bites the bullet.
- “Wise” reasoning points out that if medical professionals engaged in this sort of behavior at scale, people would stop seeking medical attention out of fear/distrust, leading to more suffering in the long run.
Taking into account your history with specific decisions, and updating accordingly.
- Example 1:
  - Suppose you have an early appointment tomorrow, but you’re staying up late, engrossed in a book. Reasoning that you will read “just one more chapter” might seem sensible: going to sleep at 01:00 AM vs. 01:15 AM would likely have no significant impact on your future wakefulness.
  - However, suppose that you end up making this decision repeatedly, until it’s 6:40 AM and you have barely any time left for sleep at all.
  - Now suppose that a week later, you’re in a similar situation: it’s 01:00 AM, you’ll need to wake up early, and you’re reading a book.
  - “Smart-but-unwise” reasoning would repeat your previous mistake: it’d argue that going to sleep fifteen minutes later is fine.
  - “Wise” reasoning would update on the previous mistake, know not to trust its object-level estimates, and go to sleep immediately.
- Example 2:
  - Suppose that someone did something very offensive to you. In the moment, you infer that this means they hate you, and update your beliefs accordingly.
  - Later, it turns out they weren’t aware that their actions upset you, and they apologize and never repeat that error.
  - Next time someone offends you, you may consider it “wise” not to trust your instinctive interpretation completely, and at least consider alternate explanations.
Taking into account the impact of the fact that you’re the sort of person to make a specific decision in a specific situation.
- Example 1:
  - Suppose that a staunch pacifist is offered a deal: they take a pill that would decrease their willingness to kill by 1%, and in exchange, they get 1 million dollars. In addition, they could take that deal multiple times, getting an additional 1 million dollars each time, and raising their willingness to kill by 1% each time.
  - A “smart-but-unwise” pacifist reasons that they’d still be unwilling to kill even if they became, say, 10% more willing to, and that they could spend the 10 million dollars on charitable causes, so they decide to take the deal 10 times.
  - A “wise” pacifist might consider the fact that, if they take the deal 10 times, the one making the decision on whether to continue would be a 10%-more-willing-to-kill version of them. That version might consider it acceptable to go up to 20%; a 20% version might consider 40% acceptable, and so on until 100%.
- Example 2: Blackmailability.
  - Suppose that we have two people, Alice and Carol. Alice is known as a reasonable, measured person who makes decisions carefully, minimizing risk. Carol is known as a very temperamental person who becomes enraged and irrationally violent at the slightest offense.
  - Suppose that you’re a criminal who wants to blackmail someone. If you’re choosing between Alice and Carol, Alice is a much better target: if you threaten to ruin her life if she doesn’t pay you $10,000, she will tally up the costs and concede. Carol, on the other hand, might see red and attempt to murder you, even if that seals her own fate.
  - Alice is “smart-but-unwise”. Carol, as stated, isn’t exactly “wise”. But she becomes “wise” under one provision: if she committed to her “irrational” decision policy as a result of rational reasoning about what would make her an unappealing blackmail target. After all, in this setup, she’s certainly the one who ends up better off than Alice!
  - (Functional Decision Theories attempt to formalize this type of reasoning, providing a framework within which it’s strictly rational.)
Erring on the side of deferring to common sense in situations where you think you see an unexploited opportunity.
- Example 1: Engaging in immoral behavior based on some highly convoluted consequentialist reasoning vs. avoiding deontology violations. See this article for an extended discussion of the topic.
  - This is similar to (1), but in this case, you don’t need to reason through the nth-order effects “manually”. You know that deferring to common sense is usually wise, even if you don’t know why the common sense is the way it is.
  - It’s also fairly similar to the first example in (3), but the setup here is much more realistic.
- Example 2: Trying to estimate the price of a stock “from scratch”, vs. “zeroing out”, i. e., taking the market value as the baseline and then updating it up/down based on whatever special information you have.
- Example 3: Getting “bad vibes” from a specific workplace environment or group of people, and dismissing these feelings as irrational (“smart-but-unwise”), vs. trying to investigate in-depth what caused them (“wise”). (And discovering, for example, that they were caused by some subtle symptoms of unhealthy social dynamics, which the global culture taught you to spot, but didn’t explain the meaning of.)
Taking the “outside view” into account (in some situations in which it’s appropriate).
- Example: Being completely convinced of your revolutionary new physics theory or business plan, vs. being excited by it, but skeptical on the meta level, on the reasoning that there’s a decent chance your object-level derivations/plans contain an error.

Summing up: All examples of “wise” behavior here involve (1) generating some candidate plan or inference, which seems reliable or correct while you’re evaluating it using your object-level heuristics, then (2) looking at the appropriate reference class of these plans/inferences, and finally (3) predicting what the actual consequences/accuracy would be using your meta-level heuristics. (“What if everyone acted this way?”, “what happened the previous times I acted/thought this way?”, “what would happen if it were commonly known I’d act this way?”, “if this is so easy, why haven’t others done this already?”, and so on.)

Naturally, it could go even higher. First-order “wise” reasoning might be unwise from a meta-meta-level perspective, and so on.

(For example, “outside-view” reasoning is often overused, and an even more wise kind of reasoning recognizes when inside-view considerations legitimately prevail over outside-view ones. Similarly, the heuristic of “the market is efficient and I can’t beat it” is usually wise, wiser than “my uncle beat the market this one time, which means I can too if I’m clever enough!”, but sometimes there are legitimate market failures.⁶)

In other words: “wise” thinking seems to be a two-step process, where you first generate a conclusion that you expect to be accurate, then “go meta”, and predict what would be the actual accuracy rate of a decision procedure that predicts this sort of conclusion to be accurate.

2A. Background Formalisms

To start off, I will need to introduce a toy model of agency. Bear with me.

First: How can we model the inferences from the inputs to an agent’s decisions?

Photons hit our eyes. Our brains draw an image aggregating the information each photon gives us. We interpret this image, decomposing it into objects, and inferring which latent-variable object is responsible for generating which part of the image. Then we wonder further: what process generated each of these objects? For example, if one of the “objects” is a news article, what is it talking about? Who wrote it? What events is it trying to capture? What set these events into motion? And so on.

In diagram format, we’re doing something like this:

Blue are ground-truth variables, gray is the “Cartesian boundary” of our mind from which we read off observations, purple are nodes in our world-model, each of which can be mapped to a ground-truth variable.

We take in observations, infer what latent variables generated them, then infer what generated those variables, and so on. We go backwards: from effects to causes, iteratively. The Cartesian boundary of our input can be viewed as a “mirror” of a sort, reflecting the Past.

It’s a bit messier in practice, of course. There are shortcuts, ways to map immediate observations to far-off states. But the general idea mostly checks out – especially given that these “shortcuts” probably still implicitly route through all the intermediate variables, just without explicitly computing them. (You can map a news article to the events it’s describing without explicitly modeling the intermediary steps of witnesses, journalists, editing, and publishing. But your mapping function is still implicitly shaped by the known quirks of those intermediaries.)

Second: Let’s now consider the “output side” of an agent. I. e., what happens when we’re planning to achieve some goal, in a consequentialist-like manner.

We envision the target state. What we want to achieve, how the world would look like. Then we ask ourselves: what would cause this? What forces could influence the outcome to align with our desires? And then: how do we control these forces? What actions would we need to take in order to make the network of causes and effects steer the world towards our desires?

In diagram format, we’re doing something like this:

Green are goals, purple are intermediary variables we compute, gray is the Cartesian boundary of our actions, red are ground-truth variables through which we influence our target variables.

We start from our goals, infer what latent variables control their state in the real world, then infer what controls those latent variables, and so on. We go backwards: from effects to causes, iteratively, until getting to our own actions. The Cartesian boundary of our output can be viewed as a “mirror” of a sort, reflecting the Future.

It’s a bit messier in practice, of course. There are shortcuts, ways to map far-off goals to immediate actions. But the general idea mostly checks out – especially given that these heuristics probably still implicitly route through all the intermediate variables, just without explicitly computing them. (“Acquire resources” is a good heuristical starting point for basically any plan. But what counts as resources is something you had to figure out in the first place by mapping from “what lets me achieve goals in this environment?”.)

And indeed, that side of this formulation isn’t novel. From this post by Scott Garrabrant, an agent-foundations researcher:

Time is also crucial for thinking about agency. My best short-phrase definition of agency is that agency is time travel. An agent is a mechanism through which the future is able to affect the past. An agent models the future consequences of its actions, and chooses actions on the basis of those consequences. In that sense, the consequence causes the action, in spite of the fact that the action comes earlier in the standard physical sense.

Let’s now put both sides together. An idealized, compute-unbounded “agent” could be laid out in this manner:

It reflects the past at the input side, and reflects the future at the output side. In the middle, there’s some “glue”/”bridge” connecting the past and the future by a forwards-simulation. During that, the agent “catches up to the present”: figures out what will happen while it’s figuring out what to do.

If we consider the relation between utility functions and probability distributions, it gets even more formally literal. An utility function over X could be viewed as a target probability distribution over X, and maximizing expected utility is equivalent to minimizing cross-entropy between this target distribution and the real distribution.

That brings the “planning” process in alignment with the “inference” process: both are about propagating target distributions “backwards” in time through the network of causality.

2B. Tentative Formalization

Let’s consider what definition “wisdom” would have, in this framework.

All “object-level” cognitive heuristics here have a form of Y → X, where Y is some environmental variable, and X are the variables that cause Y. I. e., every cognitive heuristic Y → X can be characterized as an inversion of some environmental dynamic X → Y.

“Wisdom”, in this formulation, seems to correspond to inversions of inversions. Its form is

(Y → X) → Y.

It takes in some object-level inversion – an object-level cognitive heuristic – and predicts things about the performance of a cognitive policy that uses this heuristic.

Examining this definition from both ends:

If we’re considering an object-level output-side heuristic E → A, which maps environmental variables E to actions A that need to be executed in order to set E to specific values – i. e., a “planning” heuristic – the corresponding “wisdom” heuristic (E → A) → E tells us what object-level consequences E the reasoning of this type actually results in.
If we’re considering an object-level input-side heuristic O → E mapping observations O to their environmental causes E – i. e., an “inference” heuristic – the corresponding “wisdom” heuristic (O → E) → O tells us what we’d actually expect to see going forward, and whether the new observations would diverge from our object-level inferences. (I. e., whether we expect that the person who offended us would actually start acting like they hate us, going forward.)

Admittedly, some of these speculations are fairly shaky. The “input-side” model of wisdom, in particular, seems off to me. Nevertheless, I think this toy formalism does make some intuitive sense.

It’s also clear, from this perspective, why “wisdom” is inherently more complicated / hard-to-compute than “normal” reasoning: it explicitly iterates on object-level reasoning.

2C. Crystallized Wisdom

In humans, cognitive heuristics are often not part of explicit knowledge, but are instead stored as learned instincts, patterns of behavior, or emotional responses – or “shards”, in the parlance of one popular framework.

Since wisdom is a subset of cognitive heuristics, that applies to it as well. “Wise” heuristics are often part of “common sense”, tacit knowledge, cultural norms, and hard-to-articulate intuitions and hunches. In some circumstances, they’re stored in a format that doesn’t refer to the initial object-level heuristic at all! Heuristics such as “don’t violate deontology” don’t activate only in response to object-level criminal plans.

(Essentially, wisdom is conceptually/logically downstream of object-level heuristics, but not necessarily cognitively downstream, in the sense of moment-to-moment perceived mental experiences.)

Indeed, wisdom heuristics, by virtue of being more computationally demanding, are likely to be stored in an “implicit” form more often than “object-level” heuristics. Deriving them explicitly often requires looking at “global” properties of the environment or your history in it, considering the whole reference class of the relevant object-level cognitive heuristic. By contrast, object-level heuristics themselves involve a merely “local” inversion of some environmental dynamic.

As the result, “wisdom” usually only accumulates after humanity has engaged with some domain of reality for a while. Similarly, individual people tend to become “wise” only after they personally were submerged in that domain for a while – after they “had some experience” with it.

That said, to the extent that this model of wisdom is correct, wisdom can nevertheless be inferred “manually”, with enough effort. After all, it’s still merely a function of the object-level domain. It could be derived purely from the domain’s object-level model, given enough effort and computational resources, no “practical experience” needed.

3. Would AGIs Pursue Wisdom & Philosophical Competence?

In my view, the answer is a clear “yes”.

To start off, let’s define an “AGI” as “a system which can discover novel abstractions (such as new fields of science) in any environment that has them, and fluently use these abstractions in order to better navigate or optimize its environment in the pursuit of its goals”.

It’s somewhat at odds with the more standard definitions, which tend to characterize AGIs as, for example, “systems that can do most cognitive tasks that a human can”. But I think it captures some intuitions better than the standard definitions. For one, the state-of-the-art LLMs certainly seem to be “capable of doing most cognitive tasks that humans can”, yet most specialists and laymen alike would agree that they are not AGI. Per my definition, it’s because LLMs cannot discover new ontologies: they merely learned vast repositories of abstractions that were pre-computed for them by humans.

As per my arguments, philosophical reasoning is convergent:

It’s a subset of general non-paradigmatic research…
… which is the process of deriving new ontologies…
… which are useful because they allow to decompose the world into domains that can be reasoned about mostly-separately…
… which is useful because it reduces the computational costs needed for making plans or inferences.

Any efficient bounded agent, thus, would necessarily become a competent philosopher, and it would engage in philosophical reasoning regarding all domains of reality that (directly or indirectly) concern it.

Consider the opposite: “philosophically incompetent” or incapable reasoners. Such reasoners would only be able to make use of pre-computed H → L relations. They would not be able to derive genuinely new abstractions and create new fields. Thus, they wouldn’t classify as “AGI” in the above-defined sense.

They’d be mundane, non-general software tools. They’d still be able to be quite complex and intelligent, in some ways, up to and including being able to write graduate-level essays or even complete formulaic engineering projects. Nevertheless, they’d fall short of the “AGI” bar. (And would likely represent no existential risk on their own, outside cases of misuse by human actors.)

As a specific edge case, we can consider humans who are capable researchers in their domain – including being able to derive novel ontologies – but are still philosophically incompetent in a broad sense. I’d argue that this corresponds to the split between “general” philosophical reasoning, and “philosophy as a discipline” I’ve discussed in 1E. These people likely could be capable philosophers, but simply have no interest in specializing in high-level reasoning about the-world-in-general, nor in exploring its highest-level ontology.

Something similar could happen with AGIs trained/designed a specific way. But in the limit of superintelligence, it seems likely that all generally intelligent minds converge to being philosophically competent.

Wisdom is also convergent. When it comes down to it, wisdom seems to just be an additional trick for making correct plans or inferences. “Smart but unwise” reasoning would correspond to cases in which you’re not skeptical of your own decision-making procedures, are mostly not trying to improve them, and only take immediate/local consequences of your action into account. Inasmuch as AGIs would be capable of long-term planning across many domains, they would strive to be “wise”, in the sense I’ve outlined in this essay.

And those AGIs that would have superhuman general-intelligence capabilities, would be able to derive the “wisdom” heuristics quicker than humans, with little or no practical experience in a domain.

4. Philosophically Incompetent Human Decision-Makers

That said, just because AGIs would be philosophically competent, that doesn’t mean they’d by-default address and fix the philosophical incompetence of the humans who created them. Even if these AGIs would be otherwise aligned to human intentions and inclined to follow human commands.

The main difficulty here is that humans store their values in a decompiled/incomplete format. We don’t have explicit utility functions: our values are a combination of explicit consciously-derived preferences, implicit preferences, emotions, subconscious urges, and so on. (Theoretically, it may be possible to compile all of that into a utility function, but that’s a very open problem.)

As the result, mere intent alignment – designing an AGI which would do what its human operators “genuinely want” it to do, when they give it some command – still leaves a lot of philosophical difficulties and free parameters.

For example, suppose the AGI’s operators, in a moment of excitement after they activate their AGI for the first time, tell it to solve world hunger. What should the AGI do?

Should it read off the surface-level momentary intent of this command, design some sort of highly nutritious and easy-to-produce food, and distribute it across the planet in the specific way the human is currently imagining this?
Should it extrapolate the human’s values, and execute the command the way the human would have wanted to execute it if they’d thought about it for a bit, rather than the way they’re envisioning it in the moment?
- (For example, perhaps the image flashing through the human’s mind right now is of helicopters literally dropping crates full of food near famished people, but it’s actually more efficient to do it using airplanes.)
Should it extrapolate the human’s values a bit, and point out specific issues with this plan that the human might think about later (e. g., that such sudden large-scale activity might provoke rash actions from various geopolitical actors, leading to vast suffering), then give the human a chance to abort?
Should it extrapolate the human’s values a bit further, and point out issues the human might not have thought of (including teaching the human any novel load-bearing concepts necessary for understanding said potential issues)?
Should it extrapolate the human’s values a bit further still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
Should it extrapolate the human’s values far afield, interpret the command as “maximize eudaimonia”, and do that, disregarding the specific rough way of how they gestured at the idea?
- In other words: should it directly optimize for the human’s coherent extrapolated volition (which is something like the ultimate output of abstracting-over-ethics that I’d gestured at in 1F)?
Should it remind the human that they’d wanted to be careful regarding how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right now?
Should it insist that the human is currently too philosophically confused to make such high-impact decisions, and the AGI first needs to teach them a lot of novel concepts, before they can be sure there are no unknown unknowns that’d put their current plans at odds with their extrapolated values?

There are many, many drastically different ways to implement something as seemingly intuitive as “Do What I Mean”. And unless “aligning AIs to human intent” is done in the specific way that puts as much emphasis as possible on philosophical competence, including refusing human commands if the AGI judges them unwise/philosophically incompetent – short of that, even an AGI that is intent-aligned (in some strict technical sense) might lead to existentially catastrophic outcomes, up to and including the possibility of suffering at astronomical scales.

For example, suppose the AGI is designed to act on the surface-level meaning of commands, and it’s told to “earn as much money as possible, by any means necessary”. As I’ve argued in Section 3, it would derive a wise and philosophically competent understanding of what “obeying the surface-level meaning of a human’s command” means, and how to wisely and philosophically competently execute on this specific command. But it would not question the wisdom and philosophical competence of the command from the perspective of a counterfactual wiser human. Why would it, unless specifically designed to?

Another example: If the AGI is “left to its own devices” regarding how to execute on some concrete goal, it’d likely do everything “correctly” regarding certain philosophically-novel-to-us situations, such as the hypothetical possibility of acausal trade with the rest of the multiverse. (If the universal prior is malign, and using it is a bad idea, an actual AGI would just use something else.) However, if the AGI is corrigible, and it explains the situation to a philosophically incompetent human operator before taking any action, the human might incorrectly decide that giving in to acausal blackmail is the correct thing to do, and order the AGI to do so.

On top of that, there’s a certain Catch-22 at play. Convincing the decision-makers or engineers that the AGI must be designed such that it’d only accept commands from wise philosophically competent people already requires some level of philosophical competence on the designers’ part. They’d need to know that there even are philosophical “unknown unknowns” that they must be wary of, and that faithfully interpreting human commands is more complicated than just reading off the human’s intent at the time they give the command.

How to arrive at that state of affairs is an open question.

5. Ecosystem-Building

As argued in 1G, the best way to upscale the process of attaining philosophical competence and teaching it to people would be to move metaphilosophy outside the domain of philosophy. Figure out the ontology suitable for robustly describing any and all kinds of philosophical reasoning, and decouple from the rest of reality.

This would:

Allow more people to specialize in metaphilosophy, since they’d only need to learn about this specific domain of reality, rather than becoming interdisciplinary experts reasoning about the world at a high level.
Simplify the transfer of knowledge and the process of training new people. Once we have a solid model of metaphilosophy, that’d give us a ground-truth idea of how to translate philosophical projects into concrete steps of actions (i. e., what the H → L functions are). Those could be more easily taught in a standardized format, allowing at-scale teaching and at-scale delegation of project management.
Give us the means to measure philosophical successes and failures, and therefore, how to steer philosophical projects and keep them on-track. (Which, again, would allow us to scale the size and number of such projects. How well they perform would become legible, giving us the ability to optimize for that clear metric.)
Provide legibility in general. Once we have a concrete, convergent idea of what philosophical projects are, how they succeed, and what their benefits are, we’d be able to more easily argue the importance of this agenda to other people and organizations, increasing the agenda’s reach and attracting funding.

Hopefully this essay and the formalisms in it provide the starting point for operationalizing metaphilosophy in a way suitable for scaling it up.

Similar goes for wisdom – although unlike teaching philosophical competence, this area seems less neglected. (Large-scale projects for “raising the sanity waterline” have been attempted in the past, and I think any hypothetical “wisdom-boosting” project would look more or less the same.)

6. Philosophy Automation

In my view, automating philosophical reasoning is an AGI-complete problem. I think that the ability to engage in qualitative/non-paradigmatic research is what defines a mind as generally intelligent.

This is why LLMs, for example, are so persistently bad at it, despite their decent competence in other cognitive areas. I would argue that LLMs contain vast amounts of crystallized heuristics – that is, H → L functions, in this essay’s terminology – yet no ability to derive new ontologies/abstractions H given a low-level system L. Thus, there are no types of philosophical reasoning they’d be good at; no ability to contribute on their own/autonomously.

On top of that, since we ourselves don’t know the ontology of metaphilosophy either, that likely cripples our ability to use AI tools for philosophy in general. The reason is the same as the barrier to scaling up philosophical projects: we don’t know how the domain of metaphilosophy factorizes, which means we don’t know how to competently outsource philosophical projects and sub-projects, how to train AIs specialized in this, and how to measure their successes or failures.

One approach that might work is “cyborgism”, as defined by janus. Essentially, it uses LLMs as a brainstorming tool, allowing to scope out vastly larger regions of concept-space for philosophical insights, with the LLMs’ thought-processes steered by a human. In theory, this gives us the best of both worlds: a human’s philosophy-capable algorithms are enhanced by the vast repository of crystallized H → L and L → H functions contained within the LLM. Janus has been able to generate some coherent-ish philosophical artefacts this way. However, this idea has been around for a while, and so far, I haven’t seen any payoff from it.

Overall, I’m very skeptical that LLMs could be of any help here whatsoever, besides their standard mundane-utility role of teaching people new concepts in a user-tailored format. (Which might be helpful, in fact, but it isn’t the main bottleneck here. As I’ve discussed in Section 5, this sort of at-scale distribution of standardized knowledge only becomes possible after the high-level ontology of what we want to teach is nailed down.)

What does offer some hope for automating philosophy is the research agenda focused on the Natural Abstraction Hypothesis. I’ve discussed it above, and my tentative operationalization of philosophy is based on it. The agenda is focused on finding a formal definition for abstractions (i. e., layers of ontology), and what algorithms could at least assist us with deriving new ones.

Thus, inasmuch as my model of philosophy is right, the NAH agenda is precisely focused on operationalizing philosophical reasoning. John Wentworth additionally discusses some of the NAH’s applications for metaphilosophy here.

Thanks to David Manley, Linh Chi Nguyen, and Bradford Saad for providing extensive helpful critique of an earlier draft, and to John Wentworth for proofreading the final version.

Notes

Some Preliminary Notes on the Promise of a Wisdom Explosion

Katja Grace — Sun, 27 Oct 2024 17:08:25 +0000

By Chris Leong

This was a prize-winning entry into the Essay Competition on the Automation of Wisdom and Philosophy.

Leading AI labs are aiming to trigger an intelligence explosion, but perhaps this is a grave mistake? Maybe they should be aiming to trigger a “wisdom explosion instead”?:
- Defining this as “pretty much the same thing as an intelligence explosion, but with wisdom instead” is rather vague¹, but I honestly think it is good enough for now. I think it’s fine for early-stage exploratory work to focus on opening up a new part of conversational space rather than trying to perfectly pin everything down².
- Regarding my definition of wisdom, I’ll be exploring this in more detail in part six (“What kinds of wisdom are valuable?) of my upcoming Less Wrong sequence, but for now, I’ll just say that I take an expansive definition of what wisdom is and that achieving a “wisdom explosion” would likely require us to train a system that is fairly strong on a number of different subtypes. As an example though, if a coalition of groups focused on AI safety were able to wisely strategize, wisely co-ordinate and wisely pursue methods of non-manipulative persuasion, I’d feel significantly better about humanity’s chances of surviving.
- In any case, I don’t want to center my own understanding of wisdom too much. Instead, I’d encourage you to consider the types of wisdom that you think might be most valuable for achieving a positive future for humanity and whether the arguments below follow given how you conceive of wisdom, rather than how I conceive of wisdom³.
- In an intelligence explosion, the recursive self-improvement occurs within a single AI system. However, in terms of defining a wisdom explosion, I want to take a more expansive view. In particular, instead of requiring that it occur within a single AI, I want to allow the possibility that it may occur within a cybernetic system consisting of both humans and AI’s, either within a single organisation, or within a cluster of collaborating organisations. In fact, I think this is the best path route for pursuing a wisdom explosion.
- I find the version involving a cluster of collaborating organisations particularly compelling both because it would enable the pooling of resources⁴ for developing wisdom tech, but also because it would enable pursuing a pivotal process rather than a pivotal action.
For purposes of simplicity, I’ll talk about “responsible & wise” actors vs. “irresponsible & unwise” actors even though responsibility and wisdom don’t always line up⁵.
I will develop this argument more fully in my upcoming Less Wrong post “Artificial Intelligence/Capabilities⁶ as Potentially Fatal Mistake. Artificial Wisdom as Antidote”, but an outline of the argument I plan to make is below
Firstly I will argue that the pursuit of an intelligence explosion most likely result in catastrophe:
- Capabilities inevitably proliferate: key factors include a strong open-source community, large career incentives for researchers to publish and challenges with preventing espionage
- The attack-defense balance strongly favors the attack: attackers only need to get lucky once, defenders need to get lucky every time
- The proliferation of capabilities most likely leads to an AI arms race: the diffusion of capabilities levels the playing field which forces actors to race to maintain their lead
- Intelligence/Capability tech differentially benefits irresponsible & unwise actors: Recklessly racing ahead increases your access to resources, whilst responsible & wise actors need time to figure out how to act wisely
- Society struggles to adapt: Government processes aren’t designed to be able to handle a technology that moves as fast as AI. Reckless & unwise actors will use their political influence to push society to adopt unwise policies.
In contrast, I’ll argue that the pursuit of a wisdom explosion is likely to be much safer:
- Pursuing wisdom tech likely produces less capability externalities
  - A wisdom explosion might be achievable with AI’s built on top of relatively weak base models: think of the wisest people you know, they don’t all have massive amounts of cognitive “firepower”
- Both malicious and reckless & unwise actors are less likely to pursue such technologies:
  - They are less likely to value wisdom, especially given the trade-off with pursuing shiny, shiny capabilities.
- Reckless & unwise actors are disadvantaged in pursuing a wisdom explosion:
  - There is likely a minimum bar of wisdom required to trigger such an explosion. As they say, garbage in, garbage out.
  - Even if they were able to trigger such an explosion, it’d likely take them longer and/or require a higher capability level. Remember I’m proposing producing a cybernetic system, so the human operators play a key role here.
- Reckless & unwise actors are less likely to know what to do with any wisdom tech that they develop or acquire:
  - This is less true at higher capability levels where the system can help them figure out what they should be asking, but they might just ignore it.
- Even if reckless & unwise actors actually pursue and then manage to acquire wisdom tech, it may not be harmful:
  - Acquiring such technology may make them realise their foolishness.
  - They may then either delete their model, hand it over to someone more responsible or start working towards becoming a more responsible actor themselves
- Responsible actors can use wisdom tech to help them attempt to non-manipulatively persuade irresponsible actors to be more responsible:
  - My intuition is that this is much harder for intelligence/capability tech which will likely be superhuman at persuasion soon, but which is not a natural fit for non-manipulative persuasion⁷
I also think it may be viable. I’ll develop these arguments more fully in the seventh post of my upcoming Less Wrong sequence “Is a “Wisdom Explosion” a coherent concept?”, but my high-level thoughts are as follows:
- Before we begin: What level of wisdom would we need to spiral up to count as having achieved a “wisdom explosion”? We might not need to set the level at too high of a level (insofar as super-human systems go). Saving the world may require superhuman wisdom, but I don’t think it would have to be that superhuman.
- Wisdom seems like the kind of thing where having a greater degree of wisdom makes it easier to acquire even more. In particular, you are more likely to be able to discern who is providing wise or unwise advice. You are also more likely to be able to discern which assumptions require questioning.
- Insofar as we buy into the argument for an intelligence explosion being viable, one might naively assume that this also increases the chance that a wisdom explosion is viable:
  - One could push back against this by noting that intelligence is much easier to train than wisdom because, for intelligence, we can train our system on problems with known solutions or with a simulator. This is true, but it doesn’t mean that we can’t use these kinds of things for training wisdom. Instead, it just means that we have to be more careful in terms of how we go about it.
- While a certain level of wisdom would likely be required in order to trigger a wisdom explosion, the level might not be that high:
  - It’s less about being wise and more about not being so ideological that you are unable to break out of an attractor
- As mentioned before, our base models might not need to be particularly large (by the crazy standards of frontier models). There’s a chance that a wisdom explosion could be triggered at a lower capability level than an intelligence explosion⁸ if wisdom isn’t really about cognitive firepower:
  - If this is true, then we may be able to trigger a wisdom explosion earlier than an intelligence explosion
  - This may also address some concerns about inner alignment if we believe that smaller models tend to be more controllable⁹.
- Some people might think that wisdom is too fuzzy to make any progress at all. I’ll discuss this in “An Overview of “Obvious” Approaches to Training Wise AI and I’ll discuss this further in the third post of my upcoming Less Wrong sequence “Against Learned Helplessness With Training Wise AI”.
“Wisdom explosion” as creative stimuli:
- Even if the concept of a wisdom explosion turns out to be incoherent or triggering a wisdom explosion turns out to be impossible, I still think that investigating and debating these topics would be a valuable use of time. I can’t fully explain this, but certain questions feel like obvious or natural questions to ask. Noticing these questions and following the line of inquiry until you reach a natural conclusion is one of the best ways of developing your ability to think clearly about confusing matters.
- The value of gaining a new frame isn’t just in the potential application of the frame itself, but in how it can reveal assumptions within your worldview that you may not even be aware of.

Notes

An Overview of “Obvious” Approaches to Training Wise AI Advisors

Katja Grace — Fri, 25 Oct 2024 02:53:57 +0000

By Chris Leong

This was a prize-winning entry into the Essay Competition on the Automation of Wisdom and Philosophy.

I consider four different “obvious” high-level approaches to training wise AI advisors. I consider imitation learning to be the most promising approach as I’ll argue in an upcoming sequence on Less Wrong, however, I’ve tried to take a more balanced approach in these notes.

Approach:

Imitation learning: Training imitation learning agents on a bunch of people the lab considers to be wise.
- We’d be fine-tuning a separate base model for each advisor using human demonstrations. Ideally, we’d avoid using any reinforcement learning, but that might not be possible.
- Additional training details – I don’t know enough about training frontier models to know if this is a good plan, but this is a rough draft plan:
  - Train a model on the distribution of Internet data
  - Fine-tune it on clean data to remove the tendency to occasionally generate rubbish
  - Fine-tune it according to the kinds of outputs you want it to produce. Low quality is fine at this stage (articles, chat logs)
  - Fine-tune it on high-quality data (ie. published philosophy essays, chat logs from people having serious discussions where they actually try to answer the question being asked)
  - Fine-tune it on your data from everyone you identified as wise
  - Create specific fine tunes (or specific Lora adapters) for each wise individual
- Challenges:
  - Some of the steps listed above might interfere with the previous steps. For example, some of the data from people identified as wise might come from non-serious discussions.
  - Maybe it makes sense to add meta-data at the start (ie. serious discussion, person identified as wise) for both training and inference. This might resolve the previous issue.
The Direct Approach: Training an AI to be wise based on human demonstrations and feedback
- We’d most likely use supervised learning and RLHF on a base model.
The Principled Approach: Attempting to understand what wisdom is at a deep principled level and build an AI that provides advice according to those principles:
- While we’d ideally like to develop a complete principled understanding of wisdom, more realistically we’d probably only be able to manage a partial understanding
The Scattergun Approach: This approach involves just throwing a bunch of potentially relevant wise principles and/or anecdotes (nuggets of wisdom) from a fixed set at the deciders in the hope that reading through it will lead to a wise decision:
- A model would be trained to contextually figure out what nuggets to prioritize based on past user ratings likely by using RLHF on a base model.

Definitions:

Safe LLM: I’m quite worried that if we fine-tune an LLM hard on wisdom we’ll simply end up with an LLM that optimizes against us. A safe LLM would be an LLM where we’ve taken steps to reduce the chance of significant adversarial optimization. Ways of achieving this might include limiting the size of the base model, reducing RLHF or avoiding fine-tuning the model too hard.
Wisdom explosion: When a system is able to recursively self-improve its wisdom. This doesn’t have to continue forever, as long as it caps out at a superhuman level. The self-improving system doesn’t have to be a single AI, but may be a cybernetic system consisting of a bunch of operators and AI’s in an organization, or even a network of such organizations. See Some Preliminary Notes on the Promise of a Wisdom Explosion for more details.

Considerations:

Base Power level: How capable is this method of training extremely wise agents?
Feasibility: How practical is it to make such a system?
Adversarial optimization: To what extent do we have to worry that we may be training a system to adversarially optimize against us?
Application of principles: What kind of support does the system provide in figuring out how to apply the principles?
Generalization: How well does this technique generalize out of distribution?
Wisdom explosion potential: Could this approach be useful for recursive self-wisening?
Holisticity:
- I’m worried that mixing and matching principles from various systems of wisdom can result in a new system that is incredibly unwise, even if each principle is wise within its original system. As an example, Warren Buffet might be able to provide wise advice on how to become wealthy and the Dali Lama wise advice on spiritual development, but perhaps these are two separate paths and what is wise for pursuing one path would be foolish for the other. There are two reasons why I consider holisticity to be good:
  - Consistency: Individual views have the advantage of consistency whilst mixing and matching breaks this assumption.
  - Commitment: Sometimes there are advantages to picking a path, any path, rather than just averaging everything together. As an example, maybe it’s better to either completely devote myself to pursuing programming or completely devote myself to pursuing art rather than split myself between the two and succeed at neither.

Evaluation:

Please keep in mind that my assessments of these techniques on each of the criteria are essentially hot-takes.

Imitation Learning:
- Evaluation of base proposal:
  - Base Power level: Depends hugely on who you are able to train on. The wisest people are quite wise, but you might not be able to obtain their permission to train on their data or to persuade them to collaborate with you.
  - Feasibility:
    - Standard imitation learning isn’t particularly challenging. However, we may need to advance the state of the art in order to obtain sufficiently accurate results.
    - Even if we advance the state of the art, obtaining sufficiently high-quality data might pose a significant challenge
    - There are many historical figures with large amounts of data. The major limitation here is that we can’t obtain more if they’re dead.
    - However, we might only be able to obtain a sufficient level of accuracy with people who are alive and willing to participate in the project. This has the following advantages:
      - We can gather data about their responses to the kinds of questions we’re interested in
      - We can search for cases where the model is especially unsure of what they’d say and collect their responses to these questions
      - We can ask them to take a second look at places where their thought seems contradictory
      - We can ask them to produce additional chain of thought data even for things that are so basic that they wouldn’t normally bother stepping through all their reasoning
      - Contemporary folk can use Wise AI to become wiser, making them better targets to train on
  - Adversarial optimization:
    - Optimizing hard on imitation learning is less likely to be problematic than for other targets:
      - Safer target: Incentivizing the AI to fool us into believing that “X would say Y” rather than “Y is true” is less likely to be harmful
      - Easier validation: easier to talk to X and learn that they would never say Y rather than learn that Y is not wise which might take a lot of experience and incur significant costs. Even for historical figures, we can withhold part of the data as a validation set.
      - More reliable data: it is easier to gather a high-quality dataset on what X said than on what is best on some metric (which tends to be unknown for any situation of reasonable complexity).
    - Inner alignment might still be an issue
    - If you imitate folks who are opposed to you for whatever reason, then an imitation learning agent trained on them might act adversarially.
    - If the figures we are training are being compensated to produce training data, then this might push them towards giving you the answer you want. However, this is better than RLHF as they are being compensated for being themselves rather than attempting to either produce or rate outputs according to the company’s conception of what high-quality data looks like.
  - Application of principles:
    - As an abstraction, sims provide a natural way to hold principles of wisdom along with information about the particular context in which these principles apply. Simulating dialog between these sims provides a natural way of determining which principles are more applicable to the current scenario.
  - Holisticity:
    - Likely pretty good. Sims encourage us to conceive of wisdom as a holistic system rather than just individual principles. However, skeptics might argue that even the wisest humans are incredibly inconsistent.
  - Generalization:
    - Likely very good.
    - Consulting multiple advisors reduces the impact from any one advisor generalizing poorly.
    - Humans can invent new principles on the fly, such that we can better adapt to new and unexpected circumstances or cover gaps in our map. I expect this to carry over to the imitation learning approach.
    - The principled and direct approaches attempt to figure out what wisdom is across all of time and space. In contrast, the simulator attempts to identify figures who are wise within a particular context and then adapt this to the current context. This is a much less challenging problem particularly since we can have the sims talk through how to adapt to the new circumstances.
    - One potentially useful frame: When we are selecting a figure, we aren’t just selecting a certain style of in-distribution reasoning, but a certain style of out-of-distribution reasoning. If our curation choices are good, then we might expect out-of-distribution reasoning to be good, whilst if our curation choices are bad, then we might expect out-of-distribution reasoning to be bad.
    - Going further: We aren’t just selecting a certain style of out-of-distribution reasoning, but also a certain style of reasoning about whether you are out of distribution.
  - Wisdom explosion potential:
    - Scalable alignment techniques provide significant opportunities for amplification:
      - “What if you knew X?” in combination with RAG
      - Self-consistency
      - Debate
      - Iterated distillation and amplification
    - Imitation-based techniques might actually work better with techniques ported over from humans because they’d be more in distribution.
  - Other advantages:
    - Users are less likely to be overly trusting: People will understand that they need to take the advice of imitation agents with a grain of salt, particularly because of the wide range of disagreements between them, while they will more uncritically accept the advice of an AI trained to be wise.
    - Given the relative ease of imitation learning, if we need to use either the direct or principled approach, I’d recommend implementing imitation-based techniques first and using them to assist:
      - These assistants could help us make wise decisions about all aspects of the project, including high-level approach, planning and personal selection
      - These assistants could help us produce training data for the direct approach or figure out the principles for the principled approach.
      - These assistants could help us make wise decisions about how to utilize these models and work around their limitations.
- Potential mitigations:
  - Fixing the lack of historical data:
    - If there are different interpretations of a figure’s work, we can train different agents for the main schools of thought on what they meant
    - We can ask an expert on these figures to speculate about what they may have said in relation to some of the kinds of questions we’re interested in. This could be used to reduce the chance of out-of-distribution errors.
  - Speculative: We might be able to mitigate inner alignment by averaging the weights of a bunch of models. We can then use this as a starting point and do a tiny bit of additional training to get to the real parameters for the model we’re training:
    - The average baseline is likely better for imitation learning vs. optimization b/c the average is more likely to be near the ideal solution for the former rather than the latter. I expect that this would make the ‘average biasing’ more effective at mitigating inner alignment issues
- Most promising variant:
  - I’m most optimistic about a variant where swarms of AI advisors are allowed to dynamically self-organize rather than using a fixed structure like debate for amplification.
The Direct Approach:
- Evaluation of base proposal:
  - Base Power level: Optimisation is very powerful
  - Feasibility: Very feasible. This is the standard way of training AI
  - Adversarial optimization:
    - The standard issues of Goodhart’s law are exacerbated when the training target is wisdom.
    - Wisdom is extremely hard to evaluate:
      - Wisdom is highly contested
      - Wisdom can typically only be validated by examining many different kinds of situations over long periods of time
      - It’s very easy to accidentally impose assumptions on a situation without even realizing that you are doing it. The assumptions don’t even make it to the level of consideration.
    - Sycophancy:
      - The phrasing is especially likely to leak information about the user’s views on questions about wisdom
    - Ambiguity of meaning: This can have advantages as a wise decision is still wise even if the wisdom mostly came from the user. However, it can go wrong as follows: Adam rates Y as wise assuming it will be understood as Z. Bob interprets as Z’ which is a reasonable interpretation, but incredibly unwise.
  - Application of principles: Pretty good. You can just get the model to generate outputs.
  - Holisticity: Quite poor. If we aren’t trusting any one person, we will need many different raters and this will likely merge their views together inconsistently
  - Generalization: Debatable. Some people might think that this will generalize better because it merges a lot of different views. Others might argue that there will be issues because we’re training it on inconsistent data.
  - Wisdom explosion potential: Maybe, but I’m dubious. I expect that triggering a wisdom explosion requires embracing a certain degree of subjectivity rather than trying to be objective.
- Potential mitigations:
  - We could aggressively filter the text used to train the base model to remove
  - We could produce a number of fine-tunes and use weight averaging to attempt to reduce adversarial optimization.
  - We could train another model to comment on the model outputs and attempt to identify situations where the model is being sycophantic or manipulative. This could be directly trained or we could provide it with a bunch of rules.
  - We could train a classifier on the latents to detect sycophancy.
  - We could attempt to use activation vectors in order to reduce sycophancy.
  - We could use some kind of self-consistency training to reduce the inconsistency created by training on data coming from multiple individuals.
- Most promising variant:
  - I suspect that the most promising approach would be a form of defense-in-depth where we just smash all of these different methods together and hope for the best.
The Principled Approach:
- Evaluation of base proposal:
  - Base power level: Theoretically quite powerful if you were able to reverse engineer wisdom. Partial solutions are likely much less powerful.
  - Feasibility:
    - Feasibility challenges: wisdom is likely too multifarious to reverse engineer. Most likely result is that the team never gets anywhere near finishing, even by its own standards. It would be easy to spend an entire lifetime studying wisdom:
    - The issue isn’t just that the task is massive, it’s also that it’s very hard to have a complete map of wisdom without having experienced a huge diversity of different contexts.
    - My intuition is that this would be a challenge, even if we had fifty years, which we don’t have. I expect that we would need time to go through multiple paradigms of foundational wisdom research, with each subsequent paradigm identifying massive blind spots in the previous paradigm. Without time to iterate through paradigms, we’ll likely be too localized to the current context and unable to adapt to new circumstances.
  - Adversarial optimization:
    - Much better than in the direct approach, however, unless we develop a method of inserting the principles into an AI directly, we’d still need humans to rate how well the AI is following these principles. I’m pretty worried that this would be too much exposure.
    - Inner alignment might present a problem.
  - Application of principles: Likely pretty good since we’re training the AI to learn the principles.
  - Holisticity: Actually solving wisdom principally would be the best approach in terms of ensuring holistically coherent advice.
  - Wisdom explosion potential: Decent. There’s a chance that we don’t have to solve all of wisdom, but that identifying some core principles of wisdom would allow us to produce a seed system that could trigger a wisdom explosion.
  - Generalization:
    - Potentially the best if you were actually able to reverse engineer wisdom, but as I said, that’s unlikely.
    - A partial solution to the principled approach would likely have huge blindspots.
- Potential mitigations:
  - We could merge the direct approach and the principled approach to cover any gaps by generating new principles. The downside is that this would also allow the AI to directly optimize against us. This would work as follows: use supervised learning on our list of principles and then use RLHF to train the model to produce outputs that will be highly rated. The obvious worry is that introducing RL leaves us vulnerable to being adversarially optimized against, however, there’s a chance that this is safer than the direct approach if we are able to get away with less RL¹.
  - One way to reduce the amount of exposure to adversarial optimization would be to limit the AI to identify the most contextually relevant principles, rather than allowing it to generate text explaining how to do this. However, this would greatly limit the ability of the AI to assist with figuring out how to apply the principles (we could use a safe LLM for assistance instead, but this would be less powerful).
- Most promising variant:
  - Given that you are unlikely to succesfully reverse engineering all of wisdom, I believe that the most promising variant would be aiming to decipher enough principles of wisdom such that you could build a seed AI that could recursively self-wisen.
  - I’m uncertain whether it would be better to attempt to find a way to directly insert the principles into an AI (I suspect this is basically impossible) or to let the model generate text advising you on how to apply the principles based on human ratings (unlikely to go well due to exposing yourself to adversarial optimization)
The Scattergun Approach
- Evaluation of base proposal:
  - Base Power level: Pretty weak. Limited to a set of specific nuggets of wisdom
  - Feasibility: Very feasible. Not a particularly complicated thing to train.
  - Adversarial optimization: Even though the optimizer can only select particular nuggets of text it can still adversarially optimize again you to a degree. However, it is much more limited than if it were able to freely generate text².
  - Application of principles: The base proposal provides very limited support in terms of figuring out how to apply these principles compared to the other approaches. It just provides a bunch of disconnected principles.
  - Holisticity: Provides disconnected nuggets of wisdom. Scores pretty poorly here.
  - Wisdom explosion potential: Very limited. Such a system like this might be useful for helping us pursue one of the other approaches, but limiting the nuggets of wisdom to a fixed set is a crippling limitation.
  - Generalization: Rather poor. Has a fixed set of principles.
- Potential mitigations:
  - We could tilt the optimizer towards favoring advice that would be coherent with the advice already provided. I expect that would help to a degree, but this honestly seems like a fundamental problem with this approach
  - We could annotate the content with details about the kind of context in which it might be useful. Mitigates it a bit, but this is a very limited solution.
  - We could allow an LLM to freely generate text advising you on how to apply one of these principles to your particular situation. If this were done, I would have a strong preference for using a safe LLM.
    The whole point of the scattergun approach as far as I’m concerned is to limit the set of responses as to mitigate adversarial optimization. At the point where you allow an LLM to optimize hard, I feel that you may as well go with the direct approach, as you’ve exposed yourself to adversarial optimization.
- Most promising variant:
  - Using a safe LLM to contextually annotate the nuggets of wisdom with notes on how to apply them seems like the most viable variant of this approach.

Appendix on the Imitation Learning Approach:

Because imitation learning approach is difficult to understand I’ve added answers for three of the most common questions. I’ll be explaining this approach in a lot more detail in my upcoming Less Wrong sequence:

Isn’t this approach a bit obvious?:
- Yes. That doesn’t mean that it wouldn’t be effective though.
What kind of figures are you talking about?:
- Depends on the exact use case, but there’s wisdom in all kinds of places. There’s wise philosophers, wise scientists, wise policy advisors, wise communicators ect.
Isn’t the subjectivity in selecting figures bad?
- The subjectivity is already there in the direct approach. The fact that we’re selecting figures just makes this more obvious because humans are highly attuned to anything involving status. Making this more salient is good. These are big decisions and people should be aware of this subjectivity.
- Different actors can choose to make use of different subsets of figures. Whilst we could produce multiple different AI’s with the direct approach, imitation learning has the advantage of being extremely legible in how the result is being produced. As soon as we move to some kind of averaging, we have to deal with the question of how your sample was produced.
- Further, if there are multiple projects, each project can make their own selection
- After we’ve chosen some initial figures, we can take advantage of their wisdom to help us figure out who we’ve missed or what our blindspots are.
- If we end up simply using these figures to help us train a wise AI, I would expect many of these choices to wash out and many different figures – all of whom are wise – would make similar recommendations. Running self-consistency on the AI might further remove some of these differences.
- Framing this slightly differently, if we use techniques like debate well, poor choices are unlikely to have much of an impact.

Notes

AI Impacts Quarterly Newsletter, Jan-Mar 2023

Harlan Stewart — Mon, 17 Apr 2023 22:02:42 +0000

Harlan Stewart, 17 April 2023

News

AI Impacts blog

We moved our blog to Substack! We think this platform has many advantages, and we’re excited for the blog to live here. You can now easily subscribe to the blog to receive regular newsletters as well as various thoughts and observations related to AI.

AI Impacts wiki

All AI Impacts research pages now reside on the AI Impacts Wiki. The wiki aims to document what we know so far about decision-relevant questions about the future of AI. Our pages have always been wiki-like: updatable reference pages organized by topic. We hope that making it an actual wiki will make it clearer to everyone what’s going on, as well as better to use for this purpose, for both us and readers. We are actively looking for ways to make the wiki even better, and you can help with this by sharing your thoughts in our feedback form or in the comments of this blog post!

New office

We recently moved to a new office that we are sharing with FAR AI and other partner organizations. We’re extremely grateful to the team at FAR for organizing this office space, as well as to the Lightcone team for hosting us over the last year and a half.

Katja Grace talks about forecasting AI risk at EA Global

At EA Global Bay Area 2023, Katja gave a talk titled Will AI end everything? A guide to guessing in which she outlined a way to roughly estimate the extent of AI risk.

AI Impacts in the Media

AI Impacts’ 2022 Expert Survey on Progress in AI was cited in an NBC Nightly News segment, an op-ed in Bloomberg, an op-ed in The New York Times, an article in Our World in Data, and an interview with Kelsey Piper.
Ezra Klein quoted Katja and separately cited the survey in his New York Times op-ed This Changes Everything.
Sigal Samuel interviewed Katja for the Vox article The case for slowing down AI.

Research and writing highlights

AI Strategy

“Let’s think about slowing down AI” argues that those who are concerned about existential risks from AI should think about strategies that could slow the progress of AI (Katja)
“Framing AI strategy” discusses ten frameworks for thinking about AI strategy. (Zach)
“Product safety is a poor model for AI governance” argues that a common type of policy proposal is inadequate to address the risks of AI. (Rick)
“Alexander Fleming and Antibiotic Resistance” is a research report about early efforts to prevent antibiotic resistance and relevant lessons for AI risk. (Harlan)

Resisted technological temptations: how much economic value has been forgone for safety and ethics in past technologies?

“What we’ve learned so far from our technological temptations project” is a blog post that summarizes the Technological Temptations project and some possible takeaways. (Rick)
Geoengineering, nuclear power, and vaccine challenge trials were evaluated for the amount of value that may have been forgone by not using them. (Jeffrey)

Public awareness and opinions about AI

“The public supports regulating AI for safety” summarizes the results from a survey of the American public about AI. (Zach)
“How popular is ChatGPT?”: Part 1 looks at trends in AI-related search volume, and Part 2 refutes a widespread claim about the growth of ChatGPT. (Harlan and Rick)

The state of AI today: funding, hardware, and capabilities

“Recent trends in funding for AI companies” analyzes data about the amount of funding AI companies have received. (Rick)
“How much computing capacity exists in GPUs and TPUs in Q1 2023?” uses a back-of-the-envelope calculation to estimate the total amount of compute that exists on all GPUs and TPUs. (Harlan)
“Capabilities of state-of-the-art AI, 2023” is a list of some noteworthy things that state-of-the-art AI can do. (Harlan and Zach)

Arguments for AI risk

Still in progress, “Is AI an existential risk to humanity?” is a partially complete page summarizing various arguments for concern about existential risk from AI. A couple of specific arguments are examined more closely in “Argument for AI x-risk from competent malign agents” and “Argument for AI x-risk from large impacts” (Katja)

Chaos theory and what it means for AI safety

“AI Safety Arguments Affected by Chaos” reasons about ways in which chaos theory could be relevant to predictions about AI, and “Chaos in Humans” explores the theoretical limits to predicting human behavior. The report “Chaos and Intrinsic Unpredictability” provides background, and a blog post summarizes the project. (Jeffrey and Aysja)

Miscellany

“How bad a future do ML researchers expect?” compares experts’ answers in 2016 and 2022 to the question “How positive or negative will the impacts of high-level machine intelligence on humanity be in the long run?” (Katja)
“We don’t trade with ants” (crosspost) disputes the common claim that advanced AI systems won’t trade with humans for the same reason that humans don’t trade with ants. (Katja)

Funding

We’re actively seeking financial support to continue our research and operations for the rest of the year. Previous funding allowed us to expand our research team and hold a summer internship program.

If you want to talk to us about why we should be funded or hear more details about what we would do with money, please write to Elizabeth, Rick, or Katja at [firstname]@aiimpacts.org.

If you’d like to donate to AI Impacts, you can do so here. (And we thank you!)

Image credit: Midjourney

What we’ve learned so far from our technological temptations project

richardkorzekwa — Fri, 14 Apr 2023 00:04:40 +0000

Rick Korzekwa, 11 April 2023, updated 13 April 2023

At AI Impacts, we’ve been looking into how people, institutions, and society approach novel, powerful technologies. One part of this is our technological temptations project, in which we are looking into cases where some actors had a strong incentive to develop or deploy a technology, but chose not to or showed hesitation or caution in their approach. Our researcher Jeffrey Heninger has recently finished some case studies on this topic, covering geoengineering, nuclear power, and human challenge trials.

This document summarizes the lessons I think we can take from these case studies. Much of it is borrowed directly from Jeffrey’s written analysis or conversations I had with him, some of it is my independent take, and some of it is a mix of the two, which Jeffrey may or may not agree with. All of it relies heavily on his research.

The writing is somewhat more confident than my beliefs. Some of this is very speculative, though I tried to flag the most speculative parts as such.

Summary

Jeffrey Heninger investigated three cases of technologies that create substantial value, but were not pursued or pursued more slowly

The overall scale of value at stake was very large for these cases, on the order of hundreds of billions to trillions of dollars. But it’s not clear who could capture that value, so it’s not clear whether the temptation was closer to $10B or $1T.

Social norms can generate strong disincentives for pursuing a technology, especially when combined with enforceable regulation.

Scientific communities and individuals within those communities seem to have particularly high leverage in steering technological development at early stages.

Inhibiting deployment can inhibit development for a technology over the long term, at least by slowing cost reductions.

Some of these lessons are transferable to AI, at least enough to be worth keeping in mind.

Overview of cases

Geoengineering could feasibly provide benefits of $1-10 trillion per year through global warming mitigation, at a cost of $1-10 billion per year, but actors who stand to gain the most have not pursued it, citing a lack of research into its feasibility and safety. Research has been effectively prevented by climate scientists and social activist groups.
Nuclear power has proliferated globally since the 1950s, but many countries have prevented or inhibited the construction of nuclear power plants, sometimes at an annual cost of tens of billions of dollars and thousands of lives. This is primarily done through legislation, like Italy’s ban on all nuclear power, or through costly regulations, like safety oversight in the US that has increased the cost of plant construction in the US by a factor of ten.
Human challenge trials may have accelerated deployment of covid vaccines by more than a month, saving many thousands of lives and billions or trillions of dollars. Despite this, the first challenge trial for a covid vaccine was not performed until after several vaccines had been tested and approved using traditional methods. This is consistent with the historical rarity of challenge trials, which seems to be driven by ethical concerns and enforced by institutional review boards.

Scale

The first thing to notice about these cases is the scale of value at stake. Mitigating climate change could be worth hundreds of billions or trillions of dollars per year, and deploying covid vaccines a month sooner could have saved many thousands of lives. While these numbers do not represent a major fraction of the global economy or the overall burden of disease, they are large compared to many relevant scales for AI risk. The world’s most valuable companies have market caps of a few trillion dollars, and the entire world spends around two trillion dollars per year on defense. In comparison, annual funding for AI is on the order of $100B.¹

Comparison between the potential gains from mitigating global warming and deploying covid vaccines faster. These items were somewhat arbitrarily chosen, and most of the numbers were not carefully researched, but they should be in the right ballpark.

Setting aside for the moment who could capture the value from a technology and whether the reasons for delaying or forgoing its development are rational or justified, I think it is worth recognizing that the potential upsides are large enough to create strong incentives.

Social norms

My read on these cases is that a strong determinant for whether a technology will be pursued is social attitudes toward the technology and its regulation. I’m not sure what would have happened if Pfizer had, in defiance of FDA standards and medical ethics norms, infected volunteers with covid as part of their vaccine testing, but I imagine it would have been more severe than fines or difficulty obtaining FDA approval. They would have lost standing in the medical community and possibly been unable to continue existing as a company. This goes similarly for other technologies and actors. Building nuclear power plants without adhering to safety standards is so far outside the range of acceptable actions that even suggesting it as a strategy for running a business or addressing climate change is a serious risk to reputation for a CEO or public official. An oil company executive who finances a project to disperse aerosols into the upper atmosphere to reduce global warming and protect his business sounds like a Bond movie villain.

This is not to suggest that social norms are infinitely strong or that they are always well-aligned with society’s interests. Governments and corporations will do things that are widely viewed as unethical if they think they can get away with it, for example, by doing it in secret.² And I think that public support for our current nuclear safety regime is gravely mistaken. But strong social norms, either against a technology or against breaking regulations do seem able, at least in some cases, to create incentives strong enough to constrain valuable technologies.

The public

The public plays a major role in defining and enforcing the range of acceptable paths for technology. Public backlash in response to early challenge trials set the stage for our current ethics standards, and nuclear power faces crippling safety regulations in large part because of public outcry in response to a perceived lack of acceptable safety standards. In both of these cases, the result was not just the creation of regulations, but strong buy-in and a souring of public opinion on a broad category of technologies.³

Although public opposition can be a powerful force in expelling things from the Overton window, it does not seem easy to predict or steer. The Chernobyl disaster made a strong case for designing reactors in a responsible way, but it was instead viewed by much of the public as a demonstration that nuclear power should be abolished entirely. I do not have a strong take on how hard this problem is in general, but I do think it is important and should be investigated further.

The scientific community

The precise boundaries of acceptable technology are defined in part by the scientific community, especially when technologies are very early in development. Policy makers and the public tend to defer to what they understand to be the official, legible scientific view when deciding what is or is not okay. This does not always match with actual views of scientists.

Geoengineering as an approach to reducing global warming has not been recommended by the IPCC, and a minority of climate scientists support research into geoengineering. Presumably the advocacy groups opposing geoengineering experiments would have faced a tougher battle if the official stance from the climate science community were in favor of geoengineering.

One interesting aspect of this is that scientific communities are small and heavily influenced by individual prestigious scientists. The taboo on geoengineering research was broken by the editor of a major climate journal, after which the number of papers on the topic increased by more than a factor of 20 after two years.⁴

Scientific papers published on solar radiation management by year. Paul Crutzen, an influential climate scientist, published a highly-cited paper on the use of aerosols to mitigate global warming in 2006. Oldham, et al 2014.

I suspect the public and policymakers are not always able to tell the difference between the official stance of regulatory bodies and the consensus of scientific communities. My impression is that scientific consensus is not in favor of radiation health models used by the Nuclear Regulatory Commission, but many people nonetheless believe that such models are sound science.

Warning shots

Past incidents like the Fukushima disaster and the Tuskegee syphilis study are frequently cited by opponents of nuclear power and human challenge trials. I think this may be significant, because it suggests that these “warning shots” have done a lot to shape perception of these technologies, even decades later. One interpretation of this is that, regardless of why someone is opposed to something, they benefit from citing memorable events when making their case. Another, non-competing interpretation is that these events are causally important in the trajectory of these technologies’ development and the public’s perception of them.

I’m not sure how to untangle the relative contribution of these effects, but either way, it suggests that such incidents are important for shaping and preserving norms around the deployment of technology.

Locality

In general, social norms are local. Building power plants is much more acceptable in France than it is in Italy. Even if two countries allow the construction of nuclear power plants and have similarly strong norms against breaking nuclear safety regulations, those safety regulations may be different enough to create a large difference in plant construction between countries, as seen with the US and France.

Because scientific communities have members and influence across international borders, they may have more sway over what happens globally (as we’ve seen with geoengineering), but this may be limited by local differences in the acceptability of going against scientific consensus.

Development trajectories

A common feature of these cases is that preventing or limiting deployment of the technology inhibited its development. Because less developed technologies are less useful and harder to trust, this seems to have helped reduce deployment.

Normally, things become cheaper to make as we make more of them in a somewhat predictable way. The cost goes down with the total amount that has been produced, following a power law. This is what has been happening with solar and wind power.⁵

Levelized cost of energy for wind and solar power, as a function of total capacity built. Levelized cost includes cost building, operating, and maintaining wind and solar farms. Bolinger 2022

Initially, building nuclear power plants seems to have become cheaper in the usual way for new technology—doubling the total capacity of nuclear power plants reduced the cost per kilowatt by a constant fraction. Starting around 1970, regulations and public opposition to building plants did more than increase construction costs in the near term. By reducing the number of plants built and inhibiting small-scale design experiments, it slowed the development of the technology, and correspondingly reduced the rate at which we learned to build plants cheaply and safely.⁶ Absent reductions in cost, they continue to be uncompetitive with other power generating technologies in many contexts.

Nuclear power in France and the US followed typical cost reduction curves until roughly 1970, after which they showed the opposite behavior. However, France showed a much more gradual increase. Lang 2017.

Because solar radiation management acts on a scale of months-to-years and the costs of global warming are not yet very high, I am not surprised that we have still not deployed it. But this does not explain the lack of research, and one of the reasons given for opposition to experiments is that it has not been shown to be safe. But the reason we lack evidence on safety is because research has been opposed, even at small scales.

It is less clear to me how much the relative lack of human challenge trials in the past⁷ has made us less able to do them well now. I’m also not sure how much a stronger past record of challenge trials would cause them to be viewed more positively. Still, absent evidence that medical research methodology does not improve in the usual way with quantity of research, I expect we are at least somewhat less effective at performing human challenge trials than we otherwise would be.

Separating safety decisions from gains of deployment

I think it’s impressive that regulatory bodies are able to prevent use of technology even when the cost of doing so is on the scale of many billions, plausibly trillions of dollars. One of the reasons this works seems to be that regulators will be blamed if they approve something and it goes poorly, but they will not receive much credit if things go well. Similarly, they will not be held accountable for failing to approve something good. This creates strong incentives for avoiding negative outcomes while creating little incentive to seek positive outcomes. I’m not sure if this asymmetry was deliberately built into the system or if it is a side effect of other incentive structures (e.g, at the level of politics, there is more benefit from placing blame than there is from giving credit), but it is a force to be reckoned with, especially in contexts where there is a strong social norm against disregarding the judgment of regulators.

Who stands to gain

It is hard to assess which actors are actually tempted by a technology. While society at large could benefit from building more nuclear power plants, much of the benefit would be dispersed as public health gains, and it is difficult for any particular actor to capture that value. Similarly, while many deaths could have been prevented if the covid vaccines had been available two months earlier, it is not clear if this value could have been captured by Pfizer or Moderna–demand for vaccines was not changing that quickly.

On the other hand, not all the benefits are external–switching from coal to nuclear power in the US could save tens of billions of dollars a year, and drug companies pay billions of dollars per year for trials. Some government institutions and officials have the stated goal of creating benefits like public health, in addition to economic and reputational stakes in outcomes like the quick deployment of vaccines during a pandemic. These institutions pay costs and make decisions on the basis of economic and health gains from technology (for example, subsidizing photovoltaics and obesity research), suggesting they have incentive to create that value.

Overall, I think this lack of clarity around incentives and capture of value is the biggest reason for doubt that these cases demonstrate strong resistance to technological temptation.

What this means for AI

How well these cases generalize to AI will depend on facts about AI that are not yet known. For example, if powerful AI requires large facilities and easily-trackable equipment, I think we can expect lessons from nuclear power to be more transferable than if it can be done at a smaller scale with commonly-available materials. Still, I think some of what we’ve seen in these cases will transfer to AI, either because of similarity with AI or because they reflect more general principles.

Social norms

The main thing I expect to generalize is the power of social norms to constrain technological development. While it is far from guaranteed to prevent irresponsible AI development, especially if building dangerous AI is not seen as a major transgression everywhere that AI is being developed, it does seem like the world is much safer if building AI in defiance of regulations is seen as similarly villainous to building nuclear reactors or infecting study participants without authorization. We are not at that point, but the public does seem prepared to support concrete limits on AI development.

Source

I do think there are reasons for pessimism about norms constraining AI. For geoengineering, the norms worked by tabooing a particular topic in a research community, but I’m not sure if this will work with a technology that is no longer in such an early stage. AI already has a large body of research and many people who have already invested their careers in it. For medical and nuclear technology, the norms are powerful because they enforce adherence to regulations, and those regulations define the constraints. But it can be hard to build regulations that create the right boundaries around technology, especially something as imprecise-defined as AI. If someone starts building a nuclear power plant in the US, it will become clear relatively early on that this is what they are doing, but a datacenter training an AI and a datacenter updating a search engine may be difficult to tell apart.

Another reason for pessimism is tolerance for failure. Past technologies have mostly carried risks that scaled with how much of the technology was built. For example, if you’re worried about nuclear waste, you probably think two power plants are about twice as bad as one. While risk from AI may turn out this way, it may be that a single powerful system poses a global risk. If this does turn out to be the case, then even if strong norms combine with strong regulation to achieve the same level of success as for nuclear power, it still will not be adequate.

Development gains from deployment

I’m very uncertain how much development of dangerous AI will be hindered by constraints on deployment. I think approximately all technologies face some limitations like this, in some cases very severe limitations, as we’ve seen with nuclear power. But we’re mainly interested in the gains to development toward dangerous systems, which may be possible to advance with little deployment. Adding to the uncertainty, there is ambiguity where the line is drawn between testing and deployment or whether allowing the deployment of verifiably safe systems will provide the gains needed to create dangerous systems.

Separating safety decisions from gains

I do not see any particular reason to think that asymmetric justice will operate differently with AI, but I am uncertain whether regulatory systems around AI, if created, will have such incentives. I think it is worth thinking about IRB-like models for AI safety.

Capture of value

It is obvious there are actors who believe they can capture substantial value from AI (for example Microsoft recently invested $10B in OpenAI), but I’m not sure how this will go as AI advances. By default, I expect the value created by AI to be more straightforwardly capturable than for nuclear power or geoengineering, but I’m not sure how it differs from drug development.

Social preview image: German anti-nuclear power protesters in 2012. Used under Creative Commons license from Bündnis 90/Die Grünen Baden-Württemberg Flickr

Notes

Superintelligence Is Not Omniscience

Jeffrey Heninger — Fri, 07 Apr 2023 16:25:58 +0000

Jeffrey Heninger and Aysja Johnson, 7 April 2023

The Power of Intelligence

It is often implicitly assumed that the power of a superintelligence will be practically unbounded. There seems like there could be “ample headroom” above humans, i.e. that a superintelligence will be able to vastly outperform us across virtually all domains.

By “superintelligence,” I mean something which has arbitrarily high cognitive ability, or an arbitrarily large amount of compute, memory, bandwidth, etc., but which is bound by the physical laws of our universe.¹ There are other notions of “superintelligence” which are weaker than this. Limitations of the abilities of this superintelligence would also apply to anything less intelligent.

There are some reasons to believe this assumption. For one, it seems a bit suspicious to assume that humans have close to the maximal possible intelligence. Secondly, AI systems already outperform us in some tasks,² so why not suspect that they will be able to outperform us in almost all of them? Finally, there is a more fundamental notion about the predictability of the world, described most famously by Laplace in 1814:

Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it – an intelligence sufficiently vast to submit this data to analysis – it would embrace in the same formula the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present in its eyes.³

We are very far from completely understanding, and being able to manipulate, everything we care about. But if the world is as predictable as Laplace suggests, then we should expect that a sufficiently intelligent agent would be able to take advantage of that regularity and use it to excel at any domain.

This investigation questions that assumption. Is it actually the case that a superintelligence has practically unbounded intelligence, or are there “ceilings” on what intelligence is capable of? To foreshadow a bit, there are ceilings in some domains that we care about, for instance, in predictions about the behavior of the human brain. Even unbounded cognitive ability does not imply unbounded skill when interacting with the world. For this investigation, I focus on cognitive skills, especially predicting the future. This seems like a realm where a superintelligence would have an unusually large advantage (compared to e.g. skills requiring dexterity), so restrictions on its skill here are more surprising.

There are two ways for there to be only a small amount of headroom above human intelligence. The first is that the task is so easy that humans can do it almost perfectly, like playing tic-tac-toe. The second is that the task is so hard that there is a “low ceiling”: even a superintelligence is incapable of being very good at it. This investigation focuses on the second.

There are undoubtedly many tasks where there is still ample headroom above humans. But there are also some tasks for which we can prove that there is a low ceiling. These tasks provide some limitations on what is possible, even with arbitrarily high intelligence.

Chaos Theory

The main tool used in this investigation is chaos theory. Chaotic systems are things for which uncertainty grows exponentially in time. Most of the information measured initially is lost after a finite amount of time, so reliable predictions about its future behavior are impossible.

A classic example of chaos is the weather. Weather is fairly predictable for a few days. Large simulations of the atmosphere have gotten consistently better for these short-time predictions.⁴

After about 10 days, these simulations become useless. The predictions from the simulations are worse than guessing what the weather might be using historical climate data from that location.

Chaos theory provides a response to Laplace. Even if it were possible to exactly predict the future given exact initial conditions and equations of motion,⁵ chaos makes it impossible to approximately predict the future using approximate initial conditions and equations of motion. Reliable predictions can only be made for a short period of time, but not once the uncertainty has grown large enough.

There is always some small uncertainty. Normally, we do not care: approximations are good enough. But when there is chaos, the small uncertainties matter. There are many ways small uncertainties can arise: Every measuring device has a finite precision.⁶ Every theory should only be trusted in the regimes where it has been tested. Every algorithm for evaluating the solution has some numerical error. There are external forces you are not considering that the system is not fully isolated from. At small enough scales, thermal noise and quantum effects provide their own uncertainties. Some of this uncertainty could be reduced, allowing reliable predictions to be made for a bit longer.⁷ Other sources of this uncertainty cannot be reduced. Once these microscopic uncertainties have grown to a macroscopic scale, the motion of the chaos is inherently unpredictable.

Completely eliminating the uncertainty would require making measurements with perfect precision, which does not seem to be possible in our universe. We can prove that fundamental sources of uncertainty make it impossible to know important things about the future, even with arbitrarily high intelligence. Atomic scale uncertainty, which is guaranteed to exist by Heisenberg’s Uncertainty Principle, can make macroscopic motion unpredictable in a surprisingly short amount of time. Superintelligence is not omniscience.

Chaos theory thus allows us to rigorously show that there are ceilings on some particular abilities. If we can prove that a system is chaotic, then we can conclude that the system offers diminishing returns to intelligence. Most predictions of the future of a chaotic system are impossible to make reliably. Without the ability to make better predictions, and plan on the basis of these predictions, intelligence becomes much less useful.

This does not mean that intelligence becomes useless, or that there is nothing about chaos which can be reliably predicted.

For relatively simple chaotic systems, even when what in particular will happen is unpredictable, it is possible to reliably predict the statistics of the motion.⁸ We have learned sophisticated ways of predicting the statistics of chaotic motion,⁹ and a superintelligence could be better at this than we are. It is also relatively easy to sample from this distribution to emulate behavior which is qualitatively similar to the motion of the original chaotic system.

But chaos can also be more complicated than this. The chaos might be non-stationary, which means that the statistical distribution and qualitative description of the motion themselves change unpredictably in time. The chaos might be multistable, which means that it can do statistically and qualitatively different things depending on how it starts. In these cases, it is also impossible to reliably predict the statistics of the motion, or to emulate a typical example of a distribution which is itself changing chaotically. Even in these cases, there are sometimes still patterns in the chaos which allow a few predictions to be made, like the energy spectra of fluids.¹⁰ These patterns are hard to find, and it is possible that a superintelligence could find patterns that we have missed. But it is not possible for the superintelligence to recover the vast amount of information rendered unpredictable by the chaos.

This Investigation

This blog post is the introduction to an investigation which explores these points in more detail. I will describe what chaos is, how humanity has learned to deal with chaos, and where chaos appears in things we care about – including in the human brain itself. Links to the other pages, blog posts, and report that constitute this investigation can be found below.

Most of the systems we care about are considerably messier than the simple examples we use to explain chaos. It is more difficult to prove claims about the inherent unpredictability of these systems, although it is still possible to make some arguments about how chaos affects them.

For example, I will show that individual neurons, small networks of neurons, and in vivo neurons in sense organs can behave chaotically.¹¹ Each of these can also behave non-chaotically in other circumstances. But we are more interested in the human brain as a whole. Is the brain mostly chaotic or mostly non-chaotic? Does the chaos in the brain amplify uncertainty all the way from the atomic scale to the macroscopic, or is the chain of amplifying uncertainty broken at some non-chaotic mesoscale? How does chaos in the brain actually impact human behavior? Are there some things that brains do for which chaos is essential?

These are hard questions to answer, and they are, at least in part, currently unsolved. They are worth investigating nevertheless. For instance, it seems likely to me that the chaos in the brain does render some important aspects of human behavior inherently unpredictable and plausible that chaotic amplification of atomic-level uncertainty is essential for some of the things humans are capable of doing.

This has implications for how humans might interact with a superintelligence and for how difficult it might be to build artificial general intelligence.

If some aspects of human behavior are inherently unpredictable, that might make it harder for a superintelligence to manipulate us. Manipulation is easier if it is possible to predict how a human will respond to anything you show or say to them. If even a superintelligence cannot predict how a human will respond in some circumstances, then it is harder for the superintelligence to hack the human and gain precise, long-term control over them.

So far, I have been considering the possibility that a superintelligence will exist and asking what limitations there are on its abilities.¹² But chaos theory might also change our estimates of the difficulty of making artificial general intelligence (AGI) that leads to superintelligence. Chaos in the brain makes whole brain emulation on a classical computer wildly more difficult – or perhaps even impossible.

When making a model of a brain, you want to coarse-grain it at some scale, perhaps at the scale of individual neurons. The coarse-grained model of a neuron should be much simpler than a real neuron, involving only a few variables, while still being good enough to capture the behavior relevant for the larger scale motion. If a neuron is behaving chaotically itself, especially if it is non-stationary or multistable, then no good enough coarse-grained model will exist. The neuron needs to be resolved at a finer scale, perhaps at the scale of proteins. If a protein itself amplifies smaller uncertainties, then you would have to resolve it at a finer scale, which might require a quantum mechanical calculation of atomic behavior.

Whole brain emulation provides an upper bound on the difficulty of AGI. If this upper bound ends up being farther away than you expected, then that suggests that there should be more probability mass associated with AGI being extremely hard.

Notes

A policy guaranteed to increase AI timelines

richardkorzekwa — Sat, 01 Apr 2023 20:41:43 +0000

Rick Korzekwa, April 1, 2023

The number of years until the creation of powerful AI is a major input to our thinking about risk from AI and which approaches are most promising for mitigating that risk. While there are downsides to transformative AI arriving many years from now, rather than few years from now, most people seem to agree that it is safer for AI to arrive in 2060 than in 2030. Given this, there is a lot of discussion about what we can do to increase the number of years until we see such powerful systems. While existing proposals have their merits, none of them can ensure that AI will arrive later than 2030, much less 2060.

There is a policy that is guaranteed to increase the number of years between now and the arrival of transformative AI. The General Conference on Weights and Measures defines one second to be 9,192,631,770 cycles of the optical radiation emitted during a hyperfine transition in the ground state of a cesium 133 atom. Redefining the second to instead be 919,263,177 cycles of this radiation will increase the number of years between now and transformative AI by a factor of ten. The reason this policy works is the same reason that defining a time standard works–the microscopic behavior of atoms and photons is ultimately governed by the same physical laws as everything else, including computers, AI labs, and financial markets, and those laws are unaffected by our time standards. Thus fewer cycles of cesium radiation per year implies proportionately fewer other things happening per year.

Making such a change might not sound politically tractable, but there is already precedent for making radical changes to the definition of a second. Previously it was defined in terms of Earth’s solar orbit, and before that in terms of Earth’s rotation. These physical processes and their implementations as time standards bear little resemblance to the present-day quantum mechanical standard. In contrast, a change that preserves nearly the entire standard, including all significant figures in the relevant numerical definition, is straightforward.

One possible objection to this policy is that our time standards are not entirely causally disconnected from the rest of the world. For example, redefining the time standard might create a sense of urgency among AI labs and the people investing in them. It’s not hard to imagine that the leaders and researchers within companies advancing the state of the art in AI might increase their efforts after noticing it is taking ten times as long to generate the same amount of research. While this is a reasonable concern, it seems unlikely that AI labs can increase their rate of progress by a full order of magnitude. Why would they currently be leaving so much on the table if they were? Futhermore, there are similar effects that might push in the other direction. Once politicians and executives realize they will live to be hundreds of years old, they may take risks to the longterm future more seriously.

Still, it does seem that the policy might have undesirable side effects. Changing all of our textbooks, clocks, software, calendars, and habits is costly. One solution to this challenge is to change the standard either in secret or in a way that allows most people to continue using the old “unofficial” standard. After all, what matters is the actual number of years required to create AI, not the number of years as measured by some deprecated standard.

In conclusion, while there are many policies for increasing the number of years before the arrival of advanced artificial intelligence, until now, none of them has guaranteed a large increase in this number. This policy, if implemented promptly and thoughtfully, is essentially guaranteed to cause a large increase the number of years before we see systems capable of posing a serious risk to humanity.