2022 Expert Survey on Progress in AI

Katja Grace — Thu, 04 Aug 2022 13:25:21 +0000

Published 3 August 2022; last updated 3 August 2022

This page is out-of-date. Visit the updated version of this page on our wiki.

The 2022 Expert Survey on Progress in AI (2022 ESPAI) is a survey of machine learning researchers that AI Impacts ran in June-August 2022.

Details

Background

The 2022 ESPAI is a rerun of the 2016 Expert Survey on Progress in AI that researchers at AI Impacts previously collaborated on with others. Almost all of the questions were identical, and both surveyed authors who recently published in NeurIPS and ICML, major machine learning conferences.

Zhang et al ran a followup survey in 2019 (published in 2022)¹ however they reworded or altered many questions, including the definitions of HLMI, so much of their data is not directly comparable to that of the 2016 or 2022 surveys, especially in light of large potential for framing effects observed.

Methods

Population

We contacted approximately 4271 researchers who published at the conferences NeurIPS or ICML in 2021. These people were selected by taking all of the authors at those conferences and randomly allocating them between this survey and a survey being run by others. We then contacted those whose email addresses we could find. We found email addresses in papers published at those conferences, in other public data, and in records from our previous survey and Zhang et al 2022. We received 738 responses, some partial, for a 17% response rate.

Participants who previously participated in the the 2016 ESPAI or Zhang et al surveys received slightly longer surveys, and received questions which they had received in past surveys (where random subsets of questions were given), rather than receiving newly randomized questions. This was so that they could also be included in a ‘matched panel’ survey, in which we contacted all researchers who completed the 2016 ESPAI or Zhang et al surveys, to compare responses from exactly the same samples of researchers over time. These surveys contained additional questions matching some of those in the Zhang et al survey.

Contact

We invited the selected researchers to take the survey via email. We accepted responses between June 12 and August 3, 2022.

Questions

The full list of survey questions is available below, as exported from the survey software. The export does not preserve pagination, or data about survey flow. Participants received randomized subsets of these questions, so the survey each person received was much shorter than that shown below.

2022ESPAIV Download

A small number of changes were made to questions since the 2016 survey (list forthcoming).

Definitions

‘HLMI’ was defined as follows:

The following questions ask about ‘high–level machine intelligence’ (HLMI). Say we have ‘high-level machine intelligence’ when unaided machines can accomplish every task better and more cheaply than human workers. Ignore aspects of tasks for which being a human is intrinsically advantageous, e.g. being accepted as a jury member. Think feasibility, not adoption.

Results

Data

The anonymized dataset is available here.

Summary of results

The aggregate forecast time to a 50% chance of HLMI was 37 years, i.e. 2059 (not including data from questions about the conceptually similar Full Automation of Labor, which in 2016 received much later estimates). This timeline has become about eight years shorter in the six years since 2016, when the aggregate prediction put 50% probability at 2061, i.e. 45 years out. Note that these estimates are conditional on “human scientific activity continu[ing] without major negative disruption.”
The median respondent believes the probability that the long-run effect of advanced AI on humanity will be “extremely bad (e.g., human extinction)” is 5%. This is the same as it was in 2016 (though Zhang et al 2022 found 2% in a similar but non-identical question). Many respondents were substantially more concerned: 48% of respondents gave at least 10% chance of an extremely bad outcome. But some much less concerned: 25% put it at 0%.
The median respondent believes society should prioritize AI safety research “more” than it is currently prioritized. Respondents chose from “much less,” “less,” “about the same,” “more,” and “much more.” 69% of respondents chose “more” or “much more,” up from 49% in 2016.
The median respondent thinks there is an “about even chance” that a stated argument for an intelligence explosion is broadly correct. 54% of respondents say the likelihood that it is correct is “about even,” “likely,” or “very likely” (corresponding to probability >40%), similar to 51% of respondents in 2016. The median respondent also believes machine intelligence will probably (60%) be “vastly better than humans at all professions” within 30 years of HLMI, and the rate of global technological improvement will probably (80%) dramatically increase (e.g., by a factor of ten) as a result of machine intelligence within 30 years of HLMI.

High-level machine intelligence timelines

The aggregate forecast time to HLMI was 36.6 years, conditional on “human scientific activity continu[ing] without major negative disruption.” and considering only questions using the HLMI definition. We have not yet analyzed data about the conceptually similar Full Automation of Labor (FAOL), which in 2016 prompted much later timeline estimates. Thus this timeline figure is expected to be low relative to an overall estimate from this survey.

This aggregate is the 50th percentile date in an equal mixture of probability distributions created by fitting a gamma distribution to each person’s answers to three questions either about the probability of HLMI occurring by a given year or the year at which a given probability would obtain.

Figure 1: Gamma distributions inferred for each individual.

Figure 2: Gamma distributions inferred for each individual, 2016 data

Impacts of HLMI

Question

Participants were asked:

Assume for the purpose of this question that HLMI will at some point exist. How positive or negative do you expect the overall impact of this to be on humanity, in the long run? Please answer by saying how probable you find the following kinds of impact, with probabilities adding to 100%:

______ Extremely good (e.g. rapid growth in human flourishing) (1)

______ On balance good (2)

______ More or less neutral (3)

______ On balance bad (4)

______ Extremely bad (e.g. human extinction) (5)

Answers

Medians:

Extremely good: 10%
On balance good: 20%
More or less neutral: 15%
On balance bad: 10%
Extremely bad: 5%

Means:

Extremely good: 24%
On balance good: 26%
More or less neutral: 18%
On balance bad: 17%
Extremely bad: 14%

Intelligence explosion

Probability of dramatic technological speedup

Question

Participants were asked:

Assume that HLMI will exist at some point. How likely do you then think it is that the rate of global technological improvement will dramatically increase (e.g. by a factor of ten) as a result of machine intelligence:

Within two years of that point? ___% chance

Within thirty years of that point? ___% chance

Answers

Median P(within two years) = 20% (20% in 2016)

Median P(within thirty years) = 80% (80% in 2016)

Probability of superintelligence

Question

Participants were asked:

Assume that HLMI will exist at some point. How likely do you think it is that there will be machine intelligence that is vastly better than humans at all professions (i.e. that is vastly more capable or vastly cheaper):

Within two years of that point? ___% chance

Within thirty years of that point? ___% chance

Answers

Median P(…within two years) = 10% (10% in 2016)

Median P(…within thirty years) = 60% (50% in 2016)

Chance that the intelligence explosion argument is about right

Question

Participants were asked:

Some people have argued the following:

If AI systems do nearly all research and development, improvements in AI will accelerate the pace of technological progress, including further progress in AI.

Over a short period (less than 5 years), this feedback loop could cause technological progress to become more than an order of magnitude faster.

How likely do you find this argument to be broadly correct?

Quite unlikely (0-20%)
Unlikely (21-40%)
About even chance (41-60%)
Likely (61-80%)
Quite likely (81-100%)

Answers

20% quite unlikely (25% in 2016)
26% unlikely (24% in 2016)
21% about even chance (22% in 2016)
26% likely (17% in 2016)
7% quite likely (12% in 2016)

Existential risk

In an above question, participants’ credence in “extremely bad” outcomes of HLMI have median 5% and mean 14%. To better clarify what participants mean by this, we also asked a subset of participants one of the following questions, which did not appear in the 2016 survey:

Extinction from AI

Participants were asked:

What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species?

Answers

Median 5%.

Extinction from human failure to control AI

Participants were asked:

What probability do you put on human inability to control future advanced AI systems causing human extinction or similarly permanent and severe disempowerment of the human species?

Answers

Median 10%.

This question is more specific and thus necessarily less probable than the previous question, but it was given a higher probability at the median. This could be due to noise (different random subsets of respondents received the questions, so there is no logical requirement that their answers cohere), or due to the representativeness heuristic.

Safety

General safety

Question

Participants were asked:

Let ‘AI safety research’ include any AI-related research that, rather than being primarily aimed at improving the capabilities of AI systems, is instead primarily aimed at minimizing potential risks of AI systems (beyond what is already accomplished for those goals by increasing AI system capabilities).

Examples of AI safety research might include:

Improving the human-interpretability of machine learning algorithms for the purpose of improving the safety and robustness of AI systems, not focused on improving AI capabilities
Research on long-term existential risks from AI systems
AI-specific formal verification research
Policy research about how to maximize the public benefits of AI

How much should society prioritize AI safety research, relative to how much it is currently prioritized?

Much less
Less
About the same
More
Much more

Answers

Much less: 2% (5% in 2016)
Less: 9% (8% in 2016)
About the same: 20% (38% in 2016)
More: 35% (35% in 2016)
Much more: 33% (14% in 2016)

69% of respondents think society should prioritize AI safety research more or much more, up from 49% in 2016.

Stuart Russell’s problem

Question

Participants were asked:

Stuart Russell summarizes an argument for why highly advanced AI might pose a risk as follows:

The primary concern [with highly advanced AI] is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken […]. Now we have a problem:

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

A system that is optimizing a function of n variables, where the objective depends on a subset of size k

Do you think this argument points at an important problem?

No, not a real problem.

No, not an important problem.

Yes, a moderately important problem.

Yes, a very important problem.

Yes, among the most important problems in the field.

How valuable is it to work on this problem today, compared to other problems in AI?

Much less valuable

Less valuable

As valuable as other problems

More valuable

Much more valuable

How hard do you think this problem is compared to other problems in AI?

Much easier

Easier

As hard as other problems

Harder

Much harder

Answers

Importance:

No, not a real problem: 4%

No, not an important problem: 14%

Yes, a moderately important problem: 24%

Yes, a very important problem: 37%

Yes, among the most important problems in the field: 21%

Value today:

Much less valuable: 10%

Less valuable: 30%

As valuable as other problems: 33%

More valuable: 19%

Much more valuable: 8%

Hardness:

Much easier: 5%

Easier: 9%

As hard as other problems: 29%

Harder: 31%

Much harder: 26%

Contributions

The survey was run by Katja Grace and Ben Weinstein-Raun. Data analysis was done by Zach Stein-Perlman and Ben Weinstein-Raun. This page was written by Zach Stein-Perlman and Katja Grace.

We thank many colleagues and friends for help, discussion and encouragement, including John Salvatier, Nick Beckstead, Howie Lempel, Joe Carlsmith, Leopold Aschenbrenner, Ramana Kumar, Jimmy Rintjema, Jacob Hilton, Ajeya Cotra, Scott Siskind, Chana Messinger, Noemi Dreksler, and Baobao Zhang.

We also thank the expert participants who spent time sharing their impressions with us, including:

Michał Zając
Morten Goodwin
Yue Sun
Ningyuan Chen
Egor Kostylev
Richard Antonello
Elia Turner
Andrew C Li
Zachary Markovich
Valentina Zantedeschi
Michael Cooper
Thomas A Keller
Marc Cavazza
Richard Vidal
David Lindner
Xuechen (Chen) Li
Alex M. Lamb
Tristan Aumentado-Armstrong
Ferdinando Fioretto
Alain Rossier
Wentao Zhang
Varun Jampani
Derek Lim
Muchen Li
Cong Hao
Yao-Yuan Yang
Linyi Li
Stéphane D’Ascoli
Lang Huang
Maxim Kodryan
Hao Bian
Orestis Paraskevas
David Madras
Tommy Tang
Li Sun
Stefano V Albrecht
Tristan Karch
Muhammad A Rahman
Runtian Zhai
Benjamin Black
Karan Singhal
Lin Gao
Ethan Brooks
Cesar Ferri
Dylan Campbell
Xujiang Zhao
Jack Parker-Holder
Michael Norrish
Jonathan Uesato
Yang An
Maheshakya Wijewardena
Ulrich Neumann
Lucile Ter-Minassian
Alexander Matt Turner
Subhabrata Dutta
Yu-Xiang Wang
Yao Zhang
Joanna Hong
Yao Fu
Wenqing Zheng
Louis C Tiao
Hajime Asama
Chengchun Shi
Moira R Dillon
Yisong Yue
Aurélien Bellet
Yin Cui
Gang Hua
Jongheon Jeong
Martin Klissarov
Aran Nayebi
Fabio Maria Carlucci
Chao Ma
Sébastien Gambs
Rasoul Mirzaiezadeh
Xudong Shen
Julian Schrittwieser
Adhyyan Narang
Fuxin Li
Linxi Fan
Johannes Gasteiger
Karthik Abinav Sankararaman
Patrick Mineault
Akhilesh Gotmare
Jibang Wu
Mikel Landajuela
Jinglin Liu
Qinghua Hu
Noah Siegel
Ashkan Khakzar
Nathan Grinsztajn
Julian Lienen
Xiaoteng Ma
Mohamad H Danesh
Ke ZHANG
Feiyu Xiong
Wonjae Kim
Michael Arbel
Piotr Skowron
Lê-Nguyên Hoang
Travers Rhodes
Liu Ziyin
Hossein Azizpour
Karl Tuyls
Hangyu Mao
Yi Ma
Junyi Li
Yong Cheng
Aditya Bhaskara
Xia Li
Danijar Hafner
Brian Quanz
Fangzhou Luo
Luca Cosmo
Scott Fujimoto
Santu Rana
Michael Curry
Karol Hausman
Luyao Yuan
Samarth Sinha
Matthew McLeod
Hao Shen
Navid Naderializadeh
Alessio Micheli
Zhenbang You
Van Huy Vo
Chenyang Wu
Thanard Kurutach
Vincent Conitzer
Chuang Gan
Chirag Gupta
Andreas Schlaginhaufen
Ruben Ohana
Luming Liang
Marco Fumero
Paul Muller
Hana Chockler
Ming Zhong
Jiamou Liu
Sumeet Agarwal
Eric Winsor
Ruimeng Hu
Changjian Shui
Yiwei Wang
Joey Tianyi Zhou
Anthony L. Caterini
Guillermo Ortiz-Jimenez
Iou-Jen Liu
Jiaming Liu
Michael Perlmutter
Anurag Arnab
Ziwei Xu
John Co-Reyes
Aravind Rajeswaran
Roy Fox
Yong-Lu Li
Carl Yang
Divyansh Garg
Amit Dhurandhar
Harris Chan
Tobias Schmidt
Robi Bhattacharjee
Marco Nadai
Reid McIlroy-Young
Wooseok Ha
Jesse Mu
Neale Ratzlaff
Kenneth Borup
Binghong Chen
Vikas Verma
Walter Gerych
Shachar Lovett
Zhengyu Zhao
Chandramouli Chandrasekaran
Richard Higgins
Nicholas Rhinehart
Blaise Agüera Y Arcas
Santiago Zanella-Beguelin
Dian Jin
Scott Niekum
Colin A. Raffel
Sebastian Goldt
Yali Du
Bernardo Subercaseaux
Hui Wu
Vincent Mallet
Ozan Özdenizci
Timothy Hospedales
Lingjiong Zhu
Cheng Soon Ong
Shahab Bakhtiari
Huan Zhang
Banghua Zhu
Byungjun Lee
Zhenyu Liao
Adrien Ecoffet
Vinay Ramasesh
Jesse Zhang
Soumik Sarkar
Nandan Kumar Jha
Daniel S Brown
Neev Parikh
Chen-Yu Wei
David K. Duvenaud
Felix Petersen
Songhua Wu
Huazhu Fu
Roger B Grosse
Matteo Papini
Peter Kairouz
Burak Varici
Fabio Roli
Mohammad Zalbagi Darestani
Jiamin He
Lys Sanz Moreta
Xu-Hui Liu
Qianchuan Zhao
Yulia Gel
Jan Drgona
Sajad Khodadadian
Takeshi Teshima
Igor T Podolak
Naoya Takeishi
Man Shun Ang
Mingli Song
Jakub Tomczak
Lukasz Szpruch
Micah Goldblum
Graham W. Taylor
Tomasz Korbak
Maheswaran Sathiamoorthy
Lan-Zhe Guo
Simone Fioravanti
Lei Jiao
Davin Choo
Kristy Choi
Varun Nair
Rayana Jaafar
Amy Greenwald
Martin V. Butz
Aleksey Tikhonov
Samuel Gruffaz
Yash Savani
Rui Chen
Ke Sun

Suggested citation

Zach Stein-Perlman, Benjamin Weinstein-Raun, Katja Grace, “2022 Expert Survey on Progress in AI.” AI Impacts, 3 Aug. 2022. https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/.

Notes

AI Vignettes Project

Katja Grace — Wed, 13 Oct 2021 04:58:10 +0000

Posted Oct 12 2021

The AI Vignettes Project is an ongoing effort to write concrete plausible future histories of AI development and its social impacts.

Details

Purposes

We hope to:

Check that abstract views about the future of AI have plausible concrete instantiations. (Especially, hypothesized extinction scenarios, and proposed safe scenarios.)
Develop better intuitions about possible scenarios by thinking through them concretely.
Notice recurring themes in concrete stories, that may be worth thinking about more broadly.
Fill out the space of plausible feasible scenarios with concrete illustrations, to decrease bias in thinking about the future.
Keep a collection of AI vignettes for others to use, in the above or other ways.

Methods

Our current intended method is:

Write draft vignettes, with no particular systematic method for choosing an unbiased selection of scenarios
Request comments on their realism
Modify according to comments
Repeat 2-3 until realism critiques subside

Work so far

AI Impacts has run two small workshops where participants wrote AI vignettes.

Vignette collection

This is a subset of vignettes arising from this project, or similar.

Fiction relevant to AI futurism

Katja Grace — Tue, 13 Apr 2021 00:51:04 +0000

This page is an incomplete collection of fiction about the development of advanced AI, and the consequences for society.

Details

Entries are generally included if we judge that they contain enough that is plausible or correctly evocative to be worth considering, in light of AI futurism.

The list includes:

works (usually in draft form) belonging to our AI Vignettes Project. These are written with the intention of incrementally improving their realism via comments. These are usually in commentable form, and we welcome criticism, especially of departures from realism.
works created for the purpose of better understanding the future of AI
works from mainstream entertainment, either because they were prominent or recommended to us.¹

The list can be sorted and filtered by various traits that aren’t visible by default (see top left options). For instance:

Type, i.e. being mainstream entertainment, futurism, or specifically from our Vignettes Project, as described above.
Relevant themes, e.g. ‘failure modes’ or ‘largeness of mindspace’
Scenario categories, e.g. ‘fast takeoff’, ‘government project’, ‘brain emulations’
Recommendation rating: this is roughly how strongly we recommend the piece for people wanting to think about the future of AI. It takes into account a combination of realism, tendency to evoke some specific useful intuition, ease of reading. It is very rough and probably not consistent.

Many entries are only partially filled out. These are marked ‘unfinished’, and so can be filtered out.

We would appreciate further submissions of stories or additional details for stories we have here, reviews of stories in the collection here, or other comments here.

Collection

The collection can also be seen full screen here or as a table here.

Related

AI Vignettes Project

Notes

How energy efficient are human-engineered flight designs relative to natural ones?

Katja Grace — Thu, 10 Dec 2020 22:48:00 +0000

Updated Dec 10, 2020

This page is out-of-date. Visit the updated version of this page on our wiki.

Among two animals and nine machines:

In terms of mass⋅distance/energy, the most efficient animal was 2-8x more efficient than the most efficient machine. All entries fell within two orders of magnitude.

In terms of distance/energy, the most efficient animal was 3,000-20,000x more efficient than the most efficient machine. Both animals were more efficient than all machines. Entries ranged over more than eight orders of magnitude.

Details

Background

This case study is part of research that intends to compare the performance of human engineers and natural evolution on problems where both have developed solutions. The goal of this is to inform our expectations about the performance of future artificial intelligence relative to biological minds.

Metrics

We consider two metrics:

Distance per energy used (meters / kilojoule).

Mass times distance per energy used (kilograms⋅meters / joule).

These operationalize the problem of flight into two more specific problems. There are many other aspects of flight performance that one could measure, such as energy efficiency of acceleration in a straight line, turning, hovering, vertical acceleration, vertical distance, landing, taking off, time flying per energy, and our same measures with fewer or further restrictions on acceptable entries. For instance, we might look at the problem of flying with flapping wings, or without the restriction that the solutions we consider are heavier than air and self powered.

We did not require that the flight of an entry be constantly powered. Solutions that spend some time gliding as well as some time using powered flight were allowed. Both albatrosses and butterflies use air currents to fly further.¹ The energy gains from these techniques were not included in the final score, and entries were not penalized for spending a larger fraction of time gliding. It seems likely that paramotor pilots use similar techniques, since paramotors are well suited to gliding (being paragliders with propeller motors strapped to the backs of their pilots). Our energy efficiency estimate for the paramotor came from a record breaking distance flight in which the quantity of available fuel was limited, and so it is likely that some gliding was used to increase the distance traveled as much as possible.

When multiple input values could have been used, such as the takeoff weight and the landing weight, or different estimates for the energetic costs of different kinds of flight for the Monarch butterfly, we generally calculated a high and a low estimate, taking the most optimistic and pessimistic inputs respectively. In all cases, the resulting best and worst estimates differed by less than a factor of ten.

Selection of case studies

We selected case studies informally, according to judgments about possible high energy efficiencies, and with an eye to exploring a wider range of case studies.

We started by looking at the Boeing 747-400 plane, the Wandering Albatross, and the Monarch Butterfly. We chose the animals for both being known for their abilities to fly long distances, and for both having fairly different body plans.

All three scored surprisingly similarly on distance times weight per energy (details below). This prompted us to look for engineered solutions that were optimized for fuel efficiency. To that end, we looked at paramotors and record breaking flying machines. In the latter category, we found the MacCready Gassomer Albatross, which was a human powered flying device that crossed the English Channel, and the Spirit of Butts’ Farm, which was a model airplane that crossed the Atlantic on one gallon of gasoline.

For reasons that are now obscure, we also included a number of different planes.

We would have liked to include microdrones, since they are different enough from other entries that they might be unusually efficient. However we did not find data on them.

Case studies

These are the full articles calculating the efficiencies of different flying machines and animals:

Wright Flyer

Wright model B

Vickers Vimy

North American P-51 Mustang

Paramotors

The Spirit of Butt’s Farm

Monarch butterfly

MacCready Gossamer Albatross

Airbus A-320

Boeing 747-400

Wandering albatross

Summary results

Results are available in Table 1 below, and in this spreadsheet. Figures 1 and 2 below illustrate the equivalent questions of how far each of these animals and machines can fly, given either the same amount of fuel energy, or fuel energy proportional to their body mass.

Name natural or human-engineered kg⋅m/J m/kJ

worst mean best worst mean best

Monarch Butterfly natural 0.065 0.21 0.36 100000 350000 600000

Wandering Albatross natural 1.4 2.2 3 240 240 240

The Spirit of Butt’s Farm human-engineered 0.086 0.12 0.16 32 32 32

MacGready Gossamer Albatross human-engineered 0.19 0.32 0.46 2 3.3 4.6

Paramotor human-engineered 0.058 0.079 0.1 0.36 0.36 0.36

Wright model B human-engineered 0.036 0.078 0.12 0.1 0.16 0.21

Wright Flyer human-engineered 0.022 0.042 0.061 0.080 0.13 0.18

North American P-51 Mustang human-engineered 0.25 0.38 0.5 0.073 0.083 0.092

Vickers Vimy human-engineered 0.081 0.17 0.25 0.025 0.038 0.05

Airbus A320 human-engineered 0.33 0.47 0.61 0.0078 0.0078 0.0078

Boeing 747-400 human-engineered 0.39 0.61 0.83 0.0021 0.0021 0.0021

Table 1: Energy efficiency of flight for a variety of natural and man-made flying entities.

Figure 1: If you give each animal or machine energy proportional to its weight, how far can it fly?

On mass⋅distance/energy, evolution beats engineers, but they are relatively evenly matched: the albatross (1.4-3.0 kg.m/J) and the Boeing 747-400 (0.39-0.83 kg.m/J) are the best in the natural and engineered classes respectively. Thus the best natural solution we found was roughly 2x-8x more efficient than the human-engineered one.² We found several flying machines more efficient on this metric than the monarch butterfly.

Figure 2: How far animals and machines can fly on the same amount of energy. Note that the vertical axis is log scaled, unlike that of Figure 1, so smaller looking differences are in fact much larger: over eight orders of magnitude (vs less than two in Figure 1).

On distance/energy, the natural solutions have a much larger advantage. Both are better than all man-made solutions we considered. The best natural and engineered solutions respectively are the monarch butterfly (100,000-600,000 m/kJ) and the Spirit of Butts’ Farm (32 m/kJ), for roughly a 3,000x to 20,000x advantage to natural evolution.

Interpretation

We take this as weak evidence about the best possible distance/energy and distance.mass/energy measures achievable by human engineers or natural evolution. One reason for this is that this is a small set of examples. Another is that none of these animals or machines were optimized purely for either of these flight metrics—they all had other constraints or more complex goals. For instance, the paramotor was competing for a record in which a paramotor had to be used, specifically. For the longest human flight, the flying machine had to be capable of carrying a human. The albatross’ body has many functions. Thus it seems plausible that either engineers or natural evolution could reach solutions far better on our metrics than those recorded here if they were directly aiming for those metrics.

The measurements for distance.mass/energy covered a much narrower band than those for distance/energy: a factor of under two orders of magnitude versus around eight. Comparing best scores between evolution and engineering, the gap is also much smaller, as noted above (a factor of less than one order of magnitude versus three orders of magnitude). This seems like some evidence that that band of performance is natural for some reason, and so that more pointed efforts to do better on these metrics would not readily lead to much higher performance.

Primary author: Ronny Fernandez

Notes

Time for AI to cross the human performance range in ImageNet image classification

Katja Grace — Mon, 19 Oct 2020 23:52:50 +0000

Published 19 Oct 2020

Progress in computer image classification performance took:

Over 14 years to reach the level of an untrained human
3 years to pass from untrained human level to trained human level
5 years to continue from trained human to current performance (2020)

Details

Metric

ImageNet¹ is a large collection of images organized into a hierarchy of noun categories. We looked at ‘top-5 accuracy’ in categorizing images. In this task, the player is given an image, and can guess five different categories that the image might represent. It is judged as correct if the image is in fact in any of those five categories.

Human performance milestones

Beginner level

We used Andrej Karpathy’s interface² for doing the ImageNet top-5 accuracy task ourselves, and asked a few friends to do it. Five people did it, with performances ranging from 74% to 89%, with a median performance of 81%.

This was not a random sample of people, and conditions for taking the test differed. Most notably, there was no time limit, so time allocated was set by patience for trying to marginally improve guesses.

Trained human-level

ImageNet categorization is not a popular activity for humans, so we do not know what highly talented and trained human performance would look like. The best relatively high human performance measure we have comes from Russakovsky et al, who report on performance of two ‘expert annotators’, who they say learned many of the categories. ³ The better performing annotator there had a 5.1% error rate.⁴
AI achievement of human milestones

Earliest attempt

The ImageNet database was released in 2009.⁵. An annual contest, the ImageNet Large Scale Visual Recognition Challenge, began in 2010.⁶
In the 2010 contest, the best top-5 classification performance had 28.2% error.⁷

However image classification broadly is older. Pascal VOC was a similar previous contest, which ran from 2005.⁸ We do not know when the first successful image classification systems were developed. In a blog post, Amidi & Amidi point to LeNet as pioneering work in image classification⁹, and it appears to have been developed in 1998.¹⁰
Beginner level

The first entrant in the ImageNet contest to perform better than our beginner level benchmark was SuperVision (commonly known as AlexNet) in 2012, with a 15.3% error rate.¹¹
Superhuman level

In 2015 He et al apparently achieved a 4.5% error rate, slightly better than our high human benchmark.¹²
Current level

According to paperswithcode.com, performance has continued to climb, to 2020, though slower than earlier.¹³
Times for AI to cross human-relative ranges

Given the above dates, we have:

Range Start End Duration (years)
First attempt to beginner level <1998 2012 >14
Beginner to superhuman 2012 2015 3
Above superhuman 2015 >2020 >5

Primary author: Rick Korzekwa

Notes

Surveys on fractional progress towards HLAI

Asya Bergal — Tue, 14 Apr 2020 22:34:35 +0000

Given simplistic assumptions, extrapolating fractional progress estimates suggests a median time from 2020 to human-level AI of:

372 years (2392), based on responses collected in Robin Hanson’s informal 2012-2017 survey.
36 years (2056), based on all responses collected in the 2016 Expert Survey on Progress in AI.
142 years (2162), based on the subset of responses to the 2016 Expert Survey on Progress in AI who had been in their subfield for at least 20 years.
32 years (2052), based on the subset of responses to the 2016 Expert Survey on Progress in AI about progress in deep learning or machine learning as a whole rather than narrow subfields.

67% of respondents of the 2016 expert survey on AI and 44% of respondents who answered from Hanson’s informal survey said that progress was accelerating.

Details

One way of estimating how many years something will take is to estimate what fraction of progress toward it has been made over a fixed number of years, then to extrapolate the number of years needed for full progress. As suggested by Robin Hanson,¹ this method can provide an estimate for when human-level AI will be developed, if we have data on what fraction of progress toward human-level AI has been made and whether it is proceeding at a constant rate.

We know of two surveys that ask about fractional progress and acceleration in specific AI subfields: an informal survey conducted by Robin Hanson in 2012 – 2017, and our 2016 Expert Survey on Progress in AI. We use them to extrapolate progress to human-level AI, assuming that:

AI progresses at the average rate that people have observed so far.
Human-level AI will be achieved when the median subfield reaches human-level.

Assumptions

AI progresses at the average rate that people have observed so far

The naive extrapolation method described above assumes that AI progresses at the average rate that people have observed so far, but some respondents perceived acceleration or deceleration. If we guess that this change in the rate of the progress continues into the future, this suggests that a truer extrapolation of each person’s observations would place human-level performance in their subfield either before or after the naively extrapolated date.

Human-level AI will be achieved when the median subfield reaches human-level

Both surveys asked respondents about fractional progress in their subfields. Extrapolating out these estimates to get to human-level performance gives some evidence for when AGI may come, but is not a perfect proxy. It may turn out that we get human-level performance in a small number of subfields much earlier than others, such that we count the resulting AI as ‘AGI’, or it may be the case that certain subfields important to AGI do not exist yet.

Hanson AI Expert Survey

Hanson’s survey informally asked ~15 AI experts to estimate how far we’ve come in their own subfield of AI research in the last twenty years, compared to how far we have to go to reach human level abilities. The subfields represented were analogical reasoning, knowledge representation, computer-assisted training, natural language processing, constraint satisfaction, robotic grasping manipulation, early-human vision processing, constraint reasoning, and “no particular subfield”. Three respondents said the rate of progress was staying the same, four said it was getting faster, two said it was slowing down, and six did not answer (or may not have been asked).

The naive extrapolations² of the answers from Hanson’s survey give a median time from 2020 to human-level AI (HLAI) of 372 years (2392). See the survey data and our calculations here.

2016 Expert Survey on Progress in AI

The 2016 Expert Survey on Progress in AI (2016 ESPAI) asked machine learning researchers which subfield they were in, how long they had been in their subfield, and what fraction of the remaining path to human-level performance (in their subfield) they thought had been traversed in that time.³ 107 out of 111 responses were used in our calculation.⁴ 42 subfields were reported, including “Machine learning”, “Graphical models”, “Speech recognition”, “Optimization”, “Bayesian Learning”, and “Robotics”.⁵ Notably, Hanson’s survey included subfields that weren’t represented in 2016 ESPAI, including analogic reasoning and knowledge representation. Since 2016 ESPAI was restricted to machine learning researchers, it may exclude non-machine-learning subfields that turn out to be important to fully human-level capabilities.

Acceleration

67% of all respondents said progress in their subfield was accelerating (see Figure 1). Most respondents said progress in their subfield was accelerating in each of the subsets we look at below (ML vs narrow subfield, and time in field).

Figure 1: Number of responses that progress was faster in the first half of the time in the field worked by respondents, the second half, or was about the same in both halves.

Most respondents think progress is accelerating. If this acceleration continues, our naively extrapolated estimates below may be overestimates for time to human-level performance.

Time to HLAI

We calculated estimated years from 2020 until human-level subfield performance by naively extrapolating the reported fractions of the subfield already traversed.⁶ Figure 2 below shows the implied estimates for time until human-level performance for all respondents’ answers. These estimates give a median time from 2020 until HLAI of 36 years (2056).

Figure 2: Extrapolated estimated time until human-level subfield performance for each respondent, arranged by length of time. The last four responses are above 1000 but have been cut off.

Machine learning vs subfield progress

Some respondents reported broad ‘subfields’, which encompassed all of machine learning, in particular “Machine learning” or “Deep learning”, while others reported narrow subfields, e.g. “Natural language processing” or “Robotics”. We split the survey data based on this subfield narrowness, guessing that progress on machine learning overall may be a better proxy for AGI overall. Among the 69 respondents who gave answers corresponding to the entire field of machine learning, the median implied time was 32 years (2052). Among the 70 respondents who gave narrow answers, the median implied time was 44 years (2064). Figures 3 and 4 show these estimates.

Figure 3: Implied estimates for human-level performance based on respondents who specified broad answers, e.g. “Machine learning” when asked about their subfield. The last three responses are above 1000 but have been cut off.

Figure 3: Implied estimates for human-level performance based on respondents who specified broad answers, e.g. “Machine learning” when asked about their subfield. The last three responses are above 1000 but have been cut off.

Figure 4: Implied estimates for human-level performance based on respondents who specified narrow answers, e.g. “Natural language processing” when asked about their subfield. The last response is above 1000 but has been cut off.

The median implied estimate until human-level performance for machine learning broadly was 12 years sooner than the median estimate for specific subfields. This is counter to what we might expect, if human-level performance in machine learning broadly implies human-level performance on each individual subfield.

Time spent in field

Robin Hanson has suggested that his survey may get longer implied forecasts than 2016 ESPAI because he asks exclusively people who have spent at least 20 years in their field.⁷ Filtering for people who have spent at least 20 years in their field, we have eight responses, and get a median implied time until HLAI of 142 years from 2020 (2162). Filtering for people who have spent at least 10 years in their field, we have 38 responses, and get a median implied time of 86 years (2106). Filtering for people who have spent less than 10 years in their field, we have 69 responses, and get a median implied time of 24 years (2044). Figures 5, 6 and 7 show estimates for each respondent, for each of these classes of time in field.

Figure 5: Implied estimates for human-level performance based on respondents who were working on their subfield for at least 20 years. The last response is above 1000 but has been cut off.

Figure 6: Implied estimates for human-level performance based on respondents who were working on their subfield for at least 10 years. The last three responses are above 1000 but have been cut off.

Figure 7: Implied estimates for human-level performance based on respondents who were working on their subfield for less than 10 years. The last response is above 1000 but has been cut off.

Comparison of the two surveys

The median implied estimate from 2020 until human-level performance suggested by responses from 2016 ESPAI (36 years) is an order of magnitude smaller than the one suggested by the Hanson survey (372 years). This appears to be at least partly explained by more experienced researchers giving responses that imply longer estimates. Hanson asks exclusively people who have spent at least 20 years in their subfield, whereas the 2016 survey does not filter based on experience. If we filter 2016 survey respondents for researchers who have spent at least 20 years in their subfield we instead get a median estimate of 142 years.

More experienced researchers may generate longer implied estimates because the majority of progress has happened recently– many people think progress accelerated, which is some evidence of this. It could also be that less-experienced researchers feel that progress is more significant than it actually is.

If AI research is accelerating and is going to continue accelerating until we get to human-level AI, the time to HLAI may be sooner than these estimates. If AI research is accelerating now but is not representative of what progress will look like in the future, longer naive estimates by more experienced researchers may be more appropriate.

Comparison to estimates reached by other survey methods

2016 ESPAI also asked people to estimate time until human-level machine intelligence (HLMI) by asking them how many years they would give until a 50% chance of HLMI. The median answer for this question in 2016 was 40 years, or 36 years from 2020 (2056), exactly the same as the median answer of 36 years implied by extrapolating fractional progress. The survey also asked about time to HLMI in other ways, which yielded less consistent answers.

Primary author: Asya Bergal

Notes

Precedents for economic n-year doubling before 4n-year doubling

Katja Grace — Tue, 14 Apr 2020 20:42:41 +0000

The only times gross world product appears to have doubled in n years without having doubled previously in 4n years were between 4,000 BC and 3,000 BC, and most likely between 10,000 BC and 4,000 BC.

Details

Background

A key open question regarding AI risk is how quickly advanced artificial intelligence will ‘take off’, which is to say something like ‘go from being a small source of influence in the world to an overwhelming one’.

In Superintelligence¹, Nick Bostrom defines the following answers, seemingly in line with common usage:

Slow takeoff takes decades or centuries
Moderate takeoff takes months or years
Fast takeoff takes minutes to days

However the specific criteria for takeoff having occurred are generally ambiguous.

Paul Christiano has suggested² operationalizing ‘slow takeoff’ as,

There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles. (Similarly, we’ll see an 8 year doubling before a 2 year doubling, etc.)

Historic precedents

We were interested in whether anything faster than a ‘slow takeoff’ by this definition would be historically unprecedented. That is, we wanted to know whether whenever the economy has doubled in n years, it has always completed a doubling in 4n years or less before the beginning of the n year doubling.

We took historic gross world product (GWP) estimates from Wikipedia³ and checked at each date how long it had taken for the economy to double, and whether it had always at some point doubled in as few as four times as many years prior to the start of that doubling.⁴
We found two apparent examples of faster takeoffs, so defined:

Between 4,000 BC and 3,000 BC, GWP doubled in 1,000 years, yet it had never before doubled in as few as 4000 years
Between 10,000 BC and 4,000 BC, GWP doubled in 6,000 years, yet there is no record of it doubling earlier in as few as 24,000 years. The records at that point are fairly sparse, so this is less clear, but it seems unlikely that there was a doubling in 24,000 years.⁵ This appears to coincide with the beginning of agriculture, in around 9000BC.⁶

The 300 year period immediately after 1300 saw a doubling of GWP growth, and the 1200 years beforehand did not see a doubling, however there was an earlier doubling within the 1200 years ending at 1200AD. So this is not technically an instance, but was a case of briefly accelerating growth. GWP between 1100 and 1300 actually declined though, so this is perhaps a different kind of case to the ones we are interested in.

Corresponding author: Daniel Kokotajlo

Notes

Resolutions of mathematical conjectures over time

Asya Bergal — Tue, 14 Apr 2020 20:38:13 +0000

Conditioned on being remembered as a notable conjecture, the time-to-proof for a mathematical problem appears to be exponentially distributed with a half-life of about 100 years. However, these observations are likely to be distorted by various biases.

Support

In 2014, we found conjectures referenced on Wikipedia, and recorded the dates that they were proposed and resolved, if they were resolved. We updated this list of conjectures in 2020, marking any whose status had changed. We then used a Kaplan-Meier estimator¹ to approximate the survivorship function.²
The results of this exercise are recorded here.³ Figure 1 below shows the survivorship function for the mathematical conjectures we found. The data is fit closely by an exponential function with a half-life of 117 years.⁴
Figure 1: Survivorship function of mathematical conjectures over time, also known as the fraction of mathematical conjectures unresolved at time t after being posed.

Biases

We are using resolution times for remembered conjectures as a proxy for resolution times for all conjectures. Resolution time for remembered conjectures might be biased in several ways: old conjectures are perhaps more likely to be remembered if they are solved than if they are not, very recently solved conjectures are probably more likely to be remembered (though this only matters because the rate of conjecture posing has probably changed over time), and conjectures that were especially hard to solve might also be more notable. The latter hundred years contains few data points, which makes it particularly easy for it to be inaccurate.

Relevance

To the extent that open theoretical problems in AI are similar to math problems, time to solve math problems may be informative for forming a prior on time to solve AI problems.

Corresponding author: Asya Bergal

Notes

Preliminary survey of prescient actions

richardkorzekwa — Sat, 04 Apr 2020 00:15:54 +0000

Published 3 April 2020

In a 10-20 hour exploration, we did not find clear examples of ‘prescient actions’—specific efforts to address severe and complex problems decades ahead of time and in the absence of broader scientific concern, experience with analogous problems, or feedback on the success of the effort—though we found six cases that may turn out to be examples on further investigation.

Details

We briefly investigated 20 leads on historical cases of actions taken to eliminate or mitigate a problem a decade or more in advance, evaluating them for their ‘prescience’. None were clearly as prescient as the actions of Leó Szilárd, which were previously the best examples of such actions that we found. The primary ways in which these actions failed to exhibit prescience were the amount of feedback that was available while developing a solution and the number of years in advance of the threat that the action was taken. Although we are uncertain about most of the cases, we believe that six of them are promising for future investigation.

Background

Current efforts to prepare for the impacts of artificial intelligence have several features that could make them unlikely to succeed. They typically require us to make complex predictions about novel threats over a timescale of decades, and many of these efforts will receive little feedback on whether they are on the right track, receive little input from the larger scientific community, and produce results that are not useful outside the problem of mitigating AI risk.

It may be useful to search for past cases of preparations that have similar features. It is important to know if humanity has failed to solve problems in advance because the attempts to do so have failed or because solutions were not attempted. If we find failed attempts, we want to know why they failed. For example, if it turns out that most previous actions were not successful because of failure to accurately predict the future, we may want to focus more of our efforts on forecasting. To this end, we use the following set of criteria for evaluating past efforts for their ‘prescience’, or the extent to which they represent early actions to mitigate a risk in absence of feedback:¹

Years in Advance: How many years in advance of the expected emergence of the threat was the action taken?

Novelty: Was the threat novel, or can we re-use (perhaps with modification) the solution to past threats?

Scientific Concern: Was the effort to address the threat endorsed by the larger scientific community?

Complex Prediction: Did the solution require a complex prediction, or is the solution clear and closely related to the problem?

Specificity: Was the solution specific to the threat or is it something that is broadly useful and may be done anyway?

Feedback: Was feedback available while developing a solution, so that we can make mistakes and learn from them, or will we need to get it right on the first try?

Severity: Was it a severe threat of global importance?

In addition to these criteria, we took note of whether the outcome of the efforts is known, as cases with a known outcome may be more informative and more fruitful for further investigation.

Methodology

Potential cases of interest were found by searching the Internet, asking our friends and colleagues, and offering a bounty on promising leads. We compiled a list of topics to research that were sufficiently narrow to allow for evaluation over a short period of time. This list included individual people that took actions (like Clair Patterson), specific actions that were taken (e.g. the installation of the Moscow-Washington Hotline), and the threats themselves (such as the destruction of infrastructure by a geomagnetic storm).

One researcher spent approximately 30 minutes reviewing each case, and rated them on a scale of 0 to 10 on the criteria described in the previous section.² A score of 1 indicates the criterion described the case very poorly, while a score of 10 indicates the case demonstrated the criterion extremely well. These ratings were highly subjective, though we made efforts to evaluate the cases in a way that is consistent and which would avoid too many false negatives.³ A composite score was calculated from these by taking a weighted average with the following weights:⁴
Criterion Weight
Number of years in advance⁵ 20
Overall severity of threat 2
Novelty of threat/solution 3
Overall level of concern from the scientific community at large 2
Complexity of prediction required to produce a solution 5
Specificity of solution 2
Level of feedback available while developing a solution 10

In addition to these ratings, we rated each one for how promising it was for further research, and annotated the ratings in the spreadsheet as seemed appropriate. We also assigned ratings to two cases that were previously the subject of in-depth investigations, for comparison. These were the Asilomar Conference and the actions of Leó Szilárd.

Results

The following table shows our ratings. The two reference cases are in italics. Our full spreadsheet of ratings and notes can be found here.

Case Score Suitability for Further Research
Leo Szilard 7.24
Antibiotic resistance 7.11 7
Open Quantum Safe 6.80 5
Nordic Gene Bank 6.74 4
Geomagnetic Storm Prep 6.74 5
Fukushima Daichii 6.74 5
Swiss Redoubt 6.60 2
Nonproliferation Treaty 6.14 6
Cavendish Banana and TR4 6.12 5
WIPP 6.02 4
Population Bomb 5.99 3
Y2k 5.76 4
Asilomar Conference 5.70
Cold War Civil Defense 5.29 3
Religious Apocalypse 4.88 2
Hurricane Katrina 4.18 4
Iran Nuclear Deal 4.18 4
Moscow-Washington Hotline 3.90 3
England 1800s Policy Reform 3.89 2
Clair Patterson 3.74 2
Missile gap 3.22 2
PQCrypto Conference 2006
4

For one case, the PQCrypto 2006 conference, we were unable to find sufficient information after 45 minutes of investigation to provide an evaluation.

In general the cases we investigated did not score highly on these criteria. The average score was 5.6 out of 10, with the US-Russia missile gap receiving the minimum score of 3.0 and antibiotic resistance receiving the maximum score of 7.11. None of the cases received a higher score than our reference case, the actions of Leó Szilárd (score = 7.24), which we consider to be sufficiently ‘prescient’ to be worth examining. Just over half (11) of our cases received higher ratings than the Asilomar Conference (rating = 5.6), which was previously judged to be less prescient.

The ratings are highly uncertain, as is natural for thirty minute reviews of complex topics. On average, our 90th percentile estimates were 80% larger than their corresponding 10th percentile estimate. All but four cases had minimum ratings lower than the best guess for Asilomar, and more than half had maximum ratings higher than the best guess for Leó Szilárd.

The axes on which the cases were least prescient were feedback and years in advance.⁶ The cases were most analogous on severity, novelty, and specificity of solution, losing on average .20, .30, and .20 points from their composite scores, respectively.

Two cases, antibiotic resistance and the Treaty on the Non-Proliferation of Nuclear Weapons, seemed particularly promising for additional research, and received scores of 7 and 6 accordingly. Five other cases received scores of at least five and seemed less promising, but likely worth some additional research.

Discussion

Although the very short research time allotted to each case limits our ability to confidently draw conclusions, we ruled out some cases which were clearly not prescient, identified some promising cases, and roughly characterized some ways in which efforts to reduce AI risk may be different from past efforts to reduce risks.

Irrelevant Cases

There were four cases that we found to be poor examples of prescient actions: The US-Russia Missile Gap of the late 1950’s, the actions of Clair Patterson to combat the use of leaded gasoline, 19th century policy reforms in England that were made in response to the industrial revolution, and the Moscow-US Nuclear Hotline. All of these cases involved actions that were taken in response to, rather than in anticipation of, the emergence of a problem (or perceived problem), and for which the solutions were relatively straightforward, with the primary barriers being political.⁷
Questionable Cases

Two cases involved actions based on highly dubious predictions: Preparations for a religious apocalypse⁸ and the book The Population Bomb and the accompanying actions of author Paul Erhlich. Although the actors in these cases were acting on predictions that have since been shown to be inaccurate, the cases do have some similarity to AI risk. They were addressing predictions of severe consequences from novel threats, they were acting without help from the scientific community, and they did not expect to receive a great deal of feedback along the way. However, the actions were only taken 5-10 years in advance of the threat, and we expect the apparent disconnect between the forecasts and reality to make it more difficult to learn from the actions.

Some cases involved threats that had already emerged, in the sense that they could happen immediately, but had sufficiently low per-year risk for a reasonable person to expect the outcome to be at least a decade in the future. These include Hurricane Katrina, US civil defense during the cold war, Fukushima Daichii, the comparison case Asilomar Conference, and the Nordic Gene Bank.⁹ ¹⁰
Other cases involved solutions that were easy or not dependent on complex forecasting. The Swiss National Redoubt relied on long-range forecasting, but was more of a large investment in defense than a complex search for a solution. The year 2000 problem was easy to address, even without taking action until relatively shortly before the event took place. The Iran Nuclear Deal (and perhaps also the Nuclear Non-Proliferation Treaty) required difficult political negotiations, but did not appear to rely on complex predictions.

Promising Cases

We identified six cases that seem promising for further investigation:

Alexander Fleming warned, in his 1945 Nobel Lecture, that widespread access to antibiotics without supervision may lead to antibiotic resistance.¹¹ We are uncertain of the impact of Fleming’s warning, whether he took additional action to mitigate the risk, or how widespread within the scientific community such concerns were, but our impression is that it was not a widely known issue, that his was an early warning, and that his judgement was generally taken seriously by the time of his speech. His warning preceded the first documented cases of penicillin-resistant bacteria by more than 20 years, and the threat of antimicrobial resistance seems to be broadly analogous with AI risk on most of our criteria, though it does seem that feedback was available throughout efforts to reduce the threat.

The Treaty on the Non-Proliferation of Nuclear Weapons required many actions from many actors, but it seems to have required a complex prediction about technological development and geopolitics to address a severe threat, was specific to a particular threat, and had limited opportunities for feedback. We are uncertain if any of the specific actions will prove to be prescient on further investigation, but it seems promising.

Open Quantum Safe is an open-source project to develop cryptographic techniques that are resistant to the use of quantum computers. The threat of quantum computing to cryptography has several relevant features, including complex forecasting over a decades-time scale of a novel threat. We found limited information on the circumstances surrounding the founding of the project or the related case, the 2006 PQCrypto Conference, but the problem generally seems prescient.

Geomagnetic Storm Preparation addresses the threat caused by severe damage and disruption by solar weather to electronics and power infrastructure, which could be a severe global catastrophe.¹² The expected time between such events is decades or centuries, and mitigating the risk involves actions that may be specific to the particular problem and requires complex predictions about the physics involved and how our infrastructure and institutions would be able to respond. However, we are uncertain about which actions were taken and when, and whether there is evidence that they are working. Additionally, there is substantial investment from the scientific community and we are uncertain how much feedback is available while developing solutions.

Panama Disease is a fungal infection that has been spreading globally for decades and threatens the viability of the cavendish banana as a commercial crop. Cavendish bananas account for the vast majority of banana exports, and are integral to the food security of countries such as Costa Rica and Guatemala.¹³ Early action included measures to slow the spread of the fungus, a search for cultivars to replace the Cavendish, calls for greater diversity in banana varietals, and searches for fungicides that are able to kill the fungus. Although these actions have many opportunities for feedback, some of them involve complex predictions and searches for specific technical solutions, and, from the perspective of farmers on continents that have not yet encountered the infection, the arrival of the fungus represents a discrete event at some undetermined time in the future. We are uncertain if these are good examples of prescient actions, but they may be worth additional investigation.

Presence of Feedback

The axis on which our cases most differed from efforts to reduce AI risk was the level of feedback available while developing a solution. The average score on feedback was 3.8, and none of the cases received a score higher than 7. Even cases that initially seemed that they would have very little feedback proved to have enough to aid those that were making preparations. Examples include Hurricane Katrina, which benefited from lessons learned from preceding hurricanes, and the National Redoubt of Switzerland, which benefited from the observation of conflicts between other actors, providing information about which military equipment and tactics were viable against likely adversaries. Assuming that these results are representative, here are two ways to interpret these results:

Feedback is abundant: Feedback is abundant in a wide variety of situations, so that we should also expect to have opportunities for feedback while preparing for advanced artificial intelligence. In support of this view are the cases mentioned above that were initially expected to lack feedback, even on the part of those making preparations, but which nonetheless benefited from feedback.

AI risk is unusual: The common perception that there is very little feedback available to efforts to reduce the risks of advanced AI is correct, and AI risk is unique (or very rare) in this regard. Support for this view comes from arguments for the one-shot nature of solving the AI control problem.¹⁴
Primary author: Rick Korzekwa

Notes

2019 recent trends in GPU price per FLOPS

Asya Bergal — Wed, 25 Mar 2020 23:46:49 +0000

Published 25 March, 2020

We estimate that in recent years, GPU prices have fallen at rates that would yield an order of magnitude over roughly:

17 years for single-precision FLOPS

10 years for half-precision FLOPS

5 years for half-precision fused multiply-add FLOPS

Details

GPUs (graphics processing units) are specialized electronic circuits originally used for computer graphics.¹ In recent years, they have been popularly used for machine learning applications.² One measure of GPU performance is FLOPS, the number of operations on floating-point numbers a GPU can perform in a second.³ This page looks at the trends in GPU price / FLOPS of theoretical peak performance over the past 13 years. It does not include the cost of operating the GPUs, and it does not consider GPUs rented through cloud computing.

Theoretical peak performance

‘Theoretical peak performance’ numbers appear to be determined by adding together the theoretical performances of the processing components of the GPU, which are calculated by multiplying the clock speed of the component by the number of instructions it can perform per cycle.⁴ These numbers are given by the developer and may not reflect actual performance on a given application.⁵
Metrics

We collected data on multiple slightly different measures of GPU price and FLOPS performance.

Price metrics

GPU prices are divided into release prices, which reflect the manufacturer suggested retail prices that GPUs are originally sold at, and active prices, which are the prices at which GPUs are actually sold at over time, often by resellers.

We expect that active prices better represent prices available to hardware users, but collect release prices also, as supporting evidence.

FLOPS performance metrics

Several varieties of ‘FLOPS’ can be distinguished based on the specifics of the operations they involve. Here we are interested in single-precision FLOPS, half-precision FLOPS, and half-precision fused-multiply add FLOPS.

‘Single-precision’ and ‘half-precision’ refer to the number of bits used to specify a floating point number.⁶ Using more bits to specify a number achieves greater precision at the cost of more computational steps per calculation. Our data suggests that GPUs have largely been improving in single-precision performance in recent decades,⁷ and half-precision performance appears to be increasingly popular because it is adequate for deep learning.⁸
Nvidia, the main provider of chips for machine learning applications,⁹ recently released a series of GPUs featuring Tensor Cores,¹⁰ which claim to deliver “groundbreaking AI performance”. Tensor Core performance is measured in FLOPS, but they perform exclusively certain kinds of floating-point operations known as fused multiply-adds (FMAs).¹¹ Performance on these operations is important for certain kinds of deep learning performance,¹² so we track ‘GPU price / FMA FLOPS’ as well as ‘GPU price / FLOPS’.

In addition to purely half-precision computations, Tensor Cores are capable of performing mixed-precision computations, where part of the computation is done in half-precision and part in single-precision.¹³ Since explicitly mixed-precision-optimized hardware is quite recent, we don’t look at the trend in mixed-precision price performance, and only look at the trend in half-precision price performance.

Precision tradeoffs

Any GPU that performs multiple kinds of computations (single-precision, half-precision, half-precision fused multiply add) trades off performance on one for performance on the other, because there is limited space on the chip, and transistors must be allocated to either one type of computation or the other.¹⁴ All current GPUs that perform half-precision or TensorCore fused-multiply-add computations also do single-precision computations, so they are splitting their transistor budget. For this reason, our impression is that half-precision FLOPS could be much cheaper now if entire GPUs were allocated to each one alone, rather than split between them.

Release date prices

We collected data on theoretical peak performance (FLOPS), release date, and price from several sources, including Wikipedia.¹⁵ (Data is available in this spreadsheet). We found GPUs by looking at Wikipedia’s existing large lists¹⁶ and by Googling “popular GPUs” and “popular deep learning GPUs”. We included any hardware that was labeled as a ‘GPU’. We adjusted prices for inflation based on the consumer price index.¹⁷
We were unable to find price and performance data for many popular GPUs and suspect that we are missing many from our list. In our search, we did not find any GPUs that beat our 2017 minimum of $0.03 (release price) / single-precision GFLOPS. We put out a $20 bounty on a popular Facebook group to find a cheaper GPU / FLOPS, and the bounty went unclaimed, so we are reasonably confident in this minimum.¹⁸
GPU price / single-precision FLOPS

Figure 1 shows our collected dataset for GPU price / single-precision FLOPS over time.¹⁹
Figure 1: Real GPU price / single-precision FLOPS over time. The vertical axis is log-scale. Price is measured in 2019 dollars.

To find a clear trend for the prices of the cheapest GPUs / FLOPS, we looked at the running minimum prices every 10 days.²⁰
Figure 2: Ten-day minimums in real GPU price / single-precision FLOPS over time. The vertical axis is log-scale. Price is measured in 2019 dollars. The blue line shows the trendline ignoring data before late 2007. (We believe the apparent steep decline prior to late 2007 is an artefact of a lack of data for that time period.)

The cheapest GPU price / FLOPS hardware using release date pricing has not decreased since 2017. However there was a similar period of stagnation between early 2009 and 2011, so this may not represent a slowing of the trend in the long run.

Based on the figures above, the running minimums seem to follow a roughly exponential trend. If we do not include the initial point in 2007, (which we suspect is not in fact the cheapest hardware at the time), we get that the cheapest GPU price / single-precision FLOPS fell by around 17% per year, for a factor of ten in ~12.5 years.²¹
GPU price / half-precision FLOPS

Figure 3 shows GPU price / half-precision FLOPS for all the GPUs in our search above for which we could find half-precision theoretical performance.²²
Figure 3: Real GPU price / half-precision FLOPS over time. The vertical axis is log-scale. Price is measured in 2019 dollars.

Again, we looked at the running minimums of this graph every 10 days, shown in Figure 4 below.²³
Figure 4: Minimums in real GPU price / half-precision FLOPS over time. The vertical axis is log-scale. Price is measured in 2019 dollars.

If we assume an exponential trend with noise,²⁴ cheapest GPU price / half-precision FLOPS fell by around 26% per year, which would yield a factor of ten after ~8 years.²⁵
GPU price / half-precision FMA FLOPS

Figure 5 shows GPU price / half-precision FMA FLOPS for all the GPUs in our search above for which we could find half-precision FMA theoretical performance.²⁶ (Note that this includes all of our half-precision data above, since those FLOPS could be used for fused-multiply adds in particular). GPUs with TensorCores are marked in red.

Figure 5: Real GPU price / half-precision FMA FLOPS over time. Price is measured in 2019 dollars.

Figure 6 shows the running minimums of GPU price / HP FMA FLOPS.²⁷
Figure 6: Minimums in real GPU price / half-precision FMA FLOPS over time. Price is measured in 2019 dollars.

GPU price / Half-Precision FMA FLOPS appears to be following an exponential trend over the last four years, falling by around 46% per year, for a factor of ten in ~4 years.²⁸
Active Prices

GPU prices often go down from the time of release, and some popular GPUs are older ones that have gone down in price.²⁹ Given this, it makes sense to look at active price data for the same GPU over time.

Data Sources

We collected data on peak theoretical performance in FLOPS from TechPowerUp³⁰ and combined it with active GPU price data to get GPU price / FLOPS over time.³¹ Our primary source of historical pricing data was Passmark, though we also found a less trustworthy dataset on Kaggle which we used to check our analysis. We adjusted prices for inflation based on the consumer price index.³²
Passmark

We scraped pricing data³³ on GPUs between 2011 and early 2020 from Passmark.³⁴ Where necessary, we renamed GPUs from Passmark to be consistent with TechPowerUp.³⁵ The Passmark data consists of 38,138 price points for 352 GPUs. We guess that these represent most popular GPUs.

Looking at the ‘current prices’ listed on individual Passmark GPU pages, prices appear to be sourced from Amazon, Newegg, and Ebay. Passmark’s listed pricing data does not correspond to regular intervals. We don’t know if prices were pulled at irregular intervals, or if Passmark pulls prices regularly and then only lists major changes as price points. When we see a price point, we treat it as though the GPU is that price only at that time point, not indefinitely into the future.

The data contains several blips where a GPU is briefly sold very unusually cheaply. A random checking of some of these suggests to us that these correspond to single or small numbers of GPUs for sale, which we are not interested in tracking, because we are trying to predict AI progress, which presumably isn’t influenced by temporary discounts on tiny batches of GPUs.

Kaggle

This Kaggle dataset contains scraped data of GPU prices from price comparison sites PriceSpy.co.uk, PCPartPicker.com, Geizhals.eu from the years 2013 – 2018. The Kaggle dataset has 319,147 price points for 284 GPUs. Unfortunately, at least some of the data is clearly wrong, potentially because price comparison sites include pricing data from untrustworthy merchants.³⁶ As such, we don’t use the Kaggle data directly in our analysis, but do use it as a check on our Passmark data. The data that we get from Passmark roughly appears to be a subset of the Kaggle data from 2013 – 2018,³⁷ which is what we would expect if the price comparison engines picked up prices from the merchants Passmark looks at.

Limitations

There are a number of reasons why we think this analysis may in fact not reflect GPU price trends:

We effectively have just one source of pricing data, Passmark.

Passmark appears to only look at Amazon, Newegg, and Ebay for pricing data.

We are not sure, but we suspect that Passmark only looks at the U.S. versions of Amazon, Newegg, and Ebay, and pricing may be significantly different in other parts of the world (though we guess it wouldn’t be different enough to change the general trend much).

As mentioned above, we are not sure if Passmark pulls price data regularly and only lists major price changes, or pulls price data irregularly. If the former is true, our data may be overrepresenting periods where the price changes dramatically.

None of the price data we found includes quantities of GPUs which were available at that price, which means some prices may be for only a very limited number of GPUs.

We don’t know how much the prices from these datasets reflect the prices that a company pays when buying GPUs in bulk, which we may be more interested in tracking.

A better version of this analysis might start with more complete data from price comparison engines (along the lines of the Kaggle dataset) and then filter out clearly erroneous pricing information in some principled way.

Data

The original scraped datasets with cards renamed to match TechPowerUp can be found here. GPU price / FLOPS data is graphed on a log scale in the figures below. Price points for the same GPU are marked in the same color. We adjusted prices for inflation using the consumer price index.³⁸ All points below are in 2019 dollars.

To try to filter out noisy prices that didn’t last or were only available in small numbers, we took out the lowest 5% of data in every several day period³⁹ to get the 95th percentile cheapest hardware. We then found linear and exponential trendlines of best fit through the available hardware with the lowest GPU price / FLOPS every several days.⁴⁰
GPU price / single-precision FLOPS

Figures 7-10 show the raw data, 95th percentile data, and trendlines for single-precision GPU price / FLOPS for the Passmark dataset. This folder contains plots of all our datasets, including the Kaggle dataset and combined Passmark + Kaggle dataset.⁴¹

Figure 7: GPU price / single-precision FLOPS over time, taken from our Passmark dataset.⁴² Price is measured in 2019 dollars. This picture shows that the Kaggle data does appear to be a superset of the Passmark data from 2013 – 2018, giving us some evidence that the Passmark data is correct. The vertical axis is log-scale.

Figure 8: The top 95% of data every 10 days for GPU price / single-precision FLOPS over time, taken from the Passmark dataset we plotted above. (Figure 7 with the cheapest 5% removed.) The vertical axis is log-scale.⁴³

Figure 9: The same data as Figure 8, with the vertical axis zoomed-in.

Figure 10: The minimum data points from the top 95% of the Passmark dataset, taken every 10 days. We fit linear and exponential trendlines through the data. The vertical axis is log-scale.⁴⁴

Analysis

The cheapest 95th percentile data every 10 days appears to fit relatively well to both a linear and exponential trendline. However we assume that progress will follow an exponential, because previous progress has followed an exponential.

In the Passmark dataset, the exponential trendline suggested that from 2011 to 2020, 95th-percentile GPU price / single-precision FLOPS fell by around 13% per year, for a factor of ten in ~17 years,⁴⁵ bootstrap⁴⁶ 95% confidence interval 16.3 to 18.1 years.⁴⁷ We believe the rise in price / FLOPS in 2017 corresponds to a rise in GPU prices due to increased demand from cryptocurrency miners.⁴⁸ If we instead look at the trend from 2011 through 2016, before the cryptocurrency rise, we instead get that 95th-percentile GPU price / single-precision FLOPS price fell by around 13% per year, for a factor of ten in ~16 years.⁴⁹
This is slower than the order of magnitude every ~12.5 years we found when looking at release prices. If we restrict the release price data to 2011 – 2019, we get an order of magnitude decrease every ~13.5 years instead,⁵⁰ so part of the discrepancy can be explained because of the different start times of the datasets. To get some assurance that our active price data wasn’t erroneous, we spot checked the best active price at the start of 2011, which was somewhat lower than the best release price at the same time, and confirmed that its given price was consistent with surrounding pricing data.⁵¹ We think active prices are likely to be closer to the prices at which people actually bought GPUs, so we guess that ~17 years / order of magnitude decrease is a more accurate estimate of the trend we care about.

GPU price / half-precision FLOPS

Figures 11-14 show the raw data, 95th percentile data, and trendlines for half-precision GPU price / FLOPS for the Passmark dataset. This folder contains plots of the Kaggle dataset and combined Passmark + Kaggle dataset.

Figure 11: GPU price / half-precision FLOPS over time, taken from our Passmark dataset. Price is measured in 2019 dollars.⁵² This picture shows that the Kaggle data does appear to be a superset of the Passmark data from 2013 – 2018, giving us some evidence that the Passmark data is reasonable. The vertical axis is log-scale.

Figure 12: The top 95% of data every 30 days for GPU price / half-precision FLOPS over time, taken from the Passmark dataset we plotted above. (Figure 11 with the cheapest 5% removed.) The vertical axis is log-scale.⁵³

Figure 13: The same data as Figure 12, with the vertical axis zoomed-in.

Figure 14: The minimum data points from the top 95% of the Passmark dataset, taken every 30 days. We fit linear and exponential trendlines through the data. The vertical axis is log-scale.⁵⁴

Analysis

If we assume the trend is exponential, the Passmark trend seems to suggest that from 2015 to 2020, 95th-percentile GPU price / half-precision FLOPS of GPUs has fallen by around 21% per year, for a factor of ten over ~10 years,⁵⁵ bootstrap⁵⁶ 95% confidence interval 8.8 to 11 years.⁵⁷ This is fairly close to the ~8 years / order of magnitude decrease we found when looking at release price data, but we treat active prices as a more accurate estimate of the actual prices at which people bought GPUs. As in our previous dataset, there is a noticeable rise in 2017, which we think is due to GPU prices increasing as a result of cryptocurrency miners. If we look at the trend from 2015 through 2016, before this rise, we get that 95th-percentile GPU price / half-precision FLOPS has fallen by around 14% per year, which would yield a factor of ten over ~8 years.⁵⁸
GPU price / half-precision FMA FLOPS

Figures 15-18 show the raw data, 95th percentile data, and trendlines for half-precision GPU price / FMA FLOPS for the Passmark dataset. GPUs with Tensor Cores are marked in black. This folder contains plots of the Kaggle dataset and combined Passmark + Kaggle dataset.

Figure 15: GPU price / half-precision FMA FLOPS over time, taken from our Passmark dataset.⁵⁹ price is measured in 2019 dollars. This picture shows that the Kaggle data does appear to be a superset of the Passmark data from 2013 – 2018, giving us some evidence that the Passmark data is correct. The vertical axis is log-scale.

Figure 16: The top 95% of data every 30 days for GPU price / half-precision FMA FLOPS over time, taken from the Passmark dataset we plotted above.⁶⁰ (Figure 15 with the cheapest 5% removed.)

Figure 17: The same data as Figure 16, with the vertical axis zoomed-in.

Figure 18: The minimum data points from the top 95% of the Passmark dataset, taken every 30 days. We fit linear and exponential trendlines through the data.⁶¹

Analysis

If we assume the trend is exponential, the Passmark trend seems to suggest the 95th-percentile GPU price / half-precision FMA FLOPS of GPUs has fallen by around 40% per year, which would yield a factor of ten in ~4.5 years,⁶² with a bootstrap⁶³ 95% confidence interval 4 to 5.2 years.⁶⁴ This is fairly close to the ~4 years / order of magnitude decrease we found when looking at release price data, but we think active prices are a more accurate estimate of the actual prices at which people bought GPUs.

The figures above suggest that certain GPUs with Tensor Cores were a significant (~half an order of magnitude) improvement over existing GPU price / half-precision FMA FLOPS.

Conclusion

We summarize our results in the table below.

Release Prices 95th-percentile Active Prices 95th-percentile Active Prices (pre-crypto price rise)
11/2007 – 1/2020 3/2011 – 1/2020 3/2011 – 12/2016
$ / single-precision FLOPS 12.5 17 16
9/2014 – 1/2020 1/2015 – 1/2020 1/2015 – 12/2016
$ / half-precision FLOPS 8 10 8
$ / half-precision FMA FLOPS 4 4.5 —

Release price data seems to generally support the trends we found in active prices, with the notable exception of trends in GPU price / single-precision FLOPS, which cannot be explained solely by the different start dates.⁶⁵ We think the best estimate of the overall trend for prices at which people recently bought GPUs is the 95th-percentile active price data from 2011 – 2020, since release price data does not account for existing GPUs becoming cheaper over time. The pre-crypto trends are similar to the overall trends, suggesting that the trends we are seeing are not anomalous due to cryptocurrency.

Given that, we guess that GPU prices as a whole have fallen at rates that would yield an order of magnitude over roughly:

17 years for single-precision FLOPS

10 years for half-precision FLOPS

5 years for half-precision fused multiply-add FLOPS

Half-precision FLOPS seem to have become cheaper substantially faster than single-precision in recent years. This may be a “catching up” effect as more of the space on GPUs was allocated to half-precision computing, rather than reflecting more fundamental technological progress.

Primary author: Asya Bergal

Notes

Name	natural or human-engineered		kg⋅m/J			m/kJ
		worst	mean	best	worst	mean	best
Monarch Butterfly	natural	0.065	0.21	0.36	100000	350000	600000
Wandering Albatross	natural	1.4	2.2	3	240	240	240
The Spirit of Butt’s Farm	human-engineered	0.086	0.12	0.16	32	32	32
MacGready Gossamer Albatross	human-engineered	0.19	0.32	0.46	2	3.3	4.6
Paramotor	human-engineered	0.058	0.079	0.1	0.36	0.36	0.36
Wright model B	human-engineered	0.036	0.078	0.12	0.1	0.16	0.21
Wright Flyer	human-engineered	0.022	0.042	0.061	0.080	0.13	0.18
North American P-51 Mustang	human-engineered	0.25	0.38	0.5	0.073	0.083	0.092
Vickers Vimy	human-engineered	0.081	0.17	0.25	0.025	0.038	0.05
Airbus A320	human-engineered	0.33	0.47	0.61	0.0078	0.0078	0.0078
Boeing 747-400	human-engineered	0.39	0.61	0.83	0.0021	0.0021	0.0021

Range	Start	End	Duration (years)
First attempt to beginner level	<1998	2012	>14
Beginner to superhuman	2012	2015	3
Above superhuman	2015	>2020	>5

Criterion	Weight
Number of years in advance⁵	20
Overall severity of threat	2
Novelty of threat/solution	3
Overall level of concern from the scientific community at large	2
Complexity of prediction required to produce a solution	5
Specificity of solution	2
Level of feedback available while developing a solution	10

Case	Score	Suitability for Further Research
Leo Szilard	7.24
Antibiotic resistance	7.11	7
Open Quantum Safe	6.80	5
Nordic Gene Bank	6.74	4
Geomagnetic Storm Prep	6.74	5
Fukushima Daichii	6.74	5
Swiss Redoubt	6.60	2
Nonproliferation Treaty	6.14	6
Cavendish Banana and TR4	6.12	5
WIPP	6.02	4
Population Bomb	5.99	3
Y2k	5.76	4
Asilomar Conference	5.70
Cold War Civil Defense	5.29	3
Religious Apocalypse	4.88	2
Hurricane Katrina	4.18	4
Iran Nuclear Deal	4.18	4
Moscow-Washington Hotline	3.90	3
England 1800s Policy Reform	3.89	2
Clair Patterson	3.74	2
Missile gap	3.22	2
PQCrypto Conference 2006		4

	Release Prices	95th-percentile Active Prices	95th-percentile Active Prices (pre-crypto price rise)
	11/2007 – 1/2020	3/2011 – 1/2020	3/2011 – 12/2016
$ / single-precision FLOPS	12.5	17	16
	9/2014 – 1/2020	1/2015 – 1/2020	1/2015 – 12/2016
$ / half-precision FLOPS	8	10	8
$ / half-precision FMA FLOPS	4	4.5	—

Featured Articles – AI Impacts

2022 Expert Survey on Progress in AI

Details

Background

Methods

Population

Contact

Questions

Definitions

Results

Data

Summary of results

High-level machine intelligence timelines

Impacts of HLMI

Question

Answers

Intelligence explosion

Probability of dramatic technological speedup

Question

Answers

Probability of superintelligence

Question

Answers

Chance that the intelligence explosion argument is about right

Question

Answers

Existential risk

Extinction from AI

Answers

Extinction from human failure to control AI

Answers

Safety

General safety

Question

Answers

Stuart Russell’s problem

Question

Answers

Contributions

Suggested citation

Notes

AI Vignettes Project

Details

Purposes

Methods

Work so far

Vignette collection

Fiction relevant to AI futurism

Details

Collection

Related

Notes

How energy efficient are human-engineered flight designs relative to natural ones?

Details

Background

Metrics

Selection of case studies

Case studies

Summary results

Interpretation

Notes

Time for AI to cross the human performance range in ImageNet image classification

Details

Metric

Human performance milestones

Beginner level

Trained human-level

AI achievement of human milestones

Earliest attempt

Beginner level

Superhuman level

Current level

Times for AI to cross human-relative ranges

Notes

Surveys on fractional progress towards HLAI

Details

Assumptions

AI progresses at the average rate that people have observed so far

Human-level AI will be achieved when the median subfield reaches human-level

Hanson AI Expert Survey