Jonathan Stray

To apply AI for good, think form extraction

Jonathan Stray — Sat, 24 Oct 2020 22:51:35 +0000

Folks who want to use AI/ML for good generally think of things like building predictive models, but smart methods for extracting data from forms would do more for journalism, climate science, medicine, democracy etc. than almost any other application. Since March, I’ve been working with a small team on applying deep learning to a gnarly campaign finance dataset that journalists have been struggling with for years. Today we are announcing Deepform, a baseline ML model, training data set, and public benchmark where anyone can submit their solution. I think this type of technology is important, not just for campaign finance reporters but for everyone.

This post has four parts:

Why form extraction is an important problem
Why it’s hard
The start of the art of deep learning for form extraction
The Deepform model and dataset, our contribution to this problem

Form extraction is incredibly useful

Form extraction could help climate science because a lot of old weather data is locked in forms. These forms come in a crazy variety of different formats from all over the world and across centuries.

Form extraction is useful for human rights work as well. In 2005, the National Police Historical Archive was discovered in Guatemala, opening up a window to government crimes committed during the 35 year-long civil war. Over 100,000 people were abducted, tortured, or murdered, and in many cases these fragile paper documents are the only record of what happened to someone. They are largely hand-written and come in a wide variety of formats dating back to 1870, some 80 million pages in all.

A single page from the Guatemalan National Police Archive, one of 80 million (source).

Form extraction comes up in medicine too, even for present-day documents. UCSF Hospital processes 500 faxed referrals each day, requiring 57 full time staff to type them in. These forms, such as the image below, can look like just about anything. The process has recently been partially automated.

An example referral form. If only they were standardized! (source)

There are plenty of gnarly forms in journalism too. Getting data out of forms can take months on a major investigative journalism project, like the recent FINCEN Files:

In the end, ICIJ and its partners launched a giant data-extraction effort: for more than a year, 85 journalists in 30 countries reviewed and extracted transaction information from assigned suspicious activity reports and manually entered it into Excel files … The effort resulted in 55,000 records of structured data and included details on more than 200,000 transactions flagged by the banks

I believe that this sort of data cleanup work is a far better application of AI to journalism than attempts to automate “finding the story in the data” or that sort of thing. The form extraction problem is well-specified, and data prep takes a lot of time, as I’ve argued at length in Making AI Work for Investigative Journalism.

Why is this hard?

Given that extracting data from paper forms is difficult, expensive, and widely useful you’d think that this would be solved by now. And it sort of is, but not for the kinds of cases above.

Form extraction is so common in journalism that “how do I get this data out of a PDF?” is a FAQ for data journalists. There are a few open-source solutions for table extraction, such as Tabula. There are a number of proprietary form and table extraction products at various levels of complexity, ranging from the straightforward CometDocs to the sophisticated and expensive Monarch. There are also APIs that will turn documents into structured data, such as Amazon’s Textract and the Google Cloud Document API, which has a nice tutorial.

But these products all either require manual interaction to extract data (so they can’t be used on bulk documents) or they use a template to extract data from a single type of form (which necessitates setup work for each type.) They don’t do well when the forms are heterogeneous, even if they all contain the same type of data. Unfortunately, that’s the situation with all the cases above. The problem isn’t extracting data from a form, but extracting data from any form. This is why AI can help: the ability of machine learning models to generalize may make it possible to extract data from previously unseen forms.

This general form extraction problem is so hard that are entire businesses based around it, like Expensify. And we are beginning to see more flexible AI-driven form extraction products that can learn form layouts, such as Rossum.ai. But this sort of technology is proprietary and often specialized to specific types of common business documents. If you need something else, or you need it open source, then it’s time to get to the research.

Generalized Form Extraction: The State of the Art

Here’s what I know about solving the form extraction problem. This section is for developers and researchers who want to try to improve on our benchmark results, or advance the open state of the art.

First, the basics. In many cases, including anything that started on paper, you’ll need OCR to convert images into text. Tesseract 4.0 is near-SOTA OCR in 100+ languages, so you probably want to start there. After you have PDFs with text, you’ll need to convert them into some representation suitable for machine learning. Deepform uses a tokens-plus-geometry format, where the document is transformed into a list of words (tokens) and their positions on the page. We use pdfplumber to do this, an amazing tool which will also do many other useful things (like simple table extraction).

From there, the problem is known as “multimodal form extraction.” A successful model will have to consider not just the text itself but each word’s position on the page, and even information like font size or bolded headings.

I know of only one open-source multimodal form extraction system that has been used in production. Fonduer, from a team at Stanford, starts from rich document structure including geometry, hierarchy, and fonts, then extracts data using a multi-stage process. First, candidate answers are selected through user-written matching functions. The user can also supply small “labelling” functions which supply extra information that can help choose the right answers. All these functions are small and simple, typically only a line or two of code.

Hand-written labelling functions for Fonduer (source)

Then, a trained bidirectional-LSTM network uses all this information to choose the correct answers from the candidates.

The Fonduer pipeline (source)

Just in the last year, several more noteworthy techniques have appeared. This work from several Google researchers starts with tokens and geometry (exactly what PDFPlumber outputs). The process again begins with candidates produced by hand-written matching functions (notice a theme?) followed by a unique classifier. The core idea here is to encode the neighborhood of each token, meaning all tokens within “a zone around the candidate extending all the way to the left of the page and about 10% of the page height above it.” The model also learns a separate embedding of a scalar-valued “field ID” which indicates whether the field is an amount_due, an invoice_id, a due_date, etc. The final layer of the network scores each candidate by computing the cosine similarity of the neighborhood embedding and the field type embedding.

General field extraction by learning to represent candidate neighborhoods (source)

Using cosine similarity between the candidate and neighborhood embeddings forces candidates with similar context to be “near” to each other in embedding space (and this is actually visualized in the paper.) You can think of each learned field type (field ID) embedding as the centroid of the corresponding cluster of similar neighborhoods. Each neighborhood would include, for example, a human-readable field name, so we should expect at least that much similarity in the context of the correct answer for each field type.

There have also been a graph convolutional network approaches to form extraction, as in this helpful tutorial on scanning receipts. The idea here is to encode the geometrical relationship between tokens in the document as edges in a graph, e.g. the nodes representing two tokens get an edge between them if they are horizontally or vertically aligned.

A graph representation of a receipt for information extraction (source)

All of these methods so far are strictly supervised, and they’ve all been trained on sets of 10,000-ish documents with all fields labelled. It should be possible to start with a (much larger!) large unlabeled corpus of forms to learn general facts about the structure of documents, much as modern transformer-based language models train on unlabelled text. The (public?) state-of-the-art in multimodal form extraction is LayoutLM, which uses text, geometry, and image embeddings in a BERT model.

The LayoutLM architecture pre-trains on text, geometry, and image information. Successive document tokens are horizontal, network layers are vertical (source)

LayoutLM achieves state-of-the-art performance because it can “pre-train” on large unlabelled collections of documents, some of which include millions of forms. It also incorporates image information directly into the multimodal inference, though it’s not clear to me how much this helps relative to just the text, position, and font size/weight.

Deepform: Extracting TV political advertising spending

Every election, every TV and cable station is required to disclose the political advertising they sell — but there is no requirement on how they disclose it. These PDFs documents are known as the FCC Public Files, and even though every station is disclosing more or less the same information, different stations use different formats. Hundreds of different formats, maybe thousands.

One of 40,000 or so political advertising invoices from 2012. Doesn’t look so bad, but there are hundreds of different form types with the same data.

Searching through these forms for newsworthy items, or analyzing them for broader trends, has been a headache for journalists for many years. Past reporting projects have used massive amounts of crowdsourced volunteer labor (ProPublica, 2012) or hand-coded form layouts (Alex Byrnes, 2014) to produce usable datasets. The option is to buy cleaned data from a political consulting firm for $100k or more, so solving this problem would save journalists tremendous amounts of time or money.

In 2019 I prototyped a deep learning system which could extract one field from the FCC documents with over 90% accuracy, suggesting that it was possible to create a model that could generalize between form types. Today we are announcing Deepform, the result of seven months of work extending this prototype, including:

A training data set built from 20,000 labelled FCC Public Files documents from the 2012, 2014, and 2020 elections.
A baseline model to extract five fields from each document: order number, advertiser name, flight dates (from and to), and gross amount.
A public benchmark, hosted by Weights & Biases, where you can submit your improved model.
Extracted data for the 2020 election, hosted on Overview.

Compared to the research above, our work uses a relatively simple model. We use pdfplumber to extract tokens and their bounding boxes, then a fully-connected network to score each token within a linear window. These windows, perhaps 20-50 tokens wide, are more or less equivalent to the neighborhoods used in other approaches. Like several of the systems above, we also found that including a few hand-built features based on string matching improved performance.

The Deepform model. This image is from the 2019 prototype, which only extracted the invoice total. The current version has five outputs per token, corresponding to the five fields we extract.

The benchmark scores submissions based on average accuracy across five extracted fields. Right now, our baseline model achieves 70%, but this is a bit misleading: we achieve 90% or more on all fields except advertiser name, where we get 30-40%. This is because advertiser name is the only field where the answer spans multiple tokens, so we often get only part of the name, even if it’s still human recognizable. Actually, most of the models above are not designed to handle multiple token answers either.

The actual extracted data for 2020 available on Overview. You can search, view the original documents and metadata, annotate and download nearly 100,000 invoices and orders from the FCC Public File for the 2020 election. Create an account and choose “copy an example document set,” then clone the fcc-data-2020 document set.

This data is a little late to be useful to reporters covering the 2020 election. Rather, the significance of Deepform is that it’s public progress on a difficult and important AI problem. But we have much further to go. If you are an ML engineer who would like to get involved in form-extraction-for-democracy, I’d encourage you to try your hand at the public benchmark, very kindly hosted by Weights & Biases. We’ve done all the hard work of building the data set and a baseline model — any of the methods discussed above might be quite likely to beat our work.

Finally, a big shoutout to the Deepform team: Gray Davidson, Daniel Fennelly. Andrea Lowe, Hugh Wimberly, and Stacey Svetlichnaya. None of this would have been possible without your dedication over the last seven months.

What tools do we have to combat disinformation?

Jonathan Stray — Mon, 24 Jun 2019 19:06:47 +0000

What types of defenses against disinformation are possible? And which of these would we actually want to use in a democracy, where approaches like censorship can impinge on important freedoms? To try to answer these questions, I looked at what three counter-disinformation organizations are actually doing today, and categorized their tactics.

The EU East StratCom Task Force is a contemporary government counter-propaganda agency. Facebook has made numerous changes to its operations to try to combat disinformation, and is a good example of what platforms can do. The Chinese information regime is a marvel of networked information control, and provokes questions about what a democracy should and should not do.

The result is the paper Institutional Counter-disinformation Strategies in a Networked Democracy (pdf). Here’s a video of me presenting this work at the the recent Misinfoworkshop.

I should say from the start that this work is not about defining “disinformation.” Adjudicating which speech is harmful is a profound problem with millennia of history, and what sort of narratives are “false” is one of the major political battles of our time. Instead, my goal here is to describe methods: what kinds of responses are there, and how do they align with the values of an open society?

The core of my analysis is this chart, which organizes the tactics of the above organizations into six groups.

I’ll describe each of these strategies briefly; for more depth (and references) see the talk or the paper.

Refutation, rebuttal, or debunking might be the most obvious counter-strategy. It’s also well within the bounds of democracy, as it’s simply “more speech.” It’s most effective if it’s done consistently over the long term, and in any case it’s practiced by most counter-disinformation organizations.

Exposing inauthenticity combats one of the oldest and best-recognized forms of disinformation: pretending to be someone you are not. Bot networks, “astroturfing,” and undisclosed agendas or conflicts of interest could all be considered inauthentic communication. The obvious response is to discredit the source by exposing it.

Alternative narratives. A long line of experimentation suggests that merely saying that something is false is less effective than providing an alternative narrative, and the non-platforms in this analysis combat disinformation in part by promoting their own narrative

Algorithmic filter manipulation. The rise of platforms creates a truly new way of countering disinformation: demote it by decreasing its ranking in search results and algorithmically generated feeds. Conversely, it is possible to promote alternative narratives by increasing their ranking.

Speech laws. The U.S. Supreme Court has held that the First Amendment generally protects lying; the major exceptions concern defamation and fraud. In Europe, the recent report of the High Level Expert Group on Fake News and Online Disinformation recommended against attempting to regulate disinformation. But in most democracies platforms are still legally liable for hosting certain types of content. For example, Germany requires platforms to remove Nazi-related material within 24 hours or face fines.

Censorship. One way of combatting disinformation is simply to remove it from public view. In the 20th century, censorship was sometimes possible through control over broadcast media. This is difficult with a free press, and it is even harder to eliminate information from a networked ecosystem. Yet platforms do have the power to remove content entirely and often do, both for their own reasons and as required by law. (This differs from speech laws because the latter may impose fines or require disclaimers or otherwise restrict speech without removing it.)

Despite their differences, there are many common patterns between the East StratCom Task Force, Facebook, and the Chinese government. Each of the methods they use has certain advantages and disadvantages in terms of efficacy and legitimacy — that is, alignment with the values of an open society.

A cross-sector response — both distributed and coordinated — is perhaps the biggest challenge. In societies with a free press there is no one with the power to direct all media outlets and platforms to refute or ignore or publish particular items, and it seems unlikely that people across different sectors of society would even agree on what is disinformation and what is not. In the U.S. the State Department, the Defense Department, academics, journalists, technologists and others have all launched their own more-or-less independent counter-disinformation efforts. In many countries, a coordinated response will require coming to terms with a deeply divided population.

But no matter what we collectively choose to do, citizens will require strong assurances that the strategies employed to counter disinformation are both effective and aligned with democratic values.

An Introduction to Algorithmic Bias and Quantitative Fairness

Jonathan Stray — Sat, 15 Jun 2019 20:50:32 +0000

There are many kinds of questions about discrimination fairness or bias where data is relevant. Who gets stopped on the road by the police? Who gets admitted to college? Who gets approved for a loan, and who doesn’t? The data-driven analysis of fairness has become even more important as we start to deploy algorithmic decision making across society.

I attempted to synthesize an introductory framework for thinking about what fairness means in a quantitative sense, and how these mathematical definitions connect to legal and moral principles and our real world institutions of criminal justice, employment, lending, and so on. I ended up with two talks.

This short talk (20 minutes), part of a panel at the Investigative Reporters & Editors conference, has no math. (Slides)

This longer talk (50 minutes), presented at Code for America SF, gets into a lot more depth, including the mathematical definitions of different types of fairness, and the whole tricky issue of whether or not algorithms should be “blinded” to attributes like race and gender. It also includes several case studies of real algorithmic systems, and discusses how we might design such systems to reduce bias. (Slides)

My favorite resources on these topics:

The Workbench workflow analyzing Massachusetts traffic ticket data.
Sandra Mayson, Bias In, Bias Out. One of my favorite overall discussions of algorithmic bias.
Megan Stevenson, Assessing Risk Assessment in Action. What happens with criminal justice risk assessment in the real world?
Corbett-Davies and Goel, The Measure and Mismeasure of Fairness is a well done more mathematical discussions of fairness measures.
Open Policing Project findings. A very clearly thought out analysis of US national traffic stop data.
Workbench Open Policing Project tutorial. An interactive introduction to working with this data.
Arvind Narayanan, 21 Definitions of Fairness and Their Politics. More on the connection between quantitative and political concepts of fairness.

Extracting campaign finance data from gnarly PDFs using deep learning

Jonathan Stray — Thu, 13 Jun 2019 19:52:05 +0000

Update, Oct 2020: we’ve done a lot more since this post! If you want to try working on this problem, Weights and Biases is very kindly hosting a public benchmark.

I’ve just completed an experiment to extract information from TV station political advertising disclosure forms using deep learning. In the process I’ve produced a challenging journalism-relevant dataset for NLP/AI researchers. Original data from ProPublica’s Free The Files project.

The resulting model achieves 90% accuracy extracting total spending from the PDFs in the (held out) test set, which shows that deep learning can generalize surprisingly well to previously unseen form types. I expect it could be made much more accurate through some feature engineering (see below.)

You can find the code and documentation here. Full thanks to my collaborator Nicholas Bardy of Weights & Biases.

Why?

TV stations are required to disclose their sale of political advertising, but there is no requirement that this disclosure is machine readable. Every election, tens of thousands of PDFs are posted to the FCC Public File, available at https://publicfiles.fcc.gov/. All of these contain essentially the same information, but in in hundreds of different formats, like these:

In 2012, ProPublica ran the Free The Files project (you can read how it worked) and hundreds of volunteers hand-entered information for over 17,000 of these forms. That data drove a bunch of campaign finance coverage and is now available from their data store.

Can we replicate this data extraction using modern deep learning techniques? This project aimed to find out, and successfully extracted the easiest of the fields (total amount) at 90% accuracy using a relatively simple network.

How it works

I settled on a relatively simple design, using a fully connected three-layer network trained on 20 token windows of the data. Each token is hashed to an integer mod 500, then converted to 1-hot representation and embedded into 32 dimensions. This embedding is combined with geometry information (bounding box and page number) and also some hand-crafted “hint” features, such as whether the token matches a regular expression for dollar amounts. For details, see the talk.

Although 90% is a good result, it’s probably not high enough for production use. However, I believe this approach has lots of room for improvement. The advantage of this type of system is that it can elegantly integrate multiple manual extraction methods — the “hint” features — each of which can be individually crappy. The network actually learns when to trust each method. In ML speak this is “boosting over weak learners.”

A research data set

Are you an AI researcher looking for challenging research problems that are relevant to investigative journalism? Have I got a data set for you!

There is a great deal left to do on this extraction project. For example, we still need to try extracting the other fields such as advertiser and TV station call sign. This will probably be harder than totals as it’s harder to identify tokens which “look like” the correct answer.

There is also more data preparation work to do. We discovered that about 30% of the PDFs documents still need OCR, which should increase our training data set from 9k to ~17k documents.

But even in its current form, this is a difficult data set that is very relevant to journalism, and improvements in technique will be immediately useful to campaign finance reporting.

The general problem is known as “knowledge base construction” in the research community, and the current state of the art is achieved by multimodal systems such as Fonduer.

I would love to hear from you! Contact me on twitter or here.

Ethical Software Engineering Lab Course

Jonathan Stray — Sun, 12 May 2019 00:38:27 +0000

There is now, at long last, wide concern over the negative effects of technology, along with calls to teach ethics to engineers. But critique is not enough. What tools are available to the working engineer to identify and mitigate the potential harms of their work?

I’ve been teaching the effects of technology on society for some time, and we cover a lot of it in my computational journalism course. This is an outline for a broader hands-on course, which I’m calling the Ethical Engineering Lab.

This eight-week course is a hands-on introduction to the practice of what you might call harm-aware software engineering. I’ve structured it around the Institute for the Future’s Ethical OS, a framework I’ve found useful for categorizing the places where technology intersects with personal and social harm. Each class is three hours long, split between lecture and lab time. Students must complete a project investigating actual or potential harms from technology, and their mitigations.

Each lecture is structured around a set of issues, cases where technology is or could be involved in harm, and tools, methods for mitigating these harms. The goal is to train students in the current state-of-the-art of these problems, which often requires a deep dive into both the social and technical perspectives. We will study both differential privacy algorithms and HIPAA health data privacy. In many cases there is disagreement over the potential for certain harms and their seriousness, so we will explore the tradeoffs of possible design choices.

Our hands-on exploration (lab time and final projects) will involve a combination of qualitative and quantitative methods. For example, we might read the EULAs of all the products we use and see if there are any surprises. Or we might use a Jupyter notebook with real data from the COMPAS criminal justice risk assessment algorithm to investigate the tradeoffs between different definitions of quantitative fairness. I’ve included example projects that students could do in each area.

Some technical background is required, as the goal is to teach the engineering aspects of these problems. Many but not all final projects will require coding. I particularly encourage students to choose a project related to their work.

This post is just a sketch to suggest the sort of material I’d want to include. Doubtless, a great many things are missing. What else should we cover? What references are especially good on these topics? Do you want me to teach this at your organization? Get in touch.

Truth, Disinformation, Propaganda

Issues

Overview of recent disinformation campaigns (2016 election and globally).
Disinformation spreads farther than truth.
Review of current state-of-the-art in audio, video, text, and photo deepfakes.
Defining “propaganda.” The ethics of persuasion.
What the most advanced chatbots can do today.

Tools

Contemporary institutional counter-disinformation practices.
Recommendation algorithms: design patterns for various social goals.
Moderation system design.
How Facebook responds to information operations.
Associated Press guidelines for identifying machine-written content.

Discussions

How could your technology be used as part of a disinformation campaign?

Example Projects

Build a chatbot that impersonates a person or company. See if you can fool your classmates.
Build a fake news classifier from one of the common fake news datasets. What signals does it end up learning? Can it be made to work at scale?

Addiction & The Dopamine Economy

Issues

Addiction psychology, through the example of gambling and casino design.
Defining “engagement” and the effects of optimizing for it.
Effects of screens on sleep.
“Ultra-FOMO”: What do constant images of perfection do to us?

Tools

“time well spent” metrics; well-being research
Screen time reports
Human and algorithmic approaches to evaluating content quality

Discussion

What would addiction look like on your platform?
How can your business make money without addiction?

Example Projects

Estimate quantitative effect of removing a particular addictive feature. Or implement a change to your product and find out.
Build a machine learning system that ranks content by “quality,” in the “time well spent” sense. What measure are you using, and why, and how does the classifier perform relative to this standard?

Economic and Asset Inequalities

Issues

Personalized pricing can charge poorer people more. This doesn’t have be intentional; a very simple three line algorithm will do it.
Pricing AIs will collude to fix prices.
Auto insurance continues to be more expensive in minority neighborhoods, even after adjusting for risk.

Example Projects

Reproduce the simulation which showed that pricing algorithms will collude. Under what conditions will this happen? How can AIs be designed not to do this?
Analyze real lending data to determine the demographics of who gets a loan now, and how that would change if better prediction was available, as this notebook does.
Simulate personalized pricing, using a model to estimate of willingness to pay of different demographics (location, age, etc.). How will this affect the distribution of prices between different income levels?

Machine Ethics & Algorithmic Biases

Issues

A framework for thinking about analyzing data for evidence of bias.
Google shows ads for higher paid job to men.
Sexism in word embeddings: Man is to programmer as woman is to housewife, however, there was an instructive error in the research that led to this particular example.
Better loan payback prediction will increase disparities in interest rates.
Ethical problems with prediction in general in criminal justice.
Prediction feedback loops,

Tools

Introduction to quantitative fairness measures. Three classic types, their advantages and drawbacks. 1) Demographic parity: hire the same number of men and women. 2) Equal error rates: make sure the classifier fails the same amount for different races. 3) Calibration: ensure a prediction means the same for all groups.
Stanford’s Law, Bias, and Algorithms course notebooks.
Impossibility theorem: you can only have one of these at once when base rates differ between groups. Type of fairness is a policy choice.
Real world outcomes. After recidivism prediction was introduced in Kentucky, judges initially reduced detainment rates in accordance with computed risk scores but the effect gradually wore off. A detailed analysis of the effect of predicting which children will likely require intervention by child protection services.

Example Projects

Quantify the tradeoffs between different types of fairness in the COMPAS criminal justice risk assessment data set, as in this notebook.
Simulate and analyze the feedback loops in predictive policing.

The Surveillance State

Issues

China’s inept “social credit system” and the much more sophisticated surveillance system used by police in Xinjiang.
Surveillance by landords, including video cameras and social media monitoring, is being used to harass and evict tenants.
The potential costs of sharing your heart rate and other health data.
Western companies selling surveillance tech to authoritarian regimes, and Chinese products collecting data for the government.

Scenarios

Data mining tools used for investigative journalism are re-purposed for harassment
China’s social credit system grows up and is applied to users worldwide to enforce authoritarian norms.
Police facial recognition cameras effectively track every citizen’s location, bypassing 4^th amendment protections on tracking.

Discussions

What are the technical, legal, and social factors that prevent law enforcement from abusing mass surveillance – in each country? How will your technology interact with these

Project Ideas

With their prior permission, investigate a classmate through public information only. What can you correctly infer about their life?
Publicly display your heart rate for a week and report your results.

Data Control & Monetization

Issues

Data privacy law primer, including GDPR and HIPPA.
Inadvertent collection of data. Google Wifi, Mixpanel passwords.
Data leaks due to mistakes and hacks.
The effect of making ostensibly “public” data more available or interpretable. E.g. Graffiti tracker, The Journal News’ gun map.

Tools

Redaction and minimization. Differential privacy, through the example of the new privacy changes for the 2020 Census.
Location data. How much it reveals, how easy it is to de-anonymize.
Health data. Correlations with life outcomes. Regulatory issues.
Issues of personalized recommendations and ads, e.g. targeting ads to pregnant women.
General effects of better prediction on the distribution of resources and risk. For example, if you had perfect information on someone’s future health, would that destroy the health insurance market?

Discussions

What data do you collect? Split into small groups, discuss and make a list, merge lists. Were any types of data not listed by a group because you were missing someone with a specific perspective?
What rights would your users want in regards to their data? What problems will they have if they don’t have these rights?

Example projects

Experiment with adding differential privacy to one of your APIs. How easy is it to learn personal information, via reconstructions from multiple API calls, before and after?
Reconstruct someone’s life from anonymized location data (someone in the class could give it to you, or you could use the NYC taxi data, or data from apps.)

Implicit Trust and User Understanding

Issues

No one reads EULAs
Unroll.me was reading your email for Lyft receipts and reporting to Uber
Dark Patterns in UI design

Projects:

Take one day of your browser history, re-visit every site. Read the EULAs and record anything that surprises you.
Document the dark patterns you encounter on these sites.

Hateful & Criminal Actors

Issues

The challenge of platform counter-terrorism, from Facebook’s point of view.
Bibliography of papers on online harassment and machine learning.
Attacks on image recognition algorithms: wear a t-shirt and confuse an AI.
It is now possible to run automated spear-phishing.
The Darkweb, anonymity, and security, as told through the story of The Silk Road.

Example Projects

Build a hate speech classifier. Does it work well enough to be useful? What have you learned about the complexity of this problem?
Estimate the percentage of bitcoin transactions which are used for criminal activity
Pick a platform or product. Come up with a plan to use it for criminal activity, including the security measures you would take.

Introducing Workbench

Jonathan Stray — Thu, 02 Mar 2017 19:05:26 +0000

Some of you may have heard about by new data journalism project — The Computational Journalism Workbench. This is an integrated platform for data journalism, combining scraping, analysis, and visualization in one easy tool. It works by assembling simple modules into a “workflow,” a repeatable, sharable, automatically updating pipeline that produces a publishable chart or a live API endpoint.

I demonstrated a prototype at the NICAR conference. UPDATE: Workbench is now in production at workbenchdata.com and has now been used in teaching in dozens of schools.

I’ll be working on Workbench for at least the next few years. My previous large data journalism project is the Overview document mining system, which continues active development.

Defense Against the Dark Arts: Networked Propaganda and Counter-Propaganda

Jonathan Stray — Fri, 24 Feb 2017 22:00:42 +0000

In honor of MisinfoCon this weekend, it’s time for a brain dump on propaganda — that is, getting large numbers of people to believe something for political gain. Many of my journalist and technologist colleagues have started to think about propaganda in the wake of the US election, and related issues like “fake news” and organized trolling. My goal here is to connect this new wave of enthusiasm to history and research.

This post is about persuasion. I’m not going to spend much time on the ethics of these techniques, and even less on the question of who is actually right on any particular point. That’s for another conversation. Instead, I want to talk about what works. All of these methods are just tools, and some are more just than others. Think of this as Defense Against the Dark Arts.

Let’s start with the nation states. Modern intelligence services have been involved in propaganda for a very long time and they have many names for it: information warfare, political influence operations, disinformation, psyops. Whatever you want to call it, it pays to study the masters.

Russia: You don’t need to be true or consistent

Russia has a long history of organized disinformation, and their methods have evolved for the Internet era. The modern strategy has been dubbed “the firehose of falsehood” by RAND scholar Christopher Paul.

His recent report discusses this technique of pushing out diverse messages on a huge number of different channels, everything from obvious state sources like Russia Today to carefully obscured leaks of hacked material — leaks which are tailored to appeal to sympathetic journalists.

The experimental psychology literature suggests that, all other things being equal, messages received in greater volume and from more sources will be more persuasive. Quantity does indeed have a quality all its own. High volume can deliver other benefits that are relevant in the Russian propaganda context. First, high volume can consume the attention and other available bandwidth of potential audiences, drowning out competing messages. Second, high volume can overwhelm competing messages in a flood of disagreement. Third, multiple channels increase the chances that target audiences are exposed to the message. Fourth, receiving a message via multiple modes and from multiple sources increases the message’s perceived credibility, especially if a disseminating source is one with which an audience member identifies.

And as you might expect, there is a certain amount of outright fabrication — often mixed with the truth:

Contemporary Russian propaganda makes little or no commitment to the truth. This is not to say that all of it is false. Quite the contrary: It often contains a significant fraction of the truth. Sometimes, however, events reported in Russian propaganda are wholly manufactured, like the 2014 social media campaign to create panic about an explosion and chemical plume in St. Mary’s Parish, Louisiana, that never happened. Russian propaganda has relied on manufactured evidence—often photographic. … In addition to manufacturing information, Russian propagandists often manufacture sources.

But for me, the most surprising conclusion of this work is that a source can still be credible even if it repeatedly and blatantly contradicts itself:

Potential losses in credibility due to inconsistency are potentially offset by synergies with other characteristics of contemporary propaganda. As noted earlier in the discussion of multiple channels, the presentation of multiple arguments by multiple sources is more persuasive than either the presentation of multiple arguments by one source or the presentation of one argument by multiple sources. These losses can also be offset by peripheral cues that enforce perceptions of credibility, trustworthiness, or legitimacy. Even if a channel or individual propagandist changes accounts of events from one day to the next, viewers are likely to evaluate the credibility of the new account without giving too much weight to the prior, “mistaken” account, provided that there are peripheral cues suggesting the source is credible.

Orwell was right: “We have always been at war with Eastasia” really does work, if there are enough people repeating it.

Paul suggests that the counter-strategy is not to try to refute the message, but to reach the target audience first with an alternative. Fact checking, which is really after-the-fact-checking, may not be the most effective plan. He suggests instead that we “forewarn audiences of misinformation, or merely reach them first with the truth, rather than retracting or refuting false ‘facts.'” In this light, Facebook’s plan to show the fact check along with the article seems like a much better strategy than sending someone a fact checking link when they repeat a falsehood.

He also suggests that we “focus on guiding the propaganda’s target audience in more productive directions.” Which is exactly what China does.

China: Don’t argue, distract and disrupt

China is famous for its highly developed network censorship, from the Great Firewall to its carefully policed social media. The role of the government “public opinion guides,” China’s millions of paid commenters, has been murkier — until now.

https://www.youtube.com/watch?v=xi-B3_BsL-M

The Atlantic has a readable summary of recent research by Gary King, Jennifer Pan, and Margaret E. Roberts. They started with thousands of leaked Chinese government emails where commentators report on their work, which became the raw data for an accurate predictive model of which posts are government PR. A surprising twist: nearly 60% of paid commenters will just tell you they’re posting for the government when you ask them, which allowed these scholars to verify their country-wide model. But the core of the analysis is what these posters were doing.

From the paper:

We estimate that the government fabricates and posts about 448 million social media comments a year. In contrast to prior claims, we show that the Chinese regime’s strategy is to avoid arguing with skeptics of the party and the government, and to not even discuss controversial issues. We infer that the goal of this massive secretive operation is instead to regularly distract the public and change the subject, as most of the these posts involve cheerleading for China, the revolutionary history of the Communist Party, or other symbols of the regime.

And here’s the breakdown of what these posters were doing. “Cheerleading” dominates for every sample of government accounts. Arguments are rare.

Note that this is only one half of the Chinese media control strategy. There is still massive censorship of political expression, especially of any post relating to organized protest, which is empirically good at toppling governments.

All of this without ever getting into an argument. This suggests that there is actually no need to engage the critics/trolls to get your message out (though it might still be worthwhile to distract and monitor them.) Just communicate positive messages to the masses while you quietly disable your detractors. A counter-strategy, if you are facing this type of opponent, is organized, visible resistance. Get into the streets and make it impossible to talk about something else — though note that recent experiments suggest that violent or extreme protest tactics will backfire.

But China has a tightly controlled media and the greatest censorship regime the world has ever seen. If you’re operating in a relatively free media environment, you have to manipulate the press instead.

Milo: Attention by any means necessary

The most insightful thing I have ever read about the wonder that was Milo Yiannopoulos comes from the man who wrote a book on manipulating the media, documenting the strategies he devised to market people like Tucker Max. Ryan Holiday writes,

We encouraged protests at colleges by sending outraged emails to various activist groups and clubs on campuses where the movie was being screened. We sent fake tips to Gawker, which dutifully ate them up. We created a boycott group on Facebook that acquired thousands of members. We made deliberately offensive ads and ran them on websites where they would be written about by controversy-loving reporters. After I began vandalizing some of our own billboards in Los Angeles, the trend spread across the country, with parties of feminists roving the streets of New York to deface them (with the Village Voice in tow).

But my favorite was the campaign in Chicago—the only major city where we could afford transit advertising. After placing a series of offensive ads on buses and the metro, from my office I alternated between calling in angry complaints to the Chicago CTA and sending angry emails to city officials with reporters cc’d, until ‘under pressure,’ they announced that they would be banning our advertisements and returning our money. Then we put out a press release denouncing this cowardly decision.

I’ve never seen so much publicity. It was madness.

. . .

The key tactic of alternative or provocative figures is to leverage the size and platform of their “not-audience” (i.e. their haters in the mainstream) to attract attention and build an actual audience. Let’s say 9 out of 10 people who hear something Milo says will find it repulsive and juvenile. Because of that response rate, it’s going to be hard for someone like Milo to market himself through traditional channels. His potential audience is too spread out, and doesn’t have that much in common. He can’t advertise, he can’t find them one by one. It’s just not going to scale.

But let’s say he can acquire massive amounts of negative publicity by pissing off people in the media? Well now all of a sudden someone is absorbing the cost of this inefficient form of marketing for him.

(Emphasis mine.) That one’s adversaries should be denied attention is not a new idea. Indeed, this is central to the “no-platforming” tactic. But no-platforming plays right into an outrage-based strategy if it results in additional attention (see also the Streisand effect). Worse, all the incentives for media makers are wrong. It’s going to be very hard for journalists and other media figures to wean themselves off of outrage, because strong emotional reactions get people to share information (1, 2, 3, etc.) and information sharing has become the basis of distribution, which is the basis of revenue. We are in dire need of new business models for news.

But this breakdown of the mechanics of outrage marketing does suggest a counter-strategy: before you get mad, or report on someone getting mad, do your homework. Holiday called to complain about his own content, put out false press releases, etc. A smart journalist might be able to uncover this deception. In a propaganda war, all journalists should be investigative journalists.

Attention is the currency of networked propaganda. Attention is the key. Be very careful who you give it to, and understand how your own emotions and incentives can be exploited.

[tweet id=”819748035945082881″ hide_thread=’true’ align=’center’]

But even if you’ve uncovered a deception, it’s not enough to say that someone else is lying. You have to tell a different story.

Debunking doesn’t work: provide an alternative narrative

Telling people that something they’ve heard is wrong may be one of the most pointless things you can do. A long series of experiments shows that it rarely changes belief. Brendan Nyhan is one of the main scholars here, with a series of papers on political misinformation. This is about human psychology; we simply don’t process information rationally, but instead employ a variety of heuristics and cognitive shortcuts (not necessarily maladaptive in general) that can be exploited. The classic experiment goes like this:

Participants in a study within this paradigm are told that there was a fire in a warehouse and that there were flammable chemicals in the warehouse that were improperly stored. When hearing these pieces of information in succession, people typically make a causal link between the two facts and infer that the fire was caused in some way by the flammable chemicals. Some subjects are then told that there were no flammable chemicals in the warehouse. Subjects who have received this corrective information may correctly answer that there were no flammable chemicals in the warehouse and separately incorrectly answer that flammable chemicals caused the fire. This seeming contradiction can be explained by the fact that people update the factual information about the presence of flammable chemicals without also updating the causal inferences that followed from the incorrect information they initially received.

Worse, repeating a lie in the process of refuting it may actually reinforce it! The counter strategy is to replace one narrative with another. Affirm, don’t deny:

Which of these headlines strikes you as the most persuasive:

“I am not a Muslim, Obama says.”

“I am a Christian, Obama says.”

The first headline is a direct and unequivocal denial of a piece of misinformation that’s had a frustratingly long life. It’s Obama directly addressing the falsehood.

The second option takes a different approach by affirming Obama’s true religion, rather than denying the incorrect one. He’s asserting, not correcting.

Which one is better at convincing people of Obama’s religion? According to recent research into political misinformation, it’s likely the latter.

The role of intelligence: Action not reaction

Let’s return to China for a moment. Here’s a chart, from the paper above, on the number of government social media postings over time:

Posts spiked around political events (CCP Congress) and emergencies that the government would rather citizens not talk about, such as riots and a rail explosion. This “cheerleading” propaganda wasn’t simply a regular diet of good news, but a precisely controlled strategy designed to drown out undesirable narratives.

One of the problems of a free press is that “the media” is a herd of cats. There really is no central authority — independence and diversity, huzzah! Similarly, distributed protest movements like Anonymous can be very effective for certain types of activities. But even Anonymous had central figures planning operations.

The most successful propagandists, like the most successful protest movements, are very organized. (Lost in the current “diversity of tactics” rhetoric is the historical fact that key battles in the civil rights movement were carefully planned.) Organization and planning requires intelligence. You have to know who your adversaries are and what they are doing. Intelligence involves basic steps like:

Pay attention to the details of every encounter. Who wrote that story or posted that comment?
Research the actors and their networks. Who are they connected to? What communication channels do they use to coordinate? Who directs operations?
Real-time monitoring. When a misinformation campaign begins, you need to get to your audience before they do (with something more than just a debunk, as above.)

Although there may be useful technological approaches to tracing networks, there is no magic here; anyone can keep a spreadsheet of actors, you can do real-time monitoring with little more than Tweetdeck, and investigative journalists already know how to investigate. But centralization may be important. The Russian approach of “many messages, many channels” suggests that an open, diverse network can succeed at individual propaganda actions, and I bet it would succeed at counter-propaganda actions too. But intelligence is different, and it’s an unanswered question whether the messy collection of journalists, NGOs, universities, and activists in a free society can do effective counter-propaganda intelligence, or even agree sufficiently on what that would be. I don’t think a distributed approach will work here; someone needs to own the database and run the show.

Update: The East StratCom Task Force seems to be exactly this sort of centralized actor for the EU.

But one way or another, you have know what your propagandist adversary is doing, in detail and in real-time. If you don’t have that critical function taken care of, you’re going to be forever reactive, which means you’re probably going to lose.

PS: Up your security game

Hacking and leaking — which is one of the more effective ways to dox someone — has become a propaganda tactic. If you don’t want to be on the wrong end of this, I recommend immediately doing the following easy things:

Enable 2-step logins on your email and other important accounts.
Learn to recognize phishing.

I suspect this would prevent 70%-90% of hacking and doxxing attempts. It would have saved John Podesta. Here’s lots more on easy ways to protect yourself.

Stay safe out there, and good luck.

What do Journalists do with Documents?

Jonathan Stray — Wed, 02 Nov 2016 14:46:27 +0000

Many people have realized that natural language processing (NLP) techniques could be extraordinarily helpful to journalists who need to deal with large volumes of documents or other text data. But although there have been many experiments and much speculation, almost no one has built NLP tools that journalists actually use. In part, this is because computer scientists haven’t had a good description of the problems journalists actually face. This talk and paper, presented at the Computation + Journalism Symposium, are one attempt to remedy that. (Talk slides here.)

This all comes out of my experience both building and using Overview, an open source document mining system built specifically for investigative journalists. The paper summarizes every story completed with Overview, and also discusses the five cases I know where journalists used custom NLP code to get the story done.

The talk is more focussed on the lessons learned — all the things I wish I had known when I started writing NLP code for journalism six years ago. I recommend six research themes for computer scientists who want to help journalists:

Robust import. Preparing documents for analysis is a much bigger problem than is generally appreciated. Even structured data like email is often delivered on paper.

Robust analysis. Journalists routinely deal with unbelievably dirty documents. OCR error confounds classic algorithms. Shorthand and jargon break dictionaries and parsers.

Search, not exploration. Reporters are usually looking for something, but it may not be something that is easy to express in a keyword search. The ultimate example is “corruption,” which you can’t just type into a search box.

Quantitative summaries. Journalists have long produced stories by counting the number of documents of a certain type. How can we make this easy, flexible, and accurate?

Interactive methods. Even with NLP, document-based reporting requires extensive human reading. How do we best integrate machine and human intelligence in an interactive loop?

Clarity and Accuracy. Journalists are accountable to the public for their results. They must be able to explain how they got their answer, and how they know the answer is right.

I am currently compiling test sets of real-world documents that journalists have encountered, to help researchers who want to work on these problems. Contact me if you’re interested! I’d also like to take this opportunity to point out that Overview has an analysis plugin API, so if you’re doing work that you want journalists to use, this is one easy way to get a UI around it, and get it shipping with a widely-used tool.

The Dark Clouds of Financial Cryptography

Jonathan Stray — Tue, 11 Oct 2016 17:24:10 +0000

I feel we’re on the precipice of some delightfully weird and possibly very alarming developments at the intersection of code and money. There is something deep in the rules that is getting rewritten, only we can’t quite see how yet. I’ve had this feeling before, as a self-described Cypherpunk in the 1990s. We knew or hoped that encrypted communication would change global politics, but we didn’t quite know how yet. And then Wikileaks happened. As Bruce Sterling wrote at the time,

At last — at long last — the homemade nitroglycerin in the old cypherpunks blast shack has gone off.

That was exactly how I felt when that first SIGACT dump hit the net, by then a newly hired editor at the Associated Press. Now I’m studying finance, and I can’t shake the feeling that cryptocurrencies — and their abstracted cousins, “smart contracts” and other computational financial instruments — are another explosion of weirdness waiting to happen.

I’m hardly alone in this. Lots of technologists think the “block chain” pioneered by bitcoin is going to be consequential. But I think they think this for the wrong reasons. Bitcoin itself is never going to replace our current system of money transfer and clearing; it’s much slower than existing payment systems, often more expensive, uses far too much energy, and don’t scale well. Rather, bitcoin is just a taste, a hint: it shows that we can mix computers and money in surprising and consequential ways. And there are more ominous portents, such as contracts that are actually code and the very first “distributed autonomous organizations.” But we’ll get to that.

What is clear is that we are turning capitalism into code — trading systems, economic policy, financial instruments, even money itself — and this is going to change a lot of things.

The question I always come to is this: what do we want our money to do? Code is also policy, because it constrains what people can and cannot do, and monetary code is economic policy. But code is not all powerful, which is where the bitcoin techno-libertarian ethos goes wrong. What I’ve learned since my Cypherpunk days is that we need to decide now what happens when the code fails, because eventually there will be a situation that will have to be resolved by law and politics. We should design for this rather than trying to avoid it. And this time around, there’s an even weirder twist: when we start describing financial contracts in code, we lose the ability to predict what they’ll do!

What does bitcoin get you anyway?

We’ve had electronic cash since well before I was born. We use it every day: bank balances, credit cards, and all the rest. Here are some things that these systems do:

Transfer money without anything physical changing hands.
Security. It’s not possible to take back spent money unilaterally, or spend the same money twice.
Controlled money supply. You can’t mint your own.

These are the new features that bitcoin adds:

Pseudo-anonymity. Parties to a transaction are identified only by a public key.
A public ledger of all transactions that provides cryptographic proof of ownership. This is the “block chain.”
Decentralization. The security of the system does not depend on the honesty of any single authority, but only honest action by the majority of nodes.

This is private secure global digital money without governments or banks, and you never need to trust in the honesty and competence of any one institution! It’s a really neat trick, exactly the sort of magic that first drew me to cryptography. You could learn how the trick is done places like here, but part of the trick is that it’s not just cryptography. It’s also clever alignment of incentives. It’s financial cryptography.

Here’s the core innovation: it is possible to use your computer to “mine” bitcoin — that is, create new money for yourself — and this mining operation simultaneously maintains the integrity of the global distributed ledger. This is a profound thing. It means that the distributed network operators (bitcoin “nodes”) get paid, and indeed bitcoin miners collectively made something like six billion dollars in the last year. It also means that control of the money supply is distributed, which makes it very unlike central bank money. This is the first place that the politics gets weird.

Control of the money supply

If you’re a government, you want to control your money. Traditionally, central banks do this to balance economic outcomes like unemployment and inflation. They’ve also created money for more drastic and often disreputable purposes like funding wars, inflating away debts or influencing the balance of trade. There’s a lot of destructive stuff you can do with the power to print money, which is one reason why states guard their monopoly closely. There are laws against counterfeiting.

But in the bitcoin scheme of things, money is created by anyone who can solve a specific type of computational puzzle. More specifically, you have to invert a hash function, a problem that can only really be solved by brute force guessing — a massive amount of guessing, something like 400 years on a standard PC. In other words, you pay for freshly minted bitcoins with computer time, an extremely capital and energy intensive process. This is in no way challenges the primacy of accumulated wealth; it’s a fundamentally conservative amendment to capitalism.

The whole point of this is to put an inviolate limit on how fast new coin can be created. You need a lot of resources to mine a little bitcoin — just like mining gold. In fact, the protocol automatically adjusts the difficulty of the hash problem so that new coins always get created at about the same rate, which means blocks are added to the chain at a constant rate, about ten minutes per block, no matter how many computers people throw at the problem, or how fast our computers get. And today, the bitcoin mining industry uses data centers full of custom chips that collectively dwarf the largest supercomputers. All doing essentially nothing except being expensive, which turns out to be a foolproof method of trust-less distributed control.

Sometimes a little trust gets you a lot, like a stable money supply without using as much electricity as a small country. Using trust as a design element is a hard concept for hard-core cryptographers, whose protocols are suspicious by design. But of course finance has always run on trust; there would be no credit without it, and there’s no credit in bitcoin either. No one ever has a negative balance, or even a balance at all, just ownership of tokens.

For the moment, mining is a profitable business and both the quantity of mining and the price of bitcoin are increasing. Which is good news for bitcoin users, because the work of mining is also the work of keeping the transaction processing network running; that’s why anyone bothers to process your bitcoin payments. That’s the sort of incentive engineering you get to do when money and code mix.

So the whole system is releasing new coins at a more or less constant rate, no one can speed it up no matter how much they spend, and it would be impossible to stop it without shutting computers down all over the world. Given bitcoin’s libertarian leanings, perhaps it’s not surprising that this is very much in line with Milton Friedman’s theory that a steady percentage increase in the money supply is the best policy. Then again, a constant coin mining rate also means a constant transaction processing rate, so perhaps this is merely a convenient choice for a payments system.

Either way, this arrangement is economic policy written in code. If the whole world ran on a bitcoin, there would be no way to manage recessions — for example to increse employment — by adjusting the money supply

The geopolitics of the block chain

All cryptograpy has politics, and bitcoin is no exception. It appeals particularly to a certain sort of techno-libertarian: why should the banks say what is money and when we can trade it? Why should they make all of our financial privacy decisions for us? And why should we have to trust any one person with our money?

But then again, these may not be particularly motivating problems for most people. Although a wide variety of merchants will now accept bitcoin for a wide variety of goods and services, like cash it’s well suited for shady deals — especially given its global and anonymous nature. It’s impossible to say for sure, because that’s what anonymity is, but gray markets are likely the predominant use. But we do know that 70% of global trading volume and more than 50% of mining also occurs in China. This may be nothing more than peer-to-peer commerce, or it may indicate that bitcoin is at the center of a Renminbi-denominated halawa network of underground money transfers.

China, and other governments, have unsurprisingly taken steps to discourage bitcoin use, and bitcoin is now restricted or officially banned in many countries. There are potentially good reasons for this, such as the ability to prevent terrorist financing and money laundering, just as the international financial system has implemented progressively tighter controls. There are also potentially bad reasons to restrict bitcoin, depending on your politics: state mismanagement of the money supply, protections for incumbent banks, pernicious regulation of capital flows, or authoritarian surveillance of commerce.

But if you have an uncensored internet connection and the right software, no one can stop you from trading in bitcoin. Once again, the network proves to be a great equalizer between citizens and states. The Cypherpunks understood very early on that encrypted communication enabled uncensorable distributed coordination, and that this would challenge the power of states. But cryptographic money promises something even more revolutionary: state-free trade, economically significant transfers of cold hard currency. It’s a much bigger hammer.

What we didn’t think carefully enough about, back then, was who would be using these tools. Encrypted communication has supported the toppling of autocratic regimes, but it also supports terrorism. Bitcoin miners, too, might have diverse goals.

The cryptographic consensus algorithm currently in use by every bitcoin node dictates that the majority defines which transactions get added to the global ledger and hence validated. Which means that Chinese miners now effectively control bitcoin. In principle, all Chinese operators could collude to allow double spending of their coins. Or they could “hard fork” the protocol at any moment simply by adopting a new standard. Everyone else would have to go along, or their existing bitcoins would be worthless.

Thus the bitcoin protocol is already the subject of international diplomacy, as when American entrepreneurs visited China to lobby for capacity-enhancing changes (they failed). Running a specific version of the bitcoin software and maintaining a specific version of the ledger data is in effect a vote. However, this hasn’t prevented vigorous arguments and campaigns about how those votes should be cast.

Meanwhile the “bitcoin core” developers also have substantial but not absolute influence, as they maintain the standard open source implementation of the protocol. They can’t make anyone go along with changes, but it sure could be inconvenient if you didn’t want to. And what happens if they don’t all agree?

From cryptocurrency to crypto-contracts

The cryptographic innovations of bitcoin are public and easily copied, and profitable if you get in first on a successful new currency. So naturally there has been a dizzying array of “altcoin” implementations with varying degrees of adoption and stability. The most interesting altcoins add new features, such as extended capacity or new transaction types.

But there’s one altcoin that does something truly new and interesting: Ethereum allows software-defined transactions. That is: a transaction can contain code which executes to determine who gets paid what, or more generally to perform any computation and store the results in the public ledger (block chain). The Ethereum Foundation, a Swiss non-profit, says that Ethereum is a “decentralized platform that runs smart contracts: applications that run exactly as programmed without any possibility of downtime, censorship, fraud or third party interference.”

This is science-fiction stuff. First computational contracts, then AI lawyers, all executing on the open source Justice operating system… We’re not quite there yet, but Ethereum is the proof of concept. Like bitcoin, it extrapolates legitimately interesting technical innovation into a soaring anti-authoritarian dream. A “smart contract” is a financial contract defined by code. You cryptographically sign onto it, perhaps by making a payment, and then the contract then executes on the network, does its calculations, and ultimately makes payouts. As long as the majority of the Ethereum network is operating honestly, you get paid exactly what the code says you will get paid; neither the seller or anyone else can alter the terms after the fact. No courts are needed to enforce the terms, no intermediaries are involved, no trust is required.

If bitcoin is state-free money, smart contracts are state-free financial instruments that are fully transparent and make fraud impossible. Except no, of course not, just like encryption didn’t make anonymity easy, and for the same reason: there are systems outside the computer.

Cryptocurrencies are made of people

In one sense Ethereum’s libertarian promise is all true: just as bitcoin nodes validate transactions by consensus, all Ethereum nodes collectively enforce the code-as-contract guarantee. In another sense it’s completely bogus: the computers are still controlled by people, as a significant hack demonstrated.

The story goes like this: the Ethereum community crowd-funded an initial investment of $150 million dollars to seed a “distributed autonomous organization,” or DAO. This was one of the core visions that excited Ethereum proponents, and a DAO is probably the most cyberpunk entity of all time. As CoinDesk described it:

It’s likely best to think of The DAO — which is sometimes also referred to as The DAO Hub — as a tightly packed collection of smart contracts written on the Ethereum blockchain.

Taken collectively, the smart contracts amount to a series of by-laws and other founding documents that determine how its constituency — anyone around the world who has bought DAO tokens with ethers — votes on decisions, allocates resources and in theory, creates a wide-range of possible returns.

Unlike a traditional company that has a designated managerial structure, The DAO operates through collective voting, and is owned by everyone who’s purchased a DAO token. On top of that structure is a group of so-called Curators that can be elected or removed by DAO token holders.

Got that? It was a fund where the choice of investments, the election of officers, and other such matters were all done by submitting votes as transactions on the Ethereum block chain, to be interpreted by code previously placed there. And just to make it even more like a William Gibson novel, nobody knew exactly who created this entity! Of course there was an Ethereum address attached to code that created the DAO, but Ethereum addresses are anonymous.

And then it was hacked. Maybe. Depending on your point of view. What actually happened is someone found and exploited a subtle bug in the DAO’s code and caused it to pay them the equivalent of $60 million to their Ethereum account.

Software is a subtle thing, and it’s extremely difficult to write bug-free code on the first go. It’s even harder to write code that will stand up against a malicious attacker who stands to make a life-altering amount of money if they break your system. In the event, the bug involved a problem with re-entrancy in the payout function, for which the proposed solution was to guard your disbursement code with mutexes. If you’re not a computer scientist, this is just technical detail. If you are a computer scientist, your skin is crawling. Long experience has shown that correct reasoning about these types of problems is nearly impossible for mere mortals. If smart contracts require super-human intelligence to prevent fraud, we’re in trouble.

What happened next is even more interesting. The human-language contract explaining the terms of buying into the DAO explicitly stated that the code was the final legal word. So perhaps the DAO code was buggy, but caveat emptor, the hacker played by the rules of the game and got paid. Maybe they weren’t even a hacker, but merely a savvy investor who understood a little-known clause in the contract. If the code is law, then whatever the code allows is legal by definition. Ultimately, the morality of this move is a question outside of the code itself. And that’s the problem.

The majority of people involved in Ethereum felt that investors should get their money back. But a sizable minority disagreed — they truly believed in this “code as law” model. And so both the community and the block chain split: there was a hard fork. To this day, there are two parallel Ethereum ledgers. In one world, the DAO hack was reversed. In the other universe, now called Ethereum Classic, the hacker got to keep their money.

Real financial contracts have lawyers and courts and precedents that provide a procedure for resolving disputes, and give investors reasonable safeguards. Discarding those institutional frameworks has a cost.

Turing’s Demons

Computer code can do arbitrarily subtle and weird things. This is a deep property of computation that we’ve known about even before the first electronic computers were actually built. Once a programming system reaches a minimal threshold of complexity, known as Turing completeness, there are a set of inter-related theorems that say, basically, you will never be able to tell what a program does without actually running it. It follows that it will always be possible to hide something malicious in financial code. If code is law, you’re going to get scammed. Legally.

One solution is to avoid general purpose code. There are strong parallels to recent computer security research that argues that all user inputs to system need be limited to very restricted, non-programmable languages. It’s just impossible to secure arbitrary code. And indeed, Wall Street already has purpose-specific languages for specifying financial contracts (such as derivatives) without invoking the disturbing power of general computation.

Turing completeness is a gateway to the dark powers. It freaks me out to imagine traders submitting contracts written in code to an exchange. It’s already tricky to untangle the web of counter-parties, derivatives, and legal triggers that can lead to cascading crashes in the financial system — just wait until we throw code in there.

But we’re not going to able to avoid all code. Even if traders aren’t allowed to use it to create new contracts, we need it for infrastructure. Every stock market, every financial exchange has code, and algorithmic trading is an entire industry. An increasing fraction of global transactions are handled by computers without any human review or intervention. This has led to weird behavior such as flash crashes — which are still unexplained, even in very simple, usually very stable markets like US Treasury securities. There isn’t even a single master audit record of every trade made, and there won’t be for years.

It gets even weirder when you add incentivized humans to the mix: financial players are going to exploit every edge case they can find. There’s a passage from Michael Lewis’ Flash Boys which describes the difficulty in setting up an exchange that can’t be gamed:

Creating a new stock exchange is a bit like creating a casino: Its creator needs to ensure that the casino cannot in some way be exploitable by the patrons. Or, at worst, he needs to know exactly how his system might be exploited, so that he might monitor the exploitation— as a casino monitors card counting at the blackjack tables. “You are designing a system,” said Puz, “and you don’t want the system to be gameable.” … From the point of view of the most sophisticated traders, the stock market wasn’t a mechanism for channeling capital to productive enterprise but a puzzle to be solved.

The designers of this new exchange, now known as IEX, spent months studying every type of order that could be submitted to other stock markets, and how these were exploited to rip off productive investors to the benefit of high-frequency traders (HFT). This is of course a moral judgement, and a judgment about what type of investor to privilege — and there are massive ongoing arguments about whether the current market structure that allows HFT is “fair.” But even if you know what you want your code to do, there’s no guarantee you’re going to get it. IEX found it incredibly difficult to avoid loopholes that could advantage high-frequency traders.

We are now, today, in our lifetimes, undertaking the process of turning capitalism into code. The code running our markets determines, literally, what is possible and who gets paid. Already, the cutting-edge of finance is basically nothing like “investing” as we usually think of it. It’s far more like hacking: find the properties of a complex system that get you the most money. Anything the exchange lets you do is legal, more or less. There are laws against market “manipulation” such as spoofing, but these terms are poorly defined ideas of fairness that don’t have simple technical definitions. Anyway, that’s only a problem if lawyers and regulators get involved. The code allows it.

I want our financial markets to be stable, transparent and fair. I want them to reward something other than the clever manipulation of an abstract system. And so I would argue for extreme simplicity in our electronic markets. Even very simple rules can spiral into complex consequences. Chess has more complex rules then Go, but it took computers 20 years longer to beat humans at Go. Recent game-theoretic work on algorithmic trading suggests that it’s going to be very hard to stabilize even very simple programs interacting with each other and with greedy humans.

Politics always wins in the end

Bitcoin and Ethereum are a kind of counter-power to established systems of money and finance. In the sense that many things are wrong with the current system and the powers-that-be are very hard to challenge, this is exciting. But the mistake is to think that code is enough. Wikileaks was premised on using cryptographic anonymization to protect their sources, but then Manning confessed to a freelance journalist. And all the encryption in the world could not protect Snowden from the NSA’s long reach; that required the cooperation of the Russian government.

The modern, automated stock market has already been gamed. In April 2016, an individual from Pakistan uploaded a fake document to the SEC’s EDGAR website, where public companies post their legally-required disclosures. Automated bots read the document and immediately traded on the false information, moving the stock price. The “hacker” made $425,000 in a matter of minutes, before anyone realized what was going on.

The Pakistan case was straightforward fraud, but it’s only going to get weirder. I have the Cypherpunk premonition again. Crypto-contracts shift the balance of power, and once again, a small group of people — this time, financial cryptographers — is playing with home-made nitroglycerine. Eventually it will blow up. Eventually, someone will do something with global consequences. Maybe they’ll make off with billions; maybe they’ll crash the economy of an entire country, or the world.

This will start a really big fight. And the lesson I’ve learned is that the code, while powerful, never has the last word. Eventually, there will have to be a legal and political settlement about what code the global financial markets should run on. Ultimately, the code runs on people, not the other way around.

Yet code still has enormous influence. Financial technologists are now engaged in writing the code that will determine the future shape of the economy. Code is like architecture: it’s a built environment that determines where you can and cannot go. The code that the markets run on implicitly determines our economic policy. It sets the shape of financial hacking. It very literally decides who gets what.

So what economic policy do we want our code to embody? And given the complexity of computation, how can we be sure that this is what our code actually does? The answer is that we probably can’t, and the only solution is to get clear about our goals, and the legal and political mechanisms for resolving our arguments, before we inevitably discover that our software allows something we never intended.

I can do no better than to end with a quote from security researcher Eleanor Saitta:

Repeat after me: all technical problems of sufficient scope or impact are actually political problems first.

The Origin of Banking

Jonathan Stray — Tue, 27 Sep 2016 18:43:05 +0000

There is a just-so story that explains the existence of money. Before money, the story goes, we all had to barter for the goods we wanted. If I wanted wheat and had chickens, I needed to find someone who wanted chickens and had extra wheat. Money solves this “double coincidence” problem by letting me sell my chickens to buy your wheat. If we didn’t have money we’d invent it immediately.

The problem with this simple story is that it may not match history. There has never been a pure barter economy, according to anthropologists. Pre-money economies were organized in a variety of other ways, including central planning, informal gift economies, and IOUs denominated in cows.

Sir Jon Hicks’ classic A Market Theory of Money fills this gap. Hicks was a major figure in 20th Century economics who eventually won a Nobel, and here at last is a straightforward story that explains why we have banks at all. It’s still not clear to me that this account is historically grounded – or that we can understand what a modern bank does, or should do, on the basis of historical parable — but at least this account provides a better history than barter.

With that cautionary note, here’s Hicks’ story of banking. He begins in a world where money is already the usual form of payment, and breaks down a transaction into three pieces:

Buyer and seller reach an agreement on what is to be sold at what price
Buyer delivers the goods
Payer delivers the cash

Step 1 has to come first, but payment and delivery may come in any order at any time after that, depending on the agreement that the parties made. The gap between contracting and payment is credit. Credit is a very old idea, and central to modern economies. Hicks argues that “payment on the spot” is actually the uncommon case, at least for orders over a certain size:

I may pay spot for a newspaper as I walk along the street, but I may also give an order to a newsagent to deliver a copy to my house each morning. I should not then pay for each issue as I received it; I should wait until the end of the month when he sent in his bill. … It is probably true that only for small transactions – small that is, from the point of view of one or other of the parties concerned – that the spot method of payment is ordinarily preferred. People are not, and never have been, in the habit of carrying about them a sufficient quantity of coin or notes to pay for a house or pay for furnishing it.

The key observation is that credit is typical, not extraordinary. Any time we pay a bill – whether a at restaurant or for a credit card — we have been extended credit.

In the gap between contracting and payment there is debt, and debt is measured in money (at least on the buyer’s side; for the seller, debt is measured in goods or services owed.) There has long been argument over what exactly money is, or more usefully what it does, but in a credit-based theory of the economy it has two clear roles:

We seem thus to be left with two distinguishing functions of money: standard of value and medium of payment. Are they independent, or does one imply the other? It is not easy to see that there can be payment, of a debt expressed in money, unless money as a standard has already been implied in the debt that is to be paid. So money as a means of payment implies money as a standard. But could a debt expressed in money be discharged other than in money? Surely it could.

It could for instance be set off against another debt, the debt from A to B being cancelled against a debt from B to A.

This is Hicks’ entry into the concept of an IOU, which seems to be fundamental to modern finance – perhaps the fundamental idea, the notion underlying every financial instrument of every kind. Yes, you can pay money to settle a debt, but you can also cancel one debt against another, netting the debts. This means that a debt owed to you has monetary value! From there, it’s a small step to the idea that a third party debt can be used as a form of payment. Suppose B owes A a debt, and C owes B a debt of a different amount.

A is then asked to accept part payment in the form of a debt from C to B, which is to offset the balance of debt between A and B, a balance we take to be in favour of A. But A can hardly be expected to consent to such an arrangement unless he considers that C is to be trusted. So there is a question of trust, or confidence, as soon as a third party is brought in.

This short paragraph states a pattern that has been at the core of trade for centuries, and is at the core of finance today: the transferability of debts made possible by the assurance of good credit. This was a common pattern in the trade fairs of Renaissance Italy, where merchants would meet to settle tangled webs of IOUs with each other and with the banks. It happens today when a bank B lends money to A to buy a house, creating a mortgage debt from A to the bank, then sells the right to collect that debt to another bank C. For this to happen, B has to guarantee to C that A is creditworthy enough to repay. It’s less obvious, but equally applicable, when A pays B by check. B doesn’t have “money” when they have the check, but a promise from A’s bank to pay. But we’re not there yet. Here’s how Hicks builds up to tradable debt:

The quality of debt from a particular trader depends on his reputation: it will regularly be assessed more highly by those who are in the habit of dealing with him, and know that his a accustomed to keeping is promises, than by those who do not have the advantage of this information.

Thus the value of a debt is sensitive to information. It’s not clear to me whether anyone would have used this language in, say, Renaissance Italy. Hicks, writing in 1989, would have been influenced by recent, eventually Nobel-wining work on information in economics. The information view of value explains how formerly solid debt-based assets – for example, mortgage-backed securities – can evaporate almost instantly if there is a credible threat of non-payment. Although debt is tradable like money in the good times, in bad times everyone wants hard currency, not promises to pay currency.

Hicks says that this information problem – really a reputation problem – leads to the creation of a market for guarantees that a bill will be paid.

Thus we may think of each trader as having a circle of traders around him, who have a high degree of confidence in him, so that they are ready to accept his promises at full face value or near it; there is no obstacle to offsetting of debts within that circle from lack of confidence in promises being performed. If he wants to make purchases outside his circles he will not be so well placed. Circles however may overlaps: though C is outside A’s circle, he may be within the circle of D, who himself is inside the circle of A. Then though A would not accept a debt from C if offered directly, he may be brought to accept it if it is guaranteed by D, whom he knows D is then performing a service to A, for which he may be expected to charge.

This is a market for acceptances of bills of exchange, a financial instrument that seems to be uncommon today, but was quite common in 19^th century merchant banking, as described by Bagehot. From the merchant’s point of view, the objective of all of this is to get paid sooner. Suppose you sell goods to B in exchange for a bill with a specified due date, say 60 days from now. That’s a debt, and one way to use it is just to wait 60 days until B pays cash. Or you could trade that debt immediately with any person C who thinks B trustworthy – perhaps for cash, perhaps to settle a debt with C, perhaps to make a purchase. If you wanted to trade with someone who didn’t know B, you could first buy an acceptance and then trade the bill together with the acceptance – the debt together with its guarantee.

An acceptance is basically insurance: a guarantee that a bill will be paid, in fact the promise to pay the bill if the debtor defaults. The guarantor is willing to do this because they know, or have some way to evaluate, the reputation of the debtor, and because they charge a fee for the service. The equivalent modern instrument would be something like a “credit default swap.”

Notice that we don’t have banks yet. Instead, this story tells of the creation of the first markets for debt, facilitated by intermediaries who guarantee repayment. And here’s where I start to hesitate. It’s a nice story, certainly an account of how this could have happened. But I’m not enough a historian to know if this is how it did happen. Like barter, this could be an attractive myth. Still, it’s a useful account of a problem to be solved – how do I trade my debt assets outside the small circle of people who know the debtor personally? For if debt is tradable, then I suddenly have a lot more capital at my disposal: not just the cash I’m currently holding, but all of the cash that is owed me.

According to Hicks, the next step in the story of banking is the creation of two special kinds of intermediaries. The first sells guarantees (acceptances) on debt. They either know many debtors well, like a credit rating agency, or they know where to buy acceptances from someone who does know. The other kind of intermediary pays cash for debts along with their acceptances, at a discount from the face value of the debt, that is, a fee.

Until that point, the principal reason why the market value of one bill should differ from another is the difference in reliability; but bills, between which no difference in reliability is perceived, may still differ in maturity. A trader who is in need of cash needs it now, not (say) six months hence. So there is a discount on a prime bill which is a pure matter of time preference – a pure rate of interest.

Here Hicks is distinguishing between what would today be called credit risk, the risk of non-repayment which is solved by acceptances, and “time preference,” that is, the advantage of having cash now instead of later, which costs interest. These two components of price (and more besides such as liquidity risk) are implicit any time debt is sold. Hicks believes that separating these out is a necessary step to explaining how banking arose:

The trouble is that the establishment of a competitive market for simple lending is not at all a simple matter. The lender is paying spot, for a promise the execution of which is, by definition, in the future. Some degree of confidence in the borrower’s creditworthiness – not just his intention to pay but his ability to pay, as it will be in the future – is thus essential to it. There cannot be a competitive market for loans without some of this assurance.

It’s a fine argument, and indeed there is always both credit risk and time preference in lending (or buying debt for cash, which is nearly the same thing.) But I’m not convinced this is a historical account. Did the biblical money lenders really distinguish between credit risk and time preference when they set their interest rates? I suspect the answer is no. Yet surely there were reasons that the first money lenders came into existence – that is, motivating problems that offer hints as to what banks do. It may be possible to offer an account of the creation of banking which is both simpler, more motivating, and more historical than Hicks’ story of acceptances and discounts.

From a network of discounters for bills – intermediaries who are willing to lend cash against a guaranteed debt – we finally come to banks proper. Hicks explains the origin of banking by asking how trade volume could ever increase when there’s a fixed supply of cash among merchants:

What then is to happen if trade expands, so that more bills are drawn, and more come in to be discounted? Where is the extra cash that is needed to come from? Any one of the dealers could get more cash by getting other dealers to discount bills that he holds. But the whole body of dealers could not get more that way. They must get cash from outside the market. They themselves must become borrowers.

The solution was to combine this business with another sort of business, which in the days of metallic money we know to have already made its appearance.

This other business is goldsmiths, or perhaps moneychangers, both of which would store a customer’s coins in their vault. There has long been a need for secure money storage, something better than cash under the mattress. The innovation of modern banking is to recognize that not every customer will withdraw all their coins all at once.

Then, once that happens, there will be a clear incentive to bring together the two activities – lending to the market, and ‘borrowing’ as a custodian from the general public – for the second provides the funds which in the first are needed. At that point the combined concern will indeed have been becoming a bank.

This, then, is the essence of banking: take deposits, make loans. Crucially, a loan is in fact borrowing from the depositors and lending at different time scale (maturity transformation). The debtor pays back the loan at a later date, or perhaps little by little as with a mortgage, but the depositors can demand the whole of their account as cash at any moment. It is only by hoping that not everyone wants their cash back all at the same time that a bank can exist. A bank which did not borrow from its depositors would be incapable of extending credit, at least beyond the capital that its owners are putting in.

This is the “fractional reserve” system which has existed for centuries. Banks are by law allowed to lend at most some large fraction of their deposits, hedging against many depositors asking for their money back all at once (though note, today banks can always borrow more reserves from the central bank, so capital reserve requirements don’t really constrain lending.)

On this account, the central features of a bank are:

Taking deposits, which can be withdrawn at any time
Making loans, which are repaid slowly

In other words: short-term borrowing to finance long term lending. There’s nothing surprising in this description, and it captures the inherent risk-taking in banking, since it may happen that everyone wants their deposits back as cash all at once. Today, US banks are insured up to $250,000 per account by the FDIC, which simultaneously pays out if needed and makes payout less necessary since the guarantee makes bank runs less likely.

But something key is missing: the banks’ central role in the payment system allows them to create high quality tradeable debt — that is, bank deposits. For most people, bank balances are money, therefore a bank can create money. To explain how, Hicks examines the evolution of a bank’s debt to its depositors.

It would however always have happened that when cash was deposited in the bank, some form of receipt would be given by the bank. If the receipt were made transferrable, it could itself be used in payment of debt, and that should be safer [than moving cash around physically.]

Hicks uses this idea of a “receipt” to trace the development of the bank check, and from there the modern reality that bank deposits are money as far as you and I are concerned. Consider how one person pays another through the banking system:

It would at first be necessary for the payer to give an order to his bank, then to notify the payee that he had done so, then for the payee to collect form the bank. Later it was discovered that so much correspondence was not needed. A single document, sent by debtor to creditor, instructing the creditor to collect from the bank, would suffice. It would be the bank’s business to inform the creditor whether or not the instruction was accepted, whether (that is) the debtor had enough in his account in the bank to be able to pay.

The key point is that with a check – or with any of our modern means of electronic payment – no cash is ever withdrawn from the bank! We have simply rewritten the amounts owed by the banks to each customer. That is, if I pay you, my bank owes me less cash and your bank owes you more. It would be no problem if in fact my cash had been loaned out to someone else during this whole time.

In fact, banks don’t really loan cash either.

When the bank makes a loan it hands over money, getting a statement of debt (bill, bond, or other security) in return. The money might be taken from cash which the bank had been holding, and in the early days of banking that may have often happened. But it could be all the same to the borrower if what he received was a withdrawable deposit in the bank itself. The bank deposit is money from his point of view, so from his point of view there is nothing special about this transaction. But from the bank’s point of view, it has acquired the security without giving up any cash; the counterpart, in its balance-sheet, is an increase in its liabilities. … But from the point of view of the rest of the economy, the bank has ‘created’ money. This is not to be denied.

Bank deposits are not cash. They are debt to customers. But we are happy to have bank deposits because banks have become the way in which we pay each other. When someone owes us money, we are satisfied with “money in the bank” rather than cash in hand. This all goes back the the tradability of debt that Hicks started with: Banks create debt of high quality, that is, debt which can be traded from hand to hand nearly as well as cash, or perhaps better. In that very real sense, banks create money.

Hicks has tried to explain how this combination of features we call banking arose. The story he gives is a progression from debt trading, to acceptances, to loaning cash against bills, to merging with the money storage industry, to a central role in the payment system, to creating money by lending some large fraction of customer deposits. I am not convinced that banks actually arose in this sequence. However, it does highlight some of the needs and problems that led to the creation of banking. Namely: consumers need a place to store money, businesses need credit, and everyone needs a payment system.

But the advantage of Hicks’ telling is that it highlights the central role of credit/debt. As soon as debt can be freely traded to a third party, it is a kind of money. It is this trust that allows a bank to create money. How much money a bank should create is a different question, which Hicks’ parable cannot tell. Rather, I find this pattern of tradable debt useful for thinking about all the different forms that banking takes, including by the many financial institutions that don’t call themselves banks.