A Computer Scientist in a Business School

Taste Is Not Enough. Reality or Bust

2026-03-28T12:08:00.005-04:00

A paper published in Nature on March 25, 2026 describes "The AI Scientist," a system built by Sakana AI that automates the full cycle of scientific research: idea generation, experiments, analysis, writeup, even peer review submission.

The marginal cost of producing paper-shaped research output is collapsing.

So, when paper production becomes cheap, what is next?

100,000 axiom systems and counting

Stephen Wolfram spent years systematically enumerating all possible axiom systems. Each axiom system defines a "possible universe of mathematics": a different set of starting rules, a different universe of theorems. Most universes are empty or trivial. Of the ones that are not trivial, many are bizarre. Our entire familiar mathematics occupies a tiny corner of this space. Logic, specifically Boolean algebra, turns out to be perhaps the hundred thousandth axiom system you would encounter if you enumerated them by complexity.

Wolfram found nothing obviously special about the axiom systems we actually use. His suspicion (not a proof) is that we study them for largely historical reasons: they are generalizations of arithmetic and geometry from ancient Babylon. The space of possible mathematics is vast, our explored corner is small, and we are in this particular corner because of history, not because of any intrinsic property of the systems themselves.

Why do we use them? Because humans decided they were interesting.

The AI Scientist has the same problem, one level up

The AI Scientist can generate research ideas, execute experiments, and write papers. Sakana reports a cost of roughly $15 per paper. One of three papers passed peer review at a ICLR workshop (not the main conference track), with humans filtering the most promising outputs before submission. The system can produce formally structured research outputs. It cannot yet tell which ones matter.

The problem is not just quality filtering. A separate study in Nature earlier this year analyzed 41.3 million research papers and found that scientists using AI tools publish three times more papers and get five times more citations. Great for individuals. But collectively, AI-driven research covers less topical territory. It clusters around already-popular problems.

In the Wolfram analogy: a machine that evaluates "interesting" by pattern-matching against known mathematics will keep steering you back to the hundred thousandth axiom system and its neighbors. Lots of exploit, much less explore.

So what is the actual scarce resource?

This matches my own experience using AI agents for research and teaching. The agents are shockingly good at execution. Give them a clear task with well-defined scope and they deliver something genuinely useful, fast.

But "work on the next most important task" only works if someone figured out what the important tasks are. The agent does not decide which questions matter. The moment you ask it to define its own scope, you get the AI Scientist problem: lots of output, most of it predictable, much of it wrong in ways that require domain expertise to even detect.

The scarce resource is judgment. The ability to look at a vast space of possibilities and say: this one.

That is the comforting answer, anyway. AI does the grunt work. We provide the taste, the direction, the vision. We stay at the center of the universe.

Except: that story is cope. Rich Sutton's "Bitter Lesson" showed that every time researchers tried to hand-code human knowledge into AI systems (chess heuristics, vision algorithms, Go strategies), brute-force scaling eventually crushed the hand-coded approach. Human judgment about what matters may just be the next ontology in line to be bypassed. But even if it is not, history suggests it was never as reliable as we like to think.

But the world has a vote

Wolfram's enumeration is purely abstract. The axiom systems just sit there, inert. But science interacts with data from the world we observe. You hypothesize, you collect data, and reality tells you whether you are wrong. And that feedback loop has a history of promoting "useless" systems to central importance, often over the explicit objections of the people who understood them best.

Godfrey Hardy, a godfather of number theory, wrote in 1940 that number theory had a kind of supreme uselessness, that no one had discovered any warlike or practical purpose for it, and it seemed unlikely anyone ever would. And he was proud of that uselessness, as a sign of the supreme taste of a pure mathematician.

Thirty-one years after his death, RSA encryption arrived, and modern cryptography now depends heavily on the number theory Hardy was so proud to call pointless.

Maxwell predicted electromagnetic waves in 1865 as a mathematical consequence of his equations. Hertz demonstrated them physically in 1887, and when his students asked what the discovery was good for, he replied: "It is of no use whatsoever. This is just an experiment that proves Maestro Maxwell was right."

Marconi built the wireless telegraph less than a decade later.

Notice what Hardy and Hertz have in common. They were not amateurs. They understood their own discoveries better than anyone alive. Their taste was extraordinary: out of the vast space of possible mathematics and physics, they picked systems that turned out to be profoundly important. But their forecasts of usefulness were completely wrong. Hardy looked at number theory and said: this is beautiful and this is deep. He was right about that. He was wrong about what the world would do with it. Hertz looked at electromagnetic waves and saw a confirmation of Maxwell. He was right about that too. He could not see the wireless telegraph.

The distinction matters. Taste selected the right systems to study. But taste could not predict what those systems would be for. That was decided later, by technologies and applications that did not yet exist. The world retroactively decided which "useless" formal systems had been important all along.

So the comforting story ("AI does execution, we provide the visionary taste, we stay at the center of the universe") is incomplete. Taste is real but taste without reality is flying blind. Entire fields operate this way: elegant theory frameworks that survive for decades because they never invite reality to correct them. And the people with the best taste in history still could not see where their work would land.

The right question is not "who has the best taste?" It is "what kind of feedback loop lets reality surface the value that taste alone cannot see?"

Can AI close that loop?

In some fields, it already has. 20 years ago, Mechanical Turk returned human judgements through API calls. Now, autonomous wet labs (Emerald Cloud Lab, Strateos, RAPID-200) accept experimental protocols via API and return physical results without human hands touching anything. An AI agent can already design an experiment, submit it to a cloud lab, and get data back. The loop with physical reality is not a future idea. It is existing infrastructure.

And still, the narrowing problem persists. Ten thousand automated experiments over a weekend still require someone (or something) to decide what experiments to run. The labs automate verification, not direction. Reality is the slowest, most expensive API there is. A clinical trial takes years. Growing a test crop takes a season. AI generates hypotheses at near-zero marginal cost, but verifying them against the physical world still costs capital and time.

So, the question is what happens when AI-generated ideas start getting corrected by the world. That is the difference between an AI that enumerates the space of possible mathematics and one that discovers non-Euclidean geometry because spacetime forced its hand.

How I Stopped Being a Copy-Paster for My AI Agent: Claude Code, Google Cloud, and the Loop to Close

2026-03-17T13:00:00.007-04:00

TL;DR: Your AI agent in Claude Code on the Web can use Google Cloud (or AWS/Azure) to store large datasets, run long computations, deploy web apps, and schedule recurring jobs. Once you have a cloud account and project, the repo-specific setup takes about five minutes:

Set an encryption password in your environment settings (see Step 1 below). If you only use one cloud provider, name it CLOUD_CREDENTIALS_KEY. For provider-specific setups, use GCP_CREDENTIALS_KEY / AWS_CREDENTIALS_KEY / AZURE_CREDENTIALS_KEY.
Tell the agent: "Install the cloud-bootstrap skill from https://github.com/ipeirotis/cloud-bootstrap into this repo."
Tell the agent: "Set up GCP access for this project."

The agent walks you through the rest, including one command you run in Cloud Shell to generate a temporary token.

The moment I became a human copy-paster

A few weeks ago, I was debugging data issues on the mturk-tracker demographics site. Claude Code would write a diagnostic script. I would deploy it to the server. I would copy the output. I would paste it back into Claude. Claude would write the next script. I would deploy that one. Copy. Paste. Deploy. Copy. Paste. Deploy.

I was not managing an AI agent. I was its copy-paster. Claude did the thinking. I did the Ctrl-C, Ctrl-V.

That was problem number one.

Problem number two: I needed to collect data from several websites, a process that would take a day or two of continuous scraping. Claude started the work, but the sandbox kept timing out. The session would die, I would restart it, Claude would pick up where it left off, and then the session would die again. The only way to keep things moving was to babysit: Break the bigger task into smaller subtasks and then "Do next task." "Do next task." "Do next task." Over and over. I was not reviewing or directing anything. I was just pressing the button to keep the machine running. I understand that this is our new role as humans, serving our new AI overlords, but... boooooring.

Problem number three: I needed to train a model that required a GPU. The Claude Code sandbox does not have GPUs. So I had to manually launch a VM on Google Cloud, SSH into it, clone the repo, install the dependencies, start the training, and then remember to check back later and shut the machine down before it burned through my budget. Claude had written all the training code. But the last mile (getting it to actually run somewhere with the right hardware) was entirely on me. The AI writes the code. The GPU does the math. And I am the guy who forgets to shut down the machine. Guess which component has the highest error rate.

Three different problems. Same root cause. The sandbox is a walled garden. Claude can think, it can code, it can analyze. But it cannot reach the outside world. It cannot talk to a server, run something overnight, or spin up a machine with a GPU. Everything that requires infrastructure beyond a small ephemeral container? That is your job.

The fix: give the agent a cloud account.

What changes once the agent has cloud access

Remember the mturk-tracker debugging? With cloud access, Claude deploys its own diagnostic scripts to Cloud Functions, runs them against the live data, reads the results, and iterates. No copying. No pasting. No human in the middle.

The web scraping that required me to babysit? Claude deploys the scraper as a Cloud Function with a scheduler. It runs every 15 minutes, stores results in a Cloud Storage bucket, and I check in the next day. I literally went to sleep and woke up with the data collected.

The GPU training? Claude launches a VM with the right specs (say, an n1-standard-4 with a T4 GPU), clones the repo, installs everything, starts training, and sets up a shutdown script that kills the machine when the job finishes. Results go to Cloud Storage. I went to dinner. When I came back, the model was trained, the results were in the bucket, and the VM was already off. The alternative was me manually SSH-ing into a machine, running htop every twenty minutes, and hoping I remembered to shut it down before I went to bed. (Ask me how I know that "hoping I remember to shut it down" is not a reliable cost management strategy.)

The setup (yes, there is some setup)

I will walk through this using Google Cloud, since that is what I use (the concepts are the same for AWS and Azure). If you do not already have a Google Cloud account, go to cloud.google.com and sign up.

Once you have an account, create a project in the Cloud Console. A project is Google Cloud's way of organizing resources and billing. Click the project dropdown at the top, click "New Project," give it a name, and note the project ID.

You do not need to install anything on your own computer. When you need to generate a token, you will use Google Cloud Shell: a browser-based terminal with everything pre-installed.

My pattern: one repo, one cloud project, same name

Every GitHub repo I work with gets its own dedicated Google Cloud project. And they get the same name. The repo paper-oral-exams gets the Cloud project paper-oral-exams. The repo course-ai-pm gets the Cloud project course-ai-pm.

Why? Mostly resource isolation. The agent for the course repo cannot accidentally touch the research data. Each agent gets exactly the access it needs for its own project and nothing else. It also makes housekeeping easier: when everything for a project lives in one Cloud project, you can quickly spot which storage buckets, databases, and VMs are still needed and which are leftovers. No more "wait, whose VM is this and why is it still running?"

Creating a Cloud project is free and takes 30 seconds.

Service accounts: giving the agent its own keys (not yours)

When you use Google Cloud, you log in with your Google account. But an AI agent is not you. And more importantly, it should not be you. Your Google account has access to everything: your email, your billing, your entire cloud infrastructure. Giving all of that to an automated tool would be like handing your intern the keys to the building, your credit card, and your Netflix password. Just in case.

Instead, you give the agent a service account: a restricted identity designed specifically for automated tools. It has its own email address (something like claude-agent@my-project.iam.gserviceaccount.com) and you decide exactly what it can do. Read from this storage bucket. Deploy this function. Query this database. Nothing more.

A caveat: the approach below (encrypting a service account key in the repo) is a pragmatic workaround for agent environments that do not yet support proper workload identity or secret stores. If the worst case is 'the agent ran up a $200 bill on a research project,' you are fine. If the worst case involves production data or your personal credentials, use something else. When proper agent identity federation exists, this will get simpler. For now, it is the best approximation available.

The service account authenticates using a key file: a JSON file that acts as its password. Whoever has this file can act as the service account. Which means this file needs to be protected.

But here is the catch: Claude Code runs in a sandbox that resets after each session. The only thing that persists is the GitHub repo. So the key file needs to live in the repo somehow, but committing a plaintext credentials file to a repo is a classic security mistake. (It is so common that GitHub literally has automated scanning to catch people doing it.)

The solution: encrypt the key file and commit the encrypted version. The encryption password lives in an environment variable in Claude Code, which persists across sessions but never enters the repo. At the start of each session, a hook decrypts, authenticates, and deletes the plaintext immediately. The encrypted file is useless without the password. The password is useless without the encrypted file. And if in the worst case scenario your password leaks, you only exposed the service account with limited permissions, and you can always deprecate and regenerate the credentials.

The five-minute walkthrough

You do this once per repo.

Step 1: Set your encryption password.

In Claude Code, open the environment settings for your session and find the "Environment Variables" field. Add a new variable:

CLOUD_CREDENTIALS_KEY=your-strong-passphrase-here

(If you work with multiple cloud providers across different repos, you can use provider-specific names like GCP_CREDENTIALS_KEY or AWS_CREDENTIALS_KEY instead.)

A caveat: Claude Code currently warns against putting secrets in environment variables because there is no dedicated secrets store yet. I am using this approach because the passphrase only protects an already-restricted service account, not your personal cloud credentials. When a proper secrets store ships, this workflow will use it.

Step 2: Install the skill.

Open your repo in Claude Code and tell the agent:

"Install the cloud-bootstrap skill from https://github.com/ipeirotis/cloud-bootstrap into this repo."

(For those comfortable with a terminal, you can also run curl -sSL https://raw.githubusercontent.com/ipeirotis/cloud-bootstrap/main/install.sh | bash in any environment with access to the repo.)

Step 3: Tell the agent to set up cloud access.

"Set up GCP access for this project."

The agent will ask you for your Google Cloud project ID. Then it will look at your repo and propose a set of minimum permissions: "Based on this repo, I think the service account needs access to Cloud Storage and BigQuery. Here is why. Shall I proceed?" You approve or adjust. For a new or empty repo, it will ask what you plan to do first.

Then the agent will ask you to run a command in Cloud Shell. To open it, go to shell.cloud.google.com or click the ">_" icon in the top-right of the Cloud Console. Make sure you are in the right project, and run:

gcloud auth print-access-token

You paste the result back. This gives the agent a temporary token (valid for one hour) to do the initial setup. The agent creates the service account, grants the approved permissions, generates a key, encrypts it, commits the encrypted file, and sets up an automatic authentication hook for future sessions. The temporary token expires. From this point on, every new session starts fully authenticated. You just start working.

For teams: each person gets their own encrypted key file with their own password. The README has the details.

What to do once the ceiling is gone

Once cloud access is set up, the agent will start proactively suggesting cloud improvements when it notices opportunities: "Would it help if I moved this dataset to BigQuery so we do not have to re-process it every session?" You can also prompt this explicitly: "Can you improve your process, knowing that you have access to GCP?"

I had a dataset too large to fit in the sandbox. The agent uploaded it to BigQuery. Now I query it conversationally: "Show me the distribution of response times by condition." The agent writes the SQL, runs it, brings back the results. The data lives in the cloud permanently. No re-uploading, no re-processing.

I needed to run a survey for a research study. The agent deployed a Cloud Function with a simple web form, backed by a database. Participants visit a URL, submit responses, the data lands in a table I can query later. No server to manage. No hosting to configure. Thirty minutes from "I need a survey" to a live URL that participants were already clicking on. I still have not fully processed how absurd that is.

What does it cost? Less than you might think. Cloud Functions and BigQuery queries cost cents per run. A T4 GPU VM runs about $0.35/hour. My monthly bill for all of this is usually under $10, though a long GPU job will cost more. One practical tip: set up a billing budget alert in Google Cloud before giving the agent access. Agents can get stuck in loops, and a $10 budget alert is cheaper than finding out the hard way.

The bigger picture: finding the next loop to close

There is a trajectory here worth naming. First, the AI learned to generate: write a script, draft a document, produce code. Then it learned to execute: run the script, push the changes, create a pull request. Now it is learning autonomy: spin up a server, run the job, shut down the server, and report back. Each step closes a loop where a human used to be the connector.

The previous post gave the agent memory and a workflow. This one gives it infrastructure. Same pattern: every time you find yourself doing grunt work to connect two things that the agent should be able to connect on its own, that is a loop waiting to be closed.

What comes next

The cloud-bootstrap skill supports GCP, AWS, and Azure. It handles first-time setup, adding team members, and credential rotation (it tracks credential age and warns you after six months). It also supports multi-provider setups in the same repo and handles permission escalation gracefully: if the agent hits a permission wall, it stops and tells you exactly what role it needs and why. It never silently fails, and it never gives itself more access.

This is still early. The whole approach (encrypting credentials in a repo, pasting short-lived tokens) is a workaround, as I noted above. When proper agent identity federation arrives, this will simplify considerably. But right now it works, and for isolated research projects with tightly scoped permissions, the risk is manageable.

But the agent can still only work inside the one repo it is connected to. It cannot clone a second repo, pull in a dataset from another project, or push results somewhere a collaborator can see. It can work inside one room but cannot walk between rooms. The next post will fix that: installing gh and setting up a GitHub personal access token so the agent can move freely across repos. It is a much shorter setup than this one.

After that: the "master repo, satellite repos" setup for coordinating work across multiple projects (which needs the GitHub token to work), MCP configuration for integrating Gmail and Google Calendar, and more on the "council of LLMs" approach I have been using for grading oral exams and for reviewing my work.

But start here. Give the agent a cloud account. And then go to dinner. When you get back, the agent will have finished collecting data, training the model, shut down the GPU VM, clean up everything, and gone to sleep. Your kids, on the other hand, if they are like mine, will still be awake and making fun of the parental controls on their iPads, and the kitchen is a mess.

"Let's Work on the Next Task": Claude Code, GitHub, and the Most Diligent Project Manager I've Ever Had

2026-03-04T17:55:00.023-05:00

In my previous post, I described how working with AI agents felt like managing an infinitely large, infinitely diligent team. I wrote about pairing Claude with GitHub, giving it context files and task lists, and watching it come back with actual deliverables.

After that post, I got questions from a lot of people asking how to actually set this up. Even from people I assumed were already using this kind of workflow. Turns out it was far less common knowledge than I previously thought. (I guess I am spending too much time reading social media.)

So this post is a step-by-step guide for those who still use AI tools in the "chat" form and want to examine a first setup of "agentic AI". In this case, it is not to get the AI to be a software engineer, but rather get the AI to becoming your project manager and your team of research assistants.

We will set up a GitHub repository, configure Claude Code on the Web, and build a workflow where AI plans (or does) the work and you do the reviewing.

One caveat: while you do not need to know how to code, familiarity with software development practices will help. Not the programming itself, but the process: how developers organize projects, track changes, review each other's work. This post will walk you through those practices.

First, though, let me explain why this setup is so powerful.

The real trick: The repo is the context

Here is the problem with using AI through a regular chat interface. Every time you start a new conversation, you are starting from zero. You paste in your document, re-explain what the project is about, remind the AI where you left off, describe what needs to happen next. It is like hiring a brilliant contractor who gets amnesia every morning.

GitHub solves this. When Claude Code connects to your repository, it does not just see your files. It sees everything: the project structure, the notes about what the project is, the task list, the record of what has already been done, the decisions you have made along the way... All of it, sitting right there in the repo, ready to be read.

This means your prompt for most interactions becomes absurdly simple:

"Let's work on the next most important task."

That is all. Claude reads your CLAUDE.md to understand the project. It reads your TASKS.md to figure out what needs doing. It looks at the existing files to understand the current state. And then it gets to work. No pasting. No re-explaining. No "as I mentioned in our previous conversation..." The repository is the conversation. It is the memory. It is the context.

Read about CLAUDE.md and TASKS.md and you are worried that this is some black magic? Nah, these are just regular text files, written in plain English. We will describe them next.

Wait, what is Claude Code on the Web?

First, some context. Claude Code started as a command-line tool. You would install it on your computer, open a terminal, and type commands. Powerful, but intimidating if you are not a developer.

Then Anthropic launched Claude Code on the Web. Now you can do the same thing directly from your browser. You connect a GitHub repository, give Claude a task, and it clones your repo, writes code (or documents, or reports, or whatever you need), and pushes the changes to a branch. You review the changes, approve them, and merge. All from a web interface. No installation.

Claude Code on the Web operates inside a real computing environment called the "sandbox". It can read your files, create new ones, run scripts, and push changes to GitHub. It tends to write software for performing various tasks, instead of replying in plain text. It does work. Real work. The kind you would normally delegate to a research assistant or a junior colleague.

The 10-minute setup: GitHub + Claude Code

OK, let us build this from scratch. I will assume you have zero GitHub experience.

Step 1: Create a GitHub account and a repository.

Go to github.com and sign up. Then create a new repository: click the green "New" button, give it a name (something like my-research-project or quarterly-report), make sure to set it to Private (not Public, unless you want the whole internet reading your drafts), and check "Add a README file." That last part matters. Write a short description of your project in the README. Even a couple of sentences is fine. This initializes the repo so that Claude Code can actually work with it. (An empty, uninitialized repo will cause problems.)

Step 2: Connect your repo to Claude Code.

Go to claude.ai and open Claude Code (it is in the left sidebar, or you can go directly to claude.ai/code). Start a new session and connect your GitHub repository. You can paste your repo URL directly or use the built-in GitHub integration to browse your repositories. Claude will ask you to authenticate with GitHub the first time (a one-time OAuth flow) and install Claude on the Github repo (that allows Claude to write to the repo). Select the repo you just created.

Now Claude Code can see your files, and more importantly, it can change them.

At this point, you can upload files that you have about the project to the repo, or you can defer that step for later and move on to the next step.

Step 3: Let Claude set up your project.

This is where it gets interesting. CLAUDE.md is a special file that Claude reads at the start of every session. It is the project's "master plan": what the project is about, how it is organized, what conventions to follow. But you do not need to know what it should look like. Just describe your project in plain language:

"This repo contains the data and analysis from our AI-powered oral examination system, which I wrote up as a blog post. I want to turn this into a research paper for submission to Communications of the ACM. The data and some initial analysis scripts are already in the repo. Set up the project structure for a CACM submission and create a CLAUDE.md file."

Claude will read through the existing files, figure out what is there, organize everything into a sensible structure, and create a CLAUDE.md that might look something like this:

# Project: AI-Powered Oral Examinations at Scale

## Overview
Research paper for Communications of the ACM describing our system
for conducting and grading oral examinations using conversational AI
agents and a multi-LLM grading approach.

## Submission Details
- **Journal**: Communications of the ACM
- **Format**: ACM `acmart` document class, `acmsmall` style
- **Page limit**: 12,000 words including references
- **Style**: Author-year citations (natbib)

## Structure
- `/paper/` - LaTeX source files and ACM style files
- `/data/` - Exam transcripts, grading data, survey responses
- `/analysis/` - Python scripts for statistical analysis
- `/figures/` - Generated plots (PDF format, generated from scripts)
- `/blog/` - Original blog post and supporting materials

## Conventions
- All figures must be generated from scripts in `/analysis/`,
  never created manually
- Use BibTeX for references (`references.bib`)
- Data files are never edited directly; all transformations
  happen through scripts in `/analysis/`
- Student data must be anonymized in all outputs

## Current Status
See TASKS.md for the current task list and priorities.

Notice: you did not write any of this. You described your project, and Claude produced the project master plan. You review it, maybe tweak a couple of things. Done.

Step 4: Create your TASKS.md file.

This is your project's to-do list. But unlike a regular to-do list, it serves double duty: it tells Claude what needs to be done and keeps a record of what has been completed. Ask Claude to create it:

"Create a TASKS.md file with the following initial tasks..."

Here is what one might look like:

# Tasks

## In Progress
- [ ] E1. Expand blog analysis into formal experimental evaluation
- [ ] E2. Inter-rater reliability analysis (human vs. LLM council grades)

## To Do
- [ ] E3. Create Figure 1 (grade distribution across grading methods)
- [ ] R1. Write Related Work section (AI in assessment, LLM-as-judge)
- [ ] D2. Analyze anti-cheating detection rates
- [ ] Z3. Check word count against CACM 12,000-word limit

## Done
- [x] Z1. Set up project structure from blog post materials
- [x] D1. Anonymize student data
- [x] I1. Write Introduction draft

Now here is the magic. You can point Claude at a specific task and say: "Work on the next task in TASKS.md." Claude reads the file, picks the next item, does the work, updates the task status, and creates a pull request with its changes. If you are not familiar with pull requests, more in a moment.

Step 5: Give Claude a GitHub token (so you never have to learn git).

There is one more thing worth setting up now, even though it will not feel essential until later. Claude Code on the Web can push changes and create pull requests through its built-in GitHub connection. But that connection is limited to the one repo you connected in Step 2. If you want Claude to handle all the git operations fluently, and eventually work across multiple repos, you need to give it a personal access token.

Go to github.com/settings/tokens, click "Generate new token (classic)," give it a name like claude-code, and select the repo scope (which covers reading, writing, and managing repositories). Copy the token.

Now go back to Claude Code, open the environment settings for your session, and add:

GITHUB_TOKEN=ghp_your_token_here

Then ask Claude to add this line in the CLAUDE.md:

## Global Tools - `gh` (GitHub CLI) is available as a global tool, authenticated via the `GITHUB_TOKEN` environment variable.

That is it. Next time a session starts, Claude will install gh (the GitHub command-line tool) and authenticate using your token. From that point on, Claude handles all the git plumbing: committing, branching, creating pull requests, even cloning other repos when needed. You never run a git command. You never resolve a merge conflict. You click "Merge" on the pull request, and Claude takes care of everything else.

Why does this matter? Because without it, you will eventually hit a moment where Claude says something like "I've made the changes but I need you to run git pull and resolve a conflict." And suddenly you are googling git tutorials at 11pm, which is not the workflow we are going for.

(This token also unlocks cross-repo operations, which becomes important once you start coordinating work across multiple projects. More on that in a future post about the "master repo, satellite repos" setup.)

Pull requests: Redlined documents for coders (and not only)

Now the part that is unfamiliar to people who are not software engineers. The "pull request".

If you have ever received a redlined document from a lawyer, or reviewed tracked changes in a Word file, you already understand pull requests. The concept is that simple: someone proposes changes, you review them before they get incorporated into the main document.

In GitHub, it works like this:

Claude does its work on a separate branch (a parallel copy of your project).
When it is done, it creates a pull request (PR), which says: "Here are the changes I made. Want to incorporate them?"
You see a clean diff view showing exactly what was added, removed, or modified. Green lines are additions. Red lines are deletions.
You review. You can approve, request modifications, or reject.
If you approve, you click "Merge" and the changes become part of the main project.

This is the standard process used by every software team in the world. And it works for any kind of knowledge work that relies on text. Research papers. Reports. Course materials. Business proposals. Anything that lives in files. Ideally, you want the files to be text files and not binary ones; tex good, PowerPoint files, not so much. In the future we may have better tooling for reviewing changes in Office files or other formats, but for now the process works best for text-based files.

Fair warning: the GitHub interface will look busy the first time you open a pull request. Do not panic. Just look for the "Files changed" tab to see the redlines, and the big green "Merge pull request" button when you are ready to accept.

The critical point: you never edit the files directly. You describe what you want, Claude proposes changes, and you review and approve. You are the manager. Claude is the diligent employee who comes back with deliverables for you to inspect. And the audit trail is far better than "Track Changes" in Word ever was.

A real example: From CSV to submission-ready in two hours

Let me show you how this plays out in practice with a real example from last month.

I was working on a paper that had a case study section (say, Section 8) where we discussed results from a partner's dataset, but we only had the final business conclusions, not a full experimental analysis. The rest of the paper (say, Section 7) had a proper, thorough analysis on a different dataset: figures, tables, bootstrap confidence intervals, the works. By comparison, the case study in Section 8 was the weak sibling, and reviewers have flagged that. We have received a detailed dataset from our partners, but it required work. My TASKS.md had this sitting in it:

## Backlog
- [ ] F5. AML dataset analysis
- [ ] G1. Complete §8 rewrite with AML dataset

I uploaded the CSV to the repo and told Claude:

"Here is the AML dataset. Replicate the analysis from Section 7 but now for Section 8. Use the existing details from Section 8 as the background and framing, conduct the full experimental analysis, and generate a new Section 8."

Claude read Section 7 to understand the methodology. It read the existing Section 8 to understand the framing and context. It wrote Python scripts to process the AML data, generated four figures and three tables with bootstrap confidence intervals, wrote the new section text with all quantities pulled from the analysis scripts, and submitted a pull request with everything.

Less than an hour. I spent another hour reviewing the PR, checking the code, leaving comments ("clarify this axis label," "move this paragraph before the table", "I do not think the conclusions follow from the results"), and merging.

Two hours total. For a PhD student, this would have been a few days of work, easily. And here is the part that matters: every single number in that section was generated through a Python script. Every figure had a script that produced it. Reproducibility was built in from the start, not bolted on after the fact. The pull request showed me exactly what was added: the scripts, the outputs, the LaTeX changes. I could trace every claim back to the code that produced it.

Needless to say, I remain fully accountable for any bugs or errors. At the end of the day, I have reviewed the scripts, the results, and the text. What I can say is that even if there are errors, these are not "hallucinations" where the LLM filled in random numbers or references in the text. The figures are Python-generated from the raw data, the tables and the numbers in the text the same. The errors can come from bugs, or other oversights. But we should stop calling all AI errors "hallucinations". At this point, the errors are not the errors of a "bullshitter in chief" (a title aptly earned by early LLMs); they are the same types of errors that a junior colleague may make when carefully executing a well-defined task: misreading a specification, applying a method slightly outside its intended scope, or missing an edge case that a more seasoned eye would have caught.

Beyond software: Why this works for all knowledge work

I want to be explicit about something: this is not just for code. GitHub repositories can hold any kind of file. Markdown documents, LaTeX papers, CSV data files, images, PDFs. The pull request workflow works for anything.

Writing a consulting report? Put the markdown draft in /report/, the supporting analysis in /data/, the charts in /figures/. Claude generates the analysis, creates the figures, and drafts sections of the report, all as reviewable pull requests.

Same idea for course materials (I use this with my exit tickets workflow), business plans, grant proposals. You define the project structure, you maintain a task list, and you let the agent do the work while you review proposals. Standard software engineering practice, applied to everything.

Leveling up: More files for better project management

Once you get comfortable with CLAUDE.md and TASKS.md, you can add more structure. The files I have found most universally useful are these three:

SCHEDULE.md — Deadlines and milestones. "The submission deadline is March 15" becomes a constraint that shapes which tasks get prioritized first.
DECISIONS.md — Key choices and their rationale. "We decided to use three LLMs in the grading council instead of five because the marginal improvement was negligible." Prevents you and Claude from relitigating settled questions two weeks later.
STYLEGUIDE.md — Your writing preferences. "Never use em-dashes," "Never use fluffy adjectives," "Avoid claims not supported by data or citations." Good trick: give Claude a few pieces of your favorite writing and ask it to generate a style guide that mimics your voice. Then drop it in the repo.

Beyond these, there are files worth adding for specific situations:

CHANGELOG.md — Human-readable log of what changed each session. Especially useful when preparing a response to reviewers.
BLOCKERS.md — Things waiting on someone external. Makes it easy to send a collaborator a list of "here is what I need from you."
FEEDBACK.md — Running log of all feedback received, formal and informal, with status: pending, accepted, or rejected with rationale.
SOURCES.md — Annotated bibliography: what each source is useful for, how reliable it is, which sections cite it.
GLOSSARY.md — Keeps terminology consistent across a long document. Claude consults it and adds new terms as they come up.
DEPENDENCIES.md — Maps how artifacts depend on each other. Lets Claude flag when an upstream change invalidates something downstream.

You do not need all of these on day one. Start with CLAUDE.md and TASKS.md. Add CHANGELOG.md when editing a paper that came back with revisions. Add the rest as your project grows and you find yourself needing them.

To be fair, this is a bit of a hack. We are simulating standard project management tools using plain markdown files. Scanning text files for task lists and decisions is not exactly elegant. And I have serious doubts that this can scale for projects involving hundreds of people. But it works for now, with tools that exist today, for the projects that I am working on.

In the future, agents will have proper interfaces: structured databases, purpose-built PM tools designed for agents to read and write directly, not markdown files they have to parse every session. We are in the duct-tape-and-baling-wire phase. It is fine. The duct tape holds.

The awkward part (and why it is worth it)

If you are not a software engineer, this workflow feels strange at first. You are used to opening a document and typing. Now you are writing instructions, waiting for an AI to propose changes, and clicking "Merge" on a pull request. It is indirect. It feels like you are adding a middleman.

But here is what happens after a week: you realize the middleman can do 80% of the work. And the 20% you are doing (reviewing, giving feedback, making decisions) is the work that you would have done with any apprentice. But you are not fixing typos, you are not formatting tables, you are not wrestling with matplotlib's axis labels. You are reading the output and deciding if it is good and trustworthy enough.

Coming next

This post covered the basics: one repo, one project, Claude Code on the Web doing the work. The whole secret is that now the chatbots can write down what they have done, and look up the notes next time you start working together. And it is ridiculously powerful.

But this is just the beginning.

In upcoming posts, I will describe my "master repo, satellite repos" setup, where I maintain a central task management repository that coordinates work across multiple projects with different collaborators. Think of it as the command center.

Beyond that: deploying resources on Google Cloud, spinning up virtual machines for heavy computation, and the "council of LLMs" approach where Claude, Gemini, and GPT deliberate together on evaluation tasks (something I have been using for grading oral exams and am now extending to research).

At some point (in the not so distant future, probably by the end of March or so) Claude will be scheduling my meetings, answering my emails, and assigning me tasks from my own task list. I am not entirely sure who is managing whom anymore.

Listening to My Students at Scale: Exit Tickets, NotebookLM, and the Tightest Feedback Loop I've Ever Built

2026-02-15T10:24:00.001-05:00

It started at a teaching workshop, last semester: Craig Kapp and Rob Egan presented a seminar at the NYU Center for Teaching and Learning called "Real-Time Insights: Leveraging AI for Responsive Teaching in Large Classrooms." They (re-)introduced a deceptively simple concept: the exit ticket. The idea is that at the end of every class session, you ask students three quick questions, each with a different shape metaphor:

🔵 Circle: What is still circling in your mind? (What are you confused about?)
🟥 Square: What "squared" with your understanding? (What clicked today?)
🔺 Triangle: What are three key takeaways from today's session?

Then, take these answers, and use LLMs to process them quickly and get feedback before the next session.

Getting structured feedback from students after every single session? Not at the end of the semester when it's too late to change anything, but right now, while you can still do something about it? I immediately wanted to try it.

Below I describe the details of the approach presented by Craig and Rob, and my own adjustments to the recipe. Hope you will find it useful.

The setup: Making it required (and why that matters)

It starts by setting up the exit ticket surveys as auto-graded quizzes on Brightspace (NYU's LMS). The auto-grading part is a nice little trick: one of the questions is simply "Select True in this question to get your points." Students complete the survey, they get their credit. No manual processing of ~50 submissions on my end.

We do tell students upfront: write something substantive. Don't game the system. We reserve the right to deduct points if someone slacks through the exit tickets all semester. And here's the nice irony: since we're already running AI-powered analysis on the responses, identifying freeriders who type "asdf" every week is trivial. The same pipeline that processes the feedback also flags the people not providing any.

The critical design decision: make it part of the grade, not optional. Optional feedback gets ~30% response rates and self-selected complainers. Required feedback gets everyone. And because this is formative feedback (not evaluative), students have every reason to be honest and detailed. They're not rating me. They're telling me what they need.

Compare this to the end-of-semester evaluation. Students fill it out in December, the professor reads it in January (maybe), and any changes happen next year for a completely different group of students. The feedback loop is so long that it barely qualifies as a loop. Exit tickets close that loop within days. Sometimes hours.

From exit ticket to next session: the processing pipeline

So now I have all this feedback. ~50 students, after every session, telling me what confused them, what clicked, and what they're taking away. The question becomes: how do you actually process all of that quickly enough to act on it?

NYU IT built an official path for this, which Rob demonstrated in the seminar. You export the exit ticket responses into the Brightspace Insights Portal (which Rob's team manages) and run AI-powered analysis using a prompt like this:

You are an expert Instructional Designer and Data Scientist assisting
a professor with the course "AI/ML Product Management" at NYU Stern
School of Business (undergraduate).

Your goal is to analyze student feedback survey data to improve course
delivery. The survey questions and student answers are provided below.
Please perform the following two steps:

### Step 1: Thematic Analysis
Analyze the responses to identify key themes. Do not just look for
keywords; look for semantic similarities and underlying sentiment. For
each theme, provide:
1. **Theme Name**: A concise title.
2. **Prevalence**: The approximate number of students who mentioned this.
3. **Explanation**: A brief summary of the sentiment or issue.
4. **Evidence**: A direct, representative quote from the data.

### Step 2: Actionable Pedagogy (Bloom's Taxonomy)
For each theme identified above, propose a short course activity.
* If the theme represents a **knowledge gap/pain point**, propose a
  remedial activity.
* If the theme represents a **strength/interest**, propose an activity
  to deepen understanding.
* **Constraint**: The activity must be supported by Bloom's Taxonomy.
  Explicitly state which level of Bloom's Taxonomy the activity targets
  (e.g., Application, Analysis, Evaluation).

**Format**:
Start the suggestion section for each theme with the label: "PRACTICE IDEA".

I attach the survey data.

It's a well-designed prompt. Thematic coding, prevalence counts, representative quotes, remedial activities aligned with Bloom's Taxonomy. The output is genuinely useful.

But I prefer to do something slightly different. I use the same prompt from the Insights Portal, but I run it inside NotebookLM with just the student feedback as input. For those unfamiliar: NotebookLM is Google's AI-powered research assistant. You upload your own documents, and it generates analysis, summaries, explainer videos, and podcast-style audio overviews grounded entirely in your uploaded sources. NYU provides institutional access through Google Workspace, so the data never trains any AI models, which matters when you're working with student feedback.

Why NotebookLM over the Insights Portal? Because the exit ticket analysis is just the starting point. What I really need is to prepare the follow-up material. Once NotebookLM identifies the themes and suggests activities, I take those suggestions and combine them with my lecture slides, readings, and case studies (which are already loaded in the same notebook). Then I ask it to generate explainers, videos, infographics, and targeted activities that address the confusion, all grounded in my actual course content.

The Insights Portal gives me a diagnosis. NotebookLM gives me the diagnosis and helps me build the treatment.

My workflow after every class:

Students complete the exit ticket on Brightspace (takes them 2-3 minutes)
I export the responses and upload them into a NotebookLM notebook, together with the materials for that session
NotebookLM identifies the themes: what's confusing people, what clicked, what they found most valuable
Based on those themes, I generate explainer materials, short videos, and targeted activities for the next session

(As an example, here is the NotebookLM that we use for the Zillow Offers case, which we use to discuss leading and lagging metrics, model and output monitoring, concept drift, adverse selection and other product-management-related topics. Note: this notebook contains only course materials for preparing the case discussion, not student feedback data.)

One small but annoying wrinkle: NotebookLM's default slide output has that unmistakable "AI-generated" aesthetic. You know the one. (Yes, they are visually gorgeous compared to my own slides, but after a while it starts feeling a bit like slop.) So I started uploading the NYU brand style guide as an additional source in my notebooks, and prompting NotebookLM to follow it when generating visual materials. The results are noticeably closer to proper NYU-branded slides. Not perfect, but much better than the generic AI look. I'm still waiting for NotebookLM to support custom templates or branding natively, but that's a different story.

The per-session overhead is maybe 15-20 minutes.

Why this actually works

The circle/square/triangle structure does something clever: it gives students permission to be confused. "What is still circling in your mind?" is a much less intimidating question than "What don't you understand?" And the three-takeaways question forces them to reflect, even briefly, which helps consolidate their learning.

But the real reason students engage is that they see the results. When I open the next class by saying "Several of you mentioned you were confused about X, so let's spend 15 minutes on this before we move on," students learn that their feedback actually matters. It creates a virtuous cycle: they write thoughtful responses because they know I'll respond, and I can respond because NotebookLM makes processing all the responses feasible. Without the AI assist, no professor has time to synthesize free-text responses from 50 students after every class and create targeted follow-up materials. Definitely not after every single session. The economics just don't work.

With NotebookLM doing the heavy lifting? The economics suddenly work beautifully.

The exit ticket has been around for decades. Craig and Rob simply showed how to supercharge it with AI. The hard part was never getting students to talk. It was finding the time to listen. Once students realize someone is actually listening, they start saying things worth hearing. That's the loop. That's the whole trick.

Everybody Is a CEO Now (And What Exactly Am I Doing Here?)

2026-02-11T14:04:00.012-05:00

It's hard to pinpoint the exact moment when something fundamentally shifts. There's no day when you wake up and say, "Today, everything is different." It's more like boiling a frog. Except in this case, the frog is me, and the water feels amazing.

Over the last few weeks, a confluence of AI developments crossed an invisible threshold. None of them is dramatic on their own. All of them, together, are profoundly changing how I work, how I teach, and honestly, how I think about what comes next.

Claude stopped being a chatty know-it-all

Let me start with the most concrete thing. Around December, Claude became... different. Not in some flashy, press-release way. It just started being right. Consistently, reliably right. The suggestions were spot on. The reasoning was good. The writing did not feel like fluffy AI slop. The output needed minimal editing.

I know, I know—"AI is getting better" isn't exactly breaking news. People have been saying this for years. But there's a qualitative difference between "impressive compared to what we had before, but I still need to direct and edit this very carefully" and "I now trust this thing with real work." We crossed that line.

Here's the moment it hit me. Yesterday, I had a brainstorming session with a student. We shared documents, exchanged ideas, sketched out some research directions. Normal academic stuff. Afterwards, I dumped my messy meeting notes into Claude and asked it to organize them.

What came back was not just a cleaned-up document with better formatting.

It was a research program.

Legitimate research questions, well thought out, properly scoped, organized into a coherent agenda with clear methodological approaches. I sat there staring at my screen. I did not feel like I was a professor advising a student and making some progress. It felt like we were in reality two grad students who had been goofing around with half-baked ideas, and then our wise, respected senior professor walked into the room, sat down, and said: "OK, here's how research is actually done. Here's how you think about this. Here's how you organize your work."

Not a helpful assistant anymore. Claude was setting the agenda this time around. It was the senior colleague. It was the advisor.

The Agent That Puts PhD Students to Shame

And then there's the agent setup, which is where things get truly surreal.

When you pair Claude with GitHub for memory, an AGENTS.md file for context, and a TODO.md for task tracking, something clicks. The AI labs have been saying for a while that their agents were reaching "PhD student level." I've supervised PhD students for 20 years. I love them. Truly. But let me be blunt: I have never worked with a PhD student this organized and this diligent.

None of them have ever created a table mapping every data-driven claim in the LaTeX code to the specific code and data files that support each claim. None of them has had a full pipeline for the data analysis and the figures in a makefile, ready to repeat everything if necessary. None of them has had a reproducibility package ready before we even sent out the first manuscript.

The only downside? I will not be able to have drinks with this PhD student in the future and feel happy seeing them be so much more successful than I am.

A paper is about to go out. I started writing in earnest on Saturday. It took a total of four days of work to get to a submittable manuscript. The experimental analysis, the writing, the polishing. Four days. This would have taken four weeks minimum with a human collaborator, and that's being generous. And the quality isn't "good enough for a draft." It's "ready for submission with minor tweaks."

I find myself glued to my screen all day. I am not doing busy work. I write down what needs to be done, and this is happening behind the scenes. I am getting back the next iteration in an hour, I look at it, I give feedback, we cross things out from the TODO.md and we move forward. This is real work being done. Not just coding. Paper writing. Report preparation. Coding practices leak into other types of work, and things are moving. My real work is getting done, not just my academic software prototypes.

It's like having an infinite pool of employees, each one eager, competent, and ready to come back with actual deliverables. Not drafts that need to be rewritten. Not outlines that need to be fleshed out. Deliverables.

Teaching as Curation: The NotebookLM Story

Let me tell you about another shift that's been happening in parallel, this one in our classroom.

We teach an AI Product Management course at Stern, and starting in November, something strange happened to how we prepare. We stopped creating content. We started curating it.

Here's our workflow now: After every class session, we collect student feedback. What clicked, what didn't, what questions came up, what topics generated the most energy. We dump all of this (the feedback forms, our own notes, relevant articles, the previous session's materials) into NotebookLM.

And then we ask it to help us design the next session.

NotebookLM digests the student feedback, identifies the gaps, suggests educational activities, and creates new explainer material that directly addresses what students found confusing or wanted to explore further. It connects themes across sessions that we might not have noticed. It proposes case studies that are relevant to the questions students actually asked, not the ones we assumed they'd ask.

The result? The course is absurdly adaptive. Every session builds on what students actually need, not on a syllabus we wrote in August. We're not creating lectures from scratch anymore. We're curating a learning experience, with AI as our editorial partner. The student feedback loop, which used to inform maybe the next semester's version of the course, now informs the next class.

We feel like careful curators, because we're still the ones making the final calls. For now. For how long? No idea. Perhaps in Summer even the curation will be something the AI does better than us.

Education is changing. Bloom's two sigma problem, the finding that one-on-one tutoring outperforms classroom instruction by two standard deviations, is solvable. Now. What is our role? No clue. Perhaps the future of education does not need professors. But the future of education is bright. We will not believe how bad we are. Almost like going from writing with a marker on transparencies to having an interactive demo of the concept. That transition took 30 years. Let's see where we will be in 30 months.

So... Everybody's a CEO Now?

Here's where I start to feel a little dizzy. The marginal cost of competence is hitting zero.

If I can supervise an AI agent the way I'd supervise a research team (giving it direction, reviewing output, iterating on results) and if this scales to writing papers, analyzing data, building prototypes, designing courses... then what am I? I'm a manager. A director. A CEO of a one-person company with an arbitrarily large AI workforce.

But here's the question: What happens when everyone can do this?

When every professor can produce research at 10x the speed. When every consultant can deliver analyses that used to require a team of five. When every entrepreneur can build and ship products without hiring engineers. When every student can produce work indistinguishable from an expert's.

Do we still need employees? Is it even feasible for everyone to operate like a one-person business? And if so, who are the customers? If everyone is a CEO, who is buying?

I don't have answers. The words people have been saying for the last few years, "AI will change everything," "this is the new industrial revolution," "knowledge work will be transformed," those words haven't changed.

But the feeling has.

It used to feel like a prediction. The prediction is here. You will feel it soon, if you have not felt it already. It will be a mix of awe and fear. Impostor syndrome to the fullest. What exactly am I adding here?

I'd love to tell you that the human role is now "taste, judgment, direction-setting" and that AI just handles the execution. That's the comforting version. But I just told you that Claude set the research agenda, not me. So even that may not hold for long.

Bye now

And for now, if you'll excuse me, I need to go review the deliverables my AI team just submitted. Four papers in the queue, a course redesign in progress, and a blog post that, unlike this one, I didn't write myself.

OK fine, I didn't write this one myself either.

(Kidding. Mostly.)

Fighting Fire with Fire: Scalable Personalized Oral Exams with an ElevenLabs Voice AI Agent

2025-12-29T14:46:00.021-05:00

📄 Paper: This blog post has been expanded into a full paper: Scalable and Personalized Oral Assessments Using Voice AI (Panos Ipeirotis and Konstantinos Rizakos, arXiv:2603.18221). The paper includes the complete system design, failure mode analysis, student experience data, and all prompts as appendices.

It all started with cold calling.

In our new "AI/ML Product Management" class (co-taught with Konstantinos Rizakos), the "pre-case" submissions (short assignments meant to prepare students for class discussion) were looking suspiciously good. Not "strong student" good. More like "this reads like a McKinsey memo that went through three rounds of editing," good.

And let's be clear: We have zero problems with students using AI for their work. (Banning AI in an AI course? That would be... special.) We actively encourage it. But here's the distinction that matters: using AI to enhance your thinking versus outsourcing your thinking entirely and learning nothing at the end. One of these is education. The other is expensive credential theater.

So we started cold calling students randomly during class.

The result was... illuminating. Many students who had submitted thoughtful, well-structured work could not explain basic choices in their own submission after two follow-up questions. Some could not participate at all. This gap was too consistent to blame on nerves or bad luck. If you cannot defend your own work live, then the written artifact is not measuring what you think it is measuring.

Brian Jabarian has been doing interesting work on this problem, having shown that AI is actually better than humans at conducting job interviews. Why? Humans get tired, have biases, and are less consistent at following a script. His results both inspired us and gave us the confidence to try something that would have sounded absurd two years ago: running the final exam with a Voice AI agent.

Why oral exams? And why now?

The core problem is almost embarrassingly simple: students now have immediate access to LLMs that can handle most exam questions we traditionally use for assessment. The old equilibrium—where take-home work could reliably measure understanding—is dead. Gone. Kaput.

OK, so we go pen and paper in the classroom. We did exactly that for the midterm. Problem solved, right?

Well, not quite. We also needed to ensure that students had done deep work on their group projects. In the past, our worry was freeriding: students offloading their work to teammates. But then, in the middle of our class, the AI landscape shifted dramatically. Gemini 3.0 dropped, and NotebookLM started generating flawless presentations. Suddenly, a student could deliver a polished, sophisticated presentation about a project they barely touched.

And we'd have no way to tell.

Oral exams were the natural response. They force real-time reasoning, application to novel prompts, and defense of actual decisions. No LLM whispering in your ear. No "let me just check something real quick" while ChatGPT generates your answer. Just you, your knowledge, and an evaluator.

The problem? Oral exams are a logistical nightmare.

With 36 students and two instructors, we could maybe manage. But even at that scale, the accommodation requests started piling up immediately. "I have a flight on the 15th." "I have three other finals that day." "I'm traveling for a family event." All legitimate! But multiply that by a factor of ten for a larger class, and you're looking at a month-long hostage situation.

So: oral exams don't scale. Everyone knows this. It's why we abandoned them in the first place.

Unless you cheat.

Enter the Voice Agent

We used ElevenLabs Conversational AI to build the examiner. The platform bundles the messy parts (speech-to-text, text-to-speech, turn-taking, interruption handling, …) into something usable. And here is the thing that surprised me: a basic version for a low-stakes setting (e.g., an assignment) can be up and running in literally minutes. Minutes. Just write a prompt describing what the agent should ask the student, and you are done.

Two features mattered a lot for our setup:

Dynamic variables: pass the student's name, project details, and other per-student context into the conversation as parameters, to allow personalized exams
Workflows: build a structured flow with sub-agents instead of a single "chatty" agent trying to do everything

What the exam looked like

We ran a two-part oral exam.

Part 1: "Talk me through your project." The agent asks about the student's capstone project: goals, data, modeling choices, evaluation, failure modes. This is where the "LLM did my homework" strategy dies. You can paste an assignment into ChatGPT. It is much harder to improvise consistent answers about specific decisions when someone is drilling into details.

Part 2: "Now do a case." The agent picks one of the cases we discussed in class and asks questions spanning the topics we covered: basically testing whether students absorbed the material or just showed up.

To handle this structure, we split the exam into sub-agents in a workflow:

Authentication agent: Asks for the student's ID and refuses to proceed without a valid one. (In a more productized version, we would integrate with NYU SSO instead of checking against a list.)
Project discussion agent: Gets project context injected via parameters. The prompt includes details of each project so the agent can ask informed questions. The next step is obvious: connect retrieval over the student's submitted slides and reports so the agent can quote and probe precisely.
Case discussion agent: Selects a case and runs structured questioning. Again, RAG would help with richer case details.

This "many small agents" approach is not just aesthetic. It prevents the system from drifting into unbounded conversation, and it makes debugging possible.

If you want to try: Link to try the voice agent (use Konstantinos as the name and kr888 as the net id to authenticate; the project was a "LinkedIn Recruiter, an agent that scans profiles and automatically sends personalized DMs to candidates on behalf of a recruiter. It engages in the first 3 turns of chat to answer basic questions (salary, location) before handing off to a human.")

By the Numbers

36 students examined over 9 days
25 minutes average (range: 9–64)
65 messages per conversation on average
0.42 USD per student (15 USD total), but also the $99/month ElevenLabs subscription
89% of LLM grades within 1 point
Shortest exam (9 min) → highest score (19/20)

The economics

Let's talk money.

Total cost for 36 students: 15 USD.

That's 8 USD for Claude (the chair and heaviest grader), 2 USD for Gemini, 0.30 USD for OpenAI, and roughly 5 USD for ElevenLabs voice minutes. Forty-two cents per student.

The alternative? 36 students × 25-minute exam × 2 graders = 30 hours of human time. At TA rates (~$25/hour), that's $750. At faculty rates, it's "we don't do oral exams because they don't scale."

For $15, we got: real-time oral examination, a three-model grading council with deliberation, structured feedback with verbatim quotes, a complete audit trail, and—as you'll see—a diagnosis of our own teaching gaps.

The unit economics in terms of cost work. We will see next that the real benefit is in the value that is delivered, not in the 50x cost savings.

What broke (and how we fixed it)

The first version had problems. Here is what we learned.

1) The voice was intimidating

A few students complained that the agent sounded severe. We had cloned Foster Provost's voice because, frankly, his clone was much more accurate than the clones of our own voices. But the students found it... intense. Here is an email from a student:

I had prepared thoroughly and felt confident in my understanding of the material, but the intensity of the interviewer's voice during the exam unexpectedly heightened my anxiety and affected my performance. The experience was more triggering than I anticipated, which made it difficult to fully demonstrate my knowledge. Throughout the course, I have actively participated and engaged with the material, and I had hoped to better demonstrate my knowledge in this interview.

And here is another:

Just got done with my oral exam. [...] I honestly didn't feel comfortable with it at all. The voice you picked was so condescending that it actually dropped my confidence. [...] I don't know why but the agent was shouting at me.

Fix: We are split on that. We love FakeFoster. But next time we will A/B test, and we will try to test other voices. At the end of the day, we want to optimize for comprehension, not charisma. ElevenLabs has guidance on voice and personality tuning: they treat this as a product design problem, and probably a good idea.

2) The agent stacked questions

This was the biggest real issue. The agent would ask something like: "Explain your metric choice, and also tell me what baselines you tried, and why you did not use X, and what you would do next."

That is not one question. That is four questions wearing a trench coat. The cognitive load for an oral exam is already high. Stacking questions makes it brutal.

Fix: Hard rule in the prompt: one question at a time. If you want multi-part probing, chain it across turns. For grading the exam, we included an "interference protocol": students received full credit if they had questions stacked like that and answered only some of them.

3) Clarifications became moving targets

Student: "Can you repeat the question?"
Agent: paraphrases the question in a subtly different way

Now the student is solving a different problem than the one they were asked. Very frustrating.

Fix: Explicit instruction in the prompt: repeat verbatim when asked to repeat. No paraphrasing. Same words.

4) The agent did not let students think

Humans rush to fill silence. Agents do too. Students would pause to think, and the agent would jump in with follow-up probes or worse: interpret the silence as confusion and move on.

Fix: Tell the agent to allow think-time without probing aggressively. It made the exam feel less like an interrogation. We also increased the time-out before the agent asks "Are you there?" from 5 to 10 seconds.

5) Lack of randomization

We asked the agent to "randomly select" a case study. It did not.

From December 12–18, when Zillow was in the case list, the agent picked Zillow 88% of the time. After we removed Zillow from the prompt on December 18, the agent immediately latched onto Predictive Policing—picking it for 16 out of 21 exams on December 19 alone.

LLMs are not random. They have implicit preferences and ordering biases. Asking an LLM to "pick randomly" is like asking a human to "think of a number between 1 and 10"—you're going to get a lot of 7s.

Fix: Pass an explicit random number as a parameter and map it to cases deterministically. Do the randomization in code, not in the prompt.

Grading: the council deliberation actually worked

OK, so here is where things got interesting.

We graded using a "council of LLMs" approach, an idea we borrowed from Andrej Karpathy. Three models (Claude, Gemini, ChatGPT) assessed each transcript independently. Then they saw each other's assessments and revised. Finally, the chair (Claude) synthesized the final grade with evidence.

Round 1 was a mess. When the models graded independently, agreement was poor: 0% of grades matched exactly, and only 23% were within 2 points. The average maximum disagreement was nearly 4 points on a 20-point scale.

And here's the kicker: Gemini was a softie: It averaged 17/20. Claude averaged 13.4/20. That's a 3.6-point gap—the difference between a B+ and a B-.

Meanwhile, Claude and OpenAI were already aligned: 70% of their grades were within 1 point of each other in Round 1.

Model	Round 1 Mean	Round 2 Mean	Change
Claude	13.4/20	13.9/20	+0.5
OpenAI	14.0/20	14.0/20	+0.0
Gemini	17.0/20	15.0/20	-2.0

Then came consultation. After each model saw the others' assessments and evidence, agreement improved dramatically:

Metric	Round 1	Round 2	Improvement
Perfect agreement	0%	21%	+21 pp
Within 1 point	0%	62%	+62 pp
Within 2 points	23%	85%	+62 pp
Mean max difference	3.93 pts	1.41 pts	-2.52 pts

Gemini lowered its grades by an average of 2 points after seeing Claude's and OpenAI's more rigorous assessments. It couldn't justify giving 17s when Claude was pointing to specific gaps in the experimentation discussion.

But here's what's interesting: the disagreement wasn't random. Problem Framing and Metrics had 100% agreement within 1 point. Experimentation? Only 57%.

Why? When students give clear, specific answers, graders agree. When students give vague hand-wavy answers, graders (human or AI) disagree on how much partial credit to give. The low agreement on experimentation reflects genuine ambiguity in student responses, not grader noise.

The grading was stricter than my own default. That's not a bug. Students will be evaluated outside the university, and the world is not known for grade inflation. (Just in case you are wondering, I graded all exams myself and I asked the TA to also grade the exams; we mostly agreed with the LLM grades, and I aligned mostly with the softie Gemini. However, when examining the cases when my grades disagreed with the council, I found that the council was more consistent across students and I often thought that the council graded more strictly but more fairly.)

The feedback was better than any human would produce. The system generated structured "strengths / weaknesses / actions" summaries with verbatim quotes from the transcript. Sample feedback from the highest scorer:

"Your understanding of metric trade-offs and Goodhart's Law risks was exceptional—the hot tub example perfectly illustrated how optimizing for one metric can corrupt another."

Sample from a B- student:

"Practice articulating complete A/B testing designs: state a hypothesis, define randomization unit, specify guardrail metrics, and establish decision criteria for shipping or rolling back."

Specific. Actionable. Tied to evidence. No human grader has the time to generate that for every student.

It diagnosed our teaching gaps

Ha! This one stung.

When we analyzed performance by topic, one bar stuck out like a sore thumb: Experimentation. Mean score: 1.94 out of 4. Compare that to Problem Framing at 3.39.

The breakdown was brutal:

3 students (8%) scored 0—couldn't discuss it at all
7 students (19%) scored 1—superficial understanding
15 students (42%) scored 2—basic understanding
0 students scored 4—no one demonstrated mastery

We had rushed through A/B testing methodology in class. The external grader made it impossible to ignore.

The grading output became a mirror reflecting our own weaknesses as instructors. Ooof.

Duration ≠ Quality

One finding I found strangely fascinating: exam duration had zero correlation with score (r = -0.03). The shortest exam—9 minutes—got the highest score (19/20). The longest—64 minutes—scored 12/20.

Taking longer doesn't mean you know more. If anything, it signals struggling to articulate. Confidence is efficient.

Anti-cheating (or: trust but verify)

We asked students to record themselves while taking the exam (webcam + audio). This discourages blatantly outsourcing the conversation, having multiple people in the room, or having an LLM in voice mode whispering answers. It also gives us a backup record in case something goes really badly.

And here is an underrated benefit of this whole setup: the exam is powered by guidelines, not by secret questions. We can publish exactly how the exam works—the structure, the skills being tested, the types of questions. No surprises. The LLM will pick the specific questions live, and the student will have to handle them.

This reduces anxiety and pushes students toward actual preparation instead of guessing what the instructor "wants." And it eliminates the leaked-exam problem entirely. Practice all you want—it will only make you better prepared.

What the students said

We surveyed students before releasing grades to capture their experience. Some of the results:

Only 13% preferred the AI oral format. 57% wanted traditional written exams. 83% found it more stressful.
But here's the thing: 70% agreed it tested their actual understanding: the highest-rated item. They accepted the assessment but not the delivery.
At the same time, they almost universally liked the flexibility of taking the exam at their own place and time. Yes, many of them would have also preferred a take-home exam instead of the oral exam, but this format is dead now.
83% of students found the oral exam framework more stressful than a written exam.
The fix is clear: one question at a time, slower pacing, calmer tone. The concept works. The execution needs iteration.

Try it yourself

If you want to experiment with this approach, here are some resources:

Prompt for the voice agent
Prompt for the grading council
Link to try the voice agent (use Konstantinos as the name and kr888 as the net id to authenticate; the project was a "LinkedIn Recruiter, an agent that scans profiles and automatically sends personalized DMs to candidates on behalf of a recruiter. It engages in the first 3 turns of chat to answer basic questions (salary, location) before handing off to a human.")

What I would change next time

Slower pacing and a calmer voice: We love you FakeFoster, but GenZ is not ready for you. Perhaps we will deploy FakePanos next time. Too bad ElevenLabs hasn't perfected thick accents yet to deliver a real Panos experience.
RAG over student artifacts (slides, reports, notebooks). ElevenLabs supports this directly. If the agent can quote the student's own submission, the exam becomes much harder to game and much more diagnostically useful.
Better case randomization with explicit seeding and tracking. Randomness that "feels random" is not enough. Pass explicit parameters.
Audit triggers in grading. If the LLM committee disagrees beyond a threshold, flag for human review. The point of a committee is not to pretend the result is always certain; it is to surface uncertainty.
Accessibility defaults. Offer practice runs, allow extra time, and provide alternatives when voice interaction creates unnecessary barriers.

The bigger point

Take-home exams are dead. Reverting to pen-and-paper exams in the classroom feels like a regression. In our case, we wanted to check that the students who worked in the team projects actually contributed and understood what they submitted; we would not be able to do that with pen-and-paper exams in the classroom.

We need assessments that evolve towards formats that reward understanding, decision-making, and real-time reasoning. Oral exams used to be standard until they could not scale. Now, AI is making them scalable again.

And here is the delicious part: you can give the whole setup to the students and let them prepare for the exam by practicing it multiple times. Unlike traditional exams, where leaked questions are a disaster, here the questions are generated fresh each time. The more you practice, the better you get. That is... actually how learning is supposed to work.

Fight fire with fire.

Thanks to Brian Jabarian for the inspiration and for giving us confidence that these interviews will work, Foster Provost for lending his voice to create the FakeFoster agent (sorry, students found you intimidating!), and Andrej Karpathy for the council-of-LLMs idea.

Training LLaMA using LibGen: Hack, a Theft, or Just Fair Use?

2025-03-22T19:20:00.005-04:00

Imagine you're building a Large Language Model. You need data—lots of it. If you can find text data of high quality, vetted, truthful, and useful, it would be... great! So, naturally, you head online and find a treasure trove of books neatly indexed, conveniently downloadable, and completely free. The catch? You're looking at LibGen—one of the most infamous pirate libraries on the internet.

This isn't hypothetical. Recently, Meta made headlines for allegedly training their flagship LLM, LLaMA, on content from LibGen. But—can you even do that?

Let's unpack the legal mess behind the scenes, step-by-step.

First: Is Using LibGen Even Legal?

Short answer: Absolutely not. Downloading copyrighted books from LibGen is textbook piracy. Think of it like grabbing a handful of snacks at the supermarket without paying—it's convenient but totally illegal.

Second: Does Training an AI Change the Equation?

Here's where it gets fuzzy. In the U.S., you can claim "fair use"—the idea that some copying is permissible if you're transforming the original work into something new and valuable. (We covered this in an earlier blog post.)

Remember the Google Books case? Google scanned millions of books without permission. Authors sued, but courts sided with Google, citing fair use. The logic was that indexing books for search purposes created something valuable without substituting the original.

Consider another example: the Authors Guild v. HathiTrust case. Libraries scanned books to help visually impaired readers and enable text search. Courts also ruled this fair use, emphasizing the transformative nature and public benefit. However, both these cases involved legally acquired copies—not pirated ones.

So, could Meta's training of LLaMA fall under the same umbrella? Possibly, yes, claiming the same fair use approach. There is a subtle difference: Google used legally accessible copies (from libraries), while Meta reportedly took a different route. Legally speaking, when we talk about copyright and fair use in the US, the source of the copyrighted data does not directly affect the outcome. (Although it can affect the attitude of a jury or a judge if they believe that the defendant acted in bad faith.)

Third: What About the EU?

If you thought U.S. law was tricky, the EU adds another layer of complexity. They don't have a broad "fair use" policy, but they've introduced exceptions specifically for Text and Data Mining (TDM). Good news for researchers and AI developers, right? Except there's a big "BUT": EU law explicitly requires lawful access. Pirate libraries like LibGen don't qualify.

In other words, in Europe, using LibGen isn't just risky—it's explicitly illegal.

Fourth: Is there a Legal Defense for using LibGen?

There is a very reasonable argument that training an AI is transformative—after all, an LLM doesn't copy books; it learns from them. Consider also the LAION case from Germany. LAION, a nonprofit, scraped images from stock photo sites to train AI models. The court allowed it, but crucially because LAION had legitimate access and was a non-commercial entity. The outcome might differ sharply for a commercial giant sourcing pirated content.

There is also the counterargument from authors and publishers that LLMs themselves create (for competitive reasons) a market for licensing content, as the different LLM providers try to get access to exclusive, licensed content as a differentiating factor, in the same way that various streaming companies compete to get exclusive access to films, shows, and TV series. It is a bit of a circular argument (without free training of LLMs, can the LLMs get good enough to create a licensing market?), but we will have to wait for the courts to decide.

Fifth: What's the Risk Here?

For researchers at universities or small startups, casually using LibGen might seem harmless. The risks escalate quickly when you're a global company. Training on "presumed free" copyrighted data differs from "willful infringement"—the legal term for knowingly breaking copyright law.

The fact that LLaMA is Open Source is a significant factor here, as there is less of a profit factor here, but when the trainer is a trillion-dollar company, the courts may behave differently. We will see...

After all, while pirates make great movie characters, they're generally less popular in courtrooms.

Copyright, Fair Use, and AI Training

2025-02-24T23:32:00.004-05:00

[We tested the o1-pro model to give us a detailed analysis of the legal landscape around copyright and the use of copyrighted materials to train LLMs. The full discussion is available here. Below you will find a quick attempt to summarize the (much) longer report by o1-pro.]

What is Copyright? (And Why Should You Care?)

Imagine you spend months writing a book, composing a song, or designing a killer app—wouldn't you want some protection to stop someone from copying it and making money off your hard work? That's where copyright steps in! It grants the copyright holder exclusive rights to reproduce, distribute, and display their work. However, copyright isn't an all-powerful lock—there are important exceptions, like fair use, that allow for some unlicensed use, especially when it benefits society.

Copyright laws are all about balance. Too much restriction, and we block innovation and education. Too little, and creators lose their incentive to make new things. Governments step in to help find that sweet spot—protecting creators' rights while making sure knowledge, art, and innovation stay accessible.

The Fair Use Doctrine: When Borrowing is (Sometimes) Okay

Fair use is like the ultimate legal "it depends" clause in copyright law. It allows limited use of copyrighted materials without permission—whether for education, commentary, parody, or research. But how do you know if something qualifies as fair use? Courts consider these four big factors:

Purpose and Character of the Use – Is the use transformative? Does it add new meaning or context? And is it for commercial gain or educational purposes?
Nature of the Copyrighted Work – Is the original work factual (easier to use under fair use) or highly creative (harder to justify copying)?
Amount and Substantiality – How much of the original is used, and is it the "heart" of the work?
Effect on the Market – Does this use harm the copyright holder's ability to profit from their work?

What Do Past Cases Tell Us About Fair Use?

Google Books Case (Authors Guild v. Google, 2015): Google scanned millions of books to make them searchable, showing only small snippets of text. The Second Circuit ruled this was fair use because:

It was highly transformative—it helped people find books rather than replacing them.
The snippets were not a market substitute—nobody was reading full books this way.
Instead of harming book sales, it actually helped readers find books to purchase.

Google Search Indexing (Perfect 10 v. Google, 2007): Google's image search displayed thumbnail previews linking to full-size images. The Ninth Circuit ruled this was fair use because:

It served a different function—helping users find images, not replacing the originals.
Any market harm was speculative—there was no proof Google's thumbnails hurt sales.

LinkedIn Scraping Case (hiQ Labs v. LinkedIn, 2019): hiQ Labs scraped publicly available LinkedIn profiles to analyze workforce data. LinkedIn sued, claiming this violated its terms of service. The Ninth Circuit ruled that scraping publicly accessible data wasn't illegal under the Computer Fraud and Abuse Act (CFAA), but the case raised bigger questions about data ownership and fair use. This case matters for AI because it highlights the legal gray area of using publicly available content for AI training—does scraping data for machine learning function like search indexing (which courts favor) or unfairly compete with content creators?

When Courts Say "Nope" to Fair Use

When a copied work competes directly with the original, courts usually rule against fair use:

Texaco Case (American Geophysical Union v. Texaco, 1994) – Texaco photocopied journal articles for internal research. The court ruled this wasn't fair use because Texaco could've just bought the licenses, and widespread copying threatened the scientific journal market.
Meltwater Case (Associated Press v. Meltwater) – Meltwater, a news aggregation service, copied AP excerpts. The court ruled this wasn't fair use because it replaced a licensable market for news monitoring services.

How Does This Apply to AI Training?

AI models like ChatGPT train on huge datasets, including copyrighted text. Courts will likely analyze this under fair use principles by asking:

Is AI training transformative? AI companies argue that their models learn patterns rather than copying content. This mirrors Google Books, where scanning books for search indexing was deemed transformative.
Does AI-generated text replace the original? If AI can generate news summaries or books, it might compete with the markets for journalism, books, or educational content—similar to Meltwater replacing a paid service.
Is there a licensing market? If publishers and authors start licensing data for AI training, unlicensed use could be seen as market harm—like in Texaco, where academic publishers had a functioning licensing system.

The outcome of ongoing lawsuits will determine how courts see AI's role in the content economy. If AI models start functioning as substitutes for original content, expect stricter copyright enforcement. If they're seen as research tools, fair use might hold up.

Industry-Specific Market Harm Considerations

News & Journalism – AI-generated summaries may reduce clicks on original articles, hurting ad revenue and subscriptions (New York Times v. OpenAI argues AI responses replace direct readership).
Book Publishing – Authors claim AI-generated text could compete with traditional books and summaries (Authors Guild v. OpenAI argues AI models reduce demand for original works).
Education & Academic Publishing – AI-generated study materials could cut into textbook sales (Pearson v. OpenAI claims AI-generated content could replace traditional textbooks).
Creative Writing & Film – AI-generated scripts or novels could impact demand for human writers (Writers Guild v. OpenAI and Martin v. OpenAI argue AI mimicking authors threatens their markets).

The Future of AI and Copyright Law

Current lawsuits (New York Times v. OpenAI, Authors Guild v. OpenAI) will set precedents for AI copyright law. Possible outcomes include:

AI training as fair use – If courts find AI models transformative and non-substitutive.
AI training as infringement – If courts rule that it undermines a viable licensing market.
New licensing systems – Like how music royalties work, AI companies may have to pay creators.

Wrapping It Up

So, what's the big takeaway? AI and copyright law are in a messy, ongoing battle. Will AI companies get a free pass under fair use, or will copyright holders demand licensing fees? We don't know yet, but these decisions will shape the future of AI.

My bet? AI companies will create new markets where content creators can contribute and get paid—like YouTube does for video creators. Instead of just scraping data, AI firms will likely find ways to reward quality content, making it a win-win for tech and creatives alike.

Developing Grading Rubrics using Docent

2024-09-06T10:32:00.008-04:00

When I explain the concept of Docent, a common first question I hear is if AI grades assignments, can't students just use AI to do their homework? They imagine a scenario where professors create assignments with AI, students complete them with AI, and graders assess them with AI as well.

But, let me clear that up—Docent doesn't work like that.

While we were crafting Docent, we figured out we really needed two things to make AI grading effective:

A "gold answer," which is basically the perfect solution to the assignment.
A "grading rubric," which is a guide on how to deduct points for mistakes.

The "gold answer" is our way of ensuring that even if an AI is doing the grading, students can't just whip up another AI to spit out the right answers. (For the future, we're considering adding a feature where Docent can tell if an assignment is easily solvable by an LLM, without having access to the gold answer.)

Now, developing a comprehensive grading rubric is a bit trickier. It's hard to guess all the ways students might slip up. In an "AI-less setting," we usually end up tweaking the rubric a few times over time, based on what we see after the assignment has been run a couple of times.

How can an LLM make our life easier? Docent is great when it comes to building these rubrics. Since it can handle grading hundreds of assignments at once, we quickly spot the common mistakes by simply asking Docent to grade the assignments and find the mistakes. We look at the identified mistakes, and we add them to the rubric. After adjusting the rubric, we ask Docent for a re-grade, and voila! After a few rounds, we end up with a solid rubric that catches most errors.

One additional cool thing about this whole process? Docent can summarize feedback from all the submissions and we can create a report on the most common slip-ups. We take this back to the classroom to chat about the tricky parts of the assignment and help everyone learn better.

It's like having a super-hard-working assistant who may not know how to grade at the beginning but is always willing and eager to help. It never complains if you ask it to regrade assignments, summarize findings, or provide feedback.

Use Docent, be lazy, and teach smarter, not harder!

Grading with AI: Introducing Docent

2024-09-05T17:09:00.003-04:00

TL;DR

An alpha version of Docent, our experimental AI-powered grading system, is now available at https://get-docent.com/. If you're interested in using the system, please contact us for support.

The Challenge of Grading

One thing that I find challenging when teaching is grading, especially in large classes with numerous assignments. The task is typically delegated to teaching assistants with varying levels of expertise and enthusiasm. One particular challenge is getting TAs to provide detailed, constructive feedback on assignments.

Our Experiment with LLMs

With the introduction of LLMs, we began exploring their potential to enhance the grading process. Our primary goal wasn't to replace human graders but to provide students with detailed, personalized feedback—effectively offering an on-demand tutor and addressing "Bloom's two-sigma problem":

"The average student tutored one-to-one using mastery learning techniques performed two standard deviations better than students educated in a classroom environment."

To evaluate the effectiveness of LLMs in grading, we used a dataset of 12,546 student submissions from a Business Analytics course spanning six academic semesters. We used human-assigned grades as our benchmark.

Good Quantitative Results

Our findings revealed a remarkably low discrepancy between LLM-assigned and human grades. We tested various LLMs using different approaches:

With and without fine-tuning
Zero-shot and few-shot learning

While fine-tuning and few-shot approaches showed slight improvements, we were amazed to find that GPT-4 with zero-shot learning achieved a median error of just 0.6% compared to human grading. In practical terms, if a human grader assigned 80/100 to an assignment, the LLM's grade typically fell within the 79.5-80.5 range—a striking consistency with human grading.

Qualitative Feedback: Where AI Shines

LLMs excel at providing qualitative feedback. For example, in this ChatGPT thread, you can see the detailed feedback the LLM provided for an SQL question in a database course. Much better and more detailed than whatever any human grader was going to ever provide.

Real-World Implementation: Docent

Encouraged by these results, we implemented Docent to assist human graders in our Spring and Summer 2024 classes. We also conducted a user study to assess the perceived helpfulness of LLM-generated comments. However, during deployment, we identified several areas for improvement:

Excessive Feedback: The LLM often provides too much feedback, striving to find issues even in nearly perfect assignments.
Difficulty with Negation: Despite clear grading guidelines, LLMs struggle to ignore specified minor shortcomings. See below :-)
Multi-Part Assignment Challenges: For assignments with multiple questions, grading each question separately yields better results than assessing the entire assignment at once.
Inconsistent Performance: While median performance is excellent, about 5-10% of assignments receive imperfect grades (compared to a human), leading to student appeals.

Current Status and Recommendations

Based on our experiences, here are our current recommendations for using AI in grading:

Human Supervised Use: Grading using LLMs is best used as a tool for teaching assistants, who should review and adjust the AI-generated grades and feedback before releasing them to students.
Caution in High-Stakes Scenarios: We advise against using AI for high-stakes grading, such as final exams, until we achieve greater robustness across all submissions.
Ideal for Low-Stakes Assignments: LLM-based feedback is well-suited for low-stakes assignments and practice questions, where even imperfect feedback improves the current status quo.

Try Docent

To facilitate experimentation with AI-assisted grading, we've deployed an alpha version of Docent at https://get-docent.com/. If you're interested in using the system, please contact us for support and guidance.

The PiP-AUC score for research productivity: A somewhat new metric for paper citations and number of papers

2024-01-18T15:21:00.004-05:00

Many years back, we conducted some analysis on how the number of citations for a paper evolves over time. We noticed that while the raw number of citations tends to be a bit difficult to estimate, if we calculate the percentile of citations for each paper, based on the year of publication, we get a number that stabilizes very quickly, even within 3 years of publication. That means we can estimate the future potential of a paper rather quickly by checking how it is doing against other papers of the same age. The percentile score of a paper is a very reliable indicator of its future.

To make it easy for everyone to check the percentile scores of their papers, we created a small app at

https://scholar.ipeirotis.org/

that allows anyone to search for a Google Scholar profile and then calculate the percentile scores of each paper. We then take all the papers for an author, calculate their percentile scores, and sort them in descending order based on their scores. This generates a plot like this, with the paper percentile on the y-axis and the paper rank on the x-axis.

Then, an obvious next question came up: How can we also normalize the x-axis, which shows the number of papers?

Older scholars have more years to publish, giving them more chances to write high-percentile papers. To control for that, we also calculated the percentiles for the number of papers published, by using a dataset of around 15,000 faculty members at top US universities. The plot below shows how the percentiles for the number of publications evolve over time.

Now, we can use the percentile scores for the number of papers published to normalize the x-axis as well. Instead of showing the raw number of papers on the x-axis, we normalize paper productivity against the percentile benchmark shown above. The result is a graph like this for the superstar Jure Leskovec:

and a less impressive one for yours truly:

Now, with a graph like this, with the x and y axes being normalized between 0 and 1, we have a nice new score that we have given the thoroughly boring name "Percentile in Percentile Area Under the Curve" score, or PiP-AUC for short. It is a score that ranges between 0 and 1, and you can play with different names to see their scores.

~~At some point, we may also calculate the percentile scores of the PiP scores, but we will do that in the future. :-)~~ UPDATE: If you are also curious about the percentiles for the PiP-AUC scores, here is the distribution:

The x-axis shows the PiP-AUC score, and the y-axis shows the corresponding percentile. So, if you have a PiP-AUC score of 0.6, you are in the top 25% (i.e., 75% percentile) for that metric. With a score of 0.8, you are in the top 10% (i.e., 90% percentile), etc.

In general, the tool is helpful when trying to understand the impact of newer work published in the last few years. Especially for people with many highly cited but old papers, the percentile scores are very helpful for quickly finding the newer gems. I also like the PiP-AUC scores and plots, as they offer a good balance of overall productivity and impact. Admittedly, it is a strict score, so it is not especially bragging-worthy most of the time :-)

(With thanks to Sen Tian and Jack Rao for their work.)

Tell these fucking colonels to get this fucking economist out of jail.

2022-10-18T22:53:00.004-04:00

Today is October 18th. It is 41 years since Greece voted for Andreas Papandreou with a 48% vote percentage to be elected as prime minister, fundamentally changing the course of history for Greece. Positively or negatively, this is still debated, but the change was real.

On October 6th, Roy Radner passed away at the age of 95. He was a faculty member at our department and a famous microeconomist with a highly distinguished career. Many others have written about him and his accomplishments as an economist and academic, so I will not try to do the same.

But Roy also played an important role in making that election in 1981 possible. Why? Let me tell you his story.

When I joined Stern in 2004, Roy Radner came to my office, telling me (lovingly) that he dislikes data mining, but I should not take that personally.

He also wanted to connect with me, so he shared a story with me. So, he started talking:

Roy:

"I had a friend from Greece. But he died a few years back."

[...]

"He hired me for my first job at Berkeley. A great economist and a great department chair. Strong Trotskyist. Back in the day, especially at Berkeley, economists were not afraid to declare their political views."

[...]

"I visited him in Greece, coming from Italy by ferry and then driving a long way down the western part of Greece. He had a nice Polish mother and an American wife. He also had a young son; I loved playing with him."

[...]

"At some point, he left Berkeley and returned to Greece to start a new economics research center after the prime minister invited him."

[...]

(NB: At this point, I understand he is talking about Andreas Papandreou, and I am starstruck listening to all the first-hand stories about him.)

[...]

"Well, when the dictatorship came, they arrested him. And there were rumors that the colonels may execute him."

"But he was a famous economist, very well-respected. The idea that a fellow academic may be executed, because of his beliefs, in a Western, allied country was unbelievable."

"So, the unthinkable happened. For the first time in history, 250 economists agreed on something. We wrote a letter demanding that the dictators release Andreas Papandreou immediately."

"We wrote a letter to the US President, Lyndon Johnson, asking him to intervene and get Andreas Papandreou out of jail."

"As part of a committee, Kenneth Galbraith, Kenneth Arrow, and I went to the White House to deliver the message. Johnson agrees to see us for five minutes."

"Panos, you may not be familiar with US Presidents, but Johnson was a rough Texan. He was not known for being gentle and polite, and his language was not exactly… presidential."

"So, after we talked to Johnson, he rolled his eyes, he picked up the phone, and said:"

"Tell these fucking colonels to get this fucking economist out of jail."

(and the rest is history)

This is how Roy has changed the history of Greece.

By getting Johnson to tell the fucking colonels to get that fucking economist out of jail. So that the fucking economist can then be a three-times prime minister of Greece and one of the most consequential prime ministers of the modern Greek republic.

NYTimes: Johnson to Appeal to Save Jailed Son of Papandreou

NYTimes: Letters to the Editor of The Times - Andreas Papandreou

"Geographic Footprint of an Agent" or one of my favorite data science interview questions

2021-11-23T15:22:00.002-05:00

Last week we wrote in the Compass blog how we estimate the geographic footprint of an agent.

At the very core, the technique is simple: Use the addresses of the houses that an agent has bought or sold in the past; get their longitude and latitude; and then apply a 2-dimensional kernel density estimation to find what are the areas where the agent is likely to be active. Doing the kernel density estimation is easy; the fundamentals of our approach are material that you can find in tutorials for applying a KDE. There are two interesting twists that make the approach more interesting:

How can we standardize the "geographic footprint" score to be interpretable? The density scores that come back from a kernel density application are very hard to interpret. Ideally, we want a score from 0 to 1, with 0 being "completely outside of the area of activity" and 1 being "as important as it gets". We show how to use a percentile transformation of the likelihood values to create a score that is normalized, interpretable, and very well calibrated.
What are the metrics for evaluating such a technique? We show how we can use the concept of "recall-efficiency" curves to provide a common way to evaluate the models.

You can read more in the blog post.

Despite its simplicity, this topic ended up being an amazing interview question. I think it is a great question for separating candidates that have a deeper knowledge of data science from those that have only a superficial understanding.

The typical question during the interview is:

"You have a property where the owner wants to sell the house. How can you determine if the property is geographically relevant to a particular agent? For each agent, we know all their past transactions, and we can assume that future behavior is captured well by their past behavior. For all their past transactions, you can get the address, zip code, longitude, and latitude. Do not try, for now, to figure out if the agent is the best one among many candidate agents; that is a harder problem. Just figure out if the agent is active in the location where the property lies."

I am very surprised by how many people will start on the wrong foot because they will try to actively shoehorn the problem into a binary classification problem: Indicative answers:

"I will find all the properties that the agent has not transacted in the past, and treat them as negatives".
"I will keep showing properties to the agent and when they say no, I will mark them as negatives"
"I will add features like bathrooms, bedrooms, price, and then predict if it is relevant or not"

In my experience, once people start like that, they rarely get to a solution that works. The key problem is that they have no easy way of evaluating the model they propose.

So, a typical follow-up question is:

How would you evaluate your approach?

This now gets interesting. Common superficial answers are:

I would use precision and recall, or AUC.
I would measure how often agents accept my recommendations

If they recommend precision and recall, I ask how they are going to measure recall (relatively easy, hopefully, the answer is a temporal training/test split), and precision. Precision gets tricky, as there is no "base rate" and there are no clear "negatives".

If someone recommends measuring the agent reactions, I acknowledge that this is correct, in principle, but ask them to propose an offline evaluation, so that we can test the approach before releasing it to the agents.

This is typically the point where the better candidates, even if they veered off track will start to move towards a better solution (not necessarily KDE, it can also be clustering, convex hulls, histograms on zip codes as vector representations, etc), while the one-trick ponies will reveal their lack of substance.

Kind of sorry that I have to retire this question from my question bank. Still impressed that such a simple question had such a strong discriminatory power.

Mechanical Turk, 97 cents per hour, and common reporting biases

2019-11-18T16:27:00.006-05:00

The New York Times has an article about Mechanical Turk in today's print edition: "I Found Work on an Amazon Website. I Made 97 Cents an Hour". (You will find a couple of quotes from yours truly).

The content of the article follows the current zeitgeist: Tech companies exploiting gig workers.

While it is hard to argue that there are tasks on MTurk that are really bad, I think that the article paints an unfairly gloomy picture of the overall platform.

Here are a few of the issues:

Availability and survivorship bias. While the article does describe accurately the cesspool of low-paying tasks that are available on Mechanical Turk, it fails to convey the fact that these tasks are available on the platform because nobody wants to work on them. The tasks that are easily available for everyone are the ones for which nobody competes to grab: low-paying, badly designed tasks.
The activity levels of workers follow a power-law. We have plenty of evidence that a significant part of the work on MTurk is done by a small minority of workers. While it is hard to have a truly accurate measurement of what percent of the workers do what percent of the tasks, the 1% rule is a good approximation. For example, in my demographic surveys, where I explicitly limit the participation to only once per month, 50% of the responses come from 5% of the participants. Expect the bias to be much stronger in other, more desirable tasks. Such a heavily biased propensity to participate introduces strong sampling problems when trying to find the right set of workers to interview.
Doing piecemeal work while untrained results in low pay. This is a pet peeve of mine, for all the articles of the type "I tried working on MTurk / driving Uber / delivering packages / etc / and I got a lousy pay". Well, if you work piecemeal on any task, the tasks will take a very long time initially, and the hourly wage will suck. This will hold for Turking, coding, lawyering, or anything else. If someone decides to become a freelance journalist, the first few articles will result in abysmally bad hourly wages as well; expert freelance writers often charge 10x the rates that beginner freelancer writers charge, if not more. I am 100% confident that the same applies to MTurk workers as well: Experienced workers make 10x what beginners make.

Having said that, I do agree that Amazon could prohibit tasks that are obviously paying very little (as a rule of thumb, it is impossible to get paid more than minimum wage when the HIT is paying less than 5c/task). But I also think that regular workers are smart enough to know that and avoid such tasks.

Distribution of paper citations over time

2018-11-16T13:32:00.002-05:00

A few weeks ago we had a discussion about citations, and how we can compare the citation impact of papers that were published in different years. Obviously, older papers have an advantage as they have more time to accumulate citations.

To compare papers, just for fun, we ended up opening the profile page of each paper in Google Scholar, and we analyzed the paper citations years by year to find the "winner." (They were both great papers, by great authors, fyi. It was more of a "Lebron vs. Jordan" discussion, as opposed to anything serious.)

This process got me curious though. Can we tell how a paper is doing at any given point in time? How can we compare a 2-year-old article, published in 2016, with 100 citations against a 10-year-old document, published in 2008, with 500 citations?

To settle the question, we started with the profiles of faculty members in the top-10 US universities and downloaded about 1.5M publications, across all fields, and their citation histories over time.

We then analyzed the citation histories of these publications, and, for each year, we ranked the papers based on the number of citations received over time. Finally, we computed the citation numbers corresponding to different percentiles of performance.

Cumulative percentiles

The plot below shows the number of citations that a paper needs to have at different stages to be placed in a given percentile.

A few data points, focusing on certain age milestones: 5-years after publication, 10-years after publication, and lifetime.

50% line: The performance of a "median" paper. The median paper gets around 20 citations 5 years after publication, 50 citations within 10 years, and around 100 citations in its lifetime. Milestone scores: 20,50,90
75% line: These papers perform "better," citation-wise than 75% of the remaining papers with the same age. Such papers get around 50 citations within 5 years, 100 citations within 10 years of publication, and around 200 citations in their lifetime. Milestone scores: 50,100,200
90% line: These papers perform better than 90% of the papers in their cohort. Around 90 citations within 5 years, 200 citations within 10 years, and 500 citations in their lifetime. Milestone scores: 90,200,500

Yearly percentiles and peak years

We also wanted to check at which point papers reach their peak, and start collecting fewer citations. The plot below shows the percentiles based on the yearly number of accumulated citations. The vast majority of papers tend to reach their peak 5-10 years after publication; the number of yearly citations starts declining after 5-10 years.

Below is the plot of the peak year for a paper based on the paper percentile:

There is an interesting effect around the 97.5% percentile: After that level, it seems that a 'rich-gets-richer' effect kicks in, and we effectively do not observe a peak year. The number of citations per year keeps increasing. You could call these papers the "classics".

What does it take to be a "classic"? 200 citations at 5 years or 500 citations at 10 years.

How many Mechanical Turk workers are there?

2018-01-29T09:53:00.004-05:00

TL;DR: There are about 100K-200K unique workers on Amazon Mechanical Turk. On average, there are 2K-5K workers active on Amazon at any given time, which is equivalent to having 10K-25K full-time employees. On average, 50% of the worker population changes within 12-18 months. Workers exhibit widely different patterns of activity, with most workers being active only occasionally, and few workers being very active. Combining our results with the results from Hara et al, we see that MTurk has a yearly transaction volume of a few hundreds of millions of dollars.

For more details read below, or take a look at our WSDM 2018 paper.

Question

A topic that frequently comes up when discussing Mechanical Turk is "how many workers are there on the platform"?

In general, this is a question that is very easy for Amazon to answer, but much harder for outsiders. Amazon claims that there are 500,000 workers on the platform. How can we check the validity of this statement?

Basic capture-recapture model

A common technique for this problem is the capture-recapture technique, that is widely used in the field of ecology, to measure the population of a species.

The simplest possible technique is the following:

Capture/marking phase: Capture $n_1$ animals, mark them, and release them back.
Recapture phase: A few days later, capture $n_2$ animals. Assuming there are $N$ animals overall, $n_1/N$ of them are marked. So, for each of the $n_2$ captured animals, the probability that the animal is marked is $n_1/N$ (from the capture/marking phase).
Calculation: On expectation, we expect to see $n_2 \cdot \frac{n_1}{N}$ marked animals in the recapture phase. (Notice that we do not know $N$.) So, if we actually see $m$ marked animals during the recapture phase, we set $m = n_2 \cdot \frac{n_1}{N}$ and we get the estimate that:
$N = \frac{n_1 \cdot n_2}{m}$.

In our setting we adapted the same idea, where "capture" and "recapture" correspond to participating in a demographics survey. In other words, we "capture/mark" MTurk users that complete the survey in one day. Then, in another day, we also "recapture" by surveying more workers and we see how many workers overlap in the two surveys.

First (naive) attempt

We decided to apply this technique to estimate the size of the Mechanical Turk population. We considered as "capture" period the set of surveys running over a period of 30 days. Then we considered as "recapture" period, the surveys that we ran on another 30-day period. The plot below shows the results.

The x-axis shows the beginning of the recapture period, and the y-axis the estimate of the number of workers. The color of each dot corresponds to the difference in time between the capture-recapture periods: black is a short time, and red is a long time.

If we focus on the black-color dots (~60 days between the surveys), we get a (naive) estimate of around 10K-15K workers. (Warning: this is incorrect.)

While we could stop here, we see some results that are not consistent with our model. Remember, that color encodes time between samples: black is for short time (~2 months) between samples, red is for long time (~2yrs) between samples. Notice that, as the time between the two periods increases, the estimates are becoming higher, and we get the "rainbow cake" effect in the plot. For example, for July 2017, our estimate is 12K workers if we compare with a capture from May 2017, but the estimate goes up to 45K workers if we compare with a sample from May 2015. Our model, though, says that the time between captures should not affect the population estimates. This indicates that there is something wrong with the model.

Assumptions of basic model

The basic capture-recapture estimation described above relies on a couple of assumptions. Both of these assumptions are violated when applying this technique to an online environment.

Assumption of no arrivals / departures ("closed population"): The vanilla capture-recapture scheme assumes that there are no arrivals or departures of workers between the capture and recapture phase.
Assumption of no selection bias ("equal catchability"): The vanilla capture-recapture scheme assumes that every worker in the population is equally likely to be captured.

In ecology, the issue of closed population has been examined under many different settings (birth-death of animals, immigration, spatial patterns of movement, etc.) and there are many research papers on the topic. Catchability, by comparison, has received comparatively less attention. This is reasonable, as in ecology the assumption of closed population is problematic in many settings. By comparison, assuming that the probability of capturing an animal is uniform among similar animals is reasonable. Typically the focus is on segmenting the animals into groups (e.g., nesting females vs hunting males) and assign different catchability heterogeneity to groups (but not to individuals).

In online settings though, the assumption of equal catchability is more problematic. First we have the activity bias: Workers exhibit very different levels of activity: A worker who works every day is much more likely to see and complete a task, compared to someone who works once a month. Similarly, we have a selection bias: Some workers may like to complete surveys, while others may avoid such tasks.

So, to improve our estimates, we need to use models that alleviate these assumptions.

Endowing workers with survival probabilities

We can extend the model, allowing each worker to have a certain survival probability, to allow workers to "disappear" from the platform. If we see the plot above, we can see that the population estimate increases as the time between two samples increases. This hints that workers leave the platform, and the intersection of capture-recapture becomes smaller over time.

If we account for that, we can get an estimate that the "half-life" of a Mechanical Turk worker is between 12-18 months. In other words, approximately 50% of the Mechanical Turk population changes every 12-18 months.

Endowing workers with propensity to participate

We can also extend the model by associating a certain propensity for each worker. The propensity is the probability that a worker is active and willing to participate in a task, at any given time.

In our work, we assumed that the underlying "propensity to participate" follows a Beta distribution across the worker population, and the parameters of the Beta distribution are unknown. When we assume that follow a Beta distribution, then the probability that a worker participates in the survey k times, follows a Beta Binomial distribution. Since we know how many workers participated k times in our surveys, it is then easy to estimate the underlying parameters of the Beta distribution.

Notice that we had to depart from the simple "two occasion" model above, and instead use multiple capturing periods over time. Intuitively, workers that have high propensity to participate will appear many times in our results, while inactive workers will appear only a few times.

By doing this analysis, we can observe that (as expected) the distribution of activity is highly skewed: A few workers are very active in the platform, while others are largely inactive. A nice property of the Beta distribution is its flexibility: Its shape can be pretty much anything: uniform, Gaussian-like, bimodal, heavy-tailed... you name it.

In our analysis, we estimated that the propensity distribution follows a Beta(0.3,20) distribution. We plot above the "inverse CDF" of the distribution (Inverse CDF: "what percentage of the workers have propensity higher than x").

As you can see, the propensity follows a familiar (and expected) pattern. Only 0.1% of the workers have propensity higher than 0.2, and only 10% have propensity higher than 0.05.

Intuitively, a propensity of 0.2 means that the worker is active and willing to participate 20% of their time (this is roughly equivalent to full-time level of activity; full-time employees work around 2000 hrs per year, out of 24*365 available hours in a year). A propensity of 0.05 means that the worker is active and available approximately 24 hr * 0.05 ~ 1 hour per day.

How big is the platform?

So, how many workers are there? Under such highly skewed distributions, giving an exact number for the number of workers is rather futile. The best that you can do is give a ballpark estimate, and hope to be roughly correct on the order of magnitude. What our estimates are showing is that there are around 180K distinct workers in the MTurk platform. This is good news for anyone who is trying to reach a large number of distinct workers through the platform.

Our analysis also allows us to estimate how many workers are active and willing to participate in our task at any given time. For that, we estimate that around 2K to 5K workers are available, at any given time. If we want to convert this number to full-time employee equivalence, this is equivalent to 10K-25K full-time workers.

The latter part also allows us to give some low and high estimates on the transaction volume of MTurk.

Lower bound: Assuming 2K workers active at any given time, this is 2000*24*365=17,520,000 work hours in a year. If we assume that the median wage is $2/hr, this is roughly $35M/yr transaction volume on Amazon Mechanical Turk (with Amazon netting ~$7M in fees).
Upper bound: Assuming 5K workers active at any given time, this is 5000*24*365=43,800,000 work hours in a year. If we assume an average wage of $12/hr, this is around $525M/yr transaction volume (with Amazon netting ~$100M in fees).

I understand that a range of $35M to $500M may not be very helpful, but these are very rough estimates. If someone wanted my own educated guess, I would put it somewhere in the middle of the two, i.e., transaction volume of a few hundreds of millions of dollars.

Why was my Amazon Mechanical Turk registration denied?

2017-01-17T11:42:00.000-05:00

(This is my answer to a question posted on Quora)

Mechanical Turk is a platform for work. Workers get paid, which makes now Amazon a payment processor. Payment processors are moving money on behalf of other people, and therefore are under heavy scrutiny from the US government for issues related to money laundering (AML), counter-terrorism, tax compliance, etc.

One of the key things that is required from financial institutions is to have a “Customer Identification Program” (CIP), also known as “Know Your Customer” (KYC) process. The CIP/KYC is a set of procedures that the financial institution needs to follow to establish that they know the true identity of a customer. The processes that each financial institutions follows vary, and the exact processes are rarely available to the public, as they are considered security measures. Furthermore, the practices are regularly monitored by regulators (OCC, Fed, FinCEN, etc) and change over time to follow best practices.

In your particular case, the most likely reason is that Amazon was not able to verify your identity.

If you are in the US, Amazon most probably can get your SSN and other personal details and verify whether you are a real person. However, even if you live in the US, if you have no credit history, no bank accounts, and so on, the verification will come back with low confidence. Following standard risk management processes, Amazon could plausibly reject such applications, as part of their CIP processes: it is better to have a false negative (rejecting a normal account) than having a false positive (e.g., accepting an account that will be involved in money laundering or tax-evasion schemes).

For other countries, the ability of Amazon, to follow CIP/KYC processes that conform to the US regulations, varies. I assume, for example, that the cooperation of US with UK or Australian authorities is much smoother compared to, say, Chinese authorities. So, if you live outside the US, the probability of having your account approved depends on how robust is the ability of Amazon to verify individual identities in your country.

Given that Amazon gets paid by requesters, I assume their focus is to establish CIP processes first in regions where potential requesters reside, which is not always the place where workers reside. This also means that you are more likely to be approved if you first register as a requester (assuming this is an option for you), and then try to create the worker account.

AlphaGo, Beat the Machine, and the Unknown Unknowns

2016-03-13T14:10:00.004-04:00

In Game 4, of the 5-game series between AlphaGo and Lee Sedol, the human Go champion, Lee Sedol managed to get his first win. According to the NY Times article:

Lee had said earlier in the series, which began last week, that he was unable to beat AlphaGo because he could not find any weaknesses in the software's strategy. But after Sunday's match, the 33-year-old South Korean Go grandmaster, who has won 18 international championships, said he found two weaknesses in the artificial intelligence program. Lee said that when he made an unexpected move, AlphaGo responded with a move as if the program had a bug, indicating that the machine lacked the ability to deal with surprises.

This part reminded me of one of my favorite papers: Beat the Machine: Challenging Humans to Find a Predictive Model's "Unknown Unknowns".

In the paper, we tried to use humans to "beat the machine" and identify vulnerabilities in a machine learning system. The key idea was to reward humans whenever they identify cases where the machine fails, while also being confident that it provides the correct answer. In other words, we encouraged humans to find "unexpected" errors, not just cases where naturally the machine was going to be uncertain.

As an example case, consider a system that detects adult content on the web. Our baseline machine learning system had an accuracy of ~99%. Then, we asked Mechanical Turk workers to do the following task: Find web pages with adult content that the machine learning system classifies as non-adult with high confidence. The humans had no information about the system, and the only thing they could do was to submit a URL and get back an answer.

The reward structure was the following: Humans get $1 for each URL that the machine misses, otherwise they get $0.001. In other words, we provided a strong incentive to find problematic cases.

After some probing, humans were quick to uncover underlying vulnerabilities: For example, adult pages in Japanese, Arabic, etc., were classified by our system as non-adult, despite their obvious adult content. Similarly for other categories, such as hate speech, violence, etc. Humans were quickly able to "beat the machine" and identify the "unknown unknowns".

Simply told, humans were able to figure out what were the likely cases that the system may have missed during training. At the end of the day, the training data is provided by humans, and no system has access to all possible training data. We operate in an "open world" while training data implicitly assume a "closed world".

As we see from the AlphaGo example, since most machine learning systems rely on existence of training data (or some immediate feedback for their actions), machines may get into problems when they have to face examples that are unlike any examples they have processed in their training data.

We designed our Beat The Machine system to encourage humans to discover such vulnerabilities early.

In a sense, our BTM system is like hiring hackers to break into your network, to identify security vulnerabilities before they become a real problem. The BTM system applies this principle for machine learning systems, encouraging a period of intense probing for vulnerabilities, before deploying the system in practice.

Well, perhaps Google hired Lee Sedol with the same idea: Get the human to identify cases where the machine will fail, and reward the human for doing so. Only in that case, AlphaGo managed to eat its cake (figure out a vulnerability) and have it too (beat Lee Sedol, and not pay the $1M prize) :-)

A Cohort Analysis of Mechanical Turk Requesters

2016-02-29T10:33:00.003-05:00

In my last post, I examined the number of "active requesters" on Mechanical Turk, and concluded that there is a significant decline in the numbers over the last year. The definition of "active requester" was: "A requester is active at time X if they have a HIT running at time X". A potential issue with this definition is that an improvement in the speed of HIT completion (e.g., due to increased labor supply) could drive down that number.

For this reason, I decided to perform a proper cohort analysis for the requesters on Mechanical Turk. In the cohort analysis that follows, we will examine how many requesters that have first appeared in the platform on a given month (say September 2015), are still posting tasks in the subsequent months.

Here is the resulting "layer cake plot" that indicates what happens in each cohort. Each of the layers corresponds to requesters that were first seen on a given month. (code, data) (Read this post, if you want a little bit more background on how the plot should "look like".)

For example, the bottom layer corresponds to all the requesters that were first seen on May 2014 (the first month that the new version of MTurk Tracker started collecting data). We can see that we had ~2700 "new" requesters on that month. (The May-2014 cohort obviously contains all prior cohorts in our dataset, as we do not know when these requesters really started posting.) Out of these requesters, approximately 1700 also posted a task on June 2014 or later, approximately 1000 of these have posted a task on March 2015 or later, and approximately 500 have posted a task on February 2016.

The layer on top (slightly darker blue) illustrates the evolution of the June 2014 cohort. By stacking them on top of each other, we can see the composition of the requesters that have been active in every single month.

As the plot makes obvious, until March 2015, the acquisition of new requesters every month was compensating for the requesters that were lost from the prior cohorts. However, starting March 2015, we start seeing a decline in the overall numbers, as the total decline in requesters from prior cohorts dominates the acquisition of new requesters. So, the cohort analysis supports the conclusions of the prior post, as the trends and conclusions are very similar (always good to have a few robustness checks).

Of course, a more comprehensive cohort analysis would also analyze the revenue generated by each cohort, and not just the number of active users. That requires a little bit more digging in the data, but I will do that in a subsequent post.

The Decline of Amazon Mechanical Turk

2016-02-26T23:31:00.003-05:00

It seems that after years of neglect, Mechanical Turk is starting to lose its appeal. In our latest measurement, we see Mechanical Turk losing 50% of its requesters in a YoY measurement.

A few days ago, Kristy Milland (aka SpamGirl) asked me if there is a way to see the active requesters on Mechanical Turk over time. I did not have this dashboard on Mechanical Turk tracker, but it was an important metric, so I decided to add it to the MTurk Tracker website.

So, now MTurk Tracker has a tab called "Active Requesters" which shows how many requesters are "active" on Mechanical Turk at any given time. The definition of "Active at time X" means "had a task that was running on MTurk before time X and after time X".

Here is the chart for the active requesters between Jan 1, 2015 and February 28, 2016:

As you can see, starting March 2016 (that is before the announcement of price increases), we see a decline in the number of active requesters. Interestingly, when the fee increases are announced, we see a small "valley" around the period of fee increases. The numbers remain stable until November, but after that we see a steady decline.

Overall, we observe a YoY decline of almost 50% in terms of active requesters.

What is driving the decline? Hard to tell. Perhaps requesters abandon crowdsourcing in favor of more automated solutions, such as deep learning. Perhaps requesters with long running jobs build their own workforce (e.g., using UpWork). Perhaps they use alternative platforms, such as Crowdflower. Or perhaps my own metric is flawed, and I need to revise it.

But, unless we have a bug in the code, the future does not seem promising for Mechanical Turk. And this is a shame.

An API for MTurk Demographics

2015-06-10T12:38:00.003-04:00

A few months back, I launched demographics.mturk-tracker.com, a tool that runs continuous surveys of the Mechanical Turk worker population and displays live statistics about gender, age, income, country of origin, etc.

Of course, there are many other reports and analyses that can be presented using the data. In order to make it easier for other people to use and analyze the data, we now offer a simple API for retrieving the raw survey data.

Here is a quick example: We first call the API and get back the raw responses:

import requests
import json
import pprint
import pandas as pd
from datetime import datetime
import time

# The API call that returns the last 10K survey responses
url = "https://mturk-surveys.appspot.com/" + \
    "_ah/api/survey/v1/survey/demographics/answers?limit=10000"
resp = requests.get(url)
data = json.loads(resp.text)

Then we need to reformat the returned JSON object and transform the responses into a flat table:

# This function takes as input the response for a single survey, and transforms it into a flat dictionary
def flatten(item):
    fmt = "%Y-%m-%dT%H:%M:%S.%fZ"
    
    hit_answer_date = datetime.strptime(item["date"], fmt)
    hit_creation_str = item.get("hitCreationDate")
    
    if hit_creation_str is None: 
        hit_creation_date = None 
        diff = None
    else:
        hit_creation_date = datetime.strptime(hit_creation_str, fmt)
        # convert to unix timestamp
        hit_date_ts = time.mktime(hit_creation_date.timetuple())
        answer_date_ts = time.mktime(hit_answer_date.timetuple())
        diff = int(answer_date_ts-hit_date_ts)
    
    result = {
        "worker_id": str(item["workerId"]),
        "gender": str(item["answers"]["gender"]),
        "household_income": str(item["answers"]["householdIncome"]),
        "household_size": str(item["answers"]["householdSize"]),
        "marital_status": str(item["answers"]["maritalStatus"]),
        "year_of_birth": int(item["answers"]["yearOfBirth"]),
        "location_city": str(item.get("locationCity")),
        "location_region": str(item.get("locationRegion")),
        "location_country": str(item["locationCountry"]),
        "hit_answered_date": hit_answer_date,
        "hit_creation_date": hit_creation_date,
        "post_to_completion_secs": diff
    }
    return result

# We now transform our API answer into a flat table (Pandas dataframe)
responses = [flatten(item) for item in data["items"]]
df = pd.DataFrame(responses)
df["gender"]=df["gender"].astype("category")
df["household_income"]=df["household_income"].astype("category")

We can then save the data to a vanilla CSV file, and see how the raw data looks like:

# Let's save the file as a CSV
df.to_csv("data/mturk_surveys.csv")

!head -5 data/mturk_surveys.csv

,gender,hit_answered_date,hit_creation_date,household_income,household_size,location_city,location_country,location_region,marital_status,post_to_completion_secs,worker_id,year_of_birth
0,male,2015-06-10 15:57:23.072000,2015-06-10 15:50:23,"$25,000-$39,999",5+,kochi,IN,kl,single,420.0,4ce5dfeb7ab9edb7f3b95b630e2ad0de,1992
1,male,2015-06-10 15:57:01.022000,2015-06-10 15:35:22,"Less than $10,000",4,?,IN,?,single,1299.0,cd6ce60cff5e120f3c006504bbf2eb86,1987
2,male,2015-06-10 15:21:53.070000,2015-06-10 15:20:08,"$60,000-$74,999",2,?,US,?,married,105.0,73980a1be9fca00947c59b93557651c8,1971
3,female,2015-06-10 15:16:50.111000,2015-06-10 14:50:06,"Less than $10,000",2,jacksonville,US,fl,married,1604.0,a4cdbe00c93728aefea6cdfb53b8c489,1992

Or we can take a peek at the top countries:

# Let's see the top countries
country = df['location_country'].value_counts()
country.head(20)

US    5748
IN    1281
CA      30
PH      22
GB      16
ZZ      15
DE      14
AE      11
BR      10
RO      10
TH       7
AU       7
PE       7
MK       7
FR       6
IT       6
NZ       6
SG       6
RS       5
PK       5
dtype: int64

I hope that the examples are sufficient to get people started using the API, and I am looking forward to seeing what analyses people will perform.

Postdoc Position for Quality Control in Crowdsourcing

2015-06-08T20:07:00.002-04:00

The Center for Data Science at NYU invites applications for a post-doctoral fellowship in statistical methodology relating to evaluating rater quality for a new research program in the application of crowdsourcing ratings of human speech production.

Duties and Responsibilities: This is a two-year postdoctoral position affiliated with the NYU Center for Data Science. The successful candidate will join a dynamic group of researchers in several NYU Centers including PRIISM, MAGNET, the Stern School of Business, the NYU Medical School and the Department of Communicative Sciences and Disorders. We are seeking highly motivated individuals to develop and test novel statistical and computational methods for evaluating rater quality in crowdsourced tasks. Responsibilities will include development, testing and implementation of statistical algorithms, as well as preparation of manuscripts for academic publication. Advanced knowledge of R is preferred.

Position Qualifications: Candidates will ideally have a doctoral degree in Statistics, Biostatistics, Data Science, Computer Science, or a related field, as well as genuine interests and experiences in interdisciplinary research that integrates study of human speech, citizen science games and computational statistics. Candidates will ideally have expertise in the following areas: Bayesian statistics, numerical methods and techniques, psychometrics and/or knowledge of programming languages. Outstanding computing and communication skills are required.

Please send CV, letter of intent, and three reference letters to Daphna Harel (daphna dot harel at nyu dot edu) by July 31, 2015.

The position is for 2 years (subject to good research progress). The successful candidate will be based at the NYU Center for Data Science, under the primary supervision of NYU faculty members Panos Ipeirotis and Daphna Harel, and will closely work with a multidisciplinary team including NYU faculty members Tara McAllister Byun, R. Luke DuBois, and Mario Svirsky. The position will preferably start by September 2015 (start date negotiable).

The World Bank Report on Online Labor

2015-05-29T10:20:00.003-04:00

I am often asked about statistics and data about the global population of "crowdsourcing" workers, going beyond Mechanical Turk. I am happy to say that from now on I will be able to point everyone to a study from The World Bank, which I was fortunate to participate in. The report examines the global landscape of online labor, identifying the opportunities and providing statistics about the global landscape.

The study will be officially released on Wednesday June 3rd, and for those of you willing to attend the launch event through Webex, here is the information:

---

When:

Wednesday, June 3, 2015, 9:00AM - 11:30AM EDT

Where:

Webex URL

Meeting number: 730 125 194

Meeting password: online1

Audio connection: 1-650-479-3207 Call-in toll number (US/Canada)

Access code: 730 125 194

Title:

The New Online Outsourcing Approach for Jobs, Youth and Women's Empowerment and Services Exports

Abstract:

This event will discuss the new online outsourcing (OO) phenomena in the world today, its implications for developing countries, and how your clients can leverage it as an innovative approach for jobs, youth employment and women's empowerment.

OO refers to the contracting of third-party workers and providers (often overseas) to supply services or perform tasks via Internet-based marketplaces or platforms. Also known as paid crowdsourcing, online work, microwork and other names - these technology-mediated channels allow clients to outsource their paid work to a large, distributed, global labor pool of remote workers, to enable performance, coordination, quality control, delivery, and payment of such services online.

The global OO marketplace today includes numerous emerging and growing platforms; such as Upwork (formerly Elance-oDesk), CrowdFlower, CloudFactory, Amazon Mechanical Turk, etc. There are also a wide variety of services that can be performed online - such as data entry, digitization, graphics rendering and design, programming and apps development, accounting and legal services, etc. Workers in developing countries can have access and perform jobs from all over the world - as long as they have computer and Internet access. In addition to jobs and income - OO offers workers flexible time and working environment, develops skills for professionals, and drives positive social change for youth and women.

The event will share with participants the OO study that covers comprehensively the definition and segments, trends and market size, economic and non-financial impact on workers, and the implications and policy recommendations. In addition the event will show how you can apply the online toolkit to assess the readiness of your client countries for OO.

The World Bank's ICT Unit is excited to share this new global study and toolkit, which was developed in partnership with the Rockefeller Foundation and Dalberg Global Development Advisors.

Who:

Chair: Mavis Ampah, Lead ICT Policy Specialist and Practice Lead on Jobs, GTIDR
Siou Chew Kuek, Senior ICT Specialist and TTL, GTIDR
Cecilia Paradi-Guilford, ICT Innovation Specialist and Co-TTL, GTIDR
Saori Imaizumi, ICT Innovation and Education Consultant, GTIDR

Demographics of Mechanical Turk: Now Live! (April 2015 edition)

2015-04-06T15:17:00.002-04:00

One of the most common questions that I receive is whether I have new data about the demographics of Mechanical Turk workers. The latest data that I had collected were back in 2010, and it was not clear how things have changed since then. The key problem was not that I could not run additional surveys; that would have been trivial. However, the results of the surveys were always changing over time: the aggregate data varied too much across surveys, so I refrained from publishing data that seemed to be unreliable.

So, I thought of how to tackle two problems at once:

Make it easy for people to see current data about the demographics of Mechanical Turk workers
Make it easy to understand the inherent variability of the collected data, and potentially understand the source of the variability

For that reason, we built a new site:

http://demographics.mturk-tracker.com/

(please also check the API)

The site displays live data about the demographics of the workers, based on a small 5-question survey that users are asked to answer (paying 5 cents each). To be able to capture the time variability, we post one survey every 15 minutes, allowing us to observe changes in the answers over time. We also restrict each worker to be able to answer the survey only once per month.

A few key results:

Country

Overall, we see that approximately 80% of the Mechanical Turk workers are from the US and 20% are from India.

However, this mix is not stable during the day. Around 8-10am UTC (i.e., 3am NYC time, 1:30pm India time), there is a much higher number of workers from India (~50%), which then goes down to 5% at 8-10pm UTC.

Gender

The gender participation seems to be balanced, with roughly 50% males and 50% females. The charts that examine variability based on hour of day and day of the week do not show any change in this pattern.

Year of birth

Roughly 50% of the workers are born in the 1980s and are around 30 years old. Approximately 20% of the workers are born in the 1990s, and another 20% are born in the 1970s.

Marital Status

Approximately 40% of the workers are single, 40% are married, and 10% are cohabitating.

Household Size

Approximately 15% live alone. Then 25% have a household size of two and 25% have a household size of three. Around 25% live in a household of four, and around 10% have five or more members in their household.

Income level

The median household income is around $50K per year for US Turkers, which is on par with the median US household income. Indian workers have considerably lower household income, with most of them being around $10K/yr.

Next steps

In our next steps, we plan on making the (anonymized) survey responses available through an API, and potentially add a few more graphs of interest. If you have any idea or suggestion, please send it my way.

My Peer Grading Scheme

2014-06-09T16:07:00.003-04:00

One of the components that I use in my class is student presentations.

While I like having students present, I always had a hard time grading the presentations. Plus, many students seemed to target the presentation to me, trying to sound too technical and advanced, leaving the audience in the class bored and uninterested.

For that reason, I adopted a peer-grading scheme. Students have to present to the class, and get rated by the class, and not me. (Although, I still reserve a small degree of editorial judgment for assigning the grades.) Here is how my scheme works, after a few years of experience.

Rating scale: Students assign a grade from 0 to 10 to the presentations.
No self-grading: Students do not grade their own presentations. (Early on, there were students that were assigning 10 to themselves, and lower grades to everyone else. Now they can still grade themselves if they want but the grade is ignored.)
Normalization: All assigned grades are normalized, to have a zero mean and one standard deviation. (This normalization was introduced to fight the problem where a student would try to game the system by assigning low grades to everyone else, hoping to lower the average rating of all other students.)
Grade assignment: The presentation grade is the average of the assigned normalized scores. Formally, each student $s_i$ assigns to presentation $t$ a grade $z(s,t)$. The overall grade of the presentation is the mean value $E[z(*,t)]$ of the $z(s_i,t)$ grades.
Ensuring careful grading by asking students to estimate class rating: One problem with the peer grading scheme was that many students did not take it seriously enough, and assigned random grades (typically, the same grade to everyone). To avoid indifferent grading, I decided to give credit (~10%) based on the correlation of the assigned grades $z(s,t)$ against the mean value $E[z(*,t)]$ (across all presentations $t$). This ensured that students will at least try to figure out what other students will assign to the presentation, and will not assign random grades.
Separate assigned and estimated grades: The problem with introducing the requirement to agree with the class was that some students believed to be better assessors than the rest of the class. So, they felt that their own grade was the correct one, and did not like losing credit for assigning their own "true" grade. To address that issue, I now ask students to assign two grades: their own grade $z_p(s,t)$, and an estimate of the class grade $z_c(s,t)$. The personal grade $z_p$ is used to compute $E[z(*,t)]$ in Step 4, and I use the $z_c$ to compute the correlation in Step 5.
Examine self-grading: Given that the class-estimate grades are not directly used to grade a presentation, students are also asked to provide an estimate of their own grade as part of Step 6. Effectively, students are encouraged to estimate properly their own grade.

The only thing that I have not tried so far is to modify Step 4 to take into consideration the different correlations from Step 5, effectively weighting each student's grades based on their correlation with the rest of the class. However, most students tend to exhibit the same, moderate agreement with the class (typical correlation values are in the 0.4-0.6 range, after rating 15-20 presentations), so in practice I do not expect to see a difference.

Overall, I am pretty happy with the scheme. Students indeed try to impress the class (and not me), and many presentations are interesting, interactive, and engaging. The grades are also very consistent with the overall feeling that I get for each presentation, so I did not have to practice my "editorial oversight" and adjust the grade very often (only in a couple of cases, where the students ran into technical problems during the presentation). I would be really interested to try this scheme in one of the big MOOC classes that use peer grading, and see if it can instill the same sense of responsibility in peer grading.