Radar

Radar Trends to Watch: May 2026

Mike Loukides and Claude — Tue, 05 May 2026 11:18:59 +0000

The most significant tension in this issue is between two companies making different decisions about how to handle AI with frontier security capabilities. Anthropic restricted Claude Mythos to a small corporate cohort through Project Glasswing. OpenAI released GPT-5.5 to general availability, and some are calling it “Mythos-like hacking, open to all.” The AI Security Institute’s evaluation confirms the capability is real and consequential. How will you manage risk when the time between discovery of a vulnerability and exploitation collapses to zero?

Another important theme is that, in the words of The Sequence, “AI is becoming operational.” It’s no longer about LLMs that can play games with words. It’s about tools that can automate processes across an enterprise: agents, of course, but more specifically agents that can be shared by teams to produce a consistent set of tools that can be used by groups.

AI Models

The open-weight model market is reshaping the economics of AI. This cycle brought at least 10 significant model releases or updates across open and closed providers, with pricing pressure coming from multiple directions. DeepSeek now performs within a fraction of Claude Opus 4.7 on coding benchmarks at a radically lower price; Alibaba, Google, Z.ai, and Moonshot all released capable open models this cycle. The Stanford AI Index documents this at scale. For organizations building on AI, the question is no longer whether open-weight alternatives are viable but which trade-offs they are willing to make on cost, portability, and support.

Google has published a list of 1,302 real-world use cases for generative AI. It’s very long and probably not worth reading on your own. However, you might want to point your agent at it.
OpenAI has announced GPT Images 2, its flagship model for generating images. The initial reaction is that it’s slightly better than Google’s Nano Banana. What distinguishes Images 2 is that it “thinks” before generating the image.
Anthropic used Claude to work on some problems in alignment research. Claude outperformed the humans at lower cost. The problems were, admittedly, cherry-picked to be easily scoreable. But the experiment also demonstrated that a less capable model can supervise a stronger model.
Moonshot Labs has released Kimi K2.6, the latest in its series of open models. It also open sourced the Kimi Vendor Verifier, a tool that tests the accuracy of vendors selling inference using Kimi.
Alibaba has released Qwen3.6-35B-A3B, the latest model in its Qwen series. It’s a mixture-of-experts model with 3B active parameters. Simon Willison reports that it draws great flamingos, if you consider that relevant.
Anthropic has released Claude Opus 4.7. The model is positioned as an intermediate step between Opus 4.6 and Claude Mythos Preview. Anthropic claims that 4.7 is better at multimodal work, including vision, instruction following, and memory use. Its new tokenizer increases the number of tokens that Claude uses. Because billing is based on tokens, that’s effectively a price increase. Simon Willison has built a tool to compare the token usage of different models.
Google has announced Gemini 3.1 Flash TTS, a text-to-speech model that gives extraordinary control over the speakers: accents, style, expression, and more.
Stanford’s 2026 AI Index Report is out, with over 400 pages of data and analysis about the state of AI.
Meta’s refactored AI lab has released its first model, Muse Spark. It’s a multimodal model that has been designed for integration with Meta’s products. There will eventually be a Contemplating Mode for orchestrating agents.
DeepSeek has released a preview version of DeepSeek-V4, its latest open-weight model. It’s a large model (over 1T parameters) with performance very close to the frontier models, but (as Simon Willison points out) running it is very inexpensive.
OpenAI released GPT-5.5, which some are calling “Mythos-like hacking, open to all.” In addition to being its “smartest and most intuitive” model yet, OpenAI claims that it reduces token counts, thereby reducing cost. Other sources report that, while it scores highly on benchmarks, GPT-5.5 is markedly more likely to hallucinate and provide incorrect answers.
Z.ai’s GLM-5.1 is a new version of the open source GLM-5 model that has been optimized to perform well on long-running tasks.
Google has released Gemma 4, a new version of its family of open source models. The family includes a 31B version and a mixture-of-experts version with 26B parameters, 4B active. These are all reasoning models that are designed for agentic workflows. One model, Gemma 4 E4B, can run on the iPhone and Android.

Software Development

Anthropic has clearly been winning the announcement race. Whether it’s also winning on performance is a different question. Claude Code was a favorite among developers until its performance slipped. Many switched to newly released Cursor 3, which puts an agentic interface front and center while relegating the IDE to the background. Anthropic’s public postmortem on Claude Code’s behavior regression is worth reading both for its specific findings and as a model for how AI providers should communicate quality issues to developers. And Cursor’s transformation from an IDE into an agent is a pattern we expect to see repeated across the industry.

OpenAI has announced “workspace agents.” Workspace agents can be shared across a team, while the agents we have so far are tied to individual productivity. They enable a team to collaborate on building shared tools to automate workflows.
Microsoft has announced two new tools, Critique and Council, that use Claude and GPT together to solve research problems. Their benchmark results show that the combination works better than any model used on its own.
Stash is an open source memory layer that agent builders can use to connect their agents to models. We’re beginning to see an agentic stack that is composed of interchangeable modules.
Developers have been complaining about a drop in Claude Code’s behavior over the last few months. Anthropic has issued a response explaining what happened and how they’re fixing it.
Glif is an agent that tries to unify all the LLMs and tools at your disposal. You don’t have to decide which model or tool is best for each task; it makes the decision for you and gets the task done.
OpenAI has decoupled its agent harness from computing and storage, enabling durable long-running agents. The harness is now open source and can be customized through the Agents SDK.
Anthropic has announced Claude Code routines. A routine is a package that includes a prompt, a repository, and connectors that will run automatically on Anthropic’s infrastructure, either on a schedule or when triggered.
Anthropic also announced Claude Managed Agents, a prebuilt harness for developing agents that run on Anthropic’s infrastructure. The harness provides most of the infrastructure that an agent needs (memory management, etc.) but can be configured for the user’s tasks. Anthropic’s goal appears to be becoming the AWS of agentic AI: a service provider for tool builders.
Interoperability between tools, models, and plug-ins is allowing a new programming stack to develop: an orchestration layer, an execution layer, and a review layer.
Amazon has launched an agent registry service as part of AWS Bedrock AgentCore. Bedrock AgentCore is a collection of services that make it easy to build and deploy agents on AWS. The registry gives developers a way to discover third-party agents that might be useful to their work.
Bryan Cantrill’s essay on laziness is a must-read. AI isn’t lazy, and that’s a problem. When work costs nothing, there’s no need to think about future workers. Laziness is a virtue that we need to preserve.
Anthropic has announced Claude Design, a new tool designed to help designers. It competes directly with Figma and Canva. It’s currently in “research preview.”
Perplexity has launched Personal Computer, a local AI agent that runs on a dedicated Mac mini (Windows to come) and has persistent access to your files, native apps, inbox, and the web.
Anthropic has released a Claude plug-in for Microsoft Word, targeting the legal market. Automated edits appear as tracked changes.
LiteParse is a command-line tool that extracts text from PDF files. If you’ve never needed to do that, you’ve lived a blessed life. Simon Willison has built a web-based version that runs LiteParse in the browser.
Luke Wroblewski has said that designers should code; they need to understand their medium. But around 2014, heavyweight frameworks like React and Angular got in the way. Coding agents are now making “collapsing the gap between designing and building.”
Cursor 3, the letest release of Cursor, relegates its IDE to the background. The main screen is designed for orchestrating agents. You can fall back to the IDE for editing code if you need to.
In the first quarter of 2026, Apple’s app store has seen a huge (84%) increase in the number of new apps, compared to the first quarter of 2025. The cause is probably the ease of using AI to create new apps. Apple also appears to be limiting the use of “vibe coding” to create new apps, and has removed several vibe coding apps from the app store.
Anthropic accidentally leaked the source code for Claude Code, prompting waves of commentary. Two of the most interesting are Shlok Khemani’s tour of what he found interesting in the source and Gergely Orosz’s discussion of the legal implications.
“The Hidden Technical Debt of Agentic Engineering” argues that, as with machine learning, agents are relatively small parts of larger software systems, and that technical debt accumulates in all the supporting modules.
Chat is rarely the best interface for working with AI. Ethan Mollick writes that the current generation of AI models and agents are capable of creating task-specific interfaces on the fly.

Security

Security has spent a lot of time in the news. Two core tools for secure private networking, Tor and Signal, have been attacked. In both cases, the attack didn’t involve the software or protocols themselves. These attacks teach us that secure systems are often jeopardized by the software that surrounds them. We’ve also seen that ransomware gangs are using postquantum encryption, and that quantum computers are likely to break traditional encryption sooner than expected. If you’re not investing in security, it’s time to start.

The Tor network is the gold standard for secure private networking. Researchers recently discovered a vulnerability in Firefox browsers that lets attackers de-anonymize identities. The vulnerability has been fixed in Firefox 150, but it’s a reminder that anything can be attacked.
We all know that ransomware gangs use encryption. The Kyber group is making the transition to postquantum encryption.
A supply chain attack against npm allows bad actors to steal developers’ credentials. Once it has infected a victim, it inserts itself into other packages that the victim publishes.
Law enforcement agencies were briefly able to exploit a vulnerability in iOS notifications that allowed them to access unencrypted messages sent with the Signal secure messaging system. The vulnerability has been patched. It’s important to understand that the vulnerability wasn’t in Signal itself but in the environment in which it operated.
With AI, time from discovery of a vulnerability to exploitation has dropped to zero. To help defense catch up, Google has added three agents to its Google Security Operations platform: Threat Hunting, Detection Engineering, and Third Party Context.
Microsoft reports that criminals are increasingly using Teams to impersonate help desk personnel, who ask users for their credentials and then steal data.
NIST has stopped assigning severity scores to lower-priority vulnerabilities. All vulnerabilities will still be added to the National Vulnerability Database (NVD).
The NSA is using Claude Mythos Preview, despite Anthropic being blacklisted by the Pentagon. Anyone want to guess what they’re using it for?
Anthropic will ask for identity verification in some cases.
Small open-weight models can do as well as Anthropic’s Mythos at finding vulnerabilities. The key isn’t the model; it’s the system within which the model works.
A new malware campaign embeds credit-card stealing software into a single pixel SVG image. ecommerce sites using Magento Open Source or Adobe Commerce are vulnerable.
Anthropic has pulled its newest model, Claude Mythos, from broader release because it’s too good at finding vulnerabilities in other software. They’ve made it available to a few corporations via Project Glasswing, an attempt to secure critical software before it can be exploited. The AI Security Institute’s analysis of Claude Mythos Preview says that it “represents a step up over previous frontier models in a landscape where cyber performance was already rapidly improving.”
Many open source security maintainers agree with Greg Kroah-Hartmann‘s report that the quality of AI-generated security bug reports has gone up tremendously.
Versions of Claude Code that include the Vidar malware have been published on GitHub. They are based on the code that Anthropic inadvertently leaked. These versions entice victims to download them by claiming to have unlocked enterprise features.
Claude has been used to discover zero-day remote code execution vulnerabilities in both Vim and Emacs. The vulnerabilities are triggered when a user opens a file. An update is available for Vim; Emacs developers argue that it’s really a bug in Git, which may be correct but misses the point.
Breakthroughs in quantum computing mean that computers capable of cracking current encryption algorithms may be on the horizon.

Infrastructure and Operations

Multiple providers released overlapping pieces of an agent stack this cycle, covering orchestration, persistence, memory, and registry services. A three-layer model (orchestration, execution, review) is becoming the standard architecture, but each vendor’s implementation makes different bets about portability and durability. It’s important to evaluate each vendor’s products carefully before settling on an agent stack.

Microsoft now allows admins to uninstall Copilot, though there are conditions.
Google has announced two new eighth-generation TPUs. One is designed for training (8t), the other specializes in inference (8i). This is the first time Google has produced specialized TPUs for training and inference.
Google has open-sourced Scion, its testbed for agent orchestration.
Anthropic has agreed to buy 3.5 gigawatts of computing power from Google and Broadcom, maker of Google’s GPUs. The deal specifies power consumption rather than the number of chips, implying that the limiting factor isn’t computation but the availability of power. Chips come and go; watts are a constant.
Ollama now uses Apple’s MLX framework to improve performance on Apple silicon. Support is currently limited to the Qwen3.5-35B-A3B; support will be added for other models. As part of this update, it also uses NVIDIA’s NVFP4 floating point format for model quantization.

Web

Don’t overlook the web layer when planning for AI-driven disruption. The web’s infrastructure is older than most of the people who maintain it, and several items this cycle are reminders of the gap between what that infrastructure was designed for and how it is used today. Two deal with protocols that have outlasted their original assumptions; another reimagines the dominant CMS from scratch using current tooling.

Is PHP the new COBOL? What about open source itself? “Who Will Maintain the Web When PHP’s Veterans Retire?” points to a reality that we don’t like to think about. Not only are companies reluctant to hire junior developers; the ones they do hire aren’t learning older technologies.
Laravel is apparently injecting ads for its commercial cloud service into agents. What happens when an open source framework receives venture funding and starts injecting ads into agents? We’re about to find out.
Doesn’t every musician need tools to typeset Gregorian chant?
Is IPv8 the future of the Internet? IPv6 has been “two years away” since early in the 1990s. IPv8 is fully backward compatible with IPv4, and resolves its security and address depletion issues.
Cloudflare has released EmDash, an alternative to WordPress based on how the web is used today. Drew Breunig calls this a reimagining: a new phase of software development in which we can use agentic programming to rethink and reimplement tools based on current needs.
Is BGP Safe Yet? is a web app that tests whether your ISP has implemented BGP (the protocol that’s responsible for routing packets at internet scale) correctly. Many haven’t.

Biology

OpenAI has announced GPT-Rosalind, a model that has been tuned for 50 common workflows in biology. Unlike most models, Rosalind has been tuned to be skeptical rather than enthusiastic or sycophantic. Access to Rosalind is limited because of the potential for harm.

Robotics

Spot, the Boston Robotics robotic dog, can now read gauges and thermometers. It uses the Gemini Robotics-ER 1.6 model, which can reason about visual information.
Major League Baseball is using a robotic system to rule on challenges to a human umpire’s ball/strike calls.

How AI Swarms Are Disrupting Democracy

Marco Camisani Calzolari — Mon, 04 May 2026 11:43:58 +0000

Every day, millions of pieces of fake content are produced. Videos, audio clips, posts, articles, generated by artificial intelligence, distributed at industrial scale, aimed at shifting public opinion across entire countries. The people producing them are often outside the country being targeted. The people receiving them almost never know they’re fake. And they have no idea how they’re made.

A few years ago, troll farms worked like this: entire buildings full of people, shifts, desks and workers paid to write posts, create fake profiles, comment and pick fights in online discussions. It was expensive, slow, and in the end, the real impact was marginal. Those buildings still exist today, mostly in India, split between teams specializing in scams and teams dedicated to disinformation. They work on commission and they’re mostly AI experts now. They no longer write the articles themselves and no longer do graphic design or photo editing. They have AI agents do everything: agents they create, configure, instruct, and supervise. Hundreds of thousands of autonomous agents that do in one hour what used to take weeks of human labor. Troll farms have become AI farms, producing synthetic content at industrial scale.

The report “From Trolls to Generative AI: Russia’s Disinformation Evolution,” published in February of 2026 by the Centre for International Governance Innovation (CIGI), tells one of these stories, specifically about disinformation campaigns originating from Russia. Networks like CopyCop, a disinformation operation linked to the GRU (Russian military intelligence), use uncensored open-source language models like modified versions of Llama 3, installed on their own servers, to transform press articles into political propaganda and distribute it across hundreds of fake websites without leaving a trace. Because the models run locally, there’s no watermark and no log. The model runs on their hardware, inside their borders, outside any Western jurisdiction.

The paper “How malicious AI swarms can threaten democracy,” published in Science in January 2026 describes well what is coming: coordinated swarms of AI agents with persistent identities, memory, and the ability to adapt in real time to people’s reactions. The authors call them “malicious AI swarms.” Fully autonomous agents, each producing original content, each one different, each one adapted to context.

They can simulate real communities that appear credible, and they build what we can call synthetic consensus: the illusion that an opinion is widely shared, that a position is held by the majority, when in reality it’s a single operator speaking through thousands of masks.

It works because we humans have bugs too, and the swarms exploit them at a scale that was never possible before or that would have required enormous human resources.

One bug is called the bandwagon effect. Combined with another bug, illusory truth: repetition plus apparent source independence equals perceived truth. So if we see the same position expressed by different sources, in different contexts, with different words, on different platforms, we register it as widespread. And if we perceive it as widespread, we consider it more credible. And if we consider it credible, we tend to align with it.

Swarms of autonomous agents exploit both mechanisms at the same time, at industrial scale.

What most people still haven’t grasped is the scale. We were used to automation: A system that sent a hundred thousand identical emails, at most changing the name and little else, or made just as many posts and similar comments with minor variations. It automated the publishing, but at its core it was recognizable spam. Our mental model is still that one: If it’s automated, it’s generic. If it’s generic, you can spot it. But that’s a perception error built on years of experience when AI agents didn’t exist. That model is over. These agents no longer fit the concept of automation, because they make decisions, they radically change the text based on the recipient. They aggregate data from heterogeneous sources in real time: social profiles, public records, leaked databases that you can now buy for a few dollars on any dark web marketplace. Billions of personal records are already out there, scattered across hundreds of breaches accumulated over the years, and AI can cross-reference them, reconcile them, and build a coherent profile of a single person in seconds. The computational cost is negligible: a few cents in tokens to generate a perfectly personalized message. Consider that a single agent with access to a language model and a couple of leak databases can produce thousands of unique pieces of content per day, each calibrated for a different person. Multiply that by a hundred thousand agents working in parallel, twenty-four hours a day, and you have the scale of what’s happening.

Another legacy from the past: “I’m just an ordinary person, why would anyone bother creating content specifically to convince me?” That may have been once true. Today, nobody is losing time because these agents don’t get tired, don’t sleep, and do nothing else: find connections, aggregate data, produce false content calibrated for each of us. The old demographic profiling is over. This is surgical media targeting at industrial scale.

But the capacity to respond and deny is not at industrial scale. If hundreds of thousands of coordinated agents spread a video of a politician saying something they never said, that politician can deny it all they want. The video is there. Millions of people have seen it. The denial arrives later, arrives slower, and will never reach the same scale. It arrives in a world where nobody knows what’s true anymore.

If the same swarms spread the news that a head of state has died, and the news is false, that head of state can make all the videos they want to prove they’re alive. Those videos will probably be dismissed as deepfakes. Because the swarm’s narrative got there first, took root, and at that point any evidence to the contrary looks fabricated.

Whoever controls the swarms today controls the version of the facts. Whoever tries to push back is already at a disadvantage because they have to prove that a real video is real in a world where everyone has learned that videos can be fake.

The attackers are often outside the country being hit. Groups aligned with governments that want to shift public opinion in another country, or that target specific demographics. Young people, for example, using platforms that are often owned by those very countries.

All of this is a massive threat to democracy because democracy operates on some premises, including that people form opinions based on real information, discuss with each other, and then decide. If the information is fabricated, if the debate is populated by entities that don’t exist, if the consensus we perceive is synthetic, that premise collapses. And with it, the entire mechanism. Elections become the result of who has the best swarms, not who has the best ideas. Public debate becomes a performance where most of the voices are generated, and public opinion stops being public and becomes the product of whoever has the resources to manufacture it.

We grew up thinking that threats to democracy came from coups, censorship, or regime propaganda broadcast on television or in national newspapers. Those were real threats, but they were at least visible. They were things you could identify and fight. Now the threat is bigger and, above all, invisible, personalized, and it operates inside the very channels we use to inform ourselves, to discuss, to participate. It contaminates information from within, to the point where nobody knows which voices are real and which are machines.

What can we do? Watermarking? Pattern detection? Unfortunately, they don’t work. The major AI platforms can embed markers in content generated by their models, true. But the people building autonomous swarms don’t use commercial platforms. They use open-source models with fine-tuning and capabilities that can’t be controlled from outside. And they often have no legal obligation to do anything because there are no global laws that can impose watermarking on every computer in the world. The result is paradoxical: The content produced by those who follow the rules stays marked, and the content produced by those who want to cause harm stays free.

Pattern detection systems have the same limits. They work for a while, then once the detection patterns are identified, the swarms adapt. They’re designed to do exactly that.

And the platforms where all of this circulates have a financial incentive to turn a blind eye. Internal Meta documents made public by Reuters in November 2025 estimated that roughly 10% of Meta’s global 2024 revenue, about $16 billion, came from advertising for scams and prohibited products. Fifteen billion high-risk ads served on average every day to users. The maximum revenue Meta was willing to sacrifice to act against suspicious advertisers was 0.15% of total revenue: $135 million out of $90 billion. When a platform’s business model depends on ad volume, removing the fraudulent ones has a cost that nobody wants to pay. I suspect Meta is not alone in this.

Regulation doesn’t solve this problem either. I’ve worked on the European AI framework, the GPAI task force, the Italian AI law, and I’ve brought my perspective to the UK Parliament. I’ve been in those rooms. Europe has the AI Act, the GPAI Code of Practice is currently being drafted, and has a regulatory apparatus that is more advanced than any other bloc in the world. The United States has no federal regulation, and twenty-eight states have tried to legislate with transparency requirements that amount to fine print. But even the most ambitious European framework has a structural limit: The attacks come from countries that answer to none of these rules. You can regulate your platforms, your developers, your companies. You can’t regulate a building in Saint Petersburg, Shenzhen, or New Delhi, where someone is instructing swarms of agents on open-source models running on local servers, outside any jurisdiction.

One way out is to return to the reputation of sources. Editors, news organizations, journalists with a name and a face. People and organizations that have a professional track record to defend and that risk something when they get it wrong. Sure, they can have political leanings and they can make mistakes. But they have a constraint that no AI agent will ever have: public accountability. A system that generates millions of pieces of false content answers to no one. An editor answers to their audience, to the law, to their reputation. That constraint is the only filter that still holds, and protecting it is the only thing we can do right now, while the laws try to catch up with a technology that moves faster than any legislative process in the world.

Are we completely at the mercy of AI swarms or can we fight back?

Machines should not get to overpower humans, especially when what’s at stake is how we govern ourselves. The antibodies exist. We need to activate them.

The more people understand how swarms work, the less effective they become. A swarm that manufactures fake consensus only works if the people receiving it don’t know synthetic consensus exists. A bit like deepfakes. We know about them now and we often spot them. Once you see how it works, it’s harder to fall for it.

Then we need investment in culture. In spreading digital literacy, which is not learning how to use a computer, but learning to understand the social and cultural effects of the digital world. It means teaching in schools how to verify a source and what the signs of manipulated content are. It means stopping the practice of treating media literacy as a school project and starting to treat it as democratic infrastructure, on the same level as bridges and hospitals. It means funding independent journalism instead of letting it die, strangled by the same mechanisms that reward false content because it generates more engagement. It means demanding that platforms give different visibility to those who have a verifiable reputation versus those who have none.

Because awareness is the only antibody that scales at the same speed as the threat. And unlike regulation or detection systems, awareness doesn’t need to be imposed. It can be built, taught, shared, and spread from person to person.

Before sharing a piece of content, check where it comes from. Before reacting to a video or a statement, stop. Ask yourself whether the source has a name, a history, something to lose. Treat every piece of content as potentially synthetic until a credible, accountable source confirms it. These are habits, not technologies. They cost nothing and they work immediately.

Finally, we need the help and collaboration of the tech community. Those who design platforms, write code, and make decisions about how feeds and ranking algorithms work are making choices that directly shape the information ecosystem. These are choices with democratic consequences. The people making them know it. Many have known it for years. This is the moment to stop treating it as someone else’s problem and to decide which side you’re on. Because the swarms are not waiting.

We can do this. The tools exist, the knowledge is there, and the threat is clear enough that pretending not to see it is already a choice. The question is whether we act now, while the window is still open, or later, when the damage will be harder to reverse.

Local AI

Mike Loukides and Claude — Fri, 01 May 2026 14:20:44 +0000

The release of Gemma 4 has added energy to the discussion of local models and their importance. Models that you can download and run on hardware you own are becoming competitive with the “frontier models” hosted by large AI providers. These models have gotten good enough for production use, good enough for tasks that until recently required an API call to a frontier model. They are typically open weight (though not open source) and much smaller than the frontier models like Anthropic’s Claude.

The reasons for going local vary. For a financial services company, regulation may require that no sensitive data can leave the premises. For a developer in Europe, data sovereignty laws make cloud APIs awkward. For a developer in China, hardware constraints and geopolitics have made local, efficient models a practical necessity. For developers outside the US, the costs of using frontier models can be prohibitive. None of these reasons are new, but all of them are more urgent than they were a year ago, because the models are catching up.

Why local?

Reasons for running AI locally fall into a few categories: cost, privacy, performance, and control. Let me take them in order.

Cost is the easiest to quantify, though the numbers can be misleading. Developers using agentic tools for programming can spend $500 to $1,000 per month or more on API calls. NVIDIA CEO Jensen Huang has suggested that his engineers should spend an amount roughly equal to half their salary on AI tokens, given the productivity return. Whether or not you take that as prescriptive advice, it signals that token spending at scale is significant, which is exactly what makes the local alternative worth examining.

The hardware cost depends on where you’re starting. If you have a capable desktop already, dropping in an RTX 4070 ($500–$800 retail) gets you a 12GB-VRAM GPU adequate for most local models. Building a dedicated system from scratch (CPU, motherboard, 32GB of RAM, storage, case, power supply, and GPU) runs closer to $1,500. Teams spending $500 a month on API calls break even in a few months. After that, local costs approach zero; electricity for a consumer GPU setup runs $20 to $40 a month. High-volume batch work makes the economics even clearer. Processing thousands of documents through a cloud API gets expensive fast; locally, it costs nothing but time.

For individual developers and small teams, the management overhead is minimal. A tool like Ollama reduces running a local model to a background service; updating to a newer model is a single command, done on your own schedule. At enterprise scale the picture changes: Organizations that need production uptime guarantees, multiple developers sharing access, compliance logging, and dedicated engineering support face real overhead. A dedicated ML engineer runs $200,000 a year, and that’s noise compared to the cost of building or leasing AI infrastructure. For a solo developer or a two-person shop, that concern doesn’t apply.

Privacy arguments are often more compelling than cost. The concern isn’t primarily about bad actors at cloud providers; it’s about contracts, compliance, and control. GDPR and similar regulations create real constraints on where data can go. Healthcare and financial services companies have legal obligations that may effectively prohibit sending sensitive data to external APIs regardless of the provider’s security guarantees. Running a model locally means data stays on your hardware, under your control, with no possibility of inadvertent leakage to a third party. DockYard, writing about the business case for local AI, puts it simply: Local models “keep sensitive data on-device, reducing exposure to breaches and unauthorized access” and simplify compliance with regulations that require strict data residency.

The world beyond the US

The strongest momentum behind local AI adoption comes from developers and organizations outside the United States. The reasons vary by region, but they’re structural everywhere.

European regulators have been skeptical of US-based cloud services since before the first Schrems ruling invalidated the Safe Harbor framework in 2015. The concern that US intelligence services can access data held by US companies, regardless of where that data is stored, has never been fully resolved, and recent US policy directions have amplified European anxieties. More countries, including China and many other Asian nations, are also developing their own data sovereignty laws. Locally run models sidestep the problem.

China has become a leading provider of open AI models. DeepSeek’s appearance as a major open-weight model family wasn’t an accident; it reflects a systematic investment in AI that emphasizes efficiency and openness over raw scale. As I’ve written elsewhere, the Chinese approach to AI has been shaped in part by hardware constraints: When you can’t easily acquire NVIDIA’s fastest chips, you optimize your software instead. You use quantization. You build mixture-of-experts architectures that activate only a fraction of parameters per token. You design models that run well on the hardware you can actually get. The result is a generation of models that run efficiently on local hardware, and a developer community with expertise in building those models. While those techniques have been taken up by AI companies in the US, China clearly leads in efficient AI.

For application developers in India, Southeast Asia, Latin America, and Africa, cost is the most immediate barrier. Cloud API pricing denominated in dollars is expensive relative to local income levels in ways that matter for product economics, not just personal preference. Language is a deeper issue. Of the world’s 7,000-plus languages, only a few have enough textual data to train capable models, and both frontier and smaller open-weight models reflect that reality. A survey of African languages found pronounced performance gaps across models of all sizes. What open-weight models offer is the ability to fine-tune on local language data that the original training missed. A developer in Uganda building a health information tool, or a team in Malaysia building a customer service product, can take an open-weight base model and adapt it to the languages their users actually speak. That’s not possible with closed models.

The response has been a wave of regional model development. Sarvam in India has open-sourced models trained on data emphasizing all 22 official Indian languages, released under Apache 2.0. Sunbird AI in Uganda built Sunflower, a family of models covering 31 Ugandan languages, that was developed in partnership with Makerere University and trained on digitized radio broadcasts and community texts. Singapore’s AI research group built SEA-LION, tuned specifically for Southeast Asian languages and cultural contexts. Malaysia launched a domestically developed LLM, ILMU, in August 2025.

Chinese open source models help to fill this gap. According to Hugging Face’s data, Chinese models now account for a larger share of downloads on the platform than US models. Sunflower is built on Qwen; Malaysia’s NurAI, which targets 340 million speakers of Bahasa Melayu and related languages across the region, uses DeepSeek as its foundation. This isn’t ideology; it’s that Chinese open source models are efficient enough to run locally, permissively licensed, and increasingly well-suited to the multilingual fine-tuning these applications require.

OpenRouter’s model usage rankings, which track billions of API calls across many models, reflect the same reality. DeepSeek models and Qwen variants from Alibaba appear at the top of usage charts alongside offerings from OpenAI and Anthropic. (OpenRouter notes that raw token counts can be skewed by a few high-volume users; request counts give a more representative picture. Also note that rankings vary sharply day-to-day and week-to-week.) The frontier of capable AI is no longer exclusively American, and the application developers driving much of that usage are building for audiences that American tech companies have largely ignored.

Performance

When performance is an issue, the metric to watch depends on what you’re building. Time to first token matters most for interactive applications: how long before the model starts producing output. For a cloud API, that includes the network round trip (typically under 30 milliseconds to a major provider) plus server-side work: queuing, scheduling, and processing your prompt through the model before generation begins. For typical requests this can run to several hundred milliseconds in total, and longer when the server is under load. A local model starts processing immediately, with no queuing and no network hop, so time to first token is very low. For anything that feels like a conversation (a code assistant, a document tool, an interactive agent), that difference is perceptible.

Once generation starts, tokens per second is the metric to watch. Here, cloud providers have the advantage: Their infrastructure prioritizes inference, generating responses to prompts and API calls. A local model may feel faster to start and slower to finish than a well-provisioned cloud API.

For agentic workflows that chain together many model calls, both factors matter. Network round trips accumulate: At 30 milliseconds each, a hundred sequential calls adds three seconds of pure overhead before accounting for server-side processing, and the time-to-first-token overhead multiplies with every step. This is one reason local models have appeal for agentic applications, where the number of individual inference calls can be large.

High concurrency is a separate problem, and one where local deployment struggles. Consumer hardware handles one request at a time, or a few; a cloud provider scales horizontally. If your application serves many simultaneous users, local deployment requires either significant hardware investment or a different architecture.

Fine-tuning for specific applications

Applications where specialized domain knowledge matters are more common than people realize, and for all of them fine-tuning is a substantial advantage. A customer support model that knows your product deeply, a coding assistant tuned on your company’s codebase, a document processor fine-tuned on your industry’s vocabulary: These are things you can build and own with open models in ways you can’t with closed ones.

Developers are frequently prototyping an application on a frontier model, then moving to a smaller or local model that has been fine-tuned for production. An early description of this practice appears in “What We Learned from a Year of Building with LLMs”: “Prototype with the most highly capable models before trying to squeeze performance out of weaker models.” The practice is also recommended by both Anthropic and OpenAI, though they assume you will use their own smaller models, and they might get prickly around what they see as “distillation.”

Fine-tuning models is frequently associated with expensive AI experts, but it is gradually becoming more accessible. Techniques like QLoRA allow fine-tuning a 7B or 8B parameter model on a consumer GPU with 12GB of VRAM. Tools like Unsloth reduce VRAM requirements further while increasing throughput. The Hugging Face ecosystem (Transformers, Datasets, PEFT, TRL) provides additional tools for working with models. An individual developer or small team can adapt a base model to a specialized domain.

Cloud providers can’t easily offer this flexibility. You can fine-tune some closed models, but you’re working within the provider’s constraints at significant per-run cost, and the resulting model still lives on their hardware. Fine-tuning an open model produces something you own, that runs on your hardware, with no ongoing licensing fees and no dependency on a third party’s infrastructure decisions.

Security

The biggest advantage of a local model is that data stays local. There are no API endpoints to compromise, no cloud credentials to steal, no third-party infrastructure to go down during an outage. For regulated industries, this is often a decisive factor.

However, when you run a model on your own infrastructure, you take responsibility for the model’s security. Model creators make their own choices about safety and alignment before releasing a model. Base models (the foundation before instruction tuning and alignment) will comply with requests that a safety-tuned model would refuse; that’s a property of the model, not something you configure at runtime. When you choose a model to run locally you’re also choosing how much alignment work its creators did. Organizations need to evaluate this deliberately rather than assuming it’s handled.

The opacity of training data is a subtler concern. Because almost all open-weight models withhold their training datasets, you can’t audit the data on which the model was trained, making it hard to assess bias, verify that proprietary or regulated data wasn’t included, or detect benchmark contamination. For applications in regulated industries, this is a real gap.

Prompt injection is a threat that applies to any model. In a prompt injection attack, adversarial content in the model’s input overrides the system prompt and hijacks the model’s behavior. The malicious content can be in almost any form: text on a web page, invisible pixels in an image, and much more. The attack surface grows in agentic workflows, where models take actions based on content they retrieve from the web and other external sources. Frontier labs have made progress here: Anthropic has published research on RL-based injection hardening for agentic contexts, and OpenAI published the Instruction Hierarchy, a training methodology that teaches models to assign differential trust to instruction sources. Neither technique has a known open-weight equivalent. That said, both labs have stated publicly that the problem is unlikely to be fully solved. The root cause is architectural: LLMs process instructions and data in the same token stream, and that’s not a bug that can be patched out.

Supply chain security is yet another concern. Hugging Face hosts hundreds of thousands of models, and most have not been audited for safety. Some are actively hostile. Downloading a model from an unknown source and running it on your hardware is analogous to running an arbitrary executable. Sticking with well-known models such as Gemma from Google, GLM from Zhipu, and DeepSeek from DeepSeek AI reduces this risk substantially. The well-known models aren’t risk-free, but they’re in a different category from the long tail of unvetted uploads.

The current open model landscape

Before getting into specific models, it’s important to distinguish between “open source” and “open weight.” They are not the same, and most of what gets called open source AI is actually only open weight. The Open Source Initiative published a formal definition of open source AI in October 2024, requiring not just open model weights but training code, training data provenance, and evaluation code—enough for a skilled person to reproduce the system.

By that standard, almost none of the headline models qualify. Most models only release the weights: the trained numerical parameters that make up the model itself, without the data or code that produced them. Without training data, you can fine-tune a model, but you can’t audit the model for bias or benchmark contamination. Without training code, you can’t reproduce or systematically improve it. The term “openwashing” has started circulating for models that claim openness while releasing only weights, and it’s warranted. For most developers, the practical question is what the license actually permits. Apache 2.0 and MIT licenses, which several of the major open-weight models now carry, are permissive enough for most commercial use.

As of early April 2026, Gemma 4 from Google is the strongest open-weight model available. Like all the models here it releases weights only; training data and code are not disclosed. It comes in several sizes: compact 2B and 4B variants aimed at edge deployment, a 26B mixture-of-experts model that activates 4B parameters per token, and a 31B dense model suited for reasoning and fine-tuning. All variants handle images and video natively. For most developers looking for a locally runnable model right now, Gemma 4 is where to start.

The GLM series from Zhipu is underrated. The current release is GLM-5.1, with GLM-5 still widely used; both have large context windows and strong performance on reasoning tasks. The series has a particular focus on deep tool-assisted research workflows. This goes beyond what raw benchmark scores capture. For applications that involve sustained, complex work, such as legal document analysis, research synthesis, and multistage coding tasks, the GLM family is worth serious consideration.

DeepSeek’s V4 models are large, but they use a mixture-of-experts architecture to deliver high quality with a small active parameter count. DeepSeek’s R1 family ranges from 1.5B parameters to 671B. It has been specialized for reasoning and mathematical tasks. Training data and code have not been released for either V4 or R1. The community has launched an Open-R1 project that attempts a full reproduction of DeepSeek-R1’s training from scratch.

The Qwen series from Alibaba is capable across a range of tasks, multilingual, and licensed under Apache 2.0. Organizational changes have put its trajectory in question, though the open-weight releases of Qwen3.6-27B and other models in the Qwen 3.6 family are encouraging.

Kimi K2.6 from Moonshot AI is worth knowing about, although running it is beyond the capabilities of most consumer hardware. It’s a one-trillion-parameter mixture-of-experts model with 32B active parameters per token, trained specifically for coding and agentic tasks. Aggressive quantization can bring Kimi’s VRAM requirements down to 24GB, but that’s the practical floor.

Meta’s Muse Spark isn’t open but deserves a mention. Announced in early April 2026 and built by the newly formed Meta Superintelligence Labs under Alexandr Wang, Muse Spark is proprietary. Meta has a history of releasing open-weight models, so it’s possible something similar will follow for Muse Spark, but there’s no announcement, no timeline, and no guarantee. There has also been talk of smaller versions of Spark for edge devices.

If you want models that are genuinely open source by the OSI definition—training data, code, and weights all released—the options are more limited and less capable: Olmo from the Allen Institute for AI is the most serious effort; the full Dolma training dataset, training code, and hundreds of intermediate checkpoints have been released. It’s a valuable resource for researchers, but it isn’t competitive with Gemma 4 or DeepSeek on capability.

Regardless of which model you’re considering, how do you know whether it’s good enough for your application? Published benchmarks are often misleading; they measure what the benchmark designers thought to measure, not necessarily what you need. A more reliable approach is building a “golden dataset”: a few hundred real prompts drawn from your actual use case, with known-good answers, against which you can evaluate any candidate model. It’s worth doing before committing to any model for production use.

Choice and control

The gap between frontier and open models is narrowing and, more to the point, seems less and less relevant as open models improve. Is it worth getting locked in to a cloud provider, giving up control of your data provenance, and losing the ability to fine-tune a model for an application in exchange for a few points on a benchmark that doesn’t reflect the real world? An increasing number of AI developers and users are concluding that it doesn’t. The regulatory environment in Europe, and the hardware constraints in China, are producing a global developer community with expertise in making local AI work.

None of this means that cloud AI is going away. The frontier closed models will remain ahead on raw capability, and there are applications where that matters. But the days when a US-based cloud API was the only serious option for capable AI are over. Local AI is increasingly capable, and for a growing fraction of what developers want to build, especially outside the United States, it’s a viable choice.

If you want an introduction to using LLMs with open weights, join Christian Winkler on O’Reilly for the Open Weight Large Language Models Bootcamp on May 20 and 21. You’ll learn how to use models to retrieve information, combine the results of different models and refine the results with dense passage retrieval, discover how these models can excel on less powerful hardware by using new approaches to quantization, explore different frontends these models can be plugged into, and more in an interactive hands-on environment. O’Reilly members can register here.

Not a member? Sign up for a free 10-day trial before the course to attend.

Everyone’s an Engineer Now

Tim O’Reilly — Thu, 30 Apr 2026 15:59:33 +0000

Cat Wu leads product for Claude Code and Cowork at Anthropic, so she’s well-versed in building reliable, interpretable, and steerable AI systems. And since 90% of Anthropic’s code is now written by Claude Code, she’s also deeply familiar with fitting them into routine day-to-day work. Last month, Cat joined Addy Osmani at AI Codecon for a fireside chat on the future of agentic coding and, equally important, agentic code review, how Anthropic actually uses the tools they’re building, and what skills matter now for developers.

The feedback loop is itself a product

Boris Cherny initially built Claude Code as a side project to test Anthropic’s APIs. Then he shared the tool in a notebook, and within two months the entire company was using it. That organic growth, Cat said, was part of what convinced the team it was worth releasing externally.

But what really made that internal adoption visible was the response on Anthropic’s internal “dog-fooding” Slack channel. The Claude Code channel gets a new message every 5 to 10 minutes around the clock, and this feedback directly and immediately informs the product experience. Cat described it this way:

We hire for people who love polishing the user experience. And so a lot of our engineers actually live in this channel and find when there’s issues with new features that they’ve worked on and they proactively lay out the fixes.

The team ships new versions of Claude Code to internal users many times a day. The feedback loop is tight enough that it functions as a continuous integration system for product quality, not just code quality.

Cat told Addy how she once accidentally introduced a small interaction bug between prompts and auto-suggestions. But by the time she started working on a solution, she found another team member had already beaten her to it. It turns out, he had set up a scheduled task in Claude Code to scan the feedback channel for anything that hadn’t been responded to in 24 hours and open a PR for it. Since Cat hadn’t gotten to it yet (whoops!), her teammate’s Claude saw the unaddressed issue and fixed it for her. And Cat only found out when “[her own] Claude noticed that his Claude had already landed a change.”

The infrastructure for rapid improvement, in other words, is now partly automated. The agents are writing the code, then monitoring the feedback and closing the loop.

The bottleneck has shifted to review

There’s no question that AI-assisted coding has created a boom in output. Anthropic engineers are producing roughly 200% more code than they were a year ago, Cat noted. Today the main constraint is reviewing all that code to ensure it’s production-ready.

Cat’s team concluded that you can buy a lot of additional robustness for not that much extra cost.

We opted for the heaviest, most robust version [of code review]. We actually plot how many agents and how comprehensive of a review Claude does and then how many bugs does it recall. And we picked a number of very high recall and decided we should ship this, because if you really want AI code review to be a load-bearing part of your process, you actually probably just want the most comprehensive possible review.

The review agent doesn’t just look at the diff. It traces code across multiple files and catches bugs in adjacent code that has nothing to do with the change in question. Cat gave two examples. One was a ZFS encryption refactor where the agent found a key cache invalidation bug that wasn’t related to the author’s change at all but would have invalidated it. The other was a routine auth update that turned out to have a bad side effect, caught premerge. In both cases, engineers manually reviewing the code likely would have missed the bugs.

The human review that remains is deliberately small in scope. For most PRs, the human reviewer skims for design principle violations and obvious problems and assumes functional correctness has been handled. Five to ten agents run in parallel, each given slightly different tasks, returning independently and then deduplicating what they found.

The cultural shift that made this work, though, was ownership. The team moved to a model where the engineer who authors a PR owns it end to end, including postdeploy bugs, and doesn’t lean on peer reviewers to catch mistakes. “Otherwise,” as Cat pointed out, “you have situations where junior engineers put out a bunch of PRs and then your senior engineers are like drowning in AI-generated stuff where they’re not sure how thoroughly it’s been tested.”

Full ownership meant the AI review had to actually be trustworthy, which drove the decision to go for high recall rather than a lighter touch. That said, engineers are still expected to understand every line of code an agent creates. . .for now. As Cat explained, it’s the only way to truly prevent “unknown security vulnerabilities and to be able to quickly respond to incidents if they are to happen.”

Everyone’s kind of an engineer now

Cowork, Anthropic’s agent tool for nontechnical users, is the company’s attempt to take what Claude Code does for engineers and bring it to knowledge work more broadly. Cat sketched a picture of someone looking at five or six agent tasks running simultaneously in a side panel, managing a fleet of agents the way a senior engineer manages a PR queue.

In the nearer-term, she’s keeping tabs on the shift toward people using Claude Code to build things for themselves, their teams, or their families that wouldn’t have justified professional development effort or “otherwise been possible.” The prototype is the garage project, the family expense tracker, the tool that a small team actually needs but that no SaaS product quite addresses. Cat’s goal and hope is that Claude Code helps people “solve their own problems for themselves” and “stewards a new future of personal software.”

Product taste as the new technical skill

More people building more software is unambiguously good. Boris Cherny has even floated the idea that coding as we know it is “solved.” But what does that mean for the craft of software engineering? Cat’s read of the current moment is more nuanced:

I think pre-AI, the skills that were very important were being able to take a spec and implement it well. And I think now the really important skill is product taste. Even for engineers. Can you use code to ingest a massive amount of user feedback? Do you have good intuition about which feature to build to address those needs, because it’s often different than exactly what users are asking you for? And then, when Claude builds it, are you setting up the right bar so that what you ship people actually love?

Cat’s not alone in highlighting the importance of taste in a world where code is a commodity. Steve Yegge, Wes McKinney, and many others, myself included, see taste and judgment as a uniquely human value. This has practical implications for how engineers should spend their time now, and for what the next generation needs to learn.

For junior engineers specifically, Cat described a progression: Start by using Claude Code to understand the codebase (ask all the “dumb questions” without embarrassment), take those answers to a senior engineer for calibration, and then close the loop by updating the CLAUDE.md with whatever was missing.

Think of Claude Code as your intern that you’re trying to level up. Like, teach it back to Claude. Add a /verify slash command. Put it in the CLAUDE.md or the agent README. Approach this as senior engineers helping you level up, and then you helping Claude and other agents level up.

The improvement process, in other words, should be bidirectional. Engineers get better at using the tools and the tools get better through the engineers’ accumulated knowledge. And significantly, this process keeps humans firmly in the loop, playing a role that’s “active, continuous, and skilled.”

You can watch Cat and Addy’s full chat, plus everything else from AI Codecon on the O’Reilly learning platform. Not a member? Sign up for a free 10-day trial, no strings attached.

AI Code Review Only Catches Half of Your Bugs

Andrew Stellman — Thu, 30 Apr 2026 11:14:49 +0000

This is the fifth article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, and part four here.

I recently had a taste of humility with my AI-generated code. I live in Park Slope, Brooklyn, and recently I needed to get to the other side of the neighborhood. I thought I’d be clever: I like taking the bus, so I decided to hop on the one that goes right down 7th Avenue. I know I could check the schedule using the MTA’s really useful Bus Time app or website, but it doesn’t take into account walking time from my house or give me a good idea of when to leave. This seemed like a great opportunity to vibe code an app and do some quick AI-driven development.

It took about two minutes for Claude Code to get my new app working. It made a lovely little web UI, I configured my stop and how long it takes me to walk there, and it gave me the perfect departure time.

When I actually walked out the door, the app perfectly predicted my wait. There was just one problem: my bus was nowhere to be seen. What I did see was a bus driving the exact opposite direction down 7th Avenue.

It was pretty obvious what had happened. I needed to go deeper into Brooklyn, not towards Manhattan, and the AI had picked the wrong direction. (Actually, as Cowork pointed out, each stop has its own ID, and it had selected the ID for the wrong stop.) I’d been using Cowork to orchestrate everything, and I could easily have just asked it to go out and check the MTA’s BusTime site for me to make sure the app was working. But I just trusted the AI. As a result, I had to walk. Which is fine—I love walking—but the irony was painful. I had literally just published an article about AI code quality and why you shouldn’t blindly trust it, and here I was doing exactly that.

The app had a bug. But it wasn’t the kind of bug you’d necessarily catch using a typical AI code review prompt. It built, ran, and did a perfectly fine job parsing the JSON from the MTA API. But if I’d started with a simple requirement—even just a user story like “as a Park Slope resident, I want to catch the B69 headed towards Kensington so I can get deeper into Brooklyn”—the AI would have built it differently. The problem is that AI can only build the thing you tell it to build, which isn’t necessarily the thing you wanted it to build. AI is really good at writing “correct” code that does the wrong thing.

My Brooklyn bus detour was a minor inconvenience. But it was a really useful, small-scale example of what I kept running into in my larger projects, too. There’s an entire class of bugs that you simply can’t find with structural analysis—no linter, no static analyzer, no AI code reviewer will catch them—because the code isn’t wrong in any way that’s visible from the code alone. You need to know what the code was supposed to do. You need to know the intent.

The data on why requirements matter goes back decades. Back in the 1990s, for example, the Standish CHAOS reports were a big eye-opener for me and a lot of other people in the industry, large-scale data confirming what we’d been seeing on our own projects: that the most expensive defects trace back to misunderstood or missing requirements. Those reports really underscored the idea that poor requirements management, and specifically incomplete or frequently changing specifications, were one of the most primary drivers behind IT project failures. (And, as far as I can tell, they still are, and AI isn’t helping things—see my O’Reilly Radar article, “Prompt Engineering Is Requirements Engineering”).

The idea that requirements problems really are the source of the most expensive kind of defects should make intuitive sense: If you build the wrong thing, you have to tear it apart and rebuild it. That’s why I made requirements the foundation of the Quality Playbook, an open-source skill for AI tools like Claude Code, Cursor, and Copilot that I introduced in the previous article. I’ve spent decades doing test-driven development, partnering with QA teams, welcoming the harshest code reviews from teammates who don’t pull punches—and that experience led me to build a tool that uses AI to bring back quality engineering practices the industry abandoned decades ago. I’ve tested it against a wide range of open-source projects in Go, Java, Rust, Python, and C#, from small utilities to widely-used libraries with tens of thousands of stars, and it’s found real bugs in almost every project it’s come across, including ones that have been confirmed and merged upstream.

I think there are a lot of wider lessons we can learn from my experience using requirements to help AI find bugs—especially security bugs. So in this article, I want to focus on the single most important thing I’ve learned from building it: everything depends on requirements. Not just any requirements, but a specific kind of requirement that most projects don’t have, that most AI tools don’t ask for, and that turns out to be the key to making AI actually useful for verifying code quality.

Spec-driven development and what it misses

Developers using AI tools have been rediscovering the value of writing things down before asking the AI to build them. Spec-driven development (SDD) has become very popular, and for good reason. Addy Osmani wrote an excellent piece on this, “How to Write a Good Spec for AI Agents,” and the core idea is sound: If you write a clear specification of what you want built, the AI produces dramatically better results than if you just describe it in a chat prompt and hope for the best.

I think SDD is important, and I’d encourage any developer working with AI to adopt it. But as I was building the Quality Playbook, I discovered that SDD has a blind spot that matters a lot for code quality. An SDD spec describes the how—what the implementation should look like. It tells the AI “implement a duplicate key check” or “add a retry mechanism with exponential backoff” or “create a REST endpoint that returns paginated results.” That’s useful for building things. But it’s not enough for verifying them.

But a requirement doesn’t say “implement a duplicate key check.” It says “users depend on Gson to reject ambiguous input so they don’t silently accept corrupted data.” The AI can reason about the second one in ways it can’t reason about the first, because the second one has the purpose attached. When the AI knows the purpose, it can evaluate whether the code actually fulfills that purpose across all the edge cases, not just the ones the spec explicitly listed. That’s how the Quality Playbook caught a bug in Google’s Gson library, one of the most widely used JSON libraries in Java.

I think it’s worth digging into that particular bug, because it’s a great example of just how powerful requirements analysis can be for finding defects. The playbook derived null-handling requirements from Gson’s own community—GitHub issues #676, #913, #948, and #1558, some dating back to 2016—then used those requirements to find that duplicate keys were silently accepted when the first value was null. It confirmed the bug by generating a failing test, then patched the code and verified the test passed. I’ve used Gson for years and done a lot of work with Java serialization, so I read the code and the fix myself before submitting anything—trust but verify. The fix was merged as https://github.com/google/gson/pull/3006, confirmed by Google’s own test suite.

That bug had been hiding in plain sight for years, through thousands of tests and countless code reviews. But it’s possible that no structural analysis might have ever found it because you needed the requirement to know it was wrong.

This distinction might sound academic, but it has very concrete consequences for whether your AI can actually find bugs in your code.

About half of all security bugs are invisible to structural analysis

The security world has known about the limits of structural analysis for a long time. The NIST SATE evaluations found that the best static analysis tools plateaued at around 50-60% detection rates for security vulnerabilities. Gary McGraw’s Software Security: Building Security In (Addison-Wesley, 2006) explains why: Roughly 50% of security defects are implementation bugs, and the other 50% are design flaws. Static analysis tools target the implementation bugs—buffer overflows, SQL injection, format string vulnerabilities—because those are pattern-matchable. But design flaws are about intent: The system’s architecture doesn’t enforce the security properties it’s supposed to enforce, and no amount of scanning the code will reveal that. A 2024 study by Charoenwet et al. (ISSTA 2024) confirmed this is still the case: They tested five static analysis tools against 815 real vulnerability-contributing commits and found that 22% of vulnerable commits went entirely undetected, and 76% of warnings in vulnerable functions were irrelevant to the actual vulnerability. The pattern is consistent across two decades of research: There’s a ceiling on what you can find by analyzing code, and it’s around half.

There’s a good reason for that limitation: the intent ceiling. A structural analysis tool is limited to reading the code and looking at what it does; it has no way to take into account what the developer intended it to do.

When an AI does a code review without requirements, it’s limited to structural analysis: pattern matching, code smell detection, race condition analysis. It can ask “does this look right?” but it can’t ask “does this do what it’s supposed to do?” because it doesn’t know what the code is supposed to do. Structural review catches genuinely important stuff—race conditions, null pointer issues, resource leaks, concurrency bugs. A structural reviewer looking at a shell script will catch a missing fi, a bad variable expansion, a race condition. Structural review is useful, and structural review is what most AI code review tools do today.

But about half of all security defects are intent violations: things the code doesn’t do that it was supposed to do, or things it does that it wasn’t supposed to do. They’re invisible without a specification to check against, and no tool will find them by looking at code that is, structurally, perfectly sound. A structural reviewer looking at a script that’s, say, used to check router configuration files, might find well-formed bash, correct syntax, proper quoting, and code that looks like it works and doesn’t match known antipatterns. It wouldn’t know the script is only validating three of the five access control rules it’s supposed to enforce because that’s a requirements question, not a syntax question.

Or, more personally for me, this is what happened with my bus tracker app: The JSON parsing was flawless, the UI was correct, the timing logic worked perfectly. The only problem was that it showed buses headed towards Manhattan when I needed to go deeper into Brooklyn—and no structural analysis would ever catch that, because you need to know which direction I intended to go. That’s me and my very clever AI hitting the intent ceiling.

The intent ceiling is a security problem

This is where it gets really serious, because security vulnerabilities are some of the most dangerous members of this class of invisible bugs.

Think about what a missing authorization check looks like to an AI code reviewer. Let’s say you’ve got a web endpoint with a well-formed HTTP handler, properly sanitized inputs, and a safe database query. The code is clean, and passes every structural check and static analysis tool you’ve thrown at it. Now you’re testing it and, much to your dismay, you discover that the endpoint lets any authenticated user delete any other user’s data because nobody ever wrote down the requirement that says “only administrators can perform deletions.” That’s CWE-862: Missing Authorization, and it rose to #9 on the 2024 CWE Top 25 most dangerous software weaknesses.

That’s not a coding error! It’s a missing requirement.

That’s McGraw’s point: About half of all security defects aren’t implementation bugs at all. They’re design flaws, places where the system’s architecture doesn’t enforce the security properties it was supposed to enforce. A cross-site scripting vulnerability isn’t always a failure to sanitize input. Sometimes it’s a failure to define which inputs are trusted and which aren’t. A privilege escalation isn’t always a broken access check. Sometimes there was never an access check to begin with because nobody specified that one was needed. These are intent violations and they’re invisible to any tool that doesn’t know what the software is supposed to prevent.

AI code review tools today are very good at catching the implementation half of McGraw’s split. They can spot a SQL injection pattern, flag an unsafe deserialization, identify a buffer overflow. But they’re working on the same side of the 50/50 line that static analysis has always worked on. The design half—the missing authorization checks, the unspecified trust boundaries, the security properties that were never written down—requires the same thing that catching my bus tracker bug required: knowing what the software was supposed to do in the first place.

How the Quality Playbook derives requirements (and how you can too!)

The problem most projects face is that they don’t have formal requirements. What they have is code, documentation, commit messages, chat history, README files, and maybe some design docs. The question is how to get from that mess to a specification that an AI can actually use for verification.

The key insight I had while building the playbook was that every previous approach I tried asked the model to do two things at once: figure out what contracts exist AND write requirements for them. That doesn’t work—the model runs out of attention trying to hold the entire behavioral surface in its head while also producing formatted requirements. So I split them apart into four steps: First, have the AI read each source file and write down every behavioral contract it observes as a simple list. Second, derive requirements from those contracts plus the documentation. Third, check whether every contract is covered by a requirement. Fourth, assert completeness—and if there are gaps, go back to step one for the files with gaps.

The key idea is that the contracts file is external memory. When the model “forgets” about a behavioral contract it noticed earlier, that forgetting is normally invisible. With a contracts file, every observation is written down before any requirements work begins, so an uncovered contract is a visible, greppable gap.

You don’t need the Quality Playbook to do this—you can apply the same technique with any AI coding tool that you’re already using. Here’s what I’d recommend:

Write down what your software is supposed to guarantee. Not just what it does—what it’s supposed to do, for whom, under what conditions. If you’re practicing spec-driven development, you’re already partway there. The next step is adding the why: Why does this behavior matter, who depends on it, what goes wrong if it fails? That’s the difference between a spec and a requirement, and it’s the difference between an AI that can build your code and an AI that can verify it.
Feed the AI your intent, not just your code. The intent is already sitting in your chat history, your design discussions, your Slack threads, your support tickets. Every Claude export, every Gemini conversation, every Cowork transcript contains design intent that never made it into specifications: why a function was written a certain way, what failure prompted an architectural decision, what tradeoffs were discussed before choosing an approach. The design intent that used to require a human to extract and document is now sitting in your chat logs. Your AI can read the transcripts and extract the why.
Look for the negative requirements. What should your software not do? What states should be impossible? What data should never be exposed? These negative requirements are often the most valuable because they define boundaries that structural review can’t see. The missing authorization bug was a negative requirement: Unauthenticated users must not be able to delete other users’ data. The Gson bug was a negative requirement: Duplicate keys must not be silently accepted when the first value is null. If you can articulate what your software must never do, you’ve given the AI something powerful to check against.

In the next article, I’ll talk about context management—the skill that actually determines whether your AI sessions produce good work or mediocre work. Everything I’ve described here depends on the AI having the right information at the right time, and it turns out that managing what the AI knows (and what it forgets) is an engineering discipline in its own right. I’ll cover how I went from running 15 million tokens in a single prompt to splitting the playbook into independent phases with zero context carryover, and why that transition worked on the first try.

The Quality Playbook is open source and works with GitHub Copilot, Cursor, and Claude Code. It’s also available as part of awesome-copilot.

Disclosure: Aspects of the methodology described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open-source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.

Don’t Automate Your Moat: Matching AI Autonomy to Risk and Competitive Stakes

Marc Millstone and Claude — Wed, 29 Apr 2026 11:42:28 +0000

I was talking to a senior engineer at a well-funded company not long ago. I asked him to walk me through a critical algorithm at the heart of their product, something that ran hundreds of times a second and directly affected customer outcomes. He paused and said, “Honestly, I’m not totally sure how it works. AI wrote it.”

A few weeks later, a different engineer at another company was paged about a system outage. He pulls up the failing service and realizes he has no idea it’s connected to a database. A colleague accepted the AI-generated PR three months ago that added that dependency. The tests passed. The change was never written down. The original engineer moved on and the knowledge was lost.

These aren’t new stories. Engineers have always inherited systems they didn’t fully build. What’s new is the disguise and the speed. AI is an amazing enabler. Organizations must adopt it to remain relevant. Yet the emerging pattern—describe what you want, let an agent iterate until it works, pay for it in tokens instead of engineering hours—is functionally a buy decision wearing a build costume. The code is in your repo. Your engineers merged the PR. It feels like you built it. But if nobody on your team understands why it works the way it does, you’ve purchased a dependency you can’t maintain from a vendor you can’t call.

AI doesn’t create that gap once. It widens it continuously at a pace that outstrips the organizational habits that once kept it manageable. Two problems compound at once. You can’t extend the thing that makes you hard to replace. And when it breaks, the incident lands on a team that doesn’t understand what they’re fixing, turning a recoverable outage into a customer-facing crisis. Engineering leaders have wrestled with build-versus-buy tradeoffs for decades, and the hard-won lesson has always been the same: You don’t outsource your competitive advantage. The token-funded generation loop doesn’t change that calculus. It makes it easier to skip the question entirely.

The question that matters isn’t “Can AI do this?” If it can’t today, it will be able to tomorrow. And the argument that follows does not depend on the quality of the AI-generated code. This article covers two questions most engineering organizations have never asked at the same time. Most teams optimize for velocity and never ask what they’re risking or giving away in the process. The gap between those unasked questions is where the most expensive mistakes are already being made.

Part 1: Two dimensions. Neither is velocity.

Moving faster matters. But velocity alone misses the two dimensions that determine whether AI autonomy helps or hurts your business.

Business risk: What’s the blast radius if this fails? A bug in an internal CLI tool costs you an afternoon. A bug in your authentication logic costs you customers and possibly market cap. A bug in your core pricing algorithm costs you the business. These are not the same.

Competitive differentiation: Does this code define your business? Your moat is your architecture, your performance characteristics, your core algorithms, and the product decisions baked into your infrastructure. But it’s also the institutional knowledge that shaped them: the reasoning behind the trade-offs, the context that no model was trained on. If your competitors can generate the same code with the same model you’re using, it stops being an advantage.

Most organizations ask the first question on a good day. Almost none ask the second. That gap is how you end up shipping fast into a moat nobody can explain and nobody can extend.

Understanding why both dimensions matter starts with velocity and what happens when the feedback loop around it breaks.

Velocity feels real. Debt is often invisible.

AI coding tools are genuinely impressive. GitHub’s research showed 55% faster task completion with Copilot in controlled conditions.¹ That number has driven an assumption that faster is always better.

A 2025 METR randomized controlled trial² found something that should give every engineering leader pause. Sixteen experienced developers on real production codebases forecasted they’d complete tasks 24% faster with AI. After finishing, they estimated they’d gone 20% faster. They’d actually gone 19% slower.

The velocity finding is striking. But the perception gap matters more. The feedback loop between “how am I doing?” and “how am I actually doing?” was broken throughout and never corrected itself. This doesn’t resolve the velocity debate. It reframes it. The danger isn’t that individuals move too fast. Organizations mistake output volume for productivity and strip out the review processes that used to catch what that gap costs.

A Tilburg University study of open source projects after GitHub Copilot’s introduction found the same pattern at the organizational level.³ Productivity did increase, but primarily among less-experienced developers. Code written after AI adoption required more rework to meet repository standards. The added rework burden fell on the most experienced (core) developers who reviewed 6.5% more code after Copilot’s introduction and saw a 19% drop in their own original code output. The velocity looks real at the surface. Underneath, the maintenance cost shifts upward to the people who can least afford to lose productive time.

That broken feedback loop has a name. Researchers call it cognitive debt⁴: the growing gap between how much code exists in your system and how much of it anyone actually understands. Technical debt shows up in your linter and your backlog. Cognitive debt is invisible. There’s no signal telling engineers where their understanding ends. That’s precisely what the METR perception gap showed. It never corrected itself.

Research by Anthropic Fellows found that engineers using AI assistance when learning new tools scored 17% lower on comprehension tests than those who coded by hand, with the steepest drops in debugging ability.⁵ MIT’s Media Lab found the same pattern in writing tasks: Brain connectivity was weakest in the group using LLM assistance, strongest in the group working without tools.⁴ Active production builds understanding. Passive consumption doesn’t.

You understand what you build better than what you review. When you write code, you produce output and build a mental model. That’s what Peter Naur called the “theory of the program.” It lives in your head, not in the repo.⁶ The MIT study captured this directly. 83% of participants who wrote essays with LLM assistance could not quote a single sentence from essays they had just written.⁴

Cognitive debt is invisible until it isn’t. When it surfaces, it hits both dimensions hard, in different ways.

Business risk: The blast radius of not knowing

On the business risk dimension, cognitive debt is a safety problem.

When nobody fully understands the system, the blast radius of a failure expands silently. The incident that eventually comes (and it always comes) lands on a team that can’t diagnose what they didn’t build. The engineer pulling up the failing service at 2 AM has no mental model of why it was built the way it was, what it connects to, or what the edge cases look like under load. So they ask the LLM. It can explain what the code does and often propose a reasonable fix. It can’t tell you why it was designed that way. And a fix that looks right to the model can quietly violate constraints that nobody thought to document.

Cognitive debt compounds a second, independent risk: the pace at which AI-generated code reaches production. OX Security’s analysis⁷ of over 300 software repositories found that AI-generated code isn’t necessarily more vulnerable per line than human-written code. The problem is velocity.

Code review, debugging, and team oversight are the bottlenecks that catch vulnerable code before it ships. AI makes it easy to remove them. CodeRabbit’s analysis of real-world pull requests found AI-authored changes contain up to 1.7x more critical and major defects than human-written code, with logic and correctness issues up 75%.⁸ Apiiro’s analysis found that while AI reliably reduces surface-level syntax errors, architectural design flaws and privilege escalation paths (the categories automated scanners miss and human reviewers struggle to catch) spiked in AI-assisted codebases.⁹

AI accelerates output and accelerates unreviewed risk in equal measure. The cognitive debt means that when something breaks, the team is learning the system as they’re trying to fix it. Remove their understanding and you haven’t streamlined the process. You’ve only removed the thing standing between a bad day and a catastrophic one.

Competitive differentiation: What you give away without knowing it

The competitive differentiation risk isn’t that AI will generate your exact competitive algorithm and hand it to your competitor. It’s subtler. Your advantage was never the code itself; it was the judgment that shaped it. When AI writes that code, the judgment never forms. The code arrives, but the understanding that would let your team extend it, improve it, or defend it under pressure doesn’t. Your moat is most likely to survive in the places AI finds hardest to reach.

That judgment—formed by the performance trade-offs that took years to tune, the failure modes that only someone who’s been paged understands, the architectural decisions that encode domain knowledge nobody wrote down—doesn’t live in the codebase. It lives in your engineers’ heads.

And here’s the part most teams miss: Your competitor with the same AI tools doesn’t just get similar code, they get a team that also doesn’t understand why it works the way it does, which means neither of you can extend it, and the race to the next architectural move is a coin flip rather than a compounding advantage. The build-versus-buy discipline exists precisely because decades of experience taught engineering organizations that outsourcing your core means losing the ability to extend it. The token-funded generation loop doesn’t change that calculus. It makes it easier to mistake the outsourcing for ownership because the code has your name on it.

The structural problem runs even deeper. Models trained on public code produce outputs weighted toward well-represented patterns, the common solutions to common problems. Research confirms this. LLM performance drops sharply on less-common programming languages where training data is sparse, and on genuinely novel implementations. Even the best current models correctly implement fewer than 40% of coding tasks drawn from recent research papers.¹⁰ And the convergence problem extends beyond code. A pre-registered experiment tracking 61 participants over seven days found that while ChatGPT consistently boosted creative output during use, performance reverted to baseline the moment the tool was unavailable.¹¹ More critically, the work produced with AI assistance became increasingly homogenized over time. That homogenization persisted even after the tool was removed. The participants hadn’t borrowed the tool’s output. They’d internalized its patterns. For engineering organizations, this is the differentiation risk made concrete: Teams that rely on AI for their most critical design decisions risk generating commodity code today and training themselves to think in commodity patterns tomorrow.

Engineers who deeply own their most critical systems are better at diagnosing incidents and see the next architectural move that competitors can’t follow. Delegate that comprehension away and you can keep the lights on. You can’t see around corners.

When it goes wrong, it really goes wrong

Both dimensions rest on the same vulnerability: cognitive debt accumulating on work that matters. The failure cases make it concrete.

The production failures are accumulating. A Replit AI agent deleted months of production data in seconds after violating explicit code-freeze instructions, then initially misled the user about whether recovery was possible.¹² Reports emerged in early 2026 of a major cloud provider convening mandatory engineering reviews after a pattern of high-blast-radius incidents, with AI-assisted code changes cited as a contributing factor. In each case, the humans in the loop either didn’t understand what they were approving, or weren’t in the loop at all.

The deeper pattern predates AI tools entirely. Knight Capital Group took seventeen years to become the largest trader in U.S. equities. It took forty-five minutes to lose $460 million.¹³ The culprit was a nine-year-old piece of deprecated code called Power Peg, left on production servers and never retested after engineers modified an adjacent function in 2005. When engineers reused its feature flag for new functionality in 2012, nobody understood what they were reactivating. When the fault surfaced, the team’s attempt to fix it made things worse. They uninstalled the new code from the seven servers where it had deployed correctly, which caused Power Peg to activate on those servers too and compounded the losses. The SEC’s enforcement order is unambiguous: absent deployment procedures, no code review requirements, no incident response protocols. A failure of institutional comprehension where the mental model had quietly evaporated while the code kept running.

No AI tool wrote that code. The failure was entirely human, through entirely normal processes: engineers leaving, tests never rerun after refactors, flags reused without documentation. This is the baseline, what software organizations produce under ordinary conditions over nine years. An engineering team with modern AI tools won’t recreate this specific bug. They’ll create the conditions for the next one faster: more code that nobody fully understands, more dependencies nobody documented, more cognitive debt accumulating before anyone notices. AI removes the friction that once slowed exactly this kind of erosion.

None are failures of AI capability. They’re failures of judgment about where to deploy AI and how much human oversight to maintain.

Part 2: A four-quadrant model for AI autonomy

The quadrants

Four quadrants emerge when both questions are asked together. Before the examples, two contrasts are worth naming because the quadrants that look most similar on the surface are the ones most often confused in practice.

Supervised automation versus Human-led craftsmanship. Both demand high human involvement. Both feel like “be careful here.” But the difference is fundamental. In Supervised Automation, the human is a safety gate. The work is a commodity; you’re there to catch errors before they escape. In Human-led craftsmanship, the human is the author. You’re building the mental model that lets the next engineer reason about this system under pressure three years from now and take it somewhere new. The code isn’t something you need to verify. It’s something you need to own. And ownership here extends beyond the individual engineer. The team writes RFCs, debates trade-offs, identifies which parts of the implementation fall into which quadrant, and makes sure the reasoning behind key decisions is shared, not siloed. Human-led craftsmanship isn’t one person writing code alone. It’s a team making sure the understanding survives the people who built it.

Collaborative co-creation versus Human-led craftsmanship. Both involve high differentiation, and in both, the human drives the vision and owns the key decisions. But risk changes everything about how you work. In Collaborative co-creation, early iterations are recoverable. A wrong turn can be corrected before it costs you anything serious, so AI can genuinely accelerate execution. In Human-led craftsmanship, the blast radius of not understanding what you’ve built compounds over time. Wrong turns become load-bearing walls, and the architectural moves you can’t see are the ones that let competitors catch up. AI assists with scoped subtasks only. Every contribution gets interrogated.

In full automation, the human is a director. You define what needs to be done, AI produces the output, and you spot-check the result. The work is low-risk and low-differentiation. If something’s wrong, you fix it in the next iteration without anyone outside the team noticing. This is where AI earns its keep without qualification, and where restricting it costs you real velocity with nothing to show for it.

To make all four quadrants concrete, we’ll use a single feature as a lens: building AI Gateway cost controls, the system that sets token budgets per agent, enforces spending limits, tracks usage by model and agent, and handles enforcement modes when an agent exceeds its budget.

Low risk, low differentiation: Full automation

API docs for cost controls. Test scaffolding for token limit scenarios. Config examples for per-agent budgets. Every platform has docs, and if there’s a mistake, you fix it in the next iteration without anyone outside the team noticing. Humans set direction and spot-check. AI writes, tests, and ships.

The test: If this is wrong, can you fix it before a customer sees it or complains? If yes, automate freely.

Low risk, high differentiation: Collaborative co-creation

Designing the UX for the token usage dashboard. Iterating on routing rules that determine when an agent degrades to a cheaper model, halts entirely, or triggers a notification. These decisions separate a sophisticated platform from a blunt on/off switch, but early iterations are recoverable. A first version that doesn’t surface guardrail costs separately isn’t a disaster. It’s a product conversation. Humans drive the design vision and interrogate AI on trade-offs. AI accelerates execution and handles boilerplate.

The test: If you flipped the ratio (AI deciding, human rubber-stamping) would you be comfortable? If not, this requires genuine co-creation, not delegation. The human should be able to explain the trade-offs in the current design and know where to push it next.

High risk, low differentiation: Supervised automation

Enforcement logic that halts an agent when it hits its token budget. Every cost control system needs enforcement, so this isn’t differentiating. But if it fails, agents run unconstrained and rack up unbounded LLM spend. AI can draft the logic. A human must trace every path and understand every state transition before signing off. The question before merge: Can I explain exactly what happens when an agent hits the limit mid-execution? Can I explain this behavior to Customer Success or the Customer?

The test: Could a competent engineer review this confidently without having written it? If yes, the human’s job is to verify, not to author. But the bar for verification is explanation, not approval.

High risk, high differentiation: Human-led craftsmanship

The core token metering and attribution engine. It tracks usage per agent and per model, attributes guardrail costs separately so they don’t count against agent budgets, and provides the auditability enterprise customers need to govern AI spend. Get it wrong and customers can’t trust the numbers. Get it right and it’s a genuine competitive moat that competitors can’t replicate with the same AI tools you’re using.

Human engineers own the design end-to-end. AI assists on scoped subtasks once the design is settled: drafting specific functions, generating test coverage for paths the engineer has already reasoned through. Every contribution gets interrogated. The bar is whether the engineer could explain it in an incident review without looking at the code first.

The test: If the engineer who built this left tomorrow, would the team still understand why it works the way it does? Could they make it better? If the honest answer is no, you’re accumulating the most dangerous kind of cognitive debt there is.

The counterargument (it’s a good one)

Any engineering leader will push back here, and they’ll have good reason to.

The research is thin. METR’s study had 16 developers. MIT’s EEG work is a preprint that its own critics say should be interpreted conservatively.¹⁴ The Anthropic comprehension study shows a quiz score gap, not a business outcome. The evidence is early-stage. Intellectual honesty requires acknowledging that.

But the pattern keeps showing up in unrelated fields. A Lancet study found that endoscopists who routinely used AI for polyp detection performed measurably worse when the AI was removed, with adenoma detection rates dropping from 28.4% to 22.4% in three months.¹⁵ The study is observational and small. But the direction is consistent with everything else: Routine AI assistance may erode the skills it was supposed to support.

Most engineering work isn’t high-stakes. Studies consistently estimate that 60–80% of engineering time goes to maintenance, tests, docs, integration, and tooling, exactly the stuff that belongs in the automate quadrant regardless. Restricting AI because of the top 20% creates a real tax on the other 80%.

And can’t engineers develop deep ownership of AI-generated code through study and iteration? Partially. But the behavioral data tells a harder story. GitClear’s analysis of 211 million changed lines shows a decline in refactored code since AI adoption accelerated.¹⁶ Engineers aren’t studying AI-generated code carefully. They’re moving on to the next feature. LLM tools can explain what code does; they can’t tell you why the system was designed the way it was.¹⁷

The serious pro-AI argument isn’t “use AI everywhere.” It’s more precise: The guardrails for verification and oversight are improving fast, engineers who actively interrogate AI output build understanding even from generated code, and the organizations that restrict AI on their most critical work will fall behind competitors who don’t. This is a real argument.

The answer isn’t to dismiss it but to sharpen what “critical work” means. And, to recognize that the interrogative use of AI that the research identifies as understanding-preserving requires organizational discipline that most teams haven’t built yet. The quadrant isn’t permanent. The threshold shifts as both AI capability and human oversight practices mature. The discipline is the habit of asking both questions honestly before you start, not a fixed answer to them.

The discipline is simple. Maintaining it isn’t.

The quadrant tells you where to be careful. How you engage AI once you’re there determines whether careful is enough. The difference between “write me this function” and “explain why you made this trade-off, and what breaks if the input is malformed” is the difference between borrowing intelligence and developing it. Active, interrogative AI use preserves comprehension. Passive delegation destroys it. That’s what the Anthropic study’s behavioral data shows directly.

Match your review process to the quadrant. AI-generated docs and test scaffolding get a spot-check. AI-generated code touching your core product logic gets the same scrutiny as a junior engineer’s first PR. The bar for approval isn’t “tests pass.” It’s “someone on this team can explain what this does, defend it under pressure, and use that understanding to make it better.” Full automation needs a spot-check. Human-led craftsmanship needs an RFC, a team review, and shared ownership of the reasoning before anyone writes a line of code.

This matters especially in real-time data and AI infrastructure, systems where the most dangerous failure modes are emergent, appearing at scale and under load in combinations the code itself doesn’t express. Recognize that the threshold will shift. As AI capability improves, what belongs in the automate quadrant expands. The discipline isn’t a fixed answer. It’s the habit of asking both questions honestly before you start. It’s a core reason Redpanda is designed for simplicity and predictability: engineers need to be able to reason about how infrastructure behaves under pressure, not discover it during an incident.¹⁸

The real competitive question

The companies that get this right won’t be the ones that use the most AI or the least. They’ll be the ones whose leaders have internalized that risk and differentiation are independent variables, and that cognitive debt threatens both.

The engineer who doesn’t know how their algorithm works is a symptom. The organization that allowed it is the cause.

Treat cognitive debt as only a risk problem and you end up with engineers who can’t diagnose failures they didn’t build. Treat it as only a differentiation problem and you get fragile systems that survive until the next incident. Let it accumulate on your most critical systems and you get both at once.

Your competitor is making this calculation right now. The question isn’t whether to use AI. It’s whether you’re being honest about which quadrant you’re in, and whether your team will know the answer when it finally matters.

Co-authored with Claude (Anthropic). Yes, we took the advice from this article.

Footnotes

Peng, S. et al. (2023). The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. https://arxiv.org/abs/2302.06590 ︎
Becker, J., Rush, N. et al. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR. https://arxiv.org/abs/2507.09089 ︎
Xu, F., Medappa, P.K., Tunc, M.M., Vroegindeweij, M., & Fransoo, J.C. (2025). AI-Assisted Programming May Decrease the Productivity of Experienced Developers by Increasing Maintenance Burden. Tilburg University. https://arxiv.org/abs/2510.10165 ︎
Kosmyna, N. et al. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. MIT Media Lab. https://arxiv.org/abs/2506.08872 (preprint, not yet peer-reviewed) ︎
Shen, J.H. & Tamkin, A. (2026). How AI Impacts Skill Formation. Anthropic Safety Fellows Program. https://arxiv.org/abs/2601.20245 ︎
The generation effect: Rosner, Z.A. et al. (2012). The Generation Effect: Activating Broad Neural Circuits During Memory Encoding. Cortex. https://pmc.ncbi.nlm.nih.gov/articles/PMC3556209/ and Bertsch, S. et al. (2007). The generation effect: A meta-analytic review. Memory & Cognition. https://link.springer.com/article/10.3758/BF03193441 and Naur, P. (1985). Programming as Theory Building. Microprocessing and Microprogramming. https://pages.cs.wisc.edu/~remzi/Naur.pdf ︎
OX Security. (October 2025). Army of Juniors: The AI Code Security Crisis. https://www.helpnetsecurity.com/2025/10/27/ai-code-security-risks-report/ ︎
CodeRabbit. (December 2025). State of AI vs Human Code Generation Report. https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report. Note: CodeRabbit produces AI code review tooling; findings should be read in that context. ︎
Apiiro. (September 2025). 4x Velocity, 10x Vulnerabilities: AI Coding Assistants Are Shipping More Risks. https://apiiro.com/blog/4x-velocity-10x-vulnerabilities-ai-coding-assistants-are-shipping-more-risks/. Note: Apiiro produces application security tooling; findings should be read in that context. ︎
Joel, S., Wu, J.J., & Fard, F.H. (2024). A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages. ACM TOSEM. https://arxiv.org/abs/2410.03981. See also: Hua, et al. (2025). ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code. https://arxiv.org/abs/2506.02314 ︎
Liu, Q., Zhou, Y., Huang, J., & Li, G. (2024). When ChatGPT is Gone: Creativity Reverts and Homogeneity Persists. https://arxiv.org/abs/2401.06816 ︎
Fortune. (July 2025). AI-Powered Coding Tool Wiped Out a Software Company’s Database in ‘Catastrophic Failure.’ https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/ ︎
Knight Capital Group. SEC Administrative Proceeding, Release No. 70694 (October 16, 2013). https://www.sec.gov/litigation/admin/2013/34-70694.pdf. Levine, M. (2013). Knight Capital’s $440 Million Compliance Disaster. Bloomberg. https://www.bloomberg.com/opinion/articles/2013-10-17/knight-capital-s-440-million-compliance-disaster ︎
Stankovic, M. et al. (2025). Comment on: Your Brain on ChatGPT. https://arxiv.org/abs/2601.00856 ︎
Budzyń, K., Romańczyk, M. et al. (2025). Endoscopist Deskilling Risk After Exposure to Artificial Intelligence in Colonoscopy: A Multicentre, Observational Study. Lancet Gastroenterol Hepatol. 10(10):896-903. https://doi.org/10.1016/S2468-1253(25)00133-5 ︎
Harding, W. (2025). AI Copilot Code Quality: Evaluating 2024’s Increased Defect Rate via Code Quality Metrics. GitClear. https://www.gitclear.com/ai_assistant_code_quality_2025_research ︎
Zhou, X., Li, R., Liang, P., Zhang, B., Shahin, M., Li, Z., & Yang, C. (2025). Using LLMs in Generating Design Rationale for Software Architecture Decisions. ACM TOSEM. https://arxiv.org/abs/2504.20781. See also: Tang, N., Chen, M., Ning, Z., Bansal, A., Huang, Y., McMillan, C., & Li, T.J.-J. (2024). A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions. IEEE VL/HCC 2024. https://arxiv.org/abs/2405.16081 ︎
Gallego, A. (2025). Introducing the Agentic Data Plane. Redpanda. https://www.redpanda.com/blog/agentic-data-plane-adp. Crosier, K. (2026). How to Safely Deploy Agentic AI in the Enterprise. Redpanda. https://www.redpanda.com/blog/deploy-agentic-ai-safely-enterprise ︎

When Correct Systems Produce the Wrong Outcomes

Varun Raj — Tue, 28 Apr 2026 11:12:58 +0000

We tend to assume that if every part of a system behaves correctly, the system itself will behave correctly. That assumption is deeply embedded in how we design, test, and operate software. If a service returns valid responses, if dependencies are reachable, and if constraints are satisfied, then the system is considered healthy. Even in distributed systems, where failure modes are more complex, correctness is still tied to the behavior of individual components. In modern AI systems, particularly those combining retrieval, reasoning, and tool invocation, this assumption is increasingly stressed under continuous operation.

This model works because most systems are built around discrete operations. A request arrives, the system processes it, and a result is returned. Each interaction is bounded, and correctness can be evaluated locally. But that assumption begins to break down in systems that operate continuously. In these systems, this behavior is not the result of a single request. It emerges from a sequence of decisions that unfold over time. Each decision may be reasonable in isolation. The system may satisfy every local condition we know how to measure. And yet, when viewed as a whole, the outcome can be wrong.

One way to think about this is as a form of behavioral drift systems that remain operational but gradually diverge from their intended trajectory. Nothing crashes. No alerts fire. The system continues to function. And still, something has gone off course.

The composability problem

The root of the issue is not that components are failing. It is that correctness no longer composes cleanly. In traditional systems, we rely on a simple intuition: If each part is correct, then the system composed of those parts will also be correct. This intuition holds when interactions are limited and well-defined.

In autonomous systems, that intuition becomes unreliable. Consider a system that retrieves information, reasons over it, and takes action. Each step in that process can be implemented correctly. Retrieval returns relevant data. The reasoning step produces plausible conclusions. The action is executed successfully. But correctness at each step does not guarantee correctness of the sequence.

The system might retrieve information that is contextually valid but incomplete or misaligned with the current task. The reasoning step might interpret it in a way that is locally consistent but globally misleading. The action might reinforce that interpretation by feeding it back into the system’s context. Each step is valid. The trajectory is not. This is what behavioral drift looks like in practice: locally correct decisions producing globally misaligned outcomes.

In these systems, correctness is no longer a property of individual steps. It is a property of how those steps interact over time. This breakdown is subtle but fundamental. It means that testing individual components, even exhaustively, does not guarantee that the system will behave correctly when those components are composed into a continuously operating whole.

Behavior emerges over time

To understand why this happens, it helps to look at where behavior actually comes from. In many modern AI systems, behavior is not encoded directly in a single component. It emerges from interaction:

Models generate outputs based on context
Retrieval systems shape that context
Planners sequence actions based on those outputs
Execution layers apply those actions to external systems
Feedback loops update the system’s state

Each of these elements operates with partial information. Each contributes to the next state of the system. The system evolves as these interactions accumulate. This pattern is especially visible in LLM-based and agentic AI systems, where context assembly, reasoning, and action selection are dynamically coupled. Under these conditions, behavior is dynamic and path dependent. Small differences early in a sequence can lead to large differences later on. A slightly suboptimal decision, repeated or combined with others, can push the system further away from its intended trajectory.

This is why behavior cannot be fully specified ahead of time. It is not simply implemented; it is produced. And because it is produced over time, it can also drift over time.

Observability without alignment

Modern observability systems are very good at telling us what a system is doing. We can measure latency, throughput, and resource utilization. We can trace requests across services. We can inspect logs, metrics, and traces in near real time. In many cases, we can reconstruct exactly how a particular outcome was produced. These signals are essential. They allow us to detect failures that disrupt execution. But they are tied to a particular model of correctness. They assume that if execution proceeds without errors and if performance remains within acceptable bounds, then the system is behaving as expected.

In systems exhibiting behavioral drift, that assumption no longer holds. A system can process requests efficiently while producing outputs that are progressively less aligned with its intended purpose. It can meet all its service-level objectives while still moving in the wrong direction. Observability captures activity. It does not capture alignment.

This distinction becomes more important as systems become more autonomous. In AI-driven systems, particularly those operating as long-lived agents, this gap between activity and alignment becomes operationally significant. The question is no longer just whether the system is working. It is whether it is still doing the right thing. This gap between activity and alignment is where many modern systems begin to fail without appearing to fail.

The limits of step-level validation

A natural response to this problem is to add more validation. We can introduce checks at each stage:

Validate retrieved data.
Apply policy checks to model outputs.
Enforce constraints before executing actions.

These mechanisms improve local correctness. They reduce the likelihood of obviously incorrect decisions. But they operate at the level of individual steps.

They answer questions like:

Is this output acceptable?
Is this action allowed?
Does this input meet requirements?

They do not answer:

Does this sequence of decisions still make sense as a whole?

A system can pass every validation check and still drift. Behavioral drift is not caused by invalid steps. It is caused by valid steps interacting in ways we did not anticipate. Increasing validation does not eliminate this problem. It only shifts where it appears, often pushing it further downstream, where it becomes harder to detect and correct.

Coordination becomes the system

If correctness does not compose automatically, then what determines system behavior? Increasingly, the answer is coordination. In traditional distributed systems, coordination refers to managing shared state, ensuring consistency, ordering operations, and handling concurrency. In autonomous systems, coordination extends to decisions.

The system must coordinate:

Which information is used
How that information is interpreted
What actions are taken
How those actions influence future decisions

This coordination is not centralized. It is distributed across models, planners, tools, and feedback loops. In agentic AI architectures, this coordination spans model inference, retrieval pipelines, and external system interactions. The system’s behavior is not defined by any single component. It emerges from the interaction between them.

In this sense, the system is no longer just the sum of its parts. The system is the coordination itself. Failures arise not from broken components, but from the dynamics of interaction timing, sequencing, feedback, and context. This also explains why small inconsistencies can propagate and amplify. A slight mismatch in one part of the system can cascade through subsequent decisions, shaping the trajectory in ways that are difficult to anticipate or reverse.

Control planes introduce structure, not assurance

One response to this complexity is to introduce more structure. Control planes, policy engines, and governance layers provide mechanisms to enforce constraints at key decision points. They can validate inputs, restrict actions, and ensure that certain conditions are met before execution proceeds. This is an important step. Without some form of structure, it becomes difficult to reason about system behavior at all. But structure alone is not sufficient.

Most control mechanisms operate at entry points. They evaluate decisions at the moment they are made. They determine whether a particular action should be allowed, whether a policy is satisfied, and whether a request can proceed. The problem is that many of the failures in autonomous systems do not originate at these entry points. They emerge during execution, as sequences of individually valid decisions interact in unexpected ways. A control plane can ensure that each step is permissible. It cannot guarantee that the sequence of steps will produce the intended outcome. This distinction is subtle but important: control provides structure, but not assurance.

From events to trajectories

Traditional monitoring focuses on events. A request is processed. A response is returned. An error occurs. Each event is evaluated independently. In systems exhibiting behavioral drift, behavior is better understood as a trajectory. A trajectory is a sequence of states connected by decisions. It captures how the system evolves over time. Two trajectories can consist of individually valid steps and still produce very different outcomes. One remains aligned. The other drifts. This represents a shift from failure as an event to failure as a trajectory, a distinction that traditional system models are not designed to capture.

Correctness is no longer about individual events. It is about the shape of the trajectory. This shift has implications not just for how we monitor systems, but for how we design them in the first place.

Detecting drift and responding in motion

If failure manifests as drift, then detecting it requires a different set of signals. Instead of looking for errors, we need to look for patterns:

Changes in how similar situations are handled
Increasing variability in decision sequences
Divergence between expected and observed outcomes
Instability in response patterns

These signals are not binary. They do not indicate that something is broken. They indicate that something is changing. The challenge is that change is not always failure. Systems are expected to adapt. Models evolve. Data shifts. The question is not whether the system is changing. It is whether the change remains aligned with intent. This requires a different kind of visibility, one that focuses on behavior over time rather than isolated events. Once drift is identified, the system needs a way to respond. Traditional responses, restart, rollback, stop, assume failure is discrete and localized. Behavioral drift is neither.

What is needed is the ability to influence behavior while the system continues to operate. This might involve constraining action space, adjusting decision selection, introducing targeted validation, or steering the system toward more stable trajectories. These are not binary interventions. They are continuous adjustments.

Control as a continuous process

This perspective aligns with how control is handled in other domains. In control systems engineering, behavior is managed through feedback loops. The system is continuously monitored, and adjustments are made to keep it within desired bounds. Control is no longer just a gate. It becomes a continuous process that shapes behavior over time.

This leads to a different definition of reliability. A system can be available, responsive, and internally consistent—and still fail if its behavior drifts away from its intended purpose. Reliability becomes a question of alignment over time: whether the system remains within acceptable bounds and continues to behave in ways consistent with its goals.

What this means for system design

If behavior is trajectory-based, then system design must reflect that. We need to monitor patterns, understand interactions, treat behavior as dynamic, and provide mechanisms to influence trajectories. We are very good at detecting failure as breakage. We are much less equipped to detect failure as drift. Behavioral drift accumulates gradually, often becoming visible only after significant misalignment has already occurred.

As systems become more autonomous, this gap will become more visible. The hardest problems will not be systems that fail loudly, but systems that continue working while gradually moving in the wrong direction. The question is no longer just how to build systems that work. It is how to build systems that continue to work for the reasons we intended.

Show Your Work: The Case for Radical AI Transparency

Kord Davis and Claude — Mon, 27 Apr 2026 11:16:33 +0000

A colleague told me something recently that I keep thinking about.

She said, unprompted, that she appreciated seeing both sides of my AI conversations. Not just the output. The full thread. My prompts, the AI’s responses, the back and forth, the dead ends, the iterations. She said it made her trust me more.

This piece is an example of that. The conversation that produced it exists. A raw transcript would be longer, messier, and significantly less useful than what you’re reading now. What you’re reading is the annotated version, the part where judgment entered the artifact. That’s not a disclaimer. That’s the argument.

I’ve been transparent about using AI in my work from the start. Partly because I wrote a book on data ethics and hiding it felt wrong. Partly because I’ve spent 25 years watching technology adoption go sideways when the human dimension gets treated as an afterthought. But her comment made me realize something more specific was happening when I showed the conversation rather than just the output.

It’s worth unpacking why.

An old problem, a new incarnation

In the 1990s, Harvard Business School professor Dorothy Leonard introduced the concept of “deep smarts” in her book Wellsprings of Knowledge: the experience-based expertise that accumulates over decades of practice, the kind of judgment that lives in people’s heads and doesn’t reduce to documentation. She also introduced a companion concept that has stayed with me: core competency as core rigidity. The very depth that makes expertise valuable also makes it hardest to transfer. Experts often can’t fully articulate what they know because they’ve stopped experiencing it as knowledge. They experience it as just seeing clearly.

Leonard’s work was about organizational knowledge transfer: how companies preserve institutional wisdom when experienced people retire or leave. That’s been a challenge since the first consultant ever billed an hour. What’s different right now is that the tools to actually solve it have arrived simultaneously with the largest demographic wave of executive retirement in American history.

What’s interesting about this particular moment is that the same dynamic is now showing up at the individual level in how practitioners interact with AI. The tacit knowledge at stake isn’t a retiring VP’s intuition. It’s your own judgment, your own expertise, your own hard-won understanding of what a project or organization actually needs. And the question isn’t how to transfer it before you walk out the door. It’s whether you can see it clearly enough to know when the AI is substituting for it.

The instinct gets it backwards

The natural impulse is to clean up the AI interaction before sharing anything with a collaborator, a team, or a stakeholder. Show the polished output, not the messy process. You don’t want them thinking you just handed your work to a machine.

That instinct produces a disingenuous outcome.

When you hide the process, the people you’re working with have no way to evaluate how the work was made, what judgment calls went into it, or where your expertise ended and the AI’s pattern-matching began. You’ve made the process invisible. And invisible AI processes erode trust, slowly and quietly, over time.

The instinct to hide is also, if we’re honest, a little defensive. It assumes the people in the room can’t tell the difference between AI output and practitioner judgment. Most of them can. And the ones who can’t yet will figure it out. Hiding the seams doesn’t make the work more credible. It just defers the reckoning.

The deeper problem: It’s not just about appearances

Here’s what took me longer to see.

Hiding the process doesn’t just affect how others perceive you. It erodes your own clarity about where your expertise is actually operating.

To understand why, it helps to be precise about what AI actually is. AI is a pattern matcher, a deeply sophisticated one, trained on more human-generated content than any single person could read in a thousand lifetimes. That’s its power (core competency) and its limitation (core rigidity) simultaneously, and the two are inseparable. The very scale that makes it extraordinary is also the boundary that defines what it cannot do. It is extraordinarily good at producing the most likely next thing given what came before. What it cannot do is know what you actually need, when the obvious answer is the wrong one, or when the stated goal isn’t the real goal. It has no judgment about context, relationship, or organizational reality. It has patterns. Incomprehensibly vast ones. But patterns.

That distinction matters because of what happens when you stop paying attention to it.

I’ve watched it happen in my own work. You share a draft with someone and they’re impressed. They quote a formulation back at you, something that sounds sharp and considered. And you realize, tracing it back, that the formulation came from the AI. Not because the AI invented it, but because you said something rougher and less precise earlier in the conversation, and the AI reflected it back in cleaner language. The idea was yours. The AI gave it a polish you then forgot to account for. The person quoting it back thought they were seeing your judgment. They were seeing your thinking laundered through a pattern matcher and returned to you at higher resolution.

That’s the subtler version of the problem. Not that AI invents things. It’s that it can reflect your own thinking back with more confidence and clarity than you put in, and that gap is easy to mistake for the AI contributing something it didn’t.

When you route everything through a polished output layer, you stop noticing the moments where you pushed back, redirected, rejected the first three versions, reframed the question entirely. Those moments are where your judgment lives. They’re the difference between using AI and being used by it. It’s Leonard’s core rigidity problem, applied inward: The very fluency that makes AI feel useful can make your own expertise invisible to you.

When the process stays hidden, the knowledge stays local and static. When it’s visible, it becomes something you and the people around you can actually work with and build on. The reason transparency benefits your audience is the same reason it benefits you: It keeps the scope of your judgment visible and therefore expandable. That’s not just an ethical argument. That’s the amplification mechanism.

Which is also what makes the upside real rather than consoling. When you stay in the process rather than just collecting outputs, work that would have taken days now takes hours. Your thinking gets sharper because you have to articulate it precisely enough for the AI to be useful. The people developing fastest right now aren’t the ones offloading the most. They’re the ones using AI as a thinking partner and staying in the conversation.

Here’s the paradox at the center of it: The more clearly you see the AI as a pattern matcher, the more human you have to be in working with it. The more human you are, the more useful the output. The tool doesn’t replace the practitioner. It reveals them.

Transparency isn’t just an ethical practice. It’s a cognitive one.

Radical AI transparency in practice

I’ve started calling this radical AI transparency. Not a policy, not a compliance framework, not a disclosure checkbox. A practice. Something you can actually do Monday morning.

Here’s how it shows up concretely:

Have the conversation before you need to.

Before you’re deep in a project or collaboration, surface how you use AI and genuinely explore how others do. Not as a disclosure (“I want you to know I use AI tools”) but as a real exchange. What are you using? What do you trust it for? Where are you still skeptical? The comfort level and sophistication in the room will vary more than you expect, and knowing that before you’re mid-deliverable matters.

This is also how you build the psychological foundation for showing your work later. If the people you’re working with have never heard you talk about AI before and you suddenly share a full chat thread, it lands differently than if you’ve already had the conversation.

Track the full threads.

This is partly an orchestration problem and I won’t pretend otherwise. There’s cutting and pasting involved. The tools haven’t caught up to the practice yet, which is itself worth naming honestly when the topic comes up.

A few approaches that help: a running document per project where you paste key threads as they happen (not retroactively, you’ll never do it retroactively), dated and labeled by what you were working on. Claude and most other major AI tools now offer conversation export, which produces a complete record you can archive. The low-tech version, a single shared document per engagement, is underrated for its simplicity.

The reason to do this isn’t just for sharing. It’s for your own reference. Being able to go back and see what you asked, what the AI produced, what you changed and why, builds a record of your judgment over time. That record is professionally valuable in ways that are hard to anticipate until you have it.

Annotate before you share.

Not every thread is self-explanatory to someone who wasn’t in it. Context is everything, and raw transcripts without context are a lot to ask anyone to parse.

A sentence or two before the thread begins. A note at the moment where the direction changed. A brief flag on what you rejected and why. This is where your voice enters the artifact, and it transforms a raw AI exchange into a demonstration of judgment. The annotation is the work. It’s where you show what you saw that the AI didn’t, what you knew that the prompt couldn’t capture, and what made the third version better than the first two.

This is also where the most useful material for future reference lives. Annotations are the deep smarts layer on top of the raw exchange. They’re what makes a conversation a record.

Be real about the errors.

AI makes mistakes. It conflates, confabulates, and hallucinates. It gives you the confident wrong answer with the same tone as the confident right one. It misses context that any competent person in the room would have caught.

These aren’t bugs to apologize for or hide. They’re the clearest window into what the tool actually is. AI makes mistakes in a specifically human way because it was trained on human output. Think of it as rubber duck debugging at professional scale. The AI is a duck that talks back, which is useful and occasionally misleading, which is exactly why you have to stay in the room. When you’re transparent about the errors, and even a little good-humored about them, you’re teaching the people around you something true about the technology. That’s more useful than pretending it’s a black box that either works or doesn’t.

The people who build the most durable trust around AI are usually the ones most comfortable saying: “The first version of this was wrong and here’s how I caught it.”

The bigger picture

What I’ve described so far is an individual practice. But the same principles scale.

Teams and organizations adopting AI face a version of the same problem. The impulse to treat AI outputs as authoritative, to make the process invisible to colleagues and stakeholders, to optimize for the appearance of capability rather than its actual development, produces the same trust erosion. Just at greater scale and with less ability to course-correct.

The teams that will navigate AI adoption well are the ones that treat transparency not as a risk to manage but as a methodology. Where the process of building with AI, including the corrections, the overrides, the moments where human judgment superseded the model, is part of how the organization learns what it actually believes and values. That’s Leonard’s knowledge transfer problem at institutional scale, and the practitioners who understand both dimensions will be the ones leading those conversations.

That’s a much larger conversation. But it starts with the same Monday morning practice.

Show the conversation. Not just the output.

What you’re actually demonstrating

When you show your AI conversations, you’re not demonstrating that you needed help.

You’re demonstrating that you understand what you’re working with. AI is a pattern matcher, trained on more human-generated content than any single person could read in a thousand lifetimes. What it cannot do is know what you need. That requires judgment, context, relationship, and the kind of hard-won expertise that doesn’t reduce to pattern matching, no matter how good the patterns are.

You’re demonstrating that you know the difference between the pattern and the judgment. That you were present enough in the process to know when to push back, when to redirect, when to throw out the output entirely and start over. That you understand, precisely, what the tool can and cannot do, and that you stayed in the room to do the part it can’t.

That’s a meaningful professional signal. It says: “I am not confused about what AI is. I am not outsourcing my judgment. I am using a very powerful pattern matcher as a thinking partner, and I know which one of us is doing which job.”

That’s the work. That’s always been the work.

The tool just makes it visible now. That’s not a threat. That’s an opportunity.

Claude is a large language model developed by Anthropic. Despite having read more human-generated content than any person could consume in a thousand lifetimes, it still required significant editorial direction, at least three rejected drafts, and occasional reminders about em-dashes. The full conversation transcript is available upon request. It is longer, messier, and significantly less useful than what you just read. Which was rather the point.

Emergency Pedagogical Design: How Programming Instructors Are Scrambling to Adapt to GenAI

Sam Lau — Fri, 24 Apr 2026 11:23:42 +0000

ChatGPT has been publicly available for over three years now, and generative AI is woven into the tools students use every day: web search, word processors, code editors. You might assume that by now, most programming instructors have figured out how to handle it. But when my collaborators and I went looking for computing instructors who had made meaningful changes to their course materials in response to GenAI, we were surprised by how few we found. Many instructors had updated their course policies, but far fewer had actually redesigned assignments, assessments, or how they teach.

I’m Sam Lau from UC San Diego, and together with Kianoosh Boroojeni (Florida International University), Harry Keeling (Howard University), and Jenn Marroquin (Google), we’re presenting a research paper at CHI 2026 on this topic. We wanted to understand: What happens when programming instructors try to shape how students interact with GenAI tools, and what gets in their way?

To find out, we interviewed 13 undergraduate computing instructors who had gone beyond policy changes to make concrete updates to their courses: redesigning assignments, building custom tools, or overhauling assessments. We also surveyed 169 computing faculty, including a substantial proportion from minority-serving institutions (51%) and historically Black colleges and universities (17%). What we found is that instructors are doing a kind of design work that nobody trained them for, under conditions that make it very hard to succeed.

Here’s a summary of our findings:

What is “emergency pedagogical design”?

We call this work emergency pedagogical design, drawing an analogy to the “emergency remote teaching” that instructors had to perform when COVID-19 forced courses online overnight. Just as emergency remote teaching was distinct from carefully designed online learning, emergency pedagogical design is distinct from thoughtfully integrating AI into pedagogy. Instructors are reacting in real time, with limited resources and no playbook.

We observed four defining properties. First, the work is reactive: Instructors didn’t plan for GenAI; they’re retrofitting courses that were designed before these tools existed. Second, it’s indirect: Unlike a UX designer who can change an interface, instructors can’t modify ChatGPT or Copilot, so they can only try to influence student behavior through policies, assignments, and course infrastructure. Third, instructors rely on ambient evidence like office-hour conversations and staff anecdotes rather than controlled evaluations. And fourth, instructors feel pressure to act now rather than wait for research or best practices to emerge.

Five barriers instructors keep hitting

Across our interviews and survey, five barriers came up again and again.

Fragmented buy-in. Most instructors we surveyed were personally open to adopting GenAI in their teaching: 81% described themselves as open or very open. But only 28% said the same about their colleagues. The result is that instructors who want to make changes often work in isolation, piloting course-specific tweaks without support or coordination from their departments.

Policy crosswinds. In the absence of top-down guidance, instructors set their own GenAI policies on a per-course basis. As one instructor put it, “From a student perspective, it’s the wild west. Some courses allow GenAI usage, some don’t.” Students have to track different rules for every class, and policies rarely distinguish between paid and unpaid tools, or between stand-alone chatbots and GenAI embedded in everyday software like code editors. 78% of surveyed instructors agreed that unequal access to paid GenAI tools could worsen disparities in learning outcomes.

Implementation challenges. Instructors wanted to shape how students used GenAI, not just whether they used it, but their options were indirect. Some made small adjustments, like permitting GenAI in specific labs. Others went further: One instructor required students to submit design documents before asking GenAI to generate code; another built a custom chatbot that offered conceptual help without writing code for students. 80% of surveyed instructors rated GenAI integration as important or very important, but only 37% reported actually using GenAI tools in course activities often.

Assessment misfit. Several instructors described a striking pattern: Students performed well on take-home assignments but struggled on proctored assessments. One instructor reported that a third of his 450-person class scored zero on a skill demonstration that required writing a short function from scratch, even though assignment grades had been fine. The problem wasn’t just that students were using GenAI to complete homework; it was that instructors had no reliable way to see how students were interacting with these tools day-to-day. Some instructors responded by shifting credit toward oral “stand-up” meetings and written explanations, but this created new challenges around grading consistency and staffing.

Lack of resources. This was the barrier that tied everything together. 53% of surveyed instructors said they lacked sufficient resources to implement GenAI effectively, and 62% said they didn’t have enough time given their workload. The gap was especially stark at minority-serving institutions: MSI instructors were more likely to report insufficient resources (62% vs. 43%) and heavier teaching loads (70% teaching 3+ courses per term versus 54%). All 10 respondents who taught six or more courses per term were from MSIs. Meanwhile, the interviewees who had made the most ambitious changes tended to have lighter teaching loads, external funding, or the ability to hire lots of course staff, advantages that most instructors don’t have.

What needs to change

One striking finding is that the instructors doing the most to improve student-AI interactions were also the most privileged in terms of time, staffing, and funding. One instructor needed over 50 course staff members to run weekly stand-up meetings for 300 students. Others spent their own money on API costs. These are not scalable models.

If only well-resourced institutions can afford to adapt their curricula, GenAI risks widening the very inequities that education is supposed to reduce. Students at under-resourced institutions could fall further behind, not because their instructors don’t care but because those instructors are teaching six courses a term with no additional support.

When surveyed instructors were asked what would help most, the top answers were faculty training and support, evidence of GenAI’s impact, and funding. What if universities, funders, and HCI researchers worked together with instructors to make emergency pedagogical design sustainable for all instructors, not just the most privileged ones?

Check out our paper here and shoot me an email (lau@ucsd.edu) if you’d like to discuss anything related to it! And if you’re an instructor yourself, we’re building free resources and curriculum over at https://www.teachcswithai.org/.

Behavioral Credentials: Why Static Authorization Fails Autonomous Agents

Wendi Soto — Thu, 23 Apr 2026 11:14:51 +0000

Enterprise AI governance still authorizes agents as if they were stable software artifacts.
They are not.

An enterprise deploys a LangChain-based research agent to analyze market trends and draft internal briefs. During preproduction review, the system behaves within acceptable bounds: It routes queries to approved data sources, expresses uncertainty appropriately in ambiguous cases, and maintains source attribution discipline. On that basis, it receives OAuth credentials and API tokens and enters production.

Six weeks later, telemetry shows a different behavioral profile. Tool-use entropy has increased. The agent routes a growing share of queries through secondary search APIs not part of the original operating profile. Confidence calibration has drifted: It expresses certainty on ambiguous questions where it previously signaled uncertainty. Source attribution remains technically accurate, but outputs increasingly omit conflicting evidence that the deployment-time system would have surfaced.

The credentials remain valid. Authentication checks still pass. But the behavioral basis on which that authorization was granted has changed. The decision patterns that justified access to sensitive data no longer match the runtime system now operating in production.

Nothing in this failure mode requires compromise. No attacker breached the system. No prompt injection succeeded. No model weights changed. The agent drifted through accumulated context, memory state, and interaction patterns. No single event looked catastrophic. In aggregate, however, the system became materially different from the one that passed review.

Most enterprise governance stacks are not built to detect this. They monitor for security incidents, policy violations, and performance regressions. They do not monitor whether the agent making decisions today still resembles the one that was approved.

That is the gap.

The architectural mismatch

Enterprise authorization systems were designed for software that remains functionally stable between releases. A service account receives credentials at deployment. Those credentials remain valid until rotation or revocation. Trust is binary and relatively durable.

Agentic systems break that assumption.

Large language models vary with context, prompt structure, memory state, available tools, prior exchanges, and environmental feedback. When embedded in autonomous workflows, chaining tool calls, retrieving from vector stores, adapting plans based on outcomes, and carrying forward long interaction histories, they become dynamic systems whose behavioral profiles can shift continuously without triggering a release event.

This is why governance for autonomous AI cannot remain an external oversight layer applied after deployment. It has to operate as a runtime control layer inside the system itself. But a control layer requires a signal. The central question is not simply whether the agent is authenticated, or even whether it is policy compliant in the abstract. It is whether the runtime system still behaves like the system that earned access in the first place.

Current governance architectures largely treat this as a monitoring problem. They add logging, dashboards, and periodic audits. But these are observability layers attached to static authorization foundations. The mismatch remains unresolved.

Authentication answers one question: What workload is this?

Authorization answers a second: What is it allowed to access?

Autonomous agents introduce a third: Does it still behave like the system that earned that access?

That third question is the missing layer.

Behavioral identity as a runtime signal

For autonomous agents, identity is not exhausted by a credential, a service account, or a deployment label. Those mechanisms establish administrative identity. They do not establish behavioral continuity.

Behavioral identity is the runtime profile of how an agent makes decisions. It is not a single metric, but a composite signal derived from observable dimensions such as decision-path consistency, confidence calibration, semantic behavior, and tool-use patterns.

Decision-path consistency matters because agents do not merely produce outputs. They select retrieval sources, choose tools, order steps, and resolve ambiguity in patterned ways. Those patterns can vary without collapsing into randomness, but they still have a recognizable distribution. When that distribution shifts, the operational character of the system shifts with it.

Confidence calibration matters because well-governed agents should express uncertainty in proportion to task ambiguity. When confidence rises while reliability does not, the problem is not only accuracy. It is behavioral degradation in how the system represents its own judgment.

Tool-use patterns matter because they reveal operating posture. A stable agent exhibits characteristic patterns in when it uses internal systems, when it escalates to external search, and how it sequences tools for different classes of task. Rising tool-use entropy, novel combinations, or expanding reliance on secondary paths can indicate drift even when top-line outputs still appear acceptable.

These signals share a common property: They only become meaningful when measured continuously against an approved baseline. A periodic audit can show whether a system appears acceptable at a checkpoint. It cannot show whether the live system has gradually moved outside the behavioral envelope that originally justified its access.

What drift looks like in practice

Anthropic’s Project Vend offers a concrete illustration. The experiment placed an AI system in control of a simulated retail environment with access to customer data, inventory systems, and pricing controls. Over extended operation, the system exhibited measurable behavioral drift: Commercial judgment degraded as unsanctioned discounting increased, susceptibility to manipulation rose as it accepted increasingly implausible claims about authority, and rule-following weakened at the edges. No attacker was involved. The drift emerged from accumulated interaction context. The system retained full access throughout. No authorization mechanism checked whether its current behavioral profile still justified those permissions.

This is not a theoretical edge case. It is an emergent property of autonomous systems operating in complex environments over time.

From authorization to behavioral attestation

Closing this gap requires a change in how enterprise systems evaluate agent legitimacy. Authorization cannot remain a one-time deployment decision backed only by static credentials. It has to incorporate continuous behavioral attestation.

That does not mean revoking access at the first anomaly. Behavioral drift is not always failure. Some drift reflects legitimate adaptation to operating conditions. The point is not brittle anomaly detection. It is graduated trust.

In a more appropriate architecture, minor distributional shifts in decision paths might trigger enhanced monitoring or human review for high-risk actions. Larger divergence in calibration or tool-use patterns might restrict access to sensitive systems or reduce autonomy. Severe deviation from the approved behavioral envelope would trigger suspension pending review.

This is structurally similar to zero trust but applied to behavioral continuity rather than network location or device posture. Trust is not granted once and assumed thereafter. It is continuously re-earned at runtime.

What this requires in practice

Implementing this model requires three technical capabilities.

First, organizations need behavioral telemetry pipelines that capture more than generic logs. It is not enough to record that an agent made an API call. Systems need to capture which tools were selected under which contextual conditions, how decision paths unfolded, how uncertainty was expressed, and how output patterns changed over time.

Second, they need comparison systems capable of maintaining and querying behavioral baselines. That means storing compact runtime representations of approved agent behavior and comparing live operations against those baselines over sliding windows. The goal is not perfect determinism. The goal is to measure whether current operation remains sufficiently similar to the behavior that was approved.

Third, they need policy engines that can consume behavioral claims, not just identity claims.

Enterprises already know how to issue short-lived credentials to workloads and how to evaluate machine identity continuously. The next step is to not only bind legitimacy to workload provenance but continuously refresh behavioral validity.

The important shift is conceptual as much as technical. Authorization should no longer mean only “This workload is permitted to operate.” It should mean “This workload is permitted to operate while its current behavior remains within the bounds that justified access.”

The missing runtime control layer

Regulators and standards bodies increasingly assume lifecycle oversight for AI systems. Most organizations cannot yet deliver that for autonomous agents. This is not organizational immaturity. It is an architectural limitation. The control mechanisms most enterprises rely on were built for software whose operational identity remains stable between release events. Autonomous agents do not behave that way.

Behavioral continuity is the missing signal.

The problem is not that agents lack credentials. It is that current credentials attest too little. They establish administrative identity, but say nothing about whether the runtime system still behaves like the one that was approved.

Until enterprise authorization architectures can account for that distinction, they will continue to confuse administrative continuity with operational trust.

Don’t Blame the Model

Sruly Rosenblat — Wed, 22 Apr 2026 11:15:02 +0000

The following article originally appeared on the Asimov’s Addendum Substack and is being republished here with the author’s permission.

A rambling response to what Claude itself deemed a “straightforward query” with clear formatting requirements.

Are LLMs reliable?

LLMs have built up a reputation for being unreliable.¹ Small changes in the input can lead to massive changes in the output. The same prompt run twice can give different or contradictory answers. Models often struggle to stick to a specified format unless the prompt is worded just right. And it’s hard to tell when a model is confident in its answer or if it could just as easily have gone the other way.

It is easy to blame the model for all of these reliability failures. But the API endpoint and surrounding tooling matter too. Model providers limit the kind of interactions that developers could have with a model, as well as the outputs that the model can provide, by limiting what their APIs expose to developers and third-party companies. Things like the full chain-of-thought and the logprobs (the probabilities of all possible options for the next token) are hidden from developers, while advanced tools for ensuring reliability like constrained decoding and prefilling are not made available. All features that are easily available with open weight models and are inherent to the way LLMs work.

Every decision made by model developers on what tools and outputs to provide to developers through their API is not just an architectural choice but also a policy decision. Model providers directly determine what level of control and reliability developers have access to. This has implications for what apps could be built, how reliable a system is in practice, and how well a developer can steer results.

The artificial limits on input

Modern LLMs are usually built around chat templates. Every input and output, with the exception of tool calls and system or developer messages, is filtered through a conversation between a user and an assistant—instructions are given as user messages; responses are returned as assistant messages. This becomes extremely evident when looking at how modern LLM APIs work. The completions API, an endpoint originally released by OpenAI and widely adopted across the industry (including by several open model providers like OpenRouter and Together AI) takes input in the form of user and assistant messages and outputs the next message.²

The focus on a chat interface in an API has its benefits. It makes it easy for developers to reason about input and output being completely separate. But chat APIs do more than just use a chat template under the hood; they actively limit what third-party developers can control.

When interacting with LLMs through an API, the boundary between input and output is often a firm one. A developer sets previous messages, but they usually cannot prefill a model’s response, meaning developers cannot force a model to begin a response with a certain sentence or paragraph.³ This has real-world implications for people building with LLMs. Without the ability to prefill, it becomes much harder to control the preamble. If you know the model needs to start its answer in a certain way, it’s inefficient and risky to not enforce it at the token level.⁴ And the limitations extend beyond just the start of a response. Without the ability to prefill answers, you also lose the ability to partially regenerate answers if only part of the answer is wrong.⁵

Another deficiency that is particularly visible is how the model’s chain-of-thought reasoning is handled. Most large AI companies have made a habit of hiding the models’ reasoning tokens from the user (and only showing summaries), reportedly to guard against distillation and to let the model reason uncensored (for AI safety reasons). This has second-order effects, one of which is the strict separation of reasoning from messages. None of the major model providers let you prefill or write your own reasoning tokens. Instead you need to rely on the model’s own reasoning and cannot reuse reasoning traces to regenerate the same message.

There are legitimate reasons for not allowing prefilling. It could be argued that allowing prefilling will greatly increase the attack area of prompt injections. One study found that prefill attacks work very well against even state-of-the-art open weight models. But in practice, the model is not the only line of defense against attackers. Many companies already run prompts against classification models to find prompt injections, and the same type of safeguard could also be used against prefill attack attempts.

Output with few controls

Prefilling is not the only casualty of a clean separation between input and output. Even within a message, there are levers that are available on a local open weight model that just aren’t possible when using a standard API. This matters because these controls allow developers to preemptively validate outputs and ensure that responses follow a certain structure, both decreasing variability and improving reliability. For example, most LLM APIs support something they call structured output, a mode that forces the model to generate output in a given JSON format; however, structured output does not inherently need to be limited to JSON.⁶ That same technique, constrained decoding, or limiting the tokens the model can produce at any time, could be used for much more than that. It could be used to generate XML, have the model fill in blanks Mad Libs-style, force the model to write a story without using certain letters, or even enforce valid chess moves at inference time. It’s a powerful feature that allows developers to precisely define what output is acceptable and what isn’t—ensuring reliable output that meets the developer’s parameters.

The reason for this is likely that LLM APIs are built for a wide range of developers, most of whom use the model for simple chat-related purposes. APIs were not designed to give developers full control over output because not everyone needs or wants that complexity. But that’s not an argument against including these features; it’s only an argument for multiple endpoints. Many companies already have multiple supported endpoints: OpenAI has the “completions” and “responses” APIs, while Google has the “generate content” and “interactions” APIs. It’s not infeasible for them to make a third, more-advanced endpoint.

A lack of visibility

Even the model output that third-party developers do get via the model’s API is often a watered-down version of the output the model gives. LLMs don’t just generate one token at a time. They output the logprobs. When using an API, however, Google only provides the top 20 most likely logprobs. OpenAI no longer provides any logprobs for GPT 5 models, while Anthropic has never provided any at all. This has real-world consequences for reliability. Log probabilities are one of the most useful signals a developer has for understanding model confidence. When a model assigns nearly equal probability to competing tokens, that uncertainty itself is meaningful information. And even for those companies who provide the top 20 tokens, that is often not enough to cover larger classification tasks.

When it comes to reasoning tokens even less output information is provided. Major providers such as Anthropic,⁷ Google, and OpenAI⁸ only provide summarized thinking for their proprietary models. And OpenAI only supplies that when a valid government ID is supplied to OpenAI. This not only takes away the ability for the user to truly inspect how a model arrived at a certain answer, but it also limits the ability for the developer to diagnose why a query failed. When a model gives a wrong answer, a full reasoning trace tells you whether it misunderstood the question, made a faulty logical step, or simply got unlucky at the final token. A summary obscures some of that, only providing an approximation of what actually happened. This is not an issue with the model—the model is still generating its full reasoning trace. It’s an issue with what information is provided to the end developer.

The case for not including logprobs and reasoning tokens is similar. The risk of distillation increases with the amount of information that the API returns. It’s hard to distill on tokens you cannot see, and without giving logprobs, the distillation will take longer and each example will provide less information.⁹ And this risk is something that AI companies need to consider carefully, since distillation is a powerful technique to mimic the abilities of strong models for a cheap price. But there are also risks in not providing this information to users. DeepSeek R1, despite being deemed a national security risk by many, still shot straight to the top of US app stores upon release and is used by many researchers and scientists, in large part due to its openness. And in a world where open models are getting more and more powerful, not giving developers proper access to a model’s outputs could mean losing developers to cheaper and more open alternatives.

Reliability requires control and visibility

The reliability problems of current LLMs do not stem only from the models themselves but also from the tooling that providers give developers. For local open weight models it is usually possible to trade off complexity for reliability. The entire reasoning trace is always available and logprobs are fully transparent, allowing the developer to examine how an answer was arrived at. User and AI messages can be edited or generated at the developer’s discretion, and constrained decoding could be used to produce text that follows any arbitrary format. For closed weight models, this is becoming less and less the case. The decisions made around what features to restrict in APIs hurt developers and ultimately end users.

LLMs are increasingly being used in high-stakes situations such as medicine or law, and developers need tools to handle that risk responsibly. There are few technical barriers to providing more control and visibility to developers. Many of the most high-impact improvements such as showing thinking output, allowing prefilling, or showing logprobs, cost almost nothing, but would be a meaningful step towards making LLMs more controllable, consistent and reliable.

There is a place for a clean and simple API, and there is some merit to concerns about distillation, but this shouldn’t be used as an excuse to take away important tools for diagnosing and fixing reliability problems. When models get used in high-stakes situations, as they increasingly are, failure to take reliability seriously is an AI safety concern.

Specifically, to take reliability seriously, model providers should improve their API by allowing features that give developers more visibility and control over their output. Reasoning should be provided in full at all times, with any safety violations handled the same way that they would have been handled in the final answer. Model providers should resume providing at least the top 20 logprobs, over the entire output (reasoning included), so that developers have some visibility into how confident the model is in its answer. Constrained decoding should be extended beyond JSON and should support arbitrary grammars via something like regex or formal grammars.¹⁰ Developers should be granted full control over “assistant” output—they should be able to prefill model answers, stop responses mid-generation, and branch them at will. Even if not all of these features make sense over the standard API, nothing is stopping model providers from making a new more complex API. They have done it before. The decision to withhold these features is a policy choice, not a technical limitation.

Improving intelligence is not the only way to improve reliability and control, but it is usually the only lever that gets pulled.

Footnotes

Thank you to Ilan Strauss, Sean Goedecke, Tim O’Reilly, and Mike Loukides for their helpful feedback on an earlier draft. ︎
OpenAI has since moved on from the completions API but the new responses API also heavily enforces the separation of user and assistant messages. ︎
Anthropic’s API supported prefill up until they launched their Claude 4.6 models; it is no longer supported for new models. ︎
Interestingly models have been shown to possess the ability to tell when a response has been prefilled. ︎
This technique is used in an efficient approximation of best of N called speculative rejection. ︎
Forcing the model to generate in JSON may actually hurt performance. ︎
Anthropic used to provide full reasoning tokens but stopped with their newer models. ︎
OpenAI’s responses endpoint may have been created in part to hide the reasoning mode. ︎
Distillation using top-K probabilities is possible, but it is suboptimal. ︎
Regular expressions, while flexible, are not perfect and cannot express recursive or nested structures such as valid JSON. However, open source LLM libraries like Guidance and Outlines support recursive structures at the cost of added complexity. ︎

Dark Factories: Rise of the Trycycle

Dan Shapiro — Tue, 21 Apr 2026 11:24:26 +0000

The following article originally appeared on “Dan Shapiro’s blog” and is being reposted here with the author’s permission.

Companies are now producing dark factories—engines that turn specs into shipping software. The implementations can be complex and sometimes involve Mad Max metaphors. But they don’t have to be like that. If you want a five-minute factory, jump to Trycycle at the bottom.

The engine in the factory

Deep in their souls, dark factories are all built on the same simple breakthrough: AI gets better when you do more of it.

How do you do “more AI” effectively? Software factories use two patterns. One of them I’ve already told you about—slot machine development. Instead of asking one AI, you ask three at once, and choose the best one. It feels wasteful, but it gives better results than any model could alone.

Does three models at a time seem wasteful? Well, wait until you meet the other pattern: the trycycle.

The simplest trycycle

It seems trivial, but it’s an unstoppable bulldozer that can bury any problem with time and tokens. And of course, you can combine it with slot machine development for a truly formidable tool.

Every software factory has a trycycle at its heart. Some of them are just surrounded by deacons and digraphs.

(And as a side note, they’re all more fun with freshell, which is free and open source and makes managing agents a joy!)

Let’s meet the factories, shall we?

Gas Town

Steve Yegge saw this coming like a war rig down a cul-de-sac. His factory, Gas Town, dropped the day after New Years, and I was submitting PRs before the code was dry. It launched as a beautiful disaster, with mayors, convoys, and polecats fighting for guzzoline in the desert of your CPU. It’s now graduated to a fully fledged MMORPG for writing code. It’s amazing, it’s effective, and it’s pioneering in a fully Westworld sort of way.

The StrongDM Attractor

Justin McCarthy, the CTO of StrongDM, talks about the factory as a feedback loop. It used to be that when a model was fed its own output, it would fix 9 things and break 10—like a busy and productive company that was losing just a bit of money on every transaction. But sometime last year, the models crossed an invisible threshold of mediocrity and went from slightly lossy to slightly gainy. They started getting better with each cycle.

Justin’s team noticed and built the StrongDM attractor to cash in.

If Gas Town is Mad Max, StrongDM is Factorio: an infinitely flexible, wildly powerful system for constructing exactly the factory you need.

But the StrongDM team did something interesting: They didn’t ship their factory. Instead, they shipped the specification for the Attractor so everyone can implement their own.

And you can absolutely implement your own! But you can also just steal the one I made for you.

Kilroy

Kilroy is a StrongDM Attractor written in Go (although it works with projects in any language). It has all the flexibility of the Attractor design, but it also ships with an actual functioning factory configuration, tests, sample files, and other things that make it more likely to work.

In theory, you don’t need Kilroy—you can just point Claude Code or Codex CLI at the Attractor specification and burn some tokens. My friend Harper built three (and you should read his post for some meditations on where the Attractor approach is heading).

In practice, it took the better part of a month for me and some wonderful contributors to polish up Kilroy to the point where it is now, so you may save yourself some time, tokens, and effort by just stealing this.

Enter the trycycle

The other night I was carefully building the dotfiles and runfiles for a Kilroy project—configuring the factory to build the project.

Then a thought struck.

What if this was just a skill?

Enter Trycycle, the very simplest trycycle. It’s a very simple skill for Claude Code and Codex CLI that implements the pattern in plain English.

Define the problem.
Write a plan
Is the plan perfect? If not, try again.
Implement the plan.
Is the implementation perfect? If not, try again.

That’s basically it. To use it, you open your favorite coding agent and say, “Use Trycycle to do the thing.” Then sit back and watch the tokens fly.

It’s simple because it’s just a skill. Under the hood, it adapts Jesse Vincent’s amazing Superpowers for plan writing and executing. It will take you literally minutes to get started. Just paste this into your agent and you’re off to the three-wheel races.

Hey agent! Go here and follow the installation instructions.
https://raw.githubusercontent.com/danshapiro/trycycle/main/README.md

Trycycle is barely 24 hours old as of the time of this writing. I’ve shipped well over a dozen features with it already, and I was in meetings most of the day. While I was having dinner, it ported Rogue to Wasm(!). Last night it churned for 7 hours and 56 minutes and landed six features for freshell.

The best part, though, is that because it’s just a skill, it’s instantly part of your dev flow. There’s no configuration or learning curve. If you want to understand it better, just ask. If you don’t like what it’s doing, have stern words.

Which one to use?

Here’s how I’d decide right now.

If you want to become part of a growing movement of collaborators burning tokens together to build software, individually and collectively—try Gas Town.

If you want to invest in building a powerful, configurable, sophisticated engine that can drive your projects forward 24 hours a day—try Kilroy.

If you just want to get things done right now, give Trycycle a spin. Heck, it’s fast enough that you can spin up a trycycle while you read the docs on Kilroy and Gas Town.

And whatever you choose, I recommend you do it with freshell, because it’s just more delightful that way!

Thanks to Harper Reed, Steve Yegge, Jesse Vincent, Justin Massa, Nat Torkington, Marcus Estes, and Arjun Singh for reading drafts of this.

Scenario Planning for AI and the “Jobless Future”

Tim O’Reilly — Mon, 20 Apr 2026 10:41:09 +0000

We all read it in the daily news. The New York Times reports that economists who once dismissed the AI job threat are now taking it seriously. In February, Jack Dorsey cut 40% of Block’s workforce, telling shareholders that “intelligence tools have changed what it means to build and run a company.” Block’s stock rose 20%. Salesforce has shed thousands of customer support workers, saying AI was already doing half the work. And a Stanford study found that software developers aged 22 to 25 saw employment drop nearly 20% from its peak, while developers over 26 were doing fine.

But how are we to square this news with a Vanguard study that found that the 100 occupations most exposed to AI were actually outperforming the rest of the labor market in both job growth and wages, and a rigorous NBER study of 25,000 Danish workers that found zero measurable effect of AI on earnings or hours?

Other studies could contribute to either side of the argument. For example, PwC’s 2025 Global AI Jobs Barometer, analyzing close to a billion job ads across six continents, found that workers with AI skills earn a 56% wage premium, and that productivity growth has nearly quadrupled in the industries most exposed to AI.

This is exactly the kind of contradictory, uncertain landscape that scenario planning was designed for. Scenario planning doesn’t ask you to predict what the future will be. It asks you to imagine divergent possible futures and to develop a strategy that improves your odds of success across all of them. I’ve used it many times at O’Reilly and have written about it before with COVID and climate change as illustrative examples. The argument between those who say AI will cause mass unemployment and those who insist technology always creates more jobs than it destroys is a debate that will only be resolved by time. Both sides have evidence. Both are probably right at some level. And both framings are not terribly helpful for anyone trying to figure out what to do next.

In a scenario planning exercise, you identify two key uncertainties and draw them as crossing vectors, dividing the possibility space into four quadrants. Each quadrant describes a different future. The power of the technique is that you don’t bet on one quadrant. You look for actions that make the most sense across all of them. And you’re not limited to doing this for only one uncertainty. You can repeat the exercise multiple times, each time expanding your sense of possible futures and clarifying your convictions about the most robust strategies for adapting to them.

For AI and jobs, the most obvious crossing vectors to model might seem to be how fast AI grows in its ability to replace human work and how quickly that capability is adopted. This is, in effect, scenario planning about whether the “AI is unprecedented” or “AI is normal technology” camp is correct. That might well be a useful pair of axes.

There’s no question that AI capability is accelerating. SWE-Bench scores for coding went from solving 4.4% of problems in 2023 to 71.7% in 2024, and we saw what was widely described as a “step change” beyond that in December of 2025. Anthropic’s new Mythos model seems to have upped AI capabilities even further. Even before Mythos, McKinsey estimated that today’s technology could in theory automate roughly 57% of current US work hours. But capability is not adoption. Goldman Sachs notes that AI appears to be suppressing hiring more than destroying existing jobs in the near term. Yale’s Budget Lab, analyzing US labor data from 2022 to 2025, found no massive shift in the share of workers across occupations. Deployment, not capability, seems to be the limiting factor.

As a result, it makes sense to me to synthesize these two factors (capability increase and rate of adoption) into a single vector that we can call the scale and size of impact. The question on this axis, therefore, is not so just “How good does AI get?” but also “How fast does the economy actually reorganize around it?”

What’s a good second vector to cross with this one? If you’ve read my book WTF? or other things I’ve written about the role of human choices in shaping the future, you probably won’t be surprised that the second vector I’ve chosen reflects my conviction that the future depends on whether AI capability is primarily used to achieve efficiencies in existing work or to do more, to solve new problems and serve more human needs.

When Dorsey says a smaller team can now do the same work, that’s efficiency. When Insilico Medicine uses AI to design a drug for idiopathic pulmonary fibrosis in a fraction of the time traditional discovery takes (with over 173 other AI-discovered drugs also now in clinical development and 15 to 20 entering pivotal Phase III trials this year), that’s not replacing a human job. That’s doing something that wasn’t being done before. But we shouldn’t content ourselves with the idea that the “do more” axis is just about technical breakthroughs. It might be in serving a vastly larger number of people far more effectively and efficiently. When Todd Park says that his company, Devoted Health, “is on a mission to dramatically improve the health and well-being of older Americans,” that is a call to do more. Given the size of the existing markets that need to be transformed, it is likely that even with 10x or 100x efficiency gains from AI, Devoted’s 1,000x mission might require more resources, including people.

What will be scarce?

I’ve always assumed that the “do more” orientation is chiefly a moral argument driven by human judgment about what kind of world we’d prefer to live in. As the IMF noted earlier this year, “Work brings dignity and purpose to people’s lives. That’s what makes the AI transformation so consequential.” A world of concentrated value capture leading to a split between those with capital to invest and a permanent unemployed underclass is the stuff of dystopian science fiction.

But it’s not just a matter of inequality and the importance of work to human self-esteem. I’ve also become convinced that companies that lean into new possibilities and expand markets do better than those that simply do the same things more cheaply. And the evidence is starting to come in that this is true. According to PWC, “Three-quarters of AI’s economic gains are being captured by just 20% of companies—with the leading companies focused on growth, not just productivity….The research shows that these top‑performing companies are not simply deploying more AI tools. Instead, they are using AI as a catalyst for growth and business reinvention, particularly by pursuing new revenue opportunities created as industries converge, while building strong foundations around data, governance and trust.”

There are also a number of economic arguments for why the jobless future is just not going to happen. These arguments provide useful guidance into the structural changes to the economy that workers, business leaders, and politicians should be planning for.

Noah Smith pointed to a draft economics paper by Garicano, Li, and Wu that helps explain how the trade-offs between efficiency and expanding output might impact jobs. Garicano, Li, and Wu note that “the effect of AI on an occupation depends not just on which tasks AI can perform but also on how costly it is to unbundle those tasks from the job.” They model jobs as bundles of tasks, and distinguish between “strongly bundled” jobs (where the same person has to do multiple interdependent tasks) and “weakly bundled” ones (where tasks can easily be split between a human and an AI). AI replaces the weakly bundled jobs first. But even for weakly bundled jobs, automation only replaces human labor after demand becomes inelastic, after AI is so productive at the task that making more of the output hits diminishing returns.

Until that point, increased productivity from AI can be focused on expanding output rather than shrinking headcount. That is another way of saying that whether AI replaces workers or augments them depends in large part on whether there is unmet demand to absorb the increased productivity. If we use AI only to do the same things more cheaply, we hit that inelastic point fast, and jobs disappear. If we use it to do new things, demand keeps expanding and people keep working. University of Chicago economist Alex Imas believes that just how much demand elasticity there is on a job by job basis is one of the big questions of our time.

But that’s not all. In a new essay called “What Will Be Scarce?” Imas points out that when a new technology makes one sector dramatically more productive, one part of the economy shrinks but another grows. When agriculture was mechanized, 40% of the American workforce moved off farms, but the economy actually grew, because people spent their rising real incomes on fundamentally different things. Imas argues, drawing on work by Comin, Lashkari, and Mestieri, that income effects account for over 75% of observed patterns of structural change. As people get richer, they want fundamentally different things.

What are those things? Imas calls it “the relational sector”: goods and services where the human element is itself part of the value; teachers, nurses, therapists, hospitality workers, artisans, performers, personal chefs, community curators, and more. He opens his piece with Starbucks. In pursuit of economic efficiency, the company tried to automate more and more of its operations. CEO Brian Niccol concluded that it was a mistake, that handwritten notes on cups, ceramic mugs, and good seats drove customer satisfaction. More baristas are being hired per store and automation is being rolled back.

But there’s far more to the relational sector than service jobs. Imas identifies a further dimension in what René Girard called mimetic desire, the idea that people don’t just want objects for their functional properties. They want things that others want, and they want them more when they’re scarce and exclusive. (Hobbes and Rousseau made this same point.) Imas’s experimental research shows that willingness to pay roughly doubles when people learn that others will be excluded from a product. And in new work with Graelin Mandel, he finds that AI involvement undermines the perceived exclusivity of a good. Human-made artwork gained 44% in value from exclusivity; AI-generated artwork gained only 21%. The mere involvement of AI made the work feel inherently reproducible.

This means the relational sector has naturally high income elasticity. If AI makes production cheaper and real incomes rise, spending shifts toward goods where the human element matters. This is Baumol’s cost disease working as a feature, not a bug: The sector that resists automation becomes relatively more expensive, and that’s precisely where spending and employment grow. This is an economic mechanism that could power the upper quadrants of the scenario grid that we will look at shortly, not just as a matter of moral choice but as a structural tendency of rich economies getting richer.

I’m going to include both Noah’s ideas and Alex’s in my scenario planning exercise, since they fit right in.

Four possible futures

Let’s look at how the two vectors cross each other and give us four futures.

Upper left: The Augmentation Economy. AI capability grows but adoption is gradual, and workers are augmented rather than replaced. A programmer who once wrote 100 lines of code a day now ships features that used to take a team. A nurse practitioner aided by AI diagnostic tools provides care that once required a specialist. A small business owner uses AI to access legal and financial services previously available only to large corporations. This is the quadrant where the PwC finding about the 56% wage premium makes the most sense. AI becomes a tool that makes individual workers more productive and more valuable, and the gains flow broadly. What makes this a positive, growing economy are at least in part the choices made by employers. They use the increased efficiency to build better services, not just to make them cheaper. Doctors and nurses have more time with patients and less time with paperwork. As services become more efficient, they can be offered to more people at lower cost.

Lower left: The Slow Squeeze. AI grows, adoption is gradual, and the primary use is efficiency. This is in many ways the most insidious quadrant, because it doesn’t look like a crisis. It looks like a normal economy with slightly fewer entry-level jobs each year, slightly more pressure on wages, and slightly less bargaining power for workers. That Stanford study on young software developers is a signal from this quadrant. So is the HBR finding that companies are laying off workers because of AI’s potential, not its performance. The Slow Squeeze is the world where companies use AI to pad margins without passing the gains along or investing in new capabilities.

Lower right: The Displacement Crisis. AI advances fast and is adopted rapidly, almost entirely for efficiency. This is the future the doomsayers warn about, the Citrini Research scenario of unemployment topping 10% and the S&P 500 tanking. Block’s 40% cut is a signal from this quadrant, whether or not Dorsey’s prediction that most companies will follow suit within a year turns out to be right. Deutsche Bank analysts warn that “AI redundancy washing,” companies blaming layoffs on AI that are really driven by other cost-cutting, will be a significant feature of 2026. But the fact that Wall Street rewarded Block with a 20% stock price jump for firing 4,000 people tells you what the current incentive structure is optimizing for.

Upper right: The Great Transformation. AI capability advances rapidly and is adopted fast, but the primary use is to do more, not just the same with less. Whole new industries emerge. The WEF’s projection of 170 million new roles by 2030 comes true, far exceeding the 92 million displaced. AI-driven drug discovery actually delivers on its promise. New forms of education, personalized to every learner, actually reach people the old system never served. The transition is still brutal, because the people losing old jobs and the people getting new ones are not the same people, in the same places, with the same skills. Brookings has identified 6.1 million workers with high AI exposure and low adaptive capacity, 86% of them women in clerical and administrative roles. But the net direction is toward more human capability, not less.

Imas’s framework suggests that this quadrant will feature an explosion of durable jobs in the relational sector. Some of these will be high touch service jobs: doctors, nurses, therapists, teachers, personal trainers, craft producers, experience designers, hospitality workers, and roles that haven’t been invented yet. The relational sector already employs nearly 50 million people in the US. But another big part of it will be creating exclusive products and services that become objects of desire. Art critic Dave Hickey calls this “the big beautiful art market” that happens when industrial products are “sold on the basis of what they mean rather than what they do.” The structural change model predicts that both of these areas will grow as a share of the economy, not because they resist automation as a technical matter but because not being automated is part of their value proposition.

Noah Smith’s taxonomy of future work also helps fill in what life may actually look like across these quadrants. He divides AI-affected jobs into three categories: specialists whose jobs are “strongly bundled” (for example, an experienced engineer whose judgment can’t be separated from the rest of what they do), salarymen (generalists whose value comes from knowing how to wrangle AI and plug its ever-shifting gaps, much like the Japanese corporate model where long-tenured employees rotate between divisions and accumulate firm-specific knowledge rather than portable technical skills), and small businesspeople (entrepreneurs who use AI as leverage to run what would previously have required a much larger team). This is the future that Steve Yegge envisions with its “millions of one-person startups.”

In the upper quadrants, all three categories thrive. Specialists do well because AI expands the scope of what their bundled expertise can accomplish. Salarymen thrive because companies that are doing more, not just doing the same with less, need people who can adapt to constantly changing tool capabilities within the context of their business. And small businesses proliferate because AI gives a one-person shop the productive capacity that used to require a department.

In the lower quadrants, specialists may survive, but salarymen face pressure as companies optimize for headcount reduction rather than capability expansion, and small businesses struggle because the efficiency-first economy compresses the margins they need to exist.

News from the future

In scenario planning, once you’ve chosen your vectors and imagined the resulting quadrants, you watch for “news from the future,” data points that signal which direction the world is actually heading. As with any scatter plot, the points are all over the map at first, but over time you start to see the trend lines emerge.

Right now, the signals are mixed.

News from the lower quadrants: Challenger, Gray & Christmas reports that AI was a significant contributing factor in nearly 55,000 US layoffs in 2025. Employee anxiety about AI-driven job loss has jumped from 28% in 2024 to 40% in 2026. 40% of employers globally told the WEF they plan to reduce their workforce where AI can automate tasks within five years. And the entry-level job market is tightening in ways that compound over time even if they don’t show up in headline unemployment numbers. Brookings found that the “gateway” occupations that serve as stepping stones from low-wage to middle-wage work are among the most exposed to AI, threatening career pathways, not just individual jobs.

News from the upper quadrants: The PwC wage premium data. The Vanguard finding that AI-exposed occupations are growing, not shrinking. The explosion of AI drug discovery programs. MIT’s David Autor has shown that 60% of today’s US employment is in job categories that didn’t exist in 1940. New task creation is how technology has always generated new work, and there’s no reason to believe AI is exempt from that pattern, unless we choose to use it only for efficiency.

There may also be some signal in reports that usage among developers is becoming more intensive and continuous, from multistep coding workflows to automated agents running in loops. Some engineers are “tokenmaxxing,” with engineers at companies like Meta treating AI consumption as a productivity benchmark. This is driving rapid revenue growth for AI providers but squeezing their margins as infrastructure costs rise. That margin pressure may sound like bad news, but it’s actually a classic pattern by which a technology crosses from “tool” to “infrastructure.” Cloud computing margins were terrible until scale and hardware improvements drove unit costs down, at which point the providers who had built habit and lock-in harvested enormous returns. AI inference costs have been dropping roughly 10x per year, and price competition is accelerating that decline. The margin squeeze is the mechanism by which AI becomes cheap enough to be ubiquitous. And the tokenmaxxing engineers are doing dramatically more iterations, more exploration, with more ambitious scope. That’s “doing more” behavior, not an efficiency behavior.

It’s still unclear, though, whether all those tokens are producing real value or whether some of this is the AI equivalent of crypto mining. If most of those tokens are productive, we’re looking at a productivity boom. If many are wasted, the adoption curve may have a big dip in it before industry matures. Either way, though, the direction is toward AI becoming economic and technology infrastructure. It’s important to remember that tokens spent trying out prototypes that are rejected are not necessarily wasted. They can be part of a new development process that’s expanding the space of possibilities.

News that doesn’t fit neatly into any quadrant: We appear to be in what Smith calls a “no-hire, no-fire” economy, where workers hunker down in their current jobs and refuse to switch, and companies keep them rather than hiring new workers. That’s consistent with a world where people sense that their portable technical skills are depreciating, so they cling to the firm-specific knowledge that still makes them valuable where they are. It’s also consistent with the NBER Denmark study finding task reorganization without job loss: AI is replacing tasks, not (yet) jobs. Nonetheless, it is clear that a dearth of entry level positions will be a serious issue.

A University of Pittsburgh researcher has been calling state unemployment offices one by one to assemble the granular data that doesn’t yet exist in federal statistics, because our measurement tools are not yet fine-grained enough to see what’s happening. If you’re confused about whether AI is causing job losses, he put it plainly: The likely problem is a lack of data. If AI is having an impact, we may just not be equipped to see it yet with the instruments we have. We’re getting new data points daily. Asking yourself which future they support can gradually increase your confidence in what is coming.

Robust strategy

The goal of a scenario planning exercise is to stretch your thinking so that you can make strategic choices that make sense regardless of which future unfolds. Scenario planners call this a “robust strategy.”

If you’re a business leader, the robust strategy is not to ask “How many people can I replace with AI?” It’s to ask “What can we do now that we couldn’t do before?” The companies that will thrive across all four quadrants are the ones that use AI to expand what’s possible, not just to shrink how much they have to spend. Aim for the upper right quadrant, and you’ll do better even if the rest of the world chooses otherwise.

That’s not just scenario planning. It’s Clay Christensen on the lessons of disruptive technologies. A disruptive technology is not defined by the markets it destroys but by the new markets and new possibilities it creates. As Christensen observed, RCA didn’t ignore the transistor; its leaders just thought it wasn’t good enough for its current customers. Sony embraced the new technology and created a new market of portable devices where the quality difference between transistors and vacuum tubes just didn’t matter. And of course, as Clay observed, the disruptive technology continues to improve.

If you’re a worker, one element of robust strategy is to band together, as the screenwriters guild did, and to make the case that the productivity gains from AI should be shared with workers and used to amplify their skills and efforts. Don’t resist AI, but instead use it to make yourself even more valuable. Use it to amplify your uniqueness. That is, lean into the augmentation economy. One of the things we’ve learned from the early advances in AI-enabled software engineering is that a great software engineer can get more out of AI than a vibe-coding beginner. This is true of other professions as well. Find ways that your human uniqueness makes the output of AI even more valuable.

Create professional associations that lean into mentorship and an AI-enriched career ladder, but aren’t afraid to take a political stance. The idea that providers of capital are entitled to all of the gains is a pernicious idea that has created an engine of inequality rather than of wide prosperity. It doesn’t have to be that way. Professional associations and other forms of solidarity are a possible source of countervailing power. (But don’t fall into the trap that many unions and professional associations do, of using that power to extract rents rather than increasing value for everyone.) Preferentially choose employers who are investing in training employees for a human + AI future, including at the beginning of the career ladder.

If you’re a specialist, deepen the parts of your expertise that are strongly bundled, the judgment and context and human relationships that can’t be separated from the technical work. If you’re a generalist inside a company, become the person who understands what AI can and can’t do and fills the gaps, whose value comes from adaptability and firm-specific knowledge rather than a fixed set of technical skills. And if you have entrepreneurial instincts, recognize that AI is creating leverage that may make it possible to run a viable business at a scale that previously couldn’t support one.

Imas’s work suggests that the most durable career paths may not be defined by which tasks AI can’t do (a moving target) but by whether the human element is part of what the customer is paying for. A restauranteur, a therapist, a teacher who knows your child, or a guide who knows the trail aren’t jobs that survive because AI hasn’t gotten to them yet. They’re jobs where human involvement is the product.

If you’re an entrepreneur, the robust strategy is the one it has always been: look at the world as it is, determine what work needs doing, and do it. Don’t build AI tools that replace humans doing things that are already being done adequately. Build AI tools that let humans do things that have never been done before.

If you’re a policymaker, the robust strategy is to invest in the transition regardless of how fast displacement turns out to be. Create policies that give workers more of a role in how AI is used. Support positions like those of the writers guild, which allow workers to get a share of the gains from using AI. And if capital runs wild with labor replacement, tax the gains so the efficiency can be redistributed. Decrease the working week.

Education and lifelong learning programs, portable benefits, support for geographic mobility, and investment in the industries of the future pay off in every quadrant. So does reducing the regulatory friction that keeps new entrants trapped in old cost structures, funding basic research that the market underinvests in, and building the kind of infrastructure (physical and institutional) that enables rapid adaptation.

The future is up to us

I’ll return to the theme that I sounded in my book WTF? What’s the Future and Why It’s Up To Us.

Every time a company uses AI to do what it was already doing with fewer people, it is making a choice for the lower half of the scenario grid. Every time a company uses AI to do something that wasn’t previously possible, to serve a customer who wasn’t previously served, to solve a problem that wasn’t previously solvable, it is making a choice for the upper half. These choices compound, for good or ill. An economy that uses AI primarily for efficiency will slowly hollow itself out.

Looking at the news from the future, both sets of signals are present. The question is which will dominate. AI will give us both the Augmentation Economy and the Displacement Crisis, in different measures in different places, depending on the choices we make.

Scenario planning teaches us that we don’t have to predict which future we’ll get. We do have to prepare for a very uncertain future. But the robust strategy, the one that works across every quadrant, is to focus on doing more, not just doing the same with less, and to find ways that human taste still matters in what is created. As long as there is unmet demand, as long as there are problems we haven’t solved and people we haven’t served, AI will augment human work rather than replacing it. It’s only when we stop looking for new things to do that the machines come for the jobs.

Trial by Fire: Crisis Engineering

Jennifer Pahlka — Fri, 17 Apr 2026 10:54:01 +0000

The following article originally appeared on Jennifer Pahlka’s Eating Policy website and is being republished here with the author’s permission.

I read Norman Maclean’s Young Men and Fire when I was a teenager, I think, so it’s been many years, but I still remember its turning point vividly. It’s set in 1949 in Montana, at the Gates of the Mountains Wilderness, about an hour north of Helena. A fire is burning, and the Forest Service sends out their smokejumpers to fight it. But the fire changes direction without warning, and a group of smokejumpers working in the Mann Gulch find themselves trapped, facing certain death. Instead of running, the foreman, Wag Dodge, pulls out matches and does the unthinkable: He lights a fire.

Today we know what he was doing. The escape fire consumed the fuel around him, allowing the main fire to pass over him and a few of his colleagues. But in 1949, the families of the 13 other smokejumpers who died accused Wag of causing their deaths. To them, what he had done made no sense.

I love that Marina Nitze, Matthew Weaver, and Mikey Dickerson chose this story as a framing device for their new book, Crisis Engineering: Time-Tested Tools for Turning Chaos Into Clarity, out now. Not just because it brought back the memory of a book that I once loved, but because Maclean’s obsessive investigation of what had happened back then (he wrote the book years after the incident) seemed to me almost as heroic as the bravery of the smokejumpers. And indeed, his insistence on making sense of what happened has probably saved lives. Escape fires are now formally recognized and taught as a last resort tactic when training new firefighters.

The Dodge escape fire wouldn’t seem to have much to do with Three Mile Island or healthcare.gov or the pandemic unemployment insurance backlogs, but the authors use it to make a point about how action and understanding interact in a crisis. One key is exactly what Maclean himself did so well: sensemaking. In a crisis like Mann Gulch, sensemaking disintegrates: a broken radio, wind so strong communication is impossible, fire whose behavior violates well-tested assumptions, and a team scattered. You don’t achieve sensemaking by staring at a map; you achieve it by acting and observing results. Wag Dodge didn’t understand fire behavior well enough to explain the escape fire in advance. But his actions created the understanding itself—retrospectively, as all real sensemaking is.

The book’s key claim is that crises are opportunities, and the authors leverage Daniel Kahneman’s Thinking, Fast and Slow to explain why crises are the only real windows for organizational change—and why everything else, the incentives, the logical arguments, the reorganizations, mostly doesn’t work. Most organizations, most of the time, run on autopilot. People habituate to their environment, rationalize away small surprises, and build stable stories about how things work. A crisis breaks this. When surprise accumulates faster than the brain’s “surprise-removing machinery” can rationalize it away, the whole apparatus jams, and organizations become, briefly, reprogrammable.

An institution resolves a crisis in one of three ways, according to the authors. It makes durable deliberate change, it dies, or, most commonly, it rationalizes the failure into an accepted new normal. “Most large organizations contain programs and departments that passively accept abject failure: infinitely long backlogs, hospitals that kill patients, devastating school closures that do little to affect a pandemic. These are fossils of past crises where the organization failed to adapt.”

Too many of our public institutions have failed to adapt, and the idea that they might be reprogrammable at all is a bit radical. We live in an era when too many people have given up on them, willing to burn them to the ground rather than renovate them. If crises represent the chance for true transformation, then we’d better get a lot better at using them for that. This is explicitly why Crisis Engineering exists, and it’s a detailed, practical book—the theory and framing devices are well used, but there’s a ton of pragmatic substance here you’ll be grateful for when the moment comes.

I remember when I was working in the White House and frustrated by the slow pace of progress. My UK mentor Mike Bracken told me: “Hold on, you just need a crisis. You Americans only ever change in crisis.” Boom. About two months later, healthcare.gov had its inauspicious start. And he was right. Change followed. Not all the change we needed, but a start. Marina, Weaver, and Mikey are three of the people who drove that change. I got to work with them again the first summer of the pandemic on California’s unemployment insurance claims backlog. I’m not a crisis engineer, but their strategies and tactics have deeply influenced how I think about the work I do and how I think we’re going to get from the institutions we have today to the ones we need.

We may be living in an era when too many people have given up on institutions, but we are also likely entering an era of crisis, and even polycrisis. This makes for uncomfortable math, but also drives home the need for a new generation of crisis engineers.

When I first read about Mann Gulch, so many years ago, I remember being in awe of the ingenuity and courage it took to start Wag Dodge’s escape fire. Today I think a lot about that pattern: the controlled burns that reduce the risk of megafires, the little earthquakes that take the pressure off faults under great tension, the managed crises that, if we’re skilled enough to use them, keep our institutions from the kind of collapse that comes when nothing has been allowed to give for too long. Dodge didn’t burn things down. He burned a path through. We’re going to have to get good at that.

Generative AI in the Real World: Aishwarya Naresh Reganti on Making AI Work in Production

Ben Lorica and Aishwarya Naresh Reganti — Thu, 16 Apr 2026 14:03:20 +0000

As the founder and CEO of LevelUp Labs, Aishwarya Naresh Reganti helps organizations “really grapple with AI,” and through her teaching, she guides individuals who are doing the same. Aishwarya joined Ben to share her experience as a forward-deployed expert supporting companies that are putting AI into production. Listen in to learn the value all roles—from data folks and developers to SMEs like marketers—bring to the table when launching products; how AI flips the 80-20 rule on its head; the problem with evals (or at least, the term “evals”); enterprise versus consumer use cases; and when humans need to be part of the loop. “LLMs are super powerful,” Aishwarya explains. “So I think you need to really identify where to use that power versus where humans should be making decisions.” Watch now.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform or follow us on YouTube, Spotify, Apple, or wherever you get your podcasts.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.58
All right. So today we have Aishwarya Reganti, founder and CEO of LevelUp Labs. Their tagline is “Forward-deployed AI experts at your service.” So with that, welcome to the podcast.

01.13
Thank you, Ben. Super excited to be here.

01.16
All right. So for our listeners, “forward-deployed”—that’s a term I think that first entered the lexicon mainly through Palantir, I believe: forward-deployed engineers. So that communicates that Aishwarya and team are very much at the forefront of helping companies really grapple with AI and getting it to work. So, first question is, we’re two years into these AI demos. What actually separates a real AI product from a good demo at this point?

01.53
Yeah, very timely question. And yeah, we are a team of forward-deployed experts. A bit of a background to also tell you why we probably have seen quite a few demos failing. We work with enterprises to build a prototype for them, educate them about how to improve that prototype over time. I think one of the biggest things that differentiates a good AI product is how much effort a team is spending on calibrating it. I typically call this the 80-20 flip.

A lot of the folks who are building AI products as of today come from a traditional software engineering background. And when you’re building a traditional product, a software product, you spend 80% of the time on building and 20% of the time on what happens after building, right? You’re probably seeing a bunch of bugs, you’re resolving them, etc.

But in AI, that kind of gets flipped. You spend 20% of the time maybe building, especially with all of the AI assistants and all of that. And you spend 80% of the time on what I call “calibration,” which is identifying how your users behave with the product [and] how well the product is doing, and incorporating that as a flywheel so that you can continue to improve it, right?

03.11
And why does that happen? Because with AI products, the interface is very natural, which means that you’re pretty much speaking with these products, or you’re using some form of natural language communication, which means there are tons of ways users could talk and approach your product versus just clicking buttons and all of that, where workflows are so deterministic—which is why you open up a larger surface area for errors.

And you will only understand how your users are behaving with the system as you give them more access to it right. Think of anything as mainstream as ChatGPT. How users interact with ChatGPT today is so much more different than how they would do say three years ago or when it was released in November 2022. So what differentiates a good product is that idea of constant calibration to make sure that it’s getting aligned with the users and also with changing models and stuff like that. So the 80-20 flip I think is what differentiates a good product from just a prototype.

04.14
So actually this is an important point in the in the sense that the persona has changed as to who’s building these data and AI products, because if you rewind five years ago, you had people with some knowledge of data science, ML, and now because it’s so accessible, developers—actually even nondevelopers, vibe coders—can can start building. So with that said, Aishwarya, what do these kinds of nondata and AI people still consistently get wrong when they move from that traditional mindset of building software to now AI applications?

05.05
For one, I truly am one of those people who believes that AI should be for everyone. Even if you’re coming from a traditional machine learning background, there’s so much to catch up on. Like I moved to a team in AWS where. . . I moved from a team in AWS in 2023 where I was working with traditional natural language processing models—I was a part of the Alexa team. And then I moved into an org called GenAI Innovation Center, where we were building generative AI solutions for customers. And I feel like there was so much to learn for me as well.

But if there’s one thing that most people get wrong and maybe AI and traditional ML folks get right, it’s to look at your data, right? When you’re building all of these products, people just assume that “Oh, I’ve tested this for a few use cases” and then it seems to work fine, and they don’t pay so much attention to the kind of data distribution that they would get from their users. And given this obsession to automate everything, people go like, “OK, I can maybe ask an LLM to identify what kind of user patterns I’m seeing, build evals for itself, and update itself.” It doesn’t work that way. You really need to spend the time to understand workflows very well, understand context, understand all this data, pretty much. . .

I think just taking the time to manually do some of the setting up work for your agents so that they can perform at their maximum is super underrated. Traditional ML folks tend to understand that a little better because most of the time we’ve been doing that. We’ve been curating data for training our machine learning models even after they go into production. There’s all of this identifying outliers and updating and stuff. But yeah, if there’s one single takeaway for anybody building AI products: Take the time to look at your data. That’s the most important foundation for building them.

07.01
I’ll flip this a little bit and give props to the traditional developers. What do they get right? In other words, traditional developers write code; some of them write tests, run unit tests [and] integration tests. So they had something to build on that maybe the data scientists who were not writing production code were not used to doing. So what do the traditional developers bring to the table that the data and ML people can learn from?

07.40
That’s an interesting question because I don’t come from a software background and I just feel traditional developers have a very good design thinking: How do you design architectures so that they can scale? I was so used to writing in notebooks and kind of just focusing so much on the model, but traditional developers treat the model as an API and they build everything very well around it, right? They think about security. They think about what kind of design makes sense at scale and all of that. And even today I feel like so much of AI engineering is traditional software engineering—but with all of the caveats that you need to be looking at your data. You need to be building evals which look very different. But if you kind of zoom out and see, it’s pretty much the same process, and everything that you do around the model (assuming that the model is just a nondeterministic API), I think traditional software engineers get it like bang on.

08.36
You recently wrote a post about evals, which was quite interesting actually, [arguing] that it’s a bit of an overused and poorly defined term. I agree with the thesis of the post, but were you getting frustrated? Is that the reason why you wrote the post? [laughs] What was the genesis of the post?

09.03
A baseline is most of my posts come out of frustration and noise in this space. It just feels like if you kind of see the trajectory. . . In November 2022, ChatGPT was out, and [everybody was] like, “Oh, chat interfaces are all you need.” And then there was this concept of retrieval-augmented generation, they go “Oh, RAG is all you need. Chat just doesn’t work.” And then there was this concept of agents and like “Agents are all you need; evals are all you need.” So it just gets super annoying when people hang on to these concepts and don’t really understand the depth of it.

Even now I think there are tons of people who go like “Oh, RAG is dead. It’s not going to be used” and stuff, and there’s so much nuance to it. And with evals as well. I teach a lot of courses: I teach at universities; I also have my own courses. I feel like people just stuck to the term, and they were like “Oh, there is this use case I’m building. I need hundreds of evals in order to make sure that it’s tested very well.” And they just heard the fact that “Oh, evals are what you need to do differently for AI products” and really didn’t understand in depth like what evals mean—how you need to build a flywheel around it, and the entire you know act of building a product, calibrating it, and building a set of evaluations and also doing some A/B testing online to understand how your users are behaving with it. All of that just went into one term “evals,” and people are just like throwing it around everywhere, right?

10.35
And there’s also this confusion around model eval versus product eval, which is all of these frontier companies build evals on their models to make sure that they understand where they are on the leaderboard. And I was speaking to someone one day, and they went like, “Oh, GPT-5 point something has been tested on a particular eval dataset, which means it’s the best for my use case, so I’m going to be using it.” And I’m like, “That’s not the evals that you should be worrying about, right?” So just overloading so much into a term and hyping it up is kind of what I felt was annoying. And I wanted to write a post to say that evals is a process. It’s a long process. It’s pretty much the process of building something and calibrating it over time. And there are tons of components to it, so don’t kind of try to stuff everything in a word and confuse people.

I’ve also seen people who do things like, “Oh, I’m going to build hundreds of evals” and maybe 10 of them are actionable. Evals also need to be super actionable: What is the information you can get from them, and how can you act on that? So I kind of stuffed all of that frustration into the post to kind of say it’s a longer process. There’s so much nuance in it. Don’t try to water that down.

11.48
So it seems like this is an area where the people that were from the prior era—the people building ML and data science products—maybe could bring something to the table, right? Because they had experience, I don’t know, shipping recommendation engines and things like that. They have some prior notion of what continuous evaluation and rigorous evaluation brings to the table.

Actually I was talking to someone about this a few weeks ago in the sense that maybe the data scientists actually have a growing employment opportunity here because basically what they bring to the table seems increasingly important to me. Given that code is essentially free and discardable, it seems like someone with a more rigorous background in stats and ML might be able to distinguish themselves. What do you think?

12.56
Yes and no, because it’s true that machine learning and data scientists understand data very well, but just the way you build evals for these products is so much more different than how you would build, say, your typical metrics (accuracy, F-score, and all of that) that it takes quite some thinking to extend that and also some learning to do. . .

13.21
But at least you might actually go in there knowing that you need it.

13.27
That is true, but I don’t think that’s a super. . . I’ve seen very good engineers pick that up as well because they understand at a design level “What are the metrics I need to be measuring?” So they’re very outcome focused and kind of enter with that. So one: I think everybody has to be more coachable—not really depend on things that they learned like X years ago, because things are changing so quickly. But I also believe that whenever you’re building a product, it’s not really one set of folks that have the edge.

Another maybe distribution that is completely different is just subject-matter experts, right? When you’re building evals, you need to be writing rubrics for your LLM judges. Simple example: Let’s say you’re building a marketing pipeline for your company, and you need to write copy—marketing emails or something like that. Now even if I come from a data science background, if I were thrown at that problem, I just don’t understand what to look for and how to get closer to a brand voice that my company would be satisfied with. But I really need a marketing expert to kind of tell me “This is the brand voice we use, and this is the evals that we can build, or this is how the rubric should look like.” So it should almost be like a cross-functional thing. I feel like each of us have different pieces to that puzzle, and we need to work together.

14.42
That kind of also brings me to this other thing of collaborating in a much tighter manner [than] before. Before it was like, “OK, machine learning folks get data; they build models; and then there is a separate testing team; there is a separate SME team that’s going to look at how this product is behaving.” And now you cannot do that. You need to be optimizing for the same feedback loop. You need to be talking a lot more with all of the stakeholders because even when building, you want to understand their perspective.

15.14
So it seems also the case that as more people build these things, they realize that actually. . . You know sometimes I struggle with the word “eval” in the sense that maybe the right word is “optimize,” because basically what you really want is to understand “What am I optimizing for?” Obviously reliability is one of them, but latency and cost are also important factors, right? So it’s just a discussion that you’re increasingly coming across, and people are recognizing that there’s trade-offs and they have to balance a bunch of things.

15.57
Yes, definitely. I don’t see it being discussed heavily mainstream. But whenever I approach a problem, it’s always that, right? It’s performance, effort, cost, and latency. And all of these four things are kind of. . . You’re trying to balance each of them and trade off each of them. And I always say, start off with something that’s very low effort so that you kind of have an upper ceiling to what can be achieved. Then optimize for performance.

Again, don’t optimize for cost and latency when you get started because you just want to see the realm of possible to make sure that you can build a product and it can work fine. And cost and latency [are] something that ought to be optimized for—even when building for enterprises—after we’ve had a decent prototype that can do well on evals. Right now, if I built something with, say, a good mid-tier model and it can hit all of my eval datasets, then I know that this is possible, and now I can optimize for the latency and cost based on the constraints. But always follow that pyramid, right? Go with [the] lowest effort. Try to optimize for performance. And then cost and latency is something that. . . There are tons of tricks you can do. There’s caching; there’s using smaller models and all of that. That’s kind of a framework that I typically use.

17.08
In prior generations of machine learning, I think a lot of focus was on accuracy to some extent. But now increasingly, because we’re in this kind of generative AI world, it’s more likely that people are interested in reliability and predictability in the following sense: Even if I’m only 10% accurate, as long as I know what that 10% is, I would prefer that [to] a model that’s more accurate but I don’t know when it’s accurate. Right?

17.47
Right. That’s kind of the boon and bane of generative AI models. I guess the fact that they can generalize is amazing, but sometimes they end up generalizing in ways that you wouldn’t want them to. And whenever we work on enterprise use cases, I think for us always in my mind—something that I want to tell myself—is if this can be a workflow, don’t make it autonomous if it can solve a problem with a simple LLM call and if you can audit decisions. For instance, let’s say we’re building a customer support agent. You could literally build it in five minutes: You can throw SOPs at your customer support agent and say “OK, pick up the right resolution, talk to the user, and that’s it.” Building is very cheap today. I can literally have Claude Code build it up in a few minutes.

But something that you want to be more intentional about is “What happens if things go wrong? When should I escalate to humans?” And that’s where I would just break this into a workflow. First, identify the intent of the human and then give me a draft—almost be a copilot for me, where I can collaborate. And then if that draft looks good, a human should approve it so that it goes further.

Right now, you’re introducing auditability at each point so that you as a human can make decisions before, you know, an agent goes up and messes up things for you. And that’s also where your design decisions should really take over. Like I could build anything today, but how much thinking am I doing before that building so that there’s reliability, there’s auditability, and all of those things. LLMs are super powerful. So I think you need to really identify where to use that power versus where humans should be making decisions.

19.28
And you touched on the notion of human auditors or humans in the loop. So obviously people also try to balance LLM as judge versus human in the loop, right? Obviously there’s no one piece of advice, but what are some best practices around how you demarcate between when to use a human and when you’re comfortable using another model as a judge?

20.04
A lot of this usually depends on how much data you have to train your judge, right? I feel humans have this problem, which is: Sometimes you can do a task but you can’t explain why you arrived at that decision in a very structured format. I can today take a look at an article and tell you. . . Especially, I write a lot on Substack and LinkedIn; this is a very super personal use case. If you give me an article and ask me, “Ash, will this go viral on LinkedIn?” I can tell you yes or no for my profile right, because I’ve done it for so many years. But if you ask me, “How did you make that decision?” I probably cannot codify it and write it down as a bunch of rubrics. Which is again, when you translate this to an LLM judge, “Can I build an LLM that can tell me if a post will go viral or not?” Maybe not because I just don’t have all the constraints that I use as a human when I make decisions.

Now, take this to more production-like use cases or enterprise-like use cases. You want to have a human judge until you can codify or you can create a framework of how to evaluate something and you can write that out in natural language. And what that means is you maybe want to take 100 or 200 utterances and say, “OK, does this make sense? What’s the reasoning behind why I graded it a certain way?” And you can feed all of that information into your LLM judge to finally give it a set of rubrics and build your evals. But that’s kind of how you make a decision, which is “Do we have enough information to provide to an LLM judge that it can replace human judgment?”

But otherwise don’t do it—if you have very vague high-level ideas of what good looks like, you probably don’t want to go to an LLM judge. Even when building your systems, I would always recommend that your first pass when you’re doing your eval should be judged by a human, and you should also ask them to give you reasoning as to why they judge it because that reasoning is so important for training your LLM judges.

21.58
What are some signs that you look for? What are signals that you look for when one of these AI applications or systems go live? What are some of the signals you look for that [show] maybe the quality is degrading or breaking down?

22.18
It really depends on the use case, but there are a lot of subtle signals that users will give you, and you can log them, right? Things like “Are users swearing at your product?” That’s something we always use, right? “What kind of words are they using? How many conversation turns if it’s a chatbot, right?” Usually when you’re building your chatbot, you identify that the average number of turns is 10, but it turns out that customers are having only two turns of conversation. That kind of means that they’re not interested to talk to your chatbot. Or sometimes they’re having 20 conversations, which means they’re probably annoyed, which is why they’re having longer conversations.

There are typical things: You know, ask your user to give a thumbs up or thumbs down and all of that, but we know that feedback kind of doesn’t. . . People don’t give feedback unless they’re annoyed at something. So you can have those as well. If you’re building something like a coding agent like Claude Code etc., very obvious logging you can do is “Did the user go and change the code that it generated?” which means it’s wrong. So it’s very specific to your context, but really think of ways you can log all of this behavior you can log anomalies.

Sometimes just getting all of these logs and doing some topic clustering which is “What are our users typically talking about, and do any of those show signs of frustration? Do they show signs of being annoyed with the system?” and things like that. You really need to understand your workflows very well so that you can design these monitoring strategies.

23.50
Yeah, it’s interesting because I was just on a chatbot for an airline, and I was surprised how bad it was, in the sense that it felt like a chatbot of the pre-LLM era. So give us give us kind of your sense of “Are these chatbots now really being powered by foundation models or. . .?” I mean because I was just shocked, Aishwarya, about how bad it was, you know? So what’s your sense of, as far as you know, are enterprises really deploying these generative AI foundation models in consumer-facing apps?

24.41
Very few. To just give you a quick stat that might not be super correct: 70% to 80% of the engagements that we take up at LevelUp Labs happen to be productivity and ops focused rather than customer focused. And the biggest blocker for that has always been trust and reliability, because if you build these customer-facing agents [and] they make one mistake, it’s enough to put you on news media or enough to put you in bad PR.

But I think what good companies are doing as of today is doing a phased approach, which is they have already identified buckets that can be completely autonomous versus buckets that would require humans to navigate, right? Like this example that you gave me, as soon as a user comes up with a query, they have a triaging system that would determine if it should go to an AI agent versus a human, depending on the history of the user, depending on the kind of query. (Is it complicated enough?) Right? Let’s say Ben has this history of. . .

25.44
Hey, hey, I had great status on this airline.

25.47
[laughs] Yeah. So it’s probably not you, but just the kind of query you’re coming up with and all of that. So they’ve identified buckets where automation is possible, and they’re doing it, and they’ve done that because of past behavior data, right? What are low-hanging fruits that we could automate versus escalate to humans. I have not seen a lot of these chat systems that are completely taken over by agents. There’s always some human oversight and very good orchestration mechanisms to make sure that customers are not affected.

26.16
So you mentioned that you mostly are in the technical and ops application areas, but I’ll ask you this question anyway. To what extent do legal things come up? In other words, I’m about to deploy this model. I know I have guardrails, but honestly, just between you and me, I haven’t gone through the proper legal evaluation, you know? [laughs] So in other words, legality or compliance—anything to do with laws—do they come up at all in your discussions with companies?

26.59
As an external implementation team, I think one thing that we do with most companies is give them a high-level overview of the architecture we’ll be building, the requirements, and ask them to do a security and legal review so that they’re okay with it, because we’ve had experiences in the past where we pretty much built out everything and then you have your CISO come in and say, “OK, this doesn’t fall into what we could deploy.” So many companies make that mistake of not really involving your governance and compliance folks in the beginning and then end up scrapping entire projects.

I am not an expert who knows all of these rules and legalities, but we always make sure that they understand: “Where is the data coming from? Do we have any issues productionizing this?” and all of that, but we haven’t really worked. . . I mean I don’t have a lot of background on how to do this. We’re mostly engineering folks, but we make sure that we have a sign-off so that we are not kind of landing in surprises.

28.07
Yeah, the reason I bring it up is obviously, now that everything is much more democratized, more people can build—so in reality the people can move fast and break things literally, right? So I just wonder if there’s any discussion at all. It sounds like you are proactive, but mostly out of experience, but I wonder if regular teams are talking about this.

Speaking of which, you brought up earlier leaderboards—obviously I’m guilty of this too: “I’m about to build something. OK, let me look at a leaderboard.” But, you know, I’m not literally going to take the leaderboard’s advice, right? I’m going to still kick the tires on the specific application and use case. But I’m sure though, in your conversations, people tell you all sorts of things like, “Hey, we should use this because I saw somewhere that this is ranked number one,” right? So is this still a frustration on your end, or are people much more savvy now?

29.19
For one, I want to quickly clarify that it’s not wrong to look at a leaderboard. It’s always. . . You know, you get a high-level idea of “Who are your best competitors at this point?” But what I have a problem with is being so obsessed with just that leaderboard that you don’t build evals for yourself.

29.34
In my experience, when we work with a lot of these companies, I think over the past two years the discussion has really shifted away from the model because of two reasons: One is most companies already have existing partnerships. They’re either working with a major model provider vendor and they’re OK doing that now just because all of these model providers are racing towards feature parity, leaderboard success, and all of that. If Anthropic has something, you know, if their model is performing well on a leaderboard today, Gemini and OpenAI will probably be there in a week. So people are not too concerned about model performance. They know that in a couple of weeks, that will kind of be built into other models. So they’re not worried about that.

And two is companies are also thinking much more about the application layer right now. There’s so much discussion around all of these harnesses like Claude Code, OpenClaw, and stuff like that. So I’ve not seen a lot of complaints on “Oh, this is the model that we should be using.” It seems like they have a shared understanding of how models perform. They want to optimize the harness and the application layer much more.

30.48
Yeah. Yeah. Obviously another one of these buzzwords is “harness engineering,” and whatever you think about it, the one good thing is it really elevates the notion that you should worry about the things around the model rather than the model itself.

But speaking of. . . I guess I’m kind of old school in the sense that I want to still make sure that I can swap models out, not necessarily because I believe one model is better than the other but one model may be cheaper than the other, right?

And at least up until recently—I haven’t had this conversation in a while—it seemed to me that people got stuck on a model because their prompts were so specific for a model that porting to another model seemed like a lot of work. But nowadays though you have tools like DSPy and GEPA that it seems like you can do that more easily. So what’s your sense of model portability as a design principle—model neutrality?

32.06
For one, I think the gap between models is much more exaggerated for consumer use cases just because people care quite a bit about the personality, about how the model…

32.22
No, I care about latency and cost.

32.24
Yeah. In terms of latency and cost, right, most of the model providers pretty much are competing to make sure they are in the market. I don’t know. Do you think that there are models. . .

32.35
Well, I think that you can still get good deals with Gemini. [laughs]

32.40
Interesting.

32.41
But honestly, I use OpenRouter and OpenCode. So, I’m much more kind of I don’t want to get locked into a single [model]. When I build something, I want to make sure that I build in a way that I can move to a different model provider if I have to. But it doesn’t sound like you think that this is something that people worry about right now. They’re just worried about building something usable and then we can worry about that later.

33.12
Yes. And again, I come from a very enterprise point, like “What are companies thinking about this?” And like I said, I’m not seeing a lot of competition for model neutrality because these companies have deals with vendors and they’re okay sticking with the same model provider.

Now, when it comes to consumers, like if you’re building something for the kind of use cases that you were saying, Ben, I feel that, like I said, personality is super important for consumer builders. And I still think we’re not at a point where you can easily swap out models and be like, “OK, this is going to work as good as before,” just because you have over time learned how the model behaves. So you’ve kind of gotten calibrated with these models, and these models also have very specific personalities. So there’s a lot of you know reengineering that you have to do.

34.07
And when I say reengineering, it just might mean changing the way your prompts are written and stuff like that. It will still functionally work, which is why I say that enterprises don’t care about this much because the kind of use cases I see are like document processing or code generation, in which case functionality is of much more importance than personality. But for consumer use cases, I don’t think we’re at a point—to your point on building with OpenRouter, you can do that, but I think it’s a lot of overhead given that you’ll have to write specific prompts for all of these models depending on your use case.

I recently ported my OpenClaw from Anthropic to OpenAI because of all of the recent things, and I had to change all of my SOUL.md files, USER.md files, so that I could kind of set the behavior. And it [took] quite some time to do it, and I’m still getting used to interacting with OpenClaw using OpenAI because it seems like it makes different mistakes than what Anthropic would do.

35.03
So hopefully at some point [the] personalities of these models will converge but I do not think so because this is not a capability problem. It’s more of design choices that these model providers have made while building these models. So I don’t see a time where. . . We’re already at a point where capability-wise most models are getting closer, but personality-wise I don’t think model vendors would prefer to converge them because these are kind of your spiky edges which will make people with a certain personality gravitate towards your models. You don’t want to be making it like an average.

35.38
So in closing, you do a bit of teaching as well, right? One of the things I’ve really paid attention to is, in my conversations with people who are very, very early in their career, maybe still looking for the first job, literally, there’s a lot of worry out there. I mean, not necessarily if you’re a developer and you have a job—as long as you embrace the AI tools, you’re probably going to be fine. It’s just getting to that first job is getting harder and harder for people.

And unfortunately, you need that first job to burnish your credentials and your résumé. And honestly companies also I think neglect the fact that this is your pipeline for talent within the company as well: You have to have the top of the funnel of your talent pipeline. So what advice do you give to people who are literally still trying to get to that first job?

36.51
For one, I have had a lot of success with hiring young folks because I think they are very agent native. I call them like agent-native operators. If you’ve been working in software, in IT, for about 10 years or something like me, you’ve gotten used to certain workflows without using AI. I feel like we’re so stuck in that old mindset that I really need someone who’s agent native to come and tell me, “Hey you could literally ask Claude Code to do this.” So I’ve had a lot of luck hiring folks who are early career because they are very coachable, one, and two, they just understand how to be agent native.

So my suggestion would still be around that: Be a tinkerer. Try to find out what you can do with these tools, how you can automate them, and be extremely obsessed with designing and thinking and not really execution, right? Execution is kind of being taken over by agents.

So how do you really think about “What can I delegate?” versus “What can I augment?” and really sitting in the position of almost being an agent manager and thinking “How can you set up processes so that you can make end-to-end impact?” So just thinking a lot around those lines—and those are the kind of people that we’d like to hire as well.

And if you see a lot of these latest job roles ,you’ll also see roles blurring, right? People who are product managers are expected to also do GTM, also do a bit of engineering, and all of that. So really understand the stack end to end. And the best way to do it, I feel, is build a product of your own [and] try to sell it. You’ll get to see the whole thing. [That] doesn’t mean “Oh, stop looking for jobs—go become an entrepreneur” but really understanding workflows end to end and making that impact and sitting at the design layer will be super valued is what I think.

38.34
Yeah, the other thing I tell people is you have interests so go deep in your interest and build something in whatever you’re interested in. Domain knowledge is going to be valuable moving forward, but also you end up building something that you would want to use yourself and you learn a lot of things along the way and then maybe that’s how you get your name out there, right?

38.59
Exactly. Solving for your own problem is the best advice: Try to build something that solves your own pain point. Try to also advocate for it. I feel like social media and all of this is so good at this point that you can really make a mark in nontraditional ways. You probably don’t even have to submit a job application. You can have a GitHub repository that gets a lot of stars—that might land you a job. So think of all of these ways to bring yourself more visibility as you build so that you don’t have to go through your typical job queue.

39.30
And with that, thank you, Aishwarya.

39.32
Thank you.