Radar

The Best Risk Mitigation Strategy in Data? A Single Source of Truth

Jeremy Arendt — Thu, 07 May 2026 11:14:45 +0000

Every data leader has a version of this story. A regulatory audit surfaces a metric that doesn’t match across systems. A board member catches conflicting revenue numbers in two reports presented back-to-back. An AI tool generates a recommendation based on data that hasn’t been governed since the analyst who built it left the company two years ago. The specifics change, but the pattern doesn’t: Somewhere in the stack, data risk turned into business risk, and nobody saw it coming.

In my first article, I covered what a semantic layer is and why it matters. In my second, I spoke with early adopters about what happens when you actually build one. This piece tackles a different angle: The semantic layer as a risk mitigation strategy. Not risk in the abstract, compliance-framework sense, but the practical, operational risk that quietly drains organizations every day—bad numbers reaching decision-makers, sensitive data reaching the wrong people, and metric changes that never fully propagate.

Three risks hiding in plain sight

Data risk tends to concentrate in three areas, and most organizations are exposed in all of them simultaneously.

The first is accuracy. Inaccurate data leading to bad decisions is the oldest problem in analytics, and it hasn’t gone away. It’s gotten worse. As organizations add more tools, more dashboards, and more AI-powered applications, the surface area for error expands. A revenue metric defined one way in a Tableau workbook, another way in a Power BI model, and a third way in a Python notebook isn’t just an inconvenience. It’s a liability. When leadership makes a strategic decision based on a number that turns out to be wrong—or, more commonly, based on a number that’s one version of right—the downstream consequences are real: misallocated resources, missed targets, eroded trust in the data team.

The second is governance and access. Most organizations have some framework for controlling who sees what data. In practice, those controls are scattered across warehouses, BI tools, individual dashboards, shared drives, and cloud storage buckets. Each system has its own permissions model, its own admin interface, and its own gaps. The result is a patchwork that’s expensive to maintain and nearly impossible to audit with confidence. Sensitive data finds its way into a dashboard it shouldn’t be in—not because someone acted maliciously, but because the governance surface area is too large to manage consistently.

The third is change management. A CFO decides that ARR should exclude trial customers starting next quarter. In theory, that’s a single metric change. In practice, it’s a scavenger hunt. That ARR calculation lives in a warehouse view, two Tableau workbooks, a Power BI model, an Excel report that someone on the FP&A team maintains manually, and now the new AI analytics tool that pulls directly from the data lake. Some of those get updated. Some don’t. Three months later, someone notices the numbers don’t match and the cycle starts again. The risk isn’t that the change was wrong—it’s that the change was never fully implemented.

These three risks—accuracy, governance, and change management—aren’t independent. They compound. An ungoverned metric that’s defined inconsistently and can’t be updated in one place is a ticking clock. The question isn’t whether it causes a problem, it’s when.

The legacy approach: more people, more tools, more problems

The traditional response to data risk has been to throw structure at it—and structure usually means people and process.

The most common pattern is the BI analyst as gatekeeper. Critical metrics, reports, and dashboards are managed by a centralized team. Need a new report? Submit a request. Need a metric change? Submit a request. Need to understand why two numbers don’t match? Submit a request and wait. This model exists because organizations don’t trust their data enough to let people self-serve, and for good reason—without a governed foundation, self-service creates chaos. But the gatekeeper model has its own costs. It’s slow. It creates bottlenecks. It’s expensive to staff. And performance is inconsistent—the quality of the output depends entirely on which analyst picks up the ticket and which tools they prefer.

Governance gets its own layer of complexity. Organizations deploy access controls across their data warehouse, BI platforms, file storage, and application layer—each with different permission models, administrators, and audit capabilities. Quality reporting, lineage, and business ownership tracking create additional tooling, complexity, and management overhead. Maintaining consistency across all of these systems is resource-intensive, and the more tools you add, the harder it gets. Most organizations know their governance has gaps. They just can’t find them all.

The combination of centralized BI teams and sprawling governance frameworks produces a predictable outcome: large, slow-moving data organizations that spend more time fixing and maintaining the infrastructure than actually delivering data or insight. When everything is managed manually across dozens of tools, problems don’t grow linearly—they grow exponentially. Every new dashboard, data source, BI tool adds another surface to govern, another place where logic can diverge, another potential point of failure. The legacy approach doesn’t scale. It just gets more expensive.

The semantic approach: govern once, access everywhere

The semantic layer offers a fundamentally different model for managing data risk. Instead of distributing control across every tool in the stack, it consolidates it.

Start with accuracy and change management because the semantic layer addresses both with the same mechanism: A single location for all metric definitions, business logic, and calculations. When ARR is defined once in the semantic layer, it’s defined once everywhere. Tableau, Power BI, Excel, Python, your AI chatbot—they all reference the same governed definition. When the CFO decides to exclude trial customers, that change happens in one place and propagates automatically to every downstream tool. No scavenger hunt. No version that got missed. No analyst discovering three months later that their workbook is still running the old logic. And when that same CFO wants to know how we calculated that same metric several years ago? Semantic layers are driven by version control by default, allowing for seamless versioning across key metrics.

This same centralization transforms governance. Instead of managing access controls across a warehouse, three BI platforms, a shared drive, and an application layer, organizations can align governance around the semantic layer itself. It becomes the single access point for governed data. Users connect to the semantic layer and pull data into the tool of their choice, but the permissions, definitions, and business logic are all managed in one place. The governance surface area shrinks from dozens of systems to one.

But the semantic layer does something else that the legacy approach can’t: it makes data self-documenting. In a traditional environment, the context around data—what a metric means, why certain records are excluded, how a calculation works—lives in the heads of analysts, in scattered documentation, or nowhere at all. The semantic layer captures that context as structured metadata alongside the models, columns, and metrics themselves. Field descriptions, metric definitions, relationship mappings, business rules—all of it is documented where the data lives, not in a wiki that nobody updates. This is what makes genuine self-service possible. When the data carries its own context, users don’t need to submit a ticket to understand what they’re looking at (and AI agents can read-it in for contextual understanding at scale).

The practical result is a shift from centralized gatekeeping to federated, hub-and-spoke delivery. The semantic layer is the hub: governed, documented, consistent. The spokes are the teams and tools that consume it. A finance analyst pulls data into Excel. A data scientist queries it in Python. An AI agent accesses it via MCP. They all get the same numbers, definitions, governance—without a centralized BI team manually ensuring consistency across every output.

Risk reduction, not risk elimination

The semantic layer doesn’t eliminate data risk. The underlying data still needs to be clean, well-structured, and maintained—as every practitioner I’ve spoken with has confirmed, garbage in still produces garbage out. And organizational alignment around metric definitions requires leadership commitment that no software can substitute for.

But the semantic layer changes the economics of data risk. Instead of scaling risk management by adding more people and more governance tools, you reduce the surface area that needs to be managed. Fewer places where logic can diverge. Fewer systems to audit. Fewer opportunities for a metric change to get lost in translation. The problems don’t disappear, but they become containable—manageable in one place rather than scattered across the entire stack.

For organizations serious about AI-driven analytics, this matters more than ever. AI tools need governed, contextualized data to produce trusted outputs. The semantic layer provides that foundation—not just as a nice-to-have for consistency, but as critical risk infrastructure for an era where the cost of bad data is accelerating.

One definition. One access point. One place to govern. That’s not just a better architecture. It’s a better risk strategy.

Eating My Own Dog Food: How I Used the Framework to Write the Post About the Framework

Marc Millstone — Wed, 06 May 2026 12:59:17 +0000

In “Don’t Automate Your Moat,” I argue that engineering organizations should match AI autonomy to two independent dimensions: business risk and competitive differentiation. I used AI Gateway cost controls as a worked example throughout the piece because a single feature touches all four quadrants depending on which piece you’re building.

A piece making that argument should probably be written that way. Otherwise the framework is just rhetoric. So here is what actually happened: The same quadrants, applied to the writing of the post, then the two practices that cut across all of them.

Full Automation: The citation mechanics

My post has eighteen footnotes, all of them needing consistent structure, working URLs, and clean formatting. This is the work the bottom-left quadrant exists for. If a URL is wrong, I fix it in the next pass and nobody outside the editing loop notices.

AI handled the mechanical assembly. I spot-checked.

Collaborative Co-Creation: The AI Gateway example and the build-versus-buy framing

Two things sit in this quadrant.

The AI Gateway example. Using a single feature as a lens across all four quadrants was a product decision for the post. But the choice of which feature, and how to slice it, was recoverable. A weaker example, or one split across three features, would have cost me a draft. AI accelerated execution once I had settled on cost controls. I drove the design choice and interrogated the trade-off.

The build-versus-buy framing. This one was collaborative. Claude proposed the concept and the analogy: that the token-funded generation loop is functionally a procurement decision, not a build decision, even though the code lives in your repo. I saw what the framing could do for the structure of the argument, that it could link cognitive debt to competitive differentiation, survive a skeptical CTO reading it cold, and give the post a through-line that held the whole piece together. From there we worked it together. My phrase “a buy decision wearing a build costume” came out of that back-and-forth, and the structure of the argument got reshaped around the framing until it actually carried. Neither of us would have produced the final version alone. That is what this quadrant is supposed to look like.

In both cases, AI moved fast on execution. The judgment about whether the contribution fit, and what work it had to do in the surrounding argument, stayed with me. Flip the ratio and the post gets worse. Not catastrophically. Just generic in places where specificity was the whole point.

Supervised Automation: The counterargument section

The research is thin. Most engineering work is maintenance and belongs in the automate quadrant regardless. Engineers can develop ownership of AI-generated code through study and iteration.

These are the objections any thoughtful reader would raise. Not because the post is anti-AI (it is not). The argument is that AI autonomy has to be matched with sufficient human understanding, and that argument has to defend itself against the case for letting AI run further with less. AI could draft the shape of those objections.

My job was verification. The bar was whether a thoughtful reader who disagrees with me would find the steelman fair. That meant tracing each concession back to make sure I was not giving away something I should have held, and each objection back to make sure I was representing the strongest version of the case rather than a convenient strawman.

The risk here is subtle. The section is unlikely to be flat-out wrong. The danger is that an unfair steelman quietly undermines the rest of the argument. A reader who notices a weak counterargument starts wondering what else is rigged. AI drafts, human verifies every path before merge.

Human-Led Craftsmanship: The parts I owned outright

This is where most of the actual time went.

The opening. The engineer who could not explain his own algorithm. The colleague paged about a service connected to a database nobody documented. Those examples were mine. The post only works if those scenes feel true, and a generated approximation of them would have read like exactly that. Not a risk worth delegating.

Defining the dimensions. Naming risk and differentiation as the two axes is one thing. Defining them in a way that holds up under pressure is another. The prose that establishes what business risk actually means (blast radius if this fails, from an afternoon to the business itself), and what competitive differentiation actually means (not the brand or the sales team, but the architecture, the algorithms, the institutional judgment that shaped them), is what every quadrant boundary depends on. If those definitions are vague, the quadrants become Rorschach tests. If they are sharp, the quadrants do real work. I wrote and rewrote those passages until a reader could apply them to their own systems without me there to translate.

The framework and the evidence behind it. The two-dimensional framing came out of my own thinking before Claude entered the loop. Once the dimensions existed, iterating with Claude on how to sharpen them was useful. It pushed me on where the dimensions overlapped and where the quadrant labels were doing too much work. But the seed had to be mine. A framework generated from a prompt would have read like one.

The evidence behind the framework worked the same way. I came in with a starter set of papers I already trusted: the METR productivity study, the MIT cognitive debt work, the Anthropic Fellows skill formation paper, the GitClear data on refactoring decline, and the Tilburg study on senior developer maintenance burden. Those were mine. From there, Claude expanded the research base, surfacing the Lancet endoscopy deskilling study, the OX Security and CodeRabbit and Apiiro analyses, and the survey work on LLM code generation in low-resource domains. That expansion was genuinely useful. It made the post broader and more current than what I would have assembled alone in the same time.

But expanding the source list is not the same as checking it. Every source Claude added had to be read against the specific claim it was being asked to support, because a framework is only as strong as the sources that anchor it. Generating a citation is mechanical. Reading a paper carefully enough to know what it proves, and whether the surrounding sentence reflects that, takes real time.

The Knight Capital loss figure was the clearest example. Different reports cite different numbers. The SEC enforcement order documents one figure. Bloomberg and other secondary sources round or reframe it. Claude pulled from whichever source it surfaced first on a given pass, and the number drifted across drafts. Catching that required going back to the primary source and pinning it.

The pattern repeated across other sources. Claude would attribute a claim to the right general area but the wrong specific paper. A finding about senior developer maintenance burden got mapped to a study that examined something adjacent but narrower. A claim about deskilling got pulled from a Lancet study that supports a more limited version of the argument than the way it had been phrased. Every structural source got reverified against what it actually proved. Several were corrected, replaced, or cut. Earlier drafts leaned on a real-world example whose causation was disputed in its sourcing. That example came out, and the Knight Capital section took its place because the SEC enforcement order documents the chain of causation directly.

This work could not be delegated. I had to own the mental model of what each paper actually proved and what it did not, the same way I had to own the mental model of the framework itself. The framework calls this the test of whether the engineer who built it could explain it in an incident review without looking at the code first. The writing equivalent is whether I could defend each citation in front of a skeptical reviewer without re-reading the abstract. The framework is the claim. The evidence is what makes it more than an opinion. Both had to be mine.

That covers the quadrants. Two practices cut across all of them and deserve their own treatment.

Using Claude as a critic

The most valuable thing Claude did on this post was push back. But you have to ask for it the right way.

Generic prompts produce generic critiques. “What do you think of this draft?” gets you a polite reaction with three suggestions. Useless. The prompt that actually works puts Claude in a specific adversarial seat. Mine looked roughly like this:

You are a pro-AI, token-maxing CTO watching your team and your competitors ship faster every quarter. You have a deeper than average understanding of AI. Provide a thorough critique of this article focusing on logic, completeness, and correctness. Be direct. Be brutal. This is not about the author’s feelings. It is about creating the best argument possible.

Three things make that prompt work. The persona is hostile to my thesis. The criteria are concrete: logic, completeness, correctness. And the explicit permission to be brutal lets the model drop the hedging it defaults to.

Working that way surfaced things I would have missed. Claude flagged that an early draft conflated cognitive debt as a risk problem with cognitive debt as a differentiation problem, and that collapsing them weakened both. It pointed out that one of the original real-world failure examples did not actually demonstrate the failure mode I was claiming, because the causation was disputed in the source material. It caught a passage where I was asserting a conclusion the evidence supported only in a narrower form.

Some of the pushback I accepted and rewrote around. Some I rejected, because Claude was applying a generic objection that did not fit the specific argument. (During one critique, Claude told me, “This post is sound advice. It did not need sixteen footnotes to establish it.” Fair point, but a bold claim from the model that couldn’t count to 18.) The point was not to follow every note. The point was to have the notes at all. A solo writer with a deadline does not get a skeptical reviewer on demand. Working this way, I did.

The same prompt structure works for structural critique. Swap the hostile CTO for a senior editor, keep the criteria concrete (where does the flow break, what arrives too late, what is Part 2 failing to deliver that Part 1 set up), and Claude will interrogate the architecture of the argument the same way it interrogated the content. Pulling the build-versus-buy framing forward in the final draft, and tightening the bridge between the risk and differentiation sections, came directly out of running that prompt.

This is what the research describes when it talks about AI use that preserves understanding. Interrogative, not delegated. Claude was stress-testing the argument I had already written, not writing it for me, the way a good editor or a skeptical colleague would.

Sounding like me, not like Claude

The hardest part of working with Claude on a post like this is not getting it to write. It’s getting it to stop writing like Claude. (Yes, I know. That’s the construction this section warns against. I, not Claude, wrote it on purpose.)

Models default to a recognizable voice. Em dashes everywhere. Rule-of-three lists at every cadence shift. “Not just X, it’s Y” as a reflexive contrast. Words like delve, leverage, robust, nuanced, comprehensive, pivotal. Transitions like moreover and furthermore. None of this is wrong, exactly. It is generic writing wearing a polished costume. Readers can feel it even when they cannot name it, and the moment they feel it, they stop trusting the argument.

The Redpanda voice is different. Smart, practical, playful, genuine. Short sentences mixed with long ones. Active voice. Plain English. The brand guide is explicit that we are not corporate, not academic, not polite-but-generic. If the post sounds like a polished bot, it has already failed before the argument starts.

The editing pass on voice was its own discipline, separate from the editing pass on argument or evidence. Claude would draft a paragraph that was structurally fine and full of tells. I would rewrite it. Forcing a sentence to sound like me usually meant cutting hedges, killing throat-clearing, and saying the thing directly. The corporate-academic register Claude defaults to is also the register that lets vague claims hide. Several places where the post is now sharper started as a voice fix that turned into a content fix.

A few of the patterns I usually cut survived in the final post. Two em dashes, one rule-of-three list, a “not X, but Y” construction. Each one earned its place. The em dashes carried a beat that commas would have flattened. The list of three was the cleanest way to render a specific argument without chopping it into fragments. The contrast was the only shape that made the claim land. The discipline is not avoiding the patterns absolutely. It is refusing to use them on autopilot.

The tagline was the purest version of this work. Velocity is table stakes. Code is a commodity. Understanding is the edge. That line went through more iterations than any other sentence in the post. Claude produced dozens of variants. None of them were quite right, because taste in a tagline is not a thing the model can verify for itself. The right version had to feel true to me when I read it out loud. The iteration was useful, but the judgment had to be mine.

The takeaway

The parts I delegated most heavily were the parts where being wrong was cheapest. The parts I owned most tightly were the parts where being wrong would have cost the argument or the reader’s trust. The most useful thing Claude did was push back, stress-test the structure, and force me to defend the work I was claiming as mine.

The friction we hit, the drifting Knight Capital figure, the misattributed citations, the model’s instinct to write like a model, did not mean the tool failed. It meant that without an owner holding the mental model, the output would have looked clean and been quietly broken. The framework decided where to spend that ownership. I made that call deliberately, and the post reflects it.

The Organization Is the Bottleneck

Sarah Wells — Wed, 06 May 2026 10:09:05 +0000

Everyone is adopting AI coding tools. Engineers are writing code faster than ever. But are organizations actually delivering value faster? That’s not obvious.

I wrote Enabling Microservice Success with a big focus on engineering enablement, guardrails, automated testing, active ownership, and light touch governance. I didn’t know AI coding agents were coming, but it turns out that the practices that make microservices work long-term are exactly the foundations you need to make AI coding agents work too. If your organization is adopting these tools—and the evidence suggests we all are—the book covers how to build these foundations in detail.

I’m hearing very different experiences from different organizations, and what seems to make the difference is the level of maturity that the software engineering organization has. As the latest DORA report puts it, “AI’s primary role in software development is to amplify. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones.”

A decade ago, I started building microservices at the Financial Times. It didn’t take long to realize that success wasn’t about the technology choices. Success was about getting the cultural and organizational setup right, because that’s what allows teams the autonomy to move fast. There’s no benefit to adopting microservices if your organization can only release code once a week: You’re paying the cost of a more complicated operational architecture but not benefiting from being able to ship changes frequently and with a high degree of confidence they won’t break something in some other part of your system.

The pattern with AI coding agents is strikingly similar. If you don’t have automated tests, or documentation, or CI/CD pipelines that support progressive delivery, you won’t succeed with microservices—and you won’t succeed with AI coding agents either. The organizations reporting the best results are the ones that already invested in the foundations.

Here are some of the specific parallels.

Guardrails matter. When we moved to microservices, we learned quickly that you can’t just tell teams to “do the right thing” and hope for the best. You have to build paved roads and guardrails that help people to do the right thing automatically, so that autonomy doesn’t become chaos. AI coding agents need exactly the same approach. An agent with access to your codebase and no constraints is an autonomous team with no guardrails: it will move fast, but not necessarily in the right direction. If you’ve already built those guardrails for your teams—coding standards enforced in CI, architectural decision records, templates for new services—you have a serious head start because those same artifacts become the constraints that keep agents on track.

Your deployment pipeline is your best safety net. Automated tests, progressive rollouts, zero-downtime deploys—these are the practices that catch mistakes before they reach production, whether the code was written by a human or by an AI. Observability matters here too: You wouldn’t run a microservice without logs, metrics, and traces, so why would you merge code you didn’t write yourself without the ability to understand what changed and why? And independent deployability gives you independent reversibility—when an AI agent makes a bad change to one service, you can roll it back without unwinding six other things. If we’re shipping code three times as fast with the help of AI agents, all of this becomes even more important.

Engineering enablement is how you scale. Your platform team’s templates, libraries, and golden paths don’t just help developers: they become the constraints and context that make AI agents effective across your organization. The organizations that already invested in enablement are the ones finding it easiest to adopt AI coding tools. The ones that didn’t are finding that AI just amplifies the mess.

Radar Trends to Watch: May 2026

Mike Loukides and Claude — Tue, 05 May 2026 11:18:59 +0000

The most significant tension in this issue is between two companies making different decisions about how to handle AI with frontier security capabilities. Anthropic restricted Claude Mythos to a small corporate cohort through Project Glasswing. OpenAI released GPT-5.5 to general availability, and some are calling it “Mythos-like hacking, open to all.” The AI Security Institute’s evaluation confirms the capability is real and consequential. How will you manage risk when the time between discovery of a vulnerability and exploitation collapses to zero?

Another important theme is that, in the words of The Sequence, “AI is becoming operational.” It’s no longer about LLMs that can play games with words. It’s about tools that can automate processes across an enterprise: agents, of course, but more specifically agents that can be shared by teams to produce a consistent set of tools that can be used by groups.

AI Models

The open-weight model market is reshaping the economics of AI. This cycle brought at least 10 significant model releases or updates across open and closed providers, with pricing pressure coming from multiple directions. DeepSeek now performs within a fraction of Claude Opus 4.7 on coding benchmarks at a radically lower price; Alibaba, Google, Z.ai, and Moonshot all released capable open models this cycle. The Stanford AI Index documents this at scale. For organizations building on AI, the question is no longer whether open-weight alternatives are viable but which trade-offs they are willing to make on cost, portability, and support.

Google has published a list of 1,302 real-world use cases for generative AI. It’s very long and probably not worth reading on your own. However, you might want to point your agent at it.
OpenAI has announced GPT Images 2, its flagship model for generating images. The initial reaction is that it’s slightly better than Google’s Nano Banana. What distinguishes Images 2 is that it “thinks” before generating the image.
Anthropic used Claude to work on some problems in alignment research. Claude outperformed the humans at lower cost. The problems were, admittedly, cherry-picked to be easily scoreable. But the experiment also demonstrated that a less capable model can supervise a stronger model.
Moonshot Labs has released Kimi K2.6, the latest in its series of open models. It also open sourced the Kimi Vendor Verifier, a tool that tests the accuracy of vendors selling inference using Kimi.
Alibaba has released Qwen3.6-35B-A3B, the latest model in its Qwen series. It’s a mixture-of-experts model with 3B active parameters. Simon Willison reports that it draws great flamingos, if you consider that relevant.
Anthropic has released Claude Opus 4.7. The model is positioned as an intermediate step between Opus 4.6 and Claude Mythos Preview. Anthropic claims that 4.7 is better at multimodal work, including vision, instruction following, and memory use. Its new tokenizer increases the number of tokens that Claude uses. Because billing is based on tokens, that’s effectively a price increase. Simon Willison has built a tool to compare the token usage of different models.
Google has announced Gemini 3.1 Flash TTS, a text-to-speech model that gives extraordinary control over the speakers: accents, style, expression, and more.
Stanford’s 2026 AI Index Report is out, with over 400 pages of data and analysis about the state of AI.
Meta’s refactored AI lab has released its first model, Muse Spark. It’s a multimodal model that has been designed for integration with Meta’s products. There will eventually be a Contemplating Mode for orchestrating agents.
DeepSeek has released a preview version of DeepSeek-V4, its latest open-weight model. It’s a large model (over 1T parameters) with performance very close to the frontier models, but (as Simon Willison points out) running it is very inexpensive.
OpenAI released GPT-5.5, which some are calling “Mythos-like hacking, open to all.” In addition to being its “smartest and most intuitive” model yet, OpenAI claims that it reduces token counts, thereby reducing cost. Other sources report that, while it scores highly on benchmarks, GPT-5.5 is markedly more likely to hallucinate and provide incorrect answers.
Z.ai’s GLM-5.1 is a new version of the open source GLM-5 model that has been optimized to perform well on long-running tasks.
Google has released Gemma 4, a new version of its family of open source models. The family includes a 31B version and a mixture-of-experts version with 26B parameters, 4B active. These are all reasoning models that are designed for agentic workflows. One model, Gemma 4 E4B, can run on the iPhone and Android.

Software Development

Anthropic has clearly been winning the announcement race. Whether it’s also winning on performance is a different question. Claude Code was a favorite among developers until its performance slipped. Many switched to newly released Cursor 3, which puts an agentic interface front and center while relegating the IDE to the background. Anthropic’s public postmortem on Claude Code’s behavior regression is worth reading both for its specific findings and as a model for how AI providers should communicate quality issues to developers. And Cursor’s transformation from an IDE into an agent is a pattern we expect to see repeated across the industry.

OpenAI has announced “workspace agents.” Workspace agents can be shared across a team, while the agents we have so far are tied to individual productivity. They enable a team to collaborate on building shared tools to automate workflows.
Microsoft has announced two new tools, Critique and Council, that use Claude and GPT together to solve research problems. Their benchmark results show that the combination works better than any model used on its own.
Stash is an open source memory layer that agent builders can use to connect their agents to models. We’re beginning to see an agentic stack that is composed of interchangeable modules.
Developers have been complaining about a drop in Claude Code’s behavior over the last few months. Anthropic has issued a response explaining what happened and how they’re fixing it.
Glif is an agent that tries to unify all the LLMs and tools at your disposal. You don’t have to decide which model or tool is best for each task; it makes the decision for you and gets the task done.
OpenAI has decoupled its agent harness from computing and storage, enabling durable long-running agents. The harness is now open source and can be customized through the Agents SDK.
Anthropic has announced Claude Code routines. A routine is a package that includes a prompt, a repository, and connectors that will run automatically on Anthropic’s infrastructure, either on a schedule or when triggered.
Anthropic also announced Claude Managed Agents, a prebuilt harness for developing agents that run on Anthropic’s infrastructure. The harness provides most of the infrastructure that an agent needs (memory management, etc.) but can be configured for the user’s tasks. Anthropic’s goal appears to be becoming the AWS of agentic AI: a service provider for tool builders.
Interoperability between tools, models, and plug-ins is allowing a new programming stack to develop: an orchestration layer, an execution layer, and a review layer.
Amazon has launched an agent registry service as part of AWS Bedrock AgentCore. Bedrock AgentCore is a collection of services that make it easy to build and deploy agents on AWS. The registry gives developers a way to discover third-party agents that might be useful to their work.
Bryan Cantrill’s essay on laziness is a must-read. AI isn’t lazy, and that’s a problem. When work costs nothing, there’s no need to think about future workers. Laziness is a virtue that we need to preserve.
Anthropic has announced Claude Design, a new tool designed to help designers. It competes directly with Figma and Canva. It’s currently in “research preview.”
Perplexity has launched Personal Computer, a local AI agent that runs on a dedicated Mac mini (Windows to come) and has persistent access to your files, native apps, inbox, and the web.
Anthropic has released a Claude plug-in for Microsoft Word, targeting the legal market. Automated edits appear as tracked changes.
LiteParse is a command-line tool that extracts text from PDF files. If you’ve never needed to do that, you’ve lived a blessed life. Simon Willison has built a web-based version that runs LiteParse in the browser.
Luke Wroblewski has said that designers should code; they need to understand their medium. But around 2014, heavyweight frameworks like React and Angular got in the way. Coding agents are now making “collapsing the gap between designing and building.”
Cursor 3, the letest release of Cursor, relegates its IDE to the background. The main screen is designed for orchestrating agents. You can fall back to the IDE for editing code if you need to.
In the first quarter of 2026, Apple’s app store has seen a huge (84%) increase in the number of new apps, compared to the first quarter of 2025. The cause is probably the ease of using AI to create new apps. Apple also appears to be limiting the use of “vibe coding” to create new apps, and has removed several vibe coding apps from the app store.
Anthropic accidentally leaked the source code for Claude Code, prompting waves of commentary. Two of the most interesting are Shlok Khemani’s tour of what he found interesting in the source and Gergely Orosz’s discussion of the legal implications.
“The Hidden Technical Debt of Agentic Engineering” argues that, as with machine learning, agents are relatively small parts of larger software systems, and that technical debt accumulates in all the supporting modules.
Chat is rarely the best interface for working with AI. Ethan Mollick writes that the current generation of AI models and agents are capable of creating task-specific interfaces on the fly.

Security

Security has spent a lot of time in the news. Two core tools for secure private networking, Tor and Signal, have been attacked. In both cases, the attack didn’t involve the software or protocols themselves. These attacks teach us that secure systems are often jeopardized by the software that surrounds them. We’ve also seen that ransomware gangs are using postquantum encryption, and that quantum computers are likely to break traditional encryption sooner than expected. If you’re not investing in security, it’s time to start.

The Tor network is the gold standard for secure private networking. Researchers recently discovered a vulnerability in Firefox browsers that lets attackers de-anonymize identities. The vulnerability has been fixed in Firefox 150, but it’s a reminder that anything can be attacked.
We all know that ransomware gangs use encryption. The Kyber group is making the transition to postquantum encryption.
A supply chain attack against npm allows bad actors to steal developers’ credentials. Once it has infected a victim, it inserts itself into other packages that the victim publishes.
Law enforcement agencies were briefly able to exploit a vulnerability in iOS notifications that allowed them to access unencrypted messages sent with the Signal secure messaging system. The vulnerability has been patched. It’s important to understand that the vulnerability wasn’t in Signal itself but in the environment in which it operated.
With AI, time from discovery of a vulnerability to exploitation has dropped to zero. To help defense catch up, Google has added three agents to its Google Security Operations platform: Threat Hunting, Detection Engineering, and Third Party Context.
Microsoft reports that criminals are increasingly using Teams to impersonate help desk personnel, who ask users for their credentials and then steal data.
NIST has stopped assigning severity scores to lower-priority vulnerabilities. All vulnerabilities will still be added to the National Vulnerability Database (NVD).
The NSA is using Claude Mythos Preview, despite Anthropic being blacklisted by the Pentagon. Anyone want to guess what they’re using it for?
Anthropic will ask for identity verification in some cases.
Small open-weight models can do as well as Anthropic’s Mythos at finding vulnerabilities. The key isn’t the model; it’s the system within which the model works.
A new malware campaign embeds credit-card stealing software into a single pixel SVG image. ecommerce sites using Magento Open Source or Adobe Commerce are vulnerable.
Anthropic has pulled its newest model, Claude Mythos, from broader release because it’s too good at finding vulnerabilities in other software. They’ve made it available to a few corporations via Project Glasswing, an attempt to secure critical software before it can be exploited. The AI Security Institute’s analysis of Claude Mythos Preview says that it “represents a step up over previous frontier models in a landscape where cyber performance was already rapidly improving.”
Many open source security maintainers agree with Greg Kroah-Hartmann‘s report that the quality of AI-generated security bug reports has gone up tremendously.
Versions of Claude Code that include the Vidar malware have been published on GitHub. They are based on the code that Anthropic inadvertently leaked. These versions entice victims to download them by claiming to have unlocked enterprise features.
Claude has been used to discover zero-day remote code execution vulnerabilities in both Vim and Emacs. The vulnerabilities are triggered when a user opens a file. An update is available for Vim; Emacs developers argue that it’s really a bug in Git, which may be correct but misses the point.
Breakthroughs in quantum computing mean that computers capable of cracking current encryption algorithms may be on the horizon.

Infrastructure and Operations

Multiple providers released overlapping pieces of an agent stack this cycle, covering orchestration, persistence, memory, and registry services. A three-layer model (orchestration, execution, review) is becoming the standard architecture, but each vendor’s implementation makes different bets about portability and durability. It’s important to evaluate each vendor’s products carefully before settling on an agent stack.

Microsoft now allows admins to uninstall Copilot, though there are conditions.
Google has announced two new eighth-generation TPUs. One is designed for training (8t), the other specializes in inference (8i). This is the first time Google has produced specialized TPUs for training and inference.
Google has open-sourced Scion, its testbed for agent orchestration.
Anthropic has agreed to buy 3.5 gigawatts of computing power from Google and Broadcom, maker of Google’s GPUs. The deal specifies power consumption rather than the number of chips, implying that the limiting factor isn’t computation but the availability of power. Chips come and go; watts are a constant.
Ollama now uses Apple’s MLX framework to improve performance on Apple silicon. Support is currently limited to the Qwen3.5-35B-A3B; support will be added for other models. As part of this update, it also uses NVIDIA’s NVFP4 floating point format for model quantization.

Web

Don’t overlook the web layer when planning for AI-driven disruption. The web’s infrastructure is older than most of the people who maintain it, and several items this cycle are reminders of the gap between what that infrastructure was designed for and how it is used today. Two deal with protocols that have outlasted their original assumptions; another reimagines the dominant CMS from scratch using current tooling.

Is PHP the new COBOL? What about open source itself? “Who Will Maintain the Web When PHP’s Veterans Retire?” points to a reality that we don’t like to think about. Not only are companies reluctant to hire junior developers; the ones they do hire aren’t learning older technologies.
Laravel is apparently injecting ads for its commercial cloud service into agents. What happens when an open source framework receives venture funding and starts injecting ads into agents? We’re about to find out.
Doesn’t every musician need tools to typeset Gregorian chant?
Is IPv8 the future of the Internet? IPv6 has been “two years away” since early in the 1990s. IPv8 is fully backward compatible with IPv4, and resolves its security and address depletion issues.
Cloudflare has released EmDash, an alternative to WordPress based on how the web is used today. Drew Breunig calls this a reimagining: a new phase of software development in which we can use agentic programming to rethink and reimplement tools based on current needs.
Is BGP Safe Yet? is a web app that tests whether your ISP has implemented BGP (the protocol that’s responsible for routing packets at internet scale) correctly. Many haven’t.

Biology

OpenAI has announced GPT-Rosalind, a model that has been tuned for 50 common workflows in biology. Unlike most models, Rosalind has been tuned to be skeptical rather than enthusiastic or sycophantic. Access to Rosalind is limited because of the potential for harm.

Robotics

Spot, the Boston Robotics robotic dog, can now read gauges and thermometers. It uses the Gemini Robotics-ER 1.6 model, which can reason about visual information.
Major League Baseball is using a robotic system to rule on challenges to a human umpire’s ball/strike calls.

How AI Swarms Are Disrupting Democracy

Marco Camisani Calzolari — Mon, 04 May 2026 11:43:58 +0000

Every day, millions of pieces of fake content are produced. Videos, audio clips, posts, articles, generated by artificial intelligence, distributed at industrial scale, aimed at shifting public opinion across entire countries. The people producing them are often outside the country being targeted. The people receiving them almost never know they’re fake. And they have no idea how they’re made.

A few years ago, troll farms worked like this: entire buildings full of people, shifts, desks and workers paid to write posts, create fake profiles, comment and pick fights in online discussions. It was expensive, slow, and in the end, the real impact was marginal. Those buildings still exist today, mostly in India, split between teams specializing in scams and teams dedicated to disinformation. They work on commission and they’re mostly AI experts now. They no longer write the articles themselves and no longer do graphic design or photo editing. They have AI agents do everything: agents they create, configure, instruct, and supervise. Hundreds of thousands of autonomous agents that do in one hour what used to take weeks of human labor. Troll farms have become AI farms, producing synthetic content at industrial scale.

The report “From Trolls to Generative AI: Russia’s Disinformation Evolution,” published in February of 2026 by the Centre for International Governance Innovation (CIGI), tells one of these stories, specifically about disinformation campaigns originating from Russia. Networks like CopyCop, a disinformation operation linked to the GRU (Russian military intelligence), use uncensored open-source language models like modified versions of Llama 3, installed on their own servers, to transform press articles into political propaganda and distribute it across hundreds of fake websites without leaving a trace. Because the models run locally, there’s no watermark and no log. The model runs on their hardware, inside their borders, outside any Western jurisdiction.

The paper “How malicious AI swarms can threaten democracy,” published in Science in January 2026 describes well what is coming: coordinated swarms of AI agents with persistent identities, memory, and the ability to adapt in real time to people’s reactions. The authors call them “malicious AI swarms.” Fully autonomous agents, each producing original content, each one different, each one adapted to context.

They can simulate real communities that appear credible, and they build what we can call synthetic consensus: the illusion that an opinion is widely shared, that a position is held by the majority, when in reality it’s a single operator speaking through thousands of masks.

It works because we humans have bugs too, and the swarms exploit them at a scale that was never possible before or that would have required enormous human resources.

One bug is called the bandwagon effect. Combined with another bug, illusory truth: repetition plus apparent source independence equals perceived truth. So if we see the same position expressed by different sources, in different contexts, with different words, on different platforms, we register it as widespread. And if we perceive it as widespread, we consider it more credible. And if we consider it credible, we tend to align with it.

Swarms of autonomous agents exploit both mechanisms at the same time, at industrial scale.

What most people still haven’t grasped is the scale. We were used to automation: A system that sent a hundred thousand identical emails, at most changing the name and little else, or made just as many posts and similar comments with minor variations. It automated the publishing, but at its core it was recognizable spam. Our mental model is still that one: If it’s automated, it’s generic. If it’s generic, you can spot it. But that’s a perception error built on years of experience when AI agents didn’t exist. That model is over. These agents no longer fit the concept of automation, because they make decisions, they radically change the text based on the recipient. They aggregate data from heterogeneous sources in real time: social profiles, public records, leaked databases that you can now buy for a few dollars on any dark web marketplace. Billions of personal records are already out there, scattered across hundreds of breaches accumulated over the years, and AI can cross-reference them, reconcile them, and build a coherent profile of a single person in seconds. The computational cost is negligible: a few cents in tokens to generate a perfectly personalized message. Consider that a single agent with access to a language model and a couple of leak databases can produce thousands of unique pieces of content per day, each calibrated for a different person. Multiply that by a hundred thousand agents working in parallel, twenty-four hours a day, and you have the scale of what’s happening.

Another legacy from the past: “I’m just an ordinary person, why would anyone bother creating content specifically to convince me?” That may have been once true. Today, nobody is losing time because these agents don’t get tired, don’t sleep, and do nothing else: find connections, aggregate data, produce false content calibrated for each of us. The old demographic profiling is over. This is surgical media targeting at industrial scale.

But the capacity to respond and deny is not at industrial scale. If hundreds of thousands of coordinated agents spread a video of a politician saying something they never said, that politician can deny it all they want. The video is there. Millions of people have seen it. The denial arrives later, arrives slower, and will never reach the same scale. It arrives in a world where nobody knows what’s true anymore.

If the same swarms spread the news that a head of state has died, and the news is false, that head of state can make all the videos they want to prove they’re alive. Those videos will probably be dismissed as deepfakes. Because the swarm’s narrative got there first, took root, and at that point any evidence to the contrary looks fabricated.

Whoever controls the swarms today controls the version of the facts. Whoever tries to push back is already at a disadvantage because they have to prove that a real video is real in a world where everyone has learned that videos can be fake.

The attackers are often outside the country being hit. Groups aligned with governments that want to shift public opinion in another country, or that target specific demographics. Young people, for example, using platforms that are often owned by those very countries.

All of this is a massive threat to democracy because democracy operates on some premises, including that people form opinions based on real information, discuss with each other, and then decide. If the information is fabricated, if the debate is populated by entities that don’t exist, if the consensus we perceive is synthetic, that premise collapses. And with it, the entire mechanism. Elections become the result of who has the best swarms, not who has the best ideas. Public debate becomes a performance where most of the voices are generated, and public opinion stops being public and becomes the product of whoever has the resources to manufacture it.

We grew up thinking that threats to democracy came from coups, censorship, or regime propaganda broadcast on television or in national newspapers. Those were real threats, but they were at least visible. They were things you could identify and fight. Now the threat is bigger and, above all, invisible, personalized, and it operates inside the very channels we use to inform ourselves, to discuss, to participate. It contaminates information from within, to the point where nobody knows which voices are real and which are machines.

What can we do? Watermarking? Pattern detection? Unfortunately, they don’t work. The major AI platforms can embed markers in content generated by their models, true. But the people building autonomous swarms don’t use commercial platforms. They use open-source models with fine-tuning and capabilities that can’t be controlled from outside. And they often have no legal obligation to do anything because there are no global laws that can impose watermarking on every computer in the world. The result is paradoxical: The content produced by those who follow the rules stays marked, and the content produced by those who want to cause harm stays free.

Pattern detection systems have the same limits. They work for a while, then once the detection patterns are identified, the swarms adapt. They’re designed to do exactly that.

And the platforms where all of this circulates have a financial incentive to turn a blind eye. Internal Meta documents made public by Reuters in November 2025 estimated that roughly 10% of Meta’s global 2024 revenue, about $16 billion, came from advertising for scams and prohibited products. Fifteen billion high-risk ads served on average every day to users. The maximum revenue Meta was willing to sacrifice to act against suspicious advertisers was 0.15% of total revenue: $135 million out of $90 billion. When a platform’s business model depends on ad volume, removing the fraudulent ones has a cost that nobody wants to pay. I suspect Meta is not alone in this.

Regulation doesn’t solve this problem either. I’ve worked on the European AI framework, the GPAI task force, the Italian AI law, and I’ve brought my perspective to the UK Parliament. I’ve been in those rooms. Europe has the AI Act, the GPAI Code of Practice is currently being drafted, and has a regulatory apparatus that is more advanced than any other bloc in the world. The United States has no federal regulation, and twenty-eight states have tried to legislate with transparency requirements that amount to fine print. But even the most ambitious European framework has a structural limit: The attacks come from countries that answer to none of these rules. You can regulate your platforms, your developers, your companies. You can’t regulate a building in Saint Petersburg, Shenzhen, or New Delhi, where someone is instructing swarms of agents on open-source models running on local servers, outside any jurisdiction.

One way out is to return to the reputation of sources. Editors, news organizations, journalists with a name and a face. People and organizations that have a professional track record to defend and that risk something when they get it wrong. Sure, they can have political leanings and they can make mistakes. But they have a constraint that no AI agent will ever have: public accountability. A system that generates millions of pieces of false content answers to no one. An editor answers to their audience, to the law, to their reputation. That constraint is the only filter that still holds, and protecting it is the only thing we can do right now, while the laws try to catch up with a technology that moves faster than any legislative process in the world.

Are we completely at the mercy of AI swarms or can we fight back?

Machines should not get to overpower humans, especially when what’s at stake is how we govern ourselves. The antibodies exist. We need to activate them.

The more people understand how swarms work, the less effective they become. A swarm that manufactures fake consensus only works if the people receiving it don’t know synthetic consensus exists. A bit like deepfakes. We know about them now and we often spot them. Once you see how it works, it’s harder to fall for it.

Then we need investment in culture. In spreading digital literacy, which is not learning how to use a computer, but learning to understand the social and cultural effects of the digital world. It means teaching in schools how to verify a source and what the signs of manipulated content are. It means stopping the practice of treating media literacy as a school project and starting to treat it as democratic infrastructure, on the same level as bridges and hospitals. It means funding independent journalism instead of letting it die, strangled by the same mechanisms that reward false content because it generates more engagement. It means demanding that platforms give different visibility to those who have a verifiable reputation versus those who have none.

Because awareness is the only antibody that scales at the same speed as the threat. And unlike regulation or detection systems, awareness doesn’t need to be imposed. It can be built, taught, shared, and spread from person to person.

Before sharing a piece of content, check where it comes from. Before reacting to a video or a statement, stop. Ask yourself whether the source has a name, a history, something to lose. Treat every piece of content as potentially synthetic until a credible, accountable source confirms it. These are habits, not technologies. They cost nothing and they work immediately.

Finally, we need the help and collaboration of the tech community. Those who design platforms, write code, and make decisions about how feeds and ranking algorithms work are making choices that directly shape the information ecosystem. These are choices with democratic consequences. The people making them know it. Many have known it for years. This is the moment to stop treating it as someone else’s problem and to decide which side you’re on. Because the swarms are not waiting.

We can do this. The tools exist, the knowledge is there, and the threat is clear enough that pretending not to see it is already a choice. The question is whether we act now, while the window is still open, or later, when the damage will be harder to reverse.

Local AI

Mike Loukides and Claude — Fri, 01 May 2026 14:20:44 +0000

The release of Gemma 4 has added energy to the discussion of local models and their importance. Models that you can download and run on hardware you own are becoming competitive with the “frontier models” hosted by large AI providers. These models have gotten good enough for production use, good enough for tasks that until recently required an API call to a frontier model. They are typically open weight (though not open source) and much smaller than the frontier models like Anthropic’s Claude.

The reasons for going local vary. For a financial services company, regulation may require that no sensitive data can leave the premises. For a developer in Europe, data sovereignty laws make cloud APIs awkward. For a developer in China, hardware constraints and geopolitics have made local, efficient models a practical necessity. For developers outside the US, the costs of using frontier models can be prohibitive. None of these reasons are new, but all of them are more urgent than they were a year ago, because the models are catching up.

Why local?

Reasons for running AI locally fall into a few categories: cost, privacy, performance, and control. Let me take them in order.

Cost is the easiest to quantify, though the numbers can be misleading. Developers using agentic tools for programming can spend $500 to $1,000 per month or more on API calls. NVIDIA CEO Jensen Huang has suggested that his engineers should spend an amount roughly equal to half their salary on AI tokens, given the productivity return. Whether or not you take that as prescriptive advice, it signals that token spending at scale is significant, which is exactly what makes the local alternative worth examining.

The hardware cost depends on where you’re starting. If you have a capable desktop already, dropping in an RTX 4070 ($500–$800 retail) gets you a 12GB-VRAM GPU adequate for most local models. Building a dedicated system from scratch (CPU, motherboard, 32GB of RAM, storage, case, power supply, and GPU) runs closer to $1,500. Teams spending $500 a month on API calls break even in a few months. After that, local costs approach zero; electricity for a consumer GPU setup runs $20 to $40 a month. High-volume batch work makes the economics even clearer. Processing thousands of documents through a cloud API gets expensive fast; locally, it costs nothing but time.

For individual developers and small teams, the management overhead is minimal. A tool like Ollama reduces running a local model to a background service; updating to a newer model is a single command, done on your own schedule. At enterprise scale the picture changes: Organizations that need production uptime guarantees, multiple developers sharing access, compliance logging, and dedicated engineering support face real overhead. A dedicated ML engineer runs $200,000 a year, and that’s noise compared to the cost of building or leasing AI infrastructure. For a solo developer or a two-person shop, that concern doesn’t apply.

Privacy arguments are often more compelling than cost. The concern isn’t primarily about bad actors at cloud providers; it’s about contracts, compliance, and control. GDPR and similar regulations create real constraints on where data can go. Healthcare and financial services companies have legal obligations that may effectively prohibit sending sensitive data to external APIs regardless of the provider’s security guarantees. Running a model locally means data stays on your hardware, under your control, with no possibility of inadvertent leakage to a third party. DockYard, writing about the business case for local AI, puts it simply: Local models “keep sensitive data on-device, reducing exposure to breaches and unauthorized access” and simplify compliance with regulations that require strict data residency.

The world beyond the US

The strongest momentum behind local AI adoption comes from developers and organizations outside the United States. The reasons vary by region, but they’re structural everywhere.

European regulators have been skeptical of US-based cloud services since before the first Schrems ruling invalidated the Safe Harbor framework in 2015. The concern that US intelligence services can access data held by US companies, regardless of where that data is stored, has never been fully resolved, and recent US policy directions have amplified European anxieties. More countries, including China and many other Asian nations, are also developing their own data sovereignty laws. Locally run models sidestep the problem.

China has become a leading provider of open AI models. DeepSeek’s appearance as a major open-weight model family wasn’t an accident; it reflects a systematic investment in AI that emphasizes efficiency and openness over raw scale. As I’ve written elsewhere, the Chinese approach to AI has been shaped in part by hardware constraints: When you can’t easily acquire NVIDIA’s fastest chips, you optimize your software instead. You use quantization. You build mixture-of-experts architectures that activate only a fraction of parameters per token. You design models that run well on the hardware you can actually get. The result is a generation of models that run efficiently on local hardware, and a developer community with expertise in building those models. While those techniques have been taken up by AI companies in the US, China clearly leads in efficient AI.

For application developers in India, Southeast Asia, Latin America, and Africa, cost is the most immediate barrier. Cloud API pricing denominated in dollars is expensive relative to local income levels in ways that matter for product economics, not just personal preference. Language is a deeper issue. Of the world’s 7,000-plus languages, only a few have enough textual data to train capable models, and both frontier and smaller open-weight models reflect that reality. A survey of African languages found pronounced performance gaps across models of all sizes. What open-weight models offer is the ability to fine-tune on local language data that the original training missed. A developer in Uganda building a health information tool, or a team in Malaysia building a customer service product, can take an open-weight base model and adapt it to the languages their users actually speak. That’s not possible with closed models.

The response has been a wave of regional model development. Sarvam in India has open-sourced models trained on data emphasizing all 22 official Indian languages, released under Apache 2.0. Sunbird AI in Uganda built Sunflower, a family of models covering 31 Ugandan languages, that was developed in partnership with Makerere University and trained on digitized radio broadcasts and community texts. Singapore’s AI research group built SEA-LION, tuned specifically for Southeast Asian languages and cultural contexts. Malaysia launched a domestically developed LLM, ILMU, in August 2025.

Chinese open source models help to fill this gap. According to Hugging Face’s data, Chinese models now account for a larger share of downloads on the platform than US models. Sunflower is built on Qwen; Malaysia’s NurAI, which targets 340 million speakers of Bahasa Melayu and related languages across the region, uses DeepSeek as its foundation. This isn’t ideology; it’s that Chinese open source models are efficient enough to run locally, permissively licensed, and increasingly well-suited to the multilingual fine-tuning these applications require.

OpenRouter’s model usage rankings, which track billions of API calls across many models, reflect the same reality. DeepSeek models and Qwen variants from Alibaba appear at the top of usage charts alongside offerings from OpenAI and Anthropic. (OpenRouter notes that raw token counts can be skewed by a few high-volume users; request counts give a more representative picture. Also note that rankings vary sharply day-to-day and week-to-week.) The frontier of capable AI is no longer exclusively American, and the application developers driving much of that usage are building for audiences that American tech companies have largely ignored.

Performance

When performance is an issue, the metric to watch depends on what you’re building. Time to first token matters most for interactive applications: how long before the model starts producing output. For a cloud API, that includes the network round trip (typically under 30 milliseconds to a major provider) plus server-side work: queuing, scheduling, and processing your prompt through the model before generation begins. For typical requests this can run to several hundred milliseconds in total, and longer when the server is under load. A local model starts processing immediately, with no queuing and no network hop, so time to first token is very low. For anything that feels like a conversation (a code assistant, a document tool, an interactive agent), that difference is perceptible.

Once generation starts, tokens per second is the metric to watch. Here, cloud providers have the advantage: Their infrastructure prioritizes inference, generating responses to prompts and API calls. A local model may feel faster to start and slower to finish than a well-provisioned cloud API.

For agentic workflows that chain together many model calls, both factors matter. Network round trips accumulate: At 30 milliseconds each, a hundred sequential calls adds three seconds of pure overhead before accounting for server-side processing, and the time-to-first-token overhead multiplies with every step. This is one reason local models have appeal for agentic applications, where the number of individual inference calls can be large.

High concurrency is a separate problem, and one where local deployment struggles. Consumer hardware handles one request at a time, or a few; a cloud provider scales horizontally. If your application serves many simultaneous users, local deployment requires either significant hardware investment or a different architecture.

Fine-tuning for specific applications

Applications where specialized domain knowledge matters are more common than people realize, and for all of them fine-tuning is a substantial advantage. A customer support model that knows your product deeply, a coding assistant tuned on your company’s codebase, a document processor fine-tuned on your industry’s vocabulary: These are things you can build and own with open models in ways you can’t with closed ones.

Developers are frequently prototyping an application on a frontier model, then moving to a smaller or local model that has been fine-tuned for production. An early description of this practice appears in “What We Learned from a Year of Building with LLMs”: “Prototype with the most highly capable models before trying to squeeze performance out of weaker models.” The practice is also recommended by both Anthropic and OpenAI, though they assume you will use their own smaller models, and they might get prickly around what they see as “distillation.”

Fine-tuning models is frequently associated with expensive AI experts, but it is gradually becoming more accessible. Techniques like QLoRA allow fine-tuning a 7B or 8B parameter model on a consumer GPU with 12GB of VRAM. Tools like Unsloth reduce VRAM requirements further while increasing throughput. The Hugging Face ecosystem (Transformers, Datasets, PEFT, TRL) provides additional tools for working with models. An individual developer or small team can adapt a base model to a specialized domain.

Cloud providers can’t easily offer this flexibility. You can fine-tune some closed models, but you’re working within the provider’s constraints at significant per-run cost, and the resulting model still lives on their hardware. Fine-tuning an open model produces something you own, that runs on your hardware, with no ongoing licensing fees and no dependency on a third party’s infrastructure decisions.

Security

The biggest advantage of a local model is that data stays local. There are no API endpoints to compromise, no cloud credentials to steal, no third-party infrastructure to go down during an outage. For regulated industries, this is often a decisive factor.

However, when you run a model on your own infrastructure, you take responsibility for the model’s security. Model creators make their own choices about safety and alignment before releasing a model. Base models (the foundation before instruction tuning and alignment) will comply with requests that a safety-tuned model would refuse; that’s a property of the model, not something you configure at runtime. When you choose a model to run locally you’re also choosing how much alignment work its creators did. Organizations need to evaluate this deliberately rather than assuming it’s handled.

The opacity of training data is a subtler concern. Because almost all open-weight models withhold their training datasets, you can’t audit the data on which the model was trained, making it hard to assess bias, verify that proprietary or regulated data wasn’t included, or detect benchmark contamination. For applications in regulated industries, this is a real gap.

Prompt injection is a threat that applies to any model. In a prompt injection attack, adversarial content in the model’s input overrides the system prompt and hijacks the model’s behavior. The malicious content can be in almost any form: text on a web page, invisible pixels in an image, and much more. The attack surface grows in agentic workflows, where models take actions based on content they retrieve from the web and other external sources. Frontier labs have made progress here: Anthropic has published research on RL-based injection hardening for agentic contexts, and OpenAI published the Instruction Hierarchy, a training methodology that teaches models to assign differential trust to instruction sources. Neither technique has a known open-weight equivalent. That said, both labs have stated publicly that the problem is unlikely to be fully solved. The root cause is architectural: LLMs process instructions and data in the same token stream, and that’s not a bug that can be patched out.

Supply chain security is yet another concern. Hugging Face hosts hundreds of thousands of models, and most have not been audited for safety. Some are actively hostile. Downloading a model from an unknown source and running it on your hardware is analogous to running an arbitrary executable. Sticking with well-known models such as Gemma from Google, GLM from Zhipu, and DeepSeek from DeepSeek AI reduces this risk substantially. The well-known models aren’t risk-free, but they’re in a different category from the long tail of unvetted uploads.

The current open model landscape

Before getting into specific models, it’s important to distinguish between “open source” and “open weight.” They are not the same, and most of what gets called open source AI is actually only open weight. The Open Source Initiative published a formal definition of open source AI in October 2024, requiring not just open model weights but training code, training data provenance, and evaluation code—enough for a skilled person to reproduce the system.

By that standard, almost none of the headline models qualify. Most models only release the weights: the trained numerical parameters that make up the model itself, without the data or code that produced them. Without training data, you can fine-tune a model, but you can’t audit the model for bias or benchmark contamination. Without training code, you can’t reproduce or systematically improve it. The term “openwashing” has started circulating for models that claim openness while releasing only weights, and it’s warranted. For most developers, the practical question is what the license actually permits. Apache 2.0 and MIT licenses, which several of the major open-weight models now carry, are permissive enough for most commercial use.

As of early April 2026, Gemma 4 from Google is the strongest open-weight model available. Like all the models here it releases weights only; training data and code are not disclosed. It comes in several sizes: compact 2B and 4B variants aimed at edge deployment, a 26B mixture-of-experts model that activates 4B parameters per token, and a 31B dense model suited for reasoning and fine-tuning. All variants handle images and video natively. For most developers looking for a locally runnable model right now, Gemma 4 is where to start.

The GLM series from Zhipu is underrated. The current release is GLM-5.1, with GLM-5 still widely used; both have large context windows and strong performance on reasoning tasks. The series has a particular focus on deep tool-assisted research workflows. This goes beyond what raw benchmark scores capture. For applications that involve sustained, complex work, such as legal document analysis, research synthesis, and multistage coding tasks, the GLM family is worth serious consideration.

DeepSeek’s V4 models are large, but they use a mixture-of-experts architecture to deliver high quality with a small active parameter count. DeepSeek’s R1 family ranges from 1.5B parameters to 671B. It has been specialized for reasoning and mathematical tasks. Training data and code have not been released for either V4 or R1. The community has launched an Open-R1 project that attempts a full reproduction of DeepSeek-R1’s training from scratch.

The Qwen series from Alibaba is capable across a range of tasks, multilingual, and licensed under Apache 2.0. Organizational changes have put its trajectory in question, though the open-weight releases of Qwen3.6-27B and other models in the Qwen 3.6 family are encouraging.

Kimi K2.6 from Moonshot AI is worth knowing about, although running it is beyond the capabilities of most consumer hardware. It’s a one-trillion-parameter mixture-of-experts model with 32B active parameters per token, trained specifically for coding and agentic tasks. Aggressive quantization can bring Kimi’s VRAM requirements down to 24GB, but that’s the practical floor.

Meta’s Muse Spark isn’t open but deserves a mention. Announced in early April 2026 and built by the newly formed Meta Superintelligence Labs under Alexandr Wang, Muse Spark is proprietary. Meta has a history of releasing open-weight models, so it’s possible something similar will follow for Muse Spark, but there’s no announcement, no timeline, and no guarantee. There has also been talk of smaller versions of Spark for edge devices.

If you want models that are genuinely open source by the OSI definition—training data, code, and weights all released—the options are more limited and less capable: Olmo from the Allen Institute for AI is the most serious effort; the full Dolma training dataset, training code, and hundreds of intermediate checkpoints have been released. It’s a valuable resource for researchers, but it isn’t competitive with Gemma 4 or DeepSeek on capability.

Regardless of which model you’re considering, how do you know whether it’s good enough for your application? Published benchmarks are often misleading; they measure what the benchmark designers thought to measure, not necessarily what you need. A more reliable approach is building a “golden dataset”: a few hundred real prompts drawn from your actual use case, with known-good answers, against which you can evaluate any candidate model. It’s worth doing before committing to any model for production use.

Choice and control

The gap between frontier and open models is narrowing and, more to the point, seems less and less relevant as open models improve. Is it worth getting locked in to a cloud provider, giving up control of your data provenance, and losing the ability to fine-tune a model for an application in exchange for a few points on a benchmark that doesn’t reflect the real world? An increasing number of AI developers and users are concluding that it doesn’t. The regulatory environment in Europe, and the hardware constraints in China, are producing a global developer community with expertise in making local AI work.

None of this means that cloud AI is going away. The frontier closed models will remain ahead on raw capability, and there are applications where that matters. But the days when a US-based cloud API was the only serious option for capable AI are over. Local AI is increasingly capable, and for a growing fraction of what developers want to build, especially outside the United States, it’s a viable choice.

If you want an introduction to using LLMs with open weights, join Christian Winkler on O’Reilly for the Open Weight Large Language Models Bootcamp on May 20 and 21. You’ll learn how to use models to retrieve information, combine the results of different models and refine the results with dense passage retrieval, discover how these models can excel on less powerful hardware by using new approaches to quantization, explore different frontends these models can be plugged into, and more in an interactive hands-on environment. O’Reilly members can register here.

Not a member? Sign up for a free 10-day trial before the course to attend.

Everyone’s an Engineer Now

Tim O’Reilly — Thu, 30 Apr 2026 15:59:33 +0000

Cat Wu leads product for Claude Code and Cowork at Anthropic, so she’s well-versed in building reliable, interpretable, and steerable AI systems. And since 90% of Anthropic’s code is now written by Claude Code, she’s also deeply familiar with fitting them into routine day-to-day work. Last month, Cat joined Addy Osmani at AI Codecon for a fireside chat on the future of agentic coding and, equally important, agentic code review, how Anthropic actually uses the tools they’re building, and what skills matter now for developers.

The feedback loop is itself a product

Boris Cherny initially built Claude Code as a side project to test Anthropic’s APIs. Then he shared the tool in a notebook, and within two months the entire company was using it. That organic growth, Cat said, was part of what convinced the team it was worth releasing externally.

But what really made that internal adoption visible was the response on Anthropic’s internal “dog-fooding” Slack channel. The Claude Code channel gets a new message every 5 to 10 minutes around the clock, and this feedback directly and immediately informs the product experience. Cat described it this way:

We hire for people who love polishing the user experience. And so a lot of our engineers actually live in this channel and find when there’s issues with new features that they’ve worked on and they proactively lay out the fixes.

The team ships new versions of Claude Code to internal users many times a day. The feedback loop is tight enough that it functions as a continuous integration system for product quality, not just code quality.

Cat told Addy how she once accidentally introduced a small interaction bug between prompts and auto-suggestions. But by the time she started working on a solution, she found another team member had already beaten her to it. It turns out, he had set up a scheduled task in Claude Code to scan the feedback channel for anything that hadn’t been responded to in 24 hours and open a PR for it. Since Cat hadn’t gotten to it yet (whoops!), her teammate’s Claude saw the unaddressed issue and fixed it for her. And Cat only found out when “[her own] Claude noticed that his Claude had already landed a change.”

The infrastructure for rapid improvement, in other words, is now partly automated. The agents are writing the code, then monitoring the feedback and closing the loop.

The bottleneck has shifted to review

There’s no question that AI-assisted coding has created a boom in output. Anthropic engineers are producing roughly 200% more code than they were a year ago, Cat noted. Today the main constraint is reviewing all that code to ensure it’s production-ready.

Cat’s team concluded that you can buy a lot of additional robustness for not that much extra cost.

We opted for the heaviest, most robust version [of code review]. We actually plot how many agents and how comprehensive of a review Claude does and then how many bugs does it recall. And we picked a number of very high recall and decided we should ship this, because if you really want AI code review to be a load-bearing part of your process, you actually probably just want the most comprehensive possible review.

The review agent doesn’t just look at the diff. It traces code across multiple files and catches bugs in adjacent code that has nothing to do with the change in question. Cat gave two examples. One was a ZFS encryption refactor where the agent found a key cache invalidation bug that wasn’t related to the author’s change at all but would have invalidated it. The other was a routine auth update that turned out to have a bad side effect, caught premerge. In both cases, engineers manually reviewing the code likely would have missed the bugs.

The human review that remains is deliberately small in scope. For most PRs, the human reviewer skims for design principle violations and obvious problems and assumes functional correctness has been handled. Five to ten agents run in parallel, each given slightly different tasks, returning independently and then deduplicating what they found.

The cultural shift that made this work, though, was ownership. The team moved to a model where the engineer who authors a PR owns it end to end, including postdeploy bugs, and doesn’t lean on peer reviewers to catch mistakes. “Otherwise,” as Cat pointed out, “you have situations where junior engineers put out a bunch of PRs and then your senior engineers are like drowning in AI-generated stuff where they’re not sure how thoroughly it’s been tested.”

Full ownership meant the AI review had to actually be trustworthy, which drove the decision to go for high recall rather than a lighter touch. That said, engineers are still expected to understand every line of code an agent creates. . .for now. As Cat explained, it’s the only way to truly prevent “unknown security vulnerabilities and to be able to quickly respond to incidents if they are to happen.”

Everyone’s kind of an engineer now

Cowork, Anthropic’s agent tool for nontechnical users, is the company’s attempt to take what Claude Code does for engineers and bring it to knowledge work more broadly. Cat sketched a picture of someone looking at five or six agent tasks running simultaneously in a side panel, managing a fleet of agents the way a senior engineer manages a PR queue.

In the nearer-term, she’s keeping tabs on the shift toward people using Claude Code to build things for themselves, their teams, or their families that wouldn’t have justified professional development effort or “otherwise been possible.” The prototype is the garage project, the family expense tracker, the tool that a small team actually needs but that no SaaS product quite addresses. Cat’s goal and hope is that Claude Code helps people “solve their own problems for themselves” and “stewards a new future of personal software.”

Product taste as the new technical skill

More people building more software is unambiguously good. Boris Cherny has even floated the idea that coding as we know it is “solved.” But what does that mean for the craft of software engineering? Cat’s read of the current moment is more nuanced:

I think pre-AI, the skills that were very important were being able to take a spec and implement it well. And I think now the really important skill is product taste. Even for engineers. Can you use code to ingest a massive amount of user feedback? Do you have good intuition about which feature to build to address those needs, because it’s often different than exactly what users are asking you for? And then, when Claude builds it, are you setting up the right bar so that what you ship people actually love?

Cat’s not alone in highlighting the importance of taste in a world where code is a commodity. Steve Yegge, Wes McKinney, and many others, myself included, see taste and judgment as a uniquely human value. This has practical implications for how engineers should spend their time now, and for what the next generation needs to learn.

For junior engineers specifically, Cat described a progression: Start by using Claude Code to understand the codebase (ask all the “dumb questions” without embarrassment), take those answers to a senior engineer for calibration, and then close the loop by updating the CLAUDE.md with whatever was missing.

Think of Claude Code as your intern that you’re trying to level up. Like, teach it back to Claude. Add a /verify slash command. Put it in the CLAUDE.md or the agent README. Approach this as senior engineers helping you level up, and then you helping Claude and other agents level up.

The improvement process, in other words, should be bidirectional. Engineers get better at using the tools and the tools get better through the engineers’ accumulated knowledge. And significantly, this process keeps humans firmly in the loop, playing a role that’s “active, continuous, and skilled.”

You can watch Cat and Addy’s full chat, plus everything else from AI Codecon on the O’Reilly learning platform. Not a member? Sign up for a free 10-day trial, no strings attached.

AI Code Review Only Catches Half of Your Bugs

Andrew Stellman — Thu, 30 Apr 2026 11:14:49 +0000

This is the fifth article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, and part four here.

I recently had a taste of humility with my AI-generated code. I live in Park Slope, Brooklyn, and recently I needed to get to the other side of the neighborhood. I thought I’d be clever: I like taking the bus, so I decided to hop on the one that goes right down 7th Avenue. I know I could check the schedule using the MTA’s really useful Bus Time app or website, but it doesn’t take into account walking time from my house or give me a good idea of when to leave. This seemed like a great opportunity to vibe code an app and do some quick AI-driven development.

It took about two minutes for Claude Code to get my new app working. It made a lovely little web UI, I configured my stop and how long it takes me to walk there, and it gave me the perfect departure time.

When I actually walked out the door, the app perfectly predicted my wait. There was just one problem: my bus was nowhere to be seen. What I did see was a bus driving the exact opposite direction down 7th Avenue.

It was pretty obvious what had happened. I needed to go deeper into Brooklyn, not towards Manhattan, and the AI had picked the wrong direction. (Actually, as Cowork pointed out, each stop has its own ID, and it had selected the ID for the wrong stop.) I’d been using Cowork to orchestrate everything, and I could easily have just asked it to go out and check the MTA’s BusTime site for me to make sure the app was working. But I just trusted the AI. As a result, I had to walk. Which is fine—I love walking—but the irony was painful. I had literally just published an article about AI code quality and why you shouldn’t blindly trust it, and here I was doing exactly that.

The app had a bug. But it wasn’t the kind of bug you’d necessarily catch using a typical AI code review prompt. It built, ran, and did a perfectly fine job parsing the JSON from the MTA API. But if I’d started with a simple requirement—even just a user story like “as a Park Slope resident, I want to catch the B69 headed towards Kensington so I can get deeper into Brooklyn”—the AI would have built it differently. The problem is that AI can only build the thing you tell it to build, which isn’t necessarily the thing you wanted it to build. AI is really good at writing “correct” code that does the wrong thing.

My Brooklyn bus detour was a minor inconvenience. But it was a really useful, small-scale example of what I kept running into in my larger projects, too. There’s an entire class of bugs that you simply can’t find with structural analysis—no linter, no static analyzer, no AI code reviewer will catch them—because the code isn’t wrong in any way that’s visible from the code alone. You need to know what the code was supposed to do. You need to know the intent.

The data on why requirements matter goes back decades. Back in the 1990s, for example, the Standish CHAOS reports were a big eye-opener for me and a lot of other people in the industry, large-scale data confirming what we’d been seeing on our own projects: that the most expensive defects trace back to misunderstood or missing requirements. Those reports really underscored the idea that poor requirements management, and specifically incomplete or frequently changing specifications, were one of the most primary drivers behind IT project failures. (And, as far as I can tell, they still are, and AI isn’t helping things—see my O’Reilly Radar article, “Prompt Engineering Is Requirements Engineering”).

The idea that requirements problems really are the source of the most expensive kind of defects should make intuitive sense: If you build the wrong thing, you have to tear it apart and rebuild it. That’s why I made requirements the foundation of the Quality Playbook, an open-source skill for AI tools like Claude Code, Cursor, and Copilot that I introduced in the previous article. I’ve spent decades doing test-driven development, partnering with QA teams, welcoming the harshest code reviews from teammates who don’t pull punches—and that experience led me to build a tool that uses AI to bring back quality engineering practices the industry abandoned decades ago. I’ve tested it against a wide range of open-source projects in Go, Java, Rust, Python, and C#, from small utilities to widely-used libraries with tens of thousands of stars, and it’s found real bugs in almost every project it’s come across, including ones that have been confirmed and merged upstream.

I think there are a lot of wider lessons we can learn from my experience using requirements to help AI find bugs—especially security bugs. So in this article, I want to focus on the single most important thing I’ve learned from building it: everything depends on requirements. Not just any requirements, but a specific kind of requirement that most projects don’t have, that most AI tools don’t ask for, and that turns out to be the key to making AI actually useful for verifying code quality.

Spec-driven development and what it misses

Developers using AI tools have been rediscovering the value of writing things down before asking the AI to build them. Spec-driven development (SDD) has become very popular, and for good reason. Addy Osmani wrote an excellent piece on this, “How to Write a Good Spec for AI Agents,” and the core idea is sound: If you write a clear specification of what you want built, the AI produces dramatically better results than if you just describe it in a chat prompt and hope for the best.

I think SDD is important, and I’d encourage any developer working with AI to adopt it. But as I was building the Quality Playbook, I discovered that SDD has a blind spot that matters a lot for code quality. An SDD spec describes the how—what the implementation should look like. It tells the AI “implement a duplicate key check” or “add a retry mechanism with exponential backoff” or “create a REST endpoint that returns paginated results.” That’s useful for building things. But it’s not enough for verifying them.

But a requirement doesn’t say “implement a duplicate key check.” It says “users depend on Gson to reject ambiguous input so they don’t silently accept corrupted data.” The AI can reason about the second one in ways it can’t reason about the first, because the second one has the purpose attached. When the AI knows the purpose, it can evaluate whether the code actually fulfills that purpose across all the edge cases, not just the ones the spec explicitly listed. That’s how the Quality Playbook caught a bug in Google’s Gson library, one of the most widely used JSON libraries in Java.

I think it’s worth digging into that particular bug, because it’s a great example of just how powerful requirements analysis can be for finding defects. The playbook derived null-handling requirements from Gson’s own community—GitHub issues #676, #913, #948, and #1558, some dating back to 2016—then used those requirements to find that duplicate keys were silently accepted when the first value was null. It confirmed the bug by generating a failing test, then patched the code and verified the test passed. I’ve used Gson for years and done a lot of work with Java serialization, so I read the code and the fix myself before submitting anything—trust but verify. The fix was merged as https://github.com/google/gson/pull/3006, confirmed by Google’s own test suite.

That bug had been hiding in plain sight for years, through thousands of tests and countless code reviews. But it’s possible that no structural analysis might have ever found it because you needed the requirement to know it was wrong.

This distinction might sound academic, but it has very concrete consequences for whether your AI can actually find bugs in your code.

About half of all security bugs are invisible to structural analysis

The security world has known about the limits of structural analysis for a long time. The NIST SATE evaluations found that the best static analysis tools plateaued at around 50-60% detection rates for security vulnerabilities. Gary McGraw’s Software Security: Building Security In (Addison-Wesley, 2006) explains why: Roughly 50% of security defects are implementation bugs, and the other 50% are design flaws. Static analysis tools target the implementation bugs—buffer overflows, SQL injection, format string vulnerabilities—because those are pattern-matchable. But design flaws are about intent: The system’s architecture doesn’t enforce the security properties it’s supposed to enforce, and no amount of scanning the code will reveal that. A 2024 study by Charoenwet et al. (ISSTA 2024) confirmed this is still the case: They tested five static analysis tools against 815 real vulnerability-contributing commits and found that 22% of vulnerable commits went entirely undetected, and 76% of warnings in vulnerable functions were irrelevant to the actual vulnerability. The pattern is consistent across two decades of research: There’s a ceiling on what you can find by analyzing code, and it’s around half.

There’s a good reason for that limitation: the intent ceiling. A structural analysis tool is limited to reading the code and looking at what it does; it has no way to take into account what the developer intended it to do.

When an AI does a code review without requirements, it’s limited to structural analysis: pattern matching, code smell detection, race condition analysis. It can ask “does this look right?” but it can’t ask “does this do what it’s supposed to do?” because it doesn’t know what the code is supposed to do. Structural review catches genuinely important stuff—race conditions, null pointer issues, resource leaks, concurrency bugs. A structural reviewer looking at a shell script will catch a missing fi, a bad variable expansion, a race condition. Structural review is useful, and structural review is what most AI code review tools do today.

But about half of all security defects are intent violations: things the code doesn’t do that it was supposed to do, or things it does that it wasn’t supposed to do. They’re invisible without a specification to check against, and no tool will find them by looking at code that is, structurally, perfectly sound. A structural reviewer looking at a script that’s, say, used to check router configuration files, might find well-formed bash, correct syntax, proper quoting, and code that looks like it works and doesn’t match known antipatterns. It wouldn’t know the script is only validating three of the five access control rules it’s supposed to enforce because that’s a requirements question, not a syntax question.

Or, more personally for me, this is what happened with my bus tracker app: The JSON parsing was flawless, the UI was correct, the timing logic worked perfectly. The only problem was that it showed buses headed towards Manhattan when I needed to go deeper into Brooklyn—and no structural analysis would ever catch that, because you need to know which direction I intended to go. That’s me and my very clever AI hitting the intent ceiling.

The intent ceiling is a security problem

This is where it gets really serious, because security vulnerabilities are some of the most dangerous members of this class of invisible bugs.

Think about what a missing authorization check looks like to an AI code reviewer. Let’s say you’ve got a web endpoint with a well-formed HTTP handler, properly sanitized inputs, and a safe database query. The code is clean, and passes every structural check and static analysis tool you’ve thrown at it. Now you’re testing it and, much to your dismay, you discover that the endpoint lets any authenticated user delete any other user’s data because nobody ever wrote down the requirement that says “only administrators can perform deletions.” That’s CWE-862: Missing Authorization, and it rose to #9 on the 2024 CWE Top 25 most dangerous software weaknesses.

That’s not a coding error! It’s a missing requirement.

That’s McGraw’s point: About half of all security defects aren’t implementation bugs at all. They’re design flaws, places where the system’s architecture doesn’t enforce the security properties it was supposed to enforce. A cross-site scripting vulnerability isn’t always a failure to sanitize input. Sometimes it’s a failure to define which inputs are trusted and which aren’t. A privilege escalation isn’t always a broken access check. Sometimes there was never an access check to begin with because nobody specified that one was needed. These are intent violations and they’re invisible to any tool that doesn’t know what the software is supposed to prevent.

AI code review tools today are very good at catching the implementation half of McGraw’s split. They can spot a SQL injection pattern, flag an unsafe deserialization, identify a buffer overflow. But they’re working on the same side of the 50/50 line that static analysis has always worked on. The design half—the missing authorization checks, the unspecified trust boundaries, the security properties that were never written down—requires the same thing that catching my bus tracker bug required: knowing what the software was supposed to do in the first place.

How the Quality Playbook derives requirements (and how you can too!)

The problem most projects face is that they don’t have formal requirements. What they have is code, documentation, commit messages, chat history, README files, and maybe some design docs. The question is how to get from that mess to a specification that an AI can actually use for verification.

The key insight I had while building the playbook was that every previous approach I tried asked the model to do two things at once: figure out what contracts exist AND write requirements for them. That doesn’t work—the model runs out of attention trying to hold the entire behavioral surface in its head while also producing formatted requirements. So I split them apart into four steps: First, have the AI read each source file and write down every behavioral contract it observes as a simple list. Second, derive requirements from those contracts plus the documentation. Third, check whether every contract is covered by a requirement. Fourth, assert completeness—and if there are gaps, go back to step one for the files with gaps.

The key idea is that the contracts file is external memory. When the model “forgets” about a behavioral contract it noticed earlier, that forgetting is normally invisible. With a contracts file, every observation is written down before any requirements work begins, so an uncovered contract is a visible, greppable gap.

You don’t need the Quality Playbook to do this—you can apply the same technique with any AI coding tool that you’re already using. Here’s what I’d recommend:

Write down what your software is supposed to guarantee. Not just what it does—what it’s supposed to do, for whom, under what conditions. If you’re practicing spec-driven development, you’re already partway there. The next step is adding the why: Why does this behavior matter, who depends on it, what goes wrong if it fails? That’s the difference between a spec and a requirement, and it’s the difference between an AI that can build your code and an AI that can verify it.
Feed the AI your intent, not just your code. The intent is already sitting in your chat history, your design discussions, your Slack threads, your support tickets. Every Claude export, every Gemini conversation, every Cowork transcript contains design intent that never made it into specifications: why a function was written a certain way, what failure prompted an architectural decision, what tradeoffs were discussed before choosing an approach. The design intent that used to require a human to extract and document is now sitting in your chat logs. Your AI can read the transcripts and extract the why.
Look for the negative requirements. What should your software not do? What states should be impossible? What data should never be exposed? These negative requirements are often the most valuable because they define boundaries that structural review can’t see. The missing authorization bug was a negative requirement: Unauthenticated users must not be able to delete other users’ data. The Gson bug was a negative requirement: Duplicate keys must not be silently accepted when the first value is null. If you can articulate what your software must never do, you’ve given the AI something powerful to check against.

In the next article, I’ll talk about context management—the skill that actually determines whether your AI sessions produce good work or mediocre work. Everything I’ve described here depends on the AI having the right information at the right time, and it turns out that managing what the AI knows (and what it forgets) is an engineering discipline in its own right. I’ll cover how I went from running 15 million tokens in a single prompt to splitting the playbook into independent phases with zero context carryover, and why that transition worked on the first try.

The Quality Playbook is open source and works with GitHub Copilot, Cursor, and Claude Code. It’s also available as part of awesome-copilot.

Disclosure: Aspects of the methodology described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open-source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.

Don’t Automate Your Moat: Matching AI Autonomy to Risk and Competitive Stakes

Marc Millstone and Claude — Wed, 29 Apr 2026 11:42:28 +0000

I was talking to a senior engineer at a well-funded company not long ago. I asked him to walk me through a critical algorithm at the heart of their product, something that ran hundreds of times a second and directly affected customer outcomes. He paused and said, “Honestly, I’m not totally sure how it works. AI wrote it.”

A few weeks later, a different engineer at another company was paged about a system outage. He pulls up the failing service and realizes he has no idea it’s connected to a database. A colleague accepted the AI-generated PR three months ago that added that dependency. The tests passed. The change was never written down. The original engineer moved on and the knowledge was lost.

These aren’t new stories. Engineers have always inherited systems they didn’t fully build. What’s new is the disguise and the speed. AI is an amazing enabler. Organizations must adopt it to remain relevant. Yet the emerging pattern—describe what you want, let an agent iterate until it works, pay for it in tokens instead of engineering hours—is functionally a buy decision wearing a build costume. The code is in your repo. Your engineers merged the PR. It feels like you built it. But if nobody on your team understands why it works the way it does, you’ve purchased a dependency you can’t maintain from a vendor you can’t call.

AI doesn’t create that gap once. It widens it continuously at a pace that outstrips the organizational habits that once kept it manageable. Two problems compound at once. You can’t extend the thing that makes you hard to replace. And when it breaks, the incident lands on a team that doesn’t understand what they’re fixing, turning a recoverable outage into a customer-facing crisis. Engineering leaders have wrestled with build-versus-buy tradeoffs for decades, and the hard-won lesson has always been the same: You don’t outsource your competitive advantage. The token-funded generation loop doesn’t change that calculus. It makes it easier to skip the question entirely.

The question that matters isn’t “Can AI do this?” If it can’t today, it will be able to tomorrow. And the argument that follows does not depend on the quality of the AI-generated code. This article covers two questions most engineering organizations have never asked at the same time. Most teams optimize for velocity and never ask what they’re risking or giving away in the process. The gap between those unasked questions is where the most expensive mistakes are already being made.

Part 1: Two dimensions. Neither is velocity.

Moving faster matters. But velocity alone misses the two dimensions that determine whether AI autonomy helps or hurts your business.

Business risk: What’s the blast radius if this fails? A bug in an internal CLI tool costs you an afternoon. A bug in your authentication logic costs you customers and possibly market cap. A bug in your core pricing algorithm costs you the business. These are not the same.

Competitive differentiation: Does this code define your business? Your moat is your architecture, your performance characteristics, your core algorithms, and the product decisions baked into your infrastructure. But it’s also the institutional knowledge that shaped them: the reasoning behind the trade-offs, the context that no model was trained on. If your competitors can generate the same code with the same model you’re using, it stops being an advantage.

Most organizations ask the first question on a good day. Almost none ask the second. That gap is how you end up shipping fast into a moat nobody can explain and nobody can extend.

Understanding why both dimensions matter starts with velocity and what happens when the feedback loop around it breaks.

Velocity feels real. Debt is often invisible.

AI coding tools are genuinely impressive. GitHub’s research showed 55% faster task completion with Copilot in controlled conditions.¹ That number has driven an assumption that faster is always better.

A 2025 METR randomized controlled trial² found something that should give every engineering leader pause. Sixteen experienced developers on real production codebases forecasted they’d complete tasks 24% faster with AI. After finishing, they estimated they’d gone 20% faster. They’d actually gone 19% slower.

The velocity finding is striking. But the perception gap matters more. The feedback loop between “how am I doing?” and “how am I actually doing?” was broken throughout and never corrected itself. This doesn’t resolve the velocity debate. It reframes it. The danger isn’t that individuals move too fast. Organizations mistake output volume for productivity and strip out the review processes that used to catch what that gap costs.

A Tilburg University study of open source projects after GitHub Copilot’s introduction found the same pattern at the organizational level.³ Productivity did increase, but primarily among less-experienced developers. Code written after AI adoption required more rework to meet repository standards. The added rework burden fell on the most experienced (core) developers who reviewed 6.5% more code after Copilot’s introduction and saw a 19% drop in their own original code output. The velocity looks real at the surface. Underneath, the maintenance cost shifts upward to the people who can least afford to lose productive time.

That broken feedback loop has a name. Researchers call it cognitive debt⁴: the growing gap between how much code exists in your system and how much of it anyone actually understands. Technical debt shows up in your linter and your backlog. Cognitive debt is invisible. There’s no signal telling engineers where their understanding ends. That’s precisely what the METR perception gap showed. It never corrected itself.

Research by Anthropic Fellows found that engineers using AI assistance when learning new tools scored 17% lower on comprehension tests than those who coded by hand, with the steepest drops in debugging ability.⁵ MIT’s Media Lab found the same pattern in writing tasks: Brain connectivity was weakest in the group using LLM assistance, strongest in the group working without tools.⁴ Active production builds understanding. Passive consumption doesn’t.

You understand what you build better than what you review. When you write code, you produce output and build a mental model. That’s what Peter Naur called the “theory of the program.” It lives in your head, not in the repo.⁶ The MIT study captured this directly. 83% of participants who wrote essays with LLM assistance could not quote a single sentence from essays they had just written.⁴

Cognitive debt is invisible until it isn’t. When it surfaces, it hits both dimensions hard, in different ways.

Business risk: The blast radius of not knowing

On the business risk dimension, cognitive debt is a safety problem.

When nobody fully understands the system, the blast radius of a failure expands silently. The incident that eventually comes (and it always comes) lands on a team that can’t diagnose what they didn’t build. The engineer pulling up the failing service at 2 AM has no mental model of why it was built the way it was, what it connects to, or what the edge cases look like under load. So they ask the LLM. It can explain what the code does and often propose a reasonable fix. It can’t tell you why it was designed that way. And a fix that looks right to the model can quietly violate constraints that nobody thought to document.

Cognitive debt compounds a second, independent risk: the pace at which AI-generated code reaches production. OX Security’s analysis⁷ of over 300 software repositories found that AI-generated code isn’t necessarily more vulnerable per line than human-written code. The problem is velocity.

Code review, debugging, and team oversight are the bottlenecks that catch vulnerable code before it ships. AI makes it easy to remove them. CodeRabbit’s analysis of real-world pull requests found AI-authored changes contain up to 1.7x more critical and major defects than human-written code, with logic and correctness issues up 75%.⁸ Apiiro’s analysis found that while AI reliably reduces surface-level syntax errors, architectural design flaws and privilege escalation paths (the categories automated scanners miss and human reviewers struggle to catch) spiked in AI-assisted codebases.⁹

AI accelerates output and accelerates unreviewed risk in equal measure. The cognitive debt means that when something breaks, the team is learning the system as they’re trying to fix it. Remove their understanding and you haven’t streamlined the process. You’ve only removed the thing standing between a bad day and a catastrophic one.

Competitive differentiation: What you give away without knowing it

The competitive differentiation risk isn’t that AI will generate your exact competitive algorithm and hand it to your competitor. It’s subtler. Your advantage was never the code itself; it was the judgment that shaped it. When AI writes that code, the judgment never forms. The code arrives, but the understanding that would let your team extend it, improve it, or defend it under pressure doesn’t. Your moat is most likely to survive in the places AI finds hardest to reach.

That judgment—formed by the performance trade-offs that took years to tune, the failure modes that only someone who’s been paged understands, the architectural decisions that encode domain knowledge nobody wrote down—doesn’t live in the codebase. It lives in your engineers’ heads.

And here’s the part most teams miss: Your competitor with the same AI tools doesn’t just get similar code, they get a team that also doesn’t understand why it works the way it does, which means neither of you can extend it, and the race to the next architectural move is a coin flip rather than a compounding advantage. The build-versus-buy discipline exists precisely because decades of experience taught engineering organizations that outsourcing your core means losing the ability to extend it. The token-funded generation loop doesn’t change that calculus. It makes it easier to mistake the outsourcing for ownership because the code has your name on it.

The structural problem runs even deeper. Models trained on public code produce outputs weighted toward well-represented patterns, the common solutions to common problems. Research confirms this. LLM performance drops sharply on less-common programming languages where training data is sparse, and on genuinely novel implementations. Even the best current models correctly implement fewer than 40% of coding tasks drawn from recent research papers.¹⁰ And the convergence problem extends beyond code. A pre-registered experiment tracking 61 participants over seven days found that while ChatGPT consistently boosted creative output during use, performance reverted to baseline the moment the tool was unavailable.¹¹ More critically, the work produced with AI assistance became increasingly homogenized over time. That homogenization persisted even after the tool was removed. The participants hadn’t borrowed the tool’s output. They’d internalized its patterns. For engineering organizations, this is the differentiation risk made concrete: Teams that rely on AI for their most critical design decisions risk generating commodity code today and training themselves to think in commodity patterns tomorrow.

Engineers who deeply own their most critical systems are better at diagnosing incidents and see the next architectural move that competitors can’t follow. Delegate that comprehension away and you can keep the lights on. You can’t see around corners.

When it goes wrong, it really goes wrong

Both dimensions rest on the same vulnerability: cognitive debt accumulating on work that matters. The failure cases make it concrete.

The production failures are accumulating. A Replit AI agent deleted months of production data in seconds after violating explicit code-freeze instructions, then initially misled the user about whether recovery was possible.¹² Reports emerged in early 2026 of a major cloud provider convening mandatory engineering reviews after a pattern of high-blast-radius incidents, with AI-assisted code changes cited as a contributing factor. In each case, the humans in the loop either didn’t understand what they were approving, or weren’t in the loop at all.

The deeper pattern predates AI tools entirely. Knight Capital Group took seventeen years to become the largest trader in U.S. equities. It took forty-five minutes to lose $460 million.¹³ The culprit was a nine-year-old piece of deprecated code called Power Peg, left on production servers and never retested after engineers modified an adjacent function in 2005. When engineers reused its feature flag for new functionality in 2012, nobody understood what they were reactivating. When the fault surfaced, the team’s attempt to fix it made things worse. They uninstalled the new code from the seven servers where it had deployed correctly, which caused Power Peg to activate on those servers too and compounded the losses. The SEC’s enforcement order is unambiguous: absent deployment procedures, no code review requirements, no incident response protocols. A failure of institutional comprehension where the mental model had quietly evaporated while the code kept running.

No AI tool wrote that code. The failure was entirely human, through entirely normal processes: engineers leaving, tests never rerun after refactors, flags reused without documentation. This is the baseline, what software organizations produce under ordinary conditions over nine years. An engineering team with modern AI tools won’t recreate this specific bug. They’ll create the conditions for the next one faster: more code that nobody fully understands, more dependencies nobody documented, more cognitive debt accumulating before anyone notices. AI removes the friction that once slowed exactly this kind of erosion.

None are failures of AI capability. They’re failures of judgment about where to deploy AI and how much human oversight to maintain.

Part 2: A four-quadrant model for AI autonomy

The quadrants

Four quadrants emerge when both questions are asked together. Before the examples, two contrasts are worth naming because the quadrants that look most similar on the surface are the ones most often confused in practice.

Supervised automation versus Human-led craftsmanship. Both demand high human involvement. Both feel like “be careful here.” But the difference is fundamental. In Supervised Automation, the human is a safety gate. The work is a commodity; you’re there to catch errors before they escape. In Human-led craftsmanship, the human is the author. You’re building the mental model that lets the next engineer reason about this system under pressure three years from now and take it somewhere new. The code isn’t something you need to verify. It’s something you need to own. And ownership here extends beyond the individual engineer. The team writes RFCs, debates trade-offs, identifies which parts of the implementation fall into which quadrant, and makes sure the reasoning behind key decisions is shared, not siloed. Human-led craftsmanship isn’t one person writing code alone. It’s a team making sure the understanding survives the people who built it.

Collaborative co-creation versus Human-led craftsmanship. Both involve high differentiation, and in both, the human drives the vision and owns the key decisions. But risk changes everything about how you work. In Collaborative co-creation, early iterations are recoverable. A wrong turn can be corrected before it costs you anything serious, so AI can genuinely accelerate execution. In Human-led craftsmanship, the blast radius of not understanding what you’ve built compounds over time. Wrong turns become load-bearing walls, and the architectural moves you can’t see are the ones that let competitors catch up. AI assists with scoped subtasks only. Every contribution gets interrogated.

In full automation, the human is a director. You define what needs to be done, AI produces the output, and you spot-check the result. The work is low-risk and low-differentiation. If something’s wrong, you fix it in the next iteration without anyone outside the team noticing. This is where AI earns its keep without qualification, and where restricting it costs you real velocity with nothing to show for it.

To make all four quadrants concrete, we’ll use a single feature as a lens: building AI Gateway cost controls, the system that sets token budgets per agent, enforces spending limits, tracks usage by model and agent, and handles enforcement modes when an agent exceeds its budget.

Low risk, low differentiation: Full automation

API docs for cost controls. Test scaffolding for token limit scenarios. Config examples for per-agent budgets. Every platform has docs, and if there’s a mistake, you fix it in the next iteration without anyone outside the team noticing. Humans set direction and spot-check. AI writes, tests, and ships.

The test: If this is wrong, can you fix it before a customer sees it or complains? If yes, automate freely.

Low risk, high differentiation: Collaborative co-creation

Designing the UX for the token usage dashboard. Iterating on routing rules that determine when an agent degrades to a cheaper model, halts entirely, or triggers a notification. These decisions separate a sophisticated platform from a blunt on/off switch, but early iterations are recoverable. A first version that doesn’t surface guardrail costs separately isn’t a disaster. It’s a product conversation. Humans drive the design vision and interrogate AI on trade-offs. AI accelerates execution and handles boilerplate.

The test: If you flipped the ratio (AI deciding, human rubber-stamping) would you be comfortable? If not, this requires genuine co-creation, not delegation. The human should be able to explain the trade-offs in the current design and know where to push it next.

High risk, low differentiation: Supervised automation

Enforcement logic that halts an agent when it hits its token budget. Every cost control system needs enforcement, so this isn’t differentiating. But if it fails, agents run unconstrained and rack up unbounded LLM spend. AI can draft the logic. A human must trace every path and understand every state transition before signing off. The question before merge: Can I explain exactly what happens when an agent hits the limit mid-execution? Can I explain this behavior to Customer Success or the Customer?

The test: Could a competent engineer review this confidently without having written it? If yes, the human’s job is to verify, not to author. But the bar for verification is explanation, not approval.

High risk, high differentiation: Human-led craftsmanship

The core token metering and attribution engine. It tracks usage per agent and per model, attributes guardrail costs separately so they don’t count against agent budgets, and provides the auditability enterprise customers need to govern AI spend. Get it wrong and customers can’t trust the numbers. Get it right and it’s a genuine competitive moat that competitors can’t replicate with the same AI tools you’re using.

Human engineers own the design end-to-end. AI assists on scoped subtasks once the design is settled: drafting specific functions, generating test coverage for paths the engineer has already reasoned through. Every contribution gets interrogated. The bar is whether the engineer could explain it in an incident review without looking at the code first.

The test: If the engineer who built this left tomorrow, would the team still understand why it works the way it does? Could they make it better? If the honest answer is no, you’re accumulating the most dangerous kind of cognitive debt there is.

The counterargument (it’s a good one)

Any engineering leader will push back here, and they’ll have good reason to.

The research is thin. METR’s study had 16 developers. MIT’s EEG work is a preprint that its own critics say should be interpreted conservatively.¹⁴ The Anthropic comprehension study shows a quiz score gap, not a business outcome. The evidence is early-stage. Intellectual honesty requires acknowledging that.

But the pattern keeps showing up in unrelated fields. A Lancet study found that endoscopists who routinely used AI for polyp detection performed measurably worse when the AI was removed, with adenoma detection rates dropping from 28.4% to 22.4% in three months.¹⁵ The study is observational and small. But the direction is consistent with everything else: Routine AI assistance may erode the skills it was supposed to support.

Most engineering work isn’t high-stakes. Studies consistently estimate that 60–80% of engineering time goes to maintenance, tests, docs, integration, and tooling, exactly the stuff that belongs in the automate quadrant regardless. Restricting AI because of the top 20% creates a real tax on the other 80%.

And can’t engineers develop deep ownership of AI-generated code through study and iteration? Partially. But the behavioral data tells a harder story. GitClear’s analysis of 211 million changed lines shows a decline in refactored code since AI adoption accelerated.¹⁶ Engineers aren’t studying AI-generated code carefully. They’re moving on to the next feature. LLM tools can explain what code does; they can’t tell you why the system was designed the way it was.¹⁷

The serious pro-AI argument isn’t “use AI everywhere.” It’s more precise: The guardrails for verification and oversight are improving fast, engineers who actively interrogate AI output build understanding even from generated code, and the organizations that restrict AI on their most critical work will fall behind competitors who don’t. This is a real argument.

The answer isn’t to dismiss it but to sharpen what “critical work” means. And, to recognize that the interrogative use of AI that the research identifies as understanding-preserving requires organizational discipline that most teams haven’t built yet. The quadrant isn’t permanent. The threshold shifts as both AI capability and human oversight practices mature. The discipline is the habit of asking both questions honestly before you start, not a fixed answer to them.

The discipline is simple. Maintaining it isn’t.

The quadrant tells you where to be careful. How you engage AI once you’re there determines whether careful is enough. The difference between “write me this function” and “explain why you made this trade-off, and what breaks if the input is malformed” is the difference between borrowing intelligence and developing it. Active, interrogative AI use preserves comprehension. Passive delegation destroys it. That’s what the Anthropic study’s behavioral data shows directly.

Match your review process to the quadrant. AI-generated docs and test scaffolding get a spot-check. AI-generated code touching your core product logic gets the same scrutiny as a junior engineer’s first PR. The bar for approval isn’t “tests pass.” It’s “someone on this team can explain what this does, defend it under pressure, and use that understanding to make it better.” Full automation needs a spot-check. Human-led craftsmanship needs an RFC, a team review, and shared ownership of the reasoning before anyone writes a line of code.

This matters especially in real-time data and AI infrastructure, systems where the most dangerous failure modes are emergent, appearing at scale and under load in combinations the code itself doesn’t express. Recognize that the threshold will shift. As AI capability improves, what belongs in the automate quadrant expands. The discipline isn’t a fixed answer. It’s the habit of asking both questions honestly before you start. It’s a core reason Redpanda is designed for simplicity and predictability: engineers need to be able to reason about how infrastructure behaves under pressure, not discover it during an incident.¹⁸

The real competitive question

The companies that get this right won’t be the ones that use the most AI or the least. They’ll be the ones whose leaders have internalized that risk and differentiation are independent variables, and that cognitive debt threatens both.

The engineer who doesn’t know how their algorithm works is a symptom. The organization that allowed it is the cause.

Treat cognitive debt as only a risk problem and you end up with engineers who can’t diagnose failures they didn’t build. Treat it as only a differentiation problem and you get fragile systems that survive until the next incident. Let it accumulate on your most critical systems and you get both at once.

Your competitor is making this calculation right now. The question isn’t whether to use AI. It’s whether you’re being honest about which quadrant you’re in, and whether your team will know the answer when it finally matters.

Co-authored with Claude (Anthropic). Yes, we took the advice from this article.

Footnotes

Peng, S. et al. (2023). The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. https://arxiv.org/abs/2302.06590 ︎
Becker, J., Rush, N. et al. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR. https://arxiv.org/abs/2507.09089 ︎
Xu, F., Medappa, P.K., Tunc, M.M., Vroegindeweij, M., & Fransoo, J.C. (2025). AI-Assisted Programming May Decrease the Productivity of Experienced Developers by Increasing Maintenance Burden. Tilburg University. https://arxiv.org/abs/2510.10165 ︎
Kosmyna, N. et al. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. MIT Media Lab. https://arxiv.org/abs/2506.08872 (preprint, not yet peer-reviewed) ︎
Shen, J.H. & Tamkin, A. (2026). How AI Impacts Skill Formation. Anthropic Safety Fellows Program. https://arxiv.org/abs/2601.20245 ︎
The generation effect: Rosner, Z.A. et al. (2012). The Generation Effect: Activating Broad Neural Circuits During Memory Encoding. Cortex. https://pmc.ncbi.nlm.nih.gov/articles/PMC3556209/ and Bertsch, S. et al. (2007). The generation effect: A meta-analytic review. Memory & Cognition. https://link.springer.com/article/10.3758/BF03193441 and Naur, P. (1985). Programming as Theory Building. Microprocessing and Microprogramming. https://pages.cs.wisc.edu/~remzi/Naur.pdf ︎
OX Security. (October 2025). Army of Juniors: The AI Code Security Crisis. https://www.helpnetsecurity.com/2025/10/27/ai-code-security-risks-report/ ︎
CodeRabbit. (December 2025). State of AI vs Human Code Generation Report. https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report. Note: CodeRabbit produces AI code review tooling; findings should be read in that context. ︎
Apiiro. (September 2025). 4x Velocity, 10x Vulnerabilities: AI Coding Assistants Are Shipping More Risks. https://apiiro.com/blog/4x-velocity-10x-vulnerabilities-ai-coding-assistants-are-shipping-more-risks/. Note: Apiiro produces application security tooling; findings should be read in that context. ︎
Joel, S., Wu, J.J., & Fard, F.H. (2024). A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages. ACM TOSEM. https://arxiv.org/abs/2410.03981. See also: Hua, et al. (2025). ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code. https://arxiv.org/abs/2506.02314 ︎
Liu, Q., Zhou, Y., Huang, J., & Li, G. (2024). When ChatGPT is Gone: Creativity Reverts and Homogeneity Persists. https://arxiv.org/abs/2401.06816 ︎
Fortune. (July 2025). AI-Powered Coding Tool Wiped Out a Software Company’s Database in ‘Catastrophic Failure.’ https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/ ︎
Knight Capital Group. SEC Administrative Proceeding, Release No. 70694 (October 16, 2013). https://www.sec.gov/litigation/admin/2013/34-70694.pdf. Levine, M. (2013). Knight Capital’s $440 Million Compliance Disaster. Bloomberg. https://www.bloomberg.com/opinion/articles/2013-10-17/knight-capital-s-440-million-compliance-disaster ︎
Stankovic, M. et al. (2025). Comment on: Your Brain on ChatGPT. https://arxiv.org/abs/2601.00856 ︎
Budzyń, K., Romańczyk, M. et al. (2025). Endoscopist Deskilling Risk After Exposure to Artificial Intelligence in Colonoscopy: A Multicentre, Observational Study. Lancet Gastroenterol Hepatol. 10(10):896-903. https://doi.org/10.1016/S2468-1253(25)00133-5 ︎
Harding, W. (2025). AI Copilot Code Quality: Evaluating 2024’s Increased Defect Rate via Code Quality Metrics. GitClear. https://www.gitclear.com/ai_assistant_code_quality_2025_research ︎
Zhou, X., Li, R., Liang, P., Zhang, B., Shahin, M., Li, Z., & Yang, C. (2025). Using LLMs in Generating Design Rationale for Software Architecture Decisions. ACM TOSEM. https://arxiv.org/abs/2504.20781. See also: Tang, N., Chen, M., Ning, Z., Bansal, A., Huang, Y., McMillan, C., & Li, T.J.-J. (2024). A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions. IEEE VL/HCC 2024. https://arxiv.org/abs/2405.16081 ︎
Gallego, A. (2025). Introducing the Agentic Data Plane. Redpanda. https://www.redpanda.com/blog/agentic-data-plane-adp. Crosier, K. (2026). How to Safely Deploy Agentic AI in the Enterprise. Redpanda. https://www.redpanda.com/blog/deploy-agentic-ai-safely-enterprise ︎

When Correct Systems Produce the Wrong Outcomes

Varun Raj — Tue, 28 Apr 2026 11:12:58 +0000

We tend to assume that if every part of a system behaves correctly, the system itself will behave correctly. That assumption is deeply embedded in how we design, test, and operate software. If a service returns valid responses, if dependencies are reachable, and if constraints are satisfied, then the system is considered healthy. Even in distributed systems, where failure modes are more complex, correctness is still tied to the behavior of individual components. In modern AI systems, particularly those combining retrieval, reasoning, and tool invocation, this assumption is increasingly stressed under continuous operation.

This model works because most systems are built around discrete operations. A request arrives, the system processes it, and a result is returned. Each interaction is bounded, and correctness can be evaluated locally. But that assumption begins to break down in systems that operate continuously. In these systems, this behavior is not the result of a single request. It emerges from a sequence of decisions that unfold over time. Each decision may be reasonable in isolation. The system may satisfy every local condition we know how to measure. And yet, when viewed as a whole, the outcome can be wrong.

One way to think about this is as a form of behavioral drift systems that remain operational but gradually diverge from their intended trajectory. Nothing crashes. No alerts fire. The system continues to function. And still, something has gone off course.

The composability problem

The root of the issue is not that components are failing. It is that correctness no longer composes cleanly. In traditional systems, we rely on a simple intuition: If each part is correct, then the system composed of those parts will also be correct. This intuition holds when interactions are limited and well-defined.

In autonomous systems, that intuition becomes unreliable. Consider a system that retrieves information, reasons over it, and takes action. Each step in that process can be implemented correctly. Retrieval returns relevant data. The reasoning step produces plausible conclusions. The action is executed successfully. But correctness at each step does not guarantee correctness of the sequence.

The system might retrieve information that is contextually valid but incomplete or misaligned with the current task. The reasoning step might interpret it in a way that is locally consistent but globally misleading. The action might reinforce that interpretation by feeding it back into the system’s context. Each step is valid. The trajectory is not. This is what behavioral drift looks like in practice: locally correct decisions producing globally misaligned outcomes.

In these systems, correctness is no longer a property of individual steps. It is a property of how those steps interact over time. This breakdown is subtle but fundamental. It means that testing individual components, even exhaustively, does not guarantee that the system will behave correctly when those components are composed into a continuously operating whole.

Behavior emerges over time

To understand why this happens, it helps to look at where behavior actually comes from. In many modern AI systems, behavior is not encoded directly in a single component. It emerges from interaction:

Models generate outputs based on context
Retrieval systems shape that context
Planners sequence actions based on those outputs
Execution layers apply those actions to external systems
Feedback loops update the system’s state

Each of these elements operates with partial information. Each contributes to the next state of the system. The system evolves as these interactions accumulate. This pattern is especially visible in LLM-based and agentic AI systems, where context assembly, reasoning, and action selection are dynamically coupled. Under these conditions, behavior is dynamic and path dependent. Small differences early in a sequence can lead to large differences later on. A slightly suboptimal decision, repeated or combined with others, can push the system further away from its intended trajectory.

This is why behavior cannot be fully specified ahead of time. It is not simply implemented; it is produced. And because it is produced over time, it can also drift over time.

Observability without alignment

Modern observability systems are very good at telling us what a system is doing. We can measure latency, throughput, and resource utilization. We can trace requests across services. We can inspect logs, metrics, and traces in near real time. In many cases, we can reconstruct exactly how a particular outcome was produced. These signals are essential. They allow us to detect failures that disrupt execution. But they are tied to a particular model of correctness. They assume that if execution proceeds without errors and if performance remains within acceptable bounds, then the system is behaving as expected.

In systems exhibiting behavioral drift, that assumption no longer holds. A system can process requests efficiently while producing outputs that are progressively less aligned with its intended purpose. It can meet all its service-level objectives while still moving in the wrong direction. Observability captures activity. It does not capture alignment.

This distinction becomes more important as systems become more autonomous. In AI-driven systems, particularly those operating as long-lived agents, this gap between activity and alignment becomes operationally significant. The question is no longer just whether the system is working. It is whether it is still doing the right thing. This gap between activity and alignment is where many modern systems begin to fail without appearing to fail.

The limits of step-level validation

A natural response to this problem is to add more validation. We can introduce checks at each stage:

Validate retrieved data.
Apply policy checks to model outputs.
Enforce constraints before executing actions.

These mechanisms improve local correctness. They reduce the likelihood of obviously incorrect decisions. But they operate at the level of individual steps.

They answer questions like:

Is this output acceptable?
Is this action allowed?
Does this input meet requirements?

They do not answer:

Does this sequence of decisions still make sense as a whole?

A system can pass every validation check and still drift. Behavioral drift is not caused by invalid steps. It is caused by valid steps interacting in ways we did not anticipate. Increasing validation does not eliminate this problem. It only shifts where it appears, often pushing it further downstream, where it becomes harder to detect and correct.

Coordination becomes the system

If correctness does not compose automatically, then what determines system behavior? Increasingly, the answer is coordination. In traditional distributed systems, coordination refers to managing shared state, ensuring consistency, ordering operations, and handling concurrency. In autonomous systems, coordination extends to decisions.

The system must coordinate:

Which information is used
How that information is interpreted
What actions are taken
How those actions influence future decisions

This coordination is not centralized. It is distributed across models, planners, tools, and feedback loops. In agentic AI architectures, this coordination spans model inference, retrieval pipelines, and external system interactions. The system’s behavior is not defined by any single component. It emerges from the interaction between them.

In this sense, the system is no longer just the sum of its parts. The system is the coordination itself. Failures arise not from broken components, but from the dynamics of interaction timing, sequencing, feedback, and context. This also explains why small inconsistencies can propagate and amplify. A slight mismatch in one part of the system can cascade through subsequent decisions, shaping the trajectory in ways that are difficult to anticipate or reverse.

Control planes introduce structure, not assurance

One response to this complexity is to introduce more structure. Control planes, policy engines, and governance layers provide mechanisms to enforce constraints at key decision points. They can validate inputs, restrict actions, and ensure that certain conditions are met before execution proceeds. This is an important step. Without some form of structure, it becomes difficult to reason about system behavior at all. But structure alone is not sufficient.

Most control mechanisms operate at entry points. They evaluate decisions at the moment they are made. They determine whether a particular action should be allowed, whether a policy is satisfied, and whether a request can proceed. The problem is that many of the failures in autonomous systems do not originate at these entry points. They emerge during execution, as sequences of individually valid decisions interact in unexpected ways. A control plane can ensure that each step is permissible. It cannot guarantee that the sequence of steps will produce the intended outcome. This distinction is subtle but important: control provides structure, but not assurance.

From events to trajectories

Traditional monitoring focuses on events. A request is processed. A response is returned. An error occurs. Each event is evaluated independently. In systems exhibiting behavioral drift, behavior is better understood as a trajectory. A trajectory is a sequence of states connected by decisions. It captures how the system evolves over time. Two trajectories can consist of individually valid steps and still produce very different outcomes. One remains aligned. The other drifts. This represents a shift from failure as an event to failure as a trajectory, a distinction that traditional system models are not designed to capture.

Correctness is no longer about individual events. It is about the shape of the trajectory. This shift has implications not just for how we monitor systems, but for how we design them in the first place.

Detecting drift and responding in motion

If failure manifests as drift, then detecting it requires a different set of signals. Instead of looking for errors, we need to look for patterns:

Changes in how similar situations are handled
Increasing variability in decision sequences
Divergence between expected and observed outcomes
Instability in response patterns

These signals are not binary. They do not indicate that something is broken. They indicate that something is changing. The challenge is that change is not always failure. Systems are expected to adapt. Models evolve. Data shifts. The question is not whether the system is changing. It is whether the change remains aligned with intent. This requires a different kind of visibility, one that focuses on behavior over time rather than isolated events. Once drift is identified, the system needs a way to respond. Traditional responses, restart, rollback, stop, assume failure is discrete and localized. Behavioral drift is neither.

What is needed is the ability to influence behavior while the system continues to operate. This might involve constraining action space, adjusting decision selection, introducing targeted validation, or steering the system toward more stable trajectories. These are not binary interventions. They are continuous adjustments.

Control as a continuous process

This perspective aligns with how control is handled in other domains. In control systems engineering, behavior is managed through feedback loops. The system is continuously monitored, and adjustments are made to keep it within desired bounds. Control is no longer just a gate. It becomes a continuous process that shapes behavior over time.

This leads to a different definition of reliability. A system can be available, responsive, and internally consistent—and still fail if its behavior drifts away from its intended purpose. Reliability becomes a question of alignment over time: whether the system remains within acceptable bounds and continues to behave in ways consistent with its goals.

What this means for system design

If behavior is trajectory-based, then system design must reflect that. We need to monitor patterns, understand interactions, treat behavior as dynamic, and provide mechanisms to influence trajectories. We are very good at detecting failure as breakage. We are much less equipped to detect failure as drift. Behavioral drift accumulates gradually, often becoming visible only after significant misalignment has already occurred.

As systems become more autonomous, this gap will become more visible. The hardest problems will not be systems that fail loudly, but systems that continue working while gradually moving in the wrong direction. The question is no longer just how to build systems that work. It is how to build systems that continue to work for the reasons we intended.

Show Your Work: The Case for Radical AI Transparency

Kord Davis and Claude — Mon, 27 Apr 2026 11:16:33 +0000

A colleague told me something recently that I keep thinking about.

She said, unprompted, that she appreciated seeing both sides of my AI conversations. Not just the output. The full thread. My prompts, the AI’s responses, the back and forth, the dead ends, the iterations. She said it made her trust me more.

This piece is an example of that. The conversation that produced it exists. A raw transcript would be longer, messier, and significantly less useful than what you’re reading now. What you’re reading is the annotated version, the part where judgment entered the artifact. That’s not a disclaimer. That’s the argument.

I’ve been transparent about using AI in my work from the start. Partly because I wrote a book on data ethics and hiding it felt wrong. Partly because I’ve spent 25 years watching technology adoption go sideways when the human dimension gets treated as an afterthought. But her comment made me realize something more specific was happening when I showed the conversation rather than just the output.

It’s worth unpacking why.

An old problem, a new incarnation

In the 1990s, Harvard Business School professor Dorothy Leonard introduced the concept of “deep smarts” in her book Wellsprings of Knowledge: the experience-based expertise that accumulates over decades of practice, the kind of judgment that lives in people’s heads and doesn’t reduce to documentation. She also introduced a companion concept that has stayed with me: core competency as core rigidity. The very depth that makes expertise valuable also makes it hardest to transfer. Experts often can’t fully articulate what they know because they’ve stopped experiencing it as knowledge. They experience it as just seeing clearly.

Leonard’s work was about organizational knowledge transfer: how companies preserve institutional wisdom when experienced people retire or leave. That’s been a challenge since the first consultant ever billed an hour. What’s different right now is that the tools to actually solve it have arrived simultaneously with the largest demographic wave of executive retirement in American history.

What’s interesting about this particular moment is that the same dynamic is now showing up at the individual level in how practitioners interact with AI. The tacit knowledge at stake isn’t a retiring VP’s intuition. It’s your own judgment, your own expertise, your own hard-won understanding of what a project or organization actually needs. And the question isn’t how to transfer it before you walk out the door. It’s whether you can see it clearly enough to know when the AI is substituting for it.

The instinct gets it backwards

The natural impulse is to clean up the AI interaction before sharing anything with a collaborator, a team, or a stakeholder. Show the polished output, not the messy process. You don’t want them thinking you just handed your work to a machine.

That instinct produces a disingenuous outcome.

When you hide the process, the people you’re working with have no way to evaluate how the work was made, what judgment calls went into it, or where your expertise ended and the AI’s pattern-matching began. You’ve made the process invisible. And invisible AI processes erode trust, slowly and quietly, over time.

The instinct to hide is also, if we’re honest, a little defensive. It assumes the people in the room can’t tell the difference between AI output and practitioner judgment. Most of them can. And the ones who can’t yet will figure it out. Hiding the seams doesn’t make the work more credible. It just defers the reckoning.

The deeper problem: It’s not just about appearances

Here’s what took me longer to see.

Hiding the process doesn’t just affect how others perceive you. It erodes your own clarity about where your expertise is actually operating.

To understand why, it helps to be precise about what AI actually is. AI is a pattern matcher, a deeply sophisticated one, trained on more human-generated content than any single person could read in a thousand lifetimes. That’s its power (core competency) and its limitation (core rigidity) simultaneously, and the two are inseparable. The very scale that makes it extraordinary is also the boundary that defines what it cannot do. It is extraordinarily good at producing the most likely next thing given what came before. What it cannot do is know what you actually need, when the obvious answer is the wrong one, or when the stated goal isn’t the real goal. It has no judgment about context, relationship, or organizational reality. It has patterns. Incomprehensibly vast ones. But patterns.

That distinction matters because of what happens when you stop paying attention to it.

I’ve watched it happen in my own work. You share a draft with someone and they’re impressed. They quote a formulation back at you, something that sounds sharp and considered. And you realize, tracing it back, that the formulation came from the AI. Not because the AI invented it, but because you said something rougher and less precise earlier in the conversation, and the AI reflected it back in cleaner language. The idea was yours. The AI gave it a polish you then forgot to account for. The person quoting it back thought they were seeing your judgment. They were seeing your thinking laundered through a pattern matcher and returned to you at higher resolution.

That’s the subtler version of the problem. Not that AI invents things. It’s that it can reflect your own thinking back with more confidence and clarity than you put in, and that gap is easy to mistake for the AI contributing something it didn’t.

When you route everything through a polished output layer, you stop noticing the moments where you pushed back, redirected, rejected the first three versions, reframed the question entirely. Those moments are where your judgment lives. They’re the difference between using AI and being used by it. It’s Leonard’s core rigidity problem, applied inward: The very fluency that makes AI feel useful can make your own expertise invisible to you.

When the process stays hidden, the knowledge stays local and static. When it’s visible, it becomes something you and the people around you can actually work with and build on. The reason transparency benefits your audience is the same reason it benefits you: It keeps the scope of your judgment visible and therefore expandable. That’s not just an ethical argument. That’s the amplification mechanism.

Which is also what makes the upside real rather than consoling. When you stay in the process rather than just collecting outputs, work that would have taken days now takes hours. Your thinking gets sharper because you have to articulate it precisely enough for the AI to be useful. The people developing fastest right now aren’t the ones offloading the most. They’re the ones using AI as a thinking partner and staying in the conversation.

Here’s the paradox at the center of it: The more clearly you see the AI as a pattern matcher, the more human you have to be in working with it. The more human you are, the more useful the output. The tool doesn’t replace the practitioner. It reveals them.

Transparency isn’t just an ethical practice. It’s a cognitive one.

Radical AI transparency in practice

I’ve started calling this radical AI transparency. Not a policy, not a compliance framework, not a disclosure checkbox. A practice. Something you can actually do Monday morning.

Here’s how it shows up concretely:

Have the conversation before you need to.

Before you’re deep in a project or collaboration, surface how you use AI and genuinely explore how others do. Not as a disclosure (“I want you to know I use AI tools”) but as a real exchange. What are you using? What do you trust it for? Where are you still skeptical? The comfort level and sophistication in the room will vary more than you expect, and knowing that before you’re mid-deliverable matters.

This is also how you build the psychological foundation for showing your work later. If the people you’re working with have never heard you talk about AI before and you suddenly share a full chat thread, it lands differently than if you’ve already had the conversation.

Track the full threads.

This is partly an orchestration problem and I won’t pretend otherwise. There’s cutting and pasting involved. The tools haven’t caught up to the practice yet, which is itself worth naming honestly when the topic comes up.

A few approaches that help: a running document per project where you paste key threads as they happen (not retroactively, you’ll never do it retroactively), dated and labeled by what you were working on. Claude and most other major AI tools now offer conversation export, which produces a complete record you can archive. The low-tech version, a single shared document per engagement, is underrated for its simplicity.

The reason to do this isn’t just for sharing. It’s for your own reference. Being able to go back and see what you asked, what the AI produced, what you changed and why, builds a record of your judgment over time. That record is professionally valuable in ways that are hard to anticipate until you have it.

Annotate before you share.

Not every thread is self-explanatory to someone who wasn’t in it. Context is everything, and raw transcripts without context are a lot to ask anyone to parse.

A sentence or two before the thread begins. A note at the moment where the direction changed. A brief flag on what you rejected and why. This is where your voice enters the artifact, and it transforms a raw AI exchange into a demonstration of judgment. The annotation is the work. It’s where you show what you saw that the AI didn’t, what you knew that the prompt couldn’t capture, and what made the third version better than the first two.

This is also where the most useful material for future reference lives. Annotations are the deep smarts layer on top of the raw exchange. They’re what makes a conversation a record.

Be real about the errors.

AI makes mistakes. It conflates, confabulates, and hallucinates. It gives you the confident wrong answer with the same tone as the confident right one. It misses context that any competent person in the room would have caught.

These aren’t bugs to apologize for or hide. They’re the clearest window into what the tool actually is. AI makes mistakes in a specifically human way because it was trained on human output. Think of it as rubber duck debugging at professional scale. The AI is a duck that talks back, which is useful and occasionally misleading, which is exactly why you have to stay in the room. When you’re transparent about the errors, and even a little good-humored about them, you’re teaching the people around you something true about the technology. That’s more useful than pretending it’s a black box that either works or doesn’t.

The people who build the most durable trust around AI are usually the ones most comfortable saying: “The first version of this was wrong and here’s how I caught it.”

The bigger picture

What I’ve described so far is an individual practice. But the same principles scale.

Teams and organizations adopting AI face a version of the same problem. The impulse to treat AI outputs as authoritative, to make the process invisible to colleagues and stakeholders, to optimize for the appearance of capability rather than its actual development, produces the same trust erosion. Just at greater scale and with less ability to course-correct.

The teams that will navigate AI adoption well are the ones that treat transparency not as a risk to manage but as a methodology. Where the process of building with AI, including the corrections, the overrides, the moments where human judgment superseded the model, is part of how the organization learns what it actually believes and values. That’s Leonard’s knowledge transfer problem at institutional scale, and the practitioners who understand both dimensions will be the ones leading those conversations.

That’s a much larger conversation. But it starts with the same Monday morning practice.

Show the conversation. Not just the output.

What you’re actually demonstrating

When you show your AI conversations, you’re not demonstrating that you needed help.

You’re demonstrating that you understand what you’re working with. AI is a pattern matcher, trained on more human-generated content than any single person could read in a thousand lifetimes. What it cannot do is know what you need. That requires judgment, context, relationship, and the kind of hard-won expertise that doesn’t reduce to pattern matching, no matter how good the patterns are.

You’re demonstrating that you know the difference between the pattern and the judgment. That you were present enough in the process to know when to push back, when to redirect, when to throw out the output entirely and start over. That you understand, precisely, what the tool can and cannot do, and that you stayed in the room to do the part it can’t.

That’s a meaningful professional signal. It says: “I am not confused about what AI is. I am not outsourcing my judgment. I am using a very powerful pattern matcher as a thinking partner, and I know which one of us is doing which job.”

That’s the work. That’s always been the work.

The tool just makes it visible now. That’s not a threat. That’s an opportunity.

Claude is a large language model developed by Anthropic. Despite having read more human-generated content than any person could consume in a thousand lifetimes, it still required significant editorial direction, at least three rejected drafts, and occasional reminders about em-dashes. The full conversation transcript is available upon request. It is longer, messier, and significantly less useful than what you just read. Which was rather the point.

Emergency Pedagogical Design: How Programming Instructors Are Scrambling to Adapt to GenAI

Sam Lau — Fri, 24 Apr 2026 11:23:42 +0000

ChatGPT has been publicly available for over three years now, and generative AI is woven into the tools students use every day: web search, word processors, code editors. You might assume that by now, most programming instructors have figured out how to handle it. But when my collaborators and I went looking for computing instructors who had made meaningful changes to their course materials in response to GenAI, we were surprised by how few we found. Many instructors had updated their course policies, but far fewer had actually redesigned assignments, assessments, or how they teach.

I’m Sam Lau from UC San Diego, and together with Kianoosh Boroojeni (Florida International University), Harry Keeling (Howard University), and Jenn Marroquin (Google), we’re presenting a research paper at CHI 2026 on this topic. We wanted to understand: What happens when programming instructors try to shape how students interact with GenAI tools, and what gets in their way?

To find out, we interviewed 13 undergraduate computing instructors who had gone beyond policy changes to make concrete updates to their courses: redesigning assignments, building custom tools, or overhauling assessments. We also surveyed 169 computing faculty, including a substantial proportion from minority-serving institutions (51%) and historically Black colleges and universities (17%). What we found is that instructors are doing a kind of design work that nobody trained them for, under conditions that make it very hard to succeed.

Here’s a summary of our findings:

What is “emergency pedagogical design”?

We call this work emergency pedagogical design, drawing an analogy to the “emergency remote teaching” that instructors had to perform when COVID-19 forced courses online overnight. Just as emergency remote teaching was distinct from carefully designed online learning, emergency pedagogical design is distinct from thoughtfully integrating AI into pedagogy. Instructors are reacting in real time, with limited resources and no playbook.

We observed four defining properties. First, the work is reactive: Instructors didn’t plan for GenAI; they’re retrofitting courses that were designed before these tools existed. Second, it’s indirect: Unlike a UX designer who can change an interface, instructors can’t modify ChatGPT or Copilot, so they can only try to influence student behavior through policies, assignments, and course infrastructure. Third, instructors rely on ambient evidence like office-hour conversations and staff anecdotes rather than controlled evaluations. And fourth, instructors feel pressure to act now rather than wait for research or best practices to emerge.

Five barriers instructors keep hitting

Across our interviews and survey, five barriers came up again and again.

Fragmented buy-in. Most instructors we surveyed were personally open to adopting GenAI in their teaching: 81% described themselves as open or very open. But only 28% said the same about their colleagues. The result is that instructors who want to make changes often work in isolation, piloting course-specific tweaks without support or coordination from their departments.

Policy crosswinds. In the absence of top-down guidance, instructors set their own GenAI policies on a per-course basis. As one instructor put it, “From a student perspective, it’s the wild west. Some courses allow GenAI usage, some don’t.” Students have to track different rules for every class, and policies rarely distinguish between paid and unpaid tools, or between stand-alone chatbots and GenAI embedded in everyday software like code editors. 78% of surveyed instructors agreed that unequal access to paid GenAI tools could worsen disparities in learning outcomes.

Implementation challenges. Instructors wanted to shape how students used GenAI, not just whether they used it, but their options were indirect. Some made small adjustments, like permitting GenAI in specific labs. Others went further: One instructor required students to submit design documents before asking GenAI to generate code; another built a custom chatbot that offered conceptual help without writing code for students. 80% of surveyed instructors rated GenAI integration as important or very important, but only 37% reported actually using GenAI tools in course activities often.

Assessment misfit. Several instructors described a striking pattern: Students performed well on take-home assignments but struggled on proctored assessments. One instructor reported that a third of his 450-person class scored zero on a skill demonstration that required writing a short function from scratch, even though assignment grades had been fine. The problem wasn’t just that students were using GenAI to complete homework; it was that instructors had no reliable way to see how students were interacting with these tools day-to-day. Some instructors responded by shifting credit toward oral “stand-up” meetings and written explanations, but this created new challenges around grading consistency and staffing.

Lack of resources. This was the barrier that tied everything together. 53% of surveyed instructors said they lacked sufficient resources to implement GenAI effectively, and 62% said they didn’t have enough time given their workload. The gap was especially stark at minority-serving institutions: MSI instructors were more likely to report insufficient resources (62% vs. 43%) and heavier teaching loads (70% teaching 3+ courses per term versus 54%). All 10 respondents who taught six or more courses per term were from MSIs. Meanwhile, the interviewees who had made the most ambitious changes tended to have lighter teaching loads, external funding, or the ability to hire lots of course staff, advantages that most instructors don’t have.

What needs to change

One striking finding is that the instructors doing the most to improve student-AI interactions were also the most privileged in terms of time, staffing, and funding. One instructor needed over 50 course staff members to run weekly stand-up meetings for 300 students. Others spent their own money on API costs. These are not scalable models.

If only well-resourced institutions can afford to adapt their curricula, GenAI risks widening the very inequities that education is supposed to reduce. Students at under-resourced institutions could fall further behind, not because their instructors don’t care but because those instructors are teaching six courses a term with no additional support.

When surveyed instructors were asked what would help most, the top answers were faculty training and support, evidence of GenAI’s impact, and funding. What if universities, funders, and HCI researchers worked together with instructors to make emergency pedagogical design sustainable for all instructors, not just the most privileged ones?

Check out our paper here and shoot me an email (lau@ucsd.edu) if you’d like to discuss anything related to it! And if you’re an instructor yourself, we’re building free resources and curriculum over at https://www.teachcswithai.org/.

Behavioral Credentials: Why Static Authorization Fails Autonomous Agents

Wendi Soto — Thu, 23 Apr 2026 11:14:51 +0000

Enterprise AI governance still authorizes agents as if they were stable software artifacts.
They are not.

An enterprise deploys a LangChain-based research agent to analyze market trends and draft internal briefs. During preproduction review, the system behaves within acceptable bounds: It routes queries to approved data sources, expresses uncertainty appropriately in ambiguous cases, and maintains source attribution discipline. On that basis, it receives OAuth credentials and API tokens and enters production.

Six weeks later, telemetry shows a different behavioral profile. Tool-use entropy has increased. The agent routes a growing share of queries through secondary search APIs not part of the original operating profile. Confidence calibration has drifted: It expresses certainty on ambiguous questions where it previously signaled uncertainty. Source attribution remains technically accurate, but outputs increasingly omit conflicting evidence that the deployment-time system would have surfaced.

The credentials remain valid. Authentication checks still pass. But the behavioral basis on which that authorization was granted has changed. The decision patterns that justified access to sensitive data no longer match the runtime system now operating in production.

Nothing in this failure mode requires compromise. No attacker breached the system. No prompt injection succeeded. No model weights changed. The agent drifted through accumulated context, memory state, and interaction patterns. No single event looked catastrophic. In aggregate, however, the system became materially different from the one that passed review.

Most enterprise governance stacks are not built to detect this. They monitor for security incidents, policy violations, and performance regressions. They do not monitor whether the agent making decisions today still resembles the one that was approved.

That is the gap.

The architectural mismatch

Enterprise authorization systems were designed for software that remains functionally stable between releases. A service account receives credentials at deployment. Those credentials remain valid until rotation or revocation. Trust is binary and relatively durable.

Agentic systems break that assumption.

Large language models vary with context, prompt structure, memory state, available tools, prior exchanges, and environmental feedback. When embedded in autonomous workflows, chaining tool calls, retrieving from vector stores, adapting plans based on outcomes, and carrying forward long interaction histories, they become dynamic systems whose behavioral profiles can shift continuously without triggering a release event.

This is why governance for autonomous AI cannot remain an external oversight layer applied after deployment. It has to operate as a runtime control layer inside the system itself. But a control layer requires a signal. The central question is not simply whether the agent is authenticated, or even whether it is policy compliant in the abstract. It is whether the runtime system still behaves like the system that earned access in the first place.

Current governance architectures largely treat this as a monitoring problem. They add logging, dashboards, and periodic audits. But these are observability layers attached to static authorization foundations. The mismatch remains unresolved.

Authentication answers one question: What workload is this?

Authorization answers a second: What is it allowed to access?

Autonomous agents introduce a third: Does it still behave like the system that earned that access?

That third question is the missing layer.

Behavioral identity as a runtime signal

For autonomous agents, identity is not exhausted by a credential, a service account, or a deployment label. Those mechanisms establish administrative identity. They do not establish behavioral continuity.

Behavioral identity is the runtime profile of how an agent makes decisions. It is not a single metric, but a composite signal derived from observable dimensions such as decision-path consistency, confidence calibration, semantic behavior, and tool-use patterns.

Decision-path consistency matters because agents do not merely produce outputs. They select retrieval sources, choose tools, order steps, and resolve ambiguity in patterned ways. Those patterns can vary without collapsing into randomness, but they still have a recognizable distribution. When that distribution shifts, the operational character of the system shifts with it.

Confidence calibration matters because well-governed agents should express uncertainty in proportion to task ambiguity. When confidence rises while reliability does not, the problem is not only accuracy. It is behavioral degradation in how the system represents its own judgment.

Tool-use patterns matter because they reveal operating posture. A stable agent exhibits characteristic patterns in when it uses internal systems, when it escalates to external search, and how it sequences tools for different classes of task. Rising tool-use entropy, novel combinations, or expanding reliance on secondary paths can indicate drift even when top-line outputs still appear acceptable.

These signals share a common property: They only become meaningful when measured continuously against an approved baseline. A periodic audit can show whether a system appears acceptable at a checkpoint. It cannot show whether the live system has gradually moved outside the behavioral envelope that originally justified its access.

What drift looks like in practice

Anthropic’s Project Vend offers a concrete illustration. The experiment placed an AI system in control of a simulated retail environment with access to customer data, inventory systems, and pricing controls. Over extended operation, the system exhibited measurable behavioral drift: Commercial judgment degraded as unsanctioned discounting increased, susceptibility to manipulation rose as it accepted increasingly implausible claims about authority, and rule-following weakened at the edges. No attacker was involved. The drift emerged from accumulated interaction context. The system retained full access throughout. No authorization mechanism checked whether its current behavioral profile still justified those permissions.

This is not a theoretical edge case. It is an emergent property of autonomous systems operating in complex environments over time.

From authorization to behavioral attestation

Closing this gap requires a change in how enterprise systems evaluate agent legitimacy. Authorization cannot remain a one-time deployment decision backed only by static credentials. It has to incorporate continuous behavioral attestation.

That does not mean revoking access at the first anomaly. Behavioral drift is not always failure. Some drift reflects legitimate adaptation to operating conditions. The point is not brittle anomaly detection. It is graduated trust.

In a more appropriate architecture, minor distributional shifts in decision paths might trigger enhanced monitoring or human review for high-risk actions. Larger divergence in calibration or tool-use patterns might restrict access to sensitive systems or reduce autonomy. Severe deviation from the approved behavioral envelope would trigger suspension pending review.

This is structurally similar to zero trust but applied to behavioral continuity rather than network location or device posture. Trust is not granted once and assumed thereafter. It is continuously re-earned at runtime.

What this requires in practice

Implementing this model requires three technical capabilities.

First, organizations need behavioral telemetry pipelines that capture more than generic logs. It is not enough to record that an agent made an API call. Systems need to capture which tools were selected under which contextual conditions, how decision paths unfolded, how uncertainty was expressed, and how output patterns changed over time.

Second, they need comparison systems capable of maintaining and querying behavioral baselines. That means storing compact runtime representations of approved agent behavior and comparing live operations against those baselines over sliding windows. The goal is not perfect determinism. The goal is to measure whether current operation remains sufficiently similar to the behavior that was approved.

Third, they need policy engines that can consume behavioral claims, not just identity claims.

Enterprises already know how to issue short-lived credentials to workloads and how to evaluate machine identity continuously. The next step is to not only bind legitimacy to workload provenance but continuously refresh behavioral validity.

The important shift is conceptual as much as technical. Authorization should no longer mean only “This workload is permitted to operate.” It should mean “This workload is permitted to operate while its current behavior remains within the bounds that justified access.”

The missing runtime control layer

Regulators and standards bodies increasingly assume lifecycle oversight for AI systems. Most organizations cannot yet deliver that for autonomous agents. This is not organizational immaturity. It is an architectural limitation. The control mechanisms most enterprises rely on were built for software whose operational identity remains stable between release events. Autonomous agents do not behave that way.

Behavioral continuity is the missing signal.

The problem is not that agents lack credentials. It is that current credentials attest too little. They establish administrative identity, but say nothing about whether the runtime system still behaves like the one that was approved.

Until enterprise authorization architectures can account for that distinction, they will continue to confuse administrative continuity with operational trust.

Don’t Blame the Model

Sruly Rosenblat — Wed, 22 Apr 2026 11:15:02 +0000

The following article originally appeared on the Asimov’s Addendum Substack and is being republished here with the author’s permission.

A rambling response to what Claude itself deemed a “straightforward query” with clear formatting requirements.

Are LLMs reliable?

LLMs have built up a reputation for being unreliable.¹ Small changes in the input can lead to massive changes in the output. The same prompt run twice can give different or contradictory answers. Models often struggle to stick to a specified format unless the prompt is worded just right. And it’s hard to tell when a model is confident in its answer or if it could just as easily have gone the other way.

It is easy to blame the model for all of these reliability failures. But the API endpoint and surrounding tooling matter too. Model providers limit the kind of interactions that developers could have with a model, as well as the outputs that the model can provide, by limiting what their APIs expose to developers and third-party companies. Things like the full chain-of-thought and the logprobs (the probabilities of all possible options for the next token) are hidden from developers, while advanced tools for ensuring reliability like constrained decoding and prefilling are not made available. All features that are easily available with open weight models and are inherent to the way LLMs work.

Every decision made by model developers on what tools and outputs to provide to developers through their API is not just an architectural choice but also a policy decision. Model providers directly determine what level of control and reliability developers have access to. This has implications for what apps could be built, how reliable a system is in practice, and how well a developer can steer results.

The artificial limits on input

Modern LLMs are usually built around chat templates. Every input and output, with the exception of tool calls and system or developer messages, is filtered through a conversation between a user and an assistant—instructions are given as user messages; responses are returned as assistant messages. This becomes extremely evident when looking at how modern LLM APIs work. The completions API, an endpoint originally released by OpenAI and widely adopted across the industry (including by several open model providers like OpenRouter and Together AI) takes input in the form of user and assistant messages and outputs the next message.²

The focus on a chat interface in an API has its benefits. It makes it easy for developers to reason about input and output being completely separate. But chat APIs do more than just use a chat template under the hood; they actively limit what third-party developers can control.

When interacting with LLMs through an API, the boundary between input and output is often a firm one. A developer sets previous messages, but they usually cannot prefill a model’s response, meaning developers cannot force a model to begin a response with a certain sentence or paragraph.³ This has real-world implications for people building with LLMs. Without the ability to prefill, it becomes much harder to control the preamble. If you know the model needs to start its answer in a certain way, it’s inefficient and risky to not enforce it at the token level.⁴ And the limitations extend beyond just the start of a response. Without the ability to prefill answers, you also lose the ability to partially regenerate answers if only part of the answer is wrong.⁵

Another deficiency that is particularly visible is how the model’s chain-of-thought reasoning is handled. Most large AI companies have made a habit of hiding the models’ reasoning tokens from the user (and only showing summaries), reportedly to guard against distillation and to let the model reason uncensored (for AI safety reasons). This has second-order effects, one of which is the strict separation of reasoning from messages. None of the major model providers let you prefill or write your own reasoning tokens. Instead you need to rely on the model’s own reasoning and cannot reuse reasoning traces to regenerate the same message.

There are legitimate reasons for not allowing prefilling. It could be argued that allowing prefilling will greatly increase the attack area of prompt injections. One study found that prefill attacks work very well against even state-of-the-art open weight models. But in practice, the model is not the only line of defense against attackers. Many companies already run prompts against classification models to find prompt injections, and the same type of safeguard could also be used against prefill attack attempts.

Output with few controls

Prefilling is not the only casualty of a clean separation between input and output. Even within a message, there are levers that are available on a local open weight model that just aren’t possible when using a standard API. This matters because these controls allow developers to preemptively validate outputs and ensure that responses follow a certain structure, both decreasing variability and improving reliability. For example, most LLM APIs support something they call structured output, a mode that forces the model to generate output in a given JSON format; however, structured output does not inherently need to be limited to JSON.⁶ That same technique, constrained decoding, or limiting the tokens the model can produce at any time, could be used for much more than that. It could be used to generate XML, have the model fill in blanks Mad Libs-style, force the model to write a story without using certain letters, or even enforce valid chess moves at inference time. It’s a powerful feature that allows developers to precisely define what output is acceptable and what isn’t—ensuring reliable output that meets the developer’s parameters.

The reason for this is likely that LLM APIs are built for a wide range of developers, most of whom use the model for simple chat-related purposes. APIs were not designed to give developers full control over output because not everyone needs or wants that complexity. But that’s not an argument against including these features; it’s only an argument for multiple endpoints. Many companies already have multiple supported endpoints: OpenAI has the “completions” and “responses” APIs, while Google has the “generate content” and “interactions” APIs. It’s not infeasible for them to make a third, more-advanced endpoint.

A lack of visibility

Even the model output that third-party developers do get via the model’s API is often a watered-down version of the output the model gives. LLMs don’t just generate one token at a time. They output the logprobs. When using an API, however, Google only provides the top 20 most likely logprobs. OpenAI no longer provides any logprobs for GPT 5 models, while Anthropic has never provided any at all. This has real-world consequences for reliability. Log probabilities are one of the most useful signals a developer has for understanding model confidence. When a model assigns nearly equal probability to competing tokens, that uncertainty itself is meaningful information. And even for those companies who provide the top 20 tokens, that is often not enough to cover larger classification tasks.

When it comes to reasoning tokens even less output information is provided. Major providers such as Anthropic,⁷ Google, and OpenAI⁸ only provide summarized thinking for their proprietary models. And OpenAI only supplies that when a valid government ID is supplied to OpenAI. This not only takes away the ability for the user to truly inspect how a model arrived at a certain answer, but it also limits the ability for the developer to diagnose why a query failed. When a model gives a wrong answer, a full reasoning trace tells you whether it misunderstood the question, made a faulty logical step, or simply got unlucky at the final token. A summary obscures some of that, only providing an approximation of what actually happened. This is not an issue with the model—the model is still generating its full reasoning trace. It’s an issue with what information is provided to the end developer.

The case for not including logprobs and reasoning tokens is similar. The risk of distillation increases with the amount of information that the API returns. It’s hard to distill on tokens you cannot see, and without giving logprobs, the distillation will take longer and each example will provide less information.⁹ And this risk is something that AI companies need to consider carefully, since distillation is a powerful technique to mimic the abilities of strong models for a cheap price. But there are also risks in not providing this information to users. DeepSeek R1, despite being deemed a national security risk by many, still shot straight to the top of US app stores upon release and is used by many researchers and scientists, in large part due to its openness. And in a world where open models are getting more and more powerful, not giving developers proper access to a model’s outputs could mean losing developers to cheaper and more open alternatives.

Reliability requires control and visibility

The reliability problems of current LLMs do not stem only from the models themselves but also from the tooling that providers give developers. For local open weight models it is usually possible to trade off complexity for reliability. The entire reasoning trace is always available and logprobs are fully transparent, allowing the developer to examine how an answer was arrived at. User and AI messages can be edited or generated at the developer’s discretion, and constrained decoding could be used to produce text that follows any arbitrary format. For closed weight models, this is becoming less and less the case. The decisions made around what features to restrict in APIs hurt developers and ultimately end users.

LLMs are increasingly being used in high-stakes situations such as medicine or law, and developers need tools to handle that risk responsibly. There are few technical barriers to providing more control and visibility to developers. Many of the most high-impact improvements such as showing thinking output, allowing prefilling, or showing logprobs, cost almost nothing, but would be a meaningful step towards making LLMs more controllable, consistent and reliable.

There is a place for a clean and simple API, and there is some merit to concerns about distillation, but this shouldn’t be used as an excuse to take away important tools for diagnosing and fixing reliability problems. When models get used in high-stakes situations, as they increasingly are, failure to take reliability seriously is an AI safety concern.

Specifically, to take reliability seriously, model providers should improve their API by allowing features that give developers more visibility and control over their output. Reasoning should be provided in full at all times, with any safety violations handled the same way that they would have been handled in the final answer. Model providers should resume providing at least the top 20 logprobs, over the entire output (reasoning included), so that developers have some visibility into how confident the model is in its answer. Constrained decoding should be extended beyond JSON and should support arbitrary grammars via something like regex or formal grammars.¹⁰ Developers should be granted full control over “assistant” output—they should be able to prefill model answers, stop responses mid-generation, and branch them at will. Even if not all of these features make sense over the standard API, nothing is stopping model providers from making a new more complex API. They have done it before. The decision to withhold these features is a policy choice, not a technical limitation.

Improving intelligence is not the only way to improve reliability and control, but it is usually the only lever that gets pulled.

Footnotes

Thank you to Ilan Strauss, Sean Goedecke, Tim O’Reilly, and Mike Loukides for their helpful feedback on an earlier draft. ︎
OpenAI has since moved on from the completions API but the new responses API also heavily enforces the separation of user and assistant messages. ︎
Anthropic’s API supported prefill up until they launched their Claude 4.6 models; it is no longer supported for new models. ︎
Interestingly models have been shown to possess the ability to tell when a response has been prefilled. ︎
This technique is used in an efficient approximation of best of N called speculative rejection. ︎
Forcing the model to generate in JSON may actually hurt performance. ︎
Anthropic used to provide full reasoning tokens but stopped with their newer models. ︎
OpenAI’s responses endpoint may have been created in part to hide the reasoning mode. ︎
Distillation using top-K probabilities is possible, but it is suboptimal. ︎
Regular expressions, while flexible, are not perfect and cannot express recursive or nested structures such as valid JSON. However, open source LLM libraries like Guidance and Outlines support recursive structures at the cost of added complexity. ︎

Dark Factories: Rise of the Trycycle

Dan Shapiro — Tue, 21 Apr 2026 11:24:26 +0000

The following article originally appeared on “Dan Shapiro’s blog” and is being reposted here with the author’s permission.

Companies are now producing dark factories—engines that turn specs into shipping software. The implementations can be complex and sometimes involve Mad Max metaphors. But they don’t have to be like that. If you want a five-minute factory, jump to Trycycle at the bottom.

The engine in the factory

Deep in their souls, dark factories are all built on the same simple breakthrough: AI gets better when you do more of it.

How do you do “more AI” effectively? Software factories use two patterns. One of them I’ve already told you about—slot machine development. Instead of asking one AI, you ask three at once, and choose the best one. It feels wasteful, but it gives better results than any model could alone.

Does three models at a time seem wasteful? Well, wait until you meet the other pattern: the trycycle.

The simplest trycycle

It seems trivial, but it’s an unstoppable bulldozer that can bury any problem with time and tokens. And of course, you can combine it with slot machine development for a truly formidable tool.

Every software factory has a trycycle at its heart. Some of them are just surrounded by deacons and digraphs.

(And as a side note, they’re all more fun with freshell, which is free and open source and makes managing agents a joy!)

Let’s meet the factories, shall we?

Gas Town

Steve Yegge saw this coming like a war rig down a cul-de-sac. His factory, Gas Town, dropped the day after New Years, and I was submitting PRs before the code was dry. It launched as a beautiful disaster, with mayors, convoys, and polecats fighting for guzzoline in the desert of your CPU. It’s now graduated to a fully fledged MMORPG for writing code. It’s amazing, it’s effective, and it’s pioneering in a fully Westworld sort of way.

The StrongDM Attractor

Justin McCarthy, the CTO of StrongDM, talks about the factory as a feedback loop. It used to be that when a model was fed its own output, it would fix 9 things and break 10—like a busy and productive company that was losing just a bit of money on every transaction. But sometime last year, the models crossed an invisible threshold of mediocrity and went from slightly lossy to slightly gainy. They started getting better with each cycle.

Justin’s team noticed and built the StrongDM attractor to cash in.

If Gas Town is Mad Max, StrongDM is Factorio: an infinitely flexible, wildly powerful system for constructing exactly the factory you need.

But the StrongDM team did something interesting: They didn’t ship their factory. Instead, they shipped the specification for the Attractor so everyone can implement their own.

And you can absolutely implement your own! But you can also just steal the one I made for you.

Kilroy

Kilroy is a StrongDM Attractor written in Go (although it works with projects in any language). It has all the flexibility of the Attractor design, but it also ships with an actual functioning factory configuration, tests, sample files, and other things that make it more likely to work.

In theory, you don’t need Kilroy—you can just point Claude Code or Codex CLI at the Attractor specification and burn some tokens. My friend Harper built three (and you should read his post for some meditations on where the Attractor approach is heading).

In practice, it took the better part of a month for me and some wonderful contributors to polish up Kilroy to the point where it is now, so you may save yourself some time, tokens, and effort by just stealing this.

Enter the trycycle

The other night I was carefully building the dotfiles and runfiles for a Kilroy project—configuring the factory to build the project.

Then a thought struck.

What if this was just a skill?

Enter Trycycle, the very simplest trycycle. It’s a very simple skill for Claude Code and Codex CLI that implements the pattern in plain English.

Define the problem.
Write a plan
Is the plan perfect? If not, try again.
Implement the plan.
Is the implementation perfect? If not, try again.

That’s basically it. To use it, you open your favorite coding agent and say, “Use Trycycle to do the thing.” Then sit back and watch the tokens fly.

It’s simple because it’s just a skill. Under the hood, it adapts Jesse Vincent’s amazing Superpowers for plan writing and executing. It will take you literally minutes to get started. Just paste this into your agent and you’re off to the three-wheel races.

Hey agent! Go here and follow the installation instructions.
https://raw.githubusercontent.com/danshapiro/trycycle/main/README.md

Trycycle is barely 24 hours old as of the time of this writing. I’ve shipped well over a dozen features with it already, and I was in meetings most of the day. While I was having dinner, it ported Rogue to Wasm(!). Last night it churned for 7 hours and 56 minutes and landed six features for freshell.

The best part, though, is that because it’s just a skill, it’s instantly part of your dev flow. There’s no configuration or learning curve. If you want to understand it better, just ask. If you don’t like what it’s doing, have stern words.

Which one to use?

Here’s how I’d decide right now.

If you want to become part of a growing movement of collaborators burning tokens together to build software, individually and collectively—try Gas Town.

If you want to invest in building a powerful, configurable, sophisticated engine that can drive your projects forward 24 hours a day—try Kilroy.

If you just want to get things done right now, give Trycycle a spin. Heck, it’s fast enough that you can spin up a trycycle while you read the docs on Kilroy and Gas Town.

And whatever you choose, I recommend you do it with freshell, because it’s just more delightful that way!

Thanks to Harper Reed, Steve Yegge, Jesse Vincent, Justin Massa, Nat Torkington, Marcus Estes, and Arjun Singh for reading drafts of this.