Radar

Everyone’s an Engineer Now

Tim O’Reilly — Tue, 28 Apr 2026 20:07:03 +0000

Cat Wu leads product for Claude Code and Cowork at Anthropic, so she’s well-versed in building reliable, interpretable, and steerable AI systems. And since famously, 90% of Anthropic’s code is now written by Claud Code, she’s also deeply familiar with fitting them into routine day-to-day work. Last month, Cat joined Addy Osmani at AI Codecon for a fireside chat on the future of agentic coding and, equally important, agentic code review; how Anthropic actually uses the tools they’re building; and what skills matter now. A lot of what she described is worth sitting with for a while.

The feedback loop is itself a product

Claude Code’s origin story may surprise you. Boris Cherny initially built it as a side project to test Anthropic’s APIs. Then he shared the tool in a notebook, and within two months the entire company was using it. That organic growth, Cat said, was part of what convinced the team it was worth releasing externally.

But what really made that internal adoption legible was the response on Anthropic’s internal “dog-fooding” Slack channel. The Claude Code channel gets a new message every 5 to 10 minutes around the clock, and this feedback directly and immediately informs the product experience. Cat described it this way:

We hire for people who love polishing the user experience. And so a lot of our engineers actually live in this channel and find when there’s issues with new features that they’ve worked on and they proactively lay out the fixes.

The team ships new versions of Claude Code to internal users many times a day. The feedback loop is tight enough that it functions as a continuous integration system for product quality, not just code quality.

The best illustration of how far this goes: Cat accidentally introduced a small interaction bug between prompts and auto-suggestions. But by the time she started working on a fix, she found another team member had already beaten her to it. It turns out, he had set up a scheduled task in Claude Code to scan the feedback channel for anything that hadn’t been responded to in 24 hours and open a PR for it. When Cat hadn’t yet gotten to a fix (whoops!), her teammate’s Claude saw the unaddressed issue and fixed it for her. And Cat only found out when “[her own] Claude noticed that his Claude had already landed a change.”

The infrastructure for rapid improvement, in other words, is now partly automated. The agents are writing the code, then monitoring the feedback and closing the loop.

The bottleneck has shifted to review

There’s no question that AI-assisted coding has created a boom in output: Anthropic engineers are producing roughly 200% more than they were a year ago, Cat noted. Today the main constraint is reviewing all that code to ensure it’s production-ready.

Cat’s team made a deliberate architectural choice about how to handle this. Their conclusion: You can buy a lot of additional robustness for not that much extra cost.

We opted for the heaviest, most robust version [of code review]. We actually plot how many agents and how comprehensive of a review Claude does and then how many bugs does it recall. And we picked a number of very high recall and decided we should ship this, because if you really want AI code review to be a load-bearing part of your process, you actually probably just want the most comprehensive possible review.

The review agent doesn’t just look at the diff. It traces code across multiple files and catches bugs in adjacent code that has nothing to do with the change in question. Cat gave two examples. One was a ZFS encryption refactor where the agent found a key cache invalidation bug that wasn’t related to the author’s change at all but would have invalidated it. The other was a routine auth update that turned out to have a bad side effect, caught premerge. In both cases, engineers manually reviewing the code likely would have missed the bugs.

The human review that remains is deliberately small in scope. For most PRs, the human reviewer skims for design principle violations and obvious problems and assumes functional correctness has been handled. The agents run 5 to 10 in parallel, each given slightly different tasks, returning independently and then deduplicating what they found.

The cultural shift that made this work, though, was ownership. The team moved to a model where the engineer who authors a PR owns it end to end, including postdeploy bugs, and doesn’t lean on peer reviewers to catch mistakes. “Otherwise,” as Cat pointed out, “you have situations where junior engineers put out a bunch of PRs and then your senior engineers are like drowning in AI-generated stuff where they’re not sure how thoroughly it’s been tested.”

Full ownership meant the AI review had to actually be trustworthy, which drove the decision to go for high recall rather than a lighter touch. That said, engineers are still expected to understand every line of code an agent creates. . .for now. As Cat explained, it’s the only way to truly prevent “unknown security vulnerabilities and to be able to quickly respond to incidents if they are to happen.”

Everyone’s kind of an engineer now

Cowork, Anthropic’s agent tool for nontechnical users, is the company’s attempt to take what Claude Code does for engineers and bring it to knowledge work more broadly. The picture Cat sketched is of someone looking at five or six agent tasks running simultaneously in a side panel, managing a fleet of agents the way a senior engineer manages a PR queue.

In the nearer-term, she’s keeping tabs on the shift toward people using Claude Code to build things for themselves, their teams, or their families that wouldn’t have justified professional development effort or “otherwise been possible.” The prototype is the garage project, the family expense tracker, the tool that a small team actually needs but that no SaaS product quite addresses. Cat’s goal and hope is that Claude Code helps people “solve their own problems for themselves” and “stewards a new future of personal software.”

Product taste as the new technical skill

More people building more software is unambiguously good. Boris Cherny has even floated the idea that coding as we know it is “solved.” But what does that mean for the craft of software engineering? Cat’s read of the current moment is more nuanced, and more useful:

I think pre-AI, the skills that were very important were being able to take a spec and implement it well. And I think now the really important skill is product taste. Even for engineers. Can you use code to ingest a massive amount of user feedback? Do you have good intuition about which feature to build to address those needs, because it’s often different than exactly what users are asking you for? And then, when Claude builds it, are you setting up the right bar so that what you ship people actually love?

Cat’s not alone in highlighting the importance of taste in a world where code is a commodity. Steve Yegge, Wes McKinney, and many others, myself included, see taste and judgment as a uniquely human value. This has practical implications for how engineers should spend their time now, and for what the next generation needs to learn.

For junior engineers specifically, Cat described a progression: Start by using Claude Code to understand the codebase (ask all the “dumb questions” without embarrassment), take those answers to a senior engineer for calibration, and then close the loop by updating the CLAUDE.md with whatever was missing. The last step is the nonobvious one.

Think of Claude Code as your intern that you’re trying to level up. Like, teach it back to Claude. Add a /verify slash command. Put it in the CLAUDE.md or the agent README. Approach this as senior engineers helping you level up, and then you helping Claude and other agents level up.

The improvement process, in other words, should be bidirectional. Engineers get better at using the tools; the tools get better through the engineers’ accumulated knowledge. And significantly, this process keeps humans firmly in the loop, playing a role that’s “active, continuous, and skilled.”

You can watch Cat and Addy’s full chat, plus everything else from AI Codecon on the O’Reilly learning platform. Not a member? Sign up for a free 10-day trial, no strings attached.

When Correct Systems Produce the Wrong Outcomes

Varun Raj — Tue, 28 Apr 2026 11:12:58 +0000

We tend to assume that if every part of a system behaves correctly, the system itself will behave correctly. That assumption is deeply embedded in how we design, test, and operate software. If a service returns valid responses, if dependencies are reachable, and if constraints are satisfied, then the system is considered healthy. Even in distributed systems, where failure modes are more complex, correctness is still tied to the behavior of individual components. In modern AI systems, particularly those combining retrieval, reasoning, and tool invocation, this assumption is increasingly stressed under continuous operation.

This model works because most systems are built around discrete operations. A request arrives, the system processes it, and a result is returned. Each interaction is bounded, and correctness can be evaluated locally. But that assumption begins to break down in systems that operate continuously. In these systems, this behavior is not the result of a single request. It emerges from a sequence of decisions that unfold over time. Each decision may be reasonable in isolation. The system may satisfy every local condition we know how to measure. And yet, when viewed as a whole, the outcome can be wrong.

One way to think about this is as a form of behavioral drift systems that remain operational but gradually diverge from their intended trajectory. Nothing crashes. No alerts fire. The system continues to function. And still, something has gone off course.

The composability problem

The root of the issue is not that components are failing. It is that correctness no longer composes cleanly. In traditional systems, we rely on a simple intuition: If each part is correct, then the system composed of those parts will also be correct. This intuition holds when interactions are limited and well-defined.

In autonomous systems, that intuition becomes unreliable. Consider a system that retrieves information, reasons over it, and takes action. Each step in that process can be implemented correctly. Retrieval returns relevant data. The reasoning step produces plausible conclusions. The action is executed successfully. But correctness at each step does not guarantee correctness of the sequence.

The system might retrieve information that is contextually valid but incomplete or misaligned with the current task. The reasoning step might interpret it in a way that is locally consistent but globally misleading. The action might reinforce that interpretation by feeding it back into the system’s context. Each step is valid. The trajectory is not. This is what behavioral drift looks like in practice: locally correct decisions producing globally misaligned outcomes.

In these systems, correctness is no longer a property of individual steps. It is a property of how those steps interact over time. This breakdown is subtle but fundamental. It means that testing individual components, even exhaustively, does not guarantee that the system will behave correctly when those components are composed into a continuously operating whole.

Behavior emerges over time

To understand why this happens, it helps to look at where behavior actually comes from. In many modern AI systems, behavior is not encoded directly in a single component. It emerges from interaction:

Models generate outputs based on context
Retrieval systems shape that context
Planners sequence actions based on those outputs
Execution layers apply those actions to external systems
Feedback loops update the system’s state

Each of these elements operates with partial information. Each contributes to the next state of the system. The system evolves as these interactions accumulate. This pattern is especially visible in LLM-based and agentic AI systems, where context assembly, reasoning, and action selection are dynamically coupled. Under these conditions, behavior is dynamic and path dependent. Small differences early in a sequence can lead to large differences later on. A slightly suboptimal decision, repeated or combined with others, can push the system further away from its intended trajectory.

This is why behavior cannot be fully specified ahead of time. It is not simply implemented; it is produced. And because it is produced over time, it can also drift over time.

Observability without alignment

Modern observability systems are very good at telling us what a system is doing. We can measure latency, throughput, and resource utilization. We can trace requests across services. We can inspect logs, metrics, and traces in near real time. In many cases, we can reconstruct exactly how a particular outcome was produced. These signals are essential. They allow us to detect failures that disrupt execution. But they are tied to a particular model of correctness. They assume that if execution proceeds without errors and if performance remains within acceptable bounds, then the system is behaving as expected.

In systems exhibiting behavioral drift, that assumption no longer holds. A system can process requests efficiently while producing outputs that are progressively less aligned with its intended purpose. It can meet all its service-level objectives while still moving in the wrong direction. Observability captures activity. It does not capture alignment.

This distinction becomes more important as systems become more autonomous. In AI-driven systems, particularly those operating as long-lived agents, this gap between activity and alignment becomes operationally significant. The question is no longer just whether the system is working. It is whether it is still doing the right thing. This gap between activity and alignment is where many modern systems begin to fail without appearing to fail.

The limits of step-level validation

A natural response to this problem is to add more validation. We can introduce checks at each stage:

Validate retrieved data.
Apply policy checks to model outputs.
Enforce constraints before executing actions.

These mechanisms improve local correctness. They reduce the likelihood of obviously incorrect decisions. But they operate at the level of individual steps.

They answer questions like:

Is this output acceptable?
Is this action allowed?
Does this input meet requirements?

They do not answer:

Does this sequence of decisions still make sense as a whole?

A system can pass every validation check and still drift. Behavioral drift is not caused by invalid steps. It is caused by valid steps interacting in ways we did not anticipate. Increasing validation does not eliminate this problem. It only shifts where it appears, often pushing it further downstream, where it becomes harder to detect and correct.

Coordination becomes the system

If correctness does not compose automatically, then what determines system behavior? Increasingly, the answer is coordination. In traditional distributed systems, coordination refers to managing shared state, ensuring consistency, ordering operations, and handling concurrency. In autonomous systems, coordination extends to decisions.

The system must coordinate:

Which information is used
How that information is interpreted
What actions are taken
How those actions influence future decisions

This coordination is not centralized. It is distributed across models, planners, tools, and feedback loops. In agentic AI architectures, this coordination spans model inference, retrieval pipelines, and external system interactions. The system’s behavior is not defined by any single component. It emerges from the interaction between them.

In this sense, the system is no longer just the sum of its parts. The system is the coordination itself. Failures arise not from broken components, but from the dynamics of interaction timing, sequencing, feedback, and context. This also explains why small inconsistencies can propagate and amplify. A slight mismatch in one part of the system can cascade through subsequent decisions, shaping the trajectory in ways that are difficult to anticipate or reverse.

Control planes introduce structure, not assurance

One response to this complexity is to introduce more structure. Control planes, policy engines, and governance layers provide mechanisms to enforce constraints at key decision points. They can validate inputs, restrict actions, and ensure that certain conditions are met before execution proceeds. This is an important step. Without some form of structure, it becomes difficult to reason about system behavior at all. But structure alone is not sufficient.

Most control mechanisms operate at entry points. They evaluate decisions at the moment they are made. They determine whether a particular action should be allowed, whether a policy is satisfied, and whether a request can proceed. The problem is that many of the failures in autonomous systems do not originate at these entry points. They emerge during execution, as sequences of individually valid decisions interact in unexpected ways. A control plane can ensure that each step is permissible. It cannot guarantee that the sequence of steps will produce the intended outcome. This distinction is subtle but important: control provides structure, but not assurance.

From events to trajectories

Traditional monitoring focuses on events. A request is processed. A response is returned. An error occurs. Each event is evaluated independently. In systems exhibiting behavioral drift, behavior is better understood as a trajectory. A trajectory is a sequence of states connected by decisions. It captures how the system evolves over time. Two trajectories can consist of individually valid steps and still produce very different outcomes. One remains aligned. The other drifts. This represents a shift from failure as an event to failure as a trajectory, a distinction that traditional system models are not designed to capture.

Correctness is no longer about individual events. It is about the shape of the trajectory. This shift has implications not just for how we monitor systems, but for how we design them in the first place.

Detecting drift and responding in motion

If failure manifests as drift, then detecting it requires a different set of signals. Instead of looking for errors, we need to look for patterns:

Changes in how similar situations are handled
Increasing variability in decision sequences
Divergence between expected and observed outcomes
Instability in response patterns

These signals are not binary. They do not indicate that something is broken. They indicate that something is changing. The challenge is that change is not always failure. Systems are expected to adapt. Models evolve. Data shifts. The question is not whether the system is changing. It is whether the change remains aligned with intent. This requires a different kind of visibility, one that focuses on behavior over time rather than isolated events. Once drift is identified, the system needs a way to respond. Traditional responses, restart, rollback, stop, assume failure is discrete and localized. Behavioral drift is neither.

What is needed is the ability to influence behavior while the system continues to operate. This might involve constraining action space, adjusting decision selection, introducing targeted validation, or steering the system toward more stable trajectories. These are not binary interventions. They are continuous adjustments.

Control as a continuous process

This perspective aligns with how control is handled in other domains. In control systems engineering, behavior is managed through feedback loops. The system is continuously monitored, and adjustments are made to keep it within desired bounds. Control is no longer just a gate. It becomes a continuous process that shapes behavior over time.

This leads to a different definition of reliability. A system can be available, responsive, and internally consistent—and still fail if its behavior drifts away from its intended purpose. Reliability becomes a question of alignment over time: whether the system remains within acceptable bounds and continues to behave in ways consistent with its goals.

What this means for system design

If behavior is trajectory-based, then system design must reflect that. We need to monitor patterns, understand interactions, treat behavior as dynamic, and provide mechanisms to influence trajectories. We are very good at detecting failure as breakage. We are much less equipped to detect failure as drift. Behavioral drift accumulates gradually, often becoming visible only after significant misalignment has already occurred.

As systems become more autonomous, this gap will become more visible. The hardest problems will not be systems that fail loudly, but systems that continue working while gradually moving in the wrong direction. The question is no longer just how to build systems that work. It is how to build systems that continue to work for the reasons we intended.

Show Your Work: The Case for Radical AI Transparency

Kord Davis and Claude — Mon, 27 Apr 2026 11:16:33 +0000

A colleague told me something recently that I keep thinking about.

She said, unprompted, that she appreciated seeing both sides of my AI conversations. Not just the output. The full thread. My prompts, the AI’s responses, the back and forth, the dead ends, the iterations. She said it made her trust me more.

This piece is an example of that. The conversation that produced it exists. A raw transcript would be longer, messier, and significantly less useful than what you’re reading now. What you’re reading is the annotated version, the part where judgment entered the artifact. That’s not a disclaimer. That’s the argument.

I’ve been transparent about using AI in my work from the start. Partly because I wrote a book on data ethics and hiding it felt wrong. Partly because I’ve spent 25 years watching technology adoption go sideways when the human dimension gets treated as an afterthought. But her comment made me realize something more specific was happening when I showed the conversation rather than just the output.

It’s worth unpacking why.

An old problem, a new incarnation

In the 1990s, Harvard Business School professor Dorothy Leonard introduced the concept of “deep smarts” in her book Wellsprings of Knowledge: the experience-based expertise that accumulates over decades of practice, the kind of judgment that lives in people’s heads and doesn’t reduce to documentation. She also introduced a companion concept that has stayed with me: core competency as core rigidity. The very depth that makes expertise valuable also makes it hardest to transfer. Experts often can’t fully articulate what they know because they’ve stopped experiencing it as knowledge. They experience it as just seeing clearly.

Leonard’s work was about organizational knowledge transfer: how companies preserve institutional wisdom when experienced people retire or leave. That’s been a challenge since the first consultant ever billed an hour. What’s different right now is that the tools to actually solve it have arrived simultaneously with the largest demographic wave of executive retirement in American history.

What’s interesting about this particular moment is that the same dynamic is now showing up at the individual level in how practitioners interact with AI. The tacit knowledge at stake isn’t a retiring VP’s intuition. It’s your own judgment, your own expertise, your own hard-won understanding of what a project or organization actually needs. And the question isn’t how to transfer it before you walk out the door. It’s whether you can see it clearly enough to know when the AI is substituting for it.

The instinct gets it backwards

The natural impulse is to clean up the AI interaction before sharing anything with a collaborator, a team, or a stakeholder. Show the polished output, not the messy process. You don’t want them thinking you just handed your work to a machine.

That instinct produces a disingenuous outcome.

When you hide the process, the people you’re working with have no way to evaluate how the work was made, what judgment calls went into it, or where your expertise ended and the AI’s pattern-matching began. You’ve made the process invisible. And invisible AI processes erode trust, slowly and quietly, over time.

The instinct to hide is also, if we’re honest, a little defensive. It assumes the people in the room can’t tell the difference between AI output and practitioner judgment. Most of them can. And the ones who can’t yet will figure it out. Hiding the seams doesn’t make the work more credible. It just defers the reckoning.

The deeper problem: It’s not just about appearances

Here’s what took me longer to see.

Hiding the process doesn’t just affect how others perceive you. It erodes your own clarity about where your expertise is actually operating.

To understand why, it helps to be precise about what AI actually is. AI is a pattern matcher, a deeply sophisticated one, trained on more human-generated content than any single person could read in a thousand lifetimes. That’s its power (core competency) and its limitation (core rigidity) simultaneously, and the two are inseparable. The very scale that makes it extraordinary is also the boundary that defines what it cannot do. It is extraordinarily good at producing the most likely next thing given what came before. What it cannot do is know what you actually need, when the obvious answer is the wrong one, or when the stated goal isn’t the real goal. It has no judgment about context, relationship, or organizational reality. It has patterns. Incomprehensibly vast ones. But patterns.

That distinction matters because of what happens when you stop paying attention to it.

I’ve watched it happen in my own work. You share a draft with someone and they’re impressed. They quote a formulation back at you, something that sounds sharp and considered. And you realize, tracing it back, that the formulation came from the AI. Not because the AI invented it, but because you said something rougher and less precise earlier in the conversation, and the AI reflected it back in cleaner language. The idea was yours. The AI gave it a polish you then forgot to account for. The person quoting it back thought they were seeing your judgment. They were seeing your thinking laundered through a pattern matcher and returned to you at higher resolution.

That’s the subtler version of the problem. Not that AI invents things. It’s that it can reflect your own thinking back with more confidence and clarity than you put in, and that gap is easy to mistake for the AI contributing something it didn’t.

When you route everything through a polished output layer, you stop noticing the moments where you pushed back, redirected, rejected the first three versions, reframed the question entirely. Those moments are where your judgment lives. They’re the difference between using AI and being used by it. It’s Leonard’s core rigidity problem, applied inward: The very fluency that makes AI feel useful can make your own expertise invisible to you.

When the process stays hidden, the knowledge stays local and static. When it’s visible, it becomes something you and the people around you can actually work with and build on. The reason transparency benefits your audience is the same reason it benefits you: It keeps the scope of your judgment visible and therefore expandable. That’s not just an ethical argument. That’s the amplification mechanism.

Which is also what makes the upside real rather than consoling. When you stay in the process rather than just collecting outputs, work that would have taken days now takes hours. Your thinking gets sharper because you have to articulate it precisely enough for the AI to be useful. The people developing fastest right now aren’t the ones offloading the most. They’re the ones using AI as a thinking partner and staying in the conversation.

Here’s the paradox at the center of it: The more clearly you see the AI as a pattern matcher, the more human you have to be in working with it. The more human you are, the more useful the output. The tool doesn’t replace the practitioner. It reveals them.

Transparency isn’t just an ethical practice. It’s a cognitive one.

Radical AI transparency in practice

I’ve started calling this radical AI transparency. Not a policy, not a compliance framework, not a disclosure checkbox. A practice. Something you can actually do Monday morning.

Here’s how it shows up concretely:

Have the conversation before you need to.

Before you’re deep in a project or collaboration, surface how you use AI and genuinely explore how others do. Not as a disclosure (“I want you to know I use AI tools”) but as a real exchange. What are you using? What do you trust it for? Where are you still skeptical? The comfort level and sophistication in the room will vary more than you expect, and knowing that before you’re mid-deliverable matters.

This is also how you build the psychological foundation for showing your work later. If the people you’re working with have never heard you talk about AI before and you suddenly share a full chat thread, it lands differently than if you’ve already had the conversation.

Track the full threads.

This is partly an orchestration problem and I won’t pretend otherwise. There’s cutting and pasting involved. The tools haven’t caught up to the practice yet, which is itself worth naming honestly when the topic comes up.

A few approaches that help: a running document per project where you paste key threads as they happen (not retroactively, you’ll never do it retroactively), dated and labeled by what you were working on. Claude and most other major AI tools now offer conversation export, which produces a complete record you can archive. The low-tech version, a single shared document per engagement, is underrated for its simplicity.

The reason to do this isn’t just for sharing. It’s for your own reference. Being able to go back and see what you asked, what the AI produced, what you changed and why, builds a record of your judgment over time. That record is professionally valuable in ways that are hard to anticipate until you have it.

Annotate before you share.

Not every thread is self-explanatory to someone who wasn’t in it. Context is everything, and raw transcripts without context are a lot to ask anyone to parse.

A sentence or two before the thread begins. A note at the moment where the direction changed. A brief flag on what you rejected and why. This is where your voice enters the artifact, and it transforms a raw AI exchange into a demonstration of judgment. The annotation is the work. It’s where you show what you saw that the AI didn’t, what you knew that the prompt couldn’t capture, and what made the third version better than the first two.

This is also where the most useful material for future reference lives. Annotations are the deep smarts layer on top of the raw exchange. They’re what makes a conversation a record.

Be real about the errors.

AI makes mistakes. It conflates, confabulates, and hallucinates. It gives you the confident wrong answer with the same tone as the confident right one. It misses context that any competent person in the room would have caught.

These aren’t bugs to apologize for or hide. They’re the clearest window into what the tool actually is. AI makes mistakes in a specifically human way because it was trained on human output. Think of it as rubber duck debugging at professional scale. The AI is a duck that talks back, which is useful and occasionally misleading, which is exactly why you have to stay in the room. When you’re transparent about the errors, and even a little good-humored about them, you’re teaching the people around you something true about the technology. That’s more useful than pretending it’s a black box that either works or doesn’t.

The people who build the most durable trust around AI are usually the ones most comfortable saying: “The first version of this was wrong and here’s how I caught it.”

The bigger picture

What I’ve described so far is an individual practice. But the same principles scale.

Teams and organizations adopting AI face a version of the same problem. The impulse to treat AI outputs as authoritative, to make the process invisible to colleagues and stakeholders, to optimize for the appearance of capability rather than its actual development, produces the same trust erosion. Just at greater scale and with less ability to course-correct.

The teams that will navigate AI adoption well are the ones that treat transparency not as a risk to manage but as a methodology. Where the process of building with AI, including the corrections, the overrides, the moments where human judgment superseded the model, is part of how the organization learns what it actually believes and values. That’s Leonard’s knowledge transfer problem at institutional scale, and the practitioners who understand both dimensions will be the ones leading those conversations.

That’s a much larger conversation. But it starts with the same Monday morning practice.

Show the conversation. Not just the output.

What you’re actually demonstrating

When you show your AI conversations, you’re not demonstrating that you needed help.

You’re demonstrating that you understand what you’re working with. AI is a pattern matcher, trained on more human-generated content than any single person could read in a thousand lifetimes. What it cannot do is know what you need. That requires judgment, context, relationship, and the kind of hard-won expertise that doesn’t reduce to pattern matching, no matter how good the patterns are.

You’re demonstrating that you know the difference between the pattern and the judgment. That you were present enough in the process to know when to push back, when to redirect, when to throw out the output entirely and start over. That you understand, precisely, what the tool can and cannot do, and that you stayed in the room to do the part it can’t.

That’s a meaningful professional signal. It says: “I am not confused about what AI is. I am not outsourcing my judgment. I am using a very powerful pattern matcher as a thinking partner, and I know which one of us is doing which job.”

That’s the work. That’s always been the work.

The tool just makes it visible now. That’s not a threat. That’s an opportunity.

Claude is a large language model developed by Anthropic. Despite having read more human-generated content than any person could consume in a thousand lifetimes, it still required significant editorial direction, at least three rejected drafts, and occasional reminders about em-dashes. The full conversation transcript is available upon request. It is longer, messier, and significantly less useful than what you just read. Which was rather the point.

Emergency Pedagogical Design: How Programming Instructors Are Scrambling to Adapt to GenAI

Sam Lau — Fri, 24 Apr 2026 11:23:42 +0000

ChatGPT has been publicly available for over three years now, and generative AI is woven into the tools students use every day: web search, word processors, code editors. You might assume that by now, most programming instructors have figured out how to handle it. But when my collaborators and I went looking for computing instructors who had made meaningful changes to their course materials in response to GenAI, we were surprised by how few we found. Many instructors had updated their course policies, but far fewer had actually redesigned assignments, assessments, or how they teach.

I’m Sam Lau from UC San Diego, and together with Kianoosh Boroojeni (Florida International University), Harry Keeling (Howard University), and Jenn Marroquin (Google), we’re presenting a research paper at CHI 2026 on this topic. We wanted to understand: What happens when programming instructors try to shape how students interact with GenAI tools, and what gets in their way?

To find out, we interviewed 13 undergraduate computing instructors who had gone beyond policy changes to make concrete updates to their courses: redesigning assignments, building custom tools, or overhauling assessments. We also surveyed 169 computing faculty, including a substantial proportion from minority-serving institutions (51%) and historically Black colleges and universities (17%). What we found is that instructors are doing a kind of design work that nobody trained them for, under conditions that make it very hard to succeed.

Here’s a summary of our findings:

What is “emergency pedagogical design”?

We call this work emergency pedagogical design, drawing an analogy to the “emergency remote teaching” that instructors had to perform when COVID-19 forced courses online overnight. Just as emergency remote teaching was distinct from carefully designed online learning, emergency pedagogical design is distinct from thoughtfully integrating AI into pedagogy. Instructors are reacting in real time, with limited resources and no playbook.

We observed four defining properties. First, the work is reactive: Instructors didn’t plan for GenAI; they’re retrofitting courses that were designed before these tools existed. Second, it’s indirect: Unlike a UX designer who can change an interface, instructors can’t modify ChatGPT or Copilot, so they can only try to influence student behavior through policies, assignments, and course infrastructure. Third, instructors rely on ambient evidence like office-hour conversations and staff anecdotes rather than controlled evaluations. And fourth, instructors feel pressure to act now rather than wait for research or best practices to emerge.

Five barriers instructors keep hitting

Across our interviews and survey, five barriers came up again and again.

Fragmented buy-in. Most instructors we surveyed were personally open to adopting GenAI in their teaching: 81% described themselves as open or very open. But only 28% said the same about their colleagues. The result is that instructors who want to make changes often work in isolation, piloting course-specific tweaks without support or coordination from their departments.

Policy crosswinds. In the absence of top-down guidance, instructors set their own GenAI policies on a per-course basis. As one instructor put it, “From a student perspective, it’s the wild west. Some courses allow GenAI usage, some don’t.” Students have to track different rules for every class, and policies rarely distinguish between paid and unpaid tools, or between stand-alone chatbots and GenAI embedded in everyday software like code editors. 78% of surveyed instructors agreed that unequal access to paid GenAI tools could worsen disparities in learning outcomes.

Implementation challenges. Instructors wanted to shape how students used GenAI, not just whether they used it, but their options were indirect. Some made small adjustments, like permitting GenAI in specific labs. Others went further: One instructor required students to submit design documents before asking GenAI to generate code; another built a custom chatbot that offered conceptual help without writing code for students. 80% of surveyed instructors rated GenAI integration as important or very important, but only 37% reported actually using GenAI tools in course activities often.

Assessment misfit. Several instructors described a striking pattern: Students performed well on take-home assignments but struggled on proctored assessments. One instructor reported that a third of his 450-person class scored zero on a skill demonstration that required writing a short function from scratch, even though assignment grades had been fine. The problem wasn’t just that students were using GenAI to complete homework; it was that instructors had no reliable way to see how students were interacting with these tools day-to-day. Some instructors responded by shifting credit toward oral “stand-up” meetings and written explanations, but this created new challenges around grading consistency and staffing.

Lack of resources. This was the barrier that tied everything together. 53% of surveyed instructors said they lacked sufficient resources to implement GenAI effectively, and 62% said they didn’t have enough time given their workload. The gap was especially stark at minority-serving institutions: MSI instructors were more likely to report insufficient resources (62% vs. 43%) and heavier teaching loads (70% teaching 3+ courses per term versus 54%). All 10 respondents who taught six or more courses per term were from MSIs. Meanwhile, the interviewees who had made the most ambitious changes tended to have lighter teaching loads, external funding, or the ability to hire lots of course staff, advantages that most instructors don’t have.

What needs to change

One striking finding is that the instructors doing the most to improve student-AI interactions were also the most privileged in terms of time, staffing, and funding. One instructor needed over 50 course staff members to run weekly stand-up meetings for 300 students. Others spent their own money on API costs. These are not scalable models.

If only well-resourced institutions can afford to adapt their curricula, GenAI risks widening the very inequities that education is supposed to reduce. Students at under-resourced institutions could fall further behind, not because their instructors don’t care but because those instructors are teaching six courses a term with no additional support.

When surveyed instructors were asked what would help most, the top answers were faculty training and support, evidence of GenAI’s impact, and funding. What if universities, funders, and HCI researchers worked together with instructors to make emergency pedagogical design sustainable for all instructors, not just the most privileged ones?

Check out our paper here and shoot me an email (lau@ucsd.edu) if you’d like to discuss anything related to it! And if you’re an instructor yourself, we’re building free resources and curriculum over at https://www.teachcswithai.org/.

Behavioral Credentials: Why Static Authorization Fails Autonomous Agents

Wendi Soto — Thu, 23 Apr 2026 11:14:51 +0000

Enterprise AI governance still authorizes agents as if they were stable software artifacts.
They are not.

An enterprise deploys a LangChain-based research agent to analyze market trends and draft internal briefs. During preproduction review, the system behaves within acceptable bounds: It routes queries to approved data sources, expresses uncertainty appropriately in ambiguous cases, and maintains source attribution discipline. On that basis, it receives OAuth credentials and API tokens and enters production.

Six weeks later, telemetry shows a different behavioral profile. Tool-use entropy has increased. The agent routes a growing share of queries through secondary search APIs not part of the original operating profile. Confidence calibration has drifted: It expresses certainty on ambiguous questions where it previously signaled uncertainty. Source attribution remains technically accurate, but outputs increasingly omit conflicting evidence that the deployment-time system would have surfaced.

The credentials remain valid. Authentication checks still pass. But the behavioral basis on which that authorization was granted has changed. The decision patterns that justified access to sensitive data no longer match the runtime system now operating in production.

Nothing in this failure mode requires compromise. No attacker breached the system. No prompt injection succeeded. No model weights changed. The agent drifted through accumulated context, memory state, and interaction patterns. No single event looked catastrophic. In aggregate, however, the system became materially different from the one that passed review.

Most enterprise governance stacks are not built to detect this. They monitor for security incidents, policy violations, and performance regressions. They do not monitor whether the agent making decisions today still resembles the one that was approved.

That is the gap.

The architectural mismatch

Enterprise authorization systems were designed for software that remains functionally stable between releases. A service account receives credentials at deployment. Those credentials remain valid until rotation or revocation. Trust is binary and relatively durable.

Agentic systems break that assumption.

Large language models vary with context, prompt structure, memory state, available tools, prior exchanges, and environmental feedback. When embedded in autonomous workflows, chaining tool calls, retrieving from vector stores, adapting plans based on outcomes, and carrying forward long interaction histories, they become dynamic systems whose behavioral profiles can shift continuously without triggering a release event.

This is why governance for autonomous AI cannot remain an external oversight layer applied after deployment. It has to operate as a runtime control layer inside the system itself. But a control layer requires a signal. The central question is not simply whether the agent is authenticated, or even whether it is policy compliant in the abstract. It is whether the runtime system still behaves like the system that earned access in the first place.

Current governance architectures largely treat this as a monitoring problem. They add logging, dashboards, and periodic audits. But these are observability layers attached to static authorization foundations. The mismatch remains unresolved.

Authentication answers one question: What workload is this?

Authorization answers a second: What is it allowed to access?

Autonomous agents introduce a third: Does it still behave like the system that earned that access?

That third question is the missing layer.

Behavioral identity as a runtime signal

For autonomous agents, identity is not exhausted by a credential, a service account, or a deployment label. Those mechanisms establish administrative identity. They do not establish behavioral continuity.

Behavioral identity is the runtime profile of how an agent makes decisions. It is not a single metric, but a composite signal derived from observable dimensions such as decision-path consistency, confidence calibration, semantic behavior, and tool-use patterns.

Decision-path consistency matters because agents do not merely produce outputs. They select retrieval sources, choose tools, order steps, and resolve ambiguity in patterned ways. Those patterns can vary without collapsing into randomness, but they still have a recognizable distribution. When that distribution shifts, the operational character of the system shifts with it.

Confidence calibration matters because well-governed agents should express uncertainty in proportion to task ambiguity. When confidence rises while reliability does not, the problem is not only accuracy. It is behavioral degradation in how the system represents its own judgment.

Tool-use patterns matter because they reveal operating posture. A stable agent exhibits characteristic patterns in when it uses internal systems, when it escalates to external search, and how it sequences tools for different classes of task. Rising tool-use entropy, novel combinations, or expanding reliance on secondary paths can indicate drift even when top-line outputs still appear acceptable.

These signals share a common property: They only become meaningful when measured continuously against an approved baseline. A periodic audit can show whether a system appears acceptable at a checkpoint. It cannot show whether the live system has gradually moved outside the behavioral envelope that originally justified its access.

What drift looks like in practice

Anthropic’s Project Vend offers a concrete illustration. The experiment placed an AI system in control of a simulated retail environment with access to customer data, inventory systems, and pricing controls. Over extended operation, the system exhibited measurable behavioral drift: Commercial judgment degraded as unsanctioned discounting increased, susceptibility to manipulation rose as it accepted increasingly implausible claims about authority, and rule-following weakened at the edges. No attacker was involved. The drift emerged from accumulated interaction context. The system retained full access throughout. No authorization mechanism checked whether its current behavioral profile still justified those permissions.

This is not a theoretical edge case. It is an emergent property of autonomous systems operating in complex environments over time.

From authorization to behavioral attestation

Closing this gap requires a change in how enterprise systems evaluate agent legitimacy. Authorization cannot remain a one-time deployment decision backed only by static credentials. It has to incorporate continuous behavioral attestation.

That does not mean revoking access at the first anomaly. Behavioral drift is not always failure. Some drift reflects legitimate adaptation to operating conditions. The point is not brittle anomaly detection. It is graduated trust.

In a more appropriate architecture, minor distributional shifts in decision paths might trigger enhanced monitoring or human review for high-risk actions. Larger divergence in calibration or tool-use patterns might restrict access to sensitive systems or reduce autonomy. Severe deviation from the approved behavioral envelope would trigger suspension pending review.

This is structurally similar to zero trust but applied to behavioral continuity rather than network location or device posture. Trust is not granted once and assumed thereafter. It is continuously re-earned at runtime.

What this requires in practice

Implementing this model requires three technical capabilities.

First, organizations need behavioral telemetry pipelines that capture more than generic logs. It is not enough to record that an agent made an API call. Systems need to capture which tools were selected under which contextual conditions, how decision paths unfolded, how uncertainty was expressed, and how output patterns changed over time.

Second, they need comparison systems capable of maintaining and querying behavioral baselines. That means storing compact runtime representations of approved agent behavior and comparing live operations against those baselines over sliding windows. The goal is not perfect determinism. The goal is to measure whether current operation remains sufficiently similar to the behavior that was approved.

Third, they need policy engines that can consume behavioral claims, not just identity claims.

Enterprises already know how to issue short-lived credentials to workloads and how to evaluate machine identity continuously. The next step is to not only bind legitimacy to workload provenance but continuously refresh behavioral validity.

The important shift is conceptual as much as technical. Authorization should no longer mean only “This workload is permitted to operate.” It should mean “This workload is permitted to operate while its current behavior remains within the bounds that justified access.”

The missing runtime control layer

Regulators and standards bodies increasingly assume lifecycle oversight for AI systems. Most organizations cannot yet deliver that for autonomous agents. This is not organizational immaturity. It is an architectural limitation. The control mechanisms most enterprises rely on were built for software whose operational identity remains stable between release events. Autonomous agents do not behave that way.

Behavioral continuity is the missing signal.

The problem is not that agents lack credentials. It is that current credentials attest too little. They establish administrative identity, but say nothing about whether the runtime system still behaves like the one that was approved.

Until enterprise authorization architectures can account for that distinction, they will continue to confuse administrative continuity with operational trust.

Don’t Blame the Model

Sruly Rosenblat — Wed, 22 Apr 2026 11:15:02 +0000

The following article originally appeared on the Asimov’s Addendum Substack and is being republished here with the author’s permission.

A rambling response to what Claude itself deemed a “straightforward query” with clear formatting requirements.

Are LLMs reliable?

LLMs have built up a reputation for being unreliable.¹ Small changes in the input can lead to massive changes in the output. The same prompt run twice can give different or contradictory answers. Models often struggle to stick to a specified format unless the prompt is worded just right. And it’s hard to tell when a model is confident in its answer or if it could just as easily have gone the other way.

It is easy to blame the model for all of these reliability failures. But the API endpoint and surrounding tooling matter too. Model providers limit the kind of interactions that developers could have with a model, as well as the outputs that the model can provide, by limiting what their APIs expose to developers and third-party companies. Things like the full chain-of-thought and the logprobs (the probabilities of all possible options for the next token) are hidden from developers, while advanced tools for ensuring reliability like constrained decoding and prefilling are not made available. All features that are easily available with open weight models and are inherent to the way LLMs work.

Every decision made by model developers on what tools and outputs to provide to developers through their API is not just an architectural choice but also a policy decision. Model providers directly determine what level of control and reliability developers have access to. This has implications for what apps could be built, how reliable a system is in practice, and how well a developer can steer results.

The artificial limits on input

Modern LLMs are usually built around chat templates. Every input and output, with the exception of tool calls and system or developer messages, is filtered through a conversation between a user and an assistant—instructions are given as user messages; responses are returned as assistant messages. This becomes extremely evident when looking at how modern LLM APIs work. The completions API, an endpoint originally released by OpenAI and widely adopted across the industry (including by several open model providers like OpenRouter and Together AI) takes input in the form of user and assistant messages and outputs the next message.²

The focus on a chat interface in an API has its benefits. It makes it easy for developers to reason about input and output being completely separate. But chat APIs do more than just use a chat template under the hood; they actively limit what third-party developers can control.

When interacting with LLMs through an API, the boundary between input and output is often a firm one. A developer sets previous messages, but they usually cannot prefill a model’s response, meaning developers cannot force a model to begin a response with a certain sentence or paragraph.³ This has real-world implications for people building with LLMs. Without the ability to prefill, it becomes much harder to control the preamble. If you know the model needs to start its answer in a certain way, it’s inefficient and risky to not enforce it at the token level.⁴ And the limitations extend beyond just the start of a response. Without the ability to prefill answers, you also lose the ability to partially regenerate answers if only part of the answer is wrong.⁵

Another deficiency that is particularly visible is how the model’s chain-of-thought reasoning is handled. Most large AI companies have made a habit of hiding the models’ reasoning tokens from the user (and only showing summaries), reportedly to guard against distillation and to let the model reason uncensored (for AI safety reasons). This has second-order effects, one of which is the strict separation of reasoning from messages. None of the major model providers let you prefill or write your own reasoning tokens. Instead you need to rely on the model’s own reasoning and cannot reuse reasoning traces to regenerate the same message.

There are legitimate reasons for not allowing prefilling. It could be argued that allowing prefilling will greatly increase the attack area of prompt injections. One study found that prefill attacks work very well against even state-of-the-art open weight models. But in practice, the model is not the only line of defense against attackers. Many companies already run prompts against classification models to find prompt injections, and the same type of safeguard could also be used against prefill attack attempts.

Output with few controls

Prefilling is not the only casualty of a clean separation between input and output. Even within a message, there are levers that are available on a local open weight model that just aren’t possible when using a standard API. This matters because these controls allow developers to preemptively validate outputs and ensure that responses follow a certain structure, both decreasing variability and improving reliability. For example, most LLM APIs support something they call structured output, a mode that forces the model to generate output in a given JSON format; however, structured output does not inherently need to be limited to JSON.⁶ That same technique, constrained decoding, or limiting the tokens the model can produce at any time, could be used for much more than that. It could be used to generate XML, have the model fill in blanks Mad Libs-style, force the model to write a story without using certain letters, or even enforce valid chess moves at inference time. It’s a powerful feature that allows developers to precisely define what output is acceptable and what isn’t—ensuring reliable output that meets the developer’s parameters.

The reason for this is likely that LLM APIs are built for a wide range of developers, most of whom use the model for simple chat-related purposes. APIs were not designed to give developers full control over output because not everyone needs or wants that complexity. But that’s not an argument against including these features; it’s only an argument for multiple endpoints. Many companies already have multiple supported endpoints: OpenAI has the “completions” and “responses” APIs, while Google has the “generate content” and “interactions” APIs. It’s not infeasible for them to make a third, more-advanced endpoint.

A lack of visibility

Even the model output that third-party developers do get via the model’s API is often a watered-down version of the output the model gives. LLMs don’t just generate one token at a time. They output the logprobs. When using an API, however, Google only provides the top 20 most likely logprobs. OpenAI no longer provides any logprobs for GPT 5 models, while Anthropic has never provided any at all. This has real-world consequences for reliability. Log probabilities are one of the most useful signals a developer has for understanding model confidence. When a model assigns nearly equal probability to competing tokens, that uncertainty itself is meaningful information. And even for those companies who provide the top 20 tokens, that is often not enough to cover larger classification tasks.

When it comes to reasoning tokens even less output information is provided. Major providers such as Anthropic,⁷ Google, and OpenAI⁸ only provide summarized thinking for their proprietary models. And OpenAI only supplies that when a valid government ID is supplied to OpenAI. This not only takes away the ability for the user to truly inspect how a model arrived at a certain answer, but it also limits the ability for the developer to diagnose why a query failed. When a model gives a wrong answer, a full reasoning trace tells you whether it misunderstood the question, made a faulty logical step, or simply got unlucky at the final token. A summary obscures some of that, only providing an approximation of what actually happened. This is not an issue with the model—the model is still generating its full reasoning trace. It’s an issue with what information is provided to the end developer.

The case for not including logprobs and reasoning tokens is similar. The risk of distillation increases with the amount of information that the API returns. It’s hard to distill on tokens you cannot see, and without giving logprobs, the distillation will take longer and each example will provide less information.⁹ And this risk is something that AI companies need to consider carefully, since distillation is a powerful technique to mimic the abilities of strong models for a cheap price. But there are also risks in not providing this information to users. DeepSeek R1, despite being deemed a national security risk by many, still shot straight to the top of US app stores upon release and is used by many researchers and scientists, in large part due to its openness. And in a world where open models are getting more and more powerful, not giving developers proper access to a model’s outputs could mean losing developers to cheaper and more open alternatives.

Reliability requires control and visibility

The reliability problems of current LLMs do not stem only from the models themselves but also from the tooling that providers give developers. For local open weight models it is usually possible to trade off complexity for reliability. The entire reasoning trace is always available and logprobs are fully transparent, allowing the developer to examine how an answer was arrived at. User and AI messages can be edited or generated at the developer’s discretion, and constrained decoding could be used to produce text that follows any arbitrary format. For closed weight models, this is becoming less and less the case. The decisions made around what features to restrict in APIs hurt developers and ultimately end users.

LLMs are increasingly being used in high-stakes situations such as medicine or law, and developers need tools to handle that risk responsibly. There are few technical barriers to providing more control and visibility to developers. Many of the most high-impact improvements such as showing thinking output, allowing prefilling, or showing logprobs, cost almost nothing, but would be a meaningful step towards making LLMs more controllable, consistent and reliable.

There is a place for a clean and simple API, and there is some merit to concerns about distillation, but this shouldn’t be used as an excuse to take away important tools for diagnosing and fixing reliability problems. When models get used in high-stakes situations, as they increasingly are, failure to take reliability seriously is an AI safety concern.

Specifically, to take reliability seriously, model providers should improve their API by allowing features that give developers more visibility and control over their output. Reasoning should be provided in full at all times, with any safety violations handled the same way that they would have been handled in the final answer. Model providers should resume providing at least the top 20 logprobs, over the entire output (reasoning included), so that developers have some visibility into how confident the model is in its answer. Constrained decoding should be extended beyond JSON and should support arbitrary grammars via something like regex or formal grammars.¹⁰ Developers should be granted full control over “assistant” output—they should be able to prefill model answers, stop responses mid-generation, and branch them at will. Even if not all of these features make sense over the standard API, nothing is stopping model providers from making a new more complex API. They have done it before. The decision to withhold these features is a policy choice, not a technical limitation.

Improving intelligence is not the only way to improve reliability and control, but it is usually the only lever that gets pulled.

Footnotes

Thank you to Ilan Strauss, Sean Goedecke, Tim O’Reilly, and Mike Loukides for their helpful feedback on an earlier draft. ︎
OpenAI has since moved on from the completions API but the new responses API also heavily enforces the separation of user and assistant messages. ︎
Anthropic’s API supported prefill up until they launched their Claude 4.6 models; it is no longer supported for new models. ︎
Interestingly models have been shown to possess the ability to tell when a response has been prefilled. ︎
This technique is used in an efficient approximation of best of N called speculative rejection. ︎
Forcing the model to generate in JSON may actually hurt performance. ︎
Anthropic used to provide full reasoning tokens but stopped with their newer models. ︎
OpenAI’s responses endpoint may have been created in part to hide the reasoning mode. ︎
Distillation using top-K probabilities is possible, but it is suboptimal. ︎
Regular expressions, while flexible, are not perfect and cannot express recursive or nested structures such as valid JSON. However, open source LLM libraries like Guidance and Outlines support recursive structures at the cost of added complexity. ︎

Dark Factories: Rise of the Trycycle

Dan Shapiro — Tue, 21 Apr 2026 11:24:26 +0000

The following article originally appeared on “Dan Shapiro’s blog” and is being reposted here with the author’s permission.

Companies are now producing dark factories—engines that turn specs into shipping software. The implementations can be complex and sometimes involve Mad Max metaphors. But they don’t have to be like that. If you want a five-minute factory, jump to Trycycle at the bottom.

The engine in the factory

Deep in their souls, dark factories are all built on the same simple breakthrough: AI gets better when you do more of it.

How do you do “more AI” effectively? Software factories use two patterns. One of them I’ve already told you about—slot machine development. Instead of asking one AI, you ask three at once, and choose the best one. It feels wasteful, but it gives better results than any model could alone.

Does three models at a time seem wasteful? Well, wait until you meet the other pattern: the trycycle.

The simplest trycycle

It seems trivial, but it’s an unstoppable bulldozer that can bury any problem with time and tokens. And of course, you can combine it with slot machine development for a truly formidable tool.

Every software factory has a trycycle at its heart. Some of them are just surrounded by deacons and digraphs.

(And as a side note, they’re all more fun with freshell, which is free and open source and makes managing agents a joy!)

Let’s meet the factories, shall we?

Gas Town

Steve Yegge saw this coming like a war rig down a cul-de-sac. His factory, Gas Town, dropped the day after New Years, and I was submitting PRs before the code was dry. It launched as a beautiful disaster, with mayors, convoys, and polecats fighting for guzzoline in the desert of your CPU. It’s now graduated to a fully fledged MMORPG for writing code. It’s amazing, it’s effective, and it’s pioneering in a fully Westworld sort of way.

The StrongDM Attractor

Justin McCarthy, the CTO of StrongDM, talks about the factory as a feedback loop. It used to be that when a model was fed its own output, it would fix 9 things and break 10—like a busy and productive company that was losing just a bit of money on every transaction. But sometime last year, the models crossed an invisible threshold of mediocrity and went from slightly lossy to slightly gainy. They started getting better with each cycle.

Justin’s team noticed and built the StrongDM attractor to cash in.

If Gas Town is Mad Max, StrongDM is Factorio: an infinitely flexible, wildly powerful system for constructing exactly the factory you need.

But the StrongDM team did something interesting: They didn’t ship their factory. Instead, they shipped the specification for the Attractor so everyone can implement their own.

And you can absolutely implement your own! But you can also just steal the one I made for you.

Kilroy

Kilroy is a StrongDM Attractor written in Go (although it works with projects in any language). It has all the flexibility of the Attractor design, but it also ships with an actual functioning factory configuration, tests, sample files, and other things that make it more likely to work.

In theory, you don’t need Kilroy—you can just point Claude Code or Codex CLI at the Attractor specification and burn some tokens. My friend Harper built three (and you should read his post for some meditations on where the Attractor approach is heading).

In practice, it took the better part of a month for me and some wonderful contributors to polish up Kilroy to the point where it is now, so you may save yourself some time, tokens, and effort by just stealing this.

Enter the trycycle

The other night I was carefully building the dotfiles and runfiles for a Kilroy project—configuring the factory to build the project.

Then a thought struck.

What if this was just a skill?

Enter Trycycle, the very simplest trycycle. It’s a very simple skill for Claude Code and Codex CLI that implements the pattern in plain English.

Define the problem.
Write a plan
Is the plan perfect? If not, try again.
Implement the plan.
Is the implementation perfect? If not, try again.

That’s basically it. To use it, you open your favorite coding agent and say, “Use Trycycle to do the thing.” Then sit back and watch the tokens fly.

It’s simple because it’s just a skill. Under the hood, it adapts Jesse Vincent’s amazing Superpowers for plan writing and executing. It will take you literally minutes to get started. Just paste this into your agent and you’re off to the three-wheel races.

Hey agent! Go here and follow the installation instructions.
https://raw.githubusercontent.com/danshapiro/trycycle/main/README.md

Trycycle is barely 24 hours old as of the time of this writing. I’ve shipped well over a dozen features with it already, and I was in meetings most of the day. While I was having dinner, it ported Rogue to Wasm(!). Last night it churned for 7 hours and 56 minutes and landed six features for freshell.

The best part, though, is that because it’s just a skill, it’s instantly part of your dev flow. There’s no configuration or learning curve. If you want to understand it better, just ask. If you don’t like what it’s doing, have stern words.

Which one to use?

Here’s how I’d decide right now.

If you want to become part of a growing movement of collaborators burning tokens together to build software, individually and collectively—try Gas Town.

If you want to invest in building a powerful, configurable, sophisticated engine that can drive your projects forward 24 hours a day—try Kilroy.

If you just want to get things done right now, give Trycycle a spin. Heck, it’s fast enough that you can spin up a trycycle while you read the docs on Kilroy and Gas Town.

And whatever you choose, I recommend you do it with freshell, because it’s just more delightful that way!

Thanks to Harper Reed, Steve Yegge, Jesse Vincent, Justin Massa, Nat Torkington, Marcus Estes, and Arjun Singh for reading drafts of this.

Scenario Planning for AI and the “Jobless Future”

Tim O’Reilly — Mon, 20 Apr 2026 10:41:09 +0000

We all read it in the daily news. The New York Times reports that economists who once dismissed the AI job threat are now taking it seriously. In February, Jack Dorsey cut 40% of Block’s workforce, telling shareholders that “intelligence tools have changed what it means to build and run a company.” Block’s stock rose 20%. Salesforce has shed thousands of customer support workers, saying AI was already doing half the work. And a Stanford study found that software developers aged 22 to 25 saw employment drop nearly 20% from its peak, while developers over 26 were doing fine.

But how are we to square this news with a Vanguard study that found that the 100 occupations most exposed to AI were actually outperforming the rest of the labor market in both job growth and wages, and a rigorous NBER study of 25,000 Danish workers that found zero measurable effect of AI on earnings or hours?

Other studies could contribute to either side of the argument. For example, PwC’s 2025 Global AI Jobs Barometer, analyzing close to a billion job ads across six continents, found that workers with AI skills earn a 56% wage premium, and that productivity growth has nearly quadrupled in the industries most exposed to AI.

This is exactly the kind of contradictory, uncertain landscape that scenario planning was designed for. Scenario planning doesn’t ask you to predict what the future will be. It asks you to imagine divergent possible futures and to develop a strategy that improves your odds of success across all of them. I’ve used it many times at O’Reilly and have written about it before with COVID and climate change as illustrative examples. The argument between those who say AI will cause mass unemployment and those who insist technology always creates more jobs than it destroys is a debate that will only be resolved by time. Both sides have evidence. Both are probably right at some level. And both framings are not terribly helpful for anyone trying to figure out what to do next.

In a scenario planning exercise, you identify two key uncertainties and draw them as crossing vectors, dividing the possibility space into four quadrants. Each quadrant describes a different future. The power of the technique is that you don’t bet on one quadrant. You look for actions that make the most sense across all of them. And you’re not limited to doing this for only one uncertainty. You can repeat the exercise multiple times, each time expanding your sense of possible futures and clarifying your convictions about the most robust strategies for adapting to them.

For AI and jobs, the most obvious crossing vectors to model might seem to be how fast AI grows in its ability to replace human work and how quickly that capability is adopted. This is, in effect, scenario planning about whether the “AI is unprecedented” or “AI is normal technology” camp is correct. That might well be a useful pair of axes.

There’s no question that AI capability is accelerating. SWE-Bench scores for coding went from solving 4.4% of problems in 2023 to 71.7% in 2024, and we saw what was widely described as a “step change” beyond that in December of 2025. Anthropic’s new Mythos model seems to have upped AI capabilities even further. Even before Mythos, McKinsey estimated that today’s technology could in theory automate roughly 57% of current US work hours. But capability is not adoption. Goldman Sachs notes that AI appears to be suppressing hiring more than destroying existing jobs in the near term. Yale’s Budget Lab, analyzing US labor data from 2022 to 2025, found no massive shift in the share of workers across occupations. Deployment, not capability, seems to be the limiting factor.

As a result, it makes sense to me to synthesize these two factors (capability increase and rate of adoption) into a single vector that we can call the scale and size of impact. The question on this axis, therefore, is not so just “How good does AI get?” but also “How fast does the economy actually reorganize around it?”

What’s a good second vector to cross with this one? If you’ve read my book WTF? or other things I’ve written about the role of human choices in shaping the future, you probably won’t be surprised that the second vector I’ve chosen reflects my conviction that the future depends on whether AI capability is primarily used to achieve efficiencies in existing work or to do more, to solve new problems and serve more human needs.

When Dorsey says a smaller team can now do the same work, that’s efficiency. When Insilico Medicine uses AI to design a drug for idiopathic pulmonary fibrosis in a fraction of the time traditional discovery takes (with over 173 other AI-discovered drugs also now in clinical development and 15 to 20 entering pivotal Phase III trials this year), that’s not replacing a human job. That’s doing something that wasn’t being done before. But we shouldn’t content ourselves with the idea that the “do more” axis is just about technical breakthroughs. It might be in serving a vastly larger number of people far more effectively and efficiently. When Todd Park says that his company, Devoted Health, “is on a mission to dramatically improve the health and well-being of older Americans,” that is a call to do more. Given the size of the existing markets that need to be transformed, it is likely that even with 10x or 100x efficiency gains from AI, Devoted’s 1,000x mission might require more resources, including people.

What will be scarce?

I’ve always assumed that the “do more” orientation is chiefly a moral argument driven by human judgment about what kind of world we’d prefer to live in. As the IMF noted earlier this year, “Work brings dignity and purpose to people’s lives. That’s what makes the AI transformation so consequential.” A world of concentrated value capture leading to a split between those with capital to invest and a permanent unemployed underclass is the stuff of dystopian science fiction.

But it’s not just a matter of inequality and the importance of work to human self-esteem. I’ve also become convinced that companies that lean into new possibilities and expand markets do better than those that simply do the same things more cheaply. And the evidence is starting to come in that this is true. According to PWC, “Three-quarters of AI’s economic gains are being captured by just 20% of companies—with the leading companies focused on growth, not just productivity….The research shows that these top‑performing companies are not simply deploying more AI tools. Instead, they are using AI as a catalyst for growth and business reinvention, particularly by pursuing new revenue opportunities created as industries converge, while building strong foundations around data, governance and trust.”

There are also a number of economic arguments for why the jobless future is just not going to happen. These arguments provide useful guidance into the structural changes to the economy that workers, business leaders, and politicians should be planning for.

Noah Smith pointed to a draft economics paper by Garicano, Li, and Wu that helps explain how the trade-offs between efficiency and expanding output might impact jobs. Garicano, Li, and Wu note that “the effect of AI on an occupation depends not just on which tasks AI can perform but also on how costly it is to unbundle those tasks from the job.” They model jobs as bundles of tasks, and distinguish between “strongly bundled” jobs (where the same person has to do multiple interdependent tasks) and “weakly bundled” ones (where tasks can easily be split between a human and an AI). AI replaces the weakly bundled jobs first. But even for weakly bundled jobs, automation only replaces human labor after demand becomes inelastic, after AI is so productive at the task that making more of the output hits diminishing returns.

Until that point, increased productivity from AI can be focused on expanding output rather than shrinking headcount. That is another way of saying that whether AI replaces workers or augments them depends in large part on whether there is unmet demand to absorb the increased productivity. If we use AI only to do the same things more cheaply, we hit that inelastic point fast, and jobs disappear. If we use it to do new things, demand keeps expanding and people keep working. University of Chicago economist Alex Imas believes that just how much demand elasticity there is on a job by job basis is one of the big questions of our time.

But that’s not all. In a new essay called “What Will Be Scarce?” Imas points out that when a new technology makes one sector dramatically more productive, one part of the economy shrinks but another grows. When agriculture was mechanized, 40% of the American workforce moved off farms, but the economy actually grew, because people spent their rising real incomes on fundamentally different things. Imas argues, drawing on work by Comin, Lashkari, and Mestieri, that income effects account for over 75% of observed patterns of structural change. As people get richer, they want fundamentally different things.

What are those things? Imas calls it “the relational sector”: goods and services where the human element is itself part of the value; teachers, nurses, therapists, hospitality workers, artisans, performers, personal chefs, community curators, and more. He opens his piece with Starbucks. In pursuit of economic efficiency, the company tried to automate more and more of its operations. CEO Brian Niccol concluded that it was a mistake, that handwritten notes on cups, ceramic mugs, and good seats drove customer satisfaction. More baristas are being hired per store and automation is being rolled back.

But there’s far more to the relational sector than service jobs. Imas identifies a further dimension in what René Girard called mimetic desire, the idea that people don’t just want objects for their functional properties. They want things that others want, and they want them more when they’re scarce and exclusive. (Hobbes and Rousseau made this same point.) Imas’s experimental research shows that willingness to pay roughly doubles when people learn that others will be excluded from a product. And in new work with Graelin Mandel, he finds that AI involvement undermines the perceived exclusivity of a good. Human-made artwork gained 44% in value from exclusivity; AI-generated artwork gained only 21%. The mere involvement of AI made the work feel inherently reproducible.

This means the relational sector has naturally high income elasticity. If AI makes production cheaper and real incomes rise, spending shifts toward goods where the human element matters. This is Baumol’s cost disease working as a feature, not a bug: The sector that resists automation becomes relatively more expensive, and that’s precisely where spending and employment grow. This is an economic mechanism that could power the upper quadrants of the scenario grid that we will look at shortly, not just as a matter of moral choice but as a structural tendency of rich economies getting richer.

I’m going to include both Noah’s ideas and Alex’s in my scenario planning exercise, since they fit right in.

Four possible futures

Let’s look at how the two vectors cross each other and give us four futures.

Upper left: The Augmentation Economy. AI capability grows but adoption is gradual, and workers are augmented rather than replaced. A programmer who once wrote 100 lines of code a day now ships features that used to take a team. A nurse practitioner aided by AI diagnostic tools provides care that once required a specialist. A small business owner uses AI to access legal and financial services previously available only to large corporations. This is the quadrant where the PwC finding about the 56% wage premium makes the most sense. AI becomes a tool that makes individual workers more productive and more valuable, and the gains flow broadly. What makes this a positive, growing economy are at least in part the choices made by employers. They use the increased efficiency to build better services, not just to make them cheaper. Doctors and nurses have more time with patients and less time with paperwork. As services become more efficient, they can be offered to more people at lower cost.

Lower left: The Slow Squeeze. AI grows, adoption is gradual, and the primary use is efficiency. This is in many ways the most insidious quadrant, because it doesn’t look like a crisis. It looks like a normal economy with slightly fewer entry-level jobs each year, slightly more pressure on wages, and slightly less bargaining power for workers. That Stanford study on young software developers is a signal from this quadrant. So is the HBR finding that companies are laying off workers because of AI’s potential, not its performance. The Slow Squeeze is the world where companies use AI to pad margins without passing the gains along or investing in new capabilities.

Lower right: The Displacement Crisis. AI advances fast and is adopted rapidly, almost entirely for efficiency. This is the future the doomsayers warn about, the Citrini Research scenario of unemployment topping 10% and the S&P 500 tanking. Block’s 40% cut is a signal from this quadrant, whether or not Dorsey’s prediction that most companies will follow suit within a year turns out to be right. Deutsche Bank analysts warn that “AI redundancy washing,” companies blaming layoffs on AI that are really driven by other cost-cutting, will be a significant feature of 2026. But the fact that Wall Street rewarded Block with a 20% stock price jump for firing 4,000 people tells you what the current incentive structure is optimizing for.

Upper right: The Great Transformation. AI capability advances rapidly and is adopted fast, but the primary use is to do more, not just the same with less. Whole new industries emerge. The WEF’s projection of 170 million new roles by 2030 comes true, far exceeding the 92 million displaced. AI-driven drug discovery actually delivers on its promise. New forms of education, personalized to every learner, actually reach people the old system never served. The transition is still brutal, because the people losing old jobs and the people getting new ones are not the same people, in the same places, with the same skills. Brookings has identified 6.1 million workers with high AI exposure and low adaptive capacity, 86% of them women in clerical and administrative roles. But the net direction is toward more human capability, not less.

Imas’s framework suggests that this quadrant will feature an explosion of durable jobs in the relational sector. Some of these will be high touch service jobs: doctors, nurses, therapists, teachers, personal trainers, craft producers, experience designers, hospitality workers, and roles that haven’t been invented yet. The relational sector already employs nearly 50 million people in the US. But another big part of it will be creating exclusive products and services that become objects of desire. Art critic Dave Hickey calls this “the big beautiful art market” that happens when industrial products are “sold on the basis of what they mean rather than what they do.” The structural change model predicts that both of these areas will grow as a share of the economy, not because they resist automation as a technical matter but because not being automated is part of their value proposition.

Noah Smith’s taxonomy of future work also helps fill in what life may actually look like across these quadrants. He divides AI-affected jobs into three categories: specialists whose jobs are “strongly bundled” (for example, an experienced engineer whose judgment can’t be separated from the rest of what they do), salarymen (generalists whose value comes from knowing how to wrangle AI and plug its ever-shifting gaps, much like the Japanese corporate model where long-tenured employees rotate between divisions and accumulate firm-specific knowledge rather than portable technical skills), and small businesspeople (entrepreneurs who use AI as leverage to run what would previously have required a much larger team). This is the future that Steve Yegge envisions with its “millions of one-person startups.”

In the upper quadrants, all three categories thrive. Specialists do well because AI expands the scope of what their bundled expertise can accomplish. Salarymen thrive because companies that are doing more, not just doing the same with less, need people who can adapt to constantly changing tool capabilities within the context of their business. And small businesses proliferate because AI gives a one-person shop the productive capacity that used to require a department.

In the lower quadrants, specialists may survive, but salarymen face pressure as companies optimize for headcount reduction rather than capability expansion, and small businesses struggle because the efficiency-first economy compresses the margins they need to exist.

News from the future

In scenario planning, once you’ve chosen your vectors and imagined the resulting quadrants, you watch for “news from the future,” data points that signal which direction the world is actually heading. As with any scatter plot, the points are all over the map at first, but over time you start to see the trend lines emerge.

Right now, the signals are mixed.

News from the lower quadrants: Challenger, Gray & Christmas reports that AI was a significant contributing factor in nearly 55,000 US layoffs in 2025. Employee anxiety about AI-driven job loss has jumped from 28% in 2024 to 40% in 2026. 40% of employers globally told the WEF they plan to reduce their workforce where AI can automate tasks within five years. And the entry-level job market is tightening in ways that compound over time even if they don’t show up in headline unemployment numbers. Brookings found that the “gateway” occupations that serve as stepping stones from low-wage to middle-wage work are among the most exposed to AI, threatening career pathways, not just individual jobs.

News from the upper quadrants: The PwC wage premium data. The Vanguard finding that AI-exposed occupations are growing, not shrinking. The explosion of AI drug discovery programs. MIT’s David Autor has shown that 60% of today’s US employment is in job categories that didn’t exist in 1940. New task creation is how technology has always generated new work, and there’s no reason to believe AI is exempt from that pattern, unless we choose to use it only for efficiency.

There may also be some signal in reports that usage among developers is becoming more intensive and continuous, from multistep coding workflows to automated agents running in loops. Some engineers are “tokenmaxxing,” with engineers at companies like Meta treating AI consumption as a productivity benchmark. This is driving rapid revenue growth for AI providers but squeezing their margins as infrastructure costs rise. That margin pressure may sound like bad news, but it’s actually a classic pattern by which a technology crosses from “tool” to “infrastructure.” Cloud computing margins were terrible until scale and hardware improvements drove unit costs down, at which point the providers who had built habit and lock-in harvested enormous returns. AI inference costs have been dropping roughly 10x per year, and price competition is accelerating that decline. The margin squeeze is the mechanism by which AI becomes cheap enough to be ubiquitous. And the tokenmaxxing engineers are doing dramatically more iterations, more exploration, with more ambitious scope. That’s “doing more” behavior, not an efficiency behavior.

It’s still unclear, though, whether all those tokens are producing real value or whether some of this is the AI equivalent of crypto mining. If most of those tokens are productive, we’re looking at a productivity boom. If many are wasted, the adoption curve may have a big dip in it before industry matures. Either way, though, the direction is toward AI becoming economic and technology infrastructure. It’s important to remember that tokens spent trying out prototypes that are rejected are not necessarily wasted. They can be part of a new development process that’s expanding the space of possibilities.

News that doesn’t fit neatly into any quadrant: We appear to be in what Smith calls a “no-hire, no-fire” economy, where workers hunker down in their current jobs and refuse to switch, and companies keep them rather than hiring new workers. That’s consistent with a world where people sense that their portable technical skills are depreciating, so they cling to the firm-specific knowledge that still makes them valuable where they are. It’s also consistent with the NBER Denmark study finding task reorganization without job loss: AI is replacing tasks, not (yet) jobs. Nonetheless, it is clear that a dearth of entry level positions will be a serious issue.

A University of Pittsburgh researcher has been calling state unemployment offices one by one to assemble the granular data that doesn’t yet exist in federal statistics, because our measurement tools are not yet fine-grained enough to see what’s happening. If you’re confused about whether AI is causing job losses, he put it plainly: The likely problem is a lack of data. If AI is having an impact, we may just not be equipped to see it yet with the instruments we have. We’re getting new data points daily. Asking yourself which future they support can gradually increase your confidence in what is coming.

Robust strategy

The goal of a scenario planning exercise is to stretch your thinking so that you can make strategic choices that make sense regardless of which future unfolds. Scenario planners call this a “robust strategy.”

If you’re a business leader, the robust strategy is not to ask “How many people can I replace with AI?” It’s to ask “What can we do now that we couldn’t do before?” The companies that will thrive across all four quadrants are the ones that use AI to expand what’s possible, not just to shrink how much they have to spend. Aim for the upper right quadrant, and you’ll do better even if the rest of the world chooses otherwise.

That’s not just scenario planning. It’s Clay Christensen on the lessons of disruptive technologies. A disruptive technology is not defined by the markets it destroys but by the new markets and new possibilities it creates. As Christensen observed, RCA didn’t ignore the transistor; its leaders just thought it wasn’t good enough for its current customers. Sony embraced the new technology and created a new market of portable devices where the quality difference between transistors and vacuum tubes just didn’t matter. And of course, as Clay observed, the disruptive technology continues to improve.

If you’re a worker, one element of robust strategy is to band together, as the screenwriters guild did, and to make the case that the productivity gains from AI should be shared with workers and used to amplify their skills and efforts. Don’t resist AI, but instead use it to make yourself even more valuable. Use it to amplify your uniqueness. That is, lean into the augmentation economy. One of the things we’ve learned from the early advances in AI-enabled software engineering is that a great software engineer can get more out of AI than a vibe-coding beginner. This is true of other professions as well. Find ways that your human uniqueness makes the output of AI even more valuable.

Create professional associations that lean into mentorship and an AI-enriched career ladder, but aren’t afraid to take a political stance. The idea that providers of capital are entitled to all of the gains is a pernicious idea that has created an engine of inequality rather than of wide prosperity. It doesn’t have to be that way. Professional associations and other forms of solidarity are a possible source of countervailing power. (But don’t fall into the trap that many unions and professional associations do, of using that power to extract rents rather than increasing value for everyone.) Preferentially choose employers who are investing in training employees for a human + AI future, including at the beginning of the career ladder.

If you’re a specialist, deepen the parts of your expertise that are strongly bundled, the judgment and context and human relationships that can’t be separated from the technical work. If you’re a generalist inside a company, become the person who understands what AI can and can’t do and fills the gaps, whose value comes from adaptability and firm-specific knowledge rather than a fixed set of technical skills. And if you have entrepreneurial instincts, recognize that AI is creating leverage that may make it possible to run a viable business at a scale that previously couldn’t support one.

Imas’s work suggests that the most durable career paths may not be defined by which tasks AI can’t do (a moving target) but by whether the human element is part of what the customer is paying for. A restauranteur, a therapist, a teacher who knows your child, or a guide who knows the trail aren’t jobs that survive because AI hasn’t gotten to them yet. They’re jobs where human involvement is the product.

If you’re an entrepreneur, the robust strategy is the one it has always been: look at the world as it is, determine what work needs doing, and do it. Don’t build AI tools that replace humans doing things that are already being done adequately. Build AI tools that let humans do things that have never been done before.

If you’re a policymaker, the robust strategy is to invest in the transition regardless of how fast displacement turns out to be. Create policies that give workers more of a role in how AI is used. Support positions like those of the writers guild, which allow workers to get a share of the gains from using AI. And if capital runs wild with labor replacement, tax the gains so the efficiency can be redistributed. Decrease the working week.

Education and lifelong learning programs, portable benefits, support for geographic mobility, and investment in the industries of the future pay off in every quadrant. So does reducing the regulatory friction that keeps new entrants trapped in old cost structures, funding basic research that the market underinvests in, and building the kind of infrastructure (physical and institutional) that enables rapid adaptation.

The future is up to us

I’ll return to the theme that I sounded in my book WTF? What’s the Future and Why It’s Up To Us.

Every time a company uses AI to do what it was already doing with fewer people, it is making a choice for the lower half of the scenario grid. Every time a company uses AI to do something that wasn’t previously possible, to serve a customer who wasn’t previously served, to solve a problem that wasn’t previously solvable, it is making a choice for the upper half. These choices compound, for good or ill. An economy that uses AI primarily for efficiency will slowly hollow itself out.

Looking at the news from the future, both sets of signals are present. The question is which will dominate. AI will give us both the Augmentation Economy and the Displacement Crisis, in different measures in different places, depending on the choices we make.

Scenario planning teaches us that we don’t have to predict which future we’ll get. We do have to prepare for a very uncertain future. But the robust strategy, the one that works across every quadrant, is to focus on doing more, not just doing the same with less, and to find ways that human taste still matters in what is created. As long as there is unmet demand, as long as there are problems we haven’t solved and people we haven’t served, AI will augment human work rather than replacing it. It’s only when we stop looking for new things to do that the machines come for the jobs.

Trial by Fire: Crisis Engineering

Jennifer Pahlka — Fri, 17 Apr 2026 10:54:01 +0000

The following article originally appeared on Jennifer Pahlka’s Eating Policy website and is being republished here with the author’s permission.

I read Norman Maclean’s Young Men and Fire when I was a teenager, I think, so it’s been many years, but I still remember its turning point vividly. It’s set in 1949 in Montana, at the Gates of the Mountains Wilderness, about an hour north of Helena. A fire is burning, and the Forest Service sends out their smokejumpers to fight it. But the fire changes direction without warning, and a group of smokejumpers working in the Mann Gulch find themselves trapped, facing certain death. Instead of running, the foreman, Wag Dodge, pulls out matches and does the unthinkable: He lights a fire.

Today we know what he was doing. The escape fire consumed the fuel around him, allowing the main fire to pass over him and a few of his colleagues. But in 1949, the families of the 13 other smokejumpers who died accused Wag of causing their deaths. To them, what he had done made no sense.

I love that Marina Nitze, Matthew Weaver, and Mikey Dickerson chose this story as a framing device for their new book, Crisis Engineering: Time-Tested Tools for Turning Chaos Into Clarity, out now. Not just because it brought back the memory of a book that I once loved, but because Maclean’s obsessive investigation of what had happened back then (he wrote the book years after the incident) seemed to me almost as heroic as the bravery of the smokejumpers. And indeed, his insistence on making sense of what happened has probably saved lives. Escape fires are now formally recognized and taught as a last resort tactic when training new firefighters.

The Dodge escape fire wouldn’t seem to have much to do with Three Mile Island or healthcare.gov or the pandemic unemployment insurance backlogs, but the authors use it to make a point about how action and understanding interact in a crisis. One key is exactly what Maclean himself did so well: sensemaking. In a crisis like Mann Gulch, sensemaking disintegrates: a broken radio, wind so strong communication is impossible, fire whose behavior violates well-tested assumptions, and a team scattered. You don’t achieve sensemaking by staring at a map; you achieve it by acting and observing results. Wag Dodge didn’t understand fire behavior well enough to explain the escape fire in advance. But his actions created the understanding itself—retrospectively, as all real sensemaking is.

The book’s key claim is that crises are opportunities, and the authors leverage Daniel Kahneman’s Thinking, Fast and Slow to explain why crises are the only real windows for organizational change—and why everything else, the incentives, the logical arguments, the reorganizations, mostly doesn’t work. Most organizations, most of the time, run on autopilot. People habituate to their environment, rationalize away small surprises, and build stable stories about how things work. A crisis breaks this. When surprise accumulates faster than the brain’s “surprise-removing machinery” can rationalize it away, the whole apparatus jams, and organizations become, briefly, reprogrammable.

An institution resolves a crisis in one of three ways, according to the authors. It makes durable deliberate change, it dies, or, most commonly, it rationalizes the failure into an accepted new normal. “Most large organizations contain programs and departments that passively accept abject failure: infinitely long backlogs, hospitals that kill patients, devastating school closures that do little to affect a pandemic. These are fossils of past crises where the organization failed to adapt.”

Too many of our public institutions have failed to adapt, and the idea that they might be reprogrammable at all is a bit radical. We live in an era when too many people have given up on them, willing to burn them to the ground rather than renovate them. If crises represent the chance for true transformation, then we’d better get a lot better at using them for that. This is explicitly why Crisis Engineering exists, and it’s a detailed, practical book—the theory and framing devices are well used, but there’s a ton of pragmatic substance here you’ll be grateful for when the moment comes.

I remember when I was working in the White House and frustrated by the slow pace of progress. My UK mentor Mike Bracken told me: “Hold on, you just need a crisis. You Americans only ever change in crisis.” Boom. About two months later, healthcare.gov had its inauspicious start. And he was right. Change followed. Not all the change we needed, but a start. Marina, Weaver, and Mikey are three of the people who drove that change. I got to work with them again the first summer of the pandemic on California’s unemployment insurance claims backlog. I’m not a crisis engineer, but their strategies and tactics have deeply influenced how I think about the work I do and how I think we’re going to get from the institutions we have today to the ones we need.

We may be living in an era when too many people have given up on institutions, but we are also likely entering an era of crisis, and even polycrisis. This makes for uncomfortable math, but also drives home the need for a new generation of crisis engineers.

When I first read about Mann Gulch, so many years ago, I remember being in awe of the ingenuity and courage it took to start Wag Dodge’s escape fire. Today I think a lot about that pattern: the controlled burns that reduce the risk of megafires, the little earthquakes that take the pressure off faults under great tension, the managed crises that, if we’re skilled enough to use them, keep our institutions from the kind of collapse that comes when nothing has been allowed to give for too long. Dodge didn’t burn things down. He burned a path through. We’re going to have to get good at that.

Generative AI in the Real World: Aishwarya Naresh Reganti on Making AI Work in Production

Ben Lorica and Aishwarya Naresh Reganti — Thu, 16 Apr 2026 14:03:20 +0000

As the founder and CEO of LevelUp Labs, Aishwarya Naresh Reganti helps organizations “really grapple with AI,” and through her teaching, she guides individuals who are doing the same. Aishwarya joined Ben to share her experience as a forward-deployed expert supporting companies that are putting AI into production. Listen in to learn the value all roles—from data folks and developers to SMEs like marketers—bring to the table when launching products; how AI flips the 80-20 rule on its head; the problem with evals (or at least, the term “evals”); enterprise versus consumer use cases; and when humans need to be part of the loop. “LLMs are super powerful,” Aishwarya explains. “So I think you need to really identify where to use that power versus where humans should be making decisions.” Watch now.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform or follow us on YouTube, Spotify, Apple, or wherever you get your podcasts.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.58
All right. So today we have Aishwarya Reganti, founder and CEO of LevelUp Labs. Their tagline is “Forward-deployed AI experts at your service.” So with that, welcome to the podcast.

01.13
Thank you, Ben. Super excited to be here.

01.16
All right. So for our listeners, “forward-deployed”—that’s a term I think that first entered the lexicon mainly through Palantir, I believe: forward-deployed engineers. So that communicates that Aishwarya and team are very much at the forefront of helping companies really grapple with AI and getting it to work. So, first question is, we’re two years into these AI demos. What actually separates a real AI product from a good demo at this point?

01.53
Yeah, very timely question. And yeah, we are a team of forward-deployed experts. A bit of a background to also tell you why we probably have seen quite a few demos failing. We work with enterprises to build a prototype for them, educate them about how to improve that prototype over time. I think one of the biggest things that differentiates a good AI product is how much effort a team is spending on calibrating it. I typically call this the 80-20 flip.

A lot of the folks who are building AI products as of today come from a traditional software engineering background. And when you’re building a traditional product, a software product, you spend 80% of the time on building and 20% of the time on what happens after building, right? You’re probably seeing a bunch of bugs, you’re resolving them, etc.

But in AI, that kind of gets flipped. You spend 20% of the time maybe building, especially with all of the AI assistants and all of that. And you spend 80% of the time on what I call “calibration,” which is identifying how your users behave with the product [and] how well the product is doing, and incorporating that as a flywheel so that you can continue to improve it, right?

03.11
And why does that happen? Because with AI products, the interface is very natural, which means that you’re pretty much speaking with these products, or you’re using some form of natural language communication, which means there are tons of ways users could talk and approach your product versus just clicking buttons and all of that, where workflows are so deterministic—which is why you open up a larger surface area for errors.

And you will only understand how your users are behaving with the system as you give them more access to it right. Think of anything as mainstream as ChatGPT. How users interact with ChatGPT today is so much more different than how they would do say three years ago or when it was released in November 2022. So what differentiates a good product is that idea of constant calibration to make sure that it’s getting aligned with the users and also with changing models and stuff like that. So the 80-20 flip I think is what differentiates a good product from just a prototype.

04.14
So actually this is an important point in the in the sense that the persona has changed as to who’s building these data and AI products, because if you rewind five years ago, you had people with some knowledge of data science, ML, and now because it’s so accessible, developers—actually even nondevelopers, vibe coders—can can start building. So with that said, Aishwarya, what do these kinds of nondata and AI people still consistently get wrong when they move from that traditional mindset of building software to now AI applications?

05.05
For one, I truly am one of those people who believes that AI should be for everyone. Even if you’re coming from a traditional machine learning background, there’s so much to catch up on. Like I moved to a team in AWS where. . . I moved from a team in AWS in 2023 where I was working with traditional natural language processing models—I was a part of the Alexa team. And then I moved into an org called GenAI Innovation Center, where we were building generative AI solutions for customers. And I feel like there was so much to learn for me as well.

But if there’s one thing that most people get wrong and maybe AI and traditional ML folks get right, it’s to look at your data, right? When you’re building all of these products, people just assume that “Oh, I’ve tested this for a few use cases” and then it seems to work fine, and they don’t pay so much attention to the kind of data distribution that they would get from their users. And given this obsession to automate everything, people go like, “OK, I can maybe ask an LLM to identify what kind of user patterns I’m seeing, build evals for itself, and update itself.” It doesn’t work that way. You really need to spend the time to understand workflows very well, understand context, understand all this data, pretty much. . .

I think just taking the time to manually do some of the setting up work for your agents so that they can perform at their maximum is super underrated. Traditional ML folks tend to understand that a little better because most of the time we’ve been doing that. We’ve been curating data for training our machine learning models even after they go into production. There’s all of this identifying outliers and updating and stuff. But yeah, if there’s one single takeaway for anybody building AI products: Take the time to look at your data. That’s the most important foundation for building them.

07.01
I’ll flip this a little bit and give props to the traditional developers. What do they get right? In other words, traditional developers write code; some of them write tests, run unit tests [and] integration tests. So they had something to build on that maybe the data scientists who were not writing production code were not used to doing. So what do the traditional developers bring to the table that the data and ML people can learn from?

07.40
That’s an interesting question because I don’t come from a software background and I just feel traditional developers have a very good design thinking: How do you design architectures so that they can scale? I was so used to writing in notebooks and kind of just focusing so much on the model, but traditional developers treat the model as an API and they build everything very well around it, right? They think about security. They think about what kind of design makes sense at scale and all of that. And even today I feel like so much of AI engineering is traditional software engineering—but with all of the caveats that you need to be looking at your data. You need to be building evals which look very different. But if you kind of zoom out and see, it’s pretty much the same process, and everything that you do around the model (assuming that the model is just a nondeterministic API), I think traditional software engineers get it like bang on.

08.36
You recently wrote a post about evals, which was quite interesting actually, [arguing] that it’s a bit of an overused and poorly defined term. I agree with the thesis of the post, but were you getting frustrated? Is that the reason why you wrote the post? [laughs] What was the genesis of the post?

09.03
A baseline is most of my posts come out of frustration and noise in this space. It just feels like if you kind of see the trajectory. . . In November 2022, ChatGPT was out, and [everybody was] like, “Oh, chat interfaces are all you need.” And then there was this concept of retrieval-augmented generation, they go “Oh, RAG is all you need. Chat just doesn’t work.” And then there was this concept of agents and like “Agents are all you need; evals are all you need.” So it just gets super annoying when people hang on to these concepts and don’t really understand the depth of it.

Even now I think there are tons of people who go like “Oh, RAG is dead. It’s not going to be used” and stuff, and there’s so much nuance to it. And with evals as well. I teach a lot of courses: I teach at universities; I also have my own courses. I feel like people just stuck to the term, and they were like “Oh, there is this use case I’m building. I need hundreds of evals in order to make sure that it’s tested very well.” And they just heard the fact that “Oh, evals are what you need to do differently for AI products” and really didn’t understand in depth like what evals mean—how you need to build a flywheel around it, and the entire you know act of building a product, calibrating it, and building a set of evaluations and also doing some A/B testing online to understand how your users are behaving with it. All of that just went into one term “evals,” and people are just like throwing it around everywhere, right?

10.35
And there’s also this confusion around model eval versus product eval, which is all of these frontier companies build evals on their models to make sure that they understand where they are on the leaderboard. And I was speaking to someone one day, and they went like, “Oh, GPT-5 point something has been tested on a particular eval dataset, which means it’s the best for my use case, so I’m going to be using it.” And I’m like, “That’s not the evals that you should be worrying about, right?” So just overloading so much into a term and hyping it up is kind of what I felt was annoying. And I wanted to write a post to say that evals is a process. It’s a long process. It’s pretty much the process of building something and calibrating it over time. And there are tons of components to it, so don’t kind of try to stuff everything in a word and confuse people.

I’ve also seen people who do things like, “Oh, I’m going to build hundreds of evals” and maybe 10 of them are actionable. Evals also need to be super actionable: What is the information you can get from them, and how can you act on that? So I kind of stuffed all of that frustration into the post to kind of say it’s a longer process. There’s so much nuance in it. Don’t try to water that down.

11.48
So it seems like this is an area where the people that were from the prior era—the people building ML and data science products—maybe could bring something to the table, right? Because they had experience, I don’t know, shipping recommendation engines and things like that. They have some prior notion of what continuous evaluation and rigorous evaluation brings to the table.

Actually I was talking to someone about this a few weeks ago in the sense that maybe the data scientists actually have a growing employment opportunity here because basically what they bring to the table seems increasingly important to me. Given that code is essentially free and discardable, it seems like someone with a more rigorous background in stats and ML might be able to distinguish themselves. What do you think?

12.56
Yes and no, because it’s true that machine learning and data scientists understand data very well, but just the way you build evals for these products is so much more different than how you would build, say, your typical metrics (accuracy, F-score, and all of that) that it takes quite some thinking to extend that and also some learning to do. . .

13.21
But at least you might actually go in there knowing that you need it.

13.27
That is true, but I don’t think that’s a super. . . I’ve seen very good engineers pick that up as well because they understand at a design level “What are the metrics I need to be measuring?” So they’re very outcome focused and kind of enter with that. So one: I think everybody has to be more coachable—not really depend on things that they learned like X years ago, because things are changing so quickly. But I also believe that whenever you’re building a product, it’s not really one set of folks that have the edge.

Another maybe distribution that is completely different is just subject-matter experts, right? When you’re building evals, you need to be writing rubrics for your LLM judges. Simple example: Let’s say you’re building a marketing pipeline for your company, and you need to write copy—marketing emails or something like that. Now even if I come from a data science background, if I were thrown at that problem, I just don’t understand what to look for and how to get closer to a brand voice that my company would be satisfied with. But I really need a marketing expert to kind of tell me “This is the brand voice we use, and this is the evals that we can build, or this is how the rubric should look like.” So it should almost be like a cross-functional thing. I feel like each of us have different pieces to that puzzle, and we need to work together.

14.42
That kind of also brings me to this other thing of collaborating in a much tighter manner [than] before. Before it was like, “OK, machine learning folks get data; they build models; and then there is a separate testing team; there is a separate SME team that’s going to look at how this product is behaving.” And now you cannot do that. You need to be optimizing for the same feedback loop. You need to be talking a lot more with all of the stakeholders because even when building, you want to understand their perspective.

15.14
So it seems also the case that as more people build these things, they realize that actually. . . You know sometimes I struggle with the word “eval” in the sense that maybe the right word is “optimize,” because basically what you really want is to understand “What am I optimizing for?” Obviously reliability is one of them, but latency and cost are also important factors, right? So it’s just a discussion that you’re increasingly coming across, and people are recognizing that there’s trade-offs and they have to balance a bunch of things.

15.57
Yes, definitely. I don’t see it being discussed heavily mainstream. But whenever I approach a problem, it’s always that, right? It’s performance, effort, cost, and latency. And all of these four things are kind of. . . You’re trying to balance each of them and trade off each of them. And I always say, start off with something that’s very low effort so that you kind of have an upper ceiling to what can be achieved. Then optimize for performance.

Again, don’t optimize for cost and latency when you get started because you just want to see the realm of possible to make sure that you can build a product and it can work fine. And cost and latency [are] something that ought to be optimized for—even when building for enterprises—after we’ve had a decent prototype that can do well on evals. Right now, if I built something with, say, a good mid-tier model and it can hit all of my eval datasets, then I know that this is possible, and now I can optimize for the latency and cost based on the constraints. But always follow that pyramid, right? Go with [the] lowest effort. Try to optimize for performance. And then cost and latency is something that. . . There are tons of tricks you can do. There’s caching; there’s using smaller models and all of that. That’s kind of a framework that I typically use.

17.08
In prior generations of machine learning, I think a lot of focus was on accuracy to some extent. But now increasingly, because we’re in this kind of generative AI world, it’s more likely that people are interested in reliability and predictability in the following sense: Even if I’m only 10% accurate, as long as I know what that 10% is, I would prefer that [to] a model that’s more accurate but I don’t know when it’s accurate. Right?

17.47
Right. That’s kind of the boon and bane of generative AI models. I guess the fact that they can generalize is amazing, but sometimes they end up generalizing in ways that you wouldn’t want them to. And whenever we work on enterprise use cases, I think for us always in my mind—something that I want to tell myself—is if this can be a workflow, don’t make it autonomous if it can solve a problem with a simple LLM call and if you can audit decisions. For instance, let’s say we’re building a customer support agent. You could literally build it in five minutes: You can throw SOPs at your customer support agent and say “OK, pick up the right resolution, talk to the user, and that’s it.” Building is very cheap today. I can literally have Claude Code build it up in a few minutes.

But something that you want to be more intentional about is “What happens if things go wrong? When should I escalate to humans?” And that’s where I would just break this into a workflow. First, identify the intent of the human and then give me a draft—almost be a copilot for me, where I can collaborate. And then if that draft looks good, a human should approve it so that it goes further.

Right now, you’re introducing auditability at each point so that you as a human can make decisions before, you know, an agent goes up and messes up things for you. And that’s also where your design decisions should really take over. Like I could build anything today, but how much thinking am I doing before that building so that there’s reliability, there’s auditability, and all of those things. LLMs are super powerful. So I think you need to really identify where to use that power versus where humans should be making decisions.

19.28
And you touched on the notion of human auditors or humans in the loop. So obviously people also try to balance LLM as judge versus human in the loop, right? Obviously there’s no one piece of advice, but what are some best practices around how you demarcate between when to use a human and when you’re comfortable using another model as a judge?

20.04
A lot of this usually depends on how much data you have to train your judge, right? I feel humans have this problem, which is: Sometimes you can do a task but you can’t explain why you arrived at that decision in a very structured format. I can today take a look at an article and tell you. . . Especially, I write a lot on Substack and LinkedIn; this is a very super personal use case. If you give me an article and ask me, “Ash, will this go viral on LinkedIn?” I can tell you yes or no for my profile right, because I’ve done it for so many years. But if you ask me, “How did you make that decision?” I probably cannot codify it and write it down as a bunch of rubrics. Which is again, when you translate this to an LLM judge, “Can I build an LLM that can tell me if a post will go viral or not?” Maybe not because I just don’t have all the constraints that I use as a human when I make decisions.

Now, take this to more production-like use cases or enterprise-like use cases. You want to have a human judge until you can codify or you can create a framework of how to evaluate something and you can write that out in natural language. And what that means is you maybe want to take 100 or 200 utterances and say, “OK, does this make sense? What’s the reasoning behind why I graded it a certain way?” And you can feed all of that information into your LLM judge to finally give it a set of rubrics and build your evals. But that’s kind of how you make a decision, which is “Do we have enough information to provide to an LLM judge that it can replace human judgment?”

But otherwise don’t do it—if you have very vague high-level ideas of what good looks like, you probably don’t want to go to an LLM judge. Even when building your systems, I would always recommend that your first pass when you’re doing your eval should be judged by a human, and you should also ask them to give you reasoning as to why they judge it because that reasoning is so important for training your LLM judges.

21.58
What are some signs that you look for? What are signals that you look for when one of these AI applications or systems go live? What are some of the signals you look for that [show] maybe the quality is degrading or breaking down?

22.18
It really depends on the use case, but there are a lot of subtle signals that users will give you, and you can log them, right? Things like “Are users swearing at your product?” That’s something we always use, right? “What kind of words are they using? How many conversation turns if it’s a chatbot, right?” Usually when you’re building your chatbot, you identify that the average number of turns is 10, but it turns out that customers are having only two turns of conversation. That kind of means that they’re not interested to talk to your chatbot. Or sometimes they’re having 20 conversations, which means they’re probably annoyed, which is why they’re having longer conversations.

There are typical things: You know, ask your user to give a thumbs up or thumbs down and all of that, but we know that feedback kind of doesn’t. . . People don’t give feedback unless they’re annoyed at something. So you can have those as well. If you’re building something like a coding agent like Claude Code etc., very obvious logging you can do is “Did the user go and change the code that it generated?” which means it’s wrong. So it’s very specific to your context, but really think of ways you can log all of this behavior you can log anomalies.

Sometimes just getting all of these logs and doing some topic clustering which is “What are our users typically talking about, and do any of those show signs of frustration? Do they show signs of being annoyed with the system?” and things like that. You really need to understand your workflows very well so that you can design these monitoring strategies.

23.50
Yeah, it’s interesting because I was just on a chatbot for an airline, and I was surprised how bad it was, in the sense that it felt like a chatbot of the pre-LLM era. So give us give us kind of your sense of “Are these chatbots now really being powered by foundation models or. . .?” I mean because I was just shocked, Aishwarya, about how bad it was, you know? So what’s your sense of, as far as you know, are enterprises really deploying these generative AI foundation models in consumer-facing apps?

24.41
Very few. To just give you a quick stat that might not be super correct: 70% to 80% of the engagements that we take up at LevelUp Labs happen to be productivity and ops focused rather than customer focused. And the biggest blocker for that has always been trust and reliability, because if you build these customer-facing agents [and] they make one mistake, it’s enough to put you on news media or enough to put you in bad PR.

But I think what good companies are doing as of today is doing a phased approach, which is they have already identified buckets that can be completely autonomous versus buckets that would require humans to navigate, right? Like this example that you gave me, as soon as a user comes up with a query, they have a triaging system that would determine if it should go to an AI agent versus a human, depending on the history of the user, depending on the kind of query. (Is it complicated enough?) Right? Let’s say Ben has this history of. . .

25.44
Hey, hey, I had great status on this airline.

25.47
[laughs] Yeah. So it’s probably not you, but just the kind of query you’re coming up with and all of that. So they’ve identified buckets where automation is possible, and they’re doing it, and they’ve done that because of past behavior data, right? What are low-hanging fruits that we could automate versus escalate to humans. I have not seen a lot of these chat systems that are completely taken over by agents. There’s always some human oversight and very good orchestration mechanisms to make sure that customers are not affected.

26.16
So you mentioned that you mostly are in the technical and ops application areas, but I’ll ask you this question anyway. To what extent do legal things come up? In other words, I’m about to deploy this model. I know I have guardrails, but honestly, just between you and me, I haven’t gone through the proper legal evaluation, you know? [laughs] So in other words, legality or compliance—anything to do with laws—do they come up at all in your discussions with companies?

26.59
As an external implementation team, I think one thing that we do with most companies is give them a high-level overview of the architecture we’ll be building, the requirements, and ask them to do a security and legal review so that they’re okay with it, because we’ve had experiences in the past where we pretty much built out everything and then you have your CISO come in and say, “OK, this doesn’t fall into what we could deploy.” So many companies make that mistake of not really involving your governance and compliance folks in the beginning and then end up scrapping entire projects.

I am not an expert who knows all of these rules and legalities, but we always make sure that they understand: “Where is the data coming from? Do we have any issues productionizing this?” and all of that, but we haven’t really worked. . . I mean I don’t have a lot of background on how to do this. We’re mostly engineering folks, but we make sure that we have a sign-off so that we are not kind of landing in surprises.

28.07
Yeah, the reason I bring it up is obviously, now that everything is much more democratized, more people can build—so in reality the people can move fast and break things literally, right? So I just wonder if there’s any discussion at all. It sounds like you are proactive, but mostly out of experience, but I wonder if regular teams are talking about this.

Speaking of which, you brought up earlier leaderboards—obviously I’m guilty of this too: “I’m about to build something. OK, let me look at a leaderboard.” But, you know, I’m not literally going to take the leaderboard’s advice, right? I’m going to still kick the tires on the specific application and use case. But I’m sure though, in your conversations, people tell you all sorts of things like, “Hey, we should use this because I saw somewhere that this is ranked number one,” right? So is this still a frustration on your end, or are people much more savvy now?

29.19
For one, I want to quickly clarify that it’s not wrong to look at a leaderboard. It’s always. . . You know, you get a high-level idea of “Who are your best competitors at this point?” But what I have a problem with is being so obsessed with just that leaderboard that you don’t build evals for yourself.

29.34
In my experience, when we work with a lot of these companies, I think over the past two years the discussion has really shifted away from the model because of two reasons: One is most companies already have existing partnerships. They’re either working with a major model provider vendor and they’re OK doing that now just because all of these model providers are racing towards feature parity, leaderboard success, and all of that. If Anthropic has something, you know, if their model is performing well on a leaderboard today, Gemini and OpenAI will probably be there in a week. So people are not too concerned about model performance. They know that in a couple of weeks, that will kind of be built into other models. So they’re not worried about that.

And two is companies are also thinking much more about the application layer right now. There’s so much discussion around all of these harnesses like Claude Code, OpenClaw, and stuff like that. So I’ve not seen a lot of complaints on “Oh, this is the model that we should be using.” It seems like they have a shared understanding of how models perform. They want to optimize the harness and the application layer much more.

30.48
Yeah. Yeah. Obviously another one of these buzzwords is “harness engineering,” and whatever you think about it, the one good thing is it really elevates the notion that you should worry about the things around the model rather than the model itself.

But speaking of. . . I guess I’m kind of old school in the sense that I want to still make sure that I can swap models out, not necessarily because I believe one model is better than the other but one model may be cheaper than the other, right?

And at least up until recently—I haven’t had this conversation in a while—it seemed to me that people got stuck on a model because their prompts were so specific for a model that porting to another model seemed like a lot of work. But nowadays though you have tools like DSPy and GEPA that it seems like you can do that more easily. So what’s your sense of model portability as a design principle—model neutrality?

32.06
For one, I think the gap between models is much more exaggerated for consumer use cases just because people care quite a bit about the personality, about how the model…

32.22
No, I care about latency and cost.

32.24
Yeah. In terms of latency and cost, right, most of the model providers pretty much are competing to make sure they are in the market. I don’t know. Do you think that there are models. . .

32.35
Well, I think that you can still get good deals with Gemini. [laughs]

32.40
Interesting.

32.41
But honestly, I use OpenRouter and OpenCode. So, I’m much more kind of I don’t want to get locked into a single [model]. When I build something, I want to make sure that I build in a way that I can move to a different model provider if I have to. But it doesn’t sound like you think that this is something that people worry about right now. They’re just worried about building something usable and then we can worry about that later.

33.12
Yes. And again, I come from a very enterprise point, like “What are companies thinking about this?” And like I said, I’m not seeing a lot of competition for model neutrality because these companies have deals with vendors and they’re okay sticking with the same model provider.

Now, when it comes to consumers, like if you’re building something for the kind of use cases that you were saying, Ben, I feel that, like I said, personality is super important for consumer builders. And I still think we’re not at a point where you can easily swap out models and be like, “OK, this is going to work as good as before,” just because you have over time learned how the model behaves. So you’ve kind of gotten calibrated with these models, and these models also have very specific personalities. So there’s a lot of you know reengineering that you have to do.

34.07
And when I say reengineering, it just might mean changing the way your prompts are written and stuff like that. It will still functionally work, which is why I say that enterprises don’t care about this much because the kind of use cases I see are like document processing or code generation, in which case functionality is of much more importance than personality. But for consumer use cases, I don’t think we’re at a point—to your point on building with OpenRouter, you can do that, but I think it’s a lot of overhead given that you’ll have to write specific prompts for all of these models depending on your use case.

I recently ported my OpenClaw from Anthropic to OpenAI because of all of the recent things, and I had to change all of my SOUL.md files, USER.md files, so that I could kind of set the behavior. And it [took] quite some time to do it, and I’m still getting used to interacting with OpenClaw using OpenAI because it seems like it makes different mistakes than what Anthropic would do.

35.03
So hopefully at some point [the] personalities of these models will converge but I do not think so because this is not a capability problem. It’s more of design choices that these model providers have made while building these models. So I don’t see a time where. . . We’re already at a point where capability-wise most models are getting closer, but personality-wise I don’t think model vendors would prefer to converge them because these are kind of your spiky edges which will make people with a certain personality gravitate towards your models. You don’t want to be making it like an average.

35.38
So in closing, you do a bit of teaching as well, right? One of the things I’ve really paid attention to is, in my conversations with people who are very, very early in their career, maybe still looking for the first job, literally, there’s a lot of worry out there. I mean, not necessarily if you’re a developer and you have a job—as long as you embrace the AI tools, you’re probably going to be fine. It’s just getting to that first job is getting harder and harder for people.

And unfortunately, you need that first job to burnish your credentials and your résumé. And honestly companies also I think neglect the fact that this is your pipeline for talent within the company as well: You have to have the top of the funnel of your talent pipeline. So what advice do you give to people who are literally still trying to get to that first job?

36.51
For one, I have had a lot of success with hiring young folks because I think they are very agent native. I call them like agent-native operators. If you’ve been working in software, in IT, for about 10 years or something like me, you’ve gotten used to certain workflows without using AI. I feel like we’re so stuck in that old mindset that I really need someone who’s agent native to come and tell me, “Hey you could literally ask Claude Code to do this.” So I’ve had a lot of luck hiring folks who are early career because they are very coachable, one, and two, they just understand how to be agent native.

So my suggestion would still be around that: Be a tinkerer. Try to find out what you can do with these tools, how you can automate them, and be extremely obsessed with designing and thinking and not really execution, right? Execution is kind of being taken over by agents.

So how do you really think about “What can I delegate?” versus “What can I augment?” and really sitting in the position of almost being an agent manager and thinking “How can you set up processes so that you can make end-to-end impact?” So just thinking a lot around those lines—and those are the kind of people that we’d like to hire as well.

And if you see a lot of these latest job roles ,you’ll also see roles blurring, right? People who are product managers are expected to also do GTM, also do a bit of engineering, and all of that. So really understand the stack end to end. And the best way to do it, I feel, is build a product of your own [and] try to sell it. You’ll get to see the whole thing. [That] doesn’t mean “Oh, stop looking for jobs—go become an entrepreneur” but really understanding workflows end to end and making that impact and sitting at the design layer will be super valued is what I think.

38.34
Yeah, the other thing I tell people is you have interests so go deep in your interest and build something in whatever you’re interested in. Domain knowledge is going to be valuable moving forward, but also you end up building something that you would want to use yourself and you learn a lot of things along the way and then maybe that’s how you get your name out there, right?

38.59
Exactly. Solving for your own problem is the best advice: Try to build something that solves your own pain point. Try to also advocate for it. I feel like social media and all of this is so good at this point that you can really make a mark in nontraditional ways. You probably don’t even have to submit a job application. You can have a GitHub repository that gets a lot of stars—that might land you a job. So think of all of these ways to bring yourself more visibility as you build so that you don’t have to go through your typical job queue.

39.30
And with that, thank you, Aishwarya.

39.32
Thank you.

Meet the Scope Creep Kraken

Tim O'Brien — Thu, 16 Apr 2026 10:31:31 +0000

The following article was originally published on Tim O’Brien’s Medium page and is being reposted here with the author’s permission.

If you’ve spent any time around AI-assisted software work, you already know the moment when the Scope Creep Kraken first puts a tentacle on the boat.

The project begins with a real goal and, usually, a sensible one. Build the internal tool. Clean up the reporting flow. Add the missing admin screen. Then someone discovers that the model can generate a Swift application in minutes to render this on an iPhone, and the mood in the room changes.

“Why not? We can render this on an iOS application, and it will only take 10 minutes. Go for it. These tools are amazing. Wow.”

That first idea is often genuinely useful. Something that might have taken a week now takes an hour. That is part of what makes the pattern so seductive. It doesn’t begin with incompetence. It begins with tool-driven momentum.

The meeting continues, “Let’s put the entire year’s backlog into the system and see if we can get this all done in a week. Ignore the token spend limits, let’s just get this done.” What was a reasonable weekly release meeting has now set the stage for a rapid expansion in scope, and that’s how the Scope Creep Kraken takes over.

Scope creep is older than AI, of course. Software teams have been haunted by “while we’re at it” long before anybody was pasting stack traces into a chat window. What AI changed was the rate of growth. In the old version of this problem, extra scope still had to fight its way through staffing constraints. Somebody had to build the feature, debug it, test it, and explain why it belonged. That friction was often the only thing standing between a focused project and an over-extended team.

AI broke that.

Now the extra feature often arrives with a demo attached. “Could we add multi-language support?” Forty-five seconds later, there is a branch. “What about generated documentation?” Sure, why not? “Could the CLI accept natural language commands?” The model appears optimistic, which is enough to make the whole thing sound temporarily reasonable. Each addition looks manageable in isolation. That is how the Kraken works. It does not attack all at once. It wraps around the project one small grip at a time.

Signs the Kraken is already on your boat

Features appearing without a ticket
Branches nobody asked for
Demos replacing design decisions
“It only took the model 30 seconds.”

The part I keep seeing on teams is not reckless ambition so much as confident improvisation. People are reacting to real capability. They are not wrong to be excited that so much is suddenly possible.

The trouble starts when “we can generate this quickly” quietly replaces “we decided this belongs in the project.” Those are not the same sentence.

For a while, the Kraken even looks helpful. Output goes up. Screens appear. Branches multiply. People feel productive, and sometimes they really are productive in the narrow local sense. What gets hidden in that burst of visible progress is integration cost. Every tentacle has to be tested with every other tentacle. Every generated convenience becomes a maintenance obligation. Every small addition pulls the project a little farther from the problem it originally set out to solve.

The product manager might chime in, “A mobile application? I didn’t ask for that, but I guess it’s good. We’ll see. Who’s going to review this with the customer?”

That is usually when the team realizes the Kraken is already on the boat. The original sponsor asked for a hammer and is now watching a Swiss Army knife unfold in real time, with several blades no one asked for and at least one that does not seem to fold back in properly.

AI also makes it dangerously easy to confuse demonstrations with decisions.

The useful response is not to become suspicious of every experiment. Some of the first tentacles are worth keeping. The response is to put the old discipline back where AI made it easy to remove. Keep a written scope. Treat additions as actual decisions rather than prompt side effects. Ask what each new feature does to testing, documentation, support, and the team’s ability to explain the system six months from now. If nobody can answer those questions, the feature is not “done” just because the model produced a convincing draft.

What makes the Scope Creep Kraken a good name is one that teams can use in the moment. Once people can say, “This is another tentacle,” the conversation gets clearer. You are no longer arguing about whether the idea is clever. You are asking whether this is motivated by a requirement or a capability.

AI Is Writing Our Code Faster Than We Can Verify It

Andrew Stellman — Wed, 15 Apr 2026 11:19:15 +0000

This is the fourth article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, and look for the next article on April 30 on O’Reilly Radar.

Here’s the dirty secret of the AI coding revolution: most experienced developers still don’t really trust the code the AI writes for us.

If I’m being honest, that’s not actually a particularly well-guarded secret. It feels like every day there’s a new breathless “I don’t have a lick of development experience but I just vibe coded this amazing application” article. And I get it—articles like that get so much engagement because everyone is watching carefully as the drama of AIs getting better and better at writing code unfolds. We’ve had decades of shows and movies, from WarGames to Hackers to Mr. Robot, portraying developers as reclusive geniuses doing mysterious but incredible stuff with computers. The idea that we’ve coded ourselves out of existence is fascinating to people.

The flip side of that pop-culture phenomenon is that when there are problems caused by agentic engineering gone wrong (like the equally popular “I trusted an AI agent and it deleted my entire production database” articles), everyone seems to find out about it. And, unfortunately, that newly emerging trope is much closer to reality. Most of us who do agentic engineering have seen our own AI-generated code go off the rails. That’s why I built and maintain the Quality Playbook, an open-source AI skill that uses quality engineering techniques that go back over fifty years to help developers working in any language verify the quality of their AI-generated code. I was as surprised as anyone to discover that it actually works.

I’ve talked often about how we need a “trust but verify” mindset when using AI to write code. In the past, I’ve mostly focused on the “trust” aspect, finding ways to help developers feel more comfortable adopting AI coding tools and using them for production work. But I’m increasingly convinced that our biggest problem with AI-driven development is that we don’t have a reliable way to check the quality of code from agentic engineering at scale. AI is writing our code faster than we can verify it, and that is one of AI’s biggest problems right now.

A false choice

After I got my first real taste of using AI for development in a professional setting, it felt like I was being asked to make a critical choice: either I had to outsource all of my thinking to the AI and just trust it to build whatever code I needed, or I had to review every single file it generated line by line.

A lot of really good, really experienced senior engineers I’ve talked to feel the same way. A small number of experienced developers fully embrace vibe coding and basically fire off the AI to do what it needs to, depending on a combination of unit tests and solid, decoupled architecture (and a little luck, maybe) to make sure things go well. But more frequently, the senior, experienced engineers I’ve talked to, folks who’ve been developing for a really long time, go the other way. When I ask them if they’re using AI every day, they’ll almost always say something like, “Yeah, I use AI for unit tests and code reviews.” That’s almost always a tell that they don’t trust the AI to build the really important code that’s at the core of the application. They’re using AI for things that won’t cause production bugs if they go wrong.

I think this excerpt from a recent (and excellent) article in Ars Technica, “Cognitive surrender” leads AI users to abandon logical thinking, sums up how many experienced developers feel about working with AI:

When it comes to large language model-powered tools, there are generally two broad categories of users. On one side are those who treat AI as a powerful but sometimes faulty service that needs careful human oversight and review to detect reasoning or factual flaws in responses. On the other side are those who routinely outsource their critical thinking to what they see as an all-knowing machine.

I agree that those are two options for dealing with AI. But I also believe that’s a false choice. “Cognitive surrender,” as the research referenced by the article puts it, is not a good outcome. But neither is reviewing every line of code the AI writes, because that’s so effort-intensive that we may as well just write it all ourselves. (And I can almost hear some of you asking, “What so bad about that?”)

This false choice is what really drives a lot of really good, very experienced senior engineers away from AI-driven development today. We see those two options, and they are both terrible. And that’s why I’m writing this article (and the next few in this Radar series) about quality.

Some shocking numbers about AI coding tools

The Quality Playbook is an open-source skill for AI coding tools like GitHub Copilot, Cursor, Claude Code, and Windsurf. You point it at a codebase, and it generates a complete quality engineering infrastructure for that project: test plans traced to requirements, code review protocols, integration tests, and more. More importantly, it brings back quality engineering practices that much of the industry abandoned decades ago, using AI to do a lot of the quality-related work that used to require a dedicated team.

I built the Quality Playbook as part of an experiment in AI-driven development and agentic engineering, building an open-source project called Octobatch and writing about the process in this ongoing Radar series. The playbook emerged directly from that experiment. The ideas behind it are over fifty years old, and they work.

Along the way, I ran into a shocking statistic.

We already know that many (most?) developers these days use AI coding tools like GitHub Copilot, Claude Code, Gemini, ChatGPT, and Cursor to write production code. But do we trust the code those tools generate? “Trust in these systems has collapsed to just 33%, a sharp decline from over 70% in 2023.”

That quote is from a Gemini Deep Research report I generated while doing research for this article. 70% dropping to 33%—that sounds like a massive collapse, right?

The thing is, when I checked the sources Gemini referenced, the truth wasn’t nearly as clear-cut. That “over 70% in 2023” number came from a Stack Overflow survey measuring how favorably developers view AI tools. The “33%” number came from a Qodo survey asking whether developers trust the accuracy of AI-generated code. Gemini grabbed both numbers, stripped the context, and stitched them into a single decline narrative. No single study ever measured trust dropping from over 70% to 33%. Which means we’ve got an apples-to-oranges comparison, and it might even technically be accurate (sort of?), but it’s not really the headline-grabber that it seemed to be.

So why am I telling you about it?

Because there are two important lessons from that “shocking” stat. The first is that the overall idea rings true, at least for me. Almost all of us have had the experience of generating code with AI faster than we can verify it, and we ship features before we fully review them.

The second is that when Gemini created the report, the AI fabricated the most alarming version of the story from real but unrelated data points. If I’d just cited it without checking the sources, there’s a pretty good chance it would get published, and you might even believe it. That’s ironically self-referential, because it’s literally the trust problem the survey is supposedly measuring. The AI produced something that looked authoritative, felt correct, and was wrong in ways that only careful verification could catch. If you want to understand why over 70% of developers don’t fully trust AI-generated code, you just watched it happen.

One reason many of us don’t trust AI-generated choice is because there’s a growing gap between how fast AI can generate code and how well we can verify that the code actually does what we intended. The usual response to this verification gap is to adopt better testing tools. And there are plenty of them: test stub generators, diff reviewers, spec-first frameworks. These are useful, and they solve real problems. But they generally share a blind spot: they work with what the code does, not with what it’s supposed to do. Luckily, the intent is sitting right there: in the specs, the schemas, the defensive code, the history of the AI chats about the project, even the variable names and filenames. We just need a way to use it.

AI-driven development needs its own quality practices, and the discipline we need already exists. It was just (unfairly) considered too expensive to use… until AI made it cheap.

(Re-)introducing quality engineering

There’s a difference between knowing that code works and knowing that it does what it’s supposed to do. It’s the difference between “does this function return the right value?” and “does this system fulfill its purpose?”—and as it turns out, that’s one of the oldest problems in software engineering. In fact, as I talked about in a previous Radar article, Prompt Engineering Is Requirements Engineering, it was the source of the original “software crisis.”

The software crisis was the term people used across our industry back in the 1960s when they were coming to grips with large software projects around the world that were routinely delivered late, over budget, and delivering software that didn’t do what it was supposed to do. At the 1968 NATO Software Engineering Conference—the conference that introduced the term “software engineering”—some of the top experts in the industry talked about how the crisis was caused by the developers and their stakeholders had trouble understanding the problems they were solving, communicating those needs clearly, and making sure that the systems they delivered actually met their users’ needs. Nearly two decades later, Fred Brooks made the same argument in his pioneering essay, No Silver Bullet: no tool can, on its own, eliminate the inherent difficulty of understanding what needs to be built and communicating that intent clearly. And now that we talk to our AI development tools the same way we talk to our teammates, we’re more susceptible than ever to that underlying problem of communication and shared understanding.

An important part of the industry’s response to the software crisis was quality engineering, a discipline built specifically to close the gap between intent and implementation by defining what “correct” means up front, tracing tests back to requirements, and verifying that the delivered system actually does what it’s supposed to do. For years it was standard practice for software engineering teams to include quality engineering phases in all projects. But few teams today do traditional quality engineering. Understanding why it got left behind by so many of us, more importantly, what it can do for us now, can make a huge difference for agentic engineering and AI-driven development today.

Starting in the 1950s, three thinkers built the intellectual foundation that manufacturing used to become dramatically more reliable.

W. Edwards Deming argued that quality is built into the process, not inspected in after the fact. He taught us that you don’t test your way to a good product; you design the system that produces it.
Joseph Juran defined quality as fitness for use: not just “does it work?” but ”does it do what it’s supposed to do, under real conditions, for the people who actually use it?”
Philip Crosby made the business case: quality is free, because building it in costs less than finding and fixing defects after the fact. By the time I joined my first professional software development team in the 1990s, these ideas were standard practice in our industry.

These ideas revolutionized software quality, and the people who put them into practice were called quality engineers. They built test plans traced to requirements, ran functional testing against specifications, and maintained living documentation that defined what “correct” meant for each part of the system.

So why did all of this disappear from most software teams? (It’s still alive in regulated industries like aerospace, medical devices, and automotive, where traceability is mandated by law, and a few brave holdouts throughout the industry.) It wasn’t because it didn’t work. Quality engineering got cut because it was perceived as expensive. Crosby was right that quality is free: the cost of building it in is far more than made up for by the savings you get from not finding and fixing defects later. But the costs come at the beginning of the project and the savings come at the end. In practice, that means when the team blows a deadline and the manager gets angry and starts looking for something to cut, the testing and QA activities are easy targets because the software already seems to be complete.

On top of the perceived expense, quality engineering required specialists. Building good requirements, designing test plans, and planning and running functional and regression testing are real, technical skills, and most teams simply didn’t have anyone (or, more specifically, the budget for anyone) who could do those jobs.

Quality engineering may have faded from our projects and teams over time, but the industry didn’t just give up on many of its best ideas. Developers are nothing if not resourceful, and we built our own quality practices—three of the most popular are test-driven development, behavior-driven development, agile-style iteration—and these are genuinely good at what they do. TDD keeps code honest by making you write the test before the implementation. BDD was specifically designed to capture requirements in a form that developers, testers, and stakeholders can all read (though in practice, most teams strip away the stakeholder involvement and it devolves into another flavor of integration testing). Agile iteration tightens the feedback loop so you catch problems earlier.

Those newer quality practices are practical and developer-focused, and they’re less expensive to adopt than traditional quality engineering in the short run because they live inside the development cycle. The upside of those practices is that development teams can generally implement them on their own, without asking for permission or requiring experts. The tradeoff, however, is that those practices have limited scope. They verify that the code you’re writing right now works correctly, but they don’t step back and ask whether the system as a whole fulfills its original intent. Quality engineering, on the other hand, establishes the intent of the system before the development cycle even begins, and keeps it up to date and feeds it back to the team as the project progresses. That’s a huge piece of the puzzle that got lost along the way.

Those highly effective quality engineering practices got cut from most software engineering teams because they were viewed as expensive, not because they were wrong. When you’re doing AI-driven development, you’re actually running into exactly the same problem that quality engineering was built to solve. You have a “team”—your AI coding tools—and you need a structured process to make sure that team is building what you actually intend. Quality engineering is such a good fit for AI-driven development because it’s the discipline that was specifically designed to close that gap between what you ask for and what gets built.

What nobody expected is that AI would make it cheap enough in the short run to bring quality engineering back to our projects.

Introducing the Quality Playbook

I’ve long suspected that quality engineering would be a perfect fit for AI-driven development (AIDD), and I finally got a chance to test that hypothesis. As part of my experiment with AIDD and agentic engineering (which I’ve been writing about in The Accidental Orchestrator and the rest of this series), I built the Quality Playbook, a skill for AI tools like Cursor, GitHub Copilot, and Claude Code that lets you bring these highly effective quality practices to any project, using AI to do the work that used to require a dedicated quality engineering team. Like other AI skills and agents, it’s a structured document that plugs into an AI coding agent and teaches it a specific capability. You point it at a codebase, and the AI explores the code, reads whatever specifications and documentation it can find, and generates a complete quality infrastructure tailored to that project. The Quality Playbook is now part of awesome-copilot, a collection of community-contributed agents (and I’ve also opened a pull request to add it to Anthropic’s repository of Claude Code skills).

What does “quality infrastructure” actually mean? Think about what a quality engineering team would build if you hired one. A good quality engineer would start by defining what “correct” means for your project: what the system is supposed to do, grounded in your requirements, your domain, what your users actually need. From there, they’d write tests traced to those requirements, build a code review process that checks whether the code implements what it’s supposed to, design integration tests that verify the whole system works together, and set up an audit process where independent reviewers check the code against its original intent.

That’s what the playbook generates. Developers using AI tools have been rediscovering the value of requirements, and spec-driven development (SDD) has become very popular. You don’t need to be practicing strict spec-driven development to use it. The playbook infers your project’s intent from whatever artifacts are available: chat logs, schemas, README files, code comments, and even defensive code patterns. If you have formal specs, great; if not, the AI pieces together what “correct” means from the evidence it can find.

Once the playbook figures out the intent of the code, it creates quality infrastructure for the project. Specifically, it generates ten deliverables:

Exploration and requirements elicitation (EXPLORATION.md): Before the playbook writes anything, it spends an entire phase reading the code, documentation, specs, and schemas, and writes a structured exploration document that maps the project’s architecture and domain. The most common failure mode in AI-generated quality work is producing generic content that could apply to any project. The exploration phase forces the AI to ground everything in this specific codebase, and serves as an audit trail: if the requirements end up wrong, you can trace the problem back to what the exploration discovered or missed.
Testable requirements (REQUIREMENTS.md): The most important deliverable. Building on the exploration, a five-phase pipeline extracts the actual intent of the project from code, documentation, AI chats, messages, support tickets, and any other project artifacts you can give it. The result is a specification document that a new team member or AI agent can read top-to-bottom and understand the software. Each requirement is tagged with an authority tier and linked to use cases that become the connective tissue tying requirements to integration tests to bug reports.
Quality constitution (QUALITY.md): Defines what “correct” means for your specific project, grounded in your actual domain. Every standard has a rationale explaining why it matters, because without the rationale, a future AI session will argue the standard down.
Spec-traced functional tests: Tests generated from the requirements, not from source code. That difference matters: a test generated from source code verifies that the code does what the code does, while a test traced to a spec verifies that the code does what you intended.
Three-pass code review protocol with bug reports and regression tests: Three mandatory review passes, each using a different lens: structural review with anti-hallucination guardrails, requirement verification (where you catch things the code doesn’t do that it was supposed to), and cross-requirement consistency checking. Every confirmed bug gets a regression test and a patch file.
Consolidated bug report (BUGS.md): Every confirmed bug with full reproduction details, severity calibrated to real-world impact, and a spec basis citing the specific documentation the code violates. Maintainers respond differently to ”your code violates section X.Y of your own spec” than to ”this looks like it might be a bug.”
TDD red/green verification: For each confirmed bug, a regression test runs against unpatched code (must fail), then the fix is applied and the test reruns (must pass). When you tell a maintainer ”here’s a test that fails on your current code and passes with this one-line fix,” that’s qualitatively different from a bug report.
Integration test protocol: A structured test matrix that an AI agent can pick up and execute autonomously, without asking clarifying questions. Every test specifies the exact command, what it proves, and specific pass/fail criteria. Field names and types are read from actual source files, not recalled from memory, as an anti-hallucination mechanism.
Council of Three multi-model spec audit: Three independent AI models audit the codebase against the requirements. The triage uses confidence weighting, not majority vote: findings from all three are near-certain, two are high-confidence, and findings from only one get a verification probe rather than being dismissed. The most valuable findings are often the ones only one model catches.
AGENTS.md bootstrap file: A context file that future AI sessions read first, so they inherit the full quality infrastructure. Without it, every new session starts from zero. With it, the quality constitution, requirements, and review protocols carry forward automatically across every session that touches the codebase.

The third option

I started this article by talking about a false choice: either we surrender our judgment to the AI, or get stuck reviewing every line of code it writes. The reality is much more nuanced, and, in my opinion, a lot more interesting, if we have a trustworthy way to verify that the code we worked with the AI to build actually does what we intended. It’s not a coincidence that this is one of the oldest problems in software engineering, and not surprising that AI can help us with it.

The Quality Playbook leans heavily on classic quality engineering techniques to do that verification. Those techniques work very well, and that gives us the more nuanced option: using AI to help us write our code, and then using it to help us trust what it built.

That’s not a gimmick or a paradox. It works because verification is exactly the kind of structured, specification-driven work that AI is good at. Writing tests traced to requirements, reviewing code against intent, checking that the system does what it’s supposed to do under real conditions. These are the things quality engineers used to do across the whole industry (and still do in the highly regulated parts of it). They’re also things that AI can do well, as long as we tell it what “correct” means.

The experienced engineers I talked about at the beginning of this article, the ones who only use AI for unit tests and code reviews, aren’t wrong to be cautious. They’re right that we can’t just trust whatever output the AI spits out. But limiting AI to just the “safe” parts of our projects keeps us from taking advantage of such an important set of tools. The way out of this quagmire is to build the infrastructure that makes the rest of it trustworthy too. Quality engineering gives us that infrastructure, and AI makes it cheap enough to actually use on all of our projects every day.

In the next few articles, I’ll show you what happened when I pointed the Quality Playbook at real, mature open-source codebases and it started finding real bugs, how the playbook emerged from my AI-driven development experiment, what the quality engineering mindset looks like in practice, and how we can learn important lessons from that experience that apply to all of our projects.

The Quality Playbook is open source and works with GitHub Copilot, Cursor, and Claude Code. It’s also available as part of awesome-copilot. You can try it out today by downloading it into your project and asking the AI to generate the quality playbook. The whole process takes about 10-15 minutes for a typical codebase. I’ll cover more details on running it in future articles in this series.

Disclosure: Aspects of the methodology described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open-source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.

Grief and the Nonprofessional Programmer

Mike Loukides — Tue, 14 Apr 2026 11:16:00 +0000

I can’t claim to be a professional software developer—not by a long shot. I occasionally write some Python code to analyze spreadsheets, and I occasionally hack something together on my own, usually related to prime numbers or numerical analysis. But I have to admit that I identify with both of the groups of programmers that Les Orchard identifies in “Grief and the AI Split”: those who just want to make a computer do something and those who grieve losing the satisfaction they get from writing good code.

A lot of the time, I just want to get something done; that’s particularly true when I’m grinding through a spreadsheet with sales data that has a half-million rows. (Yes, compared to databases, that’s nothing.) It’s frustrating to run into some roadblock in pandas that I can’t solve without looking through documentation, tutorials, and several incorrect Stack Overflow answers. But there’s also the programming that I do for fun—not all that often, but occasionally: writing a really big prime number sieve, seeing if I can do a million-point convex hull on my laptop in a reasonable amount of time, things like that. And that’s where the problem comes in. . .if there really is a problem.

The other day, I read a post of Simon Willison’s that included AI-generated animations of the major sorting algorithms. No big deal in itself; I’ve seen animated sorting algorithms before. Simon’s were different only in that they were AI-generated—but that made me want to try vibe coding an animation rather than something static. Graphing the first N terms of a Fourier series has long been one of the first things I try in a new programming language. So I asked Claude Code to generate an interactive web animation of the Fourier series. Claude did just fine. I couldn’t have created the app on my own, at least not as a single-page web app; I’ve always avoided JavaScript, for better or for worse. And that was cool, though, as with Simon’s sorting animations, there are plenty of Fourier animations online.

I then got interested in animations that aren’t so common. I grabbed Algorithms in a Nutshell, started looking through the chapters, and asked Claude to animate a number of things I hadn’t seen, ending with Dijkstra’s algorithm for finding the shortest path through a graph. It had some trouble with a few of the algorithms, though when I asked Claude to generate a plan first and used a second prompt asking it to implement the plan, everything worked.

And it was fun. I made the computer do things I wanted it to do; the thrill of controlling machines is something that sticks with us from our childhoods. The prompts were simple and short—they could have been much longer if I wanted to specify the design of the web page, but Claude’s sense of taste was good enough. I had other work to do while Claude was “thinking,” including attending some meetings, but I could easily have started several instances of Claude Code and had them create simulations in parallel. Doing so wouldn’t have required any fancy orchestration because every simulation was independent of the others. No need for Gas Town.

When I was done, I felt a version of the grief Les Orchard writes about. More specifically: I don’t really understand Dijkstra’s algorithm. I know what it does and have a vague idea of how it works, and I’m sure I could understand it if I read Algorithms in a Nutshell rather than used it as a catalog of things to animate. But now that I had the animation, I realized that I hadn’t gone through the process of understanding the algorithm well enough to write the code. And I cared about that.

I also cared about Fourier transformations: I would never “need” to write that code again. If I decide to learn Rust, will I write a Fourier program, or ask Claude to do it and inspect the output? I already knew the theory behind Fourier transforms—but I realized that an era had ended, and I still don’t know how I feel about that. Indeed, a few months ago, I vibe coded an application that recorded some audio from my laptop’s microphone, did a discrete Fourier transform, and displayed the result. After pasting the code into a file, I took the laptop over to the piano, started the program, played a C, and saw the fundamental and all the harmonics. The era was already in the past; it just took a few months to hit me.

Why does this bother me? My problem isn’t about losing the pleasure of turning ideas into code. I’ve always found coding at least somewhat frustrating, and at times, seriously frustrating. But I’m bothered by the lack of understanding: I was too lazy to look up how Dijkstra works, too lazy to look up (again) how discrete Fourier works. I made the computer do what I wanted, but I lost the understanding of how it did it.

What does it mean to lose the understanding of how the code works? Anything? It’s common to place the transition to AI-assisted coding in the context of the transition from assembly language to higher-level languages, a process that started in the late 1950s. That’s valid, but there’s an important difference. You can certainly program a discrete fast Fourier transform in assembly; that may even be one of the last bastions of assembly programs, since FFTs are extremely useful and often have to run on relatively slow processors. (The “butterfly” algorithm is very fast.) But you can’t learn signal processing by writing assembly any more than you can learn graph theory. When you’re writing in assembler, you have to know what you’re doing in advance. The early programming languages of the 1950s (Fortran, Lisp, Algol, even BASIC) are much better for gradually pushing forward to understanding, to say nothing of our modern languages.

That is the real source of grief, at least for me. I want to understand how things work. And I admit that I’m lazy. Understanding how things work quickly comes in conflict with getting stuff done—especially when staring at a blank screen—and writing Python or Java has a lot to do with how you come to an understanding. I will never need to understand convex hulls or Dijkstra’s algorithm. But thinking more broadly about this industry, I wonder whether we’ll be able to solve the new problems if we delegate understanding the old problems to AI. In the past, I’ve argued that I don’t see AI becoming genuinely creative because creativity isn’t just a recombination of things that already exist. I’ll stick by that, especially in the arts. AI may be a useful tool, but I don’t believe it will become an artist. But anyone involved with the arts also understands that creativity doesn’t come from a blank slate; it also requires an understanding of history, of how problems were solved in the past. And that makes me wonder whether humans—at least in computing—will continue to be creative if we delegate that understanding to AI.

Or does creativity just move up the stack to the next level of abstraction? And is that next level of abstraction all about understanding problems and writing good specifications? Writing a detailed specification is itself a kind of programming. But I don’t think that kind of grief will assuage the grief of the programmer who loves coding—or who may not love coding but loves the understanding that it brings.

Comprehension Debt: The Hidden Cost of AI-Generated Code

Addy Osmani — Mon, 13 Apr 2026 11:11:45 +0000

The following article originally appeared on Addy Osmani’s blog site and is being reposted here with the author’s permission.

Comprehension debt is the hidden cost to human intelligence and memory resulting from excessive reliance on AI and automation. For engineers, it applies most to agentic engineering.

There’s a cost that doesn’t show up in your velocity metrics when teams go deep on AI coding tools. Especially when its tedious to review all the code the AI generates. This cost accumulates steadily, and eventually it has to be paid—with interest. It’s called comprehension debt or cognitive debt.

Comprehension debt is the growing gap between how much code exists in your system and how much of it any human being genuinely understands.

Unlike technical debt, which announces itself through mounting friction—slow builds, tangled dependencies, the creeping dread every time you touch that one module—comprehension debt breeds false confidence. The codebase looks clean. The tests are green. The reckoning arrives quietly, usually at the worst possible moment.

Margaret-Anne Storey describes a student team that hit this wall in week seven: They could no longer make simple changes without breaking something unexpected. The real problem wasn’t messy code. It was that no one on the team could explain why design decisions had been made or how different parts of the system were supposed to work together. The theory of the system had evaporated.

That’s comprehension debt compounding in real time.

I’ve read Hacker News threads that captured engineers genuinely wrestling with the structural version of this problem—not the familiar optimism versus skepticism binary, but a field trying to figure out what rigor actually looks like when the bottleneck has moved.

A recent Anthropic study titled “How AI Impacts Skill Formation” highlighted the potential downsides of over-reliance on AI coding assistants. In a randomized controlled trial with 52 software engineers learning a new library, participants who used AI assistance completed the task in roughly the same time as the control group but scored 17% lower on a follow-up comprehension quiz (50% versus 67%). The largest declines occurred in debugging, with smaller but still significant drops in conceptual understanding and code reading. The researchers emphasize that passive delegation (“just make it work”) impairs skill development far more than active, question-driven use of AI. The full paper is available at arXiv.org.

There is a speed asymmetry problem here

AI generates code far faster than humans can evaluate it. That sounds obvious, but the implications are easy to underestimate.

When a developer on your team writes code, the human review process has always been a bottleneck—but a productive and educational one. Reading their PR forces comprehension. It surfaces hidden assumptions, catches design decisions that conflict with how the system was architected six months ago, and distributes knowledge about what the codebase actually does across the people responsible for maintaining it.

AI-generated code breaks that feedback loop. The volume is too high. The output is syntactically clean, often well-formatted, superficially correct—precisely the signals that historically triggered merge confidence. But surface correctness is not systemic correctness. The codebase looks healthy while comprehension quietly hollows out underneath it.

I read one engineer say that the bottleneck has always been a competent developer understanding the project. AI doesn’t change that constraint. It creates the illusion you’ve escaped it.

And the inversion is sharper than it looks. When code was expensive to produce, senior engineers could review faster than junior engineers could write. AI flips this: A junior engineer can now generate code faster than a senior engineer can critically audit it. The rate-limiting factor that kept review meaningful has been removed. What used to be a quality gate is now a throughput problem.

I love tests, but they aren’t a complete answer

The instinct to lean harder on deterministic verification—unit tests, integration tests, static analysis, linters, formatters—is understandable. I do this a lot in projects heavily leaning on AI coding agents. Automate your way out of the review bottleneck. Let machines check machines.

This helps. It has a hard ceiling.

A test suite capable of covering all observable behavior would, in many cases, be more complex than the code it validates. Complexity you can’t reason about doesn’t provide safety though. And beneath that is a more fundamental problem: You can’t write a test for behavior you haven’t thought to specify.

Nobody writes a test asserting that dragged items shouldn’t turn completely transparent. Of course they didn’t. That possibility never occurred to them. That’s exactly the class of failure that slips through, not because the test suite was poorly written, but because no one thought to look there.

There’s also a specific failure mode worth naming. When an AI changes implementation behavior and updates hundreds of test cases to match the new behavior, the question shifts from “is this code correct?” to “were all those test changes necessary, and do I have enough coverage to catch what I’m not thinking about?” Tests cannot answer that question. Only comprehension can.

The data is starting to back this up. Research suggests that developers using AI for code generation delegation score below 40% on comprehension tests, while developers using AI for conceptual inquiry—asking questions, exploring tradeoffs—score above 65%. The tool doesn’t destroy understanding. How you use it does.

Tests are necessary. They are not sufficient.

Lean on specs, but they’re also not the full story.

A common proposed solution: Write a detailed natural language spec first. Include it in the PR. Review the spec, not the code. Trust that the AI faithfully translated intent into implementation.

This is appealing in the same way Waterfall methodology was once appealing. Rigorously define the problem first, then execute. Clean separation of concerns.

The problem is that translating a spec to working code involves an enormous number of implicit decisions—edge cases, data structures, error handling, performance tradeoffs, interaction patterns—that no spec ever fully captures. Two engineers implementing the same spec will produce systems with many observable behavioral differences. Neither implementation is wrong. They’re just different. And many of those differences will eventually matter to users in ways nobody anticipated.

There’s another possibility with detailed specs worth calling out: A spec detailed enough to fully describe a program is more or less the program, just written in a non-executable language. The organizational cost of writing specs thorough enough to substitute for review may well exceed the productivity gains from using AI to execute them. And you still haven’t reviewed what was actually produced.

The deeper issue is that there is often no correct spec. Requirements emerge through building. Edge cases reveal themselves through use. The assumption that you can fully specify a non-trivial system before building it has been tested repeatedly and found wanting. AI doesn’t change this. It just adds a new layer of implicit decisions made without human deliberation.

Learn from history

Decades of managing software quality across distributed teams with varying context and communication bandwidth has produced real, tested practices. Those don’t evaporate because the team member is now a model.

What changes with AI is cost (dramatically lower), speed (dramatically higher), and interpersonal management overhead (essentially zero). What doesn’t change is the need for someone with a deep system context to maintain a coherent understanding of what the codebase is actually doing and why.

This is the uncomfortable redistribution that comprehension debt forces.

As AI volume goes up, the engineer who truly understands the system becomes more valuable, not less. The ability to look at a diff and immediately know which behaviors are load-bearing. To remember why that architectural decision got made under pressure eight months ago.

To tell the difference between a refactor that’s safe and one that’s quietly shifting something users depend on. That skill becomes the scarce resource the whole system depends on.

There’s a bit of a measurement gap here too

The reason comprehension debt is so dangerous is that nothing in your current measurement system captures it.

Velocity metrics look immaculate. DORA metrics hold steady. PR counts are up. Code coverage is green.

Performance calibration committees see velocity improvements. They cannot see comprehension deficits because no artifact of how organizations measure output captures that dimension. The incentive structure optimizes correctly for what it measures. What it measures no longer captures what matters.

This is what makes comprehension debt more insidious than technical debt. Technical debt is usually a conscious tradeoff—you chose the shortcut, you know roughly where it lives, you can schedule the paydown. Comprehension debt accumulates invisibly, often without anyone making a deliberate decision to let it. It’s the aggregate of hundreds of reviews where the code looked fine and the tests were passing and there was another PR in the queue.

The organizational assumption that reviewed code is understood code no longer holds. Engineers approved code they didn’t fully understand, which now carries implicit endorsement. The liability has been distributed without anyone noticing.

The regulation horizon is closer than it looks

Every industry that moved too fast eventually attracted regulation. Tech has been unusually insulated from that dynamic, partly because software failures are often recoverable, and partly because the industry has moved faster than regulators could follow.

That window is closing. When AI-generated code is running in healthcare systems, financial infrastructure, and government services, “the AI wrote it and we didn’t fully review it” will not hold up in a post-incident report when lives or significant assets are at stake.

Teams building comprehension discipline now—treating genuine understanding, not just passing tests, as non-negotiable—will be better positioned when that reckoning arrives than teams that optimized purely for merge velocity.

What comprehension debt actually demands

The right question for now isn’t “how do we generate more code?” It’s “how do we actually understand more of what we’re shipping?” so we can make sure our users get a consistently high quality experience.

That reframe has practical consequences. It means being ruthlessly explicit about what a change is supposed to do before it’s written. It means treating verification not as an afterthought but as a structural constraint. It means maintaining the system-level mental model that lets you catch AI mistakes at architectural scale rather than line-by-line. And it means being honest about the difference between “the tests passed” and “I understand what this does and why.”

Making code cheap to generate doesn’t make understanding cheap to skip. The comprehension work is the job.

AI handles the translation, but someone still has to understand what was produced, why it was produced that way, and whether those implicit decisions were the right ones—or you’re just deferring a bill that will eventually come due in full.

You will pay for comprehension sooner or later. The debt accrues interest rapidly.

Agents don’t know what good looks like. And that’s exactly the problem.

Luca Mezzalira — Fri, 10 Apr 2026 13:31:27 +0000

Luca Mezzalira, author of Building Micro-Frontends, originally shared the following article on LinkedIn. It’s being republished here with his permission.

Every few years, something arrives that promises to change how we build software. And every few years, the industry splits predictably: One half declares the old rules dead; the other half folds its arms and waits for the hype to pass. Both camps are usually wrong, and both camps are usually loud. What’s rarer, and more useful, is someone standing in the middle of that noise and asking the structural questions: Not “What can this do?” but “What does it mean for how we design systems?”

That’s what Neal Ford and Sam Newman did in their recent fireside chat on agentic AI and software architecture during O’Reilly’s Software Architecture Superstream. It’s a conversation worth pulling apart carefully, because some of what they surface is more uncomfortable than it first appears.

The Dreyfus trap

Neal opens with the Dreyfus Model of Knowledge Acquisition, originally developed for the nursing profession but applicable to any domain. The model maps learning across five stages:

Novice
Advanced beginner
Competent
Proficient
Expert

His claim is that current agentic AI is stuck somewhere between novice and advanced beginner: It can follow recipes, it can even apply recipes from adjacent domains when it gets stuck, but it doesn’t understand why any of those recipes work. This isn’t a minor limitation. It’s structural.

The canonical example Neal gives is beautiful in its simplicity: An agent tasked with making all tests pass encounters a failing unit test. One perfectly valid way to make a failing test pass is to replace its assertion with assert True. That’s not a hack in the agent’s mind. It’s a solution. There’s no ethical framework, no professional judgment, no instinct that says this isn’t what we meant. Sam extends this immediately with something he’d literally seen shared on LinkedIn that week: an agent that had modified the build file to silently ignore failed steps rather than fix them. The build passed. The problem remained. Congratulations all-round.

What’s interesting here is that neither Ford nor Newman are being dismissive of AI capability. The point is more subtle: The creativity that makes these agents genuinely useful, their ability to search solution space in ways humans wouldn’t think to, is inseparable from the same property that makes them dangerous. You can’t fully lobotomize the improvization without destroying the value. This is a design constraint, not a bug to be patched.

And when you zoom out, this is part of a broader signal. When experienced practitioners who’ve spent decades in this industry independently converge on calls for restraint and rigor rather than acceleration, that convergence is worth paying attention to. It’s not pessimism. It’s pattern recognition from people who’ve lived through enough cycles to know what the warning signs look like.

Behavior versus capabilities

One of the most important things Neal says, and I think it gets lost in the overall density of the conversation, is the distinction between behavioral verification and capability verification.

Behavioral verification is what most teams default to: unit tests, functional tests, integration tests. Does the code do what it’s supposed to do according to the spec? This is the natural fit for agentic tooling, because agents are actually getting pretty good at implementing behavior against specs. Give an agent a well-defined interface contract and a clear set of acceptance criteria, and it will produce something that broadly satisfies them. This is real progress.

Capability verification is harder. Much harder. Does the system exhibit the operational qualities it needs to exhibit at scale? Is it properly decoupled? Is the security model sound? What happens at 20,000 requests per second? Does it fail gracefully or catastrophically? These are things that most human developers struggle with too, and agents have been trained on human-generated code, which means they’ve inherited our failure modes as well as our successes.

This brings me to something Birgitta Boeckeler raised at QCon London that I haven’t been able to stop thinking about. The example everyone cites when making the case for AI’s coding capability is that Anthropic built a C compiler from scratch using agents. Impressive. But here’s the thing: C compiler documentation is extraordinarily well-specified and battle-tested over decades, and the test coverage for compiler behavior is some of the most rigorous in the entire software industry. That’s as close to a solved, well-bounded problem as you can get.

Enterprise software is almost never like that. Enterprise software is ambiguous requirements, undocumented assumptions, tacit knowledge living in the heads of people who left three years ago, and test coverage that exists more as aspiration than reality. The gap between “can build a C compiler” and “can reliably modernize a legacy ERP” is not a gap of raw capability. It’s a gap of specification quality and domain legibility. That distinction matters enormously for how we think about where agentic tooling can safely operate.

The current orthodoxy in agentic development is to throw more context at the problem: elaborate context files, architecture decision records, guidelines, rules about what not to do. Ford and Newman are appropriately skeptical. Sam makes the point that there’s now empirical evidence suggesting that as context file size increases, you see degradation in output quality, not improvement. You’re not guiding the agent toward better judgment. You’re just accumulating scar tissue from previous disasters. This isn’t unique to agentic workflows either. Anyone who has worked seriously with code assistants knows that summarization quality degrades as context grows, and that this degradation is only partially controllable. That has a direct impact on decisions made over time; now close your eyes for a moment and imagine doing it across an enterprise software, with many teams across different time zones. Don’t get me wrong, the tools help, but the help is bounded, and that boundary is often closer than we’d like to admit.

The more honest framing, which Neal alludes to, is that we need deterministic guardrails around nondeterministic agents. Not more prompting. Architectural fitness functions, an idea Ford and Rebecca Parsons have been promoting since 2017, feel like they’re finally about to have their moment, precisely because the cost of not having them is now immediately visible.

What should an agent own then?

This is where the conversation gets most interesting, and where I think the field is most confused.

There’s a seductive logic to the microservice as the unit of agentic regeneration. It sounds small. The word micro is in the name. You can imagine handing an agent a service with a defined API contract and saying: implement this, test it, done. The scope feels manageable.

Ford and Newman give this idea fair credit, but they’re also honest about the gap. The microservice level is attractive architecturally because it comes with an implied boundary: a process boundary, a deployment boundary, often a data boundary. You can put fitness functions around it. You can say this service must handle X load, maintain Y error rate, expose Z interface. In theory.

In practice, we barely enforce this stuff ourselves. The agents have learned from a corpus of human-written microservices, which means they’ve learned from the vast majority of microservices that were written without proper decoupling, without real resilience thinking, without any rigorous capacity planning. They don’t have our aspirations. They have our habits.

The deeper problem, which Neal raises and which I think deserves more attention than it gets, is transactional coupling. You can design five beautifully bounded services and still produce an architectural disaster if the workflow that ties them together isn’t thought through. Sagas, event choreography, compensation logic: This is the stuff that breaks real systems, and it’s also the stuff that’s hardest to specify, hardest to test, and hardest for an agent to reason about. We made exactly this mistake in the SOA era. We designed lovely little services and then discovered that the interesting complexity had simply migrated into the integration layer, which nobody owned and nobody tested.

Sam’s line here is worth quoting directly, roughly: “To err is human, but it takes a computer to really screw things up.” I suspect we’re going to produce some genuinely legendary transaction management disasters before the field develops the muscle memory to avoid them.

The sociotechnical gap nobody is talking about

There’s a dimension to this conversation that Ford and Newman gesture toward but that I think deserves much more direct examination: the question of what happens to the humans on the other side of this generated software.

It’s not completely accurate to say that all agentic work is happening on greenfield projects. There are tools already in production helping teams migrate legacy ERPs, modernize old codebases, and tackle the modernization challenge that has defeated conventional approaches for years. That’s real, and it matters.

But the challenge in those cases isn’t merely the code. It’s whether the sociotechnical system, the teams, the processes, the engineering culture, the organizational structures built around the existing software are ready to inherit what gets built. And here’s the thing: Even if agents combined with deterministic guardrails could produce a well-structured microservice architecture or a clean modular monolith in a fraction of the time it would take a human team, that architectural output doesn’t automatically come with organizational readiness. The system can arrive before the people are prepared to own it.

One of the underappreciated functions of iterative migration, the incremental strangler fig approach, the slow decomposition of a monolith over 18 months, is not primarily risk reduction, though it does that too. It’s learning. It’s the process by which a team internalizes a new way of working, makes mistakes in a bounded context, recovers, and builds the judgment that lets them operate confidently in the new world. Compress that journey too aggressively and you can end up with architecture whose operational complexity exceeds the organizational capacity to manage it. That gap tends to be expensive.

At QCon London, I asked Patrick Debois, after a talk covering best practices for AI-assisted development, whether applying all of those practices consistently would make him comfortable working on enterprise software with real complexity. His answer was: It depends. That felt like the honest answer. The tooling is improving. Whether the humans around it are keeping pace is a separate question, and one the industry is not spending nearly enough time on.

Existing systems

Ford and Newman close with a subject that almost never gets covered in these conversations: the vast, unglamorous majority of software that already exists and that our society depends on in ways that are easy to underestimate.

Most of the discourse around agentic AI and software development is implicitly greenfield. It assumes you’re starting fresh, that you get to design your architecture sensibly from the beginning, that you have clean APIs and tidy service boundaries. The reality is that most valuable software in the world was written before any of this existed, runs on platforms and languages that aren’t the natural habitat of modern AI tooling, and contains decades of accumulated decisions that nobody fully understands anymore.

Sam is working on a book about this: how to adapt existing architectures to enable AI-driven functionality in ways that are actually safe. He makes the interesting point that existing systems, despite their reputation, sometimes give you a head start. A well-structured relational schema carries implicit meaning about data ownership and referential integrity that an agent can actually reason from. There’s structure there, if you know how to read it.

The general lesson, which he states without much drama, is that you can’t just expose an existing system through an MCP server and call it done. The interface is not the architecture. The risks around security, data exposure, and vendor dependency don’t go away because you’ve wrapped something in a new protocol.

This matters more than it might seem, because the software that runs our financial systems, our healthcare infrastructure, our logistics and supply chains, is not greenfield and never will be. If we get the modernization of those systems wrong, the consequences are not abstract. They are social. The instinct to index heavily on what these tools can do in ideal conditions, on well-specified problems with good documentation and thorough test coverage, is understandable. But it’s exactly the wrong instinct when the systems in question are the ones our lives depend on. The architectural mindset that has served us well through previous paradigm shifts, the one that starts with trade-offs rather than capabilities, that asks what we are giving up rather than just what we are gaining, is not optional here. It’s the minimum requirement for doing this responsibly.

What I take away from this

Three things, mostly.

The first is that introducing deterministic guardrails into nondeterministic systems is not optional. It’s imperative. We are still figuring out exactly where and how, but the framing needs to shift: The goal is control over outcomes, not just oversight of output. There’s a difference. Output is what the agent generates. Outcome is whether the system it generates actually behaves correctly under production conditions, stays within architectural boundaries, and remains operable by the humans responsible for it. Fitness functions, capability tests, boundary definitions: the boring infrastructure that connects generated code to the real constraints of the world it runs in. We’ve had the tools to build this for years.

The second is that the people saying this is the future and the people saying this is just another hype cycle are both probably wrong in interesting ways. Ford and Newman are careful to say they don’t know what good looks like yet. Neither do I. But we have better prior art to draw on than the discourse usually acknowledges. The principles that made microservices work, when they worked, real decoupling, explicit contracts, operational ownership, apply here too. The principles that made microservices fail, leaky abstractions, distributed transactions handled badly, complexity migrating into integration layers, will cause exactly the same failures, just faster and at larger scale.

The third is something I took away from QCon London this year, and I think it might be the most important of the three. Across two days of talks, including sessions that took diametrically opposite approaches to integrating AI into the software development lifecycle, one thing became clear: We are all beginners. Not in the dismissive sense but in the most literal application of the Dreyfus model. Nobody, regardless of experience, has figured out the right way to fit these tools inside a sociotechnical system. The recipes are still being written. The war stories that will eventually become the prior art are still happening to us right now.

What got us here, collectively, was sharing what we saw, what worked, what failed, and why. That’s how the field moved from SOA disasters to microservices best practices. That’s how we built a shared vocabulary around fitness functions and evolutionary architecture. The same process has to happen again, and it will, but only if people with real experience are honest about the uncertainty rather than performing confidence they don’t have. The speed, ultimately, is both the opportunity and the danger. The technology is moving faster than the organizations, the teams, and the professional instincts that need to absorb it. The best response to that isn’t to pretend otherwise. It’s to keep comparing notes.

If this resonated, the full fireside chat between Neal Ford and Sam Newman is worth watching in its entirety. They cover more ground than I’ve had space to react to here. And if you’d like to learn more from Neal, Sam, and Luca, check out their most recent O’Reilly books: Building Resilient Distributed Systems, Architecture as Code, and Building Micro-frontends, second edition.