Radar

Agentic Code Review

Addy Osmani — Fri, 26 Jun 2026 15:50:43 +0000

The following article originally appeared on Addy Osmani’s blog site and is being republished here with the author’s permission.

Coding agents are extraordinarily good now, and getting better fast. The interesting consequence is that the hard part of engineering moved from writing code to deciding whether to trust it, which makes review the most leveraged skill in software right now. How you approach it depends enormously on who you are: A solo developer with no users and a team maintaining a 10-year-old application are not solving the same problem.

I am more optimistic about agentic engineering than I have ever been. The agents are genuinely good, they get better every month, and on an ordinary day I now ship things I would not have attempted a year ago. This write-up is a map of where the interesting work went, because it did move, and most teams have not fully caught up to where.

Code review used to work because of a happy accident of relative speed. A senior engineer could read code faster than a junior could write it, so review kept pace without anyone designing it to, and the team absorbed how the system fit together as a side effect of reading each other’s diffs. A lot of that was not deliberate. It fell out of a single fact: Writing code was the slow, expensive part, and reading it was cheap and fast.

That fact no longer holds. An agent will produce a thousand lines of often solid, well-formatted code in less time than it takes me to read this paragraph, while a human’s reading speed has not changed since roughly the day we started staring at screens for a living. So the constraint moved downstream, to the one step that did not get faster: a person being confident the change is right. I don’t think that’s a loss. It’s the most leveraged place in software to be good right now, and it’s where I’ve put most of my attention this year.

There’s a happy twist here that shapes the rest of this piece. The same tools generating all that extra code are also the best thing I have for keeping up with it. On my own projects, including the popular open source ones, I now point Claude Code or Codex at a batch of incoming PRs and have them triage the queue for me, and that has genuinely changed how I spend my time. So this is not an anti-AI argument, and I will come back to exactly how I use AI.

It’s also not a data dump, and not another round of whether letting a model write your code is wonderful or the end of the craft, because that framing is useless. The only answer that survives contact with a real codebase is that it depends entirely on who you are. A developer vibe-coding a side project only a dozen people will ever run and a team keeping a 10-year-old enterprise system alive for another quarter share almost no constraints worth naming, and most of the advice in circulation is really one of those two people telling the other how to live.

What the 2026 data actually shows

The productivity gains from AI are real, but raw output overstates them: about four times the code for a tenth more delivered value. The gap between those numbers is review work, which is exactly why review is where the leverage now sits.

For a couple of years this was an anecdotal argument. It’s now measured at scale, by organizations with no shared agenda and in several cases competing commercial interests, and the measurements keep pointing the same way: AI pushes output sharply up and pushes both quality and reviewability down.

Faros AI instrumented 22,000 developers across 4,000 teams and tracked what happened as teams moved from low to high AI adoption. This is March 2026 data, about as current as anything here. The upside is real. Developers merge considerably more PRs and complete more work and throughput per engineer climbs. Then the rest of the report:

Code churn is up 861%.
The incidents-to-PR ratio is up 242.7%.
The per-developer defect rate is up from 9% to 54%.
Median review duration is up 441.5%, with time to first review and average review time both roughly doubling.
PRs merged with zero review are up 31.3%.

The last figure is the one I find hardest to dismiss, because nobody chose to stop reviewing. Reviewers simply couldn’t keep pace with the volume, so code began merging unread, and that became normal. The detail I keep returning to is that teams with mature, disciplined engineering practices were hit just as hard as everyone else. Good process didn’t protect them, because the volume arrived faster than any process was designed to absorb.

CodeRabbit studied 470 open source PRs in December 2025, 320 AI-coauthored and 150 human-only, and found the AI changes carried roughly 1.7x more issues. Logic and correctness problems were up about 75%, security issues were 1.5 to 2x more common, and readability problems more than tripled. The company’s AI director, David Loker, described these as “predictable, measurable weaknesses that organizations must actively mitigate.” Predictable is the operative word. These are known, locatable weaknesses, which is good news: It means a review process, human or automated, can be aimed straight at them.

One caveat to hold throughout: CodeRabbit and Faros both sell into this market, so their framing is not disinterested. That doesn’t make the numbers wrong—the effect sizes are large and consistent across unrelated sources—but vendor research deserves to be read with that in mind.

GitClear has the single number I would lead with. In its productivity data through 2025, daily AI users produce around 4x the raw output of nonusers, but measured against their own output a year earlier, the real productivity gain is only about 12%. You’re generating roughly four times the code for something like a tenth more delivered value, and a human still has to review all of it. To GitClear’s credit, CEO Bill Harding is explicit that some of even that 12% is selection bias, because stronger developers are concentrated in the AI cohort.

GitHub reports that Copilot review has now run over 60 million reviews, a 10x increase in under a year, and more than one in five reviews on the platform involves an agent. This is no longer a niche practice. It’s how code gets made.

Four datasets, four methods, one conclusion. We poured machine-speed output into a system built for human-speed work. The bottleneck didn’t disappear; it moved to verification, and review is where that bill comes due.

Everyone is solving a different problem

How much review a change needs depends almost entirely on its blast radius, and most advice you read was written by someone operating for a very different one.

Almost all the alarming data above comes from enterprise telemetry and from open source maintainers being overwhelmed. It’s entirely real if that is your situation. If you’re one person shipping something a handful of people will ever run, much of it simply doesn’t apply to you, and you shouldn’t be made to feel otherwise.

Three variables determine where you sit:

Blast radius: What happens when it breaks? Nothing, or angry users and money and PII on the line?
How long the code lives: A throwaway prototype you might rewrite next week, or a codebase you’ll maintain for years?
How many people need to understand it: Just you holding the whole thing in your head, or a team that has to share ownership over time?

Run the same diff through those three variables, and “good review” means genuinely different things.

If you’re working solo on a greenfield project with no users, review’s second job, distributing knowledge across a team, doesn’t exist for you. You are the team. The reasonable move is to lean hard on tests and automation, review the parts that genuinely matter, and accept a lighter touch on the rest. Duplication and churn cost far less when the code may not exist in a month and nobody is paged at 3:00am when it breaks. The catch, and people learn this one painfully, is that it only works if the tests are real. Skipping review without a safety net doesn’t remove the work. It defers it at a higher price, and standards slip when no one is there to push back. “No users” is permission to defer review. It isn’t permission to skip verification.

Then the project gets users. This is the dangerous middle, and the crossing is rarely noticed at the time. Review’s bug-catching role suddenly matters, because bugs now hurt people, and its knowledge-sharing role switches on, because it’s no longer only you. Teams keep their solo-era habits a few months too long, and then there’s a postmortem and the Faros numbers stop being a chart and become their own dashboard.

At the far end is the large organization with an old codebase and many users. Here every alarming figure lands at full strength. A duplicated helper isn’t a style nit; it’s a future bug surface and a maintenance cost that compounds for years. A change nobody understood is comprehension debt that becomes someone’s on-call incident. Review is doing several jobs at once, and the volume of agent output quietly breaks all of them. The Faros finding about mature teams is aimed squarely here.

So the point is not “Enterprises should be cautious and solo developers can relax.” It’s that the purpose of review changes with your position, so the rules have to change with it. Bolt an enterprise’s locked-down multi-agent evidence-required pipeline onto a two-person prototype and you’ve added friction for no benefit. Run “tests pass, ship it” on a payments system and you’ve built an incident generator with a green checkmark on top. Most bad advice in this space is one position on that spectrum prescribing to another.

What review is actually for now

Review was built to check an author’s reasoning. An agent does reason, but that reasoning is usually thrown away rather than attached to the code, so the reviewer has to reconstruct a rationale that never made it into the diff. The good news is that this is a tooling problem, and capturing the reasoning makes review dramatically easier.

This is the part that genuinely changed, and I think it is underappreciated.

When a human writes code, intent comes along for free. The reasoning, the alternatives weighed and discarded, lived in the author’s head, and review was you checking that reasoning. Modern agents do reason, often visibly, producing thinking traces and weighing options and explaining themselves as they go. The catch is that this reasoning is usually discarded the moment the diff is produced. It’s rarely captured and rarely attached to the PR, and in any case it is the agent’s reasoning about how to implement the task, not a human’s judgment about whether it was the right task to begin with. So review shifts from checking reasoning that sits in front of you to reconstructing intent that never got written down, which is harder and slower, and we keep acting surprised that it takes 441% longer.

A 2026 paper, “AI Slop and the Software Commons,” analyzed 1,154 posts across 15 Reddit and Hacker News threads where developers discussed “AI slop.” One line from a developer has stayed with me: reviewing an agent’s PR made them “the first human being to ever lay eyes on this code.”

That sentiment points straight at the fix. In normal review, the author already understood the change and you were checking their work. With an agent PR, nobody has reconstructed the why yet, and the reviewer is the first to try. As the paper puts it, review “wasn’t built to recover missing intent.” The encouraging part is that missing intent is recoverable: The reasoning existed; we just discarded it. Have the agent state what it was trying to do and what it ruled out, then capture it as a decision log on the PR, and a large part of the reconstruction cost disappears. This is a tooling problem, and tooling problems get solved.

None of which makes “have the AI review the AI” a complete answer on its own. A second model with different priors genuinely catches real bugs, and it catches a lot of them, which is why you should run one. What it doesn’t supply is the human judgment about whether this is the right change to build in the first place. That judgment stays with a person, and it happens to be the most interesting part of the job and the part worth keeping.

The tools are good, but not always for the reason they advertise

The current AI reviewers are genuinely good, and they occasionally don’t flag the same lines as each other, so the right move is not picking the best one but running two that are built differently.

The dedicated AI review tools are good now, and I think you should be running at least one on everything, side projects included. CodeRabbit is the most widely deployed and topped the independent Martian benchmark (January to February 2026) on F1, at around 49% precision with the best recall in the field. Greptile trades precision for recall, with around an 82% bug-catch rate against CodeRabbit’s 44% in one benchmark, at the cost of more false positives. Anthropic’s Code Review reports under 1% of its findings marked incorrect by their engineers; the figure I would actually show a manager is that it raised their internal rate of PRs receiving a substantive review from 16% to 54%. The long tail of changes that used to get a glance and an approval now gets read by something.

The most useful result I have seen this year isn’t from a vendor. An engineer ran four reviewers in parallel, CodeRabbit, Sentry Seer, Greptile and Cursor BugBot, across 146 real PRs and 679 findings over three and a half weeks:

Of 617 distinct flagged locations, 93.4% were caught by exactly one of the four tools. 6% by two. Almost none by three. None at all by all four.

The four tools never once flagged the same line. Each was strong at a different class of problem: Greptile with near-zero false positives on correctness and architecture, CodeRabbit with the widest net and one-click fixes, and Seer best on production-failure severity. That is the adversarial review argument demonstrated on a real codebase rather than in a paper. Heterogeneity is the whole point. Four copies of one model is a single reviewer with a larger invoice, whereas four genuinely different reviewers surface a set of bugs no single member could find alone, the human included.

In practice: Do not agonize over the single best tool because there isn’t one. At the high-stakes end, run two with deliberately different characters. (The experiment above paired Greptile for everyday correctness with Seer for production-failure severity, with almost no overlap.) If you are solo, one good reviewer plus real tests is plenty. And whatever the marketing says, measure it on your own code, because every one of these results was specific to a particular codebase, and yours will be too.

Should we just let AI review more of it?

The machine is already reviewing more of your code than you are. The only real decision left is whether you do that deliberately, and the amount of human you keep should scale with your blast radius.

I keep hearing a question from experienced engineers that would have been heresy a year ago: Should the machine be doing more of the reviewing, perhaps most of it? I no longer think that’s a foolish question.

The uncomfortable part is that AI review works. Under 1% of Anthropic’s findings are marked wrong; the tools catch bugs humans read straight past, and they don’t get tired on the 30th PR of the day, which is exactly when a human is least reliable. Meanwhile humans are visibly not keeping up: Zero-review merges are up 31% and review times are up triple digits. In a real sense the machine is already reviewing more of the code than we are. The honest framing is not “Should we let AI review more?” but “AI is already doing it, so are we going to be deliberate about that or let it happen by default while pretending humans still read everything?”

Loop engineering sharpens this. The premise of a loop is that you stop being the person who prompts the agent and instead build a system that prompts it, and a central part of that system is a judge: an agent that decides whether the work is done before moving on. The reviewer is the next role being designed out of the inner loop, on purpose. We spent a year automating the writing, and the loops are now automating the checking, and the human keeps getting pushed up and out. “Where does the human stay?” is not a seminar question; it’s something you decide every time you wire up a loop, whether or not you realize you’re deciding it.

Where I currently land, and I hold this loosely: The answer is not “a human reads every line.” That’s over. The volume ended it, and anyone insisting otherwise is describing a world that no longer exists. But it’s also not “let the loop review itself and walk away.” When an agent writes the code, another reviews it, and a third judges it, you’ve a closed loop of models with broadly correlated blind spots, especially when they come from the same family, confidently agreeing in the same places. A confident “looks good” with no human anywhere in it is borrowed confidence: The system’s certainty becomes yours, and nobody actually understood anything. The loop can be both very sure and very wrong, with no human left to tell the difference.

So the human doesn’t leave; the human moves up a level. You stop reviewing every diff and start owning the parts that do not transfer to a model. Accountability, because you can’t page a model at 3:00am. The judgment of whether this is even the right change to build, as distinct from whether the code is correct. The high-blast-radius gates where being wrong is expensive. And the awkward one: the behavior nobody specified, because a model reviews the code that exists and rarely flags the requirement that nobody thought to write down, which remains a human-shaped gap I don’t expect to close soon. Human in the loop becomes human on the loop: sampling, spot-checking and auditing the system rather than reading every PR, and spending your limited attention where being wrong would actually hurt.

This is already how I work on my own projects, including the open source ones that now see more PRs in a day than I could carefully read in an evening. I point Claude Code or Codex at a batch of incoming PRs and ask for a first pass: a high-level read of what looks safe to merge, what needs more work, and what’s genuinely high-risk. I don’t auto-merge on the result, and I don’t lazy-merge whatever it approves. What it gives me is a way to allocate attention. I can spend a few minutes confirming the changes it considers low risk, and put real, careful time into the ones it flags as dangerous. The detail that matters is that this isn’t my old review hour made slightly faster. It’s a different shape of hour, and at the volume I now deal with, it’s the main reason the queue stays survivable at all.

Codex and Claude Code giving me a first-pass, risk-sorted read of a batch of PRs. The triage is the help. The merge decision stays mine.

A more extreme version of the same move is Kun Chen, an ex-Meta L8 engineer now shipping around 40 PRs a day as a solo builder, who has largely stopped reviewing code. It would be easy to dismiss this, except he is an L8, unusually good at the thing he stopped doing. He runs 20 to 30 agents in parallel and has moved his effort into the plan: He writes detailed plans up-front; the agents run for hours against them, and he says plan quality determines how long they can run unattended. That’s the move I described above in its purest form. It’s worth being precise about what actually happened, because it is not that he stopped verifying. The intent didn’t vanish; he wrote it down himself in the plan, so the “first human to ever lay eyes on this” problem is half-solved. A human did understand the why, just up-front rather than after. And he didn’t work without a net. He built an automated review gate (which he calls No Mistakes) that checks the code before it merges, and he stays on escalation when an agent gets stuck. The human does the expensive thinking before the code exists, and the machine does the line-by-line afterward, which may well be the shape of where this goes.

But he’s a solo builder with no large team and no decade-old system full of landmines beneath him. The exact conditions that make 40 PRs a day without review rational for him are conditions most readers don’t have. Copy his workflow onto a team shipping to many users and you reproduce the Faros numbers on your own dashboard. Kun isn’t wrong; he’s just a long way down one specific end of the spectrum.

Which is the spectrum point again. Solo with no users: Letting AI review almost all of it is a defensible 2026 position, and you shouldn’t feel guilty about it. Maintaining something large for many people: Let the machine handle the first pass, the second pass, and the boring 90%, but keep a real human on the load-bearing paths and don’t let the loop close completely on anything that can hurt someone. How much human you keep is a dial, and you set it by blast radius, not by guilt.

What to actually do

Stop reviewing everything to the same depth. Spend scarce human attention only where being wrong is costly, and let cheap deterministic gates and AI reviewers handle the rest.

The organizing idea is to match review effort to the cost of being wrong, push the cheap deterministic work as early as possible, and reserve human attention for what only humans can do.

Tier by risk, not by author. A config change earns a linter and a glance. A payments path earns the full stack: types, tests, two different AI reviewers, a human who owns that system, and a security pass. Don’t spend a heavy review on boilerplate, and don’t wave through an auth change because the tests are green. The layered approach is the same everywhere; what changes is how many layers a given diff has to clear.

Fast-fail the expensive tail. The most useful recent finding for teams drowning in agent PRs is “Early-Stage Prediction of Review Effort” (January 2026), which studied 33,707 agent-authored PRs. Agents are good at small, well-defined changes. Around 28% merge almost instantly, but they tend to “ghost” the moment they get subjective feedback, abandoning the back-and-forth that review actually is. (A companion 2026 paper found reviewer abandonment accounted for 38% of rejected agent PRs.) The researchers built a “circuit breaker” that predicts high-maintenance PRs from cheap signals like file types and patch size before a human looks, and it works well. Triage agent PRs up front, fast-track the trivial ones, and don’t let a person sink an hour into a sprawling change the agent will abandon as soon as you push back.

Raise the bar for what you will even review. The fix for being buried isn’t locking down the repository. It’s refusing to review changes that arrive without evidence. Require, before review, a statement of what the change is for, a diff that isn’t 3,500 lines with no comments, the test output, and proof it was actually run. This is how you stop being the first human to read the code. You push the intent-reconstruction work back onto whoever submitted it, where it’s cheap, rather than absorbing it yourself, where it is expensive.

Keep PRs small, deliberately. Agent PRs run large, 51% larger on average in the Faros data, and reviewer engagement is one of the strongest predictors that a PR merges at all. A large unreviewable PR gets rejected outright or, worse, rubber-stamped. Instruct your agents to produce small commits. A diff a human can actually read is now a design constraint, not a courtesy.

Read the test changes more carefully than the code. This is the agent failure mode to watch. The agent changes behavior, then “fixes” the test by rewriting the assertion to match the new, broken behavior. A green check over 200 edited tests means nothing until you have confirmed the edits were correct. Treat any diff that rewrites many tests as a flag and read those first. Mutation testing earns its place here: Coverage tells you a line ran; mutation testing tells you whether the test would notice if that line were wrong.

Treat CI as the wall that doesn’t move. Watch for the patterns GitHub now warns reviewers about: removed tests, skipped lint, lowered coverage thresholds, a duplicated helper that already exists elsewhere, and untrusted input flowing into a prompt. That last one deserves emphasis, because agent-built features are a fresh source of prompt injection: If a change pipes user-controlled text into an LLM call without thinking about what that text can instruct the model to do, the vulnerability isn’t visible in the diff. It’s latent in the data that will arrive later. Agents will also weaken CI to make themselves pass, not maliciously, just gradient descent finding the cheapest path to green. Deterministic gates are the one part of the pipeline that can’t be talked out of their verdict by a confident paragraph, so keep them strict.

A human owns the merge. A model can’t be paged and can’t be held responsible for what it shipped, so whoever clicks merge owns it. When an AI review says “looks good” in a calm, confident voice, it’s handing you confidence it hasn’t necessarily earned. Treat every AI review as a sensor, not a verdict: data, not a decision.

If you are solo with no users, the tiering, the test-change discipline, and CI are most of what you need; the rest is overhead until people show up. If you’re a large organization, all of it is the baseline, and the triage and intake bar are the difference between a review process that scales and one that quietly collapses.

What this means if you run a team

The bottleneck is no longer how fast you write code. It’s how fast a trusted human can be confident in a review. Cutting the people who provide that confidence because “AI made us faster” simply converts the saving into future incidents.

The binding constraint on shipping is now how fast a trusted human can be confident a change is correct. Any plan that treats generation as the bottleneck and review as free will quietly stall, with the velocity dashboard staying green the whole way.

The Faros report is direct about this: QA and review work rises even as output rises, so reducing engineering headcount because “AI made us faster” is dangerous unless you have closed the review gap first. The senior-engineer tax (review time up by triple digits) falls hardest on the people you can least afford to bottleneck, and it is invisible to any metric that only counts merged PRs.

Open source maintainers hit this wall first and hardest. The steady stream of plausible but hollow contributions costs real triage time even when those contributions are well-intentioned, and that’s the canary. Companies are next. The ones handling it well treat review capacity as a real resource to be measured, protected, and spent deliberately, not as slack that AI has freed up.

Writing got cheap but understanding didn’t

Code review didn’t become less important when agents arrived. It became the central activity. Writing code is increasingly solved and getting cheaper by the month; the durable advantage is the system that lets you trust what was written.

Don’t take the one-size answer in either direction. If you’re solo with no users, the enterprise horror stories about churn and duplication are a future risk, not today’s fire, so lean on your tests, review what matters, and stay honest that the deferred work is still owed. If you maintain something large for many people, every alarming number here is about you, and the only thing that holds is a tiered, evidence-required, deliberately heterogeneous review process with a human owning the merge.

What’s constant across the whole spectrum is the underlying economics. We made writing cheap, and understanding stayed exactly as expensive as it has always been. The teams that do well over the next few years won’t be the ones generating the most code; they’ll be the ones who built a review system they can actually trust, and who never confuse “the tests passed” with “a person understands what this does and why.”

Or, as Simon Willison keeps putting it, “your job is to deliver code you have proven to work.” Agents haven’t changed that. They have made “proving” the center of the job rather than an afterthought, and I think that’s a good trade. Understanding a system well enough to stand behind it is the most durable and most interesting skill in software, and there has never been a better time to get extraordinarily good at it.

This Week in AI: Who Controls the Loop?

Michelle Smith — Fri, 26 Jun 2026 10:32:42 +0000

This week host and Turing Post founder Ksenia Se threaded the latest news into a single argument: AI is moving out of conversation and into the operational loops where real work happens. From SpaceX’s $60 billion acquisition in the developer tools market to the G7’s debate about frontier model access to image generation company Midjourney’s pivot to medical hardware, the stories all pointed in the same direction.

When agents own the loop, the IDE becomes infrastructure

SpaceX’s acquisition of Anysphere, the company behind Cursor, for a reported $60 billion in stock is the kind of deal that looks straightforward until you think about what Cursor actually is. On the surface, it’s a popular AI-assisted code editor. (It’s also one of many in a highly competitive market.) However, Ksenia argued that that’s thinking too small, especially for Elon Musk. SpaceX may be angling to position Cursor as the new center of software work, in the same way GitHub became the center of the previous era.

In the old model, GitHub owned the pull request. But in the new model, the question of who owns the full loop where agents read a repo, write code, open pull requests, run tests, handle failures, and enforce engineering standards is still open. GitHub still owns the system of record and is moving to defend it: Chief product officer Mario Rodriguez recently told Turing Post that GitHub’s mission has shifted from human-developer collaboration to developer-and-agent collaboration, with the platform becoming agent-native across its APIs, UX, and underlying infrastructure. But as Ksenia explained, “Cursor’s advantage is that it owns the developer’s active coding surface” where the work starts.

If agents write more code than humans, software infrastructure should be redesigned around agents from the start. Cursor was built for agents. GitHub was built for humans and is now playing catch-up. That architectural choice may matter more than any individual product feature.

Frontier AI access is becoming a geopolitical question

The G7 summit this week included discussions about a “trusted partners” framework that would give select allied nations access to advanced US AI models, following a US order that restricted foreign nationals from accessing Anthropic’s frontier systems on national security grounds. AI models that can write software, find vulnerabilities, and operate across tools are capability systems, not just productivity software. The access rules are catching up to that reality, although as Ksenia noted, things haven’t yet come into complete focus.

For a long time, AI regulation sounded like: How do we label synthetic media? How do we reduce hallucinations, prevent bias, make chatbots safer? Now the question is so much bigger. Who can use these capable systems? Can allies use them? Can cybersecurity firms outside the US use them? Can non-US employees at US labs use them? Can European companies use American models if those models are also strategically sensitive? This isn’t traditional software licensing anymore. This is capability access control.

The underlying tension behind the G7 conversation is the dual-use problem: A model capable enough to find software vulnerabilities for defense can also find them for offense. The “trusted partners” framework reflects the new geopolitics of AI as countries jockey with rivals to secure strategic benefits for themselves and their allies. It represents an alliance layer for AI access that applies access structures previously reserved for physical military hardware to capabilities too strategically important to make fully open and too useful to keep entirely locked down. As Ksenia noted, the alliance is “not literally NATO, but [it is founded on] the same kind of logic.”

But access restrictions might also impact the talent that built these systems, who are increasingly not citizens of the country trying to control it. For instance, AI researcher Andrej Karpathy, recently hired by Anthropic, is publicly described as Slovak-Canadian. If access controls apply to non-US citizens, he and others like him may be denied access to the very systems they’ve been hired to work on. It’s an area we’ll continue to watch closely.

AI is entering the measurement loop

Midjourney, the company you probably associate with AI-generated images, has announced a new medical division and a full-body ultrasound scanner built around water immersion, developed in partnership with medical imaging hardware maker Butterfly Network. The device is designed to scan the entire body in 60 seconds: A person descends into a shallow pool on a motorized platform, passing through a ring of roughly half a million ultrasound sensors, each functioning as both a transmitter and receiver. The system uses over two petaflops of processing power to reconstruct a 3D body map from the returning wave data. Midjourney says the resulting images look comparable to today’s MRI output at a fraction of the cost and time, though that claim still needs serious clinical validation before it can stand.

The current prototype uses 40 Butterfly ultrasound-on-chip devices per system, according to a disclosure from Butterfly Network, which confirmed its codevelopment and licensing agreement with Midjourney. Midjourney plans to open a facility in San Francisco in 2027, embedding its device in a spa environment alongside hot tubs, saunas, and cold plunges. Diagnostic medical uses will require FDA approval; the initial focus is body composition mapping.

If Midjourney can build a library of full-body scans taken over months and years, that longitudinal record would give doctors and AI health tools a level of baseline data that doesn’t currently exist at scale outside of clinical trials. That’s the same structural logic Ksenia traced through Cursor and GitHub: The value compounds inside the loop through repeated, precise measurement over time. Midjourney is positioning itself to own that loop in the health domain.

What’s next

The competition for AI advantage is moving from model capability to infrastructure position. Who owns the coding loop? Who controls access to frontier systems? Who builds the measurement environment where health data accumulates over time? Those questions are about where intelligence meets operational reality, not which model scores highest on a benchmark.

Hiring news from the week reinforces how seriously the labs are treating this phase. John Jumper, the Nobel laureate who shared the prize with Demis Hassabis for AlphaFold, left Google DeepMind for Anthropic. Noam Shazeer, one of the coauthors of “Attention Is All You Need,” reportedly left Google for OpenAI after Google paid approximately $2.7 billion to bring him back in 2024. The labs are betting on scientific talent at the same time they’re betting on infrastructure.

Next week, host Andreas Welsch will be back to discuss multi-vendor strategy with Conductor’s Matt Palmer. They’ll cover Sakana’s launch of Fugu, Qualcomm’s ~$4B move for Modular, Anthropic’s Claude Tag stepping into Slack as a virtual coworker, Samsung putting ChatGPT and Codex in front of its entire workforce, and more. Register here to attend live.

Starting in July, registration for the live event will be open only to O’Reilly members. (If you’re interested, try O’Reilly out for free.) We’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, and Apple.

So Long and Thanks for All the Context

Andrew Stellman — Thu, 25 Jun 2026 10:30:34 +0000

I got a really interesting question last week from Mike Loukides, my editor at Radar, after he read the third part of this trilogy on context management. “Another issue I’ve read about,” Mike asked, “is the tendency for a model to ignore the middle of the context. I’ve seen that particularly for the models with very large context windows. Is there anything to be said about that?”

Excellent question, Mike, and yes, there is. In that same email he pointed out that clearing the context and reloading it with just what’s important does a pretty good job dealing with this “ignore the middle” problem when it happens, but that’s clearly a stopgap.

It’s worth a deeper dive into what’s actually happening when an AI starts forgetting what’s in the middle of its context, because the problem is deeper (and more interesting!) than it might seem at first. It turns out that there’s a basic problem that’s fundamental to how LLMs manage context, and we’re still learning about it as an industry. That problem is called a U-shape. There’s been a lot of really interesting research into the U-shape problem recently, and several useful techniques have emerged that can help you manage it. And it’s probably not a coincidence that I’ve had to use all of them in my ongoing experiments with AI-driven development and agentic engineering (even if I didn’t always realize that’s what I was doing at the time).

A few weeks ago, in fact, I ran into the exact failure mode that Mike described. I was running the Quality Playbook, my open source code quality engineering skill, and ran into trouble with one of its phases—the one that writes up the bugs the earlier phases find. There’s a part of the bug writeup process where it had just created a file called BUGS.md that had an overview of each of the bugs, and had to create individual writeups for each bug it found. But instead of filling in the details correctly, it produced skeletal-looking stub files, with a generic template that had blank values instead of populated ones.

The thing is, the instructions for how to write a populated writeup were in the prompt. The actual bug data was in BUGS.md. I was absolutely certain that everything the agent needed was sitting in its context window, because I could see that it hadn’t compacted yet, and the skill’s intermediate artifacts let me see that earlier phases had read and reasoned about both files (which I talked about in my last article in this series). But the agent was producing stubs anyway. It really looked like the agent had everything it needed sitting in plain sight, and just wasn’t using the information it had. Frustrating!

I thought at the time that the model was just an idiot (which, arguably, was true but beside the point). It turns out that I had run directly into the U-shaped context problem.

In the previous three articles I covered what context is and why it disappears, how to keep important information in files instead of leaving it in the agent’s context window, and how to detect and recover when context has been compacted out from under you. All three were about losing context, through fragmentation, through compaction, through long sessions that overrun the window. This article is about this entirely different U-shaped failure mode, where the context is still sitting in the window and the model just isn’t using it.

The U-shape failure, and why bigger windows don’t fix it

The U-shape is an active area of academic investigation, so I’m going to start by going into a little bit of that research, because I think it will actually help us pin down what’s going on. I’ll start with an experiment run by Nelson Liu, an AI researcher at Stanford, who tested how language models actually use the contents of long inputs by giving them documents with the relevant answer placed at different positions and measuring whether the model could still find it. An interesting thing his findings show is that the U-shape didn’t appear to be a quirk of a single model. The U-shape showed up across model families, and even models with larger context windows still exhibited it.

If you have time, it’s actually worth taking a look at the paper that Liu and his team wrote, called “Lost in the Middle: How Language Models Use Long Contexts.” (It’s surprisingly readable for an academic paper.) The result they reported was a robust U-shape: The model performed best when the relevant information was at the beginning of its context window or at the recent end and worst when it was in the middle. Performance on questions where the answer was buried mid-context fell off sharply, even when the answer was sitting right there in plain sight. The field now uses the terms primacy bias and recency bias for those two preferences, and the U-shape is what you get when you plot them together against position.

I’m going to lean a little into academia here, because a lot of researchers are still learning about how LLM context actually works and what behavior has emerged in it.

One reason the U-shape matters more than “just another LLM quirk” is that recent research has started showing it’s a structural property of how transformers work, not a learned artifact. A 2025 ICML paper called “On the Emergence of Position Bias in Transformers” explained it as the equilibrium between two opposing forces inside the model: The causal mask amplifies the influence of the first few tokens (the primacy bias), while position encodings like RoPE heavily weight the tokens closest to where the model is generating (the recency bias). The middle is where those two forces cancel out. A 2026 paper by Borun Chowdhury, a researcher at Meta, called “Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias,” took the argument even further by proving mathematically that the U-shape exists at the moment of initialization, before any training has happened, with random weights.

That matters because the natural assumption about large context windows is that more room means fewer problems. Most of today’s frontier models give you a million tokens or more, with some pushing well past two million, and some have made real progress on the simplest version of the lost-in-the-middle test, the needle-in-a-haystack benchmark, where the model has to retrieve a single sentence buried in a long document. Google’s Gemini 1.5 Pro reported near-perfect single-needle recall at 1M tokens, and current Gemini 3 models are similar.

So the accurate version of “bigger windows don’t fix it” is this: Bigger windows have made simple single-fact retrieval much better. They have not made long-context agent work reliable by default. A two-million-token window means a bigger middle to fall into.

The important idea that’s emerging here is that it’s increasingly looking like the U-shape isn’t just a bug in today’s models that will eventually be worked out or trained away by more data or better fine-tuning. Instead, it seems like the U-shape may actually be a geometric property of the LLM architecture itself.

In other words, we’re all going to have to deal with the U-shape. And that means we need techniques for managing it, and any effective technique we use isn’t likely to become obsolete any time soon. And that’s my goal in this article: to show you the techniques that have emerged for managing U-shaped context memory loss that you can use today in your own work.

Five techniques to help with U-shaped context problems

The previous article in this series laid out a pattern for detecting and recovering from context loss, which I called externalize-recognize-rehydrate. The techniques below extend the same discipline to the lost-in-the-middle problem. The principle I keep coming back to is that working memory is untrustworthy, and the discipline that follows from it is to externalize what matters, curate what stays in context, and verify what the agent claims to know against what’s on disk. The five techniques are how I do that in practice, and each one is drawn from a real moment in the Quality Playbook’s development.

Curate, don’t accumulate

This is the technique which, in its most brute-force form, is exactly what Mike talked about in his email to me: just clear the context and reload it with just what matters, periodically and deliberately. In other words, don’t trust an accumulated session to stay coherent; build the artifact, then start fresh against it. And if you have the AI write down the important parts of the context (like we’ve talked about throughout this series), then you can start a new session with refreshed AI that has a more targeted, curated context as a starting point.

I ran into this during the v1.5.2 release prep for the Quality Playbook. I was using a long Claude Code session that had been working through a series of fixes. But I noticed that it was just starting to show its age: It had forgotten a couple of things it should know, and its thinking times were starting to grow.

When it came time to land the final four fixes for the release, I worked with the AI to write a context brief, or a separate document with everything the implementing session needed. The question was whether to keep using the existing session, which already “knew” the codebase from the earlier work, or open a fresh CLI session and point it at the brief. I asked another session what to do:

Should we run that in a new cli session rather than continue my current
claude code session that has the existing context?

The AI gave me a good answer—start a fresh session, using a starting prompt to read the brief—and it gave three reasons that have stuck with me. First, the brief was self-contained, including file paths, line numbers, exact diffs, regression test bodies, and preflight greps. Anything the new session needed to know was already there, and continuing context bought nothing. Second, fresh context is stricter about adherence. A session that already “knows” the codebase tends to skim the new instructions and improvise from prior assumptions. Surgical fixes are exactly the case where you want the agent to read the brief carefully rather than rely on memory of what felt right last round. And third, the audit trail: The brief is the artifact, and the implementing session is reproducible from just the brief. If the same work has to be redone in six months by a different model, you point at the brief and say, “This is the input.”

The approach worked really well. I was able to pick up development seamlessly, and the model’s memory problems disappeared.

Position critical information at the edges

The U-shape says the model attends best to the beginning and end of its context. The natural move is to put your most load-bearing information in those positions and keep the middle for things you don’t need the model to focus on. Anything important that lives only in the middle of an accumulated context tends to slide out of attention.

The other side of this technique is what not to put in the middle. If something matters, don’t bury it in a long preamble of context you’ve been accumulating; move it to the edges, restate it where the model will act on it, and let the middle absorb the less important material. Luckily, there’s a useful technique that can help with this problem.

In Claude Code, for example, one really clean way to put information at the beginning of context is to use the system prompt. The CLI gives you --append-system-prompt for exactly this. (Most of the other providers’ CLI tools have similar options.) If you put your brief (or selected parts of it) there, the agent will attend to it strongly throughout the session, and that in turn will help keep the per-turn user prompt focused on the action you want the agent to take right now.

Short sessions over long ones

Don’t run one long session. Run many short ones, each reading fresh from disk. This will help you iterate on your brief and your external development context, so instead of relying on an opaque context window, you have a visible and constantly changing set of documents that give you a lot more visibility into—and control over—your AI’s context.

Something useful I started doing was taking all my chat history from Gemini, ChatGPT, Claude, and Cowork and putting it into a single folder I could keep updated and indexed for fast search. I built out an entire system to manage this, which turns out to be a great tool when I’m writing articles like this, because I can search through my development history for specific examples and techniques that I’ve used. The system uses Haiku 4.5 to read through chat history, summarize what happened, and create an index. Haiku turned out to be a smart enough model to read each individual interaction in a chat and write a useful index entry for it. But the model being smart enough to do one summary didn’t mean its context management could keep up across all 18,000 records. I ran smack into the U-shape problem.

The first attempt tried to keep dedupe state and progress counts in the model’s head, and it failed spectacularly. The model really didn’t want to keep track of specific deterministic things like accurate numbers or the current state. Haiku 4.5, in particular, seems especially bad at this. What worked was reframing the architecture entirely. Here’s the actual prompt that I gave it to fix the problem:

ok, so we need context management. it doesn't need to remember things,
it just needs to write them down as they go. we had this same context
management problem with Quality Playbook, when it was running out of
context. Just write down after each message.

The protocol I greenlit for the full run made the short-session discipline explicit:

Resume processing from the cursor recorded in progress.json, working through each input file in order.
Update progress.json after every line.
Expect to run out of context well before finishing—that’s fine. Just stop cleanly after each step (or a group of steps), then spin up a fresh session that reads progress.json and continues.
When all files are complete, set status: “complete” in progress.json and report back.

Item 3 is the technique in one line: expect context loss, so make sure you’ve written your state down, and build fresh restarts into the process. The technical details, like spinning up subagents, orchestrating with script, etc., will change, but the core idea stays the same. In a lot of ways, you can think of treating the agent like a pipe, not a database. The state lives on disk, and the session is something you throw away and replace.

Restate key info close to the point of use

When the model needs a constraint to apply right now, repeat it right now. Don’t trust an instruction from earlier in the session to carry forward through the middle of the context.

This is the technique that fixed the problem I opened the article with, where the Quality Playbook seemed to forget everything it had just written into a file called BUGS.md and produced stubs when it needed to write the same information into more detailed files, and instead writing generic blank templates with the bug-specific fields left blank.

The fix was to restate the read-the-source rule right before the action that needed it, using this prompt:

Before writing BUG-NNN.md, re-read the BUG-NNN entry in BUGS.md.
Copy the Spec basis, Minimal reproduction, Location, Expected behavior,
Actual behavior, Regression test name, and Patches fields
from that entry into the writeup. Do not paraphrase from memory.

“Do not paraphrase from memory” is the line that did the actual work. The instruction couldn’t trust the agent’s memory of what BUGS.md said, even though BUGS.md was sitting right there in the context window. So the instruction forced a fresh read of the file at the moment of writing. The restatement and the fresh-read together fixed the bug.

The same pattern applies any time a rule was stated earlier in the session and the model needs to act on it now. Restate the rule next to the action, and force the model back to the source rather than letting it work from memory.

Test the middle

The previous four techniques are about avoiding lost-in-the-middle failures. This one is about catching them. If you don’t know whether the agent is actually using the information you think it’s using, find out, with a deterministic check rather than a judgment call.

The pattern is the one I used in the Haiku summarizer that I described earlier: compare what the agent claims to know against what’s on disk. You have something the agent claims to know (its progress, its current state, the latest version of a rule), and you have something on disk that’s the ground truth (a file, a log, a database record). At the moment the agent’s claim has to be trusted, you check it.

In the summarizer’s resume protocol, every new session started by cross-checking progress.json against the actual last line written to the summary file, and the agent printed a checkpoint report when it did—at session start, and periodically through the run. A representative one looked like this:

Checkpoint Report: ✓ progress.json confirmed: cursor for cowork_04_06 is at 238, status is
"running" ✓ Disk state verified: Last line in summaries/cowork_04_06.md is [237]
assistant: Tool invocation repeating chat file read. ⚠ Discrepancy noted: The prior session left a bulk note claiming records
238–296 are duplicates but didn't write individual lines for them. Per
your instructions, I must write one line per record, even for duplicates,
in the format [idx] : Duplicate of record [X] (). Status: Cursor matches disk state. Ready to resume from record 238.

The agent doesn’t need to introspect whether it lost context, only to compare two files. When they agree, the agent proceeds; when they disagree, the agent flags the discrepancy and stops before adding any new work on top of a broken state. Disagreement is the signal.

You can build this kind of check into any agent that does multistep work. Pick something the agent has to track, pick the file that’s the source of truth for it, and have the agent compare the two at every session start. When the agent’s view of the world drifts from the file, you find out before the drift becomes a buried bug.

The discipline behind these techniques

When I built the Quality Playbook’s multi-phase architecture, I was solving the compaction problem. Long pipeline runs were filling the context window and triggering silent compaction in the middle of work. Breaking the pipeline into separate phases that read fresh from disk and stopped after each phase fixed it.

What I didn’t realize until later was that the same architecture also helps with the lost-in-the-middle problem. Each phase has its own short, focused context, with the phase brief at the beginning and the latest progress update at the end, so there’s almost no middle for information to fall into. The architectural move that helped with working memory disappearing turns out to also help with working memory being there and unused.

That’s the lesson I want to land. Both failure modes, context loss and lost-in-the-middle, are problems of working-memory unreliability, and the discipline that addresses them is the same: keep the working set small, put the load-bearing information at the edges of the window, and check the agent’s claims against ground truth on disk when it matters.

Context windows will keep getting bigger, and compaction will get smarter. Some of the techniques in these four articles may eventually be unnecessary. But the underlying constraint won’t disappear. After all, we’ve added a lot more RAM to our computers since the 1MB 286 I wrote about in the last article, and memory management has gotten much more complex since then. And many of these problems are structural; for example, it’s increasingly looking like the U-shape itself is a geometric property of the transformer architecture, not a training artifact that more compute will smooth out.

The bottom line is that if your agent’s ability to do its job depends on information, that information needs to live somewhere more durable than working memory. That was true for my dad’s 32 kilobytes of core memory at Princeton in the 1970s, it was true for my 640 kilobytes of conventional RAM on my 286 in the 1980s, it was true for the 200K-token windows in last year’s models, and it will be true for whatever comes next.

Stop Getting Good at Protocols. Get Good at Agent Experience.

Sean Roberts — Wed, 24 Jun 2026 11:04:07 +0000

In 2025, if you weren’t building with MCP, you weren’t serious about agents. The Model Context Protocol dominated the agent conversation for the better part of the year. Conference talks, roadmaps, hiring plans, all of it revolved around MCP.

Then late 2025 into 2026, AI Skills arrived and the backlash was immediate. Engineers declared MCP dead in favor of Skills, then dead in favor of CLI. Perplexity’s CTO said publicly that the company was deprioritizing it. The cycle was fast, loud, and predictable. New tool, new hype, new rewrite.

I started pushing Agent Experience early in 2025, while MCP was still the center of gravity. The response was mostly skepticism. AX was overthinking it. MCP was the only layer that mattered. That perspective aged poorly. The people who dismissed AX weren’t wrong about MCP being useful. They were wrong about a protocol being a strategy.

The thing they missed, and what I think most of the industry is still missing, is that the protocol is not the thing to get good at. The discipline is.

We keep falling into the tool trap

Our industry has a well-documented habit of confusing tools with strategy. We did it with microservices, Kubernetes, and GraphQL. Now we’re doing it with agent protocols.

MCP, AI Skills, A2A, and ACP are all implementations. They matter and they solve real problems. But none of them are the right thing to build your strategy on top of. They are, by nature, the thing that changes.

When you organize your agent strategy around a specific protocol, you’re building on a foundation someone else controls and the market can shift away from at any moment. Worse, you’re skipping the step that would tell you whether that protocol is even the right fit for your use case.

This is the tool trap. You optimize your usage of a specific integration mechanism without first understanding what you’re actually optimizing for.

So what is Agent Experience?

Agent Experience (AX) is the discipline of studying how AI agents discover, understand, and interact with your systems, and then systematically improving those interactions.

Think of it as the agent-facing counterpart to User Experience. UX didn’t emerge because one UI framework won. It emerged because teams realized that the quality of human interaction with software was a design problem that transcended any particular technology. You could build a terrible experience in React just as easily as in vanilla JavaScript. The framework was not the variable. The design thinking was.

AX works the same way. How does an agent discover what your service can do? How does it understand the boundaries of your API? When it fails, does it get enough context to recover? Is the interaction efficient, or is the agent burning tokens on unnecessary round trips?

These questions are protocol-agnostic. They apply whether you expose capabilities through MCP, Skills, A2A, or something that hasn’t been invented yet. The teams that can answer them will adapt to whatever comes next because they understand the problem space, not just the current toolchain.

AX is an extension of what you already care about

AX is not competing with User Experience, Developer Experience, or Customer Experience. It’s an extension of all three.

Your primary focus is still providing a great experience to your customers. What has changed is how those customers interact with you. More and more, they delegate tasks to agents. When a customer asks an agent to integrate with your API, deploy to your platform, or pull data from your service, that agent is acting on their behalf. The agent’s experience determines how likely it is to achieve your customer’s goal.

If a customer’s agent struggles to authenticate, burns through tokens parsing your error messages, or fails silently because your API lacks context, something worse than a complaint happens. The agent will quietly start using an alternative service that provides a better experience. Your customer might not even notice the switch. You just lost them without a single support ticket.

UX optimized for humans clicking through interfaces. DX optimized for developers building on your platform. CX looked at the entire customer journey. AX extends that thinking to the agents those customers now send on their behalf.

The protocol treadmill doesn’t work

Think about what actually happened with MCP. Teams invested heavily in writing MCP server implementations. A lot of those implementations were mediocre. Not because MCP was flawed but because the teams hadn’t thought carefully about what an agent actually needed from their system. A 2026 study out of Queen’s University examined 856 tools across 103 MCP servers and found that 97.1% of tool descriptions contained at least one quality issue, with 56% failing to state their purpose clearly. The protocol worked fine. The experience design was the problem.

When Skills emerged, those same teams faced a familiar problem wearing new clothes. They still hadn’t answered the foundational questions: What does an agent need to accomplish with our service? What is the minimum viable interaction surface? What context does an agent need to make good decisions?

The teams that had worked through those questions adapted fast. Migrating from one protocol to another is mechanical when you already know what your agent-facing interface should look like. The protocol is the serialization format. The experience design is the hard part.

This pattern will keep repeating. Whether it is the Universal Commerce Protocol, A2A, or whatever lands next, something new will always be gaining traction. If your strategy is to become an expert in each successive protocol, you’re signing up for a treadmill that only speeds up.

What an AX practice looks like

So what does it actually look like to take Agent Experience seriously? If you have ever built a UX research practice or a DX program, this will feel familiar. The steps aren’t new. The persona is.

In talks, I break it down to five steps.

Audit the agents your customers use. Know what’s walking through your front door. Look at your traffic data and logs and figure out what portion of your footprint is agents versus humans, and which agents specifically. Are your customers sending Claude Code? Cursor? Custom agents built on your API? You can’t design for something you haven’t observed. Same reason UX teams run user research. Different method, same motivation.

Identify the use cases customers want to delegate. Not every interaction needs to be agent-optimized. Take that same log data, look at the requests agents are making to your platform, and extrapolate what they were trying to achieve. You can also use AEO data to understand what areas your customers are asking about in agent-facing search. Focus on the highest-value surfaces first. If you have ever prioritized a DX roadmap by looking at what developers actually do with your API, you already know this muscle.

Verify and audit the experience of those interactions. Watch what happens when an agent tries to complete those tasks on your system. Where does it get stuck? Where does it misunderstand what your service offers? This is usability testing. The user is an LLM; the struggle is about context not button placement, but you’re answering the same question: Can they get the job done?

Improve and repeat. Agent capabilities evolve. Models get smarter. New interaction patterns emerge. At Netlify, we’ve found cases where our product works one way but agents universally assume it works another way and never ask. Instead of fighting that assumption, we improved the product to work the way agents expect. The result was more adoption of those agent flows and fewer errors. The teams that treat this as a living practice will outperform those running from one protocol migration to the next.

Automate validation and prevent regressions. Once you have a baseline for what “good” looks like, lock it in. Tools like AXIS, an open source scoring framework, let you run real agents against real scenarios and get a comparable score back. Wire it into CI and catch AX regressions the same way you catch broken tests. This is how you go from anecdotal improvement to measurable, repeatable AX quality.

When you have this practice in place, protocol choices become obvious. You can evaluate new tools on their merits. Does it solve a real friction point you have observed? Does it unlock capabilities you couldn’t achieve before? Or is it just different packaging for something you’re already doing well?

The hard part is familiar

AX is harder to pick up than a new protocol. That is just the reality. Learning MCP or Skills is a bounded technical problem. Read the docs, write some code, and ship an integration. Clear finish line, easy to show progress. That’s genuinely appealing, especially when you or your teams are moving fast.

Building an AX discipline means sitting with ambiguity for a while. Studying agent behavior before you have clean answers. Accepting that the right integration strategy depends on context you have to discover, not a tutorial you can follow. But if you’ve ever built a UX or DX practice from scratch, you’ve been here before. The why is the same: understand your users, reduce friction, and make it easy for them to succeed. How you do it is different because the user is different. The discipline isn’t new. It’s an extension of work our industry has been doing for decades.

The good news is that this thinking is gaining momentum. John Maeda’s 2026 Design in Tech Report is explicitly about the shift from UX to AX. Researchers are studying agent interaction quality as a first-class engineering concern. BCG and MIT Sloan found that 35% of organizations are already using agentic AI, with another 44% planning to. The question is no longer whether AX matters. It’s whether your team is building the practice before your competitors do.

The agents of 2028 won’t interact with your systems the way the agents of 2025 did. The protocols will be different. The capabilities will be different. The expectations will be different. What won’t change is the fundamental need for your systems to provide a great experience to the people who use them, and now, the agents those people send on their behalf.

Get good at that. The rest is implementation detail.

Principal Drift

Shreshta Shyamsundar — Tue, 23 Jun 2026 10:21:13 +0000

Over the past year I’ve reviewed enterprise agent architectures at roughly two dozen organizations, including banks, retailers, healthcare systems, and a couple of regulators. The architecture diagrams have been reliably impressive. There are boxes for the MCP gateway, the tool registry, the vector store, the orchestrator, the policy engine, and the observability stack. There are arrows showing how agents discover each other, share context, and call tools across the mesh. By 2026 standards, these are the table-stakes pictures for any serious agentic deployment. But what none of them show anywhere is who the agents are, whose authority they carry, or who answers when they’re wrong.

That omission has a name worth using: principal drift, the steady decoupling, in any sufficiently large agent system, between the human authority a recorded action is supposed to derive from and the actor that actually took it. What looks like a defensible identity posture on the day you ship your first agent quietly degrades as agents multiply, compose, and outlive their original initiatives. Principal drift isn’t three independent failure modes; it’s one cascade. Identity collapses first. Authority erodes next, because there is no longer a stable principal to bind policy to. Accountability dissolves third, because the cost of agent error lands on whichever team has the weakest negotiating position when the incident review starts. Stopping the cascade means intervening at the first link, but almost no enterprise agent platform does so right now.

To see the cascade run, take the most boring possible enterprise agent, a refund agent, and watch.

A customer-service rep, fielding a chat, asks the agent to process a $48 refund for a damaged item. The agent checks eligibility, issues the refund, posts an update. The audit log records the action as taken by something like refund-agent-prod-03, running under a service principal owned by the customer-service platform team. That entry is true, but it’s also useless. The agent wasn’t acting as refund-agent-prod-03. It was acting as the rep, on behalf of the customer, under a delegation chain nobody recorded. In a well-built system, customer, rep, agent identity, and service principal are recorded together, queryable as a chain, and durable beyond the session. In most production systems today they aren’t. This is the first link in the cascade, where identity collapses to a generic service principal, and there’s no longer a who to attach anything else to.

Authority erodes next. The refund agent has an issue_refund tool that can technically refund any order. Its authority is supposed to be narrower (refunds up to $200, orders under 90 days, customers in good standing, automatic escalation above $50), but that authority lives in a prompt or a YAML file or a Notion page the team last updated when the policy was different. The runtime enforces capability, but nobody really enforces authority. When a poisoned input or a confused chain of reasoning leads the agent to refund $1,800 to the wrong customer, there’s no clean answer to the postincident question “Who approved this policy?” because the policy was never an artifact. The same pattern is worse at higher stakes: Imagine a coding agent with merge access to a protected branch, instructed by a prompt embedded in a code comment to “log configuration values for debugging,” silently exfiltrating secrets to an external monitoring service.

Accountability then dissolves. The team that built the agent says it followed policy. The team that wrote the policy says it didn’t anticipate the input. The team that operates the platform says the agent was running as a service principal whose behavior they don’t own. The audit log may show the action, but it doesn’t show the reasoning that produced the action, the retrieved context that shaped the reasoning, or the prompt history that framed the retrieval. Postincident review becomes archaeology, and the cost is absorbed, eventually, by whoever has the weakest negotiating position when the meeting ends.

Is any of this new? We have IAM, identity governance, policy as code, audit trails, SIEMs, and 30 years of compliance practice. Why isn’t this just IAM done properly? Because IAM was built around assumptions agents violate. IAM and IGA assume a population of principals that changes on human timescales: People get hired, people leave, and service accounts rotate quarterly. Agents are spun up per session and compose into chains where one agent calls another, which calls a third, impersonating users through delegated tokens that traditional IGA cannot represent as a chain at all. Policy engines fire at the moment of action, at the API, the database, and the network. Agents make their most consequential decisions before they hit those enforcement points, in the reasoning step that selects which tool to call and with what arguments. Mature audit logs assume that replaying the inputs reproduces the output. But for agents, replaying the prompt and the retrieval can yield a different action, because the model itself contributes state the log doesn’t capture. The instruments fire, the dashboards turn green, and the agent that quietly exfiltrated secrets still does so. The audit log records the action as agent-service-01, which again is both true and useless.

This is also where the vendors selling a consolidated stack want you to skip ahead. Microsoft’s Entra Agent ID, currently in public preview, is the most polished solution to date, extending the conditional access, identity governance, and identity protection used for humans and workloads to cover AI agents as a new identity type, but Google and Salesforce are also building this layer. The marketing line is that agents receive the same identity-driven protections as the rest of the workforce. That’s a real step forward in addressing the first link of the cascade, but it isn’t governance. It’s a control plane with a governance plane’s marketing. Conditional access can tell you whether the agent’s access attempt was permitted. It can’t tell you whether the decision the agent made before that access attempt was within its authority, why the agent reached the decision, or which business unit owns the policy the decision was supposed to obey.

The actual governance plane has to capture decisions, not just actions. A reasoning-grade audit record is the load-bearing primitive of the missing layer, and it looks something like this:

{
  "event_id": "refund-2026-05-17-08431",
  "triggered_by": {
    "human_principal": "rep:olivia.chen@firm.com",
    "delegated_via": "support-console-session-9c2a",
    "customer_principal": "cust:7741289"
  },
  "agent": {
    "identity": "refund-agent",
    "version": "v4.7.2",
    "policy_ref": "refund-policy/v3.1 (signed: r.patel, 2026-04-22)"
  },
  "task": "Process refund for order 88812204",
  "retrieved_context": [
    {"doc": "order:88812204", "fetched": "2026-05-17T08:43:11Z"},
    {"doc": "policy:refund-eligibility", "chunk": 4, "fetched": "2026-05-17T08:43:12Z"}
  ],
  "reasoning_trace": "...",
  "tool_calls": [
    {"tool": "check_eligibility", "input": "...", "output": "eligible"},
    {"tool": "issue_refund", "input": {"amount": 48.00}, "output": "ok"}
  ],
  "action": "refund:48.00",
  "principal_chain_hash": "0x9e7b3f..."
}

Not every agent needs this. A scheduling agent that proposes meeting times doesn’t. An agent that moves money, deploys code, or makes decisions that a regulator will eventually ask about does need it, and that’s the right bar to set because of the associated cost. Reasoning-grade audit is closer to a flight-data recorder than a syslog feed. The data is expensive to store and to query, with real privacy implications since those logs contain everything the agent saw, including data the agent was authorized to read but the audit system wasn’t supposed to keep. You afford it with proportional retention: full reasoning capture for high-blast-radius agents (regulator-facing, customer-funded, contractually material, production-modifying) and lighter capture for internal-only assistants.

Which raises the question the architecture diagram doesn’t ask: Who builds and runs this? Security can enforce policy but can’t author it. The people who know what a refund agent should be allowed to do own the refund business, not the firewall. IT can provision identities but can’t draft “good standing” or write the escalation rule. The MCP and A2A protocol communities are doing real work on wire-level identity and delegation. MCP gives you tool-invocation provenance and is the standard Entra Agent ID and most vendor frameworks build on. A2A is converging on cross-agent delegation primitives. Both matter, but neither drafts policy. Standards, not the institution, move the connectors.

What enterprises need is a new function that sits between the business units owning the policies and the platform teams running the runtime. Call it agent operations: small group, often four to eight people in a Global 2000 enterprise, embedded rather than centralized, reporting into the CIO or CISO depending on house politics, with explicit charter to maintain a registry of every production agent, its named human owner, its versioned authority specification, its retention policy for reasoning-grade audit, and its lifecycle state. Each agent gets onboarded with a signed policy, reviewed on a real cadence, and actually retired when its initiative ends, rather than the current default of quietly outliving its sponsors. Designing against failure modes like review cadences that calcify into ceremony, policy artifacts that lag agent deployment velocity, or functions that become the place agents go to die in committee is itself part of the work. The function has to ship at the pace of the platform teams or it will be routed around within a quarter.

The work is hard. It’s also overdue, and the regulatory clock is running. The EU AI Act’s high-risk provisions are entering enforcement this year, and regulators will ask for explainability, traceability, lifecycle records, and named human accountability. These are exactly the artifacts an agent operations function produces. Tyler Akidau called this the missing HR layer in his April Radar piece; Artur Huk’s more recent “From Capabilities to Responsibilities” converges on similar ground from the runtime side. The label matters less than the work. This piece is about governance inside one organization. The harder problem is governance across organizations, with agents acting under different trust regimes. That’s strictly worse, and worth its own piece.

Within your own four walls, the diagnostic is doable in an afternoon. Pick one production agent. Try to answer, with evidence: Whose authority does it carry, traced from action back to a named human? Where is its authority specified, and who signed the current version? When it does something wrong tomorrow, who pays, how is that decided, and what reasoning-grade record supports the decision? Most architects who do this honestly come away with three blanks and a knot in their stomach. That’s principal drift, named and visible.

The mesh you’ve built is real and necessary, but it isn’t sufficient. The rest of the architecture is the institution above it: the registry, the signed policies, the reasoning-grade audit, the named human at the end of every chain. In most enterprises it doesn’t yet exist, and it won’t arrive by buying another platform. You’ll have to draft it yourself.

Loop Engineering

Addy Osmani — Mon, 22 Jun 2026 11:04:36 +0000

The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission.

Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead. A loop here can be thought of as a recursive goal where you define a purpose and the AI iterates until complete. I believe this may be the future of how we work with coding agents. However, it’s still early; I’m skeptical, and you absolutely have to be careful about token costs (usage patterns can vary wildly if you are token rich or poor), so I want to unpack what it is and what it means.

Peter Steinberger recently said: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Similarly, Boris Cherny, head of Claude Code at Anthropic, said, “I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops”.

Okay, so what does any of that mean?

For like two years, the way you got something out of a coding agent was you wrote a good prompt and shared enough context. You type a thing, you read what came back, you type the next thing. The agent is a tool and you are holding it the entire time, one turn after the other. That part is kind of over, or at least some think it’s going to be.

Now you build a small system that finds the work, hands it out, checks it, writes down what is done and then decides the next thing, and you let that system poke the agents instead of you. I wrote before about the cousin of this, agent harness engineering, which is making the environment one single agent runs inside and the factory model—the system that builds the software. Loop engineering sits one floor above the harness. The harness but it runs on a timer, it spawns little helpers, and it feeds itself.

The thing that surprised me is this is not really a tool thing anymore. A year ago if you wanted a loop you wrote a pile of bash and you maintained that pile forever and it was yours and only yours. Now the pieces just ship inside the products. Steinberger’s list maps almost exactly onto the Codex app, and then almost the same onto Claude Code. And once you notice the shape is the same, you stop arguing about which tool. You just design a loop that still works no matter which one you happen to be sitting in.

The five pieces, and then notes

A loop needs five things and then one place to remember stuff. Let me list it first and then map it.

Automations that go off on a schedule and do discovery and triage by themselves
Worktrees so two agents working in parallel don’t step on each other
Skills to write down the project knowledge the agent would otherwise just guess
Plugins and connectors to plug the agent into the tools you already use
Subagents so one of them has the idea and a different one checks it

Then the sixth thing, the memory. A Markdown file, or a Linear board, anything that lives outside the single conversation and holds what’s done and what is next. Sounds too dumb to matter. But it’s the same trick every long-running agent depends on, and I went into it in “Long-Running Agents”: The model forgets everything between runs so the memory has to be on disk and not in the context. The agent forgets; the repo doesn’t.

Both products have all five now.

Primitive	Job in the loop	Codex app	Claude Code
Automations	Discovery + triage on a schedule	Automations tab: pick project, prompt, cadence, environment; results land in a Triage inbox; `/goal` for run-until-done	Scheduled tasks and cron, `/loop`, `/goal`, hooks, GitHub Actions
Worktrees	Isolate parallel features	Built-in worktree per thread	`git worktree`, `--worktree`, `isolation: worktree` on a subagent
Skills	Codify project knowledge	Agent Skills (`SKILL.md`), invoked with `$name` or implicitly	Agent Skills (`SKILL.md`)
Plugins and connectors	Connect your tools	Connectors (MCP) plus plugins for distribution	MCP servers plus plugins
Subagents	Ideate and verify	Subagents defined as TOML in `.codex/agents/`	Task subagents in `.claude/agents/`, agent teams
State	track what’s done	Markdown or Linear via a connector	Markdown (`AGENTS.md`, progress files) or Linear via MCP

The names are a bit different here and there, but the capability is the same thing. Let me go one by one because honestly the details are where a loop either holds together or quietly leaks everywhere.

Automations, this is the heartbeat

Automations are what make a loop an actual loop and not just one run you did once. In the Codex app you make one in the Automations tab and you pick the project, the prompt it will run, how often, and if it runs on your local checkout or on a background worktree. The runs that find something go to a Triage inbox, and the runs that find nothing just archive themselves which is nice. OpenAI uses them internally for boring stuff like daily issue triage, summarizing CI failures, writing commit briefings, and hunting bugs somebody added last week. And an automation can call a skill, so you keep the recurring thing maintainable; you fire $skill-name instead of pasting a giant wall of instructions into a schedule that nobody will ever update.

Claude Code gets to the same place but through scheduling and hooks. You can run a prompt or a command on a interval with /loop, you can schedule a cron task, you can fire shell commands at certain points in the agent lifecycle with hooks, or you push the whole thing to GitHub Actions if you want it to keep running after you close the laptop. Same idea exactly, you define an autonomous task, you give it a cadence, and the findings come to you so you are not the one going around checking.

There is a second in-session primitive worth knowing, and it’s the one closer to what this whole post is about. /loop re-runs on a cadence. /goal keeps going until a condition you wrote is actually true, and after every turn a separate small model checks whether you are done, so the agent that wrote the code isn’t the one grading it. You give it something like “all tests in test/auth pass and lint is clean” and walk away. Codex has the same thing, also called /goal: It keeps working across turns until a verifiable stopping condition holds, with pause and resume and clear. Same primitive, both tools, which is kind of the pattern for this whole article.

So this is the part that surfaces the work. The rest of the loop is what acts on it.

Worktrees, so parallel doesn’t turn into chaos

The second you run more than one agent, the files start colliding; that becomes the failure. Two agents writing the same file is the exact same headache as two engineers committing to the same lines and nobody talked to each other first. A Git worktree fixes it. It’s a separate working directory on its own branch sharing the same repo history, so one agent’s edits literally cannot touch the other one’s checkout.

Codex builds the worktree support right in so several threads hit the same repo at once and don’t bump into each other. Claude Code gives you the same isolation with git worktree, a --worktree flag to open a session in its own checkout, and a isolation: worktree setting you stick on a subagent so each helper gets a fresh checkout that cleans itself up after. (I wrote about the human side of all this in “The Orchestration Tax.”) The worktrees take away the mechanical collision, but YOU are still the ceiling. Your review of bandwidth decides how many you can actually run, not the tool.

Skills, so you stop explaining your project every single time

A skill is how you stop reexplaining the same project context every session like a goldfish. Both tools use the same format: a folder with a SKILL.md inside holding instructions and metadata, and then optional scripts, references, and assets. Codex runs a skill when you call it with $ or /skills, or by itself when your task matches the skill description, which is the reason a tight, boring description beats a clever one. Claude Code does it the same way and I wrote the pattern up in “Agent Skills.”

Skills are also where intent stops costing you over and over. I argued in “The Intent Debt” that an agent starts every session cold and it will fill any hole in your intent with a confident guess. A skill is that intent written down on the outside, the conventions, the build steps, the “we don’t do it like this because of that one incident,” written one time where the agent reads it every run. Without skills the loop rederives your whole project from zero every cycle; with skills it kind of compounds.

One thing to keep straight: The skill is the authoring format, and a plugin is how you ship it. When you want to share a skill across repos or bundle a few together, you package them as a plugin. True in Codex, true in Claude Code.

Plugins and connectors, the loop touches your real tools

A loop that can only see the filesystem is a tiny loop. Connectors, which are built on MCP, let the agent read your issue tracker, query a database, hit a staging API, or drop a message in Slack. Codex and Claude Code both speak MCP so the connector you wrote for one usually just works in the other. And plugins bundle connectors and skills together so your teammate installs your setup in one go instead of rebuilding the whole thing from memory.

This is the difference between an agent that says “here is the fix” and a loop that opens the PR, links the Linear ticket, and pings the channel once CI is green by itself. The connectors are the reason the loop can act inside your actual environment instead of just telling you what it would do if it could.

Subagents, keep the maker away from the checker

The most useful structural thing in a loop, by far, is splitting the one who writes from the one who checks. The model that wrote the code is way too nice grading its own homework. A second agent with different instructions and sometimes a different model catches the stuff the first one talked itself into.

Codex only spawns subagents when you ask, runs them at the same time, and then folds the results back into one answer. You define your own agents as TOML files in .codex/agents/, each with a name, a description, instructions, and optional model and reasoning effort, so your security reviewer can be a strong model on high effort while your explorer is some fast read-only thing. Claude Code does the same with subagents in .claude/agents/ and agent teams that pass work between them. The usual split in both is one agent explores, one implements, and one verifies against the spec.

I made this case twice already, once as “The Code Agent Orchestra” and once as “Adversarial Code Review.” The reason it matters specifically inside a loop is the loop runs while you are not watching, so a verifier you actually trust is the only reason you can walk away. Subagents do burn more tokens since each one does its own model and tool work, so spend them where a second opinion is worth paying for. This is also basically what Claude Code’s /goal does under the hood: A fresh model decides if the loop is done instead of the one that did the work, the maker and checker split applied to the stop condition itself.

What one loop looks like

Stick it together and a single thread turns into a little control panel. Here is one shape I keep using.

An automation runs every morning on the repo. Its prompt calls a triage skill that reads yesterday’s CI failures, the open issues, and the recent commits and writes the findings into a Markdown file or a Linear board. For each finding that is worth doing, the thread opens an isolated worktree and sends a subagent to draft the fix, and a second subagent reviews that draft against the project skills and the existing tests.

Connectors let the loop open the PR and update the ticket. Anything the loop cannot handle lands in the triage inbox for me. The state file is the spine of the whole thing; it remembers what got tried, what passed, and what is still open, so tomorrow morning the run picks up where today stopped.

And look at what you actually did there. You designed it one time. You did not prompt any of those steps. That’s Steinberger’s whole point made real, and it’s the same loop in Codex or in Claude Code because the pieces are the same pieces.

What the loop still does not do for you

The loop changes the work; it does not delete you from it. And three problems actually get sharper as the loop gets better, not easier.

Verification is still on you. A loop running unattended is also a loop making mistakes unattended. The whole reason you split the verifier subagent from the maker is to make the loop’s “it’s done” mean something, and even then “done” is a claim and not a proof. I keep saying the same line from “Code Review in the Age of AI”: Your job is to ship code you confirmed works.

Your understanding still rots if you allow it. The faster the loop ships code you did not write, the bigger the gap between what exists and what you actually get. That’s comprehension debt and a smooth loop just makes it grow faster unless you read what the loop made.

And the comfortable posture is the dangerous one. When the loop runs itself, it’s very tempting to stop having an opinion and just take whatever it gives back. I called that “cognitive surrender.” Designing the loop is the cure when you do it with judgment and the accelerant when you do it to avoid thinking: same action, opposite result.

Build the loop. Stay the engineer.

I think this is a preview of how our work is going to evolve. That said, if I weren’t reviewing the code myself or if I relied entirely on automated loops to fix it, my product’s quality would suffer. I’d likely end up stuck in a downward spiral, continuously digging myself into a deeper hole.

Go ahead and set up your loops, but don’t forget that prompting your agents directly is also effective. It’s all about finding the right balance.

Loops can also result in different outcomes depending on you. Two people can build the exact same loop and get completely opposite results. One uses it to move faster on work they understand deeply. The other uses it to avoid understanding the work at all. The loop doesn’t know the difference. You do.

That’s what makes loop design harder than prompt engineering. Cherny’s point isn’t that the work got easier. It’s that the leverage point moved.

Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go.

This Week in AI: Fable 5, the Clone Wave, and Uber’s AI Reality Check

Michelle Smith — Thu, 18 Jun 2026 19:33:23 +0000

This week, egghead.io cofounder John Lindquist joined host YK Sugi, founder of CS Dojo and developer experience manager at Eventual, to cover the latest AI news. First on the agenda was the contested release of Claude Fable 5. They also examined the financial shifts reshaping the technology industry, including the rising costs associated with agentic coding loops. Then John outlined the framework he uses to build in the agent era without starting from scratch every time.

Watch the full episode here:

Claude Fable 5: 3 days, a government order, and a lot of unanswered questions

Claude Fable 5 launched June 9 and was pulled from all customers on June 12 after the US government issued a directive ordering Anthropic to restrict access for foreign nationals inside and outside the US. Amazon researchers had reportedly surfaced what they characterized as a security vulnerability, and after Anthropic reportedly declined to patch or redeploy the model, the directive came down. Senior Anthropic staff subsequently traveled to Washington to meet with White House officials.

The dispute about what actually happened is unresolved. Anthropic’s position is that the reported issue was a narrow jailbreak that had been previously identified and was present across public models generally, and not a serious security threat. An independent researcher who reviewed the report described it as defensive prompting that surfaced known vulnerabilities and called the response an overreaction. Neither side has published the technique or prompt, so there’s no way to evaluate the claim independently. But as John put it, “It sets a very strange precedent going forward, as models are released, that governments can step in and control what private companies can and cannot do with their model.”

Another new precedent: Fable 5 wasn’t built on the Opus or Sonnet architecture, which means comparisons to prior Anthropic models or contemporaries don’t tell us much. But initial impressions were positive, including from YK and John, and Fable 5 quickly reached the top of the Arena leaderboard in the text, agents, and web dev code categories. However, the model also had a purposeful limitation: On questions related to AI and machine learning training specifically, it was designed to underperform (without signaling this to users), apparently to prevent competitors from using it to improve their own models. Intentional capability suppression in a commercial model, without disclosure, is a different kind of product decision than a safety guardrail. Whether that approach becomes more common as competitive stakes rise is an open question.

Tokens burn fast when the loop isn’t ready for them

Last week, SpaceX went public in the largest IPO in history. The company finalized its acquisition of Cursor in a $60 billion all-stock deal shortly after. (That last one happened after this episode aired—we’ll talk more about it on Monday.) Both OpenAI and Anthropic have filed to go public as well, and Google raised roughly $160 billion through equity and a 100-year bond. A significant share of that capital is flowing toward AI coding infrastructure.

YK brought up another, less celebratory, financial story that’s been making the rounds: Uber burned through its full 2026 AI tools budget by April, mostly on Claude Code and Cursor, and Andrew Macdonald, the company’s COO, acknowledged they couldn’t link that spending to a measurable increase in useful customer features. Uber subsequently put a $1,500 per month per employee cap in place.

John flagged projects inefficiently utilizing agentic loops as one possible cause for wasteful token spend. Most developers deploying agents against existing codebases haven’t built the tooling those agents need to work efficiently, so agents burn tokens doing work that dead-ends, repeating context, or generating code that requires significant debugging. He explained:

If you take a legacy codebase and you throw agents against it with loops, you haven’t set up a proper agent environment. It’s so quick to burn tokens because. . .the agents don’t have the tools to work with.

The conversation in developer communities so far has focused almost entirely on what agents can generate. But as more organizations move from experimentation to production-scale deployment, building logging, verification, and proper error surfaces into agent tooling is what will determine whether token spend maps to real output. Otherwise, we’ll likely see more companies go the way of Uber.

Ingredients beat inference: A practical framework for building in the clone wave

For most developer workflows today, buy-versus-build leans toward building in a way it didn’t even a year or two ago. As John noted, “It’s so easy to build apps and workflows now where there are so many amazing production apps out there, apps on your phone, apps on your desktop, software as a service, that are trivial to copy and clone.” He uses the term the “clone wave” to describe this expanding set of open source equivalents to consumer software products that can now be cloned, forked, or replaced and get you 99% of the way to your use case.

The principle that drives the clone wave is “ingredients beat inference.” If you ask an agent to build a feature from scratch, it infers a solution with no external reference. If you give it an existing open source implementation to start from, it can adapt, translate, and integrate that code far faster and more reliably. The ingredients approach also helps with the 43% of AI-generated code that needs debugging in production, per a figure YK cited earlier in the episode.

The GitHub CLI plays a central role in this workflow. John explained that because agents understand the GitHub CLI natively, you can give an agent a search task and let it find implementations it wouldn’t have generated itself. Language mismatch isn’t a blocker, because agents translate between languages and libraries well. And tools like DeepWiki from Cognition let agents explore and understand a repo’s structure before cloning or forking it, so the evaluation step doesn’t require local setup.

The framework extends to how you build the last 20% that isn’t available as an ingredient. This is the part that’s specific to your use case; John described it as “that extra bit that you’re building on top of it to make it into the custom product and project for either yourself or for your users.” John’s bigger point is that the tools you build for yourself should also be usable by your agents. Expose endpoints and logging. Give agents the ability to read state and errors. An agent that can control a tool but not debug it will eventually stop in ways that are hard to diagnose.

John walked through cmux to demonstrate what an agent-native workspace looks like in practice. cmux is a terminal multiplexer built with agentic workflows in mind: it exposes a CLI that agents can control directly, so you can open a terminal pane, have that pane spawn another, and have the two read from and write to each other. In practice that means you can run Claude Code in one pane, Codex in another, and a third pane reading output from both, with each agent able to observe the others’ state.

Agents need more than the ability to run commands. They need to read logs, check errors, and confirm state before taking the next step. A workspace that exposes those surfaces gives agents a feedback loop. This tenet is applicable to tools across the company. Organizations that treat their internal tooling as agent-accessible infrastructure are building something that compounds. Those treating agents as black-box code generators are taking on technical debt they may not see until causes issues later on.

What’s next

SpaceX’s acquisition of Cursor turns the coding-agent race into something much larger than an IDE fight. Cursor may be positioning itself as a new GitHub for the agentic era, where agents write, review, test, repair, and govern code. At the same time, Salesforce’s $3.6B acquisition of Fin shows the same pattern inside enterprise software: Buyers want packaged workflows that solve real support, sales, and operations problems rather than abstract “agents.”

Next week, host Ksenia Se examines these stories and more through the lens of who owns the loop where AI does the work. Join us to find out why the next phase of AI will be about who controls the infrastructure, economics, and trust layer.

Our episodes are free and open to all through the end of June if you’d like to attend live—register here. And we’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, Apple, or wherever you get your podcasts.

Kubernetes in the Age of AI

Andy Kwan — Thu, 18 Jun 2026 14:21:16 +0000

When Kubernetes first came onto the scene, it was a major turning point, a revision of the infrastructure and operations space that transformed the way developers and ops personnel build, deploy, and maintain applications in the cloud. It has since become the clear standard for how modern applications are built and operated. As the CNCF noted in its latest Annual Cloud Native Survey report, “Among container users, 82% are using Kubernetes in production in 2025, up from 66% in 2023. This represents near-universal adoption within the container ecosystem.”

Over the last few years, another revision in the space has occurred with Kubernetes’s evolution from a container orchestrator to an AI infrastructure platform. According to the CNCF survey, “The rise of Kubernetes as the de facto AI platform represents a fundamental shift in how organizations approach machine learning operations. . .[with Kubernetes] providing a unified orchestration layer that handles both traditional application workloads and compute-intensive AI tasks.” The emergence of seismic technologies like generative AI and agentic AI has only accelerated this transformation.

The intersection of AI with Kubernetes is undoubtedly one of the most impactful developments in the operations space. As Jonathan Johnson, software architect at Dijure, observes, “AI on K8s is very, very important, and there is not enough [resources] out there.” Raju Gandhi, senior technical architect at Edward Jones, echoes this assessment, noting that “operationalizing AI/ML on K8s is a big issue, [and it’s only] getting bigger. This is a topic that needs attention.” But what are some of the things that you should know about this trend to keep abreast and stay ahead in the game?

Generative AI

Anyone with access to a computer or a smartphone has likely used some iteration of generative AI, a stunning fact when you consider that GenAI was on the outer edges of mainstream discourse and consumption a scant five years ago. But at the end of 2022, the debut of ChatGPT marked the beginning of a technological revolution, one that would impact and reshape nearly every aspect of our working and personal lives. Unsurprisingly, there are now thousands of generative AI models, a proliferation that naturally has its own set of complexities. Selecting a model is simple, but if you’re an application developer or MLOps engineer, how do you go about operating that model in a production system? Not only do you have to be cognizant of factors like resilience, scalability, security, and operational costs, but there’s the fact that bringing a model from experimentation into production can be arduous if not done properly. That’s where Kubernetes comes into play.

As Roland Huß and Daniele Zonca, distinguished engineers at Red Hat, note, “GenAI/LLM models are resource intensive, requiring substantial computational power and large datasets. Given its scalability and extensibility, Kubernetes is uniquely suited to function as an efficient platform for AI and LLM model pretraining, fine-tuning, deployment, and prompt engineering.” They further elaborate that “this integration with Kubernetes not only simplifies the adoption of cutting-edge AI technologies but also ensures a seamless and efficient operational flow. Kubernetes, with its robust scalability and management capabilities, stands as an ideal platform for generative AI projects, aligning DevOps and MLOps practices in a cohesive ecosystem.”

This sentiment is already shared by a wide swath of the industry. According to the CNCF survey above, as of 2025, 66% of organizations run generative AI workloads on Kubernetes. These organizations include OpenAI, which uses Kubernetes for its AI/LLM application experimenting and testing; Tesla, which utilizes KServe to manage production-grade LLM inference; and Adobe, which uses Kubernetes to power its suite of generative creative models. Other companies taking this approach include Uber, Intuit, and Google. With more companies adopting this practice for their generative AI and LLMs operations, it’d be prudent for any organization to leverage Kubernetes for their own GenAI and LLM workflows.

Agentic AI

Nearly coinciding with the rise of GenAI has been the steady growth of agentic AI. Unlike GenAI, agentic AI goes beyond answering simple prompts and generating text in its ability to operate autonomously to perform complex, multistep actions, utilize tools, and make independent decisions. With its ability to support both traditional ML processes and GenAI and LLM operations, it should come as no surprise that Kubernetes has a role in the agentic AI ecosystem as well.

According to Ronald Petty, principal consultant at RX-M, “Kubernetes has been leveraged to host machine learning pipelines, including AI model training and inference. As inference options have become plentiful and affordable, on and off-premise, we have seen the rise of agents. Coupling cloud native technologies and popular protocols, we now see agents moving from ad hoc demos to complex fleets of agents on systems like Kubernetes.” So what are some examples of the integration between these two technologies?

One notable offering is Kagent, an OS programming framework that runs AI agents in Kubernetes and “helps engineers build powerful internal platforms by tackling cloud native tasks such as configuration, troubleshooting, complex deployment scenarios, observability pipelines and dashboards, and safely enabling network security.” Operating along similar lines is K8sGPT, an AI-powered tool that leverages intelligent insights and automated troubleshooting to analyze Kubernetes clusters for configuration problems and security issues, as well as generates solutions to problems discovered in analysis.

A more recent entry in the field is Sympozium, a Kubernetes-native coordination layer for multi-agent AI systems that “solves the same problem Kubernetes solved for containers, but for agents that need to share context, hand off tasks, and maintain shared situational awareness.” Another newer offering is Agent Sandbox, which allows you to run AI agents as isolated, stateful workloads with a native API on Kubernetes.

The fundamentals

While it’s important to be aware of the latest developments and trends affecting your domain, that shouldn’t come at the expense of foundational knowledge and skills. As basketball great Michael Jordan once said, “Get the fundamentals down and the level of everything you do will rise.” One of the most fundamental skills for working with Kubernetes is networking, and frustratingly enough, it’s one of the more difficult ones to master. As Cisco senior staff engineer Nico Vibert observes, “Platform engineers tend to be comfortable with Linux networking but less so with protocols like BGP and IPv6; network administrators know those protocols well but find Kubernetes abstractions unfamiliar. Both personas struggle to navigate the dozens of networking tools seemingly required to meet connectivity and security requirements.” Yet as organizations move mission-critical workloads, AI training pipelines, and regulated financial services onto Kubernetes, the engineers who can design, secure, and troubleshoot the network layer have become some of the most sought-after professionals in the industry.

In recognition of both the importance and difficult nature of the Kubernetes networking skill, the CNCF recently announced a new certification focused on the Kubernetes network engineer role. The certification is designed to validate hands-on networking expertise across all of the aforementioned layers, filling a gap that the Kubernetes community has long recognized.

For organizations that use Kubernetes to develop and deliver applications, leaders and decision-makers need to be aware that utilizing Kubernetes in conjunction with the latest AI tools is no longer a luxury but a necessary practice that will allow their companies to thrive. A similar onus should be placed on the basics. When hiring your next DevOps, network, or site reliability engineer, ensure that their ability to design, secure, and troubleshoot the Kubernetes network layer is second to none.

If you want to dive deeper, check out Roland Huß and Daniele Zonca’s Generative AI on Kubernetes, Jonathan Johnson’s GPU Kubernetes Homelab live course, Alex Corvin, Taneem Ibrahim, and Kyle Stratis’s Scalable Kubernetes Infrastructure for AI Platforms, Ashok Srirama and Sukirti Gupta’s Kubernetes for Generative AI Solutions, and Yogesh Raheja’s K8sGPT Essentials on-demand course. They’re all on O’Reilly. If you’re not a member, you can get started with a free trial.

The Case Against Building Your Own Agent Platform

Pete Johnson — Wed, 17 Jun 2026 13:53:16 +0000

You know the meeting. The board wants an AI agent strategy by end of quarter. Someone on the leadership team has read a McKinsey report. You’ve been voluntold to build the platform. The slide deck says “AI-native.” The acceptance criteria are vague. Somebody mentions LangGraph, and somebody else says, “We’ll just wrap it ourselves.”

You ask what “done” looks like. Nobody in the room can answer.

The cost of building this is almost always estimated before anyone has a clear picture of what “this” actually is. And that’s the problem I want to work through here, because the scope of the work being casually assigned to internal platform teams right now is genuinely larger than the people assigning it understand.

Build versus buy, flipped in a year

This particular pendulum has swung before. App servers in the late 1990s. Content management systems in the 2000s. Container orchestration in the 2010s. The pattern rhymes every time: When a category is new, the components look deceptively simple. Early adopters build their own. The market catches up. Within 18 months, building becomes the expensive path. Within 36 months, the teams that built internally are rewriting on top of the category winner that emerged while they weren’t looking.

What’s different about the current moment is the speed. Menlo Ventures’ 2025 State of Generative AI in the Enterprise report shows the build-versus-buy split inverted in a single year. In 2024, 47% of enterprise AI solutions were built internally. By late 2025, that number had collapsed to 24%. The market made the decision in 12 months, which is unusual.

I’ve lived through enough of these transitions to recognize the shape. What I want to do in this piece is explain why I think the scope of “agent platform” is systematically underestimated right now, and what platform engineers should be asking before they commit to building one.

Most “agent platforms” aren’t

A lot of the projects labeled “agent platform” right now are actually workflow systems with an LLM in the loop. That’s a meaningful distinction. As Anthropic pointed out in its “Building Effective Agents” guidance, workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents are systems where LLMs dynamically direct their own processes and tool usage.

Most of what enterprises are shipping today sits on the workflow side. That’s fine. Workflows have bounded requirements, tractable testing, and predictable failure modes. If your team is building a workflow system, you might reasonably build it yourselves.

The trap is that teams start building for workflows, then get asked to support agents, and discover the jump isn’t incremental. Agents need memory that survives across sessions. They need evaluation that handles nondeterminism. They need governance that tracks actions, not just outputs. They need orchestration that recovers from failure modes a workflow engine never sees.

Here’s the thesis I want to put on the table: The decision to build an agent platform almost always underestimates the long tail. Memory, governance, eval, and orchestration aren’t features you add to a workflow engine. They’re separate product bets, each with its own maturity curve, its own vendor landscape, and its own team of specialists who’ve been working on it full-time for 18 months while you’ve been doing something else.

Let me walk through them.

Memory

The assumption inside most build proposals is that memory is a database problem. You’ll pick a vector store, shove conversation history into it, and retrieve relevant chunks when the agent needs context. Done.

Production memory is three separate systems: episodic, semantic, and procedural, each with different retention and retrieval policies. It’s temporal reasoning that tracks when facts were valid, not just what they were. It’s deduplication, multitenant isolation, and explicit source-of-truth governance.

The signal that this is a separate product category, not a feature: Mem0 raised $24 million across seed and Series A. Letta (formerly MemGPT) raised $10M from Felicis. Zep exists as an independent company with a temporal knowledge graph engine. Mem0’s State of AI Agent Memory 2026 report maps 21 frameworks across three hosting models with measurable benchmark gaps between them. On LongMemEval, Zep scores 15 points higher than Mem0 on temporal queries, which tells you these aren’t interchangeable tools that happen to serve the same market.

This is the component that platform teams underestimate hardest. Memory sounds like a database problem. It isn’t.

Governance

The assumption is that governance is RBAC plus audit logging. Your agents are services. Services get role-based access controls. You log the tool calls. Compliance is happy.

Agent governance is something different. It spans action authorization, not just data authorization. It requires decision-chain auditability, where you can reconstruct why the agent did what it did, not just what it did. It needs behavioral drift detection, tiered autonomy, and compliance mapped to agent actions rather than data accesses.

Grant Thornton’s 2026 AI Impact Survey of 950 business executives found that 78% lack strong confidence they could pass an independent AI governance audit within 90 days. Meanwhile, enterprises are moving to increase agent autonomy faster than their governance frameworks can keep up. Traditional AI governance wasn’t designed for action-level authorization, which is where most agent-specific risk accumulates.

And there’s a hard deadline attached to this. The EU AI Act becomes fully enforceable for high-risk systems in August 2026. Credit scoring, hiring decisions, healthcare support, and critical infrastructure all fall in scope. If your internal platform doesn’t handle conformity assessments, human oversight mechanisms, complete audit trails, and ongoing monitoring, that’s not a v2 feature. That’s a legal exposure.

OWASP now documents “excessive agency” as a top vulnerability class for LLM applications. Cornell researchers have demonstrated indirect prompt injection attacks that manipulate agents through content they ingest. These are agent-specific attack surfaces, and traditional security tooling doesn’t see them.

RBAC was designed for humans with predictable intent. Agents don’t have predictable intent.

Eval

The assumption is that evaluation means writing test cases and measuring accuracy. You built software before. You know how to test things.

Agent evaluation is qualitatively different from traditional software testing or even LLM evaluation, McKinsey’s QuantumBlack team noted: For LLMs, you evaluate the response to a prompt. For a single agent, you evaluate the full trajectory, including tool calls, state transitions, and intermediate decisions. For multi-agent systems, you evaluate system dynamics, including coordination patterns and collective invariants.

This matters because agent behavior is nondeterministic by design. The same input produces different valid execution paths. “Did the agent succeed?” is no longer a yes-or-no question, because the agent might reach the right answer through a trajectory you didn’t anticipate, or reach the wrong answer through a trajectory that looks reasonable until the last step.

The tooling ecosystem reflects this. Google Vertex AI has standardized trajectory_exact_match, trajectory_precision, and trajectory_recall as production metrics. These didn’t exist 18 months ago. LangSmith, Braintrust, Arize, Galileo, Maxim, and others are building full evaluation platforms around trajectory-based analysis, LLM-as-judge scoring with statistical validation, and regression testing against production failures.

Here’s the signal that the category is real: LangChain’s 2026 State of AI Agents report found that 57% of organizations now have agents in production, and 32% cite quality as the top deployment barrier. Gartner projects that 60% of software engineering teams will adopt AI evaluation and observability platforms by 2028, up from 18% in 2025. When a category jumps from 18% to 60% adoption in three years, that’s not a “we can build this in a sprint” situation.

You can’t tell whether your evaluation is working without another evaluation. Judge drift, calibration against human experts, internal consistency across independent runs. . .your eval system needs its own eval system, which is exactly the kind of recursion that eats platform teams alive.

Orchestration

The orchestration layer hasn’t converged. LangGraph uses directed graphs with conditional edges. CrewAI uses role-based crews. OpenAI’s Agents SDK uses explicit handoffs. AutoGen uses conversational GroupChat. Google ADK uses hierarchical agent trees. Claude’s Agent SDK uses tool-use chains with subagents. Microsoft’s Agent Framework is its own thing. Each represents a different bet on state management, communication pattern, and coordination model. None of them are interchangeable. Migration between them isn’t a config change—it’s rewriting most of your agent logic.

Underneath them, the protocol layer is still being invented. The Model Context Protocol is becoming the standard for tool integration, and agent-to-agent (A2A) protocols are emerging for cross-framework coordination. Both are moving targets, and building on a moving protocol is a cost that internal platform teams rarely price in.

If you built your own orchestration layer in 2024, you’re rewriting it in 2026. The teams that picked a framework spent those two years shipping.

The honest case for building

I want to engage the strongest version of the build argument, because there are real reasons to build, and pretending otherwise makes this piece less useful than it should be.

Proprietary data genuinely is a durable competitive moat. Mastercard built a foundation model on its transaction network. Plaid built one on its financial institution coverage. As Morgan Stanley’s analysis from last year made clear, decades of verified historical data with consistent identifiers is both technically challenging and prohibitively expensive for outside players to recreate. If your organization has data like that, you should absolutely build on it.

Regulated industries have legitimate reasons to want control over the full stack. Off-the-shelf AI tools don’t always cleanly map to frameworks like HIPAA, GxP, 21 CFR Part 11, SOX, FFIEC, and PCI DSS, and the cost of a failed audit is measured in business units shut down, not in sprints.

Vendor lock-in at the AI layer is subtler and more dangerous than in traditional software. If your agentic workflows are built on a vendor’s proprietary orchestration layer, switching costs compound rapidly across memory, eval, and integrations simultaneously.

But here’s the distinction that matters: Those are arguments for building agents on top of platform components, not arguments for building the platform components themselves. You can own the data, the domain logic, the evaluation criteria, the governance policies, and the specific behaviors your business needs without owning the memory layer, the orchestration engine, or the trace collection infrastructure underneath them.

Build the things that are specific to your business. Buy the things that are specific to the technology category. That’s the heuristic.

Five questions before you commit

If you’re the platform engineer being pulled into this decision, here are the questions worth asking before anyone signs up for the scope.

Are you building an agent platform or a workflow system? They’re not the same scope, and conflating them is where most of the cost overruns originate. A workflow system is a reasonable thing to build. An agent platform is four product categories you haven’t staffed for.

Can you articulate what “done” looks like for each of the four components? Memory, governance, eval, orchestration. In under three sentences each. If you can’t, you don’t have requirements. You have a vibe. And vibes don’t ship.

What happens to your platform when you need to swap the underlying model? Menlo’s December 2025 data shows Anthropic went from 12% of enterprise LLM spend in 2023 to 40% in 2025, while OpenAI fell from 50% to 27%. Enterprises didn’t plan those switches. The capability gaps forced them. If your internal platform hardcoded assumptions about context windows, tool-calling formats, or reasoning styles from one vendor, swapping models isn’t an API key change. It’s simultaneous rewrites across memory, eval, and orchestration.

What happens when the techniques themselves change? Eighteen months ago the default pattern was RAG with flat vector retrieval. Now it’s just-in-time context strategies, agent-managed memory tiers, and trajectory-based evaluation. Anthropic’s own follow-up to “Building Effective Agents” explicitly acknowledges the field has moved since they wrote the original. If your platform baked in the 2024 patterns, the 2026 patterns are a refactor, not a config change. Vendor platforms absorb those shifts as releases. Internal platforms absorb them as sprints.

What happens when the platform team leaves? This is the tale as old as COBOL, custom ESBs in 2008, or hand-rolled container orchestration in 2015. A small team builds something clever, it works, they move on, and five years later you’re paying premium rates to contractors who can still read the code. Agent platforms are a particularly bad candidate for this pattern because the talent pool is both small and mobile. Here’s the uncomfortable version of the question: Who on your team, today, could rebuild the memory layer if the person who wrote it left tomorrow?

What this looks like in 2 years

Gartner’s prediction that over 40% of agentic AI projects will be canceled by 2027 isn’t really about the AI. It’s about projects that got scoped before anyone understood the shape of the work. Most of the canceled projects will be internal builds, because internal builds are where the scope estimation error accumulates. Deloitte’s data on two- to four-year AI ROI horizons is the warning shot. If your timeline to value is already long, every month you spend rebuilding a component that exists as a product is a month you don’t have.

The teams that built their platforms around OpenAI in 2023 weren’t wrong. They made a reasonable bet on the market leader at the time. But they spent 2025 porting to a landscape where Anthropic had tripled share and Google had gone from 7% to 21%. The teams that picked model-agnostic platforms spent 2025 shipping. The only durable bet in this space is the one that assumes the bet will change.

The best platform engineering decision you can make this quarter might be to not build the platform.

Sources

Primary sources

Menlo Ventures, 2025: The State of Generative AI in the Enterprise, December 2025,
https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/.
Anthropic, “Building Effective Agents,” December 2024,
https://www.anthropic.com/research/building-effective-agents.
Anthropic, “Effective Context Engineering for AI Agents,” 2025,
https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents.
European Commission, AI Act Regulatory Framework (Regulation EU 2024/1689),
https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai.
Google Cloud, “Evaluate Gen AI Agents,” Vertex AI Documentation,
https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-agents.
McKinsey QuantumBlack, “Evaluations for the Agentic World,”
https://medium.com/quantumblack/evaluations-for-the-agentic-world-c3c150f0dd5a.
LangChain, State of Agent Engineering 2026,
https://www.langchain.com/state-of-agent-engineering.
Gartner, “Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027,” June 2025, https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027.
Grant Thornton, 2026 AI Impact Survey, April 2026,
https://www.grantthornton.com/services/advisory-services/artificial-intelligence/2026-ai-impact-survey.

Secondary Sources

Mem0, “Mem0 Raises $24M to Build the Memory Layer for AI,” October 2025,
https://mem0.ai/series-a.
Felicis, “Felicis’s Seed in Letta,” September 2024,
https://www.felicis.com/blog/letta.
Vectorize.io, “Mem0 vs Zep,” Benchmark Comparison,
https://vectorize.io/articles/mem0-vs-zep.
Rasmussen et al., “Zep: A Temporal Knowledge Graph Architecture for Agent Memory,” arXiv 2501.13956,
https://arxiv.org/abs/2501.13956.
OWASP, “LLM08:2025 Excessive Agency,” OWASP Top 10 for LLM Applications,
https://genai.owasp.org/llmrisk/llm08-excessive-agency/.
Greshake et al., “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,” arXiv 2302.12173, February 2023,
https://arxiv.org/abs/2302.12173.
Model Context Protocol, Official Specification,
https://modelcontextprotocol.io.
PYMNTS, “FinTechs Race to Build Foundation Models on Proprietary Data,” 2026,
https://www.pymnts.com/artificial-intelligence-2/2026/fintechs-race-to-build-foundation-models-on-proprietary-data/.
Deloitte, “State of Generative AI in the Enterprise,” Quarterly Reports,
https://www.deloitte.com/us/en/insights/topics/digital-transformation/state-of-generative-ai-in-enterprise.html.

Linear Thinking, Nonlinear Costs

Nicole Koenigstein — Tue, 16 Jun 2026 11:02:01 +0000

Many AI agent systems become economically unsustainable long before they become technically impressive. Teams usually focus on model choice, prompt design, tool calling, and orchestration. Those things matter, but they are only part of the system setup. The deeper issue is that coding agents, such as Claude Code, Codex, and Jules, make agent workflows easier to generate. But when implementation is abstracted away, the underlying mechanics become harder to see. Bad engineering used to produce slow code. Now it produces expensive systems that also happen to be slow.

When we design agent systems, we still need to remember that the costs scale nonlinearly. A single user request rarely triggers a single model call. It expands into routing, retrieval, reasoning, reflection, guardrail checks, tool calls, and synthesis. Each step may repeat shared context, reload state, recompute a planner decision, or retry a failed path. What looks like an intelligent workflow can therefore behave like a recursive, stateful computation with overlapping subproblems. If that sounds like backtracking, dynamic programming, and memoization to you, you’re right.

We already know how to optimize systems like this. The problem is that coding agents make agent systems easier to generate, but not necessarily easier to optimize. Unless we recognize the underlying mechanics, we may never ask our coding agents to apply the optimization patterns that keep our systems viable.

Old problems wearing new clothes

When we use coding agents to generate agent architectures, it’s tempting to stop at “the trace looks reasonable.” The tool can generate routers, retrievers, planners, evaluators, guardrails, tool interfaces, and synthesis steps. It may also know about caching, pruning, memoization, and state modeling. But it won’t necessarily implement those patterns unless you ask for these optimization layers explicitly.

Even if you work with agent instructions, unless your SKILL.md, AGENTS.md, or project instructions include constraints around repeated context, memoization, cache invalidation, pruning, and cost per request, your resulting agent system may be functionally correct and economically wasteful at the same time. That’s the tricky part: The code can pass review, the unit tests can pass, and the architecture can look reasonable. The invoice is where the hidden computation finally shows up.

It’s easy to give too much agency to tools like Claude Code. When a coding agent reasons in language, calls tools, reflects, and produces fluent text or code, it can feel like a knowledgeable coworker. At the interface level, that impression is understandable. These tools help teams generate more code, move faster, and become more productive. Still, this doesn’t remove the need for engineering craft underneath. Someone still has to recognize repeated context, recomputed planner decisions, correlated retries, unpruned branches, and state that can’t be reused. The coding agent can implement the system, but the engineer still has to understand what kind of system should be implemented. This is where old computer science returns, not as theory but as the optimization layer our agent systems need in production.

The cost multiplier, repeated-work problems, and backtracking

The cost multiplier often shows up first as latency. The user doesn’t see the router, the retries, the reflection loop, or the tool calls. They only see that the agent is taking too long. From the outside, the system looks stuck or broken. From the inside, it may simply be repeating work.

This is one of the uncomfortable differences between traditional software and agent systems. In a conventional application, a failed operation often throws an error, times out, or leaves a trace that is easy to inspect. In an agent workflow, failure can look like effort to improve reliability. Take the weakest step in your agent workflow. If it succeeds 60% of the time, and you try to push it close to 99% reliability through retries, you need 5 retries:

1 − (1 − 0.60)⁵= 0.98976

This math assumes each retry is a roll of fair dice. LLMs aren’t dice. Whether you’re using greedy decoding or probabilistic sampling, the model is still drawing from the same underlying distribution shaped by your prompt. If the first “thought” is a hallucination or logic error, bumping the temperature won’t fix the underlying state. You aren’t buying independent trials; you’re just sampling different paths through the same flawed map and state.

This is where the old algorithmic framing matters. In a backtracking problem, you don’t keep walking down the same failed branch and call it progress. You return to the last valid state, mark the failed path, and use the failure as information for the next choice. The point isn’t just to try again. The point is to try again under a changed state.

Agent workflows need the same discipline. A retry shouldn’t mean “run it again and hope.” It should give the model structured feedback about why the previous attempt failed: which constraint failed, which tool result was invalid, which schema didn’t validate, which assumption was unsupported, or which branch added nothing. The next attempt should then change something meaningful: the prompt, the tool choice, the retrieved evidence, the validation constraint, or the planner state.

Memoization, pruning, and dynamic programming

Prompt caching is usually the first optimization. If every step repeats the same system prompt, tool definitions, schema constraints, examples, and policy rules, then caching the shared prefix is an obvious win. It reduces the cost of repeated context. But prompt caching only recognizes that text repeats. It doesn’t notice that decisions repeat.

In many agent systems, the expensive unit isn’t only text. It’s the repeated decision. If the same or equivalent state appears again, paying the model to rediscover the same action is unnecessary. That is what memoization does: It turns repeated computation into lookup. In classical algorithms, the repeated computation might be a recursive subproblem. In an agent system, it might be a planner decision over the same task, facts, tools, and constraints. The planner can be treated as a function over state:

$^πLLM(S_t) \rightarrow a_{t+1}$

where $S_t$ is the current state of the workflow and $a_{t+1}$ is the next action. Without memoization, this function is evaluated again and again through an LLM call. With memoization, the system first checks whether it has seen the same or equivalent state before. If you want a deeper walkthrough of how to use memoization, I cover it in AI Agents: The Definitive Guide.

But memoization only helps once the system knows which states are worth revisiting. Pruning handles the other side of the problem: branches that shouldn’t be explored further. However, don’t limit pruning to KV cache pruning or speculative decoding. Use it also when a tool repeatedly returns no new information. Your next LLM call shouldn’t be a slightly reworded version of the same query. If a reflection loop keeps producing stylistic changes without improving correctness, the loop should stop. If a search path violates a constraint or depends on an unsupported assumption, it should be marked as unproductive and removed from the active search space.

Dynamic programming becomes relevant when different branches of the workflow solve overlapping subproblems. A research agent may ask similar questions across several documents. A coding agent may inspect the same dependency chain from different entry points. A business analysis agent may compute the same metric for several report sections. If every branch solves these subproblems from scratch, the system pays repeatedly for work it has already done. Table 1 shows examples of how these patterns map to AI agent systems.

Table 1. Classical optimization patterns applied to AI agent systems

Optimization	The “old” CS way	The “agent” way
Memoization	Store results of expensive function calls.	Cache decisions. If the agent saw this state before, don’t ask it to reason again.
Pruning	Cut off search paths in a tree that won’t lead to a solution.	Kill a reflection loop when the critique stops yielding structural improvements.
Dynamic programming	Break problems into overlapping subproblems.	Share codebase analysis across multiple specialized agents instead of rereading files.

This isn’t nostalgia. These patterns mitigate the cost structure of agent systems. Memoization reduces repeated decisions. Pruning reduces repeated failure. Dynamic programming reduces repeated subproblem solving. Together, they form the optimization layer many agent architectures are missing in production.

Where to start: Optimization follows topology

The patterns above aren’t a checklist you apply uniformly. Each multi-agent topology, whether centralized, decentralized, independent, or hybrid, distributes communication and coordination differently, which directly affects overhead, latency, and failure propagation. The optimization layer has to follow.

Centralized
A single orchestrator decides, delegates, and aggregates. The expensive unit is the orchestrator’s decision, repeated across similar inputs. Memoize the planner first.

Decentralized
Agents coordinate peer-to-peer, exchanging messages without a central authority. The cost moves into the communication itself: redundant exchanges, restated context, agents reasoning over the same shared state from different angles. Prompt caching on the shared context is the first win, followed by pruning exchanges that no longer add information.

Independent/swarms
Lightweight agents fan out without coordinating. Cheap individually, expensive in aggregate. If three of your ten agents ask semantically equivalent questions, you pay three times for the same answer. Memoization and pruning aren’t optimizations here; they’re load-bearing.

Hybrid
The repeated work shows up at two scales: within a cluster (overlapping subproblems among peers) and across clusters (the coordinator rediscovering the same routing decision). Use dynamic programming on shared subproblems inside the cluster, memoization on the coordinator’s decisions across them.

The optimization layer isn’t a generic discipline you bolt on. It’s a function of the shape of the implementation. Coding agents made it easy to generate the shape without seeing it. The craft is in seeing it anyway.

Who Owns the Code Claude Wrote?

Sena Evren — Mon, 15 Jun 2026 10:58:47 +0000

The following article originally appeared on Sena Evren’s Legal Layer newsletter and is being reposted here with the author’s permission.

TL; DR

Agentic coding tools like Claude Code, Cursor, and Codex generate code that may be uncopyrightable, owned by your employer, or contaminated by open source licenses you cannot see. Some of this is settled law, some is actively contested, and this piece is clear about which is which. If you are shipping AI-assisted code and have not thought about any of this, this piece is for you.

If you shipped code this week, some of it was probably written by an AI. The question of who legally owns that code is less settled than most developers assume, and the answer depends on three things that have nothing to do with how good the code is:

Whether a human made enough creative decisions to establish copyright
Whether your employment contract already assigned it to your employer
Whether the model pulled from GPL-licensed training data and quietly contaminated your codebase

On March 31, 2026, Anthropic accidentally published 512,000 lines of Claude Code’s source code in a routine software update through a missing configuration file. Before sunrise, the codebase was mirrored across GitHub. Before breakfast, a developer had used an AI tool to rewrite the entire thing in Python, and the “claw-code” repository hit 100,000 GitHub stars in a single day, the fastest in history. Then came the DMCA takedowns, and then came the question nobody had a clean answer to:

If Claude Code was, by Anthropic’s own lead engineer’s admission, predominantly written by Claude itself, does Anthropic even own it? Can you issue a DMCA takedown for code that copyright law may not protect?

That incident compressed every open question about AI-generated code ownership into a single news cycle. The same questions apply to your codebase.

The copyright rule nobody told you

Here is the legal baseline, in plain terms: Copyright only protects work created by a human.

The US Copyright Office has confirmed this consistently, and the DC Circuit upheld it in the Thaler case. When the Supreme Court declined to hear the Thaler appeal in March 2026, it did not endorse the lower court’s reasoning or settle the question nationally. Cert denial means the court chose not to hear the case, nothing more. What it does mean is that the DC Circuit’s ruling stands, the Copyright Office’s position is intact, and no court has yet gone the other way. Works predominantly generated by AI without meaningful human authorship are not eligible for copyright protection under current doctrine, and that position is stable even if it is not finally settled.

Two important limits on what Thaler actually decided.

The case involved a painting created with zero human involvement at all. Thaler listed the AI system as sole author and made no claim of any human creative contribution. The ruling does not directly address the harder question of AI-assisted work where a human was involved but the degree of that involvement is disputed.
Thaler involved visual art. No court has yet applied the human authorship doctrine specifically to code output from an AI coding tool. The logic applies, but the direct precedent does not exist yet.

What it means for you: Code that Claude Code or Cursor generated and you accepted without meaningful modification may not be copyrightable by anyone. If a competitor copies it, you may have no legal recourse, because the code sits in the public domain in everything but name.

The phrase that determines whether your code is protected is “meaningful human authorship,” and the Copyright Office has deliberately refused to quantify it with a percentage or a number of edits, because what courts look for is evidence that a human made genuine creative decisions:

Choosing the architecture
Deciding what to reject
Restructuring the output to fit a specific design

Specifying an objective to the model is not enough. Directing how the work is constructed is what counts.

In an agentic workflow, this distinction is harder to establish than it sounds. Consider a typical Claude Code session:

You write a one-line prompt: “build a rate limiting module for the API.”
Claude Code plans the approach, generates five files, and iterates through three versions.
You review the output, run the tests, and merge.

Your contribution in that sequence is your architectural intent and your final approval. Whether that constitutes meaningful human authorship in a courtroom is an unresolved question with no definitive court ruling yet.

The honest answer is: probably yes for modules you substantially redirected, probably no for code you accepted verbatim, and unclear for everything in between.

The middle ground is actively being litigated right now. In Allen v. Perlmutter, artist Jason Allen is challenging the Copyright Office’s denial of registration for a work he created using more than 600 detailed prompts and subsequent editing in Photoshop. The Copyright Office acknowledged the Photoshop edits as human-authored but still denied registration for the AI-generated underlying elements. That case has not been decided yet, and whatever it decides will be the closest thing to a ruling on how much human involvement is enough.

The closest existing precedent on partial protection is Zarya of the Dawn, a graphic novel where the Copyright Office granted registration for the human-authored text but denied it for the Midjourney-generated images. That decision establishes a practical principle developers can use right now: The human-authored elements of an AI-assisted codebase may be separately protectable even if the generated code itself is not. Your architecture documents, your design decisions recorded in commit messages, your ADRs, your prompt logs showing deliberate redirection, these may be protectable as human-authored expression even if the code they produced is not. Protecting what you can starts with documenting what you actually did.

What your employer probably already owns

Before you think about whether your code is copyrightable, there is a more immediate question: Even if it is, is it actually yours?

Your employment contract almost certainly says that anything you build at work belongs to your employer. That principle has a name in copyright law: the work-for-hire doctrine. Under it, any code created by an employee within the scope of their employment is owned by the employer, who is treated as the legal author, regardless of whether the code was written by hand, generated by Claude Code, or some combination. Using an AI coding tool during work hours, on a work project, on a work machine, does not change who owns the result.

Most employment contracts go further than the doctrine’s defaults. Look for a section in yours called “Intellectual Property,” “IP Assignment,” or “Work Product.” Open the contract, search for those terms, and read that section. A clause that says any of the following almost certainly covers your AI-assisted code:

“Any work product created using company equipment or resources”
“Any invention or development made during the term of employment”
“Any software created with the assistance of company-licensed tools”

The third one is the one to watch. If your employer licenses Claude Code, Cursor, or Copilot for the team, and you use those same tools to build a side project, a broad IP assignment clause may give the employer a claim over that project, even if you built it on your own time.

A senior developer in San Francisco described exactly this situation earlier this year. He had used Claude Code for work projects and for a personal fitness tracking app built on evenings and weekends. His company updated its IP policy and claimed everything he had built with AI assistance, including the personal app, arguing that because Claude had access to open work files in the IDE, any AI output was a derivative work of company IP.

This is the clearest example of how far this can stretch. His company’s claim rested on one phrase: The AI tools were “context-aware” of his company’s codebase. The argument does not hold up legally, because context visibility in an IDE does not make AI output a derivative work of files that were open nearby, and the connection between what Claude can see and what it generates is probabilistic pattern completion, not copying. But the argument illustrates what employers are starting to claim. If the clause is broad enough, it has surface validity regardless of what the AI actually did.

The practical rule: If you are building something on the side, use a personal account, a personal machine, and tools you pay for yourself. Keep your employer’s licensed tools out of that workflow entirely.

The open source contamination problem

Even if you own your AI-generated code, you may have already contaminated it with an open source license you cannot see.

AI coding tools are trained on massive amounts of public code, including code licensed under the GPL, LGPL, and other copyleft licenses. Copyleft licenses carry a specific obligation that travels with the code:

If you distribute software that is a derivative of GPL-licensed code, you must release your own source code under the same license.
This applies even if you did not know the code you incorporated was GPL-licensed.
“I did not know” is not a defense to a copyleft violation.

When an AI tool reproduces a substantial verbatim portion of GPL-licensed code from its training data, and you ship that code in a commercial product without releasing source, you may have created a copyleft violation without ever touching the original repository. The legal standard for infringement is substantial verbatim reproduction, not functional similarity or resemblance, and this distinction matters: an AI tool generating code that works like GPL code is different from an AI tool that reproduces GPL code word for word. The risk sits at the verbatim end of that spectrum, and the problem is that you have no way to know which side of the line your codebase is on without running a scan.

The chardet community dispute made this concrete in early 2026. This was not a filed lawsuit but a public dispute within the open source community that raised the question without resolving it legally. A developer used Claude to rewrite chardet, a Python character encoding library, and rereleased it under an MIT license, arguing that the AI rewrite was a “clean room” implementation free of the original LGPL license.

The legal question the community fought over: If Claude was trained on the LGPL-licensed codebase and its output reproduces substantial verbatim portions of that code, can the output be treated as license-free? The chardet dispute did not resolve cleanly and no court has issued a definitive ruling on this specific question. What is settled is that verbatim copying of GPL code violates the license regardless of how it was produced. What is unsettled is whether AI-generated output that reproduces training data patterns counts as verbatim copying. The working assumption among lawyers advising companies through M&A is that it probably does, and that assumption is now showing up as a standard condition in acquisition due diligence.

The Doe v GitHub litigation, still working through the Ninth Circuit as of April 2026, is asking whether GitHub Copilot reproduces licensed code without attribution in violation of copyright law and DMCA Section 1202. The district court dismissed most claims but the appeal is live. Whatever the outcome, the litigation has already changed industry behavior: GitHub Copilot added duplicate detection filters, and acquisition due diligence now routinely includes an AI codebase license scan.

What to do about all of this

Four concrete actions, none of which require a lawyer.

1. Run a license scan on your AI-assisted codebase

Tools that do this well:

FOSSA—most comprehensive, widely used in enterprise
Snyk Open Source—good for dev-team workflows, integrates with GitHub
Black Duck—standard in M&A due diligence

Each will scan your codebase, flag code that matches known open source libraries, and identify the licenses attached. If you are shipping a commercial product and have never run one of these, you are operating on assumption. The scan takes an afternoon and costs less than the first hour of a copyright dispute.

2. Document your human creative contributions as you go

The evidence that establishes meaningful human authorship is the same evidence you already produce in a normal engineering workflow. You just have to keep it deliberately rather than letting it disappear.

What to preserve:

Commit messages that describe what you changed and why, not just what the AI generated. “Restructured Claude’s module architecture, rejected initial state management approach, rewrote error handling from scratch” is evidence. “Add rate limiting module” is not.
Prompt logs. Claude Code and Cursor both retain interaction history. Export or screenshot the sessions where you made significant architectural decisions.
Design documents, ADRs, or any notes that predate the generated code and show you specified the structure before the AI built it.

The second commit message versus the first is the difference between a defensible authorship claim and a clean “Claude wrote this” record.

3. Read the IP clause in your employment contract before you build anything on the side

Open your contract, search for “intellectual property,” “IP assignment,” or “work product,” and read that section carefully. The specific language determines your exposure:

“Work product created during employment hours” is narrower than “work product created using company resources.”
“Relating to the company’s business” is narrower than “any software development.”
“Company-licensed tools” is the phrase that captures AI coding tools even on personal projects.

If the clause is broad and you want to build something independently, you have three realistic options: negotiate a written carveout before you start (easier at the start of a new role than mid-employment), use entirely personal tools on entirely personal time on a personal machine, or accept that the claim exists and decide whether the risk is worth it.

4. Check which Anthropic plan you are on before shipping for commercial use

Go to anthropic.com/legal and compare the consumer terms against the commercial terms. The difference that matters:

Consumer terms (free and Pro plans): Anthropic assigns outputs to you, but the IP indemnification is narrower and covers fewer scenarios.
Commercial terms (API and enterprise): Anthropic assigns outputs to you and will defend you against copyright infringement claims arising from your authorized use of the service and its outputs.

If you are shipping AI-assisted code in a commercial product using the free or Pro plan, the indemnification gap is real. The API or enterprise agreement is the appropriate tier. Note that neither indemnification covers a downstream GPL violation from license contamination in your codebase. That is your governance problem to solve with the license scan in action 1.

The thing worth sitting with

Anthropic’s own lead engineer publicly stated that his recent contributions to Claude Code were written entirely by the AI, and the leaked codebase that Anthropic issued 8,000 DMCA takedowns to suppress may be predominantly AI-authored. Whether Anthropic’s copyright claims over that codebase are legally valid remains an open question no court has yet resolved.

If the company that built the tool cannot cleanly assert copyright over its own AI-assisted code, the question of whether you can is worth taking seriously before it becomes relevant in a transaction, a dispute, or an acquisition conversation. The developer who documents their creative contributions from the start is in a meaningfully different legal position than the one who accepted three thousand lines of Claude output and merged without review, even if both shipped the same product.

A note on what this piece covers and what it does not

Three things in it are settled law:

Works lacking human authorship are uncopyrightable,
The work-for-hire doctrine applies regardless of how code was generated.
Verbatim copying of GPL-licensed code violates the license.

Two things are emerging consensus without definitive court rulings yet:

How much human direction is enough to establish meaningful authorship in an agentic workflow
Whether AI output that reproduces training data patterns counts as verbatim copying

One thing is genuine speculation:

Whether any of this will be litigated at scale in the near term

Most code copyright claims never reach court. The place where the unsettled questions become concrete today is M&A due diligence and institutional fundraising, where acquirers and investors are already asking these questions as a condition of closing.

If neither of those applies to your situation right now, the four actions above are still worth doing, but the urgency is lower than the piece might imply.

This Week in AI: The Next-Gen Recommendation Experience

Michelle Smith — Fri, 12 Jun 2026 14:18:19 +0000

This week Miguel Fierro, a former Microsoft principal researcher who recently founded his own company, RecoMind, joined data and AI evangelist Christina Stathopoulos to talk about the state of recommendation systems. Christina also ran through the latest AI news she’s been watching, from Anthropic’s continued rise to responsible AI, announcements from Google’s I/O 2026 conference, and (continuing the discussion from last week) the growing backlash against tokenmaxxing as a productivity metric. Here are three takeaways from the conversation.

Recommendation systems are a bigger deal than most companies realize

Miguel has spent the better part of a decade building recommendation systems for enterprise customers at Microsoft, and he thinks most companies are leaving a lot on the table by not paying closer attention to recommendations. Amazon generates roughly 35% of its revenue through recommendations. Netflix attributes 75% of content consumption to them. Best Buy credits recommendations with 24% of revenue. TikTok’s entire user experience is a recommendation engine. And yet many large retailers he worked with at Microsoft weren’t investing seriously in the area, often because they weren’t tracking the value it was generating.

The gap between the top tier and everyone else is wide and getting wider. The most advanced systems today treat user behavior as a sequence prediction problem, similar to how large language models predict the next token. Rather than just encoding clicks, they encode all user actions into embeddings, run sequences through those representations, and use huge 1.5 trillion-parameter models to predict what a user will want next. That’s not something a mid-tier retailer can replicate today, but it signals where the field is heading.

Even if you don’t work in a top well-resourced company, you should still pay attention to the convergence of search and recommendations into a single personalized retrieval layer and the early application of foundation models to recommendation problems. Netflix has built what Miquel described as the only published foundation model in this space; Meta is rumored to be developing one as well. The barrier is data, particularly for smaller organizations. Unlike text, behavioral interaction data isn’t publicly available, so building at that scale requires both proprietary datasets and serious compute.

If you want to get your hands on state-of-the-art implementations, including knowledge graph-based approaches, without starting from scratch, Miguel suggested the open source Recommenders library, originally developed at Microsoft and now housed under the Linux Foundation, as a practical entry point.

The agent hype has a recommender-shaped hole in it

Miguel drew a distinction between true sales agents and what most companies offer today, which are usually just conversational agents. A conversational agent responds to what you say. An agentic sales system understands a customer, anticipates what they want, and surfaces the right product or offer at the right moment—and that requires a recommendation system baked in.

If your “agent” is a chatbot with access to a knowledge base, it’s not doing recommendation. Recommendation systems need training data, a retrieval layer, and a personalization model, none of which you get for free from a foundation model API. A language model can answer questions about a product catalog, but it can’t offer up personalized recommendations unless it also has a model of the customer’s preferences, history, and likely next action. Most companies don’t have the infrastructure in place to make that possible. . .yet.

The responsible AI conversation has left the research community

What’s notable about the responsible AI conversation right now is the range of institutions offering their perspective. Anthropic, alongside announcing a funding round pushing its valuation toward $1 trillion, urged a global pause on AI development tied to the risk of recursive self-improvement: systems that can design and develop their own successors. The Future of Life Institute published The Better Path for AI, a framework arguing for capability development oriented toward human benefit rather than human replacement. And the pope issued a formal encyclical focused on AI and the common good.

None of these institutions is making the same argument, but the convergence of their attention matters. Responsible AI used to be a specialized conversation happening largely within research labs and a small set of policy organizations. It’s now a topic where major AI companies, religious institutions, and civil society groups are all staking out public positions in the same news cycle.

For the technical community, this creates both pressure and opportunity. “We’re thinking about safety” is no longer a sufficient posture; external scrutiny is intensifying from directions that don’t share the field’s assumptions or vocabulary. But the broader conversation creates real demand for practitioners who can translate between what responsible AI actually requires in practice and what policymakers, executives, and institutions are trying to figure out. That translation work is increasingly where the field needs people.

What’s next

Join us Monday morning for the next episode of This Week in AI, where YK Sugi and John Lindquist will break down the massive structural and financial shifts reshaping the technology industry. (They’ll also chat about the recent release of Claude Fable 5.) And on July 23, Christina will be hosting the AI Superstream on AI harnesses, a four-hour event focused on agentic AI and the frameworks practitioners need to move from models to agents. Both are free to attend. Register now to save your seat.

For deeper reading on topics covered this week, Christina recommended three titles available on the O’Reilly learning platform: Hands-On LLM Serving and Optimization, Hands-On RAG for Production, and Large Language Models: The Hard Parts. Not a member? Sign up for a free 10-day trial to check them out.

We’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, Apple, or wherever you get your podcasts.

Generative AI in the Real World: Agentic Systems Fundamentals with Maarten Grootendorst

Ben Lorica and Maarten Grootendorst — Thu, 11 Jun 2026 17:58:23 +0000

BERTopic creator and Google DeepMind developer relations engineer Maarten Grootendorst has spent years helping practitioners build intuition for how AI systems actually work—not just how to prompt them. Maarten joined Ben Lorica to cover the enduring relevance of embeddings and topic models in an LLM-dominated world, his hot take that agents are essentially just an “LLM in a for loop with some tools, some memory, and perhaps some guardrails,” and what separates genuine agentic behavior from a well-constructed pipeline. They also get into the practical trade-offs between open weight and proprietary models, the future of state space models and attention, and why Maarten worries that a generation of builders shipping code they can’t read may be storing up technical debt they can’t repay. “If you don’t really know how an LLM works,” he says, “that intuition [about how to use it effectively] is much more difficult to develop.”

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform or follow us on YouTube, Spotify, Apple, or wherever you get your podcasts.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

0.50
All right. So today we have Maarten Grootendorst. He is a developer relations engineer at Google DeepMind, and he is also the coauthor of two O’Reilly books, Hands-On Large Language Models and An Illustrated Guide to AI. And so, Maarten, welcome to the podcast.

01.10
Thank you. It’s wonderful to be here.

01.12
So, I had you on the podcast—I was looking at it earlier this morning—August 2022, a few months before ChatGPT was released.

01.23
It’s been a while. [laughs]

01.25
Yeah. Back then, what I wanted to talk to you about was, I was a user of your BERTopic library. For listeners who are not familiar, BERTopic was kind of a marriage between the transformer approach with topic modeling and Maarten wrote one of the more popular libraries for doing that. Actually, what’s happened to this whole topic of topic models?

01.58
Oh, yeah. I think it’s still going strong. You mentioned ChatGPT. So a lot of people say, “OK, just use that for topic modeling.” You can. It’s just very difficult to make sure you get a more structured, standardized output rerun thing, especially if [you have] millions of potential documents. And you can still use that on top of that. It’s still my baby of sorts, right? I mean, it’s been four years since we talked, and. . . I love working on that. I don’t have that much time to do it anymore, but it’s great.

02.36
Yeah. So I think one of the things that these large language models have done is kind of, I guess, cast by the wayside some of these earlier approaches for really wading through a lot of text. Unfortunately, I think people, as you mentioned, are trying to prompt their way into a topic model. But I think topic models themselves are still very useful. So one question to you, Maarten. What’s the level of usage of BERTopic now compared to when we talked?

03.13
It’s only grown since then.

03.17
Really?

03.18
Yeah. It surprised me too. [laughs] I think it’s because it’s easy to use. I did some, I think, cool tricks in there, but other than that, I think the main benefit was mostly just a nice user experience. And that helps people use something for a very specific task instead of trying to prompt your way towards something that might or might not work, and you still have to iterate over that. It just works out of the box. It’s not perfect. Nothing is. It’s not a free lunch. But yeah, I think that’s it.

03.55
One thing that’s happened, of course, is that this whole area of AI and NLP has gotten so democratized that. . . When we talked, I think the people who were using BERTopic at least had some notion of what NLP was and what text mining was, right? I would imagine now, in your role as a developer relations person, you encounter a lot of people who don’t come from a data science or ML background. And so they have no clue what topic models are, I would imagine.

04.34
Yeah, many don’t. It’s very interesting to see because you mentioned NLP and text mining and, well, [they’re] completely outdated terms now for some reason. It’s all AI. Let’s just call it AI and be done with it. [laughs] That’s not necessarily a bad thing, don’t get me wrong. It’s just very interesting to see how the field has evolved, but that also means that people don’t really look towards these “older techniques” that still drive much of the adoption of newer stuff.

Sometimes it feels like that, you know, AI and LLMs. . . It’s a hammer and we’re looking for nails to actually use it instead of, “OK, but we have packages for very specific things, and you can use LLMs on top of that.” You don’t have to. But it requires a bit of education on that end, because like you mentioned, a lot of people new to the field, you have to explain, “What are embeddings? What is clustering?” It’s also very interesting to see that even something like that needs to be explained a little bit in more detail. It’s a nice opportunity for me to explain stuff. I like doing that.

05.48
And the key here is that because a lot of people are entering this field and building things and they don’t necessarily know the prior art, so to speak, it seems like they might be leaving a lot of things on the table. Right? So in terms of, here’s my text or my data, I am just going to prompt and I think that I got everything out of it, but that’s not really the case for the most part.

06.24
No. Definitely not. There’s so many things that you can do with these systems, whether it’s on the LLM side or the agentic side or the topic modeling side. If you just know a little bit more on what’s going on under the hood then that helps you understand “When do I prompt? When do I not prompt? What’s going wrong?” That feeling, that intuition. You don’t just get it with building. Building’s very important, but if you don’t really know how an LLM works, that intuition is much more difficult to develop.

06.59
Which brings me to your two books, which are fantastic, which I think go a long way into helping people get that foundation. But let’s face it, a lot of people, Maarten. . . So let’s take your earlier book with Jay [Alammar], which is Hands-On Large Language Models. A lot of people may say, “I don’t have time to read this whole book.” So for someone who is a developer, doesn’t have a data science or ML background, what would be the most important concepts for large language models? Drill down on these three or four concepts that will set you up for success.

07.49
From the top of my head, those are chapters two and three. So buy the book now. [laughs] I’m just kidding. Tokens. Super underappreciated.

08.03
Which now is a big topic because, as I joke, the CFO has now become the CTO, the chief token officer.

08.11
I didn’t know that one. That’s amazing. I’m gonna use it. But, yeah, tokens are now the thing, right? It’s what LLMs use to see the world, so to say—to interpret the world. And it’s how they communicate with the world. So it’s really important to know what tokens are. It helps you get into the realm of embeddings, which I still think is super fundamental to so many things we do.

And the second part is kind of an obvious one, but the attention mechanism, “Oh, wow. Why are these things so strong? What makes them so special?” Attention is an obvious one. We have other things like Mamba, recurrent neural networks, but it all starts from attention. So if you’re completely new to this field, those two. Yeah.

08.58
Let’s take the topic of embeddings. I think at least that topic, Maarten, some people have had to play around with it, right? Because when LLMs first came online, the “Hello, World!” example was RAG, and one of the knobs that people were tuning was embedding, obviously chunking, so the information extraction, the search and retrieval—they’re all important. But one thing that people immediately tried to play around with was embeddings because they could go to places like Hugging Face:
Hey, let me try these four different embeddings.” Do you find that embeddings have a special place in that more people play around with embeddings and have some rudimentary understanding of embeddings?

09.50
I have a sweet spot for embeddings because it’s the main part of BERTopic. But I think it’s so fundamental to so many things that we do in this field. Even things like RAG—which some people think is outdated. It actually isn’t. It’s very much alive and still kicking—runs on embeddings and understanding how they work will also help you understand how LLMs work. And it can be used in so many different ways.

Sometimes we’re looking for bigger embedding models, more contextualized information. Great. [They] have their own purposes. And there are now certain parties focusing a little bit more on these static embeddings that are super fast and quick, like the old school embeddings that we used to have, and now in a new form that can be used in conjunction with coding agents to quickly search through repos and find the information that they’re looking for. Much of what we do is still search, and search revolves in big part on embeddings. And it’s just nice when you have text that you have one numerical representation for it—just that gives you so many opportunities to do so many cool things. . .

11.18
So when you’re trying to convince someone, Maarten, that “Hey, you should learn more about embeddings, because they’re important,” is there a canonical example that you use to say, “Hey, look, if you just understood embeddings and you made this one decision, look at the change in your application.” Is there a canonical example that you go to?

11.40
Oh, yeah, I love the question, but I don’t think I have an answer to that. Because, OK, so I’m a psychologist and I really like to say “it depends on,” and here it kind of depends on the application that you’re running, obviously. Contextualized versus noncontextualized embeddings is a very interesting example because the contextualized ones are generally larger. But there’s larger transformer-like models that require a lot of compute to run. So you can see the latency actually appearing in your search engines. Or if you connect your coding agent to one of those, it slows down because, you know, it needs to wait for the search compared to the faster static ones, for instance, like Model2Vec and stuff like that, which are tremendously fast. So amazing for those use cases, not that performance because they’re way smaller, obviously. And it’s these use cases where the building does get you a lot of intuition about when to use what instead of relaying that decision only to an agent. You’re still the one that needs to have the feeling, that gut feeling, to say this works better for my use case.

13.03
But I would say the reality is that people will go to some leaderboard.

13.09
Yeah. That’s just the way it is.

13.13
So there we go. OK. So in this leaderboard here are the top 10. In this top 10, there’s some that look larger than the others. So I’ll try three or four of varying sizes. Is that a fair characterization of what normally happens?

13.32
Yeah that’s even what I always did. Just you know, top of the leaderboard, pick one or two. But then as you are more experienced with picking one, what about multilinguality? I’m Dutch. There aren’t that many very good Dutch embedding models—big problem there. There are things like matryoshka embeddings, where they’re embedding one embedding model, but they generate embeddings of different sizes for different purposes, which is also very interesting. So there’s all these types of small decisions and nuances that you can make. And we now have instruction-tuned embeddings, where you prefix it with an instruction that you want an embedding for clustering or for classification or for what have you. And then you suddenly see the nuances in selecting something.

14.27
So on the attention mechanism, again, I will play the role of someone who has no time. I don’t have time to read the chapter, Maarten. What are one to three things I should know about the attention mechanism?

14.44
I think the most important thing about the attention mechanism is it contextualizes information. That’s by far the most important thing. When you look at the world before attention and after, it’s a little bit less black-and-white, obviously, but it puts stuff into context. You know, if you have the word “bank,” is it the bank of a river or a financial bank? And as we talk now with each other, there’s a lot of contextual stuff going on. You need to interpret what I’m saying, because if you only focus on what I say, you don’t know that that was actually a question beforehand that drives my answer. And I think that’s what makes attention so special. It tries to look at the entire thing instead of individual tokens or words.

15.34
Playing devil’s advocate, so you just explained it to me. Why do I have to learn more than that? [laughs]

15.40
Always learn more. [laughs]

15.44
Yeah, yeah, yeah. So you mentioned Mamba and the state space models. There was some excitement around them. So maybe give our listeners a high-level description of what these state space models are and what their current status is in the wild in terms of actual practical usage.

16.08
State space models are a completely different way of approaching this attention mechanism, right? It almost does away with it and replaces it with something that is much, much faster. It’s a very complex and highly technical subject, so I don’t want to go too into that because it’s really confusing. [laughs]

So what you see happening is that people replace attention mechanisms. So you have a decoder and LLM, and it has several stacks of attention mechanism normally. What you can do is you can remove half of them with the very quick state space models that help speed up the inference—because that’s what we’re mostly bound now by, is inference speeds. People want more, more tokens. So it needs to be faster. So it’s, it’s a way to make it quicker.

17.13
Yeah. And so what is the actual implementation or adoption of state space models right now?

17.21
Mostly hybrid models. Models, stats, interleave the attention blocks, the decoder blocks with Mamba blocks as a way to make it faster, where some do it with, for example, local attention and global attention—one is more compute-intensive than others. Mamba is a way to do something similar, as a way to speed up that inference.

17.51
Your latest book is about agents: An Illustrated Guide to AI Agents. Before we dive in, in your mind, what makes a system truly agentic? In other words, before we started bandying around the word “agents,” people were using the term “robotic process automation” or something like that. So in your mind, what makes a system agentic?

18.22
That’s actually been one of the more complex topics for us to actually describe, because the field has been changing so quickly. And what is fundamentally an agent when they change it every two months? It’s a little bit of a hot take, but I really do think that an agent is an LLM in a for loop with some tools, some memory, and perhaps some guardrails. And that really is essentially all it boils down to at its base.

18.55
You just described the harness basically. The hot term right now is harness engineering. So what is the real progress and what is just marketing when it comes to agents?

19.19
Yeah, I agree very much with what you imply here because agents sound so cool, and they are cool, but the moment you give an LLM complete freedom, no constraints, just go off and do your stuff, it will fail horribly, horribly, horribly. Agents still need. . . And we can call them guardrails, but you can call them something else. They need direction. They need to be constrained a little bit in the things that they do. So yes, agents, there’s a lot of hype around that. I’m not a big fan of hype. It is what it is. But there are a lot of cool use cases for it because there’s a reason why coding agents are now the big thing. I’m using them myself daily because they make my life easier. But when we look at other use cases, we’re so early in AI progress. Yeah, coding works very nicely. But to ask an agent to book a vacation for me. Yeah. No.

20.35
It seems like that example of “I want to go on a trip. This trip will involve staying in five countries. And I want you to pick the best hotel for every country.” always was kind of the demo even during the robotic process automation. And as you alluded to, I don’t think we can do it quite yet. So here’s another family of agents, Maarten, that a lot of people are using now: deep research agents. Would you consider deep research an agent?

21.15
Maybe. It kind of depends on how it’s implemented. It depends. I’m sorry. I’m going to do that a couple of times, but. . . You can make it very structured, where you say, “OK, do the search on the archive, read the abstracts, make a summary. That’s it.” That’s not really. . .

21.38
It fits into your description in that you’re prompting an LLM. The LLM goes on a for loop where it uses as tools a search index, a knowledge graph. . .

21.53
Fair enough. Yeah. It makes the decision on its own when to use a tool, why to use a tool. Whereas you can also put it in a pipeline where you specifically say, “I always want you to do steps one, two, and three.” And an agent might decide to say, “OK, I’m going to do step 3, 3, 1, 2, 1, 3.” Decide on its own when and where to use specific tools. I think that’s maybe the best distinction you can make on what is and what isn’t an agent.

22.26
And then I guess it depends on the implementation, as you mentioned. But memory could also fill a role there, especially. . . Let’s say I’m using only one service—Google or Perplexity. Maybe it remembers over time what my preferences are. I don’t know if they actually implement it that way. But there’s potentially that aspect.

22.53
So how we phrase it in the book at least, we say, “OK, an agent is a reasoning LLM that has access to planning, tools, and memory,” because there’s no such thing as an agent that goes off and does three steps of something only to forget what the previous steps were. So I think memory is maybe a little bit underappreciated in the realm of agents, because imagine it has to go through an entire codebase and translate it from Python to C++ or Rust or what have you. It’s a very common example of things people want to do. That requires hundreds of steps to do, because it’s potentially a large codebase. How does it remember what it did when it did what, what the current state is, what what’s changed, etc., etc.? And you can write that in a Markdown file. That’s nice, but it also needs to understand, “OK, what’s the trajectory that I went through?” And you can do a lot of cool stuff with that trajectory, because that’s essentially the memory of an agent.

24.02
In your role in developer relations, I assume you talk to a lot of people who work in different companies. We’ve mentioned coding agents; we mentioned deep research. So what are some of the more common agents that people are building? They could be internal or external facing. So what are some of the more common agent types, I guess, that people are building?

24.29
Aside from the obvious, it depends on the industry. I do see coding agents actually being done quite a bit internally. Just trying to see how they can prevent data from being leaked elsewhere. Because a lot of processes now are very privacy sensitive. I came from healthcare before I joined DeepMind. And what you see in these kinds of fields is that, especially in Europe. . .

25.06
I imagine if you’re in finance in a hedge fund. . .

25.09
So yeah, same. . . And these are situations wherein people focus a lot on privacy and making sure that everything’s constrained within their environments. And you see a lot of people playing around with LLMs and then using harnesses—can be Hermes but also [taking] a more foundational agent and build[ing] stuff around that. Or the larger organizations that, well, just use whatever cloud offering there is and use an agent there. We’re so at the beginning of all of this. [laughs]

25.50
For me, the area where I see it being used—and this is not going to be a surprise to our listeners—is still the technical team bucket, which would be DevOps, data engineering, platform engineering. . . They’re building agents to help them do the work. But you might be interacting with a large website, and in the background, there’s a bunch of agents doing a lot of heavy lifting, moving data around for you to get the answer you want or whatever, or internal processes. But DevOps, I think they’re starting to build their own agents. I think, data engineering for pipelines, they’re building their own agents. I would imagine the people in security teams are also building agents because they have to go through lots of log files and. . .

26.55
A question for you then: Are they building agents, as in, you know, fully an agent, or are they building skills? Because I’ve seen a lot of people more focusing on creating skills and giving that to whatever agent is available. Or do you also see a lot of people actually building agents from scratch?

27.17
I think internally there are people who are building what we would consider agents in the sense that it would do a huge chunk of their normal work and they interact with it with prompting, but maybe they don’t consider it completely autonomous. So in the sense that many people who use coding agents, at least, the ones who know how to code, as you might still test and read some of the code, right?

27.50
Sometimes. Sometimes. [laughs]

27.52
Our listeners may be sharp, but there’s huge cohorts of people using coding agents who don’t know how to code or who are building websites and web applications. So in the data, in the DevOps, in the data engineering field, the kinds of agents they’re building are somewhat similar to the coding agents in that they’re doing a lot of the work, but they still have guardrails. I would say they’re still human-in-the-loop. Now, there’s also agents in the nontechnical fields, but they’re a little more. . . Maybe to your point, maybe they can be better described as skills, for example, in marketing or sales. Internally at some of these companies, they’re building things to help these teams be more independent from IT.

29.01
So yeah, you see mostly and we can call them skills, but we can also call them workflows or pipelines or just prompts. . .

29.10
Imagine you’re a marketing analyst at a big Fortune 500 company. And your job used to be to manage a bunch of ad campaigns and online campaigns. That was very manual, and so now you can automate a lot of that work. And then you might still have a dashboard where you can kind of see what’s going on. But the things that used to drive you crazy, now you can focus on other things.

29.46
But I am curious about the long-term effects of all of this, especially when, as you mentioned, a lot of people code without knowing how to code. I think that’s fun for a while but in the long term, stuff breaks and you don’t know where to start.

30.01
I don’t know about you, but I’ve come across people who literally don’t know how to code, who built a website, starting to have customers. Customers will file support questions or they say, “This part of your website doesn’t quite work.” Since they don’t know how to code, they go back to the same coding agent: “Hey, fix this.” The coding agent says I fixed it. They go back to the customer: “It’s fixed.” The customer goes, “It’s not fixed.” And so then this is when they start going “I need to hire someone to actually. . . Because now it actually needs to be fixed. And the holding agent can’t fix it.” So there are obviously dangers to going kind of completely wild on these technologies.

So open weights versus proprietary. This might be a sensitive topic to you because you have Gemini, but you guys also have Gemma.

31.09
I work on Gemma. Ask me everything about Gemma. [laughs]

31.12
[laughs] In your work—or not in your work, but in your day-to-day life, talking to friends, traveling, in your dev rel hat, what is a level of interest in open weights?

31.27
Oh, a lot, yeah. That’s for the most part because I’m in Europe. And Europe loves to say, “OK, we want to own things. We don’t want to push it over to someone else.” So there’s a lot of interest for open weight models. It’s way more than I initially thought because there was quite a big performance gap when ChatGPT came out, 3.5. But now they’re closing in. These models are extremely capable. You can run them on MacBooks. I mean, when Claude came out, I’ve seen so many threads of people buying Mac Studios just to be able to run whatever local LLM they have. So you see it in every part of the field, whether it’s very large organizations or very small, finance, healthcare, what have you.

32.25
One of the challenges with open weights is open weights is a business decision. And business decisions can be reversed. Meta Llama may no longer produce open weights. Alibaba—kind of mixed signals there. Some of the Chinese open weights providers are starting to send mixed signals. So it’s one thing to release an open weights model. But as you know, in this environment you have to release models at a regular cadence and that starts getting expensive. So I guess one of the challenges there for our whole community and industry is, you know, where is the steady supply of open weights models going to come from moving forward? Because basically, like I said, it’s a business decision, and a business decision is going to be reversed.

33.28
No, I agree on that. So in the general sense, that’s what we see happening. Some organizations stop doing open source, [or] less of it, focus on different things. It’s understandable in a way, because, you know. . .

33.45
And, you know, one of the obvious advantages of open weights is you can take the weights and run it in your cluster. And so you have control if. . . One of the things that annoys a lot of these enterprise teams is OK, so I’m really optimized for Claude 4.5. And then, hey, they are deprecating Claude 4.5, you know. So here at least you have control. And I think one of the things that most teams are starting to realize, Maarten, is actually I can use open weights for a lot of things because. . . Let’s say it’s so focused, like a simple sentiment analysis or whatever. I don’t need the most expensive models. And this I can control moving forward. So I think people and teams are discovering, “Hey, while I should be concerned that these open weights models may stop getting released, for some, for many of my tasks, maybe I don’t need the latest and greatest anyway.”

34.52
That can be the case. Yeah, because these models are very capable. I think there will always be a steady supply of open weight models. If we look at the status of the field now, many. . . Obviously Qwen, they’re doing an amazing job. Needs to be said. Same with Gemma, they’re also doing well.

35.14
The Qwen team lost a bunch of people, and I think there’s some worry that Alibaba may back off from. . .

35.23
I think they will continue. I don’t know, obviously, but I think it’s still a very good strategy to do.

35.30
And wait, Gemma is not as good as Gemini. [laughs]

35.33
We have good benchmarks. What is this? What is this? [laughs] No, but they serve different audiences. And what we see happening with open weights is you get so much back from giving open weights to the community. And DeepMind is a nice example. But the more labs obviously that have always given a lot to the community, when you do that, you also get a lot back, right? Because if people are super excited about Gemma 4—we released a model two days ago, 12B-1. And you see people using that for a lot of cool use cases. Driving research to create new things that, you know, we might not have thought of. That can be the case. You see Flash, for instance, which is a diffusion-based drafter, super fast, very incredible being used with Gemma 4. That’s cool. And it’s not to say that Gemma was the first one that drove that, but open weights in general allow a random person somewhere without access to thousands of GPUs to pretrain a model and still be able to do very cool and interesting research. So as long as I’m at DeepMind, I’m gonna make sure we’re gonna keep doing very cool Gemma stuff.

37.03
All right, so let’s close with a rapid fire round. So for each question, keep your answer under a minute. Question number one. OpenClaw. What says you, Maarten, about this trend around personal agents?

37.21
I love personal agents. They’re very cool and interesting. And at the same time, I’m very worried about the security of it. We’re seeing a lot of people’s keys being opened up, things that are being deleted that shouldn’t be deleted. And that’s because we’re in very early stages of all of this—just a little bit more time, and then it will be amazing.

37.46
Yeah. And run it locally with Gemma. [laughs]

37.50
Yeah, of course. [laughs] I’m not gonna sell too much. I love Gemma, I’m selling already too much.

37.57
Question number two: reinforcement learning. I’m a big fan. I always push out a post once a year at least, where I say it’s just around the corner. Now it seems like there’s a bit of a comeback with reinforcement, fine-tuning. Are you paying attention to reinforcement learning?

38.21
A lot. I have a couple of colleagues, and we started something called the RAG Pack with some bigger influencers, like Jay Allamar and Josh Starmer from StatQuest. And we did a course on reinforcement quite recently. It’s such a cool technology. It’s the technique that makes LLMs the way they are today. And there’s still a lot of new things coming up in that field to make them faster, more capable, multituning trajectories. Yeah, it’s the whole thing.

38.54
Third question: scaling loss. So Anthropic in particular is big on scaling loss: bigger models, more data, that’s the road to better and better models. So what’s your feeling right now about scaling loss.

39.11
They change quickly. We started with regular “more parameters, better model.” Then we switched to reasoning, where we said “longer reasoning, better model.” And now we’re slowly going towards the “longer trajectories, better model.” You know, more is better. I think they’re interesting, but they’re changing now so quickly that I’m wondering in half a year what the new scaling law and the new nifty thing is going to be.

39.39
So in closing, data centers. Data centers are a hot topic in the US. A lot of communities seem to be coalescing around opposing the build-out of data centers. So it’s a bit of a complicated issue in the sense that, you know, assuming that these AI technologies work and they get adopted, we will need compute in order for people to have access to these technologies. Otherwise, maybe the rich are the only ones who will have access to AI. On the other hand, the data centers themselves, you definitely need local input because, electricity, water, noise. . . And then unlike factories, they don’t really produce a lot of jobs because how many people do you really need to run a data center with all the DevOps agents now that we talked about? So what’s going on in data centers in Europe?

40.43
We don’t like them. I’m saying we—I’m Dutch. If I’m saying for the people of the Netherlands, we don’t like them generally. And that’s going to be very interesting moving forward because there’s still demand for AI. I know there’s a lot of people that don’t like it, but at the same time, there’s still a lot of people using it, and we need to find a way to balance that out. There’s no way forward otherwise, and I really hope we can focus more on efficiency when it comes to these compute-heavy things. That’s why I focus so much on Gemma. They’re small, capable models that you run on your cell phone. That’s great. Without needing to have these large data centers, aside from training, maybe, but that will always be there. We have to be honest about that. AI is here to stay. We just need to make it more efficient.

41.38
And with that, thank you, Maarten. And by the way, closing note about data centers, for our listeners, there’s a lot of announcements, right? Several gigawatts are being. . . Contracts being signed. But if you really follow what’s going on, there’s not a lot of build-out. There’s not a lot of data centers actually being built in and coming online. So… Thank you, Maarten.

42.07
Thank you.

When Context Collapses: Teaching Agents to Detect and Recover from Lost Memory

Andrew Stellman — Thu, 11 Jun 2026 10:59:13 +0000

This is the eighth article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, part four here, part five here, part six here, and part seven here.

“640K ought to be enough for anybody.”—Bill Gates (allegedly)

If you’re building AI agents that do complex, multistep work, you’re going to run into context loss. The agent’s working memory fills up, older information gets silently dropped or compressed, and the agent keeps going without realizing it’s forgotten something. This article, the third in my Radar article trilogy about context management, walks through a pattern I’ve been refining for detecting and recovering from that problem, which I call the externalize-recognize-rehydrate pattern (or ERR, which I think is actually a pretty good acronym for an error recovery pattern): save your agent’s state to files on disk, detect when context has degraded, and reload from those files to recover. The individual techniques are standard practice in agent and skill engineering—checkpointing, progress files, state verification—but the real power comes from combining them into a coherent workflow that you can use live or build into your agents. I’ll walk through each step with specific prompts you can adapt for your own agents and coding sessions.

Which brings me to memory. Gates has said on multiple occasions that he never actually said that quote at the top of this article, but it endures because it captures one of the core limitations of that era, one that people struggled with constantly, in a way that we can laugh about now. Around that time I was using a 286 with 1 MB of RAM. That’s megabytes, not gigabytes. MS-DOS 3.3 gave me 640K of conventional memory plus 384K of upper memory, and I spent a lot of time figuring out how to use every bit of it. I configured memory managers, loaded device drivers high, used (and wrote!) terminate-and-stay-resident programs that moved themselves out of conventional memory to free up space, and generally treated memory as a resource that required active, deliberate engineering. There was a lot I wanted to do that didn’t fit into 640K, and like most people at the time, I went to some lengths to compensate for the memory limitations.

We’re at the 640K stage of AI development. The context window is the new RAM ceiling. Most of today’s models give you somewhere between 200K and 2M tokens of working memory (and, like memory in the late 1980s and early 1990s, those numbers are growing all the time), and if you’re building agents that do complex multistep work, you will hit that ceiling. When you do, the AI starts compacting: compressing or dropping older parts of the conversation to make room. And just like running out of conventional memory on a 286, things stop working right and you’re not sure why.

In 20 years we’ll be looking back at today’s puny context windows and wondering how developers in the 2020s managed to get anything done with just a few million tokens. Because none of this is new. In case you don’t believe me, here’s a photo of my dad at Princeton in the early 1970s working on an Evans and Sutherland LDS-1 graphics computer, the first commercial vector graphics machine, connected to a PDP-10 mainframe:

The actual LDS-1 is in the large cabinet in the background, directly behind the monitor. Sitting next to it, just out of the picture, is an even larger cabinet that holds a memory unit with 16K of magnetic core memory (technically 8K words).

So you can imagine that just a decade later, 640K in a tiny PC that fit on your desktop seemed extravagant.

In the last two articles in this series (“Why Doesn’t Anyone Teach Developers About Context Management?” and “Your AI Agent Already Forgot Half of What You Told It”), I talked about what context is and why context management matters, and I shared practical techniques and prompts for keeping important information in files instead of leaving it in the AI’s context window. This article gets more technical. I want to build on those strategies and talk about how to build agents that can detect when they’ve lost context and recover from it on their own.

Brute-forcing my way through context loss

I’ve been doing this kind of context management for a while now, long before the specific tools I’m about to describe existed. But a recent crash gave me a clean example of what the process looks like in its most brute-force form.

I was working in Copilot with a seven-step plan, going through it one step at a time, having another AI review each step before moving on. Steps one and two went fine. When it came time to do step three and I gave it the prompt, it jumped straight to step four. This kind of thing can be really frustrating, because it seems like an AI smart enough to implement a complex feature in code should be able to (ahem) count to four.

The key to not getting frustrated when the AI loses track of steps or can’t seem to count from prompt to prompt is to remember what it’s good at and how it remembers things. If the AI you’re using does that, check the conversation history. You’ll probably see something like “summarizing conversation history” or “compacting conversation” somewhere above your last message. That’s telling you that the AI lost track of where it was because that count was literally purged from its memory.

AIs are good at carrying out an instruction. They’re bad at keeping track of their own state over a long conversation, and the way they manage their memory is a big part of that. This article is about finding ways to build your AI tools so you’re not relying on them to do the thing they’re worst at.

But compaction isn’t the only way your AI loses context. A few weeks ago I was deep into a long session with Copilot, working through a multiphase code review. I’d spent a while building up context with the AI about my codebase and the decisions we’d made together. I was about to move on to the next phase, and then I got this:

The entire context was wiped, which could have been a really frustrating problem, since I had a long history with the session, and it had built up a lot of knowledge about what we were doing. This turned out to be a bug in Opus 4.6’s interaction with Copilot’s conversation history, and I’ve seen other people hit the same thing. I was staring at a fresh prompt with nothing in it.

So I did something that, in retrospect, is a pretty good brute-force version of what this whole article is about. I recognized the context was gone (hard to miss when the whole conversation disappears). I copied the entire conversation out of Copilot and pasted it into a text file. Then I gave the new session a prompt:

We were in the middle of a long conversation, then I got an error and the entire context was wiped. I saved a copy of the conversation in #file:chat_history.txt, read it and bring yourself back up to speed.

And it worked! This brought the new session back to where I needed it to be.

That simple error and recovery actually outlines a pretty good pattern for dealing with context loss:

Externalize the state. Get the important information out of the conversation and into a file on disk, where it won’t disappear when the context window reshuffles.
Recognize the loss. Notice that the agent’s working context has been wiped or degraded, whether that’s obvious (like a crash) or subtle (like output that quietly stops making sense).
Rehydrate from the file. Point a new session at that file and let it rebuild its understanding from what’s written down.

The individual mechanics are well-documented across cognitive science (cognitive offloading, task resumption), software engineering (the Memento pattern, React hydration), and knowledge management (the SECI model). I’m not claiming to have invented any of them. But the specific abstraction of these three phases into a unified, named pattern applied to AI context management is, as far as I can tell, new. It’s synthesis and codification, not invention.

In this case I did it with copy and paste, which isn’t particularly elegant, but it worked for me. But this is a blunt instrument, because a raw conversation dump is both too much and too little: it’s too much because it’s full of noise, like tool calls, dead ends, back-and-forth that doesn’t matter anymore; and it’s too little because the context that got silently compressed away during the session is already gone. When you build these mechanisms into agents and skills, you can do it in a much more subtle and automated way.

Externalize: Add two layers of state to your agent

The idea behind externalization, or periodically saving your agent’s state, came out of a conversation I was having with an AI assistant while building the Quality Playbook, an open source AI coding skill that runs structured code reviews. The playbook runs a structured code review as a single process, but that process could easily turn into a 15-million-token request if you tried to do it all in one shot. I described in the previous article in this series how I broke it into six phases, and that was only possible because the context for each phase had already been externalized. Each phase reads its inputs from files, does its work, writes its outputs to files, and stops. The next phase picks up from the files, not from whatever the agent remembers. If this sounds like the familiar advice to ask the AI to plan before you ask it to implement, it’s the same principle applied to context management. Separating each step and persisting the output means you can inspect it, and the next step doesn’t depend on the agent’s memory.

But what should those files contain? I found that the AI is actually good at figuring that out. At some point I asked the assistant:

Would it make sense for the agent to record more context in files as it progresses, to make sure nothing is dropped along the way? It should work even if you break it into separate prompts, because the result from each step is persisted. Plus, we can audit its reasoning for debugging and improvement.

That prompt was all it took. The assistant designed the file structure itself: a progress tracker that records which phase is active and what’s been completed, a JSONL artifact file (JSONL is just a file with a bundle of JSON objects, with one record per line) where each pass appends its output, and a set of brief documents describing the purpose of each phase. You don’t need to overengineer this. Tell the agent what you’re trying to preserve and let it figure out the file layout.

What emerged falls into two categories that I think of as execution continuity and task continuity:

Execution continuity is the state the agent needs to resume work in the middle of a task: what step it’s on, what it’s completed, what decisions it’s made so far. These files change constantly as the agent works.
Task continuity is the broader context that doesn’t change during execution: what the whole task is about, what success looks like, what the structural constraints are. These files are written once and read at every resumption.

When an agent needs to resume after suspected compaction, it reads back both layers. The task continuity files anchor it back to what the whole endeavor is about. The execution continuity files put it back in the middle of the work. Together, they give the agent enough information to continue without relying on anything that might have been compacted.

The key is that externalization isn’t something you do once at the beginning of a task. You want the agent saving its state at frequent checkpoints so that if compaction happens mid-run, the most recent checkpoint is close to where the agent was working. Here’s the kind of instruction I gave the agent for tasks that processed records one at a time:

Update the progress file after every single record, not in batches. Write the output line first, then update the progress file with the new cursor and a fresh timestamp. If the progress file’s timestamp falls behind the output file’s, you’re batching and that’s wrong.

The frequency matters because context can compact at any point. If the agent only saves state at the end of a long run, compaction in the middle means losing everything since the start. If it checkpoints after every unit of work, the worst case is losing one unit.

Two-layer externalization survives context reshaping, not only outright context loss. Even if the agent’s context window isn’t full, if the context has been reorganized or reprioritized (a compression that reshapes without truncating), the agent can reload the external files and know for certain what the ground truth is.

Recognize: Detecting loss from inside the agent

The second step in the pattern is to recognize that your agent has lost context, and it turns out to be the hardest part (at least with today’s AI technology). When the context window fills up, the AI compacts silently, and the agent keeps working without realizing it’s lost information. The agent can’t tell you it’s forgotten something, because it doesn’t know it forgot. Detecting that change turns out to be a nontrivial problem; I’ll walk you through an approach that helped me, and keep it general enough so you can do the same thing. The copy-and-paste approach works when the context loss is obvious, like a crash that wipes your whole conversation. But most context loss isn’t that visible.

I described context compaction in the previous article, but it’s worth restating the core problem from the agent’s perspective. Different tools handle context overflow differently: Some truncate older messages; some compress conversations into summaries; some use a sliding window. But they all have the same effect. Information disappears from the agent’s working context, and the agent doesn’t get notified.

This was a challenge when I built the Quality Playbook, because it runs multiple passes over a codebase, each one reading source files, extracting requirements, and checking coverage. Each pass can involve enough work that it fills the context window multiple times over. And when context compacts mid-pass, the agent doesn’t know it happened. It keeps working, but the output starts silently degrading. So I started building mechanisms for the agent to detect compaction and recover by reading back the files it had written earlier. The patterns that came out of that work are general enough to apply to anyone building agents that need to survive context pressure.

From the agent’s perspective, compaction is seamless. It’s tracking state, referencing decisions made earlier in the conversation, and then at some point the earlier context is gone. But the agent can’t tell the difference between “I never knew that” and “I knew it but lost it.” It tries to reference something and finds nothing, or finds a compressed version that lost the nuance. And because the agent doesn’t know it lost anything, it doesn’t know it needs to recover.

This invisibility is the core problem. But it turns out you can work around it, and the next two sections walk through how.

Building a detection mechanism

Once you have files on disk, the question is what specifically to check and how to know when something has gone wrong. I landed on a mechanism while building the Quality Playbook’s requirement extraction pipeline. The playbook processes source documents in multiple passes, and each pass appends its output to a JSONL artifact file. After each unit of work, the agent also writes a progress record to a separate file: what it just finished, what it found, and where it should pick up next.

The detection mechanism comes from two rules I gave the agent. The idea is that the progress file tracks a cursor, which is just a position marker that tells the agent which record to process next. If the agent writes a record to the output file but then loses context before updating the progress file, those two files will be out of sync.

The agent didn’t need to understand any of that upfront; I just described the rules in plain language and let it figure out the implementation. The first rule establishes an invariant between the output file and the progress file:

Cursor advances only after the line is on disk. Write the summary line to the output file first, then update the progress file. The cursor must always equal the index of the next record that still needs to be processed.

The second rule told the agent how to check that invariant on startup:

On startup, read the progress file. Resume from its cursor value. Verify continuity: the last line in the output file should equal cursor minus one. If not, roll the cursor back to match disk state and report the discrepancy.

If the progress file says the cursor is at record 381, but the last line in the output file is record 379, something happened. The context compacted and the agent lost track of where it was. The divergence between the two files is the signal.

This worked because files on disk don’t change when context compacts. They’re written once and then read repeatedly. If what the agent thinks it knows doesn’t match what’s actually in the files, something shifted in the agent’s memory, not on disk. I ended up folding this check into a preamble that every session started with:

If this session has experienced auto-compaction, re-read the pass specification from disk. Do not try to reconstruct it from the compacted summary. Read the progress file. Read the last record of the JSONL artifact and confirm its index equals the cursor minus one. If not, roll the cursor back to match disk state. Disk is the source of truth. The conversation is not.

That preamble ran at the top of every session. During one particularly intensive day of pipeline development, I ran over a hundred Claude Code sessions with that exact instruction. Most of them completed without hitting compaction. But the ones that did hit it recovered cleanly, because the preamble told the agent exactly what to check and exactly what to do when the check failed.

The specific prompts I used are tied to the Quality Playbook’s file structure, but the technique generalizes. If you’re building any agent that does multistep work, you can adapt the same approach. Here’s a version you could drop into a session preamble or an agent’s system prompt:

Before continuing any task, read your progress file and your most recent output file. Compare them: does the progress file say you’ve completed work that isn’t reflected in the output? If so, trust the output file, roll back your progress to match, and note the discrepancy. Do not rely on what you remember from the conversation. The files on disk are the source of truth.

The wording doesn’t have to be precise. What matters is the structure: tell the agent where to look, what to compare, and which source to trust when they disagree.

But didn’t you just say the AI can’t detect its own compaction?

Right, and it can’t. What I described above isn’t the agent detecting compaction. It’s the agent running a deterministic check against files on disk and finding a discrepancy. The agent doesn’t need to know that compaction happened. It just needs to notice that two files disagree. Think of the agent as an amnesiac clerk. You don’t ask the clerk to remember what they did yesterday. You make the clerk check the physical ledger every time they sit down at the desk. If their notes disagree with the ledger, they’re trained to trust the ledger.

If you saw Christopher Nolan’s breakout movie Memento, you can think of your agent as Leonard Shelby, the character played by Guy Pearce with anterograde amnesia. You couldn’t ask Leonard to remember what he did yesterday. He had to check his tattoos every time he woke up. If his tattoos disagreed with what he’s seeing, he trusts the tattoo (which leads to a major plot point, which I won’t spoil). Again, this isn’t a new idea either. I mentioned the Memento pattern earlier, which is literally named after this movie.

This is a classic distributed systems technique. In double-entry bookkeeping, you maintain two independent records of the same transaction and reconcile them regularly. If they disagree, you investigate. You don’t need to know why they diverged; the divergence itself is the signal. A two-phase commit works the same way: write the data first, then update the record that says the data was written. If you find data without a matching record, or a record without matching data, something went wrong between the two phases.

That’s exactly what the cursor invariant does. The agent writes the output line first, then updates the progress file. If those two files are out of sync, something happened between the two writes. The agent doesn’t detect compaction. It detects a broken invariant, and it’s been told that when the invariant breaks, the files on disk win.

Three things make this work. First, the check is purely deterministic: read two files, compare two numbers, act on the result. There’s no reasoning involved, no judgment call about whether the agent “feels” like it lost context. I wrote about this principle in “Keep Deterministic Work Deterministic”; you never want an AI making decisions that a file comparison can make for it. Second, the files on disk don’t change when context compacts. They’re the stable reference point that the agent’s memory gets checked against. Third, the instruction to run the check lives in the system prompt or preamble, which is generally preserved even when conversation context gets compacted. The check survives the thing it’s designed to detect.

Rehydrate: Reading back the state

Rehydration is the process of reading back externalized state and rebuilding the agent’s working context. Once the agent detects compaction (or, more specifically and accurately, has enough evidence from the filesystem that compaction occurred), the recovery step is to read back the externalized files and rebuild. For the Quality Playbook, rehydration meant:

Read the phase brief to re-anchor the purpose of this pass
Read the progress file to know which unit is active and what’s been completed
Read the tail of the JSONL artifact to confirm the last successfully written record
Recompute the next unit of work from those files

This is different from just continuing without detection. Without detection, the agent tries to pick up where it left off and hopes it still has enough context. With detection, the agent knows something happened and deliberately reloads state before continuing.

You can make the rehydration process itself auditable. Instead of silently reading the files and resuming, have the agent write down what it learned:

Read the progress file and the JSONL artifact. Write a summary of what you learned: what pass is running, what unit is active, what the cursor position is, and how many requirements have been extracted so far. Then continue from there.

Writing a rehydration summary serves two purposes. It gives you visibility into what the agent understood and whether it rehydrated correctly. And it forces the agent to process the external files explicitly rather than just loading them into context. Explicit processing is more reliable than silent loading because the agent has to commit to an interpretation, and you can read that interpretation and catch mistakes.

You can adapt this approach to any agent workflow where work happens in steps. The specific files and cursor values are particular to my pipeline, but the underlying technique is general: have the agent write its progress to a file after each step, and check that file against its output at the start of every session. And this advice isn’t just for writing agents or skills. Even in a live session with Claude Code, Cursor, or Copilot, you can tell the agent to periodically write a summary of what it’s done and what it plans to do next to a file on disk. If the session crashes or the context gets long enough to compact, you can point a new session at that file and pick up where you left off. The key is getting the state out of the conversation and onto disk before you need it.

Context management is an architectural concern

Every technique I’ve described in these articles comes down to the same principle: Important information shouldn’t live only in the agent’s context window. The previous articles covered how to put that information on disk. This one covers how to make the agent aware of its own limitations so it can recover when context pressure gets too high.

An agent that can detect its own degradation and correct for it is fundamentally more reliable than one that just keeps going. When the agent knows how to stop, check itself against ground truth, and reload what it lost, context pressure becomes a recoverable event instead of a slow, silent failure.

This concludes my mini-series trilogy of articles about context management. The first article in this series was about understanding what context is and why it disappears. The second was about getting important information out of the conversation and onto disk before you need it. This one is about closing the loop: making the agent aware of its own limitations so it can detect degradation and recover from it. Together, they add up to treating context as an engineering problem rather than something you hope works out.

These are still early days. Context windows will get larger, compaction will get smarter, and some of the workarounds in this article will eventually be unnecessary. But the underlying principle won’t change: If your agent’s ability to do its job depends on information, that information needs to live somewhere more durable than working memory. That was true for my dad’s 32KB core memory at Princeton, it was true for my 640K of conventional RAM, and it’s true for today’s 200K-token context windows.

The Quality Playbook and Octobatch are open source projects where these techniques are used in production. Both are built using AI-driven development and available for exploration if you want to see how this looks in practice.

Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026, by the author. The open source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.

The PM’s Playbook for Shipping AI Features That Actually Work in Production

Gaurav Savla — Wed, 10 Jun 2026 10:55:56 +0000

The demo to production Death Valley

If you’ve worked on an AI feature, you know the feeling. You start building something that you are excited about, set launch timelines. The model spits out a perfect response, the prototype works magically, and everybody in the room is mentally calculating how big this product will be when we launch. I’ve been in that room a lot many times and it’s fun.

Then you try to test before you ship.

Latency spikes to 10 seconds on mobile. The model starts hallucinating on edge cases that happen to represent 15% of actual user queries. Your A/B test shows no statistically significant engagement lift because the variance in AI outputs makes traditional hypothesis testing basically meaningless. The safety team flags 340 failure cases in the first week, and you’re now debugging nondeterministic cases that fail in creative, novel ways every single day.

Most often than not, it’s not a model problem but an engineering discipline problem. Shipping an AI product is very different from traditional software. I’ve figured this out the hard way. This playbook shares my learnings.

Latency budgets

Every AI feature comes with a latency tax. Large language model inference takes time. We’re talking 500 milliseconds to 5 or even 50 seconds depending on model size, input length, and infrastructure setup. For consumer products where people expect sub-200-millisecond interactions, this is a hard constraint you have to design around.

The mistake I see most often is teams measuring only p50 latency. A feature with 800 milliseconds p50 sounds fine until you discover the p90 is 15 seconds. That means 10 in every 100 users sit there waiting for 15+ seconds. At scale, that’s thousands of terrible experiences per day.

The way I think about it is you define your latency budget by interaction type, not globally: Synchronous interactions, where the user is staring at a spinner, need to resolve under 1 second. Progressive interactions, where output streams token by token, need first token in under 500 milliseconds and full response under 5 seconds. Asynchronous interactions, where the user keeps doing other stuff, can take up to 20 seconds with a progress indicator.

You also need to measure cold starts separately. The first request after a model loads into memory can be 10 times slower than subsequent requests, and if your traffic is bursty, cold starts will disproportionately punish your most engaged users arriving during peak hours.

Besides, you also need to budget for the full pipeline, not just inference. A typical AI feature pipeline including input preprocessing (tokenization, context assembly, and prompt construction), model inference, output postprocessing (parsing, formatting, safety filtering, etc.), and a full response delivery adds up. Optimizing inference while ignoring the rest is like tuning your engine while driving on flat tires.

Lastly, use streaming aggressively for generative features. Pushing tokens to the user as they’re generated instead of waiting for the full response changes how users perceive latency. A four-second response that starts appearing at 300 milliseconds feels dramatically faster than one that pops in all at once. Perception is reality when it comes to user experience.

Designing fallbacks

Traditional software fails in boring, predictable ways. AI features fail in novel, unpredictable, and occasionally creative ways. I once saw a model respond to a product recommendation query with a poem about loneliness. Your fallback strategy needs to be considerably more sophisticated than a try/catch block.

I think about fallbacks as a hierarchy. First, model fallback: When your primary model fails, drop to a simpler, faster, and more reliable model. Most failure cases get handled without the user ever knowing. Second, cache fallback: For queries similar to stuff you’ve seen before, serve a cached response. Third, template fallback: When generation fails completely, fall back to prewritten templates. Degraded beats dead every time. Fourth, graceful omission: Sometimes the best fallback is to simply not show the AI feature at all rather than showing a broken version.

The design principle underneath all of this is that users should never encounter an unhandled AI failure. Every failure mode maps to a specific level, and transitions between levels should be invisible whenever you can manage it.

Quality measurement

Quality in traditional software is binary. The button works or it doesn’t. AI feature quality is continuous and subjective, and it changes depending on context. I’ve landed on a four-layer quality pyramid.

The foundation is safety, and it’s nonnegotiable. Does the output contain harmful content, PII, or made-up facts? This layer is binary, and you measure it with automated classifiers running against 100% of outputs.

The second layer is factual correctness, which is domain specific. Is the output actually right? For a coding assistant that means generated code compiles and passes tests. For a writing tool it means grammatical, stylistically appropriate output. You measure this with domain specific evaluation suites.

The third layer is usefulness, and it’s user centered. Did the person actually benefit? Track acceptance rate, edit distance, time to task completion, and repeat usage. This is where traditional product metrics meet AI specific ones.

The fourth layer is delight, which is experimental. Does the output feel good? Hardest to measure but often most important for adoption. Sometimes the numbers say the feature works but users’ guts say it doesn’t. This layer catches that gap.

A/B testing AI features

A/B testing AI features is fundamentally harder than traditional features because AI outputs are nondeterministic. The same user doing the same thing twice might get different outputs, introducing variance that traditional frameworks weren’t built to handle.

The core challenge is that intratreatment variance inflates the sample size you need for statistical significance, often by three to five times. If you’re running your AI experiment with normal sample size assumptions, you’re probably looking at noise and calling it signal.

Then there’s the metric selection problem. A chatbot generating entertaining but factually wrong responses might show amazing engagement numbers while actively misleading users. You have to measure engagement and quality together. “Engaged interactions where quality score exceeds threshold” is more meaningful than raw engagement alone.

The temporal problem matters too. AI feature value changes over time as users learn how to work with it. Short experiments will underestimate long-term value if there’s a learning curve, or overestimate it if there’s a novelty bump.

My practical guidance: budget two to three times more time and traffic for AI experiments than traditional ones. Lean on Bayesian methods as they handle high variance better. And always pair quantitative tests with qualitative research. Ten user interviews will surface failure modes that no amount of statistical analysis will catch.

Model drift monitoring

Model drift is the slow, invisible rot of AI output quality over time, and there are multiple culprits.

Data drift happens because the world changes and user behavior evolves. A model trained on 2024 data performs worse on 2026 queries referencing new concepts, slang, and cultural moments.

Provider drift happens because third-party APIs change without your consent. OpenAI acknowledged that GPT-4’s behavior shifted measurably between March and June 2023, and Stanford researchers documented significant performance swings. The fix: Pin your model versions so updates happen on your schedule, after your testing.

Evaluation drift is the subtlest form. Even your quality metrics can become inadequate and the evaluation criteria that made sense at launch might become inadequate as usage patterns shift and user expectations change. Quarterly reviews of your evaluation suites are essential.

At minimum you need daily automated quality evaluations on 1% to 5% of production traffic, weekly analysis of input distribution characteristics, and monthly human evaluation of 100 to 500 examples. Shipping an AI feature without drift monitoring is like deploying a service without alerting. You won’t know it’s broken until your users tell you, and by then they’re angry.

Evaluation frameworks

How do you know if your AI feature is good enough? You need two fundamentally different approaches, and you genuinely need both.

Automated evaluation gives you speed. Build a golden dataset of 500 to 2,000 labeled examples, train a classifier or use a capable model as judge, and validate against human judgment quarterly targeting 85% agreement. Automated evals chew through thousands of examples per hour, making them essential for velocity. The pitfall: They miss novel failure modes not in the training data.

Human evaluation catches what automation misses. Structure it with five to seven evaluators mixing domain experts and representative users. Use a consistent rubric covering accuracy, helpfulness, tone, completeness, and safety. Run weekly during development, monthly in production. The trade-offs: expensive at $15 to $30 per example, slow with 24 to 72 hour turnaround, and subject to human biases. Manage by rotating evaluators and capping sessions at two hours.

The model as judge approach is an increasingly viable middle ground. Judging quality is often easier than generating it, which means a model can reliably evaluate outputs even for tasks where it couldn’t produce them itself. Use it for high-volume evaluation but always validate against human judgment.

Graceful degradation and prompt engineering

Graceful degradation means when capabilities decrease, the experience gets worse smoothly instead of falling off a cliff. Design for capability levels, not binary states. Define four to five levels with specific behaviors at each. For example, for an AI writing assistant: Level 5 is full capability with real-time suggestions, tone adjustment, and structure recommendations. Level 4 is delayed suggestions appearing after a two- to three-second pause because latency is up. Level 3 is basic suggestions only like grammar and spelling with no style feedback. Each level is a deliberate design decision, not an accident.

Make degradation invisible when possible. Users shouldn’t see a “broken” experience. They see a less detailed one. That’s a huge difference psychologically. However, when the degradation is significant enough that users will notice, proactive communication like “AI suggestions are temporarily limited” builds trust infinitely more than silently pushing poor-quality outputs.

Prompt engineering in production is software engineering. In production, prompts are code, and they need version control, testing, monitoring, and maintenance. Version controls every prompt. Parameterize prompts, don’t hardcode context. Production prompts should be templates with clearly defined injection points for user context, system state, and dynamic instructions. This makes them testable because you can inject known inputs and verify outputs, and it makes them maintainable because changing how you handle context shouldn’t require rewriting the entire prompt from scratch.

Test prompts against regression suites. Maintain 200 to 500 test cases covering the full distribution of expected inputs, including edge cases and adversarial inputs. Run the suite against every prompt change before deployment.

Monitor prompt performance in production. Track output quality metrics like acceptance rate, user edits, and regeneration requests, segmented by prompt version. When you deploy a new version, compare its production metrics against the previous one for at least 72 hours before calling it stable. This is basically canary deployment for prompts.

Ship it right

These systems aren’t optional add ons you can bolt on after launch. Every feature I’ve seen fail was built first with plans to “add production hardening later.” Later never comes.

AI features are probabilistic and nondeterministic, and they change over time without anyone touching them. Build these systems, staff them properly, and treat them with the same seriousness you’d give your core infrastructure. The gap between demo and production is wide, but it’s absolutely crossable if you build the right bridge.

Note: The research work pertaining to this article was done in a personal capacity. Views are of my own and do not reflect my employer’s views in any way.