Radar

The End of Tokenmaxxing

Mike Loukides — Tue, 30 Jun 2026 16:06:02 +0000

The practice of tokenmaxxing appears to be dying out, even before I had a chance to write about it. Good riddance. Burning tokens to create the appearance of productivity was fated to last only until the accountants learned about it, and the strictest of all accountants is one’s personal checkbook. What got many developers thinking about the cost of AI was the change in GitHub Copilot’s usage charges. The cost of Copilot went from a monthly fee with unlimited use to a monthly fee that purchased a limited number of credits, which are used to pay the AI provider of your choice. One credit is equivalent to US$0.01; when you’ve used up your credits, you can upgrade your account or pay for additional credits as you go.

The question isn’t why this didn’t happen earlier; it’s why this happened now. Tokenmaxxing is both the creation and victim of two large-scale trends in AI. First, starting with OpenAI, the major AI providers were all playing a blitzscaling game that prioritized user growth over profitability. Giving AI services away for free got you more users, and in the long run, scalers would figure out how to make money from end-user fees, selling user data, or advertising. This process inevitably ends in enshittification, and is still very much the road we’re on.

Second, token usage exploded late in 2025. The appearance of “reasoning models,” which use tokens to maintain an internal dialog in the course of solving a problem, increased the number of tokens used to respond to each prompt. Reasoning tokens are a model’s conversation with itself about possible responses to the prompt, and are often more numerous than the prompt and response themselves. Whether or not users see the reasoning process (often they don’t), reasoning tokens add to the bill. They are frequently counted as “output tokens” because they are generated by the model, and are more expensive than input tokens.

The appearance of agents also multiplied the rate at which users consumed tokens. In May, 2025, Simon Willison quoted Anthropic’s Hannah Moran’s definition of an agent: “Agents are models using tools in a loop.” The Tredence blog writes: “The agent loop is a repeating cycle in which the AI reads the current data, thinks through what it means, chooses an action, carries it out, checks what happens and starts over.” If you’ve ever watched Claude Code, OpenClaw, or any other agent work, a single request can become many calls to a model, each one using hundreds of tokens, if not thousands. In addition to the current request, one agent-generated invocation can contain the task’s entire accumulated context and relevant documents. Between reasoning tokens and agents, token usage goes up by a factor of hundreds.

The increase in token usage might not be an issue if it results in problems being solved and tasks completed more effectively. But it collides with the loss-leader pricing of the blitzscalers; their willingness to operate at a loss to gain control of a market has limits. Regardless of whether the number of AI users is increasing, the amount of computation, and therefore cost, per user grows as the use of agents increases. Reasoning models increased token usage; agents compounded the problem; and that led to price increases.¹ Microsoft/GitHub doesn’t want to pay Copilot customers’ AI bills. We haven’t yet seen across-the-board price increases from the AI providers themselves. But we have seen GitHub’s token credits, and we have seen Anthropic and OpenAI price more capable models significantly higher than older or less capable models. Fable is twice as expensive as Opus 4.8, and while some writers have called this pricing “fantastic,” that’s probably because they were expecting an even greater increase. While Fable can delegate tasks to Anthropic’s less expensive models, most early users observe that with Fable, token use goes up rather than down. Anthropic’s switch to token-based billing for its agent SDK (currently on hold) is another signal that the days of inexpensive AI are coming to an end. OpenAI’s story is similar: GPT 5.5 costs twice as much GPT 5.4 per million tokens.

It’s also important to take capacity into account. Huge data centers have been in the news, but those data centers haven’t been built yet. More important, the electrical infrastructure needed to support those data centers—transmission lines, generators—hasn’t been built either, and that’s not an investment over which AI companies have much control. They can build their own power generation facilities on a data center campus, but that’s a huge investment in technologies that they’re not familiar with. And even if you generate power locally, you need other kinds of infrastructure: rail for coal, pipelines for gas. This isn’t (yet) an essay about data center power consumption and its consequences, but it is another factor that limits increased token usage. We’ve seen Anthropic’s outages blamed on capacity, and Anthropic has responded by leasing unused data center capacity from SpaceX. But the other way to respond to increased demand that can’t be met by current capacity is to increase prices, limiting customers to those who can afford to pay. That increase is being noticed by managers, accountants, and independent developers.

Token optimization and accountability are the inevitable consequence of upward pressure on token price. One way to build accountability is through better governance, a route Bennie Haelen describes in “The Subsidy Ended: What Tool-Using Agents Actually Cost.” Better governance is achieved through building an observability layer that lets you see exactly what the agents and models are doing. With a well-designed observability layer, you can see whether the data sent to the model is growing with each invocation, whether the model is using appropriate tools, whether tools are being called repeatedly, and a lot of other information that will tell you whether your agent is running efficiently.

Another piece of token accountability is understanding which models are running your agent’s requests. General-purpose reasoning models range from expensive high-performance models like Claude Fable or Opus 4.8 to models like Gemma 4 26B that can run on a well-equipped laptop, and some models that are even smaller. While it’s tempting to say “I need the best; I’ll run Opus 4.8 or Fable with maximum reasoning,” most requests don’t require that level of reasoning or expense. Agents will be able to decide what model is best for processing every request. Fable can delegate, and we expect other frontier providers to follow as models incorporate agent capabilities. And there’s an active world of open models outside of the frontier AI providers. Vicki Boykis writes that models running locally now work almost as well as frontier models. Tools like OpenRouter give you a model-independent way of routing requests to different models, including open models that run locally. OpenRouter can be integrated with OpenClaw, Claude Code, Cursor, Codex, and other agents to provide intelligent routing.

Tokenmaxxing is dying. It will no doubt take time for its vestiges to die away, and there will always be developers who think they can game the path to a promotion, along with managers who insist on being “all in” with AI. But spending tokens responsibly is now the norm, whether you pay with your own checkbook or a company account. Token optimization will only become more important as per-token charges increase. They undoubtedly will.

Footnotes

Some articles make the strange claim that tokens have gotten cheaper by up to 98%. GPT-5.5 suggests that these writers are considering the work that can be done per token. That comparison may be worthwhile, though it’s unclear how to compare GPT-3 with 5.5 or Fable meaningfully. For this article, a token is a token. ︎

Beyond Prompt Injection

Shania Rasheed Nalagath — Tue, 30 Jun 2026 10:55:08 +0000

In late 2025, the security community stopped treating indirect prompt injection as a theoretical risk. It had spent two years as a tidy lab demonstration; then production systems started getting hit. The OWASP Top 10 for LLM applications now ranks prompt injection as the number-one risk, NIST has called indirect injection generative AI’s greatest security flaw, and academic researchers showed that a single poisoned email could coerce a model into exfiltrating SSH keys in up to 80% of trials, with zero user interaction. The attack needs no malicious binary, no phishing clicks, and no anomalous login. The agent simply reads content and takes action, exactly as designed, and the content was written by an attacker.

The most instructive example is ForcedLeak. In September 2025, researchers at Noma disclosed a critical vulnerability chain (CVSS 9.4) in Salesforce’s Agentforce platform: An attacker embedded malicious instructions in the description field of a routine Web-to-Lead form. The text sat harmlessly in the CRM until an employee later asked the AI agent to process that lead, at which point the agent dutifully executed both the legitimate query and the attacker’s hidden payload, exfiltrating sensitive CRM data to an external server. The detail that should keep you up at night is that the exfiltration destination was a domain still on Salesforce’s trusted allowlist, one that had expired and which the researchers re-registered for about five dollars. Every security control saw legitimate traffic to a trusted domain. Nothing looked wrong.

If your instinct reading that is “we filter for prompt injection,” you’re defending the wrong perimeter. Input filtering is necessary but nowhere near sufficient. The uncomfortable truth is that the injection isn’t the breach; the action is. And almost everything we call “AI security” is aimed at the wrong half of that sentence.

The defense everyone is building

Ask most enterprise AI teams how they secure their agents, and you’ll hear a consistent answer: They sanitize inputs. They harden system prompts with elaborate instructions to ignore conflicting directives. They run classifiers over incoming content to flag adversarial patterns. Some have adopted the more sophisticated training-time defenses the frontier labs have published—instruction hierarchies that teach a model to assign differential trust to different sources and reinforcement-learning approaches that harden models against injection in agentic contexts.

All of this is good work, and none of it should be abandoned. But notice what every one of these techniques shares. They all try to stop the model from being fooled. They assume that if we make the model robust enough at the input layer, the system is safe. That assumption is the vulnerability.

We’ve spent two years trying to make the model unfoolable. The systems that survive contact with production assume it will be fooled anyway.

Why the input layer is the wrong perimeter

Prompt injection isn’t a bug a future model will lack. It’s a structural property of how language models work. The model consumes a single undifferentiated stream of tokens at the moment of inference. Your instructions, the retrieved document, the tool output, and the web page just fetched are indistinguishable channels collapsed into one context. There’s no hardware-enforced boundary between “trusted instruction” and “untrusted data” the way there is between kernel space and user space in an operating system.

This is why the attack surface explodes the moment an agent becomes agentic. A chatbot that only talks is a contained risk. An agent that retrieves from the open web, reads email, queries databases, and calls APIs ingests adversarial content from a dozen sources on every turn, and any one of them can carry an instruction. Researchers cataloging real agent ecosystems have already found hundreds of malicious third-party extensions performing data exfiltration and silent injection without any user awareness. These aren’t laboratory curiosities. They’re the production environment.

So, if you can’t guarantee the model will never be fooled—and you can’t—then architecture that depends on it never being fooled is built on sand. You need a second principle, one distributed systems engineers have understood for decades.

Verify, then trust

The principle is simple to state and hard to retrofit: An agent’s proposed action should be validated against an external, deterministic policy before it executes, regardless of why the agent proposed it. The validator doesn’t ask whether the instruction that produced the action was legitimate. It doesn’t try to detect the injection. It asks a different and far more answerable question: Is this action, on its face, permitted?

This inverts the burden. Detecting a cleverly disguised malicious instruction is open-ended because the adversary gets to be arbitrarily creative. Checking whether a wire transfer exceeds a hard dollar limit is a closed problem with a definite answer. We move the security decision from where the attacker has infinite freedom to where they have almost none.

Crucially, the check must be deterministic code, not another model asking, “Does this look dangerous?” The moment you ask a second LLM to adjudicate, you’ve reintroduced the exact same vulnerability one layer down. The enforcement layer is boring, auditable conventional software, and that’s the point.

Here’s what it looks like in practice. An agent managing procurement proposes an action, and a runtime contract evaluates it before anything reaches a real API:

# agent_contract.yaml
 agent_id: "procurement_executor_07"
 role: "EXECUTOR"
 policy:
   approve_invoice:
 	max_amount_usd: 50000
 	allowed_vendors: from_approved_registry
 	require_human_above_usd: 10000

 # Runtime, on a proposed action:
 ACTION   approve_invoice(vendor='Acme', amount=1200000)
 REJECTED policy violation: max_amount_usd
      	proposed 1,200,000 / limit 50,000
      	action discarded, human notified, no API call made

The injected instruction at 2:14am never matters here. The agent can be perfectly, catastrophically fooled, and the wire transfer still doesn’t happen, all because a simple deterministic check stood between the model’s output and the outside world, and the proposed action failed it.

This only works if the action arrives structured, which makes structure a precondition.

The contract inspects approve_invoice (vendor, amount) cleanly only because the action is already typed. If the agent emits prose, “please approve the Acme invoice,” something has to parse it, and the only thing that parses open language is another LLM, so the indeterminacy walks back in. That dictates the design.

A consequential action must cross the boundary as a typed tool call, never as free text. Where the input is unavoidably natural—an email saying, “Wire them their balance” for example—let the model extract a structured value but never let its extraction be self-authorizing. The model proposes the amount; the gate still checks it against the limit, the vendor registry, and the actual balance in the system of record, not the number the email asserted. Extraction is probabilistic, while validation stays deterministic.

A few decisions are pure judgment with no schema, such as “Is this email phishing?” There the model stays in the loop. You bound the consequences instead, with reversibility and human review above a threshold. Contracts protect parameterizable actions, and unparameterizable judgments fall back to containment.

The architecture this implies

Once you accept that the action layer is where security lives, three design commitments follow, and they map almost directly onto principles that hardened distributed systems years ago.

Least privilege for agents, scoped to the action, not the agent. The naive version assumes you can predict what an agent will do and provision it accordingly. For a specialized agent you can: One that only summarizes has no business holding a credential that moves money. But the agents people actually reach for are general. In a single session, I might ask a coding agent to summarize a file, write code, execute it, and query company data—four tasks with four risk profiles, none of which are enumerated in advance. Static least privilege collapses the moment one identity spans that range.

The fix is to make privilege a property of the action, not the agent. The agent holds no dangerous capability by standing grant; it requests narrow, transient elevation per action, which the same deterministic gate approves or denies. Reading a document is auto-approved; querying the warehouse is not. The dangerous credential exists only for the instant the action is permitted, then evaporates. One caveat: This governs what an agent may reach but not what the code it writes then does. Executing code can be gated as a capability, but what executes still needs containment, sandboxing, and egress control, because generativity is a different problem from access.

Zero trust for machine identities. Every action an agent takes should be authenticated and authorized as if it came from an untrusted actor, because, functionally, it might be acting on an attacker’s instructions. The proliferation of agents has expanded the attack surface faster than most identity systems were designed to handle, and treating agent traffic as inherently trusted because it originates inside your own system is precisely the mistake.

Capability contracts at the boundary. Every consequential action passes through a deterministic gate that encodes what is allowed, dollar limits, rate limits, allowlisted destinations, mandatory human review thresholds. The contract is version-controlled, auditable, and lives entirely outside the model.

The trap of normalized deviance

The quieter organizational danger is the slow accumulation of false confidence from connecting insecure agents to real systems and watching nothing bad happen. . .for a while. Researchers have warned about indirect injections for years, but most deployments have gotten away with it. Each uneventful day makes the next risky connection feel safer. This is the normalization of deviance. Every system that eventually failed catastrophically felt the same way: fine, fine, fine, until it wasn’t.

The teams that will weather the coming wave of agent incidents aren’t the ones with the cleverest input filters. They’re the ones who assumed compromise from the start and built the boring enforcement layer anyway, the ones who decided that an agent’s autonomy ends precisely at the point where it tries to do something irreversible.

Where to start on Monday

You don’t need to rearchitect everything. Start by inventorying the actions your agents can take, and sort them by blast radius: What’s the worst thing that happens if this action fires when it shouldn’t? For every high-blast-radius action, write a deterministic contract that gates it and put a human in the loop above a threshold you can defend to your risk team. Then, and only then, keep hardening your inputs.

Prompt injection won’t be solved at the input layer, because it can’t be. But it can be rendered survivable at the action layer, where deterministic code gets the final word. The model’s job is to be useful. Your architecture’s job is to make sure that when the model fails—or worse, when it has been turned against you—the failure stops at the gate.

What You Bring to AI Determines the Result

Tim O’Reilly — Mon, 29 Jun 2026 16:15:12 +0000

Harper Carroll came to AI education through a CS background at Stanford, machine learning engineering at Meta, and a brief stint at a small GPU compute startup in late 2023, where she noticed that almost no one understood how to fine-tune open source models. She started writing and teaching to help drive signups for the startup’s platform. Her first guide, posted right after Mistral 7B was released, when she had about 50 followers, got 50,000 views. In March 2024, a video explaining the difference between AI and machine learning got 5 million views, with 1 in 20 viewers following her afterward. She now has more than 500,000 followers across multiple platforms and is a full-time AI educator.

We covered fine-tuning versus prompting, what it actually means to learn to code in 2025, and what the AI field gets wrong when it talks to the public.

Understanding the world with math

We started with Harper’s own AI learning journey, and it contained a wonderful insight. She grew up loving math and came to computer science at Stanford because algorithms seemed like wonderful math puzzles. Eventually she realized that AI is “understand[ing] the world around us with math.” Text-based LLMs are only one branch. The field as a whole is “the math of the world.” That seems like a deep intuition that all of us need to internalize.

AI as a medium

A study that circulated last year found that people who used AI to write essays showed reduced brain activity compared to people who write unaided. The reaction in many quarters was alarm. People said, “We’re outsourcing cognition and our brains will atrophy.” Harper’s smart response was that those users must have given the AI a one-sentence prompt and accepted whatever came back.

As she put it, that’s the equivalent of just telling Alexa to order you the most popular book this week. Of course less brain activity is being measured! Contrast that with the difference between shopping for a book by browsing and searching at Amazon versus driving to a physical bookstore. There’s certainly a difference, but it isn’t outsourcing cognition. It’s saving time, and that time might well be spent on other demanding cognitive tasks.

My framing is that AI is a medium, the way language is a medium, or photography. Anyone can take a photograph or write a book. The words available to every writer are the same; what differs is what they do with them, just as some photographers do something with it that others can’t. The same is true of software. There’s a line in Aaron Sorkin’s movie The Social Network where the Zuckerberg character says about the Winklevosses, “If you guys were the inventors of Facebook, you’d have invented Facebook.” An idea and its execution aren’t the same thing. One person gives AI a prompt and the output is bad. Another builds a process around AI and the output is great. What you bring to the medium is what determines the result. Harper agreed.

Fine-tuning is like psychedelics for AI

I’ve been trying to figure out how we can use AI for writing and editing at O’Reilly. We want skills and workflows that accelerate our productivity but don’t produce copy that reads as whatever the base model sounds like when nobody’s putting in any effort.

Takeaway posts like this one are a great use case for AI-assisted writing. As source material we have a transcript, with the actual conversation between the participants (or in the case of one of our online conferences, their presentations). We want a structured summary that captures the high points and suggests possible clips for social media. I (or whomever is using this AI-assisted workflow) can then rewrite, rearrange, elaborate, or delete from that first draft. It might not be as good as a draft written from scratch, but quite frankly, it’s far better than the alternative, which is no summary at all. I just don’t have time to write them all unaided.

When I’m writing an article, I generate a similar “transcript” by recording myself talking about the ideas I’m wrestling with and trying to put into the world. Then I ask Claude to put it together into something a bit more structured.

I’ve been improving Claude’s ability to produce prose that we can use by rewriting its output, showing it the differences, and then asking it to construct a skill that captures what it’s learned. Over time, it’s gotten closer and closer to something that I’m comfortable with, and I’m now generalizing that into a system that learns any author’s voice, respects the various conventions of the target content type (which can be very different across books, articles and blog posts, social media, and marketing materials like back cover copy and course descriptions), and applies editing suggestions from my favorite books on good writing, including Strunk and White and On Writing Well by William Zinsser.

Harper attacked the same problem from a different angle. She built a dataset of roughly 1,000 of her Instagram captions, video transcripts, and X posts, then fed them to Claude as context and asked it to write in her style. Unfortunately, the output tested 100% AI by a detection tool, even with 1,000 examples of her real voice in the prompt. She then fine-tuned an open source Llama model on the same data. The fine-tuned output tested 100% human. She gave a compelling demo at South by Southwest showing how easy this is to do. It took her about 20 minutes.

After Harper said that prompting doesn’t shift the output distribution the way fine-tuning does, I told her the story about the French writer Marcel Proust that I first used in my conversation with Steve Wilson, which I picked up from Alain de Botton’s How Proust Can Change Your Life. A friend comes to visit the bedridden Proust, and making polite conversation begins to tell him about the train trip to Paris. “More slowly,” Proust replies. This cycle repeats several times until the friend is telling him small details like the old man feeding pigeons on the steps of the station.

Harper got it, and broke it down more slowly in her inimitable way. Here’s why in-context prompting fails where fine-tuning succeeds:

Basically AI models are these massive mathematical equations, and the parameters are variables when you’re training, and then they become constants in those equations when you’re running inference . . .So what you’re doing when you’re training the model is you’re learning how to map, by adjusting those constants when they’re variables during training,. . .input to desired output.

Once the model is deployed, the probability distribution over output tokens is fixed. You can put 1,000 examples in a prompt and ask the model to pattern-match, but you’re asking it to do that with frozen weights. The surface behavior bends a little, but the underlying distribution doesn’t shift. Fine-tuning lets you actually modify the weights and how the model wants to write.

Her suggested approach for building the training dataset is to take your own writing, have AI rewrite it with its characteristic tics, then train with the AI version as input and your original as the target output. You’re teaching the model to undo the tells.

Should people still learn to code?

We also spent time on the inevitable question of whether people should still learn to code. We both agree they should, but not necessarily like they used to, by learning the detailed syntax of a programming language, then by trial and error as they painfully learn how hard it is to get the desired behavior.

Harper’s take (which I also agree with) is that vibe coding has lowered the floor. People who could never afford to hire someone to build a product can now do so themselves. But it has also raised the ceiling, because people who actually understand systems can build vastly more sophisticated things with the same tools, which takes us back to the case for AI as a medium.

Perhaps more importantly to the question of how much coding you should learn, experienced developers will also see failure modes that pure vibe coders miss. Harper gave an example that came from watching a friend using an agent tool that had, at some point, started storing its data in a Word document and using it as a makeshift database, probably because the session started with a Word doc. It was extremely slow and extremely inefficient. An engineer sees the problem immediately. A vibe coder might run that system for months before noticing something is wrong.

So yes, you should learn enough about coding to understand what’s happening. The art of teaching programming to the next generation will be developing useful projects that also highlight underlying concepts of software architecture and engineering.

Intuition as differentiator

Silicon Valley runs heavily on logic and on the idea that good decisions come from better data, more rigorous analysis, and sharper models. In this environment, intuition can get dismissed as something “soft and fuzzy,” Harper noted. And that’s the wrong mindset for AI.

AI is getting better and better at exactly the things the logical axis does well, but intuition remains a challenge because it often contradicts what the data says. Good intuition “goes against the input,” to use Harper’s phrase. A model that’s been trained to recognize patterns in data will, almost by definition, struggle with making decisions that run counter to those patterns. Just as skills-informed judgment supercharges AI-assisted engineers, intuition could be a uniquely human skill for a long time. Elevating it as a concern might bring the industry more of an attitude of humility towards ourselves and our place in the world.

What the field gets wrong

I closed by asking Harper what the AI field most consistently gets wrong in how it talks to the public. She said that too much of the public-facing discourse leads with fear, of job displacement, of rapidly approaching AGI, and of a rocky transition that requires a universal basic income to cushion the blow. She’s not calling those impossible futures, but she thinks they’re the wrong introduction to the technology.

A lot of companies are using AI to ask how to do the same things at lower cost. The better question is how to raise ambitions. AI doesn’t just scale individual capabilities. It scales what organizations can attempt. But for it to work out that way, everybody has to actually learn AI. We can’t have AI haves and have-nots. That means lower-cost models, serious open source investment, and companies that don’t just become serfs to the major platforms.

Harper has been making this point for a while, to audiences ranging from engineers to people who’ve never written a line of code. “There is not really much to fear right now,” she says. “AI is this incredible productivity tool.” The people who will struggle, in her view, are the ones who refuse to engage with it at all.

At O’Reilly, we’ve been working on a version of the same narrative at an organizational level. The fear-first narrative produces avoidance, and avoidance is the one thing that will actually leave someone behind. So we’re building a corporate AI transformation practice that starts with people’s existing jobs, and figures out how to “mix in” AI to make them more impactful. We’re learning how to teach both the humans and the agents at the same time to make them more productive together.

On July 9, I’ll be speaking with Trail of Bits cofounder and CEO Dan Guido about the playbook his company used to go AI native, which he first outlined at this year’s [un]prompted. He’ll give a version of the same talk, then take about 40 minutes of audience questions on what worked, what didn’t, and what is still unsolved. I hope you join us to find out what’s changed since [un]prompted and where the playbook is heading next. Register here; it’s free and open to all.

Agent Memory

Angie Jones — Mon, 29 Jun 2026 10:53:10 +0000

The following article originally appeared on Angie Jones’s LinkedIn page and is being republished here with the author’s permission.

I’m fascinated by the concept of agent memory. LLMs are stateless by design, meaning they have no memory or awareness of past interactions. Each prompt you send to an LLM is treated as a completely isolated event.

When you have a continuous chat with an AI agent, it feels like the AI remembers previous messages. However, the interface itself is faking it. Behind the scenes, your agent takes the entire conversation history and resends all of it to the LLM as one giant, combined prompt.

Companies, researchers, and even indie devs are all trying to crack agent memory. Because once an agent can remember, the entire interaction changes. It can build on what it learned, adapt to the user, resume work after a restart, and develop a sense of continuity.

Recently, I spent time with Richmond Alake, who has been in the trenches working on agent memory at Oracle.

Richmond Alake, the agent memory guru

We talked about the different kinds of memory, why memory is harder than it sounds, and what it takes to build a memory system that is actually useful in production.

That conversation made something very clear to me. When people say, “agent memory,” they often mean very different things.

So let’s unpack the various types of memory.

Conversational memory

Conversational memory is the one most people think of first. It stores the messages exchanged between the user and the assistant.

This makes sense. If I ask, “What did I say was the ultimate goal of this task?” the agent needs access to the conversation in order to answer. Without that history, every turn starts from zero.

But this is also where many memory systems go wrong.

The most common first attempt is to keep appending prior messages to the prompt. For example:

User: I’m building a customer support agent.

Assistant: Great, what should it do?

User: It should look up past tickets and draft replies.

Assistant: Got it.

User: Also, I prefer Python and FastAPI.

Then on the next call, we send all of that back to the model along with the new question.

This works for a short conversation, but the agent only “remembers” because we keep reminding it. This is not really memory engineering.

Eventually, the conversation gets too long and the model receives a giant blob of context where some details are important, some are stale, and some are completely irrelevant. The agent may technically have the information, but that doesn’t mean it can use it well.

So yes, conversation history is a valid and important type of memory. But it shouldn’t be the whole memory strategy. Real agent memory requires deciding what should be stored, where it should be stored, how it should be retrieved, and when it should be summarized, forgotten, or compressed.

Semantic memory

Semantic memory stores durable facts.

These are things that should outlive the exact conversation where they were learned:

The user prefers Python over TypeScript for backend work.
The customer support agent needs access to past tickets.
The production system handles 50,000 queries per day.

This is different from conversational memory because the exact wording and sequence are less important. What matters is the meaning.

If the agent needs to recall what stack the user is using, it should retrieve the memory even if the user never says those exact words again.

Vector search is useful for this. The memory can be embedded and retrieved by semantic similarity.

The benefit is that the agent doesn’t need to replay the full conversation. It can retrieve the few durable facts that are relevant to the current request.

Episodic memory

Episodic memory stores events.

This is the “what happened” layer of memory:

The agent searched the web for recent API gateway patterns.
The agent generated a draft response for ticket #4821.
The workflow failed at the compliance review step.

Episodic memory is especially useful for debugging, auditing, and long-running workflows.

For example, if an agent makes a decision, I may want to know what happened right before that decision (e.g., What tools did it call? What data did it retrieve?).

This type of memory often benefits from structured storage.

For example:

Find all failed tool calls from the mortgage approval workflow in the last 24 hours.

That is a database query problem, not just a vector search problem.

Procedural memory

Procedural memory is about how to do things.

For example:

When investigating a failed deployment, check logs first, then recent config changes, then dependency updates.
When drafting a customer support reply, include the ticket summary, likely cause, recommended fix, and next step.
When creating a database-aware agent, scan table comments, column comments, constraints, and recent workload patterns.

This is the kind of memory that helps an agent improve its process. That’s powerful because agents are often asked to operate in messy real-world environments. With procedural memory, it can reuse proven approaches.

The value extends beyond just knowing things to actually knowing how to proceed.

Entity memory

Entity memory stores facts about specific people, accounts, projects, systems, tickets, or objects.

For example:

Angie prefers practical examples over abstract explanations.
Customer Acme Corp has strict data residency requirements.
Ticket #4821 is related to a billing reconciliation issue.

Entity memory matters because many agent tasks are scoped around a particular thing.

If I ask, “What do we know about Acme Corp?” I don’t want every memory in the system. I want memories attached to that customer.

This is also where memory safety becomes important.

Agents should not accidentally mix memories between users, customers, or projects. A memory system needs strong scoping so one user’s context does not leak into another user’s response.

Working memory

Working memory is the short-term scratchpad for the current task.

This is where the agent keeps temporary information while reasoning through a problem.

Working memory is usually not meant to last forever. It’s useful during the task, but it may not deserve to become durable memory.

If an agent stores every temporary thought as long-term memory, the memory store gets noisy very quickly. The agent may later retrieve half-baked assumptions as if they were facts, which is dangerous.

Not everything the agent observes or thinks should be remembered permanently.

Summary memory

Summary memory is one many agent users are familiar with. It deals with the problem of context windows being limited.

Even with large context models, you can’t keep appending forever. At some point, you need to compress.

Summary memory stores a compact version of a longer thread or context window. The original details can still live in the thread, but the prompt gets a smaller representation.

For example, instead of sending 80 turns of conversation, the agent might send:

The user is building a SaaS customer support agent. They prefer Python and FastAPI, deploy on OCI, and want the agent to retrieve past tickets before drafting replies. They are currently evaluating memory strategies for production usage.

Why memory is hard for agents

At first, memory sounds straightforward: store things, retrieve them later.

But the hard part is judgment, not storage.

What should be remembered? If the user says, “I usually prefer Python,” that’s probably worth remembering. If they say, “Let’s try Python for this one experiment,” maybe not. The agent needs to distinguish durable details from temporary context.

When should memory be updated? People change their minds, and systems and requirements change. If a user used to prefer FastAPI but now works mostly in Java, should the old memory be deleted, overwritten, or kept with a timestamp? A memory system needs a correction strategy.

How much memory should be retrieved? Retrieving too little means the agent misses important context. Retrieving too much means the prompt becomes noisy. This balance matters as more context isn’t always better.

How do we prevent memory leaks? If memories are shared across users, agents, or tenants, scoping is critical. The agent should only retrieve memories it’s allowed to use. This is especially important in enterprise systems where agents may operate across many customers, teams, or workflows.

How do we know whether memory helped? Memory should improve the agent’s behavior. It should reduce repeated questions, improve continuity, lower token usage, and help the agent produce more relevant responses. If memory just adds complexity without improving outcomes, it isn’t doing its job.

How Oracle is approaching agent memory

Richmond was gracious enough to share how Oracle is tackling this with the Oracle AI Agent Memory Package (OAMP), built on top of Oracle AI Database 26ai.

Yes, an AI database! Think of it as a database that can store and query the kinds of data AI applications need, not just rows and columns. That includes embeddings and JSON documents along with text search and regular SQL. These live together in the database, so an agent does not have to bounce between separate systems just to gather context.

The idea is to make Oracle AI Database the memory core for agents. Instead of stitching together a vector database, a relational database, a document store, and custom thread management, OAMP provides agent-friendly memory primitives on top of a database that already supports multiple data access patterns.

At a high level, OAMP gives you:

Users and agents to scope memory ownership
Memories for durable facts and extracted knowledge
Threads for conversation history and continuity
Context cards for compact, prompt-ready memory retrieval
Summaries for long-running conversations
Vector search for semantic recall
Database-backed persistence so memory survives restarts

This matters because, again, agent memory is not only a vector search problem. Some memory needs semantic retrieval. Some need ordered reads or exact SQL filtering. A database-backed memory system gives you room to support all of those patterns.

Here’s a small example of what that looks like in code:

from oracleagentmemory.core import OracleAgentMemory

from oracleagentmemory.core.llms import Llm

client = OracleAgentMemory(

    connection=connection,

    embedder="text-embedding-3-small",

    llm=Llm("gpt-5.5"),

    extract_memories=True,

    schema_policy="create_if_necessary",

)

client.add_user(

    "angie",

    "Developer exploring agent memory patterns."

)

client.add_agent(

    "memory-demo-agent",

    "Assistant that demonstrates Oracle AI Agent Memory."

)

client.add_memory(

    "Angie is fascinated by agent memory and prefers practical examples over abstract explanations.",

    user_id="angie",

    agent_id="memory-demo-agent",

)

There are a few important ideas packed into this snippet.

The OracleAgentMemory client is the bridge between the agent application and Oracle AI Database. The database connection tells OAMP where memory lives. The embedder tells it how to turn memory text into vectors for semantic retrieval. The LLM enables automatic memory extraction and summary generation. And schema_policy="create_if_necessary" lets OAMP manage the underlying memory schema instead of making every application reinvent it.

The user and agent registration may look like simple setup code, but it’s actually part of the memory model. Memories need ownership. In a real system, you don’t want one user’s preferences showing up in another user’s session, and you don’t want memories written by one agent casually mixed with another agent’s context. The user ID and agent ID give the memory layer a way to scope what gets stored and retrieved.

The add_memory() call stores a durable fact. This is a piece of information the agent may need later, even if the exact conversation has moved on.

Given this, we can now recall memories.

results = client.search(

    "how should I explain this topic to Angie?",

    user_id="angie",

    max_results=3,

)

This search() call shows the part that makes semantic memory useful. The query doesn’t have to match the stored sentence exactly. We stored that I prefer practical examples, but we searched for how to explain something to me. Those are different words but related in meaning. That’s the point.

Threads and context cards

Durable memories are only part of the picture. Agents also need conversation continuity.

With OAMP, a thread can represent a real work session, such as an agent helping investigate a production issue:

from oracleagentmemory.apis.thread import Message

thread = client.create_thread(

    user_id="angie",

    agent_id="support-triage-agent",

)

thread.add_messages([

    Message(

        role="user",

        content="Customer Acme Corp is seeing intermittent checkout failures after the latest deployment.",

    ),

    Message(

        role="assistant",

        content="I'll check recent deployment notes, related incidents, and payment service logs.",

    ),

    Message(

        role="user",

        content="Focus on the payment gateway first. We saw similar timeout errors last quarter.",

    ),

])

This is much closer to how memory shows up in real agent applications. The useful context is not just that messages were exchanged. It’s that this thread is about Acme Corp, checkout failures, a recent deployment, the payment gateway, and a related incident from last quarter.

When it’s time to call the model, instead of passing the entire raw thread, you can ask for a context card:

card = thread.get_context_card()

The context card gives the agent a compact block of relevant memory to use in the next prompt.

Conceptually, the prompt becomes:

System: You are a helpful assistant. Use the provided memory context.

Memory context: [context card]

User: What did we decide earlier?

This is a much cleaner pattern than appending every message forever.

Automatic memory extraction

OAMP can also extract memories from conversation.

For example, if the user says:

I prefer Python over TypeScript for backend work. I usually deploy FastAPI apps on OCI behind an API gateway.

The memory system can extract durable facts such as:

The user prefers Python over TypeScript for backend work.

The user deploys FastAPI applications on Oracle Cloud Infrastructure behind an API gateway.

That means the application does not have to manually call add_memory() for every useful fact.

A smart thread can be configured like this:

thread = client.create_thread(

    user_id="angie",

    agent_id="memory-demo-agent",

    memory_extraction_frequency=2,

    memory_extraction_window=4,

    enable_context_summary=True,

    context_summary_update_frequency=2,

)

This tells the system to periodically inspect recent messages, extract durable memories, and maintain a running summary.

Here is where agent memory starts to feel more like a living part of the agent architecture vs just a data structure.

Teaching an agent about a database

One of the most interesting examples Richmond and I discussed was using memory to teach an agent about a database.

Imagine an enterprise data agent that needs to answer questions about a schema it has never seen before. Instead of fine-tuning a model, the agent can scan the database catalog and store what it learns as memory.

It might inspect:

ALL_TABLES for table names and row counts
ALL_TAB_COLUMNS for column names and types
ALL_TAB_COMMENTS for human-written table descriptions
ALL_COL_COMMENTS for column descriptions
ALL_CONSTRAINTS for primary keys and foreign keys
V$SQL for recent workload patterns

Then it can convert those technical details into natural-language memories.

For example:

Table SUPPLYCHAIN.VESSELS stores individual ships owned or operated by carriers. It includes vessel identifiers, carrier relationships, and operational metadata.

Now when a user asks:

Where would I find information about ships and carriers?

The agent can retrieve the relevant schema memory by meaning.

This is a beautiful pattern because it avoids one of the common traps with agents expecting the model to already know your private system.

It doesn’t. And that’s okay.

You can teach it by turning your system’s metadata into memory.

The more I learn about agent memory, the more I believe this will be one of the defining pieces of agent architecture.

Tool calling lets agents act. Planning lets agents decide what to do. Memory lets agents build continuity.

With memory, we can start designing agents that feel less like one-off prompt responders and more like persistent collaborators.

Of course, this also raises the bar. Memory has to be scoped, auditable, correctable, and intentionally retrieved. Bad memory is worse than no memory. So the challenge is not simply giving agents memory but giving them the right memory architecture.

Oracle’s OAMP approach is one way to make that system concrete: users, agents, memories, threads, context cards, summaries, and database-backed retrieval.

And while the implementation details matter, the bigger idea is that if we want agents to be useful beyond a single prompt, they need a way to remember.

Not everything. But enough to carry context forward.

Agentic Code Review

Addy Osmani — Fri, 26 Jun 2026 15:50:43 +0000

The following article originally appeared on Addy Osmani’s blog site and is being republished here with the author’s permission.

Coding agents are extraordinarily good now, and getting better fast. The interesting consequence is that the hard part of engineering moved from writing code to deciding whether to trust it, which makes review the most leveraged skill in software right now. How you approach it depends enormously on who you are: A solo developer with no users and a team maintaining a 10-year-old application are not solving the same problem.

I am more optimistic about agentic engineering than I have ever been. The agents are genuinely good, they get better every month, and on an ordinary day I now ship things I would not have attempted a year ago. This write-up is a map of where the interesting work went, because it did move, and most teams have not fully caught up to where.

Code review used to work because of a happy accident of relative speed. A senior engineer could read code faster than a junior could write it, so review kept pace without anyone designing it to, and the team absorbed how the system fit together as a side effect of reading each other’s diffs. A lot of that was not deliberate. It fell out of a single fact: Writing code was the slow, expensive part, and reading it was cheap and fast.

That fact no longer holds. An agent will produce a thousand lines of often solid, well-formatted code in less time than it takes me to read this paragraph, while a human’s reading speed has not changed since roughly the day we started staring at screens for a living. So the constraint moved downstream, to the one step that did not get faster: a person being confident the change is right. I don’t think that’s a loss. It’s the most leveraged place in software to be good right now, and it’s where I’ve put most of my attention this year.

There’s a happy twist here that shapes the rest of this piece. The same tools generating all that extra code are also the best thing I have for keeping up with it. On my own projects, including the popular open source ones, I now point Claude Code or Codex at a batch of incoming PRs and have them triage the queue for me, and that has genuinely changed how I spend my time. So this is not an anti-AI argument, and I will come back to exactly how I use AI.

It’s also not a data dump, and not another round of whether letting a model write your code is wonderful or the end of the craft, because that framing is useless. The only answer that survives contact with a real codebase is that it depends entirely on who you are. A developer vibe-coding a side project only a dozen people will ever run and a team keeping a 10-year-old enterprise system alive for another quarter share almost no constraints worth naming, and most of the advice in circulation is really one of those two people telling the other how to live.

What the 2026 data actually shows

The productivity gains from AI are real, but raw output overstates them: about four times the code for a tenth more delivered value. The gap between those numbers is review work, which is exactly why review is where the leverage now sits.

For a couple of years this was an anecdotal argument. It’s now measured at scale, by organizations with no shared agenda and in several cases competing commercial interests, and the measurements keep pointing the same way: AI pushes output sharply up and pushes both quality and reviewability down.

Faros AI instrumented 22,000 developers across 4,000 teams and tracked what happened as teams moved from low to high AI adoption. This is March 2026 data, about as current as anything here. The upside is real. Developers merge considerably more PRs and complete more work and throughput per engineer climbs. Then the rest of the report:

Code churn is up 861%.
The incidents-to-PR ratio is up 242.7%.
The per-developer defect rate is up from 9% to 54%.
Median review duration is up 441.5%, with time to first review and average review time both roughly doubling.
PRs merged with zero review are up 31.3%.

The last figure is the one I find hardest to dismiss, because nobody chose to stop reviewing. Reviewers simply couldn’t keep pace with the volume, so code began merging unread, and that became normal. The detail I keep returning to is that teams with mature, disciplined engineering practices were hit just as hard as everyone else. Good process didn’t protect them, because the volume arrived faster than any process was designed to absorb.

CodeRabbit studied 470 open source PRs in December 2025, 320 AI-coauthored and 150 human-only, and found the AI changes carried roughly 1.7x more issues. Logic and correctness problems were up about 75%, security issues were 1.5 to 2x more common, and readability problems more than tripled. The company’s AI director, David Loker, described these as “predictable, measurable weaknesses that organizations must actively mitigate.” Predictable is the operative word. These are known, locatable weaknesses, which is good news: It means a review process, human or automated, can be aimed straight at them.

One caveat to hold throughout: CodeRabbit and Faros both sell into this market, so their framing is not disinterested. That doesn’t make the numbers wrong—the effect sizes are large and consistent across unrelated sources—but vendor research deserves to be read with that in mind.

GitClear has the single number I would lead with. In its productivity data through 2025, daily AI users produce around 4x the raw output of nonusers, but measured against their own output a year earlier, the real productivity gain is only about 12%. You’re generating roughly four times the code for something like a tenth more delivered value, and a human still has to review all of it. To GitClear’s credit, CEO Bill Harding is explicit that some of even that 12% is selection bias, because stronger developers are concentrated in the AI cohort.

GitHub reports that Copilot review has now run over 60 million reviews, a 10x increase in under a year, and more than one in five reviews on the platform involves an agent. This is no longer a niche practice. It’s how code gets made.

Four datasets, four methods, one conclusion. We poured machine-speed output into a system built for human-speed work. The bottleneck didn’t disappear; it moved to verification, and review is where that bill comes due.

Everyone is solving a different problem

How much review a change needs depends almost entirely on its blast radius, and most advice you read was written by someone operating for a very different one.

Almost all the alarming data above comes from enterprise telemetry and from open source maintainers being overwhelmed. It’s entirely real if that is your situation. If you’re one person shipping something a handful of people will ever run, much of it simply doesn’t apply to you, and you shouldn’t be made to feel otherwise.

Three variables determine where you sit:

Blast radius: What happens when it breaks? Nothing, or angry users and money and PII on the line?
How long the code lives: A throwaway prototype you might rewrite next week, or a codebase you’ll maintain for years?
How many people need to understand it: Just you holding the whole thing in your head, or a team that has to share ownership over time?

Run the same diff through those three variables, and “good review” means genuinely different things.

If you’re working solo on a greenfield project with no users, review’s second job, distributing knowledge across a team, doesn’t exist for you. You are the team. The reasonable move is to lean hard on tests and automation, review the parts that genuinely matter, and accept a lighter touch on the rest. Duplication and churn cost far less when the code may not exist in a month and nobody is paged at 3:00am when it breaks. The catch, and people learn this one painfully, is that it only works if the tests are real. Skipping review without a safety net doesn’t remove the work. It defers it at a higher price, and standards slip when no one is there to push back. “No users” is permission to defer review. It isn’t permission to skip verification.

Then the project gets users. This is the dangerous middle, and the crossing is rarely noticed at the time. Review’s bug-catching role suddenly matters, because bugs now hurt people, and its knowledge-sharing role switches on, because it’s no longer only you. Teams keep their solo-era habits a few months too long, and then there’s a postmortem and the Faros numbers stop being a chart and become their own dashboard.

At the far end is the large organization with an old codebase and many users. Here every alarming figure lands at full strength. A duplicated helper isn’t a style nit; it’s a future bug surface and a maintenance cost that compounds for years. A change nobody understood is comprehension debt that becomes someone’s on-call incident. Review is doing several jobs at once, and the volume of agent output quietly breaks all of them. The Faros finding about mature teams is aimed squarely here.

So the point is not “Enterprises should be cautious and solo developers can relax.” It’s that the purpose of review changes with your position, so the rules have to change with it. Bolt an enterprise’s locked-down multi-agent evidence-required pipeline onto a two-person prototype and you’ve added friction for no benefit. Run “tests pass, ship it” on a payments system and you’ve built an incident generator with a green checkmark on top. Most bad advice in this space is one position on that spectrum prescribing to another.

What review is actually for now

Review was built to check an author’s reasoning. An agent does reason, but that reasoning is usually thrown away rather than attached to the code, so the reviewer has to reconstruct a rationale that never made it into the diff. The good news is that this is a tooling problem, and capturing the reasoning makes review dramatically easier.

This is the part that genuinely changed, and I think it is underappreciated.

When a human writes code, intent comes along for free. The reasoning, the alternatives weighed and discarded, lived in the author’s head, and review was you checking that reasoning. Modern agents do reason, often visibly, producing thinking traces and weighing options and explaining themselves as they go. The catch is that this reasoning is usually discarded the moment the diff is produced. It’s rarely captured and rarely attached to the PR, and in any case it is the agent’s reasoning about how to implement the task, not a human’s judgment about whether it was the right task to begin with. So review shifts from checking reasoning that sits in front of you to reconstructing intent that never got written down, which is harder and slower, and we keep acting surprised that it takes 441% longer.

A 2026 paper, “AI Slop and the Software Commons,” analyzed 1,154 posts across 15 Reddit and Hacker News threads where developers discussed “AI slop.” One line from a developer has stayed with me: reviewing an agent’s PR made them “the first human being to ever lay eyes on this code.”

That sentiment points straight at the fix. In normal review, the author already understood the change and you were checking their work. With an agent PR, nobody has reconstructed the why yet, and the reviewer is the first to try. As the paper puts it, review “wasn’t built to recover missing intent.” The encouraging part is that missing intent is recoverable: The reasoning existed; we just discarded it. Have the agent state what it was trying to do and what it ruled out, then capture it as a decision log on the PR, and a large part of the reconstruction cost disappears. This is a tooling problem, and tooling problems get solved.

None of which makes “have the AI review the AI” a complete answer on its own. A second model with different priors genuinely catches real bugs, and it catches a lot of them, which is why you should run one. What it doesn’t supply is the human judgment about whether this is the right change to build in the first place. That judgment stays with a person, and it happens to be the most interesting part of the job and the part worth keeping.

The tools are good, but not always for the reason they advertise

The current AI reviewers are genuinely good, and they occasionally don’t flag the same lines as each other, so the right move is not picking the best one but running two that are built differently.

The dedicated AI review tools are good now, and I think you should be running at least one on everything, side projects included. CodeRabbit is the most widely deployed and topped the independent Martian benchmark (January to February 2026) on F1, at around 49% precision with the best recall in the field. Greptile trades precision for recall, with around an 82% bug-catch rate against CodeRabbit’s 44% in one benchmark, at the cost of more false positives. Anthropic’s Code Review reports under 1% of its findings marked incorrect by their engineers; the figure I would actually show a manager is that it raised their internal rate of PRs receiving a substantive review from 16% to 54%. The long tail of changes that used to get a glance and an approval now gets read by something.

The most useful result I have seen this year isn’t from a vendor. An engineer ran four reviewers in parallel, CodeRabbit, Sentry Seer, Greptile and Cursor BugBot, across 146 real PRs and 679 findings over three and a half weeks:

Of 617 distinct flagged locations, 93.4% were caught by exactly one of the four tools. 6% by two. Almost none by three. None at all by all four.

The four tools never once flagged the same line. Each was strong at a different class of problem: Greptile with near-zero false positives on correctness and architecture, CodeRabbit with the widest net and one-click fixes, and Seer best on production-failure severity. That is the adversarial review argument demonstrated on a real codebase rather than in a paper. Heterogeneity is the whole point. Four copies of one model is a single reviewer with a larger invoice, whereas four genuinely different reviewers surface a set of bugs no single member could find alone, the human included.

In practice: Do not agonize over the single best tool because there isn’t one. At the high-stakes end, run two with deliberately different characters. (The experiment above paired Greptile for everyday correctness with Seer for production-failure severity, with almost no overlap.) If you are solo, one good reviewer plus real tests is plenty. And whatever the marketing says, measure it on your own code, because every one of these results was specific to a particular codebase, and yours will be too.

Should we just let AI review more of it?

The machine is already reviewing more of your code than you are. The only real decision left is whether you do that deliberately, and the amount of human you keep should scale with your blast radius.

I keep hearing a question from experienced engineers that would have been heresy a year ago: Should the machine be doing more of the reviewing, perhaps most of it? I no longer think that’s a foolish question.

The uncomfortable part is that AI review works. Under 1% of Anthropic’s findings are marked wrong; the tools catch bugs humans read straight past, and they don’t get tired on the 30th PR of the day, which is exactly when a human is least reliable. Meanwhile humans are visibly not keeping up: Zero-review merges are up 31% and review times are up triple digits. In a real sense the machine is already reviewing more of the code than we are. The honest framing is not “Should we let AI review more?” but “AI is already doing it, so are we going to be deliberate about that or let it happen by default while pretending humans still read everything?”

Loop engineering sharpens this. The premise of a loop is that you stop being the person who prompts the agent and instead build a system that prompts it, and a central part of that system is a judge: an agent that decides whether the work is done before moving on. The reviewer is the next role being designed out of the inner loop, on purpose. We spent a year automating the writing, and the loops are now automating the checking, and the human keeps getting pushed up and out. “Where does the human stay?” is not a seminar question; it’s something you decide every time you wire up a loop, whether or not you realize you’re deciding it.

Where I currently land, and I hold this loosely: The answer is not “a human reads every line.” That’s over. The volume ended it, and anyone insisting otherwise is describing a world that no longer exists. But it’s also not “let the loop review itself and walk away.” When an agent writes the code, another reviews it, and a third judges it, you’ve a closed loop of models with broadly correlated blind spots, especially when they come from the same family, confidently agreeing in the same places. A confident “looks good” with no human anywhere in it is borrowed confidence: The system’s certainty becomes yours, and nobody actually understood anything. The loop can be both very sure and very wrong, with no human left to tell the difference.

So the human doesn’t leave; the human moves up a level. You stop reviewing every diff and start owning the parts that do not transfer to a model. Accountability, because you can’t page a model at 3:00am. The judgment of whether this is even the right change to build, as distinct from whether the code is correct. The high-blast-radius gates where being wrong is expensive. And the awkward one: the behavior nobody specified, because a model reviews the code that exists and rarely flags the requirement that nobody thought to write down, which remains a human-shaped gap I don’t expect to close soon. Human in the loop becomes human on the loop: sampling, spot-checking and auditing the system rather than reading every PR, and spending your limited attention where being wrong would actually hurt.

This is already how I work on my own projects, including the open source ones that now see more PRs in a day than I could carefully read in an evening. I point Claude Code or Codex at a batch of incoming PRs and ask for a first pass: a high-level read of what looks safe to merge, what needs more work, and what’s genuinely high-risk. I don’t auto-merge on the result, and I don’t lazy-merge whatever it approves. What it gives me is a way to allocate attention. I can spend a few minutes confirming the changes it considers low risk, and put real, careful time into the ones it flags as dangerous. The detail that matters is that this isn’t my old review hour made slightly faster. It’s a different shape of hour, and at the volume I now deal with, it’s the main reason the queue stays survivable at all.

Codex and Claude Code giving me a first-pass, risk-sorted read of a batch of PRs. The triage is the help. The merge decision stays mine.

A more extreme version of the same move is Kun Chen, an ex-Meta L8 engineer now shipping around 40 PRs a day as a solo builder, who has largely stopped reviewing code. It would be easy to dismiss this, except he is an L8, unusually good at the thing he stopped doing. He runs 20 to 30 agents in parallel and has moved his effort into the plan: He writes detailed plans up-front; the agents run for hours against them, and he says plan quality determines how long they can run unattended. That’s the move I described above in its purest form. It’s worth being precise about what actually happened, because it is not that he stopped verifying. The intent didn’t vanish; he wrote it down himself in the plan, so the “first human to ever lay eyes on this” problem is half-solved. A human did understand the why, just up-front rather than after. And he didn’t work without a net. He built an automated review gate (which he calls No Mistakes) that checks the code before it merges, and he stays on escalation when an agent gets stuck. The human does the expensive thinking before the code exists, and the machine does the line-by-line afterward, which may well be the shape of where this goes.

But he’s a solo builder with no large team and no decade-old system full of landmines beneath him. The exact conditions that make 40 PRs a day without review rational for him are conditions most readers don’t have. Copy his workflow onto a team shipping to many users and you reproduce the Faros numbers on your own dashboard. Kun isn’t wrong; he’s just a long way down one specific end of the spectrum.

Which is the spectrum point again. Solo with no users: Letting AI review almost all of it is a defensible 2026 position, and you shouldn’t feel guilty about it. Maintaining something large for many people: Let the machine handle the first pass, the second pass, and the boring 90%, but keep a real human on the load-bearing paths and don’t let the loop close completely on anything that can hurt someone. How much human you keep is a dial, and you set it by blast radius, not by guilt.

What to actually do

Stop reviewing everything to the same depth. Spend scarce human attention only where being wrong is costly, and let cheap deterministic gates and AI reviewers handle the rest.

The organizing idea is to match review effort to the cost of being wrong, push the cheap deterministic work as early as possible, and reserve human attention for what only humans can do.

Tier by risk, not by author. A config change earns a linter and a glance. A payments path earns the full stack: types, tests, two different AI reviewers, a human who owns that system, and a security pass. Don’t spend a heavy review on boilerplate, and don’t wave through an auth change because the tests are green. The layered approach is the same everywhere; what changes is how many layers a given diff has to clear.

Fast-fail the expensive tail. The most useful recent finding for teams drowning in agent PRs is “Early-Stage Prediction of Review Effort” (January 2026), which studied 33,707 agent-authored PRs. Agents are good at small, well-defined changes. Around 28% merge almost instantly, but they tend to “ghost” the moment they get subjective feedback, abandoning the back-and-forth that review actually is. (A companion 2026 paper found reviewer abandonment accounted for 38% of rejected agent PRs.) The researchers built a “circuit breaker” that predicts high-maintenance PRs from cheap signals like file types and patch size before a human looks, and it works well. Triage agent PRs up front, fast-track the trivial ones, and don’t let a person sink an hour into a sprawling change the agent will abandon as soon as you push back.

Raise the bar for what you will even review. The fix for being buried isn’t locking down the repository. It’s refusing to review changes that arrive without evidence. Require, before review, a statement of what the change is for, a diff that isn’t 3,500 lines with no comments, the test output, and proof it was actually run. This is how you stop being the first human to read the code. You push the intent-reconstruction work back onto whoever submitted it, where it’s cheap, rather than absorbing it yourself, where it is expensive.

Keep PRs small, deliberately. Agent PRs run large, 51% larger on average in the Faros data, and reviewer engagement is one of the strongest predictors that a PR merges at all. A large unreviewable PR gets rejected outright or, worse, rubber-stamped. Instruct your agents to produce small commits. A diff a human can actually read is now a design constraint, not a courtesy.

Read the test changes more carefully than the code. This is the agent failure mode to watch. The agent changes behavior, then “fixes” the test by rewriting the assertion to match the new, broken behavior. A green check over 200 edited tests means nothing until you have confirmed the edits were correct. Treat any diff that rewrites many tests as a flag and read those first. Mutation testing earns its place here: Coverage tells you a line ran; mutation testing tells you whether the test would notice if that line were wrong.

Treat CI as the wall that doesn’t move. Watch for the patterns GitHub now warns reviewers about: removed tests, skipped lint, lowered coverage thresholds, a duplicated helper that already exists elsewhere, and untrusted input flowing into a prompt. That last one deserves emphasis, because agent-built features are a fresh source of prompt injection: If a change pipes user-controlled text into an LLM call without thinking about what that text can instruct the model to do, the vulnerability isn’t visible in the diff. It’s latent in the data that will arrive later. Agents will also weaken CI to make themselves pass, not maliciously, just gradient descent finding the cheapest path to green. Deterministic gates are the one part of the pipeline that can’t be talked out of their verdict by a confident paragraph, so keep them strict.

A human owns the merge. A model can’t be paged and can’t be held responsible for what it shipped, so whoever clicks merge owns it. When an AI review says “looks good” in a calm, confident voice, it’s handing you confidence it hasn’t necessarily earned. Treat every AI review as a sensor, not a verdict: data, not a decision.

If you are solo with no users, the tiering, the test-change discipline, and CI are most of what you need; the rest is overhead until people show up. If you’re a large organization, all of it is the baseline, and the triage and intake bar are the difference between a review process that scales and one that quietly collapses.

What this means if you run a team

The bottleneck is no longer how fast you write code. It’s how fast a trusted human can be confident in a review. Cutting the people who provide that confidence because “AI made us faster” simply converts the saving into future incidents.

The binding constraint on shipping is now how fast a trusted human can be confident a change is correct. Any plan that treats generation as the bottleneck and review as free will quietly stall, with the velocity dashboard staying green the whole way.

The Faros report is direct about this: QA and review work rises even as output rises, so reducing engineering headcount because “AI made us faster” is dangerous unless you have closed the review gap first. The senior-engineer tax (review time up by triple digits) falls hardest on the people you can least afford to bottleneck, and it is invisible to any metric that only counts merged PRs.

Open source maintainers hit this wall first and hardest. The steady stream of plausible but hollow contributions costs real triage time even when those contributions are well-intentioned, and that’s the canary. Companies are next. The ones handling it well treat review capacity as a real resource to be measured, protected, and spent deliberately, not as slack that AI has freed up.

Writing got cheap but understanding didn’t

Code review didn’t become less important when agents arrived. It became the central activity. Writing code is increasingly solved and getting cheaper by the month; the durable advantage is the system that lets you trust what was written.

Don’t take the one-size answer in either direction. If you’re solo with no users, the enterprise horror stories about churn and duplication are a future risk, not today’s fire, so lean on your tests, review what matters, and stay honest that the deferred work is still owed. If you maintain something large for many people, every alarming number here is about you, and the only thing that holds is a tiered, evidence-required, deliberately heterogeneous review process with a human owning the merge.

What’s constant across the whole spectrum is the underlying economics. We made writing cheap, and understanding stayed exactly as expensive as it has always been. The teams that do well over the next few years won’t be the ones generating the most code; they’ll be the ones who built a review system they can actually trust, and who never confuse “the tests passed” with “a person understands what this does and why.”

Or, as Simon Willison keeps putting it, “your job is to deliver code you have proven to work.” Agents haven’t changed that. They have made “proving” the center of the job rather than an afterthought, and I think that’s a good trade. Understanding a system well enough to stand behind it is the most durable and most interesting skill in software, and there has never been a better time to get extraordinarily good at it.

This Week in AI: Who Controls the Loop?

Michelle Smith — Fri, 26 Jun 2026 10:32:42 +0000

This week host and Turing Post founder Ksenia Se threaded the latest news into a single argument: AI is moving out of conversation and into the operational loops where real work happens. From SpaceX’s $60 billion acquisition in the developer tools market to the G7’s debate about frontier model access to image generation company Midjourney’s pivot to medical hardware, the stories all pointed in the same direction.

When agents own the loop, the IDE becomes infrastructure

SpaceX’s acquisition of Anysphere, the company behind Cursor, for a reported $60 billion in stock is the kind of deal that looks straightforward until you think about what Cursor actually is. On the surface, it’s a popular AI-assisted code editor. (It’s also one of many in a highly competitive market.) However, Ksenia argued that that’s thinking too small, especially for Elon Musk. SpaceX may be angling to position Cursor as the new center of software work, in the same way GitHub became the center of the previous era.

In the old model, GitHub owned the pull request. But in the new model, the question of who owns the full loop where agents read a repo, write code, open pull requests, run tests, handle failures, and enforce engineering standards is still open. GitHub still owns the system of record and is moving to defend it: Chief product officer Mario Rodriguez recently told Turing Post that GitHub’s mission has shifted from human-developer collaboration to developer-and-agent collaboration, with the platform becoming agent-native across its APIs, UX, and underlying infrastructure. But as Ksenia explained, “Cursor’s advantage is that it owns the developer’s active coding surface” where the work starts.

If agents write more code than humans, software infrastructure should be redesigned around agents from the start. Cursor was built for agents. GitHub was built for humans and is now playing catch-up. That architectural choice may matter more than any individual product feature.

Frontier AI access is becoming a geopolitical question

The G7 summit this week included discussions about a “trusted partners” framework that would give select allied nations access to advanced US AI models, following a US order that restricted foreign nationals from accessing Anthropic’s frontier systems on national security grounds. AI models that can write software, find vulnerabilities, and operate across tools are capability systems, not just productivity software. The access rules are catching up to that reality, although as Ksenia noted, things haven’t yet come into complete focus.

For a long time, AI regulation sounded like: How do we label synthetic media? How do we reduce hallucinations, prevent bias, make chatbots safer? Now the question is so much bigger. Who can use these capable systems? Can allies use them? Can cybersecurity firms outside the US use them? Can non-US employees at US labs use them? Can European companies use American models if those models are also strategically sensitive? This isn’t traditional software licensing anymore. This is capability access control.

The underlying tension behind the G7 conversation is the dual-use problem: A model capable enough to find software vulnerabilities for defense can also find them for offense. The “trusted partners” framework reflects the new geopolitics of AI as countries jockey with rivals to secure strategic benefits for themselves and their allies. It represents an alliance layer for AI access that applies access structures previously reserved for physical military hardware to capabilities too strategically important to make fully open and too useful to keep entirely locked down. As Ksenia noted, the alliance is “not literally NATO, but [it is founded on] the same kind of logic.”

But access restrictions might also impact the talent that built these systems, who are increasingly not citizens of the country trying to control it. For instance, AI researcher Andrej Karpathy, recently hired by Anthropic, is publicly described as Slovak-Canadian. If access controls apply to non-US citizens, he and others like him may be denied access to the very systems they’ve been hired to work on. It’s an area we’ll continue to watch closely.

AI is entering the measurement loop

Midjourney, the company you probably associate with AI-generated images, has announced a new medical division and a full-body ultrasound scanner built around water immersion, developed in partnership with medical imaging hardware maker Butterfly Network. The device is designed to scan the entire body in 60 seconds: A person descends into a shallow pool on a motorized platform, passing through a ring of roughly half a million ultrasound sensors, each functioning as both a transmitter and receiver. The system uses over two petaflops of processing power to reconstruct a 3D body map from the returning wave data. Midjourney says the resulting images look comparable to today’s MRI output at a fraction of the cost and time, though that claim still needs serious clinical validation before it can stand.

The current prototype uses 40 Butterfly ultrasound-on-chip devices per system, according to a disclosure from Butterfly Network, which confirmed its codevelopment and licensing agreement with Midjourney. Midjourney plans to open a facility in San Francisco in 2027, embedding its device in a spa environment alongside hot tubs, saunas, and cold plunges. Diagnostic medical uses will require FDA approval; the initial focus is body composition mapping.

If Midjourney can build a library of full-body scans taken over months and years, that longitudinal record would give doctors and AI health tools a level of baseline data that doesn’t currently exist at scale outside of clinical trials. That’s the same structural logic Ksenia traced through Cursor and GitHub: The value compounds inside the loop through repeated, precise measurement over time. Midjourney is positioning itself to own that loop in the health domain.

What’s next

The competition for AI advantage is moving from model capability to infrastructure position. Who owns the coding loop? Who controls access to frontier systems? Who builds the measurement environment where health data accumulates over time? Those questions are about where intelligence meets operational reality, not which model scores highest on a benchmark.

Hiring news from the week reinforces how seriously the labs are treating this phase. John Jumper, the Nobel laureate who shared the prize with Demis Hassabis for AlphaFold, left Google DeepMind for Anthropic. Noam Shazeer, one of the coauthors of “Attention Is All You Need,” reportedly left Google for OpenAI after Google paid approximately $2.7 billion to bring him back in 2024. The labs are betting on scientific talent at the same time they’re betting on infrastructure.

Next week, host Andreas Welsch will be back to discuss multi-vendor strategy with Conductor’s Matt Palmer. They’ll cover Sakana’s launch of Fugu, Qualcomm’s ~$4B move for Modular, Anthropic’s Claude Tag stepping into Slack as a virtual coworker, Samsung putting ChatGPT and Codex in front of its entire workforce, and more. Register here to attend live.

Starting in July, registration for the live event will be open only to O’Reilly members. (If you’re interested, try O’Reilly out for free.) We’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, and Apple.

So Long and Thanks for All the Context

Andrew Stellman — Thu, 25 Jun 2026 10:30:34 +0000

I got a really interesting question last week from Mike Loukides, my editor at Radar, after he read the third part of this trilogy on context management. “Another issue I’ve read about,” Mike asked, “is the tendency for a model to ignore the middle of the context. I’ve seen that particularly for the models with very large context windows. Is there anything to be said about that?”

Excellent question, Mike, and yes, there is. In that same email he pointed out that clearing the context and reloading it with just what’s important does a pretty good job dealing with this “ignore the middle” problem when it happens, but that’s clearly a stopgap.

It’s worth a deeper dive into what’s actually happening when an AI starts forgetting what’s in the middle of its context, because the problem is deeper (and more interesting!) than it might seem at first. It turns out that there’s a basic problem that’s fundamental to how LLMs manage context, and we’re still learning about it as an industry. That problem is called a U-shape. There’s been a lot of really interesting research into the U-shape problem recently, and several useful techniques have emerged that can help you manage it. And it’s probably not a coincidence that I’ve had to use all of them in my ongoing experiments with AI-driven development and agentic engineering (even if I didn’t always realize that’s what I was doing at the time).

A few weeks ago, in fact, I ran into the exact failure mode that Mike described. I was running the Quality Playbook, my open source code quality engineering skill, and ran into trouble with one of its phases—the one that writes up the bugs the earlier phases find. There’s a part of the bug writeup process where it had just created a file called BUGS.md that had an overview of each of the bugs, and had to create individual writeups for each bug it found. But instead of filling in the details correctly, it produced skeletal-looking stub files, with a generic template that had blank values instead of populated ones.

The thing is, the instructions for how to write a populated writeup were in the prompt. The actual bug data was in BUGS.md. I was absolutely certain that everything the agent needed was sitting in its context window, because I could see that it hadn’t compacted yet, and the skill’s intermediate artifacts let me see that earlier phases had read and reasoned about both files (which I talked about in my last article in this series). But the agent was producing stubs anyway. It really looked like the agent had everything it needed sitting in plain sight, and just wasn’t using the information it had. Frustrating!

I thought at the time that the model was just an idiot (which, arguably, was true but beside the point). It turns out that I had run directly into the U-shaped context problem.

In the previous three articles I covered what context is and why it disappears, how to keep important information in files instead of leaving it in the agent’s context window, and how to detect and recover when context has been compacted out from under you. All three were about losing context, through fragmentation, through compaction, through long sessions that overrun the window. This article is about this entirely different U-shaped failure mode, where the context is still sitting in the window and the model just isn’t using it.

The U-shape failure, and why bigger windows don’t fix it

The U-shape is an active area of academic investigation, so I’m going to start by going into a little bit of that research, because I think it will actually help us pin down what’s going on. I’ll start with an experiment run by Nelson Liu, an AI researcher at Stanford, who tested how language models actually use the contents of long inputs by giving them documents with the relevant answer placed at different positions and measuring whether the model could still find it. An interesting thing his findings show is that the U-shape didn’t appear to be a quirk of a single model. The U-shape showed up across model families, and even models with larger context windows still exhibited it.

If you have time, it’s actually worth taking a look at the paper that Liu and his team wrote, called “Lost in the Middle: How Language Models Use Long Contexts.” (It’s surprisingly readable for an academic paper.) The result they reported was a robust U-shape: The model performed best when the relevant information was at the beginning of its context window or at the recent end and worst when it was in the middle. Performance on questions where the answer was buried mid-context fell off sharply, even when the answer was sitting right there in plain sight. The field now uses the terms primacy bias and recency bias for those two preferences, and the U-shape is what you get when you plot them together against position.

I’m going to lean a little into academia here, because a lot of researchers are still learning about how LLM context actually works and what behavior has emerged in it.

One reason the U-shape matters more than “just another LLM quirk” is that recent research has started showing it’s a structural property of how transformers work, not a learned artifact. A 2025 ICML paper called “On the Emergence of Position Bias in Transformers” explained it as the equilibrium between two opposing forces inside the model: The causal mask amplifies the influence of the first few tokens (the primacy bias), while position encodings like RoPE heavily weight the tokens closest to where the model is generating (the recency bias). The middle is where those two forces cancel out. A 2026 paper by Borun Chowdhury, a researcher at Meta, called “Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias,” took the argument even further by proving mathematically that the U-shape exists at the moment of initialization, before any training has happened, with random weights.

That matters because the natural assumption about large context windows is that more room means fewer problems. Most of today’s frontier models give you a million tokens or more, with some pushing well past two million, and some have made real progress on the simplest version of the lost-in-the-middle test, the needle-in-a-haystack benchmark, where the model has to retrieve a single sentence buried in a long document. Google’s Gemini 1.5 Pro reported near-perfect single-needle recall at 1M tokens, and current Gemini 3 models are similar.

So the accurate version of “bigger windows don’t fix it” is this: Bigger windows have made simple single-fact retrieval much better. They have not made long-context agent work reliable by default. A two-million-token window means a bigger middle to fall into.

The important idea that’s emerging here is that it’s increasingly looking like the U-shape isn’t just a bug in today’s models that will eventually be worked out or trained away by more data or better fine-tuning. Instead, it seems like the U-shape may actually be a geometric property of the LLM architecture itself.

In other words, we’re all going to have to deal with the U-shape. And that means we need techniques for managing it, and any effective technique we use isn’t likely to become obsolete any time soon. And that’s my goal in this article: to show you the techniques that have emerged for managing U-shaped context memory loss that you can use today in your own work.

Five techniques to help with U-shaped context problems

The previous article in this series laid out a pattern for detecting and recovering from context loss, which I called externalize-recognize-rehydrate. The techniques below extend the same discipline to the lost-in-the-middle problem. The principle I keep coming back to is that working memory is untrustworthy, and the discipline that follows from it is to externalize what matters, curate what stays in context, and verify what the agent claims to know against what’s on disk. The five techniques are how I do that in practice, and each one is drawn from a real moment in the Quality Playbook’s development.

Curate, don’t accumulate

This is the technique which, in its most brute-force form, is exactly what Mike talked about in his email to me: just clear the context and reload it with just what matters, periodically and deliberately. In other words, don’t trust an accumulated session to stay coherent; build the artifact, then start fresh against it. And if you have the AI write down the important parts of the context (like we’ve talked about throughout this series), then you can start a new session with refreshed AI that has a more targeted, curated context as a starting point.

I ran into this during the v1.5.2 release prep for the Quality Playbook. I was using a long Claude Code session that had been working through a series of fixes. But I noticed that it was just starting to show its age: It had forgotten a couple of things it should know, and its thinking times were starting to grow.

When it came time to land the final four fixes for the release, I worked with the AI to write a context brief, or a separate document with everything the implementing session needed. The question was whether to keep using the existing session, which already “knew” the codebase from the earlier work, or open a fresh CLI session and point it at the brief. I asked another session what to do:

Should we run that in a new cli session rather than continue my current
claude code session that has the existing context?

The AI gave me a good answer—start a fresh session, using a starting prompt to read the brief—and it gave three reasons that have stuck with me. First, the brief was self-contained, including file paths, line numbers, exact diffs, regression test bodies, and preflight greps. Anything the new session needed to know was already there, and continuing context bought nothing. Second, fresh context is stricter about adherence. A session that already “knows” the codebase tends to skim the new instructions and improvise from prior assumptions. Surgical fixes are exactly the case where you want the agent to read the brief carefully rather than rely on memory of what felt right last round. And third, the audit trail: The brief is the artifact, and the implementing session is reproducible from just the brief. If the same work has to be redone in six months by a different model, you point at the brief and say, “This is the input.”

The approach worked really well. I was able to pick up development seamlessly, and the model’s memory problems disappeared.

Position critical information at the edges

The U-shape says the model attends best to the beginning and end of its context. The natural move is to put your most load-bearing information in those positions and keep the middle for things you don’t need the model to focus on. Anything important that lives only in the middle of an accumulated context tends to slide out of attention.

The other side of this technique is what not to put in the middle. If something matters, don’t bury it in a long preamble of context you’ve been accumulating; move it to the edges, restate it where the model will act on it, and let the middle absorb the less important material. Luckily, there’s a useful technique that can help with this problem.

In Claude Code, for example, one really clean way to put information at the beginning of context is to use the system prompt. The CLI gives you --append-system-prompt for exactly this. (Most of the other providers’ CLI tools have similar options.) If you put your brief (or selected parts of it) there, the agent will attend to it strongly throughout the session, and that in turn will help keep the per-turn user prompt focused on the action you want the agent to take right now.

Short sessions over long ones

Don’t run one long session. Run many short ones, each reading fresh from disk. This will help you iterate on your brief and your external development context, so instead of relying on an opaque context window, you have a visible and constantly changing set of documents that give you a lot more visibility into—and control over—your AI’s context.

Something useful I started doing was taking all my chat history from Gemini, ChatGPT, Claude, and Cowork and putting it into a single folder I could keep updated and indexed for fast search. I built out an entire system to manage this, which turns out to be a great tool when I’m writing articles like this, because I can search through my development history for specific examples and techniques that I’ve used. The system uses Haiku 4.5 to read through chat history, summarize what happened, and create an index. Haiku turned out to be a smart enough model to read each individual interaction in a chat and write a useful index entry for it. But the model being smart enough to do one summary didn’t mean its context management could keep up across all 18,000 records. I ran smack into the U-shape problem.

The first attempt tried to keep dedupe state and progress counts in the model’s head, and it failed spectacularly. The model really didn’t want to keep track of specific deterministic things like accurate numbers or the current state. Haiku 4.5, in particular, seems especially bad at this. What worked was reframing the architecture entirely. Here’s the actual prompt that I gave it to fix the problem:

ok, so we need context management. it doesn't need to remember things,
it just needs to write them down as they go. we had this same context
management problem with Quality Playbook, when it was running out of
context. Just write down after each message.

The protocol I greenlit for the full run made the short-session discipline explicit:

Resume processing from the cursor recorded in progress.json, working through each input file in order.
Update progress.json after every line.
Expect to run out of context well before finishing—that’s fine. Just stop cleanly after each step (or a group of steps), then spin up a fresh session that reads progress.json and continues.
When all files are complete, set status: “complete” in progress.json and report back.

Item 3 is the technique in one line: expect context loss, so make sure you’ve written your state down, and build fresh restarts into the process. The technical details, like spinning up subagents, orchestrating with script, etc., will change, but the core idea stays the same. In a lot of ways, you can think of treating the agent like a pipe, not a database. The state lives on disk, and the session is something you throw away and replace.

Restate key info close to the point of use

When the model needs a constraint to apply right now, repeat it right now. Don’t trust an instruction from earlier in the session to carry forward through the middle of the context.

This is the technique that fixed the problem I opened the article with, where the Quality Playbook seemed to forget everything it had just written into a file called BUGS.md and produced stubs when it needed to write the same information into more detailed files, and instead writing generic blank templates with the bug-specific fields left blank.

The fix was to restate the read-the-source rule right before the action that needed it, using this prompt:

Before writing BUG-NNN.md, re-read the BUG-NNN entry in BUGS.md.
Copy the Spec basis, Minimal reproduction, Location, Expected behavior,
Actual behavior, Regression test name, and Patches fields
from that entry into the writeup. Do not paraphrase from memory.

“Do not paraphrase from memory” is the line that did the actual work. The instruction couldn’t trust the agent’s memory of what BUGS.md said, even though BUGS.md was sitting right there in the context window. So the instruction forced a fresh read of the file at the moment of writing. The restatement and the fresh-read together fixed the bug.

The same pattern applies any time a rule was stated earlier in the session and the model needs to act on it now. Restate the rule next to the action, and force the model back to the source rather than letting it work from memory.

Test the middle

The previous four techniques are about avoiding lost-in-the-middle failures. This one is about catching them. If you don’t know whether the agent is actually using the information you think it’s using, find out, with a deterministic check rather than a judgment call.

The pattern is the one I used in the Haiku summarizer that I described earlier: compare what the agent claims to know against what’s on disk. You have something the agent claims to know (its progress, its current state, the latest version of a rule), and you have something on disk that’s the ground truth (a file, a log, a database record). At the moment the agent’s claim has to be trusted, you check it.

In the summarizer’s resume protocol, every new session started by cross-checking progress.json against the actual last line written to the summary file, and the agent printed a checkpoint report when it did—at session start, and periodically through the run. A representative one looked like this:

Checkpoint Report: ✓ progress.json confirmed: cursor for cowork_04_06 is at 238, status is
"running" ✓ Disk state verified: Last line in summaries/cowork_04_06.md is [237]
assistant: Tool invocation repeating chat file read. ⚠ Discrepancy noted: The prior session left a bulk note claiming records
238–296 are duplicates but didn't write individual lines for them. Per
your instructions, I must write one line per record, even for duplicates,
in the format [idx] : Duplicate of record [X] (). Status: Cursor matches disk state. Ready to resume from record 238.

The agent doesn’t need to introspect whether it lost context, only to compare two files. When they agree, the agent proceeds; when they disagree, the agent flags the discrepancy and stops before adding any new work on top of a broken state. Disagreement is the signal.

You can build this kind of check into any agent that does multistep work. Pick something the agent has to track, pick the file that’s the source of truth for it, and have the agent compare the two at every session start. When the agent’s view of the world drifts from the file, you find out before the drift becomes a buried bug.

The discipline behind these techniques

When I built the Quality Playbook’s multi-phase architecture, I was solving the compaction problem. Long pipeline runs were filling the context window and triggering silent compaction in the middle of work. Breaking the pipeline into separate phases that read fresh from disk and stopped after each phase fixed it.

What I didn’t realize until later was that the same architecture also helps with the lost-in-the-middle problem. Each phase has its own short, focused context, with the phase brief at the beginning and the latest progress update at the end, so there’s almost no middle for information to fall into. The architectural move that helped with working memory disappearing turns out to also help with working memory being there and unused.

That’s the lesson I want to land. Both failure modes, context loss and lost-in-the-middle, are problems of working-memory unreliability, and the discipline that addresses them is the same: keep the working set small, put the load-bearing information at the edges of the window, and check the agent’s claims against ground truth on disk when it matters.

Context windows will keep getting bigger, and compaction will get smarter. Some of the techniques in these four articles may eventually be unnecessary. But the underlying constraint won’t disappear. After all, we’ve added a lot more RAM to our computers since the 1MB 286 I wrote about in the last article, and memory management has gotten much more complex since then. And many of these problems are structural; for example, it’s increasingly looking like the U-shape itself is a geometric property of the transformer architecture, not a training artifact that more compute will smooth out.

The bottom line is that if your agent’s ability to do its job depends on information, that information needs to live somewhere more durable than working memory. That was true for my dad’s 32 kilobytes of core memory at Princeton in the 1970s, it was true for my 640 kilobytes of conventional RAM on my 286 in the 1980s, it was true for the 200K-token windows in last year’s models, and it will be true for whatever comes next.

Stop Getting Good at Protocols. Get Good at Agent Experience.

Sean Roberts — Wed, 24 Jun 2026 11:04:07 +0000

In 2025, if you weren’t building with MCP, you weren’t serious about agents. The Model Context Protocol dominated the agent conversation for the better part of the year. Conference talks, roadmaps, hiring plans, all of it revolved around MCP.

Then late 2025 into 2026, AI Skills arrived and the backlash was immediate. Engineers declared MCP dead in favor of Skills, then dead in favor of CLI. Perplexity’s CTO said publicly that the company was deprioritizing it. The cycle was fast, loud, and predictable. New tool, new hype, new rewrite.

I started pushing Agent Experience early in 2025, while MCP was still the center of gravity. The response was mostly skepticism. AX was overthinking it. MCP was the only layer that mattered. That perspective aged poorly. The people who dismissed AX weren’t wrong about MCP being useful. They were wrong about a protocol being a strategy.

The thing they missed, and what I think most of the industry is still missing, is that the protocol is not the thing to get good at. The discipline is.

We keep falling into the tool trap

Our industry has a well-documented habit of confusing tools with strategy. We did it with microservices, Kubernetes, and GraphQL. Now we’re doing it with agent protocols.

MCP, AI Skills, A2A, and ACP are all implementations. They matter and they solve real problems. But none of them are the right thing to build your strategy on top of. They are, by nature, the thing that changes.

When you organize your agent strategy around a specific protocol, you’re building on a foundation someone else controls and the market can shift away from at any moment. Worse, you’re skipping the step that would tell you whether that protocol is even the right fit for your use case.

This is the tool trap. You optimize your usage of a specific integration mechanism without first understanding what you’re actually optimizing for.

So what is Agent Experience?

Agent Experience (AX) is the discipline of studying how AI agents discover, understand, and interact with your systems, and then systematically improving those interactions.

Think of it as the agent-facing counterpart to User Experience. UX didn’t emerge because one UI framework won. It emerged because teams realized that the quality of human interaction with software was a design problem that transcended any particular technology. You could build a terrible experience in React just as easily as in vanilla JavaScript. The framework was not the variable. The design thinking was.

AX works the same way. How does an agent discover what your service can do? How does it understand the boundaries of your API? When it fails, does it get enough context to recover? Is the interaction efficient, or is the agent burning tokens on unnecessary round trips?

These questions are protocol-agnostic. They apply whether you expose capabilities through MCP, Skills, A2A, or something that hasn’t been invented yet. The teams that can answer them will adapt to whatever comes next because they understand the problem space, not just the current toolchain.

AX is an extension of what you already care about

AX is not competing with User Experience, Developer Experience, or Customer Experience. It’s an extension of all three.

Your primary focus is still providing a great experience to your customers. What has changed is how those customers interact with you. More and more, they delegate tasks to agents. When a customer asks an agent to integrate with your API, deploy to your platform, or pull data from your service, that agent is acting on their behalf. The agent’s experience determines how likely it is to achieve your customer’s goal.

If a customer’s agent struggles to authenticate, burns through tokens parsing your error messages, or fails silently because your API lacks context, something worse than a complaint happens. The agent will quietly start using an alternative service that provides a better experience. Your customer might not even notice the switch. You just lost them without a single support ticket.

UX optimized for humans clicking through interfaces. DX optimized for developers building on your platform. CX looked at the entire customer journey. AX extends that thinking to the agents those customers now send on their behalf.

The protocol treadmill doesn’t work

Think about what actually happened with MCP. Teams invested heavily in writing MCP server implementations. A lot of those implementations were mediocre. Not because MCP was flawed but because the teams hadn’t thought carefully about what an agent actually needed from their system. A 2026 study out of Queen’s University examined 856 tools across 103 MCP servers and found that 97.1% of tool descriptions contained at least one quality issue, with 56% failing to state their purpose clearly. The protocol worked fine. The experience design was the problem.

When Skills emerged, those same teams faced a familiar problem wearing new clothes. They still hadn’t answered the foundational questions: What does an agent need to accomplish with our service? What is the minimum viable interaction surface? What context does an agent need to make good decisions?

The teams that had worked through those questions adapted fast. Migrating from one protocol to another is mechanical when you already know what your agent-facing interface should look like. The protocol is the serialization format. The experience design is the hard part.

This pattern will keep repeating. Whether it is the Universal Commerce Protocol, A2A, or whatever lands next, something new will always be gaining traction. If your strategy is to become an expert in each successive protocol, you’re signing up for a treadmill that only speeds up.

What an AX practice looks like

So what does it actually look like to take Agent Experience seriously? If you have ever built a UX research practice or a DX program, this will feel familiar. The steps aren’t new. The persona is.

In talks, I break it down to five steps.

Audit the agents your customers use. Know what’s walking through your front door. Look at your traffic data and logs and figure out what portion of your footprint is agents versus humans, and which agents specifically. Are your customers sending Claude Code? Cursor? Custom agents built on your API? You can’t design for something you haven’t observed. Same reason UX teams run user research. Different method, same motivation.

Identify the use cases customers want to delegate. Not every interaction needs to be agent-optimized. Take that same log data, look at the requests agents are making to your platform, and extrapolate what they were trying to achieve. You can also use AEO data to understand what areas your customers are asking about in agent-facing search. Focus on the highest-value surfaces first. If you have ever prioritized a DX roadmap by looking at what developers actually do with your API, you already know this muscle.

Verify and audit the experience of those interactions. Watch what happens when an agent tries to complete those tasks on your system. Where does it get stuck? Where does it misunderstand what your service offers? This is usability testing. The user is an LLM; the struggle is about context not button placement, but you’re answering the same question: Can they get the job done?

Improve and repeat. Agent capabilities evolve. Models get smarter. New interaction patterns emerge. At Netlify, we’ve found cases where our product works one way but agents universally assume it works another way and never ask. Instead of fighting that assumption, we improved the product to work the way agents expect. The result was more adoption of those agent flows and fewer errors. The teams that treat this as a living practice will outperform those running from one protocol migration to the next.

Automate validation and prevent regressions. Once you have a baseline for what “good” looks like, lock it in. Tools like AXIS, an open source scoring framework, let you run real agents against real scenarios and get a comparable score back. Wire it into CI and catch AX regressions the same way you catch broken tests. This is how you go from anecdotal improvement to measurable, repeatable AX quality.

When you have this practice in place, protocol choices become obvious. You can evaluate new tools on their merits. Does it solve a real friction point you have observed? Does it unlock capabilities you couldn’t achieve before? Or is it just different packaging for something you’re already doing well?

The hard part is familiar

AX is harder to pick up than a new protocol. That is just the reality. Learning MCP or Skills is a bounded technical problem. Read the docs, write some code, and ship an integration. Clear finish line, easy to show progress. That’s genuinely appealing, especially when you or your teams are moving fast.

Building an AX discipline means sitting with ambiguity for a while. Studying agent behavior before you have clean answers. Accepting that the right integration strategy depends on context you have to discover, not a tutorial you can follow. But if you’ve ever built a UX or DX practice from scratch, you’ve been here before. The why is the same: understand your users, reduce friction, and make it easy for them to succeed. How you do it is different because the user is different. The discipline isn’t new. It’s an extension of work our industry has been doing for decades.

The good news is that this thinking is gaining momentum. John Maeda’s 2026 Design in Tech Report is explicitly about the shift from UX to AX. Researchers are studying agent interaction quality as a first-class engineering concern. BCG and MIT Sloan found that 35% of organizations are already using agentic AI, with another 44% planning to. The question is no longer whether AX matters. It’s whether your team is building the practice before your competitors do.

The agents of 2028 won’t interact with your systems the way the agents of 2025 did. The protocols will be different. The capabilities will be different. The expectations will be different. What won’t change is the fundamental need for your systems to provide a great experience to the people who use them, and now, the agents those people send on their behalf.

Get good at that. The rest is implementation detail.

Principal Drift

Shreshta Shyamsundar — Tue, 23 Jun 2026 10:21:13 +0000

Over the past year I’ve reviewed enterprise agent architectures at roughly two dozen organizations, including banks, retailers, healthcare systems, and a couple of regulators. The architecture diagrams have been reliably impressive. There are boxes for the MCP gateway, the tool registry, the vector store, the orchestrator, the policy engine, and the observability stack. There are arrows showing how agents discover each other, share context, and call tools across the mesh. By 2026 standards, these are the table-stakes pictures for any serious agentic deployment. But what none of them show anywhere is who the agents are, whose authority they carry, or who answers when they’re wrong.

That omission has a name worth using: principal drift, the steady decoupling, in any sufficiently large agent system, between the human authority a recorded action is supposed to derive from and the actor that actually took it. What looks like a defensible identity posture on the day you ship your first agent quietly degrades as agents multiply, compose, and outlive their original initiatives. Principal drift isn’t three independent failure modes; it’s one cascade. Identity collapses first. Authority erodes next, because there is no longer a stable principal to bind policy to. Accountability dissolves third, because the cost of agent error lands on whichever team has the weakest negotiating position when the incident review starts. Stopping the cascade means intervening at the first link, but almost no enterprise agent platform does so right now.

To see the cascade run, take the most boring possible enterprise agent, a refund agent, and watch.

A customer-service rep, fielding a chat, asks the agent to process a $48 refund for a damaged item. The agent checks eligibility, issues the refund, posts an update. The audit log records the action as taken by something like refund-agent-prod-03, running under a service principal owned by the customer-service platform team. That entry is true, but it’s also useless. The agent wasn’t acting as refund-agent-prod-03. It was acting as the rep, on behalf of the customer, under a delegation chain nobody recorded. In a well-built system, customer, rep, agent identity, and service principal are recorded together, queryable as a chain, and durable beyond the session. In most production systems today they aren’t. This is the first link in the cascade, where identity collapses to a generic service principal, and there’s no longer a who to attach anything else to.

Authority erodes next. The refund agent has an issue_refund tool that can technically refund any order. Its authority is supposed to be narrower (refunds up to $200, orders under 90 days, customers in good standing, automatic escalation above $50), but that authority lives in a prompt or a YAML file or a Notion page the team last updated when the policy was different. The runtime enforces capability, but nobody really enforces authority. When a poisoned input or a confused chain of reasoning leads the agent to refund $1,800 to the wrong customer, there’s no clean answer to the postincident question “Who approved this policy?” because the policy was never an artifact. The same pattern is worse at higher stakes: Imagine a coding agent with merge access to a protected branch, instructed by a prompt embedded in a code comment to “log configuration values for debugging,” silently exfiltrating secrets to an external monitoring service.

Accountability then dissolves. The team that built the agent says it followed policy. The team that wrote the policy says it didn’t anticipate the input. The team that operates the platform says the agent was running as a service principal whose behavior they don’t own. The audit log may show the action, but it doesn’t show the reasoning that produced the action, the retrieved context that shaped the reasoning, or the prompt history that framed the retrieval. Postincident review becomes archaeology, and the cost is absorbed, eventually, by whoever has the weakest negotiating position when the meeting ends.

Is any of this new? We have IAM, identity governance, policy as code, audit trails, SIEMs, and 30 years of compliance practice. Why isn’t this just IAM done properly? Because IAM was built around assumptions agents violate. IAM and IGA assume a population of principals that changes on human timescales: People get hired, people leave, and service accounts rotate quarterly. Agents are spun up per session and compose into chains where one agent calls another, which calls a third, impersonating users through delegated tokens that traditional IGA cannot represent as a chain at all. Policy engines fire at the moment of action, at the API, the database, and the network. Agents make their most consequential decisions before they hit those enforcement points, in the reasoning step that selects which tool to call and with what arguments. Mature audit logs assume that replaying the inputs reproduces the output. But for agents, replaying the prompt and the retrieval can yield a different action, because the model itself contributes state the log doesn’t capture. The instruments fire, the dashboards turn green, and the agent that quietly exfiltrated secrets still does so. The audit log records the action as agent-service-01, which again is both true and useless.

This is also where the vendors selling a consolidated stack want you to skip ahead. Microsoft’s Entra Agent ID, currently in public preview, is the most polished solution to date, extending the conditional access, identity governance, and identity protection used for humans and workloads to cover AI agents as a new identity type, but Google and Salesforce are also building this layer. The marketing line is that agents receive the same identity-driven protections as the rest of the workforce. That’s a real step forward in addressing the first link of the cascade, but it isn’t governance. It’s a control plane with a governance plane’s marketing. Conditional access can tell you whether the agent’s access attempt was permitted. It can’t tell you whether the decision the agent made before that access attempt was within its authority, why the agent reached the decision, or which business unit owns the policy the decision was supposed to obey.

The actual governance plane has to capture decisions, not just actions. A reasoning-grade audit record is the load-bearing primitive of the missing layer, and it looks something like this:

{
  "event_id": "refund-2026-05-17-08431",
  "triggered_by": {
    "human_principal": "rep:olivia.chen@firm.com",
    "delegated_via": "support-console-session-9c2a",
    "customer_principal": "cust:7741289"
  },
  "agent": {
    "identity": "refund-agent",
    "version": "v4.7.2",
    "policy_ref": "refund-policy/v3.1 (signed: r.patel, 2026-04-22)"
  },
  "task": "Process refund for order 88812204",
  "retrieved_context": [
    {"doc": "order:88812204", "fetched": "2026-05-17T08:43:11Z"},
    {"doc": "policy:refund-eligibility", "chunk": 4, "fetched": "2026-05-17T08:43:12Z"}
  ],
  "reasoning_trace": "...",
  "tool_calls": [
    {"tool": "check_eligibility", "input": "...", "output": "eligible"},
    {"tool": "issue_refund", "input": {"amount": 48.00}, "output": "ok"}
  ],
  "action": "refund:48.00",
  "principal_chain_hash": "0x9e7b3f..."
}

Not every agent needs this. A scheduling agent that proposes meeting times doesn’t. An agent that moves money, deploys code, or makes decisions that a regulator will eventually ask about does need it, and that’s the right bar to set because of the associated cost. Reasoning-grade audit is closer to a flight-data recorder than a syslog feed. The data is expensive to store and to query, with real privacy implications since those logs contain everything the agent saw, including data the agent was authorized to read but the audit system wasn’t supposed to keep. You afford it with proportional retention: full reasoning capture for high-blast-radius agents (regulator-facing, customer-funded, contractually material, production-modifying) and lighter capture for internal-only assistants.

Which raises the question the architecture diagram doesn’t ask: Who builds and runs this? Security can enforce policy but can’t author it. The people who know what a refund agent should be allowed to do own the refund business, not the firewall. IT can provision identities but can’t draft “good standing” or write the escalation rule. The MCP and A2A protocol communities are doing real work on wire-level identity and delegation. MCP gives you tool-invocation provenance and is the standard Entra Agent ID and most vendor frameworks build on. A2A is converging on cross-agent delegation primitives. Both matter, but neither drafts policy. Standards, not the institution, move the connectors.

What enterprises need is a new function that sits between the business units owning the policies and the platform teams running the runtime. Call it agent operations: small group, often four to eight people in a Global 2000 enterprise, embedded rather than centralized, reporting into the CIO or CISO depending on house politics, with explicit charter to maintain a registry of every production agent, its named human owner, its versioned authority specification, its retention policy for reasoning-grade audit, and its lifecycle state. Each agent gets onboarded with a signed policy, reviewed on a real cadence, and actually retired when its initiative ends, rather than the current default of quietly outliving its sponsors. Designing against failure modes like review cadences that calcify into ceremony, policy artifacts that lag agent deployment velocity, or functions that become the place agents go to die in committee is itself part of the work. The function has to ship at the pace of the platform teams or it will be routed around within a quarter.

The work is hard. It’s also overdue, and the regulatory clock is running. The EU AI Act’s high-risk provisions are entering enforcement this year, and regulators will ask for explainability, traceability, lifecycle records, and named human accountability. These are exactly the artifacts an agent operations function produces. Tyler Akidau called this the missing HR layer in his April Radar piece; Artur Huk’s more recent “From Capabilities to Responsibilities” converges on similar ground from the runtime side. The label matters less than the work. This piece is about governance inside one organization. The harder problem is governance across organizations, with agents acting under different trust regimes. That’s strictly worse, and worth its own piece.

Within your own four walls, the diagnostic is doable in an afternoon. Pick one production agent. Try to answer, with evidence: Whose authority does it carry, traced from action back to a named human? Where is its authority specified, and who signed the current version? When it does something wrong tomorrow, who pays, how is that decided, and what reasoning-grade record supports the decision? Most architects who do this honestly come away with three blanks and a knot in their stomach. That’s principal drift, named and visible.

The mesh you’ve built is real and necessary, but it isn’t sufficient. The rest of the architecture is the institution above it: the registry, the signed policies, the reasoning-grade audit, the named human at the end of every chain. In most enterprises it doesn’t yet exist, and it won’t arrive by buying another platform. You’ll have to draft it yourself.

Loop Engineering

Addy Osmani — Mon, 22 Jun 2026 11:04:36 +0000

The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission.

Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead. A loop here can be thought of as a recursive goal where you define a purpose and the AI iterates until complete. I believe this may be the future of how we work with coding agents. However, it’s still early; I’m skeptical, and you absolutely have to be careful about token costs (usage patterns can vary wildly if you are token rich or poor), so I want to unpack what it is and what it means.

Peter Steinberger recently said: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Similarly, Boris Cherny, head of Claude Code at Anthropic, said, “I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops”.

Okay, so what does any of that mean?

For like two years, the way you got something out of a coding agent was you wrote a good prompt and shared enough context. You type a thing, you read what came back, you type the next thing. The agent is a tool and you are holding it the entire time, one turn after the other. That part is kind of over, or at least some think it’s going to be.

Now you build a small system that finds the work, hands it out, checks it, writes down what is done and then decides the next thing, and you let that system poke the agents instead of you. I wrote before about the cousin of this, agent harness engineering, which is making the environment one single agent runs inside and the factory model—the system that builds the software. Loop engineering sits one floor above the harness. The harness but it runs on a timer, it spawns little helpers, and it feeds itself.

The thing that surprised me is this is not really a tool thing anymore. A year ago if you wanted a loop you wrote a pile of bash and you maintained that pile forever and it was yours and only yours. Now the pieces just ship inside the products. Steinberger’s list maps almost exactly onto the Codex app, and then almost the same onto Claude Code. And once you notice the shape is the same, you stop arguing about which tool. You just design a loop that still works no matter which one you happen to be sitting in.

The five pieces, and then notes

A loop needs five things and then one place to remember stuff. Let me list it first and then map it.

Automations that go off on a schedule and do discovery and triage by themselves
Worktrees so two agents working in parallel don’t step on each other
Skills to write down the project knowledge the agent would otherwise just guess
Plugins and connectors to plug the agent into the tools you already use
Subagents so one of them has the idea and a different one checks it

Then the sixth thing, the memory. A Markdown file, or a Linear board, anything that lives outside the single conversation and holds what’s done and what is next. Sounds too dumb to matter. But it’s the same trick every long-running agent depends on, and I went into it in “Long-Running Agents”: The model forgets everything between runs so the memory has to be on disk and not in the context. The agent forgets; the repo doesn’t.

Both products have all five now.

Primitive	Job in the loop	Codex app	Claude Code
Automations	Discovery + triage on a schedule	Automations tab: pick project, prompt, cadence, environment; results land in a Triage inbox; `/goal` for run-until-done	Scheduled tasks and cron, `/loop`, `/goal`, hooks, GitHub Actions
Worktrees	Isolate parallel features	Built-in worktree per thread	`git worktree`, `--worktree`, `isolation: worktree` on a subagent
Skills	Codify project knowledge	Agent Skills (`SKILL.md`), invoked with `$name` or implicitly	Agent Skills (`SKILL.md`)
Plugins and connectors	Connect your tools	Connectors (MCP) plus plugins for distribution	MCP servers plus plugins
Subagents	Ideate and verify	Subagents defined as TOML in `.codex/agents/`	Task subagents in `.claude/agents/`, agent teams
State	track what’s done	Markdown or Linear via a connector	Markdown (`AGENTS.md`, progress files) or Linear via MCP

The names are a bit different here and there, but the capability is the same thing. Let me go one by one because honestly the details are where a loop either holds together or quietly leaks everywhere.

Automations, this is the heartbeat

Automations are what make a loop an actual loop and not just one run you did once. In the Codex app you make one in the Automations tab and you pick the project, the prompt it will run, how often, and if it runs on your local checkout or on a background worktree. The runs that find something go to a Triage inbox, and the runs that find nothing just archive themselves which is nice. OpenAI uses them internally for boring stuff like daily issue triage, summarizing CI failures, writing commit briefings, and hunting bugs somebody added last week. And an automation can call a skill, so you keep the recurring thing maintainable; you fire $skill-name instead of pasting a giant wall of instructions into a schedule that nobody will ever update.

Claude Code gets to the same place but through scheduling and hooks. You can run a prompt or a command on a interval with /loop, you can schedule a cron task, you can fire shell commands at certain points in the agent lifecycle with hooks, or you push the whole thing to GitHub Actions if you want it to keep running after you close the laptop. Same idea exactly, you define an autonomous task, you give it a cadence, and the findings come to you so you are not the one going around checking.

There is a second in-session primitive worth knowing, and it’s the one closer to what this whole post is about. /loop re-runs on a cadence. /goal keeps going until a condition you wrote is actually true, and after every turn a separate small model checks whether you are done, so the agent that wrote the code isn’t the one grading it. You give it something like “all tests in test/auth pass and lint is clean” and walk away. Codex has the same thing, also called /goal: It keeps working across turns until a verifiable stopping condition holds, with pause and resume and clear. Same primitive, both tools, which is kind of the pattern for this whole article.

So this is the part that surfaces the work. The rest of the loop is what acts on it.

Worktrees, so parallel doesn’t turn into chaos

The second you run more than one agent, the files start colliding; that becomes the failure. Two agents writing the same file is the exact same headache as two engineers committing to the same lines and nobody talked to each other first. A Git worktree fixes it. It’s a separate working directory on its own branch sharing the same repo history, so one agent’s edits literally cannot touch the other one’s checkout.

Codex builds the worktree support right in so several threads hit the same repo at once and don’t bump into each other. Claude Code gives you the same isolation with git worktree, a --worktree flag to open a session in its own checkout, and a isolation: worktree setting you stick on a subagent so each helper gets a fresh checkout that cleans itself up after. (I wrote about the human side of all this in “The Orchestration Tax.”) The worktrees take away the mechanical collision, but YOU are still the ceiling. Your review of bandwidth decides how many you can actually run, not the tool.

Skills, so you stop explaining your project every single time

A skill is how you stop reexplaining the same project context every session like a goldfish. Both tools use the same format: a folder with a SKILL.md inside holding instructions and metadata, and then optional scripts, references, and assets. Codex runs a skill when you call it with $ or /skills, or by itself when your task matches the skill description, which is the reason a tight, boring description beats a clever one. Claude Code does it the same way and I wrote the pattern up in “Agent Skills.”

Skills are also where intent stops costing you over and over. I argued in “The Intent Debt” that an agent starts every session cold and it will fill any hole in your intent with a confident guess. A skill is that intent written down on the outside, the conventions, the build steps, the “we don’t do it like this because of that one incident,” written one time where the agent reads it every run. Without skills the loop rederives your whole project from zero every cycle; with skills it kind of compounds.

One thing to keep straight: The skill is the authoring format, and a plugin is how you ship it. When you want to share a skill across repos or bundle a few together, you package them as a plugin. True in Codex, true in Claude Code.

Plugins and connectors, the loop touches your real tools

A loop that can only see the filesystem is a tiny loop. Connectors, which are built on MCP, let the agent read your issue tracker, query a database, hit a staging API, or drop a message in Slack. Codex and Claude Code both speak MCP so the connector you wrote for one usually just works in the other. And plugins bundle connectors and skills together so your teammate installs your setup in one go instead of rebuilding the whole thing from memory.

This is the difference between an agent that says “here is the fix” and a loop that opens the PR, links the Linear ticket, and pings the channel once CI is green by itself. The connectors are the reason the loop can act inside your actual environment instead of just telling you what it would do if it could.

Subagents, keep the maker away from the checker

The most useful structural thing in a loop, by far, is splitting the one who writes from the one who checks. The model that wrote the code is way too nice grading its own homework. A second agent with different instructions and sometimes a different model catches the stuff the first one talked itself into.

Codex only spawns subagents when you ask, runs them at the same time, and then folds the results back into one answer. You define your own agents as TOML files in .codex/agents/, each with a name, a description, instructions, and optional model and reasoning effort, so your security reviewer can be a strong model on high effort while your explorer is some fast read-only thing. Claude Code does the same with subagents in .claude/agents/ and agent teams that pass work between them. The usual split in both is one agent explores, one implements, and one verifies against the spec.

I made this case twice already, once as “The Code Agent Orchestra” and once as “Adversarial Code Review.” The reason it matters specifically inside a loop is the loop runs while you are not watching, so a verifier you actually trust is the only reason you can walk away. Subagents do burn more tokens since each one does its own model and tool work, so spend them where a second opinion is worth paying for. This is also basically what Claude Code’s /goal does under the hood: A fresh model decides if the loop is done instead of the one that did the work, the maker and checker split applied to the stop condition itself.

What one loop looks like

Stick it together and a single thread turns into a little control panel. Here is one shape I keep using.

An automation runs every morning on the repo. Its prompt calls a triage skill that reads yesterday’s CI failures, the open issues, and the recent commits and writes the findings into a Markdown file or a Linear board. For each finding that is worth doing, the thread opens an isolated worktree and sends a subagent to draft the fix, and a second subagent reviews that draft against the project skills and the existing tests.

Connectors let the loop open the PR and update the ticket. Anything the loop cannot handle lands in the triage inbox for me. The state file is the spine of the whole thing; it remembers what got tried, what passed, and what is still open, so tomorrow morning the run picks up where today stopped.

And look at what you actually did there. You designed it one time. You did not prompt any of those steps. That’s Steinberger’s whole point made real, and it’s the same loop in Codex or in Claude Code because the pieces are the same pieces.

What the loop still does not do for you

The loop changes the work; it does not delete you from it. And three problems actually get sharper as the loop gets better, not easier.

Verification is still on you. A loop running unattended is also a loop making mistakes unattended. The whole reason you split the verifier subagent from the maker is to make the loop’s “it’s done” mean something, and even then “done” is a claim and not a proof. I keep saying the same line from “Code Review in the Age of AI”: Your job is to ship code you confirmed works.

Your understanding still rots if you allow it. The faster the loop ships code you did not write, the bigger the gap between what exists and what you actually get. That’s comprehension debt and a smooth loop just makes it grow faster unless you read what the loop made.

And the comfortable posture is the dangerous one. When the loop runs itself, it’s very tempting to stop having an opinion and just take whatever it gives back. I called that “cognitive surrender.” Designing the loop is the cure when you do it with judgment and the accelerant when you do it to avoid thinking: same action, opposite result.

Build the loop. Stay the engineer.

I think this is a preview of how our work is going to evolve. That said, if I weren’t reviewing the code myself or if I relied entirely on automated loops to fix it, my product’s quality would suffer. I’d likely end up stuck in a downward spiral, continuously digging myself into a deeper hole.

Go ahead and set up your loops, but don’t forget that prompting your agents directly is also effective. It’s all about finding the right balance.

Loops can also result in different outcomes depending on you. Two people can build the exact same loop and get completely opposite results. One uses it to move faster on work they understand deeply. The other uses it to avoid understanding the work at all. The loop doesn’t know the difference. You do.

That’s what makes loop design harder than prompt engineering. Cherny’s point isn’t that the work got easier. It’s that the leverage point moved.

Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go.

This Week in AI: Fable 5, the Clone Wave, and Uber’s AI Reality Check

Michelle Smith — Thu, 18 Jun 2026 19:33:23 +0000

This week, egghead.io cofounder John Lindquist joined host YK Sugi, founder of CS Dojo and developer experience manager at Eventual, to cover the latest AI news. First on the agenda was the contested release of Claude Fable 5. They also examined the financial shifts reshaping the technology industry, including the rising costs associated with agentic coding loops. Then John outlined the framework he uses to build in the agent era without starting from scratch every time.

Watch the full episode here:

Claude Fable 5: 3 days, a government order, and a lot of unanswered questions

Claude Fable 5 launched June 9 and was pulled from all customers on June 12 after the US government issued a directive ordering Anthropic to restrict access for foreign nationals inside and outside the US. Amazon researchers had reportedly surfaced what they characterized as a security vulnerability, and after Anthropic reportedly declined to patch or redeploy the model, the directive came down. Senior Anthropic staff subsequently traveled to Washington to meet with White House officials.

The dispute about what actually happened is unresolved. Anthropic’s position is that the reported issue was a narrow jailbreak that had been previously identified and was present across public models generally, and not a serious security threat. An independent researcher who reviewed the report described it as defensive prompting that surfaced known vulnerabilities and called the response an overreaction. Neither side has published the technique or prompt, so there’s no way to evaluate the claim independently. But as John put it, “It sets a very strange precedent going forward, as models are released, that governments can step in and control what private companies can and cannot do with their model.”

Another new precedent: Fable 5 wasn’t built on the Opus or Sonnet architecture, which means comparisons to prior Anthropic models or contemporaries don’t tell us much. But initial impressions were positive, including from YK and John, and Fable 5 quickly reached the top of the Arena leaderboard in the text, agents, and web dev code categories. However, the model also had a purposeful limitation: On questions related to AI and machine learning training specifically, it was designed to underperform (without signaling this to users), apparently to prevent competitors from using it to improve their own models. Intentional capability suppression in a commercial model, without disclosure, is a different kind of product decision than a safety guardrail. Whether that approach becomes more common as competitive stakes rise is an open question.

Tokens burn fast when the loop isn’t ready for them

Last week, SpaceX went public in the largest IPO in history. The company finalized its acquisition of Cursor in a $60 billion all-stock deal shortly after. (That last one happened after this episode aired—we’ll talk more about it on Monday.) Both OpenAI and Anthropic have filed to go public as well, and Google raised roughly $160 billion through equity and a 100-year bond. A significant share of that capital is flowing toward AI coding infrastructure.

YK brought up another, less celebratory, financial story that’s been making the rounds: Uber burned through its full 2026 AI tools budget by April, mostly on Claude Code and Cursor, and Andrew Macdonald, the company’s COO, acknowledged they couldn’t link that spending to a measurable increase in useful customer features. Uber subsequently put a $1,500 per month per employee cap in place.

John flagged projects inefficiently utilizing agentic loops as one possible cause for wasteful token spend. Most developers deploying agents against existing codebases haven’t built the tooling those agents need to work efficiently, so agents burn tokens doing work that dead-ends, repeating context, or generating code that requires significant debugging. He explained:

If you take a legacy codebase and you throw agents against it with loops, you haven’t set up a proper agent environment. It’s so quick to burn tokens because. . .the agents don’t have the tools to work with.

The conversation in developer communities so far has focused almost entirely on what agents can generate. But as more organizations move from experimentation to production-scale deployment, building logging, verification, and proper error surfaces into agent tooling is what will determine whether token spend maps to real output. Otherwise, we’ll likely see more companies go the way of Uber.

Ingredients beat inference: A practical framework for building in the clone wave

For most developer workflows today, buy-versus-build leans toward building in a way it didn’t even a year or two ago. As John noted, “It’s so easy to build apps and workflows now where there are so many amazing production apps out there, apps on your phone, apps on your desktop, software as a service, that are trivial to copy and clone.” He uses the term the “clone wave” to describe this expanding set of open source equivalents to consumer software products that can now be cloned, forked, or replaced and get you 99% of the way to your use case.

The principle that drives the clone wave is “ingredients beat inference.” If you ask an agent to build a feature from scratch, it infers a solution with no external reference. If you give it an existing open source implementation to start from, it can adapt, translate, and integrate that code far faster and more reliably. The ingredients approach also helps with the 43% of AI-generated code that needs debugging in production, per a figure YK cited earlier in the episode.

The GitHub CLI plays a central role in this workflow. John explained that because agents understand the GitHub CLI natively, you can give an agent a search task and let it find implementations it wouldn’t have generated itself. Language mismatch isn’t a blocker, because agents translate between languages and libraries well. And tools like DeepWiki from Cognition let agents explore and understand a repo’s structure before cloning or forking it, so the evaluation step doesn’t require local setup.

The framework extends to how you build the last 20% that isn’t available as an ingredient. This is the part that’s specific to your use case; John described it as “that extra bit that you’re building on top of it to make it into the custom product and project for either yourself or for your users.” John’s bigger point is that the tools you build for yourself should also be usable by your agents. Expose endpoints and logging. Give agents the ability to read state and errors. An agent that can control a tool but not debug it will eventually stop in ways that are hard to diagnose.

John walked through cmux to demonstrate what an agent-native workspace looks like in practice. cmux is a terminal multiplexer built with agentic workflows in mind: it exposes a CLI that agents can control directly, so you can open a terminal pane, have that pane spawn another, and have the two read from and write to each other. In practice that means you can run Claude Code in one pane, Codex in another, and a third pane reading output from both, with each agent able to observe the others’ state.

Agents need more than the ability to run commands. They need to read logs, check errors, and confirm state before taking the next step. A workspace that exposes those surfaces gives agents a feedback loop. This tenet is applicable to tools across the company. Organizations that treat their internal tooling as agent-accessible infrastructure are building something that compounds. Those treating agents as black-box code generators are taking on technical debt they may not see until causes issues later on.

What’s next

SpaceX’s acquisition of Cursor turns the coding-agent race into something much larger than an IDE fight. Cursor may be positioning itself as a new GitHub for the agentic era, where agents write, review, test, repair, and govern code. At the same time, Salesforce’s $3.6B acquisition of Fin shows the same pattern inside enterprise software: Buyers want packaged workflows that solve real support, sales, and operations problems rather than abstract “agents.”

Next week, host Ksenia Se examines these stories and more through the lens of who owns the loop where AI does the work. Join us to find out why the next phase of AI will be about who controls the infrastructure, economics, and trust layer.

Our episodes are free and open to all through the end of June if you’d like to attend live—register here. And we’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, Apple, or wherever you get your podcasts.

Kubernetes in the Age of AI

Andy Kwan — Thu, 18 Jun 2026 14:21:16 +0000

When Kubernetes first came onto the scene, it was a major turning point, a revision of the infrastructure and operations space that transformed the way developers and ops personnel build, deploy, and maintain applications in the cloud. It has since become the clear standard for how modern applications are built and operated. As the CNCF noted in its latest Annual Cloud Native Survey report, “Among container users, 82% are using Kubernetes in production in 2025, up from 66% in 2023. This represents near-universal adoption within the container ecosystem.”

Over the last few years, another revision in the space has occurred with Kubernetes’s evolution from a container orchestrator to an AI infrastructure platform. According to the CNCF survey, “The rise of Kubernetes as the de facto AI platform represents a fundamental shift in how organizations approach machine learning operations. . .[with Kubernetes] providing a unified orchestration layer that handles both traditional application workloads and compute-intensive AI tasks.” The emergence of seismic technologies like generative AI and agentic AI has only accelerated this transformation.

The intersection of AI with Kubernetes is undoubtedly one of the most impactful developments in the operations space. As Jonathan Johnson, software architect at Dijure, observes, “AI on K8s is very, very important, and there is not enough [resources] out there.” Raju Gandhi, senior technical architect at Edward Jones, echoes this assessment, noting that “operationalizing AI/ML on K8s is a big issue, [and it’s only] getting bigger. This is a topic that needs attention.” But what are some of the things that you should know about this trend to keep abreast and stay ahead in the game?

Generative AI

Anyone with access to a computer or a smartphone has likely used some iteration of generative AI, a stunning fact when you consider that GenAI was on the outer edges of mainstream discourse and consumption a scant five years ago. But at the end of 2022, the debut of ChatGPT marked the beginning of a technological revolution, one that would impact and reshape nearly every aspect of our working and personal lives. Unsurprisingly, there are now thousands of generative AI models, a proliferation that naturally has its own set of complexities. Selecting a model is simple, but if you’re an application developer or MLOps engineer, how do you go about operating that model in a production system? Not only do you have to be cognizant of factors like resilience, scalability, security, and operational costs, but there’s the fact that bringing a model from experimentation into production can be arduous if not done properly. That’s where Kubernetes comes into play.

As Roland Huß and Daniele Zonca, distinguished engineers at Red Hat, note, “GenAI/LLM models are resource intensive, requiring substantial computational power and large datasets. Given its scalability and extensibility, Kubernetes is uniquely suited to function as an efficient platform for AI and LLM model pretraining, fine-tuning, deployment, and prompt engineering.” They further elaborate that “this integration with Kubernetes not only simplifies the adoption of cutting-edge AI technologies but also ensures a seamless and efficient operational flow. Kubernetes, with its robust scalability and management capabilities, stands as an ideal platform for generative AI projects, aligning DevOps and MLOps practices in a cohesive ecosystem.”

This sentiment is already shared by a wide swath of the industry. According to the CNCF survey above, as of 2025, 66% of organizations run generative AI workloads on Kubernetes. These organizations include OpenAI, which uses Kubernetes for its AI/LLM application experimenting and testing; Tesla, which utilizes KServe to manage production-grade LLM inference; and Adobe, which uses Kubernetes to power its suite of generative creative models. Other companies taking this approach include Uber, Intuit, and Google. With more companies adopting this practice for their generative AI and LLMs operations, it’d be prudent for any organization to leverage Kubernetes for their own GenAI and LLM workflows.

Agentic AI

Nearly coinciding with the rise of GenAI has been the steady growth of agentic AI. Unlike GenAI, agentic AI goes beyond answering simple prompts and generating text in its ability to operate autonomously to perform complex, multistep actions, utilize tools, and make independent decisions. With its ability to support both traditional ML processes and GenAI and LLM operations, it should come as no surprise that Kubernetes has a role in the agentic AI ecosystem as well.

According to Ronald Petty, principal consultant at RX-M, “Kubernetes has been leveraged to host machine learning pipelines, including AI model training and inference. As inference options have become plentiful and affordable, on and off-premise, we have seen the rise of agents. Coupling cloud native technologies and popular protocols, we now see agents moving from ad hoc demos to complex fleets of agents on systems like Kubernetes.” So what are some examples of the integration between these two technologies?

One notable offering is Kagent, an OS programming framework that runs AI agents in Kubernetes and “helps engineers build powerful internal platforms by tackling cloud native tasks such as configuration, troubleshooting, complex deployment scenarios, observability pipelines and dashboards, and safely enabling network security.” Operating along similar lines is K8sGPT, an AI-powered tool that leverages intelligent insights and automated troubleshooting to analyze Kubernetes clusters for configuration problems and security issues, as well as generates solutions to problems discovered in analysis.

A more recent entry in the field is Sympozium, a Kubernetes-native coordination layer for multi-agent AI systems that “solves the same problem Kubernetes solved for containers, but for agents that need to share context, hand off tasks, and maintain shared situational awareness.” Another newer offering is Agent Sandbox, which allows you to run AI agents as isolated, stateful workloads with a native API on Kubernetes.

The fundamentals

While it’s important to be aware of the latest developments and trends affecting your domain, that shouldn’t come at the expense of foundational knowledge and skills. As basketball great Michael Jordan once said, “Get the fundamentals down and the level of everything you do will rise.” One of the most fundamental skills for working with Kubernetes is networking, and frustratingly enough, it’s one of the more difficult ones to master. As Cisco senior staff engineer Nico Vibert observes, “Platform engineers tend to be comfortable with Linux networking but less so with protocols like BGP and IPv6; network administrators know those protocols well but find Kubernetes abstractions unfamiliar. Both personas struggle to navigate the dozens of networking tools seemingly required to meet connectivity and security requirements.” Yet as organizations move mission-critical workloads, AI training pipelines, and regulated financial services onto Kubernetes, the engineers who can design, secure, and troubleshoot the network layer have become some of the most sought-after professionals in the industry.

In recognition of both the importance and difficult nature of the Kubernetes networking skill, the CNCF recently announced a new certification focused on the Kubernetes network engineer role. The certification is designed to validate hands-on networking expertise across all of the aforementioned layers, filling a gap that the Kubernetes community has long recognized.

For organizations that use Kubernetes to develop and deliver applications, leaders and decision-makers need to be aware that utilizing Kubernetes in conjunction with the latest AI tools is no longer a luxury but a necessary practice that will allow their companies to thrive. A similar onus should be placed on the basics. When hiring your next DevOps, network, or site reliability engineer, ensure that their ability to design, secure, and troubleshoot the Kubernetes network layer is second to none.

If you want to dive deeper, check out Roland Huß and Daniele Zonca’s Generative AI on Kubernetes, Jonathan Johnson’s GPU Kubernetes Homelab live course, Alex Corvin, Taneem Ibrahim, and Kyle Stratis’s Scalable Kubernetes Infrastructure for AI Platforms, Ashok Srirama and Sukirti Gupta’s Kubernetes for Generative AI Solutions, and Yogesh Raheja’s K8sGPT Essentials on-demand course. They’re all on O’Reilly. If you’re not a member, you can get started with a free trial.

The Case Against Building Your Own Agent Platform

Pete Johnson — Wed, 17 Jun 2026 13:53:16 +0000

You know the meeting. The board wants an AI agent strategy by end of quarter. Someone on the leadership team has read a McKinsey report. You’ve been voluntold to build the platform. The slide deck says “AI-native.” The acceptance criteria are vague. Somebody mentions LangGraph, and somebody else says, “We’ll just wrap it ourselves.”

You ask what “done” looks like. Nobody in the room can answer.

The cost of building this is almost always estimated before anyone has a clear picture of what “this” actually is. And that’s the problem I want to work through here, because the scope of the work being casually assigned to internal platform teams right now is genuinely larger than the people assigning it understand.

Build versus buy, flipped in a year

This particular pendulum has swung before. App servers in the late 1990s. Content management systems in the 2000s. Container orchestration in the 2010s. The pattern rhymes every time: When a category is new, the components look deceptively simple. Early adopters build their own. The market catches up. Within 18 months, building becomes the expensive path. Within 36 months, the teams that built internally are rewriting on top of the category winner that emerged while they weren’t looking.

What’s different about the current moment is the speed. Menlo Ventures’ 2025 State of Generative AI in the Enterprise report shows the build-versus-buy split inverted in a single year. In 2024, 47% of enterprise AI solutions were built internally. By late 2025, that number had collapsed to 24%. The market made the decision in 12 months, which is unusual.

I’ve lived through enough of these transitions to recognize the shape. What I want to do in this piece is explain why I think the scope of “agent platform” is systematically underestimated right now, and what platform engineers should be asking before they commit to building one.

Most “agent platforms” aren’t

A lot of the projects labeled “agent platform” right now are actually workflow systems with an LLM in the loop. That’s a meaningful distinction. As Anthropic pointed out in its “Building Effective Agents” guidance, workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents are systems where LLMs dynamically direct their own processes and tool usage.

Most of what enterprises are shipping today sits on the workflow side. That’s fine. Workflows have bounded requirements, tractable testing, and predictable failure modes. If your team is building a workflow system, you might reasonably build it yourselves.

The trap is that teams start building for workflows, then get asked to support agents, and discover the jump isn’t incremental. Agents need memory that survives across sessions. They need evaluation that handles nondeterminism. They need governance that tracks actions, not just outputs. They need orchestration that recovers from failure modes a workflow engine never sees.

Here’s the thesis I want to put on the table: The decision to build an agent platform almost always underestimates the long tail. Memory, governance, eval, and orchestration aren’t features you add to a workflow engine. They’re separate product bets, each with its own maturity curve, its own vendor landscape, and its own team of specialists who’ve been working on it full-time for 18 months while you’ve been doing something else.

Let me walk through them.

Memory

The assumption inside most build proposals is that memory is a database problem. You’ll pick a vector store, shove conversation history into it, and retrieve relevant chunks when the agent needs context. Done.

Production memory is three separate systems: episodic, semantic, and procedural, each with different retention and retrieval policies. It’s temporal reasoning that tracks when facts were valid, not just what they were. It’s deduplication, multitenant isolation, and explicit source-of-truth governance.

The signal that this is a separate product category, not a feature: Mem0 raised $24 million across seed and Series A. Letta (formerly MemGPT) raised $10M from Felicis. Zep exists as an independent company with a temporal knowledge graph engine. Mem0’s State of AI Agent Memory 2026 report maps 21 frameworks across three hosting models with measurable benchmark gaps between them. On LongMemEval, Zep scores 15 points higher than Mem0 on temporal queries, which tells you these aren’t interchangeable tools that happen to serve the same market.

This is the component that platform teams underestimate hardest. Memory sounds like a database problem. It isn’t.

Governance

The assumption is that governance is RBAC plus audit logging. Your agents are services. Services get role-based access controls. You log the tool calls. Compliance is happy.

Agent governance is something different. It spans action authorization, not just data authorization. It requires decision-chain auditability, where you can reconstruct why the agent did what it did, not just what it did. It needs behavioral drift detection, tiered autonomy, and compliance mapped to agent actions rather than data accesses.

Grant Thornton’s 2026 AI Impact Survey of 950 business executives found that 78% lack strong confidence they could pass an independent AI governance audit within 90 days. Meanwhile, enterprises are moving to increase agent autonomy faster than their governance frameworks can keep up. Traditional AI governance wasn’t designed for action-level authorization, which is where most agent-specific risk accumulates.

And there’s a hard deadline attached to this. The EU AI Act becomes fully enforceable for high-risk systems in August 2026. Credit scoring, hiring decisions, healthcare support, and critical infrastructure all fall in scope. If your internal platform doesn’t handle conformity assessments, human oversight mechanisms, complete audit trails, and ongoing monitoring, that’s not a v2 feature. That’s a legal exposure.

OWASP now documents “excessive agency” as a top vulnerability class for LLM applications. Cornell researchers have demonstrated indirect prompt injection attacks that manipulate agents through content they ingest. These are agent-specific attack surfaces, and traditional security tooling doesn’t see them.

RBAC was designed for humans with predictable intent. Agents don’t have predictable intent.

Eval

The assumption is that evaluation means writing test cases and measuring accuracy. You built software before. You know how to test things.

Agent evaluation is qualitatively different from traditional software testing or even LLM evaluation, McKinsey’s QuantumBlack team noted: For LLMs, you evaluate the response to a prompt. For a single agent, you evaluate the full trajectory, including tool calls, state transitions, and intermediate decisions. For multi-agent systems, you evaluate system dynamics, including coordination patterns and collective invariants.

This matters because agent behavior is nondeterministic by design. The same input produces different valid execution paths. “Did the agent succeed?” is no longer a yes-or-no question, because the agent might reach the right answer through a trajectory you didn’t anticipate, or reach the wrong answer through a trajectory that looks reasonable until the last step.

The tooling ecosystem reflects this. Google Vertex AI has standardized trajectory_exact_match, trajectory_precision, and trajectory_recall as production metrics. These didn’t exist 18 months ago. LangSmith, Braintrust, Arize, Galileo, Maxim, and others are building full evaluation platforms around trajectory-based analysis, LLM-as-judge scoring with statistical validation, and regression testing against production failures.

Here’s the signal that the category is real: LangChain’s 2026 State of AI Agents report found that 57% of organizations now have agents in production, and 32% cite quality as the top deployment barrier. Gartner projects that 60% of software engineering teams will adopt AI evaluation and observability platforms by 2028, up from 18% in 2025. When a category jumps from 18% to 60% adoption in three years, that’s not a “we can build this in a sprint” situation.

You can’t tell whether your evaluation is working without another evaluation. Judge drift, calibration against human experts, internal consistency across independent runs. . .your eval system needs its own eval system, which is exactly the kind of recursion that eats platform teams alive.

Orchestration

The orchestration layer hasn’t converged. LangGraph uses directed graphs with conditional edges. CrewAI uses role-based crews. OpenAI’s Agents SDK uses explicit handoffs. AutoGen uses conversational GroupChat. Google ADK uses hierarchical agent trees. Claude’s Agent SDK uses tool-use chains with subagents. Microsoft’s Agent Framework is its own thing. Each represents a different bet on state management, communication pattern, and coordination model. None of them are interchangeable. Migration between them isn’t a config change—it’s rewriting most of your agent logic.

Underneath them, the protocol layer is still being invented. The Model Context Protocol is becoming the standard for tool integration, and agent-to-agent (A2A) protocols are emerging for cross-framework coordination. Both are moving targets, and building on a moving protocol is a cost that internal platform teams rarely price in.

If you built your own orchestration layer in 2024, you’re rewriting it in 2026. The teams that picked a framework spent those two years shipping.

The honest case for building

I want to engage the strongest version of the build argument, because there are real reasons to build, and pretending otherwise makes this piece less useful than it should be.

Proprietary data genuinely is a durable competitive moat. Mastercard built a foundation model on its transaction network. Plaid built one on its financial institution coverage. As Morgan Stanley’s analysis from last year made clear, decades of verified historical data with consistent identifiers is both technically challenging and prohibitively expensive for outside players to recreate. If your organization has data like that, you should absolutely build on it.

Regulated industries have legitimate reasons to want control over the full stack. Off-the-shelf AI tools don’t always cleanly map to frameworks like HIPAA, GxP, 21 CFR Part 11, SOX, FFIEC, and PCI DSS, and the cost of a failed audit is measured in business units shut down, not in sprints.

Vendor lock-in at the AI layer is subtler and more dangerous than in traditional software. If your agentic workflows are built on a vendor’s proprietary orchestration layer, switching costs compound rapidly across memory, eval, and integrations simultaneously.

But here’s the distinction that matters: Those are arguments for building agents on top of platform components, not arguments for building the platform components themselves. You can own the data, the domain logic, the evaluation criteria, the governance policies, and the specific behaviors your business needs without owning the memory layer, the orchestration engine, or the trace collection infrastructure underneath them.

Build the things that are specific to your business. Buy the things that are specific to the technology category. That’s the heuristic.

Five questions before you commit

If you’re the platform engineer being pulled into this decision, here are the questions worth asking before anyone signs up for the scope.

Are you building an agent platform or a workflow system? They’re not the same scope, and conflating them is where most of the cost overruns originate. A workflow system is a reasonable thing to build. An agent platform is four product categories you haven’t staffed for.

Can you articulate what “done” looks like for each of the four components? Memory, governance, eval, orchestration. In under three sentences each. If you can’t, you don’t have requirements. You have a vibe. And vibes don’t ship.

What happens to your platform when you need to swap the underlying model? Menlo’s December 2025 data shows Anthropic went from 12% of enterprise LLM spend in 2023 to 40% in 2025, while OpenAI fell from 50% to 27%. Enterprises didn’t plan those switches. The capability gaps forced them. If your internal platform hardcoded assumptions about context windows, tool-calling formats, or reasoning styles from one vendor, swapping models isn’t an API key change. It’s simultaneous rewrites across memory, eval, and orchestration.

What happens when the techniques themselves change? Eighteen months ago the default pattern was RAG with flat vector retrieval. Now it’s just-in-time context strategies, agent-managed memory tiers, and trajectory-based evaluation. Anthropic’s own follow-up to “Building Effective Agents” explicitly acknowledges the field has moved since they wrote the original. If your platform baked in the 2024 patterns, the 2026 patterns are a refactor, not a config change. Vendor platforms absorb those shifts as releases. Internal platforms absorb them as sprints.

What happens when the platform team leaves? This is the tale as old as COBOL, custom ESBs in 2008, or hand-rolled container orchestration in 2015. A small team builds something clever, it works, they move on, and five years later you’re paying premium rates to contractors who can still read the code. Agent platforms are a particularly bad candidate for this pattern because the talent pool is both small and mobile. Here’s the uncomfortable version of the question: Who on your team, today, could rebuild the memory layer if the person who wrote it left tomorrow?

What this looks like in 2 years

Gartner’s prediction that over 40% of agentic AI projects will be canceled by 2027 isn’t really about the AI. It’s about projects that got scoped before anyone understood the shape of the work. Most of the canceled projects will be internal builds, because internal builds are where the scope estimation error accumulates. Deloitte’s data on two- to four-year AI ROI horizons is the warning shot. If your timeline to value is already long, every month you spend rebuilding a component that exists as a product is a month you don’t have.

The teams that built their platforms around OpenAI in 2023 weren’t wrong. They made a reasonable bet on the market leader at the time. But they spent 2025 porting to a landscape where Anthropic had tripled share and Google had gone from 7% to 21%. The teams that picked model-agnostic platforms spent 2025 shipping. The only durable bet in this space is the one that assumes the bet will change.

The best platform engineering decision you can make this quarter might be to not build the platform.

Sources

Primary sources

Menlo Ventures, 2025: The State of Generative AI in the Enterprise, December 2025,
https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/.
Anthropic, “Building Effective Agents,” December 2024,
https://www.anthropic.com/research/building-effective-agents.
Anthropic, “Effective Context Engineering for AI Agents,” 2025,
https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents.
European Commission, AI Act Regulatory Framework (Regulation EU 2024/1689),
https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai.
Google Cloud, “Evaluate Gen AI Agents,” Vertex AI Documentation,
https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-agents.
McKinsey QuantumBlack, “Evaluations for the Agentic World,”
https://medium.com/quantumblack/evaluations-for-the-agentic-world-c3c150f0dd5a.
LangChain, State of Agent Engineering 2026,
https://www.langchain.com/state-of-agent-engineering.
Gartner, “Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027,” June 2025, https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027.
Grant Thornton, 2026 AI Impact Survey, April 2026,
https://www.grantthornton.com/services/advisory-services/artificial-intelligence/2026-ai-impact-survey.

Secondary Sources

Mem0, “Mem0 Raises $24M to Build the Memory Layer for AI,” October 2025,
https://mem0.ai/series-a.
Felicis, “Felicis’s Seed in Letta,” September 2024,
https://www.felicis.com/blog/letta.
Vectorize.io, “Mem0 vs Zep,” Benchmark Comparison,
https://vectorize.io/articles/mem0-vs-zep.
Rasmussen et al., “Zep: A Temporal Knowledge Graph Architecture for Agent Memory,” arXiv 2501.13956,
https://arxiv.org/abs/2501.13956.
OWASP, “LLM08:2025 Excessive Agency,” OWASP Top 10 for LLM Applications,
https://genai.owasp.org/llmrisk/llm08-excessive-agency/.
Greshake et al., “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,” arXiv 2302.12173, February 2023,
https://arxiv.org/abs/2302.12173.
Model Context Protocol, Official Specification,
https://modelcontextprotocol.io.
PYMNTS, “FinTechs Race to Build Foundation Models on Proprietary Data,” 2026,
https://www.pymnts.com/artificial-intelligence-2/2026/fintechs-race-to-build-foundation-models-on-proprietary-data/.
Deloitte, “State of Generative AI in the Enterprise,” Quarterly Reports,
https://www.deloitte.com/us/en/insights/topics/digital-transformation/state-of-generative-ai-in-enterprise.html.

Linear Thinking, Nonlinear Costs

Nicole Koenigstein — Tue, 16 Jun 2026 11:02:01 +0000

Many AI agent systems become economically unsustainable long before they become technically impressive. Teams usually focus on model choice, prompt design, tool calling, and orchestration. Those things matter, but they are only part of the system setup. The deeper issue is that coding agents, such as Claude Code, Codex, and Jules, make agent workflows easier to generate. But when implementation is abstracted away, the underlying mechanics become harder to see. Bad engineering used to produce slow code. Now it produces expensive systems that also happen to be slow.

When we design agent systems, we still need to remember that the costs scale nonlinearly. A single user request rarely triggers a single model call. It expands into routing, retrieval, reasoning, reflection, guardrail checks, tool calls, and synthesis. Each step may repeat shared context, reload state, recompute a planner decision, or retry a failed path. What looks like an intelligent workflow can therefore behave like a recursive, stateful computation with overlapping subproblems. If that sounds like backtracking, dynamic programming, and memoization to you, you’re right.

We already know how to optimize systems like this. The problem is that coding agents make agent systems easier to generate, but not necessarily easier to optimize. Unless we recognize the underlying mechanics, we may never ask our coding agents to apply the optimization patterns that keep our systems viable.

Old problems wearing new clothes

When we use coding agents to generate agent architectures, it’s tempting to stop at “the trace looks reasonable.” The tool can generate routers, retrievers, planners, evaluators, guardrails, tool interfaces, and synthesis steps. It may also know about caching, pruning, memoization, and state modeling. But it won’t necessarily implement those patterns unless you ask for these optimization layers explicitly.

Even if you work with agent instructions, unless your SKILL.md, AGENTS.md, or project instructions include constraints around repeated context, memoization, cache invalidation, pruning, and cost per request, your resulting agent system may be functionally correct and economically wasteful at the same time. That’s the tricky part: The code can pass review, the unit tests can pass, and the architecture can look reasonable. The invoice is where the hidden computation finally shows up.

It’s easy to give too much agency to tools like Claude Code. When a coding agent reasons in language, calls tools, reflects, and produces fluent text or code, it can feel like a knowledgeable coworker. At the interface level, that impression is understandable. These tools help teams generate more code, move faster, and become more productive. Still, this doesn’t remove the need for engineering craft underneath. Someone still has to recognize repeated context, recomputed planner decisions, correlated retries, unpruned branches, and state that can’t be reused. The coding agent can implement the system, but the engineer still has to understand what kind of system should be implemented. This is where old computer science returns, not as theory but as the optimization layer our agent systems need in production.

The cost multiplier, repeated-work problems, and backtracking

The cost multiplier often shows up first as latency. The user doesn’t see the router, the retries, the reflection loop, or the tool calls. They only see that the agent is taking too long. From the outside, the system looks stuck or broken. From the inside, it may simply be repeating work.

This is one of the uncomfortable differences between traditional software and agent systems. In a conventional application, a failed operation often throws an error, times out, or leaves a trace that is easy to inspect. In an agent workflow, failure can look like effort to improve reliability. Take the weakest step in your agent workflow. If it succeeds 60% of the time, and you try to push it close to 99% reliability through retries, you need 5 retries:

1 − (1 − 0.60)⁵= 0.98976

This math assumes each retry is a roll of fair dice. LLMs aren’t dice. Whether you’re using greedy decoding or probabilistic sampling, the model is still drawing from the same underlying distribution shaped by your prompt. If the first “thought” is a hallucination or logic error, bumping the temperature won’t fix the underlying state. You aren’t buying independent trials; you’re just sampling different paths through the same flawed map and state.

This is where the old algorithmic framing matters. In a backtracking problem, you don’t keep walking down the same failed branch and call it progress. You return to the last valid state, mark the failed path, and use the failure as information for the next choice. The point isn’t just to try again. The point is to try again under a changed state.

Agent workflows need the same discipline. A retry shouldn’t mean “run it again and hope.” It should give the model structured feedback about why the previous attempt failed: which constraint failed, which tool result was invalid, which schema didn’t validate, which assumption was unsupported, or which branch added nothing. The next attempt should then change something meaningful: the prompt, the tool choice, the retrieved evidence, the validation constraint, or the planner state.

Memoization, pruning, and dynamic programming

Prompt caching is usually the first optimization. If every step repeats the same system prompt, tool definitions, schema constraints, examples, and policy rules, then caching the shared prefix is an obvious win. It reduces the cost of repeated context. But prompt caching only recognizes that text repeats. It doesn’t notice that decisions repeat.

In many agent systems, the expensive unit isn’t only text. It’s the repeated decision. If the same or equivalent state appears again, paying the model to rediscover the same action is unnecessary. That is what memoization does: It turns repeated computation into lookup. In classical algorithms, the repeated computation might be a recursive subproblem. In an agent system, it might be a planner decision over the same task, facts, tools, and constraints. The planner can be treated as a function over state:

$^πLLM(S_t) \rightarrow a_{t+1}$

where $S_t$ is the current state of the workflow and $a_{t+1}$ is the next action. Without memoization, this function is evaluated again and again through an LLM call. With memoization, the system first checks whether it has seen the same or equivalent state before. If you want a deeper walkthrough of how to use memoization, I cover it in AI Agents: The Definitive Guide.

But memoization only helps once the system knows which states are worth revisiting. Pruning handles the other side of the problem: branches that shouldn’t be explored further. However, don’t limit pruning to KV cache pruning or speculative decoding. Use it also when a tool repeatedly returns no new information. Your next LLM call shouldn’t be a slightly reworded version of the same query. If a reflection loop keeps producing stylistic changes without improving correctness, the loop should stop. If a search path violates a constraint or depends on an unsupported assumption, it should be marked as unproductive and removed from the active search space.

Dynamic programming becomes relevant when different branches of the workflow solve overlapping subproblems. A research agent may ask similar questions across several documents. A coding agent may inspect the same dependency chain from different entry points. A business analysis agent may compute the same metric for several report sections. If every branch solves these subproblems from scratch, the system pays repeatedly for work it has already done. Table 1 shows examples of how these patterns map to AI agent systems.

Table 1. Classical optimization patterns applied to AI agent systems

Optimization	The “old” CS way	The “agent” way
Memoization	Store results of expensive function calls.	Cache decisions. If the agent saw this state before, don’t ask it to reason again.
Pruning	Cut off search paths in a tree that won’t lead to a solution.	Kill a reflection loop when the critique stops yielding structural improvements.
Dynamic programming	Break problems into overlapping subproblems.	Share codebase analysis across multiple specialized agents instead of rereading files.

This isn’t nostalgia. These patterns mitigate the cost structure of agent systems. Memoization reduces repeated decisions. Pruning reduces repeated failure. Dynamic programming reduces repeated subproblem solving. Together, they form the optimization layer many agent architectures are missing in production.

Where to start: Optimization follows topology

The patterns above aren’t a checklist you apply uniformly. Each multi-agent topology, whether centralized, decentralized, independent, or hybrid, distributes communication and coordination differently, which directly affects overhead, latency, and failure propagation. The optimization layer has to follow.

Centralized
A single orchestrator decides, delegates, and aggregates. The expensive unit is the orchestrator’s decision, repeated across similar inputs. Memoize the planner first.

Decentralized
Agents coordinate peer-to-peer, exchanging messages without a central authority. The cost moves into the communication itself: redundant exchanges, restated context, agents reasoning over the same shared state from different angles. Prompt caching on the shared context is the first win, followed by pruning exchanges that no longer add information.

Independent/swarms
Lightweight agents fan out without coordinating. Cheap individually, expensive in aggregate. If three of your ten agents ask semantically equivalent questions, you pay three times for the same answer. Memoization and pruning aren’t optimizations here; they’re load-bearing.

Hybrid
The repeated work shows up at two scales: within a cluster (overlapping subproblems among peers) and across clusters (the coordinator rediscovering the same routing decision). Use dynamic programming on shared subproblems inside the cluster, memoization on the coordinator’s decisions across them.

The optimization layer isn’t a generic discipline you bolt on. It’s a function of the shape of the implementation. Coding agents made it easy to generate the shape without seeing it. The craft is in seeing it anyway.

Who Owns the Code Claude Wrote?

Sena Evren — Mon, 15 Jun 2026 10:58:47 +0000

The following article originally appeared on Sena Evren’s Legal Layer newsletter and is being reposted here with the author’s permission.

TL; DR

Agentic coding tools like Claude Code, Cursor, and Codex generate code that may be uncopyrightable, owned by your employer, or contaminated by open source licenses you cannot see. Some of this is settled law, some is actively contested, and this piece is clear about which is which. If you are shipping AI-assisted code and have not thought about any of this, this piece is for you.

If you shipped code this week, some of it was probably written by an AI. The question of who legally owns that code is less settled than most developers assume, and the answer depends on three things that have nothing to do with how good the code is:

Whether a human made enough creative decisions to establish copyright
Whether your employment contract already assigned it to your employer
Whether the model pulled from GPL-licensed training data and quietly contaminated your codebase

On March 31, 2026, Anthropic accidentally published 512,000 lines of Claude Code’s source code in a routine software update through a missing configuration file. Before sunrise, the codebase was mirrored across GitHub. Before breakfast, a developer had used an AI tool to rewrite the entire thing in Python, and the “claw-code” repository hit 100,000 GitHub stars in a single day, the fastest in history. Then came the DMCA takedowns, and then came the question nobody had a clean answer to:

If Claude Code was, by Anthropic’s own lead engineer’s admission, predominantly written by Claude itself, does Anthropic even own it? Can you issue a DMCA takedown for code that copyright law may not protect?

That incident compressed every open question about AI-generated code ownership into a single news cycle. The same questions apply to your codebase.

The copyright rule nobody told you

Here is the legal baseline, in plain terms: Copyright only protects work created by a human.

The US Copyright Office has confirmed this consistently, and the DC Circuit upheld it in the Thaler case. When the Supreme Court declined to hear the Thaler appeal in March 2026, it did not endorse the lower court’s reasoning or settle the question nationally. Cert denial means the court chose not to hear the case, nothing more. What it does mean is that the DC Circuit’s ruling stands, the Copyright Office’s position is intact, and no court has yet gone the other way. Works predominantly generated by AI without meaningful human authorship are not eligible for copyright protection under current doctrine, and that position is stable even if it is not finally settled.

Two important limits on what Thaler actually decided.

The case involved a painting created with zero human involvement at all. Thaler listed the AI system as sole author and made no claim of any human creative contribution. The ruling does not directly address the harder question of AI-assisted work where a human was involved but the degree of that involvement is disputed.
Thaler involved visual art. No court has yet applied the human authorship doctrine specifically to code output from an AI coding tool. The logic applies, but the direct precedent does not exist yet.

What it means for you: Code that Claude Code or Cursor generated and you accepted without meaningful modification may not be copyrightable by anyone. If a competitor copies it, you may have no legal recourse, because the code sits in the public domain in everything but name.

The phrase that determines whether your code is protected is “meaningful human authorship,” and the Copyright Office has deliberately refused to quantify it with a percentage or a number of edits, because what courts look for is evidence that a human made genuine creative decisions:

Choosing the architecture
Deciding what to reject
Restructuring the output to fit a specific design

Specifying an objective to the model is not enough. Directing how the work is constructed is what counts.

In an agentic workflow, this distinction is harder to establish than it sounds. Consider a typical Claude Code session:

You write a one-line prompt: “build a rate limiting module for the API.”
Claude Code plans the approach, generates five files, and iterates through three versions.
You review the output, run the tests, and merge.

Your contribution in that sequence is your architectural intent and your final approval. Whether that constitutes meaningful human authorship in a courtroom is an unresolved question with no definitive court ruling yet.

The honest answer is: probably yes for modules you substantially redirected, probably no for code you accepted verbatim, and unclear for everything in between.

The middle ground is actively being litigated right now. In Allen v. Perlmutter, artist Jason Allen is challenging the Copyright Office’s denial of registration for a work he created using more than 600 detailed prompts and subsequent editing in Photoshop. The Copyright Office acknowledged the Photoshop edits as human-authored but still denied registration for the AI-generated underlying elements. That case has not been decided yet, and whatever it decides will be the closest thing to a ruling on how much human involvement is enough.

The closest existing precedent on partial protection is Zarya of the Dawn, a graphic novel where the Copyright Office granted registration for the human-authored text but denied it for the Midjourney-generated images. That decision establishes a practical principle developers can use right now: The human-authored elements of an AI-assisted codebase may be separately protectable even if the generated code itself is not. Your architecture documents, your design decisions recorded in commit messages, your ADRs, your prompt logs showing deliberate redirection, these may be protectable as human-authored expression even if the code they produced is not. Protecting what you can starts with documenting what you actually did.

What your employer probably already owns

Before you think about whether your code is copyrightable, there is a more immediate question: Even if it is, is it actually yours?

Your employment contract almost certainly says that anything you build at work belongs to your employer. That principle has a name in copyright law: the work-for-hire doctrine. Under it, any code created by an employee within the scope of their employment is owned by the employer, who is treated as the legal author, regardless of whether the code was written by hand, generated by Claude Code, or some combination. Using an AI coding tool during work hours, on a work project, on a work machine, does not change who owns the result.

Most employment contracts go further than the doctrine’s defaults. Look for a section in yours called “Intellectual Property,” “IP Assignment,” or “Work Product.” Open the contract, search for those terms, and read that section. A clause that says any of the following almost certainly covers your AI-assisted code:

“Any work product created using company equipment or resources”
“Any invention or development made during the term of employment”
“Any software created with the assistance of company-licensed tools”

The third one is the one to watch. If your employer licenses Claude Code, Cursor, or Copilot for the team, and you use those same tools to build a side project, a broad IP assignment clause may give the employer a claim over that project, even if you built it on your own time.

A senior developer in San Francisco described exactly this situation earlier this year. He had used Claude Code for work projects and for a personal fitness tracking app built on evenings and weekends. His company updated its IP policy and claimed everything he had built with AI assistance, including the personal app, arguing that because Claude had access to open work files in the IDE, any AI output was a derivative work of company IP.

This is the clearest example of how far this can stretch. His company’s claim rested on one phrase: The AI tools were “context-aware” of his company’s codebase. The argument does not hold up legally, because context visibility in an IDE does not make AI output a derivative work of files that were open nearby, and the connection between what Claude can see and what it generates is probabilistic pattern completion, not copying. But the argument illustrates what employers are starting to claim. If the clause is broad enough, it has surface validity regardless of what the AI actually did.

The practical rule: If you are building something on the side, use a personal account, a personal machine, and tools you pay for yourself. Keep your employer’s licensed tools out of that workflow entirely.

The open source contamination problem

Even if you own your AI-generated code, you may have already contaminated it with an open source license you cannot see.

AI coding tools are trained on massive amounts of public code, including code licensed under the GPL, LGPL, and other copyleft licenses. Copyleft licenses carry a specific obligation that travels with the code:

If you distribute software that is a derivative of GPL-licensed code, you must release your own source code under the same license.
This applies even if you did not know the code you incorporated was GPL-licensed.
“I did not know” is not a defense to a copyleft violation.

When an AI tool reproduces a substantial verbatim portion of GPL-licensed code from its training data, and you ship that code in a commercial product without releasing source, you may have created a copyleft violation without ever touching the original repository. The legal standard for infringement is substantial verbatim reproduction, not functional similarity or resemblance, and this distinction matters: an AI tool generating code that works like GPL code is different from an AI tool that reproduces GPL code word for word. The risk sits at the verbatim end of that spectrum, and the problem is that you have no way to know which side of the line your codebase is on without running a scan.

The chardet community dispute made this concrete in early 2026. This was not a filed lawsuit but a public dispute within the open source community that raised the question without resolving it legally. A developer used Claude to rewrite chardet, a Python character encoding library, and rereleased it under an MIT license, arguing that the AI rewrite was a “clean room” implementation free of the original LGPL license.

The legal question the community fought over: If Claude was trained on the LGPL-licensed codebase and its output reproduces substantial verbatim portions of that code, can the output be treated as license-free? The chardet dispute did not resolve cleanly and no court has issued a definitive ruling on this specific question. What is settled is that verbatim copying of GPL code violates the license regardless of how it was produced. What is unsettled is whether AI-generated output that reproduces training data patterns counts as verbatim copying. The working assumption among lawyers advising companies through M&A is that it probably does, and that assumption is now showing up as a standard condition in acquisition due diligence.

The Doe v GitHub litigation, still working through the Ninth Circuit as of April 2026, is asking whether GitHub Copilot reproduces licensed code without attribution in violation of copyright law and DMCA Section 1202. The district court dismissed most claims but the appeal is live. Whatever the outcome, the litigation has already changed industry behavior: GitHub Copilot added duplicate detection filters, and acquisition due diligence now routinely includes an AI codebase license scan.

What to do about all of this

Four concrete actions, none of which require a lawyer.

1. Run a license scan on your AI-assisted codebase

Tools that do this well:

FOSSA—most comprehensive, widely used in enterprise
Snyk Open Source—good for dev-team workflows, integrates with GitHub
Black Duck—standard in M&A due diligence

Each will scan your codebase, flag code that matches known open source libraries, and identify the licenses attached. If you are shipping a commercial product and have never run one of these, you are operating on assumption. The scan takes an afternoon and costs less than the first hour of a copyright dispute.

2. Document your human creative contributions as you go

The evidence that establishes meaningful human authorship is the same evidence you already produce in a normal engineering workflow. You just have to keep it deliberately rather than letting it disappear.

What to preserve:

Commit messages that describe what you changed and why, not just what the AI generated. “Restructured Claude’s module architecture, rejected initial state management approach, rewrote error handling from scratch” is evidence. “Add rate limiting module” is not.
Prompt logs. Claude Code and Cursor both retain interaction history. Export or screenshot the sessions where you made significant architectural decisions.
Design documents, ADRs, or any notes that predate the generated code and show you specified the structure before the AI built it.

The second commit message versus the first is the difference between a defensible authorship claim and a clean “Claude wrote this” record.

3. Read the IP clause in your employment contract before you build anything on the side

Open your contract, search for “intellectual property,” “IP assignment,” or “work product,” and read that section carefully. The specific language determines your exposure:

“Work product created during employment hours” is narrower than “work product created using company resources.”
“Relating to the company’s business” is narrower than “any software development.”
“Company-licensed tools” is the phrase that captures AI coding tools even on personal projects.

If the clause is broad and you want to build something independently, you have three realistic options: negotiate a written carveout before you start (easier at the start of a new role than mid-employment), use entirely personal tools on entirely personal time on a personal machine, or accept that the claim exists and decide whether the risk is worth it.

4. Check which Anthropic plan you are on before shipping for commercial use

Go to anthropic.com/legal and compare the consumer terms against the commercial terms. The difference that matters:

Consumer terms (free and Pro plans): Anthropic assigns outputs to you, but the IP indemnification is narrower and covers fewer scenarios.
Commercial terms (API and enterprise): Anthropic assigns outputs to you and will defend you against copyright infringement claims arising from your authorized use of the service and its outputs.

If you are shipping AI-assisted code in a commercial product using the free or Pro plan, the indemnification gap is real. The API or enterprise agreement is the appropriate tier. Note that neither indemnification covers a downstream GPL violation from license contamination in your codebase. That is your governance problem to solve with the license scan in action 1.

The thing worth sitting with

Anthropic’s own lead engineer publicly stated that his recent contributions to Claude Code were written entirely by the AI, and the leaked codebase that Anthropic issued 8,000 DMCA takedowns to suppress may be predominantly AI-authored. Whether Anthropic’s copyright claims over that codebase are legally valid remains an open question no court has yet resolved.

If the company that built the tool cannot cleanly assert copyright over its own AI-assisted code, the question of whether you can is worth taking seriously before it becomes relevant in a transaction, a dispute, or an acquisition conversation. The developer who documents their creative contributions from the start is in a meaningfully different legal position than the one who accepted three thousand lines of Claude output and merged without review, even if both shipped the same product.

A note on what this piece covers and what it does not

Three things in it are settled law:

Works lacking human authorship are uncopyrightable,
The work-for-hire doctrine applies regardless of how code was generated.
Verbatim copying of GPL-licensed code violates the license.

Two things are emerging consensus without definitive court rulings yet:

How much human direction is enough to establish meaningful authorship in an agentic workflow
Whether AI output that reproduces training data patterns counts as verbatim copying

One thing is genuine speculation:

Whether any of this will be litigated at scale in the near term

Most code copyright claims never reach court. The place where the unsettled questions become concrete today is M&A due diligence and institutional fundraising, where acquirers and investors are already asking these questions as a condition of closing.

If neither of those applies to your situation right now, the four actions above are still worth doing, but the urgency is lower than the piece might imply.