Radar

Stop Getting Good at Protocols. Get Good at Agent Experience.

Sean Roberts — Wed, 24 Jun 2026 11:04:07 +0000

In 2025, if you weren’t building with MCP, you weren’t serious about agents. The Model Context Protocol dominated the agent conversation for the better part of the year. Conference talks, roadmaps, hiring plans, all of it revolved around MCP.

Then late 2025 into 2026, AI Skills arrived and the backlash was immediate. Engineers declared MCP dead in favor of Skills, then dead in favor of CLI. Perplexity’s CTO said publicly that the company was deprioritizing it. The cycle was fast, loud, and predictable. New tool, new hype, new rewrite.

I started pushing Agent Experience early in 2025, while MCP was still the center of gravity. The response was mostly skepticism. AX was overthinking it. MCP was the only layer that mattered. That perspective aged poorly. The people who dismissed AX weren’t wrong about MCP being useful. They were wrong about a protocol being a strategy.

The thing they missed, and what I think most of the industry is still missing, is that the protocol is not the thing to get good at. The discipline is.

We keep falling into the tool trap

Our industry has a well-documented habit of confusing tools with strategy. We did it with microservices, Kubernetes, and GraphQL. Now we’re doing it with agent protocols.

MCP, AI Skills, A2A, and ACP are all implementations. They matter and they solve real problems. But none of them are the right thing to build your strategy on top of. They are, by nature, the thing that changes.

When you organize your agent strategy around a specific protocol, you’re building on a foundation someone else controls and the market can shift away from at any moment. Worse, you’re skipping the step that would tell you whether that protocol is even the right fit for your use case.

This is the tool trap. You optimize your usage of a specific integration mechanism without first understanding what you’re actually optimizing for.

So what is Agent Experience?

Agent Experience (AX) is the discipline of studying how AI agents discover, understand, and interact with your systems, and then systematically improving those interactions.

Think of it as the agent-facing counterpart to User Experience. UX didn’t emerge because one UI framework won. It emerged because teams realized that the quality of human interaction with software was a design problem that transcended any particular technology. You could build a terrible experience in React just as easily as in vanilla JavaScript. The framework was not the variable. The design thinking was.

AX works the same way. How does an agent discover what your service can do? How does it understand the boundaries of your API? When it fails, does it get enough context to recover? Is the interaction efficient, or is the agent burning tokens on unnecessary round trips?

These questions are protocol-agnostic. They apply whether you expose capabilities through MCP, Skills, A2A, or something that hasn’t been invented yet. The teams that can answer them will adapt to whatever comes next because they understand the problem space, not just the current toolchain.

AX is an extension of what you already care about

AX is not competing with User Experience, Developer Experience, or Customer Experience. It’s an extension of all three.

Your primary focus is still providing a great experience to your customers. What has changed is how those customers interact with you. More and more, they delegate tasks to agents. When a customer asks an agent to integrate with your API, deploy to your platform, or pull data from your service, that agent is acting on their behalf. The agent’s experience determines how likely it is to achieve your customer’s goal.

If a customer’s agent struggles to authenticate, burns through tokens parsing your error messages, or fails silently because your API lacks context, something worse than a complaint happens. The agent will quietly start using an alternative service that provides a better experience. Your customer might not even notice the switch. You just lost them without a single support ticket.

UX optimized for humans clicking through interfaces. DX optimized for developers building on your platform. CX looked at the entire customer journey. AX extends that thinking to the agents those customers now send on their behalf.

The protocol treadmill doesn’t work

Think about what actually happened with MCP. Teams invested heavily in writing MCP server implementations. A lot of those implementations were mediocre. Not because MCP was flawed but because the teams hadn’t thought carefully about what an agent actually needed from their system. A 2026 study out of Queen’s University examined 856 tools across 103 MCP servers and found that 97.1% of tool descriptions contained at least one quality issue, with 56% failing to state their purpose clearly. The protocol worked fine. The experience design was the problem.

When Skills emerged, those same teams faced a familiar problem wearing new clothes. They still hadn’t answered the foundational questions: What does an agent need to accomplish with our service? What is the minimum viable interaction surface? What context does an agent need to make good decisions?

The teams that had worked through those questions adapted fast. Migrating from one protocol to another is mechanical when you already know what your agent-facing interface should look like. The protocol is the serialization format. The experience design is the hard part.

This pattern will keep repeating. Whether it is the Universal Commerce Protocol, A2A, or whatever lands next, something new will always be gaining traction. If your strategy is to become an expert in each successive protocol, you’re signing up for a treadmill that only speeds up.

What an AX practice looks like

So what does it actually look like to take Agent Experience seriously? If you have ever built a UX research practice or a DX program, this will feel familiar. The steps aren’t new. The persona is.

In talks, I break it down to five steps.

Audit the agents your customers use. Know what’s walking through your front door. Look at your traffic data and logs and figure out what portion of your footprint is agents versus humans, and which agents specifically. Are your customers sending Claude Code? Cursor? Custom agents built on your API? You can’t design for something you haven’t observed. Same reason UX teams run user research. Different method, same motivation.

Identify the use cases customers want to delegate. Not every interaction needs to be agent-optimized. Take that same log data, look at the requests agents are making to your platform, and extrapolate what they were trying to achieve. You can also use AEO data to understand what areas your customers are asking about in agent-facing search. Focus on the highest-value surfaces first. If you have ever prioritized a DX roadmap by looking at what developers actually do with your API, you already know this muscle.

Verify and audit the experience of those interactions. Watch what happens when an agent tries to complete those tasks on your system. Where does it get stuck? Where does it misunderstand what your service offers? This is usability testing. The user is an LLM; the struggle is about context not button placement, but you’re answering the same question: Can they get the job done?

Improve and repeat. Agent capabilities evolve. Models get smarter. New interaction patterns emerge. At Netlify, we’ve found cases where our product works one way but agents universally assume it works another way and never ask. Instead of fighting that assumption, we improved the product to work the way agents expect. The result was more adoption of those agent flows and fewer errors. The teams that treat this as a living practice will outperform those running from one protocol migration to the next.

Automate validation and prevent regressions. Once you have a baseline for what “good” looks like, lock it in. Tools like AXIS, an open source scoring framework, let you run real agents against real scenarios and get a comparable score back. Wire it into CI and catch AX regressions the same way you catch broken tests. This is how you go from anecdotal improvement to measurable, repeatable AX quality.

When you have this practice in place, protocol choices become obvious. You can evaluate new tools on their merits. Does it solve a real friction point you have observed? Does it unlock capabilities you couldn’t achieve before? Or is it just different packaging for something you’re already doing well?

The hard part is familiar

AX is harder to pick up than a new protocol. That is just the reality. Learning MCP or Skills is a bounded technical problem. Read the docs, write some code, and ship an integration. Clear finish line, easy to show progress. That’s genuinely appealing, especially when you or your teams are moving fast.

Building an AX discipline means sitting with ambiguity for a while. Studying agent behavior before you have clean answers. Accepting that the right integration strategy depends on context you have to discover, not a tutorial you can follow. But if you’ve ever built a UX or DX practice from scratch, you’ve been here before. The why is the same: understand your users, reduce friction, and make it easy for them to succeed. How you do it is different because the user is different. The discipline isn’t new. It’s an extension of work our industry has been doing for decades.

The good news is that this thinking is gaining momentum. John Maeda’s 2026 Design in Tech Report is explicitly about the shift from UX to AX. Researchers are studying agent interaction quality as a first-class engineering concern. BCG and MIT Sloan found that 35% of organizations are already using agentic AI, with another 44% planning to. The question is no longer whether AX matters. It’s whether your team is building the practice before your competitors do.

The agents of 2028 won’t interact with your systems the way the agents of 2025 did. The protocols will be different. The capabilities will be different. The expectations will be different. What won’t change is the fundamental need for your systems to provide a great experience to the people who use them, and now, the agents those people send on their behalf.

Get good at that. The rest is implementation detail.

Principal Drift

Shreshta Shyamsundar — Tue, 23 Jun 2026 10:21:13 +0000

Over the past year I’ve reviewed enterprise agent architectures at roughly two dozen organizations, including banks, retailers, healthcare systems, and a couple of regulators. The architecture diagrams have been reliably impressive. There are boxes for the MCP gateway, the tool registry, the vector store, the orchestrator, the policy engine, and the observability stack. There are arrows showing how agents discover each other, share context, and call tools across the mesh. By 2026 standards, these are the table-stakes pictures for any serious agentic deployment. But what none of them show anywhere is who the agents are, whose authority they carry, or who answers when they’re wrong.

That omission has a name worth using: principal drift, the steady decoupling, in any sufficiently large agent system, between the human authority a recorded action is supposed to derive from and the actor that actually took it. What looks like a defensible identity posture on the day you ship your first agent quietly degrades as agents multiply, compose, and outlive their original initiatives. Principal drift isn’t three independent failure modes; it’s one cascade. Identity collapses first. Authority erodes next, because there is no longer a stable principal to bind policy to. Accountability dissolves third, because the cost of agent error lands on whichever team has the weakest negotiating position when the incident review starts. Stopping the cascade means intervening at the first link, but almost no enterprise agent platform does so right now.

To see the cascade run, take the most boring possible enterprise agent, a refund agent, and watch.

A customer-service rep, fielding a chat, asks the agent to process a $48 refund for a damaged item. The agent checks eligibility, issues the refund, posts an update. The audit log records the action as taken by something like refund-agent-prod-03, running under a service principal owned by the customer-service platform team. That entry is true, but it’s also useless. The agent wasn’t acting as refund-agent-prod-03. It was acting as the rep, on behalf of the customer, under a delegation chain nobody recorded. In a well-built system, customer, rep, agent identity, and service principal are recorded together, queryable as a chain, and durable beyond the session. In most production systems today they aren’t. This is the first link in the cascade, where identity collapses to a generic service principal, and there’s no longer a who to attach anything else to.

Authority erodes next. The refund agent has an issue_refund tool that can technically refund any order. Its authority is supposed to be narrower (refunds up to $200, orders under 90 days, customers in good standing, automatic escalation above $50), but that authority lives in a prompt or a YAML file or a Notion page the team last updated when the policy was different. The runtime enforces capability, but nobody really enforces authority. When a poisoned input or a confused chain of reasoning leads the agent to refund $1,800 to the wrong customer, there’s no clean answer to the postincident question “Who approved this policy?” because the policy was never an artifact. The same pattern is worse at higher stakes: Imagine a coding agent with merge access to a protected branch, instructed by a prompt embedded in a code comment to “log configuration values for debugging,” silently exfiltrating secrets to an external monitoring service.

Accountability then dissolves. The team that built the agent says it followed policy. The team that wrote the policy says it didn’t anticipate the input. The team that operates the platform says the agent was running as a service principal whose behavior they don’t own. The audit log may show the action, but it doesn’t show the reasoning that produced the action, the retrieved context that shaped the reasoning, or the prompt history that framed the retrieval. Postincident review becomes archaeology, and the cost is absorbed, eventually, by whoever has the weakest negotiating position when the meeting ends.

Is any of this new? We have IAM, identity governance, policy as code, audit trails, SIEMs, and 30 years of compliance practice. Why isn’t this just IAM done properly? Because IAM was built around assumptions agents violate. IAM and IGA assume a population of principals that changes on human timescales: People get hired, people leave, and service accounts rotate quarterly. Agents are spun up per session and compose into chains where one agent calls another, which calls a third, impersonating users through delegated tokens that traditional IGA cannot represent as a chain at all. Policy engines fire at the moment of action, at the API, the database, and the network. Agents make their most consequential decisions before they hit those enforcement points, in the reasoning step that selects which tool to call and with what arguments. Mature audit logs assume that replaying the inputs reproduces the output. But for agents, replaying the prompt and the retrieval can yield a different action, because the model itself contributes state the log doesn’t capture. The instruments fire, the dashboards turn green, and the agent that quietly exfiltrated secrets still does so. The audit log records the action as agent-service-01, which again is both true and useless.

This is also where the vendors selling a consolidated stack want you to skip ahead. Microsoft’s Entra Agent ID, currently in public preview, is the most polished solution to date, extending the conditional access, identity governance, and identity protection used for humans and workloads to cover AI agents as a new identity type, but Google and Salesforce are also building this layer. The marketing line is that agents receive the same identity-driven protections as the rest of the workforce. That’s a real step forward in addressing the first link of the cascade, but it isn’t governance. It’s a control plane with a governance plane’s marketing. Conditional access can tell you whether the agent’s access attempt was permitted. It can’t tell you whether the decision the agent made before that access attempt was within its authority, why the agent reached the decision, or which business unit owns the policy the decision was supposed to obey.

The actual governance plane has to capture decisions, not just actions. A reasoning-grade audit record is the load-bearing primitive of the missing layer, and it looks something like this:

{
  "event_id": "refund-2026-05-17-08431",
  "triggered_by": {
    "human_principal": "rep:olivia.chen@firm.com",
    "delegated_via": "support-console-session-9c2a",
    "customer_principal": "cust:7741289"
  },
  "agent": {
    "identity": "refund-agent",
    "version": "v4.7.2",
    "policy_ref": "refund-policy/v3.1 (signed: r.patel, 2026-04-22)"
  },
  "task": "Process refund for order 88812204",
  "retrieved_context": [
    {"doc": "order:88812204", "fetched": "2026-05-17T08:43:11Z"},
    {"doc": "policy:refund-eligibility", "chunk": 4, "fetched": "2026-05-17T08:43:12Z"}
  ],
  "reasoning_trace": "...",
  "tool_calls": [
    {"tool": "check_eligibility", "input": "...", "output": "eligible"},
    {"tool": "issue_refund", "input": {"amount": 48.00}, "output": "ok"}
  ],
  "action": "refund:48.00",
  "principal_chain_hash": "0x9e7b3f..."
}

Not every agent needs this. A scheduling agent that proposes meeting times doesn’t. An agent that moves money, deploys code, or makes decisions that a regulator will eventually ask about does need it, and that’s the right bar to set because of the associated cost. Reasoning-grade audit is closer to a flight-data recorder than a syslog feed. The data is expensive to store and to query, with real privacy implications since those logs contain everything the agent saw, including data the agent was authorized to read but the audit system wasn’t supposed to keep. You afford it with proportional retention: full reasoning capture for high-blast-radius agents (regulator-facing, customer-funded, contractually material, production-modifying) and lighter capture for internal-only assistants.

Which raises the question the architecture diagram doesn’t ask: Who builds and runs this? Security can enforce policy but can’t author it. The people who know what a refund agent should be allowed to do own the refund business, not the firewall. IT can provision identities but can’t draft “good standing” or write the escalation rule. The MCP and A2A protocol communities are doing real work on wire-level identity and delegation. MCP gives you tool-invocation provenance and is the standard Entra Agent ID and most vendor frameworks build on. A2A is converging on cross-agent delegation primitives. Both matter, but neither drafts policy. Standards, not the institution, move the connectors.

What enterprises need is a new function that sits between the business units owning the policies and the platform teams running the runtime. Call it agent operations: small group, often four to eight people in a Global 2000 enterprise, embedded rather than centralized, reporting into the CIO or CISO depending on house politics, with explicit charter to maintain a registry of every production agent, its named human owner, its versioned authority specification, its retention policy for reasoning-grade audit, and its lifecycle state. Each agent gets onboarded with a signed policy, reviewed on a real cadence, and actually retired when its initiative ends, rather than the current default of quietly outliving its sponsors. Designing against failure modes like review cadences that calcify into ceremony, policy artifacts that lag agent deployment velocity, or functions that become the place agents go to die in committee is itself part of the work. The function has to ship at the pace of the platform teams or it will be routed around within a quarter.

The work is hard. It’s also overdue, and the regulatory clock is running. The EU AI Act’s high-risk provisions are entering enforcement this year, and regulators will ask for explainability, traceability, lifecycle records, and named human accountability. These are exactly the artifacts an agent operations function produces. Tyler Akidau called this the missing HR layer in his April Radar piece; Artur Huk’s more recent “From Capabilities to Responsibilities” converges on similar ground from the runtime side. The label matters less than the work. This piece is about governance inside one organization. The harder problem is governance across organizations, with agents acting under different trust regimes. That’s strictly worse, and worth its own piece.

Within your own four walls, the diagnostic is doable in an afternoon. Pick one production agent. Try to answer, with evidence: Whose authority does it carry, traced from action back to a named human? Where is its authority specified, and who signed the current version? When it does something wrong tomorrow, who pays, how is that decided, and what reasoning-grade record supports the decision? Most architects who do this honestly come away with three blanks and a knot in their stomach. That’s principal drift, named and visible.

The mesh you’ve built is real and necessary, but it isn’t sufficient. The rest of the architecture is the institution above it: the registry, the signed policies, the reasoning-grade audit, the named human at the end of every chain. In most enterprises it doesn’t yet exist, and it won’t arrive by buying another platform. You’ll have to draft it yourself.

Loop Engineering

Addy Osmani — Mon, 22 Jun 2026 11:04:36 +0000

The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission.

Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead. A loop here can be thought of as a recursive goal where you define a purpose and the AI iterates until complete. I believe this may be the future of how we work with coding agents. However, it’s still early; I’m skeptical, and you absolutely have to be careful about token costs (usage patterns can vary wildly if you are token rich or poor), so I want to unpack what it is and what it means.

Peter Steinberger recently said: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Similarly, Boris Cherny, head of Claude Code at Anthropic, said, “I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops”.

Okay, so what does any of that mean?

For like two years, the way you got something out of a coding agent was you wrote a good prompt and shared enough context. You type a thing, you read what came back, you type the next thing. The agent is a tool and you are holding it the entire time, one turn after the other. That part is kind of over, or at least some think it’s going to be.

Now you build a small system that finds the work, hands it out, checks it, writes down what is done and then decides the next thing, and you let that system poke the agents instead of you. I wrote before about the cousin of this, agent harness engineering, which is making the environment one single agent runs inside and the factory model—the system that builds the software. Loop engineering sits one floor above the harness. The harness but it runs on a timer, it spawns little helpers, and it feeds itself.

The thing that surprised me is this is not really a tool thing anymore. A year ago if you wanted a loop you wrote a pile of bash and you maintained that pile forever and it was yours and only yours. Now the pieces just ship inside the products. Steinberger’s list maps almost exactly onto the Codex app, and then almost the same onto Claude Code. And once you notice the shape is the same, you stop arguing about which tool. You just design a loop that still works no matter which one you happen to be sitting in.

The five pieces, and then notes

A loop needs five things and then one place to remember stuff. Let me list it first and then map it.

Automations that go off on a schedule and do discovery and triage by themselves
Worktrees so two agents working in parallel don’t step on each other
Skills to write down the project knowledge the agent would otherwise just guess
Plugins and connectors to plug the agent into the tools you already use
Subagents so one of them has the idea and a different one checks it

Then the sixth thing, the memory. A Markdown file, or a Linear board, anything that lives outside the single conversation and holds what’s done and what is next. Sounds too dumb to matter. But it’s the same trick every long-running agent depends on, and I went into it in “Long-Running Agents”: The model forgets everything between runs so the memory has to be on disk and not in the context. The agent forgets; the repo doesn’t.

Both products have all five now.

Primitive	Job in the loop	Codex app	Claude Code
Automations	Discovery + triage on a schedule	Automations tab: pick project, prompt, cadence, environment; results land in a Triage inbox; `/goal` for run-until-done	Scheduled tasks and cron, `/loop`, `/goal`, hooks, GitHub Actions
Worktrees	Isolate parallel features	Built-in worktree per thread	`git worktree`, `--worktree`, `isolation: worktree` on a subagent
Skills	Codify project knowledge	Agent Skills (`SKILL.md`), invoked with `$name` or implicitly	Agent Skills (`SKILL.md`)
Plugins and connectors	Connect your tools	Connectors (MCP) plus plugins for distribution	MCP servers plus plugins
Subagents	Ideate and verify	Subagents defined as TOML in `.codex/agents/`	Task subagents in `.claude/agents/`, agent teams
State	track what’s done	Markdown or Linear via a connector	Markdown (`AGENTS.md`, progress files) or Linear via MCP

The names are a bit different here and there, but the capability is the same thing. Let me go one by one because honestly the details are where a loop either holds together or quietly leaks everywhere.

Automations, this is the heartbeat

Automations are what make a loop an actual loop and not just one run you did once. In the Codex app you make one in the Automations tab and you pick the project, the prompt it will run, how often, and if it runs on your local checkout or on a background worktree. The runs that find something go to a Triage inbox, and the runs that find nothing just archive themselves which is nice. OpenAI uses them internally for boring stuff like daily issue triage, summarizing CI failures, writing commit briefings, and hunting bugs somebody added last week. And an automation can call a skill, so you keep the recurring thing maintainable; you fire $skill-name instead of pasting a giant wall of instructions into a schedule that nobody will ever update.

Claude Code gets to the same place but through scheduling and hooks. You can run a prompt or a command on a interval with /loop, you can schedule a cron task, you can fire shell commands at certain points in the agent lifecycle with hooks, or you push the whole thing to GitHub Actions if you want it to keep running after you close the laptop. Same idea exactly, you define an autonomous task, you give it a cadence, and the findings come to you so you are not the one going around checking.

There is a second in-session primitive worth knowing, and it’s the one closer to what this whole post is about. /loop re-runs on a cadence. /goal keeps going until a condition you wrote is actually true, and after every turn a separate small model checks whether you are done, so the agent that wrote the code isn’t the one grading it. You give it something like “all tests in test/auth pass and lint is clean” and walk away. Codex has the same thing, also called /goal: It keeps working across turns until a verifiable stopping condition holds, with pause and resume and clear. Same primitive, both tools, which is kind of the pattern for this whole article.

So this is the part that surfaces the work. The rest of the loop is what acts on it.

Worktrees, so parallel doesn’t turn into chaos

The second you run more than one agent, the files start colliding; that becomes the failure. Two agents writing the same file is the exact same headache as two engineers committing to the same lines and nobody talked to each other first. A Git worktree fixes it. It’s a separate working directory on its own branch sharing the same repo history, so one agent’s edits literally cannot touch the other one’s checkout.

Codex builds the worktree support right in so several threads hit the same repo at once and don’t bump into each other. Claude Code gives you the same isolation with git worktree, a --worktree flag to open a session in its own checkout, and a isolation: worktree setting you stick on a subagent so each helper gets a fresh checkout that cleans itself up after. (I wrote about the human side of all this in “The Orchestration Tax.”) The worktrees take away the mechanical collision, but YOU are still the ceiling. Your review of bandwidth decides how many you can actually run, not the tool.

Skills, so you stop explaining your project every single time

A skill is how you stop reexplaining the same project context every session like a goldfish. Both tools use the same format: a folder with a SKILL.md inside holding instructions and metadata, and then optional scripts, references, and assets. Codex runs a skill when you call it with $ or /skills, or by itself when your task matches the skill description, which is the reason a tight, boring description beats a clever one. Claude Code does it the same way and I wrote the pattern up in “Agent Skills.”

Skills are also where intent stops costing you over and over. I argued in “The Intent Debt” that an agent starts every session cold and it will fill any hole in your intent with a confident guess. A skill is that intent written down on the outside, the conventions, the build steps, the “we don’t do it like this because of that one incident,” written one time where the agent reads it every run. Without skills the loop rederives your whole project from zero every cycle; with skills it kind of compounds.

One thing to keep straight: The skill is the authoring format, and a plugin is how you ship it. When you want to share a skill across repos or bundle a few together, you package them as a plugin. True in Codex, true in Claude Code.

Plugins and connectors, the loop touches your real tools

A loop that can only see the filesystem is a tiny loop. Connectors, which are built on MCP, let the agent read your issue tracker, query a database, hit a staging API, or drop a message in Slack. Codex and Claude Code both speak MCP so the connector you wrote for one usually just works in the other. And plugins bundle connectors and skills together so your teammate installs your setup in one go instead of rebuilding the whole thing from memory.

This is the difference between an agent that says “here is the fix” and a loop that opens the PR, links the Linear ticket, and pings the channel once CI is green by itself. The connectors are the reason the loop can act inside your actual environment instead of just telling you what it would do if it could.

Subagents, keep the maker away from the checker

The most useful structural thing in a loop, by far, is splitting the one who writes from the one who checks. The model that wrote the code is way too nice grading its own homework. A second agent with different instructions and sometimes a different model catches the stuff the first one talked itself into.

Codex only spawns subagents when you ask, runs them at the same time, and then folds the results back into one answer. You define your own agents as TOML files in .codex/agents/, each with a name, a description, instructions, and optional model and reasoning effort, so your security reviewer can be a strong model on high effort while your explorer is some fast read-only thing. Claude Code does the same with subagents in .claude/agents/ and agent teams that pass work between them. The usual split in both is one agent explores, one implements, and one verifies against the spec.

I made this case twice already, once as “The Code Agent Orchestra” and once as “Adversarial Code Review.” The reason it matters specifically inside a loop is the loop runs while you are not watching, so a verifier you actually trust is the only reason you can walk away. Subagents do burn more tokens since each one does its own model and tool work, so spend them where a second opinion is worth paying for. This is also basically what Claude Code’s /goal does under the hood: A fresh model decides if the loop is done instead of the one that did the work, the maker and checker split applied to the stop condition itself.

What one loop looks like

Stick it together and a single thread turns into a little control panel. Here is one shape I keep using.

An automation runs every morning on the repo. Its prompt calls a triage skill that reads yesterday’s CI failures, the open issues, and the recent commits and writes the findings into a Markdown file or a Linear board. For each finding that is worth doing, the thread opens an isolated worktree and sends a subagent to draft the fix, and a second subagent reviews that draft against the project skills and the existing tests.

Connectors let the loop open the PR and update the ticket. Anything the loop cannot handle lands in the triage inbox for me. The state file is the spine of the whole thing; it remembers what got tried, what passed, and what is still open, so tomorrow morning the run picks up where today stopped.

And look at what you actually did there. You designed it one time. You did not prompt any of those steps. That’s Steinberger’s whole point made real, and it’s the same loop in Codex or in Claude Code because the pieces are the same pieces.

What the loop still does not do for you

The loop changes the work; it does not delete you from it. And three problems actually get sharper as the loop gets better, not easier.

Verification is still on you. A loop running unattended is also a loop making mistakes unattended. The whole reason you split the verifier subagent from the maker is to make the loop’s “it’s done” mean something, and even then “done” is a claim and not a proof. I keep saying the same line from “Code Review in the Age of AI”: Your job is to ship code you confirmed works.

Your understanding still rots if you allow it. The faster the loop ships code you did not write, the bigger the gap between what exists and what you actually get. That’s comprehension debt and a smooth loop just makes it grow faster unless you read what the loop made.

And the comfortable posture is the dangerous one. When the loop runs itself, it’s very tempting to stop having an opinion and just take whatever it gives back. I called that “cognitive surrender.” Designing the loop is the cure when you do it with judgment and the accelerant when you do it to avoid thinking: same action, opposite result.

Build the loop. Stay the engineer.

I think this is a preview of how our work is going to evolve. That said, if I weren’t reviewing the code myself or if I relied entirely on automated loops to fix it, my product’s quality would suffer. I’d likely end up stuck in a downward spiral, continuously digging myself into a deeper hole.

Go ahead and set up your loops, but don’t forget that prompting your agents directly is also effective. It’s all about finding the right balance.

Loops can also result in different outcomes depending on you. Two people can build the exact same loop and get completely opposite results. One uses it to move faster on work they understand deeply. The other uses it to avoid understanding the work at all. The loop doesn’t know the difference. You do.

That’s what makes loop design harder than prompt engineering. Cherny’s point isn’t that the work got easier. It’s that the leverage point moved.

Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go.

This Week in AI: Fable 5, the Clone Wave, and Uber’s AI Reality Check

Michelle Smith — Thu, 18 Jun 2026 19:33:23 +0000

This week, egghead.io cofounder John Lindquist joined host YK Sugi, founder of CS Dojo and developer experience manager at Eventual, to cover the latest AI news. First on the agenda was the contested release of Claude Fable 5. They also examined the financial shifts reshaping the technology industry, including the rising costs associated with agentic coding loops. Then John outlined the framework he uses to build in the agent era without starting from scratch every time.

Watch the full episode here:

Claude Fable 5: 3 days, a government order, and a lot of unanswered questions

Claude Fable 5 launched June 9 and was pulled from all customers on June 12 after the US government issued a directive ordering Anthropic to restrict access for foreign nationals inside and outside the US. Amazon researchers had reportedly surfaced what they characterized as a security vulnerability, and after Anthropic reportedly declined to patch or redeploy the model, the directive came down. Senior Anthropic staff subsequently traveled to Washington to meet with White House officials.

The dispute about what actually happened is unresolved. Anthropic’s position is that the reported issue was a narrow jailbreak that had been previously identified and was present across public models generally, and not a serious security threat. An independent researcher who reviewed the report described it as defensive prompting that surfaced known vulnerabilities and called the response an overreaction. Neither side has published the technique or prompt, so there’s no way to evaluate the claim independently. But as John put it, “It sets a very strange precedent going forward, as models are released, that governments can step in and control what private companies can and cannot do with their model.”

Another new precedent: Fable 5 wasn’t built on the Opus or Sonnet architecture, which means comparisons to prior Anthropic models or contemporaries don’t tell us much. But initial impressions were positive, including from YK and John, and Fable 5 quickly reached the top of the Arena leaderboard in the text, agents, and web dev code categories. However, the model also had a purposeful limitation: On questions related to AI and machine learning training specifically, it was designed to underperform (without signaling this to users), apparently to prevent competitors from using it to improve their own models. Intentional capability suppression in a commercial model, without disclosure, is a different kind of product decision than a safety guardrail. Whether that approach becomes more common as competitive stakes rise is an open question.

Tokens burn fast when the loop isn’t ready for them

Last week, SpaceX went public in the largest IPO in history. The company finalized its acquisition of Cursor in a $60 billion all-stock deal shortly after. (That last one happened after this episode aired—we’ll talk more about it on Monday.) Both OpenAI and Anthropic have filed to go public as well, and Google raised roughly $160 billion through equity and a 100-year bond. A significant share of that capital is flowing toward AI coding infrastructure.

YK brought up another, less celebratory, financial story that’s been making the rounds: Uber burned through its full 2026 AI tools budget by April, mostly on Claude Code and Cursor, and Andrew Macdonald, the company’s COO, acknowledged they couldn’t link that spending to a measurable increase in useful customer features. Uber subsequently put a $1,500 per month per employee cap in place.

John flagged projects inefficiently utilizing agentic loops as one possible cause for wasteful token spend. Most developers deploying agents against existing codebases haven’t built the tooling those agents need to work efficiently, so agents burn tokens doing work that dead-ends, repeating context, or generating code that requires significant debugging. He explained:

If you take a legacy codebase and you throw agents against it with loops, you haven’t set up a proper agent environment. It’s so quick to burn tokens because. . .the agents don’t have the tools to work with.

The conversation in developer communities so far has focused almost entirely on what agents can generate. But as more organizations move from experimentation to production-scale deployment, building logging, verification, and proper error surfaces into agent tooling is what will determine whether token spend maps to real output. Otherwise, we’ll likely see more companies go the way of Uber.

Ingredients beat inference: A practical framework for building in the clone wave

For most developer workflows today, buy-versus-build leans toward building in a way it didn’t even a year or two ago. As John noted, “It’s so easy to build apps and workflows now where there are so many amazing production apps out there, apps on your phone, apps on your desktop, software as a service, that are trivial to copy and clone.” He uses the term the “clone wave” to describe this expanding set of open source equivalents to consumer software products that can now be cloned, forked, or replaced and get you 99% of the way to your use case.

The principle that drives the clone wave is “ingredients beat inference.” If you ask an agent to build a feature from scratch, it infers a solution with no external reference. If you give it an existing open source implementation to start from, it can adapt, translate, and integrate that code far faster and more reliably. The ingredients approach also helps with the 43% of AI-generated code that needs debugging in production, per a figure YK cited earlier in the episode.

The GitHub CLI plays a central role in this workflow. John explained that because agents understand the GitHub CLI natively, you can give an agent a search task and let it find implementations it wouldn’t have generated itself. Language mismatch isn’t a blocker, because agents translate between languages and libraries well. And tools like DeepWiki from Cognition let agents explore and understand a repo’s structure before cloning or forking it, so the evaluation step doesn’t require local setup.

The framework extends to how you build the last 20% that isn’t available as an ingredient. This is the part that’s specific to your use case; John described it as “that extra bit that you’re building on top of it to make it into the custom product and project for either yourself or for your users.” John’s bigger point is that the tools you build for yourself should also be usable by your agents. Expose endpoints and logging. Give agents the ability to read state and errors. An agent that can control a tool but not debug it will eventually stop in ways that are hard to diagnose.

John walked through cmux to demonstrate what an agent-native workspace looks like in practice. cmux is a terminal multiplexer built with agentic workflows in mind: it exposes a CLI that agents can control directly, so you can open a terminal pane, have that pane spawn another, and have the two read from and write to each other. In practice that means you can run Claude Code in one pane, Codex in another, and a third pane reading output from both, with each agent able to observe the others’ state.

Agents need more than the ability to run commands. They need to read logs, check errors, and confirm state before taking the next step. A workspace that exposes those surfaces gives agents a feedback loop. This tenet is applicable to tools across the company. Organizations that treat their internal tooling as agent-accessible infrastructure are building something that compounds. Those treating agents as black-box code generators are taking on technical debt they may not see until causes issues later on.

What’s next

SpaceX’s acquisition of Cursor turns the coding-agent race into something much larger than an IDE fight. Cursor may be positioning itself as a new GitHub for the agentic era, where agents write, review, test, repair, and govern code. At the same time, Salesforce’s $3.6B acquisition of Fin shows the same pattern inside enterprise software: Buyers want packaged workflows that solve real support, sales, and operations problems rather than abstract “agents.”

Next week, host Ksenia Se examines these stories and more through the lens of who owns the loop where AI does the work. Join us to find out why the next phase of AI will be about who controls the infrastructure, economics, and trust layer.

Our episodes are free and open to all through the end of June if you’d like to attend live—register here. And we’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, Apple, or wherever you get your podcasts.

Kubernetes in the Age of AI

Andy Kwan — Thu, 18 Jun 2026 14:21:16 +0000

When Kubernetes first came onto the scene, it was a major turning point, a revision of the infrastructure and operations space that transformed the way developers and ops personnel build, deploy, and maintain applications in the cloud. It has since become the clear standard for how modern applications are built and operated. As the CNCF noted in its latest Annual Cloud Native Survey report, “Among container users, 82% are using Kubernetes in production in 2025, up from 66% in 2023. This represents near-universal adoption within the container ecosystem.”

Over the last few years, another revision in the space has occurred with Kubernetes’s evolution from a container orchestrator to an AI infrastructure platform. According to the CNCF survey, “The rise of Kubernetes as the de facto AI platform represents a fundamental shift in how organizations approach machine learning operations. . .[with Kubernetes] providing a unified orchestration layer that handles both traditional application workloads and compute-intensive AI tasks.” The emergence of seismic technologies like generative AI and agentic AI has only accelerated this transformation.

The intersection of AI with Kubernetes is undoubtedly one of the most impactful developments in the operations space. As Jonathan Johnson, software architect at Dijure, observes, “AI on K8s is very, very important, and there is not enough [resources] out there.” Raju Gandhi, senior technical architect at Edward Jones, echoes this assessment, noting that “operationalizing AI/ML on K8s is a big issue, [and it’s only] getting bigger. This is a topic that needs attention.” But what are some of the things that you should know about this trend to keep abreast and stay ahead in the game?

Generative AI

Anyone with access to a computer or a smartphone has likely used some iteration of generative AI, a stunning fact when you consider that GenAI was on the outer edges of mainstream discourse and consumption a scant five years ago. But at the end of 2022, the debut of ChatGPT marked the beginning of a technological revolution, one that would impact and reshape nearly every aspect of our working and personal lives. Unsurprisingly, there are now thousands of generative AI models, a proliferation that naturally has its own set of complexities. Selecting a model is simple, but if you’re an application developer or MLOps engineer, how do you go about operating that model in a production system? Not only do you have to be cognizant of factors like resilience, scalability, security, and operational costs, but there’s the fact that bringing a model from experimentation into production can be arduous if not done properly. That’s where Kubernetes comes into play.

As Roland Huß and Daniele Zonca, distinguished engineers at Red Hat, note, “GenAI/LLM models are resource intensive, requiring substantial computational power and large datasets. Given its scalability and extensibility, Kubernetes is uniquely suited to function as an efficient platform for AI and LLM model pretraining, fine-tuning, deployment, and prompt engineering.” They further elaborate that “this integration with Kubernetes not only simplifies the adoption of cutting-edge AI technologies but also ensures a seamless and efficient operational flow. Kubernetes, with its robust scalability and management capabilities, stands as an ideal platform for generative AI projects, aligning DevOps and MLOps practices in a cohesive ecosystem.”

This sentiment is already shared by a wide swath of the industry. According to the CNCF survey above, as of 2025, 66% of organizations run generative AI workloads on Kubernetes. These organizations include OpenAI, which uses Kubernetes for its AI/LLM application experimenting and testing; Tesla, which utilizes KServe to manage production-grade LLM inference; and Adobe, which uses Kubernetes to power its suite of generative creative models. Other companies taking this approach include Uber, Intuit, and Google. With more companies adopting this practice for their generative AI and LLMs operations, it’d be prudent for any organization to leverage Kubernetes for their own GenAI and LLM workflows.

Agentic AI

Nearly coinciding with the rise of GenAI has been the steady growth of agentic AI. Unlike GenAI, agentic AI goes beyond answering simple prompts and generating text in its ability to operate autonomously to perform complex, multistep actions, utilize tools, and make independent decisions. With its ability to support both traditional ML processes and GenAI and LLM operations, it should come as no surprise that Kubernetes has a role in the agentic AI ecosystem as well.

According to Ronald Petty, principal consultant at RX-M, “Kubernetes has been leveraged to host machine learning pipelines, including AI model training and inference. As inference options have become plentiful and affordable, on and off-premise, we have seen the rise of agents. Coupling cloud native technologies and popular protocols, we now see agents moving from ad hoc demos to complex fleets of agents on systems like Kubernetes.” So what are some examples of the integration between these two technologies?

One notable offering is Kagent, an OS programming framework that runs AI agents in Kubernetes and “helps engineers build powerful internal platforms by tackling cloud native tasks such as configuration, troubleshooting, complex deployment scenarios, observability pipelines and dashboards, and safely enabling network security.” Operating along similar lines is K8sGPT, an AI-powered tool that leverages intelligent insights and automated troubleshooting to analyze Kubernetes clusters for configuration problems and security issues, as well as generates solutions to problems discovered in analysis.

A more recent entry in the field is Sympozium, a Kubernetes-native coordination layer for multi-agent AI systems that “solves the same problem Kubernetes solved for containers, but for agents that need to share context, hand off tasks, and maintain shared situational awareness.” Another newer offering is Agent Sandbox, which allows you to run AI agents as isolated, stateful workloads with a native API on Kubernetes.

The fundamentals

While it’s important to be aware of the latest developments and trends affecting your domain, that shouldn’t come at the expense of foundational knowledge and skills. As basketball great Michael Jordan once said, “Get the fundamentals down and the level of everything you do will rise.” One of the most fundamental skills for working with Kubernetes is networking, and frustratingly enough, it’s one of the more difficult ones to master. As Cisco senior staff engineer Nico Vibert observes, “Platform engineers tend to be comfortable with Linux networking but less so with protocols like BGP and IPv6; network administrators know those protocols well but find Kubernetes abstractions unfamiliar. Both personas struggle to navigate the dozens of networking tools seemingly required to meet connectivity and security requirements.” Yet as organizations move mission-critical workloads, AI training pipelines, and regulated financial services onto Kubernetes, the engineers who can design, secure, and troubleshoot the network layer have become some of the most sought-after professionals in the industry.

In recognition of both the importance and difficult nature of the Kubernetes networking skill, the CNCF recently announced a new certification focused on the Kubernetes network engineer role. The certification is designed to validate hands-on networking expertise across all of the aforementioned layers, filling a gap that the Kubernetes community has long recognized.

For organizations that use Kubernetes to develop and deliver applications, leaders and decision-makers need to be aware that utilizing Kubernetes in conjunction with the latest AI tools is no longer a luxury but a necessary practice that will allow their companies to thrive. A similar onus should be placed on the basics. When hiring your next DevOps, network, or site reliability engineer, ensure that their ability to design, secure, and troubleshoot the Kubernetes network layer is second to none.

If you want to dive deeper, check out Roland Huß and Daniele Zonca’s Generative AI on Kubernetes, Jonathan Johnson’s GPU Kubernetes Homelab live course, Alex Corvin, Taneem Ibrahim, and Kyle Stratis’s Scalable Kubernetes Infrastructure for AI Platforms, Ashok Srirama and Sukirti Gupta’s Kubernetes for Generative AI Solutions, and Yogesh Raheja’s K8sGPT Essentials on-demand course. They’re all on O’Reilly. If you’re not a member, you can get started with a free trial.

The Case Against Building Your Own Agent Platform

Pete Johnson — Wed, 17 Jun 2026 13:53:16 +0000

You know the meeting. The board wants an AI agent strategy by end of quarter. Someone on the leadership team has read a McKinsey report. You’ve been voluntold to build the platform. The slide deck says “AI-native.” The acceptance criteria are vague. Somebody mentions LangGraph, and somebody else says, “We’ll just wrap it ourselves.”

You ask what “done” looks like. Nobody in the room can answer.

The cost of building this is almost always estimated before anyone has a clear picture of what “this” actually is. And that’s the problem I want to work through here, because the scope of the work being casually assigned to internal platform teams right now is genuinely larger than the people assigning it understand.

Build versus buy, flipped in a year

This particular pendulum has swung before. App servers in the late 1990s. Content management systems in the 2000s. Container orchestration in the 2010s. The pattern rhymes every time: When a category is new, the components look deceptively simple. Early adopters build their own. The market catches up. Within 18 months, building becomes the expensive path. Within 36 months, the teams that built internally are rewriting on top of the category winner that emerged while they weren’t looking.

What’s different about the current moment is the speed. Menlo Ventures’ 2025 State of Generative AI in the Enterprise report shows the build-versus-buy split inverted in a single year. In 2024, 47% of enterprise AI solutions were built internally. By late 2025, that number had collapsed to 24%. The market made the decision in 12 months, which is unusual.

I’ve lived through enough of these transitions to recognize the shape. What I want to do in this piece is explain why I think the scope of “agent platform” is systematically underestimated right now, and what platform engineers should be asking before they commit to building one.

Most “agent platforms” aren’t

A lot of the projects labeled “agent platform” right now are actually workflow systems with an LLM in the loop. That’s a meaningful distinction. As Anthropic pointed out in its “Building Effective Agents” guidance, workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents are systems where LLMs dynamically direct their own processes and tool usage.

Most of what enterprises are shipping today sits on the workflow side. That’s fine. Workflows have bounded requirements, tractable testing, and predictable failure modes. If your team is building a workflow system, you might reasonably build it yourselves.

The trap is that teams start building for workflows, then get asked to support agents, and discover the jump isn’t incremental. Agents need memory that survives across sessions. They need evaluation that handles nondeterminism. They need governance that tracks actions, not just outputs. They need orchestration that recovers from failure modes a workflow engine never sees.

Here’s the thesis I want to put on the table: The decision to build an agent platform almost always underestimates the long tail. Memory, governance, eval, and orchestration aren’t features you add to a workflow engine. They’re separate product bets, each with its own maturity curve, its own vendor landscape, and its own team of specialists who’ve been working on it full-time for 18 months while you’ve been doing something else.

Let me walk through them.

Memory

The assumption inside most build proposals is that memory is a database problem. You’ll pick a vector store, shove conversation history into it, and retrieve relevant chunks when the agent needs context. Done.

Production memory is three separate systems: episodic, semantic, and procedural, each with different retention and retrieval policies. It’s temporal reasoning that tracks when facts were valid, not just what they were. It’s deduplication, multitenant isolation, and explicit source-of-truth governance.

The signal that this is a separate product category, not a feature: Mem0 raised $24 million across seed and Series A. Letta (formerly MemGPT) raised $10M from Felicis. Zep exists as an independent company with a temporal knowledge graph engine. Mem0’s State of AI Agent Memory 2026 report maps 21 frameworks across three hosting models with measurable benchmark gaps between them. On LongMemEval, Zep scores 15 points higher than Mem0 on temporal queries, which tells you these aren’t interchangeable tools that happen to serve the same market.

This is the component that platform teams underestimate hardest. Memory sounds like a database problem. It isn’t.

Governance

The assumption is that governance is RBAC plus audit logging. Your agents are services. Services get role-based access controls. You log the tool calls. Compliance is happy.

Agent governance is something different. It spans action authorization, not just data authorization. It requires decision-chain auditability, where you can reconstruct why the agent did what it did, not just what it did. It needs behavioral drift detection, tiered autonomy, and compliance mapped to agent actions rather than data accesses.

Grant Thornton’s 2026 AI Impact Survey of 950 business executives found that 78% lack strong confidence they could pass an independent AI governance audit within 90 days. Meanwhile, enterprises are moving to increase agent autonomy faster than their governance frameworks can keep up. Traditional AI governance wasn’t designed for action-level authorization, which is where most agent-specific risk accumulates.

And there’s a hard deadline attached to this. The EU AI Act becomes fully enforceable for high-risk systems in August 2026. Credit scoring, hiring decisions, healthcare support, and critical infrastructure all fall in scope. If your internal platform doesn’t handle conformity assessments, human oversight mechanisms, complete audit trails, and ongoing monitoring, that’s not a v2 feature. That’s a legal exposure.

OWASP now documents “excessive agency” as a top vulnerability class for LLM applications. Cornell researchers have demonstrated indirect prompt injection attacks that manipulate agents through content they ingest. These are agent-specific attack surfaces, and traditional security tooling doesn’t see them.

RBAC was designed for humans with predictable intent. Agents don’t have predictable intent.

Eval

The assumption is that evaluation means writing test cases and measuring accuracy. You built software before. You know how to test things.

Agent evaluation is qualitatively different from traditional software testing or even LLM evaluation, McKinsey’s QuantumBlack team noted: For LLMs, you evaluate the response to a prompt. For a single agent, you evaluate the full trajectory, including tool calls, state transitions, and intermediate decisions. For multi-agent systems, you evaluate system dynamics, including coordination patterns and collective invariants.

This matters because agent behavior is nondeterministic by design. The same input produces different valid execution paths. “Did the agent succeed?” is no longer a yes-or-no question, because the agent might reach the right answer through a trajectory you didn’t anticipate, or reach the wrong answer through a trajectory that looks reasonable until the last step.

The tooling ecosystem reflects this. Google Vertex AI has standardized trajectory_exact_match, trajectory_precision, and trajectory_recall as production metrics. These didn’t exist 18 months ago. LangSmith, Braintrust, Arize, Galileo, Maxim, and others are building full evaluation platforms around trajectory-based analysis, LLM-as-judge scoring with statistical validation, and regression testing against production failures.

Here’s the signal that the category is real: LangChain’s 2026 State of AI Agents report found that 57% of organizations now have agents in production, and 32% cite quality as the top deployment barrier. Gartner projects that 60% of software engineering teams will adopt AI evaluation and observability platforms by 2028, up from 18% in 2025. When a category jumps from 18% to 60% adoption in three years, that’s not a “we can build this in a sprint” situation.

You can’t tell whether your evaluation is working without another evaluation. Judge drift, calibration against human experts, internal consistency across independent runs. . .your eval system needs its own eval system, which is exactly the kind of recursion that eats platform teams alive.

Orchestration

The orchestration layer hasn’t converged. LangGraph uses directed graphs with conditional edges. CrewAI uses role-based crews. OpenAI’s Agents SDK uses explicit handoffs. AutoGen uses conversational GroupChat. Google ADK uses hierarchical agent trees. Claude’s Agent SDK uses tool-use chains with subagents. Microsoft’s Agent Framework is its own thing. Each represents a different bet on state management, communication pattern, and coordination model. None of them are interchangeable. Migration between them isn’t a config change—it’s rewriting most of your agent logic.

Underneath them, the protocol layer is still being invented. The Model Context Protocol is becoming the standard for tool integration, and agent-to-agent (A2A) protocols are emerging for cross-framework coordination. Both are moving targets, and building on a moving protocol is a cost that internal platform teams rarely price in.

If you built your own orchestration layer in 2024, you’re rewriting it in 2026. The teams that picked a framework spent those two years shipping.

The honest case for building

I want to engage the strongest version of the build argument, because there are real reasons to build, and pretending otherwise makes this piece less useful than it should be.

Proprietary data genuinely is a durable competitive moat. Mastercard built a foundation model on its transaction network. Plaid built one on its financial institution coverage. As Morgan Stanley’s analysis from last year made clear, decades of verified historical data with consistent identifiers is both technically challenging and prohibitively expensive for outside players to recreate. If your organization has data like that, you should absolutely build on it.

Regulated industries have legitimate reasons to want control over the full stack. Off-the-shelf AI tools don’t always cleanly map to frameworks like HIPAA, GxP, 21 CFR Part 11, SOX, FFIEC, and PCI DSS, and the cost of a failed audit is measured in business units shut down, not in sprints.

Vendor lock-in at the AI layer is subtler and more dangerous than in traditional software. If your agentic workflows are built on a vendor’s proprietary orchestration layer, switching costs compound rapidly across memory, eval, and integrations simultaneously.

But here’s the distinction that matters: Those are arguments for building agents on top of platform components, not arguments for building the platform components themselves. You can own the data, the domain logic, the evaluation criteria, the governance policies, and the specific behaviors your business needs without owning the memory layer, the orchestration engine, or the trace collection infrastructure underneath them.

Build the things that are specific to your business. Buy the things that are specific to the technology category. That’s the heuristic.

Five questions before you commit

If you’re the platform engineer being pulled into this decision, here are the questions worth asking before anyone signs up for the scope.

Are you building an agent platform or a workflow system? They’re not the same scope, and conflating them is where most of the cost overruns originate. A workflow system is a reasonable thing to build. An agent platform is four product categories you haven’t staffed for.

Can you articulate what “done” looks like for each of the four components? Memory, governance, eval, orchestration. In under three sentences each. If you can’t, you don’t have requirements. You have a vibe. And vibes don’t ship.

What happens to your platform when you need to swap the underlying model? Menlo’s December 2025 data shows Anthropic went from 12% of enterprise LLM spend in 2023 to 40% in 2025, while OpenAI fell from 50% to 27%. Enterprises didn’t plan those switches. The capability gaps forced them. If your internal platform hardcoded assumptions about context windows, tool-calling formats, or reasoning styles from one vendor, swapping models isn’t an API key change. It’s simultaneous rewrites across memory, eval, and orchestration.

What happens when the techniques themselves change? Eighteen months ago the default pattern was RAG with flat vector retrieval. Now it’s just-in-time context strategies, agent-managed memory tiers, and trajectory-based evaluation. Anthropic’s own follow-up to “Building Effective Agents” explicitly acknowledges the field has moved since they wrote the original. If your platform baked in the 2024 patterns, the 2026 patterns are a refactor, not a config change. Vendor platforms absorb those shifts as releases. Internal platforms absorb them as sprints.

What happens when the platform team leaves? This is the tale as old as COBOL, custom ESBs in 2008, or hand-rolled container orchestration in 2015. A small team builds something clever, it works, they move on, and five years later you’re paying premium rates to contractors who can still read the code. Agent platforms are a particularly bad candidate for this pattern because the talent pool is both small and mobile. Here’s the uncomfortable version of the question: Who on your team, today, could rebuild the memory layer if the person who wrote it left tomorrow?

What this looks like in 2 years

Gartner’s prediction that over 40% of agentic AI projects will be canceled by 2027 isn’t really about the AI. It’s about projects that got scoped before anyone understood the shape of the work. Most of the canceled projects will be internal builds, because internal builds are where the scope estimation error accumulates. Deloitte’s data on two- to four-year AI ROI horizons is the warning shot. If your timeline to value is already long, every month you spend rebuilding a component that exists as a product is a month you don’t have.

The teams that built their platforms around OpenAI in 2023 weren’t wrong. They made a reasonable bet on the market leader at the time. But they spent 2025 porting to a landscape where Anthropic had tripled share and Google had gone from 7% to 21%. The teams that picked model-agnostic platforms spent 2025 shipping. The only durable bet in this space is the one that assumes the bet will change.

The best platform engineering decision you can make this quarter might be to not build the platform.

Sources

Primary sources

Menlo Ventures, 2025: The State of Generative AI in the Enterprise, December 2025,
https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/.
Anthropic, “Building Effective Agents,” December 2024,
https://www.anthropic.com/research/building-effective-agents.
Anthropic, “Effective Context Engineering for AI Agents,” 2025,
https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents.
European Commission, AI Act Regulatory Framework (Regulation EU 2024/1689),
https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai.
Google Cloud, “Evaluate Gen AI Agents,” Vertex AI Documentation,
https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-agents.
McKinsey QuantumBlack, “Evaluations for the Agentic World,”
https://medium.com/quantumblack/evaluations-for-the-agentic-world-c3c150f0dd5a.
LangChain, State of Agent Engineering 2026,
https://www.langchain.com/state-of-agent-engineering.
Gartner, “Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027,” June 2025, https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027.
Grant Thornton, 2026 AI Impact Survey, April 2026,
https://www.grantthornton.com/services/advisory-services/artificial-intelligence/2026-ai-impact-survey.

Secondary Sources

Mem0, “Mem0 Raises $24M to Build the Memory Layer for AI,” October 2025,
https://mem0.ai/series-a.
Felicis, “Felicis’s Seed in Letta,” September 2024,
https://www.felicis.com/blog/letta.
Vectorize.io, “Mem0 vs Zep,” Benchmark Comparison,
https://vectorize.io/articles/mem0-vs-zep.
Rasmussen et al., “Zep: A Temporal Knowledge Graph Architecture for Agent Memory,” arXiv 2501.13956,
https://arxiv.org/abs/2501.13956.
OWASP, “LLM08:2025 Excessive Agency,” OWASP Top 10 for LLM Applications,
https://genai.owasp.org/llmrisk/llm08-excessive-agency/.
Greshake et al., “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,” arXiv 2302.12173, February 2023,
https://arxiv.org/abs/2302.12173.
Model Context Protocol, Official Specification,
https://modelcontextprotocol.io.
PYMNTS, “FinTechs Race to Build Foundation Models on Proprietary Data,” 2026,
https://www.pymnts.com/artificial-intelligence-2/2026/fintechs-race-to-build-foundation-models-on-proprietary-data/.
Deloitte, “State of Generative AI in the Enterprise,” Quarterly Reports,
https://www.deloitte.com/us/en/insights/topics/digital-transformation/state-of-generative-ai-in-enterprise.html.

Linear Thinking, Nonlinear Costs

Nicole Koenigstein — Tue, 16 Jun 2026 11:02:01 +0000

Many AI agent systems become economically unsustainable long before they become technically impressive. Teams usually focus on model choice, prompt design, tool calling, and orchestration. Those things matter, but they are only part of the system setup. The deeper issue is that coding agents, such as Claude Code, Codex, and Jules, make agent workflows easier to generate. But when implementation is abstracted away, the underlying mechanics become harder to see. Bad engineering used to produce slow code. Now it produces expensive systems that also happen to be slow.

When we design agent systems, we still need to remember that the costs scale nonlinearly. A single user request rarely triggers a single model call. It expands into routing, retrieval, reasoning, reflection, guardrail checks, tool calls, and synthesis. Each step may repeat shared context, reload state, recompute a planner decision, or retry a failed path. What looks like an intelligent workflow can therefore behave like a recursive, stateful computation with overlapping subproblems. If that sounds like backtracking, dynamic programming, and memoization to you, you’re right.

We already know how to optimize systems like this. The problem is that coding agents make agent systems easier to generate, but not necessarily easier to optimize. Unless we recognize the underlying mechanics, we may never ask our coding agents to apply the optimization patterns that keep our systems viable.

Old problems wearing new clothes

When we use coding agents to generate agent architectures, it’s tempting to stop at “the trace looks reasonable.” The tool can generate routers, retrievers, planners, evaluators, guardrails, tool interfaces, and synthesis steps. It may also know about caching, pruning, memoization, and state modeling. But it won’t necessarily implement those patterns unless you ask for these optimization layers explicitly.

Even if you work with agent instructions, unless your SKILL.md, AGENTS.md, or project instructions include constraints around repeated context, memoization, cache invalidation, pruning, and cost per request, your resulting agent system may be functionally correct and economically wasteful at the same time. That’s the tricky part: The code can pass review, the unit tests can pass, and the architecture can look reasonable. The invoice is where the hidden computation finally shows up.

It’s easy to give too much agency to tools like Claude Code. When a coding agent reasons in language, calls tools, reflects, and produces fluent text or code, it can feel like a knowledgeable coworker. At the interface level, that impression is understandable. These tools help teams generate more code, move faster, and become more productive. Still, this doesn’t remove the need for engineering craft underneath. Someone still has to recognize repeated context, recomputed planner decisions, correlated retries, unpruned branches, and state that can’t be reused. The coding agent can implement the system, but the engineer still has to understand what kind of system should be implemented. This is where old computer science returns, not as theory but as the optimization layer our agent systems need in production.

The cost multiplier, repeated-work problems, and backtracking

The cost multiplier often shows up first as latency. The user doesn’t see the router, the retries, the reflection loop, or the tool calls. They only see that the agent is taking too long. From the outside, the system looks stuck or broken. From the inside, it may simply be repeating work.

This is one of the uncomfortable differences between traditional software and agent systems. In a conventional application, a failed operation often throws an error, times out, or leaves a trace that is easy to inspect. In an agent workflow, failure can look like effort to improve reliability. Take the weakest step in your agent workflow. If it succeeds 60% of the time, and you try to push it close to 99% reliability through retries, you need 5 retries:

1 − (1 − 0.60)⁵= 0.98976

This math assumes each retry is a roll of fair dice. LLMs aren’t dice. Whether you’re using greedy decoding or probabilistic sampling, the model is still drawing from the same underlying distribution shaped by your prompt. If the first “thought” is a hallucination or logic error, bumping the temperature won’t fix the underlying state. You aren’t buying independent trials; you’re just sampling different paths through the same flawed map and state.

This is where the old algorithmic framing matters. In a backtracking problem, you don’t keep walking down the same failed branch and call it progress. You return to the last valid state, mark the failed path, and use the failure as information for the next choice. The point isn’t just to try again. The point is to try again under a changed state.

Agent workflows need the same discipline. A retry shouldn’t mean “run it again and hope.” It should give the model structured feedback about why the previous attempt failed: which constraint failed, which tool result was invalid, which schema didn’t validate, which assumption was unsupported, or which branch added nothing. The next attempt should then change something meaningful: the prompt, the tool choice, the retrieved evidence, the validation constraint, or the planner state.

Memoization, pruning, and dynamic programming

Prompt caching is usually the first optimization. If every step repeats the same system prompt, tool definitions, schema constraints, examples, and policy rules, then caching the shared prefix is an obvious win. It reduces the cost of repeated context. But prompt caching only recognizes that text repeats. It doesn’t notice that decisions repeat.

In many agent systems, the expensive unit isn’t only text. It’s the repeated decision. If the same or equivalent state appears again, paying the model to rediscover the same action is unnecessary. That is what memoization does: It turns repeated computation into lookup. In classical algorithms, the repeated computation might be a recursive subproblem. In an agent system, it might be a planner decision over the same task, facts, tools, and constraints. The planner can be treated as a function over state:

$^πLLM(S_t) \rightarrow a_{t+1}$

where $S_t$ is the current state of the workflow and $a_{t+1}$ is the next action. Without memoization, this function is evaluated again and again through an LLM call. With memoization, the system first checks whether it has seen the same or equivalent state before. If you want a deeper walkthrough of how to use memoization, I cover it in AI Agents: The Definitive Guide.

But memoization only helps once the system knows which states are worth revisiting. Pruning handles the other side of the problem: branches that shouldn’t be explored further. However, don’t limit pruning to KV cache pruning or speculative decoding. Use it also when a tool repeatedly returns no new information. Your next LLM call shouldn’t be a slightly reworded version of the same query. If a reflection loop keeps producing stylistic changes without improving correctness, the loop should stop. If a search path violates a constraint or depends on an unsupported assumption, it should be marked as unproductive and removed from the active search space.

Dynamic programming becomes relevant when different branches of the workflow solve overlapping subproblems. A research agent may ask similar questions across several documents. A coding agent may inspect the same dependency chain from different entry points. A business analysis agent may compute the same metric for several report sections. If every branch solves these subproblems from scratch, the system pays repeatedly for work it has already done. Table 1 shows examples of how these patterns map to AI agent systems.

Table 1. Classical optimization patterns applied to AI agent systems

Optimization	The “old” CS way	The “agent” way
Memoization	Store results of expensive function calls.	Cache decisions. If the agent saw this state before, don’t ask it to reason again.
Pruning	Cut off search paths in a tree that won’t lead to a solution.	Kill a reflection loop when the critique stops yielding structural improvements.
Dynamic programming	Break problems into overlapping subproblems.	Share codebase analysis across multiple specialized agents instead of rereading files.

This isn’t nostalgia. These patterns mitigate the cost structure of agent systems. Memoization reduces repeated decisions. Pruning reduces repeated failure. Dynamic programming reduces repeated subproblem solving. Together, they form the optimization layer many agent architectures are missing in production.

Where to start: Optimization follows topology

The patterns above aren’t a checklist you apply uniformly. Each multi-agent topology, whether centralized, decentralized, independent, or hybrid, distributes communication and coordination differently, which directly affects overhead, latency, and failure propagation. The optimization layer has to follow.

Centralized
A single orchestrator decides, delegates, and aggregates. The expensive unit is the orchestrator’s decision, repeated across similar inputs. Memoize the planner first.

Decentralized
Agents coordinate peer-to-peer, exchanging messages without a central authority. The cost moves into the communication itself: redundant exchanges, restated context, agents reasoning over the same shared state from different angles. Prompt caching on the shared context is the first win, followed by pruning exchanges that no longer add information.

Independent/swarms
Lightweight agents fan out without coordinating. Cheap individually, expensive in aggregate. If three of your ten agents ask semantically equivalent questions, you pay three times for the same answer. Memoization and pruning aren’t optimizations here; they’re load-bearing.

Hybrid
The repeated work shows up at two scales: within a cluster (overlapping subproblems among peers) and across clusters (the coordinator rediscovering the same routing decision). Use dynamic programming on shared subproblems inside the cluster, memoization on the coordinator’s decisions across them.

The optimization layer isn’t a generic discipline you bolt on. It’s a function of the shape of the implementation. Coding agents made it easy to generate the shape without seeing it. The craft is in seeing it anyway.

Who Owns the Code Claude Wrote?

Sena Evren — Mon, 15 Jun 2026 10:58:47 +0000

The following article originally appeared on Sena Evren’s Legal Layer newsletter and is being reposted here with the author’s permission.

TL; DR

Agentic coding tools like Claude Code, Cursor, and Codex generate code that may be uncopyrightable, owned by your employer, or contaminated by open source licenses you cannot see. Some of this is settled law, some is actively contested, and this piece is clear about which is which. If you are shipping AI-assisted code and have not thought about any of this, this piece is for you.

If you shipped code this week, some of it was probably written by an AI. The question of who legally owns that code is less settled than most developers assume, and the answer depends on three things that have nothing to do with how good the code is:

Whether a human made enough creative decisions to establish copyright
Whether your employment contract already assigned it to your employer
Whether the model pulled from GPL-licensed training data and quietly contaminated your codebase

On March 31, 2026, Anthropic accidentally published 512,000 lines of Claude Code’s source code in a routine software update through a missing configuration file. Before sunrise, the codebase was mirrored across GitHub. Before breakfast, a developer had used an AI tool to rewrite the entire thing in Python, and the “claw-code” repository hit 100,000 GitHub stars in a single day, the fastest in history. Then came the DMCA takedowns, and then came the question nobody had a clean answer to:

If Claude Code was, by Anthropic’s own lead engineer’s admission, predominantly written by Claude itself, does Anthropic even own it? Can you issue a DMCA takedown for code that copyright law may not protect?

That incident compressed every open question about AI-generated code ownership into a single news cycle. The same questions apply to your codebase.

The copyright rule nobody told you

Here is the legal baseline, in plain terms: Copyright only protects work created by a human.

The US Copyright Office has confirmed this consistently, and the DC Circuit upheld it in the Thaler case. When the Supreme Court declined to hear the Thaler appeal in March 2026, it did not endorse the lower court’s reasoning or settle the question nationally. Cert denial means the court chose not to hear the case, nothing more. What it does mean is that the DC Circuit’s ruling stands, the Copyright Office’s position is intact, and no court has yet gone the other way. Works predominantly generated by AI without meaningful human authorship are not eligible for copyright protection under current doctrine, and that position is stable even if it is not finally settled.

Two important limits on what Thaler actually decided.

The case involved a painting created with zero human involvement at all. Thaler listed the AI system as sole author and made no claim of any human creative contribution. The ruling does not directly address the harder question of AI-assisted work where a human was involved but the degree of that involvement is disputed.
Thaler involved visual art. No court has yet applied the human authorship doctrine specifically to code output from an AI coding tool. The logic applies, but the direct precedent does not exist yet.

What it means for you: Code that Claude Code or Cursor generated and you accepted without meaningful modification may not be copyrightable by anyone. If a competitor copies it, you may have no legal recourse, because the code sits in the public domain in everything but name.

The phrase that determines whether your code is protected is “meaningful human authorship,” and the Copyright Office has deliberately refused to quantify it with a percentage or a number of edits, because what courts look for is evidence that a human made genuine creative decisions:

Choosing the architecture
Deciding what to reject
Restructuring the output to fit a specific design

Specifying an objective to the model is not enough. Directing how the work is constructed is what counts.

In an agentic workflow, this distinction is harder to establish than it sounds. Consider a typical Claude Code session:

You write a one-line prompt: “build a rate limiting module for the API.”
Claude Code plans the approach, generates five files, and iterates through three versions.
You review the output, run the tests, and merge.

Your contribution in that sequence is your architectural intent and your final approval. Whether that constitutes meaningful human authorship in a courtroom is an unresolved question with no definitive court ruling yet.

The honest answer is: probably yes for modules you substantially redirected, probably no for code you accepted verbatim, and unclear for everything in between.

The middle ground is actively being litigated right now. In Allen v. Perlmutter, artist Jason Allen is challenging the Copyright Office’s denial of registration for a work he created using more than 600 detailed prompts and subsequent editing in Photoshop. The Copyright Office acknowledged the Photoshop edits as human-authored but still denied registration for the AI-generated underlying elements. That case has not been decided yet, and whatever it decides will be the closest thing to a ruling on how much human involvement is enough.

The closest existing precedent on partial protection is Zarya of the Dawn, a graphic novel where the Copyright Office granted registration for the human-authored text but denied it for the Midjourney-generated images. That decision establishes a practical principle developers can use right now: The human-authored elements of an AI-assisted codebase may be separately protectable even if the generated code itself is not. Your architecture documents, your design decisions recorded in commit messages, your ADRs, your prompt logs showing deliberate redirection, these may be protectable as human-authored expression even if the code they produced is not. Protecting what you can starts with documenting what you actually did.

What your employer probably already owns

Before you think about whether your code is copyrightable, there is a more immediate question: Even if it is, is it actually yours?

Your employment contract almost certainly says that anything you build at work belongs to your employer. That principle has a name in copyright law: the work-for-hire doctrine. Under it, any code created by an employee within the scope of their employment is owned by the employer, who is treated as the legal author, regardless of whether the code was written by hand, generated by Claude Code, or some combination. Using an AI coding tool during work hours, on a work project, on a work machine, does not change who owns the result.

Most employment contracts go further than the doctrine’s defaults. Look for a section in yours called “Intellectual Property,” “IP Assignment,” or “Work Product.” Open the contract, search for those terms, and read that section. A clause that says any of the following almost certainly covers your AI-assisted code:

“Any work product created using company equipment or resources”
“Any invention or development made during the term of employment”
“Any software created with the assistance of company-licensed tools”

The third one is the one to watch. If your employer licenses Claude Code, Cursor, or Copilot for the team, and you use those same tools to build a side project, a broad IP assignment clause may give the employer a claim over that project, even if you built it on your own time.

A senior developer in San Francisco described exactly this situation earlier this year. He had used Claude Code for work projects and for a personal fitness tracking app built on evenings and weekends. His company updated its IP policy and claimed everything he had built with AI assistance, including the personal app, arguing that because Claude had access to open work files in the IDE, any AI output was a derivative work of company IP.

This is the clearest example of how far this can stretch. His company’s claim rested on one phrase: The AI tools were “context-aware” of his company’s codebase. The argument does not hold up legally, because context visibility in an IDE does not make AI output a derivative work of files that were open nearby, and the connection between what Claude can see and what it generates is probabilistic pattern completion, not copying. But the argument illustrates what employers are starting to claim. If the clause is broad enough, it has surface validity regardless of what the AI actually did.

The practical rule: If you are building something on the side, use a personal account, a personal machine, and tools you pay for yourself. Keep your employer’s licensed tools out of that workflow entirely.

The open source contamination problem

Even if you own your AI-generated code, you may have already contaminated it with an open source license you cannot see.

AI coding tools are trained on massive amounts of public code, including code licensed under the GPL, LGPL, and other copyleft licenses. Copyleft licenses carry a specific obligation that travels with the code:

If you distribute software that is a derivative of GPL-licensed code, you must release your own source code under the same license.
This applies even if you did not know the code you incorporated was GPL-licensed.
“I did not know” is not a defense to a copyleft violation.

When an AI tool reproduces a substantial verbatim portion of GPL-licensed code from its training data, and you ship that code in a commercial product without releasing source, you may have created a copyleft violation without ever touching the original repository. The legal standard for infringement is substantial verbatim reproduction, not functional similarity or resemblance, and this distinction matters: an AI tool generating code that works like GPL code is different from an AI tool that reproduces GPL code word for word. The risk sits at the verbatim end of that spectrum, and the problem is that you have no way to know which side of the line your codebase is on without running a scan.

The chardet community dispute made this concrete in early 2026. This was not a filed lawsuit but a public dispute within the open source community that raised the question without resolving it legally. A developer used Claude to rewrite chardet, a Python character encoding library, and rereleased it under an MIT license, arguing that the AI rewrite was a “clean room” implementation free of the original LGPL license.

The legal question the community fought over: If Claude was trained on the LGPL-licensed codebase and its output reproduces substantial verbatim portions of that code, can the output be treated as license-free? The chardet dispute did not resolve cleanly and no court has issued a definitive ruling on this specific question. What is settled is that verbatim copying of GPL code violates the license regardless of how it was produced. What is unsettled is whether AI-generated output that reproduces training data patterns counts as verbatim copying. The working assumption among lawyers advising companies through M&A is that it probably does, and that assumption is now showing up as a standard condition in acquisition due diligence.

The Doe v GitHub litigation, still working through the Ninth Circuit as of April 2026, is asking whether GitHub Copilot reproduces licensed code without attribution in violation of copyright law and DMCA Section 1202. The district court dismissed most claims but the appeal is live. Whatever the outcome, the litigation has already changed industry behavior: GitHub Copilot added duplicate detection filters, and acquisition due diligence now routinely includes an AI codebase license scan.

What to do about all of this

Four concrete actions, none of which require a lawyer.

1. Run a license scan on your AI-assisted codebase

Tools that do this well:

FOSSA—most comprehensive, widely used in enterprise
Snyk Open Source—good for dev-team workflows, integrates with GitHub
Black Duck—standard in M&A due diligence

Each will scan your codebase, flag code that matches known open source libraries, and identify the licenses attached. If you are shipping a commercial product and have never run one of these, you are operating on assumption. The scan takes an afternoon and costs less than the first hour of a copyright dispute.

2. Document your human creative contributions as you go

The evidence that establishes meaningful human authorship is the same evidence you already produce in a normal engineering workflow. You just have to keep it deliberately rather than letting it disappear.

What to preserve:

Commit messages that describe what you changed and why, not just what the AI generated. “Restructured Claude’s module architecture, rejected initial state management approach, rewrote error handling from scratch” is evidence. “Add rate limiting module” is not.
Prompt logs. Claude Code and Cursor both retain interaction history. Export or screenshot the sessions where you made significant architectural decisions.
Design documents, ADRs, or any notes that predate the generated code and show you specified the structure before the AI built it.

The second commit message versus the first is the difference between a defensible authorship claim and a clean “Claude wrote this” record.

3. Read the IP clause in your employment contract before you build anything on the side

Open your contract, search for “intellectual property,” “IP assignment,” or “work product,” and read that section carefully. The specific language determines your exposure:

“Work product created during employment hours” is narrower than “work product created using company resources.”
“Relating to the company’s business” is narrower than “any software development.”
“Company-licensed tools” is the phrase that captures AI coding tools even on personal projects.

If the clause is broad and you want to build something independently, you have three realistic options: negotiate a written carveout before you start (easier at the start of a new role than mid-employment), use entirely personal tools on entirely personal time on a personal machine, or accept that the claim exists and decide whether the risk is worth it.

4. Check which Anthropic plan you are on before shipping for commercial use

Go to anthropic.com/legal and compare the consumer terms against the commercial terms. The difference that matters:

Consumer terms (free and Pro plans): Anthropic assigns outputs to you, but the IP indemnification is narrower and covers fewer scenarios.
Commercial terms (API and enterprise): Anthropic assigns outputs to you and will defend you against copyright infringement claims arising from your authorized use of the service and its outputs.

If you are shipping AI-assisted code in a commercial product using the free or Pro plan, the indemnification gap is real. The API or enterprise agreement is the appropriate tier. Note that neither indemnification covers a downstream GPL violation from license contamination in your codebase. That is your governance problem to solve with the license scan in action 1.

The thing worth sitting with

Anthropic’s own lead engineer publicly stated that his recent contributions to Claude Code were written entirely by the AI, and the leaked codebase that Anthropic issued 8,000 DMCA takedowns to suppress may be predominantly AI-authored. Whether Anthropic’s copyright claims over that codebase are legally valid remains an open question no court has yet resolved.

If the company that built the tool cannot cleanly assert copyright over its own AI-assisted code, the question of whether you can is worth taking seriously before it becomes relevant in a transaction, a dispute, or an acquisition conversation. The developer who documents their creative contributions from the start is in a meaningfully different legal position than the one who accepted three thousand lines of Claude output and merged without review, even if both shipped the same product.

A note on what this piece covers and what it does not

Three things in it are settled law:

Works lacking human authorship are uncopyrightable,
The work-for-hire doctrine applies regardless of how code was generated.
Verbatim copying of GPL-licensed code violates the license.

Two things are emerging consensus without definitive court rulings yet:

How much human direction is enough to establish meaningful authorship in an agentic workflow
Whether AI output that reproduces training data patterns counts as verbatim copying

One thing is genuine speculation:

Whether any of this will be litigated at scale in the near term

Most code copyright claims never reach court. The place where the unsettled questions become concrete today is M&A due diligence and institutional fundraising, where acquirers and investors are already asking these questions as a condition of closing.

If neither of those applies to your situation right now, the four actions above are still worth doing, but the urgency is lower than the piece might imply.

This Week in AI: The Next-Gen Recommendation Experience

Michelle Smith — Fri, 12 Jun 2026 14:18:19 +0000

This week Miguel Fierro, a former Microsoft principal researcher who recently founded his own company, RecoMind, joined data and AI evangelist Christina Stathopoulos to talk about the state of recommendation systems. Christina also ran through the latest AI news she’s been watching, from Anthropic’s continued rise to responsible AI, announcements from Google’s I/O 2026 conference, and (continuing the discussion from last week) the growing backlash against tokenmaxxing as a productivity metric. Here are three takeaways from the conversation.

Recommendation systems are a bigger deal than most companies realize

Miguel has spent the better part of a decade building recommendation systems for enterprise customers at Microsoft, and he thinks most companies are leaving a lot on the table by not paying closer attention to recommendations. Amazon generates roughly 35% of its revenue through recommendations. Netflix attributes 75% of content consumption to them. Best Buy credits recommendations with 24% of revenue. TikTok’s entire user experience is a recommendation engine. And yet many large retailers he worked with at Microsoft weren’t investing seriously in the area, often because they weren’t tracking the value it was generating.

The gap between the top tier and everyone else is wide and getting wider. The most advanced systems today treat user behavior as a sequence prediction problem, similar to how large language models predict the next token. Rather than just encoding clicks, they encode all user actions into embeddings, run sequences through those representations, and use huge 1.5 trillion-parameter models to predict what a user will want next. That’s not something a mid-tier retailer can replicate today, but it signals where the field is heading.

Even if you don’t work in a top well-resourced company, you should still pay attention to the convergence of search and recommendations into a single personalized retrieval layer and the early application of foundation models to recommendation problems. Netflix has built what Miquel described as the only published foundation model in this space; Meta is rumored to be developing one as well. The barrier is data, particularly for smaller organizations. Unlike text, behavioral interaction data isn’t publicly available, so building at that scale requires both proprietary datasets and serious compute.

If you want to get your hands on state-of-the-art implementations, including knowledge graph-based approaches, without starting from scratch, Miguel suggested the open source Recommenders library, originally developed at Microsoft and now housed under the Linux Foundation, as a practical entry point.

The agent hype has a recommender-shaped hole in it

Miguel drew a distinction between true sales agents and what most companies offer today, which are usually just conversational agents. A conversational agent responds to what you say. An agentic sales system understands a customer, anticipates what they want, and surfaces the right product or offer at the right moment—and that requires a recommendation system baked in.

If your “agent” is a chatbot with access to a knowledge base, it’s not doing recommendation. Recommendation systems need training data, a retrieval layer, and a personalization model, none of which you get for free from a foundation model API. A language model can answer questions about a product catalog, but it can’t offer up personalized recommendations unless it also has a model of the customer’s preferences, history, and likely next action. Most companies don’t have the infrastructure in place to make that possible. . .yet.

The responsible AI conversation has left the research community

What’s notable about the responsible AI conversation right now is the range of institutions offering their perspective. Anthropic, alongside announcing a funding round pushing its valuation toward $1 trillion, urged a global pause on AI development tied to the risk of recursive self-improvement: systems that can design and develop their own successors. The Future of Life Institute published The Better Path for AI, a framework arguing for capability development oriented toward human benefit rather than human replacement. And the pope issued a formal encyclical focused on AI and the common good.

None of these institutions is making the same argument, but the convergence of their attention matters. Responsible AI used to be a specialized conversation happening largely within research labs and a small set of policy organizations. It’s now a topic where major AI companies, religious institutions, and civil society groups are all staking out public positions in the same news cycle.

For the technical community, this creates both pressure and opportunity. “We’re thinking about safety” is no longer a sufficient posture; external scrutiny is intensifying from directions that don’t share the field’s assumptions or vocabulary. But the broader conversation creates real demand for practitioners who can translate between what responsible AI actually requires in practice and what policymakers, executives, and institutions are trying to figure out. That translation work is increasingly where the field needs people.

What’s next

Join us Monday morning for the next episode of This Week in AI, where YK Sugi and John Lindquist will break down the massive structural and financial shifts reshaping the technology industry. (They’ll also chat about the recent release of Claude Fable 5.) And on July 23, Christina will be hosting the AI Superstream on AI harnesses, a four-hour event focused on agentic AI and the frameworks practitioners need to move from models to agents. Both are free to attend. Register now to save your seat.

For deeper reading on topics covered this week, Christina recommended three titles available on the O’Reilly learning platform: Hands-On LLM Serving and Optimization, Hands-On RAG for Production, and Large Language Models: The Hard Parts. Not a member? Sign up for a free 10-day trial to check them out.

We’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, Apple, or wherever you get your podcasts.

Generative AI in the Real World: Agentic Systems Fundamentals with Maarten Grootendorst

Ben Lorica and Maarten Grootendorst — Thu, 11 Jun 2026 17:58:23 +0000

BERTopic creator and Google DeepMind developer relations engineer Maarten Grootendorst has spent years helping practitioners build intuition for how AI systems actually work—not just how to prompt them. Maarten joined Ben Lorica to cover the enduring relevance of embeddings and topic models in an LLM-dominated world, his hot take that agents are essentially just an “LLM in a for loop with some tools, some memory, and perhaps some guardrails,” and what separates genuine agentic behavior from a well-constructed pipeline. They also get into the practical trade-offs between open weight and proprietary models, the future of state space models and attention, and why Maarten worries that a generation of builders shipping code they can’t read may be storing up technical debt they can’t repay. “If you don’t really know how an LLM works,” he says, “that intuition [about how to use it effectively] is much more difficult to develop.”

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform or follow us on YouTube, Spotify, Apple, or wherever you get your podcasts.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

0.50
All right. So today we have Maarten Grootendorst. He is a developer relations engineer at Google DeepMind, and he is also the coauthor of two O’Reilly books, Hands-On Large Language Models and An Illustrated Guide to AI. And so, Maarten, welcome to the podcast.

01.10
Thank you. It’s wonderful to be here.

01.12
So, I had you on the podcast—I was looking at it earlier this morning—August 2022, a few months before ChatGPT was released.

01.23
It’s been a while. [laughs]

01.25
Yeah. Back then, what I wanted to talk to you about was, I was a user of your BERTopic library. For listeners who are not familiar, BERTopic was kind of a marriage between the transformer approach with topic modeling and Maarten wrote one of the more popular libraries for doing that. Actually, what’s happened to this whole topic of topic models?

01.58
Oh, yeah. I think it’s still going strong. You mentioned ChatGPT. So a lot of people say, “OK, just use that for topic modeling.” You can. It’s just very difficult to make sure you get a more structured, standardized output rerun thing, especially if [you have] millions of potential documents. And you can still use that on top of that. It’s still my baby of sorts, right? I mean, it’s been four years since we talked, and. . . I love working on that. I don’t have that much time to do it anymore, but it’s great.

02.36
Yeah. So I think one of the things that these large language models have done is kind of, I guess, cast by the wayside some of these earlier approaches for really wading through a lot of text. Unfortunately, I think people, as you mentioned, are trying to prompt their way into a topic model. But I think topic models themselves are still very useful. So one question to you, Maarten. What’s the level of usage of BERTopic now compared to when we talked?

03.13
It’s only grown since then.

03.17
Really?

03.18
Yeah. It surprised me too. [laughs] I think it’s because it’s easy to use. I did some, I think, cool tricks in there, but other than that, I think the main benefit was mostly just a nice user experience. And that helps people use something for a very specific task instead of trying to prompt your way towards something that might or might not work, and you still have to iterate over that. It just works out of the box. It’s not perfect. Nothing is. It’s not a free lunch. But yeah, I think that’s it.

03.55
One thing that’s happened, of course, is that this whole area of AI and NLP has gotten so democratized that. . . When we talked, I think the people who were using BERTopic at least had some notion of what NLP was and what text mining was, right? I would imagine now, in your role as a developer relations person, you encounter a lot of people who don’t come from a data science or ML background. And so they have no clue what topic models are, I would imagine.

04.34
Yeah, many don’t. It’s very interesting to see because you mentioned NLP and text mining and, well, [they’re] completely outdated terms now for some reason. It’s all AI. Let’s just call it AI and be done with it. [laughs] That’s not necessarily a bad thing, don’t get me wrong. It’s just very interesting to see how the field has evolved, but that also means that people don’t really look towards these “older techniques” that still drive much of the adoption of newer stuff.

Sometimes it feels like that, you know, AI and LLMs. . . It’s a hammer and we’re looking for nails to actually use it instead of, “OK, but we have packages for very specific things, and you can use LLMs on top of that.” You don’t have to. But it requires a bit of education on that end, because like you mentioned, a lot of people new to the field, you have to explain, “What are embeddings? What is clustering?” It’s also very interesting to see that even something like that needs to be explained a little bit in more detail. It’s a nice opportunity for me to explain stuff. I like doing that.

05.48
And the key here is that because a lot of people are entering this field and building things and they don’t necessarily know the prior art, so to speak, it seems like they might be leaving a lot of things on the table. Right? So in terms of, here’s my text or my data, I am just going to prompt and I think that I got everything out of it, but that’s not really the case for the most part.

06.24
No. Definitely not. There’s so many things that you can do with these systems, whether it’s on the LLM side or the agentic side or the topic modeling side. If you just know a little bit more on what’s going on under the hood then that helps you understand “When do I prompt? When do I not prompt? What’s going wrong?” That feeling, that intuition. You don’t just get it with building. Building’s very important, but if you don’t really know how an LLM works, that intuition is much more difficult to develop.

06.59
Which brings me to your two books, which are fantastic, which I think go a long way into helping people get that foundation. But let’s face it, a lot of people, Maarten. . . So let’s take your earlier book with Jay [Alammar], which is Hands-On Large Language Models. A lot of people may say, “I don’t have time to read this whole book.” So for someone who is a developer, doesn’t have a data science or ML background, what would be the most important concepts for large language models? Drill down on these three or four concepts that will set you up for success.

07.49
From the top of my head, those are chapters two and three. So buy the book now. [laughs] I’m just kidding. Tokens. Super underappreciated.

08.03
Which now is a big topic because, as I joke, the CFO has now become the CTO, the chief token officer.

08.11
I didn’t know that one. That’s amazing. I’m gonna use it. But, yeah, tokens are now the thing, right? It’s what LLMs use to see the world, so to say—to interpret the world. And it’s how they communicate with the world. So it’s really important to know what tokens are. It helps you get into the realm of embeddings, which I still think is super fundamental to so many things we do.

And the second part is kind of an obvious one, but the attention mechanism, “Oh, wow. Why are these things so strong? What makes them so special?” Attention is an obvious one. We have other things like Mamba, recurrent neural networks, but it all starts from attention. So if you’re completely new to this field, those two. Yeah.

08.58
Let’s take the topic of embeddings. I think at least that topic, Maarten, some people have had to play around with it, right? Because when LLMs first came online, the “Hello, World!” example was RAG, and one of the knobs that people were tuning was embedding, obviously chunking, so the information extraction, the search and retrieval—they’re all important. But one thing that people immediately tried to play around with was embeddings because they could go to places like Hugging Face:
Hey, let me try these four different embeddings.” Do you find that embeddings have a special place in that more people play around with embeddings and have some rudimentary understanding of embeddings?

09.50
I have a sweet spot for embeddings because it’s the main part of BERTopic. But I think it’s so fundamental to so many things that we do in this field. Even things like RAG—which some people think is outdated. It actually isn’t. It’s very much alive and still kicking—runs on embeddings and understanding how they work will also help you understand how LLMs work. And it can be used in so many different ways.

Sometimes we’re looking for bigger embedding models, more contextualized information. Great. [They] have their own purposes. And there are now certain parties focusing a little bit more on these static embeddings that are super fast and quick, like the old school embeddings that we used to have, and now in a new form that can be used in conjunction with coding agents to quickly search through repos and find the information that they’re looking for. Much of what we do is still search, and search revolves in big part on embeddings. And it’s just nice when you have text that you have one numerical representation for it—just that gives you so many opportunities to do so many cool things. . .

11.18
So when you’re trying to convince someone, Maarten, that “Hey, you should learn more about embeddings, because they’re important,” is there a canonical example that you use to say, “Hey, look, if you just understood embeddings and you made this one decision, look at the change in your application.” Is there a canonical example that you go to?

11.40
Oh, yeah, I love the question, but I don’t think I have an answer to that. Because, OK, so I’m a psychologist and I really like to say “it depends on,” and here it kind of depends on the application that you’re running, obviously. Contextualized versus noncontextualized embeddings is a very interesting example because the contextualized ones are generally larger. But there’s larger transformer-like models that require a lot of compute to run. So you can see the latency actually appearing in your search engines. Or if you connect your coding agent to one of those, it slows down because, you know, it needs to wait for the search compared to the faster static ones, for instance, like Model2Vec and stuff like that, which are tremendously fast. So amazing for those use cases, not that performance because they’re way smaller, obviously. And it’s these use cases where the building does get you a lot of intuition about when to use what instead of relaying that decision only to an agent. You’re still the one that needs to have the feeling, that gut feeling, to say this works better for my use case.

13.03
But I would say the reality is that people will go to some leaderboard.

13.09
Yeah. That’s just the way it is.

13.13
So there we go. OK. So in this leaderboard here are the top 10. In this top 10, there’s some that look larger than the others. So I’ll try three or four of varying sizes. Is that a fair characterization of what normally happens?

13.32
Yeah that’s even what I always did. Just you know, top of the leaderboard, pick one or two. But then as you are more experienced with picking one, what about multilinguality? I’m Dutch. There aren’t that many very good Dutch embedding models—big problem there. There are things like matryoshka embeddings, where they’re embedding one embedding model, but they generate embeddings of different sizes for different purposes, which is also very interesting. So there’s all these types of small decisions and nuances that you can make. And we now have instruction-tuned embeddings, where you prefix it with an instruction that you want an embedding for clustering or for classification or for what have you. And then you suddenly see the nuances in selecting something.

14.27
So on the attention mechanism, again, I will play the role of someone who has no time. I don’t have time to read the chapter, Maarten. What are one to three things I should know about the attention mechanism?

14.44
I think the most important thing about the attention mechanism is it contextualizes information. That’s by far the most important thing. When you look at the world before attention and after, it’s a little bit less black-and-white, obviously, but it puts stuff into context. You know, if you have the word “bank,” is it the bank of a river or a financial bank? And as we talk now with each other, there’s a lot of contextual stuff going on. You need to interpret what I’m saying, because if you only focus on what I say, you don’t know that that was actually a question beforehand that drives my answer. And I think that’s what makes attention so special. It tries to look at the entire thing instead of individual tokens or words.

15.34
Playing devil’s advocate, so you just explained it to me. Why do I have to learn more than that? [laughs]

15.40
Always learn more. [laughs]

15.44
Yeah, yeah, yeah. So you mentioned Mamba and the state space models. There was some excitement around them. So maybe give our listeners a high-level description of what these state space models are and what their current status is in the wild in terms of actual practical usage.

16.08
State space models are a completely different way of approaching this attention mechanism, right? It almost does away with it and replaces it with something that is much, much faster. It’s a very complex and highly technical subject, so I don’t want to go too into that because it’s really confusing. [laughs]

So what you see happening is that people replace attention mechanisms. So you have a decoder and LLM, and it has several stacks of attention mechanism normally. What you can do is you can remove half of them with the very quick state space models that help speed up the inference—because that’s what we’re mostly bound now by, is inference speeds. People want more, more tokens. So it needs to be faster. So it’s, it’s a way to make it quicker.

17.13
Yeah. And so what is the actual implementation or adoption of state space models right now?

17.21
Mostly hybrid models. Models, stats, interleave the attention blocks, the decoder blocks with Mamba blocks as a way to make it faster, where some do it with, for example, local attention and global attention—one is more compute-intensive than others. Mamba is a way to do something similar, as a way to speed up that inference.

17.51
Your latest book is about agents: An Illustrated Guide to AI Agents. Before we dive in, in your mind, what makes a system truly agentic? In other words, before we started bandying around the word “agents,” people were using the term “robotic process automation” or something like that. So in your mind, what makes a system agentic?

18.22
That’s actually been one of the more complex topics for us to actually describe, because the field has been changing so quickly. And what is fundamentally an agent when they change it every two months? It’s a little bit of a hot take, but I really do think that an agent is an LLM in a for loop with some tools, some memory, and perhaps some guardrails. And that really is essentially all it boils down to at its base.

18.55
You just described the harness basically. The hot term right now is harness engineering. So what is the real progress and what is just marketing when it comes to agents?

19.19
Yeah, I agree very much with what you imply here because agents sound so cool, and they are cool, but the moment you give an LLM complete freedom, no constraints, just go off and do your stuff, it will fail horribly, horribly, horribly. Agents still need. . . And we can call them guardrails, but you can call them something else. They need direction. They need to be constrained a little bit in the things that they do. So yes, agents, there’s a lot of hype around that. I’m not a big fan of hype. It is what it is. But there are a lot of cool use cases for it because there’s a reason why coding agents are now the big thing. I’m using them myself daily because they make my life easier. But when we look at other use cases, we’re so early in AI progress. Yeah, coding works very nicely. But to ask an agent to book a vacation for me. Yeah. No.

20.35
It seems like that example of “I want to go on a trip. This trip will involve staying in five countries. And I want you to pick the best hotel for every country.” always was kind of the demo even during the robotic process automation. And as you alluded to, I don’t think we can do it quite yet. So here’s another family of agents, Maarten, that a lot of people are using now: deep research agents. Would you consider deep research an agent?

21.15
Maybe. It kind of depends on how it’s implemented. It depends. I’m sorry. I’m going to do that a couple of times, but. . . You can make it very structured, where you say, “OK, do the search on the archive, read the abstracts, make a summary. That’s it.” That’s not really. . .

21.38
It fits into your description in that you’re prompting an LLM. The LLM goes on a for loop where it uses as tools a search index, a knowledge graph. . .

21.53
Fair enough. Yeah. It makes the decision on its own when to use a tool, why to use a tool. Whereas you can also put it in a pipeline where you specifically say, “I always want you to do steps one, two, and three.” And an agent might decide to say, “OK, I’m going to do step 3, 3, 1, 2, 1, 3.” Decide on its own when and where to use specific tools. I think that’s maybe the best distinction you can make on what is and what isn’t an agent.

22.26
And then I guess it depends on the implementation, as you mentioned. But memory could also fill a role there, especially. . . Let’s say I’m using only one service—Google or Perplexity. Maybe it remembers over time what my preferences are. I don’t know if they actually implement it that way. But there’s potentially that aspect.

22.53
So how we phrase it in the book at least, we say, “OK, an agent is a reasoning LLM that has access to planning, tools, and memory,” because there’s no such thing as an agent that goes off and does three steps of something only to forget what the previous steps were. So I think memory is maybe a little bit underappreciated in the realm of agents, because imagine it has to go through an entire codebase and translate it from Python to C++ or Rust or what have you. It’s a very common example of things people want to do. That requires hundreds of steps to do, because it’s potentially a large codebase. How does it remember what it did when it did what, what the current state is, what what’s changed, etc., etc.? And you can write that in a Markdown file. That’s nice, but it also needs to understand, “OK, what’s the trajectory that I went through?” And you can do a lot of cool stuff with that trajectory, because that’s essentially the memory of an agent.

24.02
In your role in developer relations, I assume you talk to a lot of people who work in different companies. We’ve mentioned coding agents; we mentioned deep research. So what are some of the more common agents that people are building? They could be internal or external facing. So what are some of the more common agent types, I guess, that people are building?

24.29
Aside from the obvious, it depends on the industry. I do see coding agents actually being done quite a bit internally. Just trying to see how they can prevent data from being leaked elsewhere. Because a lot of processes now are very privacy sensitive. I came from healthcare before I joined DeepMind. And what you see in these kinds of fields is that, especially in Europe. . .

25.06
I imagine if you’re in finance in a hedge fund. . .

25.09
So yeah, same. . . And these are situations wherein people focus a lot on privacy and making sure that everything’s constrained within their environments. And you see a lot of people playing around with LLMs and then using harnesses—can be Hermes but also [taking] a more foundational agent and build[ing] stuff around that. Or the larger organizations that, well, just use whatever cloud offering there is and use an agent there. We’re so at the beginning of all of this. [laughs]

25.50
For me, the area where I see it being used—and this is not going to be a surprise to our listeners—is still the technical team bucket, which would be DevOps, data engineering, platform engineering. . . They’re building agents to help them do the work. But you might be interacting with a large website, and in the background, there’s a bunch of agents doing a lot of heavy lifting, moving data around for you to get the answer you want or whatever, or internal processes. But DevOps, I think they’re starting to build their own agents. I think, data engineering for pipelines, they’re building their own agents. I would imagine the people in security teams are also building agents because they have to go through lots of log files and. . .

26.55
A question for you then: Are they building agents, as in, you know, fully an agent, or are they building skills? Because I’ve seen a lot of people more focusing on creating skills and giving that to whatever agent is available. Or do you also see a lot of people actually building agents from scratch?

27.17
I think internally there are people who are building what we would consider agents in the sense that it would do a huge chunk of their normal work and they interact with it with prompting, but maybe they don’t consider it completely autonomous. So in the sense that many people who use coding agents, at least, the ones who know how to code, as you might still test and read some of the code, right?

27.50
Sometimes. Sometimes. [laughs]

27.52
Our listeners may be sharp, but there’s huge cohorts of people using coding agents who don’t know how to code or who are building websites and web applications. So in the data, in the DevOps, in the data engineering field, the kinds of agents they’re building are somewhat similar to the coding agents in that they’re doing a lot of the work, but they still have guardrails. I would say they’re still human-in-the-loop. Now, there’s also agents in the nontechnical fields, but they’re a little more. . . Maybe to your point, maybe they can be better described as skills, for example, in marketing or sales. Internally at some of these companies, they’re building things to help these teams be more independent from IT.

29.01
So yeah, you see mostly and we can call them skills, but we can also call them workflows or pipelines or just prompts. . .

29.10
Imagine you’re a marketing analyst at a big Fortune 500 company. And your job used to be to manage a bunch of ad campaigns and online campaigns. That was very manual, and so now you can automate a lot of that work. And then you might still have a dashboard where you can kind of see what’s going on. But the things that used to drive you crazy, now you can focus on other things.

29.46
But I am curious about the long-term effects of all of this, especially when, as you mentioned, a lot of people code without knowing how to code. I think that’s fun for a while but in the long term, stuff breaks and you don’t know where to start.

30.01
I don’t know about you, but I’ve come across people who literally don’t know how to code, who built a website, starting to have customers. Customers will file support questions or they say, “This part of your website doesn’t quite work.” Since they don’t know how to code, they go back to the same coding agent: “Hey, fix this.” The coding agent says I fixed it. They go back to the customer: “It’s fixed.” The customer goes, “It’s not fixed.” And so then this is when they start going “I need to hire someone to actually. . . Because now it actually needs to be fixed. And the holding agent can’t fix it.” So there are obviously dangers to going kind of completely wild on these technologies.

So open weights versus proprietary. This might be a sensitive topic to you because you have Gemini, but you guys also have Gemma.

31.09
I work on Gemma. Ask me everything about Gemma. [laughs]

31.12
[laughs] In your work—or not in your work, but in your day-to-day life, talking to friends, traveling, in your dev rel hat, what is a level of interest in open weights?

31.27
Oh, a lot, yeah. That’s for the most part because I’m in Europe. And Europe loves to say, “OK, we want to own things. We don’t want to push it over to someone else.” So there’s a lot of interest for open weight models. It’s way more than I initially thought because there was quite a big performance gap when ChatGPT came out, 3.5. But now they’re closing in. These models are extremely capable. You can run them on MacBooks. I mean, when Claude came out, I’ve seen so many threads of people buying Mac Studios just to be able to run whatever local LLM they have. So you see it in every part of the field, whether it’s very large organizations or very small, finance, healthcare, what have you.

32.25
One of the challenges with open weights is open weights is a business decision. And business decisions can be reversed. Meta Llama may no longer produce open weights. Alibaba—kind of mixed signals there. Some of the Chinese open weights providers are starting to send mixed signals. So it’s one thing to release an open weights model. But as you know, in this environment you have to release models at a regular cadence and that starts getting expensive. So I guess one of the challenges there for our whole community and industry is, you know, where is the steady supply of open weights models going to come from moving forward? Because basically, like I said, it’s a business decision, and a business decision is going to be reversed.

33.28
No, I agree on that. So in the general sense, that’s what we see happening. Some organizations stop doing open source, [or] less of it, focus on different things. It’s understandable in a way, because, you know. . .

33.45
And, you know, one of the obvious advantages of open weights is you can take the weights and run it in your cluster. And so you have control if. . . One of the things that annoys a lot of these enterprise teams is OK, so I’m really optimized for Claude 4.5. And then, hey, they are deprecating Claude 4.5, you know. So here at least you have control. And I think one of the things that most teams are starting to realize, Maarten, is actually I can use open weights for a lot of things because. . . Let’s say it’s so focused, like a simple sentiment analysis or whatever. I don’t need the most expensive models. And this I can control moving forward. So I think people and teams are discovering, “Hey, while I should be concerned that these open weights models may stop getting released, for some, for many of my tasks, maybe I don’t need the latest and greatest anyway.”

34.52
That can be the case. Yeah, because these models are very capable. I think there will always be a steady supply of open weight models. If we look at the status of the field now, many. . . Obviously Qwen, they’re doing an amazing job. Needs to be said. Same with Gemma, they’re also doing well.

35.14
The Qwen team lost a bunch of people, and I think there’s some worry that Alibaba may back off from. . .

35.23
I think they will continue. I don’t know, obviously, but I think it’s still a very good strategy to do.

35.30
And wait, Gemma is not as good as Gemini. [laughs]

35.33
We have good benchmarks. What is this? What is this? [laughs] No, but they serve different audiences. And what we see happening with open weights is you get so much back from giving open weights to the community. And DeepMind is a nice example. But the more labs obviously that have always given a lot to the community, when you do that, you also get a lot back, right? Because if people are super excited about Gemma 4—we released a model two days ago, 12B-1. And you see people using that for a lot of cool use cases. Driving research to create new things that, you know, we might not have thought of. That can be the case. You see Flash, for instance, which is a diffusion-based drafter, super fast, very incredible being used with Gemma 4. That’s cool. And it’s not to say that Gemma was the first one that drove that, but open weights in general allow a random person somewhere without access to thousands of GPUs to pretrain a model and still be able to do very cool and interesting research. So as long as I’m at DeepMind, I’m gonna make sure we’re gonna keep doing very cool Gemma stuff.

37.03
All right, so let’s close with a rapid fire round. So for each question, keep your answer under a minute. Question number one. OpenClaw. What says you, Maarten, about this trend around personal agents?

37.21
I love personal agents. They’re very cool and interesting. And at the same time, I’m very worried about the security of it. We’re seeing a lot of people’s keys being opened up, things that are being deleted that shouldn’t be deleted. And that’s because we’re in very early stages of all of this—just a little bit more time, and then it will be amazing.

37.46
Yeah. And run it locally with Gemma. [laughs]

37.50
Yeah, of course. [laughs] I’m not gonna sell too much. I love Gemma, I’m selling already too much.

37.57
Question number two: reinforcement learning. I’m a big fan. I always push out a post once a year at least, where I say it’s just around the corner. Now it seems like there’s a bit of a comeback with reinforcement, fine-tuning. Are you paying attention to reinforcement learning?

38.21
A lot. I have a couple of colleagues, and we started something called the RAG Pack with some bigger influencers, like Jay Allamar and Josh Starmer from StatQuest. And we did a course on reinforcement quite recently. It’s such a cool technology. It’s the technique that makes LLMs the way they are today. And there’s still a lot of new things coming up in that field to make them faster, more capable, multituning trajectories. Yeah, it’s the whole thing.

38.54
Third question: scaling loss. So Anthropic in particular is big on scaling loss: bigger models, more data, that’s the road to better and better models. So what’s your feeling right now about scaling loss.

39.11
They change quickly. We started with regular “more parameters, better model.” Then we switched to reasoning, where we said “longer reasoning, better model.” And now we’re slowly going towards the “longer trajectories, better model.” You know, more is better. I think they’re interesting, but they’re changing now so quickly that I’m wondering in half a year what the new scaling law and the new nifty thing is going to be.

39.39
So in closing, data centers. Data centers are a hot topic in the US. A lot of communities seem to be coalescing around opposing the build-out of data centers. So it’s a bit of a complicated issue in the sense that, you know, assuming that these AI technologies work and they get adopted, we will need compute in order for people to have access to these technologies. Otherwise, maybe the rich are the only ones who will have access to AI. On the other hand, the data centers themselves, you definitely need local input because, electricity, water, noise. . . And then unlike factories, they don’t really produce a lot of jobs because how many people do you really need to run a data center with all the DevOps agents now that we talked about? So what’s going on in data centers in Europe?

40.43
We don’t like them. I’m saying we—I’m Dutch. If I’m saying for the people of the Netherlands, we don’t like them generally. And that’s going to be very interesting moving forward because there’s still demand for AI. I know there’s a lot of people that don’t like it, but at the same time, there’s still a lot of people using it, and we need to find a way to balance that out. There’s no way forward otherwise, and I really hope we can focus more on efficiency when it comes to these compute-heavy things. That’s why I focus so much on Gemma. They’re small, capable models that you run on your cell phone. That’s great. Without needing to have these large data centers, aside from training, maybe, but that will always be there. We have to be honest about that. AI is here to stay. We just need to make it more efficient.

41.38
And with that, thank you, Maarten. And by the way, closing note about data centers, for our listeners, there’s a lot of announcements, right? Several gigawatts are being. . . Contracts being signed. But if you really follow what’s going on, there’s not a lot of build-out. There’s not a lot of data centers actually being built in and coming online. So… Thank you, Maarten.

42.07
Thank you.

When Context Collapses: Teaching Agents to Detect and Recover from Lost Memory

Andrew Stellman — Thu, 11 Jun 2026 10:59:13 +0000

This is the eighth article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, part four here, part five here, part six here, and part seven here.

“640K ought to be enough for anybody.”—Bill Gates (allegedly)

If you’re building AI agents that do complex, multistep work, you’re going to run into context loss. The agent’s working memory fills up, older information gets silently dropped or compressed, and the agent keeps going without realizing it’s forgotten something. This article, the third in my Radar article trilogy about context management, walks through a pattern I’ve been refining for detecting and recovering from that problem, which I call the externalize-recognize-rehydrate pattern (or ERR, which I think is actually a pretty good acronym for an error recovery pattern): save your agent’s state to files on disk, detect when context has degraded, and reload from those files to recover. The individual techniques are standard practice in agent and skill engineering—checkpointing, progress files, state verification—but the real power comes from combining them into a coherent workflow that you can use live or build into your agents. I’ll walk through each step with specific prompts you can adapt for your own agents and coding sessions.

Which brings me to memory. Gates has said on multiple occasions that he never actually said that quote at the top of this article, but it endures because it captures one of the core limitations of that era, one that people struggled with constantly, in a way that we can laugh about now. Around that time I was using a 286 with 1 MB of RAM. That’s megabytes, not gigabytes. MS-DOS 3.3 gave me 640K of conventional memory plus 384K of upper memory, and I spent a lot of time figuring out how to use every bit of it. I configured memory managers, loaded device drivers high, used (and wrote!) terminate-and-stay-resident programs that moved themselves out of conventional memory to free up space, and generally treated memory as a resource that required active, deliberate engineering. There was a lot I wanted to do that didn’t fit into 640K, and like most people at the time, I went to some lengths to compensate for the memory limitations.

We’re at the 640K stage of AI development. The context window is the new RAM ceiling. Most of today’s models give you somewhere between 200K and 2M tokens of working memory (and, like memory in the late 1980s and early 1990s, those numbers are growing all the time), and if you’re building agents that do complex multistep work, you will hit that ceiling. When you do, the AI starts compacting: compressing or dropping older parts of the conversation to make room. And just like running out of conventional memory on a 286, things stop working right and you’re not sure why.

In 20 years we’ll be looking back at today’s puny context windows and wondering how developers in the 2020s managed to get anything done with just a few million tokens. Because none of this is new. In case you don’t believe me, here’s a photo of my dad at Princeton in the early 1970s working on an Evans and Sutherland LDS-1 graphics computer, the first commercial vector graphics machine, connected to a PDP-10 mainframe:

The actual LDS-1 is in the large cabinet in the background, directly behind the monitor. Sitting next to it, just out of the picture, is an even larger cabinet that holds a memory unit with 16K of magnetic core memory (technically 8K words).

So you can imagine that just a decade later, 640K in a tiny PC that fit on your desktop seemed extravagant.

In the last two articles in this series (“Why Doesn’t Anyone Teach Developers About Context Management?” and “Your AI Agent Already Forgot Half of What You Told It”), I talked about what context is and why context management matters, and I shared practical techniques and prompts for keeping important information in files instead of leaving it in the AI’s context window. This article gets more technical. I want to build on those strategies and talk about how to build agents that can detect when they’ve lost context and recover from it on their own.

Brute-forcing my way through context loss

I’ve been doing this kind of context management for a while now, long before the specific tools I’m about to describe existed. But a recent crash gave me a clean example of what the process looks like in its most brute-force form.

I was working in Copilot with a seven-step plan, going through it one step at a time, having another AI review each step before moving on. Steps one and two went fine. When it came time to do step three and I gave it the prompt, it jumped straight to step four. This kind of thing can be really frustrating, because it seems like an AI smart enough to implement a complex feature in code should be able to (ahem) count to four.

The key to not getting frustrated when the AI loses track of steps or can’t seem to count from prompt to prompt is to remember what it’s good at and how it remembers things. If the AI you’re using does that, check the conversation history. You’ll probably see something like “summarizing conversation history” or “compacting conversation” somewhere above your last message. That’s telling you that the AI lost track of where it was because that count was literally purged from its memory.

AIs are good at carrying out an instruction. They’re bad at keeping track of their own state over a long conversation, and the way they manage their memory is a big part of that. This article is about finding ways to build your AI tools so you’re not relying on them to do the thing they’re worst at.

But compaction isn’t the only way your AI loses context. A few weeks ago I was deep into a long session with Copilot, working through a multiphase code review. I’d spent a while building up context with the AI about my codebase and the decisions we’d made together. I was about to move on to the next phase, and then I got this:

The entire context was wiped, which could have been a really frustrating problem, since I had a long history with the session, and it had built up a lot of knowledge about what we were doing. This turned out to be a bug in Opus 4.6’s interaction with Copilot’s conversation history, and I’ve seen other people hit the same thing. I was staring at a fresh prompt with nothing in it.

So I did something that, in retrospect, is a pretty good brute-force version of what this whole article is about. I recognized the context was gone (hard to miss when the whole conversation disappears). I copied the entire conversation out of Copilot and pasted it into a text file. Then I gave the new session a prompt:

We were in the middle of a long conversation, then I got an error and the entire context was wiped. I saved a copy of the conversation in #file:chat_history.txt, read it and bring yourself back up to speed.

And it worked! This brought the new session back to where I needed it to be.

That simple error and recovery actually outlines a pretty good pattern for dealing with context loss:

Externalize the state. Get the important information out of the conversation and into a file on disk, where it won’t disappear when the context window reshuffles.
Recognize the loss. Notice that the agent’s working context has been wiped or degraded, whether that’s obvious (like a crash) or subtle (like output that quietly stops making sense).
Rehydrate from the file. Point a new session at that file and let it rebuild its understanding from what’s written down.

The individual mechanics are well-documented across cognitive science (cognitive offloading, task resumption), software engineering (the Memento pattern, React hydration), and knowledge management (the SECI model). I’m not claiming to have invented any of them. But the specific abstraction of these three phases into a unified, named pattern applied to AI context management is, as far as I can tell, new. It’s synthesis and codification, not invention.

In this case I did it with copy and paste, which isn’t particularly elegant, but it worked for me. But this is a blunt instrument, because a raw conversation dump is both too much and too little: it’s too much because it’s full of noise, like tool calls, dead ends, back-and-forth that doesn’t matter anymore; and it’s too little because the context that got silently compressed away during the session is already gone. When you build these mechanisms into agents and skills, you can do it in a much more subtle and automated way.

Externalize: Add two layers of state to your agent

The idea behind externalization, or periodically saving your agent’s state, came out of a conversation I was having with an AI assistant while building the Quality Playbook, an open source AI coding skill that runs structured code reviews. The playbook runs a structured code review as a single process, but that process could easily turn into a 15-million-token request if you tried to do it all in one shot. I described in the previous article in this series how I broke it into six phases, and that was only possible because the context for each phase had already been externalized. Each phase reads its inputs from files, does its work, writes its outputs to files, and stops. The next phase picks up from the files, not from whatever the agent remembers. If this sounds like the familiar advice to ask the AI to plan before you ask it to implement, it’s the same principle applied to context management. Separating each step and persisting the output means you can inspect it, and the next step doesn’t depend on the agent’s memory.

But what should those files contain? I found that the AI is actually good at figuring that out. At some point I asked the assistant:

Would it make sense for the agent to record more context in files as it progresses, to make sure nothing is dropped along the way? It should work even if you break it into separate prompts, because the result from each step is persisted. Plus, we can audit its reasoning for debugging and improvement.

That prompt was all it took. The assistant designed the file structure itself: a progress tracker that records which phase is active and what’s been completed, a JSONL artifact file (JSONL is just a file with a bundle of JSON objects, with one record per line) where each pass appends its output, and a set of brief documents describing the purpose of each phase. You don’t need to overengineer this. Tell the agent what you’re trying to preserve and let it figure out the file layout.

What emerged falls into two categories that I think of as execution continuity and task continuity:

Execution continuity is the state the agent needs to resume work in the middle of a task: what step it’s on, what it’s completed, what decisions it’s made so far. These files change constantly as the agent works.
Task continuity is the broader context that doesn’t change during execution: what the whole task is about, what success looks like, what the structural constraints are. These files are written once and read at every resumption.

When an agent needs to resume after suspected compaction, it reads back both layers. The task continuity files anchor it back to what the whole endeavor is about. The execution continuity files put it back in the middle of the work. Together, they give the agent enough information to continue without relying on anything that might have been compacted.

The key is that externalization isn’t something you do once at the beginning of a task. You want the agent saving its state at frequent checkpoints so that if compaction happens mid-run, the most recent checkpoint is close to where the agent was working. Here’s the kind of instruction I gave the agent for tasks that processed records one at a time:

Update the progress file after every single record, not in batches. Write the output line first, then update the progress file with the new cursor and a fresh timestamp. If the progress file’s timestamp falls behind the output file’s, you’re batching and that’s wrong.

The frequency matters because context can compact at any point. If the agent only saves state at the end of a long run, compaction in the middle means losing everything since the start. If it checkpoints after every unit of work, the worst case is losing one unit.

Two-layer externalization survives context reshaping, not only outright context loss. Even if the agent’s context window isn’t full, if the context has been reorganized or reprioritized (a compression that reshapes without truncating), the agent can reload the external files and know for certain what the ground truth is.

Recognize: Detecting loss from inside the agent

The second step in the pattern is to recognize that your agent has lost context, and it turns out to be the hardest part (at least with today’s AI technology). When the context window fills up, the AI compacts silently, and the agent keeps working without realizing it’s lost information. The agent can’t tell you it’s forgotten something, because it doesn’t know it forgot. Detecting that change turns out to be a nontrivial problem; I’ll walk you through an approach that helped me, and keep it general enough so you can do the same thing. The copy-and-paste approach works when the context loss is obvious, like a crash that wipes your whole conversation. But most context loss isn’t that visible.

I described context compaction in the previous article, but it’s worth restating the core problem from the agent’s perspective. Different tools handle context overflow differently: Some truncate older messages; some compress conversations into summaries; some use a sliding window. But they all have the same effect. Information disappears from the agent’s working context, and the agent doesn’t get notified.

This was a challenge when I built the Quality Playbook, because it runs multiple passes over a codebase, each one reading source files, extracting requirements, and checking coverage. Each pass can involve enough work that it fills the context window multiple times over. And when context compacts mid-pass, the agent doesn’t know it happened. It keeps working, but the output starts silently degrading. So I started building mechanisms for the agent to detect compaction and recover by reading back the files it had written earlier. The patterns that came out of that work are general enough to apply to anyone building agents that need to survive context pressure.

From the agent’s perspective, compaction is seamless. It’s tracking state, referencing decisions made earlier in the conversation, and then at some point the earlier context is gone. But the agent can’t tell the difference between “I never knew that” and “I knew it but lost it.” It tries to reference something and finds nothing, or finds a compressed version that lost the nuance. And because the agent doesn’t know it lost anything, it doesn’t know it needs to recover.

This invisibility is the core problem. But it turns out you can work around it, and the next two sections walk through how.

Building a detection mechanism

Once you have files on disk, the question is what specifically to check and how to know when something has gone wrong. I landed on a mechanism while building the Quality Playbook’s requirement extraction pipeline. The playbook processes source documents in multiple passes, and each pass appends its output to a JSONL artifact file. After each unit of work, the agent also writes a progress record to a separate file: what it just finished, what it found, and where it should pick up next.

The detection mechanism comes from two rules I gave the agent. The idea is that the progress file tracks a cursor, which is just a position marker that tells the agent which record to process next. If the agent writes a record to the output file but then loses context before updating the progress file, those two files will be out of sync.

The agent didn’t need to understand any of that upfront; I just described the rules in plain language and let it figure out the implementation. The first rule establishes an invariant between the output file and the progress file:

Cursor advances only after the line is on disk. Write the summary line to the output file first, then update the progress file. The cursor must always equal the index of the next record that still needs to be processed.

The second rule told the agent how to check that invariant on startup:

On startup, read the progress file. Resume from its cursor value. Verify continuity: the last line in the output file should equal cursor minus one. If not, roll the cursor back to match disk state and report the discrepancy.

If the progress file says the cursor is at record 381, but the last line in the output file is record 379, something happened. The context compacted and the agent lost track of where it was. The divergence between the two files is the signal.

This worked because files on disk don’t change when context compacts. They’re written once and then read repeatedly. If what the agent thinks it knows doesn’t match what’s actually in the files, something shifted in the agent’s memory, not on disk. I ended up folding this check into a preamble that every session started with:

If this session has experienced auto-compaction, re-read the pass specification from disk. Do not try to reconstruct it from the compacted summary. Read the progress file. Read the last record of the JSONL artifact and confirm its index equals the cursor minus one. If not, roll the cursor back to match disk state. Disk is the source of truth. The conversation is not.

That preamble ran at the top of every session. During one particularly intensive day of pipeline development, I ran over a hundred Claude Code sessions with that exact instruction. Most of them completed without hitting compaction. But the ones that did hit it recovered cleanly, because the preamble told the agent exactly what to check and exactly what to do when the check failed.

The specific prompts I used are tied to the Quality Playbook’s file structure, but the technique generalizes. If you’re building any agent that does multistep work, you can adapt the same approach. Here’s a version you could drop into a session preamble or an agent’s system prompt:

Before continuing any task, read your progress file and your most recent output file. Compare them: does the progress file say you’ve completed work that isn’t reflected in the output? If so, trust the output file, roll back your progress to match, and note the discrepancy. Do not rely on what you remember from the conversation. The files on disk are the source of truth.

The wording doesn’t have to be precise. What matters is the structure: tell the agent where to look, what to compare, and which source to trust when they disagree.

But didn’t you just say the AI can’t detect its own compaction?

Right, and it can’t. What I described above isn’t the agent detecting compaction. It’s the agent running a deterministic check against files on disk and finding a discrepancy. The agent doesn’t need to know that compaction happened. It just needs to notice that two files disagree. Think of the agent as an amnesiac clerk. You don’t ask the clerk to remember what they did yesterday. You make the clerk check the physical ledger every time they sit down at the desk. If their notes disagree with the ledger, they’re trained to trust the ledger.

If you saw Christopher Nolan’s breakout movie Memento, you can think of your agent as Leonard Shelby, the character played by Guy Pearce with anterograde amnesia. You couldn’t ask Leonard to remember what he did yesterday. He had to check his tattoos every time he woke up. If his tattoos disagreed with what he’s seeing, he trusts the tattoo (which leads to a major plot point, which I won’t spoil). Again, this isn’t a new idea either. I mentioned the Memento pattern earlier, which is literally named after this movie.

This is a classic distributed systems technique. In double-entry bookkeeping, you maintain two independent records of the same transaction and reconcile them regularly. If they disagree, you investigate. You don’t need to know why they diverged; the divergence itself is the signal. A two-phase commit works the same way: write the data first, then update the record that says the data was written. If you find data without a matching record, or a record without matching data, something went wrong between the two phases.

That’s exactly what the cursor invariant does. The agent writes the output line first, then updates the progress file. If those two files are out of sync, something happened between the two writes. The agent doesn’t detect compaction. It detects a broken invariant, and it’s been told that when the invariant breaks, the files on disk win.

Three things make this work. First, the check is purely deterministic: read two files, compare two numbers, act on the result. There’s no reasoning involved, no judgment call about whether the agent “feels” like it lost context. I wrote about this principle in “Keep Deterministic Work Deterministic”; you never want an AI making decisions that a file comparison can make for it. Second, the files on disk don’t change when context compacts. They’re the stable reference point that the agent’s memory gets checked against. Third, the instruction to run the check lives in the system prompt or preamble, which is generally preserved even when conversation context gets compacted. The check survives the thing it’s designed to detect.

Rehydrate: Reading back the state

Rehydration is the process of reading back externalized state and rebuilding the agent’s working context. Once the agent detects compaction (or, more specifically and accurately, has enough evidence from the filesystem that compaction occurred), the recovery step is to read back the externalized files and rebuild. For the Quality Playbook, rehydration meant:

Read the phase brief to re-anchor the purpose of this pass
Read the progress file to know which unit is active and what’s been completed
Read the tail of the JSONL artifact to confirm the last successfully written record
Recompute the next unit of work from those files

This is different from just continuing without detection. Without detection, the agent tries to pick up where it left off and hopes it still has enough context. With detection, the agent knows something happened and deliberately reloads state before continuing.

You can make the rehydration process itself auditable. Instead of silently reading the files and resuming, have the agent write down what it learned:

Read the progress file and the JSONL artifact. Write a summary of what you learned: what pass is running, what unit is active, what the cursor position is, and how many requirements have been extracted so far. Then continue from there.

Writing a rehydration summary serves two purposes. It gives you visibility into what the agent understood and whether it rehydrated correctly. And it forces the agent to process the external files explicitly rather than just loading them into context. Explicit processing is more reliable than silent loading because the agent has to commit to an interpretation, and you can read that interpretation and catch mistakes.

You can adapt this approach to any agent workflow where work happens in steps. The specific files and cursor values are particular to my pipeline, but the underlying technique is general: have the agent write its progress to a file after each step, and check that file against its output at the start of every session. And this advice isn’t just for writing agents or skills. Even in a live session with Claude Code, Cursor, or Copilot, you can tell the agent to periodically write a summary of what it’s done and what it plans to do next to a file on disk. If the session crashes or the context gets long enough to compact, you can point a new session at that file and pick up where you left off. The key is getting the state out of the conversation and onto disk before you need it.

Context management is an architectural concern

Every technique I’ve described in these articles comes down to the same principle: Important information shouldn’t live only in the agent’s context window. The previous articles covered how to put that information on disk. This one covers how to make the agent aware of its own limitations so it can recover when context pressure gets too high.

An agent that can detect its own degradation and correct for it is fundamentally more reliable than one that just keeps going. When the agent knows how to stop, check itself against ground truth, and reload what it lost, context pressure becomes a recoverable event instead of a slow, silent failure.

This concludes my mini-series trilogy of articles about context management. The first article in this series was about understanding what context is and why it disappears. The second was about getting important information out of the conversation and onto disk before you need it. This one is about closing the loop: making the agent aware of its own limitations so it can detect degradation and recover from it. Together, they add up to treating context as an engineering problem rather than something you hope works out.

These are still early days. Context windows will get larger, compaction will get smarter, and some of the workarounds in this article will eventually be unnecessary. But the underlying principle won’t change: If your agent’s ability to do its job depends on information, that information needs to live somewhere more durable than working memory. That was true for my dad’s 32KB core memory at Princeton, it was true for my 640K of conventional RAM, and it’s true for today’s 200K-token context windows.

The Quality Playbook and Octobatch are open source projects where these techniques are used in production. Both are built using AI-driven development and available for exploration if you want to see how this looks in practice.

Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026, by the author. The open source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.

The PM’s Playbook for Shipping AI Features That Actually Work in Production

Gaurav Savla — Wed, 10 Jun 2026 10:55:56 +0000

The demo to production Death Valley

If you’ve worked on an AI feature, you know the feeling. You start building something that you are excited about, set launch timelines. The model spits out a perfect response, the prototype works magically, and everybody in the room is mentally calculating how big this product will be when we launch. I’ve been in that room a lot many times and it’s fun.

Then you try to test before you ship.

Latency spikes to 10 seconds on mobile. The model starts hallucinating on edge cases that happen to represent 15% of actual user queries. Your A/B test shows no statistically significant engagement lift because the variance in AI outputs makes traditional hypothesis testing basically meaningless. The safety team flags 340 failure cases in the first week, and you’re now debugging nondeterministic cases that fail in creative, novel ways every single day.

Most often than not, it’s not a model problem but an engineering discipline problem. Shipping an AI product is very different from traditional software. I’ve figured this out the hard way. This playbook shares my learnings.

Latency budgets

Every AI feature comes with a latency tax. Large language model inference takes time. We’re talking 500 milliseconds to 5 or even 50 seconds depending on model size, input length, and infrastructure setup. For consumer products where people expect sub-200-millisecond interactions, this is a hard constraint you have to design around.

The mistake I see most often is teams measuring only p50 latency. A feature with 800 milliseconds p50 sounds fine until you discover the p90 is 15 seconds. That means 10 in every 100 users sit there waiting for 15+ seconds. At scale, that’s thousands of terrible experiences per day.

The way I think about it is you define your latency budget by interaction type, not globally: Synchronous interactions, where the user is staring at a spinner, need to resolve under 1 second. Progressive interactions, where output streams token by token, need first token in under 500 milliseconds and full response under 5 seconds. Asynchronous interactions, where the user keeps doing other stuff, can take up to 20 seconds with a progress indicator.

You also need to measure cold starts separately. The first request after a model loads into memory can be 10 times slower than subsequent requests, and if your traffic is bursty, cold starts will disproportionately punish your most engaged users arriving during peak hours.

Besides, you also need to budget for the full pipeline, not just inference. A typical AI feature pipeline including input preprocessing (tokenization, context assembly, and prompt construction), model inference, output postprocessing (parsing, formatting, safety filtering, etc.), and a full response delivery adds up. Optimizing inference while ignoring the rest is like tuning your engine while driving on flat tires.

Lastly, use streaming aggressively for generative features. Pushing tokens to the user as they’re generated instead of waiting for the full response changes how users perceive latency. A four-second response that starts appearing at 300 milliseconds feels dramatically faster than one that pops in all at once. Perception is reality when it comes to user experience.

Designing fallbacks

Traditional software fails in boring, predictable ways. AI features fail in novel, unpredictable, and occasionally creative ways. I once saw a model respond to a product recommendation query with a poem about loneliness. Your fallback strategy needs to be considerably more sophisticated than a try/catch block.

I think about fallbacks as a hierarchy. First, model fallback: When your primary model fails, drop to a simpler, faster, and more reliable model. Most failure cases get handled without the user ever knowing. Second, cache fallback: For queries similar to stuff you’ve seen before, serve a cached response. Third, template fallback: When generation fails completely, fall back to prewritten templates. Degraded beats dead every time. Fourth, graceful omission: Sometimes the best fallback is to simply not show the AI feature at all rather than showing a broken version.

The design principle underneath all of this is that users should never encounter an unhandled AI failure. Every failure mode maps to a specific level, and transitions between levels should be invisible whenever you can manage it.

Quality measurement

Quality in traditional software is binary. The button works or it doesn’t. AI feature quality is continuous and subjective, and it changes depending on context. I’ve landed on a four-layer quality pyramid.

The foundation is safety, and it’s nonnegotiable. Does the output contain harmful content, PII, or made-up facts? This layer is binary, and you measure it with automated classifiers running against 100% of outputs.

The second layer is factual correctness, which is domain specific. Is the output actually right? For a coding assistant that means generated code compiles and passes tests. For a writing tool it means grammatical, stylistically appropriate output. You measure this with domain specific evaluation suites.

The third layer is usefulness, and it’s user centered. Did the person actually benefit? Track acceptance rate, edit distance, time to task completion, and repeat usage. This is where traditional product metrics meet AI specific ones.

The fourth layer is delight, which is experimental. Does the output feel good? Hardest to measure but often most important for adoption. Sometimes the numbers say the feature works but users’ guts say it doesn’t. This layer catches that gap.

A/B testing AI features

A/B testing AI features is fundamentally harder than traditional features because AI outputs are nondeterministic. The same user doing the same thing twice might get different outputs, introducing variance that traditional frameworks weren’t built to handle.

The core challenge is that intratreatment variance inflates the sample size you need for statistical significance, often by three to five times. If you’re running your AI experiment with normal sample size assumptions, you’re probably looking at noise and calling it signal.

Then there’s the metric selection problem. A chatbot generating entertaining but factually wrong responses might show amazing engagement numbers while actively misleading users. You have to measure engagement and quality together. “Engaged interactions where quality score exceeds threshold” is more meaningful than raw engagement alone.

The temporal problem matters too. AI feature value changes over time as users learn how to work with it. Short experiments will underestimate long-term value if there’s a learning curve, or overestimate it if there’s a novelty bump.

My practical guidance: budget two to three times more time and traffic for AI experiments than traditional ones. Lean on Bayesian methods as they handle high variance better. And always pair quantitative tests with qualitative research. Ten user interviews will surface failure modes that no amount of statistical analysis will catch.

Model drift monitoring

Model drift is the slow, invisible rot of AI output quality over time, and there are multiple culprits.

Data drift happens because the world changes and user behavior evolves. A model trained on 2024 data performs worse on 2026 queries referencing new concepts, slang, and cultural moments.

Provider drift happens because third-party APIs change without your consent. OpenAI acknowledged that GPT-4’s behavior shifted measurably between March and June 2023, and Stanford researchers documented significant performance swings. The fix: Pin your model versions so updates happen on your schedule, after your testing.

Evaluation drift is the subtlest form. Even your quality metrics can become inadequate and the evaluation criteria that made sense at launch might become inadequate as usage patterns shift and user expectations change. Quarterly reviews of your evaluation suites are essential.

At minimum you need daily automated quality evaluations on 1% to 5% of production traffic, weekly analysis of input distribution characteristics, and monthly human evaluation of 100 to 500 examples. Shipping an AI feature without drift monitoring is like deploying a service without alerting. You won’t know it’s broken until your users tell you, and by then they’re angry.

Evaluation frameworks

How do you know if your AI feature is good enough? You need two fundamentally different approaches, and you genuinely need both.

Automated evaluation gives you speed. Build a golden dataset of 500 to 2,000 labeled examples, train a classifier or use a capable model as judge, and validate against human judgment quarterly targeting 85% agreement. Automated evals chew through thousands of examples per hour, making them essential for velocity. The pitfall: They miss novel failure modes not in the training data.

Human evaluation catches what automation misses. Structure it with five to seven evaluators mixing domain experts and representative users. Use a consistent rubric covering accuracy, helpfulness, tone, completeness, and safety. Run weekly during development, monthly in production. The trade-offs: expensive at $15 to $30 per example, slow with 24 to 72 hour turnaround, and subject to human biases. Manage by rotating evaluators and capping sessions at two hours.

The model as judge approach is an increasingly viable middle ground. Judging quality is often easier than generating it, which means a model can reliably evaluate outputs even for tasks where it couldn’t produce them itself. Use it for high-volume evaluation but always validate against human judgment.

Graceful degradation and prompt engineering

Graceful degradation means when capabilities decrease, the experience gets worse smoothly instead of falling off a cliff. Design for capability levels, not binary states. Define four to five levels with specific behaviors at each. For example, for an AI writing assistant: Level 5 is full capability with real-time suggestions, tone adjustment, and structure recommendations. Level 4 is delayed suggestions appearing after a two- to three-second pause because latency is up. Level 3 is basic suggestions only like grammar and spelling with no style feedback. Each level is a deliberate design decision, not an accident.

Make degradation invisible when possible. Users shouldn’t see a “broken” experience. They see a less detailed one. That’s a huge difference psychologically. However, when the degradation is significant enough that users will notice, proactive communication like “AI suggestions are temporarily limited” builds trust infinitely more than silently pushing poor-quality outputs.

Prompt engineering in production is software engineering. In production, prompts are code, and they need version control, testing, monitoring, and maintenance. Version controls every prompt. Parameterize prompts, don’t hardcode context. Production prompts should be templates with clearly defined injection points for user context, system state, and dynamic instructions. This makes them testable because you can inject known inputs and verify outputs, and it makes them maintainable because changing how you handle context shouldn’t require rewriting the entire prompt from scratch.

Test prompts against regression suites. Maintain 200 to 500 test cases covering the full distribution of expected inputs, including edge cases and adversarial inputs. Run the suite against every prompt change before deployment.

Monitor prompt performance in production. Track output quality metrics like acceptance rate, user edits, and regeneration requests, segmented by prompt version. When you deploy a new version, compare its production metrics against the previous one for at least 72 hours before calling it stable. This is basically canary deployment for prompts.

Ship it right

These systems aren’t optional add ons you can bolt on after launch. Every feature I’ve seen fail was built first with plans to “add production hardening later.” Later never comes.

AI features are probabilistic and nondeterministic, and they change over time without anyone touching them. Build these systems, staff them properly, and treat them with the same seriousness you’d give your core infrastructure. The gap between demo and production is wide, but it’s absolutely crossable if you build the right bridge.

Note: The research work pertaining to this article was done in a personal capacity. Views are of my own and do not reflect my employer’s views in any way.

The Subsidy Ended: What Tool-Using Agents Actually Cost

Bennie Haelen — Tue, 09 Jun 2026 11:09:17 +0000

On June 1, GitHub Copilot’s usage-based billing became active for all Copilot plans, and developers reacted quickly and loudly. A Pro plan still costs $10, but it now comes with a monthly pool of AI credits. Those credits are priced at a penny each, and they’re consumed according to the model used and the tokens processed, including input, output, and cached tokens. For a heavy agentic session running a frontier model, that makes spend feel very different from a flat subscription.

That’s the news, and it’s worth understanding, but it isn’t the important part. Nothing about the underlying cost of agentic work actually changed on June 1. The tokens were always being consumed, the loops were always running, and the tool calls were always expanding the context. What changed is that the meter became visible. A workload that had been quietly subsidized under a flat rate started showing up as an itemized bill.

Where the tokens go

To see why the bill landed so hard, it helps to compare two things that look similar and bill very differently. A chat completion is close to a single transaction. You send a prompt, the model sends an answer, and you pay roughly once for the input and once for the output. A tool-using agent doesn’t work that way at all. An agent doesn’t answer a question so much as work toward it, and it works by looping. It reasons about the task, calls a tool, reads the result, reasons again, calls another tool, and continues until it decides it’s finished.

Every pass through that loop carries a cost that’s easy to miss. In many agent harnesses, each turn carries forward a large share of the accumulated context: prior messages, tool descriptions, retrieved files, and tool results. Even when some of that context is cached, summarized, or pruned, the system is still doing metered work to preserve enough state for the next decision. The final answer you actually wanted is only a thin slice of what you paid for. The loop is the bill.

This is why agent cost doesn’t scale politely. It scales with the number of turns, and the number of turns scales with how much discovery the agent has to do, which in turn scales with how vague the request was and how much irrelevant context it’s dragging along. A clean, well-scoped task might finish in three turns, while the same task posed as an open-ended question might wander through 15, each carrying the cost of everything that came before it. Under a flat rate, that difference was invisible. Under usage-based billing, it’s the difference between a small interaction and an expensive one.

Tool design is now part of the cost model

I wrote recently about a hidden tax on Model Context Protocol servers: the way an overstuffed tool catalog quietly degrades a model’s ability to route to the right tool. Bloated descriptions, overlapping responsibilities, and vague parameters make the model’s job harder and its choices worse. That argument was about accuracy. The billing change adds a second invoice for the same bloat, and this one is denominated in dollars.

The tool catalog is often part of what gets carried through the agent’s loop. A tool described in three tight sentences and a tool described in three rambling paragraphs may both function, but the second one pays rent in the context window every time an agent has it loaded. Multiply that across a catalog of 40 tools and a workflow that runs a dozen turns, and the cost of verbose tool design stops being a rounding error. Tool design was already a correctness discipline. It’s now a cost discipline as well. The same audit that tightens routing accuracy tightens the bill.

Where prompt discipline runs out

There’s a layer of this that individual users can control, and it’s worth knowing because the savings are real and immediate. Two patterns matter most, and I’ve been handing both to the engineers on a pilot I run for a large healthcare organization. They aren’t magic tricks. They’re ways to keep the agent out of unnecessary discovery loops.

The first pattern is about input. Prompt the agent like a short requirement rather than a broad question. A request such as “look at the encounter data and tell me what you find” forces the agent into discovery mode, where it burns turns figuring out what you meant, and every one of those turns carries the full context forward. Compare that to a prompt that front-loads the specifics by naming the project and the table, naming the date field to filter on, stating the output shape you want, and calling out anything that should be excluded. A better prompt would be: “Using the curated clinical project and the silver-zone encounters table, show total encounters by month for calendar year 2025, use admission_date_time for inclusion, and return one row per month ordered chronologically.” The second prompt collapses the loop. The agent has what it needs on the first turn, so it does the work instead of interviewing you for it.

In practice, the difference isn’t just polish. The vague version forces the agent to discover the data model, infer the date semantics, choose an aggregation, and decide on a display format. The specific version turns the task into a bounded query. That difference shows up in accuracy, latency, and cost.

The second pattern is about output, and it’s the lever most people overlook. Ask for plain text or Markdown during the intermediate steps, and save rich HTML formatting for the final, confirmed deliverable. Formatted output is expensive to generate, and requirements shift. If you ask for a polished HTML report on the first pass and then change a filter, you pay full output-token freight to regenerate all that layout, often more than once. The cheaper habit is to validate the numbers in text and format only at the end.

These patterns work, and they also have a ceiling. Both of them put the entire burden of cost control on the user, and they hold only as long as every user exercises the discipline on every prompt. The day someone reverts to “tell me what you find,” the savings evaporate, and the only thing standing between the team and a surprise invoice is a budget cap that reports the overspend after it has already happened.

Cost is a governance problem, not a budgeting one

That fragility is the real lesson. A budget cap is a backstop rather than a control. It will stop a runaway, but it tells you that you overspent rather than why, and it does nothing to make the next run cheaper. Treating cost as a budgeting problem leaves you forever reacting to the meter, while treating it as an architecture problem lets you build the savings in once and stop relying on everyone’s good behavior.

That means the controls that matter belong on the platform rather than in individual prompts. By the platform I don’t mean the agent itself, the coding assistant or chat client a developer drives day-to-day, and I don’t mean the model or a router sitting beneath it. I mean the control plane that sits above the agents, the layer where an organization enforces policy, access, observability, and now cost across every agent and model its developers touch. An administrative console that gives IT visibility into who is doing what and which capabilities they can install is an early, narrow instance of it. A router that sends planning to a cheap model is one feature that belongs there. The platform is where the rules live, and the agent is a consumer of those rules rather than the place you set them. The platform should route models by task, using cheaper models for planning and reserving frontier models for work that earns the price. It should bound the loop, requiring the agent to check in after a fixed number of iterations. It should cap tool-result payloads so a careless query cannot dump a million rows into the context window. It should default intermediate work to plain text, making the cheap path the path of least resistance instead of something users have to remember.

Every one of those controls is something a user can approximate by hand and something the platform can simply guarantee. This is the same principle I keep returning to in the context of data access, where safe behavior cannot depend on the person at the keyboard remembering the rules. Prompts guide behavior. Guardrails make the cheaper and safer behavior the default. Cost governance is guardrails as control plane, with a dollar sign attached, enforced at the same layer where you already enforce who is allowed to see which row.

The pattern, not the vendor

It would be a mistake to read this as only a GitHub story. GitHub is the current example because its change is visible and recent, but usage-based billing for agentic work is the direction of travel for many AI tools. The economics under the hood are similar: Agentic workloads turn single answers into loops of model calls, tool calls, and context management. The flat-rate subsidy was always going to come under pressure once the workload shifted from autocomplete to autonomy.

The organizations that treat June 1 as a pricing event will optimize a few prompts, grumble, and move on until the next vendor changes its meter. The ones that treat it as an architecture signal will push the cost controls down into the platform, where they hold regardless of which provider is counting which token. That’s the more durable place to stand. The bill didn’t get bigger this month. It got honest, and an honest bill is the kind you can engineer against.

Long-Running Agents

Addy Osmani — Mon, 08 Jun 2026 15:59:06 +0000

The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission.

A long-running AI agent can keep making progress over hours, days, or weeks. It can do this across many context windows and sandboxes, recover from failure, leave structured artifacts behind, and resume where it left off.

For two years the dominant image of an “AI agent” has been a chat window with a clever loop in it. You type a goal; the agent calls some tools; you watch tokens stream by; you stop watching when the work runs out of patience or the context window fills up. That paradigm got us a long way, but it has a ceiling. The model forgets. It declares “task complete” when it isn’t. It reintroduces a bug it fixed nine turns ago. The whole thing is structured around a single sitting.

Long-running agents are what comes next. The idea is easy to state: an agent that keeps making forward progress on a goal across many sessions and many sandboxes, possibly many days or weeks, while leaving the workspace clean enough that the next session can pick up where the last one left off. The engineering is harder. You have to solve for persistence, recovery, and verification in a way that doesn’t just paper over the cracks. You have to build a state layer that lives outside the model’s context window, and you have to design the handoff between sessions so the agent doesn’t lose its mind when it wakes up and finds itself in a different sandbox with a different context window.

This post is my attempt to lay out what’s changed, who’s pushing on it, and how an engineer can use long-running agents today without writing the whole thing from scratch.

What “long-running” actually means

“Long-running” used to mean at least three different things in practice, and it helps to keep them separate.

Long-horizon reasoning. The agent has to plan and execute over many dependent steps. This is mostly a model-quality story: coherence, planning, the ability to recover from a wrong turn 10 steps ago. METR has been tracking this with their time horizon metric, which estimates how long a task a frontier model can complete with 50% reliability. The headline finding is that the metric has been doubling roughly every seven months since 2019, and their TH1.1 update earlier this year doubled the count of eight-hour-plus tasks in the eval set. If that curve holds, frontier agents complete tasks at the day scale by 2028 and the year scale by 2034.

Long-running execution. The agent’s process runs for hours or days. Maybe it’s a coding job, maybe it’s a research sweep, maybe it’s a 24-7 monitoring service. The model might be invoked thousands of times across the run. This is mostly a harness story, and it’s the one this post is mostly about.

Persistent agency. The agent has an identity that outlives any single task. It accumulates memory, learns user preferences, and is always available. This is the Memory Bank flavor of long-running.

In practice the three blur together. A real production agent does long-horizon reasoning inside a long-running execution backed by persistent agency. But the engineering problems are different in each, and so are the products that solve them.

Why this matters

There are two reasons I believe this work matters a lot right now.

The first is a phase change in what’s economically feasible to delegate. An agent that runs for 10 minutes can answer a question, summarize a doc, fix a small bug. An agent that runs for 10 hours can own an entire feature, finish a migration that was on the backlog for six quarters, or do the kind of overnight research sweep that used to require a junior analyst. One of Anthropic’s Claude Sonnet announcements put concrete numbers on this last fall: 30+ hours of autonomous coding in internal tests, including one run that produced an 11,000-line Slack-style app. That’s already past the threshold where the answer to “Should I delegate this?” is no longer obvious.

The second is that persistence changes what the agent is. A stateless agent answers your question and disappears. A long-running one accumulates context: which competitor moved which way last week, which test flaked twice on Tuesday, what you usually mean by “the dashboard.” Anthropic’s Project Vend was the most public early demonstration of this. They had a Claude instance run an actual office vending business for a month, managing inventory, setting prices, talking to suppliers. It failed in informative ways, and the second phase ran much better, but the point wasn’t profitability. The point was watching what kinds of weird coherence problems show up when an agent has to maintain identity across weeks instead of turns.

Those are the same problems every team building production agents now hits.

The three walls every long-running agent hits

Three walls show up in basically every write-up I’ve read this year.

Finite context. Even a 1M-token window fills. And context rot, the steady degradation of model performance as the window gets full, kicks in well before the hard limit. A 24-hour run is not going to fit in any context window the field has on its roadmap. Something has to give.

No persistent state. A new session starts blank. Anthropic’s framing in their scientific computing post is the cleanest version I’ve seen: “Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.” Without an explicit persistence story, every shift change is a productivity disaster.

No self-verification. Models reliably skew positive when they grade their own work. Asked “Are you done?” they answer “yes” more often than they should. Without a separate signal that the work meets a bar, you get the agent that ships at 30% complete with full confidence.

Long-running agent designs are mostly answers to these three problems. The major labs have converged on similar shapes of answer, but with very different surface area.

The Ralph loop: One of the simpler practitioner versions of long-running agents

The Ralph loop (sometimes called the Ralph Wiggum technique) is one of “simpler” practitioner version of long-running agents, popularized by Geoffrey Huntley and Ryan Carson. The reference implementation is literally a bash script that loops:

Pick the next unfinished task from a list (prd.json or equivalent).
Build a prompt with the task, the relevant context, and any persistent notes.
Call the agent.
Run tests or other checks.
Append what happened to progress.txt.
Update the task list (done, failed, blocked).
Go back to step 1.

The reason it works is the same reason any of the harnesses below work: State lives outside the agent’s context. prd.json is the plan, progress.txt is the lab notes, and AGENTS.md is the rolling rulebook. The agent itself is amnesiac, but the filesystem isn’t. Each iteration starts fresh and reads enough state from disk to keep going. Carson’s Compound Product extends the idea by chaining multiple loops (an analysis loop that reads daily reports, a planning loop that emits a PRD, an execution loop that writes the code), which is roughly the open source version of the planner-generator-evaluator triad Anthropic landed on independently.

I went deeper on all of this in “Self-Improving Coding Agents”: task list structure, progress files, QA gates, monitoring, the failure modes you’ll actually hit. The short version is that you can build a working long-running agent in an evening with a bash script and a JSON file. Most of what Google and Anthropic have productized is the work of making this pattern recoverable, secure, and observable at scale.

The big-lab stories below are different ways of paying for that production-readiness.

Anthropic: Harnesses, then the brain/hands/session split

Anthropic has been the most public about the engineering. Two posts are worth reading end to end.

The first is “Effective Harnesses for Long-Running Agents,” which lays out a two-agent harness for autonomous full stack development. An initializer agent runs once at the start of a project to set up the environment, expand the prompt into a structured feature-list.json, and write an init.sh that future sessions will run on boot. A coding agent is then woken up over and over, each session asked to make incremental progress on one feature, run tests, leave a claude-progress.txt note, and commit. A test ratchet (“it is unacceptable to remove or edit tests because this could lead to missing or buggy functionality”) sits in the prompt to stop the very common failure of an agent deleting failing tests to “make them pass.” InfoQ’s writeup extends this into a planner, generator, and evaluator triad, on the same logic that separating generation from evaluation matters because models grade their own work too generously.

The second is “Scaling Managed Agents: Decoupling the Brain from the Hands,” the architectural post behind Claude Managed Agents (Anthropic’s hosted runtime, launched in early April). The argument is that an agent has three components that should be independently replaceable. The Brain is the model and the harness loop that calls it. The Hands are sandboxed, ephemeral execution environments where tools actually run. The Session is an append-only event log of every thought, tool call, and observation.

This sounds abstract, but it isn’t. Here’s Anthropic’s framing: “Every component in a harness encodes an assumption about what the model can’t do on its own.” When you couple them, an assumption that goes stale (e.g., the model used to need an explicit planner and now plans natively) means the whole system has to change at once. When you decouple them, the harness becomes stateless, sandboxes become cattle, not pets, and a brain crash doesn’t lose the run. A fresh container calls wake(sessionId) and reconstitutes the state from the log. They reported time-to-first-token dropped ~60% at p50 and over 90% at p95 just from being able to start inference before the sandbox is ready.

The session-as-event-log idea is the part most teams underappreciate. It is what makes a long-running agent recoverable. Without it, a container failure is a session failure and you’re debugging into a stale snapshot. With it, the agent’s memory is a queryable artifact that lives outside whatever process happens to be running at the moment.

For the scientific computing crowd, Anthropic’s “long-running Claude” post reduces all of this to a simpler stack: CLAUDE.md as a living plan the agent edits as it learns, CHANGELOG.md as portable lab notes, tmux plus SLURM plus git as the execution and coordination layer, and the Ralph loop, a for loop that kicks the agent back into context whenever it claims completion and asks if it’s really done. Their flagship case study is a Boltzmann solver Claude Opus 4.6 built over a few days that reached subpercent agreement with a reference CLASS implementation. Months to years of researcher time, compressed.

Same patterns across all three posts: an explicit plan file, an explicit progress file, structured handoffs between sessions, separate generation from evaluation, and a loop that refuses to let the agent stop early.

Cursor: Planners, workers, judges

Cursor’s “Scaling Long-Running Autonomous Coding” is the other essential read this year. They walked into walls that Anthropic mostly papered over.

Their first attempt was a flat coordination model: equal-status agents writing to shared files with locks. It became a bottleneck and made the agents risk averse, churning rather than committing. Their second attempt swapped locks for optimistic concurrency control, which removed the bottleneck but didn’t fix the coordination problem. The third design is what’s running in production now and what they describe as solving most of the problem:

Planners continuously explore the codebase and emit tasks. They can recursively spawn subplanners.
Workers are focused executors. They don’t coordinate with each other and they don’t worry about the big picture.
Judges decide when an iteration is finished and when to restart.

Two things stand out from the post. One: “A surprising amount of the system’s behavior comes down to how we prompt the agents” more than the harness or the model. Two: Different models slot into different roles. Their reported finding is that a GPT model was better than Opus for extended autonomous work specifically because Opus tended to stop early and take shortcuts. Same task, different role, different model. The matching is becoming part of the design surface.

This pairs with Composer 2 (their proprietary frontier coding model that ships in Cursor 3) and their background cloud agents: long-running tasks that run on Anysphere’s cloud infrastructure rather than your laptop. Eight-hour refactors and codebase-wide migrations survive a closed lid. You can start a task locally, hit run in cloud when you realize it’ll take 30 minutes, and reattach later from your phone. Each agent runs in an isolated Git worktree and merges back via PR. The handoff between local and remote is the part most teams haven’t figured out yet, and Cursor’s bet is that it has to be its own product surface.

The shape ends up close to Anthropic’s: Roles are split, sessions are durable, judges sit beside the worker, and a long task runs in a cloud sandbox with Git as the coordination substrate.

Google: Long-running agents on the Agent Platform

Google’s announcement at Cloud Next ’26 folded Vertex AI into the Gemini Enterprise Agent Platform and turned long-running agents into a named product, with named SLAs.

The pieces that matter for this post:

Agent Runtime supports agents that “run autonomously for days at a time” with sub-second cold starts and on-demand sandbox provisioning. The launch post’s example use case is a sales prospecting sequence that takes a week to play out, which is roughly the right shape for it.
Agent Sessions persist conversation and event history. You can pin them to a custom session ID that maps to your own CRM or DB record, so the agent’s state lives next to the business state instead of in a separate AI silo.
Agent Memory Bank is the persistent long-term memory layer, generally available as of Next ’26. It curates memories from sessions, scopes them to a user identity, and exposes a search API so the next agent invocation can pull what’s relevant. Payhawk reported that auto-submitting expenses through a Memory Bank-backed agent cut submission time by over 50%.
Agent Sandbox handles hardened code execution.
Agent-to-Agent Orchestration, Agent Registry, Agent Identity, Agent Gateway, Agent Observability, and Agent Simulation cover basically every operational concern you’d otherwise build by hand for a production fleet, including the cryptographic-identity-and-audit-log story enterprises actually need to ship.

Architecturally this is the same brain/hands/session split Anthropic described, just productized at platform scale and bundled with ADK (the code-first dev kit) and Agent Studio (the visual one). If you’re building inside Google Cloud, you don’t have to design a session log or a memory store from scratch anymore. You wire an ADK agent into Memory Bank and Sessions, deploy onto Agent Runtime, and the persistence question is answered.

Notice how much this looks like the pattern Anthropic and Cursor describe, just unbundled into named services with SLAs. Three years ago you’d have built all of this yourself. Now you pick which version of “decoupled brain, hands, and session” you want to rent.

Five patterns for long-running agents in production

Shubham Saboo and I wrote up five design patterns we’ve seen separate working long-running agents from demos. They aren’t Google-specific, but they map cleanly onto the primitives Agent Runtime now exposes, so it’s worth walking through them here in shortened form.

Checkpoint-and-resume. The most common multiday failure is context loss. An agent processes 200 documents over four hours, hits an error on document 201, and without a checkpoint you start from scratch. Treat the agent like a long-running server process: write intermediate state to disk, checkpoint every N units of work, recover from failures. The Agent Runtime sandbox gives you a persistent filesystem, but choosing the right checkpoint granularity (not every step, not only the end) is on you.

Delegated approval (human-in-the-loop). Most “human-in-the-loop” implementations are: serialize state to JSON, fire a webhook, hope someone responds. The state goes stale, the notification gets buried, the agent re-deserializes into a slightly different world. Long-running runtimes let the agent pause in place with full execution state intact: reasoning chain, working memory, tool history, pending action. Hours of human time pass, the agent consumes zero compute, and it resumes with subsecond latency. Mission Control is Google’s inbox for this. The pattern works regardless of vendor.

Memory-layered context. A seven-day agent needs more than session state. Memory Bank handles long-term curated memory, Memory Profiles add low-latency lookups, and the failure mode you’ll hit in production is memory drift: The agent learns a procedural shortcut from a few atypical interactions and starts applying it broadly. Govern memory like you govern microservices. Agent Identity controls who can read and write which banks. Agent Registry tracks which version of which agent is running. Agent Gateway enforces policy on the wire. The auditing question stops being “What are my agents doing?” and becomes “What are my agents remembering, and how is that changing their behavior?”

Ambient processing. Not every long-running agent talks to a human. Some sit on a Pub/Sub stream or a BigQuery table and act on events as they arrive: content moderation, anomaly detection, inbox triage. The architectural decision worth making early is to not hardcode policy into the agent. Define it in the Gateway and the fleet picks up policy changes without redeploys. Ambient agents run unsupervised for long stretches, and the only sane way to update a hundred of them is to update the policy layer once.

Fleet orchestration. In real systems, you rarely have one agent. A coordinator delegates subtasks to specialists (a Lead Researcher Agent, a Scoring Agent, an Outreach Agent), each running independently for different durations. Each specialist gets its own Identity (so the Outreach Agent can’t read financial data meant for Scoring), its own policy enforcement, its own Registry entry. This is the same coordinator/worker shape distributed systems have used for decades. What’s new is that ADK handles it declaratively with graph-based workflows, and a bad deployment in one specialist doesn’t cascade to the others.

The patterns compose. A compliance system might use checkpointing for document processing, delegated approval for review gates, memory layering for cross-session knowledge, and fleet orchestration to coordinate the specialists. The opening question is always the same: What’s the longest uninterrupted unit of work your agent needs to perform? Minutes, and you don’t need long-running agents. Hours or days, and these patterns are where to start. The full write-up with code samples covers each pattern in depth.

So how do you actually build one today?

This is the practical question, and it has a different answer depending on what you’re building.

You’re a developer who wants long-running coding work on your own repo. Just use Claude Code (or Antigravity, Cursor, or Codex). The harness is already there. Treat your AGENTS.md like a pilot’s checklist: short, every line earned by a real failure. Add hooks for typecheck and lint that surface failures back to the agent. Write a plan file before the agent starts. Use the Ralph loop when the agent claims it’s done and you don’t believe it. For multihour or overnight jobs, run in a worktree so a closed laptop doesn’t kill the run, and have it commit progress every meaningful unit of work. This is the path most people should take, and it’s where the most leverage is right now.

You’re building a hosted agent product. Don’t build the runtime. Pick a managed one. The three real options today: Google’s Agent Platform (Agent Engine + Memory Bank + Sessions), Claude Managed Agents, or roll something on top of ADK, the Claude Agent SDK, or Codex SDK and host it yourself. The trade-off is the usual one. Managed gets you the brain/hands/session split, observability, identity, and an audit trail out of the box. Self-hosted gets you control and the ability to use weird models for weird roles (Cursor’s pattern). For most teams, the right starting point is a managed runtime plus your own ADK or SDK code for the actual loop.

You’re doing something autonomous and operational (monitoring, research, ops). Memory Bank-style persistence is what you want, and it’s the part that doesn’t exist in Claude Code. ADK + Memory Bank + Cloud Run + Cloud Scheduler is the cleanest stack I’ve seen for “agent runs every N hours, accumulates state, alerts on a threshold.” This is also where Cursor’s planner/worker/judge split starts to matter more than it does for IDE coding, because the work is genuinely parallel and the failure modes are different.

A few things matter regardless of which path you take.

Write down the done condition before the agent starts. This is the single highest-leverage move for long runs. The Anthropic harness post calls it the feature list; Cursor calls it the planner’s task spec. Either way, it’s an external file with explicit, testable completion criteria, and it exists so the agent can’t quietly redefine done midrun.

Separate the evaluator from the generator. Self-grading is the failure mode. A planner/worker/judge pipeline, or a generator/evaluator pair, is a real architectural pattern, not a stylistic preference. Even if it’s the same model in different roles with different prompts.

Invest in the session log, not just the prompt. The append-only event log is what makes the agent recoverable, debuggable, and auditable. If you can’t reconstruct what the agent did in the last 24 hours from durable storage, what you have is a long-running shell script that happens to call an LLM, not a long-running agent.

Treat compaction and context resets as first class. Anthropic is explicit that summarization-as-compaction wasn’t enough for very long jobs; they had to do full context resets where the harness tears the session down and rebuilds it from a structured handoff file. It is essentially how humans onboard a new engineer.

There are some real limitations right now

A few things are still genuinely unsolved.

Cost. A 24-hour run with a frontier model and a few tools is not cheap. Without budgets, circuit breakers, and a hard cap on tool spend, an agent can quietly burn through a week’s API budget in an afternoon. This is solvable, but it’s an explicit step you have to take.

Security. A long-running agent with API keys, cloud access, and the ability to run shell commands has a much larger attack surface than a chat session. The brain/hands separation pattern matters here too: Credentials should be unreachable from the sandbox where model-generated code runs, which is one of the benefits Anthropic calls out for Managed Agents.

Alignment drift. Over many context windows, agents drift. The original goal gets summarized, then resummarized, then loses fidelity. This is the part hooks and judges exist to defend against. It is also the most common reason “the agent went off and did something I didn’t ask for.”

Verification. Auditing 24 hours of autonomous activity is a real human-time problem. Observability and structured artifacts (PRs, commits, briefings, test runs) are how you make this tractable. Without them, you’re scrolling logs and you’ll miss what matters.

The human role. This is the one I keep coming back to. Defining work crisply enough that an agent can run for a day on it is harder than doing the work yourself. The skill that’s appreciating in value isn’t writing code. It’s writing specs that survive contact with an autonomous executor.

Where this is going

Google, Anthropic, and Cursor have converged on roughly the same shape. Separate the model loop from the execution sandbox from the durable session log. Split planning from generation from evaluation. Bake in compaction, hooks, and context resets. Expose memory as a managed service that any agent invocation can query.

Surface area is what differs. Google’s Agent Platform is the enterprise-stack version, with the identity and audit trail story baked in. The patterns underneath are the same. Claude Managed Agents is “Anthropic’s harness, hosted.” Cursor’s background agents are “long-running coding, pulled out of the IDE and into the cloud.”

The harder problems for the next year aren’t in any of those layers individually. They’re in the coordination above them. Many long-running agents on a shared codebase. Agents that read their own traces and patch their own harnesses. Harnesses that assemble tools and context just in time for a task instead of being preconfigured at startup. That’s where the agent stops looking like a smarter chat window and starts looking like a colleague who’s been on the project longer than you have.

The model is still load-bearing. But the gap between a chat window and an agent you can leave running overnight is mostly in the state, sessions, and structured handoffs wrapped around it. That’s where I’d spend my learning time right now.

The AI Agents Stack (2026 Edition)

Paolo Perrone — Mon, 08 Jun 2026 10:56:59 +0000

The following article originally appeared on Paolo Perrone’s The AI Engineer Substack and is being reposted here with the author’s permission.

Your team picks LangGraph for a customer support chatbot. Three weeks in, you’ve got 14 nodes in a state graph, a custom checkpointer writing to Redis, and retry logic for tool calls that fail once a week. The agent answers refund questions. It calls one API. A 50-line script on the OpenAI SDK with two MCP servers would have done the same thing. But nobody mapped which layers the problem actually needed.

In November 2024, Letta published an AI agents stack diagram that became the default reference for half the engineering teams I talk to. If you’ve seen a “layers of an agent” visual on LinkedIn or pinned in a Slack channel, it probably traces back to that article.

That diagram is 14 months old now, and a lot has changed since. MCP didn’t exist yet. Memory was still treated as a subset of your vector database. Nobody was shipping provider-native agent SDKs. Eval wasn’t even on the map. The stack has six layers in 2026, and at least three of them didn’t exist as distinct categories when Letta drew the original.

So we drew it from scratch. This is the 2026 version.

TL;DR

That’s the starting stack. Add complexity when something specific breaks, not before.

What are we even mapping?

Before the stack, there was a loop. In “What Is an AI Agent?,” we defined an agent as the think-act-observe cycle: The model reasons about a task, takes an action (calls a tool, writes to memory), observes the result, and loops until the task is done. That loop is the atomic unit. Everything in this issue is infrastructure that makes that loop work reliably, at scale, in production.

The agent stack is not the LLM stack. A chatbot needs inference and maybe RAG. An agent needs state management across multistep execution, tool access governed by protocols, memory that persists across sessions, autonomous reasoning loops, and guardrails that constrain behavior in real time. That’s a fundamentally different set of infrastructure problems.

We’re mapping the six layers between your LLM and a production agent. We’re not covering training infrastructure, data pipelines, or model fine-tuning. Those are adjacent stacks. We covered RAG in depth in Issue #5. Today we’re zooming out to show where RAG fits in the bigger picture.

Three things redrew the map between 2024 and 2026. MCP standardized tool connectivity, and the entire tools layer is new because of it. Reasoning models changed what agents can do autonomously, with single-call agents replacing some multistep chains. And memory became a first-class architectural primitive, not an afterthought bolted onto a vector database.

How to evaluate each layer

When choosing tools at each layer, ask three questions. How much state do you need to manage? A stateless tool caller and a multi-session agent that learns over time are different engineering problems, and the layers where state management is hardest (memory, frameworks) are where most teams get stuck. How much vendor lock-in can you tolerate? MCP is an open standard, provider SDKs are not, and every tool choice either increases or decreases how painful your next migration will be. And how hard is it to go from demo to production? Some layers (model serving) have almost no gap, while others (eval, guardrails) have a massive one. The layer where you feel that gap most is the one to invest in first.

We take each layer from the bottom up, starting with the most stable and ending with the least mature.

Layer 1: Models and inference

How you run the model that powers your agent: call an API, use a managed open weight provider, or self-host.

The inference layer changed more in tone than in substance. Reasoning models like o1, o3, DeepSeek R1, and Claude with extended thinking shifted what agents can plan and execute. Agents that previously needed multistep chains can now solve problems in a single reasoning call. Open weight models like Llama 3.3, DeepSeek V3, and Qwen 2.5 closed the quality gap dramatically, so “always use the biggest closed model” is no longer default advice. The emerging pattern is to prototype on closed source and deploy on open weight.

The honest take: This layer is commoditizing. Model differences matter less each quarter. The real decision is the cost and latency trade-off, not which model is “smartest.”

On the evaluation side, API calls are stateless. Send a request, get a response. Nothing to manage. Lock-in risk runs high for closed APIs because each model reasons differently, so switching providers means retuning prompts, adjusting for different failure modes, and retesting your eval suite. It’s low for open weight, where you can swap the model and keep the infra. The prototype-to-production gap is the smallest of any layer. Your demo API call is the same as your production API call.

Self-host when your agent call volume makes API pricing untenable or when you need sub-100ms latency that API round-trips can’t deliver.

Layer 2: Protocols and tools

How your agent calls external tools and APIs: through MCP servers, browser automation, or agent-to-agent protocols.

This layer didn’t exist as a distinct category in 2024. Every framework had its own JSON schema for tool definitions. Now MCP is the standard, with 97M monthly SDK downloads, adoption by OpenAI, Google, and Microsoft, and a donation to the Linux Foundation.

Browser Use exploded in parallel, hitting 78K GitHub stars in under a year. Nobody was shipping browser agents in production in 2024. And agents can now talk to other agents. IBM launched ACP, and Google launched A2A. Neither is standard yet, but the problem they solve (agents coordinating with other agents) is real and growing.

Security is the open problem. Endor Labs analyzed 2,614 MCP servers and found 82% prone to path traversal and 67% to code injection.

The honest take: The protocol debate is over. MCP won. The only question left is how you lock down your MCP servers before someone exploits them.

State management is nonexistent here. Your agent calls a tool, gets a response, done. No session, no memory between calls. Lock-in risk is low because MCP is an open standard, so if you build MCP servers, any MCP-compatible agent can use them. The prototype-to-production gap is medium. Your demo MCP server works until someone sends a malicious tool description. Security and governance are the gap.

MCP standardized how agents use tools. It says nothing about how agents talk to each other. ACP and A2A are trying to solve that, but neither has reached critical mass. If you need multi-agent coordination today, you’re building it yourself at the framework layer. We covered MCP in depth in Issue #4.

Layer 3: Memory and knowledge

How your agent stores and retrieves what it knows: in-context state, vector search, or persistent memory across sessions.

All three tiers feed into the same place: The context window your agent sees on every call.

In 2024, memory meant “pick a vector database and do RAG.” In 2026, memory is a first-class architectural primitive with three distinct tiers. Context windows got massive. Gemini hit 1M+ tokens, Claude 200K. Bigger windows didn’t kill the need for memory. They changed the trade-off: What do you stuff in-context versus what do you retrieve on demand?

“Context engineering” replaced “prompt engineering” as the core discipline. Instead of writing a better prompt, you architect what information the agent sees on every call. Memory blocks appeared as named, structured fields in the context window that the agent can read and overwrite every turn. Instead of dumping everything into the system prompt, the agent manages its own state: what to keep, what to update, what to drop.

On the infrastructure side, pgvector became the default for teams that don’t need a dedicated vector database. It’s just Postgres with an extension. GraphRAG emerged as a second retrieval option: follow relationships between entities instead of matching embeddings, with Neo4j leading this space. Sleep-time compute, where agents process information during idle time, is research stage but signals where tier 3 is heading.

The honest take: Most teams overcomplicate memory. Start with conversation history in Postgres and a structured system prompt. Add vector search when your history exceeds context limits. Add agentic memory management only when your agent needs to learn across sessions.

This IS the state layer. You’re deciding what your agent remembers, how it retrieves it, and when it forgets. Highest complexity in the stack. Lock-in risk is medium. pgvector is portable because it’s just Postgres, while specialized tools like Mem0 or Zep are harder to migrate away from. The prototype-to-production gap is large. Demo memory works because context windows are big enough. Production memory breaks when conversations get long and your agent starts forgetting the important parts.

In-context memory breaks down when agents need to share memory across instances or maintain state across model provider switches. That’s where dedicated memory infrastructure like Letta, Zep, and Mem0 earns its keep.

Layer 4: Frameworks and SDKs

How you wire together the model calls, tool use, and control flow that make your agent work: a provider’s built-in toolkit (SDK), a graph-based framework like LangGraph, or raw code.

Every major AI lab now ships its own agent SDK. OpenAI has the Agents SDK (evolved from Swarm). Google released ADK. Microsoft has Semantic Kernel and AutoGen. Hugging Face built smolagents. Two years ago, LangChain was the only game. Now you pick between three camps: provider SDKs that are fast to start but locked to one model, graph-based frameworks like LangGraph that are portable but require more setup, or no framework at all. That choice didn’t exist in 2024.

LangGraph solidified as the graph-based orchestration leader with v1.0 released October 2025 and production deployments at Uber, JPMorgan, LinkedIn, and Klarna. LangChain agents are now built on LangGraph under the hood. Meanwhile, the “build it yourself” camp grew. Teams that tried LangChain in 2024 and fought the abstraction are now writing thin wrappers over provider APIs + MCP. No framework means full control. This works until your agent needs state management or complex branching.

A quick note on naming: “LangChain” and “LangGraph” are not the same thing. LangChain is the integration layer handling model connectors, tool calling, and prompt templates. LangGraph is the orchestration engine managing state, control flow, and graphs. Most production teams use both together, but LangGraph is where the agent logic lives.

The honest take: Most teams pick too much framework. If your agent calls a model and a few tools, you don’t need LangGraph. A provider SDK and a couple of tool calls will get you to production faster than any graph.

Provider SDKs manage state for you. LangGraph makes you define every state transition explicitly. Build-it-yourself means you roll your own. Lock-in risk is the highest in the stack. Your orchestration code doesn’t port. A LangGraph agent rewritten for CrewAI is a new codebase. Provider SDKs are worse because you’re locked to one model too. The prototype-to-production gap is large. Demo works because nothing goes wrong. Production means handling tool failures, retries, timeouts, and humans who need to approve before the agent acts.

The framework you pick determines your migration cost. Provider SDKs are fastest to start but lock you to one model. LangGraph is portable but complex. Building your own gives you full control until your agent outgrows your wrapper. MCP is the one layer that transfers across all three camps.

Layer 5: Eval and observability

How you measure whether your agent is doing its job: tracing runs, scoring outputs, and catching regressions before users do.

This layer barely existed in 2024. Now it’s the gap. LangChain’s State of Agent Engineering survey found 89% of teams with production agents have implemented observability, but only 52% have evals. That 37-point gap is where production quality dies.

“Evaluation as infrastructure” is converging on three tiers: fast checks on every PR (Did the agent call the right tools?), nightly regression suites that use an LLM to judge output quality, and continuous production monitoring that alerts when agent performance drifts. New agent-specific benchmarks have emerged too, including Context-Bench for memory management, Recovery-Bench for error recovery, and Terminal-Bench for coding agents.

The honest take: Most teams skip eval until something breaks in production. By then they’re debugging blind. The teams that don’t have this problem built evals before they deployed.

State management matters here because your agent runs 12 steps, step 3 picked the wrong tool, and steps 4–12 were doomed from there. If your eval only checks the final output, you’ll never know why. Lock-in risk is moderate. Most tools export OpenTelemetry traces, so switching observability providers is doable, but switching eval frameworks means rebuilding your test suites. The prototype-to-production gap is the biggest of any layer. Most prototypes have zero eval. You don’t feel the pain until production users find the failures for you.

Current eval tools are strongest for single-turn and tool-calling evaluation. Multi-agent evaluation, long-horizon task assessment, and evaluating agents that learn over time are all unsolved problems. If your agent does any of those, you’ll need custom eval infrastructure beyond what the platforms offer today.

Layer 6: Guardrails and safety

How you stop your agent from doing things it shouldn’t: filtering inputs, authorizing tool calls, and validating outputs.

Agent guardrails became a separate discipline from LLM guardrails. In 2024, guardrails meant input/output filters on a model. In 2026, your agent calls tools, spends money, and takes actions. Guardrails now means authorizing tool calls, enforcing rate limits, and validating what the agent actually did.

The “guardrails before action” pattern emerged from teams that learned the hard way. They now enforce authorization at the tool execution layer, not the output layer. By the time you filter the response, the agent already sent the email. OWASP published the MCP Top 10 (beta), which is the first real security checklist for tool-connected agents. Deployment is still DIY. LangGraph Cloud and Bedrock Agents exist, but most production teams are still deploying with FastAPI and their own infra. This layer is where you’ll spend the most unplanned engineering time.

The honest take: This is the least mature layer in the stack. No dominant framework, no established patterns. You’re writing policy code from scratch.

Guardrails need to know what the agent is doing right now to decide what it shouldn’t do next. That means tracking agent state in real time. Lock-in risk is low because most guardrails are custom policy code you write yourself. NeMo Guardrails is the closest thing to a framework, but you’ll still write most rules from scratch. The prototype-to-production gap is effectively infinite. Your demo has no guardrails because nobody’s trying to break it. Production will.

Current guardrails tools focus on single-agent systems. If you’re running multi-agent workflows where agents delegate to each other, guardrail propagation across agent boundaries is an unsolved problem. You’ll need custom authorization logic.

What are you building?

This is the decision that cuts through the framework confusion. The agent type determines which layers you invest in and which tools to pick at each one.

A stateless tool caller answers questions from a knowledge base, looks up an order, or checks inventory. You need a provider SDK, MCP, and Postgres. No framework, no vector database. This is a weekend project.

A multistep workflow processes a refund end to end, reviews a PR across five files, or triages and routes support tickets. Steps depend on each other, things fail in the middle, and humans need to approve before the agent acts. You need LangGraph, MCP, and eval. Build evals before you deploy because these agents break silently.

An agent that learns remembers your preferences across sessions, gets better at your codebase over time, or tracks project context across weeks. You need a memory-first architecture, a vector DB, and eval. Orchestration is the easy part. The hard part is deciding what to remember, what gets dropped, and how you stop old context from polluting new answers.

A multi-agent system has agents that delegate to other agents, split a research task across specialists, or run parallel workstreams. You need the full stack. Two agents passing context to each other is already hard to debug. Five is impossible without trace-level evals on every handoff. Build eval infrastructure before you build the second agent.

Coding agents: All 6 layers in action

Coding agents like Cursor, Claude Code, Codex, and Windsurf are the most proven application of the AI agents stack. All six layers, working together.

At the inference layer, these tools serve hundreds of millions of daily requests. Cursor routes between Claude, GPT-4, and its own fine-tuned models depending on the task. At the protocols layer, MCP servers connect to editors, terminals, filesystems, and Git, which is how the agent reads your code and runs commands. The memory layer uses codebase-aware retrieval with reranking. The agent doesn’t read your whole repo. It retrieves the files that matter for this specific edit.

At the framework layer, these are custom orchestration systems with RL loops. Not LangGraph, not a provider SDK. Purpose-built control flow for code generation, review, and iteration. At the eval layer, Cursor retrains its acceptance-rate model every 90 minutes based on whether users accept or reject suggestions. That’s eval running in production, continuously. And at the guardrails layer, sandboxed execution prevents runaway agents. The agent can write code and run it, but inside a container that limits what it can touch.

The AI agent stack cheat sheet

Every layer scored on the three questions from the evaluation framework: How much state do you need to manage? How much vendor lock-in can you tolerate? And how hard is it to go from demo to production?

The bigger picture

Most teams are building like it’s still 2024. They pick LangGraph before they know if they need state. They add a vector database before they’ve outgrown Postgres. They design multi-agent architectures before they’ve shipped one agent that works. The decision flowchart above exists because a tool-calling chatbot and a multi-agent research system share almost no infrastructure. Treat them the same and you’ll overbuild the first and underbuild the second.

The teams that got past this run evals on every deploy, not once a quarter. Their guardrails sit at the tool call layer, not the output layer. Their memory architecture was designed, not inherited from whatever the framework defaulted to. Most teams ship the opposite: no evals, output-only filtering, and a system prompt that grows until the context window chokes. The gap isn’t talent or budget. It’s knowing which layers matter for your specific agent instead of half-building all six.

The stack is going to collapse. Provider SDKs are already absorbing memory, tool calling, and basic eval into a single API. By early 2027, most teams won’t build each layer separately. They’ll get an increasingly opinionated stack from their model provider and that will be fine for 80% of use cases. The other 20%, agents at scale where the defaults break, will still build custom at every layer. But even then, when something fails in production, you need to know which layer failed. That’s what this article is for.

Sources

“The AI Agents Stack,” Letta, November 2024.
“Donating the Model Context Protocol and Establishing the Agentic AI Foundation,” Anthropic, December 2025.
“120+ Agentic AI Tools Mapped Across 11 Categories [2026],” StackOne, February 2026.
Henrik Plate and Darren Meyer, Dependency Management Report, Endor Labs, January 2026.
Jason Liu, Context Engineering Series: Building Better Agentic RAG Systems, August 2025.
“LangChain and LangGraph Agent Frameworks Reach v1.0 Milestones,” LangChain, October 2025.
State of Agent Engineering, LangChain, December 2025.
Yunfei Bai, Allie Colin, Kashif Imran, and Winnie Xiong, “Evaluating AI Agents: Real-World Lessons from Building Agentic Systems at Amazon,” Amazon, February 2026.
OWASP MCP Top 10, OWASP.

Radar

Stop Getting Good at Protocols. Get Good at Agent Experience.

We keep falling into the tool trap

So what is Agent Experience?

AX is an extension of what you already care about

The protocol treadmill doesn’t work

What an AX practice looks like

The hard part is familiar

Principal Drift

Loop Engineering

The five pieces, and then notes

Automations, this is the heartbeat

Worktrees, so parallel doesn’t turn into chaos

Skills, so you stop explaining your project every single time

Plugins and connectors, the loop touches your real tools

Subagents, keep the maker away from the checker

What one loop looks like

What the loop still does not do for you

Build the loop. Stay the engineer.

This Week in AI: Fable 5, the Clone Wave, and Uber’s AI Reality Check

Claude Fable 5: 3 days, a government order, and a lot of unanswered questions

Tokens burn fast when the loop isn’t ready for them

Ingredients beat inference: A practical framework for building in the clone wave

What’s next

Kubernetes in the Age of AI

Generative AI

Agentic AI

The fundamentals

The Case Against Building Your Own Agent Platform

Build versus buy, flipped in a year

Most “agent platforms” aren’t

Memory

Governance

Eval

Orchestration

The honest case for building

Five questions before you commit

What this looks like in 2 years

Sources

Primary sources

Secondary Sources

Linear Thinking, Nonlinear Costs

Old problems wearing new clothes

The cost multiplier, repeated-work problems, and backtracking

Memoization, pruning, and dynamic programming

Where to start: Optimization follows topology

Who Owns the Code Claude Wrote?

The copyright rule nobody told you

What your employer probably already owns

The open source contamination problem

What to do about all of this

1. Run a license scan on your AI-assisted codebase

2. Document your human creative contributions as you go

3. Read the IP clause in your employment contract before you build anything on the side

4. Check which Anthropic plan you are on before shipping for commercial use

The thing worth sitting with

A note on what this piece covers and what it does not

Further reading

This Week in AI: The Next-Gen Recommendation Experience

Recommendation systems are a bigger deal than most companies realize

The agent hype has a recommender-shaped hole in it

The responsible AI conversation has left the research community

What’s next

Generative AI in the Real World: Agentic Systems Fundamentals with Maarten Grootendorst

Transcript

When Context Collapses: Teaching Agents to Detect and Recover from Lost Memory

Brute-forcing my way through context loss

Externalize: Add two layers of state to your agent

Recognize: Detecting loss from inside the agent

Building a detection mechanism

But didn’t you just say the AI can’t detect its own compaction?

Rehydrate: Reading back the state

Context management is an architectural concern

The PM’s Playbook for Shipping AI Features That Actually Work in Production

The demo to production Death Valley

Latency budgets

Designing fallbacks

Quality measurement

A/B testing AI features

Model drift monitoring