Radar

When Context Collapses: Teaching Agents to Detect and Recover from Lost Memory

Andrew Stellman — Thu, 11 Jun 2026 10:59:13 +0000

This is the eighth article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, part four here, part five here, part six here, and part seven here.

“640K ought to be enough for anybody.”—Bill Gates (allegedly)

If you’re building AI agents that do complex, multistep work, you’re going to run into context loss. The agent’s working memory fills up, older information gets silently dropped or compressed, and the agent keeps going without realizing it’s forgotten something. This article, the third in my Radar article trilogy about context management, walks through a pattern I’ve been refining for detecting and recovering from that problem, which I call the externalize-recognize-rehydrate pattern (or ERR, which I think is actually a pretty good acronym for an error recovery pattern): save your agent’s state to files on disk, detect when context has degraded, and reload from those files to recover. The individual techniques are standard practice in agent and skill engineering—checkpointing, progress files, state verification—but the real power comes from combining them into a coherent workflow that you can use live or build into your agents. I’ll walk through each step with specific prompts you can adapt for your own agents and coding sessions.

Which brings me to memory. Gates has said on multiple occasions that he never actually said that quote at the top of this article, but it endures because it captures one of the core limitations of that era, one that people struggled with constantly, in a way that we can laugh about now. Around that time I was using a 286 with 1 MB of RAM. That’s megabytes, not gigabytes. MS-DOS 3.3 gave me 640K of conventional memory plus 384K of upper memory, and I spent a lot of time figuring out how to use every bit of it. I configured memory managers, loaded device drivers high, used (and wrote!) terminate-and-stay-resident programs that moved themselves out of conventional memory to free up space, and generally treated memory as a resource that required active, deliberate engineering. There was a lot I wanted to do that didn’t fit into 640K, and like most people at the time, I went to some lengths to compensate for the memory limitations.

We’re at the 640K stage of AI development. The context window is the new RAM ceiling. Most of today’s models give you somewhere between 200K and 2M tokens of working memory (and, like memory in the late 1980s and early 1990s, those numbers are growing all the time), and if you’re building agents that do complex multistep work, you will hit that ceiling. When you do, the AI starts compacting: compressing or dropping older parts of the conversation to make room. And just like running out of conventional memory on a 286, things stop working right and you’re not sure why.

In 20 years we’ll be looking back at today’s puny context windows and wondering how developers in the 2020s managed to get anything done with just a few million tokens. Because none of this is new. In case you don’t believe me, here’s a photo of my dad at Princeton in the early 1970s working on an Evans and Sutherland LDS-1 graphics computer, the first commercial vector graphics machine, connected to a PDP-10 mainframe:

The actual LDS-1 is in the large cabinet in the background, directly behind the monitor. Sitting next to it, just out of the picture, is an even larger cabinet that holds a memory unit with 16K of magnetic core memory (technically 8K words).

So you can imagine that just a decade later, 640K in a tiny PC that fit on your desktop seemed extravagant.

In the last two articles in this series (“Why Doesn’t Anyone Teach Developers About Context Management?” and “Your AI Agent Already Forgot Half of What You Told It”), I talked about what context is and why context management matters, and I shared practical techniques and prompts for keeping important information in files instead of leaving it in the AI’s context window. This article gets more technical. I want to build on those strategies and talk about how to build agents that can detect when they’ve lost context and recover from it on their own.

Brute-forcing my way through context loss

I’ve been doing this kind of context management for a while now, long before the specific tools I’m about to describe existed. But a recent crash gave me a clean example of what the process looks like in its most brute-force form.

I was working in Copilot with a seven-step plan, going through it one step at a time, having another AI review each step before moving on. Steps one and two went fine. When it came time to do step three and I gave it the prompt, it jumped straight to step four. This kind of thing can be really frustrating, because it seems like an AI smart enough to implement a complex feature in code should be able to (ahem) count to four.

The key to not getting frustrated when the AI loses track of steps or can’t seem to count from prompt to prompt is to remember what it’s good at and how it remembers things. If the AI you’re using does that, check the conversation history. You’ll probably see something like “summarizing conversation history” or “compacting conversation” somewhere above your last message. That’s telling you that the AI lost track of where it was because that count was literally purged from its memory.

AIs are good at carrying out an instruction. They’re bad at keeping track of their own state over a long conversation, and the way they manage their memory is a big part of that. This article is about finding ways to build your AI tools so you’re not relying on them to do the thing they’re worst at.

But compaction isn’t the only way your AI loses context. A few weeks ago I was deep into a long session with Copilot, working through a multiphase code review. I’d spent a while building up context with the AI about my codebase and the decisions we’d made together. I was about to move on to the next phase, and then I got this:

The entire context was wiped, which could have been a really frustrating problem, since I had a long history with the session, and it had built up a lot of knowledge about what we were doing. This turned out to be a bug in Opus 4.6’s interaction with Copilot’s conversation history, and I’ve seen other people hit the same thing. I was staring at a fresh prompt with nothing in it.

So I did something that, in retrospect, is a pretty good brute-force version of what this whole article is about. I recognized the context was gone (hard to miss when the whole conversation disappears). I copied the entire conversation out of Copilot and pasted it into a text file. Then I gave the new session a prompt:

We were in the middle of a long conversation, then I got an error and the entire context was wiped. I saved a copy of the conversation in #file:chat_history.txt, read it and bring yourself back up to speed.

And it worked! This brought the new session back to where I needed it to be.

That simple error and recovery actually outlines a pretty good pattern for dealing with context loss:

Externalize the state. Get the important information out of the conversation and into a file on disk, where it won’t disappear when the context window reshuffles.
Recognize the loss. Notice that the agent’s working context has been wiped or degraded, whether that’s obvious (like a crash) or subtle (like output that quietly stops making sense).
Rehydrate from the file. Point a new session at that file and let it rebuild its understanding from what’s written down.

The individual mechanics are well-documented across cognitive science (cognitive offloading, task resumption), software engineering (the Memento pattern, React hydration), and knowledge management (the SECI model). I’m not claiming to have invented any of them. But the specific abstraction of these three phases into a unified, named pattern applied to AI context management is, as far as I can tell, new. It’s synthesis and codification, not invention.

In this case I did it with copy and paste, which isn’t particularly elegant, but it worked for me. But this is a blunt instrument, because a raw conversation dump is both too much and too little: it’s too much because it’s full of noise, like tool calls, dead ends, back-and-forth that doesn’t matter anymore; and it’s too little because the context that got silently compressed away during the session is already gone. When you build these mechanisms into agents and skills, you can do it in a much more subtle and automated way.

Externalize: Add two layers of state to your agent

The idea behind externalization, or periodically saving your agent’s state, came out of a conversation I was having with an AI assistant while building the Quality Playbook, an open source AI coding skill that runs structured code reviews. The playbook runs a structured code review as a single process, but that process could easily turn into a 15-million-token request if you tried to do it all in one shot. I described in the previous article in this series how I broke it into six phases, and that was only possible because the context for each phase had already been externalized. Each phase reads its inputs from files, does its work, writes its outputs to files, and stops. The next phase picks up from the files, not from whatever the agent remembers. If this sounds like the familiar advice to ask the AI to plan before you ask it to implement, it’s the same principle applied to context management. Separating each step and persisting the output means you can inspect it, and the next step doesn’t depend on the agent’s memory.

But what should those files contain? I found that the AI is actually good at figuring that out. At some point I asked the assistant:

Would it make sense for the agent to record more context in files as it progresses, to make sure nothing is dropped along the way? It should work even if you break it into separate prompts, because the result from each step is persisted. Plus, we can audit its reasoning for debugging and improvement.

That prompt was all it took. The assistant designed the file structure itself: a progress tracker that records which phase is active and what’s been completed, a JSONL artifact file (JSONL is just a file with a bundle of JSON objects, with one record per line) where each pass appends its output, and a set of brief documents describing the purpose of each phase. You don’t need to overengineer this. Tell the agent what you’re trying to preserve and let it figure out the file layout.

What emerged falls into two categories that I think of as execution continuity and task continuity:

Execution continuity is the state the agent needs to resume work in the middle of a task: what step it’s on, what it’s completed, what decisions it’s made so far. These files change constantly as the agent works.
Task continuity is the broader context that doesn’t change during execution: what the whole task is about, what success looks like, what the structural constraints are. These files are written once and read at every resumption.

When an agent needs to resume after suspected compaction, it reads back both layers. The task continuity files anchor it back to what the whole endeavor is about. The execution continuity files put it back in the middle of the work. Together, they give the agent enough information to continue without relying on anything that might have been compacted.

The key is that externalization isn’t something you do once at the beginning of a task. You want the agent saving its state at frequent checkpoints so that if compaction happens mid-run, the most recent checkpoint is close to where the agent was working. Here’s the kind of instruction I gave the agent for tasks that processed records one at a time:

Update the progress file after every single record, not in batches. Write the output line first, then update the progress file with the new cursor and a fresh timestamp. If the progress file’s timestamp falls behind the output file’s, you’re batching and that’s wrong.

The frequency matters because context can compact at any point. If the agent only saves state at the end of a long run, compaction in the middle means losing everything since the start. If it checkpoints after every unit of work, the worst case is losing one unit.

Two-layer externalization survives context reshaping, not only outright context loss. Even if the agent’s context window isn’t full, if the context has been reorganized or reprioritized (a compression that reshapes without truncating), the agent can reload the external files and know for certain what the ground truth is.

Recognize: Detecting loss from inside the agent

The second step in the pattern is to recognize that your agent has lost context, and it turns out to be the hardest part (at least with today’s AI technology). When the context window fills up, the AI compacts silently, and the agent keeps working without realizing it’s lost information. The agent can’t tell you it’s forgotten something, because it doesn’t know it forgot. Detecting that change turns out to be a nontrivial problem; I’ll walk you through an approach that helped me, and keep it general enough so you can do the same thing. The copy-and-paste approach works when the context loss is obvious, like a crash that wipes your whole conversation. But most context loss isn’t that visible.

I described context compaction in the previous article, but it’s worth restating the core problem from the agent’s perspective. Different tools handle context overflow differently: Some truncate older messages; some compress conversations into summaries; some use a sliding window. But they all have the same effect. Information disappears from the agent’s working context, and the agent doesn’t get notified.

This was a challenge when I built the Quality Playbook, because it runs multiple passes over a codebase, each one reading source files, extracting requirements, and checking coverage. Each pass can involve enough work that it fills the context window multiple times over. And when context compacts mid-pass, the agent doesn’t know it happened. It keeps working, but the output starts silently degrading. So I started building mechanisms for the agent to detect compaction and recover by reading back the files it had written earlier. The patterns that came out of that work are general enough to apply to anyone building agents that need to survive context pressure.

From the agent’s perspective, compaction is seamless. It’s tracking state, referencing decisions made earlier in the conversation, and then at some point the earlier context is gone. But the agent can’t tell the difference between “I never knew that” and “I knew it but lost it.” It tries to reference something and finds nothing, or finds a compressed version that lost the nuance. And because the agent doesn’t know it lost anything, it doesn’t know it needs to recover.

This invisibility is the core problem. But it turns out you can work around it, and the next two sections walk through how.

Building a detection mechanism

Once you have files on disk, the question is what specifically to check and how to know when something has gone wrong. I landed on a mechanism while building the Quality Playbook’s requirement extraction pipeline. The playbook processes source documents in multiple passes, and each pass appends its output to a JSONL artifact file. After each unit of work, the agent also writes a progress record to a separate file: what it just finished, what it found, and where it should pick up next.

The detection mechanism comes from two rules I gave the agent. The idea is that the progress file tracks a cursor, which is just a position marker that tells the agent which record to process next. If the agent writes a record to the output file but then loses context before updating the progress file, those two files will be out of sync.

The agent didn’t need to understand any of that upfront; I just described the rules in plain language and let it figure out the implementation. The first rule establishes an invariant between the output file and the progress file:

Cursor advances only after the line is on disk. Write the summary line to the output file first, then update the progress file. The cursor must always equal the index of the next record that still needs to be processed.

The second rule told the agent how to check that invariant on startup:

On startup, read the progress file. Resume from its cursor value. Verify continuity: the last line in the output file should equal cursor minus one. If not, roll the cursor back to match disk state and report the discrepancy.

If the progress file says the cursor is at record 381, but the last line in the output file is record 379, something happened. The context compacted and the agent lost track of where it was. The divergence between the two files is the signal.

This worked because files on disk don’t change when context compacts. They’re written once and then read repeatedly. If what the agent thinks it knows doesn’t match what’s actually in the files, something shifted in the agent’s memory, not on disk. I ended up folding this check into a preamble that every session started with:

If this session has experienced auto-compaction, re-read the pass specification from disk. Do not try to reconstruct it from the compacted summary. Read the progress file. Read the last record of the JSONL artifact and confirm its index equals the cursor minus one. If not, roll the cursor back to match disk state. Disk is the source of truth. The conversation is not.

That preamble ran at the top of every session. During one particularly intensive day of pipeline development, I ran over a hundred Claude Code sessions with that exact instruction. Most of them completed without hitting compaction. But the ones that did hit it recovered cleanly, because the preamble told the agent exactly what to check and exactly what to do when the check failed.

The specific prompts I used are tied to the Quality Playbook’s file structure, but the technique generalizes. If you’re building any agent that does multistep work, you can adapt the same approach. Here’s a version you could drop into a session preamble or an agent’s system prompt:

Before continuing any task, read your progress file and your most recent output file. Compare them: does the progress file say you’ve completed work that isn’t reflected in the output? If so, trust the output file, roll back your progress to match, and note the discrepancy. Do not rely on what you remember from the conversation. The files on disk are the source of truth.

The wording doesn’t have to be precise. What matters is the structure: tell the agent where to look, what to compare, and which source to trust when they disagree.

But didn’t you just say the AI can’t detect its own compaction?

Right, and it can’t. What I described above isn’t the agent detecting compaction. It’s the agent running a deterministic check against files on disk and finding a discrepancy. The agent doesn’t need to know that compaction happened. It just needs to notice that two files disagree. Think of the agent as an amnesiac clerk. You don’t ask the clerk to remember what they did yesterday. You make the clerk check the physical ledger every time they sit down at the desk. If their notes disagree with the ledger, they’re trained to trust the ledger.

If you saw Christopher Nolan’s breakout movie Memento, you can think of your agent as Leonard Shelby, the character played by Guy Pearce with anterograde amnesia. You couldn’t ask Leonard to remember what he did yesterday. He had to check his tattoos every time he woke up. If his tattoos disagreed with what he’s seeing, he trusts the tattoo (which leads to a major plot point, which I won’t spoil). Again, this isn’t a new idea either. I mentioned the Memento pattern earlier, which is literally named after this movie.

This is a classic distributed systems technique. In double-entry bookkeeping, you maintain two independent records of the same transaction and reconcile them regularly. If they disagree, you investigate. You don’t need to know why they diverged; the divergence itself is the signal. A two-phase commit works the same way: write the data first, then update the record that says the data was written. If you find data without a matching record, or a record without matching data, something went wrong between the two phases.

That’s exactly what the cursor invariant does. The agent writes the output line first, then updates the progress file. If those two files are out of sync, something happened between the two writes. The agent doesn’t detect compaction. It detects a broken invariant, and it’s been told that when the invariant breaks, the files on disk win.

Three things make this work. First, the check is purely deterministic: read two files, compare two numbers, act on the result. There’s no reasoning involved, no judgment call about whether the agent “feels” like it lost context. I wrote about this principle in “Keep Deterministic Work Deterministic”; you never want an AI making decisions that a file comparison can make for it. Second, the files on disk don’t change when context compacts. They’re the stable reference point that the agent’s memory gets checked against. Third, the instruction to run the check lives in the system prompt or preamble, which is generally preserved even when conversation context gets compacted. The check survives the thing it’s designed to detect.

Rehydrate: Reading back the state

Rehydration is the process of reading back externalized state and rebuilding the agent’s working context. Once the agent detects compaction (or, more specifically and accurately, has enough evidence from the filesystem that compaction occurred), the recovery step is to read back the externalized files and rebuild. For the Quality Playbook, rehydration meant:

Read the phase brief to re-anchor the purpose of this pass
Read the progress file to know which unit is active and what’s been completed
Read the tail of the JSONL artifact to confirm the last successfully written record
Recompute the next unit of work from those files

This is different from just continuing without detection. Without detection, the agent tries to pick up where it left off and hopes it still has enough context. With detection, the agent knows something happened and deliberately reloads state before continuing.

You can make the rehydration process itself auditable. Instead of silently reading the files and resuming, have the agent write down what it learned:

Read the progress file and the JSONL artifact. Write a summary of what you learned: what pass is running, what unit is active, what the cursor position is, and how many requirements have been extracted so far. Then continue from there.

Writing a rehydration summary serves two purposes. It gives you visibility into what the agent understood and whether it rehydrated correctly. And it forces the agent to process the external files explicitly rather than just loading them into context. Explicit processing is more reliable than silent loading because the agent has to commit to an interpretation, and you can read that interpretation and catch mistakes.

You can adapt this approach to any agent workflow where work happens in steps. The specific files and cursor values are particular to my pipeline, but the underlying technique is general: have the agent write its progress to a file after each step, and check that file against its output at the start of every session. And this advice isn’t just for writing agents or skills. Even in a live session with Claude Code, Cursor, or Copilot, you can tell the agent to periodically write a summary of what it’s done and what it plans to do next to a file on disk. If the session crashes or the context gets long enough to compact, you can point a new session at that file and pick up where you left off. The key is getting the state out of the conversation and onto disk before you need it.

Context management is an architectural concern

Every technique I’ve described in these articles comes down to the same principle: Important information shouldn’t live only in the agent’s context window. The previous articles covered how to put that information on disk. This one covers how to make the agent aware of its own limitations so it can recover when context pressure gets too high.

An agent that can detect its own degradation and correct for it is fundamentally more reliable than one that just keeps going. When the agent knows how to stop, check itself against ground truth, and reload what it lost, context pressure becomes a recoverable event instead of a slow, silent failure.

This concludes my mini-series trilogy of articles about context management. The first article in this series was about understanding what context is and why it disappears. The second was about getting important information out of the conversation and onto disk before you need it. This one is about closing the loop: making the agent aware of its own limitations so it can detect degradation and recover from it. Together, they add up to treating context as an engineering problem rather than something you hope works out.

These are still early days. Context windows will get larger, compaction will get smarter, and some of the workarounds in this article will eventually be unnecessary. But the underlying principle won’t change: If your agent’s ability to do its job depends on information, that information needs to live somewhere more durable than working memory. That was true for my dad’s 32KB core memory at Princeton, it was true for my 640K of conventional RAM, and it’s true for today’s 200K-token context windows.

The Quality Playbook and Octobatch are open source projects where these techniques are used in production. Both are built using AI-driven development and available for exploration if you want to see how this looks in practice.

Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026, by the author. The open source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.

The PM’s Playbook for Shipping AI Features That Actually Work in Production

Gaurav Savla — Wed, 10 Jun 2026 10:55:56 +0000

The demo to production Death Valley

If you’ve worked on an AI feature, you know the feeling. You start building something that you are excited about, set launch timelines. The model spits out a perfect response, the prototype works magically, and everybody in the room is mentally calculating how big this product will be when we launch. I’ve been in that room a lot many times and it’s fun.

Then you try to test before you ship.

Latency spikes to 10 seconds on mobile. The model starts hallucinating on edge cases that happen to represent 15% of actual user queries. Your A/B test shows no statistically significant engagement lift because the variance in AI outputs makes traditional hypothesis testing basically meaningless. The safety team flags 340 failure cases in the first week, and you’re now debugging nondeterministic cases that fail in creative, novel ways every single day.

Most often than not, it’s not a model problem but an engineering discipline problem. Shipping an AI product is very different from traditional software. I’ve figured this out the hard way. This playbook shares my learnings.

Latency budgets

Every AI feature comes with a latency tax. Large language model inference takes time. We’re talking 500 milliseconds to 5 or even 50 seconds depending on model size, input length, and infrastructure setup. For consumer products where people expect sub-200-millisecond interactions, this is a hard constraint you have to design around.

The mistake I see most often is teams measuring only p50 latency. A feature with 800 milliseconds p50 sounds fine until you discover the p90 is 15 seconds. That means 10 in every 100 users sit there waiting for 15+ seconds. At scale, that’s thousands of terrible experiences per day.

The way I think about it is you define your latency budget by interaction type, not globally: Synchronous interactions, where the user is staring at a spinner, need to resolve under 1 second. Progressive interactions, where output streams token by token, need first token in under 500 milliseconds and full response under 5 seconds. Asynchronous interactions, where the user keeps doing other stuff, can take up to 20 seconds with a progress indicator.

You also need to measure cold starts separately. The first request after a model loads into memory can be 10 times slower than subsequent requests, and if your traffic is bursty, cold starts will disproportionately punish your most engaged users arriving during peak hours.

Besides, you also need to budget for the full pipeline, not just inference. A typical AI feature pipeline including input preprocessing (tokenization, context assembly, and prompt construction), model inference, output postprocessing (parsing, formatting, safety filtering, etc.), and a full response delivery adds up. Optimizing inference while ignoring the rest is like tuning your engine while driving on flat tires.

Lastly, use streaming aggressively for generative features. Pushing tokens to the user as they’re generated instead of waiting for the full response changes how users perceive latency. A four-second response that starts appearing at 300 milliseconds feels dramatically faster than one that pops in all at once. Perception is reality when it comes to user experience.

Designing fallbacks

Traditional software fails in boring, predictable ways. AI features fail in novel, unpredictable, and occasionally creative ways. I once saw a model respond to a product recommendation query with a poem about loneliness. Your fallback strategy needs to be considerably more sophisticated than a try/catch block.

I think about fallbacks as a hierarchy. First, model fallback: When your primary model fails, drop to a simpler, faster, and more reliable model. Most failure cases get handled without the user ever knowing. Second, cache fallback: For queries similar to stuff you’ve seen before, serve a cached response. Third, template fallback: When generation fails completely, fall back to prewritten templates. Degraded beats dead every time. Fourth, graceful omission: Sometimes the best fallback is to simply not show the AI feature at all rather than showing a broken version.

The design principle underneath all of this is that users should never encounter an unhandled AI failure. Every failure mode maps to a specific level, and transitions between levels should be invisible whenever you can manage it.

Quality measurement

Quality in traditional software is binary. The button works or it doesn’t. AI feature quality is continuous and subjective, and it changes depending on context. I’ve landed on a four-layer quality pyramid.

The foundation is safety, and it’s nonnegotiable. Does the output contain harmful content, PII, or made-up facts? This layer is binary, and you measure it with automated classifiers running against 100% of outputs.

The second layer is factual correctness, which is domain specific. Is the output actually right? For a coding assistant that means generated code compiles and passes tests. For a writing tool it means grammatical, stylistically appropriate output. You measure this with domain specific evaluation suites.

The third layer is usefulness, and it’s user centered. Did the person actually benefit? Track acceptance rate, edit distance, time to task completion, and repeat usage. This is where traditional product metrics meet AI specific ones.

The fourth layer is delight, which is experimental. Does the output feel good? Hardest to measure but often most important for adoption. Sometimes the numbers say the feature works but users’ guts say it doesn’t. This layer catches that gap.

A/B testing AI features

A/B testing AI features is fundamentally harder than traditional features because AI outputs are nondeterministic. The same user doing the same thing twice might get different outputs, introducing variance that traditional frameworks weren’t built to handle.

The core challenge is that intratreatment variance inflates the sample size you need for statistical significance, often by three to five times. If you’re running your AI experiment with normal sample size assumptions, you’re probably looking at noise and calling it signal.

Then there’s the metric selection problem. A chatbot generating entertaining but factually wrong responses might show amazing engagement numbers while actively misleading users. You have to measure engagement and quality together. “Engaged interactions where quality score exceeds threshold” is more meaningful than raw engagement alone.

The temporal problem matters too. AI feature value changes over time as users learn how to work with it. Short experiments will underestimate long-term value if there’s a learning curve, or overestimate it if there’s a novelty bump.

My practical guidance: budget two to three times more time and traffic for AI experiments than traditional ones. Lean on Bayesian methods as they handle high variance better. And always pair quantitative tests with qualitative research. Ten user interviews will surface failure modes that no amount of statistical analysis will catch.

Model drift monitoring

Model drift is the slow, invisible rot of AI output quality over time, and there are multiple culprits.

Data drift happens because the world changes and user behavior evolves. A model trained on 2024 data performs worse on 2026 queries referencing new concepts, slang, and cultural moments.

Provider drift happens because third-party APIs change without your consent. OpenAI acknowledged that GPT-4’s behavior shifted measurably between March and June 2023, and Stanford researchers documented significant performance swings. The fix: Pin your model versions so updates happen on your schedule, after your testing.

Evaluation drift is the subtlest form. Even your quality metrics can become inadequate and the evaluation criteria that made sense at launch might become inadequate as usage patterns shift and user expectations change. Quarterly reviews of your evaluation suites are essential.

At minimum you need daily automated quality evaluations on 1% to 5% of production traffic, weekly analysis of input distribution characteristics, and monthly human evaluation of 100 to 500 examples. Shipping an AI feature without drift monitoring is like deploying a service without alerting. You won’t know it’s broken until your users tell you, and by then they’re angry.

Evaluation frameworks

How do you know if your AI feature is good enough? You need two fundamentally different approaches, and you genuinely need both.

Automated evaluation gives you speed. Build a golden dataset of 500 to 2,000 labeled examples, train a classifier or use a capable model as judge, and validate against human judgment quarterly targeting 85% agreement. Automated evals chew through thousands of examples per hour, making them essential for velocity. The pitfall: They miss novel failure modes not in the training data.

Human evaluation catches what automation misses. Structure it with five to seven evaluators mixing domain experts and representative users. Use a consistent rubric covering accuracy, helpfulness, tone, completeness, and safety. Run weekly during development, monthly in production. The trade-offs: expensive at $15 to $30 per example, slow with 24 to 72 hour turnaround, and subject to human biases. Manage by rotating evaluators and capping sessions at two hours.

The model as judge approach is an increasingly viable middle ground. Judging quality is often easier than generating it, which means a model can reliably evaluate outputs even for tasks where it couldn’t produce them itself. Use it for high-volume evaluation but always validate against human judgment.

Graceful degradation and prompt engineering

Graceful degradation means when capabilities decrease, the experience gets worse smoothly instead of falling off a cliff. Design for capability levels, not binary states. Define four to five levels with specific behaviors at each. For example, for an AI writing assistant: Level 5 is full capability with real-time suggestions, tone adjustment, and structure recommendations. Level 4 is delayed suggestions appearing after a two- to three-second pause because latency is up. Level 3 is basic suggestions only like grammar and spelling with no style feedback. Each level is a deliberate design decision, not an accident.

Make degradation invisible when possible. Users shouldn’t see a “broken” experience. They see a less detailed one. That’s a huge difference psychologically. However, when the degradation is significant enough that users will notice, proactive communication like “AI suggestions are temporarily limited” builds trust infinitely more than silently pushing poor-quality outputs.

Prompt engineering in production is software engineering. In production, prompts are code, and they need version control, testing, monitoring, and maintenance. Version controls every prompt. Parameterize prompts, don’t hardcode context. Production prompts should be templates with clearly defined injection points for user context, system state, and dynamic instructions. This makes them testable because you can inject known inputs and verify outputs, and it makes them maintainable because changing how you handle context shouldn’t require rewriting the entire prompt from scratch.

Test prompts against regression suites. Maintain 200 to 500 test cases covering the full distribution of expected inputs, including edge cases and adversarial inputs. Run the suite against every prompt change before deployment.

Monitor prompt performance in production. Track output quality metrics like acceptance rate, user edits, and regeneration requests, segmented by prompt version. When you deploy a new version, compare its production metrics against the previous one for at least 72 hours before calling it stable. This is basically canary deployment for prompts.

Ship it right

These systems aren’t optional add ons you can bolt on after launch. Every feature I’ve seen fail was built first with plans to “add production hardening later.” Later never comes.

AI features are probabilistic and nondeterministic, and they change over time without anyone touching them. Build these systems, staff them properly, and treat them with the same seriousness you’d give your core infrastructure. The gap between demo and production is wide, but it’s absolutely crossable if you build the right bridge.

Note: The research work pertaining to this article was done in a personal capacity. Views are of my own and do not reflect my employer’s views in any way.

The Subsidy Ended: What Tool-Using Agents Actually Cost

Bennie Haelen — Tue, 09 Jun 2026 11:09:17 +0000

On June 1, GitHub Copilot’s usage-based billing became active for all Copilot plans, and developers reacted quickly and loudly. A Pro plan still costs $10, but it now comes with a monthly pool of AI credits. Those credits are priced at a penny each, and they’re consumed according to the model used and the tokens processed, including input, output, and cached tokens. For a heavy agentic session running a frontier model, that makes spend feel very different from a flat subscription.

That’s the news, and it’s worth understanding, but it isn’t the important part. Nothing about the underlying cost of agentic work actually changed on June 1. The tokens were always being consumed, the loops were always running, and the tool calls were always expanding the context. What changed is that the meter became visible. A workload that had been quietly subsidized under a flat rate started showing up as an itemized bill.

Where the tokens go

To see why the bill landed so hard, it helps to compare two things that look similar and bill very differently. A chat completion is close to a single transaction. You send a prompt, the model sends an answer, and you pay roughly once for the input and once for the output. A tool-using agent doesn’t work that way at all. An agent doesn’t answer a question so much as work toward it, and it works by looping. It reasons about the task, calls a tool, reads the result, reasons again, calls another tool, and continues until it decides it’s finished.

Every pass through that loop carries a cost that’s easy to miss. In many agent harnesses, each turn carries forward a large share of the accumulated context: prior messages, tool descriptions, retrieved files, and tool results. Even when some of that context is cached, summarized, or pruned, the system is still doing metered work to preserve enough state for the next decision. The final answer you actually wanted is only a thin slice of what you paid for. The loop is the bill.

This is why agent cost doesn’t scale politely. It scales with the number of turns, and the number of turns scales with how much discovery the agent has to do, which in turn scales with how vague the request was and how much irrelevant context it’s dragging along. A clean, well-scoped task might finish in three turns, while the same task posed as an open-ended question might wander through 15, each carrying the cost of everything that came before it. Under a flat rate, that difference was invisible. Under usage-based billing, it’s the difference between a small interaction and an expensive one.

Tool design is now part of the cost model

I wrote recently about a hidden tax on Model Context Protocol servers: the way an overstuffed tool catalog quietly degrades a model’s ability to route to the right tool. Bloated descriptions, overlapping responsibilities, and vague parameters make the model’s job harder and its choices worse. That argument was about accuracy. The billing change adds a second invoice for the same bloat, and this one is denominated in dollars.

The tool catalog is often part of what gets carried through the agent’s loop. A tool described in three tight sentences and a tool described in three rambling paragraphs may both function, but the second one pays rent in the context window every time an agent has it loaded. Multiply that across a catalog of 40 tools and a workflow that runs a dozen turns, and the cost of verbose tool design stops being a rounding error. Tool design was already a correctness discipline. It’s now a cost discipline as well. The same audit that tightens routing accuracy tightens the bill.

Where prompt discipline runs out

There’s a layer of this that individual users can control, and it’s worth knowing because the savings are real and immediate. Two patterns matter most, and I’ve been handing both to the engineers on a pilot I run for a large healthcare organization. They aren’t magic tricks. They’re ways to keep the agent out of unnecessary discovery loops.

The first pattern is about input. Prompt the agent like a short requirement rather than a broad question. A request such as “look at the encounter data and tell me what you find” forces the agent into discovery mode, where it burns turns figuring out what you meant, and every one of those turns carries the full context forward. Compare that to a prompt that front-loads the specifics by naming the project and the table, naming the date field to filter on, stating the output shape you want, and calling out anything that should be excluded. A better prompt would be: “Using the curated clinical project and the silver-zone encounters table, show total encounters by month for calendar year 2025, use admission_date_time for inclusion, and return one row per month ordered chronologically.” The second prompt collapses the loop. The agent has what it needs on the first turn, so it does the work instead of interviewing you for it.

In practice, the difference isn’t just polish. The vague version forces the agent to discover the data model, infer the date semantics, choose an aggregation, and decide on a display format. The specific version turns the task into a bounded query. That difference shows up in accuracy, latency, and cost.

The second pattern is about output, and it’s the lever most people overlook. Ask for plain text or Markdown during the intermediate steps, and save rich HTML formatting for the final, confirmed deliverable. Formatted output is expensive to generate, and requirements shift. If you ask for a polished HTML report on the first pass and then change a filter, you pay full output-token freight to regenerate all that layout, often more than once. The cheaper habit is to validate the numbers in text and format only at the end.

These patterns work, and they also have a ceiling. Both of them put the entire burden of cost control on the user, and they hold only as long as every user exercises the discipline on every prompt. The day someone reverts to “tell me what you find,” the savings evaporate, and the only thing standing between the team and a surprise invoice is a budget cap that reports the overspend after it has already happened.

Cost is a governance problem, not a budgeting one

That fragility is the real lesson. A budget cap is a backstop rather than a control. It will stop a runaway, but it tells you that you overspent rather than why, and it does nothing to make the next run cheaper. Treating cost as a budgeting problem leaves you forever reacting to the meter, while treating it as an architecture problem lets you build the savings in once and stop relying on everyone’s good behavior.

That means the controls that matter belong on the platform rather than in individual prompts. By the platform I don’t mean the agent itself, the coding assistant or chat client a developer drives day-to-day, and I don’t mean the model or a router sitting beneath it. I mean the control plane that sits above the agents, the layer where an organization enforces policy, access, observability, and now cost across every agent and model its developers touch. An administrative console that gives IT visibility into who is doing what and which capabilities they can install is an early, narrow instance of it. A router that sends planning to a cheap model is one feature that belongs there. The platform is where the rules live, and the agent is a consumer of those rules rather than the place you set them. The platform should route models by task, using cheaper models for planning and reserving frontier models for work that earns the price. It should bound the loop, requiring the agent to check in after a fixed number of iterations. It should cap tool-result payloads so a careless query cannot dump a million rows into the context window. It should default intermediate work to plain text, making the cheap path the path of least resistance instead of something users have to remember.

Every one of those controls is something a user can approximate by hand and something the platform can simply guarantee. This is the same principle I keep returning to in the context of data access, where safe behavior cannot depend on the person at the keyboard remembering the rules. Prompts guide behavior. Guardrails make the cheaper and safer behavior the default. Cost governance is guardrails as control plane, with a dollar sign attached, enforced at the same layer where you already enforce who is allowed to see which row.

The pattern, not the vendor

It would be a mistake to read this as only a GitHub story. GitHub is the current example because its change is visible and recent, but usage-based billing for agentic work is the direction of travel for many AI tools. The economics under the hood are similar: Agentic workloads turn single answers into loops of model calls, tool calls, and context management. The flat-rate subsidy was always going to come under pressure once the workload shifted from autocomplete to autonomy.

The organizations that treat June 1 as a pricing event will optimize a few prompts, grumble, and move on until the next vendor changes its meter. The ones that treat it as an architecture signal will push the cost controls down into the platform, where they hold regardless of which provider is counting which token. That’s the more durable place to stand. The bill didn’t get bigger this month. It got honest, and an honest bill is the kind you can engineer against.

Long-Running Agents

Addy Osmani — Mon, 08 Jun 2026 15:59:06 +0000

The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission.

A long-running AI agent can keep making progress over hours, days, or weeks. It can do this across many context windows and sandboxes, recover from failure, leave structured artifacts behind, and resume where it left off.

For two years the dominant image of an “AI agent” has been a chat window with a clever loop in it. You type a goal; the agent calls some tools; you watch tokens stream by; you stop watching when the work runs out of patience or the context window fills up. That paradigm got us a long way, but it has a ceiling. The model forgets. It declares “task complete” when it isn’t. It reintroduces a bug it fixed nine turns ago. The whole thing is structured around a single sitting.

Long-running agents are what comes next. The idea is easy to state: an agent that keeps making forward progress on a goal across many sessions and many sandboxes, possibly many days or weeks, while leaving the workspace clean enough that the next session can pick up where the last one left off. The engineering is harder. You have to solve for persistence, recovery, and verification in a way that doesn’t just paper over the cracks. You have to build a state layer that lives outside the model’s context window, and you have to design the handoff between sessions so the agent doesn’t lose its mind when it wakes up and finds itself in a different sandbox with a different context window.

This post is my attempt to lay out what’s changed, who’s pushing on it, and how an engineer can use long-running agents today without writing the whole thing from scratch.

What “long-running” actually means

“Long-running” used to mean at least three different things in practice, and it helps to keep them separate.

Long-horizon reasoning. The agent has to plan and execute over many dependent steps. This is mostly a model-quality story: coherence, planning, the ability to recover from a wrong turn 10 steps ago. METR has been tracking this with their time horizon metric, which estimates how long a task a frontier model can complete with 50% reliability. The headline finding is that the metric has been doubling roughly every seven months since 2019, and their TH1.1 update earlier this year doubled the count of eight-hour-plus tasks in the eval set. If that curve holds, frontier agents complete tasks at the day scale by 2028 and the year scale by 2034.

Long-running execution. The agent’s process runs for hours or days. Maybe it’s a coding job, maybe it’s a research sweep, maybe it’s a 24-7 monitoring service. The model might be invoked thousands of times across the run. This is mostly a harness story, and it’s the one this post is mostly about.

Persistent agency. The agent has an identity that outlives any single task. It accumulates memory, learns user preferences, and is always available. This is the Memory Bank flavor of long-running.

In practice the three blur together. A real production agent does long-horizon reasoning inside a long-running execution backed by persistent agency. But the engineering problems are different in each, and so are the products that solve them.

Why this matters

There are two reasons I believe this work matters a lot right now.

The first is a phase change in what’s economically feasible to delegate. An agent that runs for 10 minutes can answer a question, summarize a doc, fix a small bug. An agent that runs for 10 hours can own an entire feature, finish a migration that was on the backlog for six quarters, or do the kind of overnight research sweep that used to require a junior analyst. One of Anthropic’s Claude Sonnet announcements put concrete numbers on this last fall: 30+ hours of autonomous coding in internal tests, including one run that produced an 11,000-line Slack-style app. That’s already past the threshold where the answer to “Should I delegate this?” is no longer obvious.

The second is that persistence changes what the agent is. A stateless agent answers your question and disappears. A long-running one accumulates context: which competitor moved which way last week, which test flaked twice on Tuesday, what you usually mean by “the dashboard.” Anthropic’s Project Vend was the most public early demonstration of this. They had a Claude instance run an actual office vending business for a month, managing inventory, setting prices, talking to suppliers. It failed in informative ways, and the second phase ran much better, but the point wasn’t profitability. The point was watching what kinds of weird coherence problems show up when an agent has to maintain identity across weeks instead of turns.

Those are the same problems every team building production agents now hits.

The three walls every long-running agent hits

Three walls show up in basically every write-up I’ve read this year.

Finite context. Even a 1M-token window fills. And context rot, the steady degradation of model performance as the window gets full, kicks in well before the hard limit. A 24-hour run is not going to fit in any context window the field has on its roadmap. Something has to give.

No persistent state. A new session starts blank. Anthropic’s framing in their scientific computing post is the cleanest version I’ve seen: “Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.” Without an explicit persistence story, every shift change is a productivity disaster.

No self-verification. Models reliably skew positive when they grade their own work. Asked “Are you done?” they answer “yes” more often than they should. Without a separate signal that the work meets a bar, you get the agent that ships at 30% complete with full confidence.

Long-running agent designs are mostly answers to these three problems. The major labs have converged on similar shapes of answer, but with very different surface area.

The Ralph loop: One of the simpler practitioner versions of long-running agents

The Ralph loop (sometimes called the Ralph Wiggum technique) is one of “simpler” practitioner version of long-running agents, popularized by Geoffrey Huntley and Ryan Carson. The reference implementation is literally a bash script that loops:

Pick the next unfinished task from a list (prd.json or equivalent).
Build a prompt with the task, the relevant context, and any persistent notes.
Call the agent.
Run tests or other checks.
Append what happened to progress.txt.
Update the task list (done, failed, blocked).
Go back to step 1.

The reason it works is the same reason any of the harnesses below work: State lives outside the agent’s context. prd.json is the plan, progress.txt is the lab notes, and AGENTS.md is the rolling rulebook. The agent itself is amnesiac, but the filesystem isn’t. Each iteration starts fresh and reads enough state from disk to keep going. Carson’s Compound Product extends the idea by chaining multiple loops (an analysis loop that reads daily reports, a planning loop that emits a PRD, an execution loop that writes the code), which is roughly the open source version of the planner-generator-evaluator triad Anthropic landed on independently.

I went deeper on all of this in “Self-Improving Coding Agents”: task list structure, progress files, QA gates, monitoring, the failure modes you’ll actually hit. The short version is that you can build a working long-running agent in an evening with a bash script and a JSON file. Most of what Google and Anthropic have productized is the work of making this pattern recoverable, secure, and observable at scale.

The big-lab stories below are different ways of paying for that production-readiness.

Anthropic: Harnesses, then the brain/hands/session split

Anthropic has been the most public about the engineering. Two posts are worth reading end to end.

The first is “Effective Harnesses for Long-Running Agents,” which lays out a two-agent harness for autonomous full stack development. An initializer agent runs once at the start of a project to set up the environment, expand the prompt into a structured feature-list.json, and write an init.sh that future sessions will run on boot. A coding agent is then woken up over and over, each session asked to make incremental progress on one feature, run tests, leave a claude-progress.txt note, and commit. A test ratchet (“it is unacceptable to remove or edit tests because this could lead to missing or buggy functionality”) sits in the prompt to stop the very common failure of an agent deleting failing tests to “make them pass.” InfoQ’s writeup extends this into a planner, generator, and evaluator triad, on the same logic that separating generation from evaluation matters because models grade their own work too generously.

The second is “Scaling Managed Agents: Decoupling the Brain from the Hands,” the architectural post behind Claude Managed Agents (Anthropic’s hosted runtime, launched in early April). The argument is that an agent has three components that should be independently replaceable. The Brain is the model and the harness loop that calls it. The Hands are sandboxed, ephemeral execution environments where tools actually run. The Session is an append-only event log of every thought, tool call, and observation.

This sounds abstract, but it isn’t. Here’s Anthropic’s framing: “Every component in a harness encodes an assumption about what the model can’t do on its own.” When you couple them, an assumption that goes stale (e.g., the model used to need an explicit planner and now plans natively) means the whole system has to change at once. When you decouple them, the harness becomes stateless, sandboxes become cattle, not pets, and a brain crash doesn’t lose the run. A fresh container calls wake(sessionId) and reconstitutes the state from the log. They reported time-to-first-token dropped ~60% at p50 and over 90% at p95 just from being able to start inference before the sandbox is ready.

The session-as-event-log idea is the part most teams underappreciate. It is what makes a long-running agent recoverable. Without it, a container failure is a session failure and you’re debugging into a stale snapshot. With it, the agent’s memory is a queryable artifact that lives outside whatever process happens to be running at the moment.

For the scientific computing crowd, Anthropic’s “long-running Claude” post reduces all of this to a simpler stack: CLAUDE.md as a living plan the agent edits as it learns, CHANGELOG.md as portable lab notes, tmux plus SLURM plus git as the execution and coordination layer, and the Ralph loop, a for loop that kicks the agent back into context whenever it claims completion and asks if it’s really done. Their flagship case study is a Boltzmann solver Claude Opus 4.6 built over a few days that reached subpercent agreement with a reference CLASS implementation. Months to years of researcher time, compressed.

Same patterns across all three posts: an explicit plan file, an explicit progress file, structured handoffs between sessions, separate generation from evaluation, and a loop that refuses to let the agent stop early.

Cursor: Planners, workers, judges

Cursor’s “Scaling Long-Running Autonomous Coding” is the other essential read this year. They walked into walls that Anthropic mostly papered over.

Their first attempt was a flat coordination model: equal-status agents writing to shared files with locks. It became a bottleneck and made the agents risk averse, churning rather than committing. Their second attempt swapped locks for optimistic concurrency control, which removed the bottleneck but didn’t fix the coordination problem. The third design is what’s running in production now and what they describe as solving most of the problem:

Planners continuously explore the codebase and emit tasks. They can recursively spawn subplanners.
Workers are focused executors. They don’t coordinate with each other and they don’t worry about the big picture.
Judges decide when an iteration is finished and when to restart.

Two things stand out from the post. One: “A surprising amount of the system’s behavior comes down to how we prompt the agents” more than the harness or the model. Two: Different models slot into different roles. Their reported finding is that a GPT model was better than Opus for extended autonomous work specifically because Opus tended to stop early and take shortcuts. Same task, different role, different model. The matching is becoming part of the design surface.

This pairs with Composer 2 (their proprietary frontier coding model that ships in Cursor 3) and their background cloud agents: long-running tasks that run on Anysphere’s cloud infrastructure rather than your laptop. Eight-hour refactors and codebase-wide migrations survive a closed lid. You can start a task locally, hit run in cloud when you realize it’ll take 30 minutes, and reattach later from your phone. Each agent runs in an isolated Git worktree and merges back via PR. The handoff between local and remote is the part most teams haven’t figured out yet, and Cursor’s bet is that it has to be its own product surface.

The shape ends up close to Anthropic’s: Roles are split, sessions are durable, judges sit beside the worker, and a long task runs in a cloud sandbox with Git as the coordination substrate.

Google: Long-running agents on the Agent Platform

Google’s announcement at Cloud Next ’26 folded Vertex AI into the Gemini Enterprise Agent Platform and turned long-running agents into a named product, with named SLAs.

The pieces that matter for this post:

Agent Runtime supports agents that “run autonomously for days at a time” with sub-second cold starts and on-demand sandbox provisioning. The launch post’s example use case is a sales prospecting sequence that takes a week to play out, which is roughly the right shape for it.
Agent Sessions persist conversation and event history. You can pin them to a custom session ID that maps to your own CRM or DB record, so the agent’s state lives next to the business state instead of in a separate AI silo.
Agent Memory Bank is the persistent long-term memory layer, generally available as of Next ’26. It curates memories from sessions, scopes them to a user identity, and exposes a search API so the next agent invocation can pull what’s relevant. Payhawk reported that auto-submitting expenses through a Memory Bank-backed agent cut submission time by over 50%.
Agent Sandbox handles hardened code execution.
Agent-to-Agent Orchestration, Agent Registry, Agent Identity, Agent Gateway, Agent Observability, and Agent Simulation cover basically every operational concern you’d otherwise build by hand for a production fleet, including the cryptographic-identity-and-audit-log story enterprises actually need to ship.

Architecturally this is the same brain/hands/session split Anthropic described, just productized at platform scale and bundled with ADK (the code-first dev kit) and Agent Studio (the visual one). If you’re building inside Google Cloud, you don’t have to design a session log or a memory store from scratch anymore. You wire an ADK agent into Memory Bank and Sessions, deploy onto Agent Runtime, and the persistence question is answered.

Notice how much this looks like the pattern Anthropic and Cursor describe, just unbundled into named services with SLAs. Three years ago you’d have built all of this yourself. Now you pick which version of “decoupled brain, hands, and session” you want to rent.

Five patterns for long-running agents in production

Shubham Saboo and I wrote up five design patterns we’ve seen separate working long-running agents from demos. They aren’t Google-specific, but they map cleanly onto the primitives Agent Runtime now exposes, so it’s worth walking through them here in shortened form.

Checkpoint-and-resume. The most common multiday failure is context loss. An agent processes 200 documents over four hours, hits an error on document 201, and without a checkpoint you start from scratch. Treat the agent like a long-running server process: write intermediate state to disk, checkpoint every N units of work, recover from failures. The Agent Runtime sandbox gives you a persistent filesystem, but choosing the right checkpoint granularity (not every step, not only the end) is on you.

Delegated approval (human-in-the-loop). Most “human-in-the-loop” implementations are: serialize state to JSON, fire a webhook, hope someone responds. The state goes stale, the notification gets buried, the agent re-deserializes into a slightly different world. Long-running runtimes let the agent pause in place with full execution state intact: reasoning chain, working memory, tool history, pending action. Hours of human time pass, the agent consumes zero compute, and it resumes with subsecond latency. Mission Control is Google’s inbox for this. The pattern works regardless of vendor.

Memory-layered context. A seven-day agent needs more than session state. Memory Bank handles long-term curated memory, Memory Profiles add low-latency lookups, and the failure mode you’ll hit in production is memory drift: The agent learns a procedural shortcut from a few atypical interactions and starts applying it broadly. Govern memory like you govern microservices. Agent Identity controls who can read and write which banks. Agent Registry tracks which version of which agent is running. Agent Gateway enforces policy on the wire. The auditing question stops being “What are my agents doing?” and becomes “What are my agents remembering, and how is that changing their behavior?”

Ambient processing. Not every long-running agent talks to a human. Some sit on a Pub/Sub stream or a BigQuery table and act on events as they arrive: content moderation, anomaly detection, inbox triage. The architectural decision worth making early is to not hardcode policy into the agent. Define it in the Gateway and the fleet picks up policy changes without redeploys. Ambient agents run unsupervised for long stretches, and the only sane way to update a hundred of them is to update the policy layer once.

Fleet orchestration. In real systems, you rarely have one agent. A coordinator delegates subtasks to specialists (a Lead Researcher Agent, a Scoring Agent, an Outreach Agent), each running independently for different durations. Each specialist gets its own Identity (so the Outreach Agent can’t read financial data meant for Scoring), its own policy enforcement, its own Registry entry. This is the same coordinator/worker shape distributed systems have used for decades. What’s new is that ADK handles it declaratively with graph-based workflows, and a bad deployment in one specialist doesn’t cascade to the others.

The patterns compose. A compliance system might use checkpointing for document processing, delegated approval for review gates, memory layering for cross-session knowledge, and fleet orchestration to coordinate the specialists. The opening question is always the same: What’s the longest uninterrupted unit of work your agent needs to perform? Minutes, and you don’t need long-running agents. Hours or days, and these patterns are where to start. The full write-up with code samples covers each pattern in depth.

So how do you actually build one today?

This is the practical question, and it has a different answer depending on what you’re building.

You’re a developer who wants long-running coding work on your own repo. Just use Claude Code (or Antigravity, Cursor, or Codex). The harness is already there. Treat your AGENTS.md like a pilot’s checklist: short, every line earned by a real failure. Add hooks for typecheck and lint that surface failures back to the agent. Write a plan file before the agent starts. Use the Ralph loop when the agent claims it’s done and you don’t believe it. For multihour or overnight jobs, run in a worktree so a closed laptop doesn’t kill the run, and have it commit progress every meaningful unit of work. This is the path most people should take, and it’s where the most leverage is right now.

You’re building a hosted agent product. Don’t build the runtime. Pick a managed one. The three real options today: Google’s Agent Platform (Agent Engine + Memory Bank + Sessions), Claude Managed Agents, or roll something on top of ADK, the Claude Agent SDK, or Codex SDK and host it yourself. The trade-off is the usual one. Managed gets you the brain/hands/session split, observability, identity, and an audit trail out of the box. Self-hosted gets you control and the ability to use weird models for weird roles (Cursor’s pattern). For most teams, the right starting point is a managed runtime plus your own ADK or SDK code for the actual loop.

You’re doing something autonomous and operational (monitoring, research, ops). Memory Bank-style persistence is what you want, and it’s the part that doesn’t exist in Claude Code. ADK + Memory Bank + Cloud Run + Cloud Scheduler is the cleanest stack I’ve seen for “agent runs every N hours, accumulates state, alerts on a threshold.” This is also where Cursor’s planner/worker/judge split starts to matter more than it does for IDE coding, because the work is genuinely parallel and the failure modes are different.

A few things matter regardless of which path you take.

Write down the done condition before the agent starts. This is the single highest-leverage move for long runs. The Anthropic harness post calls it the feature list; Cursor calls it the planner’s task spec. Either way, it’s an external file with explicit, testable completion criteria, and it exists so the agent can’t quietly redefine done midrun.

Separate the evaluator from the generator. Self-grading is the failure mode. A planner/worker/judge pipeline, or a generator/evaluator pair, is a real architectural pattern, not a stylistic preference. Even if it’s the same model in different roles with different prompts.

Invest in the session log, not just the prompt. The append-only event log is what makes the agent recoverable, debuggable, and auditable. If you can’t reconstruct what the agent did in the last 24 hours from durable storage, what you have is a long-running shell script that happens to call an LLM, not a long-running agent.

Treat compaction and context resets as first class. Anthropic is explicit that summarization-as-compaction wasn’t enough for very long jobs; they had to do full context resets where the harness tears the session down and rebuilds it from a structured handoff file. It is essentially how humans onboard a new engineer.

There are some real limitations right now

A few things are still genuinely unsolved.

Cost. A 24-hour run with a frontier model and a few tools is not cheap. Without budgets, circuit breakers, and a hard cap on tool spend, an agent can quietly burn through a week’s API budget in an afternoon. This is solvable, but it’s an explicit step you have to take.

Security. A long-running agent with API keys, cloud access, and the ability to run shell commands has a much larger attack surface than a chat session. The brain/hands separation pattern matters here too: Credentials should be unreachable from the sandbox where model-generated code runs, which is one of the benefits Anthropic calls out for Managed Agents.

Alignment drift. Over many context windows, agents drift. The original goal gets summarized, then resummarized, then loses fidelity. This is the part hooks and judges exist to defend against. It is also the most common reason “the agent went off and did something I didn’t ask for.”

Verification. Auditing 24 hours of autonomous activity is a real human-time problem. Observability and structured artifacts (PRs, commits, briefings, test runs) are how you make this tractable. Without them, you’re scrolling logs and you’ll miss what matters.

The human role. This is the one I keep coming back to. Defining work crisply enough that an agent can run for a day on it is harder than doing the work yourself. The skill that’s appreciating in value isn’t writing code. It’s writing specs that survive contact with an autonomous executor.

Where this is going

Google, Anthropic, and Cursor have converged on roughly the same shape. Separate the model loop from the execution sandbox from the durable session log. Split planning from generation from evaluation. Bake in compaction, hooks, and context resets. Expose memory as a managed service that any agent invocation can query.

Surface area is what differs. Google’s Agent Platform is the enterprise-stack version, with the identity and audit trail story baked in. The patterns underneath are the same. Claude Managed Agents is “Anthropic’s harness, hosted.” Cursor’s background agents are “long-running coding, pulled out of the IDE and into the cloud.”

The harder problems for the next year aren’t in any of those layers individually. They’re in the coordination above them. Many long-running agents on a shared codebase. Agents that read their own traces and patch their own harnesses. Harnesses that assemble tools and context just in time for a task instead of being preconfigured at startup. That’s where the agent stops looking like a smarter chat window and starts looking like a colleague who’s been on the project longer than you have.

The model is still load-bearing. But the gap between a chat window and an agent you can leave running overnight is mostly in the state, sessions, and structured handoffs wrapped around it. That’s where I’d spend my learning time right now.

The AI Agents Stack (2026 Edition)

Paolo Perrone — Mon, 08 Jun 2026 10:56:59 +0000

The following article originally appeared on Paolo Perrone’s The AI Engineer Substack and is being reposted here with the author’s permission.

Your team picks LangGraph for a customer support chatbot. Three weeks in, you’ve got 14 nodes in a state graph, a custom checkpointer writing to Redis, and retry logic for tool calls that fail once a week. The agent answers refund questions. It calls one API. A 50-line script on the OpenAI SDK with two MCP servers would have done the same thing. But nobody mapped which layers the problem actually needed.

In November 2024, Letta published an AI agents stack diagram that became the default reference for half the engineering teams I talk to. If you’ve seen a “layers of an agent” visual on LinkedIn or pinned in a Slack channel, it probably traces back to that article.

That diagram is 14 months old now, and a lot has changed since. MCP didn’t exist yet. Memory was still treated as a subset of your vector database. Nobody was shipping provider-native agent SDKs. Eval wasn’t even on the map. The stack has six layers in 2026, and at least three of them didn’t exist as distinct categories when Letta drew the original.

So we drew it from scratch. This is the 2026 version.

TL;DR

That’s the starting stack. Add complexity when something specific breaks, not before.

What are we even mapping?

Before the stack, there was a loop. In “What Is an AI Agent?,” we defined an agent as the think-act-observe cycle: The model reasons about a task, takes an action (calls a tool, writes to memory), observes the result, and loops until the task is done. That loop is the atomic unit. Everything in this issue is infrastructure that makes that loop work reliably, at scale, in production.

The agent stack is not the LLM stack. A chatbot needs inference and maybe RAG. An agent needs state management across multistep execution, tool access governed by protocols, memory that persists across sessions, autonomous reasoning loops, and guardrails that constrain behavior in real time. That’s a fundamentally different set of infrastructure problems.

We’re mapping the six layers between your LLM and a production agent. We’re not covering training infrastructure, data pipelines, or model fine-tuning. Those are adjacent stacks. We covered RAG in depth in Issue #5. Today we’re zooming out to show where RAG fits in the bigger picture.

Three things redrew the map between 2024 and 2026. MCP standardized tool connectivity, and the entire tools layer is new because of it. Reasoning models changed what agents can do autonomously, with single-call agents replacing some multistep chains. And memory became a first-class architectural primitive, not an afterthought bolted onto a vector database.

How to evaluate each layer

When choosing tools at each layer, ask three questions. How much state do you need to manage? A stateless tool caller and a multi-session agent that learns over time are different engineering problems, and the layers where state management is hardest (memory, frameworks) are where most teams get stuck. How much vendor lock-in can you tolerate? MCP is an open standard, provider SDKs are not, and every tool choice either increases or decreases how painful your next migration will be. And how hard is it to go from demo to production? Some layers (model serving) have almost no gap, while others (eval, guardrails) have a massive one. The layer where you feel that gap most is the one to invest in first.

We take each layer from the bottom up, starting with the most stable and ending with the least mature.

Layer 1: Models and inference

How you run the model that powers your agent: call an API, use a managed open weight provider, or self-host.

The inference layer changed more in tone than in substance. Reasoning models like o1, o3, DeepSeek R1, and Claude with extended thinking shifted what agents can plan and execute. Agents that previously needed multistep chains can now solve problems in a single reasoning call. Open weight models like Llama 3.3, DeepSeek V3, and Qwen 2.5 closed the quality gap dramatically, so “always use the biggest closed model” is no longer default advice. The emerging pattern is to prototype on closed source and deploy on open weight.

The honest take: This layer is commoditizing. Model differences matter less each quarter. The real decision is the cost and latency trade-off, not which model is “smartest.”

On the evaluation side, API calls are stateless. Send a request, get a response. Nothing to manage. Lock-in risk runs high for closed APIs because each model reasons differently, so switching providers means retuning prompts, adjusting for different failure modes, and retesting your eval suite. It’s low for open weight, where you can swap the model and keep the infra. The prototype-to-production gap is the smallest of any layer. Your demo API call is the same as your production API call.

Self-host when your agent call volume makes API pricing untenable or when you need sub-100ms latency that API round-trips can’t deliver.

Layer 2: Protocols and tools

How your agent calls external tools and APIs: through MCP servers, browser automation, or agent-to-agent protocols.

This layer didn’t exist as a distinct category in 2024. Every framework had its own JSON schema for tool definitions. Now MCP is the standard, with 97M monthly SDK downloads, adoption by OpenAI, Google, and Microsoft, and a donation to the Linux Foundation.

Browser Use exploded in parallel, hitting 78K GitHub stars in under a year. Nobody was shipping browser agents in production in 2024. And agents can now talk to other agents. IBM launched ACP, and Google launched A2A. Neither is standard yet, but the problem they solve (agents coordinating with other agents) is real and growing.

Security is the open problem. Endor Labs analyzed 2,614 MCP servers and found 82% prone to path traversal and 67% to code injection.

The honest take: The protocol debate is over. MCP won. The only question left is how you lock down your MCP servers before someone exploits them.

State management is nonexistent here. Your agent calls a tool, gets a response, done. No session, no memory between calls. Lock-in risk is low because MCP is an open standard, so if you build MCP servers, any MCP-compatible agent can use them. The prototype-to-production gap is medium. Your demo MCP server works until someone sends a malicious tool description. Security and governance are the gap.

MCP standardized how agents use tools. It says nothing about how agents talk to each other. ACP and A2A are trying to solve that, but neither has reached critical mass. If you need multi-agent coordination today, you’re building it yourself at the framework layer. We covered MCP in depth in Issue #4.

Layer 3: Memory and knowledge

How your agent stores and retrieves what it knows: in-context state, vector search, or persistent memory across sessions.

All three tiers feed into the same place: The context window your agent sees on every call.

In 2024, memory meant “pick a vector database and do RAG.” In 2026, memory is a first-class architectural primitive with three distinct tiers. Context windows got massive. Gemini hit 1M+ tokens, Claude 200K. Bigger windows didn’t kill the need for memory. They changed the trade-off: What do you stuff in-context versus what do you retrieve on demand?

“Context engineering” replaced “prompt engineering” as the core discipline. Instead of writing a better prompt, you architect what information the agent sees on every call. Memory blocks appeared as named, structured fields in the context window that the agent can read and overwrite every turn. Instead of dumping everything into the system prompt, the agent manages its own state: what to keep, what to update, what to drop.

On the infrastructure side, pgvector became the default for teams that don’t need a dedicated vector database. It’s just Postgres with an extension. GraphRAG emerged as a second retrieval option: follow relationships between entities instead of matching embeddings, with Neo4j leading this space. Sleep-time compute, where agents process information during idle time, is research stage but signals where tier 3 is heading.

The honest take: Most teams overcomplicate memory. Start with conversation history in Postgres and a structured system prompt. Add vector search when your history exceeds context limits. Add agentic memory management only when your agent needs to learn across sessions.

This IS the state layer. You’re deciding what your agent remembers, how it retrieves it, and when it forgets. Highest complexity in the stack. Lock-in risk is medium. pgvector is portable because it’s just Postgres, while specialized tools like Mem0 or Zep are harder to migrate away from. The prototype-to-production gap is large. Demo memory works because context windows are big enough. Production memory breaks when conversations get long and your agent starts forgetting the important parts.

In-context memory breaks down when agents need to share memory across instances or maintain state across model provider switches. That’s where dedicated memory infrastructure like Letta, Zep, and Mem0 earns its keep.

Layer 4: Frameworks and SDKs

How you wire together the model calls, tool use, and control flow that make your agent work: a provider’s built-in toolkit (SDK), a graph-based framework like LangGraph, or raw code.

Every major AI lab now ships its own agent SDK. OpenAI has the Agents SDK (evolved from Swarm). Google released ADK. Microsoft has Semantic Kernel and AutoGen. Hugging Face built smolagents. Two years ago, LangChain was the only game. Now you pick between three camps: provider SDKs that are fast to start but locked to one model, graph-based frameworks like LangGraph that are portable but require more setup, or no framework at all. That choice didn’t exist in 2024.

LangGraph solidified as the graph-based orchestration leader with v1.0 released October 2025 and production deployments at Uber, JPMorgan, LinkedIn, and Klarna. LangChain agents are now built on LangGraph under the hood. Meanwhile, the “build it yourself” camp grew. Teams that tried LangChain in 2024 and fought the abstraction are now writing thin wrappers over provider APIs + MCP. No framework means full control. This works until your agent needs state management or complex branching.

A quick note on naming: “LangChain” and “LangGraph” are not the same thing. LangChain is the integration layer handling model connectors, tool calling, and prompt templates. LangGraph is the orchestration engine managing state, control flow, and graphs. Most production teams use both together, but LangGraph is where the agent logic lives.

The honest take: Most teams pick too much framework. If your agent calls a model and a few tools, you don’t need LangGraph. A provider SDK and a couple of tool calls will get you to production faster than any graph.

Provider SDKs manage state for you. LangGraph makes you define every state transition explicitly. Build-it-yourself means you roll your own. Lock-in risk is the highest in the stack. Your orchestration code doesn’t port. A LangGraph agent rewritten for CrewAI is a new codebase. Provider SDKs are worse because you’re locked to one model too. The prototype-to-production gap is large. Demo works because nothing goes wrong. Production means handling tool failures, retries, timeouts, and humans who need to approve before the agent acts.

The framework you pick determines your migration cost. Provider SDKs are fastest to start but lock you to one model. LangGraph is portable but complex. Building your own gives you full control until your agent outgrows your wrapper. MCP is the one layer that transfers across all three camps.

Layer 5: Eval and observability

How you measure whether your agent is doing its job: tracing runs, scoring outputs, and catching regressions before users do.

This layer barely existed in 2024. Now it’s the gap. LangChain’s State of Agent Engineering survey found 89% of teams with production agents have implemented observability, but only 52% have evals. That 37-point gap is where production quality dies.

“Evaluation as infrastructure” is converging on three tiers: fast checks on every PR (Did the agent call the right tools?), nightly regression suites that use an LLM to judge output quality, and continuous production monitoring that alerts when agent performance drifts. New agent-specific benchmarks have emerged too, including Context-Bench for memory management, Recovery-Bench for error recovery, and Terminal-Bench for coding agents.

The honest take: Most teams skip eval until something breaks in production. By then they’re debugging blind. The teams that don’t have this problem built evals before they deployed.

State management matters here because your agent runs 12 steps, step 3 picked the wrong tool, and steps 4–12 were doomed from there. If your eval only checks the final output, you’ll never know why. Lock-in risk is moderate. Most tools export OpenTelemetry traces, so switching observability providers is doable, but switching eval frameworks means rebuilding your test suites. The prototype-to-production gap is the biggest of any layer. Most prototypes have zero eval. You don’t feel the pain until production users find the failures for you.

Current eval tools are strongest for single-turn and tool-calling evaluation. Multi-agent evaluation, long-horizon task assessment, and evaluating agents that learn over time are all unsolved problems. If your agent does any of those, you’ll need custom eval infrastructure beyond what the platforms offer today.

Layer 6: Guardrails and safety

How you stop your agent from doing things it shouldn’t: filtering inputs, authorizing tool calls, and validating outputs.

Agent guardrails became a separate discipline from LLM guardrails. In 2024, guardrails meant input/output filters on a model. In 2026, your agent calls tools, spends money, and takes actions. Guardrails now means authorizing tool calls, enforcing rate limits, and validating what the agent actually did.

The “guardrails before action” pattern emerged from teams that learned the hard way. They now enforce authorization at the tool execution layer, not the output layer. By the time you filter the response, the agent already sent the email. OWASP published the MCP Top 10 (beta), which is the first real security checklist for tool-connected agents. Deployment is still DIY. LangGraph Cloud and Bedrock Agents exist, but most production teams are still deploying with FastAPI and their own infra. This layer is where you’ll spend the most unplanned engineering time.

The honest take: This is the least mature layer in the stack. No dominant framework, no established patterns. You’re writing policy code from scratch.

Guardrails need to know what the agent is doing right now to decide what it shouldn’t do next. That means tracking agent state in real time. Lock-in risk is low because most guardrails are custom policy code you write yourself. NeMo Guardrails is the closest thing to a framework, but you’ll still write most rules from scratch. The prototype-to-production gap is effectively infinite. Your demo has no guardrails because nobody’s trying to break it. Production will.

Current guardrails tools focus on single-agent systems. If you’re running multi-agent workflows where agents delegate to each other, guardrail propagation across agent boundaries is an unsolved problem. You’ll need custom authorization logic.

What are you building?

This is the decision that cuts through the framework confusion. The agent type determines which layers you invest in and which tools to pick at each one.

A stateless tool caller answers questions from a knowledge base, looks up an order, or checks inventory. You need a provider SDK, MCP, and Postgres. No framework, no vector database. This is a weekend project.

A multistep workflow processes a refund end to end, reviews a PR across five files, or triages and routes support tickets. Steps depend on each other, things fail in the middle, and humans need to approve before the agent acts. You need LangGraph, MCP, and eval. Build evals before you deploy because these agents break silently.

An agent that learns remembers your preferences across sessions, gets better at your codebase over time, or tracks project context across weeks. You need a memory-first architecture, a vector DB, and eval. Orchestration is the easy part. The hard part is deciding what to remember, what gets dropped, and how you stop old context from polluting new answers.

A multi-agent system has agents that delegate to other agents, split a research task across specialists, or run parallel workstreams. You need the full stack. Two agents passing context to each other is already hard to debug. Five is impossible without trace-level evals on every handoff. Build eval infrastructure before you build the second agent.

Coding agents: All 6 layers in action

Coding agents like Cursor, Claude Code, Codex, and Windsurf are the most proven application of the AI agents stack. All six layers, working together.

At the inference layer, these tools serve hundreds of millions of daily requests. Cursor routes between Claude, GPT-4, and its own fine-tuned models depending on the task. At the protocols layer, MCP servers connect to editors, terminals, filesystems, and Git, which is how the agent reads your code and runs commands. The memory layer uses codebase-aware retrieval with reranking. The agent doesn’t read your whole repo. It retrieves the files that matter for this specific edit.

At the framework layer, these are custom orchestration systems with RL loops. Not LangGraph, not a provider SDK. Purpose-built control flow for code generation, review, and iteration. At the eval layer, Cursor retrains its acceptance-rate model every 90 minutes based on whether users accept or reject suggestions. That’s eval running in production, continuously. And at the guardrails layer, sandboxed execution prevents runaway agents. The agent can write code and run it, but inside a container that limits what it can touch.

The AI agent stack cheat sheet

Every layer scored on the three questions from the evaluation framework: How much state do you need to manage? How much vendor lock-in can you tolerate? And how hard is it to go from demo to production?

The bigger picture

Most teams are building like it’s still 2024. They pick LangGraph before they know if they need state. They add a vector database before they’ve outgrown Postgres. They design multi-agent architectures before they’ve shipped one agent that works. The decision flowchart above exists because a tool-calling chatbot and a multi-agent research system share almost no infrastructure. Treat them the same and you’ll overbuild the first and underbuild the second.

The teams that got past this run evals on every deploy, not once a quarter. Their guardrails sit at the tool call layer, not the output layer. Their memory architecture was designed, not inherited from whatever the framework defaulted to. Most teams ship the opposite: no evals, output-only filtering, and a system prompt that grows until the context window chokes. The gap isn’t talent or budget. It’s knowing which layers matter for your specific agent instead of half-building all six.

The stack is going to collapse. Provider SDKs are already absorbing memory, tool calling, and basic eval into a single API. By early 2027, most teams won’t build each layer separately. They’ll get an increasingly opinionated stack from their model provider and that will be fine for 80% of use cases. The other 20%, agents at scale where the defaults break, will still build custom at every layer. But even then, when something fails in production, you need to know which layer failed. That’s what this article is for.

Sources

“The AI Agents Stack,” Letta, November 2024.
“Donating the Model Context Protocol and Establishing the Agentic AI Foundation,” Anthropic, December 2025.
“120+ Agentic AI Tools Mapped Across 11 Categories [2026],” StackOne, February 2026.
Henrik Plate and Darren Meyer, Dependency Management Report, Endor Labs, January 2026.
Jason Liu, Context Engineering Series: Building Better Agentic RAG Systems, August 2025.
“LangChain and LangGraph Agent Frameworks Reach v1.0 Milestones,” LangChain, October 2025.
State of Agent Engineering, LangChain, December 2025.
Yunfei Bai, Allie Colin, Kashif Imran, and Winnie Xiong, “Evaluating AI Agents: Real-World Lessons from Building Agentic Systems at Amazon,” Amazon, February 2026.
OWASP MCP Top 10, OWASP.

This Week in AI: Production Viability

Michelle Smith — Fri, 05 Jun 2026 15:55:20 +0000

On this week’s episode, host and the founder of AI advisory firm Intelligence Briefing Andreas Welsch brought together Maya Mikhailov, cofounder and CEO of Savvi AI, and Doug Shannon, generative AI and intelligent automation leader, to cover a handful of interconnected topics that practitioners are navigating right now: OpenAI’s push into personal finance, the role of metacognition in AI-assisted technical work, the growing backlash against token-based productivity metrics, and the new role of forward-deployed engineer. Together, these stories sketch a picture of an industry that’s good at generating output but is still figuring out what output is worth.

Why OpenAI wants your bank account data

When OpenAI announced it was analyzing users’ transaction data in partnership with financial institutions, the coverage focused on the consumer benefit: a smarter way to track spending, comparable to what Credit Karma or Mint offered but with a more conversational interface.

But that’s not all the company’s interested in, or even the main thing. Maya reframed the stakes: “What OpenAI wants to do is figure out consumer intent.” Being able to access users’ financial data is less about helping people manage their money and more about completing a profile the company can then monetize. OpenAI already builds a surprisingly accurate picture of users from their chat histories. Add transaction data and you get specifics that weren’t there before: what someone is saving for, what they’re anxious about, where their money is actually going. That’s a data asset worth a great deal to advertisers.

We’ve seen this pattern before, and as Andreas noted, companies have long held (and used) potentially invasive data to recommend products. The Target pregnancy prediction story is now more than a decade old, but it’s still being taught in business school, including by Andreas, precisely because it illustrates how behavioral data can be combined to infer things people haven’t explicitly disclosed—and spotlights the fine line between effective recommendations and those that feel too personalized, reminding consumers just how much information companies have on them. Companies’ profile-building capability hasn’t changed, but AI chat adds a new wrinkle, said Maya. A conversational interface makes disclosure feel natural, so the knowledge graph based on your chat history is very powerful. And these tools are also better positioned to share recommendations than traditional avenues. “By having this style that is agreeable, that is engaging,” Maya explained, “those recommendations are going to be a lot stickier than what a fragment of a sentence I type into a regular search engine.”

Metacognition as a professional skill

When you delegate thinking to a system that averages across a massive range of inputs to produce an answer, you need to know when that answer is good enough and when it isn’t.

“We’re essentially being averaged out,” Doug said. The model is doing many things behind the scenes to find a mean response. The human’s job is to ask questions about the questions, to push past the first answer, and to know whether their own judgment is still in the loop. That’s why Doug’s been pushing for a renewed interest in metacognition, or “thinking about thinking.” Offloading cognitive load that’s peripheral to your work is fine, Doug and Maya agreed. Offloading the reasoning that’s central to your job’s value—what Doug called cognitive surrender—is where organizations get into trouble.

The future advantage won’t come from access to AI. Everyone will have some kind of access to it. The advantage will come from knowing what to offload, what to question, and what should never leave human judgment. This is a skill-development question as much as a philosophical one. The people who’ll be most effective with AI tools aren’t the ones who use them most; they’re the ones who understand what to hand off and what to keep. That requires domain knowledge, judgment about when a model’s answer is plausible but wrong, and enough fluency with how these systems work to recognize when you’re being handed an average instead of an answer.

Tokenmaxxing and the wrong incentive

The tokenmaxxing debate seems to be coming to a head. Amazon abolished its AI productivity leaderboard after employees started gaming it by writing inefficient code to rack up token usage. And one company reportedly burned through $500M in Anthropic tokens in a single month after failing to set limits. The companies encouraging tokenmaxxing are incentivizing the wrong metrics, Maya argued. It’s like determining which bakery is best by the amount of flour it uses. The right question is “Are we making a quality product?”

Andreas shared his own vibe coding experience as an example of how token consumption and technical debt compound in practice. A developer starts with a modest plan and burns through their quota running agents in half an hour. They upgrade to a higher tier, paying five times more, but now the sunk-cost logic kicks in. As Andreas pointed out, now they feel like they “should also be getting five times more the value out of [their subscription],” so scope expands from a single tool into a unified business operating system. Three weeks later, the accumulated complexity has outpaced the ability to evaluate it: Repeated security audits keep surfacing new issues, each pass generating recommendations that require cybersecurity expertise most vibe coders don’t have. Here’s where Doug’s point about metacognition applies: The more a builder stays actively involved in understanding what the system is actually doing, the better their judgment about whether it is working. For less engaged users, the risk is accepting the output, shipping the debt, and discovering the consequences later.

Most of the misalignment originates in the gap between what executives expect from AI and what practitioners deal with day-to-day. Executives see a capability that could change the slope of productivity, Maya explained. Engineers and analysts live with the technical debt, the version control problems, and the regulatory constraints that don’t disappear because you have a better code completion tool. The leaderboard problem is a symptom of that disconnect.

GitHub’s recent shift from unlimited to usage-based pricing for Copilot is likely to realign these incentives faster than any internal policy change would. When more CFOs start seeing the actual bills, the leaderboards will all come down.

Doug identified a related problem emerging with the “cognitive surrender” to LLMs. When organizations encourage employees to pipe internal processes, proprietary logic, and institutional knowledge into foundation models without governance, they’re not just running up token bills. They’re giving away the operational knowledge that differentiates them. Process documentation, workflow logic, and institutional memory about why certain decisions were made are all forms of intellectual property, and once they’re encoded into a general-purpose model, the organization’s advantage from them diminishes.

Forward-deployed engineers aren’t enough on their own

Is the answer to these challenges to put a skilled engineer directly inside the customer environment to translate between what a model produces and what an organization actually needs? That’s the promise of the forward-deployed engineer (FDE) approach popularized by AI firms. Doug and Maya both had some criticisms of the model.

Maya’s objection was structural. Enterprise AI deployment isn’t a matter of adding capability on top of existing infrastructure. Organizations arrive with siloed data, legacy systems, and regulatory constraints that no forward-deployed engineer can resolve on technical skill alone. You can’t “just sprinkle some AI on it, and it’ll work just by a package of tokens,” she said. Engineers have to know the context behind why certain data can’t be used or why a particular model can’t be deployed in a regulated context. FDEs coming into an organization fresh don’t have this understanding and as a result may undo decisions that were made carefully and for reasons that aren’t written down anywhere obvious.

Doug’s concern was about communication. FDEs, in his experience, tend to arrive with strong technical instincts and limited organizational context. They get into the work quickly but struggle to communicate across the full stack of stakeholders involved. That’s why business analysts exist, to understand the customers’ problems and what the process actually is before engineers can address them. Skip that step and you get technically correct output that solves the wrong problem.

What both Maya and Doug were underscoring is that AI deployment at the enterprise level is fundamentally a context problem. The models are capable. What’s hard is knowing which capability to apply, where to do it, and with what constraints in place. That knowledge doesn’t live in the model; it lives in the people who’ve worked inside the organization long enough to know why things are the way they are.

The measurement problem

All the topics in this episode circle back to the same question: What are we actually measuring, and what incentives are we setting in place with those measurements? Token counts and lines of code don’t always correlate to the outcomes companies want. You need human expertise and a contextual knowledge of the business to figure out what goals you want to achieve and what to measure to ensure you get there.

On next Monday’s episode of This Week in AI, RecoMind founder Miguel Fierro joins host Christina Stathopoulos to discuss responsible AI, multimodal content creation, and more on how LLMs are changing personalization and user understanding. Miguel will also lead a live demo that offers a glimpse of the next generation of recommendation experiences—register here.

We’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, Apple, or wherever you get your podcasts.

I Let an AI Agent Run 40 Experiments While I Slept

Vanchhit Khare — Fri, 05 Jun 2026 10:27:18 +0000

I set up an AI agent on a rented GPU, pointed it at a training script, and went to bed. By morning it had run 40 experiments, improved validation loss by 5.9%, and cut memory usage from 44 GB to 17 GB. It also spent four hours chasing a bug that a linter introduced behind its back. The agent never flagged it. I only found out because the numbers stopped improving and I started reading logs.

The setup was based on Andrej Karpathy’s autoresearch project: Give an agent one file it can edit (train.py), one metric to optimize (validation bits per byte), a fixed five-minute training budget per experiment, and Git for checkpointing. If an experiment beats the current best, keep the commit. If not, revert. Loop forever. Karpathy’s own run produced 700 experiments and 20 genuine improvements across 48 hours, an 11% speedup on already-optimized code. Shopify’s Tobi Lütke pointed the same pattern at Liquid, their templating engine, and got 53% faster rendering from 93 automated commits. The pattern clearly works. The question is what breaks when you run it yourself.

The first failure: Agents fixing agents

Before running autoresearch, I had a separate problem. I had 15 custom skills for Claude Code (think reusable prompt templates with tool access, structured inputs, and specific behaviors). Most of them were broken when dispatched as parallel background agents. Vague descriptions meant the system couldn’t figure out when to invoke them. Missing tool permissions caused silent failures. Duplicate scopes between similar skills created routing confusion.

So I used the same pattern: dispatch background agents in parallel, one per skill, each tasked with reading the skill definition, identifying problems, and rewriting it. 13 out of 15 came back improved. Descriptions got specific. Dead references to nonexistent files were removed. Tool permissions were added. Two skills were left untouched because the agents couldn’t find anything wrong with them. The whole batch took under an hour.

But here’s what I didn’t expect. Three of the “improved” skills had subtle regressions. One agent removed an AskUserQuestion gate that was there for a reason, because the gate’s purpose wasn’t documented and the agent read it as unnecessary friction. Another agent rewrote a skill description so precisely that it stopped triggering on the fuzzy, misspelled queries real users actually type. I caught these during manual review, but if I had trusted the parallel output without checking, three skills would have silently degraded in production.

The second failure: The linter in the loop

Then I started the training loop. The agent worked through hyperparameters methodically. It halved the batch size early (experiment 4), which turned out to be the single biggest win: more gradient steps in the same five-minute window. It reduced model depth from eight to seven layers, dropped weight decay from 0.2 to 0.05, and tuned the learning rate schedule. Each change was small. The cumulative effect was a 5.9% improvement in validation loss and a 60% reduction in peak GPU memory.

Out of 40 experiments, the agent kept nine, discarded 28, and crashed three. That keep/discard ratio felt about right. Most ideas don’t work. The point of automation isn’t to have better ideas. It’s to try bad ones faster.

Then the numbers plateaued. Experiments 30 through 38 produced nothing worth keeping. I started digging through the logs and found something I hadn’t expected: A linter running on the remote machine had been silently modifying a hyperparameter in train.py. It changed SCALAR_LR from 0.5 to 0.3 every time the agent saved the file. The agent would set the value, commit, and run the experiment, but the linter would alter the file between the save and the execution. The agent had no way to detect this because it checked Git diffs, not the runtime state of the file. Every experiment after a certain point was running with a learning rate the agent never chose.

I lost roughly four hours of compute to this. The agent kept going, proposing new ideas, running experiments, logging results. From its perspective nothing was wrong. The experiments ran, produced numbers, and the numbers were plausible. There was no crash, no error, no alert.

Why this matters beyond my GPU bill

Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs and inadequate risk controls as the primary drivers. My overnight session was a toy example: a single GPU, a small model, and a low-stakes experiment. But the failure pattern scales. An agent that can’t detect when its inputs are being modified between decisions will make the same class of error whether it’s tuning hyperparameters or managing a production pipeline.

The autoresearch constraints are smart: one file, one metric, and Git for state. But they assume the environment is stable. Nobody checks whether something outside the loop is modifying the file between commits. The agent optimizes within its sandbox, and the sandbox has a hole in the wall that nobody thought to look for.

Anyone who has run distributed systems recognizes this. When the linter changed that hyperparameter, it was the equivalent of someone editing a database record between a read and a write. We solved that problem years ago with compare-and-swap, optimistic locking, checksums. We just haven’t brought any of it to autonomous AI workflows. The SkyPilot team recently scaled autoresearch to 16 GPUs and 910 experiments. At that scale, an undetected environment mutation doesn’t cost you four hours. It costs you a cluster.

Next time I run autoresearch, I’ll add a file integrity check before every experiment. It’s three lines of code, but it would have saved me four hours and produced a better final result. The agent did its job. The environment didn’t.

The Tidy House

Tim O’Reilly — Thu, 04 Jun 2026 16:25:11 +0000

DJ Patil has spent the past several months on a listening tour. Wherever he travels, he finds a local university, pings faculty and students and anyone else who wants to show up, and runs an AMA. He’s heard from grad students who can’t get callbacks, hospital administrators dealing with federal policy changes that land like a change in the laws of physics, and executives who can’t forecast their AI spending past six months. He’s trying to synthesize all of it and help reframe the wider conversation.

DJ co-coined the term “data scientist,” served as America’s first chief data scientist under President Obama, and was chief scientist at LinkedIn. He’s a longtime O’Reilly author, going back to Building Data Science Teams and Ethics and Data Science, and he’s on the founding team at Devoted Health, where he’s spent the past decade building the kind of data infrastructure most organizations are still struggling to put in place. He calls it “the tidy house.” He sat down with me to talk about “the broken promise” in the job market that is driving AI sentiment, and why weak data infrastructure is a big part of the gap between what AI can do and what most institutions can actually absorb.

The broken promise

What DJ keeps hearing on his tour is anger and angst. One word that keeps coming up is “terrified.” Workers are worried about layoffs. Meanwhile, students, including those from top-tier universities like MIT, Carnegie Mellon, and UC Berkeley, have been applying to 300+ internships and getting fewer than 10 callbacks. Many had zero offers going into the summer. And the industry’s response has been to tell them to learn more AI and burn more tokens. What it comes down to, DJ explained, is “effectively a broken promise”:

We said, “Go to college, get these things, you’re going to get an internship, you’re going to get job training, you’re going to pay off your student loans, and then you’re going to have all the other things that are part of that social contract.”

What the students are feeling for the first time [is]. . .“Wait, if I can’t get this internship, . . .I’m fundamentally off trajectory from getting this job.” And it doesn’t have to be a technical person. It could be someone that is in marketing. It could be someone that’s in the liberal arts. It could be a researcher. . . .There are plenty of students that I have talked to who are supposed to be going to a doctoral PhD program or a medical school or something like that. The slots aren’t there because of the overall budget impacts. And so whether you call it AI impact or economic reframing, the thing is broken.

This is where both DJ and I have been trying to build a counter narrative. The story coming from the AI labs is destructive: “We’re going to put all of you out of work, and we’ll figure out the rest once the intelligence explosion arrives.” That’s bad PR for AI, but it’s also magical thinking. An economy is a circulatory system. You can’t put your customers out of work and at the same time expect that the economy will hum along as usual. A catastrophic recession could easily interrupt the funding that keeps AI on its growth path and the concentration of value that they assume will fund universal basic income and an expanded safety net.

That’s why I’m a fan of mechanism design: start from the outcome you want, then figure out the rules of the game that produces it. Right now, they’ve designed a game that concentrates all the value in the hands of AI first movers. They could be designing a game that generates value throughout the economy. But they aren’t building affordances for that.

YouTube ContentID is a good example of mechanism design leading to economic value creation. When unauthorized music use by online video creators triggered a backlash from rights holders, YouTube replied to the takedown notices with a way for both the people who owned the music and the people who wanted to use it to get paid. A whole creator economy came out of that design choice. The labs have the same opportunity in front of them and mostly aren’t taking it.

DJ had one concrete mechanism in mind:

Imagine OpenAI and Anthropic and Microsoft. . .get together and [say], “If you’re building something for your local community, we’ll fully subsidize the token cost for some period of time.”. . .We’re talking about marginal token usage relatively on the spectrum of things, but the potential innovation and use of AI to help local communities could be astounding. You’re not putting anybody out of a job with that. . . .You’re filling the holes that already exist in the system.

The OpenAI Foundation just announced it will put $1 billion into public-benefit projects this year, including $250 million aimed at building economic futures. It’s a start. But it mostly seems designed to ameliorate the bad effects of AI rather than to forestall them by building a more inclusive AI future. If the labs start investing in the human-plus-AI economy rather than just studying the job losses, the payoff to local communities could be real.

A makerspace to bridge the internship gap

DJ’s plan is to build a bridge. He’s launching a program, basically a makerspace, for students who don’t have an internship this summer. Over two four-week sprints, an initial cohort will get mentors, speakers, and the space to explore whatever they’re interested in. It doesn’t have to be AI. Whether they’re doing investigative journalism, screenwriting, or building civic tech, participants will get some experience with current tools and produce a tangible asset they can use to prove what they know. As I told DJ in our conversation, I think he’s really on to something, and I’d love O’Reilly to be part of what he’s building.

There’s a kind of person who has always been at the center of the O’Reilly community and never waited for a job description. High school and college dropouts who started companies, built open source software packages, or otherwise took the future into their own hands. People who looked around, found something that needed doing, and did it. DJ is one of them. He’s a community college kid who learned from a good local library, from the books with the “funny animals” on the cover, and from open source. That path is still open. The early O’Reilly business came out of exactly this instinct. We were a tech-writing consulting shop, and when we ran out of paid work, we wrote manuals that didn’t exist yet but that we thought were needed. Later, when there were big conferences for every corporate technology and none for open source, we ran the first one for Perl. Conferences became a whole new business for us. You look for the gap and you fill it.

DJ pushes the same idea down to the level of the neighborhood:

If you want to feel rewarded, go fix something in your neighborhood. Go help out the food pantry. Go help out the local foster child care system. Go help out. . .parks and rec. Use those skills to go do something, and then you’re going to see. . .people respond in a different way. . . .The target-rich area for problems is massive. You just have to look.

I’ve never bought the jobless-future story. Back when I wrote WTF? in 2016, I pointed out that there is so much around us that needs to be made better. The constraint has never been a shortage of problems. AI gives us new tools for solving them. It should be a way to put people to work, not out of work.

The organization is the AI bottleneck

DJ has also been visiting hospitals and clinics and talking to CIOs and CTOs as part of the tour, and what he’s seeing is alarming.

The federal changes to Medicaid and the Affordable Care Act are landing on systems that were already near collapse. Hospitals that depended on outpatient procedures like colonoscopies for margin are watching volumes drop 20% to 30% because people can’t afford insurance. Some are running $1 million a day behind, a $300 to $400 million shortfall for the year.

At the same time, AI companies are telling those same hospitals to move into the new world, and partly because of the “you will soon be replaced” narrative from the AI labs, labor is responding the way the Kaiser nurses did in California, where any use of AI was off the table as a bargaining condition. As DJ pointed out, we can’t afford to disregard AI when it has the potential to automate the most painful parts of healthcare workers’ jobs and let them “do the job they’re trained for” without the administrative burden. Businesses need to change not just their narrative but their strategy. They need to be saying, “We’re going to use AI to help you do more for our customers. We’re going to make your job more human and let the machines deal with the BS.”

There’s a version of this where the efficiencies AI creates get plowed back into better patient care. There’s also the version that’s actually happening in most places, where private equity captures the savings as profit. The difference is institutional design, and that’s where reform isn’t happening. I saw this directly with a Code for America project called Clear My Record. A California initiative had turned a number of petty crimes into misdemeanors, but very few people were petitioning to have their status changed. We started using software to streamline an absurdly convoluted criminal record expungement process, but then we asked ourselves why we were helping people fill out forms that shouldn’t exist. The law had already changed the record. The process should have been a database update, not something that required a petition to the court. That’s the kind of problem AI was born to solve. It can help us refactor old stuck processes and move to something way better.

Done right, DOGE could have been an opportunity to carry out that kind of real institutional change at scale. Instead it became a wrecking ball, and it’s given the whole idea of institutional reform a bad name.

The Silicon Valley default assumes that incumbents will just get disrupted by startups, the way media was by Google and Meta and retail was by Amazon. There’s some truth to that. But disruption takes much longer than people think, and in a domain as central as healthcare or government services, the delay means real harm to real people. Healthcare is a third of the economy. You can’t just let it fail and rebuild it fresh while people depend on it for survival.

Data infrastructure is the competitive advantage

DJ’s term for the alternative he’s living with at Devoted is “the tidy house.” He built the boring infrastructure years before LLMs existed, and that’s why the company could move the moment AI arrived. People don’t think about having well organized, effective data infrastructure as the deep secret behind enterprise AI adoption, but DJ is right. As we work on O’Reilly’s own transformation and talk with our customers about what’s holding them back, it’s a huge part of the problem.

One of the ways we’ve tried to make this work is fundamentally still data 101, unified data environments, data flows that are clean, that have a lot of organization. . . .Because we invested so heavily in that infrastructure, the dumb, boring, painful parts of making sure you’ve got a really great data warehouse, great data engineering pipes, all of the metadata that goes with it, when AI shows up, you get to use it right away. Now you get to focus on the orchestration, the harness, all those pieces.

While other organizations are reconstructing ETL inside context windows and paying for it in GPU costs, Devoted’s team gets to work on the actual clinical problems. As DJ put it, transforming a healthcare system is “like walking and chewing gum while balancing bowling balls on your head and on a unicycle,” with the laws of physics changing on you the whole time. The organizations that come through it will be the ones that did the unglamorous work of keeping clean, flowing data with its lineage and metadata intact. The ones that didn’t will keep paying to reconstruct context they should have had all along.

The pharmacists who built their own agents

The tidy house pays off when you put the tools in the hands of people who already know the domain. At Devoted, clinicians are building things without waiting for a product manager to learn the problem first. These frontline workers have already spent decades understanding it.

A pharmacist. . .says, “Hey, you know what? I’m really worried when I see these kinds of drugs show up together. That’s not a good thing. . . .Why don’t I have an agent that alerts me every time this happens? I should just automate it because maybe one of the patients gets prescribed something by another provider and we don’t see it.” So the pharmacist [says,]. . .”I’m just going to build that agent.” Now I’ve got an agent always looking for bad drug interactions. And another pharmacist says, “I’ve got my own version of that.” . . .So I say, “Hey, agent, I want you to go ask all the pharmacists that we have a quick survey of what might be happening. . . .What are the universe of things that we should be watching out for?” Now I’ve got a robust medical layer. . .looking out and protecting all of our members from bad drug interactions. Having the right infrastructure makes it possible to act on decades of accumulated judgment distributed throughout the organization.

The histogram is still the most powerful product

You don’t need exotic tooling to get value out of data, and DJ punctured the assumption that you do.

Oftentimes, I tell people, the most powerful data product you can build is still a histogram. Just give me a distribution of what’s going on. . . .AI gives us a tremendous opportunity to let people [access this data quickly], but we’ve got to figure out the guardrails, so people don’t ask [questions] or get answers. . .[without realizing] that there’s a flaw in how they’re asking it.

Every time a new technology empowers employees to make innovative use of corporate data, there is resistance. We’ve been in this loop since the beginning of the data movement, DJ explained. The stewards of the data warehouse stand at the gate and say, “You shall not pass!” Then democratization breaks it open, and the gatekeepers reconstitute themselves in the next era. Hadoop did it last time. LLMs are doing it now, and the temptation to insist that only experts can use the tools correctly is as strong as it’s ever been. You do need ways to catch errors. But the goal should always be access.

The real opportunity is in the layers above AI models

DJ and I also talked about the new discipline forming inside computer science, engineering the trade-offs between conventional software and LLMs, when to reach for a local or open weight model, and understanding what inference actually costs against the value it returns.

Getting that right requires an expanded view of mechanism design. While this isn’t how economists talk about it, many advances in technology are really just that: redesigning the rules of a game to get better outcomes. Pay-per-click advertising started as a crude auction that sold to the highest bidder, and then Google refined it into something that worked. Rob McCool wired a web server to a database with CGI and ushered in a decade of invention of new mechanisms for data-driven websites. Or take Apache Kafka, which DJ reminded us began as a project to help LinkedIn rein in its Splunk bill and only later became the foundation for a company and an ecosystem.

We’re at the front of an architectural innovation cycle now, and the biggest opportunities are not in the models themselves but in the layers above them. That’s also where a renaissance of open source for the AI era could happen.

DJ and I are both, as he says, “this giant human LLM, summarizing and distilling all the things we’re hearing” from a lot of people. What we’re hearing is that the technology is mostly ready, but our institutions are not. What’s lagging is the organizational and economic infrastructure that lets universities, hospitals, data teams, and the labs themselves actually deploy what’s been built.

It’s time to get busy!

On June 10, Harper Reed, cofounder of 2389 Research, will join me to talk about why the future of software depends on creativity, serendipity, and building weird stuff. And on July 9, Trail of Bits cofounder and CEO Dan Guido will stop by to share his playbook for going AI native. You can register to attend them live here. You can also follow Live with Tim O’Reilly on YouTube, Spotify, Apple, or wherever you get your podcasts.

Predict, Don’t Enumerate

Michael Roytman — Thu, 04 Jun 2026 10:57:44 +0000

A third of the way into a security-operations guide that Anthropic published in April 2026, wedged between a recommendation to patch CISA’s Known Exploited Vulnerabilities list and a suggestion to automate your deployment pipeline is a small recommendation: “Use EPSS to prioritize the rest.” For anyone who has worked on a vulnerability backlog in the last decade, the sentence is an acknowledgment of a widely felt but often unspoken fact about security programs: They have become machine-scale problems of signal to noise.

EPSS (Exploit Prediction Scoring System) is a statistical model that takes a known software flaw, runs it through a set of signals about what attackers are actually doing across the internet, and returns a probability that the flaw will be exploited in the next 30 days. It isn’t an LLM, and it does no reasoning or prompt engineering. It predicts. The company endorsing it is the same company whose newest model can surface thousands of novel, exploitable vulnerabilities in production software, many of them two or three decades old, most of them still unpatched.

As far as we can tell, this is the first time a frontier AI lab has publicly endorsed a purpose-built predictive model as the right tool for a defensive problem. LLM labs usually recommend LLMs. That Anthropic did not is worth noting, but the recommendation itself isn’t news to the practitioners it’s aimed at. It’s a description of what they’ve been doing.

The quiet consensus

The volume problem isn’t new. Anyone running a scanner against a large enterprise estate in 2015 was already generating hundreds of thousands of findings per month. Anyone running one against a cloud environment in 2020 was generating millions. Enterprises have spent the better part of a decade staring at dashboards where the number of open critical findings was larger than the capacity of the team supposed to fix them. In other words, cybersecurity has become machine scale.

Risk-based vulnerability management, as a product category, has existed since around 2018. EPSS, as a public resource, has been usable since 2021. More than 120 vendors embed it today into their products. The field has had access to a predictive baseline for years.

What has been missing is an external justification to change the status quo recommendations from auditors, model risk management teams, and even boards. Auditors want a clear set of expectations, making grading more objective and therefore easier to evaluate. Compliance frameworks like CVSS (Common Vulnerability Scoring System) because CVSS is easy, but implementing something more efficient has historically required that aforementioned external push. A working CISO could tell you she had stopped treating every vulnerability scored a severity 9.8/10 by CVSS as an emergency in 2019, but she would also tell you she still kept CVSS in the report.

Anthropic’s guidance is useful because it makes the private consensus public. Patch what you know to be exploited, then use EPSS above a threshold based on the team’s capacity or risk tolerance. DHS CISA’s practice of publishing known exploited vulnerabilities since November of 2021 is just additional proof that the existing methodologies were being overwhelmed by scale and lack of signal.

Why prediction, stated plainly

In 2014, at Black Hat, Dan Geer, then the chief information security officer of In-Q-Tel, asked the first principles question: Are vulnerabilities in software sparse or dense? Sparse meant finite, meaning every fix measurably shrank the attack surface. Dense meant weeds in a field. Geer could not answer the question because the data were not in.

Eight years later, Jonathan Spring at Carnegie Mellon’s Software Engineering Institute tied vulnerability enumeration to the halting problem and showed, in theory, that for any sufficiently complex piece of deployed software, there are always more undiscovered flaws.

The AI-driven discovery results of the last 18 months have made the density argument impossible to wave off even in a compliance review. A 27-year-old bug in OpenBSD. A 16-year-old bug in FFmpeg that five million fuzzing runs never caught. Disclosed findings, by the developers’ own accounting, are less than 1% of what has been found. But again, the volume was already a problem. With the coming release of its newest model, Mythos, Anthropic is telling teams to plan for an order of magnitude more findings over the next 24 months.

Static severity scoring can’t survive the volume problem, because it’s a human-scale solution for a machine scale problem. Neither can any process that treats every critical finding as an emergency. The threshold for action has to be probabilistic, measurable, and defensible. That’s what a predictive model is for, and that’s what working teams have been using in noisy large enterprise environments.

Pointing machines and knowing machines

Geer returned to his 2014 question in the summer of 2025, writing with Dave Aitel in Lawfare. The piece gives the industry a vocabulary for a distinction it has been fudging:

A vulnerability in the code isn’t automatically a threat. A buffer overflow is a hazard. It becomes a risk only if an attacker can exploit it reliably, in this environment, against these controls, through this traffic. Bugs are abundant but the ability to weaponize a particular bug against a particular target is much rarer.

The industry, they wrote, has built a pointing machine. It enumerates.

Even children learn early to point and name—but knowing the word “dog” doesn’t reveal whether the animal might bite. In cybersecurity, we’ve built systems that similarly point and name vulnerabilities without understanding whether they’re truly dangerous. By embracing AI solely for pattern recognition, we’ve created a powerful “pointing machine” that identifies possible threats but does not comprehend their actual impact. What we need instead is a “knowing machine,” capable of understanding how code functions within complex, real-world environments, recognizing not just hazards but the full context of how and whether those hazards might become genuine risks.

A knowing machine is a system that understands how code behaves in a particular environment and recognizes the context that turns a hazard into a risk. A predictive model is how you build a knowing machine. EPSS is the clearest public example: It covers every published CVE and is updated daily.

Global isn’t local

EPSS is a global model. It sees what attackers are doing across the whole of the internet. It picks up patterns in exploitation activity that severity scores never could. What it can’t see is any particular organization’s environment. It doesn’t know which assets carry the data the business actually cares about. It doesn’t know what compensating controls are in place, where remediation is risky, or how your telemetry and history change the odds.

A 9.8 with a 97% global probability of exploitation and a 9.8 with a 0.1% probability are not the same animal. Neither are two organizations applying the same EPSS threshold to the same CVE on different assets. One has the vulnerable code path exposed to the internet, behind a web application firewall that doesn’t inspect the relevant protocol. The other has the same CVE on an internal system that accepts authenticated input from a single service account. A scanner can’t tell them apart. A global model can’t tell them apart. Their actual risk profiles are orders of magnitude apart.

Local context is where most security teams have been stuck the entire time, and where the next decade of the field is going to be fought.

What a local knowing machine actually requires

Pair a better pointing machine with a faster remediation engine and all you’ve done is increase the speed at which you produce churn, breakage and wasted effort. You’ll also spend a king’s ransom in agent tokens fixing vulnerabilities that were never dangerous in your environment.

In contrast to an omniscient scanner, a local model trains on the specific environment being defended: asset inventory, application topology, reachability, deployed controls, attack telemetry observed on-site, and the history of the organization’s own remediations and their outcomes. The model produces probabilities specific to the enterprise. Most organizations already have the inputs, scattered across CMDBs, endpoint agents, firewall logs, ticketing systems and scanner output. This context is precisely what attackers (whether they’re using good old fashioned metasploit or Mythos with an infinite budget) are lacking in their models. The context becomes an asymmetrical advantage for defenders, perhaps the only one that exists.

The policy shifts that actually matter

The interventions that will decide whether a security program survives the next 24 months aren’t purely technical. A CISO can put most of them in motion without buying anything.

Rewrite the SLA. Most vulnerability-management SLAs are organized by severity. Criticals in 15 days, highs in 30, mediums in 90. That structure was built for a world where the count of open criticals was small enough to matter. It’s now actively harmful, because it forces teams to spend the same effort on a 9.8 nobody is exploiting and a 7.5 that’s under active attack. SLAs should be rewritten in terms of probability of exploitation and asset exposure, not severity. A CISO who can’t get that past her GRC team can at least add a second tier that makes the probability-based cut enforceable alongside the severity-based one.

Change what the board sees. If the monthly security report counts the numbers of vulnerabilities, exposures or findings in different buckets (“critical,” “open past 30 days,” etc.), the organization is being managed to the wrong metric. The metric should be exploitability-weighted exposure over time, with a second line for predicted versus observed exploitation. Boards will accept this once somebody explains it. This beats showing them a number that has no relationship to risk and is growing exponentially as new LLM models are released. More to the point: A great team can do amazing volumes of remediation work, and risk can still rise because they’re measuring and remediating the wrong thing. An efficient, context-rich team can do far less work and meaningfully move the probability of an event down.

Invest in telemetry. The single most valuable instrument a security program can build is a feedback loop between what was prioritized and what was exploited. If the loop shows you were wrong, the model improves. If the loop does not exist, you will keep being wrong indefinitely (or just not being aware of misses).

Fix the compliance conversation. The reason CVSS survives is regulatory inertia. PCI, HIPAA, and most state breach-notification frameworks still reference severity. The CISOs who will come out of the next two years in the best shape are the ones who engage their auditors now, in writing, about what a probabilistic prioritization framework looks like under the existing rules.

Staff for the bottleneck, which isn’t scanning. The industry has spent a decade hiring people to find bugs. The bottleneck now is deciding which bugs matter, getting the fixes deployed, and measuring whether the prioritization was correct. The job descriptions should reflect this. A security-data engineer may be able to increase efficiency to meet SLAs more than increasing capacity would.

None of this requires a new product. All of it requires a CISO willing to say, out loud, that the old dogma is broken and that the new one will be managed by data and probabilities. That is the shift Anthropic’s five-word sentence was really announcing. The technology is available and the models are here—both the LLM-based ones to find the vulnerabilities and the predictive knowing machines to prioritize efficiently.

Context as Code

Artur Huk — Wed, 03 Jun 2026 11:00:14 +0000

As syntax becomes cheap and abundant, architectural control becomes the scarce resource. Effective governance starts upstream, where intent, constraints, and threat models shape the agent’s working context before generation begins. The goal isn’t better prompting but build-time boundaries that prevent structurally invalid code from entering the system.

The Frankenstein factories

The dark factories (as Dan Shapiro calls them) are running. Tokens fly through trycycles, features ship overnight, and codebases are ported before breakfast. The velocity is real. And comprehension debt (a term coined by Addy Osmani) is compounding in silence behind it.

What this era is producing, at scale, deserves its own name: Frankenstein factories. Not a critique of any single approach but a description of a structural condition—generation engines so effective at producing working syntax that they have industrialized the creation of architecturally ungovernable systems. The creature walks out of the laboratory impressive, functional, and alive on delivery day.

The crisis arrives the day someone must govern it. To govern a system means to hold it accountable to its design boundaries—the ability to look at it and reliably say why it works, what is permitted to touch what, and to categorically prevent forbidden state changes before they happen. Victor’s catastrophe was not the act of creation but the absent governing frame.

For prototyping or shipping features fast, unconstrained generation is a powerful tool. It optimizes for velocity, and it delivers. But for enterprise payment systems, insurance underwriting engines, logistics orchestrators, and regulated platforms, the question is not “Does the code ship?” but “Who is liable when it does the wrong thing?” Here, automating the word “YES” to every feature request does not solve the problem. It industrializes it.

Consider a standard Jira ticket: “Add an email notification after a successful payment.”

A junior developer might attempt to wedge the email-sending logic directly into the PaymentProcessor class. A senior architect catches this in code review: “No. Fire a PaymentSuccessEvent to the message bus.” That human friction—the architectural “No”—keeps the system maintainable.

Unconstrained AI agents lack this assertiveness. By default, they are the ultimate yes-men.

Hand that same ticket to a standard coding agent and it will not argue about bounded contexts. It will burn tokens until it produces 300 lines of syntactically perfect code, import an SMTP library directly into the core of your billing domain, and submit a pull request. The tests will pass; conventional feature tests make no assertion about bounded contexts. The CI pipeline will go green. And structurally, the system is now a disaster.

This happens not through malice but because of how agentic loops are built. Without explicit architectural constraints, the system’s emergent behavior is to fulfill immediate user intent. The agent is orchestrated to ship the feature, not to defend the architecture. Comprehension debt is the structural consequence: AI generates syntax faster than human beings can read or govern it. Expecting a probabilistic model to enforce structural integrity on its own is a category error. Without a governing frame, the agent will always take the path of least resistance to a “YES.”

You cannot fix code overproduction by hiring more people to read it nor by running the generation loop faster. The only scalable answer is to build a concrete riverbed before you turn on the water.

If the current era automates the word “YES,” we should automate the word “NO.”

Securing the runtime environment prevents the monster from escaping. But to prevent it from being built in the first place, we need to step back into the IDE and the CI/CD pipeline. We need to govern generation.

The great softening: Shifting risk from build time to runtime

Compilers never guaranteed correct software. You could write catastrophic logically broken systems in C, Java, or any other compiled language. But compilers served a crucial engineering purpose: They deterministically governed a specific layer of structural risk.

By enforcing hard execution constraints—syntax validity, type compatibility, linkage rules, and executable viability—the compiler acted as an automated boundary. It didn’t verify business intent, domain correctness, or architectural quality. What it did was eliminate an entire class of low-level structural failure before execution ever began.

That delegation of risk is one of the quiet triumphs of software engineering. Our discipline has always advanced by mechanizing one class of guarantees so humans can focus on the next layer of abstraction. We automated machine-level structural correctness so engineers could spend their cognitive energy on application logic. Later, we pushed more guarantees upward, into schemas, testing, static analysis, architectural patterns, and operational controls.

Over time, we also deliberately softened certain boundaries in exchange for speed. Dynamic languages, richer runtimes, reflection, and increasingly abstract frameworks all traded deterministic compile-time guarantees for developer velocity and flexibility. The newly exposed risk was absorbed elsewhere: runtime validation, automated testing, observability, and engineering discipline.

Today, with agentic AI, we are softening boundaries again, more radically than ever before.

Natural language has become a high-level control plane for software generation. Arbitrary text increasingly shapes executable behavior. And in that shift, we have blurred one of the oldest boundaries in computing: the separation between data and instructions.

Outside the model, that boundary still exists. Systems enforce permission scopes, schema contracts, sandboxing, and execution policies. But inside the inference context, those protections collapse into the same token stream.

System prompts, retrieved documents, user messages, tool outputs, and external content all flow through the same neural weights. There is no hard privilege boundary between instruction and input. Modern models may resist naive attacks like “Ignore previous instructions,” but they remain vulnerable to indirect injections disguised as legitimate operational context. A malicious instruction embedded in a customer email, a webpage, or a tool response is not processed as passive data. It can become behavioral influence.

Inside the context window, untrusted text can shape control flow. That is the real softening.

We are generating syntax at machine speed, but we have dissolved the structural gate that once constrained how systems were built. The result is a massive shift of risk from build time to runtime. Code that appears structurally sound during generation may violate architectural boundaries, introduce unsafe execution paths, or become behaviorally compromised the moment hostile context enters the loop.

The conclusion is straightforward: The fact that AI-generated code runs is no longer a meaningful proxy for system correctness.

Syntax is abundant. Execution is easy. Structural governance is what is missing.

We outsourced the writing of logic to machines, but we did not build a deterministic boundary that governs what those machines are allowed to generate.

If we want control back, we cannot rely on human code review at machine speed. We must rebuild the build-time gate.

From dependency bloat to tailor-made architecture

For decades, the industry’s default response to complexity was abstraction by accumulation: monolithic frameworks, sprawling dependency trees, and ever-thicker layers of indirection. Importing a 50-megabyte library to avoid repetitive boilerplate was a rational trade-off when developer time and cognitive bandwidth were the scarce resources. For AI agents, that trade-off changes.

This is not an argument against foundational infrastructure. Mature primitives—like SQLAlchemy in Python or Spring Boot in Java—remain essential precisely because their conventions are widely learned and predictable. The problem isn’t abstraction but opacity. When core business logic disappears behind proprietary decorators, internal frameworks, or custom orchestration layers, execution becomes a black box. An agent cannot safely reason about code it cannot trace. It needs direct visibility into causality: what changes state, what enforces invariants, and where responsibilities begin and end. Hidden flow degrades reasoning into guesswork; guesswork silently becomes architectural drift.

At the same time, AI drives the cost of procedural code toward zero. Boilerplate is no longer expensive. Clarity is. The design question shifts from “How much can we abstract away?” to “How much must remain explicit for safe reasoning?” The answer is tailor-made architecture: thin infrastructure, explicit domain logic, hard boundaries, and narrowly scoped components with visible contracts. The value is no longer in how much code you avoid writing but in how clearly the system declares its boundaries.

That same opacity also breaks verification. AI review can catch local defects, risky patterns, and implementation mistakes, but it remains blind to architectural drift and missing business intent unless those constraints are explicitly encoded. After all, if you ask a model to review code generated from the exact same vague Jira ticket, do you actually get verification, or do you just engineer a circular hallucination, where the AI politely revalidates its own blind spots?

Figure 1. Tailor-made architecture gives generated syntax a clear structure without dissolving system boundaries.

The Context Compilation Pattern

The Context Compilation Pattern governs generation in the IDE and the CI/CD pipeline before a single syntactically plausible line ever reaches a human reviewer. If the Decision Intelligence Runtime (DIR) is the vault door that protects execution in production, context compilation is the blueprint that prevents the monster from being built in the lab.

This is not “prompt engineering,” which merely asks a probabilistic model for a better answer. What we need is build-time governance: two layers of defense assembled before the LLM inference is even triggered. The first is structured context injection (assembling the prompt from prioritized artifacts). The second is postgeneration static verification (deterministic AST checks that enforce rules no probabilistic model can override). The prompt structure biases generation toward compliant solutions; the static checks make declared, machine-verifiable boundary violations impossible to merge.

Deterministic build-time governance is not a return to formal software specification (like UML), nor is it merely “prompt engineering disguised as Markdown.” It’s a mechanical constraint on the generation space that makes explicitly declared boundary violations rejectable by design. Context compilation does not eliminate architectural review or replace engineering judgment. Instead, it ensures that the agent operates within a defined riverbed of allowed structural invariants.

Engineering evolves whenever implicit rules become explicit declarations. Application development is now crossing that boundary. The senior engineer’s new job is declarative boundary engineering: explicitly declaring what the system is absolutely forbidden from doing.

The failure is not in the frameworks. The failure is in the process: pointing an unconstrained AI agent at a codebase full of invisible magic and expecting a CI/CD pipeline designed for human-generated code to catch what goes wrong. The answer is to build a compiler for the agent’s context.

The Context Compilation Pattern is the staged pipeline that makes this concrete.

Figure 2. The Context Compilation Pattern pipeline, enforcing build-time constraints through deterministic artifact assembly and dual verification.

Step 1: The context artifacts

The most strategically valuable code in your repository may no longer live in src/. It lives in /context. The pipeline consumes versioned artifacts such as intent.md, boundaries.md, and threat-model.md, each authored by a specialist before a single line of code is generated. (Ownership and role responsibilities are covered in “Artifact-Bound Roles and Accountability” below.) What matters here is that these files are the inputs to the compiler: Without them, there’s nothing to compile.

To prevent cognitive overlap, their roles must be fiercely separated: boundaries.md declares structural invariants (e.g., dependency direction, allowed communication paths, and event emission), whereas threat-model.md models adversarial constraints as declarative abuse scenarios (e.g., prompt injection and secrets exfiltration) that must be mechanically blocked.

boundaries.md warrants a precise definition, because it anchors the entire build-time governance model. In practice, boundaries are typically defined at module or bounded-context granularity (e.g., /billing/* or /risk/*), not per class or per repository. They are implemented using hybrid artifacts: a natural language document designed to constrain the LLM, tightly paired with a deterministic rule for the CI runner.

Consider this concrete example of how an architectural boundary is explicitly declared and enforced:

1. boundaries.md (for the LLM context)
This Markdown file is injected into the agent’s prompt. It defines the vocabulary, architectural constraints, and allowed interactions.

Module: Billing
Ontology: Order, Invoice, PaymentEvent
Rule: Zero external network I/O is allowed in this domain. You must NEVER import requests or smtplib.

2. semgrep-rule.yml (for the CI/CD runner)
This static file goes to the CI pipeline to mechanize the boundary. It ensures the code check is fully deterministic.

rules:
  # Block forbidden imports at the module boundary
  - id: block-external-io-in-billing
    patterns:
      - pattern-either:
          - pattern: import smtplib
          - pattern: import requests
    message: "Architecture Violation: External I/O is strictly forbidden in the billing domain."
    severity: ERROR
    languages: [python]
    paths:
      include: ["src/billing/**"]

  # Domain layer must not talk to DB driver directly
  - id: block-db-driver-in-domain
    patterns:
      - pattern-either:
          - pattern: import sqlalchemy
          - pattern: from sqlalchemy import ...
          - pattern: import psycopg2
          - pattern: from psycopg2 import ...
    message: "Architecture Violation: Domain layer must use Repository abstraction, not database drivers directly."
    severity: ERROR
    languages: [python]
    paths:
      include:
        - "src/billing/domain/**"

Crucially, these Semgrep/CI rules are human-authored (or human-reviewed) precommit artifacts. We don’t rely on an LLM to generate the security gates on the fly. The AI reads the Markdown to guide its generation; the CI runner executes the static YAML to enforce the boundary.

If these artifacts stay current, they actively govern the generated codebase. Stale or malformed context becomes context debt: The pipeline will enforce strictly whatever was declared, even if the declaration is wrong. Governance artifacts are production code. They require strict versioning, explicit ownership, and periodic review just like the executable logic they constrain. That’s why core artifacts like boundaries.md require rigorous peer review, not just casual updates.

Step 2: The context compiler

Dumping all Markdown files into the system prompt is sometimes acceptable for small projects and small artifacts. But as the codebase grows or the context window fills with too many competing constraints, models begin to suffer from “lost in the middle” degradation and silently ignore what matters most.

The term “context compiler” might sound like a magical enterprise heavy-lift, but the reality is entirely mundane. In its simplest form, it’s just a deterministic context assembly layer combined with a routing mechanism.

Instead of treating context as a flat pile of documents, the compiler assembles it into an ordered structure. Because different artifacts apply to different parts of the project, boundaries.md in the /billing module might enforce strict isolation, while the one in /frontend might be much more permissive.

In practice, the compiler may take one of these forms:

Manual selection: The developer simply points their IDE or agent to a structured set of Markdown files.

A mundane script: A basic Python or bash script that understands a directory structure. It concatenates the .md files to build the LLM’s system prompt and hands the .yml files directly to the CI runner.

Tool-mediated context protocols: Dedicated mechanisms (e.g., MCP) that allow the agent to query the workspace and dynamically assemble the required boundaries directly within the IDE, bypassing the need for manual script invocation.

Consider a practical directory structure:

/context
  /global
    coding-standards.md
  /domain
    /billing
      boundaries.md
      threat-model.md
      semgrep-rule.yml
    /risk
      boundaries.md
      threat-model.md
      semgrep-rule.yml
    /frontend
      boundaries.md
      threat-model.md
      semgrep-rule.yml

When generating code for the billing module, the script reads /global and /billing. The compiler simply scopes the rules based on the directory, perfectly focusing the agent’s attention on the boundaries that matter while wiring the corresponding YAML rules for deterministic CI verification.

Step 3: Strict boundary hierarchy (resolving conflicts)

When faced with conflicting instructions, LLMs don’t throw a compilation error. They hallucinate a dangerous compromise. The compiler prevents this by enforcing a deterministic precedence of declared constraints before the prompt is assembled:

Threat model > Boundaries > Coding standards > Intent + acceptance criteria

Security and architectural boundaries unconditionally overrule feature delivery. This operates at two levels. At the prompt level (soft enforcement), constraint ordering biases generation toward compliant solutions. At the postgeneration level (hard enforcement), deterministic code checks parse the generated syntax, verify structural invariants, and instantly fail the build on violation.

“Resolution” in this context does not mean an LLM philosophically negotiating between two Markdown files. It means deterministic rejection via CI. If the intent.md asks to “email a receipt to the user,” but boundaries.md forbids external network calls in the billing module, an unconstrained AI might try to generate an SMTP call. The conflict is mechanically “resolved” when the CI pipeline runs a static rule (derived from semgrep-rule.yml) and instantly fails the build. The developer (context orchestrator) must then intervene and change the design to use an event bus instead. The hierarchy is enforced by deterministic code analysis, not LLM reasoning. A rejected build is not necessarily a rejected business need; it’s a signal that declared boundaries and intended capability must be reconciled explicitly before regeneration. (This mechanical rejection physically executes during the adversarial verification phase in step 5).

We do not use AI for this validation. We use existing, proven AST tools and code linters like Semgrep, Bandit, or CodeQL to enforce these boundaries in CI/CD.

However, we must be precise about what this governance actually achieves. Deterministic checks enforce invariants, not the architecture as a whole. You can statically enforce forbidden imports, forbidden outbound I/O, strict layering, and schema conformance. You cannot statically enforce domain semantics, aggregate ownership correctness, subtle coupling, or conceptual cohesion. Deterministic verification doesn’t prove architectural correctness. It proves compliance with explicitly declared structural invariants.

Step 4: Generation

Context as code matters only if generated syntax is verified against the same boundaries that shaped it. With a compiled, conflict-free context hierarchy, the developer agent generates code inside an isolated user space sandbox. In this fleeting fraction of a second, the agent inside the developer’s IDE consumes the narrowed, precompiled system prompt and outputs the actual payment_service.py. Its role is constrained synthesis: translating the boundaries in boundaries.md and the imperatives in intent.md into code.

Step 5: Adversarial verification (negative space)

This phase checks whether the generated code crossed a forbidden boundary. Before the development cycle begins, the adversarial context provider defines threat vectors in threat-model.md. Because a Markdown file only guides the LLM softly, the governance platform engineer bridges the gap to determinism by translating those declarative threats into matching executable rules (like semgrep-rule.yml) wired into the CI gates. If the threat model identifies server-side request forgery or secrets exfiltration as a risk for the /frontend module, the corresponding CI rule parses the generated code and instantly fails the build if a known attack pattern or insecure execution sink is detected.

The pipeline doesn’t ask an LLM to read the Markdown and assess if the code is safe. It mechanically executes the prewritten rules derived from it. If a generative agent helps draft the rule set, it does so before the cycle in an isolated sandbox, and a human reviews the result before it enters CI. Step 5 doesn’t prove overall correctness; it proves that declared structural and security boundaries are enforced.

Like any static gate, deterministic boundary checks trade flexibility for safety and will occasionally reject valid implementations. That friction is intentional: Explicit override and artifact refinement are part of the governance loop.

AI code review may identify suspicious code, but it cannot certify that declared boundaries survived generation. Step 5 therefore relies on deterministic CI rules, not on a probabilistic model interpreting the pull request.

Step 6: Acceptance verification (positive space)

This phase checks whether the generated code solves the business problem. The acceptance-criteria.md defines the expected behavior not as a vague user story, but as a machine-executable contract (e.g., using Gherkin syntax):

Scenario: Successful payment emits notification
  Given a valid payment of 100 EUR
  When the transaction completes
  Then the PaymentSuccessEvent is published to the message bus

The CI pipeline parses this exact Markdown block and runs the corresponding test suite. Step 6 provides what step 5 cannot: verification against a declared delivery contract.

The code is approved only when it passes adversarial checks and satisfies the acceptance criteria. Without step 5, the system could violate structural boundaries. Without step 6, it could implement the wrong intent. Both contracts must hold.

Artifact-bound roles and accountability

The traditional SDLC is a linear cascade: Requirements flow to architecture, then to code, then to QA. In an era where a machine generates 10,000 lines of syntax in the time it takes to fetch a coffee, that handoff is a fatal bottleneck.

In the context matrix, specialists define parallel, independent constraint vectors before generation begins. The titles on business cards stay the same. The artifacts they produce change entirely.

Old role	New role	Artifact	Responsibility
Business analyst	Intent definer	`intent.md` + `acceptance-criteria.md`	Define the “what” and the deterministic proof that it was delivered
Software architect	World builder	`boundaries.md`	Define domain ontology, architectural invariants, and allowed interaction patterns
QA & security engineer	Adversarial context provider	`threat-model.md`	Define threat vectors and abuse paths before generation
Platform engineer/DevOps	Governance platform engineer	Compiler pipeline + CI gates (`semgrep-rule.yml`)	Operationalize declared constraints into nonbypassable enforcement gates
Developer	Context orchestrator	`coding-standards.md` + critical code	Resolve artifact conflicts, steer generation workflows, implement critical paths, and refine context quality

In this model, accountability is distributed and artifact bound. Rather than handing off work downstream, each role owns specific upstream activities and constraints.

The intent definer (formerly business analyst): Owns the business reality. They translate user needs into intent.md and define hard acceptance-criteria.md (like BDD scenarios or API contracts). Their job is to formulate requirements so strictly that the pipeline can automatically prove delivery, acting as the first line of defense against vague “vibe coding.”
The world builder (formerly software architect): Owns the structural gravity. They write boundaries.md to establish the domain ontology and hard architectural boundaries. Instead of reviewing pull requests for drift, their daily activity is defining what modules are allowed to communicate and declaring the structural invariants the generated code must respect.
The adversarial context provider (formerly QA and security): Owns the negative space. They anticipate failure modes and define threat vectors via threat-model.md. Their responsibility is identifying the precise abuse paths that the CI pipeline must block, ensuring an LLM never tests its own code.
The governance platform engineer (formerly platform engineer/DevOps): Owns the enforcement machinery. They build the context compiler pipeline and operationalize declared constraints into nonbypassable enforcement gates. Their responsibility is the deterministic enforcement pipeline that executes declared governance artifacts at precommit and CI/CD boundaries.
The context orchestrator (formerly developer): Owns generation orchestration and critical handwritten paths. This is a hybrid reality, not the end of programming. They write coding-standards.md, manually implement zero-trust paths, and resolve runtime exception requests. For the bulk of the system, their focus shifts to a meta-level: resolving conflicting constraints, tuning the prompt’s signal-to-noise ratio, and debugging why a given artifact failed to govern the agent properly.

When a failure occurs, the investigation shifts from “What was the agent thinking?” to “Which contract failed to govern?” Because the pipeline deterministically enforces what was explicitly declared, failures are no longer opaque hallucinations. They’re traceable collisions between artifact boundaries. A structural flaw cleanly points to an unbounded boundaries.md. When the pipeline is green and the contracts are honest, the orchestrator acts as a firewall against process failure, not a scapegoat for undocumented assumptions.

Figure 3. The decision boundary architecture: Context compilation governs generation, ROA structures intent, and DIR validates execution.

The economics of governance

Context compilation makes economic sense only when the cost of architectural failure exceeds the cost of explicit governance. It adds upfront design work and cognitive overhead, so its value depends on how expensive a wrong system decision would be.

For rapid prototyping, throwaway utility scripts, marketing sites, or low-stakes internal tools—where the worst-case consequence of a hallucination is a misaligned dashboard—let the generative engines run unconstrained. Velocity is the only thing that matters.

For safety-critical automation, trading platforms, healthcare orchestrators, and regulated enterprise systems, the economics invert. Velocity without deterministic boundaries is simply the speed at which you accumulate liability. A single unconstrained agent importing an insecure dependency into a payment core costs orders of magnitude more than the engineer-hours spent writing a boundaries.md contract.

You don’t build a bank vault door for a garden shed. You apply context compilation where the systemic cost of emergent architectural failure is catastrophic.

Automating the word “NO”

When code generation becomes cheap, architectural entropy tends to scale with it. That makes post hoc code review less effective, especially when reviewers spend their attention on machine-generated boilerplate. A more durable approach is context review: peer review of the declarative constraints that shape what the machine is allowed to build. A reviewed boundaries.md can guide many later development cycles. A reviewed pull request usually governs only a single change.

The discipline has shifted from imperative engineering of procedures to declarative engineering of boundaries.

Let’s return to the Jira ticket that started this discussion: “Add an email notification after a successful payment.”

The business analyst submits the intent.md. Before the developer agent sees the prompt, the context compiler activates—at the precommit gate or via tool-mediated context protocols (e.g., script or MCP) in the IDE—before a line is written. It retrieves the architect’s boundaries.md, which states, “The /domain module has zero external dependencies. No network calls.” The SMTP import collides with that boundary instantly. Even if the agent generates the import, the build will not survive it—the prompt biases generation toward compliant solutions, and the deterministic static check in step 5 rejects it at the declared boundary. The Frankenstein is caught in the pipeline, not discovered in production three release cycles later.

Code generation is becoming abundant. Architectural discipline is becoming scarce.

Context as code governs what may be generated. Responsibility-oriented agents govern what may be proposed. Decision Intelligence Runtime governs what may be executed. Three boundaries. One governing frame.

The highest-value engineering skill is no longer writing syntax. It’s engineering the conditions under which correct syntax can emerge.

That is the ability to automate the word “NO.”

This article concludes the three-part series on engineering boundaries in agentic AI. The repository at github.com/huka81/decision-intelligence-runtime contains an open source reference implementation of the concepts described in this series.

Radar Trends to Watch: June 2026

Mike Loukides — Tue, 02 Jun 2026 10:58:22 +0000

Coauthored with Claude

Agents are making the transition from performing tasks to running operations. The Cloudflare and Stripe partnership ships an agent that opens accounts, registers domains, and deploys an application on its own (details), while Stripe/Tempo and iWallet have each published machine-to-machine payment protocols to make that kind of work a standard. Office documents, browser sessions, and, in one announcement, the phone interface itself are next on the list. View the expanded role of agents as an opportunity for humans to accomplish more.

AI Models

The model menagerie keeps expanding in size and shape. Open weight contenders run at frontier capability on modest hardware, while specialist models for voice, conversation timing, and privacy filtering take over what used to be features inside one general chat model. Treat your prompts and skills as portable; the model behind them will change.

Anthropic has released Opus Claude 4.8. This model is not Mythos, which they expect to release soon. Opus 4.8 is a “modest improvement” that claims better results on coding and greater likelihood of informing users when it is uncertain about claims. Changes to the agents may be more important. Claude Code now has the ability to plan solutions to large problems involving hundreds of subagents (“dynamic workflows”); Cowork can control the effort put into solving a problem.
Cohere’s Command A+ is an open weight mixture-of-experts model with 218B parameters, 25B active. It’s competitive with frontier models and requires relatively little hardware to run: Two H100s isn’t small, but it’s not a data center either.
Google’s announcements at this year’s I/O conference include Omni, a new model that takes any kind of input (video, audio, image) and generates any kind of output; Gemini 3.5 Flash, a fast and efficient update to their coding model; Gemini Spark, a personal agent; and intelligent eyewear, another attempt at smart glasses.
Alibaba has announced Qwen3.7-Max, its most capable model.
Thinking Machines has announced a research preview of interaction models. These models support natural conversation flow. The model can wait for a speaker to finish, interrupt the speaker, respond when the speaker interrupts the model, and keep track of time.
OpenAI has released new voice models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. They’re moving from call-and-response models to models that can take part in conversations, reason, and take actions.
OpenRouter published cost studies for both Claude Opus 4.7 and GPT-5.5. GPT-5.5 raised the token price but reduced the number of tokens in a typical conversation. Claude kept prices the same, but conversations tend to require more tokens. What’s the impact on your monthly bill?
Google has updated its Gemma 4 models, claiming that they triple token generation speed. They use a technique called multi-token prediction (MTP) to draft a sequence of tokens with a very small model and then approve those tokens with the large model.
IBM released Granite 4.1, a collection of small models (30B parameters and down).
An academic paper describes “the reasoning trap,” a phenomenon in which training models for increased reasoning also increases hallucinations about tool use.
Talkie is an LLM that was trained only on data from 1931 and earlier. If you want to know what it was like to live during the start of the Depression, this is the LLM to ask.
OpenAI has announced a privacy filter model. This is a small specialized model (1.5B) that can run on phones and other small devices. It removes personally identifiable information (PII) from text documents.

Software Development

We are beginning to see anecdotal evidence that the brief era of tokenmaxxing is coming to an end. Agents may increase productivity, but they can also use tokens at an astonishing rate. So can the latest models, like Anthropic’s Claude 4.8 with new features like dynamic workflows. Employers are realizing that the only way to measure productivity is to look at the quality of an employee’s work rather than relying on an artificial (and easily gameable) metric like token use. Teams that use AI effectively will be disciplined about token use; they’ll choose lower cost (or local) models where possible, reaching for expensive models like Claude 4.8 Opus only when necessary.

The Agentic AI Foundation is updating the MCP protocol, with a release candidate scheduled for July 28. Changes include making MCP a stateless protocol, adding a process for creating extensions, and aligning authorization with the OAuth and OpenID standards.
Google is dropping Gemini CLI and putting all of its effort behind Antigravity, its agentic software development platform. There are desktop and command line versions of Antigravity, but unlike Gemini CLI, neither are open source.
What shall we call Gas City, created by Julian Knutsen and Chris Sells? Gas Town 2.0? Steve Yegge says it’s an SDK for building your own “dark factories” by deploying teams of collaborating agents in any topology. It’s “a pivotal moment in the Mad Max school of agent orchestration.”
The problem with agentic programming is that agents serve individuals, not groups, and programming is a team sport. Is collaborative steering (context management for groups) an answer?
GitHub has released a preview of its Copilot app, a stand-alone desktop application for coding with AI. It’s completely integrated with GitHub; for example, you can launch tasks directly from GitHub issues.
If you think tokenmaxxing is your path to promotion, check out burn-baby-burn. It does what it says: burns lots of tokens, fast, using the LLM of your choice. We hope it’s a parody, but we bet it works.
Mitchell Hashimoto tweets that Anthropic’s rewrite of Bun from Zig to Rust demonstrates that programming languages are now fungible. Programming language lock-in has ended; programs can easily move from one language to another.
OpenShell is a runtime environment built with security in mind from the ground up. It’s intended to be used as a secure environment for running agents. Every agent runs in its own sandbox; an external gateway manages credentials and policies.
OpenAI is shutting down its API for fine-tuning its models. They say the current models are better and don’t require significant fine-tuning. As Latent Space points out, this doesn’t necessarily mean the end of fine-tuning as a discipline, particularly for open models. But it may be a signal. Drew Breunig writes about what this means for agents and harnesses.
Anthropic has released Claude for Office 365, allowing users to run sessions that cross Word, Excel, and PowerPoint. Integration with Outlook is coming, though Claude for Outlook is currently a separate product.
A plugin to Chrome allows Codex to use Chrome for browser tasks that require you to be logged in—for example, reading email.
Firecrawl is an API that agents can use to interact with websites in a human way. It enables agents to search for the latest data, interact with the site, and return the results at scale.
Drew Breunig’s “10 Lessons for Agentic Coding” is an invaluable list of tips, including “Implement to learn.” Letting an agent write all the code is easy, but when you really need to learn something, write it by hand first.
Deepclaude configures Claude’s autonomous agent loop to use DeepSeek V4 Pro rather than one of Anthropic’s models. It’s a good way to save (DeepSeek costs much less per token) and experiment with open models. (Fair warning: The name deepclaude may change.)
OpenAI has announced Codex for Work, an assistant that’s designed for office work rather than software development.
Kanwas is a new tool for sharing context across agents. It can be used by workgroups to collaborate on projects.
Mike is an open source AI trained for legal work and designed to run locally.
GitHub is transitioning to usage-based billing for Copilot.
OpenAI and Qualcomm are reportedly working on a phone where the user interface is an agent. There won’t be any apps; the agent will do everything.

Infrastructure and Operations

The infrastructure questions of the moment are whether agents can transact and deploy without humans, and whether the platforms that host open source can stay reliable enough to keep that work going. Watch for GitHub alternatives to become competitive. And watch AI Together, a cloud company that hosts hundreds of open source models.

TokenTuner helps control AI costs by identifying where companies can use lower-cost models productively. It attempts to match token usage to business outcomes, and evaluates individuals and teams on how effectively they use their token budget.
In partnership with Stripe, Cloudflare now has an agent that can create a new account, start a subscription, register a domain name with DNS, and deploy an application without human intervention aside from granting permission.
Stripe and Tempo have released the Machine Payments Protocol (MPP), and iWallet has laid out a roadmap for the Autonomous Settlement Protocol (ASP). These new protocols are designed to facilitate machine-to-machine transactions, transactions that have to be designed without a human in the loop.
The Inference Era is when inference, rather than training, drives AI usage, cost, and infrastructure. GPUs remain important, but the relative demand for CPUs increases.
GitHub is in danger of losing its place at the center of the open source ecosystem. Problems with uptime are causing projects to find homes elsewhere—most recently, Ghostty.
Together AI operates a cloud AI platform that’s designed specifically for inference rather than training and that provides API access to over 200 open weight models. As AI use increases, the ability to run models and provide answers efficiently becomes more important than the ability to train new models.

Security

The patch window is shrinking to zero, and the attacker’s toolkit and the defender’s toolkit now include the same AI models. Any vulnerability disclosed today is being exploited tonight. The good news is that defenders running these tools at scale can close gaps faster than ever; the bad news is that the race never ends.

FROST is a new technology for surreptitiously discovering what websites a user is visiting. It’s based on measuring the I/O operations on the user’s SSD. FROST requires no interaction from the user and runs entirely in the browser.
Regrettably, neither arcane prompt injection attacks nor cryptocurrency scams are news. But it warms a ham radio enthusiast’s heart to see Morse code used in a prompt injection to scam a crypto trading bot.
TeamPCP, a cybercriminal collective, has attacked GitHub by installing a poisoned extension to VS Code. GitHub announced that nearly 4,000 repositories have been compromised, all belonging to GitHub itself; no customer repositories have become victims. But anyone who installs corrupted code from GitHub’s own repositories is vulnerable.
No Security Meter for AI provides an excellent look into the state of AI security.
Cloudflare’s report on Project Glasswing and Claude Mythos is worth reading. Mythos is especially noteworthy for its ability to chain vulnerabilities. In real life, few vulnerabilities are exploitable on their own; they become vulnerable when they are used in combination with others.
Daniel Stenberg reports that Mythos found five potential vulnerabilities in curl, of which one was legitimate. The low count isn’t surprising, given the quality of the curl team’s work. What’s significant is that Mythos was able to find a legitimate vulnerability in software that had been thoroughly audited by humans, traditional tools, and AI.
Who showed up? A security researcher ran a honeypot with port 22 open for 54 days, and logged every attempt to log in: 269,000 connection attempts from 7,556 unique IP addresses.
GitHub’s dependency scanning service for its MCP server is now in public preview. It checks code changes for vulnerable dependencies before committing code or opening a pull request.
Copy.fail is a recently discovered Linux kernel vulnerability that allows unprivileged processes to escalate privileges, and it was exploited within a day of its release. Unlike most vulnerabilities, running infected programs in a container does not offer protection. The time from release of a zero-day to exploitation in the wild is indeed shrinking.
OpenAI’s Advanced Account Security requires a physical key or passkey for access; there are no passwords. Hardware keys are provided by Yubico or a compatible hardware token.
GPT-5.5 Cyber is a version of GPT-5.5 that has been trained as a security tool. As Anthropic did with Mythos, OpenAI is limiting access to a small group of trusted users.
The Firefox team has used Claude Mythos to find 271 previously unknown vulnerabilities in Firefox. While this finding is terrifying, they conclude that defenders now have the advantage. Once you know the vulnerabilities, it’s possible to close the gap between defenders and attackers.
Claude Code can leak credentials and other secrets to public repos and package registries. When you select “allow always” for a specific command, the command and its credentials are stored in a subdirectory of .claude. This directory can inadvertently be incorporated into a package.

Policy and Governance

The ArXiv preprint repository has clarified its code of conduct for AI users. Submitters are responsible for their papers and will be banned for a year if they submit papers that use AI-generated content inappropriately. This includes hallucinated content, references, and plagiarism.
Look to China for new approaches to data governance. China is treating data as a national resource and building the infrastructure for a data economy.

Web

At its I/O conference, Google announced that traditional search will be replaced by AI search, powered by Gemini 3.5 Flash. Both AI search and traditional search (which is really AI-powered) have proven useful. What happens when you eliminate one of the options?
Linux running in a PDF? The PDF format supports JavaScript, and C can be compiled to JavaScript.

Biology

Colossal Biosciences has developed a 3D-printed artificial eggshell that’s capable of raising chicks from embryos.
Brazil has invested heavily in vaccines and has created a single-shot vaccine against Dengue fever. The country is striving for “medical sovereignty,” a concept that’s clearly related to data sovereignty and AI sovereignty.

AI Sovereignty and the Architecture of Participation

Tim O’Reilly — Mon, 01 Jun 2026 16:05:58 +0000

Adam Tooze recently shared a piece from The Economist about Brazil’s push for what it calls “medical sovereignty,” the determination to make its own vaccines and the active ingredients that go into its medicines rather than depend on supply chains it doesn’t control. Brazil already produces a large share of its own medicines through public institutions like Fiocruz and Butantan, but a lot of the underlying inputs still come from abroad, and the pandemic made clear the cost of that dependence. So the country is trying to build the capacity to make the things it most needs to survive. The economist behind a lot of this thinking is Mariana Mazzucato, whose mission-oriented approach treats public procurement as a tool to build national capacity rather than just buy finished goods. (Foreign Policy has a good overview.)

I think we’re going to see a lot more of this, and not only in medicine. The same impulse is driving the quest for sovereign AI, as countries decide they don’t want their access to a foundational technology to run through a handful of American or Chinese companies. You can see it too in Europe’s and Japan’s new willingness to take responsibility for their own military destiny rather than assume the United States will always be there.

Most commentators describe all of this as decoupling, the unwinding of a connected world. That reading is too narrow.

Free trade was an architecture of participation that broke

Much like open source software and the World Wide Web, free trade was supposed to have what I call “an architecture of participation.” The most important thing about the web and open source wasn’t openness for its own sake. It was that there were no central gatekeepers. Anyone could add to the richness of the system without asking permission as long as they followed the rules of the communication protocols that allowed independently-developed pieces to work together. In addition, value circulated among the participants instead of being extracted to a center, and the system got better the more people used it. That is a very different thing from a system that is merely large and connected.

Free trade was also supposed to work like that. The theory, going back to Smith and Ricardo, was that specialization and exchange would make everyone better off, and that the connections would be mutual. What we actually got over the past few decades looks more like the platform dominance we see in big tech than the original vision of a commons built around shared exchange. A handful of large and powerful countries and firms set the terms and the smaller players are forced to take what is on offer. Despite the language of free trade, the experience for many countries was closer to colonialism, just with a new narrative.

Overall, under the neoliberal order (whose reign, as Gary Gerstle explains, is now ending), free trade became far less egalitarian, inclusive, and generative than it could have been. Less powerful countries ended up in roughly the position that small businesses occupy on Amazon, or developers occupy on the app stores: free to participate, on terms they don’t control, with much of the value they create flowing back to the hub.

Brazil’s response (and that of many others) should not be seen as a retreat from the world. It is a refusal to be participate only as a buyer, or as a source of raw materials.

That’s why decoupling is the wrong word. Decoupling means cutting the connections. What these countries seem to want is to stay connected but to build real capacity of their own, so that no single supplier can switch them off. That’s closer to federation than to separation. A federated system is still a system, and its nodes still interoperate. But no node is wholly at the mercy of another, and value circulates among them rather than collecting at the center. A trading order in which the gains pool at a few hubs is brittle and eventually illegitimate, in the same way that a platform economy that strip-mines its participants eventually provokes regulation and revolt.

I put the increasingly visible quest for sovereign AI, and the role of open source models and open source agentic protocols and harnesses in enabling that sovereignty, into the same bucket. I remember back in the early days of open source software when Michael Tiemann, whose pioneering open source company Cygnus Solutions had just been acquired by Red Hat, told me “What we really sell at Red Hat is control. The ability to control your own destiny.”

As companies are increasingly at the mercy of unexpected token pricing changes by the big centralized players, this same quest for sovereignty is playing out at the level of organizations. Open source AI, including not just open source and open weight models but open agentic protocols, agentic harnesses, and portable memory, are increasingly an essential part of the sovereignty toolkit.

The national technology sovereignty movements should take a lesson from the open source movement. The heart of open source is its architecture of participation. It is a force for innovation and value creation to the extent that it frees up the ability of people to solve their own problems and contribute their solutions to a low-friction global commons.

Is capture the inevitable fate of any architecture of participation?

The pattern of open architectures leading to a wave of innovation, winners emerging, consolidating their power and then turning to the dark side seems to be a natural part of the technology cycle. The web broke Microsoft’s dominance over the personal computer software ecosystem only to give rise to a new generation of gatekeepers. Cory Doctorow called this cycle “enshittification.” I’ve told my own version of that story using the language of economics in “Rising Tide Rents and Robber Baron Rents.”

The instinct after capture is to try to rebuild the thing that got captured, only this time with better rules. Mastodon and Bluesky tried to rebuild Twitter’s social layer with cleaner governance, and neither has succeeded at the scale they hoped for. Critics might say that it was because Mastodon stayed pure and never made itself easy enough to use, while Bluesky looked federated without really being so. But more importantly, reinventing what we used to have, or what we think we used to have, is rarely the path forward. You have to build something new.

Each country building its own answer to the latest frontier models is the Mastodon move. The winning move is to operate at a layer the centralized model structurally can’t reach. Open agent protocols that let services from different providers interoperate (the work that MCP and the emerging agent stack are beginning to do) are one such layer. AI accountable to local democratic and legal institutions is another such layer. Domain-specific AI built around problems the global market won’t serve (the tropical disease vaccine analogue) is another. None of these is a smaller copy of what the hyperscalers offer. But there’s one more important layer to consider: infrastructure.

Where are the servers?

Ilan Strauss made a useful point in our conversation about these ideas. Ilan noted that AI is one of the most global forms of capital we’ve ever built, trained on the whole of the internet and runnable more or less anywhere, and the sovereignty rhetoric is partly an attempt to give something inherently placeless a place. The technology wants to be everywhere at once. The people who live with its consequences want some say over it where they are.

The placelessness of AI is only half of the truth, though. The other half is that AI is physically place-bound. The model weights are placeless. The data centers, the chips, the electrical grid, and the water for cooling are very much somewhere.

The comparison with Brazil’s medical sovereignty reinforces this point. Brazil’s challenge isn’t to invent new drugs to compete with Pfizer, but to build the capacity to manufacture existing vaccines, and eventually to build the capacity to invent vaccines for diseases the West ignores. Fiocruz and Butantan matter not because they hold patents but because they are physical institutional capacity rooted in Brazilian soil: the labs, the cold chains, the regulatory capacity, the trained workforce, and access to the active pharmaceutical ingredients. That’s what medical sovereignty really means in practice. It is infrastructure plus the institutions that run it.

The same is becoming true for AI. Open weights matter. They’re closer, though, to the patent than to the lab. Even if Qwen, Kimi, DeepSeek, Llama, Gemma, Granite, and whatever comes next are fully open, running them at scale requires data centers that cost tens of billions to build, chips whose supply chains a handful of countries control, and electricity grids that have to be expanded substantially to carry the load. The countries pursuing sovereign AI seriously seem to understand this. The EU’s AI Gigafactories program, India’s IndiaAI mission, the Gulf compute buildouts, the Singapore and Japan strategies, are all infrastructure plays first and model plays second.

Infrastructure is the layer where capture is hardest to undo. You can distill or fine tune a model far more easily than you can build a new continent’s worth of data centers or conjure the necessary electricity from a fragile power grid. If the architecture of participation for AI is defined only at the model layer, the infrastructure layer below will quietly recapture, over years, everything that was won above. Open weights running on three companies’ servers is not sovereignty.

Building physical infrastructure capable of carrying a generation’s worth of economic activity is exactly the kind of mission the public sector used to take on, before we convinced ourselves the market would handle it. Mazzucato’s argument is that public procurement and public capacity-building are the real engines of foundational technology. AI sovereignty without industrial policy is wishful thinking.

Industrial policy should aim to reinvent 20th century infrastructure, not just copy it. Can we use the enormous rebuild of infrastructure for the AI era to leapfrog the past? The analogy with centralized power grids and decentralized solar reminds us that local control does not have to be a localized version of the hyperscaler pattern. Might we envision a future where there is an intelligence grid that seamlessly uses frontier models in massive data centers and local models controlled by the user as dictated by considerations like cost, privacy, specialized knowledge, and user preferences? Creating the software to manage such an interoperable intelligence grid should be a high priority for the AI open source community. We need an orchestrator not just for agents but also for models and even for data center capacity.

Could federated AI give us a new pattern for the economy?

In a previous piece about AI and markets, “The Third Artificial Intelligence” I picked up Richard Danzig’s argument that markets and the bureaucracies that underpin nation states are themselves artificial intelligences, information-processing mechanisms older than the machine kind. The question with all three is who designs and builds them, what they optimize for, and what feedback loops govern them.

We’re about to spend a lot of effort working out how AI should be organized both across nations and across organizations, whether it concentrates in a few firms and a few countries or whether it can be built as something more federated, where smaller players have genuine capacity and the value they create flows back to them. The choices we are now making about how AI is organized, at the model layer, the protocol layer, and the infrastructure layer, are also choices about how economic activity will be organized for at least a generation. If we manage to get that architecture right for AI, it may give us a working pattern for the thing we’ve so far failed to get right for trade. If we get it wrong, we’ll most likely reproduce, at the level of intelligence itself, the same concentration that free trade has produced in goods and the existing internet platforms produced online.

The technology wants to be everywhere at once. The people who live with its consequences want some say over it where they are. The infrastructure that resolves that tension will be a federation of models, a federation of protocols and code, and a federation of capacity. We need an architecture of participation all the way down the stack, and all the way up.

The final section of this piece benefited greatly from questions and comments raised by Ilan Strauss and Mike Loukides, as well as from previous conversations with Richard Danzig.

SaaS Is Not Dead Yet

Mike Loukides — Mon, 01 Jun 2026 11:01:35 +0000

With the rise of agents, many people have been proclaiming that the age of software as a service (SaaS) is over. Who needs to subscribe to a service when you can create your own software with a few English-language prompts and a few dollars spent on tokens? Your own software, most likely a skill that runs in an agent, will have exactly the features you want: no more, no less.

But whenever someone talks about the death of SaaS, there’s something wrong with the picture. It’s simply that work is about groups and teams, and so far, programming with agents is about individuals. A related challenge is that SaaS companies are good at building dashboards and generating reports for humans, but agents need the raw data, not a representation of the data.

Think about the teamwork required for a good sales team. Someone needs a database to keep track of their customer info. It’s easy to get Claude, Gemini, or GPT to build that, using SQLite for a backend and putting a reasonable web frontend on it. You could also do that fairly quickly with Ruby on Rails, but AI makes it even easier. But what about the salesperson at the next desk? She needs similar CRM software, and she can create it with Claude, Gemini, or GPT. No problem. But it won’t be exactly the same; it will reflect her needs and preferences. Soon you have a team of salespeople in which everyone has their own personal CRM. They’re all similar, but slightly different. They may use different backends (Filemaker, SQLite, MySQL, or maybe a corporate Oracle instance); they have similar-but-slightly-different schemas (one has a single field for customer address, another has separate street, city, state, and country fields); and they don’t interoperate.

That’s the simplest possible case. How do you generate company-wide reports if everyone has their own version of the data? How do you know if you’re succeeding or failing if everyone on the team has their own version of the metrics? Everyone has become their own silo.

The company is not paying subscription fees to a vendor like Salesforce, but is this really progress? If anything, we need to make sharing data and metrics easier, not more difficult. On top of that, a product like Salesforce has hundreds of features. Most people don’t need most of them, but there’s a good chance that almost everyone needs one feature that nobody else needs. And there’s always the features you don’t know you need, ways to get value from data that you haven’t thought of. There’s value in buying a bundle that goes beyond your immediate requirements.

There’s certainly a lot good about enabling people to develop their own tools. I guarantee that if we had Claude Code 30 years ago, I would have vibe-coded my own skills for managing the authors I was working with. I would have vibe-coded some of the crazy tools I wrote to translate from one document format to another. (WordPerfect to troff? Why?) Now that we have agentic programming, I may never write my own tools again. But the SaaS scenario highlights something missing from the agentic picture. We don’t have tools for sharing or collaboration. Nobody buys a Salesforce subscription for themselves. It’s a departmental or corporate resource, shared between many people. And the ability to share easily is precisely what agentic programming lacks. I’ve built some of my own Claude tools and skills, but it’s very difficult to share them with other people at O’Reilly. ChatGPT Skills for Business and Enterprise hints at the ability to share skills among team members and some ability to generate them collaboratively, though it’s hard to find evidence that it delivers. I think we’re seeing a symptom of technological overreach. It’s easy to assume something is “easy” when it isn’t: “You just generate a .md file and put it in the corporate GitHub.” That process has a lot of friction, particularly for users who aren’t technical.

To make skills really useful across a company, we need:

Sharing. This can be a Git server that’s registered as a private marketplace and then configured via a corporate administrative dashboard. Publishing skills to the marketplace would remain the province of Git-aware users, and that’s a problem.
Requirements. We don’t want everyone to build a personal toolset; that’s the problem we’re trying to solve. How do you resolve differences between users who want slightly different things? What does the PRD for a skill look like?
Collaboration. Aside from Google Docs, the current state of widely used collaboration tools is poor. Suffice it to say that working on different branches of a Git repo and merging changes may work for professional programmers, but not for anyone else.
Testing. Tests and evals for agents (related, but not the same) are topics that we don’t yet understand well. But if you’re going to empower users to use and create agentic tools for creating projections and writing reports, you need to know they won’t backfire. Skills also behave like any other AI application: They drift over time. Even after they’re published, they need to be evaluated regularly to see if they still perform correctly.
Versioning. Like any software—and we need to recognize that agentic tools and skills are software, even if they’re written in English—it will be important to update them as requirements change and as LLM behavior drifts. It’s important to keep track of versions and for users to update their skills to the latest version easily. Again, this is a matter of wrapping Git appropriately for nontechnical users.
Security. Security for intelligent agents is still poorly understood. We know about prompt injection, but we also know that it’s a problem that can’t be solved yet. And attackers are still finding novel ways to inject malicious prompts. What vulnerabilities might agentic skills and tools have if they can access corporate data?

While the democratization of programming doesn’t threaten SaaS companies, intelligent agents pose a deeper challenge. In “The Salesforce of Agents Won’t Be Salesforce, the Google of Agents Won’t Be Google,” Jesus Rodriguez points out that the future for services like Salesforce and Google isn’t web UIs and dashboards; it’s APIs that are designed for agents. These APIs require a different kind of data: not something that a human can glance at to get a quick feel for what’s happening, but “structured state, task objectives, relationship graphs, permissioned memory, machine-readable sales playbooks, and reliable APIs for updating intent.” Humans need the data compression that you get from a dashboard. Agents want the data itself, and they’ll take care of the compression. SaaS companies can become the system of record that is responsible for delivering accurate data. What they need to recognize is that their real customer may not be a human user; the customer will be an agent, and that will affect everything from marketing strategy and product design to pricing.

I wouldn’t claim that Salesforce or Google can’t or won’t build APIs to help companies access their own data. SaaS remains relevant, but it’s a different kind of SaaS than we have now. Companies like Salesforce know what data is available and how to work with it. Designing and building the data infrastructure that’s needed to provide next-generation SaaS isn’t trivial, and doing the programming in English rather than C++ doesn’t make it easier. Companies like Salesforce and Google know what needs to be built. They’re likely to offer their own collections of agentic skills as a starting point, alongside APIs. But large, established companies are ripe to be blindsided if they move slowly—and it’s difficult for large institutions to move quickly.

SaaS companies have momentum—or inertia, which to a physicist is the same thing. They have to change, but they aren’t threatened by AI, agents, and user-defined skills. Providing APIs that have been designed to provide data in formats that machines can use should be an obvious next step. If they die, it will be because they don’t adapt. But there’s nothing new about that.

Open Source Ecosystems

Ilan Strauss — Fri, 29 May 2026 11:00:08 +0000

The following article originally appeared on the Asimov’s Addendum Substack and is being reposted here with the author’s permission.

Bill Gurley has an excellent article on what he calls open source strategy, which we recommend reading. There is a lot to debate about his concluding argument in particular: that open-weight models are central to keeping the AI market rent-free. The limits of open-weight AI as the primary open source strategy are surely considerable though, if it still requires expensive hardware to run on, and if the architecture ultimately remains monolithic—rather than composable and protocol-centric.

A related consideration comes from Anthropic’s recent acquisition of Stainless—a startup that generates SDKs, command-line tools, and MCP servers from API specifications. This illustrates that open protocols like MCP, even when publicly governed,¹ remain exposed at their complementary layers to private actors capturing rents. (Protocol openness does not eliminate this and instead probably enables it, by enabling market growth).

We asked Claude to analyze this acquisition, going beyond the press releases. Its first pass overstated parts of the competitive-denial story; what follows is what survived it taking a closer look:

Complement capture, not protocol capture. MCP—the standard that lets AI agents talk to other software—remains open, and its governance has been handed to an independent foundation. What Anthropic bought is the company that turned that standard into something most developers could actually use. Stainless was the dominant tool for taking an ordinary business API (say, a hotel booking system or a customer database) and converting it into something an AI agent could call through MCP. The open standard is still open. The path most developers walked to use it has now been bought.
This isn’t a one-off—the whole layer is consolidating. Stainless wasn’t alone in this market. Its main competitor, Fern, was bought by Postman in January 2026. Anthropic bought Stainless four months later, in May 2026. That leaves Speakeasy as the only major independent player, plus an open-source fallback called OpenAPI Generator that most developers consider too rough for production use without significant manual work. In under five months, two of the three serious companies in this part of the market have been absorbed into larger platforms. The Stainless deal is more visible because of who bought it and why, but the broader pattern matters more: an entire layer of AI infrastructure is being pulled inside platform owners.
Moat migration. The gap in raw model capability between Anthropic, OpenAI, and Google has narrowed considerably and continues to close, and the implication is that model quality alone is unlikely to be the principal basis of competitive advantage over the next two years. What may distinguish the leading firms instead is the quality of the developer experience around their models: how easily a business or an engineer can build something useful on top of a given model, how cleanly the tooling integrates with existing systems, and how reliable the connectors are over time.

Stainless was founded by Alex Rattray, formerly of Stripe. Stripe built its market position largely on unusually well-designed developer tools, and Stainless was, in effect, an attempt to apply the same approach to the layer between AI APIs and the rest of the software economy. Anthropic has acquired the team that knows how to do this.

Pricing logic, with caveats on denial. Stainless was last valued at $150M in December 2025; at >$300M five months later, this is a roughly 2x strategic markup, not acqui-hire arithmetic. Removing a critical-path external dependency on Anthropic’s own SDKs, while denying it to a tight set of competitors, is rational at that price—but the denial logic is partial. Speakeasy is a viable substitute, and OpenAI was reportedly already migrating off Stainless. The friction tax falls hardest on smaller players who lack the engineering bench to absorb migration cost.

…The press release calls it “extending reach”; the InfoWorld read—“last-mile developer experience”—is closer, but the complement-capture component, even if partial, is real.

-*-

Now, while Claude might be overstating some of the market risks associated with this acquisition (you tell us?), it shows that open source’s impacts are highly conditional on its dependencies and should never be analyzed in isolation from the market’s software stack and architecture. This is equally true for open weight models—being dependent on data, compute, and distribution—as it is for open protocols like MCP, dependent on constant API translations and access. Tracking those interdependencies is what a full ecosystem view involves and is helpful to undertake in order to consider where chokepoints might arise, and in turn, where open source strategy might eventually fail or be captured.

Footnotes

In this case by the Agentic AI Foundation under the Linux Foundation ︎

Your AI Agent Already Forgot Half of What You Told It

Andrew Stellman — Thu, 28 May 2026 10:59:36 +0000

This is the seventh article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, part four here, part five here, and part six here.

This is the latest article in my Radar series on AI-driven development and agentic engineering, and I have to admit that this one took a bit of a turn I wasn’t expecting.

In my last article I talked about context and context management and I promised to give you some real practical tips for using it. It was originally meant to be about specific, practical context management techniques that were really helpful to me building Octobatch and the Quality Playbook, two open source projects where I work with AIs to plan and orchestrate all of the work and every line of code is written by AI tools like Claude Code and Cursor.

But as I was writing this, I found that I’d adapted those same techniques to my work writing articles like this one. Which is surprising! I’ve been doing all this work finding ways to help people developing AI skills improve context management, so their skills run more efficiently. It turns out that those same exact techniques apply to anyone using AI tools, even when you’re using chatbots like Claude.ai or ChatGPT.

Full disclosure: I use multiple AI tools to manage this article series. My primary tools are Claude Cowork for brainstorming and managing my article research, notes, and backlog and Gemini’s mobile app for reading drafts aloud and taking my notes while I’m away from my desk. And I want to tell you about something that happened while I was using those tools, because I think it really helps show why context management isn’t just a problem for developers.

While I was writing this article, I was using Gemini’s mobile app to read the draft aloud and take my notes. Partway through the session I asked it to go back and check whether there were earlier notes it hadn’t incorporated yet. It told me it didn’t have access to the previous notes, which seemed weird and insane, since we had just taken those notes a few prompts earlier in the session. I could scroll back up and see them earlier in the conversation, but somehow it didn’t “know” about them.

Here’s what happened. Gemini had compacted our conversation without telling me, and the notes from the first half of the session were just… gone.

If you’ve ever had a web chat AI just seem to forget things you talked about earlier, you’ve experienced context compaction, just like I did. Understanding even the basics of context and context windows can make a big difference in preventing that kind of frustration.

This all reminded me of something I wrote more than two decades ago in Applied Software Project Management (back in 2005!): “Important information is discovered during the discussion that the team will need to refer back to during the development process, and if that information is not written down, the team will have to have the discussion all over again.”

Jenny Greene and I wrote that about human teams and project meetings, but it applies to AI sessions just as well.

Which brings me back to context, which I wrote about in my last article, and which I’ll write more about in the next one, because it’s one of the most important concepts to keep top of mind when working with AI.

Context loss may be invisible, but that doesn’t make it any less frustrating

Context is everything the AI is holding in its working memory during a conversation: what you’ve told it, what it’s told you, any files or instructions it’s read, and whatever internal notes the system has made along the way. All of that lives in a fixed-size context window—think of that as your AI’s short-term memory, the stuff it’s thinking about right now—and when the window fills up, the AI has to start letting things go. Different tools handle this differently: Some truncate older messages, some compress the conversation into a summary (which means details get lost even though the summary looks complete), and some just start behaving inconsistently so you can’t tell whether the AI forgot something or never understood it in the first place. The result is the same: The AI loses track of things you told it, decisions you made together, or details it noticed earlier in the session. And it won’t tell you it forgot. It’ll just keep generating confident-sounding output based on whatever it still has.

Before we dive in a little deeper, I want to do a quick jargon check. If you’ve seen the terms “skills” and “agents” floating around but aren’t sure what they are, think of skills as libraries for AIs and agents as interactive executables. Those aren’t perfectly precise definitions, but if you’re a developer they’re close enough for this discussion.

When you’re coding skills and agents, you run into context problems quickly. The work you’re asking the AI to do is often complex enough that the context window fills up, and the AI has to start compacting: compressing or dropping older parts of the conversation to make room for new ones. Compaction always seems to happen at the most frustrating and inconvenient time, which makes sense when you think about it. You hit context limits precisely when you’ve put the most information into the conversation, which is exactly when losing that information costs you the most.

That’s why I think it can often help to think of AIs as having the same shortcomings that human teams do, except those shortcomings are exaggerated by their AI nature. A person who forgets something from a meeting last week might remember it when you remind them. An AI that lost something to context compaction won’t, because the information is gone. But there’s something you can do about it, and it turns out the techniques that help are the same whether you’re building autonomous AI skills or just trying to get a chatbot to remember what you told it 20 minutes ago.

I’ve landed on four techniques that I come back to over and over again. Each one exists because at some point the AI forgot something important and I responded by putting that thing in a file where it couldn’t be forgotten. None of them require special tooling. And to my surprise, all of these techniques have turned out to be useful for both building software and managing a writing project like this one, whether I’m chatting with Claude, ChatGPT, or Gemini, or using a desktop tool like Claude Cowork or Codex. These are the techniques I find most valuable:

Split discovery from documentation: Don’t ask the AI to figure something out and produce polished output in the same pass.
Use handoff documents, not continuation prompts: Before closing a stale session, have the AI write down everything the next session needs to know.
Give the AI an acceptance criterion, not a procedure: Tell it what “done” looks like instead of spelling out the steps.
Use spec documents as the bridge between AI tools: Make a shared document the single source of truth that all your tools read from.

Split discovery from documentation

When you ask an AI to do something complex, you’re often asking it to do two things at once without realizing it. You’re asking it to figure something out and produce polished output at the same time. The problem is that figuring things out takes attention, and producing output takes attention, and the model only has so much of it. When you combine both tasks in the same prompt, the model starts cutting corners on one of them, and you can’t tell which one it shortchanged.

I ran into this with the Quality Playbook, an open source AI coding skill I built that runs structured code reviews against any codebase. One of the things it does is derive requirements from source code: It reads through the code, identifies what the code promises to do (I call these behavioral contracts), and then produces a requirements document. Originally this all happened in a single pass. The problem was that single-pass requirement generation ran out of attention after about 70 requirements. The model forgot behavioral contracts it had noticed earlier in the code, and the forgetting was completely invisible. There was no stack trace or error message, just incomplete output and no way to know what was missing. I fixed it by splitting the work into two separate prompts:

Read each source file and write down every behavioral contract you observe as a simple list in CONTRACTS.md.

Read CONTRACTS.md and the documentation, then derive requirements from them and write REQUIREMENTS.md.

Then a third pass checks whether every contract has a corresponding requirement, and if there are gaps, goes back to step one for the files with gaps.

The key idea is that CONTRACTS.md is external memory. When the model “forgets” about a behavioral contract it noticed earlier, that forgetting is normally invisible. With a contracts file, every observation is written down before any requirements work begins, so an uncovered contract is a visible, greppable gap. You can see what was forgotten and fix it.

The principle: Don’t ask the AI to figure out what exists and write formatted output in the same pass. The model runs out of attention trying to do both at once. Whenever you’re asking an AI to do something complex, consider whether you’re actually asking it to do two things at once. “Analyze this codebase and write a report” is two tasks. “Read this document and suggest improvements” is two tasks. Split them, and let the first pass write its observations to a file before the second pass starts working with them.

Use handoff documents, not continuation prompts

Anyone who’s spent a long session with an AI coding tool has felt the moment when the context starts to go stale. The AI stops tracking details it was handling fine an hour ago, or it contradicts something it said earlier. The session gets slow, and you’re often restarting because the AI seems to have gotten bogged down and filled up on what you told it. You get the sense that if you keep going, you’re going to spend more time correcting it than making progress.

Most developers respond to their session getting too long in one of two ways: They push through the problem, or they start a fresh one and try to reexplain everything from scratch. Both of those approaches can cause the AI to lose context. The first loses it to compaction; the second loses it to incomplete reexplanation. And both are frustrating! Specifically because you just spent so much time building up all that context with the AI.

There’s a third option. Before you close the session, ask the AI to write a handoff document: a file that captures everything the next session needs to know, written while the current session still has full context. The key is that you’re asking the AI to write this while the relevant details are still fresh in the working context, and in a way that it or another AI can read.

I built this into the Quality Playbook as a core part of how phases communicate. When I split the playbook from a single prompt to independent phases, I needed each phase to run as a completely independent session with no context carryover. So each phase got its own kickoff prompt as a standalone file. Here’s the structure each one follows:

Write a handoff document that a fresh session could use to pick up this work cold. Include everything it would need to know.

Every kickoff opens with what prior phases accomplished, includes explicit boundaries about what’s frozen, and names which future phase owns each piece of remaining work, because without it the AI will helpfully start doing Phase 3 work while you’re still in Phase 2. Each phase also ends with a required forward-looking handoff where the completing agent writes down what the next session needs to know.

The principle: Each handoff is a complete state snapshot. The incoming AI agent never needs to read prior kickoff prompts or chat history. Everything it needs is in the current handoff file: current state, uncommitted changes, immediate next task, pending tasks, file locations, and anything that was discovered during the prior session. A fresh AI session can pick it up cold.

If you’re deep into a Claude Code or Copilot session and you can feel the context getting stale, ask the AI to write a handoff document before you close the session. Tell it to include everything a fresh session would need to continue the work. Then start a new session and point it at that file. A fresh session with a good handoff document will usually outperform a stale session, because it’s starting with clean context instead of compacted, fragmented context.

Give the AI an acceptance criterion, not a procedure

When you give an AI a multistep task, the natural instinct is to spell out the steps. First do this, then do that, then combine the results. The problem is that step-by-step procedures are the first thing the AI forgets when the context window fills up. It’ll skip steps, merge phases, or quietly drop tasks, and there’s nothing in the procedure itself that would help the AI notice what it missed. The procedure tells the AI what to do, but it doesn’t tell the AI what “done” looks like.

I learned this the hard way with the Quality Playbook. The playbook runs multiple iteration passes over a codebase, and the results need to be cumulative. It keeps a list of all the bugs it finds in the code being tested in a file called BUGS.md. Early on, I gave the AI a procedure to run four times and then update that file:

First run the main pass, then run four iteration passes, then merge the findings into BUGS.md.

The AI did not respond well to that instruction.

It turns out that when you ask an AI to do a very complex task a specific number of times, it can lose count. In fact, from my experimentation, it seems that count is one of the first casualties of context compaction. Most of the time the AI decided three iterations was enough, or merged findings from only two passes, and no matter how many different ways I tried to rephrase that instruction, there was nothing I could come up with that prevented the problem.

However, everything changed when I replaced the “run four times” instruction with an acceptance criterion, or a specific condition that tells the AI when to stop looping:

You are done only when BUGS.md contains the cumulative findings from the main run plus all four itration passes.

Even when the AI lost track of intermediate steps, it could check the output against the criterion and know whether it was finished. And I could verify the output against the same criterion, which gave me a way to audit the agent’s work without watching every step.

In developer terms, the AI is really bad at loops like for (i = 0; i < 4; i++) because it loses track of the value of the iterator i when it compacts its context. But it’s really good at loops like while (!done) because it can check done based on the current state without relying on history.

The principle behind all this is that an acceptance criterion survives context pressure because the AI can always check “Am I done?” against a concrete test. This is actually the same principle behind test-driven development: write the test before the code so you know when you’re done. The acceptance criterion is the test for your AI session. When you’re giving an AI a task that has multiple steps, don’t describe the steps. Describe what “done” looks like, and let the AI figure out how to get there.

Use spec documents as the bridge between AI tools

Most developers working with AI don’t use just one tool. You might use Claude for design, Cursor for coding, and Copilot for quick edits. You might even use multiple models inside the same tool, like GPT-5.5 and Opus 4.7 in separate Copilot chats inside VS Code. It’s common to have one model for coding, another for review, and a third for orchestration and project management. The problem is that none of these tools or chats know what you told the others. Claude doesn’t know what you decided with Cursor. Two separate Copilot chats in the same editor don’t share context. You’re the one carrying context between them, and that’s exactly the kind of lossy handoff that causes drift. A design decision you made in one conversation gets lost or distorted by the time it reaches the tool that needs to implement it.

The fix is to make the spec document the single source of truth that all your AI tools read from. I used this when building a game prototype, where I had Claude handling design and planning and Cursor doing the coding. They never talked to each other directly, so the spec documents served as the shared contract: Claude wrote the specs, and Cursor read them. The rule I followed was simple:

Never tell the AI coder something that isn’t already in the specs. If you make a design decision in conversation, write it into the spec first, then point the coder at the spec.

If I made a design decision in a conversation with Claude, that decision had to be written into the spec before I told Cursor about it. If I discovered something during implementation, I wrote it into the appropriate doc first, then pointed the coder at it. The spec was always the single source of truth. When Claude and I changed the wound topology (removing one wound type, promoting another), we updated the docs first, then told Cursor to reread them. When we decided to add a new UI element, we wrote it into the UI spec first, then told Cursor to reread the doc.

The key was including rationale in the specs. Not just “show 5 progressive labels” but why: “The player shouldn’t be told what they’re fighting. They should discover it.” This helps the AI coder make better decisions when the spec doesn’t cover an edge case because it knows the intent behind the requirement.

The principle: The spec document is the shared context that all your tools can read. It prevents the drift that happens when design intent lives only in chat history that the other tool can’t see. This technique works any time you’re using more than one AI tool on the same project, which at this point is most projects.

How these techniques combine: Managing this article series

Those four practices came out of AI-driven development work, but they apply to almost any AI work. And while these techniques emerged for me while working on agents and skills, I think it’s valuable to demonstrate them in a nondevelopment context, so I’ll share an example from my work on the article series you’re reading now.

Over time, the process for how my AI assistant and I manage this article backlog evolved organically in conversation, but it was never written down anywhere except in the AI’s context window. Which means every time the session compacted or I started a fresh chat, the process was gone and I had to reexplain it. I caught this when the AI did something slightly wrong and I wanted to confirm we were on the same page. So I asked:

Every time I suggest a new article idea, you add an entry to the backlog, and then create a new markdown file with the source material, right?

That’s split discovery from documentation. I didn’t say “document our process.” I said “confirm what we do.” Discovery first, then documentation as a separate step. If I’d said “write up our process” without confirming first, the AI might have written something plausible but wrong, and I wouldn’t have caught the discrepancy.

Once we’d confirmed the process, I asked the AI to create two files. AGENTS.md is an emerging standard for AI-readable project context—a single file that tells any AI session what it needs to know about a project. You can learn more about the convention at agents.md. CONTEXT.md serves a similar role as a bootstrapping document—it’s less established as a standard, but the practice of asking the AI to dump everything it knows into a context file so the next session can pick it up cold has been one of the most valuable habits I’ve developed. Here’s the prompt I used:

Update the backlog file to explain what it is and how we maintain it. Create a CONTEXT.md with everything you’d need to bootstrap a new chat. Create an AGENTS.md to make it easy to bootstrap with a single-line prompt.

That prompt is a handoff document. I was explicitly asking the AI to write down everything it knew while it still had full context, specifically because I knew that context would be lost to compaction. The CONTEXT.md file is a handoff from this session to whatever fresh session picks up the work next week.

Notice what I didn’t say. I didn’t give step-by-step instructions for what should go in those files. I said “everything you would need to bootstrap this process again in case we lost it” and “a complete dump of all of the context you would need to bootstrap a new chat and get it to the point where this current chat is.” Those are acceptance criteria, not procedures. The AI had to figure out what belonged in those files. If I’d given it a procedure (“first write the publication history, then the voice rules, then the file locations”), it would have followed the list and missed anything I forgot to include. The acceptance criterion is harder to satisfy but more robust: the test is “Could a fresh session bootstrap from these files alone?”

And the AGENTS.md file itself is a spec document as a bridge between tools. It’s the shared contract that any AI session, whether it’s Claude, Gemini, Cowork, or a fresh chat, can read to get aligned with the project. This session wrote it; the next session reads it. The two sessions never communicate directly, so the spec file bridges the gap between them.

That’s all four practices in two prompts, applied to something as ordinary as managing a writing project. It didn’t require pipelines or codebases or batch orchestration. The practices work because they solve the same underlying problem regardless of the domain: important information living in the AI’s context window instead of on disk.

Context management is a development skill

Every practice I’ve described in this article and the last one is something developers have always been told to do: write things down, record your rationale, be deliberate about what you save and what you let go, write ADRs and design docs and inline comments explaining nonobvious choices. We’ve always known we should do more of it. When you’re working with AI, the cost of not doing it becomes immediate and visible.

The practices in this article all come down to the same thing: putting the important information in files where compaction can’t touch it, so you can see what the AI knows and verify that it matches reality. In the next article, I’ll go deeper on the debugging angle: how to use externalized files to understand what your AI is actually doing, with practical techniques that work even if you’re not building agents but are just using a chatbot.

The Quality Playbook is open source and works with GitHub Copilot, Cursor, and Claude Code. It’s also available as part of awesome-copilot.

Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.