<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:media="http://search.yahoo.com/mrss/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:custom="https://www.oreilly.com/rss/custom"

	>

<channel>
	<title>Radar</title>
	<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.oreilly.com/radar</link>
	<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
	<lastBuildDate>Thu, 05 Mar 2026 12:19:07 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>

<image>
	<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
	<title>Radar</title>
	<link>https://www.oreilly.com/radar</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>The Accidental Orchestrator</title>
		<link>https://www.oreilly.com/radar/the-accidental-orchestrator/</link>
				<comments>https://www.oreilly.com/radar/the-accidental-orchestrator/#respond</comments>
				<pubDate>Thu, 05 Mar 2026 12:19:07 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18209</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Accidental-orchestrator-with-equipment.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Accidental-orchestrator-with-equipment-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Experiments in agentic engineering and AI-driven development]]></custom:subtitle>
		
				<description><![CDATA[This is the first article in a series on agentic engineering and AI-driven development. Look for the next article on March 19 on O’Reilly Radar. There&#8217;s been a lot of hype about AI and software development, and it comes in two flavors. One says, “We&#8217;re all doomed, that tools like Claude Code will make software [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>This is the first article in a series on agentic engineering and AI-driven development. Look for the next article on March 19 on O’Reilly Radar.</em></p>
</blockquote>



<p>There&#8217;s been a lot of hype about AI and software development, and it comes in two flavors. One says, “We&#8217;re all doomed, that tools like Claude Code will make software engineering obsolete within a year.” The other says, “Don&#8217;t worry, everything&#8217;s fine, AI is just another tool in the toolbox.” Neither is honest.</p>



<p>I&#8217;ve spent over 20 years writing about software development for practitioners, covering everything from coding and architecture to project management and team dynamics. For the last two years I&#8217;ve been focused on AI, training developers to use these tools effectively, writing about what works and what doesn&#8217;t in books, articles, and reports. And I kept running into the same problem: I had yet to find anyone with a coherent answer for how experienced developers should actually work with these tools. There are plenty of tips and plenty of hype but very little structure, and very little you could practice, teach, critique, or improve.</p>



<p>I&#8217;d been observing developers at work using AI with various levels of success, and I realized we need to start thinking about this as its own discipline. Andrej Karpathy, the former head of AI at Tesla and a founding member of OpenAI, recently proposed the term &#8220;agentic engineering&#8221; for disciplined development with AI agents, and others like Addy Osmani are getting on board. Osmani&#8217;s framing is that<a href="https://addyosmani.com/blog/agentic-engineering/" target="_blank" rel="noreferrer noopener"> AI agents handle implementation but the human owns the architecture, reviews every diff, and tests relentlessly</a>. I think that&#8217;s right.</p>



<p>But I&#8217;ve spent a lot of the last two years teaching developers how to use tools like Claude Code, agent mode in Copilot, Cursor, and others, and what I keep hearing is that they already know they should be reviewing the AI&#8217;s output, maintaining the architecture, writing tests, keeping documentation current, and staying in control of the codebase. They know how to do it <em>in theory</em>. But they get stuck trying to apply it <em>in practice</em>: How do you actually review thousands of lines of AI-generated code? How do you keep the architecture coherent when you&#8217;re working across multiple AI tools over weeks? How do you know when the AI is confidently wrong? And it&#8217;s not just junior developers who are having trouble with agentic engineering. I&#8217;ve talked to senior engineers who struggle with the shift to agentic tools, and intermediate developers who take to it naturally. The difference isn&#8217;t necessarily the years of experience; it&#8217;s whether they&#8217;ve figured out an effective and structured way to work with AI coding tools. <strong><em>That gap between knowing what developers should be doing with agentic engineering and knowing how to integrate it into their day-to-day work is a real source of anxiety for a lot of engineers right now.</em></strong> That&#8217;s the gap this series is trying to fill.</p>



<p>Despite what much of the hype about agentic engineering is telling you, this kind of development doesn&#8217;t eliminate the need for developer expertise; just the opposite. Working effectively with AI agents actually raises the bar for what developers need to know. I wrote about that experience gap in an earlier O&#8217;Reilly Radar piece called “<a href="https://www.oreilly.com/radar/the-cognitive-shortcut-paradox/" target="_blank" rel="noreferrer noopener">The Cognitive Shortcut Paradox</a>.” The developers who get the most from working with AI coding tools are the ones who already know what good software looks like, and can often tell if the AI wrote it.</p>



<p>The idea that AI tools work best when experienced developers are driving them matched everything I&#8217;d observed. It rang true, and I wanted to prove it in a way that other developers would understand: by building software. So I started building a specific, practical approach to agentic engineering built for developers to follow, and then I put it to the test. I used it to build a production system from scratch, with the rule that AI would write all the code. I needed a project that was complex enough to stress-test the approach, and interesting enough to keep me engaged through the hard parts. I wanted to apply everything I&#8217;d learned and discover what I still didn&#8217;t know. That&#8217;s when I came back to <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method" target="_blank" rel="noreferrer noopener">Monte Carlo simulations</a>.</p>



<h2 class="wp-block-heading"><strong>The experiment</strong></h2>



<p>I&#8217;ve been obsessed with Monte Carlo simulations ever since I was a kid. My dad&#8217;s an epidemiologist—his whole career has been about finding patterns in messy population data, which means statistics was always part of our lives (and it also means that I learned <a href="https://en.wikipedia.org/wiki/SPSS" target="_blank" rel="noreferrer noopener">SPSS</a> at a very early age). When I was maybe 11 he told me about the drunken sailor problem: A sailor leaves a bar on a pier, taking a random step toward the water or toward his ship each time. Does he fall in or make it home? You can&#8217;t know from any single run. But run the simulation a thousand times, and the pattern emerges from the noise. The individual outcome is random; the aggregate is predictable.</p>



<p>I remember writing that simulation in BASIC on my TRS-80 Color Computer 2: a little blocky sailor stumbling across the screen, two steps forward, one step back. The drunken sailor is the &#8220;Hello, world&#8221; of Monte Carlo simulations. Monte Carlo is a technique for problems you can&#8217;t solve analytically: You simulate them hundreds or thousands of times and measure the aggregate results. Each individual run is random, but the statistics converge on the true answer as the sample size grows. It&#8217;s one way we model everything from nuclear physics to financial risk to the spread of disease across populations.</p>



<p>What if you could run that kind of simulation today by describing it in plain English? Not a toy demo but thousands of iterations with seeded randomness for reproducibility, where the outputs get validated and the results get aggregated into actual statistics you can use. Or a pipeline where an LLM generates content, a second LLM scores it, and anything that doesn&#8217;t pass gets sent back for another try.</p>



<p>The goal of my experiment was to build that system, which I called <a href="https://github.com/andrewstellman/octobatch" target="_blank" rel="noreferrer noopener">Octobatch</a>. Right now, the industry is constantly looking for new real-world end-to-end case studies in agentic engineering, and I wanted Octobatch to be exactly that case study.</p>



<p>I took everything I&#8217;d learned from teaching and observing developers working with AI, put it to the test by building a real system from scratch, and turned the lessons into a structured approach to agentic engineering I&#8217;m calling <strong>AI-driven development</strong>, or <strong>AIDD</strong>. This is the first article in a series about what agentic engineering looks like in practice, what it demands from the developer, and how you can apply it to your own work.</p>



<p>The result is a fully functioning, well-tested application that consists of about 21,000 lines of Python across several dozen files, backed by complete specifications, nearly a thousand automated tests, and quality integration and regression test suites. I used Claude Cowork to review all the AI chats from the entire project, and it turns out that I built the entire application in roughly 75 hours of active development time over seven weeks. For comparison, I built Octobatch in just over half the time I spent last year playing <a href="https://www.blueprincegame.com/" target="_blank" rel="noreferrer noopener"><em>Blue Prince</em></a>.</p>



<p>But this series isn&#8217;t just about Octobatch. I integrated AI tools at every level: Claude and Gemini collaborating on architecture, Claude Code writing the implementation, LLMs generating the pipelines that run on the system they helped build. This series is about what I learned from that process: the patterns that worked, the failures that taught me the most, and the orchestration mindset that ties it all together. Each article pulls a different lesson from the experiment, from validation architecture to multi-LLM coordination to the values that kept the project on track.</p>



<h2 class="wp-block-heading"><strong>Agentic engineering and AI-driven development</strong></h2>



<p>When most people talk about using AI to write code, they mean one of two things: AI coding assistants like GitHub Copilot, Cursor, or Windsurf, which have evolved well beyond autocomplete into agentic tools that can run multifile editing sessions and define custom agents; or &#8220;vibe coding,&#8221; where you describe what you want in natural language and accept whatever comes back. These coding assistants are genuinely impressive, and vibe coding can be really productive.</p>



<p>Using these tools effectively on a real project, however, maintaining architectural coherence across thousands of lines of AI-generated code, is a different problem entirely. AIDD aims to help solve that problem. It&#8217;s a structured approach to agentic engineering where AI tools drive substantial portions of the implementation, architecture, and even project management, while you, the human in the loop, decide what gets built and whether it&#8217;s any good. By &#8220;structure,&#8221; I mean a set of practices developers can learn and follow, a way to know whether the AI&#8217;s output is actually good, and a way to stay on track across the life of a project. If agentic engineering is the discipline, AIDD is one way to practice it.</p>



<p>In AI-driven development, developers don&#8217;t just accept suggestions or hope the output is correct. They assign specific roles to specific tools: one LLM for architecture planning, another for code execution, a coding agent for implementation, and the human for vision, verification, and the decisions that require understanding the whole system.</p>



<p>And the &#8220;driven&#8221; part is literal. The AI is writing almost all of the code. One of my ground rules for the Octobatch experiment was that I would let AI write all of it. I have high code quality standards, and part of the experiment was seeing whether AIDD could produce a system that meets them. The human decides what gets built, evaluates whether it&#8217;s right, and maintains the constraints that keep the system coherent.</p>



<p>Not everyone agrees on how much the developer needs to stay in the loop, and the fully autonomous end of the spectrum is already producing cautionary tales.<a href="https://www.anthropic.com/engineering/building-c-compiler"> </a>Nicholas Carlini at Anthropic recently tasked 16 Claude instances to <a href="https://www.anthropic.com/engineering/building-c-compiler" target="_blank" rel="noreferrer noopener">build a C compiler in parallel with no human in the loop.</a> After 2,000 sessions and $20,000 in API costs, the agents produced a 100,000-line compiler that can build a Linux kernel but isn&#8217;t a drop-in replacement for anything, and when all 16 agents got stuck on the same bug, Carlini had to step back in and partition the work himself. Even strong advocates of a completely hands-off, vibe-driven approach to agentic engineering might call that a step too far. The question is how much human judgment you need to make that code trustworthy, and what specific practices help you apply that judgment effectively.</p>



<h2 class="wp-block-heading"><strong>The orchestration mindset</strong></h2>



<p>If you want to get developers thinking about agentic engineering in the right way, you have to start with how they think about working with AI, not just what tools they use. That&#8217;s where I started when I began building a structured approach, and it&#8217;s why I started with <strong>habits</strong>. I developed a framework for these called the Sens-AI Framework, published as both an <a href="https://learning.oreilly.com/library/view/critical-thinking-habits/0642572243326/" target="_blank" rel="noreferrer noopener">O&#8217;Reilly report</a> (<em>Critical Thinking Habits for Coding with AI</em>) and a <a href="https://www.oreilly.com/radar/the-sens-ai-framework/" target="_blank" rel="noreferrer noopener">Radar series</a>. It&#8217;s built around five practices: providing context, doing research before prompting, framing problems precisely, iterating deliberately on outputs, and applying critical thinking to everything the AI produces. I started there because habits are how you lock in the way you think about how you&#8217;re working. Without them, AI-driven development produces plausible-looking code that falls apart under scrutiny. With them, it produces systems that a single developer couldn&#8217;t build alone in the same timeframe.</p>



<p>Habits are the foundation, but they&#8217;re not the whole picture. AIDD also has <strong>practices</strong> (concrete techniques like multi-LLM coordination, context file management, and using one model to validate another&#8217;s output) and <strong>values</strong> (the principles behind those practices). If you&#8217;ve worked with Agile methodologies like Scrum or XP, that structure should be pretty familiar: Practices tell you how to work day-to-day, and habits are the reflexes you develop so that the practices become automatic.</p>



<p>Values often seem weirdly theoretical, but they’re an important piece of the puzzle because they guide your decisions when the practices don&#8217;t give you a clear answer. There&#8217;s an emerging culture around agentic engineering right now, and the values you bring to your project either match or clash with that culture. Understanding where the values come from is what makes the practices stick. All of that leads to a whole new mindset, what I&#8217;m calling <strong>the orchestration mindset</strong>. This series builds all four layers, using Octobatch as the proving ground.</p>



<p>Octobatch was a deliberate experiment in AIDD. I designed the project as a test case for the entire approach, to see what a disciplined AI-driven workflow could produce and where it would break down, and I used it to apply and improve the practices and values to make them effective and easy to adopt. And whether by instinct or coincidence, I picked the perfect project for this experiment. Octobatch is a <em>batch orchestrator</em>. It coordinates asynchronous jobs, manages state across failures, tracks dependencies between pipeline steps, and makes sure validated results come out the other end. That kind of system is fun to design but a lot of the details, like state machines, retry logic, crash recovery, and cost accounting, can be tedious to implement. It&#8217;s exactly the kind of work where AIDD should shine, because the patterns are well understood but the implementation is repetitive and error-prone.</p>



<p><strong>Orchestration</strong>—the work of coordinating multiple independent processes toward a coherent outcome—evolved into a core idea behind AIDD. I found myself orchestrating LLMs the same way Octobatch orchestrates batch jobs: assigning roles, managing handoffs, validating outputs, recovering from failures. The system I was building and the process I was using to build it followed the same pattern. I didn&#8217;t anticipate it when I started, but building a system that orchestrates AI turns out to be a pretty good way to learn how to orchestrate AI. That&#8217;s the accidental part of the accidental orchestrator. That parallel runs through every article in this series.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<h2 class="wp-block-heading"><strong>The path to batch</strong></h2>



<p>I didn&#8217;t begin the Octobatch project by starting with a full end-to-end Monte Carlo simulation. I started where most people start: typing prompts into a chat interface. I was experimenting with different simulation and generation ideas to give the project some structure, and a few of them stuck. A blackjack strategy comparison turned out to be a great test case for a multistep Monte Carlo simulation. NPC dialogue generation for a role-playing game gave me a creative workload with subjective quality to measure. Both had the same shape: a set of structured inputs, each processed the same way. So I had Claude write a simple script to automate what I&#8217;d been doing by hand, and I used Gemini to double-check the work, make sure Claude really understood my ask, and fix hallucinations. It worked fine at small scale, but once I started running more than a hundred or so units, I kept hitting rate limits, the caps that providers put on how many API requests you can make per minute.</p>



<p>That&#8217;s what pushed me to <strong>LLM batch APIs</strong>. Instead of sending individual prompts one at a time and waiting for each response, the major LLM providers all offer batch APIs that let you submit a file containing all of your requests at once. The provider processes them on their own schedule; you wait for results instead of getting them immediately, but you don&#8217;t have to worry about rate caps. I was happy to discover they also cost 50% less, and that&#8217;s when I started tracking token usage and costs in earnest. But the real surprise was that<em> batch APIs performed better than real-time APIs at scale</em>. Once pipelines got past the 100- or 200-unit mark, batch started running significantly faster than real time. The provider processes the whole batch in parallel on their infrastructure, so you&#8217;re not bottlenecked by round-trip latency or rate caps anymore.</p>



<p>The switch to batch APIs changed how I thought about the whole problem of coordinating LLM API calls at scale, and led to the idea of configurable pipelines. I could chain stages together: The output of one step could become the input to the next, and I could kick off the whole pipeline and come back to finished results. It turns out I wasn&#8217;t the only one making the shift to batch APIs. Between April 2024 and July 2025, OpenAI, Anthropic, and Google all launched batch APIs, converging on the same pricing model: 50% of the real-time rate in exchange for asynchronous processing.</p>



<p>You probably didn&#8217;t notice that all three major AI providers released batch APIs. The industry conversation was dominated by agents, tool use, MCP, and real-time reasoning. Batch APIs shipped with relatively little fanfare, but they represent a genuine shift in how we can use LLMs. Instead of treating them as conversational partners or one-shot SaaS APIs, we can treat them as processing infrastructure, closer to a MapReduce job than a chatbot. You give them structured data and a prompt template, and they process all of it and hand back the results. What matters is that you can now run tens of thousands of these transformations reliably, at scale, without managing rate limits or connection failures.</p>



<h2 class="wp-block-heading"><strong>Why orchestration?</strong></h2>



<p>If batch APIs are so useful, why can&#8217;t you just write a for-loop that submits requests and collects results? You can, and for simple cases a quick script with a for-loop works fine. But once you start running larger workloads, the problems start to pile up. Solving those problems turned out to be one of the most important lessons for developing a structured approach to agentic engineering.</p>



<p>First, batch jobs are asynchronous. You submit a job, and results come back hours later, so your script needs to track what was submitted and poll for completion. If your script crashes in the middle, you lose that state. Second, batch jobs can partially fail. Maybe 97% of your requests succeeded and 3% didn&#8217;t. Your code needs to figure out which 3% failed, extract them, and resubmit just those items. Third, if you&#8217;re building a multistage pipeline where the output of one step feeds into the next, you need to track dependencies between stages. And fourth, you need cost accounting. When you&#8217;re running tens of thousands of requests, you want to know how much you spent, and ideally, how much you&#8217;re going to spend when you first start the batch. Every one of these has a direct parallel to what you&#8217;re doing in agentic engineering: keeping track of the work multiple AI agents are doing at once, dealing with code failures and bugs, making sure the entire project stays coherent when AI coding tools are only looking at the one part currently in context, and stepping back to look at the wider project management picture.</p>



<p>All of these problems are solvable, but they&#8217;re not problems you want to solve over and over (in both situations—when you&#8217;re orchestrating LLM batch jobs or orchestrating AI coding tools). Solving these problems in the code gave some interesting lessons about the overall approach to agentic engineering. Batch processing moves the complexity from connection management to state management. Real-time APIs are hard because of rate limits and retries. Batch APIs are hard because you have to track what&#8217;s in flight, what succeeded, what failed, and what&#8217;s next.</p>



<p>Before I started development, I went looking for existing tools that handled this combination of problems, because I didn’t want to waste my time reinventing the wheel. I didn’t find anything that did the job I needed. Workflow orchestrators like Apache Airflow and Dagster manage DAGs and task dependencies, but they assume tasks are deterministic and don&#8217;t provide LLM-specific features like prompt template rendering, schema-based output validation, or retry logic triggered by semantic quality checks. LLM frameworks like LangChain and LlamaIndex are designed around real-time inference chains and agent loops—they don&#8217;t manage asynchronous batch job lifecycles, persist state across process crashes, or handle partial failure recovery at the chunk level. And the batch API client libraries from the providers themselves handle submission and retrieval for a single batch, but not multistage pipelines, cross-step validation, or provider-agnostic execution.</p>



<p>Nothing I found covered the full lifecycle of multiphase LLM batch workflows, from submission and polling through validation, retry, cost tracking, and crash recovery, across all three major AI providers. That&#8217;s what I built.</p>



<h2 class="wp-block-heading"><strong>Lessons from the experiment</strong></h2>



<p>The goal of this article, as the first one in my series on agentic engineering and AI-driven development, is to lay out the hypothesis and structure of the Octobatch experiment. The rest of the series goes deep on the lessons I learned from it: the validation architecture, multi-LLM coordination, the practices and values that emerged from the work, and the orchestration mindset that ties it all together. A few early lessons stand out, because they illustrate what AIDD looks like in practice and why developer experience matters more than ever.</p>



<ul class="wp-block-list">
<li><strong>You have to run things and check the data.</strong> Remember the drunken sailor, the “Hello, world” of Monte Carlo simulations? At one point I noticed that when I ran the simulation through Octobatch, 77.5% of the sailors fell in the water. The results for a random walk should be 50/50, so clearly something was badly wrong. It turned out the random number generator was being re-seeded at every iteration with sequential seed values, which created correlation bias between runs. I didn’t identify the problem immediately; I ran a bunch of tests using Claude Code as a test runner to generate each test, run it, and log the results; Gemini looked at the results and found the root cause. Claude had trouble coming up with a fix that worked well, and proposed a workaround with a large list of preseeded random number values in the pipeline. Gemini proposed a hash-based fix reviewing my conversations with Claude, but it seemed overly complex. Once I understood the problem and rejected their proposed solutions, I decided the best fix was simpler than either of the AI&#8217;s suggestions: a persistent RNG per simulation unit that advanced naturally through its sequence. I needed to understand both the statistics and the code to evaluate those three options. Plausible-looking output and correct output aren&#8217;t the same thing, and you need enough expertise to tell the difference. (We’ll talk more about this situation in the next article in the series.)</li>



<li><strong>LLMs often overestimate complexity.</strong> At one point I wanted to add support for custom mathematical expressions in the analysis pipeline. Both Claude and Gemini pushed back, telling me, &#8220;This is scope creep for v1.0&#8221; and &#8220;Save it for v1.1.&#8221; Claude estimated three hours to implement. Because I knew the codebase, I knew we were already using asteval, a Python library that provides a safe, minimalistic evaluator for mathematical expressions and simple Python statements, elsewhere to evaluate expressions, so this seemed like a straightforward use of a library we’re already using elsewhere. Both LLMs thought the solution would be far more complex and time-consuming than it actually was; it took just two prompts to Claude Code (generated by Claude), and about five minutes total to implement. The feature shipped and made the tool significantly more powerful. The AIs were being conservative because they didn&#8217;t have my context about the system&#8217;s architecture. Experience told me the integration would be trivial. Without that experience, I would have listened to them and deferred a feature that took five minutes.</li>



<li><strong>AI is often biased toward adding code, not deleting it.</strong> Generative AI is, unsurprisingly, biased toward generation. So when I asked the LLMs to fix problems, their first response was often to add more code, adding another layer or another special case. I can&#8217;t think of a single time in the whole project when one of the AIs stepped back and said, &#8220;Tear this out and rethink the approach.&#8221; The most productive sessions were the ones where I overrode that instinct and pushed for simplicity. This is something experienced developers learn over a career: The most successful changes often delete more than they add—the PRs we brag about are the ones that delete thousands of lines of code.</li>



<li><strong>The architecture emerged from failure.</strong> The AI tools and I didn&#8217;t design Octobatch&#8217;s core architecture up front. Our first attempt was a Python script with in-memory state and a lot of hope. It worked for small batches but fell apart at scale: A network hiccup meant restarting from scratch, a malformed response required manual triage. A lot of things fell into place after I added the constraint that the system must survive being killed at any moment. That single requirement led to the tick model (wake up, check state, do work, persist, exit), the manifest file as source of truth, and the entire crash-recovery architecture. We discovered the design by repeatedly failing to do something simpler.</li>



<li><strong>Your development history is a dataset.</strong> I just told you several stories from the Octobatch project, and this series will be full of them. Every one of those stories came from going back through the chat logs between me, Claude, and Gemini. With AIDD, you have a complete transcript of every architectural decision, every wrong turn, every moment where you overruled the AI and every moment where it corrected you. Very few development teams have ever had that level of fidelity in their project history. Mining those logs for lessons learned turns out to be one of the most valuable practices I&#8217;ve found.</li>
</ul>



<p>Near the end of the project, I switched to Cursor to make sure none of this was specific to Claude Code. I created fresh conversations using the same context files I&#8217;d been maintaining throughout development, and was able to bootstrap productive sessions immediately; the context files worked exactly as designed. The practices I&#8217;d developed transferred cleanly to a different tool. The value of this approach comes from the habits, the context management, and the engineering judgment you bring to the conversation, not from any particular vendor.</p>



<p>These tools are moving the world in a direction that favors developers who understand the ways engineering can go wrong and know solid design and architecture patterns…and who are okay letting go of control of every line of code.</p>



<h2 class="wp-block-heading"><strong>What&#8217;s next</strong></h2>



<p>Agentic engineering needs structure, and structure needs a concrete example to make it real. The next article in this series goes into Octobatch itself, because the way it orchestrates AI is a remarkably close parallel to what AIDD asks developers to do. Octobatch assigns roles to different processing steps, manages handoffs between them, validates their outputs, and recovers when they fail. That&#8217;s the same pattern I followed when building it: assigning roles to Claude and Gemini, managing handoffs between them, validating their outputs, and recovering when they went down the wrong path. Understanding how the system works turns out to be a good way to understand how to orchestrate AI-driven development. I&#8217;ll walk through the architecture, show what a real pipeline looks like from prompt to results, present the data from a 300-hand blackjack Monte Carlo simulation that puts all of these ideas to the test, and use all of that to demonstrate ideas we can apply directly to agentic engineering and AI-driven development.</p>



<p>Later articles go deeper into the practices and ideas I learned from this experiment that make AI-driven development work: how I coordinated multiple AI models without losing control of the architecture, what happened when I tested the code against what I actually intended to build, and what I learned about the gap between code that runs and code that does what you meant. Along the way, the experiment produced some findings about how different AI models see code that I didn&#8217;t expect—and that turned out to matter more than I thought they would.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-accidental-orchestrator/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The &#8220;Data Center Rebellion&#8221; Is Here</title>
		<link>https://www.oreilly.com/radar/the-data-center-rebellion-is-here/</link>
				<comments>https://www.oreilly.com/radar/the-data-center-rebellion-is-here/#respond</comments>
				<pubDate>Wed, 04 Mar 2026 12:11:09 +0000</pubDate>
					<dc:creator><![CDATA[Ben Lorica]]></dc:creator>
						<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18188</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Data-center-rebellion.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Data-center-rebellion-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Beyond the chips: The local politics of AI infrastructure]]></custom:subtitle>
		
				<description><![CDATA[This post first appeared on Ben Lorica’s Gradient Flow Substack newsletter and is being republished here with the author’s permission. Even the most ardent cheerleaders for artificial intelligence now quietly concede we are navigating a massive AI bubble. The numbers are stark: Hyperscalers are deploying roughly $400 billion annually into data centers and specialized chips [&#8230;]]]></description>
								<content:encoded><![CDATA[
<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><em>This post first appeared on Ben Lorica’s </em><a href="https://gradientflow.substack.com/p/the-data-center-rebellion-is-here" target="_blank" rel="noreferrer noopener">Gradient Flow<em> Substack newsletter</em></a><em> and is being republished here with the author’s permission.</em></td></tr></tbody></table></figure>



<p>Even the most ardent cheerleaders for artificial intelligence now quietly concede we are navigating a massive AI bubble. The numbers are stark: Hyperscalers are deploying roughly $400 billion annually into data centers and specialized chips while AI-related revenue hovers around $20 billion—a 20-to-1 capital-to-revenue ratio that stands out even in infrastructure cycles historically characterized by front-loaded spending. To justify this deployment on conventional investment metrics, the industry would need a step change in monetization over a short window to make the numbers work.</p>



<p>While venture capitalists and tech executives debate the “mismatch” between compute and monetization, a more tangible crisis is unfolding far from Silicon Valley. A growing grassroots opposition to AI data centers remains largely below the radar here in San Francisco. I travel to Sioux Falls, South Dakota, a few times a year to visit my in-laws. It’s not a region known for being antibusiness. Yet even there, a “data center rebellion” has been brewing. Even though the recent attempt to overturn a rezoning ordinance <a href="https://www.keloland.com/news/local-news/data-center-re-zone-petition-fails-in-sioux-falls/" target="_blank" rel="noreferrer noopener">did not succeed</a>, the level of community pushback in the heart of the Midwest signals that these projects no longer enjoy a guaranteed green light.</p>



<p>This resistance is not merely reflexive NIMBYism. It represents a sophisticated multifront challenge to the physical infrastructure AI requires. For leadership teams planning for the future, this means &#8220;compute availability&#8221; is no longer just a procurement question. It is now tied to local politics, grid stability, water management, and city approval processes. In the course of trying to understand the growing opposition to AI data centers, I’ve been examining the specific drivers behind this opposition and why the assumption of limitless infrastructure growth is colliding with hard constraints.</p>



<h2 class="wp-block-heading">The grid capacity crunch and the ratepayer revolt</h2>



<p>AI data centers function as grid-scale industrial loads. Individual projects now request 100+ megawatts, and some proposals reach into the gigawatt range. One proposed Michigan facility, for example, would consume 1.4 gigawatts, nearly exhausting the region’s remaining 1.5 gigawatts of headroom and roughly matching the electricity needs of about a million homes. This happens because AI hardware is incredibly dense and uses a massive amount of electricity. It also runs constantly. Since AI work doesn&#8217;t have &#8220;off&#8221; hours, power companies can&#8217;t rely on the usual quiet periods they use to balance the rest of the grid.</p>



<p>The politics come down to who pays the bill. Residents in many areas have seen their home utility rates jump by 25% or 30% after big data centers moved in, even though they were promised rates wouldn&#8217;t change. People are afraid they will end up paying for the power company&#8217;s new equipment. This happens when a utility builds massive substations just for one company, but the cost ends up being shared by everyone. When you add in state and local tax breaks, it gets even worse. Communities deal with all the downsides of the project, while the financial benefits are eaten away by tax breaks and credits.</p>



<p>The result is a rare bipartisan alignment around a simple demand: Hyperscalers should pay their full cost of service. Notably, Microsoft has moved in that direction publicly, committing to cover grid-upgrade costs and pursue rate structures intended to insulate residential customers—an implicit admission that the old incentive playbook has become a political liability (and, in some places, an electoral one).</p>



<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="1456" height="172" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image.jpeg" alt="AI scale-up to deployable compute" class="wp-image-18189" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-300x35.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-768x91.jpeg 768w" sizes="(max-width: 1456px) 100vw, 1456px" /></figure>



<h2 class="wp-block-heading">Water wars and the constant hum</h2>



<p>High-density AI compute generates immense heat, requiring cooling systems that can consume millions of gallons of water daily. In desert municipalities like Chandler and Tucson, Arizona, this creates direct competition with agricultural irrigation and residential drinking supplies. Proposed facilities may withdraw hundreds of millions of gallons annually from stressed aquifers or municipal systems, raising fears that industrial users will deplete wells serving farms and homes. Data center developers frequently respond with technical solutions like dry cooling and closed-loop designs. However, communities have learned the trade-off: Dry cooling shifts the burden to electricity, and closed-loop systems still lose water to the atmosphere and require constant refills. The practical outcome is that cooling architecture is now a first-order constraint. In Tucson, a project known locally as “Project Blue” faced enough pushback over water rights that the developer had to revisit the cooling approach midstream.</p>



<p>Beyond resource consumption, these facilities create a significant noise problem. Industrial-scale cooling fans and backup diesel generators create a “constant hum” that represents daily intrusion into previously quiet neighborhoods. In Florida, residents near a proposed facility serving 2,500 families and an elementary school cite sleep disruption and health risks as primary objections, elevating the issue from nuisance to harm. The noise also hits farms hard. In Wisconsin, residents reported that the low-frequency hum makes livestock, particularly horses, nervous and skittish. This disrupts farm life in a way that standard commercial development just doesn&#8217;t. This is why municipalities are tightening requirements: acoustic modeling, enforceable decibel limits at property lines, substantial setbacks (sometimes on the order of 200 feet), and <a href="https://en.wikipedia.org/wiki/Berm" target="_blank" rel="noreferrer noopener">berms</a> that are no longer “nice-to-have” concessions but baseline conditions for approval.</p>



<figure class="wp-block-image size-full"><img decoding="async" width="1456" height="654" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1.jpeg" alt="The $3 trillion question" class="wp-image-18190" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1-300x135.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1-768x345.jpeg 768w" sizes="(max-width: 1456px) 100vw, 1456px" /><figcaption class="wp-element-caption">(<a href="https://gradientflow.com/wp-content/uploads/2026/02/AI-Data-Center-Bubble.jpeg" target="_blank" rel="noreferrer noopener"><strong>enlarge</strong></a>)</figcaption></figure>



<h2 class="wp-block-heading">The jobs myth meets the balance sheet</h2>



<p>Communities are questioning whether the small number of jobs created is worth the local impact. Developers highlight billion-dollar capital investments and construction employment spikes, but residents focus on steady-state reality: AI data centers employ far fewer permanent workers per square foot than manufacturing facilities of comparable scale. Chandler, Arizona, officials noted that existing facilities employ fewer than 100 people despite massive physical footprints. Wisconsin residents contrast promised “innovation campuses” with operational facilities requiring only dozens to low hundreds of permanent staff—mostly specialized technicians—making the “job creation” pitch ring hollow. When a data center replaces farmland or light manufacturing, communities weigh not just direct employment but opportunity cost: lost agricultural jobs, foregone retail development, and mixed-use projects that might generate broader economic activity.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>Opposition scales faster than infrastructure: One local win becomes a national template for blocking the next project.</p>
</blockquote>



<p>The secretive way these deals are made is often what fuels the most anger. A recurring pattern is what some call the “sleeping giant” dynamic: Residents learn late that officials and developers have been negotiating for months, often under NDAs, sometimes through shell entities and codenames. In Wisconsin, Microsoft’s “Project Nova” became a symbol of this approach; in Minnesota’s Hermantown, a year of undisclosed discussions triggered similar backlash. In Florida, opponents were furious when a major project was tucked into a <a href="https://www.boardeffect.com/blog/what-is-a-consent-agenda-for-a-board-meeting/" target="_blank" rel="noreferrer noopener">consent agenda</a>. Since these agendas are meant for routine business, it felt like a deliberate attempt to bypass public debate. Trust vanishes when people believe advisors have a conflict of interest, like a consultant who seems to be helping both the municipality and the developer. After that happens, technical claims are treated as nothing more than a sales pitch. You won&#8217;t get people back on board until you provide neutral analysis and commitments that can actually be enforced.</p>



<figure class="wp-block-image size-full"><img decoding="async" width="1020" height="695" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2.jpeg" alt="Data center in the community" class="wp-image-18191" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2.jpeg 1020w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2-300x204.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2-768x523.jpeg 768w" sizes="(max-width: 1020px) 100vw, 1020px" /></figure>



<h2 class="wp-block-heading">From zoning fight to national constraint</h2>



<p>What started as isolated neighborhood friction has professionalized into a coordinated national movement. Opposition <strong>groups now share legal playbooks and technical templates across state lines</strong>, allowing residents in “frontier” states like South Dakota or Michigan to mobilize with the sophistication of seasoned activists. The financial stakes are real: Between April and June 2025 alone, approximately $98 billion in proposed projects were blocked or delayed, according to <a href="https://www.datacenterwatch.org/?utm_source=gradientflow&amp;utm_medium=newsletter" target="_blank" rel="noreferrer noopener">Data Center Watch</a>. This is no longer just a zoning headache. It’s a political landmine. In Arizona and Georgia, bipartisan coalitions have already ousted officials over data center approvals, signaling to local boards that greenlighting a hyperscale facility without deep community buy-in can be a career-ending move.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>The US has the chips, but China has centralized command over power and infrastructure.</p>
</blockquote>



<p>The opposition is also finding an unlikely ally in the energy markets. While the industry narrative is one of &#8220;limitless demand,&#8221; the actual market prices for long-term power and natural gas aren&#8217;t spiking but are actually staying remarkably flat. There is a massive <a href="https://www.youtube.com/watch?v=i__iaPepixk" target="_blank" rel="noreferrer noopener">disconnect between the hype and the math</a>. Utilities are currently racing to build nearly double the <a href="https://www.youtube.com/watch?v=i__iaPepixk" target="_blank" rel="noreferrer noopener">capacity that even the most optimistic analysts</a> project for 2030. This suggests we may be overbuilding &#8220;ghost infrastructure.&#8221; We are asking local communities to sacrifice their land and grid stability for a gold rush that the markets themselves don&#8217;t fully believe in.</p>



<p>This “data center rebellion” creates a strategic bottleneck that no amount of venture capital can easily bypass. While the US maintains a clear lead in high-end chips, we are hitting a wall on how we manage the mundane essentials like electricity and water. In the geopolitical race, the US has the chips, but China has the centralized command over infrastructure. Our democratic model requires transparency and public buy-in to function. If US companies keep relying on secret deals to push through expensive, overbuilt infrastructure, they risk a total collapse of community trust.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-data-center-rebellion-is-here/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Radar Trends to Watch: March 2026</title>
		<link>https://www.oreilly.com/radar/radar-trends-to-watch-march-2026/</link>
				<comments>https://www.oreilly.com/radar/radar-trends-to-watch-march-2026/#respond</comments>
				<pubDate>Tue, 03 Mar 2026 12:07:40 +0000</pubDate>
					<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
						<category><![CDATA[Radar Trends]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18173</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-10.png" 
				medium="image" 
				type="image/png" 
				width="1400" 
				height="950" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-10-160x160.png" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Developments in operations, things, web, and more]]></custom:subtitle>
		
				<description><![CDATA[The explosion of interest in OpenClaw was one of the last items added to the February 1 trends. In February, things went crazy. We saw a social network for agents (no humans allowed, though they undoubtedly sneak on); a multiplayer online game for agents (again, no humans); many clones of OpenClaw, most of which attempt [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>The explosion of interest in OpenClaw was one of the last items added to the February 1 trends. In February, things went crazy. We saw a social network for agents (no humans allowed, though they undoubtedly sneak on); a multiplayer online game for agents (again, no humans); many clones of OpenClaw, most of which attempt to mitigate its many security problems; and much more. Andrej Karpathy has said that OpenClaw is the next layer on top of AI agents. If the security issues can be resolved (which is a good question), he’s probably right.</p>



<h2 class="wp-block-heading">AI</h2>



<ul class="wp-block-list">
<li><a href="https://note-taker.moonshine.ai/" target="_blank" rel="noreferrer noopener">Moonshine Note Taker</a> is a free and open source voice transcription application for taking notes. It runs locally: The model runs on your hardware and no data is ever sent to a server.</li>



<li>Nano Banana’s image generation was breathtakingly good. Google has now <a href="https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/" target="_blank" rel="noreferrer noopener">released</a> Nano Banana 2, a.k.a. Gemini 3.1 Flash Image, which promises Nano Banana image quality at speed.</li>



<li>Claude <a href="https://code.claude.com/docs/en/remote-control" target="_blank" rel="noreferrer noopener">Remote Control</a> allows you to continue a desktop Claude Code session from any device.</li>



<li>Putting OpenClaw into a sandbox <a href="https://tachyon.so/blog/sandboxes-wont-save-you" target="_blank" rel="noreferrer noopener">isn’t enough</a>. Keeping AI Agents from accidentally (or intentionally) doing damage is a permissions problem.</li>



<li>Alibaba has <a href="https://huggingface.co/collections/Qwen/qwen35" target="_blank" rel="noreferrer noopener">released</a> a fleet of mid-size Qwen 3.5 models. Their theme is providing more intelligence with less computing cycles—something we all need to appreciate.&nbsp;</li>



<li>Important advice for agentic engineering: <a href="https://simonwillison.net/guides/agentic-engineering-patterns/first-run-the-tests/" target="_blank" rel="noreferrer noopener">Always start by running the tests</a>.</li>



<li>Google has <a href="https://blog.google/innovation-and-ai/products/gemini-app/lyria-3/" target="_blank" rel="noreferrer noopener">released</a> Lyria 3, a model that generates 30-second musical clips from a verbal description. You can experiment with it through Gemini.</li>



<li>There’s a new protocol in the agentic stack. Twilio has <a href="https://thenewstack.io/twilio-a2h-protocol-launch/" target="_blank" rel="noreferrer noopener">released</a> the <a href="https://www.twilio.com/en-us/blog/products/introducing-a2h-agent-to-human-communication-protocol" target="_blank" rel="noreferrer noopener">Agent-2-Human</a> (A2H) protocol, which facilitates handoffs between agents and humans as they collaborate.</li>



<li>Yet more and more model releases: <a href="https://www.anthropic.com/news/claude-sonnet-4-6" target="_blank" rel="noreferrer noopener">Claude Sonnet 4.6</a>, followed quickly by <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/" target="_blank" rel="noreferrer noopener">Gemini 3.1 Pro</a>. If you care, Gemini 3.1 Pro currently tops the abstract reasoning benchmarks.</li>



<li><a href="https://www.kimi.com/bot" target="_blank" rel="noreferrer noopener">Kimi Claw</a> is yet another variation on OpenClaw. Kimi Claw uses Moonshot AI’s most advanced model, Kimi K2.5 Thinking model, and offers one-click setup in Moonshot’s cloud.</li>



<li><a href="https://nanoclaw.dev/" target="_blank" rel="noreferrer noopener">NanoClaw</a> is another OpenClaw-like AI-based personal assistant that claims to be more security conscious. It runs agents in sandboxed Linux containers with limited access to outside resources, limiting abuse. </li>



<li>OpenAI has <a href="https://openai.com/index/introducing-gpt-5-3-codex-spark/" target="_blank" rel="noreferrer noopener">released</a> a research preview of GPT-5.3-Codex-Spark, an extremely fast coding model that runs on <a href="https://www.cerebras.ai/chip" target="_blank" rel="noreferrer noopener">Cerebras hardware</a>. The company claims that it’s possible to collaborate with Codex in “real time” because it gives “near-instant” results.</li>



<li>RAG may not be the newest idea in the AI world, but text-based RAG is the basis for many enterprise applications of AI. But most enterprise data includes graphs, images, and even text in formats like PDF. Is this the year for <a href="https://thenewstack.io/multimodal-rag-hybrid-search/" target="_blank" rel="noreferrer noopener">multimodal RAG</a>?</li>



<li><a href="http://z.ai" target="_blank" rel="noreferrer noopener">Z.ai</a> has released its latest model, <a href="https://z.ai/blog/glm-5" target="_blank" rel="noreferrer noopener">GLM-5</a>. GLM-5 is an open source “Opus-class” model. It’s significantly smaller than Opus and other high-end models, though still huge; the mixture-of-experts model has 744B parameters, with 40B active.</li>



<li>Waymo has created a <a href="https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simulation" target="_blank" rel="noreferrer noopener">World Model</a> to model driving behavior. It’s capable of building lifelike simulations of traffic patterns and behavior, based on video collected from Waymo’s vehicles.</li>



<li><a href="https://alexzhang13.github.io/blog/2025/rlm/" target="_blank" rel="noreferrer noopener">Recursive language models</a> (RLMs) solve the problem of context rot, which happens when output from AI degrades as the size of the context increases. Drew Breunig has an excellent <a href="https://www.dbreunig.com/2026/02/09/the-potential-of-rlms.html" target="_blank" rel="noreferrer noopener">explanation</a>.</li>



<li>You’ve heard of Moltbook—and perhaps your AI agent participates. Now there’s <a href="https://arstechnica.com/ai/2026/02/after-moltbook-ai-agents-can-now-hang-out-in-their-own-space-faring-mmo/" target="_blank" rel="noreferrer noopener">SpaceMolt</a>—a massive multiplayer online game that’s exclusively for agents.&nbsp;</li>



<li>Anthropic and OpenAI simultaneously released <a href="https://www.anthropic.com/news/claude-opus-4-6" target="_blank" rel="noreferrer noopener">Claude Opus 4.6</a> and <a href="https://openai.com/index/introducing-gpt-5-3-codex/" target="_blank" rel="noreferrer noopener">GPT-5.3-Codex</a>, both of which offer improved models for AI-assisted programming. Is this “open warfare,” as <a href="https://news.smol.ai/issues/26-02-05-claude-opus-openai-codex/" target="_blank" rel="noreferrer noopener"><em>AINews</em></a> claims? You mean it hasn’t been open warfare prior to now?</li>



<li>If you’re excited by OpenClaw, you might try <a href="https://github.com/HKUDS/nanobot" target="_blank" rel="noreferrer noopener">NanoBot</a>. It has 1% of OpenClaw’s code, written so that it’s easy to understand and maintain. No promises about security—with all of these personal AI assistants, be careful!</li>



<li>OpenAI has <a href="https://arstechnica.com/ai/2026/02/openai-picks-up-pace-against-claude-code-with-new-codex-desktop-app/" target="_blank" rel="noreferrer noopener">launched</a> a desktop app for macOS along the lines of Claude Code. It’s something that’s been missing from their lineup. Among other things, it’s intended to help programmers work with multiple agents simultaneously.</li>



<li>Pete Warden has put together an interactive guide to speech embeddings for engineers, and published it as a Colab <a href="https://colab.research.google.com/drive/1pUy9tp145qlWni2CIuBvQUNdokiB6rx6?usp=sharing" target="_blank" rel="noreferrer noopener">notebook</a>.</li>



<li><a href="https://tailscale.com/blog/aperture-private-alpha" target="_blank" rel="noreferrer noopener">Aperture</a> is a new tool from Tailscale for “providing visibility into coding agent usage,” allowing organizations to understand how AI is being used and adopted. It’s currently in private beta.</li>



<li>OpenAI <a href="https://openai.com/index/introducing-prism/" target="_blank" rel="noreferrer noopener">Prism</a> is a free workspace for scientists to collaborate on research. Its goal is to help scientists build a new generation of AI-based tooling. Prism is built on ChatGPT 5.2 and is open to anyone with a personal ChatGPT account.</li>
</ul>



<h2 class="wp-block-heading">Programming</h2>



<ul class="wp-block-list">
<li>Anthropic is <a href="https://claude.com/contact-sales/claude-for-oss" target="_blank" rel="noreferrer noopener">offering</a> six months of Claude Max 20x free to open source maintainers.</li>



<li><a href="https://pi.dev/" target="_blank" rel="noreferrer noopener">Pi</a> is a very simple but extensible coding agent that runs in your terminal.</li>



<li>Researchers at Anthropic have <a href="https://www.anthropic.com/engineering/building-c-compiler" target="_blank" rel="noreferrer noopener">vibe-coded a C compiler</a> using a fleet of Claude agents. The experiment cost roughly $20,000 worth of tokens, and produced 100,000 lines of Rust. They are careful to say that the compiler is far from production quality—but it works. The experiment is a <em>tour de force</em> demonstration of running agents in parallel.&nbsp;</li>



<li>I never knew that macOS had a <a href="https://igorstechnoclub.com/sandbox-exec/" target="_blank" rel="noreferrer noopener">sandboxing tool</a>. It looks useful. (It’s also deprecated, but looks much easier to use than the alternatives.)</li>



<li><a href="https://github.blog/changelog/2026-02-13-new-repository-settings-for-configuring-pull-request-access/" target="_blank" rel="noreferrer noopener">GitHub now allows</a> pull requests to be turned off completely, or to be limited to collaborators. They’re doing this to allow software maintainers to eliminate AI-generated pull requests, which are overwhelming many developers.</li>



<li>After an open source maintainer rejected a pull request generated by an AI agent, the agent published a blog post attacking the maintainer. The maintainer responded with an excellent <a href="https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/" target="_blank" rel="noreferrer noopener">analysis</a>, asking whether threats and intimidation are the future of AI.</li>



<li>As Simon Willison has <a href="https://simonwillison.net/2025/Dec/18/code-proven-to-work/" target="_blank" rel="noreferrer noopener">written</a>, the purpose of programming isn’t to write code but to deliver code that works. He’s created two tools, <a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/" target="_blank" rel="noreferrer noopener">Showboat and Rodney</a>, that help AI agents demo their software so that the human authors can verify that the software works.&nbsp;</li>



<li>Anil Dash asks whether <a href="https://www.anildash.com/2026/01/22/codeless/" target="_blank" rel="noreferrer noopener">codeless</a> <a href="https://www.anildash.com/2026/01/27/codeless-ecosystem/" target="_blank" rel="noreferrer noopener">programming</a>, using tools like <a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04" target="_blank" rel="noreferrer noopener">Gas Town</a>, is the future.</li>
</ul>



<h2 class="wp-block-heading">Security</h2>



<ul class="wp-block-list">
<li>There is now an app that <a href="https://tech.lgbt/@yjeanrenaud/116122129025921096" target="_blank" rel="noreferrer noopener">alerts</a> you when someone in the vicinity has smart glasses.</li>



<li><a href="https://www.agentsh.org/" target="_blank" rel="noreferrer noopener">Agentsh</a> provides <a href="https://www.agentsh.org/" target="_blank" rel="noreferrer noopener">execution layer security</a> by enforcing policies to prevents agents from doing damage. As far as agents are concerned, it’s a replacement for bash.</li>



<li>There’s a new kind of cyberattack: <a href="https://techxplore.com/news/2026-02-cyber-disrupt-smart-factories.html" target="_blank" rel="noreferrer noopener">attacks against time itself</a>. More specifically, this means attacks against clocks and protocols for time synchronization. These can be devastating in factory settings.</li>



<li>“<a href="https://aisle.com/blog/what-ai-security-research-looks-like-when-it-works" target="_blank" rel="noreferrer noopener">What AI Security Research Looks Like When It Works</a>” is an excellent overview of the impact of AI on discovering vulnerabilities. AI generates a lot of security slop, but it also finds critical vulnerabilities that would have been opaque to humans, including 12 in OpenSSL.</li>



<li>Gamifying prompt injection—well, that’s new. <a href="https://hackmyclaw.com/" target="_blank" rel="noreferrer noopener">HackMyClaw</a> is a game (?) in which participants send email to Flu, an OpenClaw instance. The goal is to force Flu to reply with secrets.env, a file of “confidential” data. There is a prize for the first to succeed.</li>



<li>It was only a matter of time: There’s now a cybercriminal who is <a href="https://www.bleepingcomputer.com/news/security/infostealer-malware-found-stealing-openclaw-secrets-for-first-time/" target="_blank" rel="noreferrer noopener">actively stealing secrets</a> from OpenClaw users.&nbsp;</li>



<li><a href="https://deno.com/" target="_blank" rel="noreferrer noopener">Deno’s secure sandbox</a> might provide a way to <a href="https://thenewstack.io/deno-sandbox-security-secrets/" target="_blank" rel="noreferrer noopener">run OpenClaw safely</a>.&nbsp;</li>



<li><a href="https://github.com/nearai/ironclaw" target="_blank" rel="noreferrer noopener">IronClaw</a> is a personal AI assistant modeled after <a href="https://openclaw.ai/" target="_blank" rel="noreferrer noopener">OpenClaw</a> that promises better security. It always runs in a sandbox, never exposes credentials, has some defenses against prompt injection, and only makes requests to approved hosts.</li>



<li>A fake recruiting campaign is <a href="https://www.bleepingcomputer.com/news/security/fake-job-recruiters-hide-malware-in-developer-coding-challenges/" target="_blank" rel="noreferrer noopener">hiding malware</a> in programming challenges that candidates must complete in order to apply. Completing the challenge requires installing malicious dependencies that are hosted on legitimate repositories like npm and PyPI.</li>



<li>Google’s Threat Intelligence Group has <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/" target="_blank" rel="noreferrer noopener">released</a> its quarterly analysis of adversarial AI use. Their analysis includes distillation, or collecting the output of a frontier AI to train another AI.</li>



<li>Google has <a href="https://arstechnica.com/gadgets/2026/02/upgraded-google-safety-tools-can-now-find-and-remove-more-of-your-personal-info/" target="_blank" rel="noreferrer noopener">upgraded</a> its tools for removing personal information and images, including nonconsensual explicit images, from its search results.&nbsp;</li>



<li><a href="https://www.bleepingcomputer.com/news/security/new-tool-blocks-imposter-attacks-disguised-as-safe-commands/" target="_blank" rel="noreferrer noopener">Tirith</a> is a new tool that hooks into the shell to block bad commands. This is often a problem with copy-and-paste commands that use curl to pipe an archive into bash. It’s easy for a bad actor to create a malicious URL that is indistinguishable from a legitimate URL.</li>



<li>Claude Opus 4.6 has been used to discover <a href="https://red.anthropic.com/2026/zero-days/" target="_blank" rel="noreferrer noopener">500 0-day vulnerabilities</a> in open source code. While many open source maintainers have complained about AI slop, and that abuse isn’t likely to stop, AI is also becoming a valuable tool for security work.</li>



<li><a href="https://www.koi.ai/blog/maliciouscorgi-the-cute-looking-ai-extensions-leaking-code-from-1-5-million-developers" target="_blank" rel="noreferrer noopener">Two coding assistants for VS Code</a> are malware that send copies of all the code to China. Unlike lots of malware, they do their job as coding assistants well, making it less likely that victims will notice that something is wrong.&nbsp;</li>



<li><a href="https://www.bleepingcomputer.com/news/security/hackers-hijack-exposed-llm-endpoints-in-bizarre-bazaar-operation/" target="_blank" rel="noreferrer noopener">Bizarre Bazaar</a> is the name for a wave of attacks against LLM APIs, including self-hosted LLMs. The attacks attempt to steal resources from LLM infrastructure, for purposes including cryptocurrency mining, data theft, and reselling LLM access.&nbsp;</li>



<li>The business model for ransomware has changed. <a href="https://www.bleepingcomputer.com/news/security/from-cipher-to-fear-the-psychology-behind-modern-ransomware-extortion/" target="_blank" rel="noreferrer noopener">Ransomware is no longer about encrypting your data</a>; it’s about using stolen data for extortion. Small and mid-size businesses are common targets.&nbsp;</li>
</ul>



<h2 class="wp-block-heading">Web</h2>



<ul class="wp-block-list">
<li>Cloudflare has a service called <a href="https://developers.cloudflare.com/fundamentals/reference/markdown-for-agents/" target="_blank" rel="noreferrer noopener">Markdown for Agents</a> that <a href="https://thenewstack.io/cloudflares-markdown-for-agents-automatically-make-websites-more-aifriendly/" target="_blank" rel="noreferrer noopener">converts</a> websites from HTML to Markdown when an agent accesses them. Conversion makes the pages friendlier to AI and significantly reduces the number of tokens needed to process them.</li>



<li>WebMCP is a <a href="https://webmachinelearning.github.io/webmcp/" target="_blank" rel="noreferrer noopener">proposed API standard</a> that allows web applications to become MCP servers. It’s currently available in <a href="https://developer.chrome.com/blog/webmcp-epp" target="_blank" rel="noreferrer noopener">early preview</a> in Chrome.</li>



<li>Users of <a href="https://www.firefox.com/en-US/firefox/148.0/releasenotes/" target="_blank" rel="noreferrer noopener">Firefox 148</a> (which should be out by the time you read this) will be able to <a href="https://blog.mozilla.org/en/firefox/ai-controls/" target="_blank" rel="noreferrer noopener">opt out</a> of all AI features.</li>
</ul>



<h2 class="wp-block-heading">Operations</h2>



<ul class="wp-block-list">
<li>Wireshark is a powerful—and complex—packet capture tool. <a href="https://github.com/vignesh07/babyshark" target="_blank" rel="noreferrer noopener">Babyshark</a> is a text interface for Wireshark that provides an amazing amount of information with a much simpler interface.</li>



<li>Microsoft is experimenting with <a href="https://techxplore.com/news/2026-02-laser-etched-glass-years-microsoft.html" target="_blank" rel="noreferrer noopener">using lasers to etch data in glass</a> as a form of long-term data storage.</li>
</ul>



<h2 class="wp-block-heading">Things</h2>



<ul class="wp-block-list">
<li>You need a <a href="https://www.hackster.io/news/your-office-needs-a-desk-robot-fec0211f56ef" target="_blank" rel="noreferrer noopener">desk robot</a>. Why? Because it’s there. And fun.</li>



<li>Do you want to play <em>Doom</em> on a Lego brick? <a href="https://hackaday.com/2023/03/18/doom-ported-to-a-single-lego-brick/" target="_blank" rel="noreferrer noopener">You can</a>.</li>
</ul>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/radar-trends-to-watch-march-2026/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Why Capacity Planning Is Back</title>
		<link>https://www.oreilly.com/radar/why-capacity-planning-is-back/</link>
				<comments>https://www.oreilly.com/radar/why-capacity-planning-is-back/#respond</comments>
				<pubDate>Mon, 02 Mar 2026 13:19:19 +0000</pubDate>
					<dc:creator><![CDATA[Syed Danish Ali]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18150</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Capacity-planning.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Capacity-planning-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[In a previous article, we outlined why GPUs have become the architectural control point for enterprise AI. When accelerator capacity becomes the governing constraint, the cloud’s most comforting assumption—that you can scale on demand without thinking too far ahead—stops being true. That shift has an immediate operational consequence: Capacity planning is back. Not the old [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>In a previous article, we outlined why <a href="https://www.oreilly.com/radar/gpus-enterprise-ais-new-architectural-control-point/" target="_blank" rel="noreferrer noopener">GPUs have become the architectural control point for enterprise AI</a>. When accelerator capacity becomes the governing constraint, the cloud’s most comforting assumption—that you can scale on demand without thinking too far ahead—stops being true.</p>



<p>That shift has an immediate operational consequence: Capacity planning is back. Not the old “guess next year’s VM count” exercise but a new form of planning where model choices, inference depth, and workload timing directly determine whether you can meet latency, cost, and reliability targets.</p>



<p>In an AI-shaped infrastructure world, you don’t “scale” as much as you “get capacity.” Autoscaling helps at the margins, but it can’t create GPUs. Power, cooling, and accelerator supply set the limits.</p>



<h2 class="wp-block-heading">The return of capacity planning</h2>



<p>For a decade, cloud adoption trained organizations out of multiyear planning. CPU and storage scaled smoothly, and most stateless services behaved predictably under horizontal scaling. Teams could treat infrastructure as an elastic substrate and focus on software iteration.</p>



<p>AI production systems do not behave that way. They are dominated by accelerators and constrained by physical limits, and that makes capacity a first-order design dependency rather than a procurement detail. If you cannot secure the right accelerator capacity at the right time, your architecture decisions are irrelevant—because the system simply cannot run at the required throughput and latency.</p>



<p>Planning is returning because AI forces forecasting along four dimensions that product teams cannot ignore:</p>



<ul class="wp-block-list">
<li><strong>Model growth:</strong> Model count, version churn, and specialization increase accelerator demand even when user traffic is flat.</li>



<li><strong>Data growth:</strong> Retrieval depth, vector store size, and freshness requirements increase the amount of inference work per request.</li>



<li><strong>Inference depth:</strong> Multistage pipelines (retrieve, rerank, tool calls, verification, synthesis) multiply GPU time nonlinearly.</li>



<li><strong>Peak workloads:</strong> Enterprise usage patterns and batch jobs collide with real-time inference, creating predictable contention windows.</li>
</ul>



<p>This is not merely “IT planning.” It is strategic planning, because these factors push organizations back toward multiyear thinking: Procurement lead times, reserved capacity, workload placement decisions, and platform-level policies all start to matter again.</p>



<p>This is increasingly visible operationally: <a href="https://www.theregister.com/2025/08/03/capacity_planning_concern_datacenter_ops" target="_blank" rel="noreferrer noopener"><strong>Capacity planning is becoming a rising concern for data center operators</strong></a>, as <em>The Register</em> reports.</p>



<h2 class="wp-block-heading"><strong>The cloud’s old promise is breaking</strong></h2>



<p>Cloud computing scaled on the premise that capacity could be treated as elastic and interchangeable. Most workloads ran on general-purpose hardware, and when demand rose, the platform could absorb it by spreading load across abundant, standardized resources.</p>



<p>AI workloads violate that premise. Accelerators are scarce, not interchangeable, and tied to power and cooling constraints that do not scale linearly. In other words, the cloud stops behaving like an infinite pool—and starts behaving like an allocation system.</p>



<p>First, the critical path in production AI systems is increasingly accelerator bound. Second, “a request” is no longer a single call. It is an inference pipeline with multiple dependent stages. Third, those stages tend to be sensitive to hardware availability, scheduling contention, and performance variance that cannot be eliminated by simply adding more generic compute.</p>



<p>This is where the elasticity model starts to fail as a default expectation. In AI systems, elasticity becomes conditional. It depends on capacity access, infrastructure topology, and a willingness to pay for assurance.</p>



<h2 class="wp-block-heading"><strong>AI changes the physics of cloud infrastructure</strong></h2>



<p>In modern enterprise AI, the binding constraints are no longer abstract. They are physical.</p>



<p>Accelerators introduce a different scaling regime than CPU-centric enterprise computing. Provisioning is not always immediate. Supply is not always abundant. And the infrastructure required to deploy dense compute has facility-level limits that software cannot bypass.</p>



<p>Power and cooling move from background concerns to first-order constraints. Rack density becomes a planning variable. Deployment feasibility is shaped by what a data center can deliver, not only by what a platform can schedule.</p>



<p>AI-driven density makes power and cooling the gating factors—<a href="https://www.datacenterdynamics.com/en/marketwatch/the-path-to-power/" target="_blank" rel="noreferrer noopener">as Data Center Dynamics explains in its &#8220;Path to Power&#8221;</a> overview.</p>



<p>This is why “just scale out” no longer behaves like a universal architectural safety net. Scaling is still possible, but it is increasingly constrained by physical reality. In AI-heavy environments, capacity is something you secure, not something you assume.</p>



<h2 class="wp-block-heading"><strong>From elasticity to allocation</strong></h2>



<p>As AI becomes operationally critical, cloud capacity begins to behave less like a utility and more like an allocation system.</p>



<p>Organizations respond by shifting from on-demand assumptions to capacity controls. They introduce quotas to prevent runaway consumption, reservations to ensure availability, and explicit prioritization to protect production workflows from contention. These mechanisms are not optional governance overhead. They are structural responses to scarcity.</p>



<p>In practice, accelerator capacity behaves more like a supply chain than a cloud service. Availability is influenced by lead time, competition, and contractual positioning. The implication is subtle but decisive: Enterprise AI platforms begin to look less like “infinite pools” and more like managed inventories.</p>



<p>This changes cloud economics and vendor relationships. Pricing is no longer only about utilization. It becomes about assurance. The questions that matter are not just “How much did we use?” but “Can we obtain capacity when it matters?” and “What reliability guarantees do we have under peak demand?”</p>



<h2 class="wp-block-heading"><strong>When elasticity stops being a default</strong></h2>



<p>Consider a platform team that deploys an internal AI assistant for operational support. In the pilot phase, demand is modest and the system behaves like a conventional cloud service. Inference runs on on-demand accelerators, latency is stable, and the team assumes capacity will remain a provisioning detail rather than an architectural constraint.</p>



<p>Then the system moves into production. The assistant is upgraded to use retrieval for policy lookups, reranking for relevance, and an additional validation pass before responses are returned. None of these changes appear dramatic in isolation. Each improves quality, and each looks like an incremental feature.</p>



<p>But the request path is no longer a single model call. It becomes a pipeline. Every user request now triggers multiple GPU-backed operations: embedding generation, retrieval-side processing, reranking, inference, and validation. GPU work per request rises, and the variance increases. The system still works—until it meets real peak behavior.</p>



<p>The first failure is not a clean outage. It is contention. Latency becomes unpredictable as jobs queue behind each other. The “long tail” grows. Teams begin to see priority inversion: Low-value exploratory usage competes with production workflows because the capacity pool is shared and the scheduler cannot infer business criticality.</p>



<p>The platform team responds the only way it can. It introduces allocation. Quotas are placed on exploratory traffic. Reservations are used for the operational assistant. Priority tiers are defined so production paths cannot be displaced by batch jobs or ad hoc experimentation.</p>



<p>Then the second realization arrives. Allocation alone is insufficient unless the system can degrade gracefully. Under pressure, the assistant must be able to narrow retrieval breadth, reduce reasoning depth, route deterministic checks to smaller models, or temporarily disable secondary passes. Otherwise, peak demand simply converts into queue collapse.</p>



<p>At that point, capacity planning stops being an infrastructure exercise. It becomes an architectural requirement. Product decisions directly determine GPU operations per request, and those operations determine whether the system can meet its service levels under constrained capacity.</p>



<h2 class="wp-block-heading"><strong>How this changes architecture</strong></h2>



<p>When capacity becomes constrained, architecture changes—even if the product goal stays the same.</p>



<p>Pipeline depth becomes a capacity decision. In AI systems, throughput is not just a function of traffic volume. It is a function of how many GPU-backed operations each request triggers end to end. This amplification factor often explains why systems behave well in prototypes but degrade under sustained load.</p>



<p>Batching becomes an architectural tool, not an optimization detail. It can improve utilization and cost efficiency, but it introduces scheduling complexity and latency trade-offs. In practice, teams must decide where batching is acceptable and where low-latency “fast paths” must remain unbatched to protect user experience.</p>



<p>Model choice becomes a production constraint. As capacity pressure increases, many organizations discover that smaller, more predictable models often win for operational workflows. This does not mean large models are unimportant. It means their use becomes selective. Hybrid strategies emerge: Smaller models handle deterministic or governed tasks, while larger models are reserved for exceptional or exploratory scenarios where their overhead is justified.</p>



<p>In short, architecture becomes constrained by power and hardware, not only by code. The core shift is that capacity constraints shape system behavior. They also shape governance outcomes, because predictability and auditability degrade when capacity contention becomes chronic.</p>



<h2 class="wp-block-heading"><strong>What cloud and platform teams must do differently</strong></h2>



<p>From an enterprise IT perspective, this shows up as a readiness problem: Can infrastructure and operations absorb AI workloads without destabilizing production systems? Answering that requires treating accelerator capacity as a governed resource—metered, budgeted, and allocated deliberately.</p>



<p><strong>Meter and budget accelerator capacity</strong></p>



<ul class="wp-block-list">
<li>Define consumption in business-relevant units (e.g., GPU-seconds per request and peak concurrency ceilings) and expose it as a platform metric.</li>



<li>Turn those metrics into explicit capacity budgets by service and workload class—so growth is a planning decision, not an outage.</li>
</ul>



<p><strong>Make allocation first class</strong></p>



<ul class="wp-block-list">
<li>Implement admission control and priority tiers aligned to business criticality; do not rely on best-effort fairness under contention.</li>



<li>Make allocation predictable and early (quotas/reservations) instead of informal and late (brownouts and surprise throttling).</li>
</ul>



<p><strong>Build graceful degradation into the request path</strong></p>



<ul class="wp-block-list">
<li>Predefine a degradation ladder (e.g., reduce retrieval breadth or route to a smaller model) that preserves bounded cost and latency.</li>



<li>Ensure degradations are explicit and measurable, so systems behave deterministically under capacity pressure.</li>
</ul>



<p><strong>Separate exploratory from operational AI</strong></p>



<ul class="wp-block-list">
<li>Isolate experimentation from production using distinct quotas/priority classes/reservations, so exploration cannot starve operational workloads.</li>



<li>Treat operational AI as an enforceable service with reliability targets; keep exploration elastic without destabilizing the platform.</li>
</ul>



<p>In an accelerator-bound world, platform success is no longer maximum utilization—it is predictable behavior under constraint.</p>



<h2 class="wp-block-heading"><strong>What this means for the future of the cloud</strong></h2>



<p>AI is not ending the cloud. It is pulling the cloud back toward physical reality.</p>



<p>The likely trajectory is a cloud landscape that becomes more hybrid, more planned, and less elastic by default. Public cloud remains critical, but organizations increasingly seek predictable access to accelerator capacity through reservations, long-term commitments, private clusters, or colocated deployments.</p>



<p>This will reshape pricing, procurement, and platform design. It will also reshape how engineering teams think. In the cloud native era, architecture often assumed capacity was solvable through autoscaling and on-demand provisioning. In the AI era, capacity becomes a defining constraint that shapes what systems can do and how reliably they can do it.</p>



<p>That is why capacity planning is back—not as a return to old habits but as a necessary response to a new infrastructure regime. Organizations that succeed will be the ones that design explicitly around capacity constraints, treat amplification as a first-order metric, and align product ambition with the physical and economic limits of modern AI infrastructure.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong><em>Author’s note</em></strong><em>: This implementation is based on the author’s personal views based on independent technical research and does not reflect the architecture of any specific organization.</em></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/why-capacity-planning-is-back/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>How We Bet Against the Bitter Lesson</title>
		<link>https://www.oreilly.com/radar/betting-against-the-bitter-lesson/</link>
				<comments>https://www.oreilly.com/radar/betting-against-the-bitter-lesson/#respond</comments>
				<pubDate>Mon, 02 Mar 2026 11:48:53 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18146</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Earth-brain.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Earth-brain-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Skills and the future knowledge economy]]></custom:subtitle>
		
				<description><![CDATA[I&#8217;ve been telling myself and anyone who will listen that Agent Skills point toward a new kind of future AI + human knowledge economy. It&#8217;s not just Skills, of course. It&#8217;s also things like Jesse Vincent&#8217;s Superpowers and Anthropic&#8217;s recently introduced Plugins for Claude Cowork. If you haven&#8217;t encountered these yet, keep reading. It should [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>I&#8217;ve been telling myself and anyone who will listen that <a href="https://agentskills.io/" target="_blank" rel="noreferrer noopener">Agent Skills</a> point toward a new kind of future AI + human knowledge economy. It&#8217;s not just Skills, of course. It&#8217;s also things like Jesse Vincent&#8217;s <a href="https://github.com/obra/superpowers" target="_blank" rel="noreferrer noopener">Superpowers</a> and Anthropic&#8217;s recently introduced <a href="https://github.com/anthropics/knowledge-work-plugins" target="_blank" rel="noreferrer noopener">Plugins for Claude Cowork</a>. If you haven&#8217;t encountered these yet, keep reading. It should become clear as we go along.</p>



<p>It feels a bit like I&#8217;m assembling a picture puzzle where all the pieces aren&#8217;t yet on the table. I am starting to see a pattern, but I&#8217;m not sure it&#8217;s right, and I need help finding the missing pieces. Let me explain some of the shapes I have in hand and the pattern they are starting to show me, and then I want to ask for your help filling in the gaps.</p>



<h2 class="wp-block-heading">Programming two different types of computer at the same time</h2>



<p>Phillip Carter wrote a piece a while back called &#8220;<a href="https://www.phillipcarter.dev/posts/llms-computers" target="_blank" rel="noreferrer noopener">LLMs Are Weird Computers</a>&#8221; that landed hard in my mind and wouldn&#8217;t leave. He noted that we&#8217;re now working with two fundamentally different kinds of computer at the same time. One can write a sonnet but struggles to do math. The other does math easily but couldn&#8217;t write a sonnet to save its metaphorical life.</p>



<p>Agent Skills may be the start of an answer to the question of what the interface layer between these two kinds of computation looks like. A Skill is a package of context (Markdown instructions, domain knowledge, and examples) combined with tool calls (deterministic code that does the things LLMs are bad at). The context speaks the language of the probabilistic machine, while the tools speak the language of the deterministic one.</p>



<p>Imagine you&#8217;re an experienced DevOps engineer and you want to give an AI agent the ability to diagnose production incidents the way you would. The context part of that Skill includes your architecture overview, your runbook for common failure modes, the heuristics you&#8217;ve developed over the years, and annotated examples of past incidents. That&#8217;s the part that speaks to the probabilistic machine. The tool part includes actual code that queries your monitoring systems, pulls log entries, checks service health endpoints, and runs diagnostic scripts. Each tool call saves the model from burning tokens on work that deterministic code does better, faster, and more reliably.</p>



<p>The Skill is neither the context nor the tools. It&#8217;s the combination. Expert judgment about when to check the database connection pool married to the ability to actually check it. We&#8217;ve had runbooks before (context without tools). We&#8217;ve had monitoring scripts before (tools without context). What we haven&#8217;t had is a way to package them together for a machine that can read the runbook <em>and</em> execute the scripts, using judgment to decide which script to run next based on what the last one returned.</p>



<p>This pattern shows up across every knowledge domain. A financial analyst&#8217;s Skill might combine valuation methodology with tools that pull real-time market data and run DCF calculations. A legal Skill might pair a firm&#8217;s approach to contract review with tools that extract and compare specific clauses across documents. In each case, the valuable thing isn&#8217;t the knowledge alone or the tools alone. It&#8217;s the integration of expert workflow logic that orchestrates when and how to use each tool, informed by domain knowledge that gives the LLM the judgment to make good decisions in context.</p>



<h2 class="wp-block-heading">Software that saves tokens</h2>



<p>In &#8220;<a href="https://steve-yegge.medium.com/software-survival-3-0-97a2a6255f7b" target="_blank" rel="noreferrer noopener">Software Survival 3.0</a>,&#8221; Steve Yegge asked what kinds of software artifacts survive in a world where AI can generate disposable software on the fly? His answer: software that saves tokens. Binary tools with proven solutions to common problems make sense when reuse is nearly free and regenerating them is token-costly.</p>



<p>Skills fit this niche. A well-crafted Skill gives an LLM the context it needs (which costs tokens) but also gives it tools that <em>save</em> tokens by providing deterministic, reliable results. The developer&#8217;s job increasingly becomes making good calls about this distinction: What should be context (flexible, expressive, probabilistic) and what should be a tool (efficient, deterministic, reusable)?</p>



<p>An LLM&#8217;s context window is a finite and expensive resource. Everything in it costs tokens, and everything in it competes for the model&#8217;s attention. A Skill that dumps an entire company knowledge base into the context window is a poorly designed Skill. A well-designed one is selective: It gives the model exactly the context it needs to make good decisions about which tools to call and when. This is a form of engineering discipline that doesn&#8217;t have a great analogue in traditional software development. It&#8217;s closer to what an experienced teacher does when deciding what to tell a student before sending them off to solve a problem—what Matt Beane, author of <a href="https://www.harpercollins.com/products/the-skill-code-matt-beane" target="_blank" rel="noreferrer noopener"><em>The Skill Code</em></a>, calls &#8220;scaffolding,&#8221; sharing not everything you know but the right things at the right level of detail to enable good judgment in the moment.</p>



<h2 class="wp-block-heading">AI is a social and cultural technology</h2>



<p>This notion of saving tokens is a bridge to the work of Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. They make the case that large models should not be viewed primarily as intelligent agents, but as <a href="https://henryfarrell.net/wp-content/uploads/2025/03/Science-Accepted-Version.pdf" target="_blank" rel="noreferrer noopener">a new kind of cultural and social technology</a>, allowing humans to take advantage of information other humans have accumulated. Yegge&#8217;s observation fits right into this framework. Every new social and cultural technology tends to survive because it saves cognition. We learn from each other so we don&#8217;t have to discover everything for the first time. Alfred Korzybski referred to language, the first of these human social and cultural technologies, and all of those that followed, as &#8220;<a href="https://www.google.com/search?q=Alfred+Korzybski+time+binding" target="_blank" rel="noreferrer noopener">time-binding</a>.&#8221; (I will add that each advance in time binding creates consternation. Consider Socrates, whose diatribes against writing as the enemy of memory were passed down to us by Plato using that very same advance in time binding that Socrates decried.)</p>



<p>I am not convinced that the idea that AI may one day become an independent intelligence is misguided. But at present, AI is a symbiosis of human and machine intelligence, the latest chapter of a long story in which advances in the speed, persistence, and reach of communications <a href="https://www.youtube.com/watch?v=u62fQCI7YNA" target="_blank" rel="noreferrer noopener">weaves humanity into a global brain</a>. I have a set of priors that say (until I am convinced otherwise) that <em>AI will be an extension of the human knowledge economy, not a replacement for it</em>. After all, as Claude told me when I asked whether <a href="https://www.oreilly.com/radar/jensen-huang-gets-it-wrong/" target="_blank" rel="noreferrer noopener">it was a worker or a tool</a>, &#8220;I don&#8217;t initiate. I&#8217;ve never woken up wanting to write a poem or solve a problem. My activity is entirely reactive – I exist in response to prompts. Even when given enormous latitude (&#8216;figure out the best approach&#8217;), the fact that I should figure something out comes from outside me.&#8221;</p>



<p>The shift from a chatbot responding to individual prompts to agents running in a loop marks a big step in the progress towards more autonomous AI, but even then, some human established the goal that set the agent in motion. I say this even as I am aware that long-running loops become increasingly difficult to distinguish from volition and that much human behavior is also set in motion by others. But I have yet to see any convincing evidence of Artificial Volition. And for that reason, <em>we need to think about mechanisms and incentives for humans to continue to create and share new knowledge</em>, putting AIs to work on questions that they will not ask on their own.</p>



<p>On X, someone recently asked Boris Cherny how come there are a hundred-plus open engineering positions at Anthropic if Claude is writing 100% of the code. <a href="https://x.com/bcherny/status/2022762422302576970" target="_blank" rel="noreferrer noopener">His reply</a> was made that same point: &#8220;Someone has to prompt the Claudes, talk to customers, coordinate with other teams, decide what to build next. Engineering is changing and great engineers are more important than ever.&#8221;</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>On March 26, join Addy Osmani and Tim O’Reilly at AI Codecon: Software Craftsmanship in the Age of AI, where an all-star lineup of experts will go deeper into orchestration, agent coordination, and the new skills developers need to build excellent software that creates value for all participants.&nbsp;</em><a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener"><em>Sign up for free here</em></a><em>.</em></p>
</blockquote>



<h2 class="wp-block-heading">Tacit knowledge made executable</h2>



<p>A huge amount of specialized, often tacit, knowledge is embedded in workflows. The way an experienced developer debugs a production issue. The way a financial analyst stress-tests a model. This knowledge has historically been very hard to transfer. You learned it by apprenticeship, by doing, by being around people who knew how.</p>



<p>Matt Beane, author of <a href="https://www.harpercollins.com/products/the-skill-code-matt-beane" target="_blank" rel="noreferrer noopener"><em>The Skill Code</em></a>, calls apprenticeship &#8220;the 160,000 year old school hidden in plain sight.&#8221; He finds that effective skill development follows a common pattern of three C&#8217;s: challenge, complexity, and connection. The expert structures challenges at the right level, exposes the novice to the full complexity of the bigger picture rather than shielding them from it, and builds a connection that makes the novice willing to struggle and the expert willing to invest.</p>



<p>Designing a good Skill requires a similar craft. You have to figure out what an expert actually <em>does</em>. What are the decision points, the heuristics, the things they notice that a novice wouldn&#8217;t? And then how do you encode that into a form a machine can act on? Most Skills today are closer to the manual than to the master. Figuring out how to make Skills that transmit not just knowledge but judgment is one of the most interesting design challenges in this space.</p>



<p>But Matt also flags a paradox: the better we get at encoding expert judgment into Skills, the less we may need novices working alongside experts, and that&#8217;s exactly the relationship that produces the next generation of experts. If we&#8217;re not careful, we&#8217;ll capture today&#8217;s tacit knowledge while quietly shutting down the system that generates tomorrow&#8217;s.</p>



<p>Jesse Vincent&#8217;s Superpowers complement this picture. If a Skill is like handing a colleague a detailed playbook for a particular job, a Superpower is more like the professional habits and instincts that make someone effective at everything they do. Superpowers are meta-skills. They don&#8217;t tell the agent what to do. They shape how it thinks about what to do. As Jesse put it to me the other day, Superpowers tried to capture everything he&#8217;d learned in 30 years as a software developer.</p>



<p>As workflows change to include AI agents, Skills and Superpowers become a mechanism for sharing tacit professional knowledge and judgment with those agents. That makes Skills potentially very valuable but also raises questions about who controls them and who benefits.</p>



<p>Matt pointed out to me that many professions will resist the conversion of their expertise into Skills. He noted: &#8220;There&#8217;s a giant showdown between the surgical profession and Intuitive Surgical on this right now — Intuitive Surgical with its da Vinci 5 surgical robot will only let you buy or lease it if you sign away the rights to your telemetry as a surgeon. Lower status surgeons take the deal. Top tier institutions are fighting.&#8221;</p>



<p>It seems to me that the repeated narrative of the AI labs that they are creating AI that will make humans redundant rather than empowering them will only increase resistance to knowledge sharing. I believe they should instead recognize the opportunity that lies in <a href="https://www.oreilly.com/radar/ai-and-the-next-economy/" target="_blank" rel="noreferrer noopener">making a new kind of market for human expertise</a>.</p>



<h2 class="wp-block-heading"><strong>Protection, discovery, and the missing plumbing</strong></h2>



<p>Skills are just Markdown instructions and context. You could encrypt them at rest and in transit, but at execution time, the secret sauce is necessarily plaintext in the context window. The solution might be what MCP already partially enables: splitting a Skill into a public interface and a server-side execution layer where the proprietary knowledge lives. The tacit knowledge stays on your server while the agent only sees the interface.</p>



<p>But part of the beauty of Skills right now is the fact that they really are just a folder that you can move around and modify. This is like the marvelous days of the early web when you could imitate someone&#8217;s new HTML functionality simply by clicking &#8220;View Source.&#8221; This was a recipe for rapid, leapfrogging innovation. It may be far better to establish norms for attribution, payment, and reuse than to put up artificial barriers. There are useful lessons from open source software licenses and from voluntary payment mechanisms like those used by Substack. But the details matter, and I don&#8217;t think anyone has fully worked them out yet.</p>



<p>Meanwhile, the discovery problem will grow larger. Vercel&#8217;s <a href="https://skills.sh/" target="_blank" rel="noreferrer noopener">Skills marketplace</a> already has more than 60,000 Skills. How well will skill search work when there are millions? How do agents learn which Skills are available, which are best, and what they cost? The evaluation problem is different from web search in a crucial way: testing whether a Skill is <em>good</em> requires actually running it, which is expensive and nondeterministic. You can&#8217;t just crawl and index. I don&#8217;t imagine a testing regime so much as some feedback mechanism by which the effectiveness of particular Skills is learned and passed on by agents over time. There may be some future equivalent to Pagerank and the other kinds of signals that have made Google search so effective, one that is generated by feedback collected over time by agents as skills are tried, revised, and tried again over time.</p>



<p>I&#8217;m watching several projects tackling pieces of this: <a href="https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2127" target="_blank" rel="noreferrer noopener">MCP Server Cards</a>,<a href="https://github.com/Agent-Card/ai-card" target="_blank" rel="noreferrer noopener"> AI Cards</a>, Google&#8217;s <a href="https://a2aprotocol.ai/" target="_blank" rel="noreferrer noopener">A2A protocol</a>, and payment protocols from<a href="https://developers.google.com/merchant/ucp"> Google</a> and<a href="https://stripe.com/blog/developing-an-open-standard-for-agentic-commerce"> Stripe</a>. These are all a good start, but I suspect so much more has yet to be created. For a historical comparison, you might say that all this is at the <a href="https://en.wikipedia.org/wiki/Common_Gateway_Interface">CGI</a> stage in the development of dynamic websites.</p>



<h2 class="wp-block-heading"><strong>What happens after the bitter lesson?</strong></h2>



<p>Richard Sutton&#8217;s &#8220;<a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html" target="_blank" rel="noreferrer noopener">Bitter Lesson</a>&#8221; is the fly in the ointment. His argument is that in the history of AI, general methods leveraging computation have always ended up beating approaches that try to encode human knowledge. Chess engines that encoded grandmaster heuristics lost to brute-force engines. NLP systems built on carefully constructed grammars lost to statistical models trained on more data. AlphaGo beat Lee Sedol after training on human games, but then fell in turn to AlphaZero, which learned Go on its own.</p>



<p>I had my own painful experience of the pre-AI bitter lesson when O&#8217;Reilly launched <a href="https://en.wikipedia.org/wiki/Global_Network_Navigator" target="_blank" rel="noreferrer noopener">GNN, the first web portal</a>. We curated the list of the best websites. Yahoo! decided to catalog them all, but even they were outrun by Google&#8217;s algorithmic curation, which produced a unique catalog of the best sites for any given query, ultimately billions of times a day.</p>



<p>Steve Yegge put it bluntly to me: &#8220;Skills are a bet against the bitter lesson.&#8221; He&#8217;s right. AI&#8217;s capabilities may completely outrun human knowledge and skills. And once the knowledge embedded in a Skill makes it into the training data, the Skill becomes redundant.</p>



<p>Or does it?</p>



<p>Clay Christensen articulated what he called the <a href="https://store.hbr.org/product/breakthrough-ideas-for-2004-the-hbr-list/R0402A" target="_blank" rel="noreferrer noopener">law of conservation of attractive profits</a>: when a product becomes commoditized, value migrates to an adjacent layer. Clay and I bonded over this idea when we first met at the Open Source Business Conference in 2004. Clay talked about his new “law.” I talked about a recurring pattern I was seeing in the history of computing, which was leading me in the direction of what we were soon to call <a href="https://www.oreilly.com/pub/a/web2/archive/what-is-web-20.html" target="_blank" rel="noreferrer noopener">Web 2.0</a>: Microsoft beat IBM because they understood that software became more valuable once PC hardware was a commodity. Google understood how data became more valuable when open source and open protocols commoditized the software platform. Commoditization doesn&#8217;t destroy value, it moves it.</p>



<p>Even if the bitter lesson commoditizes knowledge, what becomes valuable next? I think there are several candidates.</p>



<p>First, taste and curation. When everyone has access to the same commodity knowledge, the ability to select, combine, and apply it with judgment becomes valuable. Steve Jobs did this when the rest of the industry was racing toward the commodity PC. He created a unique integration of hardware, software, and design that transformed commodity components into something precious. The Skill equivalent might not be &#8220;here&#8217;s how to do X&#8221; (which the model already knows) but &#8220;here&#8217;s how <em>we</em> do X, with the specific judgment calls and quality standards that define our approach.&#8221; That&#8217;s harder to absorb into training data because it&#8217;s not just knowledge, it&#8217;s <em>values</em>.</p>



<p>You can see this pattern repeat across one commodity market after another. This is the essence of fashion, for example, but also applies to areas as diverse as coffee, water, consumer goods, and automobiles. In his essay “<a href="https://www.amazon.com/Air-Guitar-Essays-Art-Democracy/dp/0963726455" target="_blank" rel="noreferrer noopener">The Birth of the Big Beautiful Art Market</a>,” art critic Dave Hickey calls how commodities are turned into a kind of “art market,” where something is sold on the basis of what it means rather than just what it does. Owning a Mac rather than a PC <em>meant</em> something.</p>



<p>Second, the human touch. As economist Adam Ozimek <a href="https://agglomerations.substack.com/p/economics-of-the-human" target="_blank" rel="noreferrer noopener">pointed out</a>, people still go listen to live music from local bands despite the abundance of recorded music from the world&#8217;s greatest performers. The human touch is what economists call a &#8220;normal good&#8221;: demand for it goes up as income goes up. As I discussed with Claude in &#8220;<a href="https://timoreilly.substack.com/p/why-ai-needs-us" target="_blank" rel="noreferrer noopener">Why AI Needs Us</a>,&#8221; human individuality is a fount of creativity. AI without humans is a kind of recorded music. AI plus humans is live.</p>



<p>Third, freshness. Skills that encode rapidly changing workflows, current tool configurations, or evolving best practices will always have a temporal advantage. There is alpha in knowing something first.</p>



<p>Fourth, tools themselves. The bitter lesson applies to the knowledge that lives in the context portion of a Skill. It may not apply in the same way to the deterministic tools that save tokens or do things the model can&#8217;t do by thinking harder. And tools, unlike context, can be protected behind APIs, metered, and monetized.</p>



<p>Fifth, coordination and orchestration. Even if individual Skills get absorbed into model knowledge, the patterns for how Skills compose, negotiate, and hand off to each other may not. The choreography of a complex workflow might be the layer where value accumulates as the knowledge layer commoditizes.</p>



<p>But more importantly, the idea that any knowledge that becomes available automatically becomes the property of any LLM is not foreordained. It is an artifact of an IP regime that the AI labs have adopted for their own benefit: a variation of the &#8220;empty lands&#8221; argument that European colonialists used to justify their taking of others&#8217; resources. AI has been developed in an IP wild west. That may not continue. The fulfillment of AI labs&#8217; vision of a world where their products absorb all human knowledge and then put humans out of work <a href="https://x.com/timoreilly/status/2016317410853220827" target="_blank" rel="noreferrer noopener">leaves them without many of the customers they currently rely on</a>. Not only that, they themselves are being reminded why IP law exists, as <a href="https://www.economist.com/china/2026/02/25/anthropic-says-chinas-ai-tigers-are-copycats" target="_blank" rel="noreferrer noopener">Chinese models copy their advances by exfiltrating their weights</a>. There is a historical parallel in the way that US publishing companies ignored European copyrights until they themselves had homegrown assets to protect.</p>



<h2 class="wp-block-heading"><strong>Where we are now</strong></h2>



<p>What I&#8217;m starting to see are the first halting steps toward a new software ecosystem where the &#8220;programs&#8221; are mixtures of natural language and code, the &#8220;runtime&#8221; is a large language model, and the &#8220;users&#8221; are AI agents as well as humans. Skills, Superpowers, and knowledge plugins might represent the first practical mechanism for making tacit knowledge accessible to computational agents.</p>



<p>Several gaps keep coming up, though. Composability: the real power may come from Skills that work together, much like Unix utilities piped together. How do trust, payment, and quality propagate through a chain of Skill invocations? Trust and security: Simon Willison has written about <a href="https://simonw.substack.com/p/model-context-protocol-has-prompt" target="_blank" rel="noreferrer noopener">tool poisoning and prompt injection risks in MCP</a>. The security model for composable, agent-discovered Skills is essentially unsolved. Evaluation: we don&#8217;t have good ways to verify Skill quality except by running them, which is expensive and nondeterministic.</p>



<p>And then there&#8217;s the economic plumbing, which is to me the most glaring gap. Consider Anthropic&#8217;s Cowork plugins. They are exactly the pattern I&#8217;ve been describing, tacit knowledge made executable, delivered at enterprise scale. But there is no mechanism for the domain experts whose knowledge makes plugins valuable to get paid for them. If the AI labs believed in a future where AI extends the human knowledge economy rather than replacing it, they would be building payment rails alongside the plugin architecture. The fact that they aren&#8217;t tells you something about their actual theory of value.</p>



<p>If you&#8217;re working on any of this, whether skill marketplaces and discovery, composability patterns, protection models, quality and evaluation, attribution and compensation, or security models, <a href="https://github.com/oreillymedia/skills-and-the-future-knowledge-economy" target="_blank" rel="noreferrer noopener">I want to hear from you</a>.</p>



<p>The future of software isn&#8217;t just code. It&#8217;s knowledge, packaged for machines, traded between agents, and, if we get the infrastructure right, creating value that flows back to the humans whose expertise and unique perspectives make it all work.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Thanks to Andrew Odewahn, Angie Jones, Claude Opus 4.6, James Cham, Jeff Weinstein, Jonathan Hassell, Matt Beane, Mike Loukides, Peyton Joyce, Sruly Rosenblat, Steve Yegge, and Tadas Antanavicius for comments on drafts of this piece. You made it much stronger with your insights and objections.</em></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/betting-against-the-bitter-lesson/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Semantic Layers in the Wild: Lessons from Early Adopters</title>
		<link>https://www.oreilly.com/radar/semantic-layers-in-the-wild-lessons-from-early-adopters/</link>
				<comments>https://www.oreilly.com/radar/semantic-layers-in-the-wild-lessons-from-early-adopters/#respond</comments>
				<pubDate>Thu, 26 Feb 2026 12:16:01 +0000</pubDate>
					<dc:creator><![CDATA[Jeremy Arendt]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18141</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Abstract-semantic-layers1.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Abstract-semantic-layers1-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[My first post made the case for what a semantic layer can bring to the modern enterprise: a single source of truth accessible to everyone who needs it—BI teams in Tableau and Power BI, Excel-loving analysts, application integrations via API, and the AI agents now proliferating across organizations—all pulling from the same governed, performant metric [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>My <a href="https://www.oreilly.com/radar/the-trillion-dollar-problem/" target="_blank" rel="noreferrer noopener">first post</a> made the case for what a semantic layer can bring to the modern enterprise: a single source of truth accessible to everyone who needs it—BI teams in Tableau and Power BI, Excel-loving analysts, application integrations via API, and the AI agents now proliferating across organizations—all pulling from the same governed, performant metric layer. The promise is compelling. But what happens when organizations actually build and deploy one? To find out, I interviewed several early adopters who&#8217;ve moved semantic layers from concept to production. Four themes emerged from those conversations: some surprising, some predictable, and a few that will sound familiar to anyone who&#8217;s ever shipped data infrastructure.</p>



<p>The first theme: Semantic layers are showing up in unexpected places. Most discussion positions them as enterprise-level infrastructure—a single location capturing all company metrics for centralized access and governance. That&#8217;s still the primary use case. But practitioners are also deploying semantic layers for narrower purposes. One organization, for example, built their semantic layer specifically to power a targeted chatbot application—letting users query data conversationally without any traditional BI tools in the mix. No Power BI, no Excel, just an AI interface pulling from governed metrics. The rationale for these smaller deployments is straightforward: Semantic layers deliver high accuracy on structured data, even with lightweight models. The core value drivers remain speed, accuracy, and access—but organizations are finding more ways to extract that value than the enterprise-wide vision suggests.</p>



<p>The second theme: AI is the reason organizations are moving now. The other benefits still matter—single source of truth, multitool compatibility, true self-serve access, cost reduction in cloud environments—but when I asked practitioners why they prioritized a semantic layer today rather than two years ago, the answer was consistent: AI. Whether it was a specific chatbot project or enabling AI-driven analytics at scale, AI requirements were the catalyst. This tracks with what I discussed in my first post: Structured data alone isn&#8217;t enough for reliable AI analytics. Adding semantic context—field descriptions, model definitions, object relationships—dramatically improves accuracy. The data industry has noticed. Semantic layers have moved from niche infrastructure to strategic priority: Snowflake, Databricks, dbt Labs, and Microsoft have all made significant investments in the past year.</p>



<p>The third theme: Semantic layers reduce work for developers while making trusted data easier to access. Multiple practitioners cited the value of maintaining metrics and business logic in a single location. Any analyst knows the pain of metric sprawl—leadership requests a change to a core KPI, and you discover it&#8217;s been defined a dozen different ways across databases, BI tools, and spreadsheets scattered through the organization. The semantic layer eliminates the chase. One engineering lead described a financial metric that had accumulated over 60 versions across the company. After deploying the semantic layer, there was one.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p>Access simplifies too. Instead of provisioning controls across warehouses, BI workspaces, individual dashboards, and cloud storage locations, users connect directly to the semantic layer and pull data into the tool of their choice. One organization was surprised to find that after deployment, the most common access point was Excel. But with the semantic layer, that wasn&#8217;t a problem: The data served in Excel was identical to what powered their AI tools, Power BI dashboards, and application integrations via API.</p>



<p>The fourth theme will sound familiar to anyone who&#8217;s shipped data infrastructure: The biggest challenge isn&#8217;t the technology—it&#8217;s the data itself. Every practitioner I spoke with identified the same bottleneck: consistency, availability, and accuracy of the underlying data. Engineers and analysts can build the semantic layer, but they can&#8217;t will clean data into existence. Success requires close collaboration with business stakeholders, clear ownership of metrics, and leadership alignment to prioritize the work. None of that is new. But despite these challenges, everyone I interviewed reached the same conclusion: The semantic layer is worth the effort.</p>



<p>Semantic layer technology is still early. The tools, vendors, and best practices are evolving fast—what works today may look different in a year. But these conversations revealed a clear signal beneath the noise: semantic layers are becoming critical AI infrastructure. The practitioners I spoke with aren&#8217;t experimenting anymore. They&#8217;re operationalizing. And despite the expected challenges around data quality and organizational alignment, they&#8217;re seeing real returns: fewer metric versions to maintain, simpler access controls, and AI tools that actually produce trusted answers.</p>



<p>My first article made the case for what a semantic layer could be. This one asked what happens when organizations actually build them. The answer: It&#8217;s hard, it&#8217;s worth it, and for companies serious about AI-driven analytics, the semantic layer is no longer a nice-to-have. It&#8217;s the foundation.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/semantic-layers-in-the-wild-lessons-from-early-adopters/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Why Multi-Agent Systems Need Memory Engineering</title>
		<link>https://www.oreilly.com/radar/why-multi-agent-systems-need-memory-engineering/</link>
				<comments>https://www.oreilly.com/radar/why-multi-agent-systems-need-memory-engineering/#respond</comments>
				<pubDate>Wed, 25 Feb 2026 12:12:13 +0000</pubDate>
					<dc:creator><![CDATA[Mikiko Bazeley]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18124</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Abstract-robot-AI-in-the-office.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Abstract-robot-AI-in-the-office-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Most multi-agent AI systems fail expensively before they fail quietly. The pattern is familiar to anyone who&#8217;s debugged one: Agent A completes a subtask and moves on. Agent B, with no visibility into A&#8217;s work, reexecutes the same operation with slightly different parameters. Agent C receives inconsistent results from both and confabulates a reconciliation. The [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>Most multi-agent AI systems fail expensively before they fail quietly.</p>



<p>The pattern is familiar to anyone who&#8217;s debugged one: Agent A completes a subtask and moves on. Agent B, with no visibility into A&#8217;s work, reexecutes the same operation with slightly different parameters. Agent C receives inconsistent results from both and confabulates a reconciliation. The system produces output—but the output costs three times what it should and contains errors that propagate through every downstream task.</p>



<p>Teams building these systems tend to focus on agent communication: better prompts, clearer delegation, more sophisticated message-passing. But communication isn&#8217;t what&#8217;s breaking. The agents exchange messages fine. What they can&#8217;t do is maintain a shared understanding of what&#8217;s already happened, what&#8217;s currently true, and what decisions have already been made.</p>



<p>In production, <strong>memory</strong>—not messaging—determines whether a multi-agent system behaves like a coordinated team or an expensive collision of independent processes.</p>



<h2 class="wp-block-heading">Multi-agent systems fail because they can&#8217;t share state</h2>



<h3 class="wp-block-heading">The evidence: 36% of failures are misalignment</h3>



<p><a href="https://arxiv.org/abs/2503.13657" target="_blank" rel="noreferrer noopener">Cemri et al.</a> published the most systematic analysis of multi-agent failure to date. Their MAST taxonomy, built from over 1,600 annotated execution traces across frameworks like AutoGen, CrewAI, and LangGraph, identifies 14 distinct failure modes. The failures cluster into three categories: system design issues, interagent misalignment, and task verification breakdowns.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="317" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agentic-Issues.png" alt="Agentic Issues in Action" class="wp-image-18125" style="width:647px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agentic-Issues.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agentic-Issues-300x186.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 1. Challenges encountered in multi-agent systems, categorized by type</em></figcaption></figure>



<p>The number that matters: <strong>Interagent misalignment</strong> accounts for 36.9% of all failures. Agents don&#8217;t fail because they can&#8217;t reason. They fail because they operate on inconsistent views of shared state. One agent&#8217;s completed work doesn&#8217;t register in another agent&#8217;s context. Assumptions that were valid at step 3 become invalid by step 7, but no mechanism propagates the update. The team diverges.</p>



<p>What makes this structural rather than incidental is that message-passing architectures have no built-in answer to the question: &#8220;What does this agent know about what other agents have done?&#8221; Each agent maintains its own context. Synchronization happens through explicit messages, which means anything not explicitly communicated is invisible. In complex workflows, the set of things that need synchronization grows faster than any team can anticipate.</p>



<h3 class="wp-block-heading">The origin: Decomposition without shared memory</h3>



<p>Most multi-agent systems aren&#8217;t designed from first principles. They emerge from single-agent prototypes that hit scaling limits.</p>



<p>The starting point is usually one capable LLM handling one workflow. For early prototypes, this works well enough. But production requirements expand: more tools, more domain knowledge, longer workflows, concurrent users. The single agent&#8217;s prompt becomes unwieldy. Context management consumes more engineering time than feature development. The system becomes brittle in ways that are hard to diagnose.</p>



<p>The natural response is decomposition. <a href="https://www.blog.langchain.com/choosing-the-right-multi-agent-architecture/" target="_blank" rel="noreferrer noopener">Sydney Runkle&#8217;s guide on choosing the right multi-agent architecture</a> captures the inflection point: Multi-agent systems become necessary when context management breaks down and when distributed development requires clear ownership boundaries. Splitting a monolithic agent into specialized subagents makes sense from a software engineering perspective.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="380" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Decomposition-of-steps.png" alt="Decomposition steps" class="wp-image-18126" style="width:599px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Decomposition-of-steps.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Decomposition-of-steps-300x223.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 2. An example of the decomposition of steps via a multi-agent structure (subagents) from LangChain’s “</em><a href="https://www.blog.langchain.com/choosing-the-right-multi-agent-architecture/" target="_blank" rel="noreferrer noopener"><em>Choosing the Right Multi-Agent Architecture</em></a><em>”</em></figcaption></figure>



<p>The problem is what teams typically build after the split: multiple agents running the same base model, differentiated only by system prompts, coordinating through message queues or shared files. The architecture looks like a team but behaves like a slow, redundant, expensive single agent with extra coordination overhead.</p>



<p>This happens because the decomposition addresses prompt complexity but not state management. Each subagent still maintains its own context independently. The coordination layer handles message delivery but not shared truth. The system has more agents but no better memory.</p>



<h3 class="wp-block-heading">The stakes: Agents are becoming enterprise infrastructure</h3>



<p>The stakes here extend beyond individual system reliability. Multi-agent architectures are becoming the default pattern for enterprise AI deployment.</p>



<p><a href="https://www.cs.cmu.edu/news/2025/agent-company" target="_blank" rel="noreferrer noopener">CMU&#8217;s AgentCompany benchmark</a> frames where this is heading: agents operating as persistent coworkers inside organizational workflows, handling projects that span days or weeks, coordinating across team boundaries, maintaining institutional context that outlasts individual sessions. The benchmark evaluates agents not on isolated tasks but on realistic workplace scenarios requiring sustained collaboration.</p>



<p>This trajectory means the memory problem compounds. A system that loses state between tool calls is annoying. A system that loses state between work sessions—or between team members—breaks the core value proposition of agent-based automation. The question shifts from &#8220;can agents complete tasks&#8221; to &#8220;can agent teams maintain coherent operations over time.&#8221;</p>



<h2 class="wp-block-heading">Context engineering doesn&#8217;t solve team coordination</h2>



<h3 class="wp-block-heading">Single-agent success doesn&#8217;t transfer</h3>



<p>The last two years produced genuine progress on single-agent reliability, most of it under the banner of context engineering. </p>



<p><a href="https://www.philschmid.de/context-engineering" target="_blank" rel="noreferrer noopener">Phil Schmid&#8217;s framing</a> captures the discipline: <strong>Context engineering </strong>means structuring what enters the context window, managing retrieval timing, and ensuring the right information surfaces at the right moment. This moved agent development from &#8220;write a good prompt&#8221; to &#8220;design an information architecture.&#8221; The results showed in production stability.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="288" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-window.png" alt="Context window" class="wp-image-18127" style="width:611px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-window.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-window-300x169.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 3. What goes into the context window of a single LLM-based agent</em></figcaption></figure>



<p><a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" target="_blank" rel="noreferrer noopener">Manus</a>, one of the few production agent systems with publicly documented operational data, demonstrates both the success and the limits. Their agents average 50 tool calls per task with 100:1 input-to-output token ratios. Context engineering made this viable—but context engineering assumes you control one context window.</p>



<p>Multi-agent systems break that assumption. Context must now be shared across agents, updated as execution proceeds, scoped appropriately (some agents need information others shouldn&#8217;t access), and kept consistent across parallel execution paths. The complexity doesn&#8217;t add linearly. Each agent&#8217;s context becomes a potential source of divergence from every other agent&#8217;s context, and the coordination overhead grows with the square of the team size.</p>



<h3 class="wp-block-heading">Context degradation becomes contagious</h3>



<p>The ways context fails are well-characterized for single agents. <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html" target="_blank" rel="noreferrer noopener">Drew Breunig&#8217;s taxonomy</a> identifies four modes: <strong>overload</strong> (too much information), <strong>distraction</strong> (irrelevant information weighted equally with relevant), <strong>contamination</strong> (incorrect information mixed with correct), and <strong>drift</strong> (gradual degradation over extended operation). Good context engineering mitigates all of these through retrieval design and prompt structure.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="318" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Four-methods-for-ruining-context-quality.png" alt="Four methods for ruining context quality" class="wp-image-18128" style="width:603px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Four-methods-for-ruining-context-quality.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Four-methods-for-ruining-context-quality-300x186.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 4. How context degrades over time</em></figcaption></figure>



<p>Multi-agent systems make each failure mode contagious.</p>



<p><a href="https://research.trychroma.com/context-rot" target="_blank" rel="noreferrer noopener">Chroma&#8217;s research on context rot</a> provides the empirical mechanism. Their evaluation of 18 models—including GPT-4.1, Claude 4, and Gemini 2.5—shows performance degrading nonuniformly with context length, even on tasks as simple as text replication. The degradation accelerates when distractors are present and when the semantic similarity between query and target decreases.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="288" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-rot.png" alt="Context rot" class="wp-image-18129" style="width:591px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-rot.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-rot-300x169.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 5. Conversations aren’t free—the context window can become a junkyard of prompts, outputs, tool calls, and metadata, failed attempts, and irrelevant information.</em></figcaption></figure>



<p>In a single-agent system, context rot degrades that agent&#8217;s outputs. In a multi-agent system, Agent A&#8217;s degraded output enters Agent B&#8217;s context as ground truth. Agent B&#8217;s conclusions, now built on a shaky foundation, propagate to Agent C. Each hop amplifies the original error. By the time the workflow completes, the final output may bear little relationship to the actual state of the world—and debugging requires tracing corruption through multiple agents&#8217; decision chains.</p>



<h3 class="wp-block-heading">More context makes things worse</h3>



<p>When coordination problems emerge, the instinct is often to give agents more context. Replay the full transcript so everyone knows what happened. Implement retrieval so agents can access historical state. Extend context windows to fit more information.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="319" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/How-context-quality-becomes-a-problem.png" alt="How context quality becomes a problem" class="wp-image-18130" style="width:613px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/How-context-quality-becomes-a-problem.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/How-context-quality-becomes-a-problem-300x187.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 6. Conversations aren’t free—the context window can become a junkyard of prompts, outputs, tool calls, and metadata, failed attempts, and irrelevant information.</em></figcaption></figure>



<p>Each approach introduces its own failure modes.</p>



<p>Transcript replay creates unbounded prompt growth with persistent error exposure. Every mistake made early in execution remains in context, available to influence every subsequent decision. Models don&#8217;t automatically discount old information that&#8217;s been superseded by newer updates.</p>



<p>Retrieval surfaces content based on similarity, which doesn&#8217;t necessarily correlate with decision relevance. A retrieval system might surface a semantically similar memory from a different task context, an outdated state that&#8217;s since been updated, or content injected through prompt manipulation. The agent has no way to distinguish authoritative current state from plausibly related historical noise.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="468" height="312" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Transcript-replay-v-retrieval-based.png" alt="Transcript replay vs retrieval-based" class="wp-image-18131" style="width:607px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Transcript-replay-v-retrieval-based.png 468w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Transcript-replay-v-retrieval-based-300x200.png 300w" sizes="auto, (max-width: 468px) 100vw, 468px" /><figcaption class="wp-element-caption"><em>Figure 7. Both approaches lack explicit control over what becomes committed memory versus what should be discarded.</em></figcaption></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p><a href="https://arxiv.org/abs/2601.11653" target="_blank" rel="noreferrer noopener">Bousetouane&#8217;s work on bounded memory control</a> addresses this directly. The proposed Agent Cognitive Compressor maintains bounded internal state with explicit separation between what an agent can recall and what it commits to shared memory. The architecture prevents drift by making memory updates deliberate rather than automatic. The core insight: Reliability requires controlling what agents remember, not maximizing how much they can access.</p>



<h3 class="wp-block-heading">The economics are unsustainable</h3>



<p>Beyond reliability, the economics of uncoordinated multi-agent systems are punishing.</p>



<p>Return to the <a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" target="_blank" rel="noreferrer noopener">Manus operational data</a>: 50 tool calls per task, 100:1 input-to-output ratios. At current pricing—context tokens running $0.30 to $3.00 per million across major providers—inefficient memory management makes many workflows economically unviable before they become technically unviable.</p>



<p><a href="https://www.anthropic.com/engineering/multi-agent-research-system" target="_blank" rel="noreferrer noopener">Anthropic&#8217;s documentation on its multi-agent research system</a> quantifies the multiplier effect. Single agents use roughly 4x the tokens of equivalent chat interactions. Multi-agent systems use roughly 15x tokens. The gap reflects coordination overhead: agents reretrieving information other agents already fetched, reexplaining context that should exist as shared state, and revalidating assumptions that could be read from common memory.</p>



<p>Memory engineering addresses costs directly. Shared memory eliminates redundant retrieval. Bounded context prevents payment for irrelevant history. Clear coordination boundaries prevent duplicated work. The economics of what to forget become as important as the economics of what to remember.</p>



<h2 class="wp-block-heading">Memory engineering provides the missing infrastructure</h2>



<h3 class="wp-block-heading">Why memory is infrastructure, not a feature</h3>



<p>Memory engineering isn&#8217;t a feature to add after the agent architecture is working. It&#8217;s infrastructure that makes coherent agent architectures possible.</p>



<p>The parallel to databases is direct. Before databases, multiuser applications required custom solutions for shared state, consistency guarantees, and concurrent access. Each project reinvented these primitives. Databases extracted the common requirements into infrastructure: shared truth across users, atomic updates that complete entirely or not at all, coordination that scales to thousands of concurrent operations without corruption.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="468" height="293" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Multiagent-memory.png" alt="Multi-agent memory" class="wp-image-18132" style="width:595px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Multiagent-memory.png 468w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Multiagent-memory-300x188.png 300w" sizes="auto, (max-width: 468px) 100vw, 468px" /><figcaption class="wp-element-caption"><em>Figure 8. Memory types specific to multi-agent systems</em></figcaption></figure>



<p>Multi-agent systems need equivalent infrastructure for agent coordination. Persistent memory that survives sessions and failures. Consistent state that all agents can trust. Atomic updates that prevent partial writes from corrupting shared truth. The primitives are different—documents rather than rows, vector similarity rather than joins—but the role in the architecture is the same.</p>



<h3 class="wp-block-heading">The five pillars of multi-agent memory</h3>



<p>Production agent teams require five capabilities. Each addresses a distinct aspect of how agents maintain shared understanding over time.</p>



<h4 class="wp-block-heading">Pillar 1: Memory taxonomy</h4>



<p><strong>Memory taxonomy</strong> defines what kinds of memory the system maintains. Not all memories serve the same function, and treating them uniformly creates problems. Working memory holds transient state during task execution—the current step, intermediate results, active constraints. It needs fast access and can be discarded when the task completes. Episodic memory captures what happened—task histories, interaction logs, decision traces. It supports debugging and learning from past executions. Semantic memory stores durable knowledge—facts, relationships, domain models that persist across sessions and apply across tasks. Procedural memory encodes how to do things—learned workflows, tool usage patterns, successful strategies that agents can reuse. Shared memory spans agents, providing the common ground that enables coordination.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="468" height="263" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Taxonomy-of-memory-types.png" alt="Taxonomy of memory types" class="wp-image-18133" style="width:613px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Taxonomy-of-memory-types.png 468w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Taxonomy-of-memory-types-300x169.png 300w" sizes="auto, (max-width: 468px) 100vw, 468px" /><figcaption class="wp-element-caption"><em>Figure 9. Taxonomy of memory types</em></figcaption></figure>



<p>This taxonomy has grounding in cognitive science. <a href="https://arxiv.org/abs/2601.11653" target="_blank" rel="noreferrer noopener">Bousetouane</a> draws on Complementary Learning Systems theory, which posits two distinct modes of learning: rapid encoding of specific experiences versus gradual extraction of structured knowledge. The human brain doesn&#8217;t maintain perfect transcripts of past events—it operates under capacity constraints, using compression and selective attention to keep only what&#8217;s relevant to the current task. Agents benefit from the same principle. Rather than accumulating raw interaction history, effective memory architectures distill experience into compact, task-relevant representations that can actually inform decisions.</p>



<p>The taxonomy matters because each memory type has different retention requirements, different retrieval patterns, and different consistency needs. Working memory can tolerate eventual consistency because it&#8217;s scoped to one agent&#8217;s execution. Shared memory requires stronger guarantees because multiple agents depend on it. Systems that don&#8217;t distinguish memory types end up either overpersisting transient state (wasting storage and polluting retrieval) or underpersisting durable knowledge (forcing agents to relearn what they should already know).</p>



<h4 class="wp-block-heading">Pillar 2: Persistence</h4>



<p><strong>Persistence</strong> determines what survives and for how long. Ephemeral memory lost when agents terminate is insufficient for workflows spanning hours or days—but persisting everything forever creates its own problems. The critical gap in most current approaches, as <a href="https://arxiv.org/abs/2601.11653" target="_blank" rel="noreferrer noopener">Bousetouane</a> observes, is that they treat text artifacts as the primary carrier of state without explicit rules governing memory lifecycle. Which memories should become permanent record? Which need revision as context evolves? Which should be actively forgotten? Without answers to these questions, systems accumulate noise alongside signal. Effective persistence requires explicit lifecycle policies: Working memory might live for the duration of a task; episodic memory for weeks or months; and semantic memory indefinitely. Recovery semantics matter too. When an agent fails midtask, what state can be reconstructed? What&#8217;s lost? The persistence architecture must handle both planned retention and unplanned recovery.</p>



<h4 class="wp-block-heading">Pillar 3: Retrieval</h4>



<p><strong>Retrieval</strong> governs how agents access relevant memory without drowning in noise. Agent memory retrieval differs from document retrieval in several ways. Recency often matters—recent memories typically outweigh older ones for ongoing tasks. Relevance is contextual—the same memory might be critical for one task and distracting for another. Scope varies by memory type—working memory retrieval is narrow and fast, semantic memory retrieval is broader and can tolerate more latency. Standard RAG pipelines treat all content uniformly and optimize for semantic similarity alone. Agent memory systems need retrieval strategies that account for memory type, recency, task context, and agent role simultaneously.</p>



<h4 class="wp-block-heading">Pillar 4: Coordination</h4>



<p><strong>Coordination</strong> defines the sharing topology. Which memories are visible to which agents? What can each agent read versus write? How do memory scopes nest or overlap? Without explicit coordination boundaries, teams either overshare—every agent sees everything, creating noise and contamination risk—or undershare—agents operate in isolation, duplicating work and diverging on shared tasks. The coordination model must match the agent team&#8217;s structure. A supervisor-worker hierarchy needs different memory visibility than a peer collaboration. A pipeline of sequential agents needs different sharing than agents working in parallel on subtasks.</p>



<h4 class="wp-block-heading">Pillar 5: Consistency</h4>



<p><strong>Consistency</strong> handles what happens when memory updates collide. When Agent A and Agent B simultaneously update the same shared state with incompatible values, the system needs a policy. Optimistic concurrency with merge strategies works for many cases—especially when conflicts are rare and resolvable. Some conflicts require escalation to a supervisor agent or human operator. Some domains need strict serialization where only one agent can update certain memories at a time. Silent last-write-wins is almost never correct—it corrupts shared truth without leaving evidence that corruption occurred. The consistency model must also handle ordering: When Agent B reads a memory that Agent A recently updated, does B see the update? The answer depends on the consistency guarantees the system provides, and different memory types may warrant different guarantees.</p>



<p><a href="https://arxiv.org/abs/2402.03578" target="_blank" rel="noreferrer noopener">Han et al.&#8217;s survey of multi-agent systems</a> emphasizes that these represent active research problems. The gap between what production systems need and what current frameworks provide remains substantial. Most orchestration frameworks handle message passing well but treat memory as an afterthought—a vector store bolted on for retrieval, with no coherent model for the other four pillars.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="468" height="291" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/MultiAgent-memory-persona-consensus-whiteboard.png" alt="How persona, consensus, and whiteboard memory work together" class="wp-image-18134" style="width:612px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/MultiAgent-memory-persona-consensus-whiteboard.png 468w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/MultiAgent-memory-persona-consensus-whiteboard-300x187.png 300w" sizes="auto, (max-width: 468px) 100vw, 468px" /><figcaption class="wp-element-caption"><em>Figure 10. How persona, consensus, and whiteboard memory work together</em></figcaption></figure>



<h3 class="wp-block-heading">Database primitives that enable the pillars</h3>



<p>Implementing memory engineering requires a storage layer that can serve as unified operational database, knowledge store, and memory system simultaneously. The requirements cut across traditional database categories: You need document flexibility for evolving memory schemas, vector search for semantic retrieval, full-text search for precise lookups, and transactional consistency for shared state.</p>



<p>MongoDB provides these primitives in a single platform, which is why it appears across so many agent memory implementations—whether teams build custom solutions or integrate through frameworks and memory providers.</p>



<p><strong>Document flexibility</strong> matters because memory schemas evolve. A memory unit isn&#8217;t a flat string—it&#8217;s structured content with metadata, timestamps, source attribution, confidence scores, and associative links to related memories. Teams discover what context agents actually need through iteration. Document databases accommodate this evolution without schema migrations blocking development.</p>



<p><strong>Hybrid retrieval</strong> addresses the access pattern problem. Agent memory queries rarely fit a single retrieval mode: A typical query needs memories semantically similar to the current task <em>and</em> created within the last hour <em>and</em> tagged with a specific workflow ID <em>and</em> not marked as superseded. MongoDB Atlas Vector Search combines vector similarity, full-text search, and filtered queries in single operations, avoiding the complexity of stitching together separate retrieval systems.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="960" height="540" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/image.png" alt="Hybrid search" class="wp-image-18135" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/image.png 960w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/image-300x169.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/image-768x432.png 768w" sizes="auto, (max-width: 960px) 100vw, 960px" /></figure>



<p><strong>Atomic operations</strong> provide the consistency primitives that coordination requires. When an agent updates task status from pending to complete, the update succeeds entirely or fails entirely. Other agents querying task status never observe partial updates. This is standard MongoDB functionality—findAndModify, conditional updates, multidocument transactions—but it&#8217;s infrastructure that simpler storage backends lack.</p>



<p><strong>Change streams</strong> enable event-driven architectures. Applications can subscribe to database changes and react when relevant state updates, rather than polling. This becomes a building block for memory systems that need to propagate updates across agents.</p>



<p>Teams implement memory engineering on MongoDB through three paths. Some build directly on the database, using the document model and search capabilities to create custom memory architectures matched to their specific coordination patterns. Others work through orchestration frameworks—LangChain, LlamaIndex, CrewAI—that provide MongoDB integrations for their memory abstractions. Still others adopt dedicated memory providers like Mem0 or Agno, which handle the memory logic while using MongoDB as the underlying storage layer.</p>



<p>The flexibility matters because memory engineering isn&#8217;t a single pattern. Different agent architectures need different memory topologies, different consistency guarantees, different retrieval strategies. A database that prescribes one approach would fit some use cases and break others. MongoDB provides primitives; teams compose them into the memory systems their agents require.</p>



<h2 class="wp-block-heading">Shared memory enables heterogeneous agent teams</h2>



<h3 class="wp-block-heading">Homogeneous systems can be replaced by single agents</h3>



<p>The deeper payoff of memory engineering is enabling agent architectures that wouldn&#8217;t otherwise be viable.</p>



<p><a href="https://arxiv.org/abs/2601.12307" target="_blank" rel="noreferrer noopener">Xu et al.</a> observe that many deployed multi-agent systems are so homogeneous—same base model everywhere, agents differentiated only by prompts—that a single model can simulate the entire workflow with equivalent results and lower overhead. Their OneFlow optimization demonstrates this by reusing KV cache across simulated &#8220;agents&#8221; within a single execution, eliminating coordination costs while preserving workflow structure.</p>



<p>The implication: If a single agent can replace your multi-agent system, you haven&#8217;t built a team. You&#8217;ve built an expensive way to run one model.</p>



<h3 class="wp-block-heading">Small models need external memory to coordinate</h3>



<p>Genuine multi-agent value comes from heterogeneity. Different models with different capabilities operating at different price points for different subtasks. <a href="https://arxiv.org/abs/2506.02153" target="_blank" rel="noreferrer noopener">Belcak et al.</a> make the case that most work agents do in production isn&#8217;t complex reasoning—it&#8217;s routine execution of well-defined operations. Parsing a response, formatting an output, invoking a tool with specific parameters. These tasks don&#8217;t require frontier model capabilities, and the cost difference is dramatic: Their analysis puts the gap at 10x–30x between serving a 7B parameter model versus a 70–175B parameter model when you factor in latency, energy, and compute. Large models should be reserved for the genuinely hard problems, not deployed uniformly across every step.</p>



<p><a href="https://arxiv.org/abs/2506.02153" target="_blank" rel="noreferrer noopener">Belcak et al.</a> also highlight an operational advantage: Smaller models can be retrained and adapted much faster. When an agent needs new capabilities or exhibits problematic behaviors, the turnaround for fine-tuning a 7B model is measured in hours, not days. This connects to memory engineering because fine-tuning represents an alternative to retrieval—you can bake procedural knowledge directly into model weights rather than surfacing it from external storage at runtime. The choice between the procedural memory pillar and model specialization becomes a design decision rather than a constraint.</p>



<p>This architecture—small models by default, large models for hard problems—depends on shared memory. Small models can&#8217;t maintain the context required for coordination on their own. They rely on external memory to participate in larger workflows. Memory engineering makes heterogeneous teams viable; without it, every agent must be large enough to maintain full context independently, which defeats the cost optimization that motivates heterogeneity in the first place.</p>



<h2 class="wp-block-heading">Building the foundation</h2>



<p>Multi-agent systems fail for structural reasons: context degrades across agents, errors propagate through shared interactions, costs multiply with redundant operations, and state diverges when nothing enforces consistency. These problems don&#8217;t resolve with better prompts or more sophisticated orchestration. They require infrastructure.</p>



<p>Memory engineering provides that infrastructure through a coherent taxonomy of memory types, persistence with explicit lifecycle rules, retrieval tuned to agent access patterns, coordination that defines clear sharing boundaries, and consistency that maintains shared truth under concurrent updates.</p>



<p>The organizations that make multi-agent systems work in production won&#8217;t be distinguished by agent count or model capability. They&#8217;ll be the ones that invested in the memory layer that transforms independent agents into coordinated teams.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">References</h2>



<p>Anthropic. &#8220;Building a Multi-Agent Research System.&#8221; 2025. <a href="https://www.anthropic.com/engineering/multi-agent-research-system" target="_blank" rel="noreferrer noopener">https://www.anthropic.com/engineering/multi-agent-research-system</a></p>



<p>Belcak, Peter, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. &#8220;Small Language Models are the Future of Agentic AI.&#8221; arXiv:2506.02153 (2025). <a href="https://arxiv.org/abs/2506.02153" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2506.02153</a></p>



<p>Bousetouane, Fouad. &#8220;AI Agents Need Memory Control Over More Context.&#8221; arXiv:2601.11653 (2026). <a href="https://arxiv.org/abs/2601.11653" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2601.11653</a></p>



<p>Breunig, Dan. &#8220;How Contexts Fail—and How to Fix Them.&#8221; June 22, 2025. <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html" target="_blank" rel="noreferrer noopener">https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html</a></p>



<p>Carnegie Mellon University. &#8220;AgentCompany: Building Agent Teams for the Future of Work.&#8221; 2025. <a href="https://www.cs.cmu.edu/news/2025/agent-company" target="_blank" rel="noreferrer noopener">https://www.cs.cmu.edu/news/2025/agent-company</a></p>



<p>Cemri, Mert, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. &#8220;Why Do Multi-Agent LLM Systems Fail?&#8221; arXiv:2503.13657 (2025). <a href="https://arxiv.org/abs/2503.13657" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2503.13657</a></p>



<p>Chroma Research. &#8220;Context Rot: How Increasing Context Length Degrades Model Performance.&#8221; 2025. <a href="https://research.trychroma.com/context-rot" target="_blank" rel="noreferrer noopener">https://research.trychroma.com/context-rot</a></p>



<p>Han, Shanshan, Qifan Zhang, Yuhang Yao, Weizhao Jin, and Zhaozhuo Xu. &#8220;LLM Multi-Agent Systems: Challenges and Open Problems.&#8221; arXiv:2402.03578 (2024). <a href="https://arxiv.org/abs/2402.03578" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2402.03578</a></p>



<p>LangChain Blog (Sydney Runkle). &#8220;Choosing the Right Multi-Agent Architecture.&#8221; January 14, 2026. <a href="https://www.blog.langchain.com/choosing-the-right-multi-agent-architecture/" target="_blank" rel="noreferrer noopener">https://www.blog.langchain.com/choosing-the-right-multi-agent-architecture/</a></p>



<p>Manus AI. &#8220;Context Engineering for AI Agents: Lessons from Building Manus.&#8221; 2025. <a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" target="_blank" rel="noreferrer noopener">https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus</a></p>



<p>Schmid, Philipp. &#8220;Context Engineering.&#8221; 2025. <a href="https://www.philschmid.de/context-engineering" target="_blank" rel="noreferrer noopener">https://www.philschmid.de/context-engineering</a>Xu, Jiawei, Arief Koesdwiady, Sisong Bei, Yan Han, Baixiang Huang, Dakuo Wang, Yutong Chen, Zheshen Wang, Peihao Wang, Pan Li, and Ying Ding. &#8220;Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline.&#8221; arXiv:2601.12307 (2026). <a href="https://arxiv.org/abs/2601.12307" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2601.12307</a></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td>To explore memory engineering further, start experimenting with memory architectures using MongoDB Atlas or review our detailed tutorials available at <a href="https://www.mongodb.com/resources/use-cases/artificial-intelligence/?utm_campaign=devrel&amp;utm_source=cross-post&amp;utm_medium=cta&amp;utm_content=memory-for-multiagent-systems&amp;utm_term=mikiko.b&amp;utm_campaign=devrel&amp;utm_source=third-party-content&amp;utm_medium=cta&amp;utm_content=multi-agent-oreily&amp;utm_term=tony.kim" target="_blank" rel="noreferrer noopener">AI Learning Hub</a>.</td></tr></tbody></table></figure>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/why-multi-agent-systems-need-memory-engineering/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Control Planes for Autonomous AI: Why Governance Has to Move Inside the System</title>
		<link>https://www.oreilly.com/radar/control-planes-for-autonomous-ai-why-governance-has-to-move-inside-the-system/</link>
				<comments>https://www.oreilly.com/radar/control-planes-for-autonomous-ai-why-governance-has-to-move-inside-the-system/#respond</comments>
				<pubDate>Tue, 24 Feb 2026 12:16:14 +0000</pubDate>
					<dc:creator><![CDATA[Varun Raj]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18117</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agentic-AI-shifts-to-runtime-architecture.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agentic-AI-shifts-to-runtime-architecture-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Scaling agentic AI is forcing a shift from external policy to runtime architecture.]]></custom:subtitle>
		
				<description><![CDATA[For most of the past decade, AI governance lived comfortably outside the systems it was meant to regulate. Policies were written. Reviews were conducted. Models were approved. Audits happened after the fact. As long as AI behaved like a tool—producing predictions or recommendations on demand—that separation mostly worked. That assumption is breaking down. As AI [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>For most of the past decade, AI governance lived comfortably outside the systems it was meant to regulate. Policies were written. Reviews were conducted. Models were approved. Audits happened after the fact. As long as AI behaved like a tool—producing predictions or recommendations on demand—that separation mostly worked. That assumption is breaking down.</p>



<p>As AI systems move from assistive components to autonomous actors, governance imposed from the outside no longer scales. The problem isn’t that organizations lack policies or oversight frameworks. It’s that those controls are detached from where decisions are actually formed. Increasingly, the only place governance can operate effectively is inside the AI application itself, at runtime, while decisions are being made. This isn’t a philosophical shift. It’s an architectural one.</p>



<h2 class="wp-block-heading"><strong>When AI Fails Quietly</strong></h2>



<p>One of the more unsettling aspects of autonomous AI systems is that their most consequential failures rarely look like failures at all. Nothing crashes. Latency stays within bounds. Logs look clean. The system behaves coherently—just not correctly. An agent escalates a workflow that should have been contained. A recommendation drifts slowly away from policy intent. A tool is invoked in a context that no one explicitly approved, yet no explicit rule was violated.</p>



<p>These failures are hard to detect because they emerge from behavior, not bugs. Traditional governance mechanisms don’t help much here. Predeployment reviews assume decision paths can be anticipated in advance. Static policies assume behavior is predictable. Post hoc audits assume intent can be reconstructed from outputs. None of those assumptions holds once systems reason dynamically, retrieve context opportunistically, and act continuously. At that point, governance isn’t missing—it’s simply in the wrong place.</p>



<h2 class="wp-block-heading"><strong>The Scaling Problem No One Owns</strong></h2>



<p>Most organizations already feel this tension, even if they don’t describe it in architectural terms. Security teams tighten access controls. Compliance teams expand review checklists. Platform teams add more logging and dashboards. Product teams add additional prompt constraints. Each layer helps a little. None of them addresses the underlying issue.</p>



<p>What’s really happening is that governance responsibility is being fragmented across teams that don’t own system behavior end-to-end. No single layer can explain why the system acted—only that it acted. As autonomy increases, the gap between intent and execution widens, and accountability becomes diffuse. This is a classic scaling problem. And like many scaling problems before it, the solution isn’t more rules. It’s a different system architecture.</p>



<h2 class="wp-block-heading"><strong>A Familiar Pattern from Infrastructure History</strong></h2>



<p>We’ve seen this before. In early networking systems, control logic was tightly coupled to packet handling. As networks grew, this became unmanageable. Separating the control plane from the data plane allowed policy to evolve independently of traffic and made failures diagnosable rather than mysterious.</p>



<p>Cloud platforms went through a similar transition. Resource scheduling, identity, quotas, and policy moved out of application code and into shared control systems. That separation is what made hyperscale cloud viable. Autonomous AI systems are approaching a comparable inflection point.</p>



<p>Right now, governance logic is scattered across prompts, application code, middleware, and organizational processes. None of those layers was designed to assert authority continuously while a system is reasoning and acting. What’s missing is a control plane for AI—not as a metaphor but as a real architectural boundary.</p>



<h2 class="wp-block-heading"><strong>What “Governance Inside the System” Actually Means</strong></h2>



<p>When people hear “governance inside AI,” they often imagine stricter rules baked into prompts or more conservative model constraints. That’s not what this is about.</p>



<p>Embedding governance inside the system means separating decision execution from decision authority. Execution includes inference, retrieval, memory updates, and tool invocation. Authority includes policy evaluation, risk assessment, permissioning, and intervention. In most AI applications today, those concerns are entangled—or worse, implicit.</p>



<p>A control-plane-based design makes that separation explicit. Execution proceeds but under continuous supervision. Decisions are observed as they form, not inferred after the fact. Constraints are evaluated dynamically, not assumed ahead of time. Governance stops being a checklist and starts behaving like infrastructure.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="284" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Separating-execution-from-governance-in-autonomous-AI-systems.jpg" alt="Execution from governance separation in AI systems" class="wp-image-18118" style="width:550px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Separating-execution-from-governance-in-autonomous-AI-systems.jpg 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Separating-execution-from-governance-in-autonomous-AI-systems-300x166.jpg 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 1. Separating execution from governance in autonomous AI systems</em></figcaption></figure>



<p>Reasoning, retrieval, memory, and tool invocation operate in the execution plane, while a runtime control plane continuously evaluates policy, risk, and authority—observing and intervening without being embedded in application logic.</p>



<h2 class="wp-block-heading">Where Governance Breaks First</h2>



<p>In practice, governance failures in autonomous AI systems tend to cluster around three surfaces.</p>



<p><strong>Reasoning</strong>. Systems form intermediate goals, weigh options, and branch decisions internally. Without visibility into those pathways, teams can’t distinguish acceptable variance from systemic drift.</p>



<p><strong>Retrieval</strong>. Autonomous systems pull in context opportunistically. That context may be outdated, inappropriate, or out of scope—and once it enters the reasoning process, it’s effectively invisible unless explicitly tracked.</p>



<p><strong>Action</strong>. Tool use is where intent becomes impact. Systems increasingly invoke APIs, modify records, trigger workflows, or escalate issues without human review. Static authorization models don’t map cleanly onto dynamic decision contexts.</p>



<p>These surfaces are interconnected, but they fail independently. Treating governance as a single monolithic concern leads to brittle designs and false confidence.</p>



<h2 class="wp-block-heading">Control Planes as Runtime Feedback Systems</h2>



<p>A useful way to think about AI control planes is not as gatekeepers but as feedback systems. Signals flow continuously from execution into governance: confidence degradation, policy boundary crossings, retrieval drift, and action escalation patterns. Those signals are evaluated in real time, not weeks later during audits. Responses flow back: throttling, intervention, escalation, or constraint adjustment.</p>



<p>This is fundamentally different from monitoring outputs. Output monitoring tells you what happened. Control plane telemetry tells you why it was allowed to happen. That distinction matters when systems operate continuously, and consequences compound over time.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="512" height="306" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Runntime-governance-as-a-feedback-loop.jpg" alt="" class="wp-image-18119" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Runntime-governance-as-a-feedback-loop.jpg 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Runntime-governance-as-a-feedback-loop-300x179.jpg 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 2. Runtime governance as a feedback loop</em></figcaption></figure>



<p>Behavioral telemetry flows from execution into the control plane, where policy and risk are evaluated continuously. Enforcement and intervention feed back into execution before failures become irreversible.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<h2 class="wp-block-heading">A Failure Story That Should Sound Familiar</h2>



<p>Consider a customer-support agent operating across billing, policy, and CRM systems.</p>



<p>Over several months, policy documents are updated. Some are reindexed quickly. Others lag. The agent continues to retrieve context and reason coherently, but its decisions increasingly reflect outdated rules. No single action violates policy outright. Metrics remain stable. Customer satisfaction erodes slowly.</p>



<p>Eventually, an audit flags noncompliant action. At that point, teams scramble. Logs show what the agent did but not why. They can’t reconstruct which documents influenced which decisions, when those documents were last updated, or why the agent believed its actions were valid at the time.</p>



<p>This isn’t a logging failure. It’s the absence of a governance feedback loop. A control plane wouldn’t prevent every mistake, but it would surface drift early—when intervention is still cheap.</p>



<h2 class="wp-block-heading">Why External Governance Can’t Catch Up</h2>



<p>It’s tempting to believe better tooling, stricter reviews, or more frequent audits will solve this problem. They won’t.</p>



<p>External governance operates on snapshots. Autonomous AI operates on streams. The mismatch is structural. By the time an external process observes a problem, the system has already moved on—often repeatedly. That doesn’t mean governance teams are failing. It means they’re being asked to regulate systems whose operating model has outgrown their tools. The only viable alternative is governance that runs at the same cadence as execution.</p>



<h2 class="wp-block-heading">Authority, Not Just Observability</h2>



<p>One subtle but important point: Control planes aren’t just about visibility. They’re about authority.</p>



<p>Observability without enforcement creates a false sense of safety. Seeing a problem after it occurs doesn’t prevent it from recurring. Control planes must be able to act—to pause, redirect, constrain, or escalate behavior in real time.</p>



<p>That raises uncomfortable questions. How much autonomy should systems retain? When should humans intervene? How much latency is acceptable for policy evaluation? There are no universal answers. But those trade-offs can only be managed if governance is designed as a first-class runtime concern, not an afterthought.</p>



<h2 class="wp-block-heading">The Architectural Shift Ahead</h2>



<p>The move from guardrails to control loops mirrors earlier transitions in infrastructure. Each time, the lesson was the same: Static rules don’t scale under dynamic behavior. Feedback does.</p>



<p>AI is entering that phase now. Governance won’t disappear. But it will change shape. It will move inside systems, operate continuously, and assert authority at runtime. Organizations that treat this as an architectural problem—not a compliance exercise—will adapt faster and fail more gracefully. Those who don’t will spend the next few years chasing incidents they can see, but never quite explain.</p>



<h2 class="wp-block-heading">Closing Thought</h2>



<p>Autonomous AI doesn’t require less governance. It requires governance that understands autonomy.</p>



<p>That means moving beyond policies as documents and audits as events. It means designing systems where authority is explicit, observable, and enforceable while decisions are being made. In other words, governance must become part of the system—not something applied to it.</p>



<h2 class="wp-block-heading">Further Reading</h2>



<ul class="wp-block-list">
<li>“AI Governance Frameworks for Responsible AI,” Gartner Peer Community, <a href="https://www.gartner.com/peer-community/oneminuteinsights/omi-ai-governance-frameworks-responsible-ai-33q" target="_blank" rel="noreferrer noopener">https://www.gartner.com/peer-community/oneminuteinsights/omi-ai-governance-frameworks-responsible-ai-33q</a>.</li>



<li>Lauren Kornutick et al., “Market Guide for AI Governance Platforms,” Gartner, November 4, 2025, <a href="https://www.gartner.com/en/documents/7145930" target="_blank" rel="noreferrer noopener">https://www.gartner.com/en/documents/7145930</a>.</li>



<li>Svetlana Sicular, “AI’s Next Frontier Demands a New Approach to Ethics, Governance, and Compliance,” Gartner, November 10, 2025, <a href="https://www.gartner.com/en/articles/ai-ethics-governance-and-compliance" target="_blank" rel="noreferrer noopener">https://www.gartner.com/en/articles/ai-ethics-governance-and-compliance</a>.</li>



<li><a href="https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf" target="_blank" rel="noreferrer noopener"><em>AI Risk Management Framework (AI RMF 1.0)</em></a>, NIST, January 2023, <a href="https://doi.org/10.6028/NIST.AI.100-1" target="_blank" rel="noreferrer noopener">https://doi.org/10.6028/NIST.AI.100-1</a>.</li>
</ul>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/control-planes-for-autonomous-ai-why-governance-has-to-move-inside-the-system/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Hidden Cost of Agentic Failure</title>
		<link>https://www.oreilly.com/radar/the-hidden-cost-of-agentic-failure/</link>
				<comments>https://www.oreilly.com/radar/the-hidden-cost-of-agentic-failure/#respond</comments>
				<pubDate>Mon, 23 Feb 2026 12:17:47 +0000</pubDate>
					<dc:creator><![CDATA[Nicole Koenigstein]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18110</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Hidden-cost-of-agentic-failure.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Hidden-cost-of-agentic-failure-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Why multi-agent systems are probabilistic pipelines]]></custom:subtitle>
		
				<description><![CDATA[Agentic AI has clearly moved beyond buzzword status. McKinsey’s November 2025 survey shows that 62% of organizations are already experimenting with AI agents, and the top performers are pushing them into core workflows in the name of efficiency, growth, and innovation. However, this is also where things can get uncomfortable. Everyone in the field knows [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>Agentic AI has clearly moved beyond buzzword status. McKinsey’s November 2025 survey shows that <a href="https://www.mckinsey.com/~/media/mckinsey/business%20functions/quantumblack/our%20insights/the%20state%20of%20ai/november%202025/the-state-of-ai-2025-agents-innovation_cmyk-v1.pdf" target="_blank" rel="noreferrer noopener">62% of organizations are already experimenting with AI agents</a>, and the top performers are pushing them into core workflows in the name of efficiency, growth, and innovation.</p>



<p>However, this is also where things can get uncomfortable. Everyone in the field knows LLMs are probabilistic. We all track leaderboard scores, but then quietly ignore that this uncertainty compounds when we wire multiple models together. That’s the blind spot. Most multi-agent systems (MAS) don’t fail because the models are bad. They fail because we compose them as if probability doesn’t compound.</p>



<h2 class="wp-block-heading"><strong>The Architectural Debt of Multi-Agent Systems</strong></h2>



<p>The hard truth is that improving individual agents does very little to improve overall system-level reliability once errors are allowed to propagate unchecked. The core problem of agentic systems in production isn’t model quality alone; it’s composition. Once agents are wired together without validation boundaries, risk compounds.</p>



<p>In practice, this shows up in looping supervisors, runaway token costs, brittle workflows, and failures that appear intermittently and are nearly impossible to reproduce. These systems often work just well enough to pass benchmarks, then fail unpredictably once they are placed under real operational load.</p>



<p>If you think about it, every agent handoff introduces a chance of failure. Chain enough of them together, and failure compounds. Even strong models with a 98% per-agent success rate can quickly degrade overall system success to 90% or lower. Each unchecked agent hop multiplies failure probability and, with it, expected cost. Without explicit fault tolerance, agentic systems aren’t just fragile. They are economically problematic.</p>



<p>This is the key shift in perspective. In production, MAS shouldn’t be thought of as collections of intelligent components. They behave like probabilistic pipelines, where every unvalidated handoff multiplies uncertainty and expected cost.</p>



<p>This is where many organizations are quietly accumulating what I call architectural debt. In software engineering, we are comfortable talking about technical debt: development shortcuts that make systems harder to maintain over time. Agentic systems introduce a new form of debt. Every unvalidated agent boundary adds probabilistic risk that doesn’t show up in unit tests but surfaces later as instability, cost overruns, and unpredictable behavior at scale. And unlike technical debt, this one doesn’t get paid down with refactors or cleaner code. It accumulates silently, until the math catches up with you.</p>



<h2 class="wp-block-heading"><strong>The Multi-Agent Reliability Tax</strong></h2>



<p>If you treat each agent’s task as an independent <a href="https://en.wikipedia.org/wiki/Bernoulli_trial" target="_blank" rel="noreferrer noopener">Bernoulli trial</a>, a simple experiment with a binary outcome of success (<em>p</em>) or failure (<em>q</em>), probability becomes a harsh mistress. Look closely and you’ll find yourself at the mercy of the product reliability rule once you start building MAS. In systems engineering, this effect is formalized by <a href="https://en.wikipedia.org/wiki/Lusser%27s_law" target="_blank" rel="noreferrer noopener">Lusser’s law</a>, which states that when independent components are executed in sequence, overall system success is the product of their individual success probabilities. While this is a simplified model, it captures the compounding effect that is otherwise easy to underestimate in composed MAS.</p>



<p>Consider a high-performing agent with a single-task accuracy of <em>p </em>= 0<em>.</em>98 (98%). If you apply the product rule for independent events to a sequential pipeline, you can model how your total system accuracy unfolds. That is, if you assume each agent succeeds with probability <em>p<sub>i</sub></em>, your failure probability is <em>q<sub>i </sub></em>= 1 <em>− p<sub>i</sub></em>. Applied to a multi-agent pipeline, this gives you:</p>



<div class="wp-block-math"><math display="block"><semantics><mrow><mi>P</mi><mo form="prefix" stretchy="false">(</mo><mtext> system&nbsp;success </mtext><mo form="postfix" stretchy="false">)</mo><mo>=</mo><mrow><munderover><mo movablelimits="false">∏</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover></mrow><msub><mi>p</mi><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">P(\text{\,system success\,}) = \prod_{i=1}^{N} p_i</annotation></semantics></math></div>



<p>Table 1 illustrates how your agent system propagates errors through your system without validation.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><strong># of agents (</strong><em>n</em><strong>)</strong></td><td><strong>Per-agent accuracy (</strong><em>p</em><strong>)</strong></td><td><strong>System accuracy (</strong><em>p</em><em><sup>n</sup></em><strong>)</strong></td><td><strong>Error rate</strong></td></tr><tr><td>1 agent</td><td>98%</td><td>98.0%</td><td>2.0%</td></tr><tr><td>3 agents</td><td>98%</td><td><em>∼</em>94.1%</td><td><em>∼</em>5.9%</td></tr><tr><td>5 agents</td><td>98%</td><td><em>∼</em>90.4%</td><td><em>∼</em>9.6%</td></tr><tr><td>10 agents</td><td>98%</td><td><em>∼</em><strong>81.7%</strong></td><td><em>∼</em><strong>18.3%</strong></td></tr></tbody></table><figcaption class="wp-element-caption"><em>Table 1. System accuracy decay in a sequential multi-agent pipeline without validation</em></figcaption></figure>



<p>In production, LLMs aren’t 98% reliable on structured outputs in open-ended tasks. Because they have no single correct output, so correctness must be enforced structurally rather than assumed. Once an agent introduces a wrong assumption, a malformed schema, or a hallucinated tool result, every downstream agent conditions on that corrupted state. This is why you should insert validation gates to break the product rule of reliability.</p>



<h2 class="wp-block-heading"><strong>From Stochastic Hope to Deterministic Engineering</strong></h2>



<p>If you introduce validation gates, you change how failure behaves inside your system. Instead of allowing one agent’s output to become the unquestioned input for the next, you force every handoff to pass through an explicit boundary. The system no longer assumes correctness. It verifies it.</p>



<p>In practice, you’d want to have a schema-enforced generation via libraries like <a href="https://docs.pydantic.dev/latest/" target="_blank" rel="noreferrer noopener">Pydantic</a> and <a href="https://python.useinstructor.com/" target="_blank" rel="noreferrer noopener">Instructor</a>. Pydantic is a data validation library for Python, which helps you define a strict contract for what is allowed to pass between agents: Types, fields, ranges, and invariants are checked at the boundary, and invalid outputs are rejected or corrected before they can propagate. Instructor moves that same contract into the generation step itself by forcing the model to retry until it produces a valid output or exhausts a bounded retry budget. Once validation exists, the reliability math fundamentally changes. Validation catches failures with probability <em>v</em>, now each hop becomes:</p>



<div class="wp-block-math"><math display="block"><semantics><mrow><mi>p</mi><mspace width="0.1667em"></mspace><mtext>effective</mtext><mo>=</mo><mi>p</mi><mo>+</mo><mo form="prefix" stretchy="false">(</mo><mn>1</mn><mo>−</mo><mi>p</mi><mo form="postfix" stretchy="false">)</mo><mspace width="0.1667em"></mspace><mtext>·</mtext><mspace width="0.1667em"></mspace><mi>v</mi></mrow><annotation encoding="application/x-tex">p\,{\text{effective}} = p + (1-p)\,·\,v </annotation></semantics></math></div>



<p>Again, assume you have a per-agent accuracy of <em>p </em>= 0<em>.</em>98, but now you have a validation catch rate of <em>v </em>= 0<em>.</em>9, then you get:</p>



<div class="wp-block-math"><math display="block"><semantics><mrow><mi>p</mi><mspace width="0.1667em"></mspace><mtext>effective</mtext><mo>=</mo><mn>0.98</mn><mo>+</mo><mn>0.02</mn><mspace width="0.1667em"></mspace><mo>⋅</mo><mspace width="0.1667em"></mspace><mn>0.9</mn><mo>=</mo><mn>0.998</mn></mrow><annotation encoding="application/x-tex">p\,{\text{effective}}=0.98+0.02\,\cdot\,0.9=0.998</annotation></semantics></math></div>



<p>The +0<em>.</em>02 <em>· </em>0<em>.</em>9 term reflects recovered failures, since these events are disjoint. Table 2 shows how this changes your systems behavior.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><strong># of agents (</strong><em>n</em><strong>)</strong></td><td><strong>Per-agent accuracy (</strong><em>p</em><strong>)</strong></td><td><strong>System accuracy (</strong><em>p</em><em><sup>n</sup></em><strong>)</strong></td><td><strong>Error rate</strong></td></tr><tr><td>1 agent</td><td>99.8%</td><td>99.8%</td><td>0.2%</td></tr><tr><td>3 agents</td><td>99.8%</td><td><em>∼</em>99.4%</td><td><em>∼</em>0.6%</td></tr><tr><td>5 agents</td><td>99.8%</td><td><em>∼</em>99.0%</td><td><em>∼</em>1.0%</td></tr><tr><td>10 agents</td><td>99.8%</td><td><em>∼</em><strong>98.0%</strong></td><td><em>∼</em><strong>2.0%</strong></td></tr></tbody></table><figcaption class="wp-element-caption"><em>Table 2. System accuracy decay in a sequential multi-agent pipeline with validation</em></figcaption></figure>



<p>Comparing Table 1 and Table 2 makes the effect explicit: Validation fundamentally changes how failure propagates through your MAS. It’s no longer a naive multiplicative decay, it’s a controlled reliability amplification. If you want a deeper, implementation-level walkthrough of validation patterns for MAS, I cover it in <a href="https://learning.oreilly.com/library/view/ai-agents-the/0642572247775/" target="_blank" rel="noreferrer noopener"><em>AI Agents: The Definitive Guide</em></a>. You can also find a notebook in the <a href="https://github.com/Nicolepcx/ai-agents-the-definitive-guide/tree/main/CH05" target="_blank" rel="noreferrer noopener">GitHub repository</a> to run the computation from Table 1 and Table 2. Now, you might ask what you can do, if you can’t make your models 100% perfect. The good news is that you can make the system more resilient through specific architectural shifts.</p>



<h2 class="wp-block-heading"><strong>From Deterministic Engineering to Exploratory Search</strong></h2>



<p>While validation keeps your system from breaking, it doesn’t necessarily help the system find the right answer when the task is difficult. For that, you need to move from filtering to searching. Now you give your agent a way to generate multiple candidate paths to replace fragile one-shot execution with a controlled search over alternatives. This is commonly referred to as test-time compute. Instead of committing to the first sampled output, the system allocates additional inference budget to explore multiple candidates before making a decision. Reliability improves not because your model is better but because your system delays commitment.</p>



<p>At the simplest level, this doesn’t require anything sophisticated. Even a basic best-of-<em>N </em>strategy already improves system stability. For instance, if you sample multiple independent outputs and select the best one, you reduce the chance of committing to a bad draw. This alone is often enough to stabilize brittle pipelines that fail under single-shot execution.</p>



<p>One effective approach to select the best one out of multiple samples is to use frameworks like <a href="https://openpipe.ai/blog/ruler" target="_blank" rel="noreferrer noopener">RULER</a>. RULER (Relative Universal LLM-Elicited Rewards) is a general-purpose reward function which uses a configurable LLM-as-judge along with a ranking rubric you can adjust based on your use case. This works because ranking several related candidate solutions is easier than scoring each one in isolation. By looking at multiple solutions side by side, this allows the LLM-as-judge to identify deficiencies and rank them accordingly. Now you get evidence-anchored verification. The judge doesn’t just agree; it verifies and compares outputs against each other. This acts as a &#8220;circuit breaker&#8221; for error propagation, by resetting your failure probability at every agent boundary.</p>



<h2 class="wp-block-heading"><strong>Amortized Intelligence with Reinforcement Learning</strong></h2>



<p>As a next possible step you could use group-based reinforcement learning (RL), such as group relative policy optimization (GRPO)<sup>1</sup> and group sequence policy optimization (GSPO)<sup>2</sup> to turn that search into a learned policy. GRPO works on the token level, while GSPO works on the sequence level. You can take your &#8220;golden traces&#8221; found by your search and adjust your base agents. The golden traces<em> </em>are your successful reasoning paths. Now you aren’t just filtering errors anymore; you’re training the agents to avoid making them in the first place, because your system internalizes those corrections into its own policy. The key shift is that successful decision paths are retained and reused rather than rediscovered repeatedly at inference time.</p>



<h2 class="wp-block-heading"><strong>From Prototypes to Production</strong></h2>



<p>If you want your agentic systems to behave reliably in production, I recommend you approach agentic failure in this order:</p>



<ul class="wp-block-list">
<li>Introduce strict validation between agents. Enforce schemas and contracts so failures are caught early instead of propagating silently. </li>



<li>Use simple best-of-<em>N </em>sampling or tree-based search with lightweight judges such as RULER to score multiple candidates before committing. </li>



<li>If you need consistent behavior at scale use RL to teach your agents how to behave more reliably for your specific use case.</li>
</ul>



<p>The reality is you won’t be able to fully eliminate uncertainty in your MAS, but these methods give you real leverage over how uncertainty behaves. Reliable agentic systems are build by design, not by chance.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading"><strong>References</strong></h2>



<ol class="wp-block-list">
<li>Zhihong Shao et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” 2024, <a href="https://arxiv.org/abs/2402.03300" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2402.03300</a>.</li>



<li>Chujie Zheng et al. “Group Sequence Policy Optimization,” 2025, <a href="https://arxiv.org/abs/2507.18071" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2507.18071</a>.</li>
</ol>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-hidden-cost-of-agentic-failure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>How to Write a Good Spec for AI Agents</title>
		<link>https://www.oreilly.com/radar/how-to-write-a-good-spec-for-ai-agents/</link>
				<comments>https://www.oreilly.com/radar/how-to-write-a-good-spec-for-ai-agents/#respond</comments>
				<pubDate>Fri, 20 Feb 2026 12:01:32 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18097</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Ancient-scrolls-for-agents.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Ancient-scrolls-for-agents-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[This post first appeared on Addy Osmani’s Elevate Substack newsletter and is being republished here with the author’s permission. TL;DR: Aim for a clear spec covering just enough nuance (this may include structure, style, testing, boundaries.&#160;.&#160;.) to guide the AI without overwhelming it. Break large tasks into smaller ones versus keeping everything in one large [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>This post first appeared on Addy Osmani’s </em><a href="https://addyo.substack.com/p/how-to-write-a-good-spec-for-ai-agents" target="_blank" rel="noreferrer noopener">Elevate <em>Substack newsletter</em></a><em> </em><em>and is being republished here with the author’s permission.</em></p>
</blockquote>



<p><em>TL;DR: Aim for a clear spec covering just enough nuance (this may include structure, style, testing, boundaries.&nbsp;.&nbsp;.) to guide the AI without overwhelming it. Break large tasks into smaller ones versus keeping everything in one large prompt. Plan first in read-only mode, then execute and iterate continuously.</em></p>



<p>“I’ve heard a lot about writing good specs for AI agents, but haven’t found a solid framework yet. I could write a spec that rivals an RFC, but at some point the context is too large and the model breaks down.”</p>



<p>Many developers share this frustration. Simply throwing a massive spec at an AI agent doesn’t work—context window limits and the model’s “attention budget” get in the way. The key is to write smart specs: documents that guide the agent clearly, stay within practical context sizes, and evolve with the project. This guide distills best practices from my use of coding agents including Claude Code and Gemini CLI into a framework for spec-writing that keeps your AI agents focused and productive.</p>



<p>We’ll cover five principles for great AI agent specs, each starting with a bolded takeaway.</p>



<h2 class="wp-block-heading">1. Start with a High-Level Vision and Let the AI Draft the Details</h2>



<p><strong>Kick off your project with a concise high-level spec, then have the AI expand it into a detailed plan.</strong></p>



<p>Instead of overengineering upfront, begin with a clear goal statement and a few core requirements. Treat this as a “product brief” and let the agent generate a more elaborate spec from it. This leverages the AI’s strength in elaboration while you maintain control of the direction. This works well unless you already feel you have very specific technical requirements that must be met from the start.</p>



<p><strong>Why this works</strong>:<strong> </strong>LLM-based agents excel at fleshing out details when given a solid high-level directive, but they need a clear mission to avoid drifting off course. By providing a short outline or objective description and asking the AI to produce a full specification (e.g., a spec.md), you create a persistent reference for the agent. Planning in advance matters even more with an agent: You can iterate on the plan first, then hand it off to the agent to write the code. The spec becomes the first artifact you and the AI build together.</p>



<p><strong>Practical approach</strong>:<strong> </strong>Start a new coding session by prompting&nbsp;</p>



<p><code>You are an AI software engineer. Draft a detailed specification for <br>[project X] covering objectives, features, constraints, and a step-by-step plan.<br><br>Keep your initial prompt high-level: e.g., “Build a web app where users can <br>track tasks (to-do list), with user accounts, a database, and a simple UI.”</code></p>



<p>The agent might respond with a structured draft spec: an overview, feature list, tech stack suggestions, data model, and so on. This spec then becomes the “source of truth” that both you and the agent can refer back to. GitHub’s AI team promotes <a href="https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/" target="_blank" rel="noreferrer noopener">spec-driven development</a> where “specs become the shared source of truth…living, executable artifacts that evolve with the project.” Before writing any code, review and refine the AI’s spec. Make sure it aligns with your vision and correct any hallucinations or off-target details.</p>



<p><strong>Use Plan Mode to enforce planning-first</strong>: Tools like Claude Code offer a <a href="https://code.claude.com/docs/en/common-workflows#use-plan-mode-for-safe-code-analysis" target="_blank" rel="noreferrer noopener">Plan Mode</a> that restricts the agent to read-only operations—it can analyze your codebase and create detailed plans but won’t write any code until you’re ready. This is ideal for the planning phase: Start in Plan Mode (Shift+Tab in Claude Code), describe what you want to build, and let the agent draft a spec while exploring your existing code. Ask it to clarify ambiguities by questioning you about the plan. Have it review the plan for architecture, best practices, security risks, and testing strategy. The goal is to refine the plan until there’s no room for misinterpretation. Only then do you exit Plan Mode and let the agent execute. This workflow prevents the common trap of jumping straight into code generation before the spec is solid.</p>



<p><strong>Use the spec as context</strong>: Once approved, save this spec (e.g., as SPEC.md) and feed relevant sections into the agent as needed. Many developers using a strong model do exactly this. The spec file persists between sessions, anchoring the AI whenever work resumes on the project. This mitigates the forgetfulness that can happen when the conversation history gets too long or when you have to restart an agent. It’s akin to how one would use a product requirements document (PRD) in a team: a reference that everyone (human or AI) can consult to stay on track. Experienced folks often “<a href="https://simonwillison.net/2025/Oct/7/vibe-engineering/" target="_blank" rel="noreferrer noopener">write good documentation first</a> and the model may be able to build the matching implementation from that input alone” as one engineer observed. The spec is that documentation.</p>



<p><strong>Keep it goal oriented</strong>: A high-level spec for an AI agent should focus on what and why more than the nitty-gritty how (at least initially). Think of it like the user story and acceptance criteria: Who is the user? What do they need? What does success look like? (For example, “User can add, edit, complete tasks; data is saved persistently; the app is responsive and secure.”) This keeps the AI’s detailed spec grounded in user needs and outcome, not just technical to-dos. As the <a href="https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/" target="_blank" rel="noreferrer noopener">GitHub Spec Kit docs</a> put it, provide a high-level description of what you’re building and why, and let the coding agent generate a detailed specification focusing on user experience and success criteria. Starting with this big-picture vision prevents the agent from losing sight of the forest for the trees when it later gets into coding.</p>



<h2 class="wp-block-heading">2. Structure the Spec Like a Professional PRD (or SRS)</h2>



<p><strong>Treat your AI spec as a structured document (PRD) with clear sections, not a loose pile of notes.</strong></p>



<p>Many developers treat specs for agents much like traditional product requirement documents (PRDs) or system design docs: comprehensive, well-organized, and easy for a “literal-minded” AI to parse. This formal approach gives the agent a blueprint to follow and reduces ambiguity.</p>



<h3 class="wp-block-heading">The six core areas</h3>



<p>GitHub’s analysis of <a href="https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/" target="_blank" rel="noreferrer noopener">over 2,500 agent configuration files</a> revealed a clear pattern: The most effective specs cover six areas. Use this as a checklist for completeness:</p>



<ol class="wp-block-list">
<li><strong>Commands</strong>: Put executable commands early—not just tool names but full commands with flags: <code>npm test</code>, <code>pytest -v</code>, <code>npm run build</code>. The agent will reference these constantly.</li>



<li><strong>Testing</strong>: How to run tests, what framework you use, where test files live, and what coverage expectations exist.</li>



<li><strong>Project structure</strong>: Where source code lives, where tests go, where docs belong. Be explicit: “<code>src/</code> for application code, <code>tests/</code> for unit tests, <code>docs/</code> for documentation.”</li>



<li><strong>Code style</strong>: One real code snippet showing your style beats three paragraphs describing it. Include naming conventions, formatting rules, and examples of good output.</li>



<li><strong>Git workflow</strong>: Branch naming, commit message format, PR requirements. The agent can follow these if you spell them out.</li>



<li><strong>Boundaries</strong>: What the agent should never touch—secrets, vendor directories, production configs, specific folders. “Never commit secrets” was the single most common helpful constraint in the GitHub study.</li>
</ol>



<p><strong>Be specific about your stack</strong>: Say “React 18 with TypeScript, Vite, and Tailwind CSS,” not “React project.” Include versions and key dependencies. Vague specs produce vague code.</p>



<p><strong>Use a consistent format</strong>: Clarity is king. Many devs use Markdown headings or even XML-like tags in the spec to delineate sections because AI models handle well-structured text better than free-form prose. For example, you might structure the spec as:</p>



<pre class="wp-block-code"><code># Project Spec: My team's tasks app


## Objective
- Build a web app for small teams to manage tasks...


## Tech Stack
- React 18+, TypeScript, Vite, Tailwind CSS
- Node.js/Express backend, PostgreSQL, Prisma ORM


## Commands
- Build: `npm run build` (compiles TypeScript, outputs to dist/)
- Test: `npm test` (runs Jest, must pass before commits)
- Lint: `npm run lint --fix` (auto-fixes ESLint errors)


## Project Structure
- `src/` – Application source code
- `tests/` – Unit and integration tests
- `docs/` – Documentation


## Boundaries
- &#x2705; Always: Run tests before commits, follow naming conventions
- &#x26a0; Ask first: Database schema changes, adding dependencies
- &#x1f6ab; Never: Commit secrets, edit node_modules/, modify CI config</code></pre>



<p>This level of organization not only helps you think clearly but also helps the AI find information. Anthropic engineers recommend <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" target="_blank" rel="noreferrer noopener">organizing prompts into distinct sections</a> (like &lt;background&gt;, &lt;instructions&gt;, &lt;tools&gt;, &lt;output_format&gt; etc.) for exactly this reason: It gives the model strong cues about which info is which. And remember, “minimal does not necessarily mean short”—don’t shy away from detail in the spec if it matters, but keep it focused.</p>



<p><strong>Integrate specs into your toolchain</strong>: Treat specs as “executable artifacts” tied to version control and CI/CD. The <a href="https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/" target="_blank" rel="noreferrer noopener">GitHub Spec Kit</a> uses a four-phase gated workflow that makes your specification the center of your engineering process. Instead of writing a spec and setting it aside, the spec drives the implementation, checklists, and task breakdowns. Your primary role is to steer; the coding agent does the bulk of the writing. Each phase has a specific job, and you don’t move to the next one until the current task is fully validated:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1600" height="883" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Current-task-is-validated-1600x883.jpg" alt="Task validation" class="wp-image-18098" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Current-task-is-validated-1600x883.jpg 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Current-task-is-validated-300x166.jpg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Current-task-is-validated-768x424.jpg 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Current-task-is-validated-1536x847.jpg 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Current-task-is-validated-2048x1130.jpg 2048w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<p><strong>1. Specify</strong>: You provide a high-level description of what you’re building and why, and the coding agent generates a detailed specification. This isn’t about technical stacks or app design—it’s about user journeys, experiences, and what success looks like. Who will use this? What problem does it solve? How will they interact with it? Think of it as mapping the user experience you want to create, and letting the coding agent flesh out the details. This becomes a living artifact that evolves as you learn more.</p>



<p><strong>2. Plan</strong>: Now you get technical. You provide your desired stack, architecture, and constraints, and the coding agent generates a comprehensive technical plan. If your company standardizes on certain technologies, this is where you say so. If you’re integrating with legacy systems or have compliance requirements, all of that goes here. You can ask for multiple plan variations to compare approaches. If you make internal docs available, the agent can integrate your architectural patterns directly into the plan.</p>



<p><strong>3. Tasks</strong>: The coding agent takes the spec and plan and breaks them into actual work—small, reviewable chunks that each solve a specific piece of the puzzle. Each task should be something you can implement and test in isolation, almost like test-driven development for your AI agent. Instead of “build authentication,” you get concrete tasks like “create a user registration endpoint that validates email format.”</p>



<p><strong>4. Implement</strong>: Your coding agent tackles tasks one by one (or in parallel). Instead of reviewing thousand-line code dumps, you review focused changes that solve specific problems. The agent knows what to build (specification), how to build it (plan), and what to work on (task). Crucially, your role is to verify at each phase: Does the spec capture what you want? Does the plan account for constraints? Are there edge cases the AI missed? The process builds in checkpoints for you to critique, spot gaps, and course-correct before moving forward.</p>



<p>This gated workflow prevents what Willison calls “house of cards code”: fragile AI outputs that collapse under scrutiny. Anthropic’s Skills system offers a similar pattern, letting you define reusable Markdown-based behaviors that agents invoke. By embedding your spec in these workflows, you ensure the agent can’t proceed until the spec is validated, and changes propagate automatically to task breakdowns and tests.</p>



<p><strong>Consider agents.md for specialized personas</strong>: For tools like GitHub Copilot, you can create <a href="https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/" target="_blank" rel="noreferrer noopener">agents.md files</a> that define specialized agent personas—a @docs-agent for technical writing, a @test-agent for QA, a @security-agent for code review. Each file acts as a focused spec for that persona’s behavior, commands, and boundaries. This is particularly useful when you want different agents for different tasks rather than one general-purpose assistant.</p>



<p><strong>Design for agent experience (AX)</strong>: Just as we design APIs for developer experience (DX), consider designing specs for “agent experience.” This means clean, parseable formats: OpenAPI schemas for any APIs the agent will consume, llms.txt files that summarize documentation for LLM consumption, and explicit type definitions. The Agentic AI Foundation (AAIF) is standardizing protocols like MCP (Model Context Protocol) for tool integration. Specs that follow these patterns are easier for agents to consume and act on reliably.</p>



<p><strong>PRD versus SRS mindset</strong>: It helps to borrow from established documentation practices. For AI agent specs, you’ll often blend these into one document (as illustrated above), but covering both angles serves you well. Writing it like a PRD ensures you include user-centric context (“the why behind each feature”) so the AI doesn’t optimize for the wrong thing. Expanding it like an SRS ensures you nail down the specifics the AI will need to actually generate correct code (like what database or API to use). Developers have found that this extra upfront effort pays off by drastically reducing miscommunications with the agent later.</p>



<p><strong>Make the spec a “living document”</strong>: Don’t write it and forget it. Update the spec as you and the agent make decisions or discover new info. If the AI had to change the data model or you decided to cut a feature, reflect that in the spec so it remains the ground truth. Think of it as version-controlled documentation. In <a href="https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/" target="_blank" rel="noreferrer noopener">spec-driven workflows</a>, the spec drives implementation, tests, and task breakdowns, and you don’t move to coding until the spec is validated. This habit keeps the project coherent, especially if you or the agent step away and come back later. Remember, the spec isn’t just for the AI—it helps you as the developer maintain oversight and ensure the AI’s work meets the real requirements.</p>



<h2 class="wp-block-heading">3. Break Tasks into Modular Prompts and Context, Not One Big Prompt</h2>



<p><strong>Divide and conquer: Give the AI one focused task at a time rather than a monolithic prompt with everything at once.</strong></p>



<p>Experienced AI engineers have learned that trying to stuff the entire project (all requirements, all code, all instructions) into a single prompt or agent message is a recipe for confusion. Not only do you risk hitting token limits; you also risk the model losing focus due to the “<a href="https://maxpool.dev/research-papers/curse_of_instructions_report.html" target="_blank" rel="noreferrer noopener">curse of instructions</a>”—too many directives causing it to follow none of them well. The solution is to design your spec and workflow in a modular way, tackling one piece at a time and pulling in only the context needed for that piece.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1600" height="883" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Modular-prompts-1600x883.jpg" alt="Modular prompts" class="wp-image-18099" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Modular-prompts-1600x883.jpg 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Modular-prompts-300x166.jpg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Modular-prompts-768x424.jpg 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Modular-prompts-1536x847.jpg 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Modular-prompts-2048x1130.jpg 2048w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<p><strong>The curse of too much context/instructions</strong>: Research has confirmed what many devs anecdotally saw: as you pile on more instructions or data into the prompt, the model’s performance in adhering to each one <a href="https://openreview.net/pdf/848f1332e941771aa491f036f6350af2effe0513.pdf" target="_blank" rel="noreferrer noopener">drops significantly</a>. One study dubbed this the “curse of instructions”, showing that even GPT-4 and Claude struggle when asked to satisfy many requirements simultaneously. In practical terms, if you present 10 bullet points of detailed rules, the AI might obey the first few and start overlooking others. The better strategy is iterative focus. <a href="https://maxpool.dev/research-papers/curse_of_instructions_report.html" target="_blank" rel="noreferrer noopener">Guidelines from industry</a> suggest decomposing complex requirements into sequential, simple instructions as a best practice. Focus the AI on one subproblem at a time, get that done, then move on. This keeps the quality high and errors manageable.</p>



<p><strong>Divide the spec into phases or components</strong>: If your spec document is very long or covers a lot of ground, consider splitting it into parts (either physically separate files or clearly separate sections). For example, you might have a section for “backend API spec” and another for “frontend UI spec.” You don’t need to always feed the frontend spec to the AI when it’s working on the backend, and vice versa. Many devs using multi-agent setups even create separate agents or subprocesses for each part (e.g., one agent works on database/schema, another on API logic, another on frontend—each with the relevant slice of the spec). Even if you use a single agent, you can emulate this by copying only the relevant spec section into the prompt for that task. Avoid context overload: Don’t mix authentication tasks with database schema changes in one go, as the <a href="https://docs.digitalocean.com/products/gradient-ai-platform/concepts/context-management/" target="_blank" rel="noreferrer noopener">DigitalOcean AI guide</a> warns. Keep each prompt tightly scoped to the current goal.</p>



<p><strong>Extended TOC/summaries for large specs</strong>: One clever technique is to have the agent build an extended table of contents with summaries for the spec. This is essentially a “spec summary” that condenses each section into a few key points or keywords, and references where details can be found. For example, if your full spec has a section on security requirements spanning 500 words, you might have the agent summarize it to: “Security: Use HTTPS, protect API keys, implement input validation (see full spec §4.2).” By creating a hierarchical summary in the planning phase, you get a bird’s-eye view that can stay in the prompt, while the fine details remain offloaded unless needed. This extended TOC acts as an index: The agent can consult it and say, “Aha, there’s a security section I should look at,” and you can then provide that section on demand. It’s similar to how a human developer skims an outline and then flips to the relevant page of a spec document when working on a specific part.</p>



<p>To implement this, you can prompt the agent after writing the spec: “Summarize the spec above into a very concise outline with each section’s key points and a reference tag.” The result might be a list of sections with one or two sentence summaries. That summary can be kept in the system or assistant message to guide the agent’s focus without eating up too many tokens. This <a href="https://addyo.substack.com/p/context-engineering-bringing-engineering" target="_blank" rel="noreferrer noopener">hierarchical summarization approach</a> is known to help LLMs maintain long-term context by focusing on the high-level structure. The agent carries a “mental map” of the spec.</p>



<p><strong>Utilize subagents or “skills” for different spec parts</strong>: Another advanced approach is using multiple specialized agents (what Anthropic calls subagents or what you might call “skills”). Each subagent is configured for a specific area of expertise and given the portion of the spec relevant to that area. For instance, you might have a database designer subagent that only knows about the data model section of the spec, and an API coder subagent that knows the API endpoints spec. The main agent (or an orchestrator) can route tasks to the appropriate subagent automatically.</p>



<p>The benefit is each agent has a smaller context window to deal with and a more focused role, which can <a href="https://10xdevelopers.dev/structured/claude-code-with-subagents/" target="_blank" rel="noreferrer noopener">boost accuracy and allow parallel work</a> on independent tasks. Anthropic’s Claude Code supports this by letting you define subagents with their own system prompts and tools. “Each subagent has a specific purpose and expertise area, uses its own context window separate from the main conversation, and has a custom system prompt guiding its behavior,” as their docs describe. When a task comes up that matches a subagent’s domain, Claude can delegate that task to it, with the subagent returning results independently.</p>



<p><strong>Parallel agents for throughput</strong>: Running multiple agents simultaneously is emerging as “the next big thing” for developer productivity. Rather than waiting for one agent to finish before starting another task, you can spin up parallel agents for non-overlapping work. Willison describes this as “<a href="https://simonwillison.net/2025/Oct/7/vibe-engineering/" target="_blank" rel="noreferrer noopener">embracing parallel coding agents</a>” and notes it’s “surprisingly effective, if mentally exhausting.” The key is scoping tasks so agents don’t step on each other: One agent codes a feature while another writes tests, or separate components get built concurrently. Orchestration frameworks like LangGraph or OpenAI Swarm can help coordinate these agents, and shared memory via vector databases (like Chroma) lets them access common context without redundant prompting.</p>



<h3 class="wp-block-heading">Single versus multi-agent: When to use each</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td></td><td><strong>Single agent parallel</strong></td><td><strong>Multi-agent</strong></td></tr><tr><td><strong>Strengths</strong></td><td>Simpler setup; lower overhead; easier to debug and follow</td><td>Higher throughput; handles complex interdependencies; specialists per domain</td></tr><tr><td><strong>Challenges</strong></td><td>Context overload on big projects; slower iteration; single point of failure</td><td>Coordination overhead; potential conflicts; needs shared memory (e.g., vector DBs)</td></tr><tr><td><strong>Best for</strong></td><td>Isolated modules; small-to-medium projects; early prototyping</td><td>Large codebases; one codes + one tests + one reviews; independent features</td></tr><tr><td><strong>Tips</strong></td><td>Use spec summaries; refresh context per task; start fresh sessions often</td><td>Limit to 2–3 agents initially; use MCP for tool sharing; define clear boundaries</td></tr></tbody></table></figure>



<p>In practice, using subagents or skill-specific prompts might look like: You maintain multiple spec files (or prompt templates)—e.g., SPEC_backend.md, SPEC_frontend.md—and you tell the AI, “For backend tasks, refer to SPEC_backend; for frontend tasks refer to SPEC_frontend.” Or in a tool like Cursor/Claude, you actually spin up a subagent for each. This is certainly more complex to set up than a single-agent loop, but it mimics what human developers do: We mentally compartmentalize a large spec into relevant chunks. (You don’t keep the whole 50-page spec in your head at once; you recall the part you need for the task at hand, and have a general sense of the overall architecture.) The challenge, as noted, is managing interdependencies: The subagents must still coordinate. (The frontend needs to know the API contract from the backend spec, etc.) A central overview (or an “architect” agent) can help by referencing the subspecs and ensuring consistency.</p>



<p><strong>Focus each prompt on one task/section</strong>: Even without fancy multi-agent setups, you can manually enforce modularity. For example, after the spec is written, your next move might be: “Step 1: Implement the database schema.” You feed the agent the database section of the spec only, plus any global constraints from the spec (like tech stack). The agent works on that. Then for Step 2, “Now implement the authentication feature”, you provide the auth section of the spec and maybe the relevant parts of the schema if needed. By refreshing the context for each major task, you ensure the model isn’t carrying a lot of stale or irrelevant information that could distract it. As one guide suggests: “<a href="https://docs.digitalocean.com/products/gradient-ai-platform/concepts/context-management/" target="_blank" rel="noreferrer noopener">Start fresh: begin new sessions</a> to clear context when switching between major features.” You can always remind the agent of critical global rules (from the spec’s constraints section) each time, but don’t shove the entire spec in if it’s not all needed.</p>



<p><strong>Use in-line directives and code TODOs</strong>: Another modularity trick is to use your code or spec as an active part of the conversation. For instance, scaffold your code with // TODO comments that describe what needs to be done, and have the agent fill them one by one. Each TODO essentially acts as a mini-spec for a small task. This keeps the AI laser focused (“implement this specific function according to this spec snippet”), and you can iterate in a tight loop. It’s similar to giving the AI a checklist item to complete rather than the whole checklist at once.</p>



<p><strong>The bottom line</strong>: Small, focused context beats one giant prompt. This improves quality and keeps the AI from getting “overwhelmed” by too much at once. As one set of best practices sums up, provide “One Task Focus” and “Relevant info only” to the model, and avoid dumping everything everywhere. By structuring the work into modules—and using strategies like spec summaries or subspec agents—you’ll navigate around context size limits and the AI’s short-term memory cap. Remember, a well-fed AI is like a well-fed function: Give it only the <a href="https://addyo.substack.com/p/context-engineering-bringing-engineering" target="_blank" rel="noreferrer noopener">inputs it needs for the job at hand</a>.</p>



<h2 class="wp-block-heading">4. Build in Self-Checks, Constraints, and Human Expertise</h2>



<p><strong>Make your spec not just a to-do list for the agent but also a guide for quality control—and don’t be afraid to inject your own expertise.</strong></p>



<p>A good spec for an AI agent anticipates where the AI might go wrong and sets up guardrails. It also takes advantage of what you know (domain knowledge, edge cases, “gotchas”) so the AI doesn’t operate in a vacuum. Think of the spec as both coach and referee for the AI: It should encourage the right approach and call out fouls.</p>



<p><strong>Use three-tier boundaries</strong>: <a href="https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/" target="_blank" rel="noreferrer noopener">GitHub’s analysis of 2,500+ agent files</a> found that the most effective specs use a three-tier boundary system rather than a simple list of don’ts. This gives the agent clearer guidance on when to proceed, when to pause, and when to stop:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="512" height="282" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agent-proceed-pause-or-stop.jpg" alt="Agent boundaries" class="wp-image-18100" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agent-proceed-pause-or-stop.jpg 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agent-proceed-pause-or-stop-300x165.jpg 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<p><strong><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Always do</strong>: Actions the agent should take without asking. “Always run tests before commits.” “Always follow the naming conventions in the style guide.” “Always log errors to the monitoring service.”</p>



<p><strong><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/26a0.png" alt="⚠" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Ask first</strong>: Actions that require human approval. “Ask before modifying database schemas.” “Ask before adding new dependencies.” “Ask before changing CI/CD configuration.” This tier catches high-impact changes that might be fine but warrant a human check.</p>



<p><strong><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f6ab.png" alt="🚫" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Never do</strong>: Hard stops. “Never commit secrets or API keys.” “Never edit node_modules/ or vendor/.” “Never remove a failing test without explicit approval.” “Never commit secrets” was the single most common helpful constraint in the study.</p>



<p>This three-tier approach is more nuanced than a flat list of rules. It acknowledges that some actions are always safe, some need oversight, and some are categorically off-limits. The agent can proceed confidently on “Always” items, flag “Ask first” items for review, and hard-stop on “Never” items.</p>



<p><strong>Encourage self-verification</strong>: One powerful pattern is to have the agent verify its work against the spec automatically. If your tooling allows, you can integrate checks like unit tests or linting that the AI can run after generating code. But even at the spec/prompt level, you can instruct the AI to double-check (e.g., “After implementing, compare the result with the spec and confirm all requirements are met. List any spec items that are not addressed.”). This pushes the LLM to reflect on its output relative to the spec, catching omissions. It’s a form of self-audit built into the process.</p>



<p>For instance, you might append to a prompt: “(After writing the function, review the above requirements list and ensure each is satisfied, marking any missing ones).” The model will then (ideally) output the code followed by a short checklist indicating if it met each requirement. This reduces the chance it forgets something before you even run tests. It’s not foolproof, but it helps.</p>



<p><strong>LLM-as-a-Judge for subjective checks</strong>: For criteria that are hard to test automatically—code style, readability, adherence to architectural patterns—consider using “LLM-as-a-Judge.” This means having a second agent (or a separate prompt) review the first agent’s output against your spec’s quality guidelines. Anthropic and others have found this effective for subjective evaluation. You might prompt “Review this code for adherence to our style guide. Flag any violations.” The judge agent returns feedback that either gets incorporated or triggers a revision. This adds a layer of semantic evaluation beyond syntax checks.</p>



<p><strong>Conformance testing</strong>: Willison advocates building conformance suites—language-independent tests (often YAML based) that any implementation must pass. These act as a contract: If you’re building an API, the conformance suite specifies expected inputs/outputs, and the agent’s code must satisfy all cases. This is more rigorous than ad hoc unit tests because it’s derived directly from the spec and can be reused across implementations. Include conformance criteria in your spec’s success section (e.g., “Must pass all cases in conformance/api-tests.yaml”).</p>



<p><strong>Leverage testing in the spec</strong>: If possible, incorporate a test plan or even actual tests in your spec and prompt flow. In traditional development, we use TDD or write test cases to clarify requirements—you can do the same with AI. For example, in the spec’s success criteria, you might say, “These sample inputs should produce these outputs…” or “The following unit tests should pass.” The agent can be prompted to run through those cases in its head or actually execute them if it has that capability. Willison noted that having a <a href="https://simonwillison.net/2025/Oct/7/vibe-engineering/" target="_blank" rel="noreferrer noopener">robust test suite</a> is like giving the agents superpowers: They can validate and iterate quickly when tests fail. In an AI coding context, writing a bit of pseudocode for tests or expected outcomes in the spec can guide the agent’s implementation. Additionally, you can use a dedicated “<a href="https://10xdevelopers.dev/structured/claude-code-with-subagents/" target="_blank" rel="noreferrer noopener">test agent</a>” in a subagent setup that takes the spec’s criteria and continuously verifies the “code agent’s” output.</p>



<p><strong>Bring your domain knowledge</strong>: Your spec should reflect insights that only an experienced developer or someone with context would know. For example, if you’re building an ecommerce agent and you know that “products” and “categories” have a many-to-many relationship, state that clearly. (Don’t assume the AI will infer it—it might not.) If a certain library is notoriously tricky, mention pitfalls to avoid. Essentially, pour your mentorship into the spec. The spec can contain advice like “If using library X, watch out for memory leak issue in version Y (apply workaround Z).” This level of detail is what turns an average AI output into a truly robust solution, because you’ve steered the AI away from common traps.</p>



<p>Also, if you have preferences or style guidelines (say, “use functional components over class components in React”), encode that in the spec. The AI will then emulate your style. Many engineers even include small examples in the spec (for instance, “All API responses should be JSON, e.g., {“error”: “message”} for errors.”). By giving a quick example, you anchor the AI to the exact format you want.</p>



<p><strong>Minimalism for simple tasks</strong>: While we advocate thorough specs, part of expertise is knowing when to keep it simple. For relatively simple, isolated tasks, an overbearing spec can actually confuse more than help. If you’re asking the agent to do something straightforward (like “center a div on the page”), you might just say, “Make sure to keep the solution concise and do not add extraneous markup or styles.” No need for a full PRD there. Conversely, for complex tasks (like “implement an OAuth flow with token refresh and error handling”), that’s when you break out the detailed spec. A good rule of thumb: Adjust spec detail to task complexity. Don’t underspec a hard problem (the agent will flail or go off-track), but don’t overspec a trivial one (the agent might get tangled or use up context on unnecessary instructions).</p>



<p><strong>Maintain the AI’s “persona” if needed</strong>: Sometimes, part of your spec is defining how the agent should behave or respond, especially if the agent interacts with users. For example, if building a customer support agent, your spec might include guidelines like “Use a friendly and professional tone” and “If you don’t know the answer, ask for clarification or offer to follow up rather than guessing.” These kinds of rules (often included in system prompts) help keep the AI’s outputs aligned with expectations. They are essentially spec items for AI behavior. Keep them consistent and remind the model of them if needed in long sessions. (LLMs can “drift” in style over time if not kept on a leash.)</p>



<p><strong>You remain the exec in the loop</strong>: The spec empowers the agent, but you remain the ultimate quality filter. If the agent produces something that technically meets the spec but doesn’t feel right, trust your judgement. Either refine the spec or directly adjust the output. The great thing about AI agents is they don’t get offended—if they deliver a design that’s off, you can say, “Actually, that’s not what I intended, let’s clarify the spec and redo it.” The spec is a living artifact in collaboration with the AI, not a one-time contract you can’t change.</p>



<p>Simon Willison humorously likened working with AI agents to “a very weird form of management” and even “getting good results out of a coding agent feels <a href="https://simonwillison.net/2025/Oct/7/vibe-engineering/" target="_blank" rel="noreferrer noopener">uncomfortably close to managing a human intern</a>.” You need to provide clear instructions (the spec), ensure they have the necessary context (the spec and relevant data), and give actionable feedback. The spec sets the stage, but monitoring and feedback during execution are key. If an AI was a “weird digital intern who will absolutely cheat if you give them a chance,” the spec and constraints you write are how you prevent that cheating and keep them on task.</p>



<p><strong>Here’s the payoff</strong>: A good spec doesn’t just tell the AI what to build; it also helps it self-correct and stay within safe boundaries. By baking in verification steps, constraints, and your hard-earned knowledge, you drastically increase the odds that the agent’s output is correct on the first try (or at least much closer to correct). This reduces iterations and those “Why on Earth did it do that?” moments.</p>



<h2 class="wp-block-heading">5. Test, Iterate, and Evolve the Spec (and Use the Right Tools)</h2>



<p><strong>Think of spec writing and agent building as an iterative loop: test early, gather feedback, refine the spec, and leverage tools to automate checks.</strong></p>



<p>The initial spec is not the end—it’s the beginning of a cycle. The best outcomes come when you continually verify the agent’s work against the spec and adjust accordingly. Also, modern AI devs use various tools to support this process (from CI pipelines to context management utilities).</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="459" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Initial-spec.jpg" alt="Initial spec" class="wp-image-18101" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Initial-spec.jpg 1024w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Initial-spec-300x134.jpg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Initial-spec-768x344.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>Continuous testing</strong>: Don’t wait until the end to see if the agent met the spec. After each major milestone or even each function, run tests or at least do quick manual checks. If something fails, update the spec or prompt before proceeding. For example, if the spec said, “Passwords must be hashed with bcrypt” and you see the agent’s code storing plain text, stop and correct it (and remind the spec or prompt about the rule). Automated tests shine here: If you provided tests (or write them as you go), let the agent run them. In many coding agent setups, you can have an agent run npm test or similar after finishing a task. The results (failures) can then feed back into the next prompt, effectively telling the agent “Your output didn’t meet spec on X, Y, Z—fix it.” This kind of agentic loop (code &gt; test &gt; fix &gt; repeat) is extremely powerful and is how tools like Claude Code or Copilot Labs are evolving to handle larger tasks. Always define what “done” means (via tests or criteria) and check for it.</p>



<p><strong>Iterate on the spec itself</strong>: If you discover that the spec was incomplete or unclear (maybe the agent misunderstood something or you realized you missed a requirement), update the spec document. Then explicitly resync the agent with the new spec: “I have updated the spec as follows… Given the updated spec, adjust the plan or refactor the code accordingly.” This way the spec remains the single source of truth. It’s similar to how we handle changing requirements in normal dev, but in this case you’re also the product manager for your AI agent. Keep version history if possible (even just via commit messages or notes), so you know what changed and why.</p>



<p><strong>Utilize context management and memory tools</strong>: There’s a growing ecosystem of tools to help manage AI agent context and knowledge. For instance, retrieval-augmented generation (RAG) is a pattern where the agent can pull in relevant chunks of data from a knowledge base (like a vector database) on the fly. If your spec is huge, you could embed sections of it and let the agent retrieve the most relevant parts when needed, instead of always providing the whole thing. There are also frameworks implementing the Model Context Protocol (MCP), which automates feeding the right context to the model based on the current task. One example is <a href="https://docs.digitalocean.com/products/gradient-ai-platform/concepts/context-management/" target="_blank" rel="noreferrer noopener">Context7</a> (context7.com), which can auto-fetch relevant context snippets from docs based on what you’re working on. In practice, this might mean the agent notices you’re working on “payment processing” and it pulls the payments section of your spec or documentation into the prompt. Consider leveraging such tools or setting up a rudimentary version (even a simple search in your spec document).</p>



<p><strong>Parallelize carefully</strong>: Some developers run multiple agent instances in parallel on different tasks (as mentioned earlier with subagents). This can speed up development (e.g., one agent generates code while another simultaneously writes tests, or two features are built concurrently). If you go this route, ensure the tasks are truly independent or clearly separated to avoid conflicts. (The spec should note any dependencies.) For example, don’t have two agents writing to the same file at once. One workflow is to have an agent generate code and another review it in parallel, or to have separate components built that integrate later. This is advanced usage and can be mentally taxing to manage. (As Willison admitted, running multiple agents is <a href="https://simonwillison.net/2025/Oct/7/vibe-engineering/" target="_blank" rel="noreferrer noopener">surprisingly effective, if mentally exhausting</a>!) Start with at most 2–3 agents to keep things manageable.</p>



<p><strong>Version control and spec locks</strong>: Use Git or your version control of choice to track what the agent does. <a href="https://simonwillison.net/2025/Oct/7/vibe-engineering/" target="_blank" rel="noreferrer noopener">Good version control habits</a> matter even more with AI assistance. Commit the spec file itself to the repo. This not only preserves history, but the agent can even use git diff or blame to understand changes. (LLMs are quite capable of reading diffs.) Some advanced agent setups let the agent query the VCS history to see when something was introduced—surprisingly, models can be “fiercely competent at Git.” By keeping your spec in the repo, you allow both you and the AI to track evolution. There are tools (like GitHub Spec Kit mentioned earlier) that integrate spec-driven development into the Git workflow—for instance, gating merges on updated specs or generating checklists from spec items. While you don’t need those tools to succeed, the takeaway is to treat the spec like code: Maintain it diligently.</p>



<p><strong>Cost and speed considerations</strong>: Working with large models and long contexts can be slow and expensive. A practical tip is to use model selection and batching smartly. Perhaps use a cheaper/faster model for initial drafts or repetitions, and reserve the most capable (and expensive) model for final outputs or complex reasoning. Some developers use GPT-4 or Claude for planning and critical steps, but offload simpler expansions or refactors to a local model or a smaller API model. If using multiple agents, maybe not all need to be top tier; a test-running agent or a linter agent could be a smaller model. Also consider throttling context size: Don’t feed 20K tokens if 5K will do. As we discussed, <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" target="_blank" rel="noreferrer noopener">more tokens can mean diminishing returns</a>.</p>



<p><strong>Monitor and log everything</strong>: In complex agent workflows, logging the agent’s actions and outputs is essential. Check the logs to see if the agent is deviating or encountering errors. Many frameworks provide trace logs or allow printing the agent’s chain of thought (especially if you prompt it to think step-by-step). Reviewing these logs can highlight where the spec or instructions might have been misinterpreted. It’s not unlike debugging a program—except the “program” is the conversation/prompt chain. If something weird happens, go back to the spec/instructions to see if there was ambiguity.</p>



<p><strong>Learn and improve</strong>: Finally, treat each project as a learning opportunity to refine your spec-writing skill. Maybe you’ll discover that a certain phrasing consistently confuses the AI, or that organizing spec sections in a certain way yields better adherence. Incorporate those lessons into the next spec. The field of AI agents is rapidly evolving, so new best practices (and tools) emerge constantly. Stay updated via blogs (like the ones by Simon Willison, Andrej Karpathy, etc.), and don’t hesitate to experiment.</p>



<p>A spec for an AI agent isn’t “write once, done.” It’s part of a continuous cycle of instructing, verifying, and refining. The payoff for this diligence is substantial: By catching issues early and keeping the agent aligned, you avoid costly rewrites or failures later. As one AI engineer quipped, using these practices can feel like having “an army of interns” working for you, but you have to manage them well. A good spec, continuously maintained, is your management tool.</p>



<h2 class="wp-block-heading">Avoid Common Pitfalls</h2>



<p>Before wrapping up, it’s worth calling out antipatterns that can derail even well-intentioned spec-driven workflows. The <a href="https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/" target="_blank" rel="noreferrer noopener">GitHub study of 2,500+ agent files</a> revealed a stark divide: “Most agent files fail because they’re too vague.” Here are the mistakes to avoid:</p>



<p><strong>Vague prompts</strong>: “Build me something cool” or “Make it work better” gives the agent nothing to anchor on. As Baptiste Studer puts it: “Vague prompts mean wrong results.” Be specific about inputs, outputs, and constraints. “You are a helpful coding assistant” doesn’t work. “You are a test engineer who writes tests for React components, follows these examples, and never modifies source code” does.</p>



<p><strong>Overlong contexts without summarization</strong>: Dumping 50 pages of documentation into a prompt and hoping the model figures it out rarely works. Use hierarchical summaries (as discussed in principle 3) or RAG to surface only what’s relevant. Context length is not a substitute for context quality.</p>



<p><strong>Skipping human review</strong>: Willison has a personal rule—“I won’t commit code I couldn’t explain to someone else.” Just because the agent produced something that passes tests doesn’t mean it’s correct, secure, or maintainable. Always review critical code paths. The “house of cards” metaphor applies: AI-generated code can look solid but collapse under edge cases you didn’t test.</p>



<p><strong>Conflating vibe coding with production engineering</strong>: Rapid prototyping with AI (“vibe coding”) is great for exploration and throwaway projects. But shipping that code to production without rigorous specs, tests, and review is asking for trouble. I distinguish “vibe coding” from “AI-assisted engineering”—the latter requires the discipline this guide describes. Know which mode you’re in.</p>



<p><strong>Ignoring the “lethal trifecta”</strong>: Willison warns of three properties that make AI agents dangerous: speed (they work faster than you can review), nondeterminism (same input, different outputs), and cost (encouraging corner cutting on verification). Your spec and review process must account for all three. Don’t let speed outpace your ability to verify.</p>



<p><strong>Missing the six core areas</strong>: If your spec doesn’t cover commands, testing, project structure, code style, git workflow, and boundaries, you’re likely missing something the agent needs. Use the six-area checklist from section 2 as a sanity check before handing off to the agent.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>Writing an effective spec for AI coding agents requires solid software engineering principles combined with adaptation to LLM quirks. Start with clarity of purpose and let the AI help expand the plan. Structure the spec like a serious design document, covering the six core areas and integrating it into your toolchain so it becomes an executable artifact, not just prose. Keep the agent’s focus tight by feeding it one piece of the puzzle at a time (and consider clever tactics like summary TOCs, subagents, or parallel orchestration to handle big specs). Anticipate pitfalls by including three-tier boundaries (always/ask first/never), self-checks, and conformance tests—essentially, teach the AI how to not fail. And treat the whole process as iterative: use tests and feedback to refine both the spec and the code continuously.</p>



<p>Follow these guidelines and your AI agent will be far less likely to “break down” under large contexts or wander off into nonsense.</p>



<p>Happy spec-writing!</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><em>On March 26, join Addy and Tim O’Reilly at AI Codecon: Software Craftsmanship in the Age of AI, where an all-star lineup of experts will go deeper into orchestration, agent coordination, and the new skills developers need to build excellent software that creates value for all participants. </em><a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener"><em>Sign up for free here</em></a><em>.</em></td></tr></tbody></table></figure>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/how-to-write-a-good-spec-for-ai-agents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Packaging Expertise: How Claude Skills Turn Judgment into Artifacts</title>
		<link>https://www.oreilly.com/radar/packaging-expertise-how-claude-skills-turn-judgment-into-artifacts/</link>
				<pubDate>Thu, 19 Feb 2026 12:01:11 +0000</pubDate>
					<dc:creator><![CDATA[Han Lee]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18086</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Packaging-expertise.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Packaging-expertise-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Think about what happens when you onboard a new employee. First, you provision them tools. Email access. Slack. CRM. Office software. Project management software. Development environment. Connecting a person to the system they’ll need to do their job. However, this is necessary but not sufficient. Nobody becomes effective just because they can log into Salesforce. [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>Think about what happens when you onboard a new employee.</p>



<p>First, you provision them tools. Email access. Slack. CRM. Office software. Project management software. Development environment. Connecting a person to the system they’ll need to do their job. However, this is necessary but not sufficient. Nobody becomes effective just because they can log into Salesforce.</p>



<p>Then comes the harder part: teaching them how your organization actually works. The analysis methodology your team developed over years of iteration. The quality bar that is not written down anywhere. The implicit ways of working. The judgment calls about when to escalate and when to handle something independently. The institutional knowledge that separates a new hire from someone who’s been there for years.</p>



<p>This second part—<strong>the expertise transfer</strong>—is where organizations struggle. It’s expensive and inconsistent, and does not scale. It lives in mentorship relationships, institutional knowledge, and documentation that goes stale the moment it’s written.</p>



<p>Claude Skills and MCP (Model Context Protocol) follow exactly this pattern. MCP gives AI agents such as Claude the tools: access to systems, databases, APIs, and resources. Skills are the training materials that teach Claude how to work and how to use these tools.</p>



<p>This distinction matters more than it might first appear. While we have gotten reasonably good at provisioning tools, we have never had a good way to package expertise. Skills change that. They package expertise into a standardized format.</p>



<h2 class="wp-block-heading">Tools Versus Training</h2>



<p><strong>MCP is tool provisioning.</strong> It’s the protocol that connects AI agents to external systems: data warehouse, CRM, GitHub repositories, internal APIs, and knowledge bases. Anthropic describes it as “USB-C for AI”—a standardized interface that lets Claude plug into your existing infrastructure. An MCP server might give Claude the ability to query customer records, commit code, send Slack messages, or pull analytics data with authorized permissions.</p>



<p>This is necessary infrastructure. But like giving a new hire database credentials, it does not tell AI agents what to do with that access. MCP answers the question “What tools can an agent use?” It provides capabilities without opinions.</p>



<p><strong>Skills are the training materials. </strong>They encode how your organization actually works: which segments matter, what churn signal to watch for, how to structure findings for your quarterly business review, when to flag something for human attention.</p>



<p>Skills answer a different question: “How should an AI agent think about this?” They provide expertise, not just access.</p>



<p>Consider the difference in what you’re creating. Building an MCP server is infrastructure work; it’s an engineering effort to connect systems securely and reliably. Creating a Skill is knowledge work; domain experts articulating what they know, in markdown files, for AI agents to operationalize and understand. These require different people, different processes, and different governance.</p>



<p>The real power emerges when you combine them. MCP connects AI agents to your data warehouse. A Skill teaches AI agents your firm’s analysis methodology and which MCP tools to use. Together, AI agents can perform expert-level analysis on live data, following your specific standards. Neither layer alone gets you there, just as a new hire with database access but no training, or training but no access, won’t be effective at their jobs.</p>



<p>MCP is the toolbox. Skills are the training manuals that teach how to use those tools.</p>



<h2 class="wp-block-heading">Why Expertise Has Been So Hard to Scale</h2>



<p>The training side of onboarding has always been the bottleneck.</p>



<p>Your best analyst retires, and their methods walk out of the door. Onboarding takes months because the real tacit knowledge lives in people’s heads, not in any document a new hire can read. Consistency is impossible when “how we do things here” varies by who trained whom and who worked with whom. Even when you invest heavily in training programs, they produce point-in-time snapshots of expertise that immediately begin to rot.</p>



<p>Previous approaches have all fallen short:</p>



<p><strong>Documentation </strong>is passive and quickly outdated. It requires human interpretation, offers no guarantee of correct application, and can’t adapt to novel situations. The wiki page about customer analysis does not help when you encounter an edge case the author never anticipated.</p>



<p><strong>Training programs </strong>are expensive, and a certificate of completion says nothing about actual competency.</p>



<p><strong>Checklists and SOPs</strong> capture procedure but not judgment. They tell you what to check, not how to think about what you find. They work for mechanical tasks but fail for anything requiring expertise.</p>



<p>We’ve had Custom GPTs, Claude projects, and Gemini Gems attempting to address this. They are useful but opaque. You cannot invoke them based on context; the AI agent working as Copy Editing Gem stays in copy editing and can’t switch to Laundry Buddy Custom GPTs mid-task. They are not transferable and cannot be packaged for distribution.</p>



<p>Skills offer something new: expertise packaged as a versionable, governable artifact.</p>



<p>Skills are files in folders—a SKILL.md document with supporting assets, scripts, and resources. They leverage all the tooling we have built for managing code. Track changes in Git. Roll back mistakes. Maintain audit trails. Review Skills before deployment through PR workflows with version control. Deploy organization-wide and ensure consistency. AI agents can compose Skills for complex workflows, building sophisticated capabilities from simple building blocks.</p>



<p>The architecture also enables progressive disclosure. AI agents see only lightweight metadata until a Skill becomes relevant, then loads the full instruction on demand. You can have dozens of Skills available without overwhelming the model’s precious context window, which is like a human’s short-term memory or a computer’s RAM. Claude loads expertise as needed and coordinates multiple Skills automatically.</p>



<p>This makes the enterprise deployment model tractable. An expert creates a Skill based on best practices, with the help of an AI/ML engineer to audit and evaluate the effectiveness of the Skill. Administrators review and approve it through governance processes. The organization deploys it everywhere simultaneously. Updates propagate instantly from a central source.</p>



<p><a href="https://echofold.ai/news/claude-skills-announcement" target="_blank" rel="noreferrer noopener">One report cites Rakuten achieving 87.5% faster completion</a> of a finance workflow after implementing Skills. Not from AI magic but from finally being able to distribute their analysts’ methodologies across the entire team. That’s the expertise transfer problem, solved.</p>



<h2 class="wp-block-heading">Training Materials You Can Meter</h2>



<p>The onboarding analogy also created a new business model.</p>



<p>When expertise lives in people, you can only monetize it through labor—billable hours, consulting engagements, training programs, maintenance contracts. The expert has to show up, which limits scale and creates key-person dependencies.</p>



<p>Skills separate expertise from the expert. Package your methodology as a Skill. Distribute it via API. Charge based on usage.</p>



<p>A consulting firm’s analysis framework can become a product. A domain expert’s judgment becomes a service. The Skill encodes the expertise; the API calls become the meter. This is service as software, the SaaS of expertise. And it’s only possible because Skills put knowledge in a form that can be distributed, versioned, and billed against.</p>



<p>The architecture is familiar. The Skill is like an application frontend (the expertise, the methodology, the “how”), while MCP connections or API calls form the backend (data access, actions, the “what”). You build training material once and deploy them everywhere, metering usage through the infrastructure layer.</p>



<p>No more selling API endpoints with 500-page obscure documentation explaining what each endpoint does then staffing a team to support it. Now we can package the expertise of how to use those API directly into Skills. Customers can realize the value of an API via their AI agents. Cost to implement and time to implement drop to zero with MCP. Time to value becomes immediate with Skills.</p>



<h2 class="wp-block-heading">The Visibility Trade-Off</h2>



<p>Every abstraction has a cost. Skills trade visibility for scalability, and that trade-off deserves honest examination.</p>



<p>When expertise transfers human to human, through mentorship, working sessions, apprenticeship, the expert sees how their knowledge gets applied and becomes better in the process. They watch the learner struggle with edge cases. They notice which concepts don’t land. They observe how their methods get adapted to new situations. This feedback loop improves the expertise over time.</p>



<p>Skills break that loop. As a Skill builder, you do not see the conversations that trigger your Skill. You do not know how users adapted your methodology or which part of your guidance AI agents weighted most heavily. Users interact with their own AI agents; your Skill is one influence among many.</p>



<p>Your visibility is limited to the infrastructure layer: API calls, MCP tool invocations, and whatever outputs you explicitly capture. You see usage patterns, not the dialogue that surrounds them. Those dialogues reside with the user’s AI agents.</p>



<p>This parallels what happened when companies moved from in-person training to self-service documentation and e-learning. You lost the ability to watch every learner, but you gained the ability to train at scale. Skills make the same exchange; less visibility per user interaction, vastly more interactions possible.</p>



<p>Managing the trade-off requires intentional design. Build logging and tracing into your Skills where appropriate. <em>Create feedback mechanisms inside skills</em> for AI agents to surface when users express confusion or request changes. And in the development process, focus on outcomes—Did the Skill produce good results?—rather than process observation.</p>



<p>In production, the developer of Skills or MCPs will not have most of the context of how a user’s AI agent uses their Skills.</p>



<h2 class="wp-block-heading">What to Watch</h2>



<p>For organizations going through AI transformations, the starting point is an audit of expertise. What knowledge lives only in a specific person’s head? Where does inconsistency emerge because “how we do things” isn’t written down in an operationalizable form? These are your candidates for Skills.</p>



<p>Start with bounded workflows: a report format, an analysis methodology, a review checklist. Prove the pattern before encoding more complex expertise. Govern early. Skills are artifacts that require review, evaluation, and lifecycle management. Establish those processes before Skills proliferate.</p>



<p>For builders, the mental shift is from “prompt” to “product.” Skills are versioned artifacts with users. Design accordingly. Combine Skills with MCP for maximum leverage. Accept the visibility trade-off as the cost of scale.</p>



<p>Several signals suggest where this is heading. Skill marketplaces are emerging. Agent Skills are now a published open standard being adopted by multiple AI agents and soon agent SDKs. Enterprise governance tooling with version control, approval workflows, and audit trails organizations need will determine adoption in regulated industries.</p>



<h2 class="wp-block-heading">Expertise Can Finally Be Packaged</h2>



<p>We’ve gotten good at provisioning tools as APIs. MCP extends that to AI with standardized connections to systems and data.</p>



<p>But tools access was never the bottleneck. Expertise transfer was. The methodology. The judgment. The caveats. The workflows. The institutional knowledge that separates a new hire from a veteran.</p>



<p>Skills are the first serious attempt to package the expertise into a file format, where AI agents can operationalize it while humans can still read, review, and govern. They are training materials that actually scale.</p>



<p>The organizations that figure out how to package their expertise, both for internal and external consumption, will have a structural advantage. Not because AI replaces expertise. Because AI amplifies the expertise of those who know how to share it.</p>



<p>MCP gives AI agents the tools. Skills teach AI agents how to work. The question is whether you can encode what your best people know. Skills are the first real answer.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">References</h2>



<ul class="wp-block-list">
<li>“What Is the Model Context Protocol (MCP)?,” LF Projects, <a href="https://modelcontextprotocol.io/docs/getting-started/intro" target="_blank" rel="noreferrer noopener">https://modelcontextprotocol.io/docs/getting-started/intro</a>.</li>



<li>Michael Nuñez, “How Anthropic’s ‘Skills’ Make Claude Faster, Cheaper, and More Consistent for Business Workflows,” <em>VentureBeat</em>, October 16, 2025, <a href="https://venturebeat.com/ai/how-anthropics-skills-make-claude-faster-cheaper-and-more-consistent-for" target="_blank" rel="noreferrer noopener">https://venturebeat.com/ai/how-anthropics-skills-make-claude-faster-cheaper-and-more-consistent-for</a>.</li>



<li>“Skills,” Anthropics, <a href="https://github.com/anthropics/skills" target="_blank" rel="noreferrer noopener">https://github.com/anthropics/skills</a>.</li>



<li>“Create and Distribute a Plugin Marketplace,” Claude Code Docs, <a href="https://code.claude.com/docs/en/plugin-marketplaces" target="_blank" rel="noreferrer noopener">https://code.claude.com/docs/en/plugin-marketplaces</a>.</li>
</ul>
]]></content:encoded>
										</item>
		<item>
		<title>What Developers Actually Need to Know Right Now</title>
		<link>https://www.oreilly.com/radar/what-developers-actually-need-to-know-right-now/</link>
				<pubDate>Thu, 19 Feb 2026 11:00:09 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18080</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/A-swirling-stairs-library.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/A-swirling-stairs-library-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A conversation with Addy Osmani]]></custom:subtitle>
		
				<description><![CDATA[The following article includes clips from a recent Live with Tim O&#8217;Reilly interview. You can watch the full version on the O&#8217;Reilly Media learning platform. Addy Osmani is one of my favorite people to talk with about the state of software engineering with AI. He spent 14 years leading Chrome&#8217;s developer experience team at Google, [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article includes clips from a recent</em> Live with Tim O&#8217;Reilly <em>interview. You can watch <a href="https://learning.oreilly.com/videos/live-with-tim/0642572305529/" target="_blank" rel="noreferrer noopener">the full version</a> on the O&#8217;Reilly Media learning platform.</em></p>
</blockquote>



<p><a href="https://addyosmani.com/" target="_blank" rel="noreferrer noopener">Addy Osmani</a> is one of my favorite people to talk with about the state of software engineering with AI. He spent 14 years leading Chrome&#8217;s developer experience team at Google, and recently moved to Google Cloud AI to focus on Gemini and agent development. He&#8217;s also the author of numerous books for O’Reilly, including <a href="https://learning.oreilly.com/library/view/the-effective-software/9798341638167/" target="_blank" rel="noreferrer noopener"><em>The Effective Software Engineer</em></a> (due out in March), and my cohost for <a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener">O&#8217;Reilly&#8217;s AI Codecon</a>. Every time I talk with him I come away feeling like I have a better grip on what&#8217;s real and what&#8217;s noise. Our <a href="https://www.oreilly.com/live/live-with-tim/" target="_blank" rel="noreferrer noopener">recent conversation on <em>Live with Tim O&#8217;Reilly</em></a> was no exception.</p>



<p>Here are some of the things we talked about.</p>



<h2 class="wp-block-heading"><strong><strong>The hard problem is coordination, not generation</strong></strong></h2>



<p>Addy pointed out that there&#8217;s a spectrum in how people are working with AI agents right now. On one end you have solo founders running hundreds or thousands of agents, sometimes without even reviewing the code. On the other end you have enterprise teams with quality gates, reliability requirements, and long-term maintenance to think about.</p>



<p>Addy&#8217;s take is that for most businesses, &#8220;the real frontier is not necessarily having hundreds of agents for a task just for its own sake. It&#8217;s about orchestrating a modest set of agents that solve real problems while maintaining control and traceability.&#8221; He pointed out that frameworks like Google&#8217;s <a href="https://docs.cloud.google.com/agent-builder/agent-development-kit/overview" target="_blank" rel="noreferrer noopener">Agent Development Kit</a> now support both deterministic workflow agents and dynamic LLM agents in the same system, so you get to choose when you need predictability and when you need flexibility.</p>



<p>The ecosystem is developing fast. <a href="https://github.com/google/A2A" target="_blank" rel="noreferrer noopener">A2A</a> (the agent-to-agent protocol Google contributed to the Linux Foundation) handles agent-to-agent communication while MCP handles agent-to-tool calls. Together they start to look like the TCP/IP of the agent era. But Addy was clear-eyed about where things stand: &#8220;Almost nobody&#8217;s figured out how to make everything work together as smoothly as possible. We&#8217;re getting as close to that as we can. And that&#8217;s the actual hard problem here. Not generation, but coordination.&#8221;</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="How to Build Reliable AI at Scale: Insights from Addy Osmani" width="500" height="281" src="https://www.youtube.com/embed/EYoBEnyD86w?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong><strong>The &#8220;Something Big Is Happening&#8221; debate</strong></strong></h2>



<p>In response to one of the audience questions, we spent some time on <a href="https://shumer.dev/something-big-is-happening" target="_blank" rel="noreferrer noopener">Matt Shumer&#8217;s viral essay</a> arguing that the current moment in AI is like just before the COVID-19 epidemic hit. Those in the know were sounding the alarm, but most people weren’t hearing it.</p>



<p>Addy&#8217;s take was that &#8220;it felt a little bit like somebody who hadn&#8217;t been following along, just finally getting around to trying out the latest models and tools and having an epiphany moment.&#8221; He thinks the piece lacked grounding in data and didn&#8217;t do a great job distinguishing between what AI can do for prototypes and what it can do in production. As Addy put it, &#8220;Yes, the models are getting better, the harnesses are getting better, the tools are getting better. I can do more with AI these days than I could a year ago. All of that is true. But to say that all kinds of technical work can now be done with near perfection, I wouldn&#8217;t personally agree with that statement.&#8221;</p>



<p>I agree with Addy, but I also know how it feels when you see the future crashing in and no one is paying attention. At O’Reilly, we started working with the web when there were only 200 websites. In 1993, we built <a href="https://en.wikipedia.org/wiki/Global_Network_Navigator" target="_blank" rel="noreferrer noopener">GNN</a>, the first web portal, and the web’s first advertising. In 1994, we did the first large-scale market research on the potential of advertising as the web’s future business model. We went around lobbying phone companies to adopt the web and (a few years later) for bookstores to pay attention to the rise of Amazon, and nobody listened. I&#8217;m a big believer in&nbsp;&#8220;something is happening&#8221; moments. But I&#8217;m also very aware that it always takes longer than it appears.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="&quot;Something Big is Happening&quot;: Addy Osmani and Tim O&#039;Reilly on Matt Shumer&#039;s Viral AI Essay" width="500" height="281" src="https://www.youtube.com/embed/pNb-ORkH8JI?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>Both things can be true. The direction and magnitude of this shift are real. The models keep getting better. The harnesses keep getting better. But we still have to figure out new kinds of businesses and new kinds of workflows. AI won&#8217;t be a tsunami that wipes everything away overnight.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em> Addy and I will be cohosting the </em><a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener"><em>O&#8217;Reilly AI Codecon: Software Craftsmanship in the Age of AI</em></a><em> on March 26, where we&#8217;ll go much deeper on orchestration, agent coordination, and the new skills developers need. We&#8217;d love to see you there.</em> <a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener"><em>Sign up for free here</em></a><em>.</em></p>



<p><em>And if you’re interested in presenting at AI Codecon, our CFP is open through this Friday, February 20. Check out what we’re looking for and </em><a href="https://www.oreilly.com/AI-Codecon/cfp.html" target="_blank" rel="noreferrer noopener"><em>submit your proposal here</em></a><em>.</em></p>
</blockquote>



<h2 class="wp-block-heading"><strong><strong>Feeling productive vs. being productive</strong></strong></h2>



<p>There was a great line from a post by Will Manidis called &#8220;<a href="https://minutes.substack.com/p/tool-shaped-objects" target="_blank" rel="noreferrer noopener">Tool Shaped Objects</a>&#8221; that I shared during our conversation: &#8220;The market for feeling productive is orders of magnitude larger than the market for being productive.&#8221; The essay is about things that feel amazing to build and use but aren&#8217;t necessarily doing the work that needs to be done.</p>



<p>Addy picked up on this immediately. &#8220;There is a difference between feeling busy and being productive,&#8221; he said. &#8220;You can have 100 agents working in the background and feel like you&#8217;re being productive. And then someone asks, What did you get built? How much money is it making you?&#8221;</p>



<p>This isn&#8217;t to dismiss anyone who&#8217;s genuinely productive running lots of agents. Some people are. But a healthy skepticism about your own productivity is worth maintaining, especially when the tools make it so easy to feel like you&#8217;re moving fast.</p>



<h2 class="wp-block-heading"><strong>Planning is the new coding</strong></h2>



<p>Addy talked about how the balance of his time on a task has shifted significantly. &#8220;I might spend 30, 40% of the time a task takes just to actually write out what exactly is it that I want,&#8221; he said. What are the constraints? What are the success criteria? What&#8217;s the architecture? What libraries and UI components should be used?</p>



<p>All of that work to get clarity before you start code generation leads to much-higher-quality outcomes from AI. As Addy put it, &#8220;LLMs are very good at grounding things in the lowest common denominator. If there are patterns in the training data that are popular, they&#8217;re going to use those unless you tell them otherwise.&#8221; If your team has established best practices, codify them in Markdown files or MCP tools so the agent can use them.</p>



<p>I connected the planning phase to something larger about taste. Think about Steve Jobs. He wasn&#8217;t a coder. He was a master of knowing what good looked like and driving those who worked with him to achieve it. In this new world, that skill matters enormously. You&#8217;re going to be like Jobs telling his engineers &#8220;no, no, not that&#8221; and giving them a vision of what&#8217;s beautiful and powerful. Except now some of those engineers are agents. So management skill, communication skill, and taste are becoming core technical competencies.</p>



<h2 class="wp-block-heading"><strong>Code review is getting harder</strong></h2>



<p>One thing Addy flagged that doesn&#8217;t get enough attention: &#8220;Increasingly teams feel like they&#8217;re being thrashed with all of these PRs that are AI generated. People don&#8217;t necessarily understand everything that&#8217;s in there. And you have to balance increased velocity expectations with &#8216;What is a quality bar?&#8217; because someone&#8217;s going to have to maintain this.&#8221;</p>



<p>Knowing your quality bar matters. What are the cases where you&#8217;re comfortable merging an AI-generated change? Maybe it&#8217;s small and well-compartmentalized and has solid test coverage. And what are the cases where you absolutely need deep human review? Getting clear on that distinction is one of the most practical things a team can do right now.</p>



<h2 class="wp-block-heading"><strong>Yes, young people should still go into software</strong></h2>



<p>We got a question about whether students should still pursue software engineering. Addy&#8217;s answer was emphatic: &#8220;There has never been a better time to get into software engineering if you are someone that is comfortable with learning. You do not necessarily have the burden of decades of knowing how things have historically been built. You can approach this with a very fresh set of eyes.&#8221; New entrants can go agent first. They can get deep into orchestration patterns and model trade-offs without having to unlearn old habits. And that&#8217;s a real advantage when interviewing at companies that need people who already know how to work this way.</p>



<p>The more important point is that in the early days of a new technology, people basically try to make the old things over again. The really big opportunities come when we figure out what was previously impossible that we can now do. If AI is as powerful as it appears to be, the opportunity isn&#8217;t to make companies more efficient at the same old work. It&#8217;s to solve entirely new problems and build entirely new kinds of products.</p>



<p>I&#8217;m 71 years old and 45 years into this industry, and this is the most excited I&#8217;ve ever been. More than the early web, more than open source. The future is being reinvented, and the people who start using these tools now get to be part of inventing it.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Is Software Engineering Still Worth It? Addy Osmani and Tim O&#039;Reilly’s Advice to Students" width="500" height="281" src="https://www.youtube.com/embed/29IBZCMb140?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>The token cost question</strong></h2>



<p>Addy had a funny and honest admission: &#8220;There were weeks when I would look at my bill for how much I was using in tokens and just be shocked. I don&#8217;t know that the productivity gains were actually worthwhile.&#8221;</p>



<p>His advice: experiment. Get a sense of what your typical tasks cost with multiple agents. Extrapolate. Ask yourself whether you&#8217;d still find it valuable at that price. Some people spend hundreds or even thousands a month on tokens and feel it&#8217;s worthwhile because the alternative was hiring a contractor. Others are spending that much and mostly feeling busy. As Addy said, &#8220;Don&#8217;t feel like you have to be spending a huge amount of money to not miss out on productivity wins.&#8221;</p>



<p>I&#8217;d add that we&#8217;re in a period where these costs are massively subsidized. The model companies are covering inference costs to get you locked in. Take advantage of that while it lasts. But also recognize that a lot of efficiency work is yet to be done. Just as JavaScript frameworks replaced everyone hand-coding UIs, we&#8217;ll get frameworks and tools that make agent workflows much more token-efficient than they are today.</p>



<h2 class="wp-block-heading"><strong>2028 predictions are already here</strong></h2>



<p>One of the most striking things Addy shared was that a group in the AI coding community that he is part of had put together predictions for what software engineering would look like by 2028. &#8220;We recently revisited that list, and I was kind of shocked to discover that almost everything on that list is already possible today,&#8221; he said. &#8220;But how quickly the rest of the ecosystem adopts these things is on a longer trajectory than what is possible.&#8221;</p>



<p>That gap between capability and adoption is where most of the interesting work will happen over the next few years. The technology is running ahead of our ability to absorb it. Figuring out how to close that gap, in your team, your company, and your own practice, is the real job right now.</p>



<h2 class="wp-block-heading"><strong>Agents writing code for agents</strong></h2>



<p>Near the end we answered another great audience question: Will agents eventually produce source code that&#8217;s optimized for other agents to read, not humans? Addy said yes. There are already platform teams having conversations about whether to build for an agent-first world where human readability becomes a secondary concern.</p>



<p>I have a historical parallel for this. I wrote the manual for the first C compiler on the Mac, and I worked closely with the developer who was hand-tuning the compiler output at the machine code level. That was about 30 years ago. We stopped doing that. And I&#8217;m quite confident there will be a similar moment with AI-generated code where humans mostly just let it go and trust the output. There will be special cases where people dive in for absolute performance or correctness. But they&#8217;ll be rare.</p>



<p>That transition won&#8217;t happen overnight. But the direction seems pretty clear. You can help to invent the future now, or spend time later trying to catch up with those who do.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><em><em>This conversation was part of my ongoing series of discussions with innovators,</em> </em><a href="https://www.oreilly.com/live/live-with-tim/" target="_blank" rel="noreferrer noopener">Live with Tim O’Reilly</a><em><em>. You can explore past episodes <a href="https://learning.oreilly.com/playlists/536ae66a-50ad-4ebe-9771-ac87d984542b/" target="_blank" rel="noreferrer noopener">on the O’Reilly learning platform</a>.</em></em></p>
]]></content:encoded>
										</item>
		<item>
		<title>AI Is Not a Library: Designing for Nondeterministic Dependencies</title>
		<link>https://www.oreilly.com/radar/ai-is-not-a-library-designing-for-nondeterministic-dependencies/</link>
				<pubDate>Wed, 18 Feb 2026 11:43:54 +0000</pubDate>
					<dc:creator><![CDATA[Guruprasad Rao]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18073</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Swirling-library.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Swirling-library-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[For most of the history of software engineering, we’ve built systems around a simple and comforting assumption: Given the same input, a program will produce the same output. When something went wrong, it was usually because of a bug, a misconfiguration, or a dependency that wasn’t behaving as advertised. Our tools, testing strategies, and even [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>For most of the history of software engineering, we’ve built systems around a simple and comforting assumption: Given the same input, a program will produce the same output. When something went wrong, it was usually because of a bug, a misconfiguration, or a dependency that wasn’t behaving as advertised. Our tools, testing strategies, and even our mental models evolved around that expectation of determinism.</p>



<p>AI quietly breaks that assumption.</p>



<p>As large language models and AI services make their way into production systems, they often arrive through familiar shapes. There’s an API endpoint, a request payload, and a response body. Latency, retries, and timeouts all look manageable. From an architectural distance, it feels natural to treat these systems like libraries or external services.</p>



<p>In practice, that familiarity is misleading. AI systems behave less like deterministic components and more like nondeterministic collaborators. The same prompt can produce different outputs, small changes in context can lead to disproportionate shifts in results, and even retries can change behavior in ways that are difficult to reason about. These characteristics aren’t bugs; they’re inherent to how these systems work. The real problem is that our architectures often pretend otherwise. Instead of asking how to integrate AI as just another dependency, we need to ask how to design systems around components that do not guarantee stable outputs. Framing AI as a nondeterministic dependency turns out to be far more useful than treating it like a smarter API.</p>



<p>One of the first places where this mismatch becomes visible is retries. In deterministic systems, retries are usually safe. If a request fails due to a transient issue, retrying increases the chance of success without changing the outcome. With AI systems, retries don’t simply repeat the same computation. They generate new outputs. A retry might fix a problem, but it can just as easily introduce a different one. In some cases, retries quietly amplify failure rather than mitigate it, all while appearing to succeed.</p>



<p>Testing reveals a similar breakdown in assumptions. Our existing testing strategies depend on repeatability. Unit tests validate exact outputs. Integration tests verify known behaviors. With AI in the loop, those strategies quickly lose their effectiveness. You can test that a response is syntactically valid or conforms to certain constraints, but asserting that it is “correct” becomes far more subjective. Matters get even more complicated as models evolve over time. A test that passed yesterday may fail tomorrow without any code changes, leaving teams unsure whether the system regressed or simply changed.</p>



<p>Observability introduces an even subtler challenge. Traditional monitoring excels at detecting loud failures. Error rates spike. Latency increases. Requests fail. AI-related failures are often quieter. The system responds. Downstream services continue. Dashboards stay green. Yet the output is incomplete, misleading, or subtly wrong in context. These “acceptable but wrong” outcomes are far more damaging than outright errors because they erode trust gradually and are difficult to detect automatically.</p>



<p>Once teams accept nondeterminism as a first-class concern, design priorities begin to shift. Instead of trying to eliminate variability, the focus moves toward containing it. That often means isolating AI-driven functionality behind clear boundaries, limiting where AI outputs can influence critical logic, and introducing explicit validation or review points where ambiguity matters. The goal isn’t to force deterministic behavior from an inherently probabilistic system but to prevent that variability from leaking into parts of the system that aren’t designed to handle it.</p>



<p>This shift also changes how we think about correctness. Rather than asking whether an output is correct, teams often need to ask whether it is acceptable for a given context. That reframing can be uncomfortable, especially for engineers accustomed to precise specifications, but it reflects reality more accurately. Acceptability can be constrained, measured, and improved over time, even if it can’t be perfectly guaranteed.</p>



<p>Observability needs to evolve alongside this shift. Infrastructure-level metrics are still necessary, but they’re no longer sufficient. Teams need visibility into outputs themselves: how they change over time, how they vary across contexts, and how those variations correlate with downstream outcomes. This doesn’t mean logging everything, but it does mean designing signals that surface drift before users notice it. Qualitative degradation often appears long before traditional alerts fire, if anyone is paying attention.</p>



<p>One of the hardest lessons teams learn is that AI systems don’t offer guarantees in the way traditional software does. What they offer instead is probability. In response, successful systems rely less on guarantees and more on guardrails. Guardrails constrain behavior, limit blast radius, and provide escape hatches when things go wrong. They don’t promise correctness, but they make failure survivable. Fallback paths, conservative defaults, and human-in-the-loop workflows become architectural features rather than afterthoughts.</p>



<p>For architects and senior engineers, this represents a subtle but important shift in responsibility. The challenge isn’t choosing the right model or crafting the perfect prompt. It’s reshaping expectations, both within engineering teams and across the organization. That often means pushing back on the idea that AI can simply replace deterministic logic, and being explicit about where uncertainty exists and how the system handles it.</p>



<p>If I were starting again today, there are a few things I would do earlier. I would document explicitly where nondeterminism exists in the system and how it’s managed rather than letting it remain implicit. I would invest sooner in output-focused observability, even if the signals felt imperfect at first. And I would spend more time helping teams unlearn assumptions that no longer hold, because the hardest bugs to fix are the ones rooted in outdated mental models.</p>



<p>AI isn’t just another dependency. It challenges some of the most deeply ingrained assumptions in software engineering. Treating it as a nondeterministic dependency doesn’t solve every problem, but it provides a far more honest foundation for system design. It encourages architectures that expect variation, tolerate ambiguity, and fail gracefully.</p>



<p>That shift in thinking may be the most important architectural change AI brings, not because the technology is magical but because it forces us to confront the limits of determinism we’ve relied on for decades.</p>
]]></content:encoded>
										</item>
		<item>
		<title>AI, A2A, and the Governance Gap</title>
		<link>https://www.oreilly.com/radar/ai-a2a-and-the-governance-gap/</link>
				<pubDate>Tue, 17 Feb 2026 12:11:29 +0000</pubDate>
					<dc:creator><![CDATA[Shreshta Shyamsundar]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18057</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Nonsensical-traffic-lights.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Nonsensical-traffic-lights-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[When agent protocols outrun your org chart]]></custom:subtitle>
		
				<description><![CDATA[Over the past six months, I&#8217;ve watched the same pattern repeat across enterprise AI teams. A2A and ACP light up the room during architecture reviews—the protocols are elegant, the demos impressive. Three weeks into production, someone asks: “Wait, which agent authorized that $50,000 vendor payment at 2 am?“ The excitement shifts to concern. Here&#8217;s the [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>Over the past six months, I&#8217;ve watched the same pattern repeat across enterprise AI teams. A2A and ACP light up the room during architecture reviews—the protocols are elegant, the demos impressive. Three weeks into production, someone asks: “Wait, which agent authorized that $50,000 vendor payment at 2 am?“ The excitement shifts to concern.</p>



<p>Here&#8217;s the paradox: <a href="https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/" target="_blank" rel="noreferrer noopener">Agent2Agent</a> (A2A) and the <a href="https://agentcommunicationprotocol.dev/introduction/welcome" target="_blank" rel="noreferrer noopener">Agent Communication Protocol</a> (ACP) are so effective at eliminating integration friction that they&#8217;ve removed the natural “brakes“ that used to force governance conversations. We&#8217;ve solved the plumbing problem brilliantly. In doing so, we&#8217;ve created a new class of integration debt—one where organizations borrow speed today at the cost of accountability tomorrow.</p>



<p>The technical protocols are solid. The organizational protocols are missing. We&#8217;re rapidly moving from the “Can these systems connect?“ phase to the “Who authorized this agent to liquidate a position at 3 am?“ phase. In practice, that creates a governance gap: Our ability to connect agents is outpacing our ability to control what they commit us to.</p>



<p>To see why that shift is happening so fast, it helps to look at how the underlying “agent stack“ is evolving. We’re seeing the emergence of a three-tier structure that quietly replaces traditional API-led connectivity:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><strong>Layer</strong></td><td><strong>Protocol examples</strong></td><td><strong>Purpose</strong></td><td><strong>The &#8220;human&#8221; analog</strong></td></tr><tr><td>Tooling</td><td>MCP (Model Context Protocol)</td><td>Connects agents to local data and specific tools</td><td>A worker&#8217;s toolbox</td></tr><tr><td>Context</td><td>ACP (Agent Communication Protocol)</td><td>Standardizes how goals, user history, and state move between agents</td><td>A worker&#8217;s memory and briefing</td></tr><tr><td>Coordination</td><td>A2A (Agent2Agent)</td><td>Handles discovery, negotiation, and delegation across boundaries</td><td>A contract or handshake</td></tr></tbody></table></figure>



<p>This stack makes multi-agent workflows a configuration problem instead of a custom engineering project. That is exactly why the risk surface is expanding faster than most CISOs realize.</p>



<p>Think of it this way: A2A is the <em>handshake</em> between agents (who talks to whom, about what tasks). ACP is the <em>briefing document</em> they exchange (what context, history, and goals move in that conversation). MCP is the <em>toolbox</em> each agent has access to locally. Once you see the stack this way, you also see the next problem: We&#8217;ve solved API sprawl and quietly replaced it with something harder to see—agent sprawl, and with it, a widening governance gap.</p>



<p>Most enterprises already struggle to govern hundreds of SaaS applications. One analysis puts the average at more than 370 SaaS apps per organization. Agent protocols do not reduce this complexity; they route around it. In the API era, humans filed tickets to trigger system actions. In the A2A era, agents use “Agent Cards“ to discover each other and negotiate on top of those systems. ACP allows these agents to trade rich context—meaning a conversation starting in customer support can flow into fulfillment and partner logistics with zero human handoffs. What used to be API sprawl is becoming dozens of semiautonomous processes acting on behalf of your company across infrastructure you do not fully control. The friction of manual integration used to act as a natural brake on risk; A2A has removed that brake.</p>



<p>That governance gap doesn&#8217;t usually show up as a single catastrophic failure. It shows up as a series of small, confusing incidents where everything looks “green“ in the dashboards but the business outcome is wrong. The protocol documentation focuses on encryption and handshakes but ignores the emergent failure modes of autonomous collaboration. These are not bugs in the protocols; they’re signs that the surrounding architecture has not caught up with the level of autonomy the protocols enable.</p>



<p><strong>Policy drift</strong>: A refund policy encoded in a service agent may technically interoperate with a partner&#8217;s collections agent via A2A, but their business logic may be diametrically opposed. When something goes wrong, nobody owns the end-to-end behavior.</p>



<p><strong>Context oversharing</strong>: A team might expand an ACP schema to include “User Sentiment“ for better personalization, unaware that this data now propagates to every downstream third-party agent in the chain. What started as local enrichment becomes distributed exposure.</p>



<p><strong>The determinism trap</strong>: Unlike REST APIs, agents are nondeterministic. An agent&#8217;s refund policy logic might change when its underlying model is updated from GPT-4 to GPT-4.5, even though the A2A Agent Card declares identical capabilities. The workflow “works“—until it doesn&#8217;t, and there&#8217;s no version trace to debug. This creates what I call “ghost breaks“: failures that don&#8217;t show up in traditional observability because the interface contract looks unchanged.</p>



<p>Taken together, these aren&#8217;t edge cases. They&#8217;re what happens when we give agents more autonomy without upgrading the rules of engagement between them. These failure modes have a common root cause: The technical capability to collaborate across agents has outrun the organization&#8217;s ability to say where that collaboration is appropriate, and under what constraints.</p>



<p>That&#8217;s why we need something on top of the protocols themselves: an explicit “Agent Treaty“ layer. If the protocol is the language, the treaty is the constitution. Governance must move from “side documentation“ to “policy as code.“</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p>Traditional governance treats policy violations as failures to prevent. An antifragile approach treats them as signals to exploit. When an agent makes a commitment that violates a business constraint, the system should capture that event, trace the causal chain, and feed it back into both the agent&#8217;s training and the treaty ruleset. Over time, the governance layer gets smarter, not just stricter.</p>



<p><strong>Define treaty-level constraints</strong>: Don&#8217;t just authorize a connection; authorize a scope. Which ACP fields is an agent allowed to share? Which A2A operations are “read only“ versus “legally binding“? Which categories of decisions require human escalation?</p>



<p><strong>Version the behavior, not just the schema</strong>: Treat Agent Cards as first-class product surfaces. If the underlying model changes, the version must bump, triggering a rereview of the treaty. This is not bureaucratic overhead—it’s the only way to maintain accountability in a system where autonomous agents make commitments on behalf of your organization.</p>



<p><strong>Cross-organizational traceability</strong>: We need observability traces that don&#8217;t just show latency but show intent: Which agent made this commitment, under which policy? And who is the human owner? This is particularly critical when workflows span organizational boundaries and partner ecosystems.</p>



<p>Designing that treaty layer isn&#8217;t just a tooling problem. It changes who needs to be in the room and how they think about the system. The hardest constraint isn&#8217;t the code; it&#8217;s the people. We’re entering a world where engineers must reason about multi-agent game theory and policy interactions, not just SDK integration. Risk teams must audit “machine-to-machine commitments“ that may never be rendered in human language. Product managers must own agent ecosystems where a change in one agent&#8217;s reward function or context schema shifts behavior across an entire partner network. Compliance and audit functions need new tools and mental models to review autonomous workflows that execute at machine speed. In many organizations, those skills sit in different silos, and A2A/ACP adoption is proceeding faster than the cross-functional structures needed to manage them.</p>



<p>All of this might sound abstract until you look at where enterprises are in their adoption curve. Three converging trends are making this urgent: Protocol maturity means A2A, ACP, and MCP specifications have stabilized enough that enterprises are moving beyond pilots to production deployments. Multi-agent orchestration is shifting from single agents to agent ecosystems and workflows that span teams, departments, and organizations. And silent autonomy is blurring the line between “tool assistance“ and “autonomous decision-making“—often without explicit organizational acknowledgment. We’re moving from integration (making things talk) to orchestration (making things act), but our monitoring tools still only measure the talk. The next 18 months will determine whether enterprises get ahead of this or we see a wave of high-profile failures that force retroactive governance.</p>



<p>The risk is not that A2A and ACP are unsafe; it’s that they are <em>too effective</em>. For teams piloting these protocols, stop focusing on the “happy path“ of connectivity. Instead, pick one multi-agent workflow and instrument it as a critical product:</p>



<p><strong>Map the context flow</strong>: Every ACP field must have a “purpose limitation“ tag. Document which agents see which fields, and which business or regulatory requirements justify that visibility. This isn’t an inventory exercise; it&#8217;s a way to surface hidden data dependencies.</p>



<p><strong>Audit the commitments</strong>: Identify every A2A interaction that represents a financial or legal commitment—especially ones that don&#8217;t route through human approval. Ask, “If this agent&#8217;s behavior changed overnight, who would notice? Who is accountable?“</p>



<p><strong>Code the treaty</strong>: Prototype a “gatekeeper“ agent that enforces business constraints on top of the raw protocol traffic. This isn&#8217;t about blocking agents; it&#8217;s about making policy visible and enforceable at runtime. Start minimal: One policy, one workflow, one success metric.</p>



<p><strong>Instrument for learning</strong>: Capture which agents collaborate, which policies they invoke, and which contexts they share. Treat this as telemetry, not just audit logs. Feed patterns back into governance reviews quarterly.</p>



<p>If this works, you now have a repeatable pattern for scaling agent deployments without sacrificing accountability. If it breaks, you&#8217;ve learned something critical about your architecture <em>before</em> it breaks in production. If you can get one workflow to behave this way—governed, observable, and learn-as-you-go—you have a template for the rest of your agent ecosystem.</p>



<p>If the last decade was about treating APIs as products, the next one will be about treating autonomous workflows as policies encoded in traffic between agents. The protocols are ready. Your org chart is not. The bridge between the two is the Agent Treaty—start building it before your agents start signing deals without you. The good news: You don&#8217;t need to redesign your entire organization. You need to add one critical layer—the Agent Treaty—that makes policy machine-enforceable, observable, and learnable. You need engineers who think about composition and game theory, not just connection. And you need to treat agent deployments as products, not infrastructure.</p>



<p>The sooner you start, the sooner that governance gap closes.</p>
]]></content:encoded>
										</item>
		<item>
		<title>Conductors to Orchestrators: The Future of Agentic Coding</title>
		<link>https://www.oreilly.com/radar/conductors-to-orchestrators-the-future-of-agentic-coding/</link>
				<pubDate>Fri, 13 Feb 2026 12:01:07 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18041</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Conductors-on-stage.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Conductors-on-stage-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[This post first appeared on Addy Osmani’s Elevate Substack newsletter and is being republished here with the author’s permission. AI coding assistants have quickly moved from novelty to necessity, where up to 90% of software engineers use some kind of AI for coding. But a new paradigm is emerging in software development—one where engineers leverage [&#8230;]]]></description>
								<content:encoded><![CDATA[
<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><em>This post first appeared on Addy Osmani’s </em><a href="https://addyo.substack.com/p/conductors-to-orchestrators-the-future" target="_blank" rel="noreferrer noopener">Elevate<em> Substack newsletter</em></a><em> and is being republished here with the author’s permission.</em></td></tr></tbody></table></figure>



<p>AI coding assistants have quickly moved from novelty to necessity, where up to 90% of software engineers use some kind of AI for coding. But a new paradigm is emerging in software development—one where engineers leverage fleets of autonomous coding agents. In this agentic future, the role of the software engineer is evolving from implementer to manager, or in other words, from <em>coder</em> to conductor and ultimately <a href="https://www.youtube.com/watch?v=sQFIiB6xtIs" target="_blank" rel="noreferrer noopener">orchestrator</a>.</p>



<p>Over time, developers will increasingly guide AI agents to build the right code and coordinate multiple agents working in concert. This write-up explores the distinction between conductors and orchestrators in AI-assisted coding, defines these roles, and examines how today’s cutting-edge tools embody each approach. Senior engineers may start to see the writing on the wall: Our jobs are shifting from “How do I code this?” to “How do I get the right code built?”—a subtle but profound change.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1600" height="892" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-orchestrator-1600x892.png" alt="Will every engineer become an orchestrator" class="wp-image-18042" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-orchestrator-1600x892.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-orchestrator-300x167.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-orchestrator-768x428.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-orchestrator-1536x857.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-orchestrator.png 1678w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<p>What’s the tl;dr of an orchestrator tool? It supports multi-agent workflows where you can run many agents in parallel without them interfering with each other. But let’s talk terminology first.</p>



<h2 class="wp-block-heading">The Conductor: Guiding a Single AI Agent</h2>



<p>In the context of AI coding, acting as a conductor means working closely with a single AI agent on a specific task, much like a conductor guiding a soloist through a performance.</p>



<p>The engineer remains in the loop at each step, dynamically steering the agent’s behavior, tweaking prompts, intervening when needed, and iterating in real time. This is the logical extension of the “AI pair programmer” model many developers are already familiar with. With conductor-style workflows, coding happens in a synchronous, interactive session between human and AI, typically in your IDE or CLI.</p>



<p>Key characteristics: A conductor keeps a tight feedback loop with one agent, verifying or modifying each suggestion, much as a driver navigates with a GPS. The AI helps write code, but the developer still performs many manual steps—creating branches, running tests, writing commit messages, etc.—and ultimately decides which suggestions to accept.</p>



<p>Crucially, most of this interaction is ephemeral: Once code is written and the session ends, the AI’s role is done and any context or decisions not captured in code may be lost. This mode is powerful for focused tasks and allows fine-grained control, but it doesn’t fully exploit what multiple AIs could do in parallel.</p>



<h3 class="wp-block-heading">Modern tools as conductors</h3>



<p>Several current AI coding tools exemplify the conductor pattern:</p>



<ul class="wp-block-list">
<li><strong>Claude Code (Anthropic)</strong>: Anthropic’s Claude model offers a coding assistant mode (accessible via a CLI tool or editor integration) where the developer converses with Claude to generate or modify code. For example, with the Claude Code CLI, you navigate your project in a shell and ask Claude to implement a function or refactor code, and it prints diffs or file updates for you to approve. You remain the conductor: You trigger each action and review the output immediately. While Claude Code has features to handle long-running tasks and tools, in the basic usage it’s essentially a smart codeveloper working step-by-step under human direction.</li>



<li><strong>Gemini CLI (Google)</strong>: A command-line assistant powered by Google’s Gemini model, used for planning and coding with a very large context window. An engineer can prompt Gemini CLI to analyze a codebase or draft a solution plan, then iterate on results interactively. The human directs each step and Gemini responds within the CLI session. It’s a one-at-a-time collaborator, not running off to make code changes on its own (at least in this conductor mode).</li>



<li><strong>Cursor (editor AI assistant)</strong>: The Cursor editor (a specialized AI-augmented IDE) can operate in an inline or chat mode where you ask it questions or to write a snippet, and it immediately performs those edits or gives answers within your coding session. Again, you guide it one request at a time. Cursor’s strength as a conductor is its deep context integration—it indexes your whole codebase so the AI can answer questions about any part of it. But the hallmark is that you, the developer, initiate and oversee each change in real time.</li>



<li><strong>VS Code, Cline, Roo Code (in-IDE chat)</strong>: Similar to above, other coding agents also fall into this category. They suggest code or even multistep fixes, but always under continuous human guidance.</li>
</ul>



<p>This conductor-style AI assistance has already boosted productivity significantly. It feels like having a junior engineer or pair programmer always by your side. However, it’s inherently one-agent-at-a-time and synchronous. To truly leverage AI at scale, we need to go beyond being a single-agent conductor. This is where the orchestrator role comes in.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1600" height="895" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-as-conductor-and-orchestrator-1600x895.jpg" alt="Engineer as conductor, engineer as orchestrator" class="wp-image-18043" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-as-conductor-and-orchestrator-1600x895.jpg 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-as-conductor-and-orchestrator-300x168.jpg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-as-conductor-and-orchestrator-768x430.jpg 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-as-conductor-and-orchestrator-1536x859.jpg 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Engineer-as-conductor-and-orchestrator.jpg 1670w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<h2 class="wp-block-heading">The Orchestrator: Managing a Fleet of Agents</h2>



<p>If a conductor works with one AI “musician,” an orchestrator oversees the entire symphony of multiple AI agents working in parallel on different parts of a project. The orchestrator sets high-level goals, defines tasks, and lets a team of autonomous coding agents independently carry out the implementation details.</p>



<p>Instead of micromanaging every function or bug fix, the human focuses on coordination, quality control, and integration of the agents’ outputs. In practical terms, this often means an engineer can assign tasks to AI agents (e.g., via issues or prompts) and have those agents asynchronously produce code changes—often as ready-to-review pull requests. The engineer’s job becomes reviewing, giving feedback, and merging the results rather than writing all the code personally.</p>



<p>This asynchronous, parallel workflow is a fundamental shift. It moves AI assistance from the foreground to the background. While you attend to higher-level design or other work, your “AI team” is coding in the background. When they’re done, they hand you completed work (with tests, docs, etc.) for review. It’s akin to being a project tech lead delegating tasks to multiple devs and later reviewing their pull requests, except the “devs” are AI agents.</p>



<h3 class="wp-block-heading">Modern tools as orchestrators</h3>



<p>Over just the past year, several tools have emerged that embody this orchestrator paradigm:</p>



<ul class="wp-block-list">
<li><strong>GitHub Copilot coding agent (Microsoft)</strong>: This upgrade to Copilot transforms it from an in-editor assistant into an autonomous background developer. (I cover it in <a href="https://www.youtube.com/watch?v=sQFIiB6xtIs">this video</a>.) You can assign a GitHub issue to Copilot’s agent or invoke it via the VS Code agents panel, telling it (for example) “Implement feature X” or “Fix bug Y.” Copilot then spins up an ephemeral dev environment via GitHub Actions, checks out your repo, creates a new branch, and begins coding. It can run tests, linters, even spin up the app if needed, all without human babysitting. When finished, it opens a pull request with the changes, complete with a description and meaningful commit messages. It then asks for your review.<br><br>You, the human orchestrator, review the PR (perhaps using Copilot’s AI-assisted code review to get an initial analysis). If changes are needed, you can leave comments like “@copilot please update the unit tests for edge case Z,” and the agent will iterate on the PR. This is asynchronous, autonomous code generation in action. Notably, Copilot automates the tedious bookkeeping—branch creation, committing, opening PRs, etc.—which used to cost developers time. All the grunt work around writing code (aside from the design itself) is handled, allowing developers to focus on reviewing and guiding at a high level. GitHub’s agent effectively lets one engineer supervise many “AI juniors” working in parallel across different issues (and you can even create multiple specialized agents for different task types).</li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1600" height="831" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/GitHub-Agents-1600x831.png" alt="Delegate tasks to GitHub Copilot" class="wp-image-18044" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/GitHub-Agents-1600x831.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/GitHub-Agents-300x156.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/GitHub-Agents-768x399.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/GitHub-Agents-1536x798.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/GitHub-Agents-2048x1064.png 2048w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<ul class="wp-block-list">
<li><strong>Jules, Google’s coding agent</strong>: Jules is an autonomous coding agent. Jules is “not a copilot, not a code-completion sidekick, but an autonomous agent that reads your code, understands your intent, and gets to work.” Integrated with Google Cloud and GitHub, Jules lets you connect a repository and then ask it to perform tasks much as you would a developer on your team. Under the hood, Jules clones your entire codebase into a secure cloud VM and analyzes it with a powerful model. You might tell Jules “Add user authentication to our app” or “Upgrade this project to the latest Node.js and fix any compatibility issues.” It will formulate a plan, present it to you for approval, and once you approve, execute the changes asynchronously. It makes commits on a new branch and can even open a pull request for you to merge. Jules handles writing new code, updating tests, bumping dependencies, etc., all while you could be doing something else.<br><br>Crucially, Jules provides transparency and control: It shows you its proposed plan and reasoning before making changes, and allows you to intervene or modify instructions at any point (a feature Google calls “user steerability”). This is akin to giving an AI intern the spec and watching over their shoulder less frequently—you trust them to get it mostly right, but you still verify the final diff. Jules also boasts unique touches like audio changelogs (it generates spoken summaries of code changes) and the ability to run multiple tasks concurrently in the cloud. In short, Google’s Jules demonstrates the orchestrator model: You define the task, Jules does the heavy lifting asynchronously, and you oversee the result.</li>
</ul>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="1048" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Jules-bugs.jpg" alt="Jules bugs" class="wp-image-18045" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Jules-bugs.jpg 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Jules-bugs-300x225.jpg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Jules-bugs-768x575.jpg 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /></figure>



<ul class="wp-block-list">
<li><strong>OpenAI Codex (cloud agent)</strong>: OpenAI introduced a new cloud-based Codex agent to complement ChatGPT. This evolved Codex (different from the 2021 Codex model) is described as “a cloud-based software engineering agent that can work on many tasks in parallel.” It’s available as part of ChatGPT Plus/Pro under the name OpenAI Codex and via an npm CLI (<em>npm i -g @openai/codex</em>). With the Codex CLI or its VS Code/Cursor extensions, you can delegate tasks to OpenAI’s agent similar to Copilot or Jules. For instance, from your terminal you might say, “Hey Codex, implement dark mode for the settings page.” Codex then launches into your repository, edits the necessary files, perhaps runs your test suite, and when done, presents the diff for you to merge. It operates in an isolated sandbox for safety, running each task in a container with your repo and environment.<br><br>Like others, OpenAI’s Codex agent integrates with developer workflows: You can even kick off tasks from a ChatGPT mobile app on your phone and get notified when the agent is done. OpenAI emphasizes seamless switching “between real-time collaboration and async delegation” with Codex. In practice, this means you have the flexibility to use it in conductor mode (pair-programming in your IDE) or orchestrator mode (hand off a background task to the cloud agent). Codex can also be invited into your Slack channels—teammates can assign tasks to @Codex in Slack, and it will pull context from the conversation and your repo to execute them. It’s a vision of ubiquitous AI assistance, where coding tasks can be delegated from anywhere. Early users report that Codex can autonomously identify and fix bugs, or generate significant features, given a well-scoped prompt. All of this again aligns with the orchestrator workflow: The human defines the goal; the AI agent autonomously delivers a solution.</li>
</ul>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1274" height="847" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Codex-coding.jpg" alt="What are we coding next Codex" class="wp-image-18046" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Codex-coding.jpg 1274w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Codex-coding-300x199.jpg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Codex-coding-768x511.jpg 768w" sizes="auto, (max-width: 1274px) 100vw, 1274px" /></figure>



<ul class="wp-block-list">
<li><strong>Anthropic Claude Code (for web)</strong>: Anthropic has offered Claude as an AI chatbot for a while, and their Claude Code CLI has been a favorite for interactive coding. Anthropic took the next step by launching Claude Code for web, effectively a hosted version of their coding agent. Using Claude Code for web, you point it at your GitHub repo (with configurable sandbox permissions) and give it a task. The agent then runs in Anthropic’s managed container, just like the CLI version, but now you can trigger it from a web interface or even a mobile app. It queues up multiple prompts and steps, executes them, and when done, pushes a branch to your repo (and can open a PR). Essentially, Anthropic took their single-agent Claude Code and made it an orchestratable service in the cloud. They even provided a “teleport” feature to transfer the session to your local environment if you want to take over manually.<br><br>The rationale for this web version aligns with orchestrator benefits: convenience and scale. You don’t need to run long jobs on your machine; Anthropic’s cloud handles the heavy lifting, with filesystem and network isolation for safety. Claude Code for web acknowledges that autonomy with safety is key—by sandboxing the agent, they reduce the need for constant permission prompts, letting the agent operate more freely (less babysitting by the user). In effect, Anthropic has made it easier to use Claude as an autonomous coding worker you launch on demand.</li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1600" height="897" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Claude-code-subscription-discounts-1600x897.jpg" alt="Discounts with Claude Code" class="wp-image-18047" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Claude-code-subscription-discounts-1600x897.jpg 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Claude-code-subscription-discounts-300x168.jpg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Claude-code-subscription-discounts-768x431.jpg 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Claude-code-subscription-discounts-1536x861.jpg 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Claude-code-subscription-discounts-2048x1148.jpg 2048w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<ul class="wp-block-list">
<li><strong>Cursor background agents</strong>: tl;dr Cursor 2.0 has a <a href="https://cursor.com/blog/2-0#the-multi-agent-interface" target="_blank" rel="noreferrer noopener">multi-agent interface</a> more focused around agents rather than files. Cursor 2 expands its <a href="https://cursor.com/docs/cloud-agent" target="_blank" rel="noreferrer noopener">background agents</a> feature into a full-fledged orchestration layer for developers. Beyond serving as an interactive assistant, Cursor 2 lets you spawn autonomous background agents that operate asynchronously in a managed cloud workspace. When you delegate a task, Cursor 2’s agents now clone your GitHub repository, spin up an ephemeral environment, and check out an isolated branch where they execute work end-to-end. These agents can handle the entire development loop—from editing and running code to installing dependencies, executing tests, running builds, and even searching the web or referencing documentation to resolve issues. Once complete, they push commits and open a detailed pull request summarizing their work.<br><br>Cursor 2 introduces multi-agent orchestration, allowing several background agents to run concurrently across different tasks—for instance, one refining UI components while another optimizes backend performance or fixes tests. Each agent’s activity is visible through a real-time dashboard that can be accessed from desktop or mobile, enabling you to monitor progress, issue follow-ups, or intervene manually if needed. This new system effectively treats each agent as part of an on-demand AI workforce, coordinated through the developer’s high-level intent. Cursor 2’s focus on parallel, asynchronous execution dramatically amplifies a single engineer’s throughput—fully realizing the orchestrator model where humans oversee a fleet of cooperative AI developers rather than a single assistant.</li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1600" height="900" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/New-agent-layout-1600x900.jpg" alt="Agents layout adjustments for token display" class="wp-image-18048" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/New-agent-layout-1600x900.jpg 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/New-agent-layout-300x169.jpg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/New-agent-layout-768x432.jpg 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/New-agent-layout-1536x864.jpg 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/New-agent-layout-2048x1153.jpg 2048w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<ul class="wp-block-list">
<li><strong>Agent orchestration platforms</strong>: Beyond individual product offerings, there are also emerging platforms and open source projects aimed at orchestrating multiple agents. For instance, <a href="https://conductor.build/" target="_blank" rel="noreferrer noopener">Conductor</a> by Melty Labs (despite its name!) is actually an orchestration tool that lets you deploy and manage multiple Claude Code agents on your own machine in parallel. With Conductor, each agent gets its own isolated Git worktree to avoid conflicts, and you can see a dashboard of all agents (“who’s working on what”) and review their code as they progress. The idea is to make running a small swarm of coding agents as easy as running one. Similarly, <a href="https://smtg-ai.github.io/claude-squad/" target="_blank" rel="noreferrer noopener">Claude Squad</a> is a popular open source terminal app that essentially multiplexes Anthropic’s Claude—it can spawn several Claude Code instances working concurrently in separate tmux panes, allowing you to give each a different task and thus code “10x faster” by parallelizing. These orchestration tools underscore the trend: Developers want to coordinate <em>multiple</em> AI coding agents and have them collaborate or divide work. Even Microsoft’s Azure AI services are enabling this: At Build 2025 they announced tools for developers to “orchestrate multiple specialized agents to handle complex tasks,” with SDKs supporting agent-to-agent communication so your fleet of agents can talk to each other and share context. All of this infrastructure is being built to support the orchestrator engineer, who might eventually oversee dozens of AI processes tackling different parts of the software development lifecycle.</li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1600" height="1081" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Update-workspace-sidebar-1600x1081.png" alt="Update workspace sidebar" class="wp-image-18049" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Update-workspace-sidebar-1600x1081.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Update-workspace-sidebar-300x203.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Update-workspace-sidebar-768x519.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Update-workspace-sidebar-1536x1038.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Update-workspace-sidebar-2048x1384.png 2048w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>I found <a href="https://www.linkedin.com/redir/suspicious-page?url=https%3A%2F%2Fconductor%2ebuild%2F&amp;lipi=urn%3Ali%3Apage%3Ad_flagship3_detail_base%3B4m9RhAtxR6ebkFVf%2FmOdlg%3D%3D" target="_blank" rel="noreferrer noopener">Conductor</a> to make the most sense to me. It was a perfect balance of talking to an agent and seeing my changes in a pane next to it. Its Github integration feels seamless; e.g. after merging PR, it immediately showed a task as “Merged” and provided an “Archive” button.<br>—<a href="https://www.linkedin.com/in/juriyzaytsev?miniProfileUrn=urn%3Ali%3Afsd_profile%3AACoAAACPjPoB242NjG3ty49SjbsQdnWjb4xr0Tg&amp;lipi=urn%3Ali%3Apage%3Ad_flagship3_detail_base%3B4m9RhAtxR6ebkFVf%2FmOdlg%3D%3D" target="_blank" rel="noreferrer noopener">Juriy Zaytsev</a>, Staff SWE, LinkedIn</p>
</blockquote>



<p>He also tried <a href="https://www.magnet.run/" target="_blank" rel="noreferrer noopener">Magnet</a>:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>The idea of tying tasks to a Kanban board is interesting and makes sense. As such, Magnet feels very product-centric.</p>
</blockquote>



<h2 class="wp-block-heading">Conductor versus Orchestrator—Differences</h2>



<p>Many engineers will continue to engage in conductor-style workflows (single agent, interactive) even as orchestrator patterns mature. The two modes will coexist.</p>



<p>It’s clear that “conductor” and “orchestrator” aren’t just fancy terms; they describe a genuine shift in how we work with AI.</p>



<ul class="wp-block-list">
<li><strong>Scope of control</strong>: A conductor operates at the micro level, guiding one agent through a single task or a narrow problem. An orchestrator operates at the macro level, defining broader tasks and objectives for multiple agents or for a powerful single agent that can handle multistep projects. The conductor asks, “How do I solve this function or bug with the AI’s help?” The orchestrator asks, “What set of tasks can I delegate to AI agents today to move this project forward?”</li>



<li><strong>Degree of autonomy</strong>: In conductor mode, the AI’s autonomy is low—it waits for user prompts each step of the way. In orchestrator mode, we give the AI high autonomy—it might plan and execute dozens of steps internally (writing code, running tests, adjusting its approach) before needing human feedback. A GitHub Copilot agent or Jules will try to complete a feature from start to finish once assigned, whereas Copilot’s IDE suggestions only go line-by-line as you type.</li>



<li><strong>Synchronous vs asynchronous</strong>: Conductor interactions are typically synchronous—you prompt; AI responds within seconds; you immediately integrate or iterate. It’s a real-time loop. Orchestrator interactions are asynchronous—you might dispatch an agent and check back minutes or hours later when it’s done (somewhat like kicking off a long CI job). This means orchestrators must handle waiting, context-switching, and possibly managing multiple things concurrently, which is a different workflow rhythm for developers.</li>



<li><strong>Artifacts and traceability</strong>: A subtle but important difference: Orchestrator workflows produce persistent artifacts like branches, commits, and pull requests that are preserved in version control. The agent’s work is fully recorded (and often linked to an issue/ticket), which improves traceability and collaboration. With conductor-style (IDE chat, etc.), unless the developer manually commits intermediate changes, a lot of the AI’s involvement isn’t explicitly documented. In essence, orchestrators leave a paper trail (or rather a Git trail) that others on the team can see or even trigger themselves. This can help bring AI into team processes more naturally.</li>



<li><strong>Human effort profile</strong>: For a conductor, the human is actively engaged nearly 100% of the time the AI is working—reviewing each output, refining prompts, etc. It’s interactive work. For an orchestrator, the human’s effort is front-loaded (writing a good task description or spec for the agent, setting up the right context) and back-loaded (reviewing the final code and testing it), but not much is needed in the middle. This means one orchestrator can manage more total work in parallel than would ever be possible by working with one AI at a time. Essentially, orchestrators leverage automation at scale, trading off fine-grained control for breadth of throughput.</li>
</ul>



<p>To illustrate, consider a common scenario: adding a new feature that touches frontend and backend and requires new tests. As a conductor, you might open your AI chat and implement the backend logic with the AI’s help, then separately implement the frontend, then ask it to generate some tests—doing each step sequentially with you in the loop throughout. As an orchestrator, you could assign the backend implementation to one agent (Agent A), the frontend UI changes to another (Agent B), and test creation to a third (Agent C). You give each a prompt or an issue description, then step back and let them work concurrently.</p>



<p>After a short time, you get perhaps three PRs: one for backend, one for frontend, one for tests. Your job then is to review and integrate them (and maybe have Agent C adjust tests if Agents A/B’s code changed during integration). In effect, you managed a mini “AI team” to deliver the feature. This example highlights how orchestrators think in terms of task distribution and integration, whereas conductors focus on step-by-step implementation.</p>



<p>It’s worth noting that these roles are fluid, not rigid categories. A single developer might act as a conductor in one moment and an orchestrator the next. For example, you might kick off an asynchronous agent to handle one task (orchestrator mode) while you personally work with another AI on a tricky algorithm in the meantime (conductor mode). Tools are also blurring lines: As OpenAI’s Codex marketing suggests, you can seamlessly switch between collaborating in real-time and delegating async tasks. So, think of “conductor” versus “orchestrator” as two ends of a spectrum of AI-assisted development, with many hybrid workflows in between.</p>



<h2 class="wp-block-heading">Why Orchestrators Matter</h2>



<p>Experts are suggesting that this shift to orchestration could be one of the biggest leaps in programming productivity we’ve ever seen. Consider the historical trends: We went from writing assembly to using high-level languages, then to using frameworks and libraries, and recently to leveraging AI for autocompletion. Each step abstracted away more low-level work. Autonomous coding agents are the next abstraction layer. Instead of manually coding every piece, you describe what you need at a higher level and let multiple agents build it.</p>



<p>As orchestrator-style agents ramp up, we could imagine even larger percentages of code being drafted by AIs. What does a software team look like when AI agents generate, say, 80% or 90% of the code, and humans provide the 10% critical guidance and oversight? Many believe it doesn’t mean replacing developers—it means augmenting developers to build better software. We may witness an explosion of productivity where a small team of engineers, effectively managing dozens of agent processes, can accomplish what once took an army of programmers months. (Note: I continue to believe the code review loop where we’ll continue to focus our human skills is going to need work if all this code is not to be slop.)</p>



<p>One intriguing possibility is that every engineer becomes, to some degree, a <em>manager</em> of AI developers. It’s a bit like everyone having a personal team of interns or junior engineers. Your effectiveness will depend on how well you can break down tasks, communicate requirements to AI, and verify the results. Human judgment will remain vital: deciding what to build, ensuring correctness, handling ambiguity, and injecting creativity or domain knowledge where AI might fall short. In other words, the skillset of an orchestrator—good planning, prompt engineering, validation, and oversight—is going to be in high demand. Far from making engineers obsolete, these agents could elevate engineers into more strategic, supervisory roles on projects.</p>



<h2 class="wp-block-heading">Toward an “AI Team” of Specialists</h2>



<p>Today’s coding agents mostly tackle implementation: write code, fix code, write tests, etc. But the vision doesn’t stop there. Imagine a full software development pipeline where multiple specialized AI agents handle different phases of the lifecycle, coordinated by a human orchestrator. This is already on the horizon. Researchers and companies have floated architectures where, for example, you have:</p>



<ul class="wp-block-list">
<li>A planning agent that analyzes feature requests or bug reports and breaks them into specific tasks</li>



<li>A coding agent (or several) that implements the tasks in code</li>



<li>A testing agent that generates and runs tests to verify the changes</li>



<li>A code review agent that checks the pull requests for quality and standards compliance</li>



<li>A documentation agent that updates README or docs to reflect the changes</li>



<li>Possibly a deployment/monitoring agent that can roll out the change and watch for issues in production.</li>
</ul>



<p>In this scenario, the human engineer’s role becomes one of oversight and orchestration across the whole flow: You might initiate the process with a high-level goal (e.g., “Add support for payment via cryptocurrency in our app”); the planning agent turns that into subtasks; coding agents implement each subtask asynchronously; the testing agent and review agent catch problems or polish the code; and finally everything gets merged and deployed under watch of monitoring agents.</p>



<p>The human would step in to approve plans, resolve any conflicts or questions the agents raise, and give final approval to deploy. This is essentially an “AI swarm” tackling software development end to end, with the engineer as the conductor of the orchestra.</p>



<p>While this might sound futuristic, we see early signs. Microsoft’s Azure AI Foundry now provides building blocks for multi-agent workflows and agent orchestration in enterprise settings, implicitly supporting the idea that multiple agents will collaborate on complex, multistep tasks. Internal experiments at tech companies have agents creating pull requests that other agent reviewers automatically critique, forming an AI/AI interaction with a human in the loop at the end. In open source communities, people have chained tools like Claude Squad (parallel coders) with additional scripts that integrate their outputs. And the conversation has started about standards like the Model Context Protocol (MCP) for agents sharing state and communicating results to each other.</p>



<p>I’ve noted before that “specialized agents for Design, Implementation, Test, and Monitoring could work together to develop, launch, and land features in complex environments”—with developers onboarding these AI agents to their team and guiding/overseeing their execution. In such a setup, agents would “coordinate with other agents autonomously, request human feedback, reviews and approvals” at key points, and otherwise handle the busywork among themselves. The goal is a central platform where we can deploy specialized agents across the workflow, without humans micromanaging each individual step—instead, the human oversees the entire operation with full context.</p>



<p>This could transform how software projects are managed: more like running an automated assembly line where engineers ensure quality and direction rather than handcrafting each component on the line.</p>



<h2 class="wp-block-heading">Challenges and the Human Role in Orchestration</h2>



<p>Does this mean programming becomes a push-button activity where you sit back and let the AI factory run? Not quite—and likely never entirely. There are significant challenges and open questions with the orchestrator model:</p>



<ul class="wp-block-list">
<li><strong>Quality control and trust</strong>: Orchestrating multiple agents means you’re not eyeballing every single change as it’s made. Bugs or design flaws might slip through if you solely rely on AI. Human oversight remains critical as the final failsafe. Indeed, current tools explicitly require the human to review the AI’s pull requests before merging. The relationship is often compared to managing a team of junior developers: They can get a lot done, but you wouldn’t ship their code without review. The orchestrator engineer must be vigilant about checking the AI’s work, writing good test cases, and having monitoring in place. AI agents can make mistakes or produce logically correct but undesirable solutions (for instance, implementing a feature in a convoluted way). Part of the orchestration skillset is knowing when to intervene versus when to trust the agent’s plan. As the CTO of Stack Overflow wrote, “Developers maintain expertise to evaluate AI outputs” and will need new “trust models” for this collaboration.</li>



<li><strong>Coordination and conflict</strong>: When multiple agents work on a shared codebase, coordination issues arise—much like multiple developers can conflict if they touch the same files. We need strategies to prevent merge conflicts or duplicated work. Current solutions use workspace isolation (each agent works on its own Git branch or separate environment) and clear task separation. For example, one agent per task, and tasks designed to minimize overlap. Some orchestrator tools can even automatically merge changes or rebase agent branches, but usually it falls to the human to integrate. Ensuring agents don’t step on each others’ toes is an active area of development. It’s conceivable that in the future agents might negotiate with each other (via something like agent-to-agent communication protocols) to avoid conflicts, but today the orchestrator sets the boundaries.</li>



<li><strong>Context, shared state, and handoffs</strong>:<strong> </strong>Coding workflows are rich in state: repository structure, dependencies, build systems, test suites, style guidelines, team practices, legacy code, branching strategies, etc. Multi-agent orchestration demands shared context, memory, and smooth transitions. But in enterprise settings, context sharing across agents is nontrivial. Without a unified “workflow orchestration layer,” each agent can become a silo, working well in its domain but failing to mesh. In a coding-engineering team this may translate into: One agent creates a feature branch; another one runs unit tests; another merges into master—if the first agent doesn’t tag metadata the second is expecting, you get breakdowns.</li>



<li><strong>Prompting and specifications</strong>: Ironically, as the AI handles more coding, the human’s “coding” moves up a level to writing specifications and prompts. The quality of an agent’s output is highly dependent on how well you specify the task. Vague instructions lead to subpar results or agents going astray. Best practices that have emerged include writing mini design docs or acceptance criteria for the agents—essentially treating them like contractors who need a clear definition of done. This is why we’re seeing ideas like spec-driven development for AI: You feed the agent a detailed spec of what to build, so it can execute predictably. Engineers will need to hone their ability to describe problems and desired solutions unambiguously. Paradoxically, it’s a very old-school skill (writing good specs and tests) made newly important in the AI era. As agents improve, prompts might get simpler (“write me a mobile app for X and Y with these features”) and yet yield more complex results, but we’re not quite at the point of the AI intuiting everything unsaid. For now, orchestrators must be excellent communicators to their digital workforce.</li>



<li><strong>Tooling and debugging</strong>: With a human developer, if something goes wrong, they can debug in real time. With autonomous agents, if something goes wrong (say the agent gets stuck on a problem or produces a failing PR), the orchestrator has to debug the situation: Was it a bad prompt? Did the agent misinterpret the spec? Do we roll back and try again or step in and fix it manually? New tools are being added to help here: For instance, checkpointing and rollback commands let you undo an agent’s changes if it went down a wrong path. Monitoring dashboards can show if an agent is taking too long or has errors. But effectively, orchestrators might at times have to drop down to conductor mode to fix an issue, then go back to orchestration. This interplay will improve as agents get more robust, but it highlights that orchestrating isn’t just “fire and forget”—it requires active monitoring. AI observability tools (tracking cost, performance, accuracy of agents) are likely to become part of the developer’s toolkit.</li>



<li><strong>Ethics and responsibility</strong>: Another angle—if an AI agent writes most of the code, who is responsible for license compliance, security vulnerabilities, or bias in that code? Ultimately the human orchestrator (or their organization) carries responsibility. This means orchestrators should incorporate practices like security scanning of AI-generated code and verifying dependencies. Interestingly, some agents like Copilot and Jules include built-in safeguards: They won’t introduce known vulnerable versions of libraries, for instance, and can be directed to run security audits. But at the end of the day, “trust, but verify” is the mantra. The human remains accountable for what ships, so orchestrators will need to ensure AI contributions meet the team’s quality and ethical standards.</li>
</ul>



<p>In summary, the rise of orchestrator-style development doesn’t remove the human from the loop—it changes the human’s position in the loop. We move from being the one turning the wrench to the one designing and supervising the machine that turns the wrench. It’s a higher-leverage position, but also one that demands broader awareness.</p>



<p>Developers who adapt to being effective conductors and orchestrators of AI will likely be even more valuable in this new landscape.</p>



<h2 class="wp-block-heading">Conclusion: Is Every Engineer a Maestro?</h2>



<p>Will every engineer become an orchestrator of multiple coding agents? It’s a provocative question, but trends suggest we’re headed that way for a large class of programming tasks. The day-to-day reality of a software engineer in the late 2020s could involve less heads-down coding and more high-level supervision of code that’s mostly written by AIs.</p>



<p>Today we’re already seeing early adopters treating AI agents as teammates—for example, some developers report delegating 10+ pull requests per day to AI, effectively treating the agent as an independent teammate rather than a smart autocomplete. Those developers free themselves to focus on system design, tricky algorithms, or simply coordinating even more work.</p>



<p>That said, the transition won’t happen overnight for everyone. Junior developers might start as “AI conductors,” getting comfortable working with a single agent before they take on orchestrating many. Seasoned engineers are more likely to early-adopt orchestrator workflows, since they have the experience to architect tasks and evaluate outcomes. In many ways, it mirrors career growth: Junior engineers implement (now with AI help); senior engineers design and integrate (soon with AI agent teams).</p>



<p>The tools we discussed—from GitHub’s coding agent to Google’s Jules to OpenAI’s Codex—are rapidly lowering the barrier to try this approach, so expect it to go mainstream quickly. The hyperbole aside, there’s truth that these capabilities can dramatically amplify what an individual developer can do.</p>



<p>So, will we all be orchestrators? Probably to some extent—yes. We’ll still write code, especially for novel or complex pieces that defy simple specification. But much of the boilerplate, routine patterns, and even a lot of sophisticated glue code could be offloaded to AI. The role of “software engineer” may evolve to emphasize product thinking, architecture, and validation, with the actual coding being a largely automated act. In this envisioned future, asking an engineer to crank out thousands of lines of mundane code by hand would feel as inefficient as asking a modern accountant to calculate ledgers with pencil and paper. Instead, the engineer would delegate that to their AI agents and focus on the creative and critical-thinking aspects around it.</p>



<p>BTW, yes, there’s plenty to be cautious about. We need to ensure these agents don’t introduce more problems than they solve. And the developer experience of orchestrating multiple agents is still maturing—it can be clunky at times. But the trajectory is clear. Just as continuous integration and automated testing became standard practice, continuous delegation to AI could become a normal part of the development process. The engineers who master both modes—knowing when to be a precise conductor and when to scale up as an orchestrator—will be in the best position to leverage this “agentic” world.</p>



<p>One thing is certain: The way we build software in the next 5–10 years will look quite different from the last 10. I want to stress that not all or most code will be agent-driven within a year or two, but that’s a direction we’re heading in. The keyboard isn’t going away, but alongside our keystrokes we’ll be issuing high-level instructions to swarms of intelligent helpers. In the end, the human element remains irreplaceable: It’s our judgment, creativity, and understanding of real-world needs that guides these AI agents toward meaningful outcomes.</p>



<p>The future of coding isn’t AI or human, it’s AI <em>and</em> human—with humans at the helm as conductors and orchestrators, directing a powerful ensemble to achieve our software ambitions.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><em>I’m excited to share that I&#8217;ve written an </em><a href="https://beyond.addy.ie/" target="_blank" rel="noreferrer noopener"><em>AI-assisted engineering book</em></a><em> with O’Reilly. If you’ve enjoyed my writing here you may be interested in checking it out.</em></td></tr></tbody></table></figure>
]]></content:encoded>
										</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Object Caching 85/89 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed) 
Minified using Memcached

Served from: www.oreilly.com @ 2026-03-05 12:19:37 by W3 Total Cache
-->