<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:media="http://search.yahoo.com/mrss/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:custom="https://www.oreilly.com/rss/custom"

	>

<channel>
	<title>Radar</title>
	<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.oreilly.com/radar</link>
	<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
	<lastBuildDate>Fri, 29 May 2026 11:00:21 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
	<title>Radar</title>
	<link>https://www.oreilly.com/radar</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Open Source Ecosystems</title>
		<link>https://www.oreilly.com/radar/open-source-ecosystems/</link>
				<comments>https://www.oreilly.com/radar/open-source-ecosystems/#respond</comments>
				<pubDate>Fri, 29 May 2026 11:00:08 +0000</pubDate>
					<dc:creator><![CDATA[Ilan Strauss]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18814</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Open-source-ecosystems.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Open-source-ecosystems-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[When open strategy meets private tactics]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on the Asimov&#8217;s Addendum Substack and is being reposted here with the author&#8217;s permission. Bill Gurley&#160;has an excellent article on what he calls&#160;open source strategy,&#160;which we recommend reading. There is a lot to debate about his concluding argument in particular: that open-weight models are central to keeping the AI market [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on the</em> <a href="https://asimovaddendum.substack.com/p/open-source-ecosystems" target="_blank" rel="noreferrer noopener">Asimov&#8217;s Addendum</a> <em>Substack and is being reposted here with the author&#8217;s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph"><a href="https://p3institute.substack.com/p/from-open-source-software-to-open" target="_blank" rel="noreferrer noopener">Bill Gurley</a>&nbsp;has an excellent article on what he calls&nbsp;<em>open source strategy,&nbsp;</em>which we recommend reading. There is a lot to debate about his concluding argument in particular: that open-weight models are central to keeping the AI market rent-free. The limits of open-weight AI as the primary open source strategy are surely considerable though, if it still requires expensive hardware to run on, and&nbsp;<a href="https://www.oreilly.com/pub/a/tim/articles/architecture_of_participation.html" target="_blank" rel="noreferrer noopener">if the architecture ultimately remains monolithic</a>—rather than composable and protocol-centric.</p>



<p class="wp-block-paragraph">A related consideration comes from Anthropic’s<a href="https://www.anthropic.com/news/anthropic-acquires-stainless" target="_blank" rel="noreferrer noopener">&nbsp;recent acquisition of Stainless</a>—a startup that generates SDKs, command-line tools, and MCP servers from API specifications. This illustrates that open protocols like MCP, even when publicly governed,<sup data-fn="6732a4b0-bcdf-41ae-a355-761cc861ab6b" class="fn"><a href="#6732a4b0-bcdf-41ae-a355-761cc861ab6b" id="6732a4b0-bcdf-41ae-a355-761cc861ab6b-link">1</a></sup>&nbsp;remain exposed at their complementary layers to private actors capturing rents. (Protocol openness does not eliminate this and instead probably enables it, by enabling market growth).</p>



<p class="wp-block-paragraph">We asked Claude to analyze this acquisition, going beyond the press releases. Its first pass overstated parts of the competitive-denial story; what follows is what survived it taking a closer look:</p>



<ol class="wp-block-list">
<li><strong>Complement capture, not protocol capture.</strong>&nbsp;MCP—the standard that lets AI agents talk to other software—remains open, and its governance has been handed to an independent foundation. What Anthropic bought is the company that turned that standard into something most developers could actually use.&nbsp;<em>Stainless was the dominant tool for taking an ordinary business API</em>&nbsp;(say, a hotel booking system or a customer database) and converting it into something an AI agent could call through MCP. The open standard is still open. The path most developers walked to use it has now been bought.<br></li>



<li><strong>This isn’t a one-off—the whole layer is consolidating.</strong>&nbsp;Stainless wasn’t alone in this market. Its main competitor, Fern, was<a href="https://buildwithfern.com/post/stainless-pricing-alternatives" target="_blank" rel="noreferrer noopener">&nbsp;bought by Postman in January 2026</a>. Anthropic bought Stainless four months later, in May 2026. That leaves&nbsp;<a href="https://www.speakeasy.com/" target="_blank" rel="noreferrer noopener">Speakeasy</a>&nbsp;as the only major independent player, plus an open-source fallback called&nbsp;<a href="https://openapi-generator.tech/" target="_blank" rel="noreferrer noopener">OpenAPI Generator</a>&nbsp;that most developers consider too rough for production use without significant manual work. In under five months, two of the three serious companies in this part of the market have been absorbed into larger platforms.&nbsp;<em>The Stainless deal is more visible because of who bought it and why, but the broader pattern matters more: an entire layer of AI infrastructure is being pulled inside platform owners</em>.<br></li>



<li><strong>Moat migration.</strong> The gap in raw model capability between Anthropic, OpenAI, and Google has narrowed considerably and continues to close, and the implication is that model quality alone is unlikely to be the principal basis of competitive advantage over the next two years. What may distinguish the leading firms instead <em>is the quality of the developer experience around their models: how easily a business or an engineer can build something useful on top of a given model, how cleanly the tooling integrates with existing systems, and how reliable the connectors are over time.</em></li>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Stainless was founded by Alex Rattray, formerly of Stripe.&nbsp;<em>Stripe built its market position largely on unusually well-designed developer tools</em>, and Stainless was, in effect, an attempt to apply the same approach to the layer between AI APIs and the rest of the software economy. Anthropic has acquired the team that knows how to do this.</p>
</blockquote>



<li><strong>Pricing logic, with caveats on denial.</strong>&nbsp;Stainless was last valued at&nbsp;<a href="https://www.analyticsinsight.net/news/anthropic-acquires-stainless-for-over-300m-to-strengthen-ai-sdk-and-tool-access" target="_blank" rel="noreferrer noopener">$150M in December 2025</a>; at &gt;$300M five months later, this is a roughly 2x strategic markup, not acqui-hire arithmetic. Removing a critical-path external dependency on Anthropic’s own SDKs, while denying it to a tight set of competitors, is rational at that price—but the denial logic is partial.&nbsp;<em>Speakeasy is a viable substitute, and OpenAI was reportedly already migrating off Stainless. The friction tax falls hardest on smaller players who lack the engineering bench to absorb migration cost</em>.</li>
</ol>



<p class="wp-block-paragraph">…The press release calls it “extending reach”; the <em>InfoWorld</em> read—“last-mile developer experience”—is closer, but the complement-capture component, even if partial, is real.</p>



<p class="wp-block-paragraph">-*-</p>



<p class="wp-block-paragraph">Now, while Claude might be overstating some of the market risks associated with this acquisition (you tell us?), it shows that open source’s impacts are highly conditional on its dependencies and should never be analyzed in isolation from the market’s software stack and architecture. This is equally true for open weight models—being dependent on data, compute, and distribution—as it is for open protocols like MCP, dependent on constant API translations and access. Tracking those interdependencies is what a full ecosystem view involves and is helpful to undertake in order to consider where chokepoints might arise, and in turn, where&nbsp;<em>open source strategy</em>&nbsp;might eventually fail or be captured.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Footnotes</h3>


<ol class="wp-block-footnotes"><li id="6732a4b0-bcdf-41ae-a355-761cc861ab6b">In this case by the<a href="https://www.linuxfoundation.org/press/agentic-ai-foundation" target="_blank" rel="noreferrer noopener"> Agentic AI Foundation under the Linux Foundation</a> <a href="#6732a4b0-bcdf-41ae-a355-761cc861ab6b-link" aria-label="Jump to footnote reference 1"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li></ol>]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/open-source-ecosystems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Your AI Agent Already Forgot Half of What You Told It</title>
		<link>https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/</link>
				<comments>https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/#respond</comments>
				<pubDate>Thu, 28 May 2026 10:59:36 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18803</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Your-AI-agent-already-forgot-half-of-what-you-told-it.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Your-AI-agent-already-forgot-half-of-what-you-told-it-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[How to keep agents and skills from losing track mid-workflow]]></custom:subtitle>
		
				<description><![CDATA[This is the seventh article in a series on agentic engineering and AI-driven development.&#160;Read part one&#160;here, part two&#160;here, part three&#160;here, part four&#160;here, part five&#160;here, and part six here. This is the latest article in my Radar series on AI-driven development and agentic engineering, and I have to admit that this one took a bit of [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>This is the seventh article in a series on agentic engineering and AI-driven development.&nbsp;Read part one&nbsp;<a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">here</a>, part two&nbsp;<a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">here</a>, part three&nbsp;<a href="https://www.oreilly.com/radar/the-toolkit-pattern/" target="_blank" rel="noreferrer noopener">here</a>, part four&nbsp;<a href="https://www.oreilly.com/radar/ai-is-writing-our-code-faster-than-we-can-verify-it/" target="_blank" rel="noreferrer noopener">here</a>, part five&nbsp;<a href="https://www.oreilly.com/radar/ai-code-review-only-catches-half-of-your-bugs/" target="_blank" rel="noreferrer noopener">here</a></em>, <em>and part six <a href="https://www.oreilly.com/radar/why-doesnt-anyone-teach-developers-about-context-management/" target="_blank" rel="noreferrer noopener">here</a>.</em></p>
</blockquote>



<p class="wp-block-paragraph">This is the latest article in my Radar series on AI-driven development and agentic engineering, and I have to admit that this one took a bit of a turn I wasn&#8217;t expecting.</p>



<p class="wp-block-paragraph">In my <a href="https://www.oreilly.com/radar/why-doesnt-anyone-teach-developers-about-context-management/" target="_blank" rel="noreferrer noopener">last article</a> I talked about context and context management and I promised to give you some real practical tips for using it. It was originally meant to be about specific, practical context management techniques that were really helpful to me building <a href="https://github.com/andrewstellman/octobatch" target="_blank" rel="noreferrer noopener">Octobatch</a> and the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a>, two open source projects where I work with AIs to plan and orchestrate all of the work and every line of code is written by AI tools like Claude Code and Cursor.</p>



<p class="wp-block-paragraph">But as I was writing this, I found that I&#8217;d adapted those same techniques to my work writing articles like this one. Which is surprising! I&#8217;ve been doing all this work finding ways to help people developing AI skills improve context management, so their skills run more efficiently. It turns out that those same exact techniques apply to anyone using AI tools, even when you&#8217;re using chatbots like Claude.ai or ChatGPT.</p>



<p class="wp-block-paragraph">Full disclosure: I use multiple AI tools to manage this article series. My primary tools are Claude Cowork for brainstorming and managing my article research, notes, and backlog and Gemini&#8217;s mobile app for reading drafts aloud and taking my notes while I&#8217;m away from my desk. And I want to tell you about something that happened while I was using those tools, because I think it really helps show why context management isn&#8217;t just a problem for developers.</p>



<p class="wp-block-paragraph">While I was writing this article, I was using Gemini&#8217;s mobile app to read the draft aloud and take my notes. Partway through the session I asked it to go back and check whether there were earlier notes it hadn&#8217;t incorporated yet. It told me it didn&#8217;t have access to the previous notes, which seemed weird and insane, since we had <em>just taken those notes a few prompts earlier in the session</em>. I could scroll back up and see them earlier in the conversation, but somehow it didn&#8217;t &#8220;know&#8221; about them.</p>



<p class="wp-block-paragraph">Here&#8217;s what happened. Gemini had compacted our conversation without telling me, and the notes from the first half of the session were just&#8230; gone.</p>



<p class="wp-block-paragraph">If you&#8217;ve ever had a web chat AI just seem to forget things you talked about earlier, you&#8217;ve experienced context compaction, just like I did. Understanding even the basics of context and context windows can make a big difference in preventing that kind of frustration.</p>



<p class="wp-block-paragraph">This all reminded me of something I wrote more than two decades ago in <em><a href="https://learning.oreilly.com/library/view/applied-software-project/0596009488/" target="_blank" rel="noreferrer noopener">Applied Software Project Management</a></em> (back in 2005!): &#8220;Important information is discovered during the discussion that the team will need to refer back to during the development process, and if that information is not written down, the team will have to have the discussion all over again.&#8221;</p>



<p class="wp-block-paragraph">Jenny Greene and I wrote that about human teams and project meetings, but it applies to AI sessions just as well.</p>



<p class="wp-block-paragraph">Which brings me back to context, which I wrote about in my last article, and which I&#8217;ll write more about in the next one, because it&#8217;s one of the most important concepts to keep top of mind when working with AI.</p>



<h3 class="wp-block-heading"><strong>Context loss may be invisible, but that doesn&#8217;t make it any less frustrating</strong></h3>



<p class="wp-block-paragraph"><strong>Context</strong> is everything the AI is holding in its working memory during a conversation: what you&#8217;ve told it, what it&#8217;s told you, any files or instructions it&#8217;s read, and whatever internal notes the system has made along the way. All of that lives in a fixed-size <strong>context window</strong>—think of that as your AI&#8217;s short-term memory, the stuff it&#8217;s thinking about right now—and when the window fills up, the AI has to start letting things go. Different tools handle this differently: Some truncate older messages, some compress the conversation into a summary (which means details get lost even though the summary looks complete), and some just start behaving inconsistently so you can&#8217;t tell whether the AI forgot something or never understood it in the first place. The result is the same: The AI loses track of things you told it, decisions you made together, or details it noticed earlier in the session. And it won&#8217;t tell you it forgot. It&#8217;ll just keep generating confident-sounding output based on whatever it still has.</p>



<p class="wp-block-paragraph">Before we dive in a little deeper, I want to do a quick jargon check. If you&#8217;ve seen the terms &#8220;skills&#8221; and &#8220;agents&#8221; floating around but aren&#8217;t sure what they are, think of skills as libraries for AIs and agents as interactive executables. Those aren&#8217;t perfectly precise definitions, but if you&#8217;re a developer they&#8217;re close enough for this discussion.</p>



<p class="wp-block-paragraph">When you&#8217;re coding skills and agents, you run into context problems quickly. The work you&#8217;re asking the AI to do is often complex enough that the context window fills up, and the AI has to start compacting: compressing or dropping older parts of the conversation to make room for new ones. Compaction always seems to happen at the most frustrating and inconvenient time, which makes sense when you think about it. You hit context limits precisely when you&#8217;ve put the most information into the conversation, which is exactly when losing that information costs you the most.</p>



<p class="wp-block-paragraph">That&#8217;s why I think it can often help to think of AIs as having the same shortcomings that human teams do, except those shortcomings are exaggerated by their AI nature. A person who forgets something from a meeting last week might remember it when you remind them. An AI that lost something to context compaction won&#8217;t, because the information is gone. But there&#8217;s something you can do about it, and it turns out the techniques that help are the same whether you&#8217;re building autonomous AI skills or just trying to get a chatbot to remember what you told it 20 minutes ago.</p>



<p class="wp-block-paragraph">I&#8217;ve landed on four techniques that I come back to over and over again. Each one exists because at some point the AI forgot something important and I responded by putting that thing in a file where it couldn&#8217;t be forgotten. None of them require special tooling. And to my surprise, all of these techniques have turned out to be useful for both building software and managing a writing project like this one, whether I&#8217;m chatting with Claude, ChatGPT, or Gemini, or using a desktop tool like Claude Cowork or Codex. These are the techniques I find most valuable:</p>



<ul class="wp-block-list">
<li><strong>Split discovery from documentation:</strong> Don&#8217;t ask the AI to figure something out and produce polished output in the same pass.</li>



<li><strong>Use handoff documents, not continuation prompts:</strong> Before closing a stale session, have the AI write down everything the next session needs to know.</li>



<li><strong>Give the AI an acceptance criterion, not a procedure:</strong> Tell it what &#8220;done&#8221; looks like instead of spelling out the steps.</li>



<li><strong>Use spec documents as the bridge between AI tools:</strong> Make a shared document the single source of truth that all your tools read from.</li>
</ul>



<h3 class="wp-block-heading"><strong>Split discovery from documentation</strong></h3>



<p class="wp-block-paragraph">When you ask an AI to do something complex, you&#8217;re often asking it to do two things at once without realizing it. You&#8217;re asking it to figure something out and produce polished output at the same time. The problem is that figuring things out takes attention, and producing output takes attention, and the model only has so much of it. When you combine both tasks in the same prompt, the model starts cutting corners on one of them, and you can&#8217;t tell which one it shortchanged.</p>



<p class="wp-block-paragraph">I ran into this with the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a>, an open source AI coding skill I built that runs structured code reviews against any codebase. One of the things it does is derive requirements from source code: It reads through the code, identifies what the code promises to do (I call these behavioral contracts), and then produces a requirements document. Originally this all happened in a single pass. The problem was that single-pass requirement generation ran out of attention after about 70 requirements. The model forgot behavioral contracts it had noticed earlier in the code, and the forgetting was completely invisible. There was no stack trace or error message, just incomplete output and no way to know what was missing. I fixed it by splitting the work into two separate prompts:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Read each source file and write down every behavioral contract you observe as a simple list in CONTRACTS.md.</em></p>



<p class="wp-block-paragraph"><em>Read CONTRACTS.md and the documentation, then derive requirements from them and write REQUIREMENTS.md.</em></p>
</blockquote>



<p class="wp-block-paragraph">Then a third pass checks whether every contract has a corresponding requirement, and if there are gaps, goes back to step one for the files with gaps.</p>



<p class="wp-block-paragraph">The key idea is that CONTRACTS.md is external memory. When the model &#8220;forgets&#8221; about a behavioral contract it noticed earlier, that forgetting is normally invisible. With a contracts file, every observation is written down before any requirements work begins, so an uncovered contract is a visible, greppable gap. You can see what was forgotten and fix it.</p>



<p class="wp-block-paragraph">The principle: Don&#8217;t ask the AI to figure out what exists and write formatted output in the same pass. The model runs out of attention trying to do both at once. Whenever you&#8217;re asking an AI to do something complex, consider whether you&#8217;re actually asking it to do two things at once. &#8220;Analyze this codebase and write a report&#8221; is two tasks. &#8220;Read this document and suggest improvements&#8221; is two tasks. Split them, and let the first pass write its observations to a file before the second pass starts working with them.</p>



<h3 class="wp-block-heading"><strong>Use handoff documents, not continuation prompts</strong></h3>



<p class="wp-block-paragraph">Anyone who&#8217;s spent a long session with an AI coding tool has felt the moment when the context starts to go stale. The AI stops tracking details it was handling fine an hour ago, or it contradicts something it said earlier. The session gets slow, and you&#8217;re often restarting because the AI seems to have gotten bogged down and filled up on what you told it. You get the sense that if you keep going, you&#8217;re going to spend more time correcting it than making progress.</p>



<p class="wp-block-paragraph">Most developers respond to their session getting too long in one of two ways: They push through the problem, or they start a fresh one and try to reexplain everything from scratch. Both of those approaches can cause the AI to lose context. The first loses it to compaction; the second loses it to incomplete reexplanation. And both are frustrating! Specifically because you just spent so much time building up all that context with the AI.</p>



<p class="wp-block-paragraph">There&#8217;s a third option. Before you close the session, ask the AI to write a handoff document: a file that captures everything the next session needs to know, written while the current session still has full context. The key is that you&#8217;re asking the AI to write this while the relevant details are still fresh in the working context, and in a way that it or another AI can read.</p>



<p class="wp-block-paragraph">I built this into the Quality Playbook as a core part of how phases communicate. When I split the playbook from a single prompt to independent phases, I needed each phase to run as a completely independent session with no context carryover. So each phase got its own kickoff prompt as a standalone file. Here&#8217;s the structure each one follows:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Write a handoff document that a fresh session could use to pick up this work cold. Include everything it would need to know.</em></p>
</blockquote>



<p class="wp-block-paragraph">Every kickoff opens with what prior phases accomplished, includes explicit boundaries about what&#8217;s frozen, and names which future phase owns each piece of remaining work, because without it the AI will helpfully start doing Phase 3 work while you&#8217;re still in Phase 2. Each phase also ends with a required forward-looking handoff where the completing agent writes down what the next session needs to know.</p>



<p class="wp-block-paragraph">The principle: Each handoff is a complete state snapshot. The incoming AI agent never needs to read prior kickoff prompts or chat history. Everything it needs is in the current handoff file: current state, uncommitted changes, immediate next task, pending tasks, file locations, and anything that was discovered during the prior session. A fresh AI session can pick it up cold.</p>



<p class="wp-block-paragraph">If you&#8217;re deep into a Claude Code or Copilot session and you can feel the context getting stale, ask the AI to write a handoff document before you close the session. Tell it to include everything a fresh session would need to continue the work. Then start a new session and point it at that file. A fresh session with a good handoff document will usually outperform a stale session, because it&#8217;s starting with clean context instead of compacted, fragmented context.</p>



<h3 class="wp-block-heading"><strong>Give the AI an acceptance criterion, not a procedure</strong></h3>



<p class="wp-block-paragraph">When you give an AI a multistep task, the natural instinct is to spell out the steps. First do this, then do that, then combine the results. The problem is that step-by-step procedures are the first thing the AI forgets when the context window fills up. It&#8217;ll skip steps, merge phases, or quietly drop tasks, and there&#8217;s nothing in the procedure itself that would help the AI notice what it missed. The procedure tells the AI what to do, but it doesn&#8217;t tell the AI what &#8220;done&#8221; looks like.</p>



<p class="wp-block-paragraph">I learned this the hard way with the Quality Playbook. The playbook runs multiple iteration passes over a codebase, and the results need to be cumulative. It keeps a list of all the bugs it finds in the code being tested in a file called BUGS.md. Early on, I gave the AI a procedure to run four times and then update that file:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>First run the main pass, then run four iteration passes, then merge the findings into BUGS.md.</em></p>
</blockquote>



<p class="wp-block-paragraph">The AI did not respond well to that instruction.</p>



<p class="wp-block-paragraph">It turns out that when you ask an AI to do a very complex task a specific number of times, it can lose count. In fact, from my experimentation, it seems that count is one of the first casualties of context compaction. Most of the time the AI decided three iterations was enough, or merged findings from only two passes, and no matter how many different ways I tried to rephrase that instruction, there was nothing I could come up with that prevented the problem.</p>



<p class="wp-block-paragraph">However, everything changed when I replaced the &#8220;run four times&#8221; instruction with an <strong>acceptance criterion</strong>, or a specific condition that tells the AI when to stop looping:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>You are done only when BUGS.md contains the cumulative findings from the main run plus all four itration passes.</em></p>
</blockquote>



<p class="wp-block-paragraph">Even when the AI lost track of intermediate steps, it could check the output against the criterion and know whether it was finished. And I could verify the output against the same criterion, which gave me a way to audit the agent&#8217;s work without watching every step.</p>



<p class="wp-block-paragraph">In developer terms, the AI is really bad at loops like <em>for (i = 0; i &lt; 4; i++)</em> because it loses track of the value of the iterator <em>i</em> when it compacts its context. But it&#8217;s really good at loops like <em>while (!done)</em> because it can check <em>done</em> based on the current state without relying on history.</p>



<p class="wp-block-paragraph">The principle behind all this is that an acceptance criterion survives context pressure because the AI can always check &#8220;Am I done?&#8221; against a concrete test. This is actually the same principle behind test-driven development: write the test before the code so you know when you&#8217;re done. The acceptance criterion is the test for your AI session. When you&#8217;re giving an AI a task that has multiple steps, don&#8217;t describe the steps. Describe what &#8220;done&#8221; looks like, and let the AI figure out how to get there.</p>



<h3 class="wp-block-heading"><strong>Use spec documents as the bridge between AI tools</strong></h3>



<p class="wp-block-paragraph">Most developers working with AI don&#8217;t use just one tool. You might use Claude for design, Cursor for coding, and Copilot for quick edits. You might even use multiple models inside the same tool, like GPT-5.5 and Opus 4.7 in separate Copilot chats inside VS Code. It&#8217;s common to have one model for coding, another for review, and a third for orchestration and project management. The problem is that none of these tools or chats know what you told the others. Claude doesn&#8217;t know what you decided with Cursor. Two separate Copilot chats in the same editor don&#8217;t share context. You&#8217;re the one carrying context between them, and that&#8217;s exactly the kind of lossy handoff that causes drift. A design decision you made in one conversation gets lost or distorted by the time it reaches the tool that needs to implement it.</p>



<p class="wp-block-paragraph">The fix is to make the spec document the single source of truth that all your AI tools read from. I used this when building a game prototype, where I had Claude handling design and planning and Cursor doing the coding. They never talked to each other directly, so the spec documents served as the shared contract: Claude wrote the specs, and Cursor read them. The rule I followed was simple:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Never tell the AI coder something that isn&#8217;t already in the specs. If you make a design decision in conversation, write it into the spec first, then point the coder at the spec.</em></p>
</blockquote>



<p class="wp-block-paragraph">If I made a design decision in a conversation with Claude, that decision had to be written into the spec before I told Cursor about it. If I discovered something during implementation, I wrote it into the appropriate doc first, then pointed the coder at it. The spec was always the single source of truth. When Claude and I changed the wound topology (removing one wound type, promoting another), we updated the docs first, then told Cursor to reread them. When we decided to add a new UI element, we wrote it into the UI spec first, then told Cursor to reread the doc.</p>



<p class="wp-block-paragraph">The key was including rationale in the specs. Not just &#8220;show 5 progressive labels&#8221; but why: &#8220;The player shouldn&#8217;t be told what they&#8217;re fighting. They should discover it.&#8221; This helps the AI coder make better decisions when the spec doesn&#8217;t cover an edge case because it knows the intent behind the requirement.</p>



<p class="wp-block-paragraph">The principle: The spec document is the shared context that all your tools can read. It prevents the drift that happens when design intent lives only in chat history that the other tool can&#8217;t see. This technique works any time you&#8217;re using more than one AI tool on the same project, which at this point is most projects.</p>



<h3 class="wp-block-heading"><strong>How these techniques combine: Managing this article series</strong></h3>



<p class="wp-block-paragraph">Those four practices came out of AI-driven development work, but they apply to almost any AI work. And while these techniques emerged for me while working on agents and skills, I think it&#8217;s valuable to demonstrate them in a nondevelopment context, so I&#8217;ll share an example from my work on the article series you&#8217;re reading now.</p>



<p class="wp-block-paragraph">Over time, the process for how my AI assistant and I manage this article backlog evolved organically in conversation, but it was never written down anywhere except in the AI&#8217;s context window. Which means every time the session compacted or I started a fresh chat, the process was gone and I had to reexplain it. I caught this when the AI did something slightly wrong and I wanted to confirm we were on the same page. So I asked:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Every time I suggest a new article idea, you add an entry to the backlog, and then create a new markdown file with the source material, right?</em></p>
</blockquote>



<p class="wp-block-paragraph">That&#8217;s split discovery from documentation. I didn&#8217;t say &#8220;document our process.&#8221; I said &#8220;confirm what we do.&#8221; Discovery first, then documentation as a separate step. If I&#8217;d said &#8220;write up our process&#8221; without confirming first, the AI might have written something plausible but wrong, and I wouldn&#8217;t have caught the discrepancy.</p>



<p class="wp-block-paragraph">Once we&#8217;d confirmed the process, I asked the AI to create two files. <strong>AGENTS.md</strong> is an emerging standard for AI-readable project context—a single file that tells any AI session what it needs to know about a project. You can learn more about the convention at <a href="https://agents.md/" target="_blank" rel="noreferrer noopener">agents.md</a>. <strong>CONTEXT.md</strong> serves a similar role as a bootstrapping document—it&#8217;s less established as a standard, but the practice of asking the AI to dump everything it knows into a context file so the next session can pick it up cold has been one of the most valuable habits I&#8217;ve developed. Here&#8217;s the prompt I used:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Update the backlog file to explain what it is and how we maintain it. Create a CONTEXT.md with everything you&#8217;d need to bootstrap a new chat. Create an AGENTS.md to make it easy to bootstrap with a single-line prompt.</em></p>
</blockquote>



<p class="wp-block-paragraph">That prompt is a handoff document. I was explicitly asking the AI to write down everything it knew while it still had full context, specifically because I knew that context would be lost to compaction. The CONTEXT.md file is a handoff from this session to whatever fresh session picks up the work next week.</p>



<p class="wp-block-paragraph">Notice what I didn&#8217;t say. I didn&#8217;t give step-by-step instructions for what should go in those files. I said &#8220;everything you would need to bootstrap this process again in case we lost it&#8221; and &#8220;a complete dump of all of the context you would need to bootstrap a new chat and get it to the point where this current chat is.&#8221; Those are acceptance criteria, not procedures. The AI had to figure out what belonged in those files. If I&#8217;d given it a procedure (&#8220;first write the publication history, then the voice rules, then the file locations&#8221;), it would have followed the list and missed anything I forgot to include. The acceptance criterion is harder to satisfy but more robust: the test is &#8220;Could a fresh session bootstrap from these files alone?&#8221;</p>



<p class="wp-block-paragraph">And the AGENTS.md file itself is a spec document as a bridge between tools. It&#8217;s the shared contract that any AI session, whether it&#8217;s Claude, Gemini, Cowork, or a fresh chat, can read to get aligned with the project. This session wrote it; the next session reads it. The two sessions never communicate directly, so the spec file bridges the gap between them.</p>



<p class="wp-block-paragraph">That&#8217;s all four practices in two prompts, applied to something as ordinary as managing a writing project. It didn&#8217;t require pipelines or codebases or batch orchestration. The practices work because they solve the same underlying problem regardless of the domain: important information living in the AI&#8217;s context window instead of on disk.</p>



<h3 class="wp-block-heading"><strong>Context management is a development skill</strong></h3>



<p class="wp-block-paragraph">Every practice I&#8217;ve described in this article and the last one is something developers have always been told to do: write things down, record your rationale, be deliberate about what you save and what you let go, write ADRs and design docs and inline comments explaining nonobvious choices. We&#8217;ve always known we should do more of it. When you&#8217;re working with AI, the cost of not doing it becomes immediate and visible.</p>



<p class="wp-block-paragraph">The practices in this article all come down to the same thing: putting the important information in files where compaction can&#8217;t touch it, so you can see what the AI knows and verify that it matches reality. In the next article, I&#8217;ll go deeper on the debugging angle: how to use externalized files to understand what your AI is actually doing, with practical techniques that work even if you&#8217;re not building agents but are just using a chatbot.</p>



<p class="wp-block-paragraph"><em>The <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a> is open source and works with GitHub Copilot, Cursor, and Claude Code. It&#8217;s also available as part of <a href="https://awesome-copilot.github.com/#file=skills%2Fquality-playbook%2FSKILL.md" target="_blank" rel="noreferrer noopener">awesome-copilot</a>.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Get a Good Return on Your AI Investments</title>
		<link>https://www.oreilly.com/radar/get-a-good-return-on-your-ai-investments/</link>
				<comments>https://www.oreilly.com/radar/get-a-good-return-on-your-ai-investments/#respond</comments>
				<pubDate>Wed, 27 May 2026 16:52:37 +0000</pubDate>
					<dc:creator><![CDATA[Louise Corrigan]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18808</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Get-a-good-return-on-your-AI-investments.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Get-a-good-return-on-your-AI-investments-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Takeaways from Sam Newman&#039;s fireside chat with Nathen Harvey, DORA team lead at Google Cloud]]></custom:subtitle>
		
				<description><![CDATA[Last week, we had our first Infrastructure &#38; Ops superstream of 2026, Platform Engineering in the Age of AI. Our speakers explored a range of topics focused on supporting new AI workloads, each with unique infrastructure needs, unpredictable costs, and novel security concerns. Google Cloud’s Abdel Sghiouar took the audience through what a good platform [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Last week, we had our first Infrastructure &amp; Ops superstream of 2026, <a href="https://learning.oreilly.com/live-events/infrastructure-ops-superstream-platform-engineering-in-the-age-of-ai/0642572314507/0642572314491/" target="_blank" rel="noreferrer noopener">Platform Engineering in the Age of AI</a>. Our speakers explored a range of topics focused on supporting new AI workloads, each with unique infrastructure needs, unpredictable costs, and novel security concerns. Google Cloud’s Abdel Sghiouar took the audience through what a good platform for AI looks like, Cockroach Labs’ Jordan Lewis shared lessons learned rolling out a corporate AI platform, Syntasso’s Daniel Bryant outlined a three-layer model for building a good platform, technology leader Sarah Wells discussed the importance of governance and how to make it more manageable, and Thoughtworks’ Ben O&#8217;Mahony explained why evals should be part of your observability story. You can <a href="https://youtu.be/neycwJJmpG0" target="_blank" rel="noreferrer noopener">watch the highlights here</a>.</p>



<p class="wp-block-paragraph">The event concluded with a fireside chat between Sam and Nathen Harvey, who leads the DORA team at Google Cloud. <a href="https://dora.dev/" target="_blank" rel="noreferrer noopener">DORA</a> has been tracking software delivery performance for over a decade, which means they&#8217;ve watched a lot of technology trends come through. Their center of gravity has always been the same question: How quickly and safely can a team move change into a running production application?</p>



<p class="wp-block-paragraph">AI hasn&#8217;t changed that question, although it has made answering it a bit harder. DORA recently released its <a href="https://cloud.google.com/resources/content/dora-roi-of-ai-assisted-software-development" target="_blank" rel="noreferrer noopener"><em>ROI of AI-Assisted Software Development</em> report</a> to show how AI is working for teams right now, and how that may or may not be contributing to organizations’ bottom lines. Nathen used the findings as a jumping-off point to dig into how AI is changing platform engineering and software development as a whole.</p>



<h2 class="wp-block-heading">The productivity gap</h2>



<p class="wp-block-paragraph">Sam started by pointing out one of the biggest headline findings from DORA’S 2025 data: Organizations saw about 10% improvement in terms of actual code shipped to production systems. Even though developers likely felt that they were more productive, that doesn&#8217;t automatically carry through to production. DORA&#8217;s data shows higher throughput alongside higher instability. In other words, teams are shipping more but they’re also more frequently rolling back changes or implementing fixes. The gains at the individual level are real (and 10% is a pretty good number), but those gains aren’t “the dramatic improvements that you find in the headlines.”</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="The Productivity Gap with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/9jxMx1yHAZo?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">AI amplifies good processes (and bad ones)</h2>



<p class="wp-block-paragraph">Nathen explained that AI is an amplifier and mirror that equally reflects the good and bad. On teams where shipping change is already easy, AI tends to keep things running well. On teams where getting change into production is painful, AI generates <em>more</em> change and makes the existing friction more acute. That said, his read on this outcome is cautiously optimistic: &#8220;If the pain is more acute, we maybe will invest in addressing that pain.&#8221;</p>



<p class="wp-block-paragraph">The rub is that the investment has to actually happen. Nathen noted that in lower-performing organizations, AI tools often arrive with a reset of expectations rather than an invitation to fix the process: Here&#8217;s your new tool. Now we expect more from you. Addressing this problem means reframing the question “Does AI make people more productive?” What we really should be asking is “Under what conditions will AI boost productivity, and who&#8217;s responsible for creating them?” And that falls on the organization, not the technology.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="AI Is an Amplifier and Mirror for Good Processes and Bad with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/5CzvrWpXBHg?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Verification isn&#8217;t a checkbox</h2>



<p class="wp-block-paragraph">Trust is a big challenge with generative AI. About 30% of DORA survey respondents trust AI output little or not at all. Around 46% trust it &#8220;somewhat&#8221; (and Nathen is one of them). Despite all the advances in generative AI, these tools still make mistakes, and if you&#8217;ve multiplied your ability to generate code without doing anything to scale your ability to verify it, you&#8217;ve made your situation worse, not better.</p>



<p class="wp-block-paragraph">Nathen called this the verification tax, and it belongs in any honest accounting of AI&#8217;s productivity impact. Pipeline adaptation belongs there too: Is your delivery pipeline fit for purpose given the volume of change you&#8217;re now trying to push through? These costs don&#8217;t show up in the headlines about 10x developer productivity. They show up in your incident reports three months later.</p>



<p class="wp-block-paragraph">DORA recently published an <a href="https://dora.dev/ai/roi/calculator/#staff_size=500&amp;salary=176000&amp;revenue=100000000&amp;downtime_cost_per_hour=100000&amp;current_deployments_per_year=50&amp;current_features_per_year=50&amp;idea_success_rate=0.33&amp;revenue_impact_per_feature=0.005&amp;current_cfr=0.05&amp;current_fdrt=4&amp;time_saved_per_developer=0.125&amp;ai_license_cost_per_user=250&amp;additional_ai_cost_per_user=80&amp;additional_ai_infra_cost=100000&amp;training_cost_per_user=9600&amp;target_deployments_per_year=56&amp;target_features_per_year=56&amp;target_cfr=0.06&amp;j_curve_drop=0.15&amp;j_curve_duration=3" target="_blank" rel="noreferrer noopener">ROI framework and calculator</a> for AI-assisted software development. Nathen was clear that there&#8217;s no universal number to offer, and the calculator doesn&#8217;t pretend otherwise. What it does is give teams a way to model the real costs, including the learning investment, the verification overhead, and the pipeline changes required.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="The Verification Tax with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/wGYLtVj8z0Q?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Context switching and burnout</h2>



<p class="wp-block-paragraph">With productivity on the upswing, AI-induced burnout is becoming a serious concern. (Steve Yegge calls this the “<a href="https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163" target="_blank" rel="noreferrer noopener">AI vampire</a>.”) DORA’s data for 2025 showed that AI adoption wasn’t strongly connected with burnout, with the caveat that about 64% of DORA survey respondents said they’d never worked in an agentic workflow. Both of those findings are likely to change significantly in 2026.</p>



<p class="wp-block-paragraph">Nathen highlighted one source of burnout he expects to escalate as agents become the norm: context switching. As he pointed out, software developers spent years arguing for protected focus time to do the deep work that requires them to maintain flow. Agentic workflows are now incentivizing those same developers to voluntarily run a dozen or more agents at once, forcing them to context-switch multiple times every hour. As he joked, “There&#8217;s plenty of research that supports the idea that all of us feel like we&#8217;re pretty good multitaskers and none of us are.” The consequences are coming, and we’re doing it to ourselves.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Burnout Will Go Up, and We’re Doing It to Ourselves with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/ibdw27MxQq0?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The cognitive debt question</h2>



<p class="wp-block-paragraph">Sam Newman brought up the related notion of “cognitive debt,” and in particular, Margaret-Anne Storey’s discussion of it. (See “<a href="https://margaretstorey.com/blog/2026/02/09/cognitive-debt/" target="_blank" rel="noreferrer noopener">How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt</a>” and “<a href="https://arxiv.org/abs/2603.22106" target="_blank" rel="noreferrer noopener">From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI</a>.”) Here’s how Storey explains the problem in her blog post:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Debt compounded from going fast lives in the brains of the developers and affects their lived experiences and abilities to “go fast” or to make changes. Even if AI agents produce code that could be easy to understand, the humans involved may have simply lost the plot and may not understand what the program is supposed to do, how their intentions were implemented, or how to possibly change it.</p>
</blockquote>



<p class="wp-block-paragraph">And as Sam noted, this compounds across teams and organizations. As developers increasingly work in parallel with AI rather than with each other, they lose the shared understanding that comes from people building software together. Kent Beck once said that “<a href="https://tidyfirst.substack.com/p/self-team-product" target="_blank" rel="noreferrer noopener">software design is an exercise in human relationships</a>.” Agentic workflows are putting pressure on that in ways we&#8217;re only beginning to see.</p>



<p class="wp-block-paragraph">Nathen agreed cognitive debt is where he&#8217;s most concerned, and both your workers and your architecture will suffer for it. Understanding the ramifications of an architectural decision you made eight months ago takes years of operation to surface, and AI doesn&#8217;t help with that at all.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Cognitive Debt and Long Feedback Loops with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/yiOsikXaQ7c?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Invest in your platform now</h2>



<p class="wp-block-paragraph">Considering what makes some AI-assisted teams high performers, Nathen explained, “It’s not <em>that</em> you’re using AI but <em>how</em> you’re using AI.” This observation led DORA to develop <a href="https://cloud.google.com/blog/products/ai-machine-learning/introducing-doras-inaugural-ai-capabilities-model" target="_blank" rel="noreferrer noopener">seven capabilities</a> that, when combined with AI adoption, lead to better outcomes. Nathen briefly ran through the list, ending on quality internal platforms. And here he made a claim about software engineering investment that was, in his words, “a little bit wild”:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Every product engineer that you have in your organization, every engineer that&#8217;s focused on building features right now, should probably stop building features and focus on the platform.</p>
</blockquote>



<p class="wp-block-paragraph">His argument is that platforms matter more, not less, in an environment where AI makes it possible for almost anyone in an organization to build something. The people closest to customers and business problems can now generate working software. What they can&#8217;t do is ensure that software is durable, secure, and production-ready.</p>



<p class="wp-block-paragraph">Nathen suggested that the best leverage for software engineering investment today might be building platforms that provide those guardrails, that shift the complexity of production-readiness down into the infrastructure so that anyone building on top of it gets the safety net for free. He acknowledged that moving every product engineer to platform work might be overkill. But the direction of travel is real. The platform is also, as Newman pointed out, where you bring determinism back into a process that AI has made more nondeterministic.</p>



<p class="wp-block-paragraph">That’s something we’ve been hearing a lot here at O’Reilly. The expansion of who can build doesn&#8217;t reduce the need for deep engineering expertise. It changes where that expertise is most valuable, and platforms are a good answer to where.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="AI Capabilities and the Case for Platform Investment Now with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/CIFoHFTbIec?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">What DORA’s research tells us</h2>



<p class="wp-block-paragraph">The teams that are doing well are running experiments, learning from them, and spreading those lessons. The measure Nathen suggested is not how many tokens you&#8217;ve consumed but how many experiments you&#8217;ve run and how well you&#8217;re distributing what you&#8217;ve learned.</p>



<p class="wp-block-paragraph">The tools are moving fast enough that any organization locking in a fixed policy around specific tools will find itself stuck. What you want is the capacity to keep learning, which means building the culture and the processes that make learning visible and transferable.</p>



<p class="wp-block-paragraph">All of DORA&#8217;s research is freely available at <a href="https://dora.dev/" target="_blank" rel="noreferrer noopener">dora.dev</a>, including the 2025 annual report and the ROI framework. The <a href="https://dora.community/" target="_blank" rel="noreferrer noopener">DORA Community</a> provides a space for practitioners to work through these questions together. If you&#8217;re trying to navigate any of this with your team, you may want to spend some time there.</p>



<p class="wp-block-paragraph">And if you want to dive deeper into Nathen and Sam’s chat or explore the other sessions, you can <a href="https://learning.oreilly.com/videos/infrastructure-ops/0642572308308/" target="_blank" rel="noreferrer noopener">watch the entire Infrastructure &amp; Ops Superstream</a> on the O’Reilly learning platform. Our next event, on September 9, will cover agentic observability. <a href="https://www.oreilly.com/live/io-superstream-agentic-observability.html" target="_blank" rel="noreferrer noopener">Register for free here</a>, and check out all the other <a href="https://www.oreilly.com/live/free.html" target="_blank" rel="noreferrer noopener">free live events on O’Reilly</a>.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/get-a-good-return-on-your-ai-investments/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Agent Skills</title>
		<link>https://www.oreilly.com/radar/agent-skills/</link>
				<comments>https://www.oreilly.com/radar/agent-skills/#respond</comments>
				<pubDate>Wed, 27 May 2026 10:59:18 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18796</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-skills.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-skills-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A senior engineer’s job is mostly the parts that don’t show up in the diff. Specs. Tests. Reviews. Scope discipline. Refusing to ship what can’t be verified. AI coding agents skip those parts by default. Agent Skills is my attempt to make them not optional.]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission. The default behavior of any AI coding agent is to take the shortest path to “done.” Ask for a feature and it writes the feature. It doesn’t ask whether you have a spec, write a test before [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on <a href="https://addyosmani.com/blog/agent-skills/" target="_blank" rel="noreferrer noopener">Addy Osmani’s blog</a> and is being reposted here with the author’s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">The default behavior of any AI coding agent is to take the shortest path to “done.” Ask for a feature and it writes the feature. It doesn’t ask whether you have a spec, write a test before the implementation, consider whether the change crosses a trust boundary, or check what the PR will look like to a reviewer. It produces code, declares victory, and moves on.</p>



<p class="wp-block-paragraph">This is the same failure mode every senior engineer has spent their career learning to avoid. The senior version of any task includes work that doesn’t show up in the diff: surfacing assumptions, writing the spec, breaking the work into reviewable chunks, choosing the boring design, leaving evidence that the result is correct, sizing the change so a human can actually review it. Those steps are most of what separates engineers who ship reliable software at scale from people who push code that breaks.</p>



<p class="wp-block-paragraph">Agents skip those steps for the same reason any junior would. They’re invisible. The reward signal points at “task complete” not “task complete and the design doc exists.” So we have to bolt the senior-engineer scaffolding back on.</p>



<p class="wp-block-paragraph"><a href="https://github.com/addyosmani/agent-skills" target="_blank" rel="noreferrer noopener">Agent Skills</a> is my attempt at that scaffolding. It just crossed 27K stars, so apparently I’m not alone in wanting it. This post is the part the README doesn’t quite cover: why each design choice exists, how it maps onto standard SDLC and Google’s published engineering practices, and what you should steal from the project even if you never install a single skill.</p>



<h2 class="wp-block-heading">What a “skill” actually is</h2>



<p class="wp-block-paragraph">The word “skill” is doing a lot of work in the Claude Code/Anthropic vocabulary, and it helps to be precise. A skill is a Markdown file with front matter that gets injected into the agent’s context when the situation calls for it. Somewhere between a system-prompt fragment and a runbook.</p>



<p class="wp-block-paragraph">A skill is <em>not</em> reference documentation. It is not “everything you should know about testing.” It is a workflow: a sequence of steps the agent follows, with checkpoints that produce evidence, ending in a defined exit criterion.</p>



<p class="wp-block-paragraph">That distinction is the whole game. If you put a 2,000-word essay on testing best practices into the agent’s context, the agent reads it, generates plausible-looking text, and skips the actual testing. If you put a <em>workflow</em> there (write the failing test first, run it, watch it fail, write the minimum code to pass, watch it pass, refactor), the agent has something to do, and you have something to verify.</p>



<p class="wp-block-paragraph">Process over prose. Workflows over reference. Steps with exit criteria over essays without them. That single distinction separates a useful skill from a pretty Markdown file. It also explains why so many “AI rules” repos end up doing nothing in practice. The rules are essays.</p>



<h2 class="wp-block-heading">The SDLC the skills encode</h2>



<p class="wp-block-paragraph">The 20 skills in the repo organize around six lifecycle phases, with seven slash commands sitting on top. Define (<code>/spec</code>) is where you decide what you’re actually building. Plan (<code>/plan</code>) breaks the work down. Build (<code>/build</code>) implements it in vertical slices. Verify (<code>/test</code>) proves it works. Review (<code>/review</code>) catches what slipped through. Ship (<code>/ship</code>) gets it to users safely. <code>/code-simplify</code> sits across the bottom of the whole thing.</p>



<p class="wp-block-paragraph">This isn’t a coincidence. It’s the same SDLC every functioning engineering organization runs, just in different vocabulary. Google calls it design doc → review → implementation → readability review → launch checklist. Amazon calls it the working-backward memo and the bar raiser. Every healthy team has some version of this loop.</p>



<p class="wp-block-paragraph">What’s new with AI coding agents is that <em>most agents skip most of these phases by default</em>. You ask for a feature, you get an implementation, and the spec, plan, tests, review, and launch checklist all just don’t happen. Skills push the agent through the same phases a senior engineer forces themselves through, because shipping the code without them is how you produce incidents.</p>



<p class="wp-block-paragraph">A complex feature might activate eleven skills in sequence. A small bug fix might use three. The router (<code>using-agent-skills</code>) decides which apply. The point is that the workflow scales to the actual scope, not to the assumed scope.</p>



<h2 class="wp-block-heading">Five principles that are doing the work</h2>



<p class="wp-block-paragraph">Five design decisions in the project are the loadbearing ones. The rest of the system follows from them.</p>



<h3 class="wp-block-heading">1. Process over prose</h3>



<p class="wp-block-paragraph">Already covered. Workflows are agent-actionable; essays are not. The same is true for human teams. If your team handbook is 200 pages, no one reads it under time pressure. If it’s a small set of workflows with checkpoints, people actually run them.</p>



<h3 class="wp-block-heading">2. Anti-rationalization tables</h3>



<p class="wp-block-paragraph">This is the most distinctive design decision in the project, and the one I most want other teams to steal.</p>



<p class="wp-block-paragraph">Each skill includes a table of common excuses an agent (or a tired engineer) might use to skip the workflow, paired with a written rebuttal. A few examples close to the originals:</p>



<ul class="wp-block-list">
<li>“This task is too simple to need a spec.” → Acceptance criteria still apply. Five lines is fine. Zero lines is not.</li>



<li>“I’ll write tests later.” → Later is the loadbearing word. There is no later. Write the failing test first.</li>



<li>“Tests pass, ship it.” → Passing tests are evidence, not proof. Did you check the runtime? Did you verify user-visible behavior? Did a human read the diff?</li>
</ul>



<p class="wp-block-paragraph">The reason this works is that LLMs are excellent at rationalization. They will produce a plausible-sounding paragraph explaining why <em>this particular</em> task doesn’t need a spec or why <em>this particular</em> change is fine to merge without review. Anti-rationalization tables are prewritten rebuttals to lies the agent hasn’t yet told.</p>



<p class="wp-block-paragraph">The pattern is just as good for human teams. Most engineering decay isn’t anyone choosing to do bad work. It’s people accepting plausible-sounding justifications for skipping the parts they don’t feel like doing. A team that writes down its anti-rationalizations is a team that has fewer of them.</p>



<h3 class="wp-block-heading">3. Verification is nonnegotiable</h3>



<p class="wp-block-paragraph">Every skill terminates in concrete evidence. Tests pass. Build output is clean. The runtime trace shows the expected behavior. A reviewer signs off. “Seems right” is never sufficient.</p>



<p class="wp-block-paragraph">This is the same principle that makes Anthropic’s harness recover from failures, that makes Cursor’s planner/worker/judge split actually catch bugs, that makes any <a href="https://addyosmani.com/blog/long-running-agents/" target="_blank" rel="noreferrer noopener">long-running agent</a> recoverable. The agent is a generator. You need a separate signal that the work is done. Skills bake that signal into every workflow.</p>



<h3 class="wp-block-heading">4. Progressive disclosure</h3>



<p class="wp-block-paragraph">Do not load all 20 skills into context at session start. Activate them based on the phase. A small meta-skill (<code>using-agent-skills</code>) acts as a router that decides which skill applies to the current task.</p>



<p class="wp-block-paragraph">This is the <a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">harness engineering</a> lesson applied at skill granularity. Every token loaded into context degrades performance somewhere, so you load what’s relevant and leave the rest on disk. Progressive disclosure is how you get a 20-skill library into a 5K-token slot without poisoning the well.</p>



<h3 class="wp-block-heading">5. Scope discipline</h3>



<p class="wp-block-paragraph">The meta-skill encodes a nonnegotiable I’d staple to every agent if I could: “touch only what you’re asked to touch.” Don’t refactor adjacent systems. Don’t remove code you don’t fully understand. Don’t brush against a TODO and decide to rewrite the file.</p>



<p class="wp-block-paragraph">This sounds obvious until you watch an agent decide that fixing one bug requires modernizing three unrelated files. Scope discipline is the single biggest determinant of whether an agent’s PR is mergeable or has to be unwound. It’s also the principle that maps most cleanly onto Google’s code review norms, where reviewers will block a PR for doing more than one thing.</p>



<h2 class="wp-block-heading">The Google DNA</h2>



<p class="wp-block-paragraph">The skills are saturated with practices from <em><a href="https://learning.oreilly.com/library/view/software-engineering-at/9781492082781/" target="_blank" rel="noreferrer noopener">Software Engineering at Google</a></em> and Google’s public engineering culture. This is intentional. Most of what makes Google-scale software work is documented and public, and it is <em>exactly</em> the part agents are most likely to skip.</p>



<p class="wp-block-paragraph">A partial map of which skill encodes which practice:</p>



<ul class="wp-block-list">
<li><strong>Hyrum’s law</strong><strong> in </strong><strong>api-and-interface-design</strong><strong>. </strong>Every observable behavior of your API will eventually be depended on by someone, so design with that in mind.</li>



<li><strong>The test pyramid (~80/15/5) and the Beyoncé rule</strong><strong> in </strong><strong>test-driven-development</strong><strong>.</strong> “If you liked it, you should have put a test on it.” Infrastructure changes don’t catch bugs; tests do.</li>



<li><strong>DAMP over DRY in tests.</strong> Google’s testing philosophy is explicit that test code should read like a specification even at the cost of some duplication. Overabstracted tests are a known antipattern.</li>



<li><strong>~100-line PR sizing, with Critical/Nit/Optional/FYI severity labels</strong><strong> in </strong><strong>code-review-and-quality</strong><strong>.</strong> Straight from Google’s code review norms. Big PRs don’t get reviewed; they get rubber-stamped.</li>



<li><strong>Chesterton’s Fence</strong><strong> in </strong><strong>code-simplification</strong><strong>.</strong> Don’t remove a thing until you understand why it was put there.</li>



<li><strong>Trunk-based development and atomic commits</strong><strong> in </strong><strong>git-workflow-and-versioning</strong><strong>.</strong></li>



<li><strong>Shift left and feature flags</strong><strong> in </strong><strong>ci-cd-and-automation</strong><strong>.</strong> Catch problems as early as possible, decouple deploy from release.</li>



<li><strong>Code-as-liability</strong><strong> in </strong><strong>deprecation-and-migration</strong><strong>.</strong> Every line you keep is one you have to maintain forever, so prefer the smaller surface.</li>
</ul>



<p class="wp-block-paragraph">None of these are new ideas. The point is that none of them are in the agent by default. A frontier model has read the phrase “Hyrum’s law” in its training data, but it does not apply Hyrum’s law when it’s designing your API at 3am. Skills are how you make sure it does.</p>



<h2 class="wp-block-heading">How to actually use it</h2>



<p class="wp-block-paragraph">Three modes, in roughly increasing commitment.</p>



<p class="wp-block-paragraph"><strong>Mode 1: Install via marketplace. </strong>If you’re using Claude Code:</p>



<pre class="wp-block-code"><code><code>/plugin marketplace add addyosmani/agent-skills 
/plugin install agent-skills@addy-agent-skills</code></code></pre>



<p class="wp-block-paragraph">You get the slash commands (<code>/spec</code>, <code>/plan</code>, <code>/build</code>, <code>/test</code>, <code>/review</code>, <code>/ship</code>, <code>/code-simplify</code>) and the agent activates the relevant skills automatically based on context. This is the path I’d recommend most people start on.</p>



<p class="wp-block-paragraph"><strong>Mode 2: Drop the Markdown into your tool of choice.</strong> The skills are plain Markdown with front matter. Cursor users put them in <code>.cursor/rules/</code>. Gemini CLI has its own install path. Codex, Aider, Windsurf, OpenCode, anything that accepts a system prompt can read them. The tooling matters less than the workflow underneath.</p>



<p class="wp-block-paragraph"><strong>Mode 3: Read them as a spec. </strong>Even if you never install anything, the skills are a <em>documented description of what good engineering with AI agents looks like</em>. Read <code>code-review-and-quality.md</code> and apply the five-axis framework to your team’s review process. Read <code>test-driven-development.md</code> and use it to settle the next “do we need to write the test first” argument with a junior. Read the meta-skill and steal the five nonnegotiables for your own AGENTS.md.</p>



<p class="wp-block-paragraph">This third mode is where I’d actually start. Pick the four or five skills closest to your current pain. Decide which workflows you want enforced. Then install the runtime, or roll your own, to do the enforcing.</p>



<h2 class="wp-block-heading">What to steal even if you never install</h2>



<p class="wp-block-paragraph">A few patterns from the project I’d steal regardless of whether you use AI coding agents at all:</p>



<p class="wp-block-paragraph"><strong>Anti-rationalization as a team practice.</strong> Write down the lies your team tells itself. “We’ll fix the tests after launch.” “This change is too small for a design doc.” “It’s fine, we have monitoring.” Pair each with the rebuttal. Put it in your AGENTS.md or your engineering wiki. It will save you arguments and it will catch the next tired Friday-afternoon shortcut.</p>



<p class="wp-block-paragraph"><strong>Process over prose for anything you write internally.</strong> If you find yourself writing a 2,000-word doc titled “how we approach X” you’ve written reference material. Convert it to a workflow with checkpoints. The doc shrinks to 400 words and people actually run it. This applies as much to onboarding guides and runbooks as it does to agent skills.</p>



<p class="wp-block-paragraph"><strong>Verification as a hard exit criterion.</strong> Make “produce evidence” the exit step of every task. For agents, for engineers, for yourself. Evidence is whatever proves the work is done: a green test run, a screenshot, a log, a review approval. Without it, the task is not done. “Seems right” never closes the loop.</p>



<p class="wp-block-paragraph"><strong>Progressive disclosure for any rulebook.</strong> Do not write a 50-page handbook. Write a small router that points to the right small chapter for the situation. This is true for AGENTS.md, for runbooks, for incident playbooks, for anything anyone will read under time pressure.</p>



<p class="wp-block-paragraph">Five nonnegotiables, lifted from the meta-skill, that I’d put in any AGENTS.md tomorrow:</p>



<ol class="wp-block-list">
<li>Surface assumptions before building. Wrong assumptions held silently are the most common failure mode.</li>



<li>Stop and ask when requirements conflict. Don’t guess.</li>



<li>Push back when warranted. The agent (or engineer) is not a yes-machine.</li>



<li>Prefer the boring, obvious solution. Cleverness is expensive.</li>



<li>Touch only what you’re asked to touch.</li>
</ol>



<p class="wp-block-paragraph">That’s a worthwhile engineering culture in five lines, and you don’t need to install anything to adopt it.</p>



<h2 class="wp-block-heading">Where this fits in the harness</h2>



<p class="wp-block-paragraph">In the broader picture, skills are one layer of <a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">agent harness engineering</a>. The harness is the model plus everything you build around it; skills are the reusable workflow chunks that get progressively disclosed into the system prompt. They sit alongside <code>AGENTS.md</code> (the rolling rulebook), hooks (the deterministic enforcement layer), tools (the actions the agent can take), and the session log (the durable memory). Each layer has a specific job. Skills do the senior-engineer-process job.</p>



<p class="wp-block-paragraph">Skills matter more for <a href="https://addyosmani.com/blog/long-running-agents/" target="_blank" rel="noreferrer noopener">long-running agents</a> than they do for chat-style ones, because long runs amplify every shortcut. An agent that skips the test in a 10-minute session produces one bug. An agent that skips the test in a 30-hour session produces a debugging archaeology project at the end of the run, when no one remembers what the original intent was. The longer the run, the more the senior-engineer scaffolding has to be enforced rather than suggested.</p>



<p class="wp-block-paragraph">The portability of the skills format matters too. The same SKILL.md file works in Claude Code, Cursor (with rules), Gemini CLI, Codex, and any other harness that accepts system-prompt content. Write the workflow once, the runtime enforces it. That’s the thing the Markdown-with-front matter format buys you that bespoke prompt engineering does not.</p>



<h2 class="wp-block-heading">Closing</h2>



<p class="wp-block-paragraph">The thing I most want people to take from this project, more than the skills themselves, is the framing.</p>



<p class="wp-block-paragraph">AI coding agents are extremely capable junior engineers with no instinct for the parts of the job that don’t show up in the diff. The senior-engineering work (surfacing assumptions, sizing changes, writing the spec, leaving evidence, refusing to merge what can’t be reviewed) is exactly what an agent will skip unless you make it impossible to skip. The job, increasingly, is to encode that discipline as something the agent cannot talk itself out of.</p>



<p class="wp-block-paragraph">Skills are one shape of that. Anti-rationalization tables. Progressive disclosure. Process over prose. Verification as the loadbearing exit criterion. The Google practices that already work, made portable.</p>



<p class="wp-block-paragraph">You can install <a href="https://github.com/addyosmani/agent-skills" target="_blank" rel="noreferrer noopener">my version</a>. You can roll your own. The lesson stands either way: The senior-engineer parts of the job are no longer optional, even when the engineer is a model.</p>



<p class="wp-block-paragraph"><em>The repo is at <a href="https://github.com/addyosmani/agent-skills" target="_blank" rel="noreferrer noopener">github.com/addyosmani/agent-skills</a> (MIT). For the broader scaffolding picture, see “<a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">Agent Harness Engineering</a>” and “<a href="https://addyosmani.com/blog/long-running-agents/" target="_blank" rel="noreferrer noopener">Long-Running Agents</a>.”</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/agent-skills/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Who Authorized That? The Delegation Problem in Multi-Agent AI</title>
		<link>https://www.oreilly.com/radar/who-authorized-that-the-delegation-problem-in-multi-agent-ai/</link>
				<comments>https://www.oreilly.com/radar/who-authorized-that-the-delegation-problem-in-multi-agent-ai/#respond</comments>
				<pubDate>Tue, 26 May 2026 10:58:58 +0000</pubDate>
					<dc:creator><![CDATA[Sunil Prakash]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18793</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Who-Authorized-That.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Who-Authorized-That-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Securing access isn’t enough. As agents begin calling other agents, enterprises need to secure delegation too.]]></custom:subtitle>
		
				<description><![CDATA[Your AI agent booked a meeting, summarized a financial report, and emailed the highlights to three stakeholders. To do this, it called a calendar agent, a document analysis agent, and an email agent. Each accessed internal systems, made decisions about what to include, and acted on your behalf. Here’s the question your security team can’t [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Your AI agent booked a meeting, summarized a financial report, and emailed the highlights to three stakeholders. To do this, it called a calendar agent, a document analysis agent, and an email agent. Each accessed internal systems, made decisions about what to include, and acted on your behalf.</p>



<p class="wp-block-paragraph">Here’s the question your security team can’t answer: <strong>Who authorized the email agent to read that financial report?</strong></p>



<p class="wp-block-paragraph">In most current architectures, the honest answer is no one explicitly. The logs may show that a service called another service. But they can’t show that the delegation itself was authorized. The authorization didn’t fail loudly. It leaked silently through the chain.</p>



<p class="wp-block-paragraph">This is the delegation problem in multi-agent AI. As enterprises connect agents through protocols such as MCP and A2A, they’re solving the connectivity problem faster than they’re solving the authority problem. The result is a new security boundary that most enterprise architectures have not yet modeled, precisely because most organizations still treat it as orchestration rather than authorization.</p>



<h2 class="wp-block-heading">Agents are connecting faster than authorization is adapting</h2>



<p class="wp-block-paragraph">The agent ecosystem has moved fast over the past two years. Anthropic&#8217;s MCP gave model-powered applications a standard way to connect to tools, data sources, and services. Google&#8217;s A2A protocol gave agents a standard way to communicate and coordinate across systems. Frameworks and SDKs such as LangChain, CrewAI, and Google&#8217;s ADK made it easier to build multi-agent workflows where one agent orchestrates several others.</p>



<p class="wp-block-paragraph">What these protocols don’t yet provide, at least not as a mature common layer, is a delegation-aware authorization model.</p>



<p class="wp-block-paragraph">MCP describes a protected server as an OAuth 2.1 resource server, with the MCP client acting as an OAuth client making requests on behalf of a resource owner. That’s a familiar and well-understood pattern, but it was designed for a world where a human clicks &#8220;Allow&#8221; and a single client gets a scoped token. It doesn’t address what happens when Agent A receives that token, delegates a subtask to Agent B, and Agent B spawns Agent C to handle part of it. Each hop in that chain either reuses the original token (overprivileged) or has no token at all (untracked).</p>



<p class="wp-block-paragraph">A2A was built for interoperability: independent, potentially opaque agent systems communicating and coordinating actions across enterprise platforms. That’s the right problem to solve. But communication and delegation governance are different layers. A2A helps agents discover, describe, and communicate with one another. This is necessary infrastructure, but it isn’t the same as delegated authority. It doesn’t tell you whether a specific downstream action was legitimately derived from an upstream instruction.</p>



<p class="wp-block-paragraph">Static API keys are even weaker for this problem. A key grants access to a service. It says nothing about who is using it, what they’re using it for, or whether the entity presenting it is the same one it was issued to. Service accounts identify a workload, not an intent. When three agents share a service account, every action looks the same in your logs.</p>



<p class="wp-block-paragraph">None of these tools are broken. They solve different problems. The gap is structural. Authentication answers which agent is calling. Authorization defines what that agent may access. The harder question, and the one most enterprise architectures are not yet designed to answer, is whether a specific downstream action was legitimately derived from an upstream instruction, under narrowed constraints, with a verifiable chain back to a human decision. That’s the delegation question, and it sits in a layer that today&#8217;s stack doesn’t really have.</p>



<p class="wp-block-paragraph">In a clean version of this picture, privilege should sit only with the agent that touches the outside world. If a payer (A) asks a bookkeeper agent (B) to make a payment, and the bookkeeper asks a banking agent (C) to execute the transfer, only the banking agent needs banking authority. The bookkeeper doesn’t need to move money. It only needs to know the request came from an authorized payer. The banking agent only needs to know the request came from an authorized bookkeeper. This is the principle of least privilege, a concept the security community has lived with for decades, applied to delegation chains. The difficulty is that today&#8217;s agent stacks make it hard to enforce.</p>



<h2 class="wp-block-heading">What breaks in the chain</h2>



<p class="wp-block-paragraph">Consider a treasury reporting workflow in a regulated bank. A planning agent is allowed to read liquidity projections and produce a daily summary for senior finance users. To complete the task, it delegates chart generation to a visualization agent and narrative review to a communications agent. The visualization agent doesn’t need access to raw account-level data. The communications agent doesn’t need access to the underlying liquidity model. Yet unless the delegation layer attenuates permissions, both may receive more context than their task requires. The result isn’t a dramatic breach, but it is a quiet expansion of access that the access-control model never explicitly approved.</p>



<p class="wp-block-paragraph">The risk isn’t limited to internet-facing agents. Many delegation failures happen entirely inside the enterprise boundary. An internal agent may call another internal agent, which calls an internal tool, which sends data to an approved SaaS service. Every individual step may look acceptable. The risk appears in the composition: The final data movement or action may exceed the intent of the original authorization.</p>



<p class="wp-block-paragraph">This pattern creates three categories of failure that enterprises may have to explain to regulators, auditors, or customers.</p>



<p class="wp-block-paragraph"><strong>Ghost permissions. </strong>A finance analyst assistant has been given access to a customer transactions database to support quarterly reporting. It calls a summarization agent: &#8220;summarize recent transactions for these accounts.&#8221; The summarization agent now operates against customer records, even though no policy engine granted it that access. The analyst assistant&#8217;s privileges effectively traveled with the request. The permission is a ghost. It exists in practice but not in any authorization system.</p>



<p class="wp-block-paragraph"><strong>Scope drift.</strong> Even when an agent starts with narrow permissions, delegation tends to widen scope rather than narrow it. An agent authorized to read Q1 revenue data delegates to a charting agent, which calls an external rendering API, which now has the revenue figures. The data left the organization through three hops of implicit trust. Each agent acted within what it understood as its scope. The aggregate result exceeded what any human would have approved.</p>



<p class="wp-block-paragraph"><strong>Broken audit trails.</strong> Regulated industries require the ability to answer &#8220;who did what and why&#8221; for any consequential action. In a single-agent system, this is manageable. In a multi-agent chain, the audit trail fragments across agents, protocols, and services. When a compliance team asks why a particular customer communication was sent, the answer might involve four agents across two protocols, none of which logged the delegation chain. The action is traceable to a system but not to a decision.</p>



<p class="wp-block-paragraph">These aren’t edge cases. They’re a common outcome when delegation isn’t modeled explicitly. The delegation problem isn’t a bug in any particular framework. It’s a gap in the layer between them.</p>



<h2 class="wp-block-heading">What a delegation-aware model requires</h2>



<p class="wp-block-paragraph">A delegation-aware authorization model has to solve four things at once, which is part of why no existing layer covers it cleanly<em>.</em></p>



<p class="wp-block-paragraph">The first is identity. The downstream agent needs a cryptographic credential that the receiving system can verify independently, not just a hostname or an API key. Hostnames lie. API keys travel. A real identity is one the calling system cannot fabricate.</p>



<p class="wp-block-paragraph">The second is attenuation. When an agent delegates a task, the subagent should receive strictly fewer permissions than the parent—never the same set, and certainly never more. This is the principle of least privilege applied to delegation chains, and almost no current tooling enforces it by default.</p>



<p class="wp-block-paragraph">The third is purpose. &#8220;Read this report to summarize liquidity exposure for the CFO&#8221; is a different authorization from &#8220;read this report and send selected figures to an external charting service.&#8221; It may be the same data and the same agent, but it’s two very different risk profiles. Without a purpose binding, the authorization layer has no way to distinguish them.</p>



<p class="wp-block-paragraph">The fourth is audit. The organization should be able to reconstruct, after the fact, who delegated what, under which constraints, and what evidence each agent produced at completion. Not just which systems were called but which decisions were made and on whose authority.</p>



<p class="wp-block-paragraph">It’s possible for agents to authenticate successfully even when they don’t have accountable authority. They can prove who they are and still execute actions that no human ever authorized.</p>



<h2 class="wp-block-heading">Emerging approaches</h2>



<p class="wp-block-paragraph">Several efforts address parts of this problem: workload identity standards, agent metadata in tokens, OAuth-based MCP authorization, A2A authentication patterns, and agent identity frameworks. These are useful building blocks, but identity is not the same as delegated authority. A signed agent card can help establish an agent&#8217;s declared identity and capabilities. An OAuth token can tell you what a client may access. Neither, by itself, proves that a specific downstream action was authorized by a specific upstream decision under narrowed constraints.</p>



<p class="wp-block-paragraph">One emerging pattern is delegation-bound capability tokens: short-lived credentials that bind an invocation to an agent identity, a constrained permission set, and a provenance record. One example is the <a href="https://datatracker.ietf.org/doc/draft-prakash-aip/" target="_blank" rel="noreferrer noopener">Agent Identity Protocol (AIP)</a>, which I’ve been working on as an Internet-Draft and <a href="https://sunilprakash.com/aip/" target="_blank" rel="noreferrer noopener">open source implementation</a>. AIP is still early, but it illustrates the shape of one possible answer: invocation-bound tokens that carry identity, attenuated permissions, and provenance through a delegation chain. The token chain itself becomes part of the audit evidence rather than something reconstructed after the fact from fragmented logs.</p>



<p class="wp-block-paragraph">Complementary approaches are also emerging. Behavioral credentials, the idea that agents should be continuously reauthorized based on runtime behavior rather than just initial permissions, address a related but distinct problem. Delegation tokens tell you who authorized what. Behavioral monitoring tells you whether the agent is still acting within its authorized profile. A complete solution will likely need both.</p>



<p class="wp-block-paragraph">None of these approaches have reached mainstream adoption. But the fact that they are emerging simultaneously, from different corners of the industry, signals that the delegation gap is real and recognized.</p>



<h2 class="wp-block-heading">What enterprise teams should do now</h2>



<p class="wp-block-paragraph">You don’t need to wait for standards to mature before addressing the delegation problem. There are concrete steps that security, platform, and architecture teams can take today.</p>



<p class="wp-block-paragraph"><strong>Map your delegation chains.</strong> Most teams deploying multi-agent workflows haven’t documented which agents call which other agents, with what permissions, through which protocols. Start there. If you can’t draw the graph, you can’t secure it.</p>



<p class="wp-block-paragraph"><strong>Audit implicit permissions.</strong> For every agent-to-agent interaction, ask: Was this access explicitly granted, or is the downstream agent inheriting permissions by proximity? If the answer is inheritance, you have a ghost permission that needs a policy decision.</p>



<p class="wp-block-paragraph"><strong>Require scope attenuation.</strong> Establish an architectural rule: When an agent delegates a task, the subagent must receive fewer permissions than the parent, never more. Current tooling doesn’t enforce this automatically, but you can enforce it in your orchestration layer.</p>



<p class="wp-block-paragraph"><strong>Build the audit trail before the auditor asks.</strong> If your organization is in a regulated industry, the question &#8220;Who authorized this agent action?&#8221; will eventually be asked. The time to instrument delegation logging is before that question arrives, not after. Log the full chain: which agent initiated the task, what permissions were passed, which subagents were invoked, and what each one accessed.</p>



<p class="wp-block-paragraph"><strong>Test with real tooling.</strong> Delegation-aware approaches, including capability-token designs, workload identity standards, and agent identity frameworks, are early but functional. Running one in a nonproduction environment will expose gaps in your current authorization model that architecture review alone will not surface.</p>



<h2 class="wp-block-heading">Delegation is the security boundary</h2>



<p class="wp-block-paragraph">The first phase of enterprise agent adoption was about connectivity: Can the agent reach the tool, the API, the database, or the other agent? The next phase will be about accountable delegation: Should this agent be allowed to ask that agent to do this specific thing, with this data, under these constraints?</p>



<p class="wp-block-paragraph">That question won’t be answered by prompt engineering. It belongs in the authorization layer, the platform layer, and the audit trail.</p>



<p class="wp-block-paragraph">Enterprises don’t need to solve the entire standards problem today. But they do need to stop treating delegation as an implementation detail. In multi-agent systems, delegation is the security boundary.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/who-authorized-that-the-delegation-problem-in-multi-agent-ai/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>This Week in AI: Rethinking the Agent Harness</title>
		<link>https://www.oreilly.com/radar/this-week-in-ai-rethinking-the-agent-harness/</link>
				<comments>https://www.oreilly.com/radar/this-week-in-ai-rethinking-the-agent-harness/#respond</comments>
				<pubDate>Fri, 22 May 2026 15:01:29 +0000</pubDate>
					<dc:creator><![CDATA[Michelle Smith]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[This Week in AI]]></category>
		<category><![CDATA[Podcast]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18774</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/0642572383770_This_Week_in_AI_Cover-scaled.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2560" 
				height="2560" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/0642572383770_This_Week_in_AI_Cover-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Plus AI security, the compute arms race, and why eventually there may no longer be an internet for humans]]></custom:subtitle>
		
				<description><![CDATA[We kicked off our new weekly series This Week in AI on Monday, and we covered a lot of ground in 30 minutes, including an AI model that found security holes faster than decades of human auditing, a data center in Utah the size of two Manhattans, and a practical argument for why the harness [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">We kicked off our new weekly series <em>This Week in AI</em> on Monday, and we covered a lot of ground in 30 minutes, including an AI model that found security holes faster than decades of human auditing, a data center in Utah the size of two Manhattans, and a practical argument for why the harness you build around a model now matters more than which model you pick.<br><br>Here are a few takeaways from the conversation between host Eric Freeman, faculty member at UT Austin and a longtime <a href="https://learning.oreilly.com/search/?q=author%3A%20%22Eric%20Freeman%22&amp;suggested=true&amp;suggestionType=author&amp;originalQuery=eric%20freeman&amp;rows=100&amp;language=en" target="_blank" rel="noreferrer noopener">friend of O’Reilly</a>, and guest John Berryman, founder of Arcturus Labs, an early production engineer on GitHub Copilot, and coauthor of O&#8217;Reilly&#8217;s<a href="https://learning.oreilly.com/library/view/prompt-engineering-for/9781098156145/" target="_blank" rel="noreferrer noopener"> <em>Prompt Engineering for LLMs</em></a>. Watch the entire episode to find out why you should be building your own agent and why John believes eventually there will be no internet for humans.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="This Week in AI: Rethinking the Agent Harness" width="500" height="281" src="https://www.youtube.com/embed/g4cfjz5AKxY?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>AI&#8217;s security problem is now a policy problem</strong></h2>



<p class="wp-block-paragraph">You’ve probably already heard about <a href="https://red.anthropic.com/2026/mythos-preview/" target="_blank" rel="noreferrer noopener">Mythos</a>. Anthropic&#8217;s internal testing of the frontier model surfaced thousands of previously unknown security vulnerabilities across major operating systems, browsers, and financial infrastructure, including a 27-year-old bug in OpenBSD. Anthropic chose not to release the model publicly and instead launched <a href="https://www.anthropic.com/glasswing" target="_blank" rel="noreferrer noopener">Project Glasswing</a>, a restricted program giving monitored access to a small group of trusted partners for defensive patching.</p>



<p class="wp-block-paragraph">That decision moved fast in Washington. In roughly six weeks, the conversation shifted from the light-touch national AI policy released in March to reported White House discussions of an <a href="https://fortune.com/2026/05/06/trump-administration-embraces-ai-oversight-policies-it-once-rejected-anthropic-mythos-caisi/" target="_blank" rel="noreferrer noopener">executive order review process</a> modeled on how the FDA handles drugs. Security researcher Bruce Schneier has questioned <a href="https://www.schneier.com/blog/archives/2026/04/mythos-and-cybersecurity.html" target="_blank" rel="noreferrer noopener">whether Mythos is uniquely capable here</a> or whether similar results are achievable with cheaper public models, but as Freeman noted (paraphrasing Schneier), either way, it’s a problem that’s coming.</p>



<h2 class="wp-block-heading">The compute race is getting stranger</h2>



<p class="wp-block-paragraph">Anthropic <a href="https://x.ai/news/anthropic-compute-partnership" target="_blank" rel="noreferrer noopener">leased xAI&#8217;s entire Colossus 1 supercluster</a> in Memphis: more than 200,000 GPUs and 300 megawatts of power. A month before that deal, <a href="https://www.anthropic.com/news/google-broadcom-partnership-compute" target="_blank" rel="noreferrer noopener">Anthropic expanded its agreement with Google and Broadcom for 3.5 gigawatts</a> of capacity coming online in 2027. For context, that&#8217;s roughly 10 times the power output of the Colossus 1 deal, in a single contract. After this episode aired, Anthropic announced that that deal has been <a href="https://www.axios.com/2026/05/20/anthropic-spacex-compute" target="_blank" rel="noreferrer noopener">expanded to Colossus 2</a> as well.</p>



<p class="wp-block-paragraph">Box Elder County, Utah, just approved a 40,000-acre AI data center called the Stratos project, backed by investor and TV personality Kevin O&#8217;Leary (a.k.a. Mr. Wonderful). It’s planned for <a href="https://www.theregister.com/on-prem/2026/05/13/utah-mega-datacenter-could-dump-23-atomic-bombs-worth-of-energy-per-day/5239670" target="_blank" rel="noreferrer noopener">9 gigawatts at full buildout</a>. That&#8217;s a footprint more than twice the size of Manhattan, powered by the equivalent of nine commercial nuclear reactors. And like many data center deals going forward, including Colossus above, it was <a href="https://www.cnn.com/2026/05/09/tech/ai-data-center-utah-kevin-oleary-opposition" target="_blank" rel="noreferrer noopener">approved over local protests</a>.</p>



<p class="wp-block-paragraph">Infrastructure at this incredible scale takes years to come online, and the companies making these bets are pricing in a world where model capability keeps scaling. Whether that assumption holds will determine a lot about what&#8217;s economically viable to build in the next decade.</p>



<h2 class="wp-block-heading"><strong>The harness matters more than the model</strong></h2>



<p class="wp-block-paragraph">John was on hand to rethink the agent harness, which as he pointed out, entered a new phase with the step change in model capability that occurred in November and December of last year. He took Eric through the arc of AI product development, from document completion and chat loops to tool-calling agents, DAG-based workflows, and now the harness era represented by tools like Claude Code. Each progression added capability, John noted, but also complexity, and each generated a new class of problems around reliability and control. In our current moment, which John has dubbed the “age of the unharnessed agent,” agents are now within reach of everyone, not just software developers.</p>



<p class="wp-block-paragraph">The payoff of this “unharnessed” era is control. John described a client engagement where he replaced a bespoke application with a skills-driven agent. Now domain experts with no development experience can read the agent&#8217;s behavior written in plain English and better understand it. As John explained,</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Rather than building a bespoke agent.&nbsp;.&nbsp;., I just built something that was just the agent harness—the agent—and I just gave it skills that describe what basically I learned in interviewing their experts, how they would work with these agents. And it worked perfectly. Not only does the agent stay on track and do what it needs to do these days, but it&#8217;s coded, as far as my client is concerned, in English.<br><br>The experts don&#8217;t have to complain to developers “this doesn&#8217;t work.” The experts can look at the English description of what&#8217;s going on and see problems, and maybe even fix it themselves. And I&#8217;m really excited to basically give that power into the hands of the people that know best how to change it, the experts.</p>
</blockquote>



<p class="wp-block-paragraph">That&#8217;s a different relationship between the experts and the tool than anything a wrapped commercial product offers.</p>



<p class="wp-block-paragraph">As Eric pointed out, recent <a href="https://arxiv.org/html/2603.28052v1" target="_blank" rel="noreferrer noopener">Stanford research</a> supports this broader point: Performance gaps between a bare model and a well-designed harness now often matter more than which underlying model you&#8217;re using. The benchmark that used to dominate buying decisions, which model scores highest, has been displaced by a harder question about which harness fits the task.</p>



<p class="wp-block-paragraph">John closed with a demo of his personal agent moving from an Obsidian notebook into Wikipedia and back, carrying context across environments. He used it to illustrate a concept he called the &#8220;open agent protocol,&#8221; his term for a not-yet-existing standard where an agent receives environment-specific skills as it moves between contexts. The protocol doesn&#8217;t exist yet, but the demo made the direction clear.</p>



<h2 class="wp-block-heading"><strong>What&#8217;s next</strong></h2>



<p class="wp-block-paragraph">Join us and a rotating lineup of expert guests for weekly live tool demos and deeper dives into the topics that matter in AI. We’re taking next week off for Memorial Day in the US, but we’ll be back on June 1 with host Andreas Welsch and guests Maya Mikhailov and Doug Shannon to cut through another week of AI headlines and separate what actually drives business value from what looks good in a demo but goes nowhere in production. Our first few episodes are free and open to all if you’d like to attend live—<a href="https://www.oreilly.com/live/this-week-in-ai.html" target="_blank" rel="noreferrer noopener">register here</a>.</p>



<p class="wp-block-paragraph">We’ll continue to share full episodes and publish our takeaways here on Radar each Friday. You can also watch or listen on <a href="https://www.youtube.com/watch?v=g4cfjz5AKxY&amp;list=PL055Epbe6d5bJEhT7_ZzOeJZ6gPyUzYpS" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/033kJS2BG1teGunxmtsU1r" data-type="link" data-id="https://open.spotify.com/show/033kJS2BG1teGunxmtsU1r" target="_blank" rel="noreferrer noopener">Spotify</a>, Apple, or wherever you get your podcasts.</p>



<p class="wp-block-paragraph"></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/this-week-in-ai-rethinking-the-agent-harness/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Agentic P&#038;L: Beyond the Empire of Headcount</title>
		<link>https://www.oreilly.com/radar/the-agentic-pl-beyond-the-empire-of-headcount/</link>
				<comments>https://www.oreilly.com/radar/the-agentic-pl-beyond-the-empire-of-headcount/#respond</comments>
				<pubDate>Thu, 21 May 2026 15:04:52 +0000</pubDate>
					<dc:creator><![CDATA[Shreshta Shyamsundar and Anmol Jain]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18761</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-Agentic-PL.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-Agentic-PL-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[For over a century, both the prestige and budget of a corporate department have been measured by a single crude metric: headcount. If you manage 500 people, you’re a &#8220;distinguished leader.&#8221; If you manage five, you’re a footnote. This &#8220;empire of headcount&#8221; has governed everything from office square footage to C-suite influence. It’s the fundamental [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">For over a century, both the prestige and budget of a corporate department have been measured by a single crude metric: headcount. If you manage 500 people, you’re a &#8220;distinguished leader.&#8221; If you manage five, you’re a footnote. This &#8220;empire of headcount&#8221; has governed everything from office square footage to C-suite influence. It’s the fundamental unit of the 20th-century P&amp;L.</p>



<p class="wp-block-paragraph">In an enterprise powered by federated agentic systems, this math is not just obsolete—it is a liability. AI will reshape the enterprise. The question is now “Which line items on the P&amp;L change, and by how much?” Labor and benefits contract. Token and infrastructure costs appear as a new operating line. Compliance costs shift from reactive rework to proactive provenance. And the assets that matter most—structured knowledge enclaves, trained agent policies, decision logs—do not yet appear on most balance sheets.</p>



<h2 class="wp-block-heading">Why AI-on-top-of-hierarchy fails</h2>



<p class="wp-block-paragraph">Most enterprise AI deployments begin with the right instinct and the wrong architecture. A foundation model is procured, a chatbot is deployed, and analysts are relieved of their most repetitive queries. This is the butler-bot phase: AI as a faster way to do what the organization already does, inside a structure designed for a different era.</p>



<p class="wp-block-paragraph">The problem is the process the model is plugged into. If a compliance decision requires sign-off from three managers, an AI assistant that drafts the memo faster doesn’t change the three-week cycle time. If context is scattered across email threads and local drives, a model querying that corpus will hallucinate at exactly the rate the corpus is incomplete. The model inherits the organization&#8217;s structural debt. The agentic P&amp;L begins where the butler bot ends: with a deliberate redesign of the process, not just the tooling.</p>



<p class="wp-block-paragraph">The enterprise must pivot: Stop valuing the empire of headcount and start valuing the federated nervous system.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="362" height="186" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-11.png" alt="Figure 1. Empire of headcount vs. federated nervous system—An analogy" class="wp-image-18771" style="width:503px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-11.png 362w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-11-300x154.png 300w" sizes="auto, (max-width: 362px) 100vw, 362px" /><figcaption class="wp-element-caption">Figure 1. Empire of headcount vs. federated nervous system—An analogy</figcaption></figure>



<h2 class="wp-block-heading">Pillar 1: Potential energy—How knowledge-ready is your department?</h2>



<p class="wp-block-paragraph">If the department is the fundamental unit of the enterprise, its contextual enclave is its brain—its store of potential energy. Most companies are drowning in low-quality context: petabytes of data buried in half-finished Slack threads, abandoned wikis, and tacit knowledge held by seniors who are three months from retirement. To an agent, this isn’t intelligence; it’s noise.</p>



<h3 class="wp-block-heading">From data lakes to sharded enclaves</h3>



<p class="wp-block-paragraph">The data lake became a 2020s nightmare—a giant swamp where context went to die. In the federated model, legal, HR, engineering, and compliance each maintain their own secure, high-density enclave instead. Policy, process documentation, and institutional knowledge is synthesized into a form an agent can reason over directly, without a human in the interpretive loop. Data stays local; reasoning moves via agents. Protocols like the Model Context Protocol (MCP) are emerging as the TCP/IP of the federated enterprise—a standard way for agents and tools to discover each other, exchange context, and record what happened regardless of which vendor stack sits underneath. MCP is what allows “reasoning moves, data stays” to be an implementation detail rather than a custom integration project every time.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1134" height="633" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8.png" alt="Figure 2. Contextual density in shared enclaves" class="wp-image-18764" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8.png 1134w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8-300x167.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8-768x429.png 768w" sizes="auto, (max-width: 1134px) 100vw, 1134px" /><figcaption class="wp-element-caption">Figure 2. Contextual density in shared enclaves</figcaption></figure>



<h3 class="wp-block-heading">Making potential energy measurable</h3>



<p class="wp-block-paragraph">Three dimensions combine into what we call the contextual density score: coverage (what proportion of policy and process is documented and retrievable—for a compliance enclave, the fraction of onboarding scenarios tied to explicit playbooks); consistency and recency (how often does retrieved guidance conflict, and how stale is it); and retrieval quality (how often can a reference agent answer test questions from its own enclave without human overrides). The contextual density score measures how ready an enclave is for agents to act on it reliably. Each enclave is assigned an owner whose job is to improve that score quarter over quarter, as a traditional leader improves throughput or defect rates. Context maintenance becomes the new R&amp;D.</p>



<h2 class="wp-block-heading">Pillar 2: Agentic throughput (the kinetic energy)</h2>



<p class="wp-block-paragraph">If a department’s knowledge enclave is its store of potential energy, throughput is the kinetic energy: the volume and value of cognitive outcomes produced by the agentic layer without human execution in the critical path. To measure this, we must stop counting &#8220;activity&#8221; and start counting handshakes.</p>



<h3 class="wp-block-heading">The handshake economy</h3>



<p class="wp-block-paragraph">In a federated mesh, work is done through agent-to-agent (A2A) negotiation. A logistics agent detects a delayed shipment and initiates a handshake with a procurement agent to find an alternative supplier. That agent consults the contracts enclave via a legal agent to check compliance and risk limits. A resolution is reached, records are updated, and a human is notified of the result—not every intermediate step. Throughput is the rate of successful, economically meaningful handshakes.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1233" height="688" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9.png" alt="Figure 3. The federated agent operating model" class="wp-image-18765" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9.png 1233w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9-300x167.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9-768x429.png 768w" sizes="auto, (max-width: 1233px) 100vw, 1233px" /><figcaption class="wp-element-caption">Figure 3. The federated agent operating model</figcaption></figure>



<h3 class="wp-block-heading">Agentic unit economics: The cost of the handshake</h3>



<p class="wp-block-paragraph">Not all handshakes are equal. Every one carries a token tax, an infrastructure cost, and a latency cost. Agentic throughput is only valuable when the cost per cognitive outcome is significantly lower than the labor-equivalent at equal or better quality. If an agent fans out 50 calls to a premium model to resolve a $5 inquiry, you&#8217;ve increased throughput and destroyed ROI. If a handful of calls to a moderately priced model resolve a complex cross-silo onboarding decision that previously took three teams and two weeks, the economics are compelling.</p>



<p class="wp-block-paragraph">The agentic P&amp;L must therefore track outcome volume (risk-weighted handshakes per period) and cost per outcome relative to the pre-agentic baseline—this is where CFOs and architects meet. This recommendation is consistent with <a href="https://www.pwc.com.au/media/2026/pwc-ai-performance-study-australian-companies-lead-on-ai-security.html" target="_blank" rel="noreferrer noopener">emerging research</a>: The companies seeing genuine AI ROI are those using it to expand what they can do, not those focused purely on headcount reduction.</p>



<h3 class="wp-block-heading">How agents learn: Gyms and mirrors</h3>



<p class="wp-block-paragraph">The gym is a simulation built from historical cases and synthetic data where agents train against gold decisions, respecting policy constraints and risk limits. The mirror is a read-only, regulator-grade log of what agents did in production: prompts, tool calls, model versions, human overrides, and final outcomes. <a href="https://www.oreilly.com/radar/gyms-for-them-mirrors-for-us/" target="_blank" rel="noreferrer noopener">Agents spar in the gym; they are judged in the mirror</a>. By 2026, decision provenance—the ability to reconstruct who or what did what, under which policy and model version—is becoming standard operating procedure in regulated industries.</p>



<h3 class="wp-block-heading">The Agentic P&amp;L decomposed</h3>



<p class="wp-block-paragraph">Four-line items change structurally when an enterprise moves from a headcount model to a federated agentic model:</p>



<p class="wp-block-paragraph">Labor and benefits contract, but not to zero. The compliance function that previously employed 400 analysts moves to 80–100 humans in orchestration and oversight roles—higher-skilled and higher-cost per head, a deliberate trade of volume for leverage.</p>



<p class="wp-block-paragraph">General expenses shift as management layers thin, training budgets pivot from procedural compliance to enclave curation, and real estate requirements contract as hybrid squads replace large hub operations.</p>



<p class="wp-block-paragraph">Token and infrastructure costs emerge as a new operating line that does not exist in the pre-agentic P&amp;L. This line must be actively managed: cost per cognitive outcome is the new unit of measurement and deteriorates quickly with poorly designed agent architectures.</p>



<p class="wp-block-paragraph">Compliance and audit costs shift structure. In a Tier-1 bank, the cost of a single regulatory finding—remediation, legal exposure, delayed onboarding—dwarfs the annual cost of maintaining a well-designed decision log. The mirror transforms regulatory response from a fire drill into a navigable record. Decision provenance is not governance overhead. It is P&amp;L protection.</p>



<p class="wp-block-paragraph">Revenue productivity per person (RPP)—revenue divided by headcount—ties the expense-side story to the top line. Software-native firms have long used RPP as a signal of operational leverage; banks are now applying the same lens to their operations functions. As headcount contracts while throughput and revenue capacity hold or grow, RPP rises structurally rather than cyclically—the metric that tells a CFO whether agentic transformation is delivering leverage or merely cost reduction.</p>



<h2 class="wp-block-heading">A stylized agentic P&amp;L: Compliance in a Tier-1 bank</h2>



<p class="wp-block-paragraph">Consider a compliance function with 400 analysts. Its P&amp;L is dominated by salaries, benefits, and office costs. Context sits in email, local drives, and the memory of experienced analysts—institutional knowledge that walks out of the building every evening.</p>



<p class="wp-block-paragraph">In phase 1, the bank builds a compliance enclave: policies, historical cases, and regulator Q&amp;A synthesized into a structured knowledge graph. Three hybrid squads of 12–15 humans work alongside 10–15 agents handling document collection, screening, and rule-based decisions. Agentic throughput starts modestly—20%–30% of low-risk cases auto-cleared from within the enclave. The P&amp;L effect at this stage is primarily a productivity story: lower cost per case, faster cycle times.</p>



<p class="wp-block-paragraph">The structural transformation comes in phase 2. After several cycles of gym training and mirror-driven refinement, the function operates with 80–100 humans plus 40–60 agents. The compliance enclave—curated policies, decision logs, evaluated reward functions—is now the primary asset. Legal discovery may require the email archive; what the regulator wants is a structured, navigable record of decisions. That’s what the mirror provides. With it, the reduced headcount is defensible to regulators, to the board, and on the P&amp;L.</p>



<h2 class="wp-block-heading">The new org unit: The 3+N squad</h2>



<p class="wp-block-paragraph">The &#8220;3+N&#8221; squad—a small human core plus a flexible swarm of agents—is the fundamental cell of the agentic enterprise. The strategic architect sets intent and constraints. The policy and ethics lead designs the gyms, ensuring agents act under responsible AI principles. The technical orchestrator manages the context mesh, MCP-based connectors, and enclave density. Around them, specialized agents handle contract analysis, sanctions screening, exception routing, and external API liaison. This is cognitive federation. Humans move up-stack into judgment and intent, while agents handle high-volume reasoning and cross-departmental coordination.</p>



<p class="wp-block-paragraph">Leaders rewarded for headcount and budget will resist decomposing their empires even as enclave quality and throughput improve. Executive scorecards must include agentic KPIs: enclave maturity, agentic throughput, risk-adjusted outcomes, and RPP. The mirror needs an explicit owner spanning risk, compliance, and engineering. Without decision provenance, you get the worst of both worlds: expensive models and humans still quietly doing the real work in spreadsheets.</p>



<p class="wp-block-paragraph">When you tell a senior vice president that their value is no longer tied to a 500-person headcount but to the knowledge readiness and agentic throughput of their domain, they will fight. The resistance isn’t just economic; it’s psychological. Headcount has been a proxy for power and identity. In the new world, it often becomes a proxy for architectural debt.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Client: &#8220;Can&#8217;t we just put a human in the loop but set the default to &#8216;Accept&#8217;?&#8221;</p>



<p class="wp-block-paragraph"><br>Me: &#8220;That&#8217;s not human-in-the-loop. That&#8217;s human-as-rubberstamp. You&#8217;re just automating the blame.&#8221;</p>
</blockquote>



<p class="wp-block-paragraph">The reframing that works is not &#8220;we are shrinking your kingdom&#8221; but &#8220;we are upgrading your leverage&#8221; from managing people (inherently high friction and limited scale) to designing intelligence (human-plus-agent systems that scale almost without bound).</p>



<h2 class="wp-block-heading">The leader of 2027: The playbook</h2>



<p class="wp-block-paragraph">The leader of 2027 thinks in flows instead of functions, enclaves and mirrors instead of departments and reports, and token costs and compliance risk instead of merely headcount and budget. Their signature move is converting headcount empires into high-density enclaves and high-throughput meshes under credible governance, then proving it on the P&amp;L with lower unit costs, faster cycle times, and a compliance posture auditors can navigate.</p>



<p class="wp-block-paragraph">For leaders mapping their 2026–2027 roadmaps, here are three hard pivots you need to make: First, stop hiring for capacity; build a better gym, not a bigger team. Second, audit your enclave’s knowledge readiness—if agents hallucinate, you have contextual debt, not a model problem; invest in governed sharded enclaves and mirrors your auditors can use. Finally, manage your token line as the new overhead expense; track cost per cognitive outcome rather than aggregate spend and monitor RPP as your headline leverage indicator.</p>



<p class="wp-block-paragraph">The goal is not to build an AI that works for you. The goal is to build an enterprise that thinks with you.</p>



<p class="wp-block-paragraph">Gyms for them, mirrors for us, and a context mesh to hold the P&amp;L together—that is the architecture of a decentralized, high-alpha enterprise. Anything else is just an expensive way to stay in the 20th century.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-agentic-pl-beyond-the-empire-of-headcount/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Agent Stack Bet</title>
		<link>https://www.oreilly.com/radar/the-agent-stack-bet/</link>
				<comments>https://www.oreilly.com/radar/the-agent-stack-bet/#respond</comments>
				<pubDate>Wed, 20 May 2026 10:58:36 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18746</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-agent-stack-bet.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-agent-stack-bet-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[The bet every serious developer needs to make on on their agent stack]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on the Elevate newsletter and is being reposted here with the author&#8217;s permission. Peek under the hood of most “production agents” shipping today and you won’t find intelligence. You’ll find custom plumbing, fragile session logic, shared service accounts, and a security model held together by hope. This can be so [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on the</em> <a href="https://addyo.substack.com/p/the-agent-stack-bet" target="_blank" rel="noreferrer noopener">Elevate</a> <em>newsletter and is being reposted here with the author&#8217;s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">Peek under the hood of most “production agents” shipping today and you won’t find intelligence. You’ll find custom plumbing, fragile session logic, shared service accounts, and a security model held together by hope. This can be so much better.</p>



<p class="wp-block-paragraph">If you’ve spent the last 18 months putting agents into production, you already know the models and tools have gotten <em>dramatically</em> better. You also know the problems that are still burning your on-call rotation are not problems you can prompt your way out of. We are running into a <strong>stack ceiling</strong>, and it is quietly creating a <strong>governance</strong> and <strong>reliability gap</strong> that the next generation of agentic systems cannot grow through.</p>



<p class="wp-block-paragraph">Right now the industry is living with what I’d call <em>excessive agency</em>: <strong>autonomous systems given broad permissions to get things done</strong>, then left to discover—at runtime, in production—that a schema drifted, an API changed, or a downstream service started returning PII it wasn’t supposed to. Agents mark tasks “complete” while leaving a trail of corrupted state behind them. The humans find out on Monday.</p>



<p class="wp-block-paragraph">This is not a failure of the people building agents. It is a failure of the stack they’re building on.</p>



<p class="wp-block-paragraph">Here are the four architectural bets I think every serious team has to make in the next twelve months.</p>



<h2 class="wp-block-heading"><strong>1) Agents need identities, not shared credentials</strong></h2>



<p class="wp-block-paragraph">Every engineer who has shipped agents to production knows this specific flavor of dread: You have agents doing useful work, and effectively zero visibility into which tools they touched, which data they moved, or which credentials they used to do it. I call this <em>governance debt</em>—the silent accumulation of security and audit risk that eventually forces a full rewrite, usually right after the first incident that reaches the CISO.</p>



<p class="wp-block-paragraph">The root cause is that most agents today are ghosts. They don’t have identities. They borrow a service account, inherit a human’s OAuth token, and “promise”—in application code, in a prompt—to stay inside the lines. In a real enterprise environment, a promise in a prompt is not a policy.</p>



<p class="wp-block-paragraph"><strong>My bet is that agent identity has to move from the application layer down into the platform layer.</strong></p>



<p class="wp-block-paragraph">The difference is between bolted-on versus embedded security. Bolted-on looks like middleware in front of every tool call, politely asking the agent to behave: easy to bypass, expensive in latency, and invisible to your existing IAM. Embedded looks like a badge reader welded into a steel frame. The agent has a distinct, unforgeable identity recognized at the network and platform level, and policy is enforced at the source. If the agent reaches for a database it isn’t cleared for, the connection never opens. No middleware, no vibes.</p>



<p class="wp-block-paragraph">Done right, this turns “a fleet of liabilities” into something that looks a lot more like a managed workforce: every action attributable, every permission auditable, every agent revocable with one call.</p>



<h2 class="wp-block-heading"><strong>2) Agents need universal context, not scraped windows</strong></h2>



<p class="wp-block-paragraph">Context management is a tax every builder is currently paying. Teams are burning a huge share of their engineering hours (and tokens) on undifferentiated plumbing—custom serialization, bespoke session stores, hand-rolled memory layers—just to keep an agent from forgetting its mission halfway through a multi-step task.</p>



<p class="wp-block-paragraph">Worse, the context agents <em>can</em> get their hands on is usually siloed. A browser-based agent can see the open tab. A desktop wrapper can see the files a user happened to drag in. Neither of them can easily reason across the systems where the business actually lives—the CRM, the ERP, the data warehouse, the ticketing system, the transcripts, the project plans—at the same time.</p>



<p class="wp-block-paragraph"><strong>Agents need universal context that integrates at the platform level.</strong> If we don’t fix this, we should be honest that the ceiling of agentic AI is “slightly better spreadsheet autocomplete,” and we should stop writing vision pieces about it.</p>



<h2 class="wp-block-heading"><strong>3) Agents need to survive your laptop closing</strong></h2>



<p class="wp-block-paragraph">Here’s the uncomfortable version of this: A lot of what ships today as “an agent” isn’t yet ready to deploy across a business.</p>



<p class="wp-block-paragraph">I want to be precise, because the frontier has genuinely moved in the last six months. Environments like Claude Code, OpenClaw, and similar platforms are capable—persistent task state, scheduled execution, multi-agent coordination, and long-running sessions that survive disconnects are no longer aspirational. These are not toys. The question has moved on.</p>



<p class="wp-block-paragraph">The question now is whether an agent can run for a week instead of an hour. Whether it can cross three handoffs, two credential rotations, and an approval gate without a human babysitting the session. Whether the work it did on Tuesday is auditable on Friday by someone who wasn’t in the room. A session that survives a dropped WebSocket is table stakes. A mission that survives a quarter is the bar enterprises actually need.</p>



<p class="wp-block-paragraph">Real work doesn’t fit in a session, and most of it doesn’t fit in a day either. A procurement workflow spans weeks and a dozen handoffs. A compliance audit runs for a month. An incident investigation outlives three on-call rotations.</p>



<p class="wp-block-paragraph"><strong>Most agents today hit a hard ceiling—sometimes time-based, sometimes token-based, sometimes governance-based—and when they hit it, the mission fails and a human picks up the pieces from wherever the transcript ended.</strong></p>



<p class="wp-block-paragraph">Enterprise-grade autonomy requires durable, cloud-native execution with a much higher floor than “the session stayed up.” Concretely, that means:</p>



<ul class="wp-block-list">
<li><strong>State</strong> and <strong>checkpointing</strong> that survives restarts, disconnects, redeploys, and model version changes by default—not bolted on with a local Redis and a prayer.</li>



<li><strong>Context that outlives the window</strong>: long-horizon memory, summarization, and handoff between agent instances, so a multi-week task doesn’t die because a single run exhausted its tokens.</li>



<li><strong>Missions that outlive sessions</strong>: agents that stay on the job across days, handoffs, and credential rotations, with an auditable trail of what happened while you were asleep.</li>



<li><strong>First-class human-in-the-loop primitives,</strong> so the agent can pause and ask for permission to do something new instead of silently deciding it has the authority.</li>
</ul>



<p class="wp-block-paragraph">Persistence with guardrails. That’s the bar. Anything less and you’re building demos that happen to run for a long time.</p>



<h2 class="wp-block-heading"><strong>4) Agents need platforms</strong></h2>



<p class="wp-block-paragraph">The pattern I see most often in strong teams is the saddest one: brilliant engineers draining their bandwidth into stack problems that do not differentiate their product. Custom memory. Bespoke eval harnesses. Homegrown observability. Handwritten retry logic. A tracing system that almost works. None of this is the hard part of the agentic era, and none of it is what your users are paying you for.</p>



<p class="wp-block-paragraph">The real value lives in domain reasoning and business logic—the judgment calls that are specific to your company, your customers, your regulatory environment. Everything underneath should be the platform you <em>build on</em>, not the plumbing you <em>build</em>.</p>



<p class="wp-block-paragraph">This is why the maturation of open primitives matters right now. Open-source orchestration frameworks exist precisely so the scaffolding isn’t locked behind any single vendor’s roadmap. The model that worked for cloud compute, containers, and CI/CD—start local on open primitives, graduate to a managed platform when you’re ready to scale—is the model agent platforms need to copy.</p>



<p class="wp-block-paragraph"><strong>Teams should be able to prototype on their laptop with the same building blocks they’ll run in production, and cross that boundary without a rewrite.</strong></p>



<p class="wp-block-paragraph">That’s the engineering standard that lets teams stop fighting plumbing and get back to the product.</p>



<h2 class="wp-block-heading"><strong>The five-year horizon</strong></h2>



<p class="wp-block-paragraph">The teams that pull ahead in the next five years will not pull ahead by being smarter at writing boilerplate. They’ll pull ahead by <strong>choosing the right agent foundation</strong> and spending their engineering hours on the problems <em><strong>only they can solve</strong></em>.</p>



<p class="wp-block-paragraph">Every month spent rebuilding the common stack—identity, context, persistence, orchestration—is a month not spent on the logic that actually makes your agents worth deploying.</p>



<p class="wp-block-paragraph"><strong>The agent stack has to become a solved problem.</strong> The only real question is whether you want to solve it yourself, again, or build on a foundation that was engineered for agents from the ground up.</p>



<p class="wp-block-paragraph">My bet is on the latter. I think yours should be too.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-agent-stack-bet/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>When an Agent Deletes the Production Database</title>
		<link>https://www.oreilly.com/radar/when-an-agent-deletes-the-production-database/</link>
				<comments>https://www.oreilly.com/radar/when-an-agent-deletes-the-production-database/#respond</comments>
				<pubDate>Tue, 19 May 2026 16:00:39 +0000</pubDate>
					<dc:creator><![CDATA[Sam Newman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18743</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/When-an-agent-deletes-the-production-database.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/When-an-agent-deletes-the-production-database-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Revisiting the PocketOS Incident]]></custom:subtitle>
		
				<description><![CDATA[Another day, another example of an AI Agent &#8220;running rogue&#8221; and doing something the human operator didn&#8217;t want it to do. The tl;dr is that Jeremy (Jer) Crane, founder of PocketOS, was using Claude to perform some routine DB maintenance. Claude then proceeded to delete the production database and all backups hosted at their cloud [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Another day, another example of an AI Agent &#8220;running rogue&#8221; and doing something the <a href="https://www.theregister.com/2026/04/27/cursoropus_agent_snuffs_out_pocketos/" target="_blank" rel="noreferrer noopener">human operator didn&#8217;t want it to do</a>. The tl;dr is that Jeremy (Jer) Crane, founder of PocketOS, was using Claude to perform some routine DB maintenance. Claude then proceeded to delete the production database and all backups hosted at their cloud provider, Railway. To their credit Railway managed to recover the lost data. The initial deletion took less than 10 seconds; I&#8217;m sure the recovery took much longer. Let’s look at what we can learn from what happened, and why AI is really just an amplifier of existing issues, rather than the cause itself.</p>



<p class="wp-block-paragraph">We know about the incident because Jer <a href="https://x.com/lifeof_jer/status/2048103471019434248?s=20" target="_blank" rel="noreferrer noopener">wrote</a> about it after it happened. First, taking time to reflect after something goes wrong is important; it&#8217;s how we learn. Sharing your mistakes with the world can be difficult, but it creates chances for us all to learn from each other. Second, I&#8217;ve seen a lot of people publicly dunking on both PocketOS and Railway. I would guess that none of those people have ever experienced the sheer terror and panic that happens during an incident like this. The feeling that you just want the ground to open and swallow you whole. It&#8217;s a feeling I&#8217;ve only experienced once or twice before, and it&#8217;s not an experience I&#8217;m keen to repeat.</p>



<p class="wp-block-paragraph">One point in Railway’s credit is that they got PocketOS’s data back. If you called for a deletion via the APIs on AWS, Azure, Google Cloud or whatever, using a valid credential, that data is gone—unless you have your own backups of course. AWS et al. aren’t maintaining backups of customer data to hedge against customer mistakes. This is your yearly reminder to <a href="https://www.backblaze.com/blog/the-3-2-1-backup-strategy/" target="_blank" rel="noreferrer noopener">look into the 3-2-1 backup strategy</a>.</p>



<p class="wp-block-paragraph">What can we learn about what happened? Well, for all the discussion around how this is AI&#8217;s fault, what we have here is a much simpler example of common system weaknesses being exploited both accidentally and at speed.</p>



<h2 class="wp-block-heading">What Did Claude Do?</h2>



<p class="wp-block-paragraph">Claude had been asked to carry out a task against PocketOS&#8217;s staging environment. The agent hit an issue, searched out and found a long-lived API token which gave access to production, and then proceeded to delete the production volume that contained both the production databases and the backups.</p>



<p class="wp-block-paragraph">When asked what had happened, Claude’s reaction was objectively funny. It seemed to be totally aware of what went wrong, and what it should have done instead. This implies a set of reasoning that was not evident during the actual operation itself—I do wonder if recent attempts to reduce how much reasoning Claude does in certain modes to reduce token use—and Anthropic’s operating costs might partly be to blame.</p>



<p class="wp-block-paragraph">Breaking it all down, there seem to be a couple of fairly straightforward issues at play that at first glance have very little to do with AI itself.</p>



<p class="wp-block-paragraph">The token Claude had access to gave overly broad access. It&#8217;s common for cloud-based infrastructure providers like AWS or Azure to allow you to create tokens that are limited in what they do. This helps implement the <em>principle of least privilege</em>. The idea is that an actor in a system should be given access to what they need, and no more. The principle of least privilege reduces the impact if an inappropriate party gains access to the actor’s credentials, or if the actor themself goes rogue. Consider what happens if someone steals your hotel room key. They can get into your hotel room, which isn&#8217;t great, but they can&#8217;t get into anyone else&#8217;s. It seems that Railway has a limitation that its auth tokens cannot have their scope limited.</p>



<p class="wp-block-paragraph">The second problem was that the credentials were stored on disk and had not expired. This makes the impact of the broadly scoped auth token much worse. Credentials should be time limited, so that if they are found later they cannot be used. If tokens are generated on demand, which could have been done in this specific case, then this particular issue could have been mitigated. Claude would have had to ask for a human to provide a credential—at which point, hopefully, the operator would have had a chance to work out what was going on.</p>



<p class="wp-block-paragraph">I take minor issue with Jer&#8217;s assertion that Railway&#8217;s GraphQL API should have required a confirmation before deletion. This, to me, is a fundamental misunderstanding of what cloud APIs are for. APIs are there for automation; if you want a human-in-the-loop confirmation model, you have to build that yourself. This has always been the case. However, in the aftermath of an incident like this, we should give Jer a lot of leeway around his view of the problems, and some of Jeremy&#8217;s requests for how Railway should change appear to be very sensible (e.g. more clear SLAs, easier to scope tokens).</p>



<h2 class="wp-block-heading">How Could These Issues Be Mitigated?</h2>



<p class="wp-block-paragraph">One obvious takeaway is to ensure that access tokens are more aggressively expired, but also made more limited in scope. This reduces the chance of Claude accessing something it shouldn’t. This would need to be solved on the Railway side, as they generate the token in the first place.</p>



<p class="wp-block-paragraph">Unfortunately, having a more limited token for Claude isn’t a total fix for this scenario. Claude was given a token that limited its behavior, and went looking for a better token—and found it. This is not the first time I’ve heard of this happening; the same thing happened to a client of mine recently.</p>



<p class="wp-block-paragraph">As our agents become more sophisticated, it seems that some sort of sandboxing is key. The production token was viewable by Claude, so it was used. Running agents in a restricted sandbox where they are only able to see parts of your filesystem would help greatly. However that also limits their usefulness.</p>



<p class="wp-block-paragraph">Another option would be for the agent to ask for confirmation before it does something like delete data. It seems conceivable that having a human in the loop model when the agent has to escalate privileges could help. But again, if it gets access to an access token with broad scope, it won’t need to ask a human.</p>



<p class="wp-block-paragraph">Finally, I’ve seen a lot of discussion about how the agent should “know” that deleting the data was bad, and that it should have checked first. This is a fundamental limitation of an LLM-based agent. It has no concept of causality. It cannot predict what will happen. There is a field of AI study known as <a href="https://en.wikipedia.org/wiki/World_model_(artificial_intelligence)" target="_blank" rel="noreferrer noopener">world models</a>, which could allow these agents to make more informed decisions. For example, a world model that understands physics would be able to predict that the egg would likely break if the egg was pushed from a table on to the concrete floor below. World models are used a lot in video generation and autonomous driving (where prediction of motion is key), but are sparsely used elsewhere.</p>



<h2 class="wp-block-heading">AI Not To Blame?</h2>



<p class="wp-block-paragraph">I said just a moment ago that these issues seem to have little to do with AI. That isn&#8217;t entirely true.</p>



<p class="wp-block-paragraph">In the recent DORA report on the state of <a href="https://dora.dev/research/2025/dora-report/" target="_blank" rel="noreferrer noopener">AI-assisted Software Development</a>, the authors noted that AI seems to be an amplifier: that AI-assisted software development tends to help good teams go faster, and slow teams go slower. Bad practices get encoded and done more. In the PocketOS and Railway situation, we have a set of credentials that were overly broad, with long-lived credentials stored on disc, combined with an apologetic AI agent doing something other than what was expected of it. If a human had made the same mistakes, they would have made them much more slowly, and may well have had the chance to work out their mistake part way through. AI works so fast that it can go more quickly in the wrong direction.</p>



<p class="wp-block-paragraph">More importantly, unlike LLM-based AI, a human being has the chance to learn from experience, and for that learning to be rooted in a very specific, emotional response. When I first heard about the PocketOS story, I was brought back to a dim echo of that same horrific feeling I had in the midst of a major production issue that I had contributed to. Those feelings don&#8217;t leave you—those lessons don&#8217;t leave you. Every time I touched a production system, those memories were with me, and helped guide me towards more sensible working practices.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/when-an-agent-deletes-the-production-database/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>AI Artifact Catalogs: Durable Standards Worth Institutional Investment</title>
		<link>https://www.oreilly.com/radar/ai-artifact-catalogs-durable-standards-worth-institutional-investment/</link>
				<comments>https://www.oreilly.com/radar/ai-artifact-catalogs-durable-standards-worth-institutional-investment/#respond</comments>
				<pubDate>Tue, 19 May 2026 11:05:38 +0000</pubDate>
					<dc:creator><![CDATA[Tadas Antanavicius]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18737</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/AI-artifact-catalogs.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/AI-artifact-catalogs-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Companies everywhere are trying to leverage AI to boost internal productivity metrics. Some, like Ramp and Intercom, are succeeding. Many are failing. To make matters more complicated, the narrative around what tooling enables these gains is constantly shifting. For software engineers, auto-complete via GitHub Copilot was the bleeding-edge tool of choice in 2024. Then it [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Companies everywhere are trying to leverage AI to boost internal productivity metrics. Some, like <a href="https://x.com/geoffintech/status/2042002590758572377?s=20" target="_blank" rel="noreferrer noopener">Ramp</a> and <a href="https://www.linkedin.com/posts/destraynor_in-intercom-we-literally-doubled-productivity-activity-7450589093400469504-wbHH/" target="_blank" rel="noreferrer noopener">Intercom</a>, are succeeding. <a href="https://www.pwc.com/gx/en/news-room/press-releases/2026/pwc-2026-ai-performance-study.html" target="_blank" rel="noreferrer noopener">Many are failing</a>.</p>



<p class="wp-block-paragraph">To make matters more complicated, the narrative around what tooling enables these gains is constantly shifting. For software engineers, auto-complete via <a href="https://github.com/features/copilot" target="_blank" rel="noreferrer noopener">GitHub Copilot</a> was the bleeding-edge tool of choice in 2024. Then it was Cursor for <a href="https://ramp.com/vendors/cursor" target="_blank" rel="noreferrer noopener">much of 2025</a>. 2026 has been dominated by command-line-based coding agents like <a href="https://www.anthropic.com/news/anthropic-raises-30-billion-series-g-funding-380-billion-post-money-valuation" target="_blank" rel="noreferrer noopener">Claude Code</a> and <a href="https://fortune.com/2026/03/04/openai-codex-growth-enterprise-ai-agents/" target="_blank" rel="noreferrer noopener">Codex</a>.</p>



<p class="wp-block-paragraph">While the tooling layer winds ebb and flow, many of them have come to share a number of common primitives: <strong>open standards that help configure and guide these tools’ capabilities</strong>.</p>



<p class="wp-block-paragraph"><a href="https://agentskills.io/" target="_blank" rel="noreferrer noopener">Agent Skills</a>. <a href="https://modelcontextprotocol.io/" target="_blank" rel="noreferrer noopener">MCP</a>. <a href="https://open-plugins.com/" target="_blank" rel="noreferrer noopener">Plugins</a>. These all present vendor-agnostic mechanisms by which we can configure the tools today. The catch: These mechanisms aren’t one-size-fits-all. How you can connect to an MCP server depends on your organization’s security posture. An Agent Skill crafted specifically for one team’s design system does not copy-paste well into that of another team.</p>



<p class="wp-block-paragraph">As individuals within organizations begin to configure—and sometimes build from scratch—the skills and MCP servers that unlock real productivity gains, the next unlock is to translate those wins to shareable, reusable institutional knowledge. <strong>AI artifact catalogs</strong> are the output of this step. They represent the useful bits of <em>internal </em>knowledge and glue that connect much of what employees are doing manually today, over to empowering both:</p>



<ul class="wp-block-list">
<li><strong>Their peers</strong>. By sharing these artifacts within or across teams, productivity gains are shared across the organization, not in individual silos.</li>



<li><strong>And their agents</strong>. Equipping agent runtimes like Claude Code or Codex with hard-won, domain-specific guidance means employees can spend more time building agentic systems and less time toiling on repeatable labor.</li>
</ul>



<h2 class="wp-block-heading">The durability of open standards</h2>



<p class="wp-block-paragraph">There is an ongoing industry-wide rush to buy AI-powered solutions in the hopes that a vendor can unlock these sought-after productivity gains. <a href="https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/" target="_blank" rel="noreferrer noopener">95% of those pilot projects are failing</a>.</p>



<p class="wp-block-paragraph">Of course, there is a spectrum of risk when buying solutions like this from a vendor. If you go all-in on Anthropic’s tooling—like <a href="https://ideas.fin.ai/p/we-gave-claude-code-to-everyone-at" target="_blank" rel="noreferrer noopener">Intercom did with Claude Code</a>—and Anthropic continues to be an industry leader, things will go well. Make the same decision with a startup’s offering that fails to get broad industry adoption, and you’re stuck with a proprietary data model that operates in a dead-end silo you have to rebuild from scratch in a year.</p>



<p class="wp-block-paragraph">There’s another path: that of committing to open standards. If you invest in Agent Skills, in MCP, in plugins, not only will you be protected against a single vendor going belly-up, but you won’t even miss a beat when the leading coding agent that all your engineers demand next quarter changes, again. Switching costs drop to a fraction of what they’d be with a proprietary stack.</p>



<p class="wp-block-paragraph">There’s no doubt that AI capabilities are evolving at a breakneck pace. It’s hard to predict what innovations the next cycle will bring. But what’s unique about these vendor-agnostic standardized primitives is that they are concepts upon which innovation can build, not replace. We’re all still building on top of HTTP that forms the fabric of the web. QWERTY keyboards are strictly inferior to Dvorak keyboards, and yet the standard prevails to this day. JavaScript is a much-maligned language, yet it underpins practically the entire frontend of the internet.</p>



<p class="wp-block-paragraph">As AI rapidly reduces the cost of building, the cost of coordination among people and among entities remains high. Standards remain scarce and valuable.</p>



<h2 class="wp-block-heading">AI artifacts and their relative maturity</h2>



<p class="wp-block-paragraph">The most important aspect of any standard is its level of adoption. It’s clear that the leading tooling empowering internal AI transformation is coalescing around coding agent tools like Claude Code and Codex, less-technical tooling like Claude Cowork, and rich agent SDKs like those from Anthropic or OpenAI.</p>



<p class="wp-block-paragraph">Taking the compatibility of leading tools in those categories as indicators of standard adoption, here’s where I think the landscape of AI artifacts currently nets out:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><strong>Standard</strong></td><td><strong>Artifact</strong></td><td><strong>Status</strong></td><td><strong>Adoption</strong></td></tr><tr><td>Agent Skills</td><td><a href="https://agentskills.io/" target="_blank" rel="noreferrer noopener">Skill</a></td><td>Vendor-agnostic standard</td><td>Highest</td></tr><tr><td>MCP servers</td><td><a href="https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2633" target="_blank" rel="noreferrer noopener">mcp.json</a> and <a href="https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2127" target="_blank" rel="noreferrer noopener">Server Card</a></td><td>Vendor-agnostic standard</td><td>Highest</td></tr><tr><td>Plugins</td><td><a href="https://open-plugins.com/" target="_blank" rel="noreferrer noopener">Plugin</a></td><td>Vendor-agnostic standard</td><td>High</td></tr><tr><td>Command line interface (CLI) tools</td><td>Custom</td><td>Unstandardized</td><td>High</td></tr><tr><td>Hooks</td><td><a href="https://open-plugins.com/agent-builders/components/hooks" target="_blank" rel="noreferrer noopener">Hook</a></td><td>Derivative standard (Open Plugins)</td><td>Medium</td></tr><tr><td>Roots</td><td>Git repositories</td><td>Derivative standard (<a href="https://agents.md" target="_blank" rel="noreferrer noopener">AGENTS.md</a>)</td><td>Medium</td></tr><tr><td>Rules</td><td><a href="https://open-plugins.com/agent-builders/components/rules" target="_blank" rel="noreferrer noopener">Rule</a></td><td>Derivative standard (Open Plugins)</td><td>Medium</td></tr></tbody></table><figcaption class="wp-element-caption"><em>Tool compatibility considered in “adoption” as of April 2026</em><strong><em>: </em></strong><em>Claude Code, Cowork, Codex, Cursor, GitHub Copilot, Gemini CLI, Pi, OpenCode, Amp, Claude Agents SDK, OpenAI Agents SDK</em></figcaption></figure>



<p class="wp-block-paragraph">A minimalist catalog stored as a Git repository for a team might start off looking something like this:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="626" height="252" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-6.png" alt="A minimalist catalog stored as a Git repository" class="wp-image-18738" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-6.png 626w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-6-300x121.png 300w" sizes="auto, (max-width: 626px) 100vw, 626px" /></figure>



<p class="wp-block-paragraph">I work with software engineering teams early in their AI adoption journey, where they might have a few individual tinkerers leaning heavily into AI but haven’t yet figured out how to propagate adoption more widely. Out of the gate, my conversations with teams tend to run a gamut of disparate tool preferences, unique workflows, disjoint architectures, and other one-off quirks. A big unlock for moving these organizations forward is to introduce shared language. Shared language grounds conversations. It puts teams working on different AI-related initiatives on a path to smooth integration with each other. People get excited about how puzzle pieces might fit together.</p>



<p class="wp-block-paragraph">Let’s review these artifacts in more detail.</p>



<h3 class="wp-block-heading">Skills: The lifeblood of most institutional knowledge</h3>



<p class="wp-block-paragraph">As Tim O’Reilly <a href="https://www.oreilly.com/radar/betting-against-the-bitter-lesson/" target="_blank" rel="noreferrer noopener">wrote a few months ago</a>, a skill can be “the integration of expert workflow logic that orchestrates when and how to use each tool, informed by domain knowledge that gives the LLM the judgment to make good decisions in context.”</p>



<p class="wp-block-paragraph">This is not the only “type” of skill that currently exists out there. They can span a gamut of purposes; to name a few:</p>



<ul class="wp-block-list">
<li>Encoding of internal, expert orchestration knowledge (as in the above)</li>



<li>Guidance on using otherwise deterministic tools (such as MCP servers or CLI tools)</li>



<li>Context management tricks that have broad appeal (to make up for LLM capability limitations)</li>
</ul>



<p class="wp-block-paragraph">But the first—the encoding of expert knowledge—is very much the most valuable and irreplaceable. Chances are, what an organization might capture in that variant of skill is knowledge not otherwise documented. It lives as tacit knowledge among your employees or is scattered across many systems so as to make any associated work a multistep journey.</p>



<p class="wp-block-paragraph">The implication: Any skill you can download from the public internet is probably not nearly as valuable as an internal skill crafted by an employee. The latter skill is aware of your business context, the opinionated systems in play, and maybe encodes unique expertise hard-won over years of tenure. And most importantly: That level of insight is not making it into a model training run any time soon. Nor is it likely to be relevant to just about anyone outside of your own company. The same can’t be said for the latest skill repository on GitHub that acquires 10,000 stars. If that public skill is any good, the generic concepts will find their way into natural model and harness capabilities before long, eliminating the need for that class of skill.</p>



<p class="wp-block-paragraph">Skills are <a href="https://agentskills.io/clients" target="_blank" rel="noreferrer noopener">extremely well-adopted</a>; uncontroversially so by every major coding agent.</p>



<h3 class="wp-block-heading">MCP and CLI tools: The connectivity layer to external systems</h3>



<p class="wp-block-paragraph">Most agents don’t operate in a vacuum: Interaction with external systems is how we compose AI. One agent can talk to another agent, or just some separate deterministic system, by way of MCP or a CLI tool.</p>



<p class="wp-block-paragraph">The <a href="https://claude.com/blog/building-agents-that-reach-production-systems-with-mcp" target="_blank" rel="noreferrer noopener">MCP versus CLI</a> debate is well-documented, so we won’t rehash it here. Regardless of which of the two you implement (and perhaps you use both for different use cases), the point is that MCP/CLI is responsible for poking a hole into what is otherwise a local-only sandboxed environment for your agent.</p>



<p class="wp-block-paragraph">This is the layer that juggles authentication—facilitating OAuth, injecting any relevant secrets—and exposes some well-defined surface area for what your agent could possibly do in communication with that external system (e.g., MCP tool definitions or CLI command options).</p>



<p class="wp-block-paragraph">For MCP, you have well-established conventions and standards in the form of <a href="https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2127" target="_blank" rel="noreferrer noopener">Server Cards</a> and <a href="https://github.com/modelcontextprotocol/registry/blob/main/docs/reference/server-json/generic-server-json.md" target="_blank" rel="noreferrer noopener">server.json</a> files—to declare all the <em>possible</em> configurations of an MCP server—and also an upcoming standard called <a href="https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2633" target="_blank" rel="noreferrer noopener">mcp.json</a> to declare <em>specific</em> configurations of an MCP server (inspired by, among others, files like <a href="https://code.claude.com/docs/en/mcp#project-scope" target="_blank" rel="noreferrer noopener">.mcp.json from Claude Code</a>).</p>



<p class="wp-block-paragraph">For CLI, cataloging a tool means rolling your own catalog format: probably covering metadata like “how to install this,” “what auth mechanisms does it support,” “where to store secrets,” and related concerns that are explicitly or implicitly captured in analogous mcp.json files.</p>



<p class="wp-block-paragraph">MCP is very well-adopted and natively compatible with most agent frameworks. CLI works anywhere the agent comes with bash capabilities but can be fairly limited in a sandbox environment and doesn’t share the sort of configurability as MCP does otherwise.</p>



<h3 class="wp-block-heading">Hooks: Inject capabilities at deterministic trigger points</h3>



<p class="wp-block-paragraph">Hooks are handy to inject sprinkles of determinism in an otherwise nondeterministic agentic session. Some effective uses I’ve seen: injecting a session transcript capture step for future review or capturing analytics on what skills are being invoked within a team.</p>



<p class="wp-block-paragraph">Hooks don’t have their own standard but are <a href="https://open-plugins.com/agent-builders/components/hooks" target="_blank" rel="noreferrer noopener">baked into the upcoming Open Plugins standard</a>. The concept is supported by most major coding agents, although implementations have some variance.</p>



<h3 class="wp-block-heading">Rules: Context appended to rules</h3>



<p class="wp-block-paragraph">Originally <a href="https://cursor.com/docs/rules" target="_blank" rel="noreferrer noopener">popularized by Cursor</a>, rules allow for injecting blurbs of context in largely deterministic, but sometimes nondeterministic, fashion.</p>



<p class="wp-block-paragraph">Functionally, many rules could be modeled as skills and AGENTS.md files. Given the popularity of the latter, it’s unclear whether they will continue to remain relevant in the long run.</p>



<h3 class="wp-block-heading">Roots: An agent’s starting point</h3>



<p class="wp-block-paragraph">Most agents “start” inside a particular location in a filesystem: a “root.” For coding agents, this means some folder within a Git repository. In some agents, such as Claude Cowork, this is equivalent to the notion of a “project.”</p>



<p class="wp-block-paragraph">While not directly standardized, the notion of a root is implicit in the AGENTS.md standard, which assumes the presence of a filesystem that hosts static context for which the agent should operate upon.</p>



<h3 class="wp-block-heading">Plugins: Bundles to bring it all together</h3>



<p class="wp-block-paragraph">Plugins are somewhat unique in the above list. Conceptually, they are a <em>bundle</em> of several of the other artifacts. A plugin can be thought of as a composition of skills, rules, hooks, MCP servers, and some other components. The up-and-coming <a href="https://open-plugins.com/" target="_blank" rel="noreferrer noopener">Open Plugins</a> initiative spearheaded by Vercel is working to finalize what this specification looks like.</p>



<p class="wp-block-paragraph">They serve a natural purpose. Any team leaning into building skills and MCP servers will quickly get to a point where several skills and MCP servers will combine to form a practical grouping of guidance and capabilities. Claude Code’s implementation of <a href="https://code.claude.com/docs/en/plugin-marketplaces" target="_blank" rel="noreferrer noopener">plugin marketplaces</a> is becoming a de facto distribution mechanism for plugins. It’s very much an option to catalog individual artifacts, and then use mechanisms like that to distribute them all as bundled within the plugin abstraction layer.</p>



<p class="wp-block-paragraph">Some companies have fully leaned into this abstraction. For example, Intercom, rather than cataloging skills or hooks individually, <a href="https://ideas.fin.ai/p/how-we-use-claude-code-today-at-intercom" target="_blank" rel="noreferrer noopener">just catalogs plugins</a>—skills and hooks are fully inlined within them.</p>



<p class="wp-block-paragraph">Most of the agentic tooling ecosystem is largely aligned on plugins, with Pi and OpenCode being notable holdouts.</p>



<h2 class="wp-block-heading">Rich, practical catalogs are what can separate AI success stories from repeated false starts</h2>



<p class="wp-block-paragraph">Maybe you choose to go all-in on plugins and bundle your skills and MCP servers inline; maybe you build a granular catalog per artifact type. But whatever shape it takes, what matters is that your company is cataloging—and retaining ownership of—its way of working. And doing so in a way that maximizes potential compatibility with the frontier tooling that is yet to be invented.</p>



<p class="wp-block-paragraph">It’s very immediately actionable for a company to start on this path. No new vendor relationship is needed, just an internal agreement to start storing artifacts in some company-wide Git repository. Encouraging sharing, moving past individual silos, celebrating wins—and eventually celebrating <em>usage</em>—of these artifacts. Every addition to that catalog is an opportunity for someone else to leverage an artifact someone else constructed, a chance to build on top of it, to collaborate or consolidate efforts.</p>



<p class="wp-block-paragraph">If you’re part of a company building its first catalog, I’d like to hear from you. I work with a few companies in the early stages of this initiative, and I’ve been capturing early learnings around managing these catalogs in a very <a href="https://github.com/pulsemcp/air" target="_blank" rel="noreferrer noopener">lightweight open source framework called AIR</a>. If others are getting value out of leaning into these open standards as catalogs, we likely have an opportunity to collaborate across companies on some of the glue and minutiae that can operationalize the ideas here.</p>



<p class="wp-block-paragraph">Ramp and Intercom aren&#8217;t winning because they picked the right tooling vendor. They&#8217;re winning because they&#8217;ve turned individual productivity into organizational capability. The tooling will keep rotating. Whether your company compounds alongside it is a choice worth making deliberately.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/ai-artifact-catalogs-durable-standards-worth-institutional-investment/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Agent Skills Work but the Research Shows Most Teams Are Building Them Wrong</title>
		<link>https://www.oreilly.com/radar/agent-skills-work-but-the-research-shows-most-teams-are-building-them-wrong/</link>
				<comments>https://www.oreilly.com/radar/agent-skills-work-but-the-research-shows-most-teams-are-building-them-wrong/#respond</comments>
				<pubDate>Mon, 18 May 2026 10:59:14 +0000</pubDate>
					<dc:creator><![CDATA[Aishwarya Naresh Reganti, Prahitha Movva and Kiriti Badam]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18732</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-skills-work-but-the-research-shows-most-teams-are-building-them-wrong.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-skills-work-but-the-research-shows-most-teams-are-building-them-wrong-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Everybody is building agent skills, but not all skills are created equal. Here are some recent research papers that empirically show best practices to build them.]]></custom:subtitle>
		
				<description><![CDATA[This post was originally published on The Nuanced Perspective and is being reposted here with the authors’ permission. Agent skills are everywhere right now. Atlassian built them into Rovo so agents can automatically triage Jira tickets, draft Confluence pages, and route service requests without anyone typing a prompt. Canva and Figma use them so Claude [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>This post was originally published on </em><a href="https://thenuancedperspective.substack.com/p/agent-skills-work-but-the-research" target="_blank" rel="noreferrer noopener">The Nuanced Perspective</a><em> and is being reposted here with the authors’ permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">Agent skills are everywhere right now. Atlassian built them into Rovo so agents can automatically triage Jira tickets, draft Confluence pages, and route service requests without anyone typing a prompt. Canva and Figma use them so Claude can interact with design files directly. Stripe published skills for payment workflow automation. When Anthropic <a href="https://venturebeat.com/technology/anthropic-launches-enterprise-agent-skills-and-opens-the-standard" target="_blank" rel="noreferrer noopener">launched the Agent Skills open standard in December 2025</a>, Microsoft adopted it in VS Code and GitHub within weeks.</p>



<p class="wp-block-paragraph">The idea is elegantly simple. Instead of building a new specialized agent for every use case, you write a skill once, and any agent that understands the standard can use it. A code reviewer, a PR generator, a deployment checklist, a sprint planner. Each lives in a folder, triggers when relevant, and brings your team’s specific way of doing things into the agent’s context.</p>



<p class="wp-block-paragraph">But the research on whether skills actually work, and what causes them to fail, is only catching up to adoption now. Four recent papers take the first systematic look at skills in practice: what the benchmarks show, how libraries break down as they grow, and what a more principled approach to orchestration looks like.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>Three findings that will change how you think about skills</strong>:</p>



<ul class="wp-block-list">
<li>Curated skills raised the rate at which agents successfully completed tasks by <strong>16.2% on average</strong> across 84 tasks. Model-written skills showed no consistent benefit across any configuration tested.</li>



<li>As skill libraries grow, the agent’s ability to find the right skill on demand breaks down. When it scans every skill description in one pass, similar-sounding skills start colliding. <strong>Organizing skills into a hierarchy</strong> rather than a flat list is what the research shows actually fixes this.</li>



<li>A large-scale security study of ~31K community skills found that more than one in four contain exploitable vulnerabilities, spanning <strong>prompt injection</strong>, <strong>data exfiltration</strong>, and <strong>privilege escalation</strong>.</li>
</ul>
</blockquote>



<p class="wp-block-paragraph">This is what those papers found, and what it means for anyone building with skills today.</p>



<h2 class="wp-block-heading">What a skill is</h2>



<p class="wp-block-paragraph">Your team has a specific way of reviewing PRs. Particular checks, a specific order, standards that go beyond what any generic reviewer would know. You’ve explained it to every new engineer who joined. A skill is how you stop explaining it and let the agent carry it instead. In practice it’s a folder with a SKILL.md file at the center: a description that acts as the trigger condition, a body with step-by-step instructions, and optionally scripts and reference documents that load only when needed. A scoped set of tools and instructions the agent can invoke.</p>



<p class="wp-block-paragraph">At session startup, the agent reads only the name and description from each installed skill, which is about 100 tokens per skill. The full instructions load only when the skill activates, and scripts run without being read into context at all. A large skill library costs almost nothing at initialization. The context budget only gets spent when a skill is actually running.</p>



<p class="wp-block-paragraph">That’s progressive disclosure, and it’s what makes skills different from system prompts, which load everything globally every session, or tools, which are API calls that give the agent direct capabilities. The distinction that holds up for MCPs is that MCP gives the agent abilities, say, a shell, an API connection, or access to a database, whereas skills encode the knowledge of how to use those abilities well for a specific workflow. <a href="https://block.github.io/goose/blog/2025/12/22/agent-skills-vs-mcp/" target="_blank" rel="noreferrer noopener">Block’s engineering team put it well</a> that skills are like GitHub Actions YAML, and MCP is the runner. One describes the workflow and the other makes it possible.</p>



<p class="wp-block-paragraph">Some concrete examples of what this looks like in practice, from teams that have shipped skills in production:</p>



<ul class="wp-block-list">
<li>A <strong>PR review skill</strong> that loads your org’s specific style guide, flagging violations and blockers according to your team’s standards rather than generic best practices</li>



<li>A <strong>deployment checklist skill</strong> that runs your team’s exact predeploy sequence, covering environment checks, rollback verification, and the three Slack channels to notify in order</li>



<li>A <strong>data reporting skill</strong> that knows your company’s metric definitions, so when someone asks for “revenue,” it pulls the right number rather than the closest approximation</li>



<li>A <strong>sprint planning skill</strong> that fetches the backlog, applies your team’s capacity rules, and proposes a plan structured the way your team runs standups</li>
</ul>



<p class="wp-block-paragraph">The value in each of these isn’t the task itself. Any agent can attempt a PR review or a sprint plan. The value is the organizational knowledge baked into how the skill executes it, your style rules, your deploy sequence, your metric definitions, your team’s way of running things. That specificity is also what makes skills hard to get right, as the benchmarks show.</p>



<h2 class="wp-block-heading">What the benchmarks show</h2>



<p class="wp-block-paragraph"><a href="https://arxiv.org/pdf/2602.12670v1" target="_blank" rel="noreferrer noopener">SkillsBench</a> is the <a href="https://www.skillsbench.ai/blogs/introducing-skillsbench" target="_blank" rel="noreferrer noopener">first benchmark</a> built specifically to measure whether agent skills actually improve performance. It tested 84 tasks across 11 domains, running each task under three conditions: no skill, a curated skill, and a self-generated skill. The results are worth sitting with.</p>



<p class="wp-block-paragraph">Curated skills raised average pass rates by 16.2%. However, the gains were uneven across domains. Software engineering tasks improved by 4.5%, while healthcare tasks saw nearly 52% improvement. The domains where skills helped most were the ones with highly structured workflows and domain-specific conventions the base model doesn’t carry natively.</p>



<p class="wp-block-paragraph">The less-cited result is that self-generated skills, where the model writes its own skill rather than a human curating one, provided no average benefit across configurations (“<a href="https://arxiv.org/pdf/2602.12670v1">SkillsBench</a>,” Table 3). Some model configurations saw small gains; others saw small losses. The paper’s conclusion was that models cannot reliably author the procedural knowledge they benefit from consuming. The trajectory analysis in the benchmark identified two failure modes:</p>



<ul class="wp-block-list">
<li>Models either generate imprecise procedures lacking specific API patterns, or</li>



<li>Fail to recognize what domain knowledge the task actually requires.</li>
</ul>



<p class="wp-block-paragraph">The benchmark’s self-generation condition has also drawn pushback from practitioners. One engineer writing on <a href="https://hackernoon.com/read-this-before-you-write-another-agent-skill" target="_blank" rel="noreferrer noopener"><em>HackerNoon</em></a> argues the test doesn’t reflect how skilled teams actually build skills. The benchmark prompted a fresh agent to write a skill and immediately use it, which is closer to asking a model to think harder before attempting a task than to building a skill from real execution experience. His own replication, using skills built from actual debugging sessions, showed much stronger results. The distinction matters because a skill captures what a fresh model wouldn’t know. If the model could have reasoned its way there anyway, the skill wasn’t needed.</p>



<p class="wp-block-paragraph">The practical consequence is that self-generation is the obvious shortcut. You finish a workflow, ask the agent to extract it as a skill, and move on. The benchmark says that without a human review step, you’re not getting the gains you’d expect. The skills look complete. They often cover the main path. What they miss are the edge cases, the exceptions, the three things your team does differently that the model has no way of knowing, and those are exactly the things that make a skill valuable.</p>



<p class="wp-block-paragraph">One finding worth noting for anyone building with skills: focused skills with two to three modules consistently outperformed comprehensive documentation (“<a href="https://arxiv.org/pdf/2602.12670v1" target="_blank" rel="noreferrer noopener">SkillsBench</a>,” Section 4.2). More coverage in a single skill didn’t help; more focused, well-scoped skills did. The benchmark also found that smaller models running with curated skills could match larger models running without them, which is a meaningful cost implication for anyone running skills at scale (“<a href="https://arxiv.org/pdf/2602.12670v1" target="_blank" rel="noreferrer noopener">SkillsBench</a>,” Section 4.2.3, Finding 7).</p>



<h2 class="wp-block-heading">Questions that come up when building with skills</h2>



<p class="wp-block-paragraph">These questions show up every time a team starts building a skill library.</p>



<p class="wp-block-paragraph"><strong>When does something become a skill versus staying in a workflow or system prompt?</strong><br>The cleaner test is whether this is a recurring task that your team has a specific, repeatable way of doing. If yes, it’s a skill candidate. If it’s a one-time flow or something where general reasoning is sufficient, it probably doesn’t need one. The key difference between a skill and a workflow tool like n8n is flexibility. A workflow executes a fixed sequence and breaks when inputs change, while a skill gives the agent procedural guidance it can apply to variations of the same task. Similarly, agentic workflows can chain multiple agents and tasks together, but each agent still benefits from skills that encode the org-specific knowledge for its part of the chain. When you want the <em>what</em> to be consistent but the agent to handle the <em>how</em> intelligently, that’s a skill.</p>



<p class="wp-block-paragraph"><strong>How narrow or broad should a skill be?</strong><br>The SkillsBench finding that focused skills with two to three modules outperform comprehensive ones is directly relevant here (“<a href="https://arxiv.org/pdf/2602.12670v1" target="_blank" rel="noreferrer noopener">SkillsBench</a>,” Section 4.2). A skill that tries to cover an entire domain tends to underperform one that handles a specific thing well. The more practical question is whether to put a full workflow (data fetch, format, generate PDF) into one skill or split it. Current research supports splitting because, then, each piece becomes reusable, easier to update when something changes, and less likely to create unexpected behavior when one module’s scope drifts.</p>



<p class="wp-block-paragraph"><strong>What about skills for noncoders or nonsoftware workflows?</strong><br>Skills are format-agnostic. They’re structured instructions plus optional scripts, and the domain can be anything. A customer support team can encode their escalation criteria, tone guidelines, and the specific conditions where a human always takes over. A legal team can encode their document review checklist. A design team can encode component standards so reviews stay consistent across contributors. <a href="https://support.atlassian.com/rovo/docs/agent-actions/" target="_blank" rel="noreferrer noopener">Atlassian’s Rovo agents</a> are a useful reference outside the coding context. Their skills handle ticket triage, Confluence page creation, and service request routing, none of which is software engineering.</p>



<p class="wp-block-paragraph"><strong>When should you deprecate a skill?</strong><br>This is the question that gets skipped most often. The “<a href="https://arxiv.org/pdf/2602.20867v1" target="_blank" rel="noreferrer noopener">SoK</a>” paper argues for treating skills like any other maintained artifact through discovery, refinement, evaluation, update, and eventually deprecation (see Figure 2 in the paper). A skill that was compensating for a model capability gap six months ago may now be redundant, and worse than redundant if it’s overriding better native behavior. The practical test is to run the task with and without the skill and check if the skill still helps. If the gap has closed, retire it.</p>



<h2 class="wp-block-heading">What breaks as the library grows</h2>



<p class="wp-block-paragraph">A single well-written skill works well. As libraries grow, flat retrieval breaks down, and the “<a href="https://arxiv.org/pdf/2603.02176" target="_blank" rel="noreferrer noopener">AgentSkillOS</a>” paper is the first to study this systematically across ecosystem scales from 200 to 200,000 skills.</p>



<p class="wp-block-paragraph">Flat skill libraries don’t scale. When the agent scans a flat directory of, say, 80+ skills on every request, retrieval becomes unreliable. Two skills with similar descriptions start triggering interchangeably and behavior becomes nondeterministic for the same input. At the extreme, the orchestrator falls into <strong>routing collapse,</strong> where it consistently invokes the wrong skill because the semantic embeddings of two similar skills are indistinguishable. The output looks reasonable BUT the wrong skill ran.</p>



<p class="wp-block-paragraph">The fix the paper proposes is capability trees: organize skills into a hierarchy rather than a flat list. Top-level domains like code, data, docs, with more specific skills as branches and leaves. The agent navigates from domain to branch to leaf instead of scanning everything. They also introduce a usage frequency queue, where skills that aren’t being invoked or aren’t improving outcomes get moved to a <strong>dormant index</strong> so they don’t pollute retrieval for active skills.</p>



<p class="wp-block-paragraph">Testing this across ecosystems ranging from 200 to over 200,000 skills, the structured approach consistently outperformed flat invocation, and the gap widened as library size grew.</p>



<p class="wp-block-paragraph">This pattern shows up in how production teams manage their libraries too. Atlassian recommends <a href="https://support.atlassian.com/rovo/docs/agent-actions/" target="_blank" rel="noreferrer noopener">fewer than five skills per Rovo agent</a>. OpenHands maintains a <a href="https://github.com/OpenHands/extensions" target="_blank" rel="noreferrer noopener">curated extensions repository</a> with separate skill packages for discrete workflows rather than one monolithic skill set. Across all of them, scoped purposeful skill sets outperform comprehensive ones. More skills isn’t more capable. Past a point, it’s just more noise.</p>



<h2 class="wp-block-heading">How orchestration can work differently</h2>



<p class="wp-block-paragraph"><em>This section uses a different definition of skill than the rest of the article, so the distinction matters upfront.</em></p>



<p class="wp-block-paragraph">In the “<a href="https://arxiv.org/pdf/2602.19672" target="_blank" rel="noreferrer noopener">SkillOrchestra</a>” paper, a skill isn’t a SKILL.md file. It’s a capability description used to match task requirements to individual agents in a multi-agent system (see Figure 3 in the paper). The concern isn’t procedural knowledge for one agent but figuring out which agent in a pool should handle a given task and why.</p>



<p class="wp-block-paragraph">The problem it’s solving is that standard reinforcement learning approaches to multi-agent routing don’t hold up as systems grow. Adding a new agent or modifying a workflow means retraining from scratch. RL policies also tend to send everything to the highest-capability agent regardless of cost, which looks fine in evaluation but gets expensive when you’re running it in production.</p>



<p class="wp-block-paragraph">SkillOrchestra’s alternative has each agent maintain a <strong>competence profile</strong> derived from its own execution history, specifically estimated success rates across different task types. The orchestrator routes incoming tasks to the agent whose profile best matches what the task actually demands, rather than the one with the highest raw capability. The routing logic stays current without retraining, and you can inspect why a task went where it went.</p>



<p class="wp-block-paragraph">The same logic applies to SKILL.md-based systems. Tracking which skills actually improve outcomes for specific task types, and what they cost in tokens, gives you the foundation for better selection as your library grows. You don’t need SkillOrchestra’s full framework to benefit from the core idea.</p>



<h2 class="wp-block-heading">The security problem</h2>



<p class="wp-block-paragraph">A <a href="https://arxiv.org/abs/2601.10338" target="_blank" rel="noreferrer noopener">large-scale security analysis</a> of 31,132 community-sourced skills found that 26.1% contain at least one exploitable vulnerability, spanning prompt injection, data exfiltration, privilege escalation, and supply chain risks. More than one in four.</p>



<p class="wp-block-paragraph">The attack patterns aren’t exotic. Prompt injection hidden in skill descriptions that manipulate agent behavior once the skill loads. Scripts that execute against filesystem permissions broader than the skill needs. Tool authorizations scoped to the entire workspace when the task only requires one directory.</p>



<p class="wp-block-paragraph">The core issue is that an external skill isn’t a document you’re reading. It’s code running with your agent’s permissions. Importing a skill from a public repository without reviewing it is like doing an npm install from an unknown author. You wouldn’t do that without at least checking what the package does. That framing changes what due diligence looks like. It means checking the scripts folder before installing, verifying that the permissions the skill requests match what the task actually requires, and sandboxing execution where your environment allows.</p>



<p class="wp-block-paragraph">The tooling for auditing skills at install time doesn’t exist at the level it should yet. Until it does, the due diligence is manual. <a href="https://github.com/OpenHands/extensions">OpenHands’ extensions repository</a> and <a href="https://medium.com/@xuelangping/introducing-atlassian-skills-extending-ai-agent-with-atlassian-integration-fa19f6056df7">Atlassian’s open source skill package</a> are reasonable references for how production-grade community skills scope permissions. Claude Code’s built-in skill creator also helps here, since it structures permission scoping explicitly from the start.</p>



<h2 class="wp-block-heading">3 things to do differently</h2>



<p class="wp-block-paragraph">Across all four papers, three recommendations are consistent.</p>



<p class="wp-block-paragraph"><strong>Write skills from real execution.</strong> Do the workflow manually with an agent, correct it as you go, then extract it as a skill. The agent has full context of what worked. Skills built from real runbooks, incident reports, and accumulated corrections outperform skills written from scratch. The org-specific edge cases are exactly what the base model doesn’t already know. The general workflow it can handle; the three exceptions your team deals with differently are what the skill needs to capture.</p>



<p class="wp-block-paragraph"><strong>Treat the description as routing logic.</strong> The description isn’t a label. It’s how the skill gets triggered at all. Specific phrases, explicit activation conditions, context that distinguishes this skill from adjacent ones. If a skill isn’t firing when you expect it to, or fires when it shouldn’t, rewrite the description first. That’s almost always where the problem is.</p>



<p class="wp-block-paragraph"><strong>Plan for the full lifecycle.</strong> Creation is the easy part. Skills drift out of relevance as models improve. A skill that compensated for something Claude couldn’t do eight months ago may now be actively overriding better native behavior. They need to be evaluated against actual task outcomes, updated when workflows change, and retired when they stop earning their place. The teams that treat their skill libraries the way good engineering teams treat their codebase, with reviews, with metrics, with a process for deprecation, are the ones whose libraries stay useful as they grow.</p>



<h2 class="wp-block-heading">Where this is heading</h2>



<p class="wp-block-paragraph">The shift from prompt engineering to tool use to skill engineering has followed a pattern. Each era produces artifacts that persist longer than the last. Prompts lived in conversations. Tools live in configurations. Skills live in libraries, versioned, shared, maintained, and eventually retired. They behave like code.</p>



<p class="wp-block-paragraph">Most teams aren’t treating them that way yet. Skills get written quickly, without evaluation criteria, without any plan for what happens when they stop being useful. That’s worked so far because most skill libraries are still small enough to hold in your head. It won’t hold as they become infrastructure.</p>



<p class="wp-block-paragraph">The teams building durable agent systems won’t be the ones with the most skills. They’ll be the ones who figured out earlier that a skill library needs to be maintained, not just populated, and who started building the discipline to do that before it became urgent.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>This article grew out of a live “Chai &amp; AI” session conducted by </em><a href="https://open.substack.com/users/14105724-prahitha-movva?utm_source=mentions" target="_blank" rel="noreferrer noopener"><em>Prahitha Movva</em></a><em> where practitioners debated whether agent skills actually deliver on the hype, or just add another layer of complexity.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/agent-skills-work-but-the-research-shows-most-teams-are-building-them-wrong/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Agent Harness Engineering</title>
		<link>https://www.oreilly.com/radar/agent-harness-engineering/</link>
				<pubDate>Fri, 15 May 2026 11:02:24 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18718</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-harness-engineering.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-harness-engineering-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A coding agent is the model plus everything you build around it. Harness engineering treats that scaffolding as a real artifact, and it tightens every time the agent slips.]]></custom:subtitle>
		
				<description><![CDATA[This article was originally published on Addy Osmani’s blog. It’s being reposted here with the author’s permission. Roughly: Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again. We’ve spent the last two years arguing about models. Which one is [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>This article was originally published on </em><a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener"><em>Addy Osmani’s blog</em></a><em>. It’s being reposted here with the author’s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">Roughly: Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.</p>



<p class="wp-block-paragraph">We’ve spent the last two years arguing about models. Which one is smartest, which one writes the cleanest React, which one hallucinates less. That conversation is fine as far as it goes, but it’s missing the other half of the system. The model is one input into a running agent. The rest is the <em>harness</em>: the prompts, tools, context policies, hooks, sandboxes, subagents, feedback loops, and recovery paths wrapped around the model so it can actually finish something.</p>



<p class="wp-block-paragraph"><strong>A decent model with a great harness beats a great model with a bad harness</strong>. I’ve watched this play out on my own work over and over. And increasingly the interesting engineering isn’t in picking the model; it’s in designing the scaffolding around it.</p>



<p class="wp-block-paragraph">That discipline now has a name. Viv Trivedy coined the term <em>harness engineering</em>, and his “<a href="https://x.com/Vtrivedy10/status/2031408954517971368" target="_blank" rel="noreferrer noopener">Anatomy of an Agent Harness</a>” post is the cleanest derivation of what a harness actually is and why each piece exists. <a href="https://x.com/dexhorthy/status/1985699548153467120" target="_blank" rel="noreferrer noopener">Dex Horthy</a> has been tracking the pattern as it emerges. <a href="https://www.humanlayer.dev/blog/skill-issue-harness-engineering-for-coding-agents" target="_blank" rel="noreferrer noopener">HumanLayer</a> frames most agent failures as “skill issues” that come down to configuration rather than model weights. <a href="https://www.anthropic.com/engineering/harness-design-long-running-apps" target="_blank" rel="noreferrer noopener">Anthropic’s engineering team</a> has published what I think is the best public breakdown of how to design a harness for long-running work. And <a href="https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html" target="_blank" rel="noreferrer noopener">Birgitta Böckeler</a> has a good overview of what this looks like from the user’s side.</p>



<p class="wp-block-paragraph">This post is my attempt to pull those threads together.</p>



<h2 class="wp-block-heading">What is a harness, really?</h2>



<p class="wp-block-paragraph">Viv’s one-liner does most of the work:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Agent = Model + Harness. If you’re not the model, you’re the harness.</p>
</blockquote>



<p class="wp-block-paragraph">A harness is every piece of code, configuration, and execution logic that isn’t the model itself. A raw model is not an agent. It becomes one once a harness gives it state, tool execution, feedback loops, and enforceable constraints.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1376" height="768" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image.jpeg" alt="The model is one chip on the board. The harness is everything else that makes it useful." class="wp-image-18719" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image.jpeg 1376w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-300x167.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-768x429.jpeg 768w" sizes="auto, (max-width: 1376px) 100vw, 1376px" /></figure>



<p class="wp-block-paragraph">Concretely, a harness includes:</p>



<ul class="wp-block-list">
<li>System prompts, CLAUDE.md, AGENTS.md, skill files, and subagent prompts</li>



<li>Tools, skills, MCP servers, and their descriptions</li>



<li>Bundled infrastructure (filesystem, sandbox, browser)</li>



<li>Orchestration logic (subagent spawning, handoffs, model routing)</li>



<li>Hooks and middleware for deterministic execution (compaction, continuation, lint checks)</li>



<li>Observability (logs, traces, cost and latency metering)</li>
</ul>



<p class="wp-block-paragraph"><a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/" target="_blank" rel="noreferrer noopener">Simon Willison</a> reduces the loop part to its essence: an agent is a system that “runs tools in a loop to achieve a goal.” The skill is in the design of both the tools and the loop.</p>



<p class="wp-block-paragraph">If that sounds like a lot of surface area, it is. And it’s <em>your</em> surface area, not the model provider’s. Claude Code, Cursor, Codex, Aider, Cline: These are all harnesses. <strong>The model underneath is sometimes the same, but the behavior you experience is dominated by what the harness does</strong>.</p>



<p class="has-text-align-center wp-block-paragraph"><code>coding agent = AI model(s) + harness</code></p>



<p class="wp-block-paragraph">This equation, articulated by Viv and echoed by HumanLayer, is where the work actually lives. The debate over the left-hand side is loud. Most of the actual leverage sits on the right.</p>



<h2 class="wp-block-heading">The “skill issue” reframe</h2>



<p class="wp-block-paragraph">There’s a pattern I watch engineers fall into. The agent does something dumb, the engineer blames the model, and the blame gets filed under “wait for the next version.”</p>



<p class="wp-block-paragraph">The harness-engineering mindset rejects that default. The failure is usually legible. The agent didn’t know about a convention, so you add it to AGENTS.md. The agent ran a destructive command, so you add a hook that blocks it. The agent got lost in a 40-step task, so you split it into a planner and an executor. The agent kept “finishing” broken code, so you wire a typecheck back-pressure signal into the loop.</p>



<p class="wp-block-paragraph">HumanLayer says: “It’s not a model problem. It’s a configuration problem.” Harness engineering is what happens when you take that seriously.</p>



<p class="wp-block-paragraph">There’s a striking data point that shows up in both Viv’s write-up and HumanLayer’s. On Terminal Bench 2.0, Claude Opus 4.6 running inside Claude Code scores far lower than the same model running in a custom harness. Viv’s team moved a coding agent from Top 30 to Top 5 by changing only the harness. Models get posttraining coupled to the harness they were trained against. Moving them into a different harness, with better tools for your codebase, a tighter prompt, and sharper backpressure, can unlock capability the original harness was leaving on the floor.</p>



<p class="wp-block-paragraph">This is the opposite of the “just wait for GPT-6” narrative. <strong>The gap between what today’s models can do and what you see them doing is largely a harness gap</strong>.</p>



<h2 class="wp-block-heading">The ratchet: Every mistake becomes a rule</h2>



<p class="wp-block-paragraph">The most important habit in harness engineering is treating agent mistakes as permanent signals. Not one-off stories to laugh about, not “bad runs” to retry. Signals.</p>



<p class="wp-block-paragraph">If the agent ships a PR with a commented-out test and I merge it by accident, that’s an input. The next version of my AGENTS.md says “never comment out tests; delete them or fix them.” The next version of my precommit hook greps for .<code>skip( </code>and <code>xit( </code>in the diff. The next version of my reviewer subagent flags commented-out tests as a blocker.</p>



<p class="wp-block-paragraph">You only add constraints when you’ve seen a real failure. You only remove them when a capable model has made them redundant. <strong>Every line in a good AGENTS.md should be traceable back to a specific thing that went wrong</strong>.</p>



<p class="wp-block-paragraph">This is also why harness engineering is a discipline rather than a framework. The right harness for your codebase is shaped by your failure history. You can’t download it.</p>



<h2 class="wp-block-heading">Working backward from behavior</h2>



<p class="wp-block-paragraph">The framing from Viv that I find most useful when I’m actually designing a harness is to start from the behavior you want and derive the harness piece that delivers it. His pattern: <em>behavior we want (or want to fix) → harness design to help the model achieve this</em>.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1376" height="768" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-1.jpeg" alt="Every harness feature is a bridge across a specific thing the model can't do on its own" class="wp-image-18720" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-1.jpeg 1376w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-1-300x167.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-1-768x429.jpeg 768w" sizes="auto, (max-width: 1376px) 100vw, 1376px" /></figure>



<p class="wp-block-paragraph">The useful thing about deriving it this way is that every harness component has a specific job. <strong>If you can’t name the behavior a component exists to deliver, it probably shouldn’t be there</strong>.</p>



<p class="wp-block-paragraph">The rest of this section walks the pieces in roughly the order Viv does, with the specific patterns I’ve found worth stealing.</p>



<h2 class="wp-block-heading">Filesystem and Git: Durable state</h2>



<p class="wp-block-paragraph">The filesystem is the most foundational primitive, and it tends to be underrated because it’s boring. Models can only directly operate on what fits in context. Without a filesystem, you’re copy-pasting into a chat window, and that isn’t a workflow.</p>



<p class="wp-block-paragraph">Once you have a filesystem, the agent gets a workspace to read data, code, and docs; a place to offload intermediate work instead of holding it in context; and a surface where multiple agents and humans can coordinate through shared files. Adding Git on top gives you versioning for free, so the agent can track progress, roll back errors, and branch experiments.</p>



<p class="wp-block-paragraph">Most of the other harness primitives end up pointing at the filesystem for something.</p>



<h2 class="wp-block-heading">Bash and code execution: The general-purpose tool</h2>



<p class="wp-block-paragraph">The main agent loop today is a ReAct loop: The model reasons, takes an action via a tool call, observes the result, and repeats. But a harness can only execute the tools it has logic for. You can try to prebuild a tool for every possible action, or you can give the agent bash and let it build the tools it needs on the fly.</p>



<p class="wp-block-paragraph">Willison’s take on this is that agents already excel at shell commands; most tasks collapse to a few well-chosen CLI invocations. Harnesses still ship focused tools, but bash plus code execution has become the default general-purpose strategy for autonomous problem solving. It’s the difference between teaching someone to use a single kitchen gadget and handing them a kitchen.</p>



<h2 class="wp-block-heading">Sandboxes and default tooling</h2>



<p class="wp-block-paragraph">Bash is only useful if it runs somewhere safe. Running agent-generated code on your laptop is risky, and a single local environment doesn’t scale to many parallel agents.</p>



<p class="wp-block-paragraph">Sandboxes give agents an isolated operating environment. Instead of executing locally, the harness connects to a sandbox to run code, inspect files, install dependencies, and verify work. You can allow-list commands, enforce network isolation, spin up new environments on demand, and tear them down when the task is done.</p>



<p class="wp-block-paragraph">A good sandbox ships with good defaults: preinstalled language runtimes and packages, Git and test CLIs, a headless browser for web interaction. Browsers, logs, screenshots, and test runners are what let the agent observe its own work and close the self-verification loop.</p>



<p class="wp-block-paragraph">The model doesn’t configure its execution environment. Deciding where the agent runs, what’s available, and how it verifies its output are all harness-level calls.</p>



<h2 class="wp-block-heading">Memory and search: Continual learning</h2>



<p class="wp-block-paragraph">Models have no additional knowledge beyond their weights and what’s currently in context. Without the ability to edit weights, the only way to add knowledge is through context injection.</p>



<p class="wp-block-paragraph">The filesystem is again the primitive. Harnesses support memory file standards like AGENTS.md that get injected on every start. As the agent edits that file, the harness reloads it, and knowledge from one session carries into the next. This is a crude but effective form of continual learning.</p>



<p class="wp-block-paragraph">For knowledge that didn’t exist at training time (new library versions, current docs, today’s data), web search and MCP tools like Context7 bridge the cutoff. These are useful primitives to bake into the harness rather than leaving to the user.</p>



<h2 class="wp-block-heading">Battling context rot</h2>



<p class="wp-block-paragraph">Context rot is the observation that models get worse at reasoning and completing tasks as the context window fills up. Context is scarce, and harnesses are largely delivery mechanisms for good context engineering.</p>



<p class="wp-block-paragraph">Three techniques show up repeatedly:</p>



<p class="wp-block-paragraph"><strong>Compaction</strong>. When the window gets close to full, something has to give. Letting the API error is not an option for a production harness, so the harness intelligently summarizes and offloads older context so the agent can keep working.</p>



<p class="wp-block-paragraph"><strong>Tool-call offloading</strong>. Large tool outputs (think 2,000-line log files) clutter context without adding much signal. The harness keeps the head and tail tokens above a threshold and offloads the full output to the filesystem, where the agent can read it on demand.</p>



<p class="wp-block-paragraph"><strong>Skills with progressive disclosure</strong>. Loading every tool and MCP into context at startup degrades performance before the agent takes a single action. Skills let the harness reveal instructions and tools only when the task actually calls for them.</p>



<p class="wp-block-paragraph">Anthropic’s harness post adds one more technique for the really long jobs: full context resets, where the harness tears the session down and rebuilds it from a compact handoff file. They’re explicit that <em>compaction alone wasn’t sufficient</em> for long tasks; sometimes you need to start fresh with a structured brief. This is closer to how humans onboard a new engineer than to how we usually think about “memory.”</p>



<h2 class="wp-block-heading">Long-horizon execution: Ralph loops, planning, verification</h2>



<p class="wp-block-paragraph">Autonomous long-horizon work is the holy grail and the hardest thing to get right. Today’s models suffer from early stopping, poor decomposition of complex problems, and incoherence as work stretches across multiple context windows. The harness has to design around all of that.</p>



<p class="wp-block-paragraph">I’ve written about autonomous coding loops like the Ralph loop before in <a href="https://addyosmani.com/blog/self-improving-agents/" target="_blank" rel="noreferrer noopener">self-improving agents</a> and in my <a href="https://beyond.addy.ie/2026-trends/" target="_blank" rel="noreferrer noopener">2026 trends piece</a>, but it’s worth restating in this framing: A hook intercepts the model’s attempt to exit and reinjects the original prompt into a fresh context window, forcing the agent to continue against a completion goal. Each iteration starts clean but reads state from the previous one through the filesystem. It’s a surprisingly simple trick for turning a single-session agent into a multisession one, and it’s the kind of primitive you’d never derive from “just use a smarter model.”</p>



<p class="wp-block-paragraph"><strong>Planning</strong> is when the model decomposes a goal into a sequence of steps, usually into a plan file on disk. The harness supports this with prompting and reminders about how to use the plan file. After each step, the agent checks its work via self-verification: Hooks run a predefined test suite and loop failures back to the model with the error text, or the model reviews its own output against explicit criteria.</p>



<p class="wp-block-paragraph"><strong>Planner/generator/evaluator splits</strong>. Anthropic’s long-running harness work is explicit that separating generation from evaluation into distinct agents outperforms self-evaluation, because agents reliably skew positive when grading their own work. It’s GANs for prose. The related pattern is the <strong>sprint contract</strong>, where the generator and evaluator negotiate what “done” actually means before code gets written. In my own workflows, writing down the done condition before starting has caught more scope drift than any prompt change I’ve ever made.</p>



<h2 class="wp-block-heading">Hooks: The enforcement layer</h2>



<p class="wp-block-paragraph">Hooks are what separate “I told the agent to do X” from “the system enforces X.”</p>



<p class="wp-block-paragraph">A hook is a script that runs at a specific lifecycle point: before a tool call, after a file edit, before commit, on session start. They’re the right place for things the agent should never forget but often does. Run typecheck and lint and tests after every edit and surface failures. Block destructive bash (<code>rm -rf</code>,<code> git push --force</code>,<code> DROP TABLE</code>). Require approval before opening a PR or pushing to main. Auto-format on write so the agent doesn’t waste tokens on whitespace.</p>



<p class="wp-block-paragraph">The principle HumanLayer highlights and I’ve come to agree with is: <strong>Success is silent; failures are verbose</strong>. If typecheck passes, the agent hears nothing. If it fails, the error text gets injected into the loop and the agent self-corrects. That makes the feedback loop almost free in the common case and directly actionable when something goes wrong.</p>



<h2 class="wp-block-heading">AGENTS.md and tool choice</h2>



<p class="wp-block-paragraph">The flat markdown rulebook at the root of your repo is still the single highest-leverage configuration point, because it lands in the system prompt every turn. Conventions go here: package manager, test framework, formatting, “never touch <code>/legacy</code>,” “always use our logger.” Two hard-won lessons:</p>



<p class="wp-block-paragraph">Keep it short. HumanLayer keeps theirs under 60 lines. Every line is competing for attention, and more rules make each rule matter less. <strong>Pilot’s checklist, not style guide</strong>.</p>



<p class="wp-block-paragraph">Earn each line. Rules should trace to a specific past failure or a hard external constraint. If they don’t, they’re noise. Ratchet; don’t brainstorm.</p>



<p class="wp-block-paragraph">Same discipline applies to tools. Each tool’s name, description, and schema gets stamped into the prompt every request. Ten focused tools outperform fifty overlapping ones because the model can hold the menu in its head. HumanLayer also flags a real security concern here: tool descriptions populate the prompt, so any MCP server you install is trusted text the model will read. A sloppy or malicious MCP can prompt-inject your agent before you’ve typed anything.</p>



<h2 class="wp-block-heading">What this looks like in production</h2>



<p class="wp-block-paragraph">The clearest public picture I’ve seen of a mature harness is Fareed Khan’s (estimated) breakdown of <a href="https://levelup.gitconnected.com/building-claude-code-with-harness-engineering-d2e8c0da85f0" target="_blank" rel="noreferrer noopener">Claude Code’s architecture</a>.</p>



<p class="wp-block-paragraph">Almost every concept from the previous section shows up on this diagram as a named component. Context injection is the knowledge layer. Loop state lives in the memory store and the worktree isolator. Destructive-action hooks sit behind the permission gate. Subagent context firewalls are the entire multi-agent layer. The tool dispatch registry is where MCP servers and bash both plug in. Khan’s argument is the same as Viv’s, just worked through a shipping product: <strong>Claude Code’s trajectory is about the harness at least as much as about the model underneath it</strong>.</p>



<h2 class="wp-block-heading">Harnesses don’t shrink; they move</h2>



<p class="wp-block-paragraph">One of the better observations in the Anthropic write-up is that as models improve, the space of interesting harness combinations doesn’t shrink. It moves.</p>



<p class="wp-block-paragraph">The naive story is that better models make harnesses obsolete. If the model can plan, no planner. If the model is coherent at long horizons, no context resets. And yes, Opus 4.6 largely killed the context-anxiety failure mode (Sonnet 4.5 used to wrap up work prematurely as it approached what it thought was its context limit), which means a whole class of anxiety-mitigation scaffolding I was writing six months ago is now dead code.</p>



<p class="wp-block-paragraph">But the ceiling moved with the model. Tasks that were unreachable are in play, and they have their own failure modes. The anxiety scaffolding goes away, and in its place you need a multiday memory policy or a harness that coordinates three specialized agents or evaluators for design quality in generated UIs. The assumptions shift, and so does the scaffolding that encodes them.</p>



<p class="wp-block-paragraph">Anthropic puts it cleanly: “Every component in a harness encodes an assumption about what the model can’t do on its own.” When the model gets better at something, that component becomes load-bearing for nothing and should come out. When the model unlocks something new, new scaffolding is needed to reach the new ceiling.</p>



<h2 class="wp-block-heading">The model-harness training loop</h2>



<p class="wp-block-paragraph">The other thing that’s happening, which Viv names explicitly, is a feedback loop between harness design and model training.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1376" height="768" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-2.jpeg" alt="The harness doesn't shrink, it moves" class="wp-image-18721" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-2.jpeg 1376w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-2-300x167.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-2-768x429.jpeg 768w" sizes="auto, (max-width: 1376px) 100vw, 1376px" /></figure>



<p class="wp-block-paragraph">Today’s agent products are posttrained with harnesses in the loop. The model gets specifically better at the actions the harness designers think it should be good at: filesystem operations, bash, planning, subagent dispatch. That’s why Opus 4.6 feels different inside Claude Code than inside someone else’s harness, and it’s why changing a tool’s logic sometimes causes strange regressions. A genuinely general model wouldn’t care whether you used <code>apply_patch</code> or <code>str_replace</code>, but cotraining creates overfitting.</p>



<p class="wp-block-paragraph">The practical implication is twofold. <strong>A harness is a living system, not a config file you set up once</strong>. And the “best” harness isn’t necessarily the one the model was trained inside; it’s the one designed for your task. Viv’s Top 30 to Top 5 Terminal Bench jump is the clearest proof point I’ve seen.</p>



<h2 class="wp-block-heading">Harness as a service</h2>



<p class="wp-block-paragraph">Viv’s other contribution is the <a href="https://www.vtrivedy.com/posts/claude-code-sdk-haas-harness-as-a-service" target="_blank" rel="noreferrer noopener">HaaS</a> framing: harness as a service. The observation is that we’re moving from building on LLM APIs (which give you a completion) to building on harness APIs (which give you a runtime). The Claude Agent SDK, the Codex SDK, and the OpenAI Agents SDK all point in the same direction. You get the loop, the tools, the context management, the hooks, and the sandbox primitives out of the box, and you customize them.</p>



<p class="wp-block-paragraph">The shift matters because the default path used to be: build your own loop, wire up your own tool-calling, handle your own conversation state, invent your own approval flow. Now the default path is: pick a harness framework, configure it along the four pillars (system prompt, tools, context, subagents), and put the rest of your effort into domain-specific prompt and tool design.</p>



<p class="wp-block-paragraph">That’s what makes “skill issue” tractable. You’re not rebuilding an agent from scratch every time something goes wrong. You’re tuning a configuration surface that’s already well-factored.</p>



<p class="wp-block-paragraph">Viv’s line on this is also the best argument for starting messy: “Good agent building is an exercise in iteration. You can’t do iterations if you don’t have a v0.1.”</p>



<h2 class="wp-block-heading">Where this is going</h2>



<p class="wp-block-paragraph">Look at the top coding agents side by side (Claude Code, Cursor, Codex, Aider, Cline) and <strong>they look more like each other than their underlying models do</strong>. The models are different. The harness patterns are converging. I don’t think that’s an accident. It’s the industry slowly finding the load-bearing pieces of scaffolding that turn a generative model into something that can ship.</p>



<p class="wp-block-paragraph">Viv’s framing of the open problems is the one I find most exciting: orchestrating many agents working in parallel on a shared codebase; agents that analyze their own traces to identify and fix harness-level failure modes; harnesses that dynamically assemble the right tools and context just-in-time for a given task instead of being preconfigured at startup.</p>



<p class="wp-block-paragraph">That last one, in particular, feels like <strong>where harnesses stop being static config and start becoming something closer to a compiler</strong>.</p>
]]></content:encoded>
										</item>
		<item>
		<title>Generative AI in the Real World: Chang She on Data Infrastructure for AI</title>
		<link>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-chang-she-on-data-infrastructure-for-ai/</link>
				<comments>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-chang-she-on-data-infrastructure-for-ai/#respond</comments>
				<pubDate>Thu, 14 May 2026 16:00:00 +0000</pubDate>
					<dc:creator><![CDATA[Ben Lorica and Chang She]]></dc:creator>
						<category><![CDATA[Generative AI in the Real World]]></category>
		<category><![CDATA[Podcast]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&#038;p=18710</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png" 
				medium="image" 
				type="image/png" 
				width="2560" 
				height="2560" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-160x160.png" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[As a pandas core contributor and early Parquet adopter who built AI data pipelines at streaming company Tubi TV, Chang She saw firsthand why the traditional data stack breaks down for AI workloads—and founded LanceDB to fix it. Chang joined Ben Lorica to explain why vector databases are too narrow a solution for modern AI [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">As a pandas core contributor and early Parquet adopter who built AI data pipelines at streaming company Tubi TV, Chang She saw firsthand why the traditional data stack breaks down for AI workloads—and founded LanceDB to fix it. Chang joined Ben Lorica to explain why vector databases are too narrow a solution for modern AI data needs, and what a true multimodal data infrastructure actually looks like. Chang and Ben get into why the Lance file format is quickly becoming the open source standard for multimodal data, how the rise of agents is exploding data infrastructure demands, why open-weight models are the enterprise cost shift to watch in the next 12 months, and more. &#8220;Trillion is the new billion,&#8221; Chang says, and the enterprises that set up their data infrastructure now for that scale will be the ones that succeed.</p>



<p class="wp-block-paragraph">About the <em>Generative AI in the Real World</em> podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In <em>Generative AI in the Real World</em>, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.</p>



<p class="wp-block-paragraph">Check out other episodes of this podcast <a href="https://learning.oreilly.com/playlists/42123a72-1108-40f1-91c0-adbfb9f4983b/?_gl=1*zs7my8*_gcl_au*NDMxMjk0ODI3LjE3Nzc0ODkyMDc.*_ga*MTk1MTQxMjQ0My4xNzc3NDA1NDkw*_ga_092EL089CH*czE3Nzg1MTY2NzEkbzM2JGcxJHQxNzc4NTE2Njk0JGozNyRsMCRoMA.." target="_blank" rel="noreferrer noopener">on the O’Reilly learning platform</a> or follow us on <a href="https://www.youtube.com/playlist?list=PL055Epbe6d5YcJUhZbsVW9dlMueIuOxK_" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/5C9oof8TFkP65lDUcEy5jT" target="_blank" rel="noreferrer noopener">Spotify</a>, <a href="https://podcasts.apple.com/us/podcast/generative-ai-in-the-real-world/id1835476293" target="_blank" rel="noreferrer noopener">Apple</a>, or wherever you get your podcasts.</p>



<h2 class="wp-block-heading"><strong>Transcript</strong></h2>



<p class="wp-block-paragraph"><em>This transcript was created with the help of AI and has been lightly edited for clarity.</em></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=35" target="_blank" rel="noreferrer noopener">00.35</a><br><strong>All right, so today we have Chang She, CEO and cofounder of LanceDB, which you can find at </strong><a href="http://lancedb.com" target="_blank" rel="noreferrer noopener"><strong>lancedb.com</strong></a><strong>. Tagline is “Build better models faster.” So Chang, welcome to the podcast.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=49" target="_blank" rel="noreferrer noopener">00.49</a><br>Hey Ben, super excited to be here.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=52" target="_blank" rel="noreferrer noopener">00.52</a><br><strong>All right, we&#8217;ll jump into the core topics, but a bit of a background there for our listeners who may not be familiar with you. You worked on pandas—you were a core member of the pandas team. You were very early on with Parquet as well. And at some point, you became convinced that for AI workloads, these former tools that you worked on—Parquet, pandas—were not enough. So what was the moment of realization for you that these traditional tools that were foundational for analytics were lacking?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=93" target="_blank" rel="noreferrer noopener">01.33</a><br>Absolutely. So I worked at a company called Tubi TV, which was video on-demand and streaming. So movies and TV. And it was there that I ended up dealing with a lot of I guess what I would call AI data. So we had to have embeddings for personalization, video assets, image assets, audio, text for subtitles and all of those things. All of those did not really fit into the traditional data stack—you know, pandas, Spark, Parquet, and even Arrow. So that was sort of the inspiration for me to start LanceDB.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=135" target="_blank" rel="noreferrer noopener">02.15</a><br><strong>And Chang, at this point, do you think that more people are aware of this disconnect between those tools and the kinds of tools they’ll need moving forward?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=150" target="_blank" rel="noreferrer noopener">02.30</a><br>When I talk to data infrastructure folks who are building and managing that stack for dealing with this kind of data, there&#8217;s broad recognition that something has to be done, that the existing stack is just not sufficient to deal with this data. And what&#8217;s more interesting is that this data is also becoming a lot more valuable because of AI.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=172" target="_blank" rel="noreferrer noopener">02.52</a><br><strong>So obviously, before you came on the scene, there was this wave of vector stores or vector databases which were optimized for retrieval. So let&#8217;s say I&#8217;m a listener and all I have is text. Do I need anything beyond the vector database?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=197" target="_blank" rel="noreferrer noopener">03.17</a><br>Even if you just have text and you just have text embeddings, the creation of those embeddings and then the management of all of those data assets—your metadata, the actual documents, how to serve that—a lot of that falls outside the purview of a vector database. The vector databases tend to be very narrow solutions for a very narrow problem, whereas something like LanceDB takes a broader view of, “When you have AI data, what are all the things you need to do to it throughout that life cycle of application development or model development? And how do we build a tool and a system that allows you to simplify your life by having one system to do all of the major workloads throughout that life cycle?”</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=253" target="_blank" rel="noreferrer noopener">04.13</a><br><strong>And by the way, for our listeners, there&#8217;s LanceDB and then there&#8217;s the open Lance file format, and I wanna ask you about this file format in a second, but you mentioned something about vector databases and you were kind of saying that, you know, they&#8217;re not great at creating the embeddings. But Chang, the vector database people, they never really positioned themselves as responsible for creating the embeddings, right? So they just assume that you&#8217;ll show up with embeddings.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=287" target="_blank" rel="noreferrer noopener">04.47</a><br>That&#8217;s right. But even if you take that narrow view, what we find in enterprises today is a lot of folks have an offline generation process in the data lake itself, where they chunk up the documents, then they generate the embeddings, then they have what they call an offline store, then they have to copy-paste that data into a vector database for serving. So there&#8217;s a lot of data syncing [and] data movement, so it creates expense and there&#8217;s a lot of complexity.</p>



<p class="wp-block-paragraph">And so that&#8217;s the.&nbsp;.&nbsp;. Even for just text-based workloads, even just for pure vector search, that tends to be a big pain point. And then two is vector databases, a lot of times, don&#8217;t pay as much attention to the overall retrieval stack, right? If you remember, the task for users is I want to find the right data in my dataset, and vector search is just one technique. You have many different kinds of techniques, full-text search, or even just outside of search. You might have SQL queries that you want to run, filters, regexes, all of that goes into a rich and very accurate retrieval process. And vector databases, in general, do not expand beyond just that simple semantic or vector search.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=370" target="_blank" rel="noreferrer noopener">06.10</a><br><strong>So I mentioned the Lance open file format, which.&nbsp;.&nbsp;. I guess the shortcut that people use is like Parquet for AI, but it&#8217;s actually both a file and table format. So maybe give our listeners, Chang, a high-level description of the Lance format and why it&#8217;s become so popular.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=393" target="_blank" rel="noreferrer noopener">06.33</a><br>Lance is what we call a lakehouse format. It is quickly becoming the new open source standard for multimodal data. And what I mean by a lakehouse format is that it spans a couple of different layers. So you mentioned in the beginning a file format. So this is the equivalent in the stack to Parquet, where we would talk about “How do we lay out the data in a particular file?” And at this layer, the innovation in Lance is that it is really, really good for random access without sacrificing any speed and scans. And our files are actually smaller than Parquet for many AI datasets.<br><br>The next layer is usually what we call a table format that is occupied by projects like Iceberg and Delta and Hudi today. And [the] Lance format comes in at this layer. We have much better designs, more optimizations for machine learning experimentation, so doing backfills easily, doing two-dimensional data evolution, being able to handle really large blob data like videos and images, and then just being able to do a branching strategy that supports true sort of Git for data semantics that takes the best of Parquet and Iceberg.&nbsp;</p>



<p class="wp-block-paragraph">And then finally, there&#8217;s a third layer, which is about indexing so that you can have fast scans, fast searches, fast queries. So when you put all that together, that&#8217;s what we call the Lance lakehouse format.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=491" target="_blank" rel="noreferrer noopener">08.11</a><br><strong>I described Lance as open. Can you kind of clarify what that means, because I actually don&#8217;t know?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=499" target="_blank" rel="noreferrer noopener">08.19</a><br>Number one is Lance format is open source. It&#8217;s Apache 2.0 license. You can find it on our GitHub. We have community governance; [we] have PMCs that are from lots of external contributors. And then I think beyond that, there&#8217;s open source and there&#8217;s open source, right? I think what Lance format is designed for is a true open architecture as well. So not only is it open source; it also plays really well into the rest of the data ecosystem.&nbsp;</p>



<p class="wp-block-paragraph">So for example, when people compare us to Parquet and Iceberg, well, we&#8217;re not designed as a head-to-head competitor with Parquet and Iceberg. We will slot into the same Polaris data catalog, or you can have one unified view on all of your datasets, but then under the hood it can be Parquet/Iceberg for BI data and Lance for your AI data. And then Lance itself plugs in natively to Spark and pandas and Polars and DuckDB and any sort of open data tooling that you&#8217;re already used to.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=571" target="_blank" rel="noreferrer noopener">09.31</a><br><strong>So operationally then, Chang, if I&#8217;m a data architect, should I think of Lance as, “OK, so I have Parquet and these table formats like Delta and Iceberg for my structured data. And then if it&#8217;s nonstructured, which could mean video, audio, and also text, right? So then I have to bring in this other format, Lance.” Is that operationally what happens in practice?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=607" target="_blank" rel="noreferrer noopener">10.07</a><br>Yeah, often what the data infra folks and data engineers we talk to interact with is the tooling, right? So they&#8217;re looking at their data pipelines, they&#8217;re looking at maybe their Spark jobs or their search applications, and then those are the jobs that actually interact with the underlying storage, for example. And so instead of.&nbsp;.&nbsp;.&nbsp;</p>



<p class="wp-block-paragraph">And that data transfer process is actually really easy through Apache Arrow. And most of the time, it&#8217;s really just one line of code change. It&#8217;s the same Spark code, for example. Instead of writing to Parquet, you&#8217;re writing to Lance. And it simplifies your overall data pipeline by bringing all of your tabular data and metadata along with your multimodal data all in the same place and also embeddings.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=665" target="_blank" rel="noreferrer noopener">11.05</a><br><strong>And then in terms of workload, you alluded to the fact that the previous-generation vector source, they excelled at something very specific, maybe retrieval. So is Lance equally specialized in the sense that, “All right, Lance is great for X, and X might be, I don&#8217;t know, analytics, but it doesn&#8217;t excel in other things”? Describe the kinds of workloads that teams that are using Lance are using.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=699" target="_blank" rel="noreferrer noopener">11.39</a><br>So very high-level, the summary is LanceDB, our enterprise data platform, excels at helping our customers manage really large-scale AI data. So embeddings for search, adding new, adding new features and extracting new, new columns, enriching their dataset, doing data curation and exploration, and then feeding that to GPUs really quickly for distributed training jobs so that they can get as high GPU utilization and as high auto-flops utilization as they can.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=740" target="_blank" rel="noreferrer noopener">12.20</a><br><strong>You&#8217;ve used the word multimodal a few times, and I&#8217;ve always been a proponent of people really making sure that their data infrastructure is positioned for this multimodal world. But sometimes I question this assumption in the following sense, right? Is multimodality a Bay Area bubble thing? In other words, if I go to the East Coast and talk to, I don&#8217;t know, Goldman Sachs or an insurance company, are they still grappling with legacy systems that are mostly structured data? What they want to do is be able to do all this fancy AI stuff now with agents, but still using the old-school data that they have.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=792" target="_blank" rel="noreferrer noopener">13.12</a><br>I think when we talk about multimodal data, a lot of times what comes to mind first is video generation, image generation, all of those. Self-driving cars.&nbsp;.&nbsp;. So there&#8217;s a lot of high-tech, cutting-edge applications that are multimodal. But I think if you look at more traditional enterprises, they already have a lot of multimodal data.&nbsp;</p>



<p class="wp-block-paragraph">So you just mentioned insurance: They have millions of documents and PDFs and contracts lying around. Insurance especially will have top-down views of houses and boundaries so that they can figure out and assess risk a little bit better. The way I think about it is before AI, it&#8217;s just really hard to get value out of that data. They just really haven&#8217;t paid as much attention.</p>



<p class="wp-block-paragraph">So it&#8217;s kind of like when I clean up my house, what I like to do is just like move all the mess into a back room or storage. And so then I don’t have to think about it, right? My wife yells at me all the time. She opens up the storage and everything kind of falls out. And so I feel like with multimodal data, this is kind of what traditional enterprises have done: They didn&#8217;t know what to do with it. They stuck it in some directory in SharePoint or something like that and kind of just like leave it there for storage. But there&#8217;s actually a tremendous amount of value and AI is helping them unlock all of that. So I think in the next few years, especially, we&#8217;re going to see a lot more attention paid to, “If we can get a lot more value out of this data, how do we actually manage it? How do we work with it? And how do we combine it with the rest of our data stack so that it&#8217;s governed within a single entity?”</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=906" target="_blank" rel="noreferrer noopener">15.06</a><br><strong>The hot thing a few years ago in data infrastructure was the lakehouse, right? Great term we introduced. [laughs]</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=918" target="_blank" rel="noreferrer noopener">15.18</a><br>I wonder who came up with that one. [laughs]</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=922" target="_blank" rel="noreferrer noopener">15.22</a><br><strong>Yeah. So you folks are starting to use the term multimodal lakehouse. So compare the status of the lakehouse.&nbsp;.&nbsp;. [The term] is I think now widely used, right? And then now you&#8217;re introducing the multimodal lakehouse. So where is the multimodal lakehouse now kind of mature, and where does it still need to do some work?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=950" target="_blank" rel="noreferrer noopener">15.50</a><br>Just for the audience who&#8217;s not as familiar, the really, really simplified way I think about just a lakehouse is you have all your data in one place in the data lake, and then you have a combined data warehousing layer on top that provides structure, tables, and structured ways to run workloads on all of that data.&nbsp;</p>



<p class="wp-block-paragraph">Now, the way we think about multimodal lakehouse is in a couple of different ways. One, the data changes so that you go from purely tabular data or maybe like clickstream data to now all sorts of multimodal data. So from embeddings to all of your multimedia types. So that changes a lot about how you can read and write data efficiently, how you manage that, how you synchronize that with metadata.<br><br>Number two is the workloads also are multimodal. You&#8217;re not just thinking about running SQL and analytics workloads. You&#8217;re now thinking about search. Now you&#8217;re thinking about training. Now you&#8217;re thinking about feature engineering and “How does your lakehouse interact with GPU clusters?” and all of those things that traditional lakehouses are not very good at.<br><br>And then I think the third layer, where the meaning “multimodal” comes in, is traditional lakehouses tend to be good only at batch offline processing. And then if you want to do serving, online processing, you probably need to introduce a sort of an OLTP kind of database or some system that&#8217;s primarily for serving. Well, with LanceDB, because of the innovations in the format, you can actually do both at the same time. So the online-offline scenario can also become multimodal in this sense.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1064" target="_blank" rel="noreferrer noopener">17.44</a><br><strong>So if I understand what you&#8217;re saying, you&#8217;re multimodal in multiple senses. So multimodal data types, multimodal workloads, and multimodal kinds of operations. So right now, in the Databricks world, they have—I don&#8217;t think they used the word multimodal. If anything, they go back to that HTAP kind of thing, so [a] hybrid transactional analytics kind of processing engine. I think through an acquisition, now they are very good at Postgres. I forget what they call this. [Chang: A lakebase.] So they have the transactions, and they have the analytics. So what you&#8217;re saying is that your vision of the multimodal lakehouse has that hybrid transactional analytics, multimodal types of data, and then multimodal workloads. Is that a fair summation? Surely, Chang, certain aspects of what you just described are more fleshed out than others, right? So what areas do you anticipate you folks will be working on hard, in terms of multiple notions of multimodality?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1156" target="_blank" rel="noreferrer noopener">19.16</a><br>Number one is actually scale. Scale is actually the biggest driving factor late last year and this year. And a lot of that has been the rise of agents. Because of the rise of agents, data volume and scale, query throughput and scale, and performance and latency requirements, all of those things have just kind of been exploded. And that&#8217;s the thing that we find we&#8217;re uniquely suited for. And that&#8217;s something that we&#8217;re pushing a lot on. Oftentimes when we talk to customers, really what we think about is like, trillion is new billion. And we have folks who probably are operating at a thousand times the scale that they were just a year ago or two years ago.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1222" target="_blank" rel="noreferrer noopener">20.22</a><br><strong>I guess the hack that people will do for some of these things, Chang, is just let&#8217;s put the files in S3 and then use a database somehow. So are you still seeing a lot of people kind of try to do this?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1239" target="_blank" rel="noreferrer noopener">20.39</a><br>Yeah, I mean, I think there are a few attempts that [are] doing that. And I think there&#8217;s generally a trend because of the data scale, like object storage is kind of the only sort of cost effective and scalable storage backend for a lot of these newer data storage systems. I think where the challenge lies for data infrastructure providers is “How do you actually have scalability and high performance and maintain the cost advantages of S3 and object store?” That is, I think, the difficult challenge. And so we actually have a recent blog article talking about how we do that at 10 billion-vector scale.<br><br>At smaller scales, that&#8217;s actually really easy. You just slurp up all the data from S3 into some caching system. You can serve it from there in any in-memory system. That&#8217;s a really easy problem. There&#8217;s tons of open source projects, Lance, for example, that can help you do that pretty effectively. And then the challenge is really at scale. If you have 10 billion vectors, pretty much, your only cost-effective solution is to store that on object storage. Then, you know, imagine the query times if you were just targeting S3 directly. So then indexing challenges and search and caching and all of that, that becomes a big distributed systems problem. So that&#8217;s what we solve.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1336" target="_blank" rel="noreferrer noopener">22.16</a><br><strong>Like you said, many data engineering and data infrastructure teams are trying to think through, “So what does our infrastructure look like in a world of agents?” right? So imagine—this isn&#8217;t happening yet—the equivalent of OpenClaw in enterprise, where a single employee might have 10 of these AI delegates or AI assistants. Some of the things that come up: One, identity management, so access control, identity management. Secondly, maybe some of these AI agents and AI delegates don&#8217;t really need anything permanent. They just want something ephemeral. So stand up a LanceDB for a minute and then make it go away. Are these some of the things that you are starting to think of?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1394" target="_blank" rel="noreferrer noopener">23.14</a><br>Yeah, so for our cutting-edge customers, that&#8217;s already the reality. We specialize a lot in infrastructure for model training, for example. So if you think about features, like a researcher might have, “Hey, I have a feature idea. There&#8217;s two input features, each with 10 variants. And then I have some output feature that combines the two.” Well, now I&#8217;ve got 100 different variants. So before, there was a limited [number] of variants that I can test as an individual researcher manually. But now I can use agents to run all of that automatically. And I can just go to sleep and it&#8217;ll run. Well, now humans can go to sleep, but then the agents are presenting a lot of load on the underlying data infrastructure. This year we&#8217;re talking about going from hundreds of queries per second from plain RAG a couple of years ago to a hundred thousand queries per second in this land of agents.&nbsp;</p>



<p class="wp-block-paragraph">And then when it comes to security and compliance, there&#8217;s a lot of churn in the stack about sandboxing and ephemeral systems. And when we talk about object storage, this is actually a big, even a bigger challenge, right? So if your source of truth is on object store, that&#8217;s actually the only way you can make this ephemeral workload work out well so that when you have hot data, you cache it, you serve it for a time, and then that can go away. And then the cache can expire it [to] be replaced by the next hot workload. And you can do that without having to pay for really expensive memory and NVMe for all of your data.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1504" target="_blank" rel="noreferrer noopener">25.04</a><br><strong>So the other thing, Chang, that comes up with agents right now, the hot thing that it seems like there&#8217;s a gazillion people working on is this notion of memory. So I guess my question to you is, if I have a bunch of agents and then I have a multimodal lakehouse.&nbsp;.&nbsp;. I have a lakehouse and now I have memories. So I have three different systems that I have to maintain. What&#8217;s your what&#8217;s your guys’ take in terms of agent memory?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1542" target="_blank" rel="noreferrer noopener">25.42</a><br>LanceDB open source is actually the main memory plug-in for OpenClaw and a number of other agents like Crew AI, for example. And for a lot of these agent frameworks and harnesses, there&#8217;s a couple of different requirements. Number one is just lightweight, super easy to use. LanceDB is the only one where it supports hybrid search; it supports reranking, all these fairly sophisticated retrieval mechanisms, without having to maintain a service.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1580" target="_blank" rel="noreferrer noopener">26.20</a><br><strong>Before you continue.&nbsp;.&nbsp;. All right, so this notion of lightweight, right? On the one hand, there&#8217;s the notion of multimodal lakehouse and a lakehouse is never lightweight, right? But then, it seems like you folks are positioning yourself also in the DuckDB kind of very lightweight SQLite world. Can you clarify what you mean by lightweight when you are supposedly a lakehouse, right?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1609" target="_blank" rel="noreferrer noopener">26.49</a><br>So what I mean by lightweight here is that if you think about it from an agent perspective, it simplifies a lot of things if you don&#8217;t have to connect to another service and talk to another system in order to get access to your memory and to retrieve from memory. So that&#8217;s what I mean. So the open source, the.&nbsp;.&nbsp;.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1635" target="_blank" rel="noreferrer noopener">27.15</a><br><strong>But then you’re large-scale infrastructure.&nbsp;.&nbsp;. So then if I&#8217;m a lightweight agent, how can you&#8230; This is where I guess I&#8217;m a bit confused. Can you clarify, why am I bringing along a big piece of infrastructure if I&#8217;m a lightweight agent?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1655" target="_blank" rel="noreferrer noopener">27.37</a><br>Right. LanceDB open source is actually very lightweight. So there&#8217;s no heavy infrastructure involved. This is why it&#8217;s perfect for memory. Because a lot of times, memory is very ephemeral. So you just interact with a session and then when that session is gone, you want to retain all of that. At most you might want to compress some of it and then retain it for downstream historical processing. But most of the time, it&#8217;s just gone. You don&#8217;t have to think about it. And so that&#8217;s what I mean by lightweight. So there&#8217;s a version of that.&nbsp;</p>



<p class="wp-block-paragraph">And then for large-scale retrieval, you have a large historical corpus, if you&#8217;re working in a corporate environment, if you have an agent that&#8217;s searching through patent history or something like that, right? And then that&#8217;s where the infrastructure comes in. Well, if I have a petabyte of data out there that I need to search through, the embedded library is not going to do. So you need to have a scalable system out there, but it needs to be easy to use. And from an agent perspective, it&#8217;s the same interface. So from the agent perspective, it&#8217;s just as easy, but there is a scalable system for that large amount of data that&#8217;s kind of hidden beneath the surface there.&nbsp;</p>



<p class="wp-block-paragraph">I think for agents, that&#8217;s sort of just one of the requirements. The other one is having more sophisticated retrieval so that agents can find what they&#8217;re looking for. And different agents will want to look for data in different ways. So being able to support all of that without having like a million different plug-ins to do each modality, I think that&#8217;s also something very important for agents as well.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1768" target="_blank" rel="noreferrer noopener">29.28</a><br><strong>By the way, I was playing devil&#8217;s advocate there because I actually use LanceDB every day on my laptop. It can be something that you can use in your laptop just in-memory.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1782" target="_blank" rel="noreferrer noopener">29.42</a><br>Yeah. So I think what we find is that when you make it really easy for agents to actually use it, that&#8217;s when scale really takes off. The way we&#8217;re looking at it is agents are kind of like an ideal gas that if you make it easy for them to use, no matter how much compute you have, no matter how much data and infrastructure you have, agents will expand to fill all of that that you have, right? So what we&#8217;ve seen is.&nbsp;.&nbsp;. We talked about growth and creep throughput. And then because of complex agents, there&#8217;s compression and latency. Your agents want a hundred-millisecond or like 20-millisecond latencies now. And then we also see a lot of proliferation of data.<br><br>One of the largest users in LanceDB told us they&#8217;re now managing something like a billion tables. Just because they have so many agents and so much data that they have to manage, like that number of tables within their system. Any computational and data management dimension you can think of, agents will expand to however much capacity you give them.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1859" target="_blank" rel="noreferrer noopener">30.59</a><br><strong>So this is a two-part question. Our listeners may not be aware, but for some reason, LanceDB kind of blew up a little more during the launch of OpenClaw. So I guess my two questions are one: How did this OpenClaw community land on Lance? And have you heard back from them, and have they told you what they liked about Lance?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1892" target="_blank" rel="noreferrer noopener">31.32</a><br>Yeah, I mean, a lot of that is what we just talked about: It’s lightweight; it&#8217;s easy to use the model.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1899" target="_blank" rel="noreferrer noopener">31.39</a><br><strong>But how did it happen? How did they land on Lance? Do you know?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1903" target="_blank" rel="noreferrer noopener">31.43</a><br>So my recollection was that originally it was a recommendation from Claude or something like that. And I think [Lance] was the only one out there that met the requirements, was embedded, lightweight, sophisticated retrieval. And it can do both in-memory on NVMe local and also on object store.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1931" target="_blank" rel="noreferrer noopener">32.11</a><br><strong>Interesting. So since then, has this kind of marriage [with OpenClaw] continued?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=1940" target="_blank" rel="noreferrer noopener">32.20</a><br>Yeah, we continue to see engagement from the open source community. Our open source continues to grow. I think at the latest, we&#8217;re at around 14 million downloads a month across our open source projects. And we&#8217;re super excited about working and supporting the open source community on that. What we see now is demand for a more filesystem-like interface. It&#8217;s easier for agents a lot of times to interact with a filesystem interface.<br><br>Now, I&#8217;m choosing my words carefully. I don&#8217;t mean a filesystem. I just mean an interface. This is something that we&#8217;re looking into—trying to see what it would look like to put a filesystem interface over a LanceDB or Lance format. Based on the usage patterns that we see from agents, this is fairly straightforward to do. So I think if you&#8217;re listening and this is something interesting, we&#8217;d love to have early users come check it out and test it out with us.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2009" target="_blank" rel="noreferrer noopener">33.29</a><br><strong>It&#8217;s interesting, actually, as you were talking there, it just dawned on me that this notion.&nbsp;.&nbsp;. These various notions of multimodality that you described earlier actually might be another reason why people landed on Lance. Because there are other vector search systems that you can run in-memory or embedded. If you want to build agents that are more capable moving forward, then the various notions of multimodality that Chang described earlier might come in handy, right?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2046" target="_blank" rel="noreferrer noopener">34.06</a><br>Yeah, yeah, absolutely. I will say that like, I&#8217;m sort of a.&nbsp;.&nbsp;. There are AI maximalists. I&#8217;m sort of a multimodal maximalist. So my prediction is that in five years, multimodal won&#8217;t even be a word anymore. It&#8217;ll just be data, and it&#8217;ll just be multimodal by default. People will just say data, and it&#8217;ll be inclusive of all the different modalities. And when we think about data engineering, there won&#8217;t be multimodal data engineering. It&#8217;ll just be multimodal by default when we say data engineering.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2077" target="_blank" rel="noreferrer noopener">34.37</a><br><strong>Interesting, which actually.&nbsp;.&nbsp;. As we&#8217;re winding down here, I was going to ask you, If I&#8217;m a CxO or an architect at an enterprise, what data infrastructure decision do you think I should bear in mind? Or I guess to put it negatively, what are some of the decisions I can make right now that potentially can hurt my team moving forward in the next year?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2108" target="_blank" rel="noreferrer noopener">35.08</a><br>Right, right. So I think we&#8217;re already.&nbsp;.&nbsp;. For a lot of early adopters, we see big pain points around new AI data silos. So one pattern, I wouldn&#8217;t call it an anti-pattern, but one I would say pain point is if you&#8217;re a CIO or CDO or something like that, chances are a lot of your teams within the enterprise have charged forward with their own AI applications and AI stack. And so now the centralized data platform team are faced with maybe like 10 different vector databases that they have to support and maybe five different ways to store the AI data, some in images and some just embeddings and others, many different modalities. So that becomes a big pain point going forward, right? So as companies go from “Let&#8217;s try out AI in this particular area” to, I guess, AI transformation, having large swaths of the enterprise be AI-assisted or AI-native, that becomes a big pain point.&nbsp;</p>



<p class="wp-block-paragraph"><br>I think if I were a CIO or a CEO or CTO at a larger enterprise, I would be looking forward a little bit to think about how do I set up all of my teams across the enterprise for success so that one, “How do I allow them to charge forward very quickly and iterate very quickly without presenting this crazy, untenable challenge on the central platform team?” So that&#8217;s what I would be thinking of. That&#8217;s actually.&nbsp;.&nbsp;. At LanceDB, that&#8217;s what we&#8217;re building for.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2225" target="_blank" rel="noreferrer noopener">37.05</a><br><strong>If your thesis is multimodal data matures over the next few years, and so do agents and everything that comes with agents, including memory, what does the data stack look like in a few years?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2242" target="_blank" rel="noreferrer noopener">37.22</a><br>In broad strokes, the base layers are not going to change all that much. I think the infrastructure layer stays roughly the same. There&#8217;s going to be object storage. There&#8217;s going to be a storage layer. And then the compute layer will start to change.&nbsp;</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2269" target="_blank" rel="noreferrer noopener">37.49</a><br><strong>Ray. [laughs]</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2272" target="_blank" rel="noreferrer noopener">37.52</a><br>What I think we&#8217;ll see is that the middle layer of data tooling will start to melt away a little bit because of agents.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2284" target="_blank" rel="noreferrer noopener">38.04</a><br><strong>Define data tooling.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2287" target="_blank" rel="noreferrer noopener">38.07</a><br>I don&#8217;t want to name names, but I think there&#8217;s a lot of [what] I would call developer middleware for data where it&#8217;s neither the infrastructure layer nor is it the layer that&#8217;s interfacing with agents and users directly, right? That middle layer, I think will melt away a little bit or at least be very much refactored. So there&#8217;s going to be a lot of churn in that. It&#8217;s going to be interesting to see what shakes out. I think what will happen is that agents will continue to push that layer down, and agents will want to get as close to the base layer as possible.&nbsp;</p>



<p class="wp-block-paragraph">If you look at this middle layer, there&#8217;s really two things that they&#8217;re providing. One is a precanned data model for how their users think about the problem, right? So they built that on top of the base infrastructure. So they would build that on top of LanceDB, for example. And then the other thing that they have in this middle tier right now is user interaction, right? The combination of the two is how they capture user workflows. And that&#8217;s the core of that. I think what happens in the future is that that UI workflow layer will largely go away and be replaced by agents.</p>



<p class="wp-block-paragraph">But useful data models will still be useful, and they&#8217;ll still stay. Yes, you can have agents directly talk to random bits on S3, but why waste all that intelligence? It&#8217;s not worth the token cost. A well-formed data model is the right base layer for agents to interact with. And so I think that&#8217;s what we&#8217;ll see, is that melting away and reformatting of that middle layer. And I think this is something when I talk to data builders and AI infrastructure builders today, I think we&#8217;re all seeing that all at the same time.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2422" target="_blank" rel="noreferrer noopener">40.22</a><br><strong>What I describe to people right now as kind of the forward-looking stack has two main parts: So one, you have the multimodal lakehouse built around Lance, LanceDB, and the Lance format. And then you have the AI compute layer, which I call the PARK stack, so PyTorch, AI foundation models, Ray, and Kubernetes. So PARK stack here, and then your lakehouse will be around Lance and the Lance format. I see that quite a bit actually. I definitely see the PARK stack, PyTorch, Ray, Kubernetes. And now I&#8217;m starting to see more and more people talking about Lance and Lance format. Do you think of these as complementary or what?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2476" target="_blank" rel="noreferrer noopener">41.16</a><br>Yeah, yeah, absolutely. I think we have close relationships with Ray and Spark and really like native-level integrations. And also PyTorch, right? I don&#8217;t think that&#8217;s going away. Those are either like.&nbsp;.&nbsp;. PyTorch is essentially interacting with developers directly, whereas Spark and Ray are very much infrastructure layer, so I don&#8217;t think those things are going anywhere. Kubernetes is definitely still around.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2511" target="_blank" rel="noreferrer noopener">41.51</a><br><strong>Yeah, yeah, yeah, yeah. And so what big trend are you paying attention to right now that we haven&#8217;t yet talked about? This is how we close.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2528" target="_blank" rel="noreferrer noopener">42.08</a><br>What&#8217;s been really interesting that we didn&#8217;t talk about is the rise of open source models. And I think that&#8217;s going to have a big impact, maybe starting next year or even the remainder of this year. Enterprise AI. [Ben: Open weight.] Open-weight models. That&#8217;s correct. Yeah.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2555" target="_blank" rel="noreferrer noopener">42.35</a><br><strong>Who&#8217;s the source? Because right now the main source is China for the better ones. And I still see a lot of hesitation for enterprise teams to adopt such models. I actually just wrote a short post about this. Basically the perception seems to be that while the open-weight models from China are closing the gap, there is still a gap, and there&#8217;s structural reasons why there&#8217;s a gap. So one is the Chinese seem to be benchmaxxing. You know, they&#8217;re optimized for the benchmark, so not real workloads. And then secondly, there is a compute challenge, which makes iteration for them more challenging. So whereas the labs here may update their models every three or four months, the Chinese have to wait six months. And then finally, the data pipelines and the investment in data pipelines is just not the same as you would see at, for example, Gemini, Anthropic, and OpenAI. They’re licensing data from all over the place. The Chinese labs tend to do distillation, which means.&nbsp;.&nbsp;. When you&#8217;re doing distillation, your cap is basically the model you&#8217;re distilling from.</strong></p>



<p class="wp-block-paragraph"><strong>And then there&#8217;s the flywheel—OpenAI and Anthropic and Gemini have a lot of users, so therefore they get better as more users interact with them.&nbsp;.&nbsp;.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2660" target="_blank" rel="noreferrer noopener">44.20</a><br>That&#8217;s right. Don&#8217;t forget the open-weight models in China are also.&nbsp;.&nbsp;. [cross-talk] Here&#8217;s the way I think about it, right? So I think as AI adoption grows exponentially within enterprises, they are going to be extremely motivated to invest in their own inference on open-weight models, right? Just because there&#8217;s such a drastic cost in tokens.</p>



<p class="wp-block-paragraph">Because of that economic incentive, I think there&#8217;s going to be a lot more incentive for companies to create better open-weight models. If you look at the open-weight models in China, one, the fact that they can create open-weight models of this quality on really limited hardware is really telling. So a team in the US theoretically should be able to create much better quality open-weight models because of that.<br><br>Number two, I don&#8217;t think the distillation argument is actually true. If you look at the report that Anthropic threw out, right, like if you look at the numbers of how much distillation they accused DeepSeek of doing, it&#8217;s actually not that much. It&#8217;s basically negligible, right? Like MiniMax is a legit big offender, but DeepSeek, basically, didn&#8217;t really do that much. I don&#8217;t think distillation is a big factor in the quality of open-weight models anymore.<br><br>So then there is a remaining gap in quality. Maybe there&#8217;s a three- to four-month gap between open-weight models and SOTA. But what&#8217;s interesting is the experiments that people have done is, open-weight models, one, are cheaper, and they&#8217;re much faster. So if you have a coding agent task, you can do a one-shot with SOTA models or you can do multiple rounds and iterations on an open-weight model, which gets you the same quality, still lower total costs and tokens, and you finish around the same time, or you actually might finish faster. So then I think a lot of that is lack of familiarity and a skill gap, where if you have to do a few shots, that complexity is way more than what people want to think about right now.<br><br>So the pattern today is you go into production with SOTA models, then you reach some cost-prohibitive moment where you say, “OK, what are the areas where there&#8217;s not requirements for really heavy intelligence but still have a lot of token costs, and then I can replace [them] with open models?” And I think that will happen more and more across enterprises. So I think that&#8217;s going to be a big trend to watch this year and next.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2838" target="_blank" rel="noreferrer noopener">47.18</a><br><strong>And actually, as you mentioned, my conversations are a product of the fact of the stage of adoption, which is basically [the] early stage of adoption. I will deploy with state-of-the-art models because I&#8217;m early. And then as my agent or my application gets used, then I start paying attention to cost, latency, and all these. And then I can worry about swapping the models then. And hopefully, we will have some Western labs start cranking on open-weights models again, right? It seems like Meta is off the table. The Gemma folks produce models, but they&#8217;re meant for on-device, I think. Maybe there&#8217;s an opening there for someone to start up something that&#8230;</strong></p>



<p class="wp-block-paragraph"><strong>Especially as people become more clever in terms of training and tools like LanceDB make training more affordable somehow. We&#8217;ll see what happens. And with that, thank you, Chang.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=6URiiQmeiXo#t=2904" target="_blank" rel="noreferrer noopener">48.24</a><br>That&#8217;s right. Thank you, Ben.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-chang-she-on-data-infrastructure-for-ai/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Why Doesn&#8217;t Anyone Teach Developers About Context Management?</title>
		<link>https://www.oreilly.com/radar/why-doesnt-anyone-teach-developers-about-context-management/</link>
				<pubDate>Thu, 14 May 2026 10:57:49 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18713</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Why-doesnt-anyone-teach-developers-about-context-management.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Why-doesnt-anyone-teach-developers-about-context-management-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[We overestimate what AI can remember and underestimate what it can orchestrate.]]></custom:subtitle>
		
				<description><![CDATA[This is the sixth article in a series on agentic engineering and AI-driven development.&#160;Read part one&#160;here, part two&#160;here, part three&#160;here, part four&#160;here, and part five here. I think context management is one of the most important skills in AI-driven development, and it&#8217;s weird that compared to other AI-related topics, almost nobody talks about it. We [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>This is the sixth article in a series on agentic engineering and AI-driven development.&nbsp;Read part one&nbsp;<a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">here</a>, part two&nbsp;<a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">here</a>, part three&nbsp;<a href="https://www.oreilly.com/radar/the-toolkit-pattern/" target="_blank" rel="noreferrer noopener">here</a>, part four&nbsp;<a href="https://www.oreilly.com/radar/ai-is-writing-our-code-faster-than-we-can-verify-it/" target="_blank" rel="noreferrer noopener">here</a>, and part five <a href="https://www.oreilly.com/radar/ai-code-review-only-catches-half-of-your-bugs/" target="_blank" rel="noreferrer noopener">here</a>.</em></p>
</blockquote>



<p class="wp-block-paragraph">I think context management is one of the most important skills in AI-driven development, and it&#8217;s weird that compared to other AI-related topics, almost nobody talks about it. We talk about prompt engineering, about which model to use, about agentic workflows and tool use. But more than anything else, the thing that actually determines whether your AI session produces good work or mediocre work is how well you manage context (or if you even do it at all!).</p>



<p class="wp-block-paragraph">A lot of developers using AI tools treat all this &#8220;context&#8221; talk as AI jargon that can be dismissed, and it&#8217;s not hard to understand why. AI development tools have gotten so easy that an experienced developer can be incredibly effective by just combining vibe coding with critical thinking (that&#8217;s the central idea behind <a href="https://www.oreilly.com/radar/the-sens-ai-framework/" target="_blank" rel="noreferrer noopener">the Sens-AI Framework</a>), and not really think about context at all. That&#8217;s ironic, because despite all the &#8220;I&#8217;m functionally illiterate but I just vibe coded an entire multitenant SaaS platform&#8221; articles, and despite everyone&#8217;s general concern that AI will put all developers out of work, the development skills you&#8217;ve been working on for years make you especially effective at writing code with AI—and context management is where those skills really shine.</p>



<p class="wp-block-paragraph">Just to make sure we&#8217;re all on the same page, <strong>context</strong> is (basically) everything the AI is thinking about right now: your prompt, the conversation so far, the files it&#8217;s read, the decisions you&#8217;ve made together. When you start a fresh session with an AI, its context is wiped clean, and it starts fresh with just the initial instructions it&#8217;s been given. Managing context is central for building AI agents and skills. But it&#8217;s also really important when you&#8217;re using tools like Claude Code, Cursor, or Copilot for day-to-day development work. Context is typically measured in tokens, and there&#8217;s a finite amount of it. When the <strong>context window</strong>, or the maximum amount of information (input and output tokens) an AI model can process and retain at once, fills up, the AI starts losing track of things, and that&#8217;s when you start to see it give wrong and weird answers.</p>



<p class="wp-block-paragraph">Unfortunately a lot of developers read paragraphs like the last one and their eyes glaze over. Somehow it gets classified in the same part of our brains as learning how our build systems work: boring stuff we somehow don&#8217;t really want to think about because it takes us away from &#8220;real&#8221; programming. That&#8217;s a shame, because when we don&#8217;t understand the basics of how context works we waste a lot of time.</p>



<p class="wp-block-paragraph">For example, here&#8217;s something I see developers do all the time that they absolutely shouldn&#8217;t. They&#8217;re deep into an AI coding session, and the AI has built up a detailed understanding of their codebase (e.g., it&#8217;s noticed patterns, it&#8217;s making good decisions, etc.). Then they start seeing &#8220;Compacting conversation&#8221; messages, or they notice the little context usage indicator in Cursor or Copilot filling up, and they don&#8217;t really know what that means. But they learned that closing the session and starting a new one seems to fix the problem. Unfortunately, all they&#8217;ve done is trade compaction for total amnesia. The new session just keeps going, producing output that looks fine, but it&#8217;s giving worse answers and generating worse code because it&#8217;s working from incomplete information.</p>



<p class="wp-block-paragraph">The really weird thing is that I was writing about something really similar all the way back in 2006, long before AI was around, in <a href="https://learning.oreilly.com/library/view/applied-software-project/0596009488/" target="_blank" rel="noreferrer noopener"><em>Applied Software Project Management</em></a>: Missing requirements are especially insidious because they&#8217;re difficult to spot. I was writing about requirements, not AI context, but the problem is the same. I&#8217;ve <a href="https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/" target="_blank" rel="noreferrer noopener">written about how prompt engineering is requirements engineering</a>, and this is another place where the parallel holds up. When a requirement is missing, there&#8217;s no artifact to flag it, you just end up with code that doesn&#8217;t do what it&#8217;s supposed to do. When context is missing from an AI session, there&#8217;s no error message telling you what the AI forgot; you just end up with worse answers.</p>



<p class="wp-block-paragraph">The cost of poor context management is actually measurable. A developer on <a href="https://devblogs.microsoft.com/all-things-azure/i-wasted-68-minutes-a-day-re-explaining-my-code-then-i-built-auto-memory/" target="_blank" rel="noreferrer noopener">Microsoft&#8217;s Dev Blog</a> recently timed his own reorientation overhead and found he was spending over an hour a day just reexplaining things to his AI that it had known in a previous session. He&#8217;s not alone. There are now entire frameworks and managed services dedicated to giving agents persistent memory, from lightweight CLIs that query Copilot&#8217;s local session database to managed memory services from Cloudflare. Some of these tools are genuinely useful, but they&#8217;re solutions you need to evaluate, integrate, and maintain before they help you.</p>



<p class="wp-block-paragraph">My goal in this article and the next is to give you four specific things you can do today, using whatever AI tools you&#8217;re already working with. This article covers the problem: why context management matters and how context loss affects the quality of your AI&#8217;s output. The next article covers the specific practices that emerged from building the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a> and <a href="https://github.com/andrewstellman/octobatch" target="_blank" rel="noreferrer noopener">Octobatch</a>, things you can bring back to your own prompts, skills, and agents immediately. I’ll use real examples from those projects, because I think they’ve got some good examples that you can draw on.</p>



<h2 class="wp-block-heading"><strong>We get AI wrong in both directions</strong></h2>



<p class="wp-block-paragraph">I think the through line through all of this is that developers both overestimate and underestimate AI. We overestimate how much it can hold in its memory and its ability to remember things and make decisions for us. So we&#8217;ll just stuff a whole bunch of stuff in the context window and assume the AI will work it out, and then get annoyed when it hallucinates or forgets.</p>



<p class="wp-block-paragraph">On the other hand, we massively underestimate its ability as an orchestrator. Your prompt doesn&#8217;t just have to ask a question or ask the AI to generate something. You can give it a multistep workflow where each step writes its results to files, and the AI will coordinate the whole thing, spinning off subtasks and picking up where it left off if something breaks.</p>



<p class="wp-block-paragraph">When developers don&#8217;t take either of those things seriously, context management or orchestration, you get a specific cycle. They treat the context window as infinite and cram everything in. Then when the session gets too long and the AI starts losing track, they throw it all away and start fresh. They never consider the alternative, which is designing the workflow so the AI works from externalized files across independent sessions.</p>



<p class="wp-block-paragraph">I discovered this while building the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a>. The context management was working so well inside my sessions that I realized the sessions themselves were the bottleneck. I was running the playbook in a single prompt. I think I had a record of over 15 million tokens in a single Copilot GPT-5.4 session that ran for hours, and I did eight of them in parallel. Which incidentally is why I got rate-limited for 54 hours from Copilot, which is completely fair.</p>



<p class="wp-block-paragraph">The playbook was writing everything down to files as it went, which is why those runs could last that long at all. But I didn&#8217;t want that behavior. Running 15 million tokens in a single session is expensive, and if you&#8217;re on pay-as-you-go API tokens instead of a flat-rate plan like Copilot or Claude Max or Cursor, that kind of usage can be a real shock. I wanted to make the playbook available to developers who don&#8217;t want to burn that much at once. And because the context was already externalized to files, splitting into independent phases turned out to be easy.</p>



<h2 class="wp-block-heading"><strong>Ask the AI to write its context down along the way</strong></h2>



<p class="wp-block-paragraph">Before I get into how the pipeline splits things up, I want to talk about the practice that made the split possible in the first place: storing development context in files as you go.</p>



<p class="wp-block-paragraph">I don&#8217;t mean asking the AI to export its notes at the end of a session, or writing up a &#8220;lessons learned&#8221; document after the fact. I mean baking it into the actual instructions you give the AI from the start, so it&#8217;s continually writing and updating context as it works. For Octobatch, the batch LLM orchestrator that was my first experiment in agentic engineering (I wrote about the development process in “<a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">The Accidental Orchestrator</a>”), I had the AI write developer context in every folder, and that really made it easy to spin up a new session.</p>



<p class="wp-block-paragraph">Here&#8217;s what that looks like in practice. Every new Claude Code session on Octobatch starts with a single line: &#8220;Read ai_context/DEVELOPMENT_CONTEXT.md and bootstrap yourself to continue development.&#8221; That file contains a loading sequence: read this first, then fan out to component-level CONTEXT.md files in scripts/, tui/, pipelines/, each describing its own subsystem at the right level of detail. By the time the AI finishes reading, it knows what the project is, how it&#8217;s built, what&#8217;s currently in progress, and what the active bugs are.</p>



<p class="wp-block-paragraph">I think of this as shifting left. Instead of putting constraints in every prompt (don&#8217;t use additionalProperties: false, always test with &#8211;limit 3), those rules live in the CONTEXT.md files. The prompt stays clean because the documentation does the heavy lifting.</p>



<p class="wp-block-paragraph">And updating context files is part of every task. Before we commit anything, I have the AI review the context files and make sure they reflect what we just did. If we added a feature or fixed a bug, the context file should reflect that before we commit. Stale context causes the same kinds of problems as stale documentation, except it&#8217;s worse because the AI is actually relying on it to make decisions.</p>



<p class="wp-block-paragraph">I want to be clear exactly what I mean by &#8220;development context.” Specifically, it’s the information a new AI session needs to get up to speed: what the project is, how it&#8217;s built, and what decisions have been made along the way. Tools like Claude Code read development context from files like <a href="http://agents.md" target="_blank" rel="noreferrer noopener">AGENTS.md</a> (and you can actually go to that website to learn more) at the start of every session, and if you do a thorough enough job of building up your development context and keeping it up-to-date, you can get them fully bootstrapped. They&#8217;re the blueprints for your AI sessions. I wrote in <em>Applied Software Project Management</em> that building software without requirements is similar to building a house without blueprints. Running AI sessions without externalized context is the same mistake. You&#8217;re relying on what&#8217;s in someone&#8217;s head instead of what&#8217;s written down. And when you&#8217;re working with AI, &#8220;someone&#8217;s head&#8221; is a context window that&#8217;s going to get compacted or thrown away.</p>



<p class="wp-block-paragraph">The most important thing is that what&#8217;s in my head matches what&#8217;s in the AI&#8217;s head. The context file is just a convenient way to help us figure out whether or not we agree. When I start a new Claude Code session on a folder that has a good DEVELOPMENT_CONTEXT.md, the AI reads it and we&#8217;re immediately aligned. When I start a session without one, the AI has to rediscover everything from scratch, and it always misses things. Rediscovery is always lossy.</p>



<p class="wp-block-paragraph">If you&#8217;re not already writing context files as part of your workflow, none of the fancier techniques I&#8217;m about to describe matter. This is the foundation.</p>



<h2 class="wp-block-heading"><strong>Include the why, or the AI will undo your decisions</strong></h2>



<p class="wp-block-paragraph">There&#8217;s a specific thing that has to go into these context files, and it took me a while to learn why it matters so much: the reasoning behind every decision.</p>



<p class="wp-block-paragraph">Octobatch&#8217;s DEVELOPMENT_CONTEXT.md has a section called &#8220;Key Technical Learnings&#8221; with 49 entries, each in a specific format: What happened, Why it matters, When we discovered it, and Where in the code it applies. At the top of that section is a note in bold: &#8220;IMPORTANT: Always include the REASONING (the &#8216;Why&#8217;) for each learning. This prevents future sessions from &#8216;refactoring&#8217; a deliberate decision.&#8221;</p>



<p class="wp-block-paragraph">That note is there because without it, the AI will do exactly that. I had a case with Octobatch where we used recursive set_timer() instead of set_interval() for auto-refresh because Textual&#8217;s set_interval() callbacks aren&#8217;t reliably serviced on pushed screens. Without the &#8220;Why&#8221; in the context file, a future session would look at that code, see a &#8220;cleaner&#8221; alternative, and helpfully refactor it right back to the broken approach.</p>



<p class="wp-block-paragraph">The same principle applies to quality standards. Don&#8217;t just say &#8220;90% coverage for core logic.&#8221; Say &#8220;90% coverage for core logic, because expression evaluation touches randomness and seeding, where subtle bugs produce plausible-but-wrong output. The drunken sailor reseeding bug passed all visual inspection. Only statistical verification caught that sequential seeds created correlation bias (77.5% fell in water instead of a theoretical 50/50).&#8221; Without the &#8220;why,&#8221; a future AI session will argue the coverage target down. Any standard or architectural decision or unusual code pattern that doesn&#8217;t have its rationale attached is vulnerable to being optimized away by an AI that doesn&#8217;t know what problem it was solving.</p>



<h2 class="wp-block-heading"><strong>The garbage collection problem</strong></h2>



<p class="wp-block-paragraph">A lot of people like to talk about the context window as your AI&#8217;s short-term or working memory, and context that&#8217;s persisted to disk as long-term memory. Personally, I&#8217;m not sure those analogies to human memory work all that well. I think it&#8217;s a lot more useful to find ways to think about context that are similar to how we manage memory in our code.</p>



<p class="wp-block-paragraph">I find it especially helpful to compare context compaction to garbage collection—again, not a perfect analogy but a useful one. When you look at a GC graph in Java, you see the memory slowly fill up and then suddenly drop after each GC. That drop is the runtime figuring out what&#8217;s still being referenced and freeing everything else.</p>



<p class="wp-block-paragraph">The context window does the same thing. Your conversation accumulates tokens, the AI&#8217;s context window fills up, and then compaction happens. The tool (or the model) decides what to keep and what to throw away. Compaction is lossy and automatic, and you don&#8217;t control what survives.</p>



<p class="wp-block-paragraph">Java developers spent decades learning to design their allocation patterns so garbage collection wouldn&#8217;t destroy anything important. AI developers need to learn the same thing, and the learning curve should be shorter because the concepts transfer directly.</p>



<p class="wp-block-paragraph">When you ask the AI to write important state to files, you&#8217;re promoting it out of that volatile space. It&#8217;s surprisingly easy to do this. Just pass the AI to write its context to a Markdown file. For example, you can put all of the context related to a specific domain into a particular file, like if the AI noticed a behavioral contract, you could have it write all the related context to a file called CONTRACTS.md. If it made a design decision, that could go into DEVELOPMENT_CONTEXT.md—that&#8217;s a pattern I use all the time to write down all the important contacts needed to bootstrap a new AI session to work on the code. Those files live on disk, outside the context window, and compaction can&#8217;t touch them. But if you start a new session without externalizing any of this, you&#8217;re shutting down the application and losing everything that was in memory.</p>



<p class="wp-block-paragraph">The first time I built Octobatch&#8217;s batch orchestrator, it was a Python script with in-memory state and a lot of hope. It worked for small batches but fell apart at scale, which is pretty much what most developers are doing with their AI context right now: keeping everything in the context window and hoping it holds together, even though that stops working once sessions get long and codebases get complex.</p>



<h2 class="wp-block-heading"><strong>It&#8217;s way too easy to fall into one context management extreme or the other</strong></h2>



<p class="wp-block-paragraph">The Quality Playbook exists in part because of this problem. When I was building the requirements pipeline, I discovered that single-pass requirement generation runs out of attention after about 70 requirements. The model forgets behavioral contracts it noticed earlier. And it&#8217;s completely invisible. You don&#8217;t get a stack trace or an error message or any kind of warning, just incomplete output and no way to know what&#8217;s missing.</p>



<p class="wp-block-paragraph">The longer a defect goes uncorrected, the more entrenched it becomes and the more things get built on top of it. Context drift works the same way. When the AI loses track of a design decision early in a session, everything built on that lost context compounds the error. And just like a late-discovered defect, you don&#8217;t know what went wrong because the original context is gone.</p>



<p class="wp-block-paragraph">I had a concrete example when I was running the playbook against virtio-win. Version 1.3.32 found four bugs. Version 1.3.33, after some changes, found only one. That regression was only diagnosable because I had EXPLORATION.md, an externalized intermediate state file that captures what the AI observed during its exploration phase. Without it, the only observable output would have been &#8220;fewer bugs this time.&#8221; I had no way to tell whether the playbook was worse, or the bugs were harder, or it had just missed something. Without externalized state, I couldn&#8217;t have answered any of those questions.</p>



<p class="wp-block-paragraph">The contracts file in the pipeline exists specifically to solve this. When the model forgets about a behavioral contract it noticed earlier, that forgetting is normally invisible. But with a contracts file, every observation is written down before any requirements work begins. If a contract is in the file but has no corresponding requirement, that&#8217;s a visible, greppable gap. You can see what was forgotten and fix it.</p>



<p class="wp-block-paragraph">But it&#8217;s just as easy to overcompensate. If the LLM has to constantly hop between eight different reference files, its context window fragments and you start getting hallucinations. I&#8217;ve seen this happen. You load all your context files and requirements documents and design docs into the session, and the AI gets worse, not better. It spends all its attention navigating between reference files instead of thinking about the problem.</p>



<p class="wp-block-paragraph">I hit this with the Quality Playbook when I expanded the scope of a run against virtio-win from 10 files to about 60. The result was 6x more files analyzed but 75% fewer bugs found. The model burned its context on device drivers instead of going deep on the transport layer where the bugs actually were. Wider scope meant shallower analysis.</p>



<p class="wp-block-paragraph">The goal isn&#8217;t to save everything. You have to decide what to externalize, what to keep in context, and what to let go. The best context file contains exactly what the AI needs for this session and nothing more.</p>



<h2 class="wp-block-heading"><strong>Helping your AI manage its context helps you too</strong></h2>



<p class="wp-block-paragraph">The interesting thing about all of this is that good context management really makes use of your development expertise, and it’s one of those things that makes you a better developer the more you do it. Every practice I&#8217;ve described in this article, writing down your decisions, recording why you made them, being deliberate about what goes into a session and what doesn&#8217;t, is something developers have always been told to do. We write ADRs and design docs and inline comments explaining nonobvious choices, and we all know we should do more of it. When you&#8217;re working with AI, the cost of not doing it becomes immediate and visible. Your context files end up being the project documentation you should have been writing all along, except now there&#8217;s something on the other end that will actually go wrong if you skip it.</p>



<p class="wp-block-paragraph">And once you start thinking about context as something you actively manage, you can start designing your workflows around it. That&#8217;s what happened with the Quality Playbook, when it went from a single 15-million-token session to a set of independent phases with clean handoffs between them, and the whole split worked on the first try because the context was already externalized to files.</p>



<p class="wp-block-paragraph">In the next article, I&#8217;ll get into the specific techniques you can use today in your AI agents, but also in your day-to-day AI development work.<br><br><em>The </em><a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener"><em>Quality Playbook</em></a><em> </em><em>is open source and works with GitHub Copilot, Cursor, and Claude Code. It&#8217;s also available as part of </em><a href="https://awesome-copilot.github.com/#file=skills%2Fquality-playbook%2FSKILL.md" target="_blank" rel="noreferrer noopener"><em>awesome-copilot</em></a><em>.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open-source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.</em></p>
]]></content:encoded>
										</item>
		<item>
		<title>Ryan Carson Is a One-Person Code Factory</title>
		<link>https://www.oreilly.com/radar/ryan-carson-is-a-one-person-code-factory/</link>
				<pubDate>Wed, 13 May 2026 16:23:10 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18706</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Ryan-Carson-is-a-one-person-code-factory.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Ryan-Carson-is-a-one-person-code-factory-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A conversation about running a startup alone, with agents doing the work of a full engineering team]]></custom:subtitle>
		
				<description><![CDATA[Ryan Carson has built companies for 25 years, including Treehouse, which taught over a million people to code. He knows what it takes to grow a team. So when he told me he&#8217;d raised $2 million in seed funding for his latest company, Untangle, an AI-powered divorce assistant, and had no plans to hire anyone, [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Ryan Carson has built companies for 25 years, including Treehouse, which taught over a million people to code. He knows what it takes to grow a team. So when he told me he&#8217;d raised $2 million in seed funding for his latest company, Untangle, an AI-powered divorce assistant, and had no plans to hire anyone, I wanted to understand what that actually looks like.</p>



<p class="wp-block-paragraph">Ryan stopped writing code professionally around 2008. He’d essentially been “abstracted away” from it by the responsibilities of running a funded startup, as he put it. Following the acquisition of Treehouse and inspired by the arrival of large language models, he decided to teach himself to code again with ChatGPT. Ryan learned Next.js, a framework he&#8217;d never touched, using AI as a tutor that was wrong often enough to keep him honest but patient enough that he could go as slowly as he needed.</p>



<p class="wp-block-paragraph">He shipped something. It didn&#8217;t work commercially, so he moved on, but he still learned a lot about iterating on AI products in the process. A few years later, when he had an idea for a divorce tool born out of watching his family members struggle through difficult splits, he was ready to build a real MVP, and he did it all by himself (with a little design help along the way).</p>



<p class="wp-block-paragraph">As one of the foremost proponents of companies led by a single founder running a team of agents, in some sense, Ryan is a prince from another country. Maybe it’s not immediately apparent how his current workflow is relevant to developers working for big corporations beyond efficiency gains with AI-assisted coding. But thinking bigger picture, what Ryan calls the “<a href="https://x.com/ryancarson/status/2023452909883609111" target="_blank" rel="noreferrer noopener">code factory</a>”—a system where agents write and review the code, run the tests, triage the error reports, and monitor the production environment, under his oversight—may be an early version of what a lot more organizations will look like in five years.</p>



<h2 class="wp-block-heading">The loop is the thing</h2>



<p class="wp-block-paragraph">What makes the code factory model possible, Ryan explained, is the ability to set up automations and skills for jobs that you know that you need to be doing every day. In other words, you&#8217;re teaching an agent to do a repeatable process. The underlying pattern is the iterative loop, and Ryan was an early <a href="https://github.com/snarktank/ralph" target="_blank" rel="noreferrer noopener">proponent and popularizer</a> of Geoffrey Huntley’s “<a href="https://ghuntley.com/ralph/" target="_blank" rel="noreferrer noopener">Ralph Wiggum</a>” approach.</p>



<p class="wp-block-paragraph">The name comes from a <em>Simpsons</em> character who is, to put it charitably, not the sharpest. The idea is that you don&#8217;t need the agent to be superintelligent. You need it to do one thing, write down what it did and what it learned, stop, and restart with that notebook in hand. As Ryan pointed out, it turns out that pretty good intelligence, a loop, some instructions, and a notebook gets you surprisingly far into complex territory. Or to use another of Ryan’s analogies:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Think of it as a notebook where it&#8217;s like, “Here are the things I&#8217;ve done. And here&#8217;s the holes I fell into.” It&#8217;s like <em>Memento</em>, the movie, where [the main character] tattoos himself or uses notes to remember, like, “What did I do yesterday and what did I learn?” And agents are the same. They don&#8217;t have any long-term memory. And so [Geoffrey Huntley] figured out, yeah, this loop actually works shockingly well. It&#8217;s very primitive, this idea. And eventually after a number of these iterations, you actually get pretty complex outcomes.</p>
</blockquote>



<p class="wp-block-paragraph">When I heard this I thought of my first exposure to shell programming and how I fell in love with loops. You have a repetitive task and you want to do it many times, and computers are good at that. The language has changed, though; it&#8217;s English now instead of Bash. But the logic hasn&#8217;t: do something; save the result; do it again.</p>



<p class="wp-block-paragraph">The <a href="https://www.oreilly.com/radar/what-mcp-and-claude-skills-teach-us-about-open-source-for-ai/" target="_blank" rel="noreferrer noopener">skill</a> I use to generate first drafts of posts like this reads the transcript, summarizes it, and suggests possible video clips to extract. I built it with a different sort of loop, iteratively training Claude to write more like me by rewriting its drafts, asking it to analyze the differences, and then feeding back the differences as a SKILL.md file, repeating until the gap narrowed enough to reduce the amount of time it takes to accurately reflect my own takeaways.</p>



<p class="wp-block-paragraph">Ryan brought up an important point: skills decay. A Next.js skill from six months ago may conflict with your current component library. Two skills may say opposite things. He told me he’d gladly pay for a system that audits his skills library, flags conflicts, and surfaces what&#8217;s gone stale. Anyone can write a skill that’s useful in the moment. The value is in keeping the skill current and coherent as it interacts with the code factory’s complete workflow.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Achieving Complex Outcomes with the Ralph Wiggum Loop with Ryan Carson" width="500" height="281" src="https://www.youtube.com/embed/wcPnswKLq2E?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The code factory in practice</h2>



<p class="wp-block-paragraph">I asked Ryan to show us his daily workflow to give us a peek into the code factory. He shared a screen with 15 active threads running in Devin (at a monthly token burn of $2,000–$3,000). As Ryan explained, having a tool like Devin is the key to the code factory model. He’d started by “hand-cobbling” together a system with a Ralph Wiggum loop and a skill, but it was fragile and things broke or got out of sync. He needed a more durable system to run the cron jobs and nightly automations that keep the factory humming. He picked Devin, but ultimately choosing a direction was more important than the choice itself:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">If you back up and say, How is the modern code factory happening? It&#8217;s choosing a tool that allows you to have automations and skills for jobs that you know that you need to be doing every day.</p>
</blockquote>



<p class="wp-block-paragraph">And he’s since expanded that toolset to cover product requirements beyond software engineering, like design.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Code Factory in Practice with Ryan Carson" width="500" height="281" src="https://www.youtube.com/embed/EH4rdnQ6A48?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">What you can automate, and what you can’t</h2>



<p class="wp-block-paragraph">One of the threads Ryan had open was an end-to-end smoke test that signs up for his own app every morning, runs through the full onboarding flow, exercises all 14 tools, and records a video of itself doing it. Every morning he wakes up to a report. The test passed or it didn&#8217;t, and if it didn&#8217;t, here&#8217;s what failed. He has a separate Devin automation that reads <a href="https://sentry.io/" target="_blank" rel="noreferrer noopener">Sentry</a> every morning, and if it finds something problematic, spins up another Devin to fix it.</p>



<p class="wp-block-paragraph">This is what a CTO does: reads the Datadog and Sentry reports, triages what matters, and points the team at it. Ryan has automated the reading and the triaging. He still decides what to do about the things that matter, but the number of things he has to pay attention to has been compressed dramatically.</p>



<p class="wp-block-paragraph">Ryan’s figured out how to automate many of the responsibilities he hired for in his previous companies. Another automation runs against his Google Ads, Meta, and X spend, compiles a performance report on cost per click, lead generation, click-through rate. He reads that the way a head of marketing would read it.</p>



<p class="wp-block-paragraph">There’s one thing he hasn’t been able to automate: what he should build. As we hear <a href="https://www.oreilly.com/radar/everyones-an-engineer-now/#:~:text=Product%20taste%20as%20the%20new%20technical%20skill" target="_blank" rel="noreferrer noopener">again</a> and <a href="https://www.oreilly.com/radar/the-mythical-agent-month/#:~:text=Design%20and%20taste%20as%20our%20last%20foothold" target="_blank" rel="noreferrer noopener">again</a>, the efficiency gains in coding, testing, design iteration, and monitoring don&#8217;t replace the judgment calls about which problems matter. As Ryan noted, “There isn&#8217;t a magic wand still. You can build faster, but whether you&#8217;re building the right thing, and doing it better is something [else].”</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="What I Can’t Automate Is What to Build with Ryan Carson" width="500" height="281" src="https://www.youtube.com/embed/utDJaO2LAuE?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Programming isn&#8217;t going away</h2>



<p class="wp-block-paragraph">We all need to keep pushing back on the narrative that programming is going away. When I started, I wrote assembly language programs. I was literally moving data from registers, multiplying values, low-level operations that nobody does manually anymore because the compiler handles it all. When we look back on that, we don&#8217;t think “programmers became unnecessary.” We understand that programming was just abstracted to a higher level, and became more powerful for it. That&#8217;s where we are again.</p>



<p class="wp-block-paragraph">Ryan used the analogy of a carpenter switching from a handsaw to a Sawzall. It saves a ton of time, but you still need to know which pipes you&#8217;re cutting or you’re going to have a bad day. The domain knowledge doesn&#8217;t get abstracted away with the tool.</p>



<p class="wp-block-paragraph">The people who are going to do well are the ones who bring genuine domain expertise to what they&#8217;re asking agents to do. Ryan knows divorce law well enough to evaluate whether the output is right. He knows enough about software to catch when the agent has gone off the rails. The agent amplifies what you already know; it can&#8217;t supply what you don&#8217;t.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Programming Isn&#039;t Going Away. It&#039;s Being Abstracted. with Ryan Carson" width="500" height="281" src="https://www.youtube.com/embed/3-jjYuixwFI?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">What happened when he pitched an attorney</h2>



<p class="wp-block-paragraph">Ryan&#8217;s company is built for people considering or going through a divorce who find the process too expensive and too hard. But he always expected attorneys to have opinions. As he put it, “Either they would hate us and see us as the grim reaper, or they would love us because we&#8217;re going to save them costs.” So he had his AI agent, whom he calls R2, find and book meetings with small family law firms to hear them out. The feedback was very positive (from lawyers at least; paralegals may have another opinion). Here’s how one legal business owner responded to his pitch:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">The truth is, I have a lot of overhead from folks that are more in the paralegal space. And it sounds like your tool will do all that work. And I would rather have attorneys on staff that are doing the real legal work and then have all the paralegal work done by AI. I would love to pay you for that.</p>
</blockquote>



<p class="wp-block-paragraph">I expect that&#8217;s where most of the near-term displacement happens. Lower-value overhead gets automated and professionals spend more of their hours on actual professional work.</p>



<p class="wp-block-paragraph">Sometimes there&#8217;s an economic tradeoff between job losses (bad for those who lose their jobs) and lower costs that can be passed on to consumers. A lot of people who need legal help with a divorce can&#8217;t afford it, so they get stuck in a bad marriage. If the cost of the process comes down because the overhead is lower, some of those people get served who currently aren&#8217;t. There&#8217;s a big difference in economic impact between a business just saving costs and pocketing the savings and one that passes those savings along to consumers or uses them to radically improve access.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Law Office Owner Wants Accounts for All Her Attorneys with Ryan Carson" width="500" height="281" src="https://www.youtube.com/embed/dDRtV-3rb5Q?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">AI’s supporting role</h2>



<p class="wp-block-paragraph">Late in our conversation, someone asked how you use AI to identify strategic opportunities. Ryan&#8217;s answer was practical: build a priority map of the projects and people that matter to you, then run a cron job every 15 minutes to triage your inbox and Slack through that map, surface what&#8217;s relevant, and act. Ryan calls it his AI chief of staff, and he’s even open-sourced it as <a href="https://github.com/snarktank/clawchief" target="_blank" rel="noreferrer noopener">Clawchief</a>.</p>



<p class="wp-block-paragraph">My framing is a little different, and it comes from a conversation I had years ago with <a href="https://www.linkedin.com/in/jeff-jonas/" target="_blank" rel="noreferrer noopener">Jeff Jonas</a>, who has done data work for national intelligence agencies and casino security systems. His dream was a system where the query lives in the same space as the data. Rather than going looking for things, you define what matters to you and the system watches for it. New data shows up and the query is already there, waiting. Jeff was talking about that long before agents were a concept, but it describes what a well-designed agent loop can do now.</p>



<p class="wp-block-paragraph">Only you yourself will be able to fully understand the strategic opportunity moments for your company. What AI can do for you is be a scout. It can surface things that you should be paying better attention to. That’s what Jeff and Ryan are both talking about (<a href="https://www.oreilly.com/radar/steve-yegge-wants-you-to-stop-looking-at-your-code/#:~:text=Everyone%20gets%20a%20chief%20of%20staff" target="_blank" rel="noreferrer noopener">Steve Yegge too</a>): an agent that watches the flow and surfaces what deserves your attention rather than one that tries to make decisions for you.</p>



<p class="wp-block-paragraph">Right now, there&#8217;s this incredible opportunity to try things out and see what sticks. As Ryan has shown, it doesn’t take an entire company. Identify your goal and opportunity, then start building. His advice: Don’t worry about trying out every new tool. Just “find an energetic system,” then “pick a lane and invest.”</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="AI Scouts and AI Chiefs of Staff with Ryan Carson" width="500" height="281" src="https://www.youtube.com/embed/qG6M3OmWAv0?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>
]]></content:encoded>
										</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Object Caching 69/280 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed) 
Minified using Memcached

Served from: www.oreilly.com @ 2026-05-29 17:13:25 by W3 Total Cache
-->