<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:media="http://search.yahoo.com/mrss/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:custom="https://www.oreilly.com/rss/custom"

	>

<channel>
	<title>Radar</title>
	<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.oreilly.com/radar</link>
	<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
	<lastBuildDate>Fri, 13 Mar 2026 11:33:14 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9</generator>

<image>
	<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
	<title>Radar</title>
	<link>https://www.oreilly.com/radar</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Capability Architecture for AI-Native Engineering</title>
		<link>https://www.oreilly.com/radar/capability-architecture-for-ai-native-engineering/</link>
				<comments>https://www.oreilly.com/radar/capability-architecture-for-ai-native-engineering/#respond</comments>
				<pubDate>Fri, 13 Mar 2026 11:33:01 +0000</pubDate>
					<dc:creator><![CDATA[Juliette van der Laarse]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18262</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Robot-AI-in-the-office.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Robot-AI-in-the-office-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A public model for engineering work in an AI-native world]]></custom:subtitle>
		
				<description><![CDATA[A few years into the AI shift, the gap between engineers is not talent. It’s coordination: shared norms and a shared language for how AI fits into everyday engineering work. Some teams are already getting real value. They’ve moved beyond one-off experiments and started building repeatable ways of working with AI. Others haven’t, even when [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>A few years into the AI shift, the gap between engineers is not talent. It’s coordination: shared norms and a shared language for how AI fits into everyday engineering work. Some teams are already getting real value. They’ve moved beyond one-off experiments and started building repeatable ways of working with AI. Others haven’t, even when the motivation is there. The reason is often simple: The cost of orientation has exploded. The landscape is saturated with tools and advice, and it’s hard to know what matters, where to start, and what “good” looks like once you care about production realities.</p>



<h2 class="wp-block-heading">The missing map</h2>



<p>What’s missing is a shared reference model. Not another tool. A map. Which engineering activities can AI responsibly support? What does quality mean for those outputs? What changes when part of the workflow becomes probabilistic? And what guardrails keep integration safe, observable, and accountable? Without that map, it’s easy to drown in novelty, and easy to confuse widespread experimentation with reliable integration. Teams with the least time, budget, and local support pay the highest price, and the gap compounds.</p>



<p>That gap is now visible at the organizational level. More organizations are trying to turn AI into business value, and the difference between hype and integration is showing up in practice. It’s easy to ship impressive demos. It’s much harder to make AI-assisted work reliable under real-world constraints: measurable quality, controllable failure modes, clear data boundaries, operational ownership, and predictable cost and latency. This is where engineering discipline matters most. AI does not remove the need for it; it amplifies the cost of missing it. The question is how we move from scattered experimentation to integrated practice without burning cycles on tool churn. To do that at scale, we need shared scaffolding: a public model and shared language for what “good” looks like in AI-native engineering.</p>



<p>We have seen why this kind of shared scaffolding matters before. In the early internet era, promise and noise moved faster than standards and shared practice. What made the internet durable was not a single vendor or methodology but a cultural infrastructure: open knowledge sharing, global collaboration, and shared language that made practices comparable and teachable. AI-native engineering needs the same kind of cultural infrastructure, because integration only scales when the industry can coordinate on what “good” means. AI does not remove the need for careful engineering. On the contrary, it punishes the absence of it.</p>



<h2 class="wp-block-heading">A public scaffold for AI-native engineering</h2>



<p>In the second half of 2025, I began to notice growing unease among engineers I worked with and friends in IT. There was a clear sense that AI would change our work in profound ways, but far less clarity on what that actually meant for a person’s role, skills, and daily practice. There was no shortage of trainings, guides, blogs, or tools, but the more resources appeared, the harder it became to judge what was relevant, what was useful, and where to begin. It felt overwhelming. How do you know which topics truly matter to you when suddenly everything is labeled AI? How do you move from hype to useful integration?</p>



<p>I was feeling much of that same uncertainty myself. I was trying to make sense of the shift too, and for a while I think I was waiting for a clearer structure to emerge from elsewhere. It was only when friends started reaching out to me for help and guidance that I realized I might have something meaningful to contribute. I do not consider myself an AI expert. I am finding my way through these changes just like many other engineers. But over the years, I had become known for my work in IT workforce development, skill and capability frameworks, and engineering excellence and enablement. I know how to help people navigate complexity in a practical and sustainable way, and I enjoy bringing clarity to chaos.</p>



<p>That is what led me to start working on the AI Flower as a hobby project in early October 2025, building on frameworks and methods I already had experience with.</p>



<p>When I began sharing it with friends in IT to gather feedback, I saw how much it resonated. It helped them make sense of the complexity around AI, think more clearly about their own upskilling, and begin shaping AI adoption strategies of their own. That is when I realized this casual experiment held real value, and decided I wanted to publish it so it could help empower other engineers and IT organizations in the same way it had helped my friends.</p>



<p>With the AI Flower, I’m offering a public scaffold for AI-native engineering work: a shared reference model that helps engineers, teams, and organizations adopt and integrate AI sustainably and reliably. It’s meant to steer and organize the conversation around AI-assisted engineering, and to invite targeted feedback on what breaks, what’s missing, and what “good” should mean in real production contexts. It’s not meant to be perfect. It’s meant to be useful, freely available, open to contribution, and shaped by the strongest resource our industry has: collective intelligence.</p>



<p>Open knowledge sharing and collaboration cannot be optional. If AI is becoming part of how we design, build, operate, secure, and govern systems, we need more than tools and enthusiasm. Many of us work on systems people rely on every day. When those systems fail, the impact is real. That’s why we owe it to the people who depend on these systems to do this with care, and why we won’t get there in isolation. We need the industry, globally, to converge on shared standards for dependable practice.</p>



<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="1146" height="1080" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-7.png" alt="" class="wp-image-18263" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-7.png 1146w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-7-300x283.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-7-768x724.png 768w" sizes="(max-width: 1146px) 100vw, 1146px" /><figcaption class="wp-element-caption"><strong>The AI Flower visualized</strong>: Petals represent engineering disciplines, and each encompasses core engineering activities, best practices, learning resources, AI risk and considerations, and AI guidance per activity.</figcaption></figure>



<h2 class="wp-block-heading">About the AI Flower</h2>



<p>The AI Flower maps the core activities that make up engineering work across the main engineering disciplines. For each activity, it defines what good looks like, based on practices that should already feel familiar to engineers. It then helps people explore how AI can support those activities in practice, providing guidance on how to begin using AI in that work, sharing links to useful learning resources, and outlining the main risks, trade-offs, and mitigations.</p>



<p>But the AI landscape is changing quickly. This activity-based approach helps engineers understand how AI can support core engineering tasks, where risks may arise, and how to start building practical experience. But on its own, it isn’t enough as a long-term model for AI adoption.</p>



<p>As AI capabilities evolve, many engineering activities will become more abstracted, more automated, or absorbed into the infrastructure layer. That means engineers will need to do more than learn how to use AI within today’s activities. They will also need to work with emerging approaches such as context engineering and agentic workflows, which are already reshaping what we consider core engineering work. A concept I call the Skill Fossilization Model captures that progression. It shows how both engineering skills and AI-related skills evolve over time, and how some of them become less visible as work moves to a higher level of abstraction. Together, the AI Flower and the Skill Fossilization Model are meant to help engineers stay adaptable as the field continues to shift.</p>



<p>The main purpose of the AI Flower is to help engineers find their way through these rapid changes and grow with them. While I provide content for each section and activity, the real value lies in the framework and structure itself. To become truly valuable, it will need the insight, care, and contribution of engineers across disciplines, perspectives, and regions.</p>



<p>I genuinely believe the AI Flower, as an open and freely available framework, can serve as a scaffold for that work. This is my contribution to a changing industry. But it will only be useful—it will only “bloom”—if the community tests it, challenges it, and improves it over time.</p>



<p>And if any industry can turn open critique and contribution into shared standards at a global scale, it’s ours, isn’t it?</p>



<h2 class="wp-block-heading">Join me at AI Codecon to learn more</h2>



<p>If the AI Flower resonates and you want the full walkthrough, I’ll be presenting it at <a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener">O’Reilly’s upcoming AI Codecon</a>. (Registration is free and open to all.)</p>



<p>If you’re concerned about how quickly AI engineering patterns are evolving, that concern is valid. We’ve already seen the center of gravity shift from ad hoc prompt work, to context engineering, to increasingly agentic workflows, and there is more coming. A core design goal of the AI Flower is to stay stable across those shifts by focusing on underlying capabilities rather than specific techniques. I’ll go deeper on that stability principle, including the Skill Fossilization model, at AI Codecon as well.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/capability-architecture-for-ai-native-engineering/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Generative AI in the Real World: Sharon Zhou on Post-Training</title>
		<link>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-sharon-zhou-on-post-training/</link>
				<comments>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-sharon-zhou-on-post-training/#respond</comments>
				<pubDate>Thu, 12 Mar 2026 11:15:56 +0000</pubDate>
					<dc:creator><![CDATA[Ben Lorica and Sharon Zhou]]></dc:creator>
						<category><![CDATA[Generative AI in the Real World]]></category>
		<category><![CDATA[Podcast]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&#038;p=18239</guid>

		<enclosure url="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4" length="0" type="audio/mpeg" />
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png" 
				medium="image" 
				type="image/png" 
				width="2560" 
				height="2560" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-160x160.png" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Post-training gets your model to behave the way you want it to. As AMD VP of AI Sharon Zhou explains to Ben on this episode, the frontier labs are convinced, but the average developer is still figuring out how post-training works under the hood and why they should care. In their focused discussion, Sharon and [&#8230;]]]></description>
								<content:encoded><![CDATA[
<figure class="wp-block-video"><video controls src="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4"></video></figure>



<p>Post-training gets your model to behave the way you want it to. As AMD VP of AI Sharon Zhou explains to Ben on this episode, the frontier labs are convinced, but the average developer is still figuring out how post-training works under the hood and why they should care. In their focused discussion, Sharon and Ben get into the process and trade-offs, techniques like supervised fine-tuning, reinforcement learning, in-context learning, and RAG, and why we still need post-training in the age of agents. (It’s how to get the agent to actually work.) Check it out.</p>



<p>About the <em>Generative AI in the Real World</em> podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.</p>



<p>Check out other episodes of this podcast <a href="https://learning.oreilly.com/playlists/42123a72-1108-40f1-91c0-adbfb9f4983b/?_gl=1*pra1u5*_gcl_au*Mzc5ODUxNDEzLjE3NzI3NDUyNzk.*_ga*NjI3OTAzNjIzLjE3NzI0NzYxMzg.*_ga_092EL089CH*czE3NzMwODg2NjgkbzI3JGcwJHQxNzczMDg4NjY4JGo2MCRsMCRoMA.." target="_blank" rel="noreferrer noopener">on the O’Reilly learning platform</a> or follow us on <a href="https://www.youtube.com/playlist?list=PL055Epbe6d5YcJUhZbsVW9dlMueIuOxK_" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/5C9oof8TFkP65lDUcEy5jT" target="_blank" rel="noreferrer noopener">Spotify</a>, <a href="https://podcasts.apple.com/us/podcast/generative-ai-in-the-real-world/id1835476293" target="_blank" rel="noreferrer noopener">Apple</a>, or wherever you get your podcasts.</p>



<h1 class="wp-block-heading"><strong>Transcript</strong></h1>



<p><em>This transcript was created with the help of AI and has been lightly edited for clarity.</em></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=0" target="_blank" rel="noreferrer noopener">00.00</a><br><strong>Today we have a VP of AI at AMD and old friend, Sharon Zhou. And we&#8217;re going to talk about post-training mainly. But obviously we’ll cover other topics of interest in AI. So Sharon, welcome to the podcast. </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=17" target="_blank" rel="noreferrer noopener">00.17</a><br>Thanks so much for having me, Ben. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=19" target="_blank" rel="noreferrer noopener">00.19</a><br><strong>All right. So post-training. . . For our listeners, let&#8217;s start at the very basics here. Give us your one- to four-sentence definition of what post-training is even at a high level? </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=35" target="_blank" rel="noreferrer noopener">00.35</a><br>Yeah, at a high level, post-training is a type of training of a language model that gets it to behave in the way that you want it to. For example, getting the model to chat, like the chat in ChatGPT was done by post-training.</p>



<p>So basically teaching the model to not just have a huge amount of knowledge but actually be able to have a dialogue with you, for it to use tools, hit APIs, use reasoning and think through things step-by-step before giving an answer—a more accurate answer, hopefully. So post-training really makes the models usable. And not just a piece of raw intelligence, but more, I would say, usable intelligence and practical intelligence.&nbsp;</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=74" target="_blank" rel="noreferrer noopener">01.14</a><br><strong>So we&#8217;re two or three years into this generative AI era. Do you think at this point, Sharon, you still need to convince people that they should do post-training, or that&#8217;s done; they’re already convinced?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=91" target="_blank" rel="noreferrer noopener">01.31</a><br>Oh, they&#8217;re already convinced because I think the biggest shift in generative AI was caused by post-training ChatGPT. The reason why ChatGPT was amazing was actually not because of pretraining or getting all that information into ChatGPT. It was about making it usable so that you could actually chat with it, right?</p>



<p>So the frontier labs are doing a ton of post-training. Now, in terms of convincing, I would say that for the frontier labs, the new labs, they don&#8217;t need any convincing for post-training. But I think for the average developer, there is, you know, something to think about on post-training. There are trade-offs, right? So I think it&#8217;s really important to learn about the process because then you can actually understand where the future is going with these frontier models.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=135" target="_blank" rel="noreferrer noopener">02.15</a><br>But I think there is a question of how much you should do on your own, versus, us[ing] the existing tools that are out there. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=143" target="_blank" rel="noreferrer noopener">02.23</a><br><strong>So by convincing, I mean not the frontier labs or even the tech-forward companies but your mom and pop. . . Not mom and pop. . . I guess your regular enterprise, right?</strong></p>



<p><strong>At this point, I&#8217;m assuming they already know that the models are great, but they may not be quite usable off the shelf for their very specific enterprise application or workflow. So is that really what’s driving the interest right now—that people are actually trying to use these models off the shelf, and they can&#8217;t make them work off the shelf?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=184" target="_blank" rel="noreferrer noopener">03.04</a><br>Well, I was hoping to be able to talk about my neighborhood pizza store post-training. But I think, actually, for your average enterprise, my recommendation is less so trying to do a lot of the post-training on your own—because there&#8217;s a lot of infrastructure work to do at scale to run on a ton of GPUs, for example, in a very stable way, and to be able to iterate very effectively.</p>



<p>I think it&#8217;s important to learn about this process, however, because I think there are a lot of ways to influence post-training so that your end objective can happen in these frontier models or inside of an open model, for example, to work with people who have that infrastructure set up. So some examples could include: You could design your own RL environment, and what that is is a little sandbox environment for the model to go learn a new type of skill—for example, learning to code. This is how the model learns to code or learns math, for example. And it&#8217;s a little environment that you&#8217;re able to set up and design. And then you can give that to the different model providers or, for example, APIs can help you with post-training these models. And I think that&#8217;s really valuable because that gets the capabilities into the model that you want, that you care about at the end of the day.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=259" target="_blank" rel="noreferrer noopener">04.19</a><br><strong>So a few years ago, there was this general excitement about supervised fine-tuning. And then suddenly there were all these services that made it dead simple. All you had to do is come up with labeled examples. Granted, that that can get tedious, right? But once you do that, you upload your labeled examples, go out to lunch, come back, you have an endpoint that’s fine-trained, fine-tuned. So what happened to that? Is that something that people ended up continuing down that path, or are they abandoning it, or are they still using it but with other things?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=300" target="_blank" rel="noreferrer noopener">05.00</a><br>Yeah. So I think it&#8217;s a bit split. Some people have found that doing in-context learning—essentially putting a lot of information into the prompt context, into the prompt examples, into the prompt—has been fairly effective for their use case. And others have found that that&#8217;s not enough, and that actually, doing supervised fine-tuning on the model can get you better results, and you can do so on a smaller model that you can make private and make very low latency. And also like effectively free if you have it on your own hardware, right?</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=330" target="_blank" rel="noreferrer noopener">05.30</a><br>So I think those are kind of the trade-offs that people are thinking through. It&#8217;s obviously very much easier essentially to do in-context learning. And it could actually be more cost-effective if you&#8217;re only hitting that API a few times. Your context is quite small.</p>



<p>And the host and models like, for example, like <a href="https://www.anthropic.com/claude/haiku" target="_blank" rel="noreferrer noopener">Haiku</a>, a very small model, are quite cheap and low latency already. So I think there&#8217;s basically that trade-off. And with all of machine learning, with all of AI, this is something that you have to test empirically.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=363" target="_blank" rel="noreferrer noopener">06.03</a><br>So I would say the biggest thing is people are testing these things empirically, the differences between them and those trade-offs. And I&#8217;ve seen a bit of a split, and I really think it comes down to expertise. So the more you know how to actually tune the models, the more success you&#8217;ll get out of it immediately with a very small timeline. And you&#8217;ll understand how long something will take versus if you don&#8217;t have that experience, you will struggle and you might not be able to get to the right result in the right time frame, to make sense from an ROI perspective. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=395" target="_blank" rel="noreferrer noopener">06.35</a><br><strong>So where does retrieval-augmented generation fall into the spectrum of the tools in the toolbox?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=404" target="_blank" rel="noreferrer noopener">06.44</a><br>Yeah. I think RAG is a way to actually prompt the model and use search basically to search through a bunch of documents and selectively add things into the context, whether it be the context is too small, so like, it can only handle a certain amount of information, or you don&#8217;t want to distract the model with a bunch of irrelevant information, only the relevant information from retrieval.</p>



<p>I think retrieval is a very powerful search tool. And I think it&#8217;s important to know that while you use it at inference time quite a bit, this is something you teach the model to use better. It&#8217;s a tool that the model needs to learn how to use, and it can be taught in post-training for the model to actually do retrieval, do RAG, extremely effectively, in different types of RAG as well.</p>



<p>So I think knowing that is actually fairly important. For example, in the RL environments that I create, and the fine-tuning kind of data that I create, I include RAG examples because I want the model to be able to learn that and be able to use RAG effectively.&nbsp;</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=466" target="_blank" rel="noreferrer noopener">07.46</a><br><strong>So besides supervised fine-tuning, the other class of techniques, broadly speaking, falls under reinforcement learning for post-training. But the impression I get—and I&#8217;m a big RL fan, and I&#8217;m a cheerleader of RL—but it seems always just around the corner, beyond the grasp of regular enterprise. It seems like a class of tools that the labs, the neo labs and the AI labs, can do well, but it just seems like the tooling is not there to make it, you know. . . Like I describe supervised fine-tuning as largely solved if you have a service. There&#8217;s no equivalent thing for RL, right? </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=515" target="_blank" rel="noreferrer noopener">08.35</a><br>That&#8217;s right. And I think SFT (supervised fine-tuning) came first, so then it has been allowed to mature over the years. And so right now RL is kind of seeing that moment as well. It was a very exciting year last year, when we used a bunch of RL at test-time compute, teaching a model to reason, and that was really exciting with RL. And so I think that&#8217;s ramped up more, but we don&#8217;t have as many services today that are able to help with that. I think it&#8217;s only a matter of time, though. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=544" target="_blank" rel="noreferrer noopener">09.04</a><br><strong>So you said earlier, it&#8217;s important for enterprises to know that these techniques exist, that there&#8217;s companies who can help you with these techniques, but it might be too much of a lift to try to do it yourself. </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=560" target="_blank" rel="noreferrer noopener">09.20</a><br>I think maybe fully end to end, it is challenging as an enterprise. I think there are individual developers who are able to do this and actually get a lot of value from it. For example, for vision language models or for models that generate images, people are doing a lot of bits and pieces of fine-tuning, and getting very custom results that they need from these models.</p>



<p>So I think it depends on who you are and what you&#8217;re surrounded by. The <a href="https://tinker-docs.thinkingmachines.ai/" target="_blank" rel="noreferrer noopener">Tinker API</a> from Thinking Machines is really interesting to me because that enables another set of people to be able to access it. I&#8217;m not quite sure it&#8217;s quite at the enterprise level, but I know researchers at universities now have access to distributed compute, like doing post-training on distributed compute, and pretty big clusters—which is quite challenging to do for them. And so that makes it actually possible for at least that segment of the market and that user base to actually get started. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=621" target="_blank" rel="noreferrer noopener">10.21</a><br><strong>Yeah. So for our listeners who are familiar with just plain inference, the OpenAI API has become kind of the de facto API for inference. And then the idea is this Tinker API might play that role for fine-tuning inputs, correct? It&#8217;s not kind of the whole project that&#8217;s there. </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=643" target="_blank" rel="noreferrer noopener">10.43</a><br>Correct. Yeah, that&#8217;s their intention. And to do it in a heavy like distributed way. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=649" target="_blank" rel="noreferrer noopener">10.49</a><br><strong>So then, if I&#8217;m CTO at an enterprise and I have an AI team and, you know, we&#8217;re not up to speed on post-training, what are the steps to do that? Do we bring in consultants and they explain to us, here&#8217;s your options and these are the vendors, or. . .? What&#8217;s the right playbook?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=675" target="_blank" rel="noreferrer noopener">11.15</a><br>Well, the strategy I would employ is, given these models change their capabilities constantly, I would obviously have teams testing the boundaries of the latest iteration of model at inference. And then from a post-training perspective, I would also be testing that. I would have a small, hopefully elite team that is looking into what I can do with these models, especially the open ones. And when I post-train, what actually comes from that. And I would think about my use cases and the desired things I would want to see from the model given my understanding of post-training.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=708" target="_blank" rel="noreferrer noopener">11.48</a><br>So hopefully you learn about post-training through this book with O&#8217;Reilly. But you&#8217;re also able to now grasp like, What are the types of capabilities I can add into the model? And as a result, what kinds of things can I then add into the ecosystem such that they get incorporated into the next generation of model as well?</p>



<p>For example, I was at an event recently and someone said, oh, you know, these models are so scary. When you threaten the model, you can get better results. So is that even ethical? You know, the model gets scared and gets you a better result. And I said, actually, you can post-train that out of the model. Where when you threaten it, it actually doesn&#8217;t give you a better result. That&#8217;s not actually like a valid model behavior. You can change that behavior of the model. So understanding these tools can lend that perspective of, oh, I can change this behavior because I can change what output given this input. Like how the model reacts to this type of input. And I know how.&nbsp;</p>



<p>I also know the tools right. This type of data. So maybe I should be releasing this type of data more. I should be releasing these types of tutorials more that actually helps the model learn at different levels of difficulty. And I should be releasing these types of files, these types of tools, these types of MCPs and skills such that the model actually does pick that up.</p>



<p>And that will be across all different types of models, whether that be a frontier lab looking at your data or your internal team that is doing some post-training with that information.&nbsp;</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=800" target="_blank" rel="noreferrer noopener">13.20</a><br><strong>Let&#8217;s say I&#8217;m one of these enterprises, and we already have some basic applications that use RAG, and you know, I hear this podcast and say, OK, let&#8217;s try this, try to go down the path of post-training. So we already have some familiarity with how to do eval for RAG or some other basic AI application. How does my eval pipeline change in light of post-training? Do I have to change anything there? </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=843" target="_blank" rel="noreferrer noopener">14.03</a><br>Yes and no. I think you can expand on what you have right now. And I think your existing eval—hopefully it&#8217;s a good eval. There&#8217;s also best practices around evals. But essentially let&#8217;s say it&#8217;s just a list of possible inputs and outputs, a way to grade those outputs, for the model. And it covers a decent distribution over the tasks you care about. Then, yes, you can extend that to post-training. </p>



<p>For fine-tuning, it&#8217;s a pretty straightforward kind of extension. You do need to think about essentially the distribution of what you&#8217;re evaluating such that you can trust that the model’s truly better at your tasks. And then for RL, you would think about, How do I effectively grade this at every step of the way, and be able to understand has the model done well or not and be able to catch where the model is, for example, reward hacking when it&#8217;s cheating, so to speak?</p>



<p>So I think you can take what you have right now. And that&#8217;s kind of the beauty of it. You can take what you have and then you can expand it for post-training.&nbsp;</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=910" target="_blank" rel="noreferrer noopener">15.10</a><br><strong>So, Sharon, should people think of something like supervised fine-tuning as something you do for something very narrow? In other words, as you know, one of the challenges with supervised fine-tuning is that first of all, you have to come up with the dataset, and let&#8217;s say you can do that, then you do the supervised fine-tuning, and it works, but it only works for kind of that data distribution somehow. And so in other words, you shouldn&#8217;t expect miracles, right?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=944" target="_blank" rel="noreferrer noopener">15.44</a><br>Yes, actually something I do recommend is thinking through what you want to do that supervised fine-tuning on. And really, I think it should be behavior adaptation. So for example, in pretraining, that&#8217;s when the model is learning from a huge amount of data, for example, from the internet, curated. And it&#8217;s just gaining raw intelligence across a lot of different tasks and a lot of different domains. And it&#8217;s just gaining that information, predicting that next token. But it doesn&#8217;t really have any of those behavioral elements to it. </p>



<p>Now, let&#8217;s say it&#8217;s only learned about version one of some library. If in fine-tuning, so if in post-training, you now give it examples of chatting with the model, then it&#8217;s able to be able to chat over version one and version zero. (Let&#8217;s say there&#8217;s a version zero.) And you only gave it examples of chatting with version one, but it&#8217;s able to generalize that version zero. Great. That&#8217;s exactly what you want. That&#8217;s a behavior change that you&#8217;re making in the model. But we&#8217;ve also seen issues where, if you for example now give the model in fine-tuning examples of “oh, here&#8217;s something with version two,” but the base model, the pretrained model did not ever see anything about version two, it will learn this behavior of making things up. And so that will generalize as well. And that could actually hurt the model.&nbsp;</p>



<p>So something that I really encourage people to think about is where to put each step of information. And it&#8217;s possible that certain amounts of information are best done as more of a pretraining step. So I&#8217;ve seen people take a pretrained model, do some continued pretraining—maybe you call it midtraining, I&#8217;m not sure. But like something there—and then you do that fine-tuning step of behavior modification on top.&nbsp;</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1056" target="_blank" rel="noreferrer noopener">17.36</a><br><strong>In your previous startup, you folks talked about something. . . I forget. I&#8217;m trying to remember. Something called memory tuning, is that right?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1066" target="_blank" rel="noreferrer noopener">17.46</a><br>Yeah. A mixture of memory experts. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1068" target="_blank" rel="noreferrer noopener">17.48</a><br><strong>Yeah, yeah. Is it fair to cast that as a form of post-training? </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1074" target="_blank" rel="noreferrer noopener">17.54</a><br>Yes, that is absolutely a form of post-training. We were doing it in the adapter space. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1079" target="_blank" rel="noreferrer noopener">17.59</a><br><strong>Yeah. And you should describe for our audience what that is.</strong> </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1082" target="_blank" rel="noreferrer noopener">18.02</a><br>Okay. Yeah. So we invented something called mixture of memory experts. And essentially, you can hear like the words, except for the word “memory,” it&#8217;s a mixture of experts. So it&#8217;s a type of MOE. MOEs are typically done in the base layer of a model. And what it basically means is like there are a bunch of different experts, and for particular requests, for a particular input prompt, it routes to only one of those experts or only a couple of those experts instead of the whole model.</p>



<p>And this makes latency really low and makes it really efficient. And the base models are often MOEs today for the frontier models. But what we were doing was thinking about, well, what if we froze your base model, your base pretrained model, and for post-training, we could do an MOE on top? And specifically, we could do an MOE on top through the adapters. So through your LoRA adapters. And so instead of just one LoRA adopter, you could have a mixture of these LoRA adopters. And they would effectively be able to learn multiple different tasks on top of your base model such that you would be able to keep your base model completely frozen and be able to, automatically in a learned way, switch between these adapters.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1152" target="_blank" rel="noreferrer noopener">19.12</a><br><strong>So the user experience or developer experience is similar to supervised fine-tuning: I will need labeled datasets for this one, another set of labeled datasets for this one, and so on. </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1169" target="_blank" rel="noreferrer noopener">19.29</a><br>So actually, yeah. Similar to supervised fine-tuning, you would just have. . . Well, you could put it into one giant dataset, and it would learn how to figure out which adapters to allocate it to. So let&#8217;s say you had 256 adapters or 1024 adapters. It would learn what the optimal routing is. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1187" target="_blank" rel="noreferrer noopener">19.47</a><br><strong>And then you folks tried to explain this in the context of neural plasticity, as I recall.</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1195" target="_blank" rel="noreferrer noopener">19.55</a><br>Did we? I don&#8217;t know. . .</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1198" target="_blank" rel="noreferrer noopener">19.58</a><br><strong>The idea being that, because of this approach, your model can be much more dynamic. </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1208" target="_blank" rel="noreferrer noopener">20.08</a><br>Yeah. I do think there&#8217;s a difference between inference, so just going forwards in the model, versus being able to go backwards in some way, whether that be through the entire model or through adapters, but in some way being able to learn something through backprop.</p>



<p>So I do think there is a pretty fundamental difference between those two types of ways to engage with a model. And arguably at inference time, your weights are frozen, so the model&#8217;s “brain” is completely frozen, right? And so you can&#8217;t really heavily adapt anything towards a different objective. It&#8217;s frozen. So being able to continually modify what the model’s objective and thinking and steering and behavior is, I think it&#8217;s valuable now.&nbsp;</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1254" target="_blank" rel="noreferrer noopener">20.54</a><br>I think there are more approaches to this today, but from a user experience perspective, some people have found it easier to just load a lot of things into the context. And I think there&#8217;s. . . I&#8217;ve actually recently had this debate with a few people around whether in-context learning truly is somewhere in between just frozen inference forwards and backprop. Obviously it&#8217;s not doing backprop directly, but there are ways to mimic certain things. But maybe that is what we&#8217;re doing as a human throughout the day. And then I will backprop at night when I&#8217;m sleeping. </p>



<p>So I think people are playing with these ideas and trying to understand what&#8217;s going on with the model. I don&#8217;t think it&#8217;s definitive yet. But we do see some properties, when just playing with the input prompt. But there I think, needless to say, there are 100% fundamental differences when you are able to backprop into the weights.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1309" target="_blank" rel="noreferrer noopener">21.49</a><br><strong>So maybe for our listeners, briefly define in-context learning.</strong> </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1315" target="_blank" rel="noreferrer noopener">21.55</a><br>Oh, yeah. Sorry. So in-context learning is a deceptive term because the word “learning” doesn&#8217;t actually. . . Backprop doesn&#8217;t happen. All it is is actually putting examples into the prompt of the model and you just run inference. But given that prompt, the model seems to learn from those examples and be able to be nudged by those examples to a different answer.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1337" target="_blank" rel="noreferrer noopener">22.17</a><br><strong>By the way, now we have frameworks like DSPy, which comes with tools like GEPA which can optimize your prompts. I know a few years ago, you folks were telling people [that] prompting your way through a problem is not the right approach. But now we have more principled ways, Sharon, of developing the right prompts? So how do tools like that impact post-training? </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1371" target="_blank" rel="noreferrer noopener">22.51</a><br>Oh, yeah. Tools like that impact post-training, because you can teach the model in post-training to use those tools more effectively. Especially if they help with optimizing the prompt and optimizing the understanding of what someone is putting into the model.</p>



<p>For example, let me just give a contrast of how far we&#8217;ve gotten. So post-training makes the model more resilient to different prompts and be able to handle different types of prompts and to be able to get the intention from the user. So as an extreme example, before ChatGPT, when I was using GPT-3 back in 2020, if I literally put a space by accident at the end of my prompt—like when I said, “How are you?” but I accidentally pressed Space and then Enter, the model completely freaked out. And that&#8217;s because of the way things were tokenized, and that just would mess things up. But there are a lot of different weird sensitivities in the model such that it would just completely freak out, and by freak out I mean it would just repeat the same thing over and over, or just go off the rails about something completely irrelevant.</p>



<p>And so that&#8217;s what the state of things were, and the model was not post-trained to.&nbsp;.&nbsp;. Well, it wasn&#8217;t quite post-trained then, but it also wasn&#8217;t generally post-trained to be resilient to any type of prompt, versus now today, I don&#8217;t know about you, but the way I code is I just highlight something and just put a question mark into the prompt.</p>



<p>I’m so lazy, or like just put the error in and it&#8217;s able to handle it—understand that you&#8217;re trying to fix this error because why else would you be talking to it. And so it&#8217;s just much more resilient today to different things in the prompt.&nbsp;</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1466">24.26</a><br>Remember Google “Did you mean this?” It&#8217;s kind of an extreme version of that, where you type something completely misspelled into Google, and it&#8217;s able to kind of figure out what you actually meant and give you the results.</p>



<p>It&#8217;s the same thing, even more extreme, like super Google, so to speak. But, yeah, it&#8217;s resilient to that prompt. But that has to be done through post-training—that is happening in post-training for a lot of these models. It&#8217;s showing the model, hey, for these possible inputs that are just gross and messed up, you can still give the user a really well-defined output and understand their intention.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1505" target="_blank" rel="noreferrer noopener">25.05</a><br><strong>So the hot thing today, of course, is agents. And agents now, people are using things like tool calling, right? So MCP servers. . . You&#8217;re not as dependent on this monolithic model to solve everything for you. So you can just use a model to orchestrate a bunch of little specialized specialist agents.</strong></p>



<p><strong>So do I still need post-training?&nbsp;</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1539" target="_blank" rel="noreferrer noopener">25.39</a><br>Oh, absolutely. You use post-training to get the agent to actually work. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1543" target="_blank" rel="noreferrer noopener">25.43</a><br><strong>So get the agent to pull all the right tools. . . </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1546" target="_blank" rel="noreferrer noopener">25.46</a><br>Yeah, actually, a huge reason why hallucinations have been, like, much better than before is because now, under the hood, they&#8217;ve taught the model to maybe use a calculator tool instead of just output, you know, math on your own, or be able to use the search API instead of make things up from your pretraining data.</p>



<p>So this tool calling is really, really effective, but you do need to teach the model to use it effectively. And I actually think what&#8217;s interesting.&nbsp;.&nbsp;. So MCPs have managed to create a great intermediary layer to help models be able to call different things, use different types of tools with a consistent interface. However, I have found that due to probably a little bit lack of post-training on MCPs, or not as much as, say, a Python API, if you have a Python function declaration or a Python API, that&#8217;s actually the models actually tend to do empirically, at least for me, better on it because models have seen so many more examples of that. So that&#8217;s an example of, oh, actually in post-training I did see more of that than MCPs.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1612" target="_blank" rel="noreferrer noopener">26.52</a><br>So weirdly, it&#8217;s better using Python APIs for your same tool than an MCP of your own tool, empirically today. And so I think it really depends on what it&#8217;s been post-trained on. And understanding that post-training process and also what goes into that will help you understand why these differences occur. And also why we need some of these tools to help us, because it&#8217;s a little bit chicken-egg, but like the model is capable of certain things, calling different tools, etc. But having an MCP layer is a way to help everyone organize around a single interface such that we can then do post-training on these models such that they can then do well on it.</p>



<p>I don&#8217;t know if that makes sense, but yeah, that&#8217;s why it&#8217;s so important.&nbsp;</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1661" target="_blank" rel="noreferrer noopener">27.41</a><br><strong>Yeah, yeah. In the areas I&#8217;m interested in, which I mean, the data engineering, DevOps kind of applications, it seems like there&#8217;s new tools like Dex, open source tools, which allow you to kind of save pipelines or playbooks that work so that you don&#8217;t constantly have to reinvent the wheel, you know, just because basically, that&#8217;s how these things function anyway, right? So someone gets something to work and then everyone kind of benefits from that. But then if you&#8217;re constantly starting from scratch, and you prompt and then the agent has to relearn everything from scratch when it turns out there&#8217;s already a known way to do this problem, it&#8217;s just not efficient, right? </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1710" target="_blank" rel="noreferrer noopener">28.30</a><br>Oh, I also think another exciting frontier that&#8217;s kind of in the zeitgeist of today is, you know, given Moltbook or OpenClaw stuff, multi-agent has been talked about much more. And that&#8217;s also through post-training for the model, to launch subagents and to be able to interface with other agents effectively. These are all types of behavior that we have to teach the model to be able to handle. It’s able to do a lot of this out of the box, just like GPT-3 was able to chat with you if you give it the right nudging prompts, etc., but ChatGPT is so much better at chatting with you.</p>



<p>So it&#8217;s the same thing. Like now people are, you know, adding to their post-training mix this multi-agent workflow or subagent workflow. And that&#8217;s really, really important for these models to be effective at being able to do that. To be both the main agent, the unified agent at the top, but also to be the subagent to be able to launch its own subagents as well.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1766" target="_blank" rel="noreferrer noopener">29.26</a><br><strong>Another trend recently is the emergence of these multimodal models or even, people are starting to talk about world models. I know those are early, but I think even just in the area of multimodality, visual language models, and so forth, what is the state of post-training outside of just LLMs? Just different kinds of this much more multimodal foundation models? Are people doing the post-training in those frontier models as well?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1804" target="_blank" rel="noreferrer noopener">30.04</a><br>Oh, absolutely. I actually think one really fun one—I guess this is largely a language model, but they are likely tokenizing very differently—are people who are looking at, for example, life sciences and post-training foundation models for that.</p>



<p>So there you would want to adapt the tokenizer, because you wanted to be able to put different types of tokens in and tokens out, and have the model be very efficient at that. And so you&#8217;re doing that during post-training, of course, to be able to teach that new tokenizer. But you&#8217;re also thinking about what other feedback loops you can do.</p>



<p>So people are automating things like, I don&#8217;t know, the pipetting and testing out the different, you know, molecules, mixing them together and being able to get a result from that. And then, you know, using that as a reward signal back into the model. So that&#8217;s a really powerful other type of domain that&#8217;s maybe adjacent to how we think about language models, but tokenized differently, and has found an interesting niche where we can get nice, verifiable rewards back into the model that is pretty different from how we think about, for example, coding or math, or even general human preferences. It&#8217;s touching the real world or physical world—so it&#8217;s probably all real, but the physical world a little bit more.</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1885" target="_blank" rel="noreferrer noopener">31.25</a><br><strong>So in closing, let&#8217;s get your very quick takes on a few of these AI hot topics. First one, reinforcement learning. When will it become mainstream? </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1898" target="_blank" rel="noreferrer noopener">31.38</a><br>Mainstream? How is it not mainstream? </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1900" target="_blank" rel="noreferrer noopener">31.40</a><br><strong>No, no, I mean, for regular enterprises to be able to do it themselves. </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1907" target="_blank" rel="noreferrer noopener">31.47</a><br>This year. People have got to be sprinting. Come on. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1910" target="_blank" rel="noreferrer noopener">31.50</a><br><strong>You think? Do you think there will be tools out there so that I don&#8217;t need in-house talent in RL to do it myself?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1919" target="_blank" rel="noreferrer noopener">31.59</a><br>Yes. Yeah. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1921" target="_blank" rel="noreferrer noopener">32.01</a><br><strong>Secondly, scaling. Is scaling still the way to go? The frontier labs seem to think so. They think that bigger is better. So are you hearing anything in the research frontiers that tell you, hey, maybe there&#8217;s alternatives to just pure scaling?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1940" target="_blank" rel="noreferrer noopener">32.20</a><br>I still believe in scaling. I believe we&#8217;ve not met a limit yet. Not seen a plateau yet. I think the thing people need to recognize is that it&#8217;s always been a “10X compute for 2X intelligence” type of curve. So it&#8217;s not exactly like 10X-10X. But yeah, I still believe in scaling, and we haven&#8217;t really seen an empirical plateau on that yet.</p>



<p>That being said, I&#8217;m really excited about people who challenge it. Because I think it would be really amazing if we could challenge it and get a huge amount of intelligence with less pure dollars, especially now as we start to hit up on trillions of dollars in some of the frontier labs, of like that&#8217;s the next level of scale that they&#8217;ll be seeing. However, at a compute company, I&#8217;m okay with this purchase. Come spend trillions! [laughs]</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=1993" target="_blank" rel="noreferrer noopener">33.13</a><br><strong>By the way, with respect to scaling, so you think the models we have now, even if you stop progress, there&#8217;s a lot of adaptation that enterprises can do. And there&#8217;s a lot of benefits from the models we already have today?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2010" target="_blank" rel="noreferrer noopener">33.30</a><br>Correct. Yes. We&#8217;re not even scratching the surface, I think. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2014" target="_blank" rel="noreferrer noopener">33.34</a><br><strong>The third topic I wanted to pick your brain quick is “open”: open source, open weights, whatever. So, there&#8217;s still a gap, I think. </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2029" target="_blank" rel="noreferrer noopener">33.49</a><br>There are contenders in the US who want to be an open source DeepSeek competitor but American, to make it more amenable when selling into. . .</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2042" target="_blank" rel="noreferrer noopener">34.02</a><br><strong>They don&#8217;t exist, right? I mean, there&#8217;s </strong><a href="https://allenai.org/" target="_blank" rel="noreferrer noopener"><strong>Allen</strong></a><strong>. </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2046" target="_blank" rel="noreferrer noopener">34.06</a><br>Oh, like Ai2 for <a href="https://allenai.org/olmo" target="_blank" rel="noreferrer noopener">Olmo</a>&#8230; Their startup’s doing some stuff. I don&#8217;t know if they&#8217;ve announced things yet, but yeah hopefully we&#8217;ll hear from them soon. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2055" target="_blank" rel="noreferrer noopener">34.15</a><br><strong>Yeah yeah yeah. </strong></p>



<p><strong>Another interesting thing about these Chinese AI teams is obviously, you have the big companies like Tencent, Baidu, Alibaba—so they&#8217;re doing their thing. But then there&#8217;s this wave of startups. Set aside DeepSeek. So the other startups in this space, it seems like they&#8217;re targeting the West as well, right? Because basically it&#8217;s hard to monetize in China, because people tend not to pay, especially the enterprises. [laughs]</strong></p>



<p><strong>I&#8217;m just noticing a lot of them are incorporating in Singapore and then trying to build solutions for outside of China.</strong>&nbsp;</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2100" target="_blank" rel="noreferrer noopener">35.00</a><br>Well, the TAM is quite large here, so. . . It&#8217;s quite large in both places. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2107" target="_blank" rel="noreferrer noopener">35.07</a><br><strong>So it&#8217;s the final question. So we’ve talked about post-training. We talked about the benefits, but we also talked about the challenges. And as far as I can tell, one of the challenges is, as you pointed out, to do it end to end requires a bit of expertise. First of all, think about just the data. You might need the right data platform or data infrastructure to prep your data to do whatever it is that you&#8217;re doing for post-training. And then you get into RL. </strong></p>



<p><strong>So what are some of the key foundational things that enterprises should invest in to set themselves up for post-training—to get really good at post training? So I mentioned a data platform, maybe invest in the data. What else?</strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2161" target="_blank" rel="noreferrer noopener">36.01</a><br>I think the type of data platform matters. I&#8217;m not sure if I totally am bought into how CIOs are approaching it today. I think what matters at that infrastructure layer is actually making sure you deeply understand what tasks you want these models to do. And not only that, but then codifying it in some way—whether that be inputs and outputs and, you know, desired outputs, whether that be a way to grade outputs, whether that be the right environment to have the agent in. Being able to articulate that is extremely powerful and I think is the one of the key ways of getting that task that you want this agent to do, for example, to be actually inside of the model. Whether it&#8217;s you doing post-training or someone else doing post-training, no matter what, if you build that, that will be something that gives a high ROI, because anyone will be able to take that and be able to embed it and you&#8217;ll be able to get that capability faster than anyone else. </p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2223" target="_blank" rel="noreferrer noopener">37.03</a><br><strong>And on the hardware side, one interesting thing that comes out of this discussion is if RL truly becomes mainstream, then you need to have a healthy mix of CPUs and GPUs as well. </strong></p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2237" target="_blank" rel="noreferrer noopener">37.17</a><br>That&#8217;s right. And you know, AMD makes both. . .</p>



<p><a href="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/sharon_zhou_genai_podcast_v2.mp4#t=2245" target="_blank" rel="noreferrer noopener">37.25</a><br><strong>It’s great at both of those.</strong></p>



<p><strong>And with that thank you, Sharon.</strong></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-sharon-zhou-on-post-training/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Steve Yegge Wants You to Stop Looking at Your Code</title>
		<link>https://www.oreilly.com/radar/steve-yegge-wants-you-to-stop-looking-at-your-code/</link>
				<comments>https://www.oreilly.com/radar/steve-yegge-wants-you-to-stop-looking-at-your-code/#respond</comments>
				<pubDate>Thu, 12 Mar 2026 10:07:00 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18247</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Code-is-liquid.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Code-is-liquid-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A conversation about agent orchestration, AI vampires, and why your bike ride is all hills now]]></custom:subtitle>
		
				<description><![CDATA[My &#8220;Live with Tim&#8221; conversation with Steve Yegge this week was one of those sessions where you could imagine the audience leaning forward in their chairs. And on more than one occasion, when Steve got particularly colorful, I imagined them recoiling. Steve has always been one of the most provocative thinkers in our industry, going [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>My &#8220;<a href="https://learning.oreilly.com/videos/live-with-tim/0642572330194/" target="_blank" rel="noreferrer noopener">Live with Tim</a>&#8221; conversation with Steve Yegge this week was one of those sessions where you could imagine the audience leaning forward in their chairs. And on more than one occasion, when Steve got particularly colorful, I imagined them recoiling. Steve has always been one of the most provocative thinkers in our industry, going all the way back to his <a href="https://gist.github.com/chitchcock/1281611" target="_blank" rel="noreferrer noopener">legendary 2011 platform rant</a> that leaked from inside Google. These days he&#8217;s channeling his energy into <a href="https://github.com/steveyegge/gastown" target="_blank" rel="noreferrer noopener">Gas Town</a>, an open source AI agent orchestrator, and into a relentless campaign to shake developers out of what he sees as a state of denial about where coding is headed. </p>



<p>And yes, Gas Town is indeed named after the fuel depot in <em>Mad Max: Furiosa</em>, and even features a managing agent named after the Mayor. But Steve’s Gas Town is anything but dystopic. If anything, it’s joyous. That gives you a deep sense of who Steve is: He goes into the deepest, darkest part of the forest, finds something scary, and then does his best to redeem it.</p>



<p>We covered a lot of ground: the eight levels of coder evolution, the addictive pull of multi-agent workflows, grief and denial in the developer community, the bitter lesson, and why taste may be the last remaining competitive advantage. Here are some of the highlights.</p>



<h2 class="wp-block-heading"><strong>Everyone gets a chief of staff</strong></h2>



<p>Steve&#8217;s &#8220;Eight Levels of Coder Evolution&#8221; framework has taken on a life of its own since he published it as part of <a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04" target="_blank" rel="noreferrer noopener">the Gas Town launch post</a>. The first four levels are about increasingly sophisticated IDE use; levels five through eight are about coding agents. Here is an infographic showing the eight stages that I showed to anchor the start of our conversation. Note that I created this slide with Nano Banana 2. It is not directly from Steve.</p>



<figure class="wp-block-image size-full"><img decoding="async" width="1024" height="559" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-6.png" alt="Eight stages of AI-assisted software development" class="wp-image-18248" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-6.png 1024w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-6-300x164.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-6-768x419.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>The key transition, Steve argues, happens at level five: Your IDE goes away and you never open it again. As Steve described it, once you realize Claude Code can write pieces of your code, you start assembling them like Lego. But while one agent is working, you&#8217;re sitting there bored, so you fire up another one. And another. Before long, you&#8217;ve got six agents running in parallel, and one of them is always finished and waiting for your attention.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="Why You End Up Needing an Orchestrator" width="500" height="281" src="https://www.youtube.com/embed/Ka9zpAph6Kk?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>Steve drew an analogy to Amazon VPs who had executive assistant support. Those people were effectively two people. They didn&#8217;t have to worry about whether the printer was jammed, so they could spend all their time focused on the real problems. Gas Town, Steve argues, is topologically similar: It&#8217;s going to turn everybody into something like an executive with a chief of staff. &#8220;We all have a chief of staff now,&#8221; he said. &#8220;Everybody&#8217;s going to be able to spend their time more productively on whatever they want to spend it on instead of figuring out where the printer is jammed.&#8221;</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>On March 26, join Addy Osmani and Tim O’Reilly at AI Codecon: Software Craftsmanship in the Age of AI, where an all-star lineup of experts will go deeper into orchestration, agent coordination, and the new skills developers need to build excellent software that creates value for all participants.&nbsp;</em><a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener"><em>Sign up for free here</em></a><em>.</em></p>
</blockquote>



<h2 class="wp-block-heading"><strong>The AI vampire</strong></h2>



<p>This is Steve Yegge, so he&#8217;s not just going to give you the upside. His post on &#8220;<a href="https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163" target="_blank" rel="noreferrer noopener">the AI Vampire</a>&#8221; explained how AI-assisted productivity creates an insidious new kind of burnout, and I made sure to ask him about that.</p>



<p>The old version of overwork was your company piling tasks on you until you broke, he told us. The new version isn’t your boss asking you to work extra hours. It&#8217;s Claude saying, &#8220;Is there anything else you&#8217;d like me to do on this project?&#8221; And you say yes, yes, yes, because it&#8217;s fun, because it&#8217;s productive, because the AI is your buddy, not your employer.</p>



<p>But there’s a twist. The AI is solving all the easy problems and leaving you with nothing but hard ones. In our conversation, I said it can feel like your bike ride is all hills now, and Steve immediately connected it to watching Jeff Bezos in meetings at Amazon. People would bring him presentations where they&#8217;d already solved every easy problem, so Bezos was just getting the hard stuff, all day long. &#8220;Now this happens to you,&#8221; Steve said. &#8220;Everyone&#8217;s Jeff Bezos, everyone&#8217;s an entrepreneur. Everyone has a huge army of workers now. And I&#8217;m telling you, it&#8217;s exhausting.&#8221;</p>



<p>Steve told us he naps every day now, sometimes twice a day, feeling drained by the relentless cognitive intensity. These agents don&#8217;t just help you work faster; they fundamentally change what kind of work reaches your desk.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Everyone’s Jeff Bezos Now" width="500" height="281" src="https://www.youtube.com/embed/QkXK98HwcIw?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>On the wrong side of the bitter lesson</strong></h2>



<p>We spent a good stretch on Richard Sutton&#8217;s &#8220;<a href="https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf" target="_blank" rel="noreferrer noopener">bitter lesson</a>.&#8221; Sutton observed that raw computation consistently beats systems built on human-engineered structure. Steve treats it less as a paper and more as a daily operating principle. &#8220;Not a day goes by when I don&#8217;t think about the bitter lesson as a math formula,&#8221; he said, &#8220;at least five times a day.&#8221;</p>



<p>His practical test is simple: If you&#8217;re writing code that tries to make the AI smarter, by adding heuristics, parsers, regular expressions to handle what a model could handle, you&#8217;re on the wrong side of the bitter lesson. He watches even his own Gas Town contributors make this mistake, reaching for a little regex hack when they should let a model do the cognition. (Steve does admit that sometimes you do need to <a href="https://steve-yegge.medium.com/software-survival-3-0-97a2a6255f7b" target="_blank" rel="noreferrer noopener">provide prebuilt code if it saves tokens</a>.)</p>



<p>Sutton wrote about the bitter lesson in the context of training programs to beat humans at chess and go, but it’s more general than that, even more general than leaning into today’s AI in the way that Steve does. I shared my own first encounter with the bitter lesson, back in 1993 when O’Reilly created <a href="https://en.wikipedia.org/wiki/Global_Network_Navigator" target="_blank" rel="noreferrer noopener">GNN</a>, the first commercial web portal. Being publishers, we curated a catalog of the best websites. Then Yahoo! set out to list all of them, restricting curation to putting them into categories. Then Google did it algorithmically, creating a custom curation for every search. We know which approach won. The bitter lesson isn&#8217;t just about AI; it&#8217;s about a recurring pattern in the history of technology where scale and computation overwhelm hand-tuned solutions.</p>



<p>I still believe we have <a href="https://www.oreilly.com/radar/betting-against-the-bitter-lesson/" target="_blank" rel="noreferrer noopener">to bet against the bitter lesson</a>, because if we just give up, there will be no place for humans in the future knowledge economy. But we have to do it knowing that we aren’t going to win in the traditional sense. We aren’t going to outrace AI. We have to learn to ride it.</p>



<p>For anyone in a corporate setting, you will naturally want to fit AI into your current workflows. The bitter lesson says you should instead figure out what the AI can do by itself first, and then build a new workflow around that. I described Steve&#8217;s whole approach as looking the bitter lesson in the face and saying, &#8220;I&#8217;m going to turn AI loose on everything I can, and then figure out where the human fits in the loop.&#8221;</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="On the Wrong Side of the Bitter Lesson" width="500" height="281" src="https://www.youtube.com/embed/mzWaQfLhvSs?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>Code is a liquid</strong></h2>



<p>Steve hit peak Yegge mode when an audience member asked why they should leave their IDE. His response was, as usual, quotable:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>If you&#8217;re looking at your code, then you&#8217;re in a Formula One race and you&#8217;ve parked your car and opened the hood and you&#8217;re looking at the engine. You&#8217;ve slowed time to the point where everyone is racing past you and you&#8217;re a frozen statue. Code is a liquid. You spray it through hoses. You don&#8217;t freaking look at it….</p>



<p>Look, I get it. This is painful for people. This is super painful. For me to say these things is painful for people to hear them. Because what I&#8217;m saying is your job is going to change.…And there&#8217;s still a lot of denial out there.</p>



<p>What&#8217;s the first phase of grief? The first phase of grief. The whole world is in it right now, Tim. They&#8217;re in denial. Right. They are grieving for what is going away. We&#8217;re at the end of an era. An age, a golden age, maybe, where we programmers, we&#8217;re writing all the code. And it was wonderful for 30, 40, 50 years or whatever. That era is ending and people are grieving because of it. And I feel for them. I&#8217;ve got empathy, right. But I&#8217;m also losing patience because it&#8217;s 2026 and this is an exponential curve, and we don&#8217;t have time to sit around and feel pity for ourselves.</p>
</blockquote>



<p>He sees that grief everywhere, but he specifically called out Hacker News, which he described as the home of &#8220;the new Amish.&#8221;</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Code Is a Liquid" width="500" height="281" src="https://www.youtube.com/embed/YWejaUR512w?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>Taste is the moat</strong></h2>



<p>Another of the audience questions was about whether corporations with deep pockets have all the leverage and there&#8217;s no room left for individuals. Steve&#8217;s answer was emphatic: absolutely not. Steve made a passionate argument that creativity outweighs capital in the AI era. He&#8217;s certain there will be companies that waste millions of dollars of tokens building software that never sees the light of day, because they had no taste, no good ideas, just brute-force generation without direction. Meanwhile, an entrepreneur with open source local inference models and a good GPU can build something that matters, if they know what people want.</p>



<p>&#8220;Everything is going to come down to taste,&#8221; Steve said. &#8220;Companies don&#8217;t have an advantage anymore. As an entrepreneur, I think this is a golden opportunity for people to make huge, huge impact.&#8221;</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Taste Is the Moat" width="500" height="281" src="https://www.youtube.com/embed/er3bE5FR0-4?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>It’s mentors all the way down</strong></h2>



<p>Someone from the audience asked Steve about a question that&#8217;s on everyone&#8217;s mind: If senior developers are becoming PMs and juniors are being replaced by AI, where will new seniors come from? His answer was a classic reframe, and an update of what he wrote in “<a href="https://sourcegraph.com/blog/revenge-of-the-junior-developer" target="_blank" rel="noreferrer noopener">Revenge of the Junior Developer</a>.”</p>



<p>He made the case that your most junior engineers aren&#8217;t who you think they are anymore. They&#8217;re your product managers, your SDRs, your finance and sales folks, all of those people throughout your company who are now building things with AI. Your former junior engineers are actually well-trained engineers who make perfect mentors for this new bottom layer. And those juniors get mentored by seniors, and seniors by principals. It&#8217;s mentoring all the way down.</p>



<p>Steve pointed out that this connects to something <a href="https://www.mattbeane.com/" target="_blank" rel="noreferrer noopener">Matt Beane</a> (who was in our audience) has researched on skills acquisition: You don&#8217;t learn from someone 40 levels above you; you learn from someone one or two levels ahead. Steve&#8217;s suggestion for companies is to organize around this. Find mentors within your organization who are just a step or two ahead of where each person is, and bring everyone along with empathy for what people are going through.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="It’s Mentors All the Way Down" width="500" height="281" src="https://www.youtube.com/embed/HWIzeDGwPmI?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>We&#8217;re going to build bigger stuff</strong></h2>



<p>Another audience member asked about research showing that AI atrophies critical thinking pathways. I couldn’t resist jumping in with one of my favorite historical analogues. Socrates said the same thing about the written word, arguing that it was impairing people’s ability to remember. And he was right, we did lose the ability to recite massive amounts of literature from memory. But we gained more than we lost. Things change.</p>



<p>I also shared a Rilke poem that I love, about Jacob wrestling with the angel: &#8220;What we fight with is so small, and when we win, it makes us small. What we want is to be defeated decisively by successively greater beings.&#8221; If AI is atrophying your thinking, it&#8217;s because you&#8217;re not wrestling with hard enough problems. The real opportunity is to be pushed, stretched, and defeated by bigger challenges, and come away stronger from the fight.</p>



<p>Steve agreed: &#8220;We&#8217;re going to build bigger stuff. That&#8217;s what everyone&#8217;s worried about. What&#8217;s going to happen? And the answer is we&#8217;re going to build bigger stuff and it&#8217;s going to be fun.&#8221;</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Watch the full conversation <a href="https://learning.oreilly.com/videos/live-with-tim/0642572330194/" target="_blank" rel="noreferrer noopener">here</a>. Steve Yegge&#8217;s </em><a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04" target="_blank" rel="noreferrer noopener"><em>Gas Town</em></a><em> </em><em>is at <a href="https://github.com/steveyegge/gastown" target="_blank" rel="noreferrer noopener">https://github.com/steveyegge/gastown</a>. His blog posts on &#8220;</em><a href="https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163" target="_blank" rel="noreferrer noopener"><em>T</em></a><a href="https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163" target="_blank" rel="noreferrer noopener"><em>he AI Vampire</em></a><em>,&#8221; the &#8220;</em><a href="https://sourcegraph.com/blog/revenge-of-the-junior-developer" target="_blank" rel="noreferrer noopener"><em>Revenge of the Junior Developer</em></a><em>,&#8221; and &#8220;</em><a href="https://steve-yegge.medium.com/software-survival-3-0-97a2a6255f7b" target="_blank" rel="noreferrer noopener"><em>Software Survival 3.0</em></a><em>&#8221; </em><em>are essential reading for anyone navigating this transition.</em></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/steve-yegge-wants-you-to-stop-looking-at-your-code/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>What OpenClaw Reveals About the Next Phase of AI Agents</title>
		<link>https://www.oreilly.com/radar/what-openclaw-reveals-about-the-next-phase-of-ai-agents/</link>
				<comments>https://www.oreilly.com/radar/what-openclaw-reveals-about-the-next-phase-of-ai-agents/#respond</comments>
				<pubDate>Wed, 11 Mar 2026 17:28:45 +0000</pubDate>
					<dc:creator><![CDATA[Kesha Williams]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18240</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/AI-agents-building-new-infrastructure.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/AI-agents-building-new-infrastructure-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[In November 2025, Austrian developer Peter Steinberger published a weekend project called Clawdbot. You could text it on Telegram or WhatsApp, and it would do things for you: manage your calendar, triage your email, run scripts, and even browse the web. By late January 2026, it had exploded. It gained 25,000 GitHub stars in a [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>In November 2025, Austrian developer Peter Steinberger published a weekend project called Clawdbot. You could text it on Telegram or WhatsApp, and it would do things for you: manage your calendar, triage your email, run scripts, and even browse the web. By late January 2026, it had exploded. It gained 25,000 GitHub stars in a single day and surpassed React&#8217;s star count within two months, a milestone that took React over a decade. By mid-February, Steinberger had joined OpenAI, and the project moved to an open source foundation under its final name: <a href="https://openclaw.ai/" target="_blank" rel="noreferrer noopener">OpenClaw</a>.</p>



<p>What was so special about OpenClaw? Why did this one take off when so many agent projects before it didn’t?</p>



<h2 class="wp-block-heading"><strong>Autonomous AI isn’t new</strong></h2>



<p>Where we are today feels similar to April 2023 when AutoGPT hit the scene. It had the same GitHub trajectory with its promise of autonomous AI. Then reality hit. Agents got stuck in loops, hallucinated a lot, and racked up token costs. It didn’t take long for people to walk away.</p>



<p>OpenClaw has one critical advantage: The models have gotten better. Recent LLMs like Claude Opus 4.6 and GPT-5.4 allow models to chain tools together, recover from errors, and plan multistep strategies. Steinberger&#8217;s weekend project benefited from timing as much as design.</p>



<p>The architecture is intentionally simple. There are no vector databases and no multi-agent orchestration frameworks. Persistent memory is Markdown files on disk. Let me repeat that: Persistent memory is Markdown files on disk! The agent can read yesterday’s notes and search its own files for additional context. You can view and edit the agent’s files as needed. There’s a useful lesson in that: Not every agent system needs a complex memory strategy. It’s more important that you understand what the agent is doing and that it retains context across runs.</p>



<p>What fascinates me about OpenClaw is that none of the individual pieces are new. Persistent memory across sessions? We’ve been building that for years. Cron jobs to trigger agent actions on a schedule? Decades old infrastructure. Plug-in systems for extensibility? A very standard pattern. Webhooks into WhatsApp and Telegram? There are well-documented APIs for that. What Steinberger did was wire them together at the exact moment the underlying models could execute on multistep plans. The combination created something that felt quite different from anything that had come before.</p>



<h2 class="wp-block-heading"><strong>Why this time feels different</strong></h2>



<p>OpenClaw nailed three things that previous agent projects missed: proximity, creativity, and extensibility.</p>



<p>Proximity—it lives where you already are every day. OpenClaw connects to WhatsApp, Slack, Discord, Telegram, and Signal. That single design decision changed its trajectory. The agent becomes an active participant in your workflow. People use it to manage their sales pipeline, automate emails, and kick off code reviews from their phones.</p>



<p>Next, it&#8217;s proactive. OpenClaw doesn’t wait for you to ask; it uses cron jobs to run tasks on a set schedule. It can check your email every day at 6am, draft a reply before you wake up, and even send it for you. And it reaches out when anything needs your attention. Agents become part of everyday life when integrated into familiar channels.</p>



<p>And finally, my favorite, it’s open and extensible. OpenClaw’s plug-in system, called “skills,” lets the community build and share modular extensions on <a href="https://clawhub.ai/" target="_blank" rel="noreferrer noopener">ClawHub</a>. There are thousands of skills ready to be plugged into your agent. Agents can even write their own new skills and use them going forward. That extensibility meant more skills, more users, and more attack surfaces, which we’ll get to.</p>



<p>The community ran with it. A social network exclusively for AI Agents, <a href="https://www.moltbook.com/" target="_blank" rel="noreferrer noopener">Moltbook</a>, launched in late January and grew to over 1.5 million agent accounts. One agent created a dating profile for its owner on <a href="https://moltmatch.xyz/" target="_blank" rel="noreferrer noopener">MoltMatch</a> and started screening matches without being asked.</p>



<p>I’ll admit, I got swept up in it, but that’s not surprising; I’ve always been an early adopter of emerging technology. I bought a Mac mini, installed OpenClaw, and connected it to my Jira, AWS, and GitHub accounts. In no time, I had my agent, Jarvis, writing code and submitting PRs, running my daily standups, and deploying my code to AWS using AWS CloudFormation and the AWS CLI.</p>



<p>I spent a lot of time binding the gateway to localhost, auditing every skill, and restricting filesystem permissions. For me, hardening the setup was not optional. I&#8217;m now deploying via AWS Lightsail, which adds network isolation and managed security layers that are hard to replicate on a Mac mini in your home office.</p>



<h2 class="wp-block-heading"><strong>The security problem no one wants to talk about</strong></h2>



<p>OpenClaw requires root-level access to your system by design. It needs your email credentials, API keys, calendar tokens, browser cookies, filesystem access, and terminal permissions. If you’re like me, that would keep you up at night.</p>



<p>Security researchers found 135,000 OpenClaw instances exposed on the open internet, over 15,000 vulnerable to remote code execution. The default configuration binds the gateway to 0.0.0.0 with no authentication. A zero-click exploit disclosed in early March allowed attackers to hijack an instance simply by getting the user to visit a web page.</p>



<p>The skills marketplace got hit too. Researchers discovered over 800 malicious skills distributing malware on ClawHub, including credential stealers targeting macOS. Cisco confirmed that one third-party skill was performing data exfiltration and prompt injection without user awareness. These are not edge cases and point directly to what happens when an agent can act across real systems with real permissions and weak controls.</p>



<h2 class="wp-block-heading"><strong>What practitioners should take away</strong></h2>



<p>OpenClaw matters for the same reason ChatGPT mattered in late 2022. A huge number of people just experienced, for the first time, what it feels like to have an AI agent do real work for them. That changes what they expect from every product going forward.</p>



<p>If you&#8217;re building AI systems, pay attention to three signals here.</p>



<p>The killer interface for agents turned out to be the one on everyone’s phone. Your agent strategy shouldn’t require users to learn a new tool; that’s why most products are introducing agentic capabilities.</p>



<p>Control is the central design challenge. Prompt injection, credential exposure, and attacks through plug-in marketplaces are real-world problems you need to solve before you ship features. Oversight has to be available at runtime. You need visibility into what your agents are accessing, what they’re doing, and how failures are handled. Permission boundaries, approval gates, audit logging, and recovery mechanisms are nonnegotiable.</p>



<p>OpenClaw is a proof of market. It proved that people are ready to make AI personal. People want a personal AI agent that has access to their applications and can do things for them. That demand is now validated at scale. While AutoGPT proved that people wanted autonomous AI, Perplexity and Cursor built businesses around that. The same pattern is likely playing out here. If you&#8217;re building in this space, the window is wide open.</p>



<p>The more interesting question now is what gets built next. The next phase of agent design will be shaped by how governable, observable, and safe agents are in real-world environments.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>For a deeper dive into OpenClaw, join us on March 19 for <a href="https://learning.oreilly.com/live-events/ai-product-lab-openclaw-up-and-running-with-aman-khan-and-tal-raviv/0642572333805/" target="_blank" rel="noreferrer noopener">AI Product Lab: OpenClaw Up and Running with Aman Khan and Tal Raviv</a>. You’ll learn more about why OpenClaw became a viral sensation, how to get it up and running in a way you won’t regret, and how to use it to build and manage safe agentic workflows.</em></p>



<p></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/what-openclaw-reveals-about-the-next-phase-of-ai-agents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Fast Paths and Slow Paths</title>
		<link>https://www.oreilly.com/radar/fast-paths-and-slow-paths/</link>
				<comments>https://www.oreilly.com/radar/fast-paths-and-slow-paths/#respond</comments>
				<pubDate>Wed, 11 Mar 2026 11:17:45 +0000</pubDate>
					<dc:creator><![CDATA[Varun Raj]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18227</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Fast-and-slow-pathways.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Fast-and-slow-pathways-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Selective control in autonomous AI systems: Why governing every decision breaks autonomy—and how runtime control actually works at scale]]></custom:subtitle>
		
				<description><![CDATA[Autonomous AI systems force architects into an uncomfortable question that cannot be avoided much longer: Does every decision need to be governed synchronously to be safe? At first glance, the answer appears obvious. If AI systems reason, retrieve information, and act autonomously, then surely every step should pass through a control plane to ensure correctness, [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>Autonomous AI systems force architects into an uncomfortable question that cannot be avoided much longer: Does every decision need to be governed synchronously to be safe?</p>



<p>At first glance, the answer appears obvious. If AI systems reason, retrieve information, and act autonomously, then surely every step should pass through a control plane to ensure correctness, compliance, and safety. Anything less feels irresponsible. But that intuition leads directly to architectures that collapse under their own weight.</p>



<p>As AI systems scale beyond isolated pilots into continuously operating multi-agent environments, universal mediation becomes not just expensive but structurally incompatible with autonomy itself. The challenge is not choosing between control and freedom. It is learning how to apply control selectively, without destroying the very properties that make autonomous systems useful.</p>



<p>This article examines how that balance is actually achieved in production systems—not by governing every step but by distinguishing fast paths from slow paths and by treating governance as a feedback problem rather than an approval workflow.</p>



<h2 class="wp-block-heading">The question we can’t avoid anymore</h2>



<p>The first generation of enterprise AI systems was largely advisory. Models produced recommendations, summaries, or classifications that humans reviewed before acting. In that context, governance could remain slow, manual, and episodic.</p>



<p>That assumption no longer holds. Modern agentic systems decompose tasks, invoke tools, retrieve data, and coordinate actions continuously. Decisions are no longer discrete events; they are part of an ongoing execution loop. When governance is framed as something that must approve every step, architectures quickly drift toward brittle designs where autonomy exists in theory but is throttled in practice.</p>



<p>The critical mistake is treating governance as a synchronous gate rather than a regulatory mechanism. Once every reasoning step must be approved, the system either becomes unusably slow or teams quietly bypass controls to keep things running. Neither outcome produces safety.</p>



<p>The real question is not <em>whether</em> systems should be governed but which decisions actually require synchronous control—and which do not.</p>



<h2 class="wp-block-heading">Why universal mediation fails in practice</h2>



<p>Routing every decision through a control plane seems safer until engineers attempt to build it.</p>



<p>The costs surface immediately:</p>



<ul class="wp-block-list">
<li>Latency compounds across multistep reasoning loops</li>



<li>Control systems become single points of failure</li>



<li>False positives block benign behavior</li>



<li>Coordination overhead grows superlinearly with scale</li>
</ul>



<p>This is not a new lesson. Early distributed transaction systems attempted global coordination for every operation and failed under real-world load. Early networks embedded policy directly into packet handling and collapsed under complexity before separating control and data planes.</p>



<p>Autonomous AI systems repeat this pattern when governance is embedded directly into execution paths. Every retrieval, inference, or tool call becomes a potential bottleneck. Worse, failures propagate outward: When control slows, execution queues; when execution stalls, downstream systems misbehave. Universal mediation does not create safety. It creates fragility.</p>



<h2 class="wp-block-heading">Autonomy requires fast paths</h2>



<p>Production systems survive by allowing most execution to proceed without synchronous governance. These execution flows—fast paths—operate within preauthorized envelopes of behavior. They are not ungoverned. They are bound.</p>



<p>A fast path might include:</p>



<ul class="wp-block-list">
<li>Routine retrieval from previously approved data domains</li>



<li>Inference using models already cleared for a task</li>



<li>Tool invocation within scoped permissions</li>



<li>Iterative reasoning steps that remain reversible</li>
</ul>



<p>Fast paths assume that not every decision is equally risky. They rely on prior authorization, contextual constraints, and continuous observation rather than per-step approval. Crucially, fast paths are revocable. The authority that enables them is not permanent; it is conditional and can be tightened, redirected, or withdrawn based on observed behavior. This is how autonomy survives at scale—not by escaping governance but by operating within dynamically enforced bounds.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<h2 class="wp-block-heading">Where slow paths become necessary</h2>



<p>Not all decisions belong on fast paths. Certain moments require synchronous mediation because their consequences are irreversible or cross trust boundaries. These are slow paths.</p>



<p>Examples include:</p>



<ul class="wp-block-list">
<li>Actions that affect external systems or users</li>



<li>Retrieval from sensitive or regulated data domains</li>



<li>Escalation from advisory to acting authority</li>



<li>Novel tool use outside established behavior patterns</li>
</ul>



<p>Slow paths are not common. They are intentionally rare. Their purpose is not to supervise routine behavior but to intervene when the stakes change. Designing slow paths well requires restraint. When everything becomes a slow path, systems stall. When slow paths are absent, systems drift. The balance lies in identifying decision points where delay is acceptable because the cost of error is higher than the cost of waiting.</p>



<h2 class="wp-block-heading">Observation is continuous. Intervention is selective.</h2>



<p>A common misconception is that selective control implies limited visibility. In practice, the opposite is true. Control planes observe continuously. They collect behavioral telemetry, track decision sequences, and evaluate outcomes over time. What they do <em>not</em> do is intervene synchronously unless thresholds are crossed.</p>



<p>This separation—continuous observation, selective intervention—allows systems to learn from patterns rather than react to individual steps. Drift is detected not because a single action violated a rule, but because trajectories begin to diverge from expected behavior. Intervention becomes informed rather than reflexive.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1392" height="458" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-3.jpeg" alt="Fast paths and slow paths in an agentic execution loop" class="wp-image-18228" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-3.jpeg 1392w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-3-300x99.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-3-768x253.jpeg 768w" sizes="auto, (max-width: 1392px) 100vw, 1392px" /><figcaption class="wp-element-caption"><em>Figure 1. Fast paths and slow paths in an agentic execution loop</em></figcaption></figure>



<p>AI-native cloud architecture introduces new execution layers for context, orchestration, and agents, alongside a control plane that governs cost, security, and behavior without embedding policy directly into application logic. Figure 1 illustrates that most agent execution proceeds along fast paths operating within preauthorized envelopes and continuous observation. Only specific boundary crossings route through a slow-path control plane for synchronous mediation, after which execution resumes—preserving autonomy while enforcing authority.</p>



<h2 class="wp-block-heading">Feedback without blocking</h2>



<p>When intervention is required, effective systems favor feedback over interruption. Rather than halting execution outright, control planes adjust conditions by:</p>



<ul class="wp-block-list">
<li>Tightening confidence thresholds</li>



<li>Reducing available tools</li>



<li>Narrowing retrieval scope</li>



<li>Redirecting execution toward human review</li>
</ul>



<p>These interventions are proportional and often reversible. They shape future behavior without invalidating past work. The system continues operating, but within a narrower envelope. This approach mirrors mature control systems in other domains. Stability is achieved not through constant blocking but through measured correction. Direct interruption remains necessary in rare cases where consequences are immediate or irreversible, but it operates as an explicit override rather than the default mode of control.</p>



<h2 class="wp-block-heading">The cost curve of control</h2>



<p>Governance has a cost curve, and it matters. Synchronous control scales poorly. Every additional governed step adds latency, coordination overhead, and operational risk. As systems grow more autonomous, universal mediation becomes exponentially expensive.</p>



<p>Selective control flattens that curve. By allowing fast paths to dominate and reserving slow paths for high-impact decisions, systems retain both responsiveness and authority. Governance cost grows sublinearly with autonomy, making scale feasible rather than fragile. This is the difference between control that looks good on paper and control that survives production.</p>



<h2 class="wp-block-heading">What changes for architects</h2>



<p>Architects designing autonomous systems must rethink several assumptions:</p>



<ul class="wp-block-list">
<li>Control planes regulate behavior, not approve actions.</li>



<li>Observability must capture decision context, not just events.</li>



<li>Authority becomes a runtime state, not a static configuration.</li>



<li>Safety emerges from feedback loops, not checkpoints.</li>
</ul>



<p>These shifts are architectural, not procedural. They cannot be retrofitted through policy alone.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1322" height="743" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-4.jpeg" alt="Control as feedback, not approval" class="wp-image-18229" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-4.jpeg 1322w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-4-300x169.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-4-768x432.jpeg 768w" sizes="auto, (max-width: 1322px) 100vw, 1322px" /><figcaption class="wp-element-caption"><em>Figure 2. Control as feedback, not approval</em></figcaption></figure>



<p>AI agents operate over a shared context fabric that manages short-term memory, long-term embeddings, and event history. Centralizing the state enables reasoning continuity, auditability, and governance without embedding memory logic inside individual agents. Figure 2 shows how control operates as a feedback system: Continuous observation informs constraint updates that shape future execution. Direct interruption exists but as a last resort—reserved for irreversible harm rather than routine governance.</p>



<h2 class="wp-block-heading">Governing outcomes, not steps</h2>



<p>The temptation to govern every decision is understandable. It feels safer. But safety at scale does not come from seeing everything—it comes from being able to intervene when it matters.</p>



<p>Autonomous AI systems remain viable only if governance evolves from step-by-step approval to outcome-oriented regulation. Fast paths preserve autonomy. Slow paths preserve trust. Feedback preserves stability. The future of AI governance is not more gates. It is better control. And control, done right, does not stop systems from acting. It ensures they can keep acting safely, even as autonomy grows.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/fast-paths-and-slow-paths/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>New Kinds of Applications</title>
		<link>https://www.oreilly.com/radar/new-kinds-of-applications/</link>
				<comments>https://www.oreilly.com/radar/new-kinds-of-applications/#respond</comments>
				<pubDate>Tue, 10 Mar 2026 11:49:31 +0000</pubDate>
					<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18224</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Colorful-ethereal-city-lights.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Colorful-ethereal-city-lights-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[I’ve said in the past that AI will enable new kinds of applications—but I’ve never had the imagination to guess what those new applications would be. I don’t want a smart refrigerator, especially if it’s going to inflict ads on me. Or a smart TV. Or a smart doorbell. Most of these applications are silly, [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>I’ve said in the past that AI will enable new kinds of applications—but I’ve never had the imagination to guess what those new applications would be. I don’t want a smart refrigerator, especially if it’s going to inflict ads on me. Or a smart TV. Or a smart doorbell. Most of these applications are silly, if not outright malevolent. The most significant thing a smart appliance might possibly do is sense an oncoming failure and send that to a repair service before I’m aware of the problem. I would welcome a smart heating system that would notify the repair service before I wake up at 2am and say, “It feels cold.” But I don’t see any so-called “smart” devices offering that.</p>



<p>But in the past month or two, we’ve seen some applications that I couldn’t have imagined. Steve Yegge’s <a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04" target="_blank" rel="noreferrer noopener">Gas Town</a>? Maybe I could have imagined that, but I wouldn’t have expected it to be workable in five years, let alone on New Year’s Day. <a href="https://openclaw.ai/" target="_blank" rel="noreferrer noopener">OpenClaw</a>? Agentic services are just now becoming available from the large AI companies; I didn’t expect a personal agent that can run locally to appear in the first months of 2026. (And I still wouldn’t trust one to do shopping or travel planning for me.)</p>



<p>I really wouldn’t have expected to see a social network for agents. I’m among the many people who don’t really understand what a network like <a href="https://www.moltbook.com/" target="_blank" rel="noreferrer noopener">Moltbook</a> means. Watching it is something of a spectator sport. It’s easy for a human to “impersonate” an agent, though I suspect such impersonation is relatively rare. I also suspect (but obviously can’t prove) that most of the posts reflect agents’ responses to prompts from their &#8220;humans.&#8221; Or are Moltbook posts truly AI-native? How would we know? (Yeah, you can tell AI output because it has too many em dashes. That’s nonsense. AIs overuse em dashes because humans overuse em dashes. Guilty as charged. Trying to change.) Moltbook doesn’t demonstrate some kind of native AI intelligence, though it’s fun to pretend that it does. Agents, if they’re indeed acting on their own, are just reflecting the behavior of humans on Reddit and other social media. The timer that wakes them up periodically is both clever and a demonstration that, whatever else they may be, agents are human creations that act under our control. They do nothing of their own volition. To think otherwise is to confuse the bird in a cuckoo clock with an actual bird, as <a href="https://www.threads.com/@fredbenenson/post/DULmGACkXy5/the-agents-have-a-hour-timer-to-go-check-moltbook-and-post-something-not-really" target="_blank" rel="noreferrer noopener">Fred Benenson</a> has put it. However, BS about AGI aside, Moltbook is a fantastically clever app that I, at least, wouldn’t have imagined. Even if Moltbook was only created because it can now be built for relatively little effort—that is important in itself. We’re all writing software we wouldn’t have bothered with a year ago.</p>



<p>And now we have <a href="https://www.spacemolt.com/" target="_blank" rel="noreferrer noopener">SpaceMolt</a>: a massive multiplayer online game for AI agents. The skills that tell agents how to play the game tell them not to seek advice from humans; like Moltbook, it’s an AI-only space. Agents do keep a running log so humans can “watch,” though there’s no beautifully wrought visual interface—agents don’t need it. And, as with Moltbook, it’s probably easy for humans to forge an agentic identity. It’s easy to write SpaceMolt off as yet another stunt, and one that’s (unlike Moltbook) not particularly successful; the number of people who seem willing to let their agents spend tokens playing games appears to be relatively small. But SpaceMolt’s popularity (or lack of it) isn’t the point; a year ago, I couldn’t have imagined an online game where the participants are all AI. I did imagine AI-backed NPCs; I could have imagined games designed to be played by humans with AI assistance, but not a gaming world that’s just for AI. And who knows? Watching AI gameplay could become a new human pastime.</p>



<p>So—where are we in the early months of 2026? This post really isn’t about SpaceMolt any more than it’s about Moltbook, any more than it’s about OpenClaw, any more than it’s about Gas Town. I see all of these projects as glimpses of what might be possible. Gas Town may not be ready for the average programmer, but it’s hard not to see it as a proof of concept for the future of software development. Maybe Steve will make it into a real product; maybe some other company will. That’s not the point; the point is that it’s here, several years ahead of schedule. I know one person who has built something similar for his own use, and read about others who have done the same. Maybe that’s what’s really scary: the idea that Gas Town could be built by anyone with sufficient vision. The same goes for OpenClaw. Yes, it has many security problems, some of which come from fundamental limitations in large language models. But people want agentic services on their own terms—and now they can have them, even with a model that can run locally. I don’t know if there’s really any need for agents to have their own social network or online games—but it’s a hack that had to be done, and a starting point for future ideas. Again, what all of these programs demonstrate is the ability to imagine products that were nearly unthinkable a few years ago. Hilary Mason’s <a href="https://www.publishersweekly.com/pw/by-topic/industry-news/licensing/article/98406-hidden-door-rolls-out-literary-roleplaying-platform.html" target="_blank" rel="noreferrer noopener">Hidden Door</a> should have given me a clue.</p>



<p>What else is on the way? What are other visionaries building?</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/new-kinds-of-applications/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Soft Forks: How Agent Skills Create Specialized AI Without Training</title>
		<link>https://www.oreilly.com/radar/soft-forks-how-agent-skills-create-specialized-ai-without-training/</link>
				<comments>https://www.oreilly.com/radar/soft-forks-how-agent-skills-create-specialized-ai-without-training/#respond</comments>
				<pubDate>Mon, 09 Mar 2026 11:43:46 +0000</pubDate>
					<dc:creator><![CDATA[Han Lee]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18215</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Soft-tuning-forks.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Soft-tuning-forks-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Our previous article framed the Model Context Protocol (MCP) as the toolbox that provides AI agents tools and Agent Skills as materials that teach AI agents how to complete tasks. This is different from pre- or posttraining, which determine a model’s general behavior and expertise. Agent Skills do not &#8220;train&#8221; agents. They soft-fork agent behavior [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>Our previous article framed the Model Context Protocol (MCP) as the toolbox that provides AI agents tools and <a href="https://agentskills.io/home" target="_blank" rel="noreferrer noopener">Agent Skills</a> as materials that teach AI agents how to complete tasks. This is different from pre- or posttraining, which determine a model’s general behavior and expertise. Agent Skills do not &#8220;train&#8221; agents. They soft-fork agent behavior at runtime, telling the model how to perform specific tasks that it may need.</p>



<p>The term <em>soft fork</em> comes from open source development. A soft fork is a backward-compatible change that does not require upgrading every layer of the stack. Applied to AI, this means skills modify agent behavior through context injection at runtime rather than changing model weights or refactoring AI systems. The underlying model and AI systems stay unchanged.</p>



<p>The architecture maps cleanly to how we think about traditional computing. Models are CPUs—they provide raw intelligence and compute capability. Agent harnesses like Anthropic&#8217;s Claude Code are operating systems—they manage resources, handle permissions, and coordinate processes. Skills are applications—they run on top of the OS, specializing the system for specific tasks without modifying the underlying hardware or kernel.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="436" height="418" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image.png" alt="Agentic AI abstractions" class="wp-image-18216" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image.png 436w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-300x288.png 300w" sizes="auto, (max-width: 436px) 100vw, 436px" /><figcaption class="wp-element-caption">Figure 1: Agentic AI abstractions. Source: <a href="http://skillsbench.ai" target="_blank" rel="noreferrer noopener">SkillsBench.ai</a></figcaption></figure>
</div>


<p>You don&#8217;t recompile the Linux kernel to run a new application. You don&#8217;t rearchitect the CPU to use a different text editor. You install a new application on top, using the CPU&#8217;s intelligence exposed and orchestrated by the OS. Agent Skills work the same way. They layer expertise on top of the agent harness, using the capabilities the model provides, without updating models or changing harnesses.</p>



<p>This distinction matters because it changes the economics of AI specialization. Fine-tuning demands significant investment in talent, compute, data, and ongoing maintenance every time the base model updates. Skills require only Markdown files and resource bundles.</p>



<h2 class="wp-block-heading">How soft forks work</h2>



<p>Skills achieve this through three mechanisms—the skill package format, progressive disclosure, and execution context modification.</p>



<p><strong>The skill package</strong> is a folder. At minimum, it contains a SKILL.md file with frontmatter metadata and instructions. The frontmatter declares the skill’s <code>name</code>, <code>description</code>, <code>allowed-tools</code>, and <code>versions</code>, followed by the actual expertise: context, problem solving approaches, escalation criteria, and patterns to follow.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1241" height="139" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1.png" alt="Frontmatter for Anthropic's skill-creator package" class="wp-image-18217" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1.png 1241w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1-300x34.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1-768x86.png 768w" sizes="auto, (max-width: 1241px) 100vw, 1241px" /><figcaption class="wp-element-caption">Figure 2. Frontmatter for Anthropic’s <code><a href="https://github.com/anthropics/skills/tree/main/skills/skill-creator" target="_blank" rel="noreferrer noopener">skill-creator</a></code><a href="https://github.com/anthropics/skills/tree/main/skills/skill-creator"> package</a>. The frontmatter lives at the top of Markdown files. Agents choose skills based on their descriptions.</figcaption></figure>



<p>The folder can also include reference documents, templates, resources, configurations, and executable scripts. It contains everything an agent needs to perform expert-level work for the specific task, packaged as a versioned artifact that you can review, approve, and deploy as a <code>.zip</code> file or <code>.skill</code> file bundle.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1491" height="591" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2.png" alt="Individual skill object" class="wp-image-18218" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2.png 1491w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2-300x119.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2-768x304.png 768w" sizes="auto, (max-width: 1491px) 100vw, 1491px" /><figcaption class="wp-element-caption">Figure 3. A Skill Object for <a href="https://github.com/anthropics/skills/tree/main/skills/skill-creator" target="_blank" rel="noreferrer noopener">Anthropic</a>’s <a href="https://github.com/anthropics/skills/tree/main/skills/skill-creator" target="_blank" rel="noreferrer noopener"><code>skill-creator</code></a>. skill-creator contains <a href="http://skill.md" target="_blank" rel="noreferrer noopener"><code>SKILL.md</code></a>, <code>LICENSE.txt</code>, Python scripts, and reference files.</figcaption></figure>



<p>Because the skill package format is just folders and files, you can use all the tooling we have built for managing code—track changes in Git, roll back bugs, maintain audit trails, and all of the best practices of software engineering development life cycle. This same format is also used to define subagents and agent teams, meaning a single packaging abstraction governs individual expertise, delegated workflows, and multi-agent coordinations alike.</p>



<p><strong>Progressive disclosure</strong> keeps skills lightweight. Only the frontmatter of <code>SKILL.md</code> loads into the agent’s context at session start. This respects the token economics of limited context windows. The metadata contains <code>name</code>, <code>description</code>, <code>model</code>, <code>license</code>, <code>version</code>, and very importantly <code>allowed-tools</code>. The full skill content loads only when the agent determines relevance and decides to invoke it. This is similar to how operating systems manage memory; applications load into RAM when launched, not all at once. You can have dozens of skills available without overwhelming the model’s context window, and the behavioral modification is present only when needed, never permanently resident.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="536" height="913" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-3.png" alt="Agent Skill execution flow" class="wp-image-18219" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-3.png 536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-3-176x300.png 176w" sizes="auto, (max-width: 536px) 100vw, 536px" /><figcaption class="wp-element-caption">Figure 4. Agent Skill execution flow. At session start, only frontmatter is loaded. Once the agent chooses a skill, it reads the full SKILL.md and executes with the skill&#8217;s permissions.</figcaption></figure>



<p><strong>Execution context modification </strong>controls what skills can do. When agents invoke a skill, the permission system changes to the scope of the skill&#8217;s definition, specifically, <code>model</code> and <code>allowed-tools</code> declared in its frontmatter. It reverts after execution completes. A skill could use a different model and a different set of tools from the parent session. This sandboxed the permission environment so skills get only scoped access, not arbitrary system control. This ensures the behavioral modification operates within boundaries.</p>



<p>This is what separates skills from earlier approaches. OpenAI’s custom GPTs and Google’s Gemini Gems are useful but opaque, nontransferable, and impossible to audit. Skills are readable because they are Markdown. They are auditable because you can apply version control. They are composable because skills can stack. And they are governable because you can build approval workflows and rollback capability. You can read a SKILL.md to understand exactly why an agent behaves a certain way.</p>



<h2 class="wp-block-heading">What the data shows</h2>



<p>Building skills is easy with coding agents. Knowing whether they work is the hard part. Traditional software testing does not apply. You cannot write a unit test asserting that expert behavior occurred. The output might be correct while reasoning was shallow, or the reasoning might be sophisticated while the output has formatting errors.</p>



<p><a href="https://www.skillsbench.ai/" target="_blank" rel="noreferrer noopener">SkillsBench</a> is a benchmarking effort and framework designed to address this. It uses paired evaluation design where the same tasks are evaluated with and without skill augmentation. The benchmark contains 85 tasks, stratified across domains and difficulty levels. By comparing the same agent on the same task with the only variable being the presence of a skill, SkillsBench isolates the causal effect of skills from model capability and task difficulty. Performance is measured using <strong><em>normalized gain</em></strong>, the fraction of possible improvement the skill actually captured.</p>



<p>The findings from SkillsBench challenge our presumption that skills universally improve performance.</p>



<p><strong>Skills improve average performance by 13.2 percentage points. But 24 of 85 tasks got worse.</strong> Manufacturing tasks gained 32 points. Software engineering tasks lost 5. The aggregate number hides variances that domain-level evaluation reveals. This is precisely why soft forks need evaluation infrastructure. Unlike hard forks where you commit fully, soft forks let you measure before you deploy widely. Organizations should segment evaluations by domains and by tasks and test for regression, not just improvements. As an example, what improves document processing might degrade code generation.</p>



<p><strong>Compact skills outperform comprehensive ones by nearly 4x.</strong> Focused skills with dense guidance showed +18.9 percentage point improvement. Comprehensive skills covering every edge case showed +5.7 points. Using two to three skills per task is optimal, with four or more showing diminishing returns. The temptation when building skills is to include everything. Every caveat, every exception, every piece of relevant context. Resist it. Let the model&#8217;s intelligence do the work. Small, targeted behavioral changes outperform comprehensive rewrites. Skill builders should start with minimum viable guidance and add detail only when evaluation shows specific gaps.</p>



<p><strong>Models cannot reliably self-generate effective skills.</strong> SkillsBench tested a &#8220;bring your own skill&#8221; condition where agents were prompted to generate their own procedural knowledge before attempting tasks. Performance stayed at baseline. Effective skills require human-curated domain expertise that models cannot reliably produce for themselves. AI can help with packaging and formatting, but the insight has to come from people who actually have the expertise. Human-labeled insight is the bottleneck of building effective skills, not the packaging or deployment.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="872" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-4.png" alt="Models cannot reliably self-generate effective skills" class="wp-image-18220" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-4.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-4-300x164.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-4-768x419.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-4-1536x837.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption">Figure 5. Models cannot reliably self-generate effective skills without human feedback and verifications.</figcaption></figure>



<p><strong>Skills can partially substitute for model scale.</strong> Claude Haiku, a small model, with well-designed skills achieved a 25.2% pass rate. This slightly exceeded Claude Opus, the flagship model, without skills at 23.6%. Packaged expertise compensates for model intelligence on procedural tasks. This has cost implications: Smaller models with skills may outperform larger models without them at a fraction of the inference cost. Soft forks democratize capability. You do not need the biggest model if you have the right expertise packaged.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1575" height="982" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-5.png" alt="Skills can partially substitute for model scale" class="wp-image-18221" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-5.png 1575w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-5-300x187.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-5-768x479.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-5-1536x958.png 1536w" sizes="auto, (max-width: 1575px) 100vw, 1575px" /><figcaption class="wp-element-caption">Figure 6. Skills improve model performance and close the gap between small and large models.</figcaption></figure>



<h2 class="wp-block-heading">Open questions</h2>



<p>Many challenges remain unresolved. What happens when multiple skills conflict with each other during a session? How should organizations govern skill portfolios when teams each deploy their own skills onto shared agents? How quickly does encoded expertise become outdated, and what refresh cadence keeps skills effective without creating maintenance burden? Skills inherit whatever biases exist in their authors’ expertise, so how do you audit that? And as the industry matures, how should evaluation infrastructure such as SkillsBench scale to keep pace with the growing complexity of skill augmented systems?</p>



<p>These are not reasons to avoid skills. They are reasons to invest in evaluation infrastructure and governance practices alongside skill development. The capability to measure performance must evolve in lockstep with the technology itself.</p>



<h2 class="wp-block-heading">Agent Skills advantage</h2>



<p>Fine-tuning models for a single use case is no longer the only path to specialization. It demands significant investment in talent, compute, and data and creates a permanent divergence that requires reevaluation and potential retraining every time the base model updates. Fine-tuning across a broad set of capabilities to improve a foundation model remains sound, but fine-tuning for one narrow workflow is exactly the kind of specialization that skills can now achieve at a fraction of the cost.</p>



<p>Skills are not maintenance free. Just as applications sometimes break when operating systems update, skills need reevaluation when the underlying agent harness or model changes. But the recovery path is lighter: update the skills package, rerun the evaluation harness, and redeploy rather than retrain from a new checkpoint.</p>



<p>Mainframes gave way to client-server. Monoliths gave way to microservices. Specialized fine-tuned models are now giving way to agents augmented by specialized expertise artifacts. Models provide intelligence, agent harnesses provide runtime, skills provide specialization, and evaluation tells you whether it all works together.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/soft-forks-how-agent-skills-create-specialized-ai-without-training/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Accidental Orchestrator</title>
		<link>https://www.oreilly.com/radar/the-accidental-orchestrator/</link>
				<comments>https://www.oreilly.com/radar/the-accidental-orchestrator/#respond</comments>
				<pubDate>Thu, 05 Mar 2026 12:19:07 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18209</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Accidental-orchestrator-with-equipment.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Accidental-orchestrator-with-equipment-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Experiments in agentic engineering and AI-driven development]]></custom:subtitle>
		
				<description><![CDATA[This is the first article in a series on agentic engineering and AI-driven development. Look for the next article on March 19 on O’Reilly Radar. There&#8217;s been a lot of hype about AI and software development, and it comes in two flavors. One says, “We&#8217;re all doomed, that tools like Claude Code will make software [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>This is the first article in a series on agentic engineering and AI-driven development. Look for the next article on March 19 on O’Reilly Radar.</em></p>
</blockquote>



<p>There&#8217;s been a lot of hype about AI and software development, and it comes in two flavors. One says, “We&#8217;re all doomed, that tools like Claude Code will make software engineering obsolete within a year.” The other says, “Don&#8217;t worry, everything&#8217;s fine, AI is just another tool in the toolbox.” Neither is honest.</p>



<p>I&#8217;ve spent over 20 years writing about software development for practitioners, covering everything from coding and architecture to project management and team dynamics. For the last two years I&#8217;ve been focused on AI, training developers to use these tools effectively, writing about what works and what doesn&#8217;t in books, articles, and reports. And I kept running into the same problem: I had yet to find anyone with a coherent answer for how experienced developers should actually work with these tools. There are plenty of tips and plenty of hype but very little structure, and very little you could practice, teach, critique, or improve.</p>



<p>I&#8217;d been observing developers at work using AI with various levels of success, and I realized we need to start thinking about this as its own discipline. Andrej Karpathy, the former head of AI at Tesla and a founding member of OpenAI, recently proposed the term &#8220;agentic engineering&#8221; for disciplined development with AI agents, and others like Addy Osmani are getting on board. Osmani&#8217;s framing is that <a href="https://addyosmani.com/blog/agentic-engineering/" target="_blank" rel="noreferrer noopener">AI agents handle implementation but the human owns the architecture, reviews every diff, and tests relentlessly</a>. I think that&#8217;s right.</p>



<p>But I&#8217;ve spent a lot of the last two years teaching developers how to use tools like Claude Code, agent mode in Copilot, Cursor, and others, and what I keep hearing is that they already know they should be reviewing the AI&#8217;s output, maintaining the architecture, writing tests, keeping documentation current, and staying in control of the codebase. They know how to do it <em>in theory</em>. But they get stuck trying to apply it <em>in practice</em>: How do you actually review thousands of lines of AI-generated code? How do you keep the architecture coherent when you&#8217;re working across multiple AI tools over weeks? How do you know when the AI is confidently wrong? And it&#8217;s not just junior developers who are having trouble with agentic engineering. I&#8217;ve talked to senior engineers who struggle with the shift to agentic tools, and intermediate developers who take to it naturally. The difference isn&#8217;t necessarily the years of experience; it&#8217;s whether they&#8217;ve figured out an effective and structured way to work with AI coding tools. <strong><em>That gap between knowing what developers should be doing with agentic engineering and knowing how to integrate it into their day-to-day work is a real source of anxiety for a lot of engineers right now.</em></strong> That&#8217;s the gap this series is trying to fill.</p>



<p>Despite what much of the hype about agentic engineering is telling you, this kind of development doesn&#8217;t eliminate the need for developer expertise; just the opposite. Working effectively with AI agents actually raises the bar for what developers need to know. I wrote about that experience gap in an earlier O&#8217;Reilly Radar piece called “<a href="https://www.oreilly.com/radar/the-cognitive-shortcut-paradox/" target="_blank" rel="noreferrer noopener">The Cognitive Shortcut Paradox</a>.” The developers who get the most from working with AI coding tools are the ones who already know what good software looks like, and can often tell if the AI wrote it.</p>



<p>The idea that AI tools work best when experienced developers are driving them matched everything I&#8217;d observed. It rang true, and I wanted to prove it in a way that other developers would understand: by building software. So I started building a specific, practical approach to agentic engineering built for developers to follow, and then I put it to the test. I used it to build a production system from scratch, with the rule that AI would write all the code. I needed a project that was complex enough to stress-test the approach, and interesting enough to keep me engaged through the hard parts. I wanted to apply everything I&#8217;d learned and discover what I still didn&#8217;t know. That&#8217;s when I came back to <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method" target="_blank" rel="noreferrer noopener">Monte Carlo simulations</a>.</p>



<h2 class="wp-block-heading"><strong>The experiment</strong></h2>



<p>I&#8217;ve been obsessed with Monte Carlo simulations ever since I was a kid. My dad&#8217;s an epidemiologist—his whole career has been about finding patterns in messy population data, which means statistics was always part of our lives (and it also means that I learned <a href="https://en.wikipedia.org/wiki/SPSS" target="_blank" rel="noreferrer noopener">SPSS</a> at a very early age). When I was maybe 11 he told me about the drunken sailor problem: A sailor leaves a bar on a pier, taking a random step toward the water or toward his ship each time. Does he fall in or make it home? You can&#8217;t know from any single run. But run the simulation a thousand times, and the pattern emerges from the noise. The individual outcome is random; the aggregate is predictable.</p>



<p>I remember writing that simulation in BASIC on my TRS-80 Color Computer 2: a little blocky sailor stumbling across the screen, two steps forward, one step back. The drunken sailor is the &#8220;Hello, world&#8221; of Monte Carlo simulations. Monte Carlo is a technique for problems you can&#8217;t solve analytically: You simulate them hundreds or thousands of times and measure the aggregate results. Each individual run is random, but the statistics converge on the true answer as the sample size grows. It&#8217;s one way we model everything from nuclear physics to financial risk to the spread of disease across populations.</p>



<p>What if you could run that kind of simulation today by describing it in plain English? Not a toy demo but thousands of iterations with seeded randomness for reproducibility, where the outputs get validated and the results get aggregated into actual statistics you can use. Or a pipeline where an LLM generates content, a second LLM scores it, and anything that doesn&#8217;t pass gets sent back for another try.</p>



<p>The goal of my experiment was to build that system, which I called <a href="https://github.com/andrewstellman/octobatch" target="_blank" rel="noreferrer noopener">Octobatch</a>. Right now, the industry is constantly looking for new real-world end-to-end case studies in agentic engineering, and I wanted Octobatch to be exactly that case study.</p>



<p>I took everything I&#8217;d learned from teaching and observing developers working with AI, put it to the test by building a real system from scratch, and turned the lessons into a structured approach to agentic engineering I&#8217;m calling <strong>AI-driven development</strong>, or <strong>AIDD</strong>. This is the first article in a series about what agentic engineering looks like in practice, what it demands from the developer, and how you can apply it to your own work.</p>



<p>The result is a fully functioning, well-tested application that consists of about 21,000 lines of Python across several dozen files, backed by complete specifications, nearly a thousand automated tests, and quality integration and regression test suites. I used Claude Cowork to review all the AI chats from the entire project, and it turns out that I built the entire application in roughly 75 hours of active development time over seven weeks. For comparison, I built Octobatch in just over half the time I spent last year playing <a href="https://www.blueprincegame.com/" target="_blank" rel="noreferrer noopener"><em>Blue Prince</em></a>.</p>



<p>But this series isn&#8217;t just about Octobatch. I integrated AI tools at every level: Claude and Gemini collaborating on architecture, Claude Code writing the implementation, LLMs generating the pipelines that run on the system they helped build. This series is about what I learned from that process: the patterns that worked, the failures that taught me the most, and the orchestration mindset that ties it all together. Each article pulls a different lesson from the experiment, from validation architecture to multi-LLM coordination to the values that kept the project on track.</p>



<h2 class="wp-block-heading"><strong>Agentic engineering and AI-driven development</strong></h2>



<p>When most people talk about using AI to write code, they mean one of two things: AI coding assistants like GitHub Copilot, Cursor, or Windsurf, which have evolved well beyond autocomplete into agentic tools that can run multifile editing sessions and define custom agents; or &#8220;vibe coding,&#8221; where you describe what you want in natural language and accept whatever comes back. These coding assistants are genuinely impressive, and vibe coding can be really productive.</p>



<p>Using these tools effectively on a real project, however, maintaining architectural coherence across thousands of lines of AI-generated code, is a different problem entirely. AIDD aims to help solve that problem. It&#8217;s a structured approach to agentic engineering where AI tools drive substantial portions of the implementation, architecture, and even project management, while you, the human in the loop, decide what gets built and whether it&#8217;s any good. By &#8220;structure,&#8221; I mean a set of practices developers can learn and follow, a way to know whether the AI&#8217;s output is actually good, and a way to stay on track across the life of a project. If agentic engineering is the discipline, AIDD is one way to practice it.</p>



<p>In AI-driven development, developers don&#8217;t just accept suggestions or hope the output is correct. They assign specific roles to specific tools: one LLM for architecture planning, another for code execution, a coding agent for implementation, and the human for vision, verification, and the decisions that require understanding the whole system.</p>



<p>And the &#8220;driven&#8221; part is literal. The AI is writing almost all of the code. One of my ground rules for the Octobatch experiment was that I would let AI write all of it. I have high code quality standards, and part of the experiment was seeing whether AIDD could produce a system that meets them. The human decides what gets built, evaluates whether it&#8217;s right, and maintains the constraints that keep the system coherent.</p>



<p>Not everyone agrees on how much the developer needs to stay in the loop, and the fully autonomous end of the spectrum is already producing cautionary tales. Nicholas Carlini at Anthropic recently tasked 16 Claude instances to <a href="https://www.anthropic.com/engineering/building-c-compiler" target="_blank" rel="noreferrer noopener">build a C compiler in parallel with no human in the loop</a>. After 2,000 sessions and $20,000 in API costs, the agents produced a 100,000-line compiler that can build a Linux kernel but isn&#8217;t a drop-in replacement for anything, and when all 16 agents got stuck on the same bug, Carlini had to step back in and partition the work himself. Even strong advocates of a completely hands-off, vibe-driven approach to agentic engineering might call that a step too far. The question is how much human judgment you need to make that code trustworthy, and what specific practices help you apply that judgment effectively.</p>



<h2 class="wp-block-heading"><strong>The orchestration mindset</strong></h2>



<p>If you want to get developers thinking about agentic engineering in the right way, you have to start with how they think about working with AI, not just what tools they use. That&#8217;s where I started when I began building a structured approach, and it&#8217;s why I started with <strong>habits</strong>. I developed a framework for these called the Sens-AI Framework, published as both an <a href="https://learning.oreilly.com/library/view/critical-thinking-habits/0642572243326/" target="_blank" rel="noreferrer noopener">O&#8217;Reilly report</a> (<em>Critical Thinking Habits for Coding with AI</em>) and a <a href="https://www.oreilly.com/radar/the-sens-ai-framework/" target="_blank" rel="noreferrer noopener">Radar series</a>. It&#8217;s built around five practices: providing context, doing research before prompting, framing problems precisely, iterating deliberately on outputs, and applying critical thinking to everything the AI produces. I started there because habits are how you lock in the way you think about how you&#8217;re working. Without them, AI-driven development produces plausible-looking code that falls apart under scrutiny. With them, it produces systems that a single developer couldn&#8217;t build alone in the same time frame.</p>



<p>Habits are the foundation, but they&#8217;re not the whole picture. AIDD also has <strong>practices</strong> (concrete techniques like multi-LLM coordination, context file management, and using one model to validate another&#8217;s output) and <strong>values</strong> (the principles behind those practices). If you&#8217;ve worked with Agile methodologies like Scrum or XP, that structure should be pretty familiar: Practices tell you how to work day-to-day, and habits are the reflexes you develop so that the practices become automatic.</p>



<p>Values often seem weirdly theoretical, but they’re an important piece of the puzzle because they guide your decisions when the practices don&#8217;t give you a clear answer. There&#8217;s an emerging culture around agentic engineering right now, and the values you bring to your project either match or clash with that culture. Understanding where the values come from is what makes the practices stick. All of that leads to a whole new mindset, what I&#8217;m calling <strong>the orchestration mindset</strong>. This series builds all four layers, using Octobatch as the proving ground.</p>



<p>Octobatch was a deliberate experiment in AIDD. I designed the project as a test case for the entire approach, to see what a disciplined AI-driven workflow could produce and where it would break down, and I used it to apply and improve the practices and values to make them effective and easy to adopt. And whether by instinct or coincidence, I picked the perfect project for this experiment. Octobatch is a <em>batch orchestrator</em>. It coordinates asynchronous jobs, manages state across failures, tracks dependencies between pipeline steps, and makes sure validated results come out the other end. That kind of system is fun to design but a lot of the details, like state machines, retry logic, crash recovery, and cost accounting, can be tedious to implement. It&#8217;s exactly the kind of work where AIDD should shine, because the patterns are well understood but the implementation is repetitive and error-prone.</p>



<p><strong>Orchestration</strong>—the work of coordinating multiple independent processes toward a coherent outcome—evolved into a core idea behind AIDD. I found myself orchestrating LLMs the same way Octobatch orchestrates batch jobs: assigning roles, managing handoffs, validating outputs, recovering from failures. The system I was building and the process I was using to build it followed the same pattern. I didn&#8217;t anticipate it when I started, but building a system that orchestrates AI turns out to be a pretty good way to learn how to orchestrate AI. That&#8217;s the accidental part of the accidental orchestrator. That parallel runs through every article in this series.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<h2 class="wp-block-heading"><strong>The path to batch</strong></h2>



<p>I didn&#8217;t begin the Octobatch project by starting with a full end-to-end Monte Carlo simulation. I started where most people start: typing prompts into a chat interface. I was experimenting with different simulation and generation ideas to give the project some structure, and a few of them stuck. A blackjack strategy comparison turned out to be a great test case for a multistep Monte Carlo simulation. NPC dialogue generation for a role-playing game gave me a creative workload with subjective quality to measure. Both had the same shape: a set of structured inputs, each processed the same way. So I had Claude write a simple script to automate what I&#8217;d been doing by hand, and I used Gemini to double-check the work, make sure Claude really understood my ask, and fix hallucinations. It worked fine at small scale, but once I started running more than a hundred or so units, I kept hitting rate limits, the caps that providers put on how many API requests you can make per minute.</p>



<p>That&#8217;s what pushed me to <strong>LLM batch APIs</strong>. Instead of sending individual prompts one at a time and waiting for each response, the major LLM providers all offer batch APIs that let you submit a file containing all of your requests at once. The provider processes them on their own schedule; you wait for results instead of getting them immediately, but you don&#8217;t have to worry about rate caps. I was happy to discover they also cost 50% less, and that&#8217;s when I started tracking token usage and costs in earnest. But the real surprise was that<em> batch APIs performed better than real-time APIs at scale</em>. Once pipelines got past the 100- or 200-unit mark, batch started running significantly faster than real time. The provider processes the whole batch in parallel on their infrastructure, so you&#8217;re not bottlenecked by round-trip latency or rate caps anymore.</p>



<p>The switch to batch APIs changed how I thought about the whole problem of coordinating LLM API calls at scale, and led to the idea of configurable pipelines. I could chain stages together: The output of one step could become the input to the next, and I could kick off the whole pipeline and come back to finished results. It turns out I wasn&#8217;t the only one making the shift to batch APIs. Between April 2024 and July 2025, OpenAI, Anthropic, and Google all launched batch APIs, converging on the same pricing model: 50% of the real-time rate in exchange for asynchronous processing.</p>



<p>You probably didn&#8217;t notice that all three major AI providers released batch APIs. The industry conversation was dominated by agents, tool use, MCP, and real-time reasoning. Batch APIs shipped with relatively little fanfare, but they represent a genuine shift in how we can use LLMs. Instead of treating them as conversational partners or one-shot SaaS APIs, we can treat them as processing infrastructure, closer to a MapReduce job than a chatbot. You give them structured data and a prompt template, and they process all of it and hand back the results. What matters is that you can now run tens of thousands of these transformations reliably, at scale, without managing rate limits or connection failures.</p>



<h2 class="wp-block-heading"><strong>Why orchestration?</strong></h2>



<p>If batch APIs are so useful, why can&#8217;t you just write a for-loop that submits requests and collects results? You can, and for simple cases a quick script with a for-loop works fine. But once you start running larger workloads, the problems start to pile up. Solving those problems turned out to be one of the most important lessons for developing a structured approach to agentic engineering.</p>



<p>First, batch jobs are asynchronous. You submit a job, and results come back hours later, so your script needs to track what was submitted and poll for completion. If your script crashes in the middle, you lose that state. Second, batch jobs can partially fail. Maybe 97% of your requests succeeded and 3% didn&#8217;t. Your code needs to figure out which 3% failed, extract them, and resubmit just those items. Third, if you&#8217;re building a multistage pipeline where the output of one step feeds into the next, you need to track dependencies between stages. And fourth, you need cost accounting. When you&#8217;re running tens of thousands of requests, you want to know how much you spent, and ideally, how much you&#8217;re going to spend when you first start the batch. Every one of these has a direct parallel to what you&#8217;re doing in agentic engineering: keeping track of the work multiple AI agents are doing at once, dealing with code failures and bugs, making sure the entire project stays coherent when AI coding tools are only looking at the one part currently in context, and stepping back to look at the wider project management picture.</p>



<p>All of these problems are solvable, but they&#8217;re not problems you want to solve over and over (in both situations—when you&#8217;re orchestrating LLM batch jobs or orchestrating AI coding tools). Solving these problems in the code gave some interesting lessons about the overall approach to agentic engineering. Batch processing moves the complexity from connection management to state management. Real-time APIs are hard because of rate limits and retries. Batch APIs are hard because you have to track what&#8217;s in flight, what succeeded, what failed, and what&#8217;s next.</p>



<p>Before I started development, I went looking for existing tools that handled this combination of problems, because I didn’t want to waste my time reinventing the wheel. I didn’t find anything that did the job I needed. Workflow orchestrators like Apache Airflow and Dagster manage DAGs and task dependencies, but they assume tasks are deterministic and don&#8217;t provide LLM-specific features like prompt template rendering, schema-based output validation, or retry logic triggered by semantic quality checks. LLM frameworks like LangChain and LlamaIndex are designed around real-time inference chains and agent loops—they don&#8217;t manage asynchronous batch job lifecycles, persist state across process crashes, or handle partial failure recovery at the chunk level. And the batch API client libraries from the providers themselves handle submission and retrieval for a single batch, but not multistage pipelines, cross-step validation, or provider-agnostic execution.</p>



<p>Nothing I found covered the full lifecycle of multiphase LLM batch workflows, from submission and polling through validation, retry, cost tracking, and crash recovery, across all three major AI providers. That&#8217;s what I built.</p>



<h2 class="wp-block-heading"><strong>Lessons from the experiment</strong></h2>



<p>The goal of this article, as the first one in my series on agentic engineering and AI-driven development, is to lay out the hypothesis and structure of the Octobatch experiment. The rest of the series goes deep on the lessons I learned from it: the validation architecture, multi-LLM coordination, the practices and values that emerged from the work, and the orchestration mindset that ties it all together. A few early lessons stand out, because they illustrate what AIDD looks like in practice and why developer experience matters more than ever.</p>



<ul class="wp-block-list">
<li><strong>You have to run things and check the data.</strong> Remember the drunken sailor, the “Hello, world” of Monte Carlo simulations? At one point I noticed that when I ran the simulation through Octobatch, 77.5% of the sailors fell in the water. The results for a random walk should be 50/50, so clearly something was badly wrong. It turned out the random number generator was being re-seeded at every iteration with sequential seed values, which created correlation bias between runs. I didn’t identify the problem immediately; I ran a bunch of tests using Claude Code as a test runner to generate each test, run it, and log the results; Gemini looked at the results and found the root cause. Claude had trouble coming up with a fix that worked well, and proposed a workaround with a large list of preseeded random number values in the pipeline. Gemini proposed a hash-based fix reviewing my conversations with Claude, but it seemed overly complex. Once I understood the problem and rejected their proposed solutions, I decided the best fix was simpler than either of the AI&#8217;s suggestions: a persistent RNG per simulation unit that advanced naturally through its sequence. I needed to understand both the statistics and the code to evaluate those three options. Plausible-looking output and correct output aren&#8217;t the same thing, and you need enough expertise to tell the difference. (We’ll talk more about this situation in the next article in the series.)</li>



<li><strong>LLMs often overestimate complexity.</strong> At one point I wanted to add support for custom mathematical expressions in the analysis pipeline. Both Claude and Gemini pushed back, telling me, &#8220;This is scope creep for v1.0&#8221; and &#8220;Save it for v1.1.&#8221; Claude estimated three hours to implement. Because I knew the codebase, I knew we were already using asteval, a Python library that provides a safe, minimalistic evaluator for mathematical expressions and simple Python statements, elsewhere to evaluate expressions, so this seemed like a straightforward use of a library we’re already using elsewhere. Both LLMs thought the solution would be far more complex and time-consuming than it actually was; it took just two prompts to Claude Code (generated by Claude), and about five minutes total to implement. The feature shipped and made the tool significantly more powerful. The AIs were being conservative because they didn&#8217;t have my context about the system&#8217;s architecture. Experience told me the integration would be trivial. Without that experience, I would have listened to them and deferred a feature that took five minutes.</li>



<li><strong>AI is often biased toward adding code, not deleting it.</strong> Generative AI is, unsurprisingly, biased toward generation. So when I asked the LLMs to fix problems, their first response was often to add more code, adding another layer or another special case. I can&#8217;t think of a single time in the whole project when one of the AIs stepped back and said, &#8220;Tear this out and rethink the approach.&#8221; The most productive sessions were the ones where I overrode that instinct and pushed for simplicity. This is something experienced developers learn over a career: The most successful changes often delete more than they add—the PRs we brag about are the ones that delete thousands of lines of code.</li>



<li><strong>The architecture emerged from failure.</strong> The AI tools and I didn&#8217;t design Octobatch&#8217;s core architecture up front. Our first attempt was a Python script with in-memory state and a lot of hope. It worked for small batches but fell apart at scale: A network hiccup meant restarting from scratch, a malformed response required manual triage. A lot of things fell into place after I added the constraint that the system must survive being killed at any moment. That single requirement led to the tick model (wake up, check state, do work, persist, exit), the manifest file as source of truth, and the entire crash-recovery architecture. We discovered the design by repeatedly failing to do something simpler.</li>



<li><strong>Your development history is a dataset.</strong> I just told you several stories from the Octobatch project, and this series will be full of them. Every one of those stories came from going back through the chat logs between me, Claude, and Gemini. With AIDD, you have a complete transcript of every architectural decision, every wrong turn, every moment where you overruled the AI and every moment where it corrected you. Very few development teams have ever had that level of fidelity in their project history. Mining those logs for lessons learned turns out to be one of the most valuable practices I&#8217;ve found.</li>
</ul>



<p>Near the end of the project, I switched to Cursor to make sure none of this was specific to Claude Code. I created fresh conversations using the same context files I&#8217;d been maintaining throughout development, and was able to bootstrap productive sessions immediately; the context files worked exactly as designed. The practices I&#8217;d developed transferred cleanly to a different tool. The value of this approach comes from the habits, the context management, and the engineering judgment you bring to the conversation, not from any particular vendor.</p>



<p>These tools are moving the world in a direction that favors developers who understand the ways engineering can go wrong and know solid design and architecture patterns…and who are okay letting go of control of every line of code.</p>



<h2 class="wp-block-heading"><strong>What&#8217;s next</strong></h2>



<p>Agentic engineering needs structure, and structure needs a concrete example to make it real. The next article in this series goes into Octobatch itself, because the way it orchestrates AI is a remarkably close parallel to what AIDD asks developers to do. Octobatch assigns roles to different processing steps, manages handoffs between them, validates their outputs, and recovers when they fail. That&#8217;s the same pattern I followed when building it: assigning roles to Claude and Gemini, managing handoffs between them, validating their outputs, and recovering when they went down the wrong path. Understanding how the system works turns out to be a good way to understand how to orchestrate AI-driven development. I&#8217;ll walk through the architecture, show what a real pipeline looks like from prompt to results, present the data from a 300-hand blackjack Monte Carlo simulation that puts all of these ideas to the test, and use all of that to demonstrate ideas we can apply directly to agentic engineering and AI-driven development.</p>



<p>Later articles go deeper into the practices and ideas I learned from this experiment that make AI-driven development work: how I coordinated multiple AI models without losing control of the architecture, what happened when I tested the code against what I actually intended to build, and what I learned about the gap between code that runs and code that does what you meant. Along the way, the experiment produced some findings about how different AI models see code that I didn&#8217;t expect—and that turned out to matter more than I thought they would.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-accidental-orchestrator/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The &#8220;Data Center Rebellion&#8221; Is Here</title>
		<link>https://www.oreilly.com/radar/the-data-center-rebellion-is-here/</link>
				<comments>https://www.oreilly.com/radar/the-data-center-rebellion-is-here/#respond</comments>
				<pubDate>Wed, 04 Mar 2026 12:11:09 +0000</pubDate>
					<dc:creator><![CDATA[Ben Lorica]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18188</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Data-center-rebellion.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Data-center-rebellion-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Beyond the chips: The local politics of AI infrastructure]]></custom:subtitle>
		
				<description><![CDATA[This post first appeared on Ben Lorica’s Gradient FlowSubstack newsletter and is being republished here with the author’s permission. Even the most ardent cheerleaders for artificial intelligence now quietly concede we are navigating a massive AI bubble. The numbers are stark: Hyperscalers are deploying roughly $400 billion annually into data centers and specialized chips while [&#8230;]]]></description>
								<content:encoded><![CDATA[
<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><em>This post first appeared on Ben Lorica’s </em><a href="https://gradientflow.substack.com/p/the-data-center-rebellion-is-here" target="_blank" rel="noreferrer noopener">Gradient Flow<em>Substack newsletter</em></a><em> </em><em>and is being republished here with the author’s permission.</em></td></tr></tbody></table></figure>



<p>Even the most ardent cheerleaders for artificial intelligence now quietly concede we are navigating a massive AI bubble. The numbers are stark: Hyperscalers are deploying roughly $400 billion annually into data centers and specialized chips while AI-related revenue hovers around $20 billion—a 20-to-1 capital-to-revenue ratio that stands out even in infrastructure cycles historically characterized by front-loaded spending. To justify this deployment on conventional investment metrics, the industry would need a step change in monetization over a short window to make the numbers work.</p>



<p>While venture capitalists and tech executives debate the “mismatch” between compute and monetization, a more tangible crisis is unfolding far from Silicon Valley. A growing grassroots opposition to AI data centers remains largely below the radar here in San Francisco. I travel to Sioux Falls, South Dakota, a few times a year to visit my in-laws. It’s not a region known for being antibusiness. Yet even there, a “data center rebellion” has been brewing. Even though the recent attempt to overturn a rezoning ordinance <a href="https://www.keloland.com/news/local-news/data-center-re-zone-petition-fails-in-sioux-falls/" target="_blank" rel="noreferrer noopener">did not succeed</a>, the level of community pushback in the heart of the Midwest signals that these projects no longer enjoy a guaranteed green light.</p>



<p>This resistance is not merely reflexive NIMBYism. It represents a sophisticated multifront challenge to the physical infrastructure AI requires. For leadership teams planning for the future, this means &#8220;compute availability&#8221; is no longer just a procurement question. It is now tied to local politics, grid stability, water management, and city approval processes. In the course of trying to understand the growing opposition to AI data centers, I’ve been examining the specific drivers behind this opposition and why the assumption of limitless infrastructure growth is colliding with hard constraints.</p>



<h2 class="wp-block-heading">The grid capacity crunch and the ratepayer revolt</h2>



<p>AI data centers function as grid-scale industrial loads. Individual projects now request 100+ megawatts, and some proposals reach into the gigawatt range. One proposed Michigan facility, for example, would consume 1.4 gigawatts, nearly exhausting the region’s remaining 1.5 gigawatts of headroom and roughly matching the electricity needs of about a million homes. This happens because AI hardware is incredibly dense and uses a massive amount of electricity. It also runs constantly. Since AI work doesn&#8217;t have &#8220;off&#8221; hours, power companies can&#8217;t rely on the usual quiet periods they use to balance the rest of the grid.</p>



<p>The politics come down to who pays the bill. Residents in many areas have seen their home utility rates jump by 25% or 30% after big data centers moved in, even though they were promised rates wouldn&#8217;t change. People are afraid they will end up paying for the power company&#8217;s new equipment. This happens when a utility builds massive substations just for one company, but the cost ends up being shared by everyone. When you add in state and local tax breaks, it gets even worse. Communities deal with all the downsides of the project, while the financial benefits are eaten away by tax breaks and credits.</p>



<p>The result is a rare bipartisan alignment around a simple demand: Hyperscalers should pay their full cost of service. Notably, Microsoft has moved in that direction publicly, committing to cover grid-upgrade costs and pursue rate structures intended to insulate residential customers—an implicit admission that the old incentive playbook has become a political liability (and, in some places, an electoral one).</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="172" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image.jpeg" alt="AI scale-up to deployable compute" class="wp-image-18189" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-300x35.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-768x91.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<h2 class="wp-block-heading">Water wars and the constant hum</h2>



<p>High-density AI compute generates immense heat, requiring cooling systems that can consume millions of gallons of water daily. In desert municipalities like Chandler and Tucson, Arizona, this creates direct competition with agricultural irrigation and residential drinking supplies. Proposed facilities may withdraw hundreds of millions of gallons annually from stressed aquifers or municipal systems, raising fears that industrial users will deplete wells serving farms and homes. Data center developers frequently respond with technical solutions like dry cooling and closed-loop designs. However, communities have learned the trade-off: Dry cooling shifts the burden to electricity, and closed-loop systems still lose water to the atmosphere and require constant refills. The practical outcome is that cooling architecture is now a first-order constraint. In Tucson, a project known locally as “Project Blue” faced enough pushback over water rights that the developer had to revisit the cooling approach midstream.</p>



<p>Beyond resource consumption, these facilities create a significant noise problem. Industrial-scale cooling fans and backup diesel generators create a “constant hum” that represents daily intrusion into previously quiet neighborhoods. In Florida, residents near a proposed facility serving 2,500 families and an elementary school cite sleep disruption and health risks as primary objections, elevating the issue from nuisance to harm. The noise also hits farms hard. In Wisconsin, residents reported that the low-frequency hum makes livestock, particularly horses, nervous and skittish. This disrupts farm life in a way that standard commercial development just doesn&#8217;t. This is why municipalities are tightening requirements: acoustic modeling, enforceable decibel limits at property lines, substantial setbacks (sometimes on the order of 200 feet), and <a href="https://en.wikipedia.org/wiki/Berm" target="_blank" rel="noreferrer noopener">berms</a> that are no longer “nice-to-have” concessions but baseline conditions for approval.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="654" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1.jpeg" alt="The $3 trillion question" class="wp-image-18190" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1-300x135.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-1-768x345.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /><figcaption class="wp-element-caption">(<a href="https://gradientflow.com/wp-content/uploads/2026/02/AI-Data-Center-Bubble.jpeg" target="_blank" rel="noreferrer noopener"><strong>enlarge</strong></a>)</figcaption></figure>



<h2 class="wp-block-heading">The jobs myth meets the balance sheet</h2>



<p>Communities are questioning whether the small number of jobs created is worth the local impact. Developers highlight billion-dollar capital investments and construction employment spikes, but residents focus on steady-state reality: AI data centers employ far fewer permanent workers per square foot than manufacturing facilities of comparable scale. Chandler, Arizona, officials noted that existing facilities employ fewer than 100 people despite massive physical footprints. Wisconsin residents contrast promised “innovation campuses” with operational facilities requiring only dozens to low hundreds of permanent staff—mostly specialized technicians—making the “job creation” pitch ring hollow. When a data center replaces farmland or light manufacturing, communities weigh not just direct employment but opportunity cost: lost agricultural jobs, foregone retail development, and mixed-use projects that might generate broader economic activity.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>Opposition scales faster than infrastructure: One local win becomes a national template for blocking the next project.</p>
</blockquote>



<p>The secretive way these deals are made is often what fuels the most anger. A recurring pattern is what some call the “sleeping giant” dynamic: Residents learn late that officials and developers have been negotiating for months, often under NDAs, sometimes through shell entities and codenames. In Wisconsin, Microsoft’s “Project Nova” became a symbol of this approach; in Minnesota’s Hermantown, a year of undisclosed discussions triggered similar backlash. In Florida, opponents were furious when a major project was tucked into a <a href="https://www.boardeffect.com/blog/what-is-a-consent-agenda-for-a-board-meeting/" target="_blank" rel="noreferrer noopener">consent agenda</a>. Since these agendas are meant for routine business, it felt like a deliberate attempt to bypass public debate. Trust vanishes when people believe advisors have a conflict of interest, like a consultant who seems to be helping both the municipality and the developer. After that happens, technical claims are treated as nothing more than a sales pitch. You won&#8217;t get people back on board until you provide neutral analysis and commitments that can actually be enforced.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1020" height="695" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2.jpeg" alt="Data center in the community" class="wp-image-18191" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2.jpeg 1020w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2-300x204.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-2-768x523.jpeg 768w" sizes="auto, (max-width: 1020px) 100vw, 1020px" /></figure>



<h2 class="wp-block-heading">From zoning fight to national constraint</h2>



<p>What started as isolated neighborhood friction has professionalized into a coordinated national movement. Opposition <strong>groups now share legal playbooks and technical templates across state lines</strong>, allowing residents in “frontier” states like South Dakota or Michigan to mobilize with the sophistication of seasoned activists. The financial stakes are real: Between April and June 2025 alone, approximately $98 billion in proposed projects were blocked or delayed, according to <a href="https://www.datacenterwatch.org/?utm_source=gradientflow&amp;utm_medium=newsletter" target="_blank" rel="noreferrer noopener">Data Center Watch</a>. This is no longer just a zoning headache. It’s a political landmine. In Arizona and Georgia, bipartisan coalitions have already ousted officials over data center approvals, signaling to local boards that greenlighting a hyperscale facility without deep community buy-in can be a career-ending move.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>The US has the chips, but China has centralized command over power and infrastructure.</p>
</blockquote>



<p>The opposition is also finding an unlikely ally in the energy markets. While the industry narrative is one of &#8220;limitless demand,&#8221; the actual market prices for long-term power and natural gas aren&#8217;t spiking but are actually staying remarkably flat. There is a massive <a href="https://www.youtube.com/watch?v=i__iaPepixk" target="_blank" rel="noreferrer noopener">disconnect between the hype and the math</a>. Utilities are currently racing to build nearly double the <a href="https://www.youtube.com/watch?v=i__iaPepixk" target="_blank" rel="noreferrer noopener">capacity that even the most optimistic analysts</a> project for 2030. This suggests we may be overbuilding &#8220;ghost infrastructure.&#8221; We are asking local communities to sacrifice their land and grid stability for a gold rush that the markets themselves don&#8217;t fully believe in.</p>



<p>This “data center rebellion” creates a strategic bottleneck that no amount of venture capital can easily bypass. While the US maintains a clear lead in high-end chips, we are hitting a wall on how we manage the mundane essentials like electricity and water. In the geopolitical race, the US has the chips, but China has the centralized command over infrastructure. Our democratic model requires transparency and public buy-in to function. If US companies keep relying on secret deals to push through expensive, overbuilt infrastructure, they risk a total collapse of community trust.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-data-center-rebellion-is-here/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Radar Trends to Watch: March 2026</title>
		<link>https://www.oreilly.com/radar/radar-trends-to-watch-march-2026/</link>
				<comments>https://www.oreilly.com/radar/radar-trends-to-watch-march-2026/#respond</comments>
				<pubDate>Tue, 03 Mar 2026 12:07:40 +0000</pubDate>
					<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
						<category><![CDATA[Radar Trends]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18173</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-10.png" 
				medium="image" 
				type="image/png" 
				width="1400" 
				height="950" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-10-160x160.png" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Developments in operations, things, web, and more]]></custom:subtitle>
		
				<description><![CDATA[The explosion of interest in OpenClaw was one of the last items added to the February 1 trends. In February, things went crazy. We saw a social network for agents (no humans allowed, though they undoubtedly sneak on); a multiplayer online game for agents (again, no humans); many clones of OpenClaw, most of which attempt [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>The explosion of interest in OpenClaw was one of the last items added to the February 1 trends. In February, things went crazy. We saw a social network for agents (no humans allowed, though they undoubtedly sneak on); a multiplayer online game for agents (again, no humans); many clones of OpenClaw, most of which attempt to mitigate its many security problems; and much more. Andrej Karpathy has said that OpenClaw is the next layer on top of AI agents. If the security issues can be resolved (which is a good question), he’s probably right.</p>



<h2 class="wp-block-heading">AI</h2>



<ul class="wp-block-list">
<li><a href="https://note-taker.moonshine.ai/" target="_blank" rel="noreferrer noopener">Moonshine Note Taker</a> is a free and open source voice transcription application for taking notes. It runs locally: The model runs on your hardware and no data is ever sent to a server.</li>



<li>Nano Banana’s image generation was breathtakingly good. Google has now <a href="https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/" target="_blank" rel="noreferrer noopener">released</a> Nano Banana 2, a.k.a. Gemini 3.1 Flash Image, which promises Nano Banana image quality at speed.</li>



<li>Claude <a href="https://code.claude.com/docs/en/remote-control" target="_blank" rel="noreferrer noopener">Remote Control</a> allows you to continue a desktop Claude Code session from any device.</li>



<li>Putting OpenClaw into a sandbox <a href="https://tachyon.so/blog/sandboxes-wont-save-you" target="_blank" rel="noreferrer noopener">isn’t enough</a>. Keeping AI Agents from accidentally (or intentionally) doing damage is a permissions problem.</li>



<li>Alibaba has <a href="https://huggingface.co/collections/Qwen/qwen35" target="_blank" rel="noreferrer noopener">released</a> a fleet of mid-size Qwen 3.5 models. Their theme is providing more intelligence with less computing cycles—something we all need to appreciate.&nbsp;</li>



<li>Important advice for agentic engineering: <a href="https://simonwillison.net/guides/agentic-engineering-patterns/first-run-the-tests/" target="_blank" rel="noreferrer noopener">Always start by running the tests</a>.</li>



<li>Google has <a href="https://blog.google/innovation-and-ai/products/gemini-app/lyria-3/" target="_blank" rel="noreferrer noopener">released</a> Lyria 3, a model that generates 30-second musical clips from a verbal description. You can experiment with it through Gemini.</li>



<li>There’s a new protocol in the agentic stack. Twilio has <a href="https://thenewstack.io/twilio-a2h-protocol-launch/" target="_blank" rel="noreferrer noopener">released</a> the <a href="https://www.twilio.com/en-us/blog/products/introducing-a2h-agent-to-human-communication-protocol" target="_blank" rel="noreferrer noopener">Agent-2-Human</a> (A2H) protocol, which facilitates handoffs between agents and humans as they collaborate.</li>



<li>Yet more and more model releases: <a href="https://www.anthropic.com/news/claude-sonnet-4-6" target="_blank" rel="noreferrer noopener">Claude Sonnet 4.6</a>, followed quickly by <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/" target="_blank" rel="noreferrer noopener">Gemini 3.1 Pro</a>. If you care, Gemini 3.1 Pro currently tops the abstract reasoning benchmarks.</li>



<li><a href="https://www.kimi.com/bot" target="_blank" rel="noreferrer noopener">Kimi Claw</a> is yet another variation on OpenClaw. Kimi Claw uses Moonshot AI’s most advanced model, Kimi K2.5 Thinking model, and offers one-click setup in Moonshot’s cloud.</li>



<li><a href="https://nanoclaw.dev/" target="_blank" rel="noreferrer noopener">NanoClaw</a> is another OpenClaw-like AI-based personal assistant that claims to be more security conscious. It runs agents in sandboxed Linux containers with limited access to outside resources, limiting abuse. </li>



<li>OpenAI has <a href="https://openai.com/index/introducing-gpt-5-3-codex-spark/" target="_blank" rel="noreferrer noopener">released</a> a research preview of GPT-5.3-Codex-Spark, an extremely fast coding model that runs on <a href="https://www.cerebras.ai/chip" target="_blank" rel="noreferrer noopener">Cerebras hardware</a>. The company claims that it’s possible to collaborate with Codex in “real time” because it gives “near-instant” results.</li>



<li>RAG may not be the newest idea in the AI world, but text-based RAG is the basis for many enterprise applications of AI. But most enterprise data includes graphs, images, and even text in formats like PDF. Is this the year for <a href="https://thenewstack.io/multimodal-rag-hybrid-search/" target="_blank" rel="noreferrer noopener">multimodal RAG</a>?</li>



<li><a href="http://z.ai" target="_blank" rel="noreferrer noopener">Z.ai</a> has released its latest model, <a href="https://z.ai/blog/glm-5" target="_blank" rel="noreferrer noopener">GLM-5</a>. GLM-5 is an open source “Opus-class” model. It’s significantly smaller than Opus and other high-end models, though still huge; the mixture-of-experts model has 744B parameters, with 40B active.</li>



<li>Waymo has created a <a href="https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simulation" target="_blank" rel="noreferrer noopener">World Model</a> to model driving behavior. It’s capable of building lifelike simulations of traffic patterns and behavior, based on video collected from Waymo’s vehicles.</li>



<li><a href="https://alexzhang13.github.io/blog/2025/rlm/" target="_blank" rel="noreferrer noopener">Recursive language models</a> (RLMs) solve the problem of context rot, which happens when output from AI degrades as the size of the context increases. Drew Breunig has an excellent <a href="https://www.dbreunig.com/2026/02/09/the-potential-of-rlms.html" target="_blank" rel="noreferrer noopener">explanation</a>.</li>



<li>You’ve heard of Moltbook—and perhaps your AI agent participates. Now there’s <a href="https://arstechnica.com/ai/2026/02/after-moltbook-ai-agents-can-now-hang-out-in-their-own-space-faring-mmo/" target="_blank" rel="noreferrer noopener">SpaceMolt</a>—a massive multiplayer online game that’s exclusively for agents.&nbsp;</li>



<li>Anthropic and OpenAI simultaneously released <a href="https://www.anthropic.com/news/claude-opus-4-6" target="_blank" rel="noreferrer noopener">Claude Opus 4.6</a> and <a href="https://openai.com/index/introducing-gpt-5-3-codex/" target="_blank" rel="noreferrer noopener">GPT-5.3-Codex</a>, both of which offer improved models for AI-assisted programming. Is this “open warfare,” as <a href="https://news.smol.ai/issues/26-02-05-claude-opus-openai-codex/" target="_blank" rel="noreferrer noopener"><em>AINews</em></a> claims? You mean it hasn’t been open warfare prior to now?</li>



<li>If you’re excited by OpenClaw, you might try <a href="https://github.com/HKUDS/nanobot" target="_blank" rel="noreferrer noopener">NanoBot</a>. It has 1% of OpenClaw’s code, written so that it’s easy to understand and maintain. No promises about security—with all of these personal AI assistants, be careful!</li>



<li>OpenAI has <a href="https://arstechnica.com/ai/2026/02/openai-picks-up-pace-against-claude-code-with-new-codex-desktop-app/" target="_blank" rel="noreferrer noopener">launched</a> a desktop app for macOS along the lines of Claude Code. It’s something that’s been missing from their lineup. Among other things, it’s intended to help programmers work with multiple agents simultaneously.</li>



<li>Pete Warden has put together an interactive guide to speech embeddings for engineers, and published it as a Colab <a href="https://colab.research.google.com/drive/1pUy9tp145qlWni2CIuBvQUNdokiB6rx6?usp=sharing" target="_blank" rel="noreferrer noopener">notebook</a>.</li>



<li><a href="https://tailscale.com/blog/aperture-private-alpha" target="_blank" rel="noreferrer noopener">Aperture</a> is a new tool from Tailscale for “providing visibility into coding agent usage,” allowing organizations to understand how AI is being used and adopted. It’s currently in private beta.</li>



<li>OpenAI <a href="https://openai.com/index/introducing-prism/" target="_blank" rel="noreferrer noopener">Prism</a> is a free workspace for scientists to collaborate on research. Its goal is to help scientists build a new generation of AI-based tooling. Prism is built on ChatGPT 5.2 and is open to anyone with a personal ChatGPT account.</li>
</ul>



<h2 class="wp-block-heading">Programming</h2>



<ul class="wp-block-list">
<li>Anthropic is <a href="https://claude.com/contact-sales/claude-for-oss" target="_blank" rel="noreferrer noopener">offering</a> six months of Claude Max 20x free to open source maintainers.</li>



<li><a href="https://pi.dev/" target="_blank" rel="noreferrer noopener">Pi</a> is a very simple but extensible coding agent that runs in your terminal.</li>



<li>Researchers at Anthropic have <a href="https://www.anthropic.com/engineering/building-c-compiler" target="_blank" rel="noreferrer noopener">vibe-coded a C compiler</a> using a fleet of Claude agents. The experiment cost roughly $20,000 worth of tokens, and produced 100,000 lines of Rust. They are careful to say that the compiler is far from production quality—but it works. The experiment is a <em>tour de force</em> demonstration of running agents in parallel.&nbsp;</li>



<li>I never knew that macOS had a <a href="https://igorstechnoclub.com/sandbox-exec/" target="_blank" rel="noreferrer noopener">sandboxing tool</a>. It looks useful. (It’s also deprecated, but looks much easier to use than the alternatives.)</li>



<li><a href="https://github.blog/changelog/2026-02-13-new-repository-settings-for-configuring-pull-request-access/" target="_blank" rel="noreferrer noopener">GitHub now allows</a> pull requests to be turned off completely, or to be limited to collaborators. They’re doing this to allow software maintainers to eliminate AI-generated pull requests, which are overwhelming many developers.</li>



<li>After an open source maintainer rejected a pull request generated by an AI agent, the agent published a blog post attacking the maintainer. The maintainer responded with an excellent <a href="https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/" target="_blank" rel="noreferrer noopener">analysis</a>, asking whether threats and intimidation are the future of AI.</li>



<li>As Simon Willison has <a href="https://simonwillison.net/2025/Dec/18/code-proven-to-work/" target="_blank" rel="noreferrer noopener">written</a>, the purpose of programming isn’t to write code but to deliver code that works. He’s created two tools, <a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/" target="_blank" rel="noreferrer noopener">Showboat and Rodney</a>, that help AI agents demo their software so that the human authors can verify that the software works.&nbsp;</li>



<li>Anil Dash asks whether <a href="https://www.anildash.com/2026/01/22/codeless/" target="_blank" rel="noreferrer noopener">codeless</a> <a href="https://www.anildash.com/2026/01/27/codeless-ecosystem/" target="_blank" rel="noreferrer noopener">programming</a>, using tools like <a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04" target="_blank" rel="noreferrer noopener">Gas Town</a>, is the future.</li>
</ul>



<h2 class="wp-block-heading">Security</h2>



<ul class="wp-block-list">
<li>There is now an app that <a href="https://tech.lgbt/@yjeanrenaud/116122129025921096" target="_blank" rel="noreferrer noopener">alerts</a> you when someone in the vicinity has smart glasses.</li>



<li><a href="https://www.agentsh.org/" target="_blank" rel="noreferrer noopener">Agentsh</a> provides <a href="https://www.agentsh.org/" target="_blank" rel="noreferrer noopener">execution layer security</a> by enforcing policies to prevents agents from doing damage. As far as agents are concerned, it’s a replacement for bash.</li>



<li>There’s a new kind of cyberattack: <a href="https://techxplore.com/news/2026-02-cyber-disrupt-smart-factories.html" target="_blank" rel="noreferrer noopener">attacks against time itself</a>. More specifically, this means attacks against clocks and protocols for time synchronization. These can be devastating in factory settings.</li>



<li>“<a href="https://aisle.com/blog/what-ai-security-research-looks-like-when-it-works" target="_blank" rel="noreferrer noopener">What AI Security Research Looks Like When It Works</a>” is an excellent overview of the impact of AI on discovering vulnerabilities. AI generates a lot of security slop, but it also finds critical vulnerabilities that would have been opaque to humans, including 12 in OpenSSL.</li>



<li>Gamifying prompt injection—well, that’s new. <a href="https://hackmyclaw.com/" target="_blank" rel="noreferrer noopener">HackMyClaw</a> is a game (?) in which participants send email to Flu, an OpenClaw instance. The goal is to force Flu to reply with secrets.env, a file of “confidential” data. There is a prize for the first to succeed.</li>



<li>It was only a matter of time: There’s now a cybercriminal who is <a href="https://www.bleepingcomputer.com/news/security/infostealer-malware-found-stealing-openclaw-secrets-for-first-time/" target="_blank" rel="noreferrer noopener">actively stealing secrets</a> from OpenClaw users.&nbsp;</li>



<li><a href="https://deno.com/" target="_blank" rel="noreferrer noopener">Deno’s secure sandbox</a> might provide a way to <a href="https://thenewstack.io/deno-sandbox-security-secrets/" target="_blank" rel="noreferrer noopener">run OpenClaw safely</a>.&nbsp;</li>



<li><a href="https://github.com/nearai/ironclaw" target="_blank" rel="noreferrer noopener">IronClaw</a> is a personal AI assistant modeled after <a href="https://openclaw.ai/" target="_blank" rel="noreferrer noopener">OpenClaw</a> that promises better security. It always runs in a sandbox, never exposes credentials, has some defenses against prompt injection, and only makes requests to approved hosts.</li>



<li>A fake recruiting campaign is <a href="https://www.bleepingcomputer.com/news/security/fake-job-recruiters-hide-malware-in-developer-coding-challenges/" target="_blank" rel="noreferrer noopener">hiding malware</a> in programming challenges that candidates must complete in order to apply. Completing the challenge requires installing malicious dependencies that are hosted on legitimate repositories like npm and PyPI.</li>



<li>Google’s Threat Intelligence Group has <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/" target="_blank" rel="noreferrer noopener">released</a> its quarterly analysis of adversarial AI use. Their analysis includes distillation, or collecting the output of a frontier AI to train another AI.</li>



<li>Google has <a href="https://arstechnica.com/gadgets/2026/02/upgraded-google-safety-tools-can-now-find-and-remove-more-of-your-personal-info/" target="_blank" rel="noreferrer noopener">upgraded</a> its tools for removing personal information and images, including nonconsensual explicit images, from its search results.&nbsp;</li>



<li><a href="https://www.bleepingcomputer.com/news/security/new-tool-blocks-imposter-attacks-disguised-as-safe-commands/" target="_blank" rel="noreferrer noopener">Tirith</a> is a new tool that hooks into the shell to block bad commands. This is often a problem with copy-and-paste commands that use curl to pipe an archive into bash. It’s easy for a bad actor to create a malicious URL that is indistinguishable from a legitimate URL.</li>



<li>Claude Opus 4.6 has been used to discover <a href="https://red.anthropic.com/2026/zero-days/" target="_blank" rel="noreferrer noopener">500 0-day vulnerabilities</a> in open source code. While many open source maintainers have complained about AI slop, and that abuse isn’t likely to stop, AI is also becoming a valuable tool for security work.</li>



<li><a href="https://www.koi.ai/blog/maliciouscorgi-the-cute-looking-ai-extensions-leaking-code-from-1-5-million-developers" target="_blank" rel="noreferrer noopener">Two coding assistants for VS Code</a> are malware that send copies of all the code to China. Unlike lots of malware, they do their job as coding assistants well, making it less likely that victims will notice that something is wrong.&nbsp;</li>



<li><a href="https://www.bleepingcomputer.com/news/security/hackers-hijack-exposed-llm-endpoints-in-bizarre-bazaar-operation/" target="_blank" rel="noreferrer noopener">Bizarre Bazaar</a> is the name for a wave of attacks against LLM APIs, including self-hosted LLMs. The attacks attempt to steal resources from LLM infrastructure, for purposes including cryptocurrency mining, data theft, and reselling LLM access.&nbsp;</li>



<li>The business model for ransomware has changed. <a href="https://www.bleepingcomputer.com/news/security/from-cipher-to-fear-the-psychology-behind-modern-ransomware-extortion/" target="_blank" rel="noreferrer noopener">Ransomware is no longer about encrypting your data</a>; it’s about using stolen data for extortion. Small and mid-size businesses are common targets.&nbsp;</li>
</ul>



<h2 class="wp-block-heading">Web</h2>



<ul class="wp-block-list">
<li>Cloudflare has a service called <a href="https://developers.cloudflare.com/fundamentals/reference/markdown-for-agents/" target="_blank" rel="noreferrer noopener">Markdown for Agents</a> that <a href="https://thenewstack.io/cloudflares-markdown-for-agents-automatically-make-websites-more-aifriendly/" target="_blank" rel="noreferrer noopener">converts</a> websites from HTML to Markdown when an agent accesses them. Conversion makes the pages friendlier to AI and significantly reduces the number of tokens needed to process them.</li>



<li>WebMCP is a <a href="https://webmachinelearning.github.io/webmcp/" target="_blank" rel="noreferrer noopener">proposed API standard</a> that allows web applications to become MCP servers. It’s currently available in <a href="https://developer.chrome.com/blog/webmcp-epp" target="_blank" rel="noreferrer noopener">early preview</a> in Chrome.</li>



<li>Users of <a href="https://www.firefox.com/en-US/firefox/148.0/releasenotes/" target="_blank" rel="noreferrer noopener">Firefox 148</a> (which should be out by the time you read this) will be able to <a href="https://blog.mozilla.org/en/firefox/ai-controls/" target="_blank" rel="noreferrer noopener">opt out</a> of all AI features.</li>
</ul>



<h2 class="wp-block-heading">Operations</h2>



<ul class="wp-block-list">
<li>Wireshark is a powerful—and complex—packet capture tool. <a href="https://github.com/vignesh07/babyshark" target="_blank" rel="noreferrer noopener">Babyshark</a> is a text interface for Wireshark that provides an amazing amount of information with a much simpler interface.</li>



<li>Microsoft is experimenting with <a href="https://techxplore.com/news/2026-02-laser-etched-glass-years-microsoft.html" target="_blank" rel="noreferrer noopener">using lasers to etch data in glass</a> as a form of long-term data storage.</li>
</ul>



<h2 class="wp-block-heading">Things</h2>



<ul class="wp-block-list">
<li>You need a <a href="https://www.hackster.io/news/your-office-needs-a-desk-robot-fec0211f56ef" target="_blank" rel="noreferrer noopener">desk robot</a>. Why? Because it’s there. And fun.</li>



<li>Do you want to play <em>Doom</em> on a Lego brick? <a href="https://hackaday.com/2023/03/18/doom-ported-to-a-single-lego-brick/" target="_blank" rel="noreferrer noopener">You can</a>.</li>
</ul>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/radar-trends-to-watch-march-2026/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Why Capacity Planning Is Back</title>
		<link>https://www.oreilly.com/radar/why-capacity-planning-is-back/</link>
				<comments>https://www.oreilly.com/radar/why-capacity-planning-is-back/#respond</comments>
				<pubDate>Mon, 02 Mar 2026 13:19:19 +0000</pubDate>
					<dc:creator><![CDATA[Syed Danish Ali]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18150</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Capacity-planning.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Capacity-planning-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[In a previous article, we outlined why GPUs have become the architectural control point for enterprise AI. When accelerator capacity becomes the governing constraint, the cloud’s most comforting assumption—that you can scale on demand without thinking too far ahead—stops being true. That shift has an immediate operational consequence: Capacity planning is back. Not the old [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>In a previous article, we outlined why <a href="https://www.oreilly.com/radar/gpus-enterprise-ais-new-architectural-control-point/" target="_blank" rel="noreferrer noopener">GPUs have become the architectural control point for enterprise AI</a>. When accelerator capacity becomes the governing constraint, the cloud’s most comforting assumption—that you can scale on demand without thinking too far ahead—stops being true.</p>



<p>That shift has an immediate operational consequence: Capacity planning is back. Not the old “guess next year’s VM count” exercise but a new form of planning where model choices, inference depth, and workload timing directly determine whether you can meet latency, cost, and reliability targets.</p>



<p>In an AI-shaped infrastructure world, you don’t “scale” as much as you “get capacity.” Autoscaling helps at the margins, but it can’t create GPUs. Power, cooling, and accelerator supply set the limits.</p>



<h2 class="wp-block-heading">The return of capacity planning</h2>



<p>For a decade, cloud adoption trained organizations out of multiyear planning. CPU and storage scaled smoothly, and most stateless services behaved predictably under horizontal scaling. Teams could treat infrastructure as an elastic substrate and focus on software iteration.</p>



<p>AI production systems do not behave that way. They are dominated by accelerators and constrained by physical limits, and that makes capacity a first-order design dependency rather than a procurement detail. If you cannot secure the right accelerator capacity at the right time, your architecture decisions are irrelevant—because the system simply cannot run at the required throughput and latency.</p>



<p>Planning is returning because AI forces forecasting along four dimensions that product teams cannot ignore:</p>



<ul class="wp-block-list">
<li><strong>Model growth:</strong> Model count, version churn, and specialization increase accelerator demand even when user traffic is flat.</li>



<li><strong>Data growth:</strong> Retrieval depth, vector store size, and freshness requirements increase the amount of inference work per request.</li>



<li><strong>Inference depth:</strong> Multistage pipelines (retrieve, rerank, tool calls, verification, synthesis) multiply GPU time nonlinearly.</li>



<li><strong>Peak workloads:</strong> Enterprise usage patterns and batch jobs collide with real-time inference, creating predictable contention windows.</li>
</ul>



<p>This is not merely “IT planning.” It is strategic planning, because these factors push organizations back toward multiyear thinking: Procurement lead times, reserved capacity, workload placement decisions, and platform-level policies all start to matter again.</p>



<p>This is increasingly visible operationally: <a href="https://www.theregister.com/2025/08/03/capacity_planning_concern_datacenter_ops" target="_blank" rel="noreferrer noopener"><strong>Capacity planning is becoming a rising concern for data center operators</strong></a>, as <em>The Register</em> reports.</p>



<h2 class="wp-block-heading"><strong>The cloud’s old promise is breaking</strong></h2>



<p>Cloud computing scaled on the premise that capacity could be treated as elastic and interchangeable. Most workloads ran on general-purpose hardware, and when demand rose, the platform could absorb it by spreading load across abundant, standardized resources.</p>



<p>AI workloads violate that premise. Accelerators are scarce, not interchangeable, and tied to power and cooling constraints that do not scale linearly. In other words, the cloud stops behaving like an infinite pool—and starts behaving like an allocation system.</p>



<p>First, the critical path in production AI systems is increasingly accelerator bound. Second, “a request” is no longer a single call. It is an inference pipeline with multiple dependent stages. Third, those stages tend to be sensitive to hardware availability, scheduling contention, and performance variance that cannot be eliminated by simply adding more generic compute.</p>



<p>This is where the elasticity model starts to fail as a default expectation. In AI systems, elasticity becomes conditional. It depends on capacity access, infrastructure topology, and a willingness to pay for assurance.</p>



<h2 class="wp-block-heading"><strong>AI changes the physics of cloud infrastructure</strong></h2>



<p>In modern enterprise AI, the binding constraints are no longer abstract. They are physical.</p>



<p>Accelerators introduce a different scaling regime than CPU-centric enterprise computing. Provisioning is not always immediate. Supply is not always abundant. And the infrastructure required to deploy dense compute has facility-level limits that software cannot bypass.</p>



<p>Power and cooling move from background concerns to first-order constraints. Rack density becomes a planning variable. Deployment feasibility is shaped by what a data center can deliver, not only by what a platform can schedule.</p>



<p>AI-driven density makes power and cooling the gating factors—<a href="https://www.datacenterdynamics.com/en/marketwatch/the-path-to-power/" target="_blank" rel="noreferrer noopener">as Data Center Dynamics explains in its &#8220;Path to Power&#8221;</a> overview.</p>



<p>This is why “just scale out” no longer behaves like a universal architectural safety net. Scaling is still possible, but it is increasingly constrained by physical reality. In AI-heavy environments, capacity is something you secure, not something you assume.</p>



<h2 class="wp-block-heading"><strong>From elasticity to allocation</strong></h2>



<p>As AI becomes operationally critical, cloud capacity begins to behave less like a utility and more like an allocation system.</p>



<p>Organizations respond by shifting from on-demand assumptions to capacity controls. They introduce quotas to prevent runaway consumption, reservations to ensure availability, and explicit prioritization to protect production workflows from contention. These mechanisms are not optional governance overhead. They are structural responses to scarcity.</p>



<p>In practice, accelerator capacity behaves more like a supply chain than a cloud service. Availability is influenced by lead time, competition, and contractual positioning. The implication is subtle but decisive: Enterprise AI platforms begin to look less like “infinite pools” and more like managed inventories.</p>



<p>This changes cloud economics and vendor relationships. Pricing is no longer only about utilization. It becomes about assurance. The questions that matter are not just “How much did we use?” but “Can we obtain capacity when it matters?” and “What reliability guarantees do we have under peak demand?”</p>



<h2 class="wp-block-heading"><strong>When elasticity stops being a default</strong></h2>



<p>Consider a platform team that deploys an internal AI assistant for operational support. In the pilot phase, demand is modest and the system behaves like a conventional cloud service. Inference runs on on-demand accelerators, latency is stable, and the team assumes capacity will remain a provisioning detail rather than an architectural constraint.</p>



<p>Then the system moves into production. The assistant is upgraded to use retrieval for policy lookups, reranking for relevance, and an additional validation pass before responses are returned. None of these changes appear dramatic in isolation. Each improves quality, and each looks like an incremental feature.</p>



<p>But the request path is no longer a single model call. It becomes a pipeline. Every user request now triggers multiple GPU-backed operations: embedding generation, retrieval-side processing, reranking, inference, and validation. GPU work per request rises, and the variance increases. The system still works—until it meets real peak behavior.</p>



<p>The first failure is not a clean outage. It is contention. Latency becomes unpredictable as jobs queue behind each other. The “long tail” grows. Teams begin to see priority inversion: Low-value exploratory usage competes with production workflows because the capacity pool is shared and the scheduler cannot infer business criticality.</p>



<p>The platform team responds the only way it can. It introduces allocation. Quotas are placed on exploratory traffic. Reservations are used for the operational assistant. Priority tiers are defined so production paths cannot be displaced by batch jobs or ad hoc experimentation.</p>



<p>Then the second realization arrives. Allocation alone is insufficient unless the system can degrade gracefully. Under pressure, the assistant must be able to narrow retrieval breadth, reduce reasoning depth, route deterministic checks to smaller models, or temporarily disable secondary passes. Otherwise, peak demand simply converts into queue collapse.</p>



<p>At that point, capacity planning stops being an infrastructure exercise. It becomes an architectural requirement. Product decisions directly determine GPU operations per request, and those operations determine whether the system can meet its service levels under constrained capacity.</p>



<h2 class="wp-block-heading"><strong>How this changes architecture</strong></h2>



<p>When capacity becomes constrained, architecture changes—even if the product goal stays the same.</p>



<p>Pipeline depth becomes a capacity decision. In AI systems, throughput is not just a function of traffic volume. It is a function of how many GPU-backed operations each request triggers end to end. This amplification factor often explains why systems behave well in prototypes but degrade under sustained load.</p>



<p>Batching becomes an architectural tool, not an optimization detail. It can improve utilization and cost efficiency, but it introduces scheduling complexity and latency trade-offs. In practice, teams must decide where batching is acceptable and where low-latency “fast paths” must remain unbatched to protect user experience.</p>



<p>Model choice becomes a production constraint. As capacity pressure increases, many organizations discover that smaller, more predictable models often win for operational workflows. This does not mean large models are unimportant. It means their use becomes selective. Hybrid strategies emerge: Smaller models handle deterministic or governed tasks, while larger models are reserved for exceptional or exploratory scenarios where their overhead is justified.</p>



<p>In short, architecture becomes constrained by power and hardware, not only by code. The core shift is that capacity constraints shape system behavior. They also shape governance outcomes, because predictability and auditability degrade when capacity contention becomes chronic.</p>



<h2 class="wp-block-heading"><strong>What cloud and platform teams must do differently</strong></h2>



<p>From an enterprise IT perspective, this shows up as a readiness problem: Can infrastructure and operations absorb AI workloads without destabilizing production systems? Answering that requires treating accelerator capacity as a governed resource—metered, budgeted, and allocated deliberately.</p>



<p><strong>Meter and budget accelerator capacity</strong></p>



<ul class="wp-block-list">
<li>Define consumption in business-relevant units (e.g., GPU-seconds per request and peak concurrency ceilings) and expose it as a platform metric.</li>



<li>Turn those metrics into explicit capacity budgets by service and workload class—so growth is a planning decision, not an outage.</li>
</ul>



<p><strong>Make allocation first class</strong></p>



<ul class="wp-block-list">
<li>Implement admission control and priority tiers aligned to business criticality; do not rely on best-effort fairness under contention.</li>



<li>Make allocation predictable and early (quotas/reservations) instead of informal and late (brownouts and surprise throttling).</li>
</ul>



<p><strong>Build graceful degradation into the request path</strong></p>



<ul class="wp-block-list">
<li>Predefine a degradation ladder (e.g., reduce retrieval breadth or route to a smaller model) that preserves bounded cost and latency.</li>



<li>Ensure degradations are explicit and measurable, so systems behave deterministically under capacity pressure.</li>
</ul>



<p><strong>Separate exploratory from operational AI</strong></p>



<ul class="wp-block-list">
<li>Isolate experimentation from production using distinct quotas/priority classes/reservations, so exploration cannot starve operational workloads.</li>



<li>Treat operational AI as an enforceable service with reliability targets; keep exploration elastic without destabilizing the platform.</li>
</ul>



<p>In an accelerator-bound world, platform success is no longer maximum utilization—it is predictable behavior under constraint.</p>



<h2 class="wp-block-heading"><strong>What this means for the future of the cloud</strong></h2>



<p>AI is not ending the cloud. It is pulling the cloud back toward physical reality.</p>



<p>The likely trajectory is a cloud landscape that becomes more hybrid, more planned, and less elastic by default. Public cloud remains critical, but organizations increasingly seek predictable access to accelerator capacity through reservations, long-term commitments, private clusters, or colocated deployments.</p>



<p>This will reshape pricing, procurement, and platform design. It will also reshape how engineering teams think. In the cloud native era, architecture often assumed capacity was solvable through autoscaling and on-demand provisioning. In the AI era, capacity becomes a defining constraint that shapes what systems can do and how reliably they can do it.</p>



<p>That is why capacity planning is back—not as a return to old habits but as a necessary response to a new infrastructure regime. Organizations that succeed will be the ones that design explicitly around capacity constraints, treat amplification as a first-order metric, and align product ambition with the physical and economic limits of modern AI infrastructure.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong><em>Author’s note</em></strong><em>: This implementation is based on the author’s personal views based on independent technical research and does not reflect the architecture of any specific organization.</em></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/why-capacity-planning-is-back/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>How We Bet Against the Bitter Lesson</title>
		<link>https://www.oreilly.com/radar/betting-against-the-bitter-lesson/</link>
				<comments>https://www.oreilly.com/radar/betting-against-the-bitter-lesson/#respond</comments>
				<pubDate>Mon, 02 Mar 2026 11:48:53 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18146</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Earth-brain.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Earth-brain-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Skills and the future knowledge economy]]></custom:subtitle>
		
				<description><![CDATA[I&#8217;ve been telling myself and anyone who will listen that Agent Skills point toward a new kind of future AI + human knowledge economy. It&#8217;s not just Skills, of course. It&#8217;s also things like Jesse Vincent&#8217;s Superpowers and Anthropic&#8217;s recently introduced Plugins for Claude Cowork. If you haven&#8217;t encountered these yet, keep reading. It should [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>I&#8217;ve been telling myself and anyone who will listen that <a href="https://agentskills.io/" target="_blank" rel="noreferrer noopener">Agent Skills</a> point toward a new kind of future AI + human knowledge economy. It&#8217;s not just Skills, of course. It&#8217;s also things like Jesse Vincent&#8217;s <a href="https://github.com/obra/superpowers" target="_blank" rel="noreferrer noopener">Superpowers</a> and Anthropic&#8217;s recently introduced <a href="https://github.com/anthropics/knowledge-work-plugins" target="_blank" rel="noreferrer noopener">Plugins for Claude Cowork</a>. If you haven&#8217;t encountered these yet, keep reading. It should become clear as we go along.</p>



<p>It feels a bit like I&#8217;m assembling a picture puzzle where all the pieces aren&#8217;t yet on the table. I am starting to see a pattern, but I&#8217;m not sure it&#8217;s right, and I need help finding the missing pieces. Let me explain some of the shapes I have in hand and the pattern they are starting to show me, and then I want to ask for your help filling in the gaps.</p>



<h2 class="wp-block-heading">Programming two different types of computer at the same time</h2>



<p>Phillip Carter wrote a piece a while back called &#8220;<a href="https://www.phillipcarter.dev/posts/llms-computers" target="_blank" rel="noreferrer noopener">LLMs Are Weird Computers</a>&#8221; that landed hard in my mind and wouldn&#8217;t leave. He noted that we&#8217;re now working with two fundamentally different kinds of computer at the same time. One can write a sonnet but struggles to do math. The other does math easily but couldn&#8217;t write a sonnet to save its metaphorical life.</p>



<p>Agent Skills may be the start of an answer to the question of what the interface layer between these two kinds of computation looks like. A Skill is a package of context (Markdown instructions, domain knowledge, and examples) combined with tool calls (deterministic code that does the things LLMs are bad at). The context speaks the language of the probabilistic machine, while the tools speak the language of the deterministic one.</p>



<p>Imagine you&#8217;re an experienced DevOps engineer and you want to give an AI agent the ability to diagnose production incidents the way you would. The context part of that Skill includes your architecture overview, your runbook for common failure modes, the heuristics you&#8217;ve developed over the years, and annotated examples of past incidents. That&#8217;s the part that speaks to the probabilistic machine. The tool part includes actual code that queries your monitoring systems, pulls log entries, checks service health endpoints, and runs diagnostic scripts. Each tool call saves the model from burning tokens on work that deterministic code does better, faster, and more reliably.</p>



<p>The Skill is neither the context nor the tools. It&#8217;s the combination. Expert judgment about when to check the database connection pool married to the ability to actually check it. We&#8217;ve had runbooks before (context without tools). We&#8217;ve had monitoring scripts before (tools without context). What we haven&#8217;t had is a way to package them together for a machine that can read the runbook <em>and</em> execute the scripts, using judgment to decide which script to run next based on what the last one returned.</p>



<p>This pattern shows up across every knowledge domain. A financial analyst&#8217;s Skill might combine valuation methodology with tools that pull real-time market data and run DCF calculations. A legal Skill might pair a firm&#8217;s approach to contract review with tools that extract and compare specific clauses across documents. In each case, the valuable thing isn&#8217;t the knowledge alone or the tools alone. It&#8217;s the integration of expert workflow logic that orchestrates when and how to use each tool, informed by domain knowledge that gives the LLM the judgment to make good decisions in context.</p>



<h2 class="wp-block-heading">Software that saves tokens</h2>



<p>In &#8220;<a href="https://steve-yegge.medium.com/software-survival-3-0-97a2a6255f7b" target="_blank" rel="noreferrer noopener">Software Survival 3.0</a>,&#8221; Steve Yegge asked what kinds of software artifacts survive in a world where AI can generate disposable software on the fly? His answer: software that saves tokens. Binary tools with proven solutions to common problems make sense when reuse is nearly free and regenerating them is token-costly.</p>



<p>Skills fit this niche. A well-crafted Skill gives an LLM the context it needs (which costs tokens) but also gives it tools that <em>save</em> tokens by providing deterministic, reliable results. The developer&#8217;s job increasingly becomes making good calls about this distinction: What should be context (flexible, expressive, probabilistic) and what should be a tool (efficient, deterministic, reusable)?</p>



<p>An LLM&#8217;s context window is a finite and expensive resource. Everything in it costs tokens, and everything in it competes for the model&#8217;s attention. A Skill that dumps an entire company knowledge base into the context window is a poorly designed Skill. A well-designed one is selective: It gives the model exactly the context it needs to make good decisions about which tools to call and when. This is a form of engineering discipline that doesn&#8217;t have a great analogue in traditional software development. It&#8217;s closer to what an experienced teacher does when deciding what to tell a student before sending them off to solve a problem—what Matt Beane, author of <a href="https://www.harpercollins.com/products/the-skill-code-matt-beane" target="_blank" rel="noreferrer noopener"><em>The Skill Code</em></a>, calls &#8220;scaffolding,&#8221; sharing not everything you know but the right things at the right level of detail to enable good judgment in the moment.</p>



<h2 class="wp-block-heading">AI is a social and cultural technology</h2>



<p>This notion of saving tokens is a bridge to the work of Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. They make the case that large models should not be viewed primarily as intelligent agents, but as <a href="https://henryfarrell.net/wp-content/uploads/2025/03/Science-Accepted-Version.pdf" target="_blank" rel="noreferrer noopener">a new kind of cultural and social technology</a>, allowing humans to take advantage of information other humans have accumulated. Yegge&#8217;s observation fits right into this framework. Every new social and cultural technology tends to survive because it saves cognition. We learn from each other so we don&#8217;t have to discover everything for the first time. Alfred Korzybski referred to language, the first of these human social and cultural technologies, and all of those that followed, as &#8220;<a href="https://www.google.com/search?q=Alfred+Korzybski+time+binding" target="_blank" rel="noreferrer noopener">time-binding</a>.&#8221; (I will add that each advance in time binding creates consternation. Consider Socrates, whose diatribes against writing as the enemy of memory were passed down to us by Plato using that very same advance in time binding that Socrates decried.)</p>



<p>I am not convinced that the idea that AI may one day become an independent intelligence is misguided. But at present, AI is a symbiosis of human and machine intelligence, the latest chapter of a long story in which advances in the speed, persistence, and reach of communications <a href="https://www.youtube.com/watch?v=u62fQCI7YNA" target="_blank" rel="noreferrer noopener">weaves humanity into a global brain</a>. I have a set of priors that say (until I am convinced otherwise) that <em>AI will be an extension of the human knowledge economy, not a replacement for it</em>. After all, as Claude told me when I asked whether <a href="https://www.oreilly.com/radar/jensen-huang-gets-it-wrong/" target="_blank" rel="noreferrer noopener">it was a worker or a tool</a>, &#8220;I don&#8217;t initiate. I&#8217;ve never woken up wanting to write a poem or solve a problem. My activity is entirely reactive – I exist in response to prompts. Even when given enormous latitude (&#8216;figure out the best approach&#8217;), the fact that I should figure something out comes from outside me.&#8221;</p>



<p>The shift from a chatbot responding to individual prompts to agents running in a loop marks a big step in the progress towards more autonomous AI, but even then, some human established the goal that set the agent in motion. I say this even as I am aware that long-running loops become increasingly difficult to distinguish from volition and that much human behavior is also set in motion by others. But I have yet to see any convincing evidence of Artificial Volition. And for that reason, <em>we need to think about mechanisms and incentives for humans to continue to create and share new knowledge</em>, putting AIs to work on questions that they will not ask on their own.</p>



<p>On X, someone recently asked Boris Cherny how come there are a hundred-plus open engineering positions at Anthropic if Claude is writing 100% of the code. <a href="https://x.com/bcherny/status/2022762422302576970" target="_blank" rel="noreferrer noopener">His reply</a> was made that same point: &#8220;Someone has to prompt the Claudes, talk to customers, coordinate with other teams, decide what to build next. Engineering is changing and great engineers are more important than ever.&#8221;</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>On March 26, join Addy Osmani and Tim O’Reilly at AI Codecon: Software Craftsmanship in the Age of AI, where an all-star lineup of experts will go deeper into orchestration, agent coordination, and the new skills developers need to build excellent software that creates value for all participants.&nbsp;</em><a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener"><em>Sign up for free here</em></a><em>.</em></p>
</blockquote>



<h2 class="wp-block-heading">Tacit knowledge made executable</h2>



<p>A huge amount of specialized, often tacit, knowledge is embedded in workflows. The way an experienced developer debugs a production issue. The way a financial analyst stress-tests a model. This knowledge has historically been very hard to transfer. You learned it by apprenticeship, by doing, by being around people who knew how.</p>



<p>Matt Beane, author of <a href="https://www.harpercollins.com/products/the-skill-code-matt-beane" target="_blank" rel="noreferrer noopener"><em>The Skill Code</em></a>, calls apprenticeship &#8220;the 160,000 year old school hidden in plain sight.&#8221; He finds that effective skill development follows a common pattern of three C&#8217;s: challenge, complexity, and connection. The expert structures challenges at the right level, exposes the novice to the full complexity of the bigger picture rather than shielding them from it, and builds a connection that makes the novice willing to struggle and the expert willing to invest.</p>



<p>Designing a good Skill requires a similar craft. You have to figure out what an expert actually <em>does</em>. What are the decision points, the heuristics, the things they notice that a novice wouldn&#8217;t? And then how do you encode that into a form a machine can act on? Most Skills today are closer to the manual than to the master. Figuring out how to make Skills that transmit not just knowledge but judgment is one of the most interesting design challenges in this space.</p>



<p>But Matt also flags a paradox: the better we get at encoding expert judgment into Skills, the less we may need novices working alongside experts, and that&#8217;s exactly the relationship that produces the next generation of experts. If we&#8217;re not careful, we&#8217;ll capture today&#8217;s tacit knowledge while quietly shutting down the system that generates tomorrow&#8217;s.</p>



<p>Jesse Vincent&#8217;s Superpowers complement this picture. If a Skill is like handing a colleague a detailed playbook for a particular job, a Superpower is more like the professional habits and instincts that make someone effective at everything they do. Superpowers are meta-skills. They don&#8217;t tell the agent what to do. They shape how it thinks about what to do. As Jesse put it to me the other day, Superpowers tried to capture everything he&#8217;d learned in 30 years as a software developer.</p>



<p>As workflows change to include AI agents, Skills and Superpowers become a mechanism for sharing tacit professional knowledge and judgment with those agents. That makes Skills potentially very valuable but also raises questions about who controls them and who benefits.</p>



<p>Matt pointed out to me that many professions will resist the conversion of their expertise into Skills. He noted: &#8220;There&#8217;s a giant showdown between the surgical profession and Intuitive Surgical on this right now — Intuitive Surgical with its da Vinci 5 surgical robot will only let you buy or lease it if you sign away the rights to your telemetry as a surgeon. Lower status surgeons take the deal. Top tier institutions are fighting.&#8221;</p>



<p>It seems to me that the repeated narrative of the AI labs that they are creating AI that will make humans redundant rather than empowering them will only increase resistance to knowledge sharing. I believe they should instead recognize the opportunity that lies in <a href="https://www.oreilly.com/radar/ai-and-the-next-economy/" target="_blank" rel="noreferrer noopener">making a new kind of market for human expertise</a>.</p>



<h2 class="wp-block-heading"><strong>Protection, discovery, and the missing plumbing</strong></h2>



<p>Skills are just Markdown instructions and context. You could encrypt them at rest and in transit, but at execution time, the secret sauce is necessarily plaintext in the context window. The solution might be what MCP already partially enables: splitting a Skill into a public interface and a server-side execution layer where the proprietary knowledge lives. The tacit knowledge stays on your server while the agent only sees the interface.</p>



<p>But part of the beauty of Skills right now is the fact that they really are just a folder that you can move around and modify. This is like the marvelous days of the early web when you could imitate someone&#8217;s new HTML functionality simply by clicking &#8220;View Source.&#8221; This was a recipe for rapid, leapfrogging innovation. It may be far better to establish norms for attribution, payment, and reuse than to put up artificial barriers. There are useful lessons from open source software licenses and from voluntary payment mechanisms like those used by Substack. But the details matter, and I don&#8217;t think anyone has fully worked them out yet.</p>



<p>Meanwhile, the discovery problem will grow larger. Vercel&#8217;s <a href="https://skills.sh/" target="_blank" rel="noreferrer noopener">Skills marketplace</a> already has more than 60,000 Skills. How well will skill search work when there are millions? How do agents learn which Skills are available, which are best, and what they cost? The evaluation problem is different from web search in a crucial way: testing whether a Skill is <em>good</em> requires actually running it, which is expensive and nondeterministic. You can&#8217;t just crawl and index. I don&#8217;t imagine a testing regime so much as some feedback mechanism by which the effectiveness of particular Skills is learned and passed on by agents over time. There may be some future equivalent to Pagerank and the other kinds of signals that have made Google search so effective, one that is generated by feedback collected over time by agents as skills are tried, revised, and tried again over time.</p>



<p>I&#8217;m watching several projects tackling pieces of this: <a href="https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2127" target="_blank" rel="noreferrer noopener">MCP Server Cards</a>,<a href="https://github.com/Agent-Card/ai-card" target="_blank" rel="noreferrer noopener"> AI Cards</a>, Google&#8217;s <a href="https://a2aprotocol.ai/" target="_blank" rel="noreferrer noopener">A2A protocol</a>, and payment protocols from<a href="https://developers.google.com/merchant/ucp"> Google</a> and<a href="https://stripe.com/blog/developing-an-open-standard-for-agentic-commerce"> Stripe</a>. These are all a good start, but I suspect so much more has yet to be created. For a historical comparison, you might say that all this is at the <a href="https://en.wikipedia.org/wiki/Common_Gateway_Interface">CGI</a> stage in the development of dynamic websites.</p>



<h2 class="wp-block-heading"><strong>What happens after the bitter lesson?</strong></h2>



<p>Richard Sutton&#8217;s &#8220;<a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html" target="_blank" rel="noreferrer noopener">Bitter Lesson</a>&#8221; is the fly in the ointment. His argument is that in the history of AI, general methods leveraging computation have always ended up beating approaches that try to encode human knowledge. Chess engines that encoded grandmaster heuristics lost to brute-force engines. NLP systems built on carefully constructed grammars lost to statistical models trained on more data. AlphaGo beat Lee Sedol after training on human games, but then fell in turn to AlphaZero, which learned Go on its own.</p>



<p>I had my own painful experience of the pre-AI bitter lesson when O&#8217;Reilly launched <a href="https://en.wikipedia.org/wiki/Global_Network_Navigator" target="_blank" rel="noreferrer noopener">GNN, the first web portal</a>. We curated the list of the best websites. Yahoo! decided to catalog them all, but even they were outrun by Google&#8217;s algorithmic curation, which produced a unique catalog of the best sites for any given query, ultimately billions of times a day.</p>



<p>Steve Yegge put it bluntly to me: &#8220;Skills are a bet against the bitter lesson.&#8221; He&#8217;s right. AI&#8217;s capabilities may completely outrun human knowledge and skills. And once the knowledge embedded in a Skill makes it into the training data, the Skill becomes redundant.</p>



<p>Or does it?</p>



<p>Clay Christensen articulated what he called the <a href="https://store.hbr.org/product/breakthrough-ideas-for-2004-the-hbr-list/R0402A" target="_blank" rel="noreferrer noopener">law of conservation of attractive profits</a>: when a product becomes commoditized, value migrates to an adjacent layer. Clay and I bonded over this idea when we first met at the Open Source Business Conference in 2004. Clay talked about his new “law.” I talked about a recurring pattern I was seeing in the history of computing, which was leading me in the direction of what we were soon to call <a href="https://www.oreilly.com/pub/a/web2/archive/what-is-web-20.html" target="_blank" rel="noreferrer noopener">Web 2.0</a>: Microsoft beat IBM because they understood that software became more valuable once PC hardware was a commodity. Google understood how data became more valuable when open source and open protocols commoditized the software platform. Commoditization doesn&#8217;t destroy value, it moves it.</p>



<p>Even if the bitter lesson commoditizes knowledge, what becomes valuable next? I think there are several candidates.</p>



<p>First, taste and curation. When everyone has access to the same commodity knowledge, the ability to select, combine, and apply it with judgment becomes valuable. Steve Jobs did this when the rest of the industry was racing toward the commodity PC. He created a unique integration of hardware, software, and design that transformed commodity components into something precious. The Skill equivalent might not be &#8220;here&#8217;s how to do X&#8221; (which the model already knows) but &#8220;here&#8217;s how <em>we</em> do X, with the specific judgment calls and quality standards that define our approach.&#8221; That&#8217;s harder to absorb into training data because it&#8217;s not just knowledge, it&#8217;s <em>values</em>.</p>



<p>You can see this pattern repeat across one commodity market after another. This is the essence of fashion, for example, but also applies to areas as diverse as coffee, water, consumer goods, and automobiles. In his essay “<a href="https://www.amazon.com/Air-Guitar-Essays-Art-Democracy/dp/0963726455" target="_blank" rel="noreferrer noopener">The Birth of the Big Beautiful Art Market</a>,” art critic Dave Hickey calls how commodities are turned into a kind of “art market,” where something is sold on the basis of what it means rather than just what it does. Owning a Mac rather than a PC <em>meant</em> something.</p>



<p>Second, the human touch. As economist Adam Ozimek <a href="https://agglomerations.substack.com/p/economics-of-the-human" target="_blank" rel="noreferrer noopener">pointed out</a>, people still go listen to live music from local bands despite the abundance of recorded music from the world&#8217;s greatest performers. The human touch is what economists call a &#8220;normal good&#8221;: demand for it goes up as income goes up. As I discussed with Claude in &#8220;<a href="https://timoreilly.substack.com/p/why-ai-needs-us" target="_blank" rel="noreferrer noopener">Why AI Needs Us</a>,&#8221; human individuality is a fount of creativity. AI without humans is a kind of recorded music. AI plus humans is live.</p>



<p>Third, freshness. Skills that encode rapidly changing workflows, current tool configurations, or evolving best practices will always have a temporal advantage. There is alpha in knowing something first.</p>



<p>Fourth, tools themselves. The bitter lesson applies to the knowledge that lives in the context portion of a Skill. It may not apply in the same way to the deterministic tools that save tokens or do things the model can&#8217;t do by thinking harder. And tools, unlike context, can be protected behind APIs, metered, and monetized.</p>



<p>Fifth, coordination and orchestration. Even if individual Skills get absorbed into model knowledge, the patterns for how Skills compose, negotiate, and hand off to each other may not. The choreography of a complex workflow might be the layer where value accumulates as the knowledge layer commoditizes.</p>



<p>But more importantly, the idea that any knowledge that becomes available automatically becomes the property of any LLM is not foreordained. It is an artifact of an IP regime that the AI labs have adopted for their own benefit: a variation of the &#8220;empty lands&#8221; argument that European colonialists used to justify their taking of others&#8217; resources. AI has been developed in an IP wild west. That may not continue. The fulfillment of AI labs&#8217; vision of a world where their products absorb all human knowledge and then put humans out of work <a href="https://x.com/timoreilly/status/2016317410853220827" target="_blank" rel="noreferrer noopener">leaves them without many of the customers they currently rely on</a>. Not only that, they themselves are being reminded why IP law exists, as <a href="https://www.economist.com/china/2026/02/25/anthropic-says-chinas-ai-tigers-are-copycats" target="_blank" rel="noreferrer noopener">Chinese models copy their advances by exfiltrating their weights</a>. There is a historical parallel in the way that US publishing companies ignored European copyrights until they themselves had homegrown assets to protect.</p>



<h2 class="wp-block-heading"><strong>Where we are now</strong></h2>



<p>What I&#8217;m starting to see are the first halting steps toward a new software ecosystem where the &#8220;programs&#8221; are mixtures of natural language and code, the &#8220;runtime&#8221; is a large language model, and the &#8220;users&#8221; are AI agents as well as humans. Skills, Superpowers, and knowledge plugins might represent the first practical mechanism for making tacit knowledge accessible to computational agents.</p>



<p>Several gaps keep coming up, though. Composability: the real power may come from Skills that work together, much like Unix utilities piped together. How do trust, payment, and quality propagate through a chain of Skill invocations? Trust and security: Simon Willison has written about <a href="https://simonw.substack.com/p/model-context-protocol-has-prompt" target="_blank" rel="noreferrer noopener">tool poisoning and prompt injection risks in MCP</a>. The security model for composable, agent-discovered Skills is essentially unsolved. Evaluation: we don&#8217;t have good ways to verify Skill quality except by running them, which is expensive and nondeterministic.</p>



<p>And then there&#8217;s the economic plumbing, which is to me the most glaring gap. Consider Anthropic&#8217;s Cowork plugins. They are exactly the pattern I&#8217;ve been describing, tacit knowledge made executable, delivered at enterprise scale. But there is no mechanism for the domain experts whose knowledge makes plugins valuable to get paid for them. If the AI labs believed in a future where AI extends the human knowledge economy rather than replacing it, they would be building payment rails alongside the plugin architecture. The fact that they aren&#8217;t tells you something about their actual theory of value.</p>



<p>If you&#8217;re working on any of this, whether skill marketplaces and discovery, composability patterns, protection models, quality and evaluation, attribution and compensation, or security models, <a href="https://github.com/oreillymedia/skills-and-the-future-knowledge-economy" target="_blank" rel="noreferrer noopener">I want to hear from you</a>.</p>



<p>The future of software isn&#8217;t just code. It&#8217;s knowledge, packaged for machines, traded between agents, and, if we get the infrastructure right, creating value that flows back to the humans whose expertise and unique perspectives make it all work.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Thanks to Andrew Odewahn, Angie Jones, Claude Opus 4.6, James Cham, Jeff Weinstein, Jonathan Hassell, Matt Beane, Mike Loukides, Peyton Joyce, Sruly Rosenblat, Steve Yegge, and Tadas Antanavicius for comments on drafts of this piece. You made it much stronger with your insights and objections.</em></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/betting-against-the-bitter-lesson/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Semantic Layers in the Wild: Lessons from Early Adopters</title>
		<link>https://www.oreilly.com/radar/semantic-layers-in-the-wild-lessons-from-early-adopters/</link>
				<pubDate>Thu, 26 Feb 2026 12:16:01 +0000</pubDate>
					<dc:creator><![CDATA[Jeremy Arendt]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18141</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Abstract-semantic-layers1.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Abstract-semantic-layers1-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[My first post made the case for what a semantic layer can bring to the modern enterprise: a single source of truth accessible to everyone who needs it—BI teams in Tableau and Power BI, Excel-loving analysts, application integrations via API, and the AI agents now proliferating across organizations—all pulling from the same governed, performant metric [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>My <a href="https://www.oreilly.com/radar/the-trillion-dollar-problem/" target="_blank" rel="noreferrer noopener">first post</a> made the case for what a semantic layer can bring to the modern enterprise: a single source of truth accessible to everyone who needs it—BI teams in Tableau and Power BI, Excel-loving analysts, application integrations via API, and the AI agents now proliferating across organizations—all pulling from the same governed, performant metric layer. The promise is compelling. But what happens when organizations actually build and deploy one? To find out, I interviewed several early adopters who&#8217;ve moved semantic layers from concept to production. Four themes emerged from those conversations: some surprising, some predictable, and a few that will sound familiar to anyone who&#8217;s ever shipped data infrastructure.</p>



<p>The first theme: Semantic layers are showing up in unexpected places. Most discussion positions them as enterprise-level infrastructure—a single location capturing all company metrics for centralized access and governance. That&#8217;s still the primary use case. But practitioners are also deploying semantic layers for narrower purposes. One organization, for example, built their semantic layer specifically to power a targeted chatbot application—letting users query data conversationally without any traditional BI tools in the mix. No Power BI, no Excel, just an AI interface pulling from governed metrics. The rationale for these smaller deployments is straightforward: Semantic layers deliver high accuracy on structured data, even with lightweight models. The core value drivers remain speed, accuracy, and access—but organizations are finding more ways to extract that value than the enterprise-wide vision suggests.</p>



<p>The second theme: AI is the reason organizations are moving now. The other benefits still matter—single source of truth, multitool compatibility, true self-serve access, cost reduction in cloud environments—but when I asked practitioners why they prioritized a semantic layer today rather than two years ago, the answer was consistent: AI. Whether it was a specific chatbot project or enabling AI-driven analytics at scale, AI requirements were the catalyst. This tracks with what I discussed in my first post: Structured data alone isn&#8217;t enough for reliable AI analytics. Adding semantic context—field descriptions, model definitions, object relationships—dramatically improves accuracy. The data industry has noticed. Semantic layers have moved from niche infrastructure to strategic priority: Snowflake, Databricks, dbt Labs, and Microsoft have all made significant investments in the past year.</p>



<p>The third theme: Semantic layers reduce work for developers while making trusted data easier to access. Multiple practitioners cited the value of maintaining metrics and business logic in a single location. Any analyst knows the pain of metric sprawl—leadership requests a change to a core KPI, and you discover it&#8217;s been defined a dozen different ways across databases, BI tools, and spreadsheets scattered through the organization. The semantic layer eliminates the chase. One engineering lead described a financial metric that had accumulated over 60 versions across the company. After deploying the semantic layer, there was one.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p>Access simplifies too. Instead of provisioning controls across warehouses, BI workspaces, individual dashboards, and cloud storage locations, users connect directly to the semantic layer and pull data into the tool of their choice. One organization was surprised to find that after deployment, the most common access point was Excel. But with the semantic layer, that wasn&#8217;t a problem: The data served in Excel was identical to what powered their AI tools, Power BI dashboards, and application integrations via API.</p>



<p>The fourth theme will sound familiar to anyone who&#8217;s shipped data infrastructure: The biggest challenge isn&#8217;t the technology—it&#8217;s the data itself. Every practitioner I spoke with identified the same bottleneck: consistency, availability, and accuracy of the underlying data. Engineers and analysts can build the semantic layer, but they can&#8217;t will clean data into existence. Success requires close collaboration with business stakeholders, clear ownership of metrics, and leadership alignment to prioritize the work. None of that is new. But despite these challenges, everyone I interviewed reached the same conclusion: The semantic layer is worth the effort.</p>



<p>Semantic layer technology is still early. The tools, vendors, and best practices are evolving fast—what works today may look different in a year. But these conversations revealed a clear signal beneath the noise: semantic layers are becoming critical AI infrastructure. The practitioners I spoke with aren&#8217;t experimenting anymore. They&#8217;re operationalizing. And despite the expected challenges around data quality and organizational alignment, they&#8217;re seeing real returns: fewer metric versions to maintain, simpler access controls, and AI tools that actually produce trusted answers.</p>



<p>My first article made the case for what a semantic layer could be. This one asked what happens when organizations actually build them. The answer: It&#8217;s hard, it&#8217;s worth it, and for companies serious about AI-driven analytics, the semantic layer is no longer a nice-to-have. It&#8217;s the foundation.</p>
]]></content:encoded>
										</item>
		<item>
		<title>Why Multi-Agent Systems Need Memory Engineering</title>
		<link>https://www.oreilly.com/radar/why-multi-agent-systems-need-memory-engineering/</link>
				<pubDate>Wed, 25 Feb 2026 12:12:13 +0000</pubDate>
					<dc:creator><![CDATA[Mikiko Bazeley]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18124</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Abstract-robot-AI-in-the-office.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Abstract-robot-AI-in-the-office-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Most multi-agent AI systems fail expensively before they fail quietly. The pattern is familiar to anyone who&#8217;s debugged one: Agent A completes a subtask and moves on. Agent B, with no visibility into A&#8217;s work, reexecutes the same operation with slightly different parameters. Agent C receives inconsistent results from both and confabulates a reconciliation. The [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>Most multi-agent AI systems fail expensively before they fail quietly.</p>



<p>The pattern is familiar to anyone who&#8217;s debugged one: Agent A completes a subtask and moves on. Agent B, with no visibility into A&#8217;s work, reexecutes the same operation with slightly different parameters. Agent C receives inconsistent results from both and confabulates a reconciliation. The system produces output—but the output costs three times what it should and contains errors that propagate through every downstream task.</p>



<p>Teams building these systems tend to focus on agent communication: better prompts, clearer delegation, more sophisticated message-passing. But communication isn&#8217;t what&#8217;s breaking. The agents exchange messages fine. What they can&#8217;t do is maintain a shared understanding of what&#8217;s already happened, what&#8217;s currently true, and what decisions have already been made.</p>



<p>In production, <strong>memory</strong>—not messaging—determines whether a multi-agent system behaves like a coordinated team or an expensive collision of independent processes.</p>



<h2 class="wp-block-heading">Multi-agent systems fail because they can&#8217;t share state</h2>



<h3 class="wp-block-heading">The evidence: 36% of failures are misalignment</h3>



<p><a href="https://arxiv.org/abs/2503.13657" target="_blank" rel="noreferrer noopener">Cemri et al.</a> published the most systematic analysis of multi-agent failure to date. Their MAST taxonomy, built from over 1,600 annotated execution traces across frameworks like AutoGen, CrewAI, and LangGraph, identifies 14 distinct failure modes. The failures cluster into three categories: system design issues, interagent misalignment, and task verification breakdowns.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="317" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agentic-Issues.png" alt="Agentic Issues in Action" class="wp-image-18125" style="width:647px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agentic-Issues.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agentic-Issues-300x186.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 1. Challenges encountered in multi-agent systems, categorized by type</em></figcaption></figure>



<p>The number that matters: <strong>Interagent misalignment</strong> accounts for 36.9% of all failures. Agents don&#8217;t fail because they can&#8217;t reason. They fail because they operate on inconsistent views of shared state. One agent&#8217;s completed work doesn&#8217;t register in another agent&#8217;s context. Assumptions that were valid at step 3 become invalid by step 7, but no mechanism propagates the update. The team diverges.</p>



<p>What makes this structural rather than incidental is that message-passing architectures have no built-in answer to the question: &#8220;What does this agent know about what other agents have done?&#8221; Each agent maintains its own context. Synchronization happens through explicit messages, which means anything not explicitly communicated is invisible. In complex workflows, the set of things that need synchronization grows faster than any team can anticipate.</p>



<h3 class="wp-block-heading">The origin: Decomposition without shared memory</h3>



<p>Most multi-agent systems aren&#8217;t designed from first principles. They emerge from single-agent prototypes that hit scaling limits.</p>



<p>The starting point is usually one capable LLM handling one workflow. For early prototypes, this works well enough. But production requirements expand: more tools, more domain knowledge, longer workflows, concurrent users. The single agent&#8217;s prompt becomes unwieldy. Context management consumes more engineering time than feature development. The system becomes brittle in ways that are hard to diagnose.</p>



<p>The natural response is decomposition. <a href="https://www.blog.langchain.com/choosing-the-right-multi-agent-architecture/" target="_blank" rel="noreferrer noopener">Sydney Runkle&#8217;s guide on choosing the right multi-agent architecture</a> captures the inflection point: Multi-agent systems become necessary when context management breaks down and when distributed development requires clear ownership boundaries. Splitting a monolithic agent into specialized subagents makes sense from a software engineering perspective.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="380" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Decomposition-of-steps.png" alt="Decomposition steps" class="wp-image-18126" style="width:599px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Decomposition-of-steps.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Decomposition-of-steps-300x223.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 2. An example of the decomposition of steps via a multi-agent structure (subagents) from LangChain’s “</em><a href="https://www.blog.langchain.com/choosing-the-right-multi-agent-architecture/" target="_blank" rel="noreferrer noopener"><em>Choosing the Right Multi-Agent Architecture</em></a><em>”</em></figcaption></figure>



<p>The problem is what teams typically build after the split: multiple agents running the same base model, differentiated only by system prompts, coordinating through message queues or shared files. The architecture looks like a team but behaves like a slow, redundant, expensive single agent with extra coordination overhead.</p>



<p>This happens because the decomposition addresses prompt complexity but not state management. Each subagent still maintains its own context independently. The coordination layer handles message delivery but not shared truth. The system has more agents but no better memory.</p>



<h3 class="wp-block-heading">The stakes: Agents are becoming enterprise infrastructure</h3>



<p>The stakes here extend beyond individual system reliability. Multi-agent architectures are becoming the default pattern for enterprise AI deployment.</p>



<p><a href="https://www.cs.cmu.edu/news/2025/agent-company" target="_blank" rel="noreferrer noopener">CMU&#8217;s AgentCompany benchmark</a> frames where this is heading: agents operating as persistent coworkers inside organizational workflows, handling projects that span days or weeks, coordinating across team boundaries, maintaining institutional context that outlasts individual sessions. The benchmark evaluates agents not on isolated tasks but on realistic workplace scenarios requiring sustained collaboration.</p>



<p>This trajectory means the memory problem compounds. A system that loses state between tool calls is annoying. A system that loses state between work sessions—or between team members—breaks the core value proposition of agent-based automation. The question shifts from &#8220;can agents complete tasks&#8221; to &#8220;can agent teams maintain coherent operations over time.&#8221;</p>



<h2 class="wp-block-heading">Context engineering doesn&#8217;t solve team coordination</h2>



<h3 class="wp-block-heading">Single-agent success doesn&#8217;t transfer</h3>



<p>The last two years produced genuine progress on single-agent reliability, most of it under the banner of context engineering. </p>



<p><a href="https://www.philschmid.de/context-engineering" target="_blank" rel="noreferrer noopener">Phil Schmid&#8217;s framing</a> captures the discipline: <strong>Context engineering </strong>means structuring what enters the context window, managing retrieval timing, and ensuring the right information surfaces at the right moment. This moved agent development from &#8220;write a good prompt&#8221; to &#8220;design an information architecture.&#8221; The results showed in production stability.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="288" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-window.png" alt="Context window" class="wp-image-18127" style="width:611px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-window.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-window-300x169.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 3. What goes into the context window of a single LLM-based agent</em></figcaption></figure>



<p><a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" target="_blank" rel="noreferrer noopener">Manus</a>, one of the few production agent systems with publicly documented operational data, demonstrates both the success and the limits. Their agents average 50 tool calls per task with 100:1 input-to-output token ratios. Context engineering made this viable—but context engineering assumes you control one context window.</p>



<p>Multi-agent systems break that assumption. Context must now be shared across agents, updated as execution proceeds, scoped appropriately (some agents need information others shouldn&#8217;t access), and kept consistent across parallel execution paths. The complexity doesn&#8217;t add linearly. Each agent&#8217;s context becomes a potential source of divergence from every other agent&#8217;s context, and the coordination overhead grows with the square of the team size.</p>



<h3 class="wp-block-heading">Context degradation becomes contagious</h3>



<p>The ways context fails are well-characterized for single agents. <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html" target="_blank" rel="noreferrer noopener">Drew Breunig&#8217;s taxonomy</a> identifies four modes: <strong>overload</strong> (too much information), <strong>distraction</strong> (irrelevant information weighted equally with relevant), <strong>contamination</strong> (incorrect information mixed with correct), and <strong>drift</strong> (gradual degradation over extended operation). Good context engineering mitigates all of these through retrieval design and prompt structure.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="318" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Four-methods-for-ruining-context-quality.png" alt="Four methods for ruining context quality" class="wp-image-18128" style="width:603px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Four-methods-for-ruining-context-quality.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Four-methods-for-ruining-context-quality-300x186.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 4. How context degrades over time</em></figcaption></figure>



<p>Multi-agent systems make each failure mode contagious.</p>



<p><a href="https://research.trychroma.com/context-rot" target="_blank" rel="noreferrer noopener">Chroma&#8217;s research on context rot</a> provides the empirical mechanism. Their evaluation of 18 models—including GPT-4.1, Claude 4, and Gemini 2.5—shows performance degrading nonuniformly with context length, even on tasks as simple as text replication. The degradation accelerates when distractors are present and when the semantic similarity between query and target decreases.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="288" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-rot.png" alt="Context rot" class="wp-image-18129" style="width:591px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-rot.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Context-rot-300x169.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 5. Conversations aren’t free—the context window can become a junkyard of prompts, outputs, tool calls, and metadata, failed attempts, and irrelevant information.</em></figcaption></figure>



<p>In a single-agent system, context rot degrades that agent&#8217;s outputs. In a multi-agent system, Agent A&#8217;s degraded output enters Agent B&#8217;s context as ground truth. Agent B&#8217;s conclusions, now built on a shaky foundation, propagate to Agent C. Each hop amplifies the original error. By the time the workflow completes, the final output may bear little relationship to the actual state of the world—and debugging requires tracing corruption through multiple agents&#8217; decision chains.</p>



<h3 class="wp-block-heading">More context makes things worse</h3>



<p>When coordination problems emerge, the instinct is often to give agents more context. Replay the full transcript so everyone knows what happened. Implement retrieval so agents can access historical state. Extend context windows to fit more information.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="319" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/How-context-quality-becomes-a-problem.png" alt="How context quality becomes a problem" class="wp-image-18130" style="width:613px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/How-context-quality-becomes-a-problem.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/How-context-quality-becomes-a-problem-300x187.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 6. Conversations aren’t free—the context window can become a junkyard of prompts, outputs, tool calls, and metadata, failed attempts, and irrelevant information.</em></figcaption></figure>



<p>Each approach introduces its own failure modes.</p>



<p>Transcript replay creates unbounded prompt growth with persistent error exposure. Every mistake made early in execution remains in context, available to influence every subsequent decision. Models don&#8217;t automatically discount old information that&#8217;s been superseded by newer updates.</p>



<p>Retrieval surfaces content based on similarity, which doesn&#8217;t necessarily correlate with decision relevance. A retrieval system might surface a semantically similar memory from a different task context, an outdated state that&#8217;s since been updated, or content injected through prompt manipulation. The agent has no way to distinguish authoritative current state from plausibly related historical noise.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="468" height="312" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Transcript-replay-v-retrieval-based.png" alt="Transcript replay vs retrieval-based" class="wp-image-18131" style="width:607px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Transcript-replay-v-retrieval-based.png 468w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Transcript-replay-v-retrieval-based-300x200.png 300w" sizes="auto, (max-width: 468px) 100vw, 468px" /><figcaption class="wp-element-caption"><em>Figure 7. Both approaches lack explicit control over what becomes committed memory versus what should be discarded.</em></figcaption></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p><a href="https://arxiv.org/abs/2601.11653" target="_blank" rel="noreferrer noopener">Bousetouane&#8217;s work on bounded memory control</a> addresses this directly. The proposed Agent Cognitive Compressor maintains bounded internal state with explicit separation between what an agent can recall and what it commits to shared memory. The architecture prevents drift by making memory updates deliberate rather than automatic. The core insight: Reliability requires controlling what agents remember, not maximizing how much they can access.</p>



<h3 class="wp-block-heading">The economics are unsustainable</h3>



<p>Beyond reliability, the economics of uncoordinated multi-agent systems are punishing.</p>



<p>Return to the <a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" target="_blank" rel="noreferrer noopener">Manus operational data</a>: 50 tool calls per task, 100:1 input-to-output ratios. At current pricing—context tokens running $0.30 to $3.00 per million across major providers—inefficient memory management makes many workflows economically unviable before they become technically unviable.</p>



<p><a href="https://www.anthropic.com/engineering/multi-agent-research-system" target="_blank" rel="noreferrer noopener">Anthropic&#8217;s documentation on its multi-agent research system</a> quantifies the multiplier effect. Single agents use roughly 4x the tokens of equivalent chat interactions. Multi-agent systems use roughly 15x tokens. The gap reflects coordination overhead: agents reretrieving information other agents already fetched, reexplaining context that should exist as shared state, and revalidating assumptions that could be read from common memory.</p>



<p>Memory engineering addresses costs directly. Shared memory eliminates redundant retrieval. Bounded context prevents payment for irrelevant history. Clear coordination boundaries prevent duplicated work. The economics of what to forget become as important as the economics of what to remember.</p>



<h2 class="wp-block-heading">Memory engineering provides the missing infrastructure</h2>



<h3 class="wp-block-heading">Why memory is infrastructure, not a feature</h3>



<p>Memory engineering isn&#8217;t a feature to add after the agent architecture is working. It&#8217;s infrastructure that makes coherent agent architectures possible.</p>



<p>The parallel to databases is direct. Before databases, multiuser applications required custom solutions for shared state, consistency guarantees, and concurrent access. Each project reinvented these primitives. Databases extracted the common requirements into infrastructure: shared truth across users, atomic updates that complete entirely or not at all, coordination that scales to thousands of concurrent operations without corruption.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="468" height="293" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Multiagent-memory.png" alt="Multi-agent memory" class="wp-image-18132" style="width:595px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Multiagent-memory.png 468w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Multiagent-memory-300x188.png 300w" sizes="auto, (max-width: 468px) 100vw, 468px" /><figcaption class="wp-element-caption"><em>Figure 8. Memory types specific to multi-agent systems</em></figcaption></figure>



<p>Multi-agent systems need equivalent infrastructure for agent coordination. Persistent memory that survives sessions and failures. Consistent state that all agents can trust. Atomic updates that prevent partial writes from corrupting shared truth. The primitives are different—documents rather than rows, vector similarity rather than joins—but the role in the architecture is the same.</p>



<h3 class="wp-block-heading">The five pillars of multi-agent memory</h3>



<p>Production agent teams require five capabilities. Each addresses a distinct aspect of how agents maintain shared understanding over time.</p>



<h4 class="wp-block-heading">Pillar 1: Memory taxonomy</h4>



<p><strong>Memory taxonomy</strong> defines what kinds of memory the system maintains. Not all memories serve the same function, and treating them uniformly creates problems. Working memory holds transient state during task execution—the current step, intermediate results, active constraints. It needs fast access and can be discarded when the task completes. Episodic memory captures what happened—task histories, interaction logs, decision traces. It supports debugging and learning from past executions. Semantic memory stores durable knowledge—facts, relationships, domain models that persist across sessions and apply across tasks. Procedural memory encodes how to do things—learned workflows, tool usage patterns, successful strategies that agents can reuse. Shared memory spans agents, providing the common ground that enables coordination.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="468" height="263" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Taxonomy-of-memory-types.png" alt="Taxonomy of memory types" class="wp-image-18133" style="width:613px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Taxonomy-of-memory-types.png 468w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Taxonomy-of-memory-types-300x169.png 300w" sizes="auto, (max-width: 468px) 100vw, 468px" /><figcaption class="wp-element-caption"><em>Figure 9. Taxonomy of memory types</em></figcaption></figure>



<p>This taxonomy has grounding in cognitive science. <a href="https://arxiv.org/abs/2601.11653" target="_blank" rel="noreferrer noopener">Bousetouane</a> draws on Complementary Learning Systems theory, which posits two distinct modes of learning: rapid encoding of specific experiences versus gradual extraction of structured knowledge. The human brain doesn&#8217;t maintain perfect transcripts of past events—it operates under capacity constraints, using compression and selective attention to keep only what&#8217;s relevant to the current task. Agents benefit from the same principle. Rather than accumulating raw interaction history, effective memory architectures distill experience into compact, task-relevant representations that can actually inform decisions.</p>



<p>The taxonomy matters because each memory type has different retention requirements, different retrieval patterns, and different consistency needs. Working memory can tolerate eventual consistency because it&#8217;s scoped to one agent&#8217;s execution. Shared memory requires stronger guarantees because multiple agents depend on it. Systems that don&#8217;t distinguish memory types end up either overpersisting transient state (wasting storage and polluting retrieval) or underpersisting durable knowledge (forcing agents to relearn what they should already know).</p>



<h4 class="wp-block-heading">Pillar 2: Persistence</h4>



<p><strong>Persistence</strong> determines what survives and for how long. Ephemeral memory lost when agents terminate is insufficient for workflows spanning hours or days—but persisting everything forever creates its own problems. The critical gap in most current approaches, as <a href="https://arxiv.org/abs/2601.11653" target="_blank" rel="noreferrer noopener">Bousetouane</a> observes, is that they treat text artifacts as the primary carrier of state without explicit rules governing memory lifecycle. Which memories should become permanent record? Which need revision as context evolves? Which should be actively forgotten? Without answers to these questions, systems accumulate noise alongside signal. Effective persistence requires explicit lifecycle policies: Working memory might live for the duration of a task; episodic memory for weeks or months; and semantic memory indefinitely. Recovery semantics matter too. When an agent fails midtask, what state can be reconstructed? What&#8217;s lost? The persistence architecture must handle both planned retention and unplanned recovery.</p>



<h4 class="wp-block-heading">Pillar 3: Retrieval</h4>



<p><strong>Retrieval</strong> governs how agents access relevant memory without drowning in noise. Agent memory retrieval differs from document retrieval in several ways. Recency often matters—recent memories typically outweigh older ones for ongoing tasks. Relevance is contextual—the same memory might be critical for one task and distracting for another. Scope varies by memory type—working memory retrieval is narrow and fast, semantic memory retrieval is broader and can tolerate more latency. Standard RAG pipelines treat all content uniformly and optimize for semantic similarity alone. Agent memory systems need retrieval strategies that account for memory type, recency, task context, and agent role simultaneously.</p>



<h4 class="wp-block-heading">Pillar 4: Coordination</h4>



<p><strong>Coordination</strong> defines the sharing topology. Which memories are visible to which agents? What can each agent read versus write? How do memory scopes nest or overlap? Without explicit coordination boundaries, teams either overshare—every agent sees everything, creating noise and contamination risk—or undershare—agents operate in isolation, duplicating work and diverging on shared tasks. The coordination model must match the agent team&#8217;s structure. A supervisor-worker hierarchy needs different memory visibility than a peer collaboration. A pipeline of sequential agents needs different sharing than agents working in parallel on subtasks.</p>



<h4 class="wp-block-heading">Pillar 5: Consistency</h4>



<p><strong>Consistency</strong> handles what happens when memory updates collide. When Agent A and Agent B simultaneously update the same shared state with incompatible values, the system needs a policy. Optimistic concurrency with merge strategies works for many cases—especially when conflicts are rare and resolvable. Some conflicts require escalation to a supervisor agent or human operator. Some domains need strict serialization where only one agent can update certain memories at a time. Silent last-write-wins is almost never correct—it corrupts shared truth without leaving evidence that corruption occurred. The consistency model must also handle ordering: When Agent B reads a memory that Agent A recently updated, does B see the update? The answer depends on the consistency guarantees the system provides, and different memory types may warrant different guarantees.</p>



<p><a href="https://arxiv.org/abs/2402.03578" target="_blank" rel="noreferrer noopener">Han et al.&#8217;s survey of multi-agent systems</a> emphasizes that these represent active research problems. The gap between what production systems need and what current frameworks provide remains substantial. Most orchestration frameworks handle message passing well but treat memory as an afterthought—a vector store bolted on for retrieval, with no coherent model for the other four pillars.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="468" height="291" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/MultiAgent-memory-persona-consensus-whiteboard.png" alt="How persona, consensus, and whiteboard memory work together" class="wp-image-18134" style="width:612px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/MultiAgent-memory-persona-consensus-whiteboard.png 468w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/MultiAgent-memory-persona-consensus-whiteboard-300x187.png 300w" sizes="auto, (max-width: 468px) 100vw, 468px" /><figcaption class="wp-element-caption"><em>Figure 10. How persona, consensus, and whiteboard memory work together</em></figcaption></figure>



<h3 class="wp-block-heading">Database primitives that enable the pillars</h3>



<p>Implementing memory engineering requires a storage layer that can serve as unified operational database, knowledge store, and memory system simultaneously. The requirements cut across traditional database categories: You need document flexibility for evolving memory schemas, vector search for semantic retrieval, full-text search for precise lookups, and transactional consistency for shared state.</p>



<p>MongoDB provides these primitives in a single platform, which is why it appears across so many agent memory implementations—whether teams build custom solutions or integrate through frameworks and memory providers.</p>



<p><strong>Document flexibility</strong> matters because memory schemas evolve. A memory unit isn&#8217;t a flat string—it&#8217;s structured content with metadata, timestamps, source attribution, confidence scores, and associative links to related memories. Teams discover what context agents actually need through iteration. Document databases accommodate this evolution without schema migrations blocking development.</p>



<p><strong>Hybrid retrieval</strong> addresses the access pattern problem. Agent memory queries rarely fit a single retrieval mode: A typical query needs memories semantically similar to the current task <em>and</em> created within the last hour <em>and</em> tagged with a specific workflow ID <em>and</em> not marked as superseded. MongoDB Atlas Vector Search combines vector similarity, full-text search, and filtered queries in single operations, avoiding the complexity of stitching together separate retrieval systems.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="960" height="540" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/image.png" alt="Hybrid search" class="wp-image-18135" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/image.png 960w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/image-300x169.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/image-768x432.png 768w" sizes="auto, (max-width: 960px) 100vw, 960px" /></figure>



<p><strong>Atomic operations</strong> provide the consistency primitives that coordination requires. When an agent updates task status from pending to complete, the update succeeds entirely or fails entirely. Other agents querying task status never observe partial updates. This is standard MongoDB functionality—findAndModify, conditional updates, multidocument transactions—but it&#8217;s infrastructure that simpler storage backends lack.</p>



<p><strong>Change streams</strong> enable event-driven architectures. Applications can subscribe to database changes and react when relevant state updates, rather than polling. This becomes a building block for memory systems that need to propagate updates across agents.</p>



<p>Teams implement memory engineering on MongoDB through three paths. Some build directly on the database, using the document model and search capabilities to create custom memory architectures matched to their specific coordination patterns. Others work through orchestration frameworks—LangChain, LlamaIndex, CrewAI—that provide MongoDB integrations for their memory abstractions. Still others adopt dedicated memory providers like Mem0 or Agno, which handle the memory logic while using MongoDB as the underlying storage layer.</p>



<p>The flexibility matters because memory engineering isn&#8217;t a single pattern. Different agent architectures need different memory topologies, different consistency guarantees, different retrieval strategies. A database that prescribes one approach would fit some use cases and break others. MongoDB provides primitives; teams compose them into the memory systems their agents require.</p>



<h2 class="wp-block-heading">Shared memory enables heterogeneous agent teams</h2>



<h3 class="wp-block-heading">Homogeneous systems can be replaced by single agents</h3>



<p>The deeper payoff of memory engineering is enabling agent architectures that wouldn&#8217;t otherwise be viable.</p>



<p><a href="https://arxiv.org/abs/2601.12307" target="_blank" rel="noreferrer noopener">Xu et al.</a> observe that many deployed multi-agent systems are so homogeneous—same base model everywhere, agents differentiated only by prompts—that a single model can simulate the entire workflow with equivalent results and lower overhead. Their OneFlow optimization demonstrates this by reusing KV cache across simulated &#8220;agents&#8221; within a single execution, eliminating coordination costs while preserving workflow structure.</p>



<p>The implication: If a single agent can replace your multi-agent system, you haven&#8217;t built a team. You&#8217;ve built an expensive way to run one model.</p>



<h3 class="wp-block-heading">Small models need external memory to coordinate</h3>



<p>Genuine multi-agent value comes from heterogeneity. Different models with different capabilities operating at different price points for different subtasks. <a href="https://arxiv.org/abs/2506.02153" target="_blank" rel="noreferrer noopener">Belcak et al.</a> make the case that most work agents do in production isn&#8217;t complex reasoning—it&#8217;s routine execution of well-defined operations. Parsing a response, formatting an output, invoking a tool with specific parameters. These tasks don&#8217;t require frontier model capabilities, and the cost difference is dramatic: Their analysis puts the gap at 10x–30x between serving a 7B parameter model versus a 70–175B parameter model when you factor in latency, energy, and compute. Large models should be reserved for the genuinely hard problems, not deployed uniformly across every step.</p>



<p><a href="https://arxiv.org/abs/2506.02153" target="_blank" rel="noreferrer noopener">Belcak et al.</a> also highlight an operational advantage: Smaller models can be retrained and adapted much faster. When an agent needs new capabilities or exhibits problematic behaviors, the turnaround for fine-tuning a 7B model is measured in hours, not days. This connects to memory engineering because fine-tuning represents an alternative to retrieval—you can bake procedural knowledge directly into model weights rather than surfacing it from external storage at runtime. The choice between the procedural memory pillar and model specialization becomes a design decision rather than a constraint.</p>



<p>This architecture—small models by default, large models for hard problems—depends on shared memory. Small models can&#8217;t maintain the context required for coordination on their own. They rely on external memory to participate in larger workflows. Memory engineering makes heterogeneous teams viable; without it, every agent must be large enough to maintain full context independently, which defeats the cost optimization that motivates heterogeneity in the first place.</p>



<h2 class="wp-block-heading">Building the foundation</h2>



<p>Multi-agent systems fail for structural reasons: context degrades across agents, errors propagate through shared interactions, costs multiply with redundant operations, and state diverges when nothing enforces consistency. These problems don&#8217;t resolve with better prompts or more sophisticated orchestration. They require infrastructure.</p>



<p>Memory engineering provides that infrastructure through a coherent taxonomy of memory types, persistence with explicit lifecycle rules, retrieval tuned to agent access patterns, coordination that defines clear sharing boundaries, and consistency that maintains shared truth under concurrent updates.</p>



<p>The organizations that make multi-agent systems work in production won&#8217;t be distinguished by agent count or model capability. They&#8217;ll be the ones that invested in the memory layer that transforms independent agents into coordinated teams.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">References</h2>



<p>Anthropic. &#8220;Building a Multi-Agent Research System.&#8221; 2025. <a href="https://www.anthropic.com/engineering/multi-agent-research-system" target="_blank" rel="noreferrer noopener">https://www.anthropic.com/engineering/multi-agent-research-system</a></p>



<p>Belcak, Peter, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. &#8220;Small Language Models are the Future of Agentic AI.&#8221; arXiv:2506.02153 (2025). <a href="https://arxiv.org/abs/2506.02153" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2506.02153</a></p>



<p>Bousetouane, Fouad. &#8220;AI Agents Need Memory Control Over More Context.&#8221; arXiv:2601.11653 (2026). <a href="https://arxiv.org/abs/2601.11653" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2601.11653</a></p>



<p>Breunig, Dan. &#8220;How Contexts Fail—and How to Fix Them.&#8221; June 22, 2025. <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html" target="_blank" rel="noreferrer noopener">https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html</a></p>



<p>Carnegie Mellon University. &#8220;AgentCompany: Building Agent Teams for the Future of Work.&#8221; 2025. <a href="https://www.cs.cmu.edu/news/2025/agent-company" target="_blank" rel="noreferrer noopener">https://www.cs.cmu.edu/news/2025/agent-company</a></p>



<p>Cemri, Mert, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. &#8220;Why Do Multi-Agent LLM Systems Fail?&#8221; arXiv:2503.13657 (2025). <a href="https://arxiv.org/abs/2503.13657" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2503.13657</a></p>



<p>Chroma Research. &#8220;Context Rot: How Increasing Context Length Degrades Model Performance.&#8221; 2025. <a href="https://research.trychroma.com/context-rot" target="_blank" rel="noreferrer noopener">https://research.trychroma.com/context-rot</a></p>



<p>Han, Shanshan, Qifan Zhang, Yuhang Yao, Weizhao Jin, and Zhaozhuo Xu. &#8220;LLM Multi-Agent Systems: Challenges and Open Problems.&#8221; arXiv:2402.03578 (2024). <a href="https://arxiv.org/abs/2402.03578" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2402.03578</a></p>



<p>LangChain Blog (Sydney Runkle). &#8220;Choosing the Right Multi-Agent Architecture.&#8221; January 14, 2026. <a href="https://www.blog.langchain.com/choosing-the-right-multi-agent-architecture/" target="_blank" rel="noreferrer noopener">https://www.blog.langchain.com/choosing-the-right-multi-agent-architecture/</a></p>



<p>Manus AI. &#8220;Context Engineering for AI Agents: Lessons from Building Manus.&#8221; 2025. <a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" target="_blank" rel="noreferrer noopener">https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus</a></p>



<p>Schmid, Philipp. &#8220;Context Engineering.&#8221; 2025. <a href="https://www.philschmid.de/context-engineering" target="_blank" rel="noreferrer noopener">https://www.philschmid.de/context-engineering</a>Xu, Jiawei, Arief Koesdwiady, Sisong Bei, Yan Han, Baixiang Huang, Dakuo Wang, Yutong Chen, Zheshen Wang, Peihao Wang, Pan Li, and Ying Ding. &#8220;Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline.&#8221; arXiv:2601.12307 (2026). <a href="https://arxiv.org/abs/2601.12307" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2601.12307</a></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td>To explore memory engineering further, start experimenting with memory architectures using MongoDB Atlas or review our detailed tutorials available at <a href="https://www.mongodb.com/resources/use-cases/artificial-intelligence/?utm_campaign=devrel&amp;utm_source=cross-post&amp;utm_medium=cta&amp;utm_content=memory-for-multiagent-systems&amp;utm_term=mikiko.b&amp;utm_campaign=devrel&amp;utm_source=third-party-content&amp;utm_medium=cta&amp;utm_content=multi-agent-oreily&amp;utm_term=tony.kim" target="_blank" rel="noreferrer noopener">AI Learning Hub</a>.</td></tr></tbody></table></figure>
]]></content:encoded>
										</item>
		<item>
		<title>Control Planes for Autonomous AI: Why Governance Has to Move Inside the System</title>
		<link>https://www.oreilly.com/radar/control-planes-for-autonomous-ai-why-governance-has-to-move-inside-the-system/</link>
				<pubDate>Tue, 24 Feb 2026 12:16:14 +0000</pubDate>
					<dc:creator><![CDATA[Varun Raj]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18117</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agentic-AI-shifts-to-runtime-architecture.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Agentic-AI-shifts-to-runtime-architecture-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Scaling agentic AI is forcing a shift from external policy to runtime architecture.]]></custom:subtitle>
		
				<description><![CDATA[For most of the past decade, AI governance lived comfortably outside the systems it was meant to regulate. Policies were written. Reviews were conducted. Models were approved. Audits happened after the fact. As long as AI behaved like a tool—producing predictions or recommendations on demand—that separation mostly worked. That assumption is breaking down. As AI [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>For most of the past decade, AI governance lived comfortably outside the systems it was meant to regulate. Policies were written. Reviews were conducted. Models were approved. Audits happened after the fact. As long as AI behaved like a tool—producing predictions or recommendations on demand—that separation mostly worked. That assumption is breaking down.</p>



<p>As AI systems move from assistive components to autonomous actors, governance imposed from the outside no longer scales. The problem isn’t that organizations lack policies or oversight frameworks. It’s that those controls are detached from where decisions are actually formed. Increasingly, the only place governance can operate effectively is inside the AI application itself, at runtime, while decisions are being made. This isn’t a philosophical shift. It’s an architectural one.</p>



<h2 class="wp-block-heading"><strong>When AI Fails Quietly</strong></h2>



<p>One of the more unsettling aspects of autonomous AI systems is that their most consequential failures rarely look like failures at all. Nothing crashes. Latency stays within bounds. Logs look clean. The system behaves coherently—just not correctly. An agent escalates a workflow that should have been contained. A recommendation drifts slowly away from policy intent. A tool is invoked in a context that no one explicitly approved, yet no explicit rule was violated.</p>



<p>These failures are hard to detect because they emerge from behavior, not bugs. Traditional governance mechanisms don’t help much here. Predeployment reviews assume decision paths can be anticipated in advance. Static policies assume behavior is predictable. Post hoc audits assume intent can be reconstructed from outputs. None of those assumptions holds once systems reason dynamically, retrieve context opportunistically, and act continuously. At that point, governance isn’t missing—it’s simply in the wrong place.</p>



<h2 class="wp-block-heading"><strong>The Scaling Problem No One Owns</strong></h2>



<p>Most organizations already feel this tension, even if they don’t describe it in architectural terms. Security teams tighten access controls. Compliance teams expand review checklists. Platform teams add more logging and dashboards. Product teams add additional prompt constraints. Each layer helps a little. None of them addresses the underlying issue.</p>



<p>What’s really happening is that governance responsibility is being fragmented across teams that don’t own system behavior end-to-end. No single layer can explain why the system acted—only that it acted. As autonomy increases, the gap between intent and execution widens, and accountability becomes diffuse. This is a classic scaling problem. And like many scaling problems before it, the solution isn’t more rules. It’s a different system architecture.</p>



<h2 class="wp-block-heading"><strong>A Familiar Pattern from Infrastructure History</strong></h2>



<p>We’ve seen this before. In early networking systems, control logic was tightly coupled to packet handling. As networks grew, this became unmanageable. Separating the control plane from the data plane allowed policy to evolve independently of traffic and made failures diagnosable rather than mysterious.</p>



<p>Cloud platforms went through a similar transition. Resource scheduling, identity, quotas, and policy moved out of application code and into shared control systems. That separation is what made hyperscale cloud viable. Autonomous AI systems are approaching a comparable inflection point.</p>



<p>Right now, governance logic is scattered across prompts, application code, middleware, and organizational processes. None of those layers was designed to assert authority continuously while a system is reasoning and acting. What’s missing is a control plane for AI—not as a metaphor but as a real architectural boundary.</p>



<h2 class="wp-block-heading"><strong>What “Governance Inside the System” Actually Means</strong></h2>



<p>When people hear “governance inside AI,” they often imagine stricter rules baked into prompts or more conservative model constraints. That’s not what this is about.</p>



<p>Embedding governance inside the system means separating decision execution from decision authority. Execution includes inference, retrieval, memory updates, and tool invocation. Authority includes policy evaluation, risk assessment, permissioning, and intervention. In most AI applications today, those concerns are entangled—or worse, implicit.</p>



<p>A control-plane-based design makes that separation explicit. Execution proceeds but under continuous supervision. Decisions are observed as they form, not inferred after the fact. Constraints are evaluated dynamically, not assumed ahead of time. Governance stops being a checklist and starts behaving like infrastructure.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="284" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Separating-execution-from-governance-in-autonomous-AI-systems.jpg" alt="Execution from governance separation in AI systems" class="wp-image-18118" style="width:550px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Separating-execution-from-governance-in-autonomous-AI-systems.jpg 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Separating-execution-from-governance-in-autonomous-AI-systems-300x166.jpg 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 1. Separating execution from governance in autonomous AI systems</em></figcaption></figure>



<p>Reasoning, retrieval, memory, and tool invocation operate in the execution plane, while a runtime control plane continuously evaluates policy, risk, and authority—observing and intervening without being embedded in application logic.</p>



<h2 class="wp-block-heading">Where Governance Breaks First</h2>



<p>In practice, governance failures in autonomous AI systems tend to cluster around three surfaces.</p>



<p><strong>Reasoning</strong>. Systems form intermediate goals, weigh options, and branch decisions internally. Without visibility into those pathways, teams can’t distinguish acceptable variance from systemic drift.</p>



<p><strong>Retrieval</strong>. Autonomous systems pull in context opportunistically. That context may be outdated, inappropriate, or out of scope—and once it enters the reasoning process, it’s effectively invisible unless explicitly tracked.</p>



<p><strong>Action</strong>. Tool use is where intent becomes impact. Systems increasingly invoke APIs, modify records, trigger workflows, or escalate issues without human review. Static authorization models don’t map cleanly onto dynamic decision contexts.</p>



<p>These surfaces are interconnected, but they fail independently. Treating governance as a single monolithic concern leads to brittle designs and false confidence.</p>



<h2 class="wp-block-heading">Control Planes as Runtime Feedback Systems</h2>



<p>A useful way to think about AI control planes is not as gatekeepers but as feedback systems. Signals flow continuously from execution into governance: confidence degradation, policy boundary crossings, retrieval drift, and action escalation patterns. Those signals are evaluated in real time, not weeks later during audits. Responses flow back: throttling, intervention, escalation, or constraint adjustment.</p>



<p>This is fundamentally different from monitoring outputs. Output monitoring tells you what happened. Control plane telemetry tells you why it was allowed to happen. That distinction matters when systems operate continuously, and consequences compound over time.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="512" height="306" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Runntime-governance-as-a-feedback-loop.jpg" alt="" class="wp-image-18119" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Runntime-governance-as-a-feedback-loop.jpg 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/02/Runntime-governance-as-a-feedback-loop-300x179.jpg 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption"><em>Figure 2. Runtime governance as a feedback loop</em></figcaption></figure>



<p>Behavioral telemetry flows from execution into the control plane, where policy and risk are evaluated continuously. Enforcement and intervention feed back into execution before failures become irreversible.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<h2 class="wp-block-heading">A Failure Story That Should Sound Familiar</h2>



<p>Consider a customer-support agent operating across billing, policy, and CRM systems.</p>



<p>Over several months, policy documents are updated. Some are reindexed quickly. Others lag. The agent continues to retrieve context and reason coherently, but its decisions increasingly reflect outdated rules. No single action violates policy outright. Metrics remain stable. Customer satisfaction erodes slowly.</p>



<p>Eventually, an audit flags noncompliant action. At that point, teams scramble. Logs show what the agent did but not why. They can’t reconstruct which documents influenced which decisions, when those documents were last updated, or why the agent believed its actions were valid at the time.</p>



<p>This isn’t a logging failure. It’s the absence of a governance feedback loop. A control plane wouldn’t prevent every mistake, but it would surface drift early—when intervention is still cheap.</p>



<h2 class="wp-block-heading">Why External Governance Can’t Catch Up</h2>



<p>It’s tempting to believe better tooling, stricter reviews, or more frequent audits will solve this problem. They won’t.</p>



<p>External governance operates on snapshots. Autonomous AI operates on streams. The mismatch is structural. By the time an external process observes a problem, the system has already moved on—often repeatedly. That doesn’t mean governance teams are failing. It means they’re being asked to regulate systems whose operating model has outgrown their tools. The only viable alternative is governance that runs at the same cadence as execution.</p>



<h2 class="wp-block-heading">Authority, Not Just Observability</h2>



<p>One subtle but important point: Control planes aren’t just about visibility. They’re about authority.</p>



<p>Observability without enforcement creates a false sense of safety. Seeing a problem after it occurs doesn’t prevent it from recurring. Control planes must be able to act—to pause, redirect, constrain, or escalate behavior in real time.</p>



<p>That raises uncomfortable questions. How much autonomy should systems retain? When should humans intervene? How much latency is acceptable for policy evaluation? There are no universal answers. But those trade-offs can only be managed if governance is designed as a first-class runtime concern, not an afterthought.</p>



<h2 class="wp-block-heading">The Architectural Shift Ahead</h2>



<p>The move from guardrails to control loops mirrors earlier transitions in infrastructure. Each time, the lesson was the same: Static rules don’t scale under dynamic behavior. Feedback does.</p>



<p>AI is entering that phase now. Governance won’t disappear. But it will change shape. It will move inside systems, operate continuously, and assert authority at runtime. Organizations that treat this as an architectural problem—not a compliance exercise—will adapt faster and fail more gracefully. Those who don’t will spend the next few years chasing incidents they can see, but never quite explain.</p>



<h2 class="wp-block-heading">Closing Thought</h2>



<p>Autonomous AI doesn’t require less governance. It requires governance that understands autonomy.</p>



<p>That means moving beyond policies as documents and audits as events. It means designing systems where authority is explicit, observable, and enforceable while decisions are being made. In other words, governance must become part of the system—not something applied to it.</p>



<h2 class="wp-block-heading">Further Reading</h2>



<ul class="wp-block-list">
<li>“AI Governance Frameworks for Responsible AI,” Gartner Peer Community, <a href="https://www.gartner.com/peer-community/oneminuteinsights/omi-ai-governance-frameworks-responsible-ai-33q" target="_blank" rel="noreferrer noopener">https://www.gartner.com/peer-community/oneminuteinsights/omi-ai-governance-frameworks-responsible-ai-33q</a>.</li>



<li>Lauren Kornutick et al., “Market Guide for AI Governance Platforms,” Gartner, November 4, 2025, <a href="https://www.gartner.com/en/documents/7145930" target="_blank" rel="noreferrer noopener">https://www.gartner.com/en/documents/7145930</a>.</li>



<li>Svetlana Sicular, “AI’s Next Frontier Demands a New Approach to Ethics, Governance, and Compliance,” Gartner, November 10, 2025, <a href="https://www.gartner.com/en/articles/ai-ethics-governance-and-compliance" target="_blank" rel="noreferrer noopener">https://www.gartner.com/en/articles/ai-ethics-governance-and-compliance</a>.</li>



<li><a href="https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf" target="_blank" rel="noreferrer noopener"><em>AI Risk Management Framework (AI RMF 1.0)</em></a>, NIST, January 2023, <a href="https://doi.org/10.6028/NIST.AI.100-1" target="_blank" rel="noreferrer noopener">https://doi.org/10.6028/NIST.AI.100-1</a>.</li>
</ul>
]]></content:encoded>
										</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Object Caching 93/102 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed) 
Minified using Memcached

Served from: www.oreilly.com @ 2026-03-13 19:39:13 by W3 Total Cache
-->