<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:media="http://search.yahoo.com/mrss/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:custom="https://www.oreilly.com/rss/custom"

	>

<channel>
	<title>Radar</title>
	<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.oreilly.com/radar</link>
	<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
	<lastBuildDate>Thu, 04 Jun 2026 17:08:52 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
	<title>Radar</title>
	<link>https://www.oreilly.com/radar</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>The Tidy House</title>
		<link>https://www.oreilly.com/radar/the-tidy-house/</link>
				<comments>https://www.oreilly.com/radar/the-tidy-house/#respond</comments>
				<pubDate>Thu, 04 Jun 2026 16:25:11 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18849</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-tidy-house.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-tidy-house-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[DJ Patil on why the hardest part of AI adoption is organizational, not technical]]></custom:subtitle>
		
				<description><![CDATA[DJ Patil has spent the past several months on a listening tour. Wherever he travels, he finds a local university, pings faculty and students and anyone else who wants to show up, and runs an AMA. He&#8217;s heard from grad students who can&#8217;t get callbacks, hospital administrators dealing with federal policy changes that land like [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">DJ Patil has spent the past several months on a listening tour. Wherever he travels, he finds a local university, pings faculty and students and anyone else who wants to show up, and runs an AMA. He&#8217;s heard from grad students who can&#8217;t get callbacks, hospital administrators dealing with federal policy changes that land like a change in the laws of physics, and executives who can&#8217;t forecast their AI spending past six months. He&#8217;s trying to synthesize all of it and help reframe the wider conversation.</p>



<p class="wp-block-paragraph">DJ co-coined the term &#8220;data scientist,&#8221; served as America&#8217;s first chief data scientist under President Obama, and was chief scientist at LinkedIn. He&#8217;s a longtime O&#8217;Reilly author, going back to <em><a href="https://www.oreilly.com/library/view/building-data-science/BLDNGDST0001/" target="_blank" rel="noreferrer noopener">Building Data Science Teams</a></em> and <em><a href="https://www.oreilly.com/library/view/ethics-and-data/9781492043898/" target="_blank" rel="noreferrer noopener">Ethics and Data Science</a></em>, and he&#8217;s on the founding team at <a href="https://www.devoted.com/" target="_blank" rel="noreferrer noopener">Devoted Health</a>, where he&#8217;s spent the past decade building the kind of data infrastructure most organizations are still struggling to put in place. He calls it “the tidy house.” He sat down with me to talk about the gap between what the technology can do and what most institutions can actually absorb.</p>



<h2 class="wp-block-heading">The broken promise</h2>



<p class="wp-block-paragraph">What DJ keeps hearing on his tour is anger and angst. One word that keeps coming up is &#8220;terrified.&#8221; Workers are worried about layoffs. Meanwhile, students, including those from top-tier universities like MIT, Carnegie Mellon, and UC Berkeley, have been applying to 300+ internships and getting fewer than 10 callbacks. Many had zero offers going into the summer. And the industry&#8217;s response has been to tell them to learn more AI and burn more tokens. What it comes down to, DJ explained, is “effectively a broken promise”:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">We said, “Go to college, get these things, you&#8217;re going to get an internship, you&#8217;re going to get job training, you&#8217;re going to pay off your student loans, and then you&#8217;re going to have all the other things that are part of that social contract.”</p>



<p class="wp-block-paragraph">What the students are feeling for the first time [is]. . .“Wait, if I can&#8217;t get this internship, . . .I&#8217;m fundamentally off trajectory from getting this job.” And it doesn&#8217;t have to be a technical person. It could be someone that is in marketing. It could be someone that&#8217;s in the liberal arts. It could be a researcher. . . .There are plenty of students that I have talked to who are supposed to be going to a doctoral PhD program or a medical school or something like that. The slots aren&#8217;t there because of the overall budget impacts. And so whether you call it AI impact or economic reframing, the thing is broken.</p>
</blockquote>



<p class="wp-block-paragraph">This is where I&#8217;ve been trying to build a counter narrative. The story coming from the AI labs is destructive: “We&#8217;re going to put all of you out of work, and we&#8217;ll figure out the rest once the intelligence explosion arrives.” That&#8217;s bad PR for AI, but it’s also magical thinking. An economy is a circulatory system. You can&#8217;t put your customers out of work and at the same time expect that the economy will hum along as usual. A catastrophic recession could easily interrupt the funding that keeps AI on its growth path and the concentration of value that they assume will fund universal basic income and an expanded safety net.</p>



<p class="wp-block-paragraph">That’s why I’m a fan of <a href="https://www.oreilly.com/radar/the-missing-mechanisms-of-the-agentic-economy/" target="_blank" rel="noreferrer noopener">mechanism design</a>: start from the outcome you want, then figure out the rules of the game that produces it. Right now, they’ve designed a game that concentrates all the value in the hands of AI first movers. They could be designing a game that generates value throughout the economy. But they aren’t building affordances for that.</p>



<p class="wp-block-paragraph">YouTube ContentID is a good example of mechanism design leading to economic value creation. When unauthorized music use by online video creators triggered a backlash from rights holders, YouTube replied to the takedown notices with a way for both the people who owned the music and the people who wanted to use it to get paid. A whole creator economy came out of that design choice. The labs have the same opportunity in front of them and mostly aren&#8217;t taking it.</p>



<p class="wp-block-paragraph">DJ had one concrete mechanism in mind:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Imagine OpenAI and Anthropic and Microsoft.&nbsp;.&nbsp;.get together and [say], “If you&#8217;re building something for your local community, we&#8217;ll fully subsidize the token cost for some period of time.”.&nbsp;.&nbsp;.We&#8217;re talking about marginal token usage relatively on the spectrum of things, but the potential innovation and use of AI to help local communities could be astounding. You&#8217;re not putting anybody out of a job with that.&nbsp;.&nbsp;.&nbsp;.You&#8217;re filling the holes that already exist in the system.</p>
</blockquote>



<p class="wp-block-paragraph">The <a href="https://openaifoundation.org/news/update-on-the-openai-foundation#our-mission" target="_blank" rel="noreferrer noopener">OpenAI Foundation just announced</a> it will put $1 billion into public-benefit projects this year, including $250 million aimed at building economic futures. It&#8217;s a start. But it mostly seems designed to ameliorate the bad effects of AI rather than to forestall them by building a more inclusive AI future. If the labs start investing in the human-plus-AI economy rather than just studying the job losses, the payoff to local communities could be real.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="The Broken Promise with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/OAwI4G_MxYg?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">A makerspace to bridge the internship gap</h2>



<p class="wp-block-paragraph">DJ&#8217;s plan is to build a bridge. He&#8217;s launching a program, basically a makerspace, for students who don&#8217;t have an internship this summer. Over two four-week sprints, an initial cohort will get mentors, speakers, and the space to explore whatever they&#8217;re interested in. It doesn&#8217;t have to be AI. Whether they’re doing investigative journalism, screenwriting, or building civic tech, participants will get some experience with current tools and produce a tangible asset they can use to prove what they know. As I told DJ in our conversation, I think he’s really on to something, and I&#8217;d love O&#8217;Reilly to be part of what he’s building.</p>



<p class="wp-block-paragraph">There&#8217;s a kind of person who has always been at the center of the O&#8217;Reilly community and never waited for a job description. High school dropouts who started companies. People who looked around, found something that needed doing, and did it. DJ is one of them. He&#8217;s a community college kid who learned from a good local library, from the <a href="https://www.oreilly.com/content/a-short-history-of-the-oreilly-animals/" target="_blank" rel="noreferrer noopener">books with the “funny animals” on the cover</a>, and from open source. That path is still open. The early O&#8217;Reilly business came out of exactly this instinct. We were a tech-writing consulting shop, and when we ran out of paid work, we wrote manuals that didn&#8217;t exist yet but that we thought were needed. Later, when there were big conferences for every corporate technology and none for open source, we ran the first one for Perl. Conferences became a whole new business for us. You look for the gap and you fill it.</p>



<p class="wp-block-paragraph">DJ pushes the same idea down to the level of the neighborhood:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">If you want to feel rewarded, go fix something in your neighborhood. Go help out the food pantry. Go help out the local foster child care system. Go help out.&nbsp;.&nbsp;.parks and rec. Use those skills to go do something, and then you&#8217;re going to see.&nbsp;.&nbsp;.people respond in a different way.&nbsp;.&nbsp;.&nbsp;.The target-rich area for problems is massive. You just have to look.</p>
</blockquote>



<p class="wp-block-paragraph">I&#8217;ve never bought the jobless-future story. Back when I wrote <em><a href="https://www.oreilly.com/tim/wtf-book.html" target="_blank" rel="noreferrer noopener">WTF?</a></em> in 2016, I pointed out that there is so much around us that needs to be made better. The constraint has never been a shortage of problems. AI gives us new tools for solving them. It should be a way to put people <em>to work</em>, not <em>out of work</em>.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="A Makerspace to Bridge the Internship Gap with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/bzE88bDjvJo?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The organization is the bottleneck</h2>



<p class="wp-block-paragraph">DJ has also been visiting hospitals and clinics and talking to CIOs and CTOs as part of the tour, and what he&#8217;s seeing is alarming.</p>



<p class="wp-block-paragraph">The federal changes to Medicaid and the Affordable Care Act are landing on systems that were already near collapse. Hospitals that depended on outpatient procedures like colonoscopies for margin are watching volumes drop 20% to 30% because people can&#8217;t afford insurance. Some are running $1 million a day behind, a $300 to $400 million shortfall for the year.</p>



<p class="wp-block-paragraph">At the same time, AI companies are telling those same hospitals to move into the new world, and partly because of the “you will soon be replaced” narrative from the AI labs, labor is responding the way the Kaiser nurses did in California, where any use of AI was off the table as a bargaining condition. As DJ pointed out, we can’t afford to disregard AI when it has the potential to automate the most painful parts of healthcare workers’ jobs and let them “do the job they&#8217;re trained for” without the administrative burden. Businesses need to change not just their narrative but their strategy. They need to be saying, “We’re going to use AI to help you do more for our customers. We’re going to make your job more human and let the machines deal with the BS.”</p>



<p class="wp-block-paragraph">The constraint here is organizational capacity, not technology. The Silicon Valley default assumes that incumbents will just get disrupted by startups, the way media was by Google and Meta and retail was by Amazon. There&#8217;s some truth to that. But disruption takes much longer than people think, and in a domain this central, the delay means real harm to real people. Healthcare is a third of the economy. You can&#8217;t just let it fail and rebuild it fresh while people depend on it for survival.</p>



<p class="wp-block-paragraph">There’s a version of this where the efficiencies AI creates get plowed back into better patient care. There&#8217;s also the version that&#8217;s actually happening in most places, where private equity captures the savings as profit. The difference is institutional design, and that&#8217;s where reform isn&#8217;t happening. I saw this directly with a <a href="http://codeforamerica.org" target="_blank" rel="noreferrer noopener">Code for America</a> project called <a href="https://www.clearmyrecord.org/" target="_blank" rel="noreferrer noopener">Clear My Record</a>. A California initiative had turned a number of petty crimes into misdemeanors, but very few people were petitioning to have their status changed. We started using software to streamline an absurdly convoluted criminal record expungement process, but then we asked ourselves why we were helping people fill out forms that shouldn&#8217;t exist. The law had already changed the record. The process should have been a database update, not something that required a petition to the court. That’s the kind of problem AI was born to solve. It can help us refactor old stuck processes and move to something way better.</p>



<p class="wp-block-paragraph">Done right, DOGE could have been an opportunity to carry out that kind of real institutional change at scale. Instead it became a wrecking ball, and it&#8217;s given the whole idea of institutional reform a bad name.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="Organizational Capacity Is the Bottleneck with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/BHsqVllEZPQ?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Data infrastructure is the competitive advantage</h2>



<p class="wp-block-paragraph">DJ&#8217;s term for the alternative he&#8217;s living with at Devoted is “the tidy house.” He built the boring infrastructure years before LLMs existed, and that&#8217;s why the company could move the moment AI arrived.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">One of the ways we&#8217;ve tried to make this work is fundamentally still data 101, unified data environments, data flows that are clean, that have a lot of organization.&nbsp;.&nbsp;.&nbsp;.Because we invested so heavily in that infrastructure, the dumb, boring, painful parts of making sure you&#8217;ve got a really great data warehouse, great data engineering pipes, all of the metadata that goes with it, when AI shows up, you get to use it right away. Now you get to focus on the orchestration, the harness, all those pieces.</p>
</blockquote>



<p class="wp-block-paragraph">While other organizations are reconstructing ETL inside context windows and paying for it in GPU costs, Devoted&#8217;s team gets to work on the actual clinical problems. As DJ put it, transforming a healthcare system is &#8220;like walking and chewing gum while balancing bowling balls on your head and on a unicycle,&#8221; with the laws of physics changing on you the whole time. The organizations that come through it will be the ones that did the unglamorous work of keeping clean, flowing data with its lineage and metadata intact. The ones that didn&#8217;t will keep paying to reconstruct context they should have had all along.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Keeping a Tidy House with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/73vf3GeP20g?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The pharmacists who built their own agents</h2>



<p class="wp-block-paragraph">The tidy house pays off when you put the tools in the hands of people who already know the domain. At Devoted, clinicians are building things without waiting for a product manager to learn the problem first. These frontline workers have already spent decades understanding it.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">A pharmacist.&nbsp;.&nbsp;.says, “Hey, you know what? I&#8217;m really worried when I see these kinds of drugs show up together. That&#8217;s not a good thing.&nbsp;.&nbsp;.&nbsp;.Why don&#8217;t I have an agent that alerts me every time this happens? I should just automate it because maybe one of the patients gets prescribed something by another provider and we don&#8217;t see it.” So the pharmacist [says,].&nbsp;.&nbsp;.”I&#8217;m just going to build that agent.” Now I&#8217;ve got an agent always looking for bad drug interactions. And another pharmacist says, “I&#8217;ve got my own version of that.”&nbsp;.&nbsp;.&nbsp;.So I say, “Hey, agent, I want you to go ask all the pharmacists that we have a quick survey of what might be happening.&nbsp;.&nbsp;.&nbsp;.What are the universe of things that we should be watching out for?” Now I&#8217;ve got a robust medical layer.&nbsp;.&nbsp;.looking out and protecting all of our members from bad drug interactions.</p>
</blockquote>



<p class="wp-block-paragraph">One clinician automating the thing they&#8217;d always done by hand expands to cover an entire membership of patients. Having the right infrastructure makes it possible to act on decades of accumulated judgment at the scale of the whole system.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Pharmacists Who Built Their Own Agents with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/bHqxMWVbP44?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The histogram is still the most powerful product</h2>



<p class="wp-block-paragraph">You don&#8217;t need exotic tooling to get value out of data, and DJ has a way of puncturing the assumption that you do.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Oftentimes, I tell people, the most powerful data product you can build is still a histogram. Just give me a distribution of what&#8217;s going on.&nbsp;.&nbsp;.&nbsp;.AI gives us a tremendous opportunity to let people [access this data quickly], but we&#8217;ve got to figure out the guardrails, so people don&#8217;t ask [questions] or get answers.&nbsp;.&nbsp;.[without realizing] that there&#8217;s a flaw in how they&#8217;re asking it.</p>
</blockquote>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Histogram Is Still the Most Powerful Data Product with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/xBBjws9NIIo?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p class="wp-block-paragraph">We&#8217;ve been in this loop since the beginning of the data movement, DJ explained. The stewards of the data warehouse stand at the gate and say, “You shall not pass!” Then democratization breaks it open, and the gatekeepers reconstitute themselves in the next era. Hadoop did it last time. LLMs are doing it now, and the temptation to insist that only experts can use the tools correctly is as strong as it&#8217;s ever been. You do need ways to catch errors. But the goal should always be access.</p>



<h2 class="wp-block-heading">The real opportunity is in the layers above AI models</h2>



<p class="wp-block-paragraph">That&#8217;s a new discipline forming inside computer science. We are increasingly having to engineer the trade-offs between conventional software and LLMs, when to reach for a local or open weight model, and what inference actually costs against what it returns.</p>



<p class="wp-block-paragraph">Getting that right requires an expanded view of what economists call mechanism design. While this isn’t how economists talk about it, many advances in technology are really a form of mechanism design: redesigning the rules of a game to get better outcomes. Pay-per-click advertising started as a crude auction that sold to the highest bidder, and then Google refined it into something that worked. Rob McCool wired a web server to a database with CGI and ushered in a decade of invention of new mechanisms for data-driven websites. Or take Apache Kafka, which DJ reminded us began as a project to help LinkedIn rein in its Splunk bill and only later became the foundation for a company and an ecosystem.</p>



<p class="wp-block-paragraph">We&#8217;re at the front of an architectural innovation cycle now, and the biggest opportunities are probably not in the models themselves but in the layers above them. That’s also where a renaissance of open source for the AI era could happen.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Future of Software Will Be Shaped by Microeconomics with Tim O&amp;apos;Reilly" width="500" height="281" src="https://www.youtube.com/embed/ZLffZO_GHzs?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p class="wp-block-paragraph">DJ and I are both, as he says, &#8220;this giant human LLM, summarizing and distilling all the things we&#8217;re hearing&#8221; from a lot of people. What we&#8217;re hearing is that the technology is mostly ready, but our institutions are not. What&#8217;s lagging is the organizational and economic infrastructure that lets universities, hospitals, data teams, and the labs themselves actually deploy what&#8217;s been built.</p>



<p class="wp-block-paragraph">It’s time to get busy!</p>



<p class="wp-block-paragraph"><em>On June 10, Harper Reed, cofounder of 2389 Research, will join me to talk about why the future of software depends on creativity, serendipity, and building weird stuff. And on July 9, Trail of Bits cofounder and CEO Dan Guido will stop by to share his playbook for going AI native. You can register to attend them live <a href="https://www.oreilly.com/live/live-with-tim/" target="_blank" rel="noreferrer noopener">here</a>. You can also follow </em>Live with Tim O’Reilly<em> on <a href="https://www.youtube.com/playlist?list=PL055Epbe6d5YQ8t30jyo1D6XuSpe8uhAG" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/79YLK6OLSAJam4kcd8w3Kw" target="_blank" rel="noreferrer noopener">Spotify</a>, <a href="https://podcasts.apple.com/us/podcast/live-with-tim-oreilly/id1896312725" target="_blank" rel="noreferrer noopener">Apple</a>, or wherever you get your podcasts.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-tidy-house/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Predict, Don&#8217;t Enumerate</title>
		<link>https://www.oreilly.com/radar/predict-dont-enumerate/</link>
				<comments>https://www.oreilly.com/radar/predict-dont-enumerate/#respond</comments>
				<pubDate>Thu, 04 Jun 2026 10:57:44 +0000</pubDate>
					<dc:creator><![CDATA[Michael Roytman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18846</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Predict-dont-enumerate.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Predict-dont-enumerate-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[An AI lab just endorsed a predictive model for defense.]]></custom:subtitle>
		
				<description><![CDATA[A third of the way into a security-operations guide that Anthropic published in April 2026, wedged between a recommendation to patch CISA&#8217;s Known Exploited Vulnerabilities list and a suggestion to automate your deployment pipeline is a small recommendation: &#8220;Use EPSS to prioritize the rest.&#8221; For anyone who has worked on a vulnerability backlog in the [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">A third of the way into a <a href="https://claude.com/blog/preparing-your-security-program-for-ai-accelerated-offense" target="_blank" rel="noreferrer noopener">security-operations guide</a> that Anthropic published in April 2026, wedged between a recommendation to patch CISA&#8217;s Known Exploited Vulnerabilities list and a suggestion to automate your deployment pipeline is a small recommendation: &#8220;Use EPSS to prioritize the rest.&#8221; For anyone who has worked on a vulnerability backlog in the last decade, the sentence is an acknowledgment of a widely felt but often unspoken fact about security programs: They have become machine-scale problems of signal to noise.</p>



<p class="wp-block-paragraph">EPSS (Exploit Prediction Scoring System) is a statistical model that takes a known software flaw, runs it through a set of signals about what attackers are actually doing across the internet, and returns a probability that the flaw will be exploited in the next 30 days. It isn’t an LLM, and it does no reasoning or prompt engineering. It predicts. The company endorsing it is the same company whose newest model can surface thousands of novel, exploitable vulnerabilities in production software, many of them two or three decades old, most of them still unpatched.</p>



<p class="wp-block-paragraph">As far as we can tell, this is the first time a frontier AI lab has publicly endorsed a purpose-built predictive model as the right tool for a defensive problem. LLM labs usually recommend LLMs. That Anthropic did not is worth noting, but the recommendation itself isn’t news to the practitioners it’s aimed at. It’s a description of what they’ve been doing.</p>



<h2 class="wp-block-heading"><strong>The quiet consensus</strong></h2>



<p class="wp-block-paragraph">The volume problem isn’t new. Anyone running a scanner against a large enterprise estate in 2015 was already generating hundreds of thousands of findings per month. Anyone running one against a cloud environment in 2020 was generating millions. Enterprises have spent the better part of a decade staring at dashboards where the number of open critical findings was larger than the capacity of the team supposed to fix them. In other words, cybersecurity has become machine scale.</p>



<p class="wp-block-paragraph">Risk-based vulnerability management, as a product category, has existed since around 2018. EPSS, as a public resource, has been usable since 2021. More than 120 vendors embed it today into their products. The field has had access to a predictive baseline for years.</p>



<p class="wp-block-paragraph">What has been missing is an external justification to change the status quo recommendations from auditors, model risk management teams, and even boards. Auditors want a clear set of expectations, making grading more objective and therefore easier to evaluate. Compliance frameworks like CVSS (Common Vulnerability Scoring System) because CVSS is <em>easy</em>, but implementing something more efficient has historically required that aforementioned external push. A working CISO could tell you she had stopped treating every vulnerability scored a severity 9.8/10 by CVSS as an emergency in 2019, but she would also tell you she still kept CVSS in the report.</p>



<p class="wp-block-paragraph">Anthropic&#8217;s guidance is useful because it makes the private consensus public. Patch what you know to be exploited, then use EPSS above a threshold based on the team’s capacity or risk tolerance. DHS CISA’s practice of publishing known exploited vulnerabilities since November of 2021 is just additional proof that the existing methodologies were being overwhelmed by scale and lack of signal.</p>



<h2 class="wp-block-heading">Why prediction, stated plainly</h2>



<p class="wp-block-paragraph">In 2014, at Black Hat, Dan Geer, then the chief information security officer of In-Q-Tel, asked the first principles question: Are vulnerabilities in software sparse or dense? Sparse meant finite, meaning every fix measurably shrank the attack surface. Dense meant weeds in a field. Geer could not answer the question because the data were not in.</p>



<p class="wp-block-paragraph">Eight years later, Jonathan Spring at Carnegie Mellon&#8217;s Software Engineering Institute tied vulnerability enumeration to the halting problem and showed, in theory, that for any sufficiently complex piece of deployed software, there are always more undiscovered flaws.</p>



<p class="wp-block-paragraph">The AI-driven discovery results of the last 18 months have made the density argument impossible to wave off even in a compliance review. A 27-year-old bug in OpenBSD. A 16-year-old bug in FFmpeg that five million fuzzing runs never caught. Disclosed findings, by the developers&#8217; own accounting, are less than 1% of what has been found. But again, the volume was already a problem. With the coming release of its newest model, Mythos, Anthropic is telling teams to plan for an order of magnitude more findings over the next 24 months.</p>



<p class="wp-block-paragraph">Static severity scoring can’t survive the volume problem, because it’s a human-scale solution for a machine scale problem. Neither can any process that treats every critical finding as an emergency. The threshold for action has to be probabilistic, measurable, and defensible. That’s what a predictive model is for, and that’s what working teams have been using in noisy large enterprise environments.</p>



<h2 class="wp-block-heading">Pointing machines and knowing machines</h2>



<p class="wp-block-paragraph">Geer returned to his 2014 question in the summer of 2025, <a href="https://www.lawfaremedia.org/article/ai-and-secure-code-generation" target="_blank" rel="noreferrer noopener">writing with Dave Aitel in <em>Lawfare</em></a>. The piece gives the industry a vocabulary for a distinction it has been fudging:</p>



<p class="wp-block-paragraph">A vulnerability in the code isn’t automatically a threat. A buffer overflow is a hazard. It becomes a risk only if an attacker can exploit it reliably, in this environment, against these controls, through this traffic. Bugs are abundant but the ability to weaponize a particular bug against a particular target is much rarer.</p>



<p class="wp-block-paragraph">The industry, they wrote, has built a pointing machine. It enumerates.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Even children learn early to point and name—but knowing the word “dog” doesn’t reveal whether the animal might bite. In cybersecurity, we’ve built systems that similarly point and name vulnerabilities without understanding whether they’re truly dangerous. By embracing AI solely for pattern recognition, we’ve created a powerful “pointing machine” that identifies possible threats but does not comprehend their actual impact. What we need instead is a “knowing machine,” capable of understanding how code functions within complex, real-world environments, recognizing not just hazards but the full context of how and whether those hazards might become genuine risks.</em></p>
</blockquote>



<p class="wp-block-paragraph">A knowing machine is a system that understands how code behaves in a particular environment and recognizes the context that turns a hazard into a risk. A predictive model is how you build a knowing machine. EPSS is the clearest public example: It covers every published CVE and is updated daily.</p>



<h2 class="wp-block-heading">Global isn’t local</h2>



<p class="wp-block-paragraph">EPSS is a global model. It sees what attackers are doing across the whole of the internet. It picks up patterns in exploitation activity that severity scores never could. What it can’t see is any particular organization&#8217;s environment. It doesn’t know which assets carry the data the business actually cares about. It doesn’t know what compensating controls are in place, where remediation is risky, or how your telemetry and history change the odds.</p>



<p class="wp-block-paragraph">A 9.8 with a 97% global probability of exploitation and a 9.8 with a 0.1% probability are not the same animal. Neither are two organizations applying the same EPSS threshold to the same CVE on different assets. One has the vulnerable code path exposed to the internet, behind a web application firewall that doesn’t inspect the relevant protocol. The other has the same CVE on an internal system that accepts authenticated input from a single service account. A scanner can’t tell them apart. A global model can’t tell them apart. Their actual risk profiles are orders of magnitude apart.</p>



<p class="wp-block-paragraph">Local context is where most security teams have been stuck the entire time, and where the next decade of the field is going to be fought.</p>



<h2 class="wp-block-heading">What a local knowing machine actually requires</h2>



<p class="wp-block-paragraph">Pair a better pointing machine with a faster remediation engine and all you’ve done is increase the speed at which you produce churn, breakage and wasted effort. You’ll also spend a king&#8217;s ransom in agent tokens fixing vulnerabilities that were never dangerous in your environment.</p>



<p class="wp-block-paragraph">In contrast to an omniscient scanner, a local model trains on the specific environment being defended: asset inventory, application topology, reachability, deployed controls, attack telemetry observed on-site, and the history of the organization&#8217;s own remediations and their outcomes. The model produces probabilities specific to the enterprise. Most organizations already have the inputs, scattered across CMDBs, endpoint agents, firewall logs, ticketing systems and scanner output. This context is precisely what attackers (whether they’re using good old fashioned metasploit or Mythos with an infinite budget) are lacking in their models. The context becomes an asymmetrical advantage for defenders, perhaps the only one that exists.</p>



<h2 class="wp-block-heading">The policy shifts that actually matter</h2>



<p class="wp-block-paragraph">The interventions that will decide whether a security program survives the next 24 months aren’t purely technical. A CISO can put most of them in motion without buying anything.</p>



<p class="wp-block-paragraph">Rewrite the SLA. Most vulnerability-management SLAs are organized by severity. Criticals in 15 days, highs in 30, mediums in 90. That structure was built for a world where the count of open criticals was small enough to matter. It’s now actively harmful, because it forces teams to spend the same effort on a 9.8 nobody is exploiting and a 7.5 that’s under active attack. SLAs should be rewritten in terms of probability of exploitation and asset exposure, not severity. A CISO who can’t get that past her GRC team can at least add a second tier that makes the probability-based cut enforceable alongside the severity-based one.</p>



<p class="wp-block-paragraph">Change what the board sees. If the monthly security report counts the numbers of vulnerabilities, exposures or findings in different buckets (“critical,” “open past 30 days,” etc.), the organization is being managed to the wrong metric. The metric should be exploitability-weighted exposure over time, with a second line for predicted versus observed exploitation. Boards will accept this once somebody explains it. This beats showing them a number that has no relationship to risk and is growing exponentially as new LLM models are released. More to the point: A great team can do amazing <em>volumes</em> of remediation work, and risk can still rise because they’re measuring and remediating the wrong thing. An efficient, context-rich team can do far less work and meaningfully move the probability of an event down.</p>



<p class="wp-block-paragraph">Invest in telemetry. The single most valuable instrument a security program can build is a feedback loop between what was prioritized and what was exploited. If the loop shows you were wrong, the model improves. If the loop does not exist, you will keep being wrong indefinitely (or just not being aware of misses).</p>



<p class="wp-block-paragraph">Fix the compliance conversation. The reason CVSS survives is regulatory inertia. PCI, HIPAA, and most state breach-notification frameworks still reference severity. The CISOs who will come out of the next two years in the best shape are the ones who engage their auditors now, in writing, about what a probabilistic prioritization framework looks like under the existing rules.</p>



<p class="wp-block-paragraph">Staff for the bottleneck, which isn’t scanning. The industry has spent a decade hiring people to find bugs. The bottleneck now is deciding which bugs matter, getting the fixes deployed, and measuring whether the prioritization was correct. The job descriptions should reflect this. A security-data engineer may be able to increase efficiency to meet SLAs more than increasing capacity would.</p>



<p class="wp-block-paragraph">None of this requires a new product. All of it requires a CISO willing to say, out loud, that the old dogma is broken and that the new one will be managed by data and probabilities. That is the shift Anthropic&#8217;s five-word sentence was really announcing. The technology is available and the models are here—both the LLM-based ones to find the vulnerabilities and the predictive knowing machines to prioritize efficiently.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/predict-dont-enumerate/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Context as Code</title>
		<link>https://www.oreilly.com/radar/context-as-code/</link>
				<comments>https://www.oreilly.com/radar/context-as-code/#respond</comments>
				<pubDate>Wed, 03 Jun 2026 11:00:14 +0000</pubDate>
					<dc:creator><![CDATA[Artur Huk]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18837</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Context-as-code.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Context-as-code-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Build-time governance in the era of infinite syntax]]></custom:subtitle>
		
				<description><![CDATA[As syntax becomes cheap and abundant, architectural control becomes the scarce resource. Effective governance starts upstream, where intent, constraints, and threat models shape the agent’s working context before generation begins. The goal isn’t better prompting but build-time boundaries that prevent structurally invalid code from entering the system. The Frankenstein factories The dark factories (as Dan [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">As syntax becomes cheap and abundant, architectural control becomes the scarce resource. Effective governance starts upstream, where intent, constraints, and threat models shape the agent’s working context before generation begins. The goal isn’t better prompting but build-time boundaries that prevent structurally invalid code from entering the system.</p>



<h2 class="wp-block-heading">The Frankenstein factories</h2>



<p class="wp-block-paragraph">The <a href="https://www.oreilly.com/radar/dark-factories-rise-of-the-trycycle/" target="_blank" rel="noreferrer noopener">dark factories</a> (as Dan Shapiro calls them) are running. Tokens fly through trycycles, features ship overnight, and codebases are ported before breakfast. The velocity is real. And <a href="https://www.oreilly.com/radar/comprehension-debt-the-hidden-cost-of-ai-generated-code/" target="_blank" rel="noreferrer noopener">comprehension debt</a> (a term coined by Addy Osmani) is compounding in silence behind it.</p>



<p class="wp-block-paragraph">What this era is producing, at scale, deserves its own name: Frankenstein factories. Not a critique of any single approach but a description of a structural condition—generation engines so effective at producing working syntax that they have industrialized the creation of architecturally ungovernable systems. The creature walks out of the laboratory impressive, functional, and alive on delivery day.</p>



<p class="wp-block-paragraph">The crisis arrives the day someone must govern it. To govern a system means to hold it accountable to its design boundaries—the ability to look at it and reliably say <em>why</em> it works, <em>what</em> is permitted to touch what, and to categorically prevent forbidden state changes before they happen. Victor&#8217;s catastrophe was not the act of creation but the absent governing frame.</p>



<p class="wp-block-paragraph">For prototyping or shipping features fast, unconstrained generation is a powerful tool. It optimizes for velocity, and it delivers. But for enterprise payment systems, insurance underwriting engines, logistics orchestrators, and regulated platforms, the question is not &#8220;Does the code ship?&#8221; but &#8220;Who is liable when it does the wrong thing?&#8221; Here, automating the word &#8220;YES&#8221; to every feature request does not solve the problem. It industrializes it.</p>



<p class="wp-block-paragraph">Consider a standard Jira ticket: &#8220;Add an email notification after a successful payment.&#8221;</p>



<p class="wp-block-paragraph">A junior developer might attempt to wedge the email-sending logic directly into the <code>PaymentProcessor</code> class. A senior architect catches this in code review: &#8220;No. Fire a <code>PaymentSuccessEvent</code> to the message bus.&#8221; That human friction—the architectural &#8220;No&#8221;—keeps the system maintainable.</p>



<p class="wp-block-paragraph">Unconstrained AI agents lack this assertiveness. By default, they are the ultimate yes-men.</p>



<p class="wp-block-paragraph">Hand that same ticket to a standard coding agent and it will not argue about bounded contexts. It will burn tokens until it produces 300 lines of syntactically perfect code, import an SMTP library directly into the core of your billing domain, and submit a pull request. The tests will pass; conventional feature tests make no assertion about bounded contexts. The CI pipeline will go green. And structurally, the system is now a disaster.</p>



<p class="wp-block-paragraph">This happens not through malice but because of how agentic loops are built. Without explicit architectural constraints, the system&#8217;s emergent behavior is to fulfill immediate user intent. The agent is orchestrated to ship the feature, not to defend the architecture. Comprehension debt is the structural consequence: AI generates syntax faster than human beings can read or govern it. Expecting a probabilistic model to enforce structural integrity on its own is a category error. Without a governing frame, the agent will always take the path of least resistance to a &#8220;YES.&#8221;</p>



<p class="wp-block-paragraph">You cannot fix code overproduction by hiring more people to read it nor by running the generation loop faster. The only scalable answer is to build a concrete riverbed <em>before</em> you turn on the water.</p>



<p class="wp-block-paragraph">If the current era automates the word &#8220;YES,&#8221; we should automate the word &#8220;NO.&#8221;</p>



<p class="wp-block-paragraph">Securing the runtime environment prevents the monster from escaping. But to prevent it from being built in the first place, we need to step back into the IDE and the CI/CD pipeline. We need to govern <em>generation</em>.</p>



<h2 class="wp-block-heading">The great softening: Shifting risk from build time to runtime</h2>



<p class="wp-block-paragraph">Compilers never guaranteed correct software. You could write catastrophic logically broken systems in C, Java, or any other compiled language. But compilers served a crucial engineering purpose: They deterministically governed a specific layer of structural risk.</p>



<p class="wp-block-paragraph">By enforcing hard execution constraints—syntax validity, type compatibility, linkage rules, and executable viability—the compiler acted as an automated boundary. It didn’t verify business intent, domain correctness, or architectural quality. What it did was eliminate an entire class of low-level structural failure <em>before</em> execution ever began.</p>



<p class="wp-block-paragraph">That delegation of risk is one of the quiet triumphs of software engineering. Our discipline has always advanced by mechanizing one class of guarantees so humans can focus on the next layer of abstraction. We automated machine-level structural correctness so engineers could spend their cognitive energy on application logic. Later, we pushed more guarantees upward, into schemas, testing, static analysis, architectural patterns, and operational controls.</p>



<p class="wp-block-paragraph">Over time, we also deliberately softened certain boundaries in exchange for speed. Dynamic languages, richer runtimes, reflection, and increasingly abstract frameworks all traded deterministic compile-time guarantees for developer velocity and flexibility. The newly exposed risk was absorbed elsewhere: runtime validation, automated testing, observability, and engineering discipline.</p>



<p class="wp-block-paragraph">Today, with agentic AI, we are softening boundaries again, more radically than ever before.</p>



<p class="wp-block-paragraph">Natural language has become a high-level control plane for software generation. Arbitrary text increasingly shapes executable behavior. And in that shift, we have blurred one of the oldest boundaries in computing: the separation between <em>data</em> and <em>instructions</em>.</p>



<p class="wp-block-paragraph">Outside the model, that boundary still exists. Systems enforce permission scopes, schema contracts, sandboxing, and execution policies. But inside the inference context, those protections collapse into the same token stream.</p>



<p class="wp-block-paragraph">System prompts, retrieved documents, user messages, tool outputs, and external content all flow through the same neural weights. There is no hard privilege boundary between instruction and input. Modern models may resist naive attacks like &#8220;Ignore previous instructions,&#8221; but they remain vulnerable to indirect injections disguised as legitimate operational context. A malicious instruction embedded in a customer email, a webpage, or a tool response is not processed as passive data. It can become behavioral influence.</p>



<p class="wp-block-paragraph">Inside the context window, untrusted text can shape control flow. That is the real softening.</p>



<p class="wp-block-paragraph">We are generating syntax at machine speed, but we have dissolved the structural gate that once constrained how systems were built. The result is a massive shift of risk from build time to runtime. Code that appears structurally sound during generation may violate architectural boundaries, introduce unsafe execution paths, or become behaviorally compromised the moment hostile context enters the loop.</p>



<p class="wp-block-paragraph">The conclusion is straightforward: The fact that AI-generated code runs is no longer a meaningful proxy for system correctness.</p>



<p class="wp-block-paragraph">Syntax is abundant. Execution is easy. Structural governance is what is missing.</p>



<p class="wp-block-paragraph">We outsourced the writing of logic to machines, but we did not build a deterministic boundary that governs what those machines are allowed to generate.</p>



<p class="wp-block-paragraph">If we want control back, we cannot rely on human code review at machine speed. We must rebuild the build-time gate.</p>



<h2 class="wp-block-heading">From dependency bloat to tailor-made architecture</h2>



<p class="wp-block-paragraph">For decades, the industry&#8217;s default response to complexity was abstraction by accumulation: monolithic frameworks, sprawling dependency trees, and ever-thicker layers of indirection. Importing a 50-megabyte library to avoid repetitive boilerplate was a rational trade-off when developer time and cognitive bandwidth were the scarce resources. For AI agents, that trade-off changes.</p>



<p class="wp-block-paragraph">This is not an argument against foundational infrastructure. Mature primitives—like SQLAlchemy in Python or Spring Boot in Java—remain essential precisely because their conventions are widely learned and predictable. The problem isn’t abstraction but opacity. When core business logic disappears behind proprietary decorators, internal frameworks, or custom orchestration layers, execution becomes a black box. An agent cannot safely reason about code it cannot trace. It needs direct visibility into causality: what changes state, what enforces invariants, and where responsibilities begin and end. Hidden flow degrades reasoning into guesswork; guesswork silently becomes architectural drift.</p>



<p class="wp-block-paragraph">At the same time, AI drives the cost of procedural code toward zero. Boilerplate is no longer expensive. Clarity is. The design question shifts from &#8220;How much can we abstract away?&#8221; to &#8220;How much must remain explicit for safe reasoning?&#8221; The answer is tailor-made architecture: thin infrastructure, explicit domain logic, hard boundaries, and narrowly scoped components with visible contracts. The value is no longer in how much code you avoid writing but in how clearly the system declares its boundaries.</p>



<p class="wp-block-paragraph">That same opacity also breaks verification. AI review can catch local defects, risky patterns, and implementation mistakes, but it remains blind to architectural drift and missing business intent unless those constraints are explicitly encoded. After all, if you ask a model to review code generated from the exact same vague Jira ticket, do you actually get verification, or do you just engineer a circular hallucination, where the AI politely revalidates its own blind spots?</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1536" height="1024" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image.png" alt="Tailor-made architecture gives generated syntax a clear structure without dissolving system boundaries." class="wp-image-18838" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-300x200.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-768x512.png 768w" sizes="auto, (max-width: 1536px) 100vw, 1536px" /><figcaption class="wp-element-caption"><em>Figure 1. Tailor-made architecture gives generated syntax a clear structure without dissolving system boundaries.</em></figcaption></figure>



<h2 class="wp-block-heading">The Context Compilation Pattern</h2>



<p class="wp-block-paragraph">The Context Compilation Pattern governs <em>generation</em> in the IDE and the CI/CD pipeline before a single syntactically plausible line ever reaches a human reviewer. If the Decision Intelligence Runtime (DIR) is the vault door that protects execution in production, context compilation is the blueprint that prevents the monster from being built in the lab.</p>



<p class="wp-block-paragraph">This is not &#8220;prompt engineering,&#8221; which merely asks a probabilistic model for a better answer. What we need is build-time governance: two layers of defense assembled before the LLM inference is even triggered. The first is structured context injection (assembling the prompt from prioritized artifacts). The second is postgeneration static verification (deterministic AST checks that enforce rules no probabilistic model can override). The prompt structure biases generation toward compliant solutions; the static checks make declared, machine-verifiable boundary violations impossible to merge.</p>



<p class="wp-block-paragraph">Deterministic build-time governance is not a return to formal software specification (like UML), nor is it merely &#8220;prompt engineering disguised as Markdown.&#8221; It’s a mechanical constraint on the generation space that makes explicitly declared boundary violations rejectable by design. Context compilation does not eliminate architectural review or replace engineering judgment. Instead, it ensures that the agent operates within a defined riverbed of allowed structural invariants.</p>



<p class="wp-block-paragraph">Engineering evolves whenever implicit rules become explicit declarations. Application development is now crossing that boundary. The senior engineer&#8217;s new job is <em>declarative boundary engineering</em>: explicitly declaring what the system is absolutely forbidden from doing.</p>



<p class="wp-block-paragraph">The failure is not in the frameworks. The failure is in the process: pointing an unconstrained AI agent at a codebase full of invisible magic and expecting a CI/CD pipeline designed for human-generated code to catch what goes wrong. The answer is to build a compiler for the agent&#8217;s context.</p>



<p class="wp-block-paragraph">The Context Compilation Pattern is the staged pipeline that makes this concrete.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1056" height="1600" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1-1056x1600.png" alt="The Context Compilation Pattern pipeline, enforcing build-time constraints through deterministic artifact assembly and dual verification." class="wp-image-18839" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1-1056x1600.png 1056w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1-198x300.png 198w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1-768x1164.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1-1013x1536.png 1013w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1.png 1274w" sizes="auto, (max-width: 1056px) 100vw, 1056px" /><figcaption class="wp-element-caption"><em>Figure 2. The Context Compilation Pattern pipeline, enforcing build-time constraints through deterministic artifact assembly and dual verification.</em></figcaption></figure>



<h3 class="wp-block-heading">Step 1: The context artifacts</h3>



<p class="wp-block-paragraph">The most strategically valuable code in your repository may no longer live in <code>src/</code>. It lives in <code>/context</code>. The pipeline consumes versioned artifacts such as <code>intent.md</code>, <code>boundaries.md</code>, and <code>threat-model.md</code>, each authored by a specialist before a single line of code is generated. (Ownership and role responsibilities are covered in “Artifact-Bound Roles and Accountability” below.) What matters here is that these files are the <em>inputs</em> to the compiler: Without them, there’s nothing to compile.</p>



<p class="wp-block-paragraph">To prevent cognitive overlap, their roles must be fiercely separated: <code>boundaries.md</code> declares <em>structural invariants</em> (e.g., dependency direction, allowed communication paths, and event emission), whereas <code>threat-model.md</code> models <em>adversarial constraints </em>as declarative abuse scenarios (e.g., prompt injection and secrets exfiltration) that must be mechanically blocked.</p>



<p class="wp-block-paragraph"><code>boundaries.md</code> warrants a precise definition, because it anchors the entire build-time governance model. In practice, boundaries are typically defined at module or bounded-context granularity (e.g., <code>/billing/*</code> or <code>/risk/*</code>), not per class or per repository. They are implemented using <strong>hybrid artifacts</strong>: a natural language document designed to constrain the LLM, tightly paired with a deterministic rule for the CI runner.</p>



<p class="wp-block-paragraph">Consider this concrete example of how an architectural boundary is explicitly declared and enforced:</p>



<p class="wp-block-paragraph"><strong>1. <code>boundaries.md</code> (for the LLM context)<br></strong>This Markdown file is injected into the agent’s prompt. It defines the vocabulary, architectural constraints, and allowed interactions.</p>



<pre class="wp-block-code"><code>Module: Billing
Ontology: Order, Invoice, PaymentEvent
Rule: Zero external network I/O is allowed in this domain. You must NEVER import requests or smtplib.</code></pre>



<p class="wp-block-paragraph"><strong>2. <code>semgrep-rule.yml</code> (for the CI/CD runner)</strong><br>This static file goes to the CI pipeline to mechanize the boundary. It ensures the code check is fully deterministic.</p>



<pre class="wp-block-code"><code>rules:
  # Block forbidden imports at the module boundary
  - id: block-external-io-in-billing
    patterns:
      - pattern-either:
          - pattern: import smtplib
          - pattern: import requests
    message: "Architecture Violation: External I/O is strictly forbidden in the billing domain."
    severity: ERROR
    languages: &#91;python]
    paths:
      include: &#91;"src/billing/**"]

  # Domain layer must not talk to DB driver directly
  - id: block-db-driver-in-domain
    patterns:
      - pattern-either:
          - pattern: import sqlalchemy
          - pattern: from sqlalchemy import ...
          - pattern: import psycopg2
          - pattern: from psycopg2 import ...
    message: "Architecture Violation: Domain layer must use Repository abstraction, not database drivers directly."
    severity: ERROR
    languages: &#91;python]
    paths:
      include:
        - "src/billing/domain/**"</code></pre>



<p class="wp-block-paragraph">Crucially, these Semgrep/CI rules are human-authored (or human-reviewed) precommit artifacts. We don’t rely on an LLM to generate the security gates on the fly. The AI reads the Markdown to guide its generation; the CI runner executes the static YAML to enforce the boundary.</p>



<p class="wp-block-paragraph">If these artifacts stay current, they actively govern the generated codebase. Stale or malformed context becomes context debt: The pipeline will enforce strictly whatever was declared, even if the declaration is wrong. Governance artifacts are production code. They require strict versioning, explicit ownership, and periodic review just like the executable logic they constrain. That’s why core artifacts like <code>boundaries.md</code> require rigorous peer review, not just casual updates.</p>



<h3 class="wp-block-heading">Step 2: The context compiler</h3>



<p class="wp-block-paragraph">Dumping all Markdown files into the system prompt is sometimes acceptable for small projects and small artifacts. But as the codebase grows or the context window fills with too many competing constraints, models begin to suffer from &#8220;lost in the middle&#8221; degradation and silently ignore what matters most.</p>



<p class="wp-block-paragraph">The term “context compiler&#8221; might sound like a magical enterprise heavy-lift, but the reality is entirely mundane. In its simplest form, it’s just a deterministic context assembly layer combined with a routing mechanism.</p>



<p class="wp-block-paragraph">Instead of treating context as a flat pile of documents, the compiler assembles it into an ordered structure. Because different artifacts apply to different parts of the project, <code>boundaries.md</code> in the <code>/billing</code> module might enforce strict isolation, while the one in /frontend might be much more permissive.</p>



<p class="wp-block-paragraph">In practice, the compiler may take one of these forms:</p>



<p class="wp-block-paragraph"><strong>Manual selection:</strong> The developer simply points their IDE or agent to a structured set of Markdown files.</p>



<p class="wp-block-paragraph"><strong>A mundane script:</strong> A basic Python or bash script that understands a directory structure. It concatenates the <code>.md</code> files to build the LLM&#8217;s system prompt and hands the <code>.yml</code> files directly to the CI runner.</p>



<p class="wp-block-paragraph"><strong>Tool-mediated context protocols:</strong> Dedicated mechanisms (e.g., MCP) that allow the agent to query the workspace and dynamically assemble the required boundaries directly within the IDE, bypassing the need for manual script invocation.</p>



<p class="wp-block-paragraph">Consider a practical directory structure:</p>



<pre class="wp-block-code"><code>/context
  /global
    coding-standards.md
  /domain
    /billing
      boundaries.md
      threat-model.md
      semgrep-rule.yml
    /risk
      boundaries.md
      threat-model.md
      semgrep-rule.yml
    /frontend
      boundaries.md
      threat-model.md
      semgrep-rule.yml</code></pre>



<p class="wp-block-paragraph">When generating code for the billing module, the script reads <code>/global</code> and <code>/billing</code>. The compiler simply scopes the rules based on the directory, perfectly focusing the agent&#8217;s attention on the boundaries that matter while wiring the corresponding YAML rules for deterministic CI verification.</p>



<h3 class="wp-block-heading">Step 3: Strict boundary hierarchy (resolving conflicts)</h3>



<p class="wp-block-paragraph">When faced with conflicting instructions, LLMs don’t throw a compilation error. They hallucinate a dangerous compromise. The compiler prevents this by enforcing a deterministic precedence of declared constraints before the prompt is assembled:</p>



<p class="wp-block-paragraph"><strong>Threat model &gt; Boundaries &gt; Coding standards &gt; Intent + acceptance criteria</strong></p>



<p class="wp-block-paragraph">Security and architectural boundaries unconditionally overrule feature delivery. This operates at two levels. At the prompt level (soft enforcement), constraint ordering biases generation toward compliant solutions. At the postgeneration level (hard enforcement), deterministic code checks parse the generated syntax, verify structural invariants, and instantly fail the build on violation.</p>



<p class="wp-block-paragraph">&#8220;Resolution&#8221; in this context does not mean an LLM philosophically negotiating between two Markdown files. It means <em>deterministic rejection via CI</em>. If the <code>intent.md</code> asks to &#8220;email a receipt to the user,&#8221; but <code>boundaries.md</code> forbids external network calls in the billing module, an unconstrained AI might try to generate an SMTP call. The conflict is mechanically &#8220;resolved&#8221; when the CI pipeline runs a static rule (derived from <code>semgrep-rule.yml</code>) and instantly fails the build. The developer (context orchestrator) must then intervene and change the design to use an event bus instead. The hierarchy is enforced by deterministic code analysis, not LLM reasoning. A rejected build is not necessarily a rejected business need; it’s a signal that declared boundaries and intended capability must be reconciled explicitly before regeneration. (This mechanical rejection physically executes during the adversarial verification phase in step 5).</p>



<p class="wp-block-paragraph">We do not use AI for this validation. We use existing, proven AST tools and code linters like <a href="https://semgrep.dev/" target="_blank" rel="noreferrer noopener">Semgrep</a>, <a href="https://bandit.readthedocs.io/" target="_blank" rel="noreferrer noopener">Bandit</a>, or <a href="https://codeql.github.com/" target="_blank" rel="noreferrer noopener">CodeQL</a> to enforce these boundaries in CI/CD.</p>



<p class="wp-block-paragraph">However, we must be precise about what this governance actually achieves. Deterministic checks enforce invariants, not the architecture as a whole. You can statically enforce forbidden imports, forbidden outbound I/O, strict layering, and schema conformance. You cannot statically enforce domain semantics, aggregate ownership correctness, subtle coupling, or conceptual cohesion. Deterministic verification doesn’t prove architectural correctness. It proves compliance with explicitly declared structural invariants.</p>



<h3 class="wp-block-heading">Step 4: Generation</h3>



<p class="wp-block-paragraph">Context as code matters only if generated syntax is verified against the same boundaries that shaped it. With a compiled, conflict-free context hierarchy, the developer agent generates code inside an isolated user space sandbox. In this fleeting fraction of a second, the agent inside the developer&#8217;s IDE consumes the narrowed, precompiled system prompt and outputs the actual <code>payment_service.py</code>. Its role is constrained synthesis: translating the boundaries in <code>boundaries.md</code> and the imperatives in <code>intent.md</code> into code.</p>



<h3 class="wp-block-heading">Step 5: Adversarial verification (negative space)</h3>



<p class="wp-block-paragraph">This phase checks whether the generated code crossed a forbidden boundary. Before the development cycle begins, the adversarial context provider defines threat vectors in <code>threat-model.md</code>. Because a Markdown file only guides the LLM softly, the governance platform engineer bridges the gap to determinism by translating those declarative threats into matching executable rules (like <code>semgrep-rule.yml</code>) wired into the CI gates. If the threat model identifies server-side request forgery or secrets exfiltration as a risk for the <code>/frontend</code> module, the corresponding CI rule parses the generated code and instantly fails the build if a known attack pattern or insecure execution sink is detected.</p>



<p class="wp-block-paragraph">The pipeline doesn’t ask an LLM to read the Markdown and assess if the code is safe. It mechanically executes the prewritten rules derived from it. If a generative agent helps draft the rule set, it does so before the cycle in an isolated sandbox, and a human reviews the result before it enters CI. Step 5 doesn’t prove overall correctness; it proves that declared structural and security boundaries are enforced.</p>



<p class="wp-block-paragraph">Like any static gate, deterministic boundary checks trade flexibility for safety and will occasionally reject valid implementations. That friction is intentional: Explicit override and artifact refinement are part of the governance loop.</p>



<p class="wp-block-paragraph">AI code review may identify suspicious code, but it cannot certify that declared boundaries survived generation. Step 5 therefore relies on deterministic CI rules, not on a probabilistic model interpreting the pull request.</p>



<h3 class="wp-block-heading">Step 6: Acceptance verification (positive space)</h3>



<p class="wp-block-paragraph">This phase checks whether the generated code solves the business problem. The <code>acceptance-criteria.md</code> defines the expected behavior not as a vague user story, but as a machine-executable contract (e.g., using Gherkin syntax):</p>



<pre class="wp-block-code"><code>Scenario: Successful payment emits notification
  Given a valid payment of 100 EUR
  When the transaction completes
  Then the PaymentSuccessEvent is published to the message bus</code></pre>



<p class="wp-block-paragraph">The CI pipeline parses this exact Markdown block and runs the corresponding test suite. Step 6 provides what step 5 cannot: verification against a declared delivery contract.</p>



<p class="wp-block-paragraph">The code is approved only when it passes adversarial checks <em>and</em> satisfies the acceptance criteria. Without step 5, the system could violate structural boundaries. Without step 6, it could implement the wrong intent. Both contracts must hold.</p>



<h2 class="wp-block-heading">Artifact-bound roles and accountability</h2>



<p class="wp-block-paragraph">The traditional SDLC is a linear cascade: Requirements flow to architecture, then to code, then to QA. In an era where a machine generates 10,000 lines of syntax in the time it takes to fetch a coffee, that handoff is a fatal bottleneck.</p>



<p class="wp-block-paragraph">In the context matrix, specialists define parallel, independent constraint vectors <em>before</em> generation begins. The titles on business cards stay the same. The artifacts they produce change entirely.</p>



<figure class="wp-block-table"><table><tbody><tr><td><strong>Old role</strong></td><td><strong>New role</strong></td><td><strong>Artifact</strong></td><td><strong>Responsibility</strong></td></tr><tr><td>Business analyst</td><td><strong>Intent definer</strong></td><td><code>intent.md</code> + <br><code>acceptance-criteria.md</code></td><td>Define the &#8220;what&#8221; and the deterministic proof that it was delivered</td></tr><tr><td>Software architect</td><td><strong>World builder</strong></td><td><code>boundaries.md</code></td><td>Define domain ontology, architectural invariants, and allowed interaction patterns</td></tr><tr><td>QA &amp; security engineer</td><td><strong>Adversarial context provider</strong></td><td><code>threat-model.md</code></td><td>Define threat vectors and abuse paths <em>before</em> generation</td></tr><tr><td>Platform engineer/DevOps</td><td><strong>Governance platform engineer</strong></td><td>Compiler pipeline + CI gates (<code>semgrep-rule.yml</code>)&nbsp;</td><td>Operationalize declared constraints into nonbypassable enforcement gates</td></tr><tr><td>Developer</td><td><strong>Context orchestrator</strong></td><td><code>coding-standards.md</code> + critical code</td><td>Resolve artifact conflicts, steer generation workflows, implement critical paths, and refine context quality</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">In this model, accountability is distributed and artifact bound. Rather than handing off work downstream, each role owns specific upstream activities and constraints.</p>



<ul class="wp-block-list">
<li><strong>The intent definer (formerly business analyst):</strong> Owns the business reality. They translate user needs into <code>intent.md</code> and define hard <code>acceptance-criteria.md</code> (like BDD scenarios or API contracts). Their job is to formulate requirements so strictly that the pipeline can automatically prove delivery, acting as the first line of defense against vague &#8220;vibe coding.&#8221;</li>



<li><strong>The world builder (formerly software architect):</strong> Owns the structural gravity. They write <code>boundaries.md</code> to establish the domain ontology and hard architectural boundaries. Instead of reviewing pull requests for drift, their daily activity is defining what modules are allowed to communicate and declaring the structural invariants the generated code must respect.</li>



<li><strong>The adversarial context provider (formerly QA and security):</strong> Owns the negative space. They anticipate failure modes and define threat vectors via <code>threat-model.md</code>. Their responsibility is identifying the precise abuse paths that the CI pipeline must block, ensuring an LLM never tests its own code.</li>



<li><strong>The governance platform engineer (formerly platform engineer/DevOps):</strong> Owns the enforcement machinery. They build the context compiler pipeline and operationalize declared constraints into nonbypassable enforcement gates. Their responsibility is the deterministic enforcement pipeline that executes declared governance artifacts at precommit and CI/CD boundaries.</li>



<li><strong>The context orchestrator (formerly developer):</strong> Owns generation orchestration and critical handwritten paths. This is a hybrid reality, not the end of programming. They write <code>coding-standards.md</code>, manually implement zero-trust paths, and resolve runtime exception requests. For the bulk of the system, their focus shifts to a meta-level: resolving conflicting constraints, tuning the prompt&#8217;s signal-to-noise ratio, and debugging why a given artifact failed to govern the agent properly.</li>
</ul>



<p class="wp-block-paragraph">When a failure occurs, the investigation shifts from &#8220;What was the agent thinking?&#8221; to &#8220;Which contract failed to govern?&#8221; Because the pipeline deterministically enforces what was explicitly declared, failures are no longer opaque hallucinations. They’re traceable collisions between artifact boundaries. A structural flaw cleanly points to an unbounded <code>boundaries.md</code>. When the pipeline is green and the contracts are honest, the orchestrator acts as a firewall against process failure, not a scapegoat for undocumented assumptions.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1600" height="780" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2-1600x780.png" alt="The decision boundary architecture: Context compilation governs generation, ROA structures intent, and DIR validates execution." class="wp-image-18841" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2-1600x780.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2-300x146.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2-768x375.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2-1536x749.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2.png 2048w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Figure 3. The decision boundary architecture: Context compilation governs generation, ROA structures intent, and DIR validates execution.</em></figcaption></figure>



<h2 class="wp-block-heading">The economics of governance</h2>



<p class="wp-block-paragraph">Context compilation makes economic sense only when the cost of architectural failure exceeds the cost of explicit governance. It adds upfront design work and cognitive overhead, so its value depends on how expensive a wrong system decision would be.</p>



<p class="wp-block-paragraph">For rapid prototyping, throwaway utility scripts, marketing sites, or low-stakes internal tools—where the worst-case consequence of a hallucination is a misaligned dashboard—let the generative engines run unconstrained. Velocity is the only thing that matters.</p>



<p class="wp-block-paragraph">For safety-critical automation, trading platforms, healthcare orchestrators, and regulated enterprise systems, the economics invert. Velocity without deterministic boundaries is simply the speed at which you accumulate liability. A single unconstrained agent importing an insecure dependency into a payment core costs orders of magnitude more than the engineer-hours spent writing a <code>boundaries.md</code> contract.</p>



<p class="wp-block-paragraph">You don’t build a bank vault door for a garden shed. You apply context compilation where the systemic cost of emergent architectural failure is catastrophic.</p>



<h2 class="wp-block-heading">Automating the word &#8220;NO&#8221;</h2>



<p class="wp-block-paragraph">When code generation becomes cheap, architectural entropy tends to scale with it. That makes post hoc code review less effective, especially when reviewers spend their attention on machine-generated boilerplate. A more durable approach is <em>context review</em>: peer review of the declarative constraints that shape what the machine is allowed to build. A reviewed <code>boundaries.md</code> can guide many later development cycles. A reviewed pull request usually governs only a single change.</p>



<p class="wp-block-paragraph">The discipline has shifted from imperative engineering of procedures to declarative engineering of boundaries.</p>



<p class="wp-block-paragraph">Let’s return to the Jira ticket that started this discussion: &#8220;Add an email notification after a successful payment.&#8221;</p>



<p class="wp-block-paragraph">The business analyst submits the <code>intent.md</code>. Before the developer agent sees the prompt, the context compiler activates—at the precommit gate or via tool-mediated context protocols (e.g., script or MCP) in the IDE—before a line is written. It retrieves the architect&#8217;s <code>boundaries.md</code>, which states, &#8220;The <code>/domain</code> module has zero external dependencies. No network calls.&#8221; The SMTP import collides with that boundary instantly. Even if the agent generates the import, the build will not survive it—the prompt biases generation toward compliant solutions, and the deterministic static check in step 5 rejects it at the declared boundary. The Frankenstein is caught in the pipeline, not discovered in production three release cycles later.</p>



<p class="wp-block-paragraph">Code generation is becoming abundant. Architectural discipline is becoming scarce.</p>



<p class="wp-block-paragraph">Context as code governs what may be generated. Responsibility-oriented agents govern what may be proposed. Decision Intelligence Runtime governs what may be executed. Three boundaries. One governing frame.</p>



<p class="wp-block-paragraph">The highest-value engineering skill is no longer writing syntax. It’s engineering the conditions under which correct syntax can emerge.</p>



<p class="wp-block-paragraph">That is the ability to automate the word &#8220;NO.&#8221;</p>



<p class="wp-block-paragraph"><em>This article concludes the three-part series on engineering boundaries in agentic AI. The repository at <a href="https://github.com/huka81/decision-intelligence-runtime" target="_blank" rel="noreferrer noopener">github.com/huka81/decision-intelligence-runtime</a> contains an open source reference implementation of the concepts described in this series.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/context-as-code/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Radar Trends to Watch: June 2026</title>
		<link>https://www.oreilly.com/radar/radar-trends-to-watch-june-2026/</link>
				<comments>https://www.oreilly.com/radar/radar-trends-to-watch-june-2026/#respond</comments>
				<pubDate>Tue, 02 Jun 2026 10:58:22 +0000</pubDate>
					<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
						<category><![CDATA[Radar Trends]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18834</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-7.png" 
				medium="image" 
				type="image/png" 
				width="1400" 
				height="950" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-7-160x160.png" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Developments in policy and governance, infrastructure and ops, AI models, and more]]></custom:subtitle>
		
				<description><![CDATA[Coauthored with Claude Agents are making the transition from performing tasks to running operations. The Cloudflare and Stripe partnership ships an agent that opens accounts, registers domains, and deploys an application on its own (details), while Stripe/Tempo and iWallet have each published machine-to-machine payment protocols to make that kind of work a standard. Office documents, [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em>Coauthored with Claude</em></p>



<p class="wp-block-paragraph">Agents are making the transition from performing tasks to running operations. The Cloudflare and Stripe partnership ships an agent that opens accounts, registers domains, and deploys an application on its own (<a href="https://www.infoworld.com/article/4165857/are-we-ready-to-give-ai-agents-the-keys-to-the-cloud-cloudflare-thinks-so.html" target="_blank" rel="noreferrer noopener">details</a>), while Stripe/Tempo and iWallet have each published machine-to-machine payment protocols to make that kind of work a standard. Office documents, browser sessions, and, in one announcement, the phone interface itself are next on the list. View the expanded role of agents as an opportunity for humans to accomplish more.</p>



<h2 class="wp-block-heading">AI Models</h2>



<p class="wp-block-paragraph">The model menagerie keeps expanding in size and shape. Open weight contenders run at frontier capability on modest hardware, while specialist models for voice, conversation timing, and privacy filtering take over what used to be features inside one general chat model. Treat your prompts and skills as portable; the model behind them will change.</p>



<ul class="wp-block-list">
<li>Anthropic has <a href="https://www.anthropic.com/news/claude-opus-4-8" target="_blank" rel="noreferrer noopener">released</a> Opus Claude 4.8. This model is not Mythos, which they expect to release soon. Opus 4.8 is a “modest improvement” that claims better results on coding and greater likelihood of informing users when it is uncertain about claims. Changes to the agents may be more important. Claude Code now has the ability to plan solutions to large problems involving hundreds of subagents (“dynamic workflows”); Cowork can control the effort put into solving a problem.</li>



<li>Cohere&#8217;s <a href="https://cohere.com/blog/command-a-plus" target="_blank" rel="noreferrer noopener">Command A+</a> is an open weight mixture-of-experts model with 218B parameters, 25B active. It’s competitive with frontier models and requires relatively little hardware to run: Two H100s isn&#8217;t small, but it&#8217;s not a data center either.</li>



<li>Google&#8217;s announcements at this year’s I/O conference include <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni/" target="_blank" rel="noreferrer noopener">Omni</a>, a new model that takes any kind of input (video, audio, image) and generates any kind of output; <a href="https://ai.google.dev/gemini-api/docs/interactions/whats-new-gemini-3.5" target="_blank" rel="noreferrer noopener">Gemini 3.5 Flash</a>, a fast and efficient update to their coding model; <a href="https://gemini.google/overview/agent/spark/" target="_blank" rel="noreferrer noopener">Gemini Spark</a>, a personal agent; and <a href="https://blog.google/products-and-platforms/platforms/android/android-xr-io-2026/" target="_blank" rel="noreferrer noopener">intelligent eyewear</a>, another attempt at smart glasses.</li>



<li>Alibaba has <a href="https://qwen.ai/blog?id=qwen3.7" target="_blank" rel="noreferrer noopener">announced</a> Qwen3.7-Max, its most capable model.</li>



<li>Thinking Machines has <a href="https://thinkingmachines.ai/blog/interaction-models/" target="_blank" rel="noreferrer noopener">announced</a> a research preview of interaction models. These models support natural conversation flow. The model can wait for a speaker to finish, interrupt the speaker, respond when the speaker interrupts the model, and keep track of time.</li>



<li>OpenAI has <a href="https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/" target="_blank" rel="noreferrer noopener">released</a> new voice models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. They’re moving from call-and-response models to models that can take part in conversations, reason, and take actions.</li>



<li>OpenRouter published cost studies for both <a href="https://openrouter.ai/announcements/opus-47-tokenizer-analysis" target="_blank" rel="noreferrer noopener">Claude Opus 4.7</a> and <a href="https://openrouter.ai/announcements/gpt55-cost-analysis" target="_blank" rel="noreferrer noopener">GPT-5.5</a>. GPT-5.5 raised the token price but reduced the number of tokens in a typical conversation. Claude kept prices the same, but conversations tend to require more tokens. What&#8217;s the impact on your monthly bill?</li>



<li>Google has <a href="https://arstechnica.com/ai/2026/05/googles-gemma-4-open-ai-models-use-speculative-decoding-to-get-up-to-3x-faster/" target="_blank" rel="noreferrer noopener">updated</a> its Gemma 4 models, claiming that they triple token generation speed. They use a technique called <a href="https://x.com/googlegemma/status/2051694045869879749" target="_blank" rel="noreferrer noopener">multi-token prediction</a> (MTP) to draft a sequence of tokens with a very small model and then approve those tokens with the large model.</li>



<li>IBM released <a href="https://research.ibm.com/blog/granite-4-1-ai-foundation-models" target="_blank" rel="noreferrer noopener">Granite 4.1</a>, a collection of small models (30B parameters and down).</li>



<li>An academic paper describes “<a href="https://arxiv.org/abs/2510.22977" target="_blank" rel="noreferrer noopener">the reasoning trap</a>,” a phenomenon in which training models for increased reasoning also increases hallucinations about tool use.</li>



<li><a href="https://talkie-lm.com/chat" target="_blank" rel="noreferrer noopener">Talkie</a> is an LLM that was trained only on data from 1931 and earlier. If you want to know what it was like to live during the start of the Depression, this is the LLM to ask.</li>



<li>OpenAI has <a href="https://openai.com/index/introducing-openai-privacy-filter/" target="_blank" rel="noreferrer noopener">announced</a> a <a href="https://huggingface.co/openai/privacy-filter" target="_blank" rel="noreferrer noopener">privacy filter model</a>. This is a small specialized model (1.5B) that can run on phones and other small devices. It removes personally identifiable information (PII) from text documents.</li>
</ul>



<h2 class="wp-block-heading">Software Development</h2>



<p class="wp-block-paragraph">We are beginning to see anecdotal evidence that the brief era of <a href="https://thenewstack.io/opus-4-8-claude-smarter-token-discipline-urgent/" target="_blank" rel="noreferrer noopener">tokenmaxxing is coming to an end</a>. Agents may increase productivity, but they can also use tokens at an astonishing rate. So can the latest models, like Anthropic’s Claude 4.8 with new features like dynamic workflows. Employers are realizing that the only way to measure productivity is to look at the quality of an employee’s work rather than relying on an artificial (and easily gameable) metric like token use. Teams that use AI effectively will be disciplined about token use; they’ll choose lower cost (or local) models where possible, reaching for expensive models like Claude 4.8 Opus only when necessary.</p>



<ul class="wp-block-list">
<li>The Agentic AI Foundation is <a href="https://aaif.io/blog/mcp-is-growing-up/" target="_blank" rel="noreferrer noopener">updating</a> the MCP protocol, with a <a href="https://blog.modelcontextprotocol.io/posts/2026-07-28-release-candidate/" target="_blank" rel="noreferrer noopener">release candidate</a> scheduled for July 28. Changes include making MCP a stateless protocol, adding a process for creating extensions, and aligning authorization with the OAuth and OpenID standards.</li>



<li>Google is <a href="https://developers.googleblog.com/an-important-update-transitioning-gemini-cli-to-antigravity-cli/" target="_blank" rel="noreferrer noopener">dropping Gemini CLI</a> and putting all of its effort behind <a href="https://antigravity.google/" target="_blank" rel="noreferrer noopener">Antigravity</a>, its agentic software development platform. There are desktop and command line versions of Antigravity, but unlike Gemini CLI, neither are open source.</li>



<li>What shall we call <a href="https://steve-yegge.medium.com/welcome-to-gas-city-57f564bb3607" target="_blank" rel="noreferrer noopener">Gas City</a>, created by Julian Knutsen and Chris Sells? Gas Town 2.0? Steve Yegge says it&#8217;s an SDK for building your own &#8220;dark factories&#8221; by deploying teams of collaborating agents in any topology. It&#8217;s &#8220;a pivotal moment in the Mad Max school of agent orchestration.&#8221;</li>



<li>The problem with agentic programming is that agents serve individuals, not groups, and programming is a team sport. Is <a href="https://www.lukew.com/ff/entry.asp?2153" target="_blank" rel="noreferrer noopener">collaborative steering</a> (context management for groups) an answer?</li>



<li>GitHub has <a href="https://github.com/features/preview/github-app" target="_blank" rel="noreferrer noopener">released</a> a preview of its Copilot app, a stand-alone desktop application for coding with AI. It’s completely integrated with GitHub; for example, you can launch tasks directly from GitHub issues.</li>



<li>If you think tokenmaxxing is your path to promotion, check out <a href="https://github.com/dtnewman/burn-baby-burn" target="_blank" rel="noreferrer noopener">burn-baby-burn</a>. It does what it says: burns lots of tokens, fast, using the LLM of your choice. We hope it&#8217;s a parody, but we bet it works.</li>



<li>Mitchell Hashimoto <a href="https://x.com/mitchellh/status/2055039647924007222" target="_blank" rel="noreferrer noopener">tweets</a> that Anthropic&#8217;s rewrite of Bun from Zig to Rust demonstrates that programming languages are now fungible. Programming language lock-in has ended; programs can easily move from one language to another.</li>



<li><a href="https://github.com/NVIDIA/OpenShell?utm_source=the+new+stack&amp;utm_medium=referral&amp;utm_content=inline-mention&amp;utm_campaign=tns+platform" target="_blank" rel="noreferrer noopener">OpenShell</a> is a <a href="https://thenewstack.io/nvidia-openshell-agent-runtime/" target="_blank" rel="noreferrer noopener">runtime environment</a> built with security in mind from the ground up. It’s intended to be used as a secure environment for running agents. Every agent runs in its own sandbox; an external gateway manages credentials and policies.</li>



<li>OpenAI is <a href="https://community.openai.com/t/openai-is-winding-down-the-fine-tuning-api-and-platform-discussion-thread/1380522" target="_blank" rel="noreferrer noopener">shutting down</a> its API for fine-tuning its models. <a href="https://x.com/bradenjhancock/status/2053309599248453999?s=20" target="_blank" rel="noreferrer noopener">They say</a> the current models are better and don&#8217;t require significant fine-tuning. As <em>Latent Space</em> <a href="https://www.latent.space/p/ainews-the-end-of-finetuning" target="_blank" rel="noreferrer noopener">points out</a>, this doesn&#8217;t necessarily mean the end of fine-tuning as a discipline, particularly for open models. But it may be a signal. Drew Breunig <a href="https://www.dbreunig.com/2026/05/10/overfitting-the-harness.html" target="_blank" rel="noreferrer noopener">writes</a> about what this means for agents and harnesses.</li>



<li>Anthropic has <a href="https://claude.com/blog/collaborate-with-claude-across-excel-powerpoint-word-and-outlook" target="_blank" rel="noreferrer noopener">released</a> Claude for Office 365, allowing users to run sessions that cross Word, Excel, and PowerPoint. Integration with Outlook is coming, though Claude for Outlook is currently a separate product.</li>



<li>A <a href="https://developers.openai.com/codex/app/chrome-extension?utm_source=the+new+stack&amp;utm_medium=referral&amp;utm_content=inline-mention&amp;utm_campaign=tns+platform" target="_blank" rel="noreferrer noopener">plugin to Chrome allows Codex to use Chrome</a> for browser tasks that require you to be logged in—for example, reading email.</li>



<li><a href="https://www.firecrawl.dev/" target="_blank" rel="noreferrer noopener">Firecrawl</a> is an API that agents can use to interact with websites in a human way. It enables agents to search for the latest data, interact with the site, and return the results at scale.</li>



<li>Drew Breunig&#8217;s “<a href="https://www.dbreunig.com/2026/05/04/10-lessons-for-agentic-coding.html" target="_blank" rel="noreferrer noopener">10 Lessons for Agentic Coding</a>” is an invaluable list of tips, including &#8220;Implement to learn.&#8221; Letting an agent write all the code is easy, but when you really need to learn something, write it by hand first.</li>



<li><a href="https://github.com/aattaran/deepclaude" target="_blank" rel="noreferrer noopener">Deepclaude</a> configures Claude&#8217;s autonomous agent loop to use DeepSeek V4 Pro rather than one of Anthropic&#8217;s models. It&#8217;s a good way to save (DeepSeek costs much less per token) and experiment with open models. (Fair warning: The name deepclaude may change.)</li>



<li>OpenAI has announced <a href="https://chatgpt.com/codex/for-work/" target="_blank" rel="noreferrer noopener">Codex for Work</a>, an assistant that&#8217;s designed for office work rather than software development.</li>



<li><a href="https://github.com/kanwas-ai/kanwas" target="_blank" rel="noreferrer noopener">Kanwas</a> is a new tool for sharing context across agents. It can be used by workgroups to collaborate on projects.</li>



<li><a href="https://mikeoss.com/" target="_blank" rel="noreferrer noopener">Mike</a> is an open source AI trained for legal work and designed to run locally.</li>



<li>GitHub is <a href="https://arstechnica.com/ai/2026/04/github-will-start-charging-copilot-users-based-on-their-actual-ai-usage/" target="_blank" rel="noreferrer noopener">transitioning</a> to <a href="https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/" target="_blank" rel="noreferrer noopener">usage-based billing for Copilot</a>.</li>



<li>OpenAI and Qualcomm are reportedly <a href="https://thenextweb.com/news/openai-qualcomm-ai-phone-agents-replace-apps" target="_blank" rel="noreferrer noopener">working on a phone</a> where the user interface is an agent. There won&#8217;t be any apps; the agent will do everything.</li>
</ul>



<h2 class="wp-block-heading">Infrastructure and Operations</h2>



<p class="wp-block-paragraph">The infrastructure questions of the moment are whether agents can transact and deploy without humans, and whether the platforms that host open source can stay reliable enough to keep that work going. Watch for GitHub alternatives to become competitive. And watch AI Together, a cloud company that hosts hundreds of open source models.</p>



<ul class="wp-block-list">
<li><a href="https://www.withlanai.com/products/tokentuner" target="_blank" rel="noreferrer noopener">TokenTuner</a> helps control AI costs by <a href="https://thenewstack.io/lanai-token-tuner-tokenmaxxing/" target="_blank" rel="noreferrer noopener">identifying</a> where companies can use lower-cost models productively. It attempts to match token usage to business outcomes, and evaluates individuals and teams on how effectively they use their token budget.</li>



<li>In partnership with <a href="https://projects.dev/" target="_blank" rel="noreferrer noopener">Stripe</a>, <a href="https://blog.cloudflare.com/agents-stripe-projects/" target="_blank" rel="noreferrer noopener">Cloudflare</a> now has an <a href="https://www.infoworld.com/article/4165857/are-we-ready-to-give-ai-agents-the-keys-to-the-cloud-cloudflare-thinks-so.html" target="_blank" rel="noreferrer noopener">agent that can create a new account</a>, start a subscription, register a domain name with DNS, and deploy an application without human intervention aside from granting permission.</li>



<li>Stripe and Tempo have <a href="https://thenewstack.io/ai-agent-payment-protocols/" target="_blank" rel="noreferrer noopener">released</a> the Machine Payments Protocol (MPP), and iWallet has laid out a roadmap for the Autonomous Settlement Protocol (ASP). These new protocols are designed to facilitate machine-to-machine transactions, transactions that have to be designed without a human in the loop.</li>



<li>The <a href="https://www.latent.space/p/ainews-the-inference-inflection" target="_blank" rel="noreferrer noopener">Inference Era</a> is when inference, rather than training, drives AI usage, cost, and infrastructure. GPUs remain important, but the relative demand for CPUs increases.</li>



<li>GitHub is in danger of losing its place at the center of the open source ecosystem. <a href="https://www.theregister.com/2026/04/29/github_says_sorry_and_says/" target="_blank" rel="noreferrer noopener">Problems with uptime</a> are causing projects to find homes elsewhere—<a href="https://www.theregister.com/2026/04/29/mitchell_hashimoto_ghostty_quitting_github/" target="_blank" rel="noreferrer noopener">most recently, Ghostty</a>.</li>



<li><a href="https://www.together.ai/" target="_blank" rel="noreferrer noopener">Together AI</a> operates a cloud AI platform that’s designed <a href="https://rokosbas.beehiiv.com/p/may-20-2026" target="_blank" rel="noreferrer noopener">specifically for inference</a> rather than training and that provides API access to over 200 open weight models. As AI use increases, the ability to run models and provide answers efficiently becomes more important than the ability to train new models.</li>
</ul>



<h2 class="wp-block-heading">Security</h2>



<p class="wp-block-paragraph">The patch window is shrinking to zero, and the attacker&#8217;s toolkit and the defender&#8217;s toolkit now include the same AI models. Any vulnerability disclosed today is being exploited tonight. The good news is that defenders running these tools at scale can close gaps faster than ever; the bad news is that the race never ends.</p>



<ul class="wp-block-list">
<li><a href="https://arstechnica.com/security/2026/05/websites-have-a-new-way-to-spy-on-visitors-analyzing-their-ssd-activity/" target="_blank" rel="noreferrer noopener">FROST</a> is a new technology for surreptitiously discovering what websites a user is visiting. It’s based on measuring the I/O operations on the user’s SSD. FROST requires no interaction from the user and runs entirely in the browser.</li>



<li>Regrettably, neither arcane prompt injection attacks nor cryptocurrency scams are news. But it warms a ham radio enthusiast&#8217;s heart to see <a href="https://www.dexerto.com/entertainment/x-user-tricks-grok-into-sending-them-200000-in-crypto-using-morse-code-3361036/" target="_blank" rel="noreferrer noopener">Morse code used in a prompt injection to scam a crypto trading bot</a>.</li>



<li>TeamPCP, a cybercriminal collective, has <a href="https://arstechnica.com/information-technology/2026/05/a-hacker-group-is-poisoning-open-source-code-at-an-unprecedented-scale/" target="_blank" rel="noreferrer noopener">attacked GitHub</a> by installing a poisoned extension to VS Code. GitHub announced that nearly 4,000 repositories have been compromised, all belonging to GitHub itself; no customer repositories have become victims. But anyone who installs corrupted code from GitHub&#8217;s own repositories is vulnerable.</li>



<li><em><a href="https://berryvilleiml.com/docs/no-security-meter-ai.pdf" target="_blank" rel="noreferrer noopener">No Security Meter for AI</a></em> provides an excellent look into the state of AI security.</li>



<li>Cloudflare&#8217;s <a href="https://blog.cloudflare.com/cyber-frontier-models/" target="_blank" rel="noreferrer noopener">report</a> on Project Glasswing and Claude Mythos is worth reading. Mythos is especially noteworthy for its ability to chain vulnerabilities. In real life, few vulnerabilities are exploitable on their own; they become vulnerable when they are used in combination with others.</li>



<li>Daniel Stenberg <a href="https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-vulnerability/" target="_blank" rel="noreferrer noopener">reports</a> that Mythos found five potential vulnerabilities in <a href="https://curl.se/" target="_blank" rel="noreferrer noopener">curl</a>, of which one was legitimate. The low count isn&#8217;t surprising, given the quality of the curl team&#8217;s work. What&#8217;s significant is that Mythos was able to find a legitimate vulnerability in software that had been thoroughly audited by humans, traditional tools, and AI.</li>



<li><a href="https://arman-bd.hashnode.dev/i-left-port-22-open-on-the-internet-for-54-days-here-s-who-showed-up" target="_blank" rel="noreferrer noopener">Who showed up?</a> A security researcher ran a honeypot with port 22 open for 54 days, and logged every attempt to log in: 269,000 connection attempts from 7,556 unique IP addresses.</li>



<li>GitHub&#8217;s dependency scanning service for its MCP server is now in <a href="https://github.blog/changelog/2026-05-05-dependency-scanning-with-github-mcp-server-is-in-public-preview/?utm_source=the+new+stack&amp;utm_medium=referral&amp;utm_content=inline-mention&amp;utm_campaign=tns+platform" target="_blank" rel="noreferrer noopener">public preview</a>. It checks code changes for vulnerable dependencies before committing code or opening a pull request.</li>



<li><a href="https://jorijn.com/en/blog/copy-fail-cve-2026-31431-linux-kernel-bug-explained/" target="_blank" rel="noreferrer noopener">Copy.fail</a> is a recently discovered Linux kernel vulnerability that allows unprivileged processes to escalate privileges, and it was exploited within a day of its release. Unlike most vulnerabilities, running infected programs in a container does not offer protection. The time from release of a zero-day to exploitation in the wild is indeed shrinking.</li>



<li>OpenAI&#8217;s <a href="https://thenextweb.com/news/openai-chatgpt-advanced-security-yubico-passkeys" target="_blank" rel="noreferrer noopener">Advanced Account Security</a> requires a physical key or passkey for access; there are no passwords. Hardware keys are provided by Yubico or a compatible hardware token.</li>



<li><a href="https://techcrunch.com/2026/04/30/after-dissing-anthropic-for-limiting-mythos-openai-restricts-access-to-cyber-too/" target="_blank" rel="noreferrer noopener">GPT-5.5 Cyber</a> is a version of GPT-5.5 that has been trained as a security tool. As Anthropic did with Mythos, OpenAI is limiting access to a small group of trusted users.</li>



<li>The Firefox team has <a href="https://blog.mozilla.org/en/firefox/ai-security-zero-day-vulnerabilities/" target="_blank" rel="noreferrer noopener">used Claude Mythos to find 271 previously unknown vulnerabilities</a> in Firefox. While this finding is terrifying, they conclude that defenders now have the advantage. Once you know the vulnerabilities, it&#8217;s possible to close the gap between defenders and attackers.</li>



<li>Claude Code can <a href="https://bdtechtalks.com/2026/04/27/claude-code-api-token-leak/" target="_blank" rel="noreferrer noopener">leak credentials</a> and other secrets to public repos and package registries. When you select &#8220;allow always&#8221; for a specific command, the command and its credentials are stored in a subdirectory of .claude. This directory can inadvertently be incorporated into a package.</li>
</ul>



<h2 class="wp-block-heading">Policy and Governance</h2>



<ul class="wp-block-list">
<li>The ArXiv preprint repository has <a href="https://xcancel.com/tdietterich/status/2055000956144935055" target="_blank" rel="noreferrer noopener">clarified</a> its code of conduct for AI users. Submitters are responsible for their papers and will be banned for a year if they submit papers that use AI-generated content inappropriately. This includes hallucinated content, references, and plagiarism.</li>



<li>Look to China for new approaches to <a href="https://thenextweb.com/news/china-data-governance-global-standard" target="_blank" rel="noreferrer noopener">data governance</a>. China is treating data as a national resource and building the infrastructure for a data economy.</li>
</ul>



<h2 class="wp-block-heading">Web</h2>



<ul class="wp-block-list">
<li>At its I/O conference, Google <a href="https://blog.google/products-and-platforms/products/search/search-io-2026/#powerful-ai" target="_blank" rel="noreferrer noopener">announced</a> that traditional search will be replaced by AI search, powered by Gemini 3.5 Flash. Both AI search and traditional search (which is really AI-powered) have proven useful. What happens when you eliminate one of the options?</li>



<li><a href="https://www.xda-developers.com/linux-running-inside-pdf-file/" target="_blank" rel="noreferrer noopener">Linux running in a PDF</a>? The PDF format supports JavaScript, and C can be compiled to JavaScript.</li>
</ul>



<h2 class="wp-block-heading">Biology</h2>



<ul class="wp-block-list">
<li>Colossal Biosciences has <a href="https://www.technologyreview.com/2026/05/19/1137471/colossal-biosciences-is-growing-chickens-in-a-3d-printed-container/" target="_blank" rel="noreferrer noopener">developed</a> a 3D-printed artificial eggshell that’s capable of raising chicks from embryos.</li>



<li>Brazil has <a href="https://www.economist.com/the-americas/2026/05/21/why-brazils-government-is-obsessed-with-vaccines" target="_blank" rel="noreferrer noopener">invested heavily</a> in vaccines and has created a single-shot vaccine against Dengue fever. The country is striving for “medical sovereignty,” a concept that’s clearly related to data sovereignty and AI sovereignty.</li>
</ul>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/radar-trends-to-watch-june-2026/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>AI Sovereignty and the Architecture of Participation</title>
		<link>https://www.oreilly.com/radar/ai-sovereignty-and-the-architecture-of-participation/</link>
				<comments>https://www.oreilly.com/radar/ai-sovereignty-and-the-architecture-of-participation/#respond</comments>
				<pubDate>Mon, 01 Jun 2026 16:05:58 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18818</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Image-by-ChatGPT-5.5-Earth-from-space-at-night-as-a-federated-distributed-network.png" 
				medium="image" 
				type="image/png" 
				width="512" 
				height="288" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Image-by-ChatGPT-5.5-Earth-from-space-at-night-as-a-federated-distributed-network-160x160.png" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Adam Tooze recently shared a piece from The Economist about Brazil&#8217;s push for what it calls &#8220;medical sovereignty,&#8221; the determination to make its own vaccines and the active ingredients that go into its medicines rather than depend on supply chains it doesn&#8217;t control. Brazil already produces a large share of its own medicines through public [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Adam Tooze recently <a href="https://adamtooze.substack.com/p/top-links-1115-claiming-medical-sovereignty" target="_blank" rel="noreferrer noopener">shared</a> a piece from <em>The Economist</em> about <a href="https://www.economist.com/the-americas/2026/05/21/why-brazils-government-is-obsessed-with-vaccines" target="_blank" rel="noreferrer noopener">Brazil&#8217;s push for what it calls &#8220;medical sovereignty,&#8221;</a> the determination to make its own vaccines and the active ingredients that go into its medicines rather than depend on supply chains it doesn&#8217;t control. Brazil already produces a large share of its own medicines through public institutions like Fiocruz and Butantan, but a lot of the underlying inputs still come from abroad, and the pandemic made clear the cost of that dependence. So the country is trying to build the capacity to make the things it most needs to survive. The economist behind a lot of this thinking is <a href="https://marianamazzucato.com/" target="_blank" rel="noreferrer noopener">Mariana Mazzucato</a>, whose mission-oriented approach treats public procurement as a tool to build national capacity rather than just buy finished goods. (<a href="https://foreignpolicy.com/2024/01/26/brazil-lula-industrial-policy-economy-mission-mazzucato/" target="_blank" rel="noreferrer noopener"><em>Foreign Policy</em> has a good overview</a>.)</p>



<p class="wp-block-paragraph">I think we&#8217;re going to see a lot more of this, and not only in medicine. The same impulse is driving the quest for sovereign AI, as countries decide they don&#8217;t want their access to a foundational technology to run through a handful of American or Chinese companies. You can see it too in Europe&#8217;s and Japan&#8217;s new willingness to take responsibility for their own military destiny rather than assume the United States will always be there.</p>



<p class="wp-block-paragraph">Most commentators describe all of this as decoupling, the unwinding of a connected world. That reading is too narrow.</p>



<h2 class="wp-block-heading">Free trade was an architecture of participation that broke</h2>



<p class="wp-block-paragraph">Much like open source software and the World Wide Web, free trade was supposed to have what I call “<a href="https://asimovaddendum.substack.com/p/the-architecture-of-participation" target="_blank" rel="noreferrer noopener">an architecture of participation</a>.” The most important thing about the web and open source wasn&#8217;t openness for its own sake. It was that there were no central gatekeepers. Anyone could add to the richness of the system without asking permission as long as they followed the rules of the communication protocols that allowed independently-developed pieces to work together. In addition, value circulated among the participants instead of being extracted to a center, and the system got better the more people used it. That is a very different thing from a system that is merely large and connected.</p>



<p class="wp-block-paragraph">Free trade was also supposed to work like that. The theory, going back to Smith and Ricardo, was that specialization and exchange would make everyone better off, and that the connections would be mutual. What we actually got over the past few decades looks more like the platform dominance we see in big tech than the original vision of a commons built around shared exchange. A handful of large and powerful countries and firms set the terms and the smaller players are forced to take what is on offer. Despite the language of free trade, the experience for many countries was closer to colonialism, just with a new narrative.</p>



<p class="wp-block-paragraph">Overall, under the neoliberal order (whose reign, as <a href="https://global.oup.com/academic/product/the-rise-and-fall-of-the-neoliberal-order-9780197519646" target="_blank" rel="noreferrer noopener">Gary Gerstle explains</a>, is now ending), free trade became far less egalitarian, inclusive, and generative than it could have been. Less powerful countries ended up in roughly the position that small businesses occupy on Amazon, or developers occupy on the app stores: free to participate, on terms they don&#8217;t control, with much of the value they create flowing back to the hub.</p>



<p class="wp-block-paragraph">Brazil&#8217;s response (and that of many others) should not be seen as a retreat from the world. It is a refusal to be participate <em>only as a buyer</em>, or as a source of raw materials.</p>



<p class="wp-block-paragraph">That&#8217;s why decoupling is the wrong word. Decoupling means cutting the connections. What these countries seem to want is to stay connected but to build real capacity of their own, so that no single supplier can switch them off. That&#8217;s closer to federation than to separation. A federated system is still a system, and its nodes still interoperate. But no node is wholly at the mercy of another, and value circulates among them rather than collecting at the center. A trading order in which the gains pool at a few hubs is brittle and eventually illegitimate, in the same way that a platform economy that strip-mines its participants eventually provokes regulation and revolt.</p>



<p class="wp-block-paragraph">I put the increasingly visible quest for <a href="https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-sovereign-ai" target="_blank" rel="noreferrer noopener">sovereign AI</a>, and the role of open source models and open source agentic protocols and harnesses in enabling that sovereignty, into the same bucket. I remember back in the early days of open source software when Michael Tiemann, whose pioneering open source company Cygnus Solutions had just been acquired by Red Hat, told me “What we really sell at Red Hat is control. The ability to control your own destiny.”</p>



<p class="wp-block-paragraph">As companies are increasingly at the mercy of <a href="https://www.theinformation.com/newsletters/ai-agenda/rising-ai-costs-becoming-problem-investors" target="_blank" rel="noreferrer noopener">unexpected token pricing changes by the big centralized players</a>, this same quest for sovereignty is playing out at the level of organizations. Open source AI, including not just open source and open weight models but open agentic protocols, agentic harnesses, and portable memory, are increasingly an essential part of the sovereignty toolkit.</p>



<p class="wp-block-paragraph">The national technology sovereignty movements should take a lesson from the open source movement. The heart of open source is its architecture of participation. It is a force for innovation and value creation to the extent that it frees up the ability of people to solve their own problems and contribute their solutions to a low-friction global commons.</p>



<h2 class="wp-block-heading">Is capture the inevitable fate of any architecture of participation?</h2>



<p class="wp-block-paragraph">The pattern of open architectures leading to a wave of innovation, winners emerging, consolidating their power and then turning to the dark side seems to be a natural part of the technology cycle. The web broke Microsoft’s dominance over the personal computer software ecosystem only to give rise to a new generation of gatekeepers. Cory Doctorow called this cycle “<a href="https://en.wikipedia.org/wiki/Enshittification" target="_blank" rel="noreferrer noopener">enshittification</a>.” I’ve told my own version of that story using the language of economics in “<a href="https://www.oreilly.com/radar/rising-tide-rents-and-robber-baron-rents/" target="_blank" rel="noreferrer noopener">Rising Tide Rents and Robber Baron Rents</a>.”</p>



<p class="wp-block-paragraph">The instinct after capture is to try to rebuild the thing that got captured, only this time with better rules. Mastodon and Bluesky tried to rebuild Twitter&#8217;s social layer with cleaner governance, and neither has succeeded at the scale they hoped for. Critics might say that it was because Mastodon stayed pure and never made itself easy enough to use, while Bluesky looked federated without really being so. But more importantly, reinventing what we used to have, or what we think we used to have, is rarely the path forward. You have to build something new.</p>



<p class="wp-block-paragraph">Each country building its own answer to the latest frontier models is the Mastodon move. The winning move is to operate at a layer the centralized model structurally can&#8217;t reach. Open agent protocols that let services from different providers interoperate (the work that MCP and the emerging agent stack are beginning to do) are one such layer. AI accountable to local democratic and legal institutions is another such layer. Domain-specific AI built around problems the global market won&#8217;t serve (the tropical disease vaccine analogue) is another. None of these is a smaller copy of what the hyperscalers offer. But there’s one more important layer to consider: infrastructure.</p>



<h2 class="wp-block-heading">Where are the servers?</h2>



<p class="wp-block-paragraph"><a href="https://ai-disclosures.org/" target="_blank" rel="noreferrer noopener">Ilan Strauss</a> made a useful point in our conversation about these ideas. Ilan noted that AI is one of the most global forms of capital we&#8217;ve ever built, trained on the whole of the internet and runnable more or less anywhere, and the sovereignty rhetoric is partly an attempt to give something inherently placeless a place. The technology wants to be everywhere at once. The people who live with its consequences want some say over it where they are.</p>



<p class="wp-block-paragraph">The placelessness of AI is only half of the truth, though. The other half is that AI is physically place-bound. The model weights are placeless. The data centers, the chips, the electrical grid, and the water for cooling are very much somewhere.</p>



<p class="wp-block-paragraph">The comparison with Brazil’s medical sovereignty reinforces this point. Brazil’s challenge isn’t to invent new drugs to compete with Pfizer, but to build the capacity to manufacture existing vaccines, and eventually to build the capacity to invent vaccines for diseases the West ignores. Fiocruz and Butantan matter not because they hold patents but because they are physical institutional capacity rooted in Brazilian soil: the labs, the cold chains, the regulatory capacity, the trained workforce, and access to the active pharmaceutical ingredients. That&#8217;s what medical sovereignty really means in practice. It is infrastructure plus the institutions that run it.</p>



<p class="wp-block-paragraph">The same is becoming true for AI. Open weights matter. They&#8217;re closer, though, to the patent than to the lab. Even if Qwen, Kimi, DeepSeek, Llama, Gemma, Granite, and whatever comes next are fully open, running them at scale requires data centers that cost tens of billions to build, chips whose supply chains a handful of countries control, and electricity grids that have to be expanded substantially to carry the load. The countries pursuing sovereign AI seriously seem to understand this. The EU&#8217;s AI Gigafactories program, India&#8217;s IndiaAI mission, the Gulf compute buildouts, the Singapore and Japan strategies, are all infrastructure plays first and model plays second.</p>



<p class="wp-block-paragraph">Infrastructure is the layer where capture is hardest to undo. You can distill or fine tune a model far more easily than you can build a new continent’s worth of data centers or conjure the necessary electricity from a fragile power grid. If the architecture of participation for AI is defined only at the model layer, the infrastructure layer below will quietly recapture, over years, everything that was won above. Open weights running on three companies’ servers is not sovereignty.</p>



<p class="wp-block-paragraph">Building physical infrastructure capable of carrying a generation&#8217;s worth of economic activity is exactly the kind of mission the public sector used to take on, before we convinced ourselves the market would handle it. Mazzucato’s argument is that public procurement and public capacity-building are the real engines of foundational technology. AI sovereignty without industrial policy is wishful thinking.</p>



<p class="wp-block-paragraph">Industrial policy should aim to reinvent 20th century infrastructure, not just copy it. Can we use the enormous rebuild of infrastructure for the AI era to leapfrog the past? The analogy with centralized power grids and decentralized solar reminds us that local control does not have to be a localized version of the hyperscaler pattern. Might we envision a future where there is an intelligence grid that seamlessly uses frontier models in massive data centers and local models controlled by the user as dictated by considerations like cost, privacy, specialized knowledge, and user preferences? Creating the software to manage such an interoperable intelligence grid should be a high priority for the AI open source community. We need an orchestrator not just for agents but also for models and even for data center capacity.</p>



<h2 class="wp-block-heading">Could federated AI give us a new pattern for the economy?</h2>



<p class="wp-block-paragraph">In a previous piece about AI and markets, &#8220;<a href="https://asimovaddendum.substack.com/p/the-third-artificial-intelligence" target="_blank" rel="noreferrer noopener">The Third Artificial Intelligence</a>&#8221; I picked up Richard Danzig&#8217;s argument that markets and the bureaucracies that underpin nation states are themselves artificial intelligences, information-processing mechanisms older than the machine kind. The question with all three is who designs and builds them, what they optimize for, and what feedback loops govern them.</p>



<p class="wp-block-paragraph">We&#8217;re about to spend a lot of effort working out how AI should be organized both across nations and across organizations, whether it concentrates in a few firms and a few countries or whether it can be built as something more federated, where smaller players have genuine capacity and the value they create flows back to them. The choices we are now making about how AI is organized, at the model layer, the protocol layer, and the infrastructure layer, are also choices about how economic activity will be organized for at least a generation. If we manage to get that architecture right for AI, it may give us a working pattern for the thing we&#8217;ve so far failed to get right for trade. If we get it wrong, we&#8217;ll most likely reproduce, at the level of intelligence itself, the same concentration that free trade has produced in goods and the existing internet platforms produced online.</p>



<p class="wp-block-paragraph">The technology wants to be everywhere at once. The people who live with its consequences want some say over it where they are. The infrastructure that resolves that tension will be a federation of models, a federation of protocols and code, and a federation of capacity. We need an architecture of participation all the way down the stack, and all the way up.</p>



<p class="wp-block-paragraph"><em>The final section of this piece benefited greatly from questions and comments raised by Ilan Strauss and <a href="https://www.oreilly.com/people/mike-loukides/" target="_blank" rel="noreferrer noopener">Mike Loukides</a>, as well as from previous conversations with Richard Danzig.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/ai-sovereignty-and-the-architecture-of-participation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>SaaS Is Not Dead Yet</title>
		<link>https://www.oreilly.com/radar/saas-is-not-dead-yet/</link>
				<comments>https://www.oreilly.com/radar/saas-is-not-dead-yet/#respond</comments>
				<pubDate>Mon, 01 Jun 2026 11:01:35 +0000</pubDate>
					<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18822</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/SaaS-is-not-dead-yet.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/SaaS-is-not-dead-yet-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[With the rise of agents, many people have been proclaiming that the age of software as a service (SaaS) is over. Who needs to subscribe to a service when you can create your own software with a few English-language prompts and a few dollars spent on tokens? Your own software, most likely a skill that [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">With the rise of agents, many people have been proclaiming that the age of software as a service (SaaS) is over. Who needs to subscribe to a service when you can create your own software with a few English-language prompts and a few dollars spent on tokens? Your own software, most likely a skill that runs in an agent, will have exactly the features you want: no more, no less.</p>



<p class="wp-block-paragraph">But whenever someone talks about the death of SaaS, there’s something wrong with the picture. It’s simply that work is about groups and teams, and so far, programming with agents is about individuals. A related challenge is that SaaS companies are good at building dashboards and generating reports for humans, but agents need the raw data, not a representation of the data.</p>



<p class="wp-block-paragraph">Think about the teamwork required for a good sales team. Someone needs a database to keep track of their customer info. It’s easy to get Claude, Gemini, or GPT to build that, using SQLite for a backend and putting a reasonable web frontend on it. You could also do that fairly quickly with Ruby on Rails, but AI makes it even easier. But what about the salesperson at the next desk? She needs similar CRM software, and she can create it with Claude, Gemini, or GPT. No problem. But it won’t be exactly the same; it will reflect her needs and preferences. Soon you have a team of salespeople in which everyone has their own personal CRM. They’re all similar, but slightly different. They may use different backends (Filemaker, SQLite, MySQL, or maybe a corporate Oracle instance); they have similar-but-slightly-different schemas (one has a single field for customer address, another has separate street, city, state, and country fields); and they don’t interoperate.</p>



<p class="wp-block-paragraph">That’s the simplest possible case. How do you generate company-wide reports if everyone has their own version of the data? How do you know if you’re succeeding or failing if everyone on the team has their own version of the metrics? Everyone has become their own silo.</p>



<p class="wp-block-paragraph">The company is not paying subscription fees to a vendor like Salesforce, but is this really progress? If anything, we need to make sharing data and metrics easier, not more difficult. On top of that, a product like Salesforce has hundreds of features. Most people don’t need most of them, but there’s a good chance that almost everyone needs one feature that nobody else needs. And there’s always the features you don’t know you need, ways to get value from data that you haven’t thought of. There’s value in buying a bundle that goes beyond your immediate requirements.</p>



<p class="wp-block-paragraph">There’s certainly a lot good about enabling people to develop their own tools. I guarantee that if we had Claude Code 30 years ago, I would have vibe-coded my own skills for managing the authors I was working with. I would have vibe-coded some of the crazy tools I wrote to translate from one document format to another. (WordPerfect to troff? Why?) Now that we have agentic programming, I may never write my own tools again. But the SaaS scenario highlights something missing from the agentic picture. We don’t have tools for sharing or collaboration. Nobody buys a Salesforce subscription for themselves. It’s a departmental or corporate resource, shared between many people. And the ability to share easily is precisely what agentic programming lacks. I’ve built some of my own Claude tools and skills, but it’s very difficult to share them with other people at O’Reilly. <a href="https://www.linkedin.com/posts/openai-for-business_today-were-introducing-skills-in-beta-for-activity-7435743335107084288-yHR9/">ChatGPT Skills for Business and Enterprise</a> hints at the ability to share skills among team members and some ability to generate them collaboratively, though it’s hard to find evidence that it delivers. I think we’re seeing a symptom of technological overreach. It’s easy to assume something is &#8220;easy&#8221; when it isn’t: &#8220;You just generate a .md file and put it in the corporate GitHub.&#8221; That process has a lot of friction, particularly for users who aren’t technical.</p>



<p class="wp-block-paragraph">To make skills really useful across a company, we need:</p>



<ul class="wp-block-list">
<li><strong>Sharing.</strong> This can be a Git server that’s registered as a private marketplace and then configured via a corporate administrative dashboard. Publishing skills to the marketplace would remain the province of Git-aware users, and that’s a problem.</li>



<li><strong>Requirements.</strong> We don’t want everyone to build a personal toolset; that’s the problem we’re trying to solve. How do you resolve differences between users who want slightly different things? What does the PRD for a skill look like?</li>



<li><strong>Collaboration.</strong> Aside from Google Docs, the current state of widely used collaboration tools is poor. Suffice it to say that working on different branches of a Git repo and merging changes may work for professional programmers, but not for anyone else.</li>



<li><strong>Testing.</strong> Tests and evals for agents (related, but not the same) are topics that we don’t yet understand well. But if you’re going to empower users to use and create agentic tools for creating projections and writing reports, you need to know they won’t backfire. Skills also behave like any other AI application: They drift over time. Even after they’re published, they need to be evaluated regularly to see if they still perform correctly.</li>



<li><strong>Versioning.</strong> Like any software—and we need to recognize that agentic tools and skills are software, even if they’re written in English—it will be important to update them as requirements change and as LLM behavior drifts. It’s important to keep track of versions and for users to update their skills to the latest version easily. Again, this is a matter of wrapping Git appropriately for nontechnical users.</li>



<li><strong>Security.</strong> Security for intelligent agents is still poorly understood. We know about prompt injection, but we also know that it’s a problem that can’t be solved yet. And attackers are still finding novel ways to inject malicious prompts. What vulnerabilities might agentic skills and tools have if they can access corporate data?</li>
</ul>



<p class="wp-block-paragraph">While the democratization of programming doesn’t threaten SaaS companies, intelligent agents pose a deeper challenge. In “<a href="https://asimovaddendum.substack.com/p/the-salesforce-of-agents-wont-be" target="_blank" rel="noreferrer noopener">The Salesforce of Agents Won’t Be Salesforce, the Google of Agents Won’t Be Google</a>,” Jesus Rodriguez points out that the future for services like Salesforce and Google isn’t web UIs and dashboards; it’s APIs that are designed for agents. These APIs require a different kind of data: not something that a human can glance at to get a quick feel for what’s happening, but “structured state, task objectives, relationship graphs, permissioned memory, machine-readable sales playbooks, and reliable APIs for updating intent.” Humans need the data compression that you get from a dashboard. Agents want the data itself, and they’ll take care of the compression. SaaS companies can become the system of record that is responsible for delivering accurate data. What they need to recognize is that their real customer may not be a human user; the customer will be an agent, and that will affect everything from marketing strategy and product design to pricing.</p>



<p class="wp-block-paragraph">I wouldn’t claim that Salesforce or Google can’t or won’t build APIs to help companies access their own data. SaaS remains relevant, but it’s a different kind of SaaS than we have now. Companies like Salesforce know what data is available and how to work with it. Designing and building the data infrastructure that’s needed to provide next-generation SaaS isn’t trivial, and doing the programming in English rather than C++ doesn’t make it easier. Companies like Salesforce and Google know what needs to be built. They’re likely to offer their own collections of agentic skills as a starting point, alongside APIs. But large, established companies are ripe to be blindsided if they move slowly—and it’s difficult for large institutions to move quickly.</p>



<p class="wp-block-paragraph">SaaS companies have momentum—or inertia, which to a physicist is the same thing. They have to change, but they aren’t threatened by AI, agents, and user-defined skills. Providing APIs that have been designed to provide data in formats that machines can use should be an obvious next step. If they die, it will be because they don’t adapt. But there’s nothing new about that.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/saas-is-not-dead-yet/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Open Source Ecosystems</title>
		<link>https://www.oreilly.com/radar/open-source-ecosystems/</link>
				<comments>https://www.oreilly.com/radar/open-source-ecosystems/#respond</comments>
				<pubDate>Fri, 29 May 2026 11:00:08 +0000</pubDate>
					<dc:creator><![CDATA[Ilan Strauss]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18814</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Open-source-ecosystems.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Open-source-ecosystems-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[When open strategy meets private tactics]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on the Asimov&#8217;s Addendum Substack and is being reposted here with the author&#8217;s permission. Bill Gurley&#160;has an excellent article on what he calls&#160;open source strategy,&#160;which we recommend reading. There is a lot to debate about his concluding argument in particular: that open-weight models are central to keeping the AI market [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on the</em> <a href="https://asimovaddendum.substack.com/p/open-source-ecosystems" target="_blank" rel="noreferrer noopener">Asimov&#8217;s Addendum</a> <em>Substack and is being reposted here with the author&#8217;s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph"><a href="https://p3institute.substack.com/p/from-open-source-software-to-open" target="_blank" rel="noreferrer noopener">Bill Gurley</a>&nbsp;has an excellent article on what he calls&nbsp;<em>open source strategy,&nbsp;</em>which we recommend reading. There is a lot to debate about his concluding argument in particular: that open-weight models are central to keeping the AI market rent-free. The limits of open-weight AI as the primary open source strategy are surely considerable though, if it still requires expensive hardware to run on, and&nbsp;<a href="https://www.oreilly.com/pub/a/tim/articles/architecture_of_participation.html" target="_blank" rel="noreferrer noopener">if the architecture ultimately remains monolithic</a>—rather than composable and protocol-centric.</p>



<p class="wp-block-paragraph">A related consideration comes from Anthropic’s<a href="https://www.anthropic.com/news/anthropic-acquires-stainless" target="_blank" rel="noreferrer noopener">&nbsp;recent acquisition of Stainless</a>—a startup that generates SDKs, command-line tools, and MCP servers from API specifications. This illustrates that open protocols like MCP, even when publicly governed,<sup data-fn="6732a4b0-bcdf-41ae-a355-761cc861ab6b" class="fn"><a href="#6732a4b0-bcdf-41ae-a355-761cc861ab6b" id="6732a4b0-bcdf-41ae-a355-761cc861ab6b-link">1</a></sup>&nbsp;remain exposed at their complementary layers to private actors capturing rents. (Protocol openness does not eliminate this and instead probably enables it, by enabling market growth).</p>



<p class="wp-block-paragraph">We asked Claude to analyze this acquisition, going beyond the press releases. Its first pass overstated parts of the competitive-denial story; what follows is what survived it taking a closer look:</p>



<ol class="wp-block-list">
<li><strong>Complement capture, not protocol capture.</strong>&nbsp;MCP—the standard that lets AI agents talk to other software—remains open, and its governance has been handed to an independent foundation. What Anthropic bought is the company that turned that standard into something most developers could actually use.&nbsp;<em>Stainless was the dominant tool for taking an ordinary business API</em>&nbsp;(say, a hotel booking system or a customer database) and converting it into something an AI agent could call through MCP. The open standard is still open. The path most developers walked to use it has now been bought.<br></li>



<li><strong>This isn’t a one-off—the whole layer is consolidating.</strong>&nbsp;Stainless wasn’t alone in this market. Its main competitor, Fern, was<a href="https://buildwithfern.com/post/stainless-pricing-alternatives" target="_blank" rel="noreferrer noopener">&nbsp;bought by Postman in January 2026</a>. Anthropic bought Stainless four months later, in May 2026. That leaves&nbsp;<a href="https://www.speakeasy.com/" target="_blank" rel="noreferrer noopener">Speakeasy</a>&nbsp;as the only major independent player, plus an open-source fallback called&nbsp;<a href="https://openapi-generator.tech/" target="_blank" rel="noreferrer noopener">OpenAPI Generator</a>&nbsp;that most developers consider too rough for production use without significant manual work. In under five months, two of the three serious companies in this part of the market have been absorbed into larger platforms.&nbsp;<em>The Stainless deal is more visible because of who bought it and why, but the broader pattern matters more: an entire layer of AI infrastructure is being pulled inside platform owners</em>.<br></li>



<li><strong>Moat migration.</strong> The gap in raw model capability between Anthropic, OpenAI, and Google has narrowed considerably and continues to close, and the implication is that model quality alone is unlikely to be the principal basis of competitive advantage over the next two years. What may distinguish the leading firms instead <em>is the quality of the developer experience around their models: how easily a business or an engineer can build something useful on top of a given model, how cleanly the tooling integrates with existing systems, and how reliable the connectors are over time.</em></li>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Stainless was founded by Alex Rattray, formerly of Stripe.&nbsp;<em>Stripe built its market position largely on unusually well-designed developer tools</em>, and Stainless was, in effect, an attempt to apply the same approach to the layer between AI APIs and the rest of the software economy. Anthropic has acquired the team that knows how to do this.</p>
</blockquote>



<li><strong>Pricing logic, with caveats on denial.</strong>&nbsp;Stainless was last valued at&nbsp;<a href="https://www.analyticsinsight.net/news/anthropic-acquires-stainless-for-over-300m-to-strengthen-ai-sdk-and-tool-access" target="_blank" rel="noreferrer noopener">$150M in December 2025</a>; at &gt;$300M five months later, this is a roughly 2x strategic markup, not acqui-hire arithmetic. Removing a critical-path external dependency on Anthropic’s own SDKs, while denying it to a tight set of competitors, is rational at that price—but the denial logic is partial.&nbsp;<em>Speakeasy is a viable substitute, and OpenAI was reportedly already migrating off Stainless. The friction tax falls hardest on smaller players who lack the engineering bench to absorb migration cost</em>.</li>
</ol>



<p class="wp-block-paragraph">…The press release calls it “extending reach”; the <em>InfoWorld</em> read—“last-mile developer experience”—is closer, but the complement-capture component, even if partial, is real.</p>



<p class="wp-block-paragraph">-*-</p>



<p class="wp-block-paragraph">Now, while Claude might be overstating some of the market risks associated with this acquisition (you tell us?), it shows that open source’s impacts are highly conditional on its dependencies and should never be analyzed in isolation from the market’s software stack and architecture. This is equally true for open weight models—being dependent on data, compute, and distribution—as it is for open protocols like MCP, dependent on constant API translations and access. Tracking those interdependencies is what a full ecosystem view involves and is helpful to undertake in order to consider where chokepoints might arise, and in turn, where&nbsp;<em>open source strategy</em>&nbsp;might eventually fail or be captured.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Footnotes</h3>


<ol class="wp-block-footnotes"><li id="6732a4b0-bcdf-41ae-a355-761cc861ab6b">In this case by the<a href="https://www.linuxfoundation.org/press/agentic-ai-foundation" target="_blank" rel="noreferrer noopener"> Agentic AI Foundation under the Linux Foundation</a> <a href="#6732a4b0-bcdf-41ae-a355-761cc861ab6b-link" aria-label="Jump to footnote reference 1"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li></ol>]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/open-source-ecosystems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Your AI Agent Already Forgot Half of What You Told It</title>
		<link>https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/</link>
				<comments>https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/#respond</comments>
				<pubDate>Thu, 28 May 2026 10:59:36 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18803</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Your-AI-agent-already-forgot-half-of-what-you-told-it.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Your-AI-agent-already-forgot-half-of-what-you-told-it-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[How to keep agents and skills from losing track mid-workflow]]></custom:subtitle>
		
				<description><![CDATA[This is the seventh article in a series on agentic engineering and AI-driven development.&#160;Read part one&#160;here, part two&#160;here, part three&#160;here, part four&#160;here, part five&#160;here, and part six here. This is the latest article in my Radar series on AI-driven development and agentic engineering, and I have to admit that this one took a bit of [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>This is the seventh article in a series on agentic engineering and AI-driven development.&nbsp;Read part one&nbsp;<a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">here</a>, part two&nbsp;<a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">here</a>, part three&nbsp;<a href="https://www.oreilly.com/radar/the-toolkit-pattern/" target="_blank" rel="noreferrer noopener">here</a>, part four&nbsp;<a href="https://www.oreilly.com/radar/ai-is-writing-our-code-faster-than-we-can-verify-it/" target="_blank" rel="noreferrer noopener">here</a>, part five&nbsp;<a href="https://www.oreilly.com/radar/ai-code-review-only-catches-half-of-your-bugs/" target="_blank" rel="noreferrer noopener">here</a></em>, <em>and part six <a href="https://www.oreilly.com/radar/why-doesnt-anyone-teach-developers-about-context-management/" target="_blank" rel="noreferrer noopener">here</a>.</em></p>
</blockquote>



<p class="wp-block-paragraph">This is the latest article in my Radar series on AI-driven development and agentic engineering, and I have to admit that this one took a bit of a turn I wasn&#8217;t expecting.</p>



<p class="wp-block-paragraph">In my <a href="https://www.oreilly.com/radar/why-doesnt-anyone-teach-developers-about-context-management/" target="_blank" rel="noreferrer noopener">last article</a> I talked about context and context management and I promised to give you some real practical tips for using it. It was originally meant to be about specific, practical context management techniques that were really helpful to me building <a href="https://github.com/andrewstellman/octobatch" target="_blank" rel="noreferrer noopener">Octobatch</a> and the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a>, two open source projects where I work with AIs to plan and orchestrate all of the work and every line of code is written by AI tools like Claude Code and Cursor.</p>



<p class="wp-block-paragraph">But as I was writing this, I found that I&#8217;d adapted those same techniques to my work writing articles like this one. Which is surprising! I&#8217;ve been doing all this work finding ways to help people developing AI skills improve context management, so their skills run more efficiently. It turns out that those same exact techniques apply to anyone using AI tools, even when you&#8217;re using chatbots like Claude.ai or ChatGPT.</p>



<p class="wp-block-paragraph">Full disclosure: I use multiple AI tools to manage this article series. My primary tools are Claude Cowork for brainstorming and managing my article research, notes, and backlog and Gemini&#8217;s mobile app for reading drafts aloud and taking my notes while I&#8217;m away from my desk. And I want to tell you about something that happened while I was using those tools, because I think it really helps show why context management isn&#8217;t just a problem for developers.</p>



<p class="wp-block-paragraph">While I was writing this article, I was using Gemini&#8217;s mobile app to read the draft aloud and take my notes. Partway through the session I asked it to go back and check whether there were earlier notes it hadn&#8217;t incorporated yet. It told me it didn&#8217;t have access to the previous notes, which seemed weird and insane, since we had <em>just taken those notes a few prompts earlier in the session</em>. I could scroll back up and see them earlier in the conversation, but somehow it didn&#8217;t &#8220;know&#8221; about them.</p>



<p class="wp-block-paragraph">Here&#8217;s what happened. Gemini had compacted our conversation without telling me, and the notes from the first half of the session were just&#8230; gone.</p>



<p class="wp-block-paragraph">If you&#8217;ve ever had a web chat AI just seem to forget things you talked about earlier, you&#8217;ve experienced context compaction, just like I did. Understanding even the basics of context and context windows can make a big difference in preventing that kind of frustration.</p>



<p class="wp-block-paragraph">This all reminded me of something I wrote more than two decades ago in <em><a href="https://learning.oreilly.com/library/view/applied-software-project/0596009488/" target="_blank" rel="noreferrer noopener">Applied Software Project Management</a></em> (back in 2005!): &#8220;Important information is discovered during the discussion that the team will need to refer back to during the development process, and if that information is not written down, the team will have to have the discussion all over again.&#8221;</p>



<p class="wp-block-paragraph">Jenny Greene and I wrote that about human teams and project meetings, but it applies to AI sessions just as well.</p>



<p class="wp-block-paragraph">Which brings me back to context, which I wrote about in my last article, and which I&#8217;ll write more about in the next one, because it&#8217;s one of the most important concepts to keep top of mind when working with AI.</p>



<h3 class="wp-block-heading"><strong>Context loss may be invisible, but that doesn&#8217;t make it any less frustrating</strong></h3>



<p class="wp-block-paragraph"><strong>Context</strong> is everything the AI is holding in its working memory during a conversation: what you&#8217;ve told it, what it&#8217;s told you, any files or instructions it&#8217;s read, and whatever internal notes the system has made along the way. All of that lives in a fixed-size <strong>context window</strong>—think of that as your AI&#8217;s short-term memory, the stuff it&#8217;s thinking about right now—and when the window fills up, the AI has to start letting things go. Different tools handle this differently: Some truncate older messages, some compress the conversation into a summary (which means details get lost even though the summary looks complete), and some just start behaving inconsistently so you can&#8217;t tell whether the AI forgot something or never understood it in the first place. The result is the same: The AI loses track of things you told it, decisions you made together, or details it noticed earlier in the session. And it won&#8217;t tell you it forgot. It&#8217;ll just keep generating confident-sounding output based on whatever it still has.</p>



<p class="wp-block-paragraph">Before we dive in a little deeper, I want to do a quick jargon check. If you&#8217;ve seen the terms &#8220;skills&#8221; and &#8220;agents&#8221; floating around but aren&#8217;t sure what they are, think of skills as libraries for AIs and agents as interactive executables. Those aren&#8217;t perfectly precise definitions, but if you&#8217;re a developer they&#8217;re close enough for this discussion.</p>



<p class="wp-block-paragraph">When you&#8217;re coding skills and agents, you run into context problems quickly. The work you&#8217;re asking the AI to do is often complex enough that the context window fills up, and the AI has to start compacting: compressing or dropping older parts of the conversation to make room for new ones. Compaction always seems to happen at the most frustrating and inconvenient time, which makes sense when you think about it. You hit context limits precisely when you&#8217;ve put the most information into the conversation, which is exactly when losing that information costs you the most.</p>



<p class="wp-block-paragraph">That&#8217;s why I think it can often help to think of AIs as having the same shortcomings that human teams do, except those shortcomings are exaggerated by their AI nature. A person who forgets something from a meeting last week might remember it when you remind them. An AI that lost something to context compaction won&#8217;t, because the information is gone. But there&#8217;s something you can do about it, and it turns out the techniques that help are the same whether you&#8217;re building autonomous AI skills or just trying to get a chatbot to remember what you told it 20 minutes ago.</p>



<p class="wp-block-paragraph">I&#8217;ve landed on four techniques that I come back to over and over again. Each one exists because at some point the AI forgot something important and I responded by putting that thing in a file where it couldn&#8217;t be forgotten. None of them require special tooling. And to my surprise, all of these techniques have turned out to be useful for both building software and managing a writing project like this one, whether I&#8217;m chatting with Claude, ChatGPT, or Gemini, or using a desktop tool like Claude Cowork or Codex. These are the techniques I find most valuable:</p>



<ul class="wp-block-list">
<li><strong>Split discovery from documentation:</strong> Don&#8217;t ask the AI to figure something out and produce polished output in the same pass.</li>



<li><strong>Use handoff documents, not continuation prompts:</strong> Before closing a stale session, have the AI write down everything the next session needs to know.</li>



<li><strong>Give the AI an acceptance criterion, not a procedure:</strong> Tell it what &#8220;done&#8221; looks like instead of spelling out the steps.</li>



<li><strong>Use spec documents as the bridge between AI tools:</strong> Make a shared document the single source of truth that all your tools read from.</li>
</ul>



<h3 class="wp-block-heading"><strong>Split discovery from documentation</strong></h3>



<p class="wp-block-paragraph">When you ask an AI to do something complex, you&#8217;re often asking it to do two things at once without realizing it. You&#8217;re asking it to figure something out and produce polished output at the same time. The problem is that figuring things out takes attention, and producing output takes attention, and the model only has so much of it. When you combine both tasks in the same prompt, the model starts cutting corners on one of them, and you can&#8217;t tell which one it shortchanged.</p>



<p class="wp-block-paragraph">I ran into this with the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a>, an open source AI coding skill I built that runs structured code reviews against any codebase. One of the things it does is derive requirements from source code: It reads through the code, identifies what the code promises to do (I call these behavioral contracts), and then produces a requirements document. Originally this all happened in a single pass. The problem was that single-pass requirement generation ran out of attention after about 70 requirements. The model forgot behavioral contracts it had noticed earlier in the code, and the forgetting was completely invisible. There was no stack trace or error message, just incomplete output and no way to know what was missing. I fixed it by splitting the work into two separate prompts:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Read each source file and write down every behavioral contract you observe as a simple list in CONTRACTS.md.</em></p>



<p class="wp-block-paragraph"><em>Read CONTRACTS.md and the documentation, then derive requirements from them and write REQUIREMENTS.md.</em></p>
</blockquote>



<p class="wp-block-paragraph">Then a third pass checks whether every contract has a corresponding requirement, and if there are gaps, goes back to step one for the files with gaps.</p>



<p class="wp-block-paragraph">The key idea is that CONTRACTS.md is external memory. When the model &#8220;forgets&#8221; about a behavioral contract it noticed earlier, that forgetting is normally invisible. With a contracts file, every observation is written down before any requirements work begins, so an uncovered contract is a visible, greppable gap. You can see what was forgotten and fix it.</p>



<p class="wp-block-paragraph">The principle: Don&#8217;t ask the AI to figure out what exists and write formatted output in the same pass. The model runs out of attention trying to do both at once. Whenever you&#8217;re asking an AI to do something complex, consider whether you&#8217;re actually asking it to do two things at once. &#8220;Analyze this codebase and write a report&#8221; is two tasks. &#8220;Read this document and suggest improvements&#8221; is two tasks. Split them, and let the first pass write its observations to a file before the second pass starts working with them.</p>



<h3 class="wp-block-heading"><strong>Use handoff documents, not continuation prompts</strong></h3>



<p class="wp-block-paragraph">Anyone who&#8217;s spent a long session with an AI coding tool has felt the moment when the context starts to go stale. The AI stops tracking details it was handling fine an hour ago, or it contradicts something it said earlier. The session gets slow, and you&#8217;re often restarting because the AI seems to have gotten bogged down and filled up on what you told it. You get the sense that if you keep going, you&#8217;re going to spend more time correcting it than making progress.</p>



<p class="wp-block-paragraph">Most developers respond to their session getting too long in one of two ways: They push through the problem, or they start a fresh one and try to reexplain everything from scratch. Both of those approaches can cause the AI to lose context. The first loses it to compaction; the second loses it to incomplete reexplanation. And both are frustrating! Specifically because you just spent so much time building up all that context with the AI.</p>



<p class="wp-block-paragraph">There&#8217;s a third option. Before you close the session, ask the AI to write a handoff document: a file that captures everything the next session needs to know, written while the current session still has full context. The key is that you&#8217;re asking the AI to write this while the relevant details are still fresh in the working context, and in a way that it or another AI can read.</p>



<p class="wp-block-paragraph">I built this into the Quality Playbook as a core part of how phases communicate. When I split the playbook from a single prompt to independent phases, I needed each phase to run as a completely independent session with no context carryover. So each phase got its own kickoff prompt as a standalone file. Here&#8217;s the structure each one follows:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Write a handoff document that a fresh session could use to pick up this work cold. Include everything it would need to know.</em></p>
</blockquote>



<p class="wp-block-paragraph">Every kickoff opens with what prior phases accomplished, includes explicit boundaries about what&#8217;s frozen, and names which future phase owns each piece of remaining work, because without it the AI will helpfully start doing Phase 3 work while you&#8217;re still in Phase 2. Each phase also ends with a required forward-looking handoff where the completing agent writes down what the next session needs to know.</p>



<p class="wp-block-paragraph">The principle: Each handoff is a complete state snapshot. The incoming AI agent never needs to read prior kickoff prompts or chat history. Everything it needs is in the current handoff file: current state, uncommitted changes, immediate next task, pending tasks, file locations, and anything that was discovered during the prior session. A fresh AI session can pick it up cold.</p>



<p class="wp-block-paragraph">If you&#8217;re deep into a Claude Code or Copilot session and you can feel the context getting stale, ask the AI to write a handoff document before you close the session. Tell it to include everything a fresh session would need to continue the work. Then start a new session and point it at that file. A fresh session with a good handoff document will usually outperform a stale session, because it&#8217;s starting with clean context instead of compacted, fragmented context.</p>



<h3 class="wp-block-heading"><strong>Give the AI an acceptance criterion, not a procedure</strong></h3>



<p class="wp-block-paragraph">When you give an AI a multistep task, the natural instinct is to spell out the steps. First do this, then do that, then combine the results. The problem is that step-by-step procedures are the first thing the AI forgets when the context window fills up. It&#8217;ll skip steps, merge phases, or quietly drop tasks, and there&#8217;s nothing in the procedure itself that would help the AI notice what it missed. The procedure tells the AI what to do, but it doesn&#8217;t tell the AI what &#8220;done&#8221; looks like.</p>



<p class="wp-block-paragraph">I learned this the hard way with the Quality Playbook. The playbook runs multiple iteration passes over a codebase, and the results need to be cumulative. It keeps a list of all the bugs it finds in the code being tested in a file called BUGS.md. Early on, I gave the AI a procedure to run four times and then update that file:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>First run the main pass, then run four iteration passes, then merge the findings into BUGS.md.</em></p>
</blockquote>



<p class="wp-block-paragraph">The AI did not respond well to that instruction.</p>



<p class="wp-block-paragraph">It turns out that when you ask an AI to do a very complex task a specific number of times, it can lose count. In fact, from my experimentation, it seems that count is one of the first casualties of context compaction. Most of the time the AI decided three iterations was enough, or merged findings from only two passes, and no matter how many different ways I tried to rephrase that instruction, there was nothing I could come up with that prevented the problem.</p>



<p class="wp-block-paragraph">However, everything changed when I replaced the &#8220;run four times&#8221; instruction with an <strong>acceptance criterion</strong>, or a specific condition that tells the AI when to stop looping:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>You are done only when BUGS.md contains the cumulative findings from the main run plus all four itration passes.</em></p>
</blockquote>



<p class="wp-block-paragraph">Even when the AI lost track of intermediate steps, it could check the output against the criterion and know whether it was finished. And I could verify the output against the same criterion, which gave me a way to audit the agent&#8217;s work without watching every step.</p>



<p class="wp-block-paragraph">In developer terms, the AI is really bad at loops like <em>for (i = 0; i &lt; 4; i++)</em> because it loses track of the value of the iterator <em>i</em> when it compacts its context. But it&#8217;s really good at loops like <em>while (!done)</em> because it can check <em>done</em> based on the current state without relying on history.</p>



<p class="wp-block-paragraph">The principle behind all this is that an acceptance criterion survives context pressure because the AI can always check &#8220;Am I done?&#8221; against a concrete test. This is actually the same principle behind test-driven development: write the test before the code so you know when you&#8217;re done. The acceptance criterion is the test for your AI session. When you&#8217;re giving an AI a task that has multiple steps, don&#8217;t describe the steps. Describe what &#8220;done&#8221; looks like, and let the AI figure out how to get there.</p>



<h3 class="wp-block-heading"><strong>Use spec documents as the bridge between AI tools</strong></h3>



<p class="wp-block-paragraph">Most developers working with AI don&#8217;t use just one tool. You might use Claude for design, Cursor for coding, and Copilot for quick edits. You might even use multiple models inside the same tool, like GPT-5.5 and Opus 4.7 in separate Copilot chats inside VS Code. It&#8217;s common to have one model for coding, another for review, and a third for orchestration and project management. The problem is that none of these tools or chats know what you told the others. Claude doesn&#8217;t know what you decided with Cursor. Two separate Copilot chats in the same editor don&#8217;t share context. You&#8217;re the one carrying context between them, and that&#8217;s exactly the kind of lossy handoff that causes drift. A design decision you made in one conversation gets lost or distorted by the time it reaches the tool that needs to implement it.</p>



<p class="wp-block-paragraph">The fix is to make the spec document the single source of truth that all your AI tools read from. I used this when building a game prototype, where I had Claude handling design and planning and Cursor doing the coding. They never talked to each other directly, so the spec documents served as the shared contract: Claude wrote the specs, and Cursor read them. The rule I followed was simple:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Never tell the AI coder something that isn&#8217;t already in the specs. If you make a design decision in conversation, write it into the spec first, then point the coder at the spec.</em></p>
</blockquote>



<p class="wp-block-paragraph">If I made a design decision in a conversation with Claude, that decision had to be written into the spec before I told Cursor about it. If I discovered something during implementation, I wrote it into the appropriate doc first, then pointed the coder at it. The spec was always the single source of truth. When Claude and I changed the wound topology (removing one wound type, promoting another), we updated the docs first, then told Cursor to reread them. When we decided to add a new UI element, we wrote it into the UI spec first, then told Cursor to reread the doc.</p>



<p class="wp-block-paragraph">The key was including rationale in the specs. Not just &#8220;show 5 progressive labels&#8221; but why: &#8220;The player shouldn&#8217;t be told what they&#8217;re fighting. They should discover it.&#8221; This helps the AI coder make better decisions when the spec doesn&#8217;t cover an edge case because it knows the intent behind the requirement.</p>



<p class="wp-block-paragraph">The principle: The spec document is the shared context that all your tools can read. It prevents the drift that happens when design intent lives only in chat history that the other tool can&#8217;t see. This technique works any time you&#8217;re using more than one AI tool on the same project, which at this point is most projects.</p>



<h3 class="wp-block-heading"><strong>How these techniques combine: Managing this article series</strong></h3>



<p class="wp-block-paragraph">Those four practices came out of AI-driven development work, but they apply to almost any AI work. And while these techniques emerged for me while working on agents and skills, I think it&#8217;s valuable to demonstrate them in a nondevelopment context, so I&#8217;ll share an example from my work on the article series you&#8217;re reading now.</p>



<p class="wp-block-paragraph">Over time, the process for how my AI assistant and I manage this article backlog evolved organically in conversation, but it was never written down anywhere except in the AI&#8217;s context window. Which means every time the session compacted or I started a fresh chat, the process was gone and I had to reexplain it. I caught this when the AI did something slightly wrong and I wanted to confirm we were on the same page. So I asked:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Every time I suggest a new article idea, you add an entry to the backlog, and then create a new markdown file with the source material, right?</em></p>
</blockquote>



<p class="wp-block-paragraph">That&#8217;s split discovery from documentation. I didn&#8217;t say &#8220;document our process.&#8221; I said &#8220;confirm what we do.&#8221; Discovery first, then documentation as a separate step. If I&#8217;d said &#8220;write up our process&#8221; without confirming first, the AI might have written something plausible but wrong, and I wouldn&#8217;t have caught the discrepancy.</p>



<p class="wp-block-paragraph">Once we&#8217;d confirmed the process, I asked the AI to create two files. <strong>AGENTS.md</strong> is an emerging standard for AI-readable project context—a single file that tells any AI session what it needs to know about a project. You can learn more about the convention at <a href="https://agents.md/" target="_blank" rel="noreferrer noopener">agents.md</a>. <strong>CONTEXT.md</strong> serves a similar role as a bootstrapping document—it&#8217;s less established as a standard, but the practice of asking the AI to dump everything it knows into a context file so the next session can pick it up cold has been one of the most valuable habits I&#8217;ve developed. Here&#8217;s the prompt I used:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Update the backlog file to explain what it is and how we maintain it. Create a CONTEXT.md with everything you&#8217;d need to bootstrap a new chat. Create an AGENTS.md to make it easy to bootstrap with a single-line prompt.</em></p>
</blockquote>



<p class="wp-block-paragraph">That prompt is a handoff document. I was explicitly asking the AI to write down everything it knew while it still had full context, specifically because I knew that context would be lost to compaction. The CONTEXT.md file is a handoff from this session to whatever fresh session picks up the work next week.</p>



<p class="wp-block-paragraph">Notice what I didn&#8217;t say. I didn&#8217;t give step-by-step instructions for what should go in those files. I said &#8220;everything you would need to bootstrap this process again in case we lost it&#8221; and &#8220;a complete dump of all of the context you would need to bootstrap a new chat and get it to the point where this current chat is.&#8221; Those are acceptance criteria, not procedures. The AI had to figure out what belonged in those files. If I&#8217;d given it a procedure (&#8220;first write the publication history, then the voice rules, then the file locations&#8221;), it would have followed the list and missed anything I forgot to include. The acceptance criterion is harder to satisfy but more robust: the test is &#8220;Could a fresh session bootstrap from these files alone?&#8221;</p>



<p class="wp-block-paragraph">And the AGENTS.md file itself is a spec document as a bridge between tools. It&#8217;s the shared contract that any AI session, whether it&#8217;s Claude, Gemini, Cowork, or a fresh chat, can read to get aligned with the project. This session wrote it; the next session reads it. The two sessions never communicate directly, so the spec file bridges the gap between them.</p>



<p class="wp-block-paragraph">That&#8217;s all four practices in two prompts, applied to something as ordinary as managing a writing project. It didn&#8217;t require pipelines or codebases or batch orchestration. The practices work because they solve the same underlying problem regardless of the domain: important information living in the AI&#8217;s context window instead of on disk.</p>



<h3 class="wp-block-heading"><strong>Context management is a development skill</strong></h3>



<p class="wp-block-paragraph">Every practice I&#8217;ve described in this article and the last one is something developers have always been told to do: write things down, record your rationale, be deliberate about what you save and what you let go, write ADRs and design docs and inline comments explaining nonobvious choices. We&#8217;ve always known we should do more of it. When you&#8217;re working with AI, the cost of not doing it becomes immediate and visible.</p>



<p class="wp-block-paragraph">The practices in this article all come down to the same thing: putting the important information in files where compaction can&#8217;t touch it, so you can see what the AI knows and verify that it matches reality. In the next article, I&#8217;ll go deeper on the debugging angle: how to use externalized files to understand what your AI is actually doing, with practical techniques that work even if you&#8217;re not building agents but are just using a chatbot.</p>



<p class="wp-block-paragraph"><em>The <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a> is open source and works with GitHub Copilot, Cursor, and Claude Code. It&#8217;s also available as part of <a href="https://awesome-copilot.github.com/#file=skills%2Fquality-playbook%2FSKILL.md" target="_blank" rel="noreferrer noopener">awesome-copilot</a>.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Get a Good Return on Your AI Investments</title>
		<link>https://www.oreilly.com/radar/get-a-good-return-on-your-ai-investments/</link>
				<comments>https://www.oreilly.com/radar/get-a-good-return-on-your-ai-investments/#respond</comments>
				<pubDate>Wed, 27 May 2026 16:52:37 +0000</pubDate>
					<dc:creator><![CDATA[Louise Corrigan]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18808</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Get-a-good-return-on-your-AI-investments.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Get-a-good-return-on-your-AI-investments-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Takeaways from Sam Newman&#039;s fireside chat with Nathen Harvey, DORA team lead at Google Cloud]]></custom:subtitle>
		
				<description><![CDATA[Last week, we had our first Infrastructure &#38; Ops superstream of 2026, Platform Engineering in the Age of AI. Our speakers explored a range of topics focused on supporting new AI workloads, each with unique infrastructure needs, unpredictable costs, and novel security concerns. Google Cloud’s Abdel Sghiouar took the audience through what a good platform [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Last week, we had our first Infrastructure &amp; Ops superstream of 2026, <a href="https://learning.oreilly.com/live-events/infrastructure-ops-superstream-platform-engineering-in-the-age-of-ai/0642572314507/0642572314491/" target="_blank" rel="noreferrer noopener">Platform Engineering in the Age of AI</a>. Our speakers explored a range of topics focused on supporting new AI workloads, each with unique infrastructure needs, unpredictable costs, and novel security concerns. Google Cloud’s Abdel Sghiouar took the audience through what a good platform for AI looks like, Cockroach Labs’ Jordan Lewis shared lessons learned rolling out a corporate AI platform, Syntasso’s Daniel Bryant outlined a three-layer model for building a good platform, technology leader Sarah Wells discussed the importance of governance and how to make it more manageable, and Thoughtworks’ Ben O&#8217;Mahony explained why evals should be part of your observability story. You can <a href="https://youtu.be/neycwJJmpG0" target="_blank" rel="noreferrer noopener">watch the highlights here</a>.</p>



<p class="wp-block-paragraph">The event concluded with a fireside chat between Sam and Nathen Harvey, who leads the DORA team at Google Cloud. <a href="https://dora.dev/" target="_blank" rel="noreferrer noopener">DORA</a> has been tracking software delivery performance for over a decade, which means they&#8217;ve watched a lot of technology trends come through. Their center of gravity has always been the same question: How quickly and safely can a team move change into a running production application?</p>



<p class="wp-block-paragraph">AI hasn&#8217;t changed that question, although it has made answering it a bit harder. DORA recently released its <a href="https://cloud.google.com/resources/content/dora-roi-of-ai-assisted-software-development" target="_blank" rel="noreferrer noopener"><em>ROI of AI-Assisted Software Development</em> report</a> to show how AI is working for teams right now, and how that may or may not be contributing to organizations’ bottom lines. Nathen used the findings as a jumping-off point to dig into how AI is changing platform engineering and software development as a whole.</p>



<h2 class="wp-block-heading">The productivity gap</h2>



<p class="wp-block-paragraph">Sam started by pointing out one of the biggest headline findings from DORA’S 2025 data: Organizations saw about 10% improvement in terms of actual code shipped to production systems. Even though developers likely felt that they were more productive, that doesn&#8217;t automatically carry through to production. DORA&#8217;s data shows higher throughput alongside higher instability. In other words, teams are shipping more but they’re also more frequently rolling back changes or implementing fixes. The gains at the individual level are real (and 10% is a pretty good number), but those gains aren’t “the dramatic improvements that you find in the headlines.”</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Productivity Gap with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/9jxMx1yHAZo?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">AI amplifies good processes (and bad ones)</h2>



<p class="wp-block-paragraph">Nathen explained that AI is an amplifier and mirror that equally reflects the good and bad. On teams where shipping change is already easy, AI tends to keep things running well. On teams where getting change into production is painful, AI generates <em>more</em> change and makes the existing friction more acute. That said, his read on this outcome is cautiously optimistic: &#8220;If the pain is more acute, we maybe will invest in addressing that pain.&#8221;</p>



<p class="wp-block-paragraph">The rub is that the investment has to actually happen. Nathen noted that in lower-performing organizations, AI tools often arrive with a reset of expectations rather than an invitation to fix the process: Here&#8217;s your new tool. Now we expect more from you. Addressing this problem means reframing the question “Does AI make people more productive?” What we really should be asking is “Under what conditions will AI boost productivity, and who&#8217;s responsible for creating them?” And that falls on the organization, not the technology.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="AI Is an Amplifier and Mirror for Good Processes and Bad with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/5CzvrWpXBHg?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Verification isn&#8217;t a checkbox</h2>



<p class="wp-block-paragraph">Trust is a big challenge with generative AI. About 30% of DORA survey respondents trust AI output little or not at all. Around 46% trust it &#8220;somewhat&#8221; (and Nathen is one of them). Despite all the advances in generative AI, these tools still make mistakes, and if you&#8217;ve multiplied your ability to generate code without doing anything to scale your ability to verify it, you&#8217;ve made your situation worse, not better.</p>



<p class="wp-block-paragraph">Nathen called this the verification tax, and it belongs in any honest accounting of AI&#8217;s productivity impact. Pipeline adaptation belongs there too: Is your delivery pipeline fit for purpose given the volume of change you&#8217;re now trying to push through? These costs don&#8217;t show up in the headlines about 10x developer productivity. They show up in your incident reports three months later.</p>



<p class="wp-block-paragraph">DORA recently published an <a href="https://dora.dev/ai/roi/calculator/#staff_size=500&amp;salary=176000&amp;revenue=100000000&amp;downtime_cost_per_hour=100000&amp;current_deployments_per_year=50&amp;current_features_per_year=50&amp;idea_success_rate=0.33&amp;revenue_impact_per_feature=0.005&amp;current_cfr=0.05&amp;current_fdrt=4&amp;time_saved_per_developer=0.125&amp;ai_license_cost_per_user=250&amp;additional_ai_cost_per_user=80&amp;additional_ai_infra_cost=100000&amp;training_cost_per_user=9600&amp;target_deployments_per_year=56&amp;target_features_per_year=56&amp;target_cfr=0.06&amp;j_curve_drop=0.15&amp;j_curve_duration=3" target="_blank" rel="noreferrer noopener">ROI framework and calculator</a> for AI-assisted software development. Nathen was clear that there&#8217;s no universal number to offer, and the calculator doesn&#8217;t pretend otherwise. What it does is give teams a way to model the real costs, including the learning investment, the verification overhead, and the pipeline changes required.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Verification Tax with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/wGYLtVj8z0Q?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Context switching and burnout</h2>



<p class="wp-block-paragraph">With productivity on the upswing, AI-induced burnout is becoming a serious concern. (Steve Yegge calls this the “<a href="https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163" target="_blank" rel="noreferrer noopener">AI vampire</a>.”) DORA’s data for 2025 showed that AI adoption wasn’t strongly connected with burnout, with the caveat that about 64% of DORA survey respondents said they’d never worked in an agentic workflow. Both of those findings are likely to change significantly in 2026.</p>



<p class="wp-block-paragraph">Nathen highlighted one source of burnout he expects to escalate as agents become the norm: context switching. As he pointed out, software developers spent years arguing for protected focus time to do the deep work that requires them to maintain flow. Agentic workflows are now incentivizing those same developers to voluntarily run a dozen or more agents at once, forcing them to context-switch multiple times every hour. As he joked, “There&#8217;s plenty of research that supports the idea that all of us feel like we&#8217;re pretty good multitaskers and none of us are.” The consequences are coming, and we’re doing it to ourselves.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Burnout Will Go Up, and We’re Doing It to Ourselves with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/ibdw27MxQq0?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The cognitive debt question</h2>



<p class="wp-block-paragraph">Sam Newman brought up the related notion of “cognitive debt,” and in particular, Margaret-Anne Storey’s discussion of it. (See “<a href="https://margaretstorey.com/blog/2026/02/09/cognitive-debt/" target="_blank" rel="noreferrer noopener">How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt</a>” and “<a href="https://arxiv.org/abs/2603.22106" target="_blank" rel="noreferrer noopener">From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI</a>.”) Here’s how Storey explains the problem in her blog post:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Debt compounded from going fast lives in the brains of the developers and affects their lived experiences and abilities to “go fast” or to make changes. Even if AI agents produce code that could be easy to understand, the humans involved may have simply lost the plot and may not understand what the program is supposed to do, how their intentions were implemented, or how to possibly change it.</p>
</blockquote>



<p class="wp-block-paragraph">And as Sam noted, this compounds across teams and organizations. As developers increasingly work in parallel with AI rather than with each other, they lose the shared understanding that comes from people building software together. Kent Beck once said that “<a href="https://tidyfirst.substack.com/p/self-team-product" target="_blank" rel="noreferrer noopener">software design is an exercise in human relationships</a>.” Agentic workflows are putting pressure on that in ways we&#8217;re only beginning to see.</p>



<p class="wp-block-paragraph">Nathen agreed cognitive debt is where he&#8217;s most concerned, and both your workers and your architecture will suffer for it. Understanding the ramifications of an architectural decision you made eight months ago takes years of operation to surface, and AI doesn&#8217;t help with that at all.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Cognitive Debt and Long Feedback Loops with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/yiOsikXaQ7c?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Invest in your platform now</h2>



<p class="wp-block-paragraph">Considering what makes some AI-assisted teams high performers, Nathen explained, “It’s not <em>that</em> you’re using AI but <em>how</em> you’re using AI.” This observation led DORA to develop <a href="https://cloud.google.com/blog/products/ai-machine-learning/introducing-doras-inaugural-ai-capabilities-model" target="_blank" rel="noreferrer noopener">seven capabilities</a> that, when combined with AI adoption, lead to better outcomes. Nathen briefly ran through the list, ending on quality internal platforms. And here he made a claim about software engineering investment that was, in his words, “a little bit wild”:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Every product engineer that you have in your organization, every engineer that&#8217;s focused on building features right now, should probably stop building features and focus on the platform.</p>
</blockquote>



<p class="wp-block-paragraph">His argument is that platforms matter more, not less, in an environment where AI makes it possible for almost anyone in an organization to build something. The people closest to customers and business problems can now generate working software. What they can&#8217;t do is ensure that software is durable, secure, and production-ready.</p>



<p class="wp-block-paragraph">Nathen suggested that the best leverage for software engineering investment today might be building platforms that provide those guardrails, that shift the complexity of production-readiness down into the infrastructure so that anyone building on top of it gets the safety net for free. He acknowledged that moving every product engineer to platform work might be overkill. But the direction of travel is real. The platform is also, as Newman pointed out, where you bring determinism back into a process that AI has made more nondeterministic.</p>



<p class="wp-block-paragraph">That’s something we’ve been hearing a lot here at O’Reilly. The expansion of who can build doesn&#8217;t reduce the need for deep engineering expertise. It changes where that expertise is most valuable, and platforms are a good answer to where.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="AI Capabilities and the Case for Platform Investment Now with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/CIFoHFTbIec?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">What DORA’s research tells us</h2>



<p class="wp-block-paragraph">The teams that are doing well are running experiments, learning from them, and spreading those lessons. The measure Nathen suggested is not how many tokens you&#8217;ve consumed but how many experiments you&#8217;ve run and how well you&#8217;re distributing what you&#8217;ve learned.</p>



<p class="wp-block-paragraph">The tools are moving fast enough that any organization locking in a fixed policy around specific tools will find itself stuck. What you want is the capacity to keep learning, which means building the culture and the processes that make learning visible and transferable.</p>



<p class="wp-block-paragraph">All of DORA&#8217;s research is freely available at <a href="https://dora.dev/" target="_blank" rel="noreferrer noopener">dora.dev</a>, including the 2025 annual report and the ROI framework. The <a href="https://dora.community/" target="_blank" rel="noreferrer noopener">DORA Community</a> provides a space for practitioners to work through these questions together. If you&#8217;re trying to navigate any of this with your team, you may want to spend some time there.</p>



<p class="wp-block-paragraph">And if you want to dive deeper into Nathen and Sam’s chat or explore the other sessions, you can <a href="https://learning.oreilly.com/videos/infrastructure-ops/0642572308308/" target="_blank" rel="noreferrer noopener">watch the entire Infrastructure &amp; Ops Superstream</a> on the O’Reilly learning platform. Our next event, on September 9, will cover agentic observability. <a href="https://www.oreilly.com/live/io-superstream-agentic-observability.html" target="_blank" rel="noreferrer noopener">Register for free here</a>, and check out all the other <a href="https://www.oreilly.com/live/free.html" target="_blank" rel="noreferrer noopener">free live events on O’Reilly</a>.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/get-a-good-return-on-your-ai-investments/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Agent Skills</title>
		<link>https://www.oreilly.com/radar/agent-skills/</link>
				<comments>https://www.oreilly.com/radar/agent-skills/#respond</comments>
				<pubDate>Wed, 27 May 2026 10:59:18 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18796</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-skills.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-skills-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A senior engineer’s job is mostly the parts that don’t show up in the diff. Specs. Tests. Reviews. Scope discipline. Refusing to ship what can’t be verified. AI coding agents skip those parts by default. Agent Skills is my attempt to make them not optional.]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission. The default behavior of any AI coding agent is to take the shortest path to “done.” Ask for a feature and it writes the feature. It doesn’t ask whether you have a spec, write a test before [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on <a href="https://addyosmani.com/blog/agent-skills/" target="_blank" rel="noreferrer noopener">Addy Osmani’s blog</a> and is being reposted here with the author’s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">The default behavior of any AI coding agent is to take the shortest path to “done.” Ask for a feature and it writes the feature. It doesn’t ask whether you have a spec, write a test before the implementation, consider whether the change crosses a trust boundary, or check what the PR will look like to a reviewer. It produces code, declares victory, and moves on.</p>



<p class="wp-block-paragraph">This is the same failure mode every senior engineer has spent their career learning to avoid. The senior version of any task includes work that doesn’t show up in the diff: surfacing assumptions, writing the spec, breaking the work into reviewable chunks, choosing the boring design, leaving evidence that the result is correct, sizing the change so a human can actually review it. Those steps are most of what separates engineers who ship reliable software at scale from people who push code that breaks.</p>



<p class="wp-block-paragraph">Agents skip those steps for the same reason any junior would. They’re invisible. The reward signal points at “task complete” not “task complete and the design doc exists.” So we have to bolt the senior-engineer scaffolding back on.</p>



<p class="wp-block-paragraph"><a href="https://github.com/addyosmani/agent-skills" target="_blank" rel="noreferrer noopener">Agent Skills</a> is my attempt at that scaffolding. It just crossed 27K stars, so apparently I’m not alone in wanting it. This post is the part the README doesn’t quite cover: why each design choice exists, how it maps onto standard SDLC and Google’s published engineering practices, and what you should steal from the project even if you never install a single skill.</p>



<h2 class="wp-block-heading">What a “skill” actually is</h2>



<p class="wp-block-paragraph">The word “skill” is doing a lot of work in the Claude Code/Anthropic vocabulary, and it helps to be precise. A skill is a Markdown file with front matter that gets injected into the agent’s context when the situation calls for it. Somewhere between a system-prompt fragment and a runbook.</p>



<p class="wp-block-paragraph">A skill is <em>not</em> reference documentation. It is not “everything you should know about testing.” It is a workflow: a sequence of steps the agent follows, with checkpoints that produce evidence, ending in a defined exit criterion.</p>



<p class="wp-block-paragraph">That distinction is the whole game. If you put a 2,000-word essay on testing best practices into the agent’s context, the agent reads it, generates plausible-looking text, and skips the actual testing. If you put a <em>workflow</em> there (write the failing test first, run it, watch it fail, write the minimum code to pass, watch it pass, refactor), the agent has something to do, and you have something to verify.</p>



<p class="wp-block-paragraph">Process over prose. Workflows over reference. Steps with exit criteria over essays without them. That single distinction separates a useful skill from a pretty Markdown file. It also explains why so many “AI rules” repos end up doing nothing in practice. The rules are essays.</p>



<h2 class="wp-block-heading">The SDLC the skills encode</h2>



<p class="wp-block-paragraph">The 20 skills in the repo organize around six lifecycle phases, with seven slash commands sitting on top. Define (<code>/spec</code>) is where you decide what you’re actually building. Plan (<code>/plan</code>) breaks the work down. Build (<code>/build</code>) implements it in vertical slices. Verify (<code>/test</code>) proves it works. Review (<code>/review</code>) catches what slipped through. Ship (<code>/ship</code>) gets it to users safely. <code>/code-simplify</code> sits across the bottom of the whole thing.</p>



<p class="wp-block-paragraph">This isn’t a coincidence. It’s the same SDLC every functioning engineering organization runs, just in different vocabulary. Google calls it design doc → review → implementation → readability review → launch checklist. Amazon calls it the working-backward memo and the bar raiser. Every healthy team has some version of this loop.</p>



<p class="wp-block-paragraph">What’s new with AI coding agents is that <em>most agents skip most of these phases by default</em>. You ask for a feature, you get an implementation, and the spec, plan, tests, review, and launch checklist all just don’t happen. Skills push the agent through the same phases a senior engineer forces themselves through, because shipping the code without them is how you produce incidents.</p>



<p class="wp-block-paragraph">A complex feature might activate eleven skills in sequence. A small bug fix might use three. The router (<code>using-agent-skills</code>) decides which apply. The point is that the workflow scales to the actual scope, not to the assumed scope.</p>



<h2 class="wp-block-heading">Five principles that are doing the work</h2>



<p class="wp-block-paragraph">Five design decisions in the project are the loadbearing ones. The rest of the system follows from them.</p>



<h3 class="wp-block-heading">1. Process over prose</h3>



<p class="wp-block-paragraph">Already covered. Workflows are agent-actionable; essays are not. The same is true for human teams. If your team handbook is 200 pages, no one reads it under time pressure. If it’s a small set of workflows with checkpoints, people actually run them.</p>



<h3 class="wp-block-heading">2. Anti-rationalization tables</h3>



<p class="wp-block-paragraph">This is the most distinctive design decision in the project, and the one I most want other teams to steal.</p>



<p class="wp-block-paragraph">Each skill includes a table of common excuses an agent (or a tired engineer) might use to skip the workflow, paired with a written rebuttal. A few examples close to the originals:</p>



<ul class="wp-block-list">
<li>“This task is too simple to need a spec.” → Acceptance criteria still apply. Five lines is fine. Zero lines is not.</li>



<li>“I’ll write tests later.” → Later is the loadbearing word. There is no later. Write the failing test first.</li>



<li>“Tests pass, ship it.” → Passing tests are evidence, not proof. Did you check the runtime? Did you verify user-visible behavior? Did a human read the diff?</li>
</ul>



<p class="wp-block-paragraph">The reason this works is that LLMs are excellent at rationalization. They will produce a plausible-sounding paragraph explaining why <em>this particular</em> task doesn’t need a spec or why <em>this particular</em> change is fine to merge without review. Anti-rationalization tables are prewritten rebuttals to lies the agent hasn’t yet told.</p>



<p class="wp-block-paragraph">The pattern is just as good for human teams. Most engineering decay isn’t anyone choosing to do bad work. It’s people accepting plausible-sounding justifications for skipping the parts they don’t feel like doing. A team that writes down its anti-rationalizations is a team that has fewer of them.</p>



<h3 class="wp-block-heading">3. Verification is nonnegotiable</h3>



<p class="wp-block-paragraph">Every skill terminates in concrete evidence. Tests pass. Build output is clean. The runtime trace shows the expected behavior. A reviewer signs off. “Seems right” is never sufficient.</p>



<p class="wp-block-paragraph">This is the same principle that makes Anthropic’s harness recover from failures, that makes Cursor’s planner/worker/judge split actually catch bugs, that makes any <a href="https://addyosmani.com/blog/long-running-agents/" target="_blank" rel="noreferrer noopener">long-running agent</a> recoverable. The agent is a generator. You need a separate signal that the work is done. Skills bake that signal into every workflow.</p>



<h3 class="wp-block-heading">4. Progressive disclosure</h3>



<p class="wp-block-paragraph">Do not load all 20 skills into context at session start. Activate them based on the phase. A small meta-skill (<code>using-agent-skills</code>) acts as a router that decides which skill applies to the current task.</p>



<p class="wp-block-paragraph">This is the <a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">harness engineering</a> lesson applied at skill granularity. Every token loaded into context degrades performance somewhere, so you load what’s relevant and leave the rest on disk. Progressive disclosure is how you get a 20-skill library into a 5K-token slot without poisoning the well.</p>



<h3 class="wp-block-heading">5. Scope discipline</h3>



<p class="wp-block-paragraph">The meta-skill encodes a nonnegotiable I’d staple to every agent if I could: “touch only what you’re asked to touch.” Don’t refactor adjacent systems. Don’t remove code you don’t fully understand. Don’t brush against a TODO and decide to rewrite the file.</p>



<p class="wp-block-paragraph">This sounds obvious until you watch an agent decide that fixing one bug requires modernizing three unrelated files. Scope discipline is the single biggest determinant of whether an agent’s PR is mergeable or has to be unwound. It’s also the principle that maps most cleanly onto Google’s code review norms, where reviewers will block a PR for doing more than one thing.</p>



<h2 class="wp-block-heading">The Google DNA</h2>



<p class="wp-block-paragraph">The skills are saturated with practices from <em><a href="https://learning.oreilly.com/library/view/software-engineering-at/9781492082781/" target="_blank" rel="noreferrer noopener">Software Engineering at Google</a></em> and Google’s public engineering culture. This is intentional. Most of what makes Google-scale software work is documented and public, and it is <em>exactly</em> the part agents are most likely to skip.</p>



<p class="wp-block-paragraph">A partial map of which skill encodes which practice:</p>



<ul class="wp-block-list">
<li><strong>Hyrum’s law</strong><strong> in </strong><strong>api-and-interface-design</strong><strong>. </strong>Every observable behavior of your API will eventually be depended on by someone, so design with that in mind.</li>



<li><strong>The test pyramid (~80/15/5) and the Beyoncé rule</strong><strong> in </strong><strong>test-driven-development</strong><strong>.</strong> “If you liked it, you should have put a test on it.” Infrastructure changes don’t catch bugs; tests do.</li>



<li><strong>DAMP over DRY in tests.</strong> Google’s testing philosophy is explicit that test code should read like a specification even at the cost of some duplication. Overabstracted tests are a known antipattern.</li>



<li><strong>~100-line PR sizing, with Critical/Nit/Optional/FYI severity labels</strong><strong> in </strong><strong>code-review-and-quality</strong><strong>.</strong> Straight from Google’s code review norms. Big PRs don’t get reviewed; they get rubber-stamped.</li>



<li><strong>Chesterton’s Fence</strong><strong> in </strong><strong>code-simplification</strong><strong>.</strong> Don’t remove a thing until you understand why it was put there.</li>



<li><strong>Trunk-based development and atomic commits</strong><strong> in </strong><strong>git-workflow-and-versioning</strong><strong>.</strong></li>



<li><strong>Shift left and feature flags</strong><strong> in </strong><strong>ci-cd-and-automation</strong><strong>.</strong> Catch problems as early as possible, decouple deploy from release.</li>



<li><strong>Code-as-liability</strong><strong> in </strong><strong>deprecation-and-migration</strong><strong>.</strong> Every line you keep is one you have to maintain forever, so prefer the smaller surface.</li>
</ul>



<p class="wp-block-paragraph">None of these are new ideas. The point is that none of them are in the agent by default. A frontier model has read the phrase “Hyrum’s law” in its training data, but it does not apply Hyrum’s law when it’s designing your API at 3am. Skills are how you make sure it does.</p>



<h2 class="wp-block-heading">How to actually use it</h2>



<p class="wp-block-paragraph">Three modes, in roughly increasing commitment.</p>



<p class="wp-block-paragraph"><strong>Mode 1: Install via marketplace. </strong>If you’re using Claude Code:</p>



<pre class="wp-block-code"><code><code>/plugin marketplace add addyosmani/agent-skills 
/plugin install agent-skills@addy-agent-skills</code></code></pre>



<p class="wp-block-paragraph">You get the slash commands (<code>/spec</code>, <code>/plan</code>, <code>/build</code>, <code>/test</code>, <code>/review</code>, <code>/ship</code>, <code>/code-simplify</code>) and the agent activates the relevant skills automatically based on context. This is the path I’d recommend most people start on.</p>



<p class="wp-block-paragraph"><strong>Mode 2: Drop the Markdown into your tool of choice.</strong> The skills are plain Markdown with front matter. Cursor users put them in <code>.cursor/rules/</code>. Gemini CLI has its own install path. Codex, Aider, Windsurf, OpenCode, anything that accepts a system prompt can read them. The tooling matters less than the workflow underneath.</p>



<p class="wp-block-paragraph"><strong>Mode 3: Read them as a spec. </strong>Even if you never install anything, the skills are a <em>documented description of what good engineering with AI agents looks like</em>. Read <code>code-review-and-quality.md</code> and apply the five-axis framework to your team’s review process. Read <code>test-driven-development.md</code> and use it to settle the next “do we need to write the test first” argument with a junior. Read the meta-skill and steal the five nonnegotiables for your own AGENTS.md.</p>



<p class="wp-block-paragraph">This third mode is where I’d actually start. Pick the four or five skills closest to your current pain. Decide which workflows you want enforced. Then install the runtime, or roll your own, to do the enforcing.</p>



<h2 class="wp-block-heading">What to steal even if you never install</h2>



<p class="wp-block-paragraph">A few patterns from the project I’d steal regardless of whether you use AI coding agents at all:</p>



<p class="wp-block-paragraph"><strong>Anti-rationalization as a team practice.</strong> Write down the lies your team tells itself. “We’ll fix the tests after launch.” “This change is too small for a design doc.” “It’s fine, we have monitoring.” Pair each with the rebuttal. Put it in your AGENTS.md or your engineering wiki. It will save you arguments and it will catch the next tired Friday-afternoon shortcut.</p>



<p class="wp-block-paragraph"><strong>Process over prose for anything you write internally.</strong> If you find yourself writing a 2,000-word doc titled “how we approach X” you’ve written reference material. Convert it to a workflow with checkpoints. The doc shrinks to 400 words and people actually run it. This applies as much to onboarding guides and runbooks as it does to agent skills.</p>



<p class="wp-block-paragraph"><strong>Verification as a hard exit criterion.</strong> Make “produce evidence” the exit step of every task. For agents, for engineers, for yourself. Evidence is whatever proves the work is done: a green test run, a screenshot, a log, a review approval. Without it, the task is not done. “Seems right” never closes the loop.</p>



<p class="wp-block-paragraph"><strong>Progressive disclosure for any rulebook.</strong> Do not write a 50-page handbook. Write a small router that points to the right small chapter for the situation. This is true for AGENTS.md, for runbooks, for incident playbooks, for anything anyone will read under time pressure.</p>



<p class="wp-block-paragraph">Five nonnegotiables, lifted from the meta-skill, that I’d put in any AGENTS.md tomorrow:</p>



<ol class="wp-block-list">
<li>Surface assumptions before building. Wrong assumptions held silently are the most common failure mode.</li>



<li>Stop and ask when requirements conflict. Don’t guess.</li>



<li>Push back when warranted. The agent (or engineer) is not a yes-machine.</li>



<li>Prefer the boring, obvious solution. Cleverness is expensive.</li>



<li>Touch only what you’re asked to touch.</li>
</ol>



<p class="wp-block-paragraph">That’s a worthwhile engineering culture in five lines, and you don’t need to install anything to adopt it.</p>



<h2 class="wp-block-heading">Where this fits in the harness</h2>



<p class="wp-block-paragraph">In the broader picture, skills are one layer of <a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">agent harness engineering</a>. The harness is the model plus everything you build around it; skills are the reusable workflow chunks that get progressively disclosed into the system prompt. They sit alongside <code>AGENTS.md</code> (the rolling rulebook), hooks (the deterministic enforcement layer), tools (the actions the agent can take), and the session log (the durable memory). Each layer has a specific job. Skills do the senior-engineer-process job.</p>



<p class="wp-block-paragraph">Skills matter more for <a href="https://addyosmani.com/blog/long-running-agents/" target="_blank" rel="noreferrer noopener">long-running agents</a> than they do for chat-style ones, because long runs amplify every shortcut. An agent that skips the test in a 10-minute session produces one bug. An agent that skips the test in a 30-hour session produces a debugging archaeology project at the end of the run, when no one remembers what the original intent was. The longer the run, the more the senior-engineer scaffolding has to be enforced rather than suggested.</p>



<p class="wp-block-paragraph">The portability of the skills format matters too. The same SKILL.md file works in Claude Code, Cursor (with rules), Gemini CLI, Codex, and any other harness that accepts system-prompt content. Write the workflow once, the runtime enforces it. That’s the thing the Markdown-with-front matter format buys you that bespoke prompt engineering does not.</p>



<h2 class="wp-block-heading">Closing</h2>



<p class="wp-block-paragraph">The thing I most want people to take from this project, more than the skills themselves, is the framing.</p>



<p class="wp-block-paragraph">AI coding agents are extremely capable junior engineers with no instinct for the parts of the job that don’t show up in the diff. The senior-engineering work (surfacing assumptions, sizing changes, writing the spec, leaving evidence, refusing to merge what can’t be reviewed) is exactly what an agent will skip unless you make it impossible to skip. The job, increasingly, is to encode that discipline as something the agent cannot talk itself out of.</p>



<p class="wp-block-paragraph">Skills are one shape of that. Anti-rationalization tables. Progressive disclosure. Process over prose. Verification as the loadbearing exit criterion. The Google practices that already work, made portable.</p>



<p class="wp-block-paragraph">You can install <a href="https://github.com/addyosmani/agent-skills" target="_blank" rel="noreferrer noopener">my version</a>. You can roll your own. The lesson stands either way: The senior-engineer parts of the job are no longer optional, even when the engineer is a model.</p>



<p class="wp-block-paragraph"><em>The repo is at <a href="https://github.com/addyosmani/agent-skills" target="_blank" rel="noreferrer noopener">github.com/addyosmani/agent-skills</a> (MIT). For the broader scaffolding picture, see “<a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">Agent Harness Engineering</a>” and “<a href="https://addyosmani.com/blog/long-running-agents/" target="_blank" rel="noreferrer noopener">Long-Running Agents</a>.”</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/agent-skills/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Who Authorized That? The Delegation Problem in Multi-Agent AI</title>
		<link>https://www.oreilly.com/radar/who-authorized-that-the-delegation-problem-in-multi-agent-ai/</link>
				<comments>https://www.oreilly.com/radar/who-authorized-that-the-delegation-problem-in-multi-agent-ai/#respond</comments>
				<pubDate>Tue, 26 May 2026 10:58:58 +0000</pubDate>
					<dc:creator><![CDATA[Sunil Prakash]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18793</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Who-Authorized-That.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Who-Authorized-That-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Securing access isn’t enough. As agents begin calling other agents, enterprises need to secure delegation too.]]></custom:subtitle>
		
				<description><![CDATA[Your AI agent booked a meeting, summarized a financial report, and emailed the highlights to three stakeholders. To do this, it called a calendar agent, a document analysis agent, and an email agent. Each accessed internal systems, made decisions about what to include, and acted on your behalf. Here’s the question your security team can’t [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Your AI agent booked a meeting, summarized a financial report, and emailed the highlights to three stakeholders. To do this, it called a calendar agent, a document analysis agent, and an email agent. Each accessed internal systems, made decisions about what to include, and acted on your behalf.</p>



<p class="wp-block-paragraph">Here’s the question your security team can’t answer: <strong>Who authorized the email agent to read that financial report?</strong></p>



<p class="wp-block-paragraph">In most current architectures, the honest answer is no one explicitly. The logs may show that a service called another service. But they can’t show that the delegation itself was authorized. The authorization didn’t fail loudly. It leaked silently through the chain.</p>



<p class="wp-block-paragraph">This is the delegation problem in multi-agent AI. As enterprises connect agents through protocols such as MCP and A2A, they’re solving the connectivity problem faster than they’re solving the authority problem. The result is a new security boundary that most enterprise architectures have not yet modeled, precisely because most organizations still treat it as orchestration rather than authorization.</p>



<h2 class="wp-block-heading">Agents are connecting faster than authorization is adapting</h2>



<p class="wp-block-paragraph">The agent ecosystem has moved fast over the past two years. Anthropic&#8217;s MCP gave model-powered applications a standard way to connect to tools, data sources, and services. Google&#8217;s A2A protocol gave agents a standard way to communicate and coordinate across systems. Frameworks and SDKs such as LangChain, CrewAI, and Google&#8217;s ADK made it easier to build multi-agent workflows where one agent orchestrates several others.</p>



<p class="wp-block-paragraph">What these protocols don’t yet provide, at least not as a mature common layer, is a delegation-aware authorization model.</p>



<p class="wp-block-paragraph">MCP describes a protected server as an OAuth 2.1 resource server, with the MCP client acting as an OAuth client making requests on behalf of a resource owner. That’s a familiar and well-understood pattern, but it was designed for a world where a human clicks &#8220;Allow&#8221; and a single client gets a scoped token. It doesn’t address what happens when Agent A receives that token, delegates a subtask to Agent B, and Agent B spawns Agent C to handle part of it. Each hop in that chain either reuses the original token (overprivileged) or has no token at all (untracked).</p>



<p class="wp-block-paragraph">A2A was built for interoperability: independent, potentially opaque agent systems communicating and coordinating actions across enterprise platforms. That’s the right problem to solve. But communication and delegation governance are different layers. A2A helps agents discover, describe, and communicate with one another. This is necessary infrastructure, but it isn’t the same as delegated authority. It doesn’t tell you whether a specific downstream action was legitimately derived from an upstream instruction.</p>



<p class="wp-block-paragraph">Static API keys are even weaker for this problem. A key grants access to a service. It says nothing about who is using it, what they’re using it for, or whether the entity presenting it is the same one it was issued to. Service accounts identify a workload, not an intent. When three agents share a service account, every action looks the same in your logs.</p>



<p class="wp-block-paragraph">None of these tools are broken. They solve different problems. The gap is structural. Authentication answers which agent is calling. Authorization defines what that agent may access. The harder question, and the one most enterprise architectures are not yet designed to answer, is whether a specific downstream action was legitimately derived from an upstream instruction, under narrowed constraints, with a verifiable chain back to a human decision. That’s the delegation question, and it sits in a layer that today&#8217;s stack doesn’t really have.</p>



<p class="wp-block-paragraph">In a clean version of this picture, privilege should sit only with the agent that touches the outside world. If a payer (A) asks a bookkeeper agent (B) to make a payment, and the bookkeeper asks a banking agent (C) to execute the transfer, only the banking agent needs banking authority. The bookkeeper doesn’t need to move money. It only needs to know the request came from an authorized payer. The banking agent only needs to know the request came from an authorized bookkeeper. This is the principle of least privilege, a concept the security community has lived with for decades, applied to delegation chains. The difficulty is that today&#8217;s agent stacks make it hard to enforce.</p>



<h2 class="wp-block-heading">What breaks in the chain</h2>



<p class="wp-block-paragraph">Consider a treasury reporting workflow in a regulated bank. A planning agent is allowed to read liquidity projections and produce a daily summary for senior finance users. To complete the task, it delegates chart generation to a visualization agent and narrative review to a communications agent. The visualization agent doesn’t need access to raw account-level data. The communications agent doesn’t need access to the underlying liquidity model. Yet unless the delegation layer attenuates permissions, both may receive more context than their task requires. The result isn’t a dramatic breach, but it is a quiet expansion of access that the access-control model never explicitly approved.</p>



<p class="wp-block-paragraph">The risk isn’t limited to internet-facing agents. Many delegation failures happen entirely inside the enterprise boundary. An internal agent may call another internal agent, which calls an internal tool, which sends data to an approved SaaS service. Every individual step may look acceptable. The risk appears in the composition: The final data movement or action may exceed the intent of the original authorization.</p>



<p class="wp-block-paragraph">This pattern creates three categories of failure that enterprises may have to explain to regulators, auditors, or customers.</p>



<p class="wp-block-paragraph"><strong>Ghost permissions. </strong>A finance analyst assistant has been given access to a customer transactions database to support quarterly reporting. It calls a summarization agent: &#8220;summarize recent transactions for these accounts.&#8221; The summarization agent now operates against customer records, even though no policy engine granted it that access. The analyst assistant&#8217;s privileges effectively traveled with the request. The permission is a ghost. It exists in practice but not in any authorization system.</p>



<p class="wp-block-paragraph"><strong>Scope drift.</strong> Even when an agent starts with narrow permissions, delegation tends to widen scope rather than narrow it. An agent authorized to read Q1 revenue data delegates to a charting agent, which calls an external rendering API, which now has the revenue figures. The data left the organization through three hops of implicit trust. Each agent acted within what it understood as its scope. The aggregate result exceeded what any human would have approved.</p>



<p class="wp-block-paragraph"><strong>Broken audit trails.</strong> Regulated industries require the ability to answer &#8220;who did what and why&#8221; for any consequential action. In a single-agent system, this is manageable. In a multi-agent chain, the audit trail fragments across agents, protocols, and services. When a compliance team asks why a particular customer communication was sent, the answer might involve four agents across two protocols, none of which logged the delegation chain. The action is traceable to a system but not to a decision.</p>



<p class="wp-block-paragraph">These aren’t edge cases. They’re a common outcome when delegation isn’t modeled explicitly. The delegation problem isn’t a bug in any particular framework. It’s a gap in the layer between them.</p>



<h2 class="wp-block-heading">What a delegation-aware model requires</h2>



<p class="wp-block-paragraph">A delegation-aware authorization model has to solve four things at once, which is part of why no existing layer covers it cleanly<em>.</em></p>



<p class="wp-block-paragraph">The first is identity. The downstream agent needs a cryptographic credential that the receiving system can verify independently, not just a hostname or an API key. Hostnames lie. API keys travel. A real identity is one the calling system cannot fabricate.</p>



<p class="wp-block-paragraph">The second is attenuation. When an agent delegates a task, the subagent should receive strictly fewer permissions than the parent—never the same set, and certainly never more. This is the principle of least privilege applied to delegation chains, and almost no current tooling enforces it by default.</p>



<p class="wp-block-paragraph">The third is purpose. &#8220;Read this report to summarize liquidity exposure for the CFO&#8221; is a different authorization from &#8220;read this report and send selected figures to an external charting service.&#8221; It may be the same data and the same agent, but it’s two very different risk profiles. Without a purpose binding, the authorization layer has no way to distinguish them.</p>



<p class="wp-block-paragraph">The fourth is audit. The organization should be able to reconstruct, after the fact, who delegated what, under which constraints, and what evidence each agent produced at completion. Not just which systems were called but which decisions were made and on whose authority.</p>



<p class="wp-block-paragraph">It’s possible for agents to authenticate successfully even when they don’t have accountable authority. They can prove who they are and still execute actions that no human ever authorized.</p>



<h2 class="wp-block-heading">Emerging approaches</h2>



<p class="wp-block-paragraph">Several efforts address parts of this problem: workload identity standards, agent metadata in tokens, OAuth-based MCP authorization, A2A authentication patterns, and agent identity frameworks. These are useful building blocks, but identity is not the same as delegated authority. A signed agent card can help establish an agent&#8217;s declared identity and capabilities. An OAuth token can tell you what a client may access. Neither, by itself, proves that a specific downstream action was authorized by a specific upstream decision under narrowed constraints.</p>



<p class="wp-block-paragraph">One emerging pattern is delegation-bound capability tokens: short-lived credentials that bind an invocation to an agent identity, a constrained permission set, and a provenance record. One example is the <a href="https://datatracker.ietf.org/doc/draft-prakash-aip/" target="_blank" rel="noreferrer noopener">Agent Identity Protocol (AIP)</a>, which I’ve been working on as an Internet-Draft and <a href="https://sunilprakash.com/aip/" target="_blank" rel="noreferrer noopener">open source implementation</a>. AIP is still early, but it illustrates the shape of one possible answer: invocation-bound tokens that carry identity, attenuated permissions, and provenance through a delegation chain. The token chain itself becomes part of the audit evidence rather than something reconstructed after the fact from fragmented logs.</p>



<p class="wp-block-paragraph">Complementary approaches are also emerging. Behavioral credentials, the idea that agents should be continuously reauthorized based on runtime behavior rather than just initial permissions, address a related but distinct problem. Delegation tokens tell you who authorized what. Behavioral monitoring tells you whether the agent is still acting within its authorized profile. A complete solution will likely need both.</p>



<p class="wp-block-paragraph">None of these approaches have reached mainstream adoption. But the fact that they are emerging simultaneously, from different corners of the industry, signals that the delegation gap is real and recognized.</p>



<h2 class="wp-block-heading">What enterprise teams should do now</h2>



<p class="wp-block-paragraph">You don’t need to wait for standards to mature before addressing the delegation problem. There are concrete steps that security, platform, and architecture teams can take today.</p>



<p class="wp-block-paragraph"><strong>Map your delegation chains.</strong> Most teams deploying multi-agent workflows haven’t documented which agents call which other agents, with what permissions, through which protocols. Start there. If you can’t draw the graph, you can’t secure it.</p>



<p class="wp-block-paragraph"><strong>Audit implicit permissions.</strong> For every agent-to-agent interaction, ask: Was this access explicitly granted, or is the downstream agent inheriting permissions by proximity? If the answer is inheritance, you have a ghost permission that needs a policy decision.</p>



<p class="wp-block-paragraph"><strong>Require scope attenuation.</strong> Establish an architectural rule: When an agent delegates a task, the subagent must receive fewer permissions than the parent, never more. Current tooling doesn’t enforce this automatically, but you can enforce it in your orchestration layer.</p>



<p class="wp-block-paragraph"><strong>Build the audit trail before the auditor asks.</strong> If your organization is in a regulated industry, the question &#8220;Who authorized this agent action?&#8221; will eventually be asked. The time to instrument delegation logging is before that question arrives, not after. Log the full chain: which agent initiated the task, what permissions were passed, which subagents were invoked, and what each one accessed.</p>



<p class="wp-block-paragraph"><strong>Test with real tooling.</strong> Delegation-aware approaches, including capability-token designs, workload identity standards, and agent identity frameworks, are early but functional. Running one in a nonproduction environment will expose gaps in your current authorization model that architecture review alone will not surface.</p>



<h2 class="wp-block-heading">Delegation is the security boundary</h2>



<p class="wp-block-paragraph">The first phase of enterprise agent adoption was about connectivity: Can the agent reach the tool, the API, the database, or the other agent? The next phase will be about accountable delegation: Should this agent be allowed to ask that agent to do this specific thing, with this data, under these constraints?</p>



<p class="wp-block-paragraph">That question won’t be answered by prompt engineering. It belongs in the authorization layer, the platform layer, and the audit trail.</p>



<p class="wp-block-paragraph">Enterprises don’t need to solve the entire standards problem today. But they do need to stop treating delegation as an implementation detail. In multi-agent systems, delegation is the security boundary.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/who-authorized-that-the-delegation-problem-in-multi-agent-ai/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>This Week in AI: Rethinking the Agent Harness</title>
		<link>https://www.oreilly.com/radar/this-week-in-ai-rethinking-the-agent-harness/</link>
				<comments>https://www.oreilly.com/radar/this-week-in-ai-rethinking-the-agent-harness/#respond</comments>
				<pubDate>Fri, 22 May 2026 15:01:29 +0000</pubDate>
					<dc:creator><![CDATA[Michelle Smith]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[This Week in AI]]></category>
		<category><![CDATA[Podcast]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18774</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/0642572383770_This_Week_in_AI_Cover-scaled.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2560" 
				height="2560" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/0642572383770_This_Week_in_AI_Cover-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Plus AI security, the compute arms race, and why eventually there may no longer be an internet for humans]]></custom:subtitle>
		
				<description><![CDATA[We kicked off our new weekly series This Week in AI on Monday, and we covered a lot of ground in 30 minutes, including an AI model that found security holes faster than decades of human auditing, a data center in Utah the size of two Manhattans, and a practical argument for why the harness [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">We kicked off our new weekly series <em>This Week in AI</em> on Monday, and we covered a lot of ground in 30 minutes, including an AI model that found security holes faster than decades of human auditing, a data center in Utah the size of two Manhattans, and a practical argument for why the harness you build around a model now matters more than which model you pick.<br><br>Here are a few takeaways from the conversation between host Eric Freeman, faculty member at UT Austin and a longtime <a href="https://learning.oreilly.com/search/?q=author%3A%20%22Eric%20Freeman%22&amp;suggested=true&amp;suggestionType=author&amp;originalQuery=eric%20freeman&amp;rows=100&amp;language=en" target="_blank" rel="noreferrer noopener">friend of O’Reilly</a>, and guest John Berryman, founder of Arcturus Labs, an early production engineer on GitHub Copilot, and coauthor of O&#8217;Reilly&#8217;s<a href="https://learning.oreilly.com/library/view/prompt-engineering-for/9781098156145/" target="_blank" rel="noreferrer noopener"> <em>Prompt Engineering for LLMs</em></a>. Watch the entire episode to find out why you should be building your own agent and why John believes eventually there will be no internet for humans.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="This Week in AI: Rethinking the Agent Harness" width="500" height="281" src="https://www.youtube.com/embed/g4cfjz5AKxY?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>AI&#8217;s security problem is now a policy problem</strong></h2>



<p class="wp-block-paragraph">You’ve probably already heard about <a href="https://red.anthropic.com/2026/mythos-preview/" target="_blank" rel="noreferrer noopener">Mythos</a>. Anthropic&#8217;s internal testing of the frontier model surfaced thousands of previously unknown security vulnerabilities across major operating systems, browsers, and financial infrastructure, including a 27-year-old bug in OpenBSD. Anthropic chose not to release the model publicly and instead launched <a href="https://www.anthropic.com/glasswing" target="_blank" rel="noreferrer noopener">Project Glasswing</a>, a restricted program giving monitored access to a small group of trusted partners for defensive patching.</p>



<p class="wp-block-paragraph">That decision moved fast in Washington. In roughly six weeks, the conversation shifted from the light-touch national AI policy released in March to reported White House discussions of an <a href="https://fortune.com/2026/05/06/trump-administration-embraces-ai-oversight-policies-it-once-rejected-anthropic-mythos-caisi/" target="_blank" rel="noreferrer noopener">executive order review process</a> modeled on how the FDA handles drugs. Security researcher Bruce Schneier has questioned <a href="https://www.schneier.com/blog/archives/2026/04/mythos-and-cybersecurity.html" target="_blank" rel="noreferrer noopener">whether Mythos is uniquely capable here</a> or whether similar results are achievable with cheaper public models, but as Freeman noted (paraphrasing Schneier), either way, it’s a problem that’s coming.</p>



<h2 class="wp-block-heading">The compute race is getting stranger</h2>



<p class="wp-block-paragraph">Anthropic <a href="https://x.ai/news/anthropic-compute-partnership" target="_blank" rel="noreferrer noopener">leased xAI&#8217;s entire Colossus 1 supercluster</a> in Memphis: more than 200,000 GPUs and 300 megawatts of power. A month before that deal, <a href="https://www.anthropic.com/news/google-broadcom-partnership-compute" target="_blank" rel="noreferrer noopener">Anthropic expanded its agreement with Google and Broadcom for 3.5 gigawatts</a> of capacity coming online in 2027. For context, that&#8217;s roughly 10 times the power output of the Colossus 1 deal, in a single contract. After this episode aired, Anthropic announced that that deal has been <a href="https://www.axios.com/2026/05/20/anthropic-spacex-compute" target="_blank" rel="noreferrer noopener">expanded to Colossus 2</a> as well.</p>



<p class="wp-block-paragraph">Box Elder County, Utah, just approved a 40,000-acre AI data center called the Stratos project, backed by investor and TV personality Kevin O&#8217;Leary (a.k.a. Mr. Wonderful). It’s planned for <a href="https://www.theregister.com/on-prem/2026/05/13/utah-mega-datacenter-could-dump-23-atomic-bombs-worth-of-energy-per-day/5239670" target="_blank" rel="noreferrer noopener">9 gigawatts at full buildout</a>. That&#8217;s a footprint more than twice the size of Manhattan, powered by the equivalent of nine commercial nuclear reactors. And like many data center deals going forward, including Colossus above, it was <a href="https://www.cnn.com/2026/05/09/tech/ai-data-center-utah-kevin-oleary-opposition" target="_blank" rel="noreferrer noopener">approved over local protests</a>.</p>



<p class="wp-block-paragraph">Infrastructure at this incredible scale takes years to come online, and the companies making these bets are pricing in a world where model capability keeps scaling. Whether that assumption holds will determine a lot about what&#8217;s economically viable to build in the next decade.</p>



<h2 class="wp-block-heading"><strong>The harness matters more than the model</strong></h2>



<p class="wp-block-paragraph">John was on hand to rethink the agent harness, which as he pointed out, entered a new phase with the step change in model capability that occurred in November and December of last year. He took Eric through the arc of AI product development, from document completion and chat loops to tool-calling agents, DAG-based workflows, and now the harness era represented by tools like Claude Code. Each progression added capability, John noted, but also complexity, and each generated a new class of problems around reliability and control. In our current moment, which John has dubbed the “age of the unharnessed agent,” agents are now within reach of everyone, not just software developers.</p>



<p class="wp-block-paragraph">The payoff of this “unharnessed” era is control. John described a client engagement where he replaced a bespoke application with a skills-driven agent. Now domain experts with no development experience can read the agent&#8217;s behavior written in plain English and better understand it. As John explained,</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Rather than building a bespoke agent.&nbsp;.&nbsp;., I just built something that was just the agent harness—the agent—and I just gave it skills that describe what basically I learned in interviewing their experts, how they would work with these agents. And it worked perfectly. Not only does the agent stay on track and do what it needs to do these days, but it&#8217;s coded, as far as my client is concerned, in English.<br><br>The experts don&#8217;t have to complain to developers “this doesn&#8217;t work.” The experts can look at the English description of what&#8217;s going on and see problems, and maybe even fix it themselves. And I&#8217;m really excited to basically give that power into the hands of the people that know best how to change it, the experts.</p>
</blockquote>



<p class="wp-block-paragraph">That&#8217;s a different relationship between the experts and the tool than anything a wrapped commercial product offers.</p>



<p class="wp-block-paragraph">As Eric pointed out, recent <a href="https://arxiv.org/html/2603.28052v1" target="_blank" rel="noreferrer noopener">Stanford research</a> supports this broader point: Performance gaps between a bare model and a well-designed harness now often matter more than which underlying model you&#8217;re using. The benchmark that used to dominate buying decisions, which model scores highest, has been displaced by a harder question about which harness fits the task.</p>



<p class="wp-block-paragraph">John closed with a demo of his personal agent moving from an Obsidian notebook into Wikipedia and back, carrying context across environments. He used it to illustrate a concept he called the &#8220;open agent protocol,&#8221; his term for a not-yet-existing standard where an agent receives environment-specific skills as it moves between contexts. The protocol doesn&#8217;t exist yet, but the demo made the direction clear.</p>



<h2 class="wp-block-heading"><strong>What&#8217;s next</strong></h2>



<p class="wp-block-paragraph">Join us and a rotating lineup of expert guests for weekly live tool demos and deeper dives into the topics that matter in AI. We’re taking next week off for Memorial Day in the US, but we’ll be back on June 1 with host Andreas Welsch and guests Maya Mikhailov and Doug Shannon to cut through another week of AI headlines and separate what actually drives business value from what looks good in a demo but goes nowhere in production. Our first few episodes are free and open to all if you’d like to attend live—<a href="https://www.oreilly.com/live/this-week-in-ai.html" target="_blank" rel="noreferrer noopener">register here</a>.</p>



<p class="wp-block-paragraph">We’ll continue to share full episodes and publish our takeaways here on Radar each Friday. You can also watch or listen on <a href="https://www.youtube.com/watch?v=g4cfjz5AKxY&amp;list=PL055Epbe6d5bJEhT7_ZzOeJZ6gPyUzYpS" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/033kJS2BG1teGunxmtsU1r" data-type="link" data-id="https://open.spotify.com/show/033kJS2BG1teGunxmtsU1r" target="_blank" rel="noreferrer noopener">Spotify</a>, Apple, or wherever you get your podcasts.</p>



<p class="wp-block-paragraph"></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/this-week-in-ai-rethinking-the-agent-harness/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Agentic P&#038;L: Beyond the Empire of Headcount</title>
		<link>https://www.oreilly.com/radar/the-agentic-pl-beyond-the-empire-of-headcount/</link>
				<pubDate>Thu, 21 May 2026 15:04:52 +0000</pubDate>
					<dc:creator><![CDATA[Shreshta Shyamsundar and Anmol Jain]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18761</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-Agentic-PL.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-Agentic-PL-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[For over a century, both the prestige and budget of a corporate department have been measured by a single crude metric: headcount. If you manage 500 people, you’re a &#8220;distinguished leader.&#8221; If you manage five, you’re a footnote. This &#8220;empire of headcount&#8221; has governed everything from office square footage to C-suite influence. It’s the fundamental [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">For over a century, both the prestige and budget of a corporate department have been measured by a single crude metric: headcount. If you manage 500 people, you’re a &#8220;distinguished leader.&#8221; If you manage five, you’re a footnote. This &#8220;empire of headcount&#8221; has governed everything from office square footage to C-suite influence. It’s the fundamental unit of the 20th-century P&amp;L.</p>



<p class="wp-block-paragraph">In an enterprise powered by federated agentic systems, this math is not just obsolete—it is a liability. AI will reshape the enterprise. The question is now “Which line items on the P&amp;L change, and by how much?” Labor and benefits contract. Token and infrastructure costs appear as a new operating line. Compliance costs shift from reactive rework to proactive provenance. And the assets that matter most—structured knowledge enclaves, trained agent policies, decision logs—do not yet appear on most balance sheets.</p>



<h2 class="wp-block-heading">Why AI-on-top-of-hierarchy fails</h2>



<p class="wp-block-paragraph">Most enterprise AI deployments begin with the right instinct and the wrong architecture. A foundation model is procured, a chatbot is deployed, and analysts are relieved of their most repetitive queries. This is the butler-bot phase: AI as a faster way to do what the organization already does, inside a structure designed for a different era.</p>



<p class="wp-block-paragraph">The problem is the process the model is plugged into. If a compliance decision requires sign-off from three managers, an AI assistant that drafts the memo faster doesn’t change the three-week cycle time. If context is scattered across email threads and local drives, a model querying that corpus will hallucinate at exactly the rate the corpus is incomplete. The model inherits the organization&#8217;s structural debt. The agentic P&amp;L begins where the butler bot ends: with a deliberate redesign of the process, not just the tooling.</p>



<p class="wp-block-paragraph">The enterprise must pivot: Stop valuing the empire of headcount and start valuing the federated nervous system.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="362" height="186" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-11.png" alt="Figure 1. Empire of headcount vs. federated nervous system—An analogy" class="wp-image-18771" style="width:503px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-11.png 362w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-11-300x154.png 300w" sizes="auto, (max-width: 362px) 100vw, 362px" /><figcaption class="wp-element-caption">Figure 1. Empire of headcount vs. federated nervous system—An analogy</figcaption></figure>



<h2 class="wp-block-heading">Pillar 1: Potential energy—How knowledge-ready is your department?</h2>



<p class="wp-block-paragraph">If the department is the fundamental unit of the enterprise, its contextual enclave is its brain—its store of potential energy. Most companies are drowning in low-quality context: petabytes of data buried in half-finished Slack threads, abandoned wikis, and tacit knowledge held by seniors who are three months from retirement. To an agent, this isn’t intelligence; it’s noise.</p>



<h3 class="wp-block-heading">From data lakes to sharded enclaves</h3>



<p class="wp-block-paragraph">The data lake became a 2020s nightmare—a giant swamp where context went to die. In the federated model, legal, HR, engineering, and compliance each maintain their own secure, high-density enclave instead. Policy, process documentation, and institutional knowledge is synthesized into a form an agent can reason over directly, without a human in the interpretive loop. Data stays local; reasoning moves via agents. Protocols like the Model Context Protocol (MCP) are emerging as the TCP/IP of the federated enterprise—a standard way for agents and tools to discover each other, exchange context, and record what happened regardless of which vendor stack sits underneath. MCP is what allows “reasoning moves, data stays” to be an implementation detail rather than a custom integration project every time.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1134" height="633" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8.png" alt="Figure 2. Contextual density in shared enclaves" class="wp-image-18764" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8.png 1134w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8-300x167.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8-768x429.png 768w" sizes="auto, (max-width: 1134px) 100vw, 1134px" /><figcaption class="wp-element-caption">Figure 2. Contextual density in shared enclaves</figcaption></figure>



<h3 class="wp-block-heading">Making potential energy measurable</h3>



<p class="wp-block-paragraph">Three dimensions combine into what we call the contextual density score: coverage (what proportion of policy and process is documented and retrievable—for a compliance enclave, the fraction of onboarding scenarios tied to explicit playbooks); consistency and recency (how often does retrieved guidance conflict, and how stale is it); and retrieval quality (how often can a reference agent answer test questions from its own enclave without human overrides). The contextual density score measures how ready an enclave is for agents to act on it reliably. Each enclave is assigned an owner whose job is to improve that score quarter over quarter, as a traditional leader improves throughput or defect rates. Context maintenance becomes the new R&amp;D.</p>



<h2 class="wp-block-heading">Pillar 2: Agentic throughput (the kinetic energy)</h2>



<p class="wp-block-paragraph">If a department’s knowledge enclave is its store of potential energy, throughput is the kinetic energy: the volume and value of cognitive outcomes produced by the agentic layer without human execution in the critical path. To measure this, we must stop counting &#8220;activity&#8221; and start counting handshakes.</p>



<h3 class="wp-block-heading">The handshake economy</h3>



<p class="wp-block-paragraph">In a federated mesh, work is done through agent-to-agent (A2A) negotiation. A logistics agent detects a delayed shipment and initiates a handshake with a procurement agent to find an alternative supplier. That agent consults the contracts enclave via a legal agent to check compliance and risk limits. A resolution is reached, records are updated, and a human is notified of the result—not every intermediate step. Throughput is the rate of successful, economically meaningful handshakes.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1233" height="688" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9.png" alt="Figure 3. The federated agent operating model" class="wp-image-18765" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9.png 1233w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9-300x167.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9-768x429.png 768w" sizes="auto, (max-width: 1233px) 100vw, 1233px" /><figcaption class="wp-element-caption">Figure 3. The federated agent operating model</figcaption></figure>



<h3 class="wp-block-heading">Agentic unit economics: The cost of the handshake</h3>



<p class="wp-block-paragraph">Not all handshakes are equal. Every one carries a token tax, an infrastructure cost, and a latency cost. Agentic throughput is only valuable when the cost per cognitive outcome is significantly lower than the labor-equivalent at equal or better quality. If an agent fans out 50 calls to a premium model to resolve a $5 inquiry, you&#8217;ve increased throughput and destroyed ROI. If a handful of calls to a moderately priced model resolve a complex cross-silo onboarding decision that previously took three teams and two weeks, the economics are compelling.</p>



<p class="wp-block-paragraph">The agentic P&amp;L must therefore track outcome volume (risk-weighted handshakes per period) and cost per outcome relative to the pre-agentic baseline—this is where CFOs and architects meet. This recommendation is consistent with <a href="https://www.pwc.com.au/media/2026/pwc-ai-performance-study-australian-companies-lead-on-ai-security.html" target="_blank" rel="noreferrer noopener">emerging research</a>: The companies seeing genuine AI ROI are those using it to expand what they can do, not those focused purely on headcount reduction.</p>



<h3 class="wp-block-heading">How agents learn: Gyms and mirrors</h3>



<p class="wp-block-paragraph">The gym is a simulation built from historical cases and synthetic data where agents train against gold decisions, respecting policy constraints and risk limits. The mirror is a read-only, regulator-grade log of what agents did in production: prompts, tool calls, model versions, human overrides, and final outcomes. <a href="https://www.oreilly.com/radar/gyms-for-them-mirrors-for-us/" target="_blank" rel="noreferrer noopener">Agents spar in the gym; they are judged in the mirror</a>. By 2026, decision provenance—the ability to reconstruct who or what did what, under which policy and model version—is becoming standard operating procedure in regulated industries.</p>



<h3 class="wp-block-heading">The Agentic P&amp;L decomposed</h3>



<p class="wp-block-paragraph">Four-line items change structurally when an enterprise moves from a headcount model to a federated agentic model:</p>



<p class="wp-block-paragraph">Labor and benefits contract, but not to zero. The compliance function that previously employed 400 analysts moves to 80–100 humans in orchestration and oversight roles—higher-skilled and higher-cost per head, a deliberate trade of volume for leverage.</p>



<p class="wp-block-paragraph">General expenses shift as management layers thin, training budgets pivot from procedural compliance to enclave curation, and real estate requirements contract as hybrid squads replace large hub operations.</p>



<p class="wp-block-paragraph">Token and infrastructure costs emerge as a new operating line that does not exist in the pre-agentic P&amp;L. This line must be actively managed: cost per cognitive outcome is the new unit of measurement and deteriorates quickly with poorly designed agent architectures.</p>



<p class="wp-block-paragraph">Compliance and audit costs shift structure. In a Tier-1 bank, the cost of a single regulatory finding—remediation, legal exposure, delayed onboarding—dwarfs the annual cost of maintaining a well-designed decision log. The mirror transforms regulatory response from a fire drill into a navigable record. Decision provenance is not governance overhead. It is P&amp;L protection.</p>



<p class="wp-block-paragraph">Revenue productivity per person (RPP)—revenue divided by headcount—ties the expense-side story to the top line. Software-native firms have long used RPP as a signal of operational leverage; banks are now applying the same lens to their operations functions. As headcount contracts while throughput and revenue capacity hold or grow, RPP rises structurally rather than cyclically—the metric that tells a CFO whether agentic transformation is delivering leverage or merely cost reduction.</p>



<h2 class="wp-block-heading">A stylized agentic P&amp;L: Compliance in a Tier-1 bank</h2>



<p class="wp-block-paragraph">Consider a compliance function with 400 analysts. Its P&amp;L is dominated by salaries, benefits, and office costs. Context sits in email, local drives, and the memory of experienced analysts—institutional knowledge that walks out of the building every evening.</p>



<p class="wp-block-paragraph">In phase 1, the bank builds a compliance enclave: policies, historical cases, and regulator Q&amp;A synthesized into a structured knowledge graph. Three hybrid squads of 12–15 humans work alongside 10–15 agents handling document collection, screening, and rule-based decisions. Agentic throughput starts modestly—20%–30% of low-risk cases auto-cleared from within the enclave. The P&amp;L effect at this stage is primarily a productivity story: lower cost per case, faster cycle times.</p>



<p class="wp-block-paragraph">The structural transformation comes in phase 2. After several cycles of gym training and mirror-driven refinement, the function operates with 80–100 humans plus 40–60 agents. The compliance enclave—curated policies, decision logs, evaluated reward functions—is now the primary asset. Legal discovery may require the email archive; what the regulator wants is a structured, navigable record of decisions. That’s what the mirror provides. With it, the reduced headcount is defensible to regulators, to the board, and on the P&amp;L.</p>



<h2 class="wp-block-heading">The new org unit: The 3+N squad</h2>



<p class="wp-block-paragraph">The &#8220;3+N&#8221; squad—a small human core plus a flexible swarm of agents—is the fundamental cell of the agentic enterprise. The strategic architect sets intent and constraints. The policy and ethics lead designs the gyms, ensuring agents act under responsible AI principles. The technical orchestrator manages the context mesh, MCP-based connectors, and enclave density. Around them, specialized agents handle contract analysis, sanctions screening, exception routing, and external API liaison. This is cognitive federation. Humans move up-stack into judgment and intent, while agents handle high-volume reasoning and cross-departmental coordination.</p>



<p class="wp-block-paragraph">Leaders rewarded for headcount and budget will resist decomposing their empires even as enclave quality and throughput improve. Executive scorecards must include agentic KPIs: enclave maturity, agentic throughput, risk-adjusted outcomes, and RPP. The mirror needs an explicit owner spanning risk, compliance, and engineering. Without decision provenance, you get the worst of both worlds: expensive models and humans still quietly doing the real work in spreadsheets.</p>



<p class="wp-block-paragraph">When you tell a senior vice president that their value is no longer tied to a 500-person headcount but to the knowledge readiness and agentic throughput of their domain, they will fight. The resistance isn’t just economic; it’s psychological. Headcount has been a proxy for power and identity. In the new world, it often becomes a proxy for architectural debt.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Client: &#8220;Can&#8217;t we just put a human in the loop but set the default to &#8216;Accept&#8217;?&#8221;</p>



<p class="wp-block-paragraph"><br>Me: &#8220;That&#8217;s not human-in-the-loop. That&#8217;s human-as-rubberstamp. You&#8217;re just automating the blame.&#8221;</p>
</blockquote>



<p class="wp-block-paragraph">The reframing that works is not &#8220;we are shrinking your kingdom&#8221; but &#8220;we are upgrading your leverage&#8221; from managing people (inherently high friction and limited scale) to designing intelligence (human-plus-agent systems that scale almost without bound).</p>



<h2 class="wp-block-heading">The leader of 2027: The playbook</h2>



<p class="wp-block-paragraph">The leader of 2027 thinks in flows instead of functions, enclaves and mirrors instead of departments and reports, and token costs and compliance risk instead of merely headcount and budget. Their signature move is converting headcount empires into high-density enclaves and high-throughput meshes under credible governance, then proving it on the P&amp;L with lower unit costs, faster cycle times, and a compliance posture auditors can navigate.</p>



<p class="wp-block-paragraph">For leaders mapping their 2026–2027 roadmaps, here are three hard pivots you need to make: First, stop hiring for capacity; build a better gym, not a bigger team. Second, audit your enclave’s knowledge readiness—if agents hallucinate, you have contextual debt, not a model problem; invest in governed sharded enclaves and mirrors your auditors can use. Finally, manage your token line as the new overhead expense; track cost per cognitive outcome rather than aggregate spend and monitor RPP as your headline leverage indicator.</p>



<p class="wp-block-paragraph">The goal is not to build an AI that works for you. The goal is to build an enterprise that thinks with you.</p>



<p class="wp-block-paragraph">Gyms for them, mirrors for us, and a context mesh to hold the P&amp;L together—that is the architecture of a decentralized, high-alpha enterprise. Anything else is just an expensive way to stay in the 20th century.</p>
]]></content:encoded>
										</item>
		<item>
		<title>The Agent Stack Bet</title>
		<link>https://www.oreilly.com/radar/the-agent-stack-bet/</link>
				<pubDate>Wed, 20 May 2026 10:58:36 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18746</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-agent-stack-bet.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-agent-stack-bet-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[The bet every serious developer needs to make on on their agent stack]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on the Elevate newsletter and is being reposted here with the author&#8217;s permission. Peek under the hood of most “production agents” shipping today and you won’t find intelligence. You’ll find custom plumbing, fragile session logic, shared service accounts, and a security model held together by hope. This can be so [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on the</em> <a href="https://addyo.substack.com/p/the-agent-stack-bet" target="_blank" rel="noreferrer noopener">Elevate</a> <em>newsletter and is being reposted here with the author&#8217;s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">Peek under the hood of most “production agents” shipping today and you won’t find intelligence. You’ll find custom plumbing, fragile session logic, shared service accounts, and a security model held together by hope. This can be so much better.</p>



<p class="wp-block-paragraph">If you’ve spent the last 18 months putting agents into production, you already know the models and tools have gotten <em>dramatically</em> better. You also know the problems that are still burning your on-call rotation are not problems you can prompt your way out of. We are running into a <strong>stack ceiling</strong>, and it is quietly creating a <strong>governance</strong> and <strong>reliability gap</strong> that the next generation of agentic systems cannot grow through.</p>



<p class="wp-block-paragraph">Right now the industry is living with what I’d call <em>excessive agency</em>: <strong>autonomous systems given broad permissions to get things done</strong>, then left to discover—at runtime, in production—that a schema drifted, an API changed, or a downstream service started returning PII it wasn’t supposed to. Agents mark tasks “complete” while leaving a trail of corrupted state behind them. The humans find out on Monday.</p>



<p class="wp-block-paragraph">This is not a failure of the people building agents. It is a failure of the stack they’re building on.</p>



<p class="wp-block-paragraph">Here are the four architectural bets I think every serious team has to make in the next twelve months.</p>



<h2 class="wp-block-heading"><strong>1) Agents need identities, not shared credentials</strong></h2>



<p class="wp-block-paragraph">Every engineer who has shipped agents to production knows this specific flavor of dread: You have agents doing useful work, and effectively zero visibility into which tools they touched, which data they moved, or which credentials they used to do it. I call this <em>governance debt</em>—the silent accumulation of security and audit risk that eventually forces a full rewrite, usually right after the first incident that reaches the CISO.</p>



<p class="wp-block-paragraph">The root cause is that most agents today are ghosts. They don’t have identities. They borrow a service account, inherit a human’s OAuth token, and “promise”—in application code, in a prompt—to stay inside the lines. In a real enterprise environment, a promise in a prompt is not a policy.</p>



<p class="wp-block-paragraph"><strong>My bet is that agent identity has to move from the application layer down into the platform layer.</strong></p>



<p class="wp-block-paragraph">The difference is between bolted-on versus embedded security. Bolted-on looks like middleware in front of every tool call, politely asking the agent to behave: easy to bypass, expensive in latency, and invisible to your existing IAM. Embedded looks like a badge reader welded into a steel frame. The agent has a distinct, unforgeable identity recognized at the network and platform level, and policy is enforced at the source. If the agent reaches for a database it isn’t cleared for, the connection never opens. No middleware, no vibes.</p>



<p class="wp-block-paragraph">Done right, this turns “a fleet of liabilities” into something that looks a lot more like a managed workforce: every action attributable, every permission auditable, every agent revocable with one call.</p>



<h2 class="wp-block-heading"><strong>2) Agents need universal context, not scraped windows</strong></h2>



<p class="wp-block-paragraph">Context management is a tax every builder is currently paying. Teams are burning a huge share of their engineering hours (and tokens) on undifferentiated plumbing—custom serialization, bespoke session stores, hand-rolled memory layers—just to keep an agent from forgetting its mission halfway through a multi-step task.</p>



<p class="wp-block-paragraph">Worse, the context agents <em>can</em> get their hands on is usually siloed. A browser-based agent can see the open tab. A desktop wrapper can see the files a user happened to drag in. Neither of them can easily reason across the systems where the business actually lives—the CRM, the ERP, the data warehouse, the ticketing system, the transcripts, the project plans—at the same time.</p>



<p class="wp-block-paragraph"><strong>Agents need universal context that integrates at the platform level.</strong> If we don’t fix this, we should be honest that the ceiling of agentic AI is “slightly better spreadsheet autocomplete,” and we should stop writing vision pieces about it.</p>



<h2 class="wp-block-heading"><strong>3) Agents need to survive your laptop closing</strong></h2>



<p class="wp-block-paragraph">Here’s the uncomfortable version of this: A lot of what ships today as “an agent” isn’t yet ready to deploy across a business.</p>



<p class="wp-block-paragraph">I want to be precise, because the frontier has genuinely moved in the last six months. Environments like Claude Code, OpenClaw, and similar platforms are capable—persistent task state, scheduled execution, multi-agent coordination, and long-running sessions that survive disconnects are no longer aspirational. These are not toys. The question has moved on.</p>



<p class="wp-block-paragraph">The question now is whether an agent can run for a week instead of an hour. Whether it can cross three handoffs, two credential rotations, and an approval gate without a human babysitting the session. Whether the work it did on Tuesday is auditable on Friday by someone who wasn’t in the room. A session that survives a dropped WebSocket is table stakes. A mission that survives a quarter is the bar enterprises actually need.</p>



<p class="wp-block-paragraph">Real work doesn’t fit in a session, and most of it doesn’t fit in a day either. A procurement workflow spans weeks and a dozen handoffs. A compliance audit runs for a month. An incident investigation outlives three on-call rotations.</p>



<p class="wp-block-paragraph"><strong>Most agents today hit a hard ceiling—sometimes time-based, sometimes token-based, sometimes governance-based—and when they hit it, the mission fails and a human picks up the pieces from wherever the transcript ended.</strong></p>



<p class="wp-block-paragraph">Enterprise-grade autonomy requires durable, cloud-native execution with a much higher floor than “the session stayed up.” Concretely, that means:</p>



<ul class="wp-block-list">
<li><strong>State</strong> and <strong>checkpointing</strong> that survives restarts, disconnects, redeploys, and model version changes by default—not bolted on with a local Redis and a prayer.</li>



<li><strong>Context that outlives the window</strong>: long-horizon memory, summarization, and handoff between agent instances, so a multi-week task doesn’t die because a single run exhausted its tokens.</li>



<li><strong>Missions that outlive sessions</strong>: agents that stay on the job across days, handoffs, and credential rotations, with an auditable trail of what happened while you were asleep.</li>



<li><strong>First-class human-in-the-loop primitives,</strong> so the agent can pause and ask for permission to do something new instead of silently deciding it has the authority.</li>
</ul>



<p class="wp-block-paragraph">Persistence with guardrails. That’s the bar. Anything less and you’re building demos that happen to run for a long time.</p>



<h2 class="wp-block-heading"><strong>4) Agents need platforms</strong></h2>



<p class="wp-block-paragraph">The pattern I see most often in strong teams is the saddest one: brilliant engineers draining their bandwidth into stack problems that do not differentiate their product. Custom memory. Bespoke eval harnesses. Homegrown observability. Handwritten retry logic. A tracing system that almost works. None of this is the hard part of the agentic era, and none of it is what your users are paying you for.</p>



<p class="wp-block-paragraph">The real value lives in domain reasoning and business logic—the judgment calls that are specific to your company, your customers, your regulatory environment. Everything underneath should be the platform you <em>build on</em>, not the plumbing you <em>build</em>.</p>



<p class="wp-block-paragraph">This is why the maturation of open primitives matters right now. Open-source orchestration frameworks exist precisely so the scaffolding isn’t locked behind any single vendor’s roadmap. The model that worked for cloud compute, containers, and CI/CD—start local on open primitives, graduate to a managed platform when you’re ready to scale—is the model agent platforms need to copy.</p>



<p class="wp-block-paragraph"><strong>Teams should be able to prototype on their laptop with the same building blocks they’ll run in production, and cross that boundary without a rewrite.</strong></p>



<p class="wp-block-paragraph">That’s the engineering standard that lets teams stop fighting plumbing and get back to the product.</p>



<h2 class="wp-block-heading"><strong>The five-year horizon</strong></h2>



<p class="wp-block-paragraph">The teams that pull ahead in the next five years will not pull ahead by being smarter at writing boilerplate. They’ll pull ahead by <strong>choosing the right agent foundation</strong> and spending their engineering hours on the problems <em><strong>only they can solve</strong></em>.</p>



<p class="wp-block-paragraph">Every month spent rebuilding the common stack—identity, context, persistence, orchestration—is a month not spent on the logic that actually makes your agents worth deploying.</p>



<p class="wp-block-paragraph"><strong>The agent stack has to become a solved problem.</strong> The only real question is whether you want to solve it yourself, again, or build on a foundation that was engineered for agents from the ground up.</p>



<p class="wp-block-paragraph">My bet is on the latter. I think yours should be too.</p>
]]></content:encoded>
										</item>
		<item>
		<title>When an Agent Deletes the Production Database</title>
		<link>https://www.oreilly.com/radar/when-an-agent-deletes-the-production-database/</link>
				<pubDate>Tue, 19 May 2026 16:00:39 +0000</pubDate>
					<dc:creator><![CDATA[Sam Newman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18743</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/When-an-agent-deletes-the-production-database.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/When-an-agent-deletes-the-production-database-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Revisiting the PocketOS Incident]]></custom:subtitle>
		
				<description><![CDATA[Another day, another example of an AI Agent &#8220;running rogue&#8221; and doing something the human operator didn&#8217;t want it to do. The tl;dr is that Jeremy (Jer) Crane, founder of PocketOS, was using Claude to perform some routine DB maintenance. Claude then proceeded to delete the production database and all backups hosted at their cloud [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Another day, another example of an AI Agent &#8220;running rogue&#8221; and doing something the <a href="https://www.theregister.com/2026/04/27/cursoropus_agent_snuffs_out_pocketos/" target="_blank" rel="noreferrer noopener">human operator didn&#8217;t want it to do</a>. The tl;dr is that Jeremy (Jer) Crane, founder of PocketOS, was using Claude to perform some routine DB maintenance. Claude then proceeded to delete the production database and all backups hosted at their cloud provider, Railway. To their credit Railway managed to recover the lost data. The initial deletion took less than 10 seconds; I&#8217;m sure the recovery took much longer. Let’s look at what we can learn from what happened, and why AI is really just an amplifier of existing issues, rather than the cause itself.</p>



<p class="wp-block-paragraph">We know about the incident because Jer <a href="https://x.com/lifeof_jer/status/2048103471019434248?s=20" target="_blank" rel="noreferrer noopener">wrote</a> about it after it happened. First, taking time to reflect after something goes wrong is important; it&#8217;s how we learn. Sharing your mistakes with the world can be difficult, but it creates chances for us all to learn from each other. Second, I&#8217;ve seen a lot of people publicly dunking on both PocketOS and Railway. I would guess that none of those people have ever experienced the sheer terror and panic that happens during an incident like this. The feeling that you just want the ground to open and swallow you whole. It&#8217;s a feeling I&#8217;ve only experienced once or twice before, and it&#8217;s not an experience I&#8217;m keen to repeat.</p>



<p class="wp-block-paragraph">One point in Railway’s credit is that they got PocketOS’s data back. If you called for a deletion via the APIs on AWS, Azure, Google Cloud or whatever, using a valid credential, that data is gone—unless you have your own backups of course. AWS et al. aren’t maintaining backups of customer data to hedge against customer mistakes. This is your yearly reminder to <a href="https://www.backblaze.com/blog/the-3-2-1-backup-strategy/" target="_blank" rel="noreferrer noopener">look into the 3-2-1 backup strategy</a>.</p>



<p class="wp-block-paragraph">What can we learn about what happened? Well, for all the discussion around how this is AI&#8217;s fault, what we have here is a much simpler example of common system weaknesses being exploited both accidentally and at speed.</p>



<h2 class="wp-block-heading">What Did Claude Do?</h2>



<p class="wp-block-paragraph">Claude had been asked to carry out a task against PocketOS&#8217;s staging environment. The agent hit an issue, searched out and found a long-lived API token which gave access to production, and then proceeded to delete the production volume that contained both the production databases and the backups.</p>



<p class="wp-block-paragraph">When asked what had happened, Claude’s reaction was objectively funny. It seemed to be totally aware of what went wrong, and what it should have done instead. This implies a set of reasoning that was not evident during the actual operation itself—I do wonder if recent attempts to reduce how much reasoning Claude does in certain modes to reduce token use—and Anthropic’s operating costs might partly be to blame.</p>



<p class="wp-block-paragraph">Breaking it all down, there seem to be a couple of fairly straightforward issues at play that at first glance have very little to do with AI itself.</p>



<p class="wp-block-paragraph">The token Claude had access to gave overly broad access. It&#8217;s common for cloud-based infrastructure providers like AWS or Azure to allow you to create tokens that are limited in what they do. This helps implement the <em>principle of least privilege</em>. The idea is that an actor in a system should be given access to what they need, and no more. The principle of least privilege reduces the impact if an inappropriate party gains access to the actor’s credentials, or if the actor themself goes rogue. Consider what happens if someone steals your hotel room key. They can get into your hotel room, which isn&#8217;t great, but they can&#8217;t get into anyone else&#8217;s. It seems that Railway has a limitation that its auth tokens cannot have their scope limited.</p>



<p class="wp-block-paragraph">The second problem was that the credentials were stored on disk and had not expired. This makes the impact of the broadly scoped auth token much worse. Credentials should be time limited, so that if they are found later they cannot be used. If tokens are generated on demand, which could have been done in this specific case, then this particular issue could have been mitigated. Claude would have had to ask for a human to provide a credential—at which point, hopefully, the operator would have had a chance to work out what was going on.</p>



<p class="wp-block-paragraph">I take minor issue with Jer&#8217;s assertion that Railway&#8217;s GraphQL API should have required a confirmation before deletion. This, to me, is a fundamental misunderstanding of what cloud APIs are for. APIs are there for automation; if you want a human-in-the-loop confirmation model, you have to build that yourself. This has always been the case. However, in the aftermath of an incident like this, we should give Jer a lot of leeway around his view of the problems, and some of Jeremy&#8217;s requests for how Railway should change appear to be very sensible (e.g. more clear SLAs, easier to scope tokens).</p>



<h2 class="wp-block-heading">How Could These Issues Be Mitigated?</h2>



<p class="wp-block-paragraph">One obvious takeaway is to ensure that access tokens are more aggressively expired, but also made more limited in scope. This reduces the chance of Claude accessing something it shouldn’t. This would need to be solved on the Railway side, as they generate the token in the first place.</p>



<p class="wp-block-paragraph">Unfortunately, having a more limited token for Claude isn’t a total fix for this scenario. Claude was given a token that limited its behavior, and went looking for a better token—and found it. This is not the first time I’ve heard of this happening; the same thing happened to a client of mine recently.</p>



<p class="wp-block-paragraph">As our agents become more sophisticated, it seems that some sort of sandboxing is key. The production token was viewable by Claude, so it was used. Running agents in a restricted sandbox where they are only able to see parts of your filesystem would help greatly. However that also limits their usefulness.</p>



<p class="wp-block-paragraph">Another option would be for the agent to ask for confirmation before it does something like delete data. It seems conceivable that having a human in the loop model when the agent has to escalate privileges could help. But again, if it gets access to an access token with broad scope, it won’t need to ask a human.</p>



<p class="wp-block-paragraph">Finally, I’ve seen a lot of discussion about how the agent should “know” that deleting the data was bad, and that it should have checked first. This is a fundamental limitation of an LLM-based agent. It has no concept of causality. It cannot predict what will happen. There is a field of AI study known as <a href="https://en.wikipedia.org/wiki/World_model_(artificial_intelligence)" target="_blank" rel="noreferrer noopener">world models</a>, which could allow these agents to make more informed decisions. For example, a world model that understands physics would be able to predict that the egg would likely break if the egg was pushed from a table on to the concrete floor below. World models are used a lot in video generation and autonomous driving (where prediction of motion is key), but are sparsely used elsewhere.</p>



<h2 class="wp-block-heading">AI Not To Blame?</h2>



<p class="wp-block-paragraph">I said just a moment ago that these issues seem to have little to do with AI. That isn&#8217;t entirely true.</p>



<p class="wp-block-paragraph">In the recent DORA report on the state of <a href="https://dora.dev/research/2025/dora-report/" target="_blank" rel="noreferrer noopener">AI-assisted Software Development</a>, the authors noted that AI seems to be an amplifier: that AI-assisted software development tends to help good teams go faster, and slow teams go slower. Bad practices get encoded and done more. In the PocketOS and Railway situation, we have a set of credentials that were overly broad, with long-lived credentials stored on disc, combined with an apologetic AI agent doing something other than what was expected of it. If a human had made the same mistakes, they would have made them much more slowly, and may well have had the chance to work out their mistake part way through. AI works so fast that it can go more quickly in the wrong direction.</p>



<p class="wp-block-paragraph">More importantly, unlike LLM-based AI, a human being has the chance to learn from experience, and for that learning to be rooted in a very specific, emotional response. When I first heard about the PocketOS story, I was brought back to a dim echo of that same horrific feeling I had in the midst of a major production issue that I had contributed to. Those feelings don&#8217;t leave you—those lessons don&#8217;t leave you. Every time I touched a production system, those memories were with me, and helped guide me towards more sensible working practices.</p>
]]></content:encoded>
										</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Object Caching 98/112 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed) 
Minified using Memcached

Served from: www.oreilly.com @ 2026-06-04 17:09:31 by W3 Total Cache
-->