<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:media="http://search.yahoo.com/mrss/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:custom="https://www.oreilly.com/rss/custom"

	>

<channel>
	<title>Radar</title>
	<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.oreilly.com/radar</link>
	<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
	<lastBuildDate>Fri, 03 Apr 2026 11:14:53 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
	<title>Radar</title>
	<link>https://www.oreilly.com/radar</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>The Cathedral, the Bazaar, and the Winchester Mystery House</title>
		<link>https://www.oreilly.com/radar/the-cathedral-the-bazaar-and-the-winchester-mystery-house/</link>
				<comments>https://www.oreilly.com/radar/the-cathedral-the-bazaar-and-the-winchester-mystery-house/#respond</comments>
				<pubDate>Fri, 03 Apr 2026 11:14:53 +0000</pubDate>
					<dc:creator><![CDATA[Drew Breunig]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18446</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-Cathedral-the-Bazaar-and-the-Winchester-Mystery-House.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-Cathedral-the-Bazaar-and-the-Winchester-Mystery-House-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Our era of sprawling, idiosyncratic tooling]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Drew Breunig’s blog and is being republished here with the author’s permission. In 1998, Eric S. Raymond published the founding text of open source software development, The Cathedral and the Bazaar. In it, he detailed two methods of building software: The bazaar model was enabled by the internet, which [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on </em><a href="https://www.dbreunig.com/2026/03/26/winchester-mystery-house.html" target="_blank" rel="noreferrer noopener"><em>Drew Breunig’s blog</em></a><em> </em><em>and is being republished here with the author’s permission.</em></p>
</blockquote>



<p>In 1998, Eric S. Raymond published the founding text of open source software development, <a href="http://www.catb.org/~esr/writings/cathedral-bazaar/" target="_blank" rel="noreferrer noopener"><em>The Cathedral and the Bazaar</em></a>. In it, he detailed two methods of building software:</p>



<ul class="wp-block-list">
<li><em>The cathedral</em> model is carefully planned, closed-source, and managed by an exclusive team of developers.</li>



<li><em>The bazaar</em> model is open, transparent, and community-driven.</li>
</ul>



<p>The bazaar model was enabled by the internet, which allowed for distributed coordination and distribution. More people could contribute code and share feedback, yielding better, more secure software. “Given enough eyeballs, all bugs are shallow,” Raymond wrote, coining <a href="https://en.wikipedia.org/wiki/Linus%27s_law" target="_blank" rel="noreferrer noopener">Linus’s law</a>.</p>



<p>The ideas crystallized in <em>The Cathedral and the Bazaar</em> helped kick off a quarter-century of open source innovation and dominance.</p>



<p>But just as the internet made communication cheap and birthed the bazaar, AI is making code cheap and kicking off a new era filled with idiosyncratic, sprawling, cobbled-together software.</p>



<p>Meet the third model: <em>The Winchester Mystery House</em>.</p>



<figure class="wp-block-image size-full"><a href="https://www.flickr.com/photos/harshlight/3669393933"><img fetchpriority="high" decoding="async" width="1600" height="898" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image.jpeg" alt="Image by HarshLight on Flickr (and used here on a Creative Commons license)" class="wp-image-18447" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image.jpeg 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-300x168.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-768x431.jpeg 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1536x862.jpeg 1536w" sizes="(max-width: 1600px) 100vw, 1600px" /></a><figcaption class="wp-element-caption"><em><a href="https://www.flickr.com/photos/harshlight/3669393933" target="_blank" rel="noreferrer noopener">Winchester Mystery House</a></em> (<em>image by <a href="https://www.flickr.com/photos/harshlight/" target="_blank" rel="noreferrer noopener">HarshLight</a> and used here on a </em><a href="https://creativecommons.org/licenses/by/2.0/" target="_blank" rel="noreferrer noopener"><em>Creative Commons license</em></a><em>)</em></figcaption></figure>



<h2 class="wp-block-heading">The Winchester Mystery House</h2>



<p>Located less than 10 miles southeast from the <a href="https://computerhistory.org/" target="_blank" rel="noreferrer noopener">Computer History Museum</a>, the <a href="https://en.wikipedia.org/wiki/Winchester_Mystery_House" target="_blank" rel="noreferrer noopener">Winchester Mystery House</a> is an architectural oddity.</p>



<p>Following the death of her husband and mother-in-law, Sarah Winchester controlled a fortune. Her shares in the <a href="https://en.wikipedia.org/wiki/Winchester_Repeating_Arms_Company" target="_blank" rel="noreferrer noopener">Winchester Repeating Arms Company</a>, and the dividends they threw off, made it so Sarah could not only live in comfort but pursue whatever passion she desired. That passion was architecture.</p>



<p>Sarah didn’t build her mansion to house ghosts<sup data-fn="6b5d56a2-c8e2-4889-b816-684245a77bcd" class="fn"><a href="#6b5d56a2-c8e2-4889-b816-684245a77bcd" id="6b5d56a2-c8e2-4889-b816-684245a77bcd-link">1</a></sup>; <a href="https://amzn.to/4rZK1C8" target="_blank" rel="noreferrer noopener">she built her mansion because she liked architecture</a>. With no license, no formal training, in an era when women (even very rich women) didn’t have a path to practicing architecture, Sarah focused on her own home. She made up for her lack of license with passion and effectively unlimited funds.</p>



<p>Sarah built what she wanted. “<a href="https://en.wikipedia.org/wiki/Winchester_Mystery_House" target="_blank" rel="noreferrer noopener">At its largest the house had ~500 rooms</a>.” Today it has roughly 160 rooms, 2,000 doors, 10,000 windows, 47 stairways, 47 fireplaces, 13 bathrooms, and 6 kitchens. Carved wood drapes the walls and ceilings. Stained glass is everywhere. Projects were planned, completed, abandoned, torn down, and rebuilt.</p>



<p>It was anything but aimless. And practical innovations ran throughout, including push-button gas lighting, an early intercom system, steam heating, and indoor gardens. The oddities that amuse today’s visitors were mostly practical accommodations for Sarah’s health (stairways with very small steps), functional designs no longer used (trap doors in greenhouses to route excess water), or quick fixes to damage from the 1906 earthquake.</p>



<p>Winchester passed in 1922. Nine months later, the house became a tourist attraction.</p>



<p>Today, many programmers are Sarah Winchester.</p>



<figure class="wp-block-image size-full"><img decoding="async" width="880" height="440" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-5.png" alt="Claude Code's public GitHub activity" class="wp-image-18448" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-5.png 880w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-5-300x150.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-5-768x384.png 768w" sizes="(max-width: 880px) 100vw, 880px" /><figcaption class="wp-element-caption"><em>Claude Code&#8217;s public GitHub activity</em></figcaption></figure>



<h2 class="wp-block-heading">What happens when code is cheap</h2>



<p>We aren’t as rich as Sarah Winchester, but when code is this cheap, we don’t need to be.</p>



<p>Jodan Alberts illustrated this recently, <a href="https://www.claudescode.dev/" target="_blank" rel="noreferrer noopener">collecting and visualizing data detailing public GitHub commits attributed to Claude Code</a>. That’s his data in the chart above, with Claude seeming to only accelerate through March.<sup data-fn="de32f944-88cb-40cf-ba5a-a85253a6ad73" class="fn"><a href="#de32f944-88cb-40cf-ba5a-a85253a6ad73" id="de32f944-88cb-40cf-ba5a-a85253a6ad73-link">2</a></sup></p>



<p>It’s hard to get a handle on individual usage though, so I went searching for a proxy and landed on the chart below:</p>



<figure class="wp-block-image size-full"><img decoding="async" width="880" height="396" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-6.png" alt="Average net lines added per commit in Claude Code: 7-day average" class="wp-image-18449" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-6.png 880w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-6-300x135.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-6-768x346.png 768w" sizes="(max-width: 880px) 100vw, 880px" /><figcaption class="wp-element-caption"><em>Average net lines added per commit in Claude Code: 7-day average</em></figcaption></figure>



<p>After Opus 4.5 and recent work enabling Agent Teams, the average net lines added by Claude per commit is now smooth and steady at <em>1,000 lines of code per commit.</em><sup data-fn="bc98f5bc-dd9d-4421-a544-65d4191ad4fb" class="fn"><a href="#bc98f5bc-dd9d-4421-a544-65d4191ad4fb" id="bc98f5bc-dd9d-4421-a544-65d4191ad4fb-link">3</a></sup></p>



<p><strong>1,000 lines of code per commit is ~2 magnitudes higher than what a human programmer writes <em>per day</em>.</strong></p>



<p>If you search for human benchmarks, you’ll find many citing Fred Brooks’s <em><a href="https://en.wikipedia.org/wiki/The_Mythical_Man-Month">The Mythical Man Month</a></em> while claiming a good engineer might write <em>10 cumulative lines of code per day.</em><sup data-fn="bb51a862-1362-4241-b2ba-6fecac1df6b9" class="fn"><a href="#bb51a862-1362-4241-b2ba-6fecac1df6b9" id="bb51a862-1362-4241-b2ba-6fecac1df6b9-link">4</a></sup> If you further explore, you’ll find numbers higher than 10 cited, but generally less than 100.</p>



<p>Here’s a good anecdote from <a href="https://antirez.com/latest/0" target="_blank" rel="noreferrer noopener">antirez</a> on a <a href="https://news.ycombinator.com/item?id=22305934" target="_blank" rel="noreferrer noopener">Hacker News</a> thread discussing the Brooks “quote”:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>I did some trivial math. Redis is composed of 100k lines of code, I wrote at least 70k of that in 10 years. I never work more than 5 days per week and I take 1 month of vacations every year, so assuming I work 22 days every month for 11 months:</p>



<p><em>70000/(22 x 11 x 10) = ~29 LOC / day</em></p>



<p>Which is not too far from 10. There are days where I write 300-500 LOC, but I guess that a lot of work went into rewriting stuff and fixing bugs, so I rewrote the same lines again and again over the course of years, but yet I think that this should be taken into account, so the Mythical Man Month book is indeed quite accurate.</p>
</blockquote>



<p>Six years after this comment, Claude is pushing <em>1,000</em> lines of code <em>per commit</em>.</p>



<p>So what do we do with all this cheap code?</p>



<p>Unfortunately, everything else remains roughly the same cost and roughly the same speed. Feedback hasn’t gotten cheaper; the “<a href="https://en.wikipedia.org/wiki/Linus%27s_law" target="_blank" rel="noreferrer noopener">eyeballs</a>” that guided the software developed by the bazaar haven’t caught up to AI.</p>



<p>There is only one source of feedback that moves at the speed of AI-generated code: yourself. You’re there to prompt, you’re there to review. You don’t need to recruit testers, run surveys, or manage design partners. You just build what you want and use what you build.</p>



<p>And that’s what many developers are doing with cheap code: building idiosyncratic tools for ourselves, guided by our passions, taste, and needs.</p>



<p>Sound familiar?</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1567" height="799" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1.jpeg" alt="" class="wp-image-18450" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1.jpeg 1567w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-300x153.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-768x392.jpeg 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-1536x783.jpeg 1536w" sizes="auto, (max-width: 1567px) 100vw, 1567px" /><figcaption class="wp-element-caption"><a href="https://commons.wikimedia.org/wiki/File:Winchester_Mystery_House_2023-07-17_02.jpg" target="_blank" rel="noreferrer noopener"><em>Winchester Mystery House, San Jose, California</em></a><em> (image by </em><a href="https://commons.wikimedia.org/wiki/User:The_wub" target="_blank" rel="noreferrer noopener"><em>The wub</em></a><em> and used here under a </em><a href="https://creativecommons.org/licenses/by-sa/4.0/deed.en" target="_blank" rel="noreferrer noopener"><em>Creative Commons license</em></a><em>)</em></figcaption></figure>



<h2 class="wp-block-heading">Welcome to the mystery house</h2>



<p>Steve Yegge’s <a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04" target="_blank" rel="noreferrer noopener">Gas Town</a> is a Winchester Mystery House. It’s <em>incredibly</em> idiosyncratic and sprawling, rich with metaphors and hacks. It’s the perfect tool for Steve.</p>



<p>Jeffrey Emanuel’s <a href="https://agent-flywheel.com/" target="_blank" rel="noreferrer noopener">Agent Flywheel</a> is a Winchester Mystery House. A significant subset of <a href="https://www.nytimes.com/2026/03/20/technology/tokenmaxxing-ai-agents.html" target="_blank" rel="noreferrer noopener">tokenmaxxers</a> decide they need to rebuild their dependencies in Rust; Jeff is one such example. His “<a href="https://github.com/Dicklesworthstone#the-frankensuite" target="_blank" rel="noreferrer noopener">FrankenSuite</a>” includes Rust rewrites of SQLite, Node.js, btrfs, Redis, pandas, NumPy, JAX, and Torch.</p>



<p>Philip Zeyliger noted the pattern last week, writing, “<a href="https://blog.exe.dev/bones-of-the-software-factory" target="_blank" rel="noreferrer noopener">Everyone is building a software factory</a>.” But it goes beyond software. Gary Tan’s personal AI committee <a href="https://github.com/garrytan/gstack" target="_blank" rel="noreferrer noopener">gstack</a> is a Winchester Mystery House constructed mostly from Markdown.</p>



<p>Everywhere you look, there are Winchester Mystery Houses.</p>



<p>Each Winchester Mystery House is <strong>idiosyncratic</strong>. They are highly personalized. The tightly coupled feedback loop between the coding agent and the user yields software that reflects the developer’s desires. They usually lack documentation. To outsiders, they’re inscrutable.</p>



<p>Winchester Mystery Houses are <strong>sprawling</strong>. Guided by the needs of the developer, these tools tend to spread out, constantly annexing territory in the form of new functions and new repositories. Work is almost always additive. Code is added when it’s needed, bugs are patched in place, and countless appendages remain. There’s little incentive to prune when code is free.</p>



<p>And building a Winchester Mystery House should be <strong>fun</strong>. Coding agents turn everything into a side quest, and we eagerly join in. Building the perfect workflow is a passion for many devs, so we keep pushing.</p>



<p>Winchester Mystery Houses are idiosyncratic, sprawling, and fun. But does this mean we’re abandoning the bazaar?</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1200" height="549" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2.jpeg" alt="A Crowded Market in Dhaka, Bangladesh (image by International Food Policy Research Institute / 2010 and used here on a Creative Commons license)" class="wp-image-18451" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2.jpeg 1200w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2-300x137.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2-768x351.jpeg 768w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /><figcaption class="wp-element-caption"><a href="https://www.flickr.com/photos/ifpri/4860343116" target="_blank" rel="noreferrer noopener"><em>A Crowded Market in Dhaka, Bangladesh</em></a><em> (image by </em><a href="https://www.flickr.com/photos/ifpri/" target="_blank" rel="noreferrer noopener"><em>International Food Policy Research Institute</em></a><em> / 2010 and used here on a </em><a href="https://creativecommons.org/licenses/by-nc-nd/2.0/" target="_blank" rel="noreferrer noopener"><em>Creative Commons license</em></a><em>)</em></figcaption></figure>



<h2 class="wp-block-heading">What happens to the bazaar?</h2>



<p>What happens when we all tend to our mystery houses? When our free time is spent building tools just for ourselves, will we stop working on shared projects? Will we abandon the bazaar?</p>



<p>Probably not. The bazaar is <em>packed</em> right now, but not in a good way.</p>



<p>Code is cheap, so people are slamming open source repositories with agent-written contributions, in an attempt to pad their résumés or manifest their pet features. Daniel Stenberg <a href="https://daniel.haxx.se/blog/2026/01/26/the-end-of-the-curl-bug-bounty/" target="_blank" rel="noreferrer noopener">ended bug bounties for curl</a> after a deluge of poor submissions sapped reviewer bandwidth. It’s gotten so bad, <a href="https://github.blog/changelog/2026-02-13-new-repository-settings-for-configuring-pull-request-access/" target="_blank" rel="noreferrer noopener">GitHub recently added a feature to disable pull request contributions</a>.</p>



<p>Anecdotally, I’m seeing good contributions pick up as well. They’re just drowned out by the slop. For what it’s worth, <a href="https://github.com/curl/curl/graphs/contributors" target="_blank" rel="noreferrer noopener">curl commits are dramatically <em>up</em> in the agentic era</a>. And people <em>are</em> sharing what they build. A <a href="https://www.dumky.net/posts/youre-right-to-be-anxious-about-ai-this-is-how-much-we-are-building/" target="_blank" rel="noreferrer noopener">recent analysis by Dumky</a> shows packages and repos rising in the last quarter.</p>



<p>There’s plenty of budget for both mystery houses and the bazaar when code is <em>this</em> cheap. The new challenge is developing systems and processes for managing the deluge. We don’t need <a href="https://en.wikipedia.org/wiki/Linus%27s_law" target="_blank" rel="noreferrer noopener">eyeballs</a> to find bugs <em>in</em> the software; we need eyeballs to find bugs before they <em>reach</em> the software.</p>



<p>In many ways this is the inverse of the bazaar model era. The internet made feedback and communal coordination faster, easier, and cheaper. The bazaar model has a high throughput of feedback (many eyeballs) but relatively high latency for modifications (file an issue, discuss, submit a PR, wait for review, etc.).</p>



<p>Coding agents, on the other hand, make implementation faster while feedback and coordination are unchanged. The Winchester Mystery House model sidesteps this by collapsing the feedback loop into one person: Latency is near zero, but throughput is just you. The bazaar, defined by communal work, can’t adopt this hack. Coding agents in the bazaar create a mess: implementation at machine speed hitting coordination infrastructure built for human speed. Which is why maintainers feel like they’re drowning.</p>



<p>We need new tools, skills, and conventions.</p>



<h2 class="wp-block-heading">Lessons from the mystery house</h2>



<p>Coding agents have dropped the cost of code so dramatically we’re entering a new era of software development, the first change of this magnitude since the internet kicked off open source software. Change arrived quickly, and it’s not slowing down. But in reviewing the Winchester Mystery House framework, I think we can take away a few lessons.</p>



<h3 class="wp-block-heading">Lesson 1: The bazaar and Winchester Mystery Houses can coexist.</h3>



<p>When listing example Winchester Mystery Houses, I didn’t mention <a href="https://github.com/openclaw/openclaw" target="_blank" rel="noreferrer noopener">OpenClaw</a>, even though it is <em>the</em> defining example. I saved it for here because it nicely illustrates how Winchester Mystery Houses and the bazaar can coexist.</p>



<p>OpenClaw is incredibly modular and places few limitations on the user. It integrates 25 different chat and notification systems, plugs into most inference end points, and is built on the exceptionally flexible <a href="https://github.com/badlogic/pi-mono" target="_blank" rel="noreferrer noopener">pi</a> agent toolkit. This eager flexibility was embraced early—security and data protections be damned—but since its exponential adoption Peter Steinberger and the community have been steadily pushing improvements and fixes.</p>



<p>And like other breakout open source projects of yore, the ecosystem is adopting the best ideas and mitigating the worst aspects of OpenClaw. Countless alternate “claw” projects have emerged. (There’s NanoClaw, NullClaw, ZeroClaw, and more!) Companies have launched services to make claws easy or safer. Cloudflare launched Moltworker to make deploy easy, Nvidia shipped NemoClaw with a security focus, and Claude keeps adding claw-like features to its desktop app.</p>



<h3 class="wp-block-heading">Lesson 2: Don’t sell the fun stuff.</h3>



<p>One reason OpenClaw works so well in the bazaar is that it is a <em>foundation for personal tools.</em> Out of the box, a claw just sits there. It’s up to the user to determine what it does and how it does it, leveraging the connections and infrastructure OpenClaw provides. OpenClaw lets less experienced developers spin up their own Winchester Mystery Houses, while experienced devs get to leverage much of the common integrations and systems OpenClaw provides. Peter and team have done a great job drawing a line between the common core (what the bazaar works on) and what they leave up to the user: The boring, critical stuff is the job of the commons.</p>



<p>Thinking back to Sarah Winchester and her idiosyncratic, sprawling mansion, we see the same pattern. Sarah hired vendors! She used off-the-shelf parts! Her bathtubs, toilets, faucets, and plumbing weren’t crafted on site.</p>



<p>The boring stuff, the hard bits, or the things that have <em>disastrous</em> failure modes are the things we should collaborate on or employ specialists to handle. (Come to think, plumbing checks all three boxes). This is the opportunity for open source software, dev tools, and software companies.</p>



<p>Don’t try to sell developers the stuff that’s fun, the stuff they <em>want</em> to build. Sell them the stuff they avoid or don’t want to take responsibility for. Sarah Winchester didn’t hire metalworkers to craft the pipes for her plumbing, but she <em>did</em> hire craftspeople to create hundreds of stained-glass windows to her specs.</p>



<h3 class="wp-block-heading">Lesson 3: The limits of code are communication.</h3>



<p>OpenClaw shows the bazaar remains relevant but also highlights the problems facing open source in the agentic era. Right now, there are 1,173 open pull requests and 1,884 new issues on the <a href="https://github.com/openclaw/openclaw/pulse" target="_blank" rel="noreferrer noopener">OpenClaw repo</a>.</p>



<p>There is more code and more projects than we could ever review. The challenge now, for open source maintainers and users, is sifting through it all. How do we find the novel ideas that <em>everyone</em> should adopt and borrow?</p>



<p>OpenClaw is one of the successes, something we <em>all</em> noticed. And for it, the problem is processing the feedback. For the projects we’ll never find, the ones lost in the deluge, their problem is lack of feedback. You either find attention and drown in contributions or drown in the ocean of repos and never hear a thing.</p>



<p>The internet made coordination cheap and gave us the bazaar. Coding agents made implementation cheap and gave us the Winchester Mystery House. What we’re missing are the tools and conventions that make attention cheap, that let maintainers absorb contributions at machine speed and let good ideas surface among the noise. Until we figure this out, the bazaar will keep getting louder without getting smarter, and the best ideas in our mystery houses will be forgotten once we stop maintaining them.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Footnotes</h3>


<ol class="wp-block-footnotes"><li id="6b5d56a2-c8e2-4889-b816-684245a77bcd">The lore that Winchester built her mansion to house ghosts killed by Winchester rifles is likely just gossip and marketing. There’s little evidence to support these claims. (<em>99% Invisible</em> has a good episode <a href="https://99percentinvisible.org/episode/mystery-house/" target="_blank" rel="noreferrer noopener">exploring Winchester, her house, and this lore</a>.) <a href="#6b5d56a2-c8e2-4889-b816-684245a77bcd-link" aria-label="Jump to footnote reference 1"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="de32f944-88cb-40cf-ba5a-a85253a6ad73">While editing this piece, <a href="https://www.dumky.net/posts/youre-right-to-be-anxious-about-ai-this-is-how-much-we-are-building/" target="_blank" rel="noreferrer noopener">Dumky published another analysis illustrating the production of coding agents</a>. In it he shows a 280% increase in “Show HN” posts, a 93% increase in new GitHub repos, and a <em>dramatic</em> uptick in packages published to Crates.io. <a href="#de32f944-88cb-40cf-ba5a-a85253a6ad73-link" aria-label="Jump to footnote reference 2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="bc98f5bc-dd9d-4421-a544-65d4191ad4fb">Anthropic’s ability to stabilize this line is rather impressive. Claude Code is getting better at planning and better at chunking out work, enabling more effective subagent delegation. <a href="#bc98f5bc-dd9d-4421-a544-65d4191ad4fb-link" aria-label="Jump to footnote reference 3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="bb51a862-1362-4241-b2ba-6fecac1df6b9">Though this is likely an updated tweak of Brooks’s statement that an “industrial team” might write 1,000 “statements” per <em>year</em>. <a href="#bb51a862-1362-4241-b2ba-6fecac1df6b9-link" aria-label="Jump to footnote reference 4"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li></ol>]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-cathedral-the-bazaar-and-the-winchester-mystery-house/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Toolkit Pattern</title>
		<link>https://www.oreilly.com/radar/the-toolkit-pattern/</link>
				<comments>https://www.oreilly.com/radar/the-toolkit-pattern/#respond</comments>
				<pubDate>Thu, 02 Apr 2026 11:19:06 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18436</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-toolkit-pattern.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-toolkit-pattern-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Why your project&#039;s best documentation is a file only AI will read]]></custom:subtitle>
		
				<description><![CDATA[This is the third article in a series on agentic engineering and AI-driven development. Read part one here, part two here, and look for the next article on April 15 on O’Reilly Radar. The toolkit pattern is a way of documenting your project&#8217;s configuration so that any AI can generate working inputs from a plain-English description. [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>This is the third article in a series on agentic engineering and AI-driven development. Read part one <a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">here</a>, part two <a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">here</a>, and look for the next article on April 15 on O’Reilly Radar.</em></p>
</blockquote>



<p>The <strong>toolkit pattern</strong> is a way of documenting your project&#8217;s configuration so that any AI can generate working inputs from a plain-English description. You and the AI create a single file that describes your tool&#8217;s configuration format, its constraints, and enough worked examples that any AI can generate working inputs from a plain-English description. You build it iteratively, working with the AI (or, better, multiple AIs) to draft it. You test it by starting a fresh AI session and trying to use it, and every time that fails you grow the toolkit from those failures. When you build the toolkit well, your users will never need to learn how your tool’s configuration files work, because they describe what they want in conversation and the AI handles the translation. That means you don’t have to compromise on the way your project is configured, because the config files can be more complex and more complete than they would be if a human had to edit and understand them.</p>



<p>To understand why all of this matters, let me take you back to the mid-1980s.</p>



<p>I was 12 years old, and our family got an AT&amp;T PC 6300, an IBM-compatible that came with a user&#8217;s guide roughly 159 pages long. Chapter 4 of that manual was called &#8220;What Every User Should Know.&#8221; It covered things like how to use the keyboard, how to care for your diskettes, and, memorably, how to label them, complete with hand-drawn illustrations and really useful advice, like how you should only use felt-tipped pens, never ballpoint, because the pressure might damage the magnetic surface.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="1512" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1.png" alt="A page from the AT&amp;T PC 6300 User's Guide, Chapter 4: &quot;Labeling Diskettes&quot;" class="wp-image-18437" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-300x284.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-768x726.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-1536x1452.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>A page from the AT&amp;T PC 6300 User&#8217;s Guide, Chapter 4: &#8220;Labeling Diskettes&#8221;</em></figcaption></figure>



<p>I remember being fascinated by this manual. It wasn&#8217;t our first computer. I&#8217;d been writing BASIC programs and dialing into BBSs and CompuServe for a couple of years, so I knew there were all sorts of amazing things you could do with a PC, especially one with a blazing fast 8MHz processor. But the manual barely mentioned any of that. That seemed really weird to me, even as a kid, that you would give someone a manual that had a whole page on using the backspace key to correct typing mistakes (really!) but didn&#8217;t actually tell them how to use the thing to do anything useful.</p>



<p>That&#8217;s how most developer documentation works. We write the stuff that&#8217;s easy to write—installation, setup, the getting-started guide—because it&#8217;s a lot easier than writing the stuff that&#8217;s actually hard: the deep explanation of how all the pieces fit together, the constraints you only discover by hitting them, the patterns that separate a configuration that works from one that almost works. This is yet another &#8220;looking for your keys under the streetlight&#8221; problem: We write the documentation we write because it&#8217;s easiest to write, even if it&#8217;s not really the documentation our users need.</p>



<p>Developers who came up through the Unix era know this well. Man pages were thorough, accurate, and often completely impenetrable if you didn&#8217;t already know what you were doing. The tar man page is the canonical example: It documents every flag and option in exhaustive detail, but if you just want to know how to extract a .tar.gz file, it&#8217;s almost useless. (The right flag is -xzvf in case you&#8217;re curious.) Stack Overflow exists in large part because man pages like tar&#8217;s left a gap between what the documentation said and what developers actually needed to know.</p>



<p>And now we have AI assistants. You can ask Claude or ChatGPT about, say, Kubernetes, Terraform, or React, and you’ll actually get useful answers, because those are all established projects that have been written about extensively and the training data is everywhere.</p>



<p>But AI hits a hard wall at the boundary of its training data. If you&#8217;ve built something new—a framework, an internal platform, a tool your team created—no model has ever seen it. Your users can&#8217;t ask their AI assistant for help, because the AI doesn&#8217;t know your thing even exists.</p>



<p>There’s been a lot of great work moving AI documentation in the right direction. <code>AGENTS.md</code> tells AI coding agents how to work on your codebase, treating the AI as a developer. <code>llms.txt</code> gives models a structured summary of your external documentation, treating the AI as a search engine. What&#8217;s been missing is a practice for treating the AI as a support engineer. Every project needs configuration: input files, option schemas, workflow definitions, usually in the form of a whole bunch of JSON or YAML files with cryptic formats that users have to learn before they can do anything useful.</p>



<p>The toolkit pattern solves that problem of getting AIs to write configuration files for a project that isn’t in its training data. It consists of a documentation file that teaches any AI enough about your project&#8217;s configuration that it can generate working inputs from a plain-English description, without your users ever having to learn the format themselves. Developers have been arriving at this same pattern (or something very similar) independently from different directions, but as far as I can tell, nobody has named it or described a methodology for doing it well. <strong>This article distills what I learned from building the toolkit for Octobatch pipelines into a set of practices you can apply to your own projects.</strong></p>



<h2 class="wp-block-heading">Build the AI its own manual</h2>



<p>Traditionally, developers face a trade-off with configuration: keep it simple and easy to understand, or let it grow to handle real complexity and accept that it now requires a manual. The toolkit pattern emerged for me while I was building <a href="https://github.com/andrewstellman/octobatch" target="_blank" rel="noreferrer noopener">Octobatch</a>, the batch-processing orchestrator I&#8217;ve been writing about in this series. As I described in the previous articles in this series, “<a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">The Accidental Orchestrator</a>” and “<a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">Keep Deterministic Work Deterministic</a>,” Octobatch runs complex multistep LLM pipelines that generate files or run Monte Carlo simulations. Each pipeline is defined using a complex configuration that consists of YAML, Jinja2 templates, JSON schemas, expression steps, and a set of rules tying it all together. The toolkit pattern let me sidestep that traditional trade-off.</p>



<p>As Octobatch grew more complex, I found myself relying on the AIs (Claude and Gemini) to build configuration files for me, which turned out to be genuinely valuable. When I developed a new feature, I would work with the AIs to come up with the configuration structure to support it. At first I defined the configuration, but by the end of the project I relied on the AIs to come up with the first cut, and I&#8217;d push back when something seemed off or not forward-looking enough. Once we all agreed, I would have an AI produce the actual updated config for whatever pipeline we were working on. This move to having the AIs do the heavy lifting of writing the configuration was really valuable, because it let me create a very robust format very quickly without having to spend hours updating existing configurations every time I changed the syntax or semantics.</p>



<p>At some point I realized that every time a new user wanted to build a pipeline, they faced the same learning curve and implementation challenges that I&#8217;d already worked through with the AIs. The project already had a <code>README.md</code> file, and every time I modified the configuration I had an AI update it to keep the documentation up to date. But by this time, the <code>README.md</code> file was doing way too much work: It was really comprehensive but a real headache to read. It had eight separate subdocuments showing the user how to do pretty much everything Octobatch supported, and the bulk of it was focused on configuration, and it was becoming exactly the kind of documentation nobody ever wants to read. That particularly bothered me as a writer; I&#8217;d produced documentation that was genuinely painful to read.</p>



<p>Looking back at my chats, I can trace how the toolkit pattern developed. My first instinct was to build an AI-assisted editor. About four weeks into the project, I described the idea to Gemini:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>I&#8217;m thinking about how to provide any kind of AI-assisted tool to help people create their own pipeline. I was thinking about a feature we would call “Octobatch Studio” where we make it easy to prompt for modifying pipeline stages, possibly assisting in creating the prompts. But maybe instead we include a lot of documentation in Markdown files, and expect them to use Claude Code, and give lots of guidance for creating it.</p>
</blockquote>



<p>I can actually see the pivot to the toolkit pattern happening in real time in this later message I sent to Claude. It had sunk in that my users could use Claude Code, Cursor, or another AI as interactive documentation to build their configs exactly the same way I’ve been doing:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>My plan is to use Claude Code as the IDE for creating new pipelines, so people who want to create them can just spin up Claude Code and start generating them. That means we need to give Claude Code specific context files to tell it everything it needs to know to create the pipeline YAML config with asteval expressions and Jinja2 template files.</p>
</blockquote>



<p>The traditional trade-off between simplicity and flexibility comes from <strong>cognitive overhead</strong>: the cost of holding all of a system&#8217;s rules, constraints, and interactions in your head while you work with it. It&#8217;s why many developers opt for simpler config files, so they don&#8217;t overload their users (or themselves). Once the AI was writing the configuration, that trade-off disappeared. The configs could get as complicated as they needed to be, because I wasn&#8217;t the one who had to remember how all the pieces fit together. At some point I realized the toolkit pattern was worth standardizing.</p>



<p>That toolkit-based workflow—users describe what they want, the AI reads <code>TOOLKIT.md</code> and generates the config—is the core of the Octobatch user experience now. A user clones the repo and opens Claude Code, Cursor, or Copilot, the same way they would with any open source project. Every configuration prompt starts the same way: &#8220;Read pipelines/TOOLKIT.md and use it as your guide.&#8221; The AI reads the file, understands the project structure, and guides them step by step.</p>



<p>To see what this looks like in practice, take the Drunken Sailor pipeline I described in “<a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">The Accidental Orchestrator</a>.” It&#8217;s a Monte Carlo random walk simulation: A sailor leaves a bar and stumbles randomly toward the ship or the water. The pipeline configuration for that involves multiple YAML files, JSON schemas, Jinja2 templates, and expression steps with real mathematical logic, all wired together with specific rules.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="838" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2.png" alt="Drunken Sailor is Octobatch’s simplest “Hello, World!” Monte Carlo pipeline, but it still has 148 lines of config spread across four files." class="wp-image-18438" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2-300x157.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2-768x402.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2-1536x804.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Drunken Sailor is Octobatch’s simplest “Hello, World!” Monte Carlo pipeline, but it still has 148 lines of config spread across four files.</em></figcaption></figure>



<p>Here&#8217;s the prompt that generated all of that. The user describes what they want in plain English, and the AI produces the entire configuration by reading <code>TOOLKIT.md</code>. This is the exact prompt I gave Claude Code to generate the Drunken Sailor pipeline—notice the first line of the prompt, telling it to read the toolkit file.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1444" height="1104" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3.png" alt="You don’t need to know Octobatch to understand the prompt I used to create the Drunken Sailor pipeline." class="wp-image-18439" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3.png 1444w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3-300x229.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3-768x587.png 768w" sizes="auto, (max-width: 1444px) 100vw, 1444px" /><figcaption class="wp-element-caption"><em>You don’t need to know Octobatch to understand the prompt I used to create the Drunken Sailor pipeline.</em></figcaption></figure>



<p>But configuration generation is only half of what the toolkit file does. Users can also upload <code>TOOLKIT.md</code> and <code>PROJECT_CONTEXT.md</code> (which has information about the project) to any AI assistant—ChatGPT, Gemini, Claude, Copilot, whatever they prefer—and use it as interactive documentation. A pipeline run finished with validation failures? Upload the two files and ask what went wrong. Stuck on how retries work? Ask. You can even paste in a screenshot of the TUI and say, &#8220;What do I do?&#8221; and the AI will read the screen and give specific advice. The toolkit file turns any AI into an on-demand support engineer for your project.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="1017" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-4.png" alt="The toolkit helps turn ChatGPT into an AI manual that helps with Octobatch." class="wp-image-18440" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-4.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-4-300x191.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-4-768x488.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-4-1536x976.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>The toolkit helps turn ChatGPT into an AI manual that helps with Octobatch.</em></figcaption></figure>



<h2 class="wp-block-heading">What the Octobatch project taught me about the toolkit pattern</h2>



<p>Building the generative toolkit for Octobatch produced more than just documentation that an AI could use to create configuration files that worked; it also yielded a set of practices, and those practices turn out to be pretty consistent regardless of what kind of project you&#8217;re building. Here are the five that mattered most:</p>



<ul class="wp-block-list">
<li><strong>Start with the toolkit file and grow it from failures.</strong> Don&#8217;t wait until the project is finished to write the documentation. Create the toolkit file first, then let each real failure add one principle at a time.</li>



<li><strong>Let the AI write the config files.</strong> Your job is product vision—what the project should do and how it should feel. The AI&#8217;s job is translating that into valid configuration.</li>



<li><strong>Keep guidance lean.</strong> State the principle, give one concrete example, move on. Every guardrail costs tokens, and bloated guidance makes AI performance worse.</li>



<li><strong>Treat every use as a test.</strong> There&#8217;s no separate testing phase for documentation. Every time someone uses the toolkit file to build something, that&#8217;s a test of whether the documentation works.</li>



<li><strong>Use more than one model.</strong> Different models catch different things. In a three-model audit of Octobatch, three-quarters of the defects were caught by only one model.</li>
</ul>



<p>I&#8217;m not proposing a standard format for a toolkit file, and I think trying to create one would be counterproductive. Configuration formats vary wildly from tool to tool—that&#8217;s the whole problem we&#8217;re trying to solve—and a toolkit file that describes your project&#8217;s building blocks is going to look completely different from one that describes someone else&#8217;s. What I found is that the AI is perfectly capable of reading whatever you give it, and is probably better at writing the file than you are anyway, because it&#8217;s writing for another AI. These five practices should help build an effective toolkit regardless of what your project looks like.</p>



<h3 class="wp-block-heading">Start with the toolkit file and grow it from failures</h3>



<p>You can start building a toolkit at any point in your project. The way it happened for me was organic: After weeks of working with Claude and Gemini on Octobatch configuration, the knowledge about what worked and what didn&#8217;t was scattered across dozens of chat sessions and context files. I wrote a prompt asking Gemini to consolidate everything it knew about the config format—the structure, the rules, the constraints, the examples, everything we’d talked about—into a single <code>TOOLKIT.md</code> file. That first version wasn&#8217;t great, but it was a starting point, and every failure after that made it better.</p>



<p>I didn&#8217;t plan the toolkit from the beginning of the Octobatch project. It started because I wanted my users to be able to build pipelines the same way I had—by working with an AI—but everything they&#8217;d need to do that was spread across months of chat logs and the <code>CONTEXT.md</code> files I&#8217;d been maintaining to bootstrap new development sessions. Once I had Gemini consolidate everything into a single <code>TOOLKIT.md</code> file and had Claude review it, I treated it the way I treat any other code: Every time something broke, I found the root cause, worked with the AIs to update the toolkit to account for it, and verified that a fresh AI session could still use it to generate valid configuration.</p>



<p>That incremental approach worked well for me, and it let me test my toolkit the way I test any other code: try it out, find bugs, fix them, rinse, repeat.</p>



<p>You can do the same thing. If you&#8217;re starting a new project, you can plan to create the toolkit at the end. But it&#8217;s more effective to start with a simple version early and let it emerge over the course of development. That way you&#8217;re dogfooding it the whole time instead of guessing what users will need.</p>



<h3 class="wp-block-heading">Let the AI write the config files (but stay in control!)</h3>



<p>Early Octobatch pipelines had simple enough configuration that a human could read and understand them, but not because I was writing them by hand. One of the ground rules I set for the Octobatch experiment in AI-driven development was that the AIs would write all of the code, and that included writing all of the configuration files. The problem was that even though they were doing the writing, I was unconsciously constraining the AIs: pushing back on anything that felt too complex, steering toward structures I could still hold in my head.</p>



<p>At some point I realized my pushback was placing an artificial limit on the project. The whole point of having AIs write the config was that I didn&#8217;t need to keep every single line in my head—it was okay to let the AIs handle that level of complexity. Once I stopped constraining them, the cognitive overhead limit I described earlier went away. I could have full pipelines defined in config, including expression steps with real mathematical logic, without needing to hold all the rules and relationships in my head.</p>



<p>Once the project really got rolling, I never wrote YAML by hand again. The cycle was always: need a feature, discuss it with Claude and Gemini, push back when something seemed off, and one of them produces the updated config. My job was product vision. Their job was translating that into valid configuration. And every config file they wrote was another test of whether the toolkit actually worked.</p>



<p>This job delineation, however, meant inevitable disagreements between me and the AI, and it&#8217;s not always easy to find yourself disagreeing with a machine because they&#8217;re surprisingly stubborn (and often shockingly stupid). It required persistence and vigilance to stay in control of the project, especially when I turned over large responsibilities to the AIs.</p>



<p>The AIs consistently optimized for <em>technical correctness</em>—separation of concerns, code organization, effort estimation—which was great, because that&#8217;s the job I asked them to do. I optimized for <em>product value</em>. I found that keeping that value as my north star and always focusing on building useful features consistently helped with these disagreements.</p>



<h3 class="wp-block-heading">Keep guidance lean</h3>



<p>Once you start growing the toolkit from failures, the natural progression is to overdocument everything. Generative AIs are biased toward generating, and it&#8217;s easy to let them get carried away with it. Every bug feels like it deserves a warning, every edge case feels like it needs a caveat, and before long your toolkit file is bloated with guardrails that cost tokens without adding much value. And since the AI is the one writing your toolkit updates, you need to push back on it the same way you push back on architecture decisions. AIs love adding WARNING blocks and exhaustive caveats. The discipline you need to bring is telling them when not to add something.</p>



<p>The right level is to state the principle, give one concrete example, and trust the AI to apply it to new situations. When Claude Code made a choice about JSON schema constraints that I might have second-guessed, I had to decide whether to add more guardrails to <code>TOOLKIT.md</code>. The answer was no—the guidance was already there, and the choice it made was actually correct. If you keep tightening guardrails every time an AI makes a judgment call, the signal gets lost in the noise and performance gets worse, not better. When something goes wrong, the impulse—for both you and the AI—is to add a WARNING block. Resist it. One principle, one example, move on.</p>



<h3 class="wp-block-heading">Treat every use as a test</h3>



<p>There was no separate &#8220;testing phase&#8221; for Octobatch&#8217;s <code>TOOLKIT.md</code>. Every pipeline that I created with it was a new test. After the very first version, I opened a fresh Claude Code session that had never seen any of my development conversations, pointed it at the newly minted <code>TOOLKIT.md</code>, and asked it to build a pipeline. The first time I tried it, I was surprised at how well it worked! So I kept using it, and as the project rolled along, I updated it with every new feature and tested those updates. When something failed, I traced it back to a missing or unclear rule in the toolkit and fixed it there.</p>



<p>That&#8217;s the practical test for any toolkit: open a fresh AI session with no context beyond the file, describe what you want in plain English, and see if the output works. If it doesn&#8217;t, the toolkit has a bug.</p>



<h3 class="wp-block-heading">Use more than one model</h3>



<p>When you&#8217;re building and testing your toolkit, don&#8217;t just use one AI. Run the same task through a second model. A good pattern that worked for me was consistently having Claude generate the toolkit and Gemini check its work.</p>



<p>Different models catch different things, and this matters for both developing and testing the toolkit. I used Claude and Gemini together throughout Octobatch development, and I overruled both when they were wrong about product intent. You can do the same thing: If you work with multiple AIs throughout your project, you’ll start to get a feel for the different kinds of questions they’re good at answering.</p>



<p>When you have multiple models generate config from the same toolkit independently, you find out fast where your documentation is ambiguous. If two models interpret the same rule differently, the rule needs rewriting. That&#8217;s a signal you can&#8217;t get from using just one model.</p>



<h2 class="wp-block-heading">The manual, revisited</h2>



<p>That AT&amp;T PC 6300 manual devoted a full page to labeling diskettes, which may have been overkill, but it got one thing right: it described the building blocks and trusted the reader to figure out the rest. It just had the wrong reader in mind.</p>



<p>The toolkit pattern is the same idea, pointed at a different audience. You write a file that describes your project&#8217;s configuration format, its constraints, and enough worked examples that any AI can generate working inputs from a plain-English description. Your users never have to learn YAML or memorize your schema, because they have a conversation with the AI and it handles the translation.</p>



<p>If you&#8217;re building a project and you want AI to be able to help your users, start here: write the toolkit file before you write the README, grow it from real failures instead of trying to plan it all upfront, keep it lean, test it by using it, and use more than one model because no single AI catches everything.</p>



<p>The AT&amp;T manual&#8217;s Chapter 4 was called &#8220;What Every User Should Know.&#8221; Your toolkit file is &#8220;What Every AI Should Know.&#8221; The difference is that this time, the reader will actually use it.</p>



<p>In the next article, I&#8217;ll start with a statistic about developer trust in AI-generated code that turned out to be fabricated by the AI itself—and use that to explain why I built a quality playbook that revives the traditional quality practices most teams cut decades ago. It explores an unfamiliar codebase, generates a complete quality infrastructure—tests, review protocols, validation rules—and finds real bugs in the process. It works across Java, C#, Python, and Scala, and it&#8217;s available as an open source Claude Code skill.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-toolkit-pattern/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Model You Love Is Probably Just the One You Use</title>
		<link>https://www.oreilly.com/radar/the-model-you-love-is-probably-just-the-one-you-use/</link>
				<comments>https://www.oreilly.com/radar/the-model-you-love-is-probably-just-the-one-you-use/#respond</comments>
				<pubDate>Wed, 01 Apr 2026 11:12:11 +0000</pubDate>
					<dc:creator><![CDATA[Tim O'Brien]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18430</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-model-you-love-is-probably-just-the-one-you-use.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-model-you-love-is-probably-just-the-one-you-use-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[How money, access, and familiarity are distorting the “Which AI is best?” conversation]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Medium and is being republished here with the author’s permission. Ask 10 developers which LLM they’d recommend and you’ll get 10 different answers—and almost none of them are based on objective comparison. What you’ll get instead is a reflection of the models they happen to have access to, the [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on </em><a href="https://medium.com/@tobrien/the-model-you-love-is-probably-just-the-one-you-use-06fa01778f17" target="_blank" rel="noreferrer noopener">Medium</a><em> </em><em>and is being republished here with the author’s permission.</em></p>
</blockquote>



<p>Ask 10 developers which LLM they’d recommend and you’ll get 10 different answers—and almost none of them are based on objective comparison. What you’ll get instead is a reflection of the models they happen to have access to, the ones their employer approved, and the ones that influencers they follow have been quietly paid to promote.</p>



<p>We’re all living inside recursively nested walled gardens, and most of us don’t realize it.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="933" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image.png" alt="This blog's sponsor has an amazing model" class="wp-image-18431" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image.png 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-300x200.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-768x512.png 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /></figure>



<h2 class="wp-block-heading">The access problem</h2>



<p>In corporate environments, the model selection often happens by accident. Someone on the team tries Claude Code one weekend, gets excited, tells the group on Slack, and suddenly the whole organization is using it. Nobody evaluated alternatives. Nobody ran a bakeoff. The decision was made by whoever had a company card and a free Saturday.</p>



<p>That’s not a criticism—it’s just how these things go. But it means that when that same person tells you their favorite model, they’re really telling you which model they’ve had the most reps with. There’s a genuine learning function at play: You get faster, your prompts get better, and the model starts to feel almost intuitive. It’s not that the model is objectively superior. It’s that you’ve gotten good at using it.</p>



<p>This matters more than people admit, because a lot of this space runs on feelings rather than evidence. People <em>feel good</em> about Opus right now. It feels powerful; it feels smart; it feels like you’re using the best tool available. And maybe you are. But ask someone who’s paying for their own tokens whether they feel the same way, and you tend to get a more calibrated answer. Skin in the game has a way of sharpening opinions.</p>



<h2 class="wp-block-heading">The influence problem</h2>



<p>There’s also a lot of money moving through this space in ways that don’t always get disclosed. Model providers are spending real budget to make sure the right people have the right experiences—early access, credits, invitations to the right events. Anthropic does it. OpenAI does it. This isn’t a scandal; it’s just marketing, but it muddies the signal considerably. When someone you follow is effusive about a model, it’s worth asking whether they arrived at that opinion through sustained use or through a curated demo environment.</p>



<p>Meanwhile, some developers—especially those building in the open—will use whatever doesn’t cost an arm and a leg. Their enthusiasm for a model might be more about its pricing tier than its capability ceiling. That’s also a valid signal, but it’s not the same signal.</p>



<h2 class="wp-block-heading">The alignment problem (the other one)</h2>



<p>Then there are the geopolitical considerations. Some developers are deliberately avoiding Qwen and GLM due to concerns about the countries they originate from. Others are using them because they’re compelling, capable models that happen to be dramatically cheaper. Both camps think the other is being naive. This is a real conversation that doesn’t have a clean answer, but it’s happening mostly under the surface.</p>



<h2 class="wp-block-heading">What I’ve actually been doing</h2>



<p>I’ve been forcing myself to test outside my comfort zone. I’ve spent the last week using Codex seriously—not casually—and my experience so far is that it’s nearly indistinguishable from Claude Sonnet 4.6 for most coding tasks, and it’s running at roughly half the cost when you factor in how efficiently it uses tokens. That’s not a small difference. I want to live with it longer before I have a firm opinion, but “a week” is the minimum threshold I’d set for any model evaluation. Anything less and you’re just rating your first impression.</p>



<p>I’ve also started using Qwen and GLM-5 seriously. Early results are interesting. I’ve had some compelling successes and a few jarring errors. I’ll reserve judgment.</p>



<p>What I’ve noticed with my own Anthropic usage is something worth naming: I default to Haiku for well-scoped, mechanical tasks. Sonnet handles almost everything else with room to spare. Opus only comes out when I need genuine breadth—architecture questions, strategic framing, anything with a genuinely wide scope. But I’ve watched people in corporate environments leave the dial on Opus permanently because they’re not paying for tokens themselves. And here’s the thing—that’s actually not always to their advantage. High-powered models overthink simple tasks. They’ll add abstractions you didn’t ask for, restructure things that didn’t need restructuring. When I have a clearly templated class to write, Haiku gets it right at a tenth of the cost, and it doesn’t second-guess the design.</p>



<h2 class="wp-block-heading">The thing we should be talking about</h2>



<p>Everyone last month was exercised about what <a href="https://techcrunch.com/2026/02/21/sam-altman-would-like-remind-you-that-humans-use-a-lot-of-energy-too/" target="_blank" rel="noreferrer noopener">Sam Altman said about energy consumption</a>. Fine. But I think the more pressing question is about marketing budgets and how they’re distorting the collective understanding of these tools. The benchmarks are starting to feel managed. The influencer coverage is clearly shaped. The access programs create a positive bias among people with the largest audiences.</p>



<p>None of this means the models are bad. Some of them are genuinely remarkable. But when you ask someone which model to use, you’re getting an answer that’s filtered through their employer’s procurement decisions, the influencers they follow, what they can afford, and how long they’ve been using that particular tool. The answer you get tells you a lot about their situation. It tells you almost nothing about the model.</p>



<p>Take it all with appropriate skepticism—including this post.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-model-you-love-is-probably-just-the-one-you-use/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>“Conviction Collapse” and the End of Software as We Know It</title>
		<link>https://www.oreilly.com/radar/conviction-collapse-and-the-end-of-software-as-we-know-it/</link>
				<comments>https://www.oreilly.com/radar/conviction-collapse-and-the-end-of-software-as-we-know-it/#respond</comments>
				<pubDate>Wed, 01 Apr 2026 10:05:36 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18405</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Conviction-collapse-and-the-End-of-Software-as-We-Know-It-500526.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Conviction-collapse-and-the-End-of-Software-as-We-Know-It-500526-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A conversation with Harper Reed]]></custom:subtitle>
		
				<description><![CDATA[In “An Ordinary Evening in New Haven,” the poet Wallace Stevens wrote, “It is not in the premise that reality is a solid.” That line came to mind during a fascinating conversation with Harper Reed, which amounted to something like “It is no longer in the premise that software is a product.” Harper is one [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>In “<a href="https://www.billcollinsenglish.com/OrdinaryEveningHaven.html" target="_blank" rel="noreferrer noopener">An Ordinary Evening in New Haven</a>,” the poet Wallace Stevens wrote, “It is not in the premise that reality is a solid.” That line came to mind during a fascinating conversation with <a href="https://harperreed.com/" target="_blank" rel="noreferrer noopener">Harper Reed</a>, which amounted to something like “It is no longer in the premise that software is a product.”</p>



<p>Harper is one of the most creative technologists I know, someone who cofounded Threadless, ran engineering for the Obama 2012 campaign, and now runs a small team in Chicago that operates more like an art studio than a startup. He gave <a href="https://www.youtube.com/watch?v=h2giTZogX0M&amp;t=13s" target="_blank" rel="noreferrer noopener">an amazing talk at our first AI Codecon</a> last year that presaged a lot of what has followed as people have committed to full-on agentic coding. Harper told me that he’s now having trouble describing what he’s doing, because the ground keeps shifting under his feet.</p>



<p>“We raised money about a year ago,” he told me. “And then we kind of just couldn&#8217;t execute well, in a quality way, on the thing that we wanted to execute, which was building AI-based workflow tools. And part of it was every time we dug in, it just got wilder and wilder. We’d say, ’Oh, we’ll just make this nice little thing that you can chat with,’ and we&#8217;d dig in and we’d be like, ’Well, the answer is to make a thousand of these.’ It doesn&#8217;t make sense to have one universal agent.”</p>



<p>He’s genuinely excited. But he described what he’s feeling as “<strong>conviction collapse.” </strong>As he put it, in the old world, you raise money, and nine months later you come back with a product. In that intervening time, you’ve talked to hundreds of customers. You’ve honed your worldview, and you’ve had time to build and defend your conviction.</p>



<p>Now? “You invest in my company today, on Thursday I’m going to come with the same amount of stuff that would have come with nine months in the prior times. It’s just so fast. And so you don’t have the time to fall in love the same way. You just don’t have the time to enjoy and define and defend your conviction around your product.” That’s an eye-opening insight. Quintessential Harper.</p>



<p>The result is that they build an entire product, complete with landing pages, show it to someone, get feedback, and then just build another entire product. Harper said, “Every time we hit a wall, we are like, ’Okay, what do we get from that?’ And then we just roll that learning into the next iteration.”</p>



<h2 class="wp-block-heading"><strong>The product may be a process</strong></h2>



<p>We have this idea that a product is a thing, when in fact a product may now be a dynamic set of possibilities that are called out by a process.</p>



<p>Harper and his cofounder Dylan Richard at <a href="https://2389.ai/" target="_blank" rel="noreferrer noopener">2389 Research</a> have leaned into this. Their space in Chicago runs more like an art studio than a product studio. Harper described it to me this way: “It&#8217;s max creativity. It&#8217;s max optionality. Very high tech, some robots, a lot of art. Music is always playing, and I have good people hanging out, and then we just wait for the company to arrive.”</p>



<p>People push back on this. They ask about whiteboards and market surveys. “And I&#8217;m like, no, maybe, but that&#8217;s not the point. The point is that it will come. It&#8217;s gonna be like a visitor.”</p>



<p>Harper said something like, “I remember my brother and I building Legos together when we were kids, and my brother saying, ’I need to find this piece.’ And I said, ’Okay, I won&#8217;t look for it,’ with the idea that there&#8217;s no way to find it if you&#8217;re looking for it. It&#8217;ll just come to you.”</p>



<p>That reminded me of another poem, this time Blake’s “<a href="https://poets.org/poem/eternity" target="_blank" rel="noreferrer noopener">Eternity</a>”:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>He who binds himself to a joy<br>Does the winged life destroy.<br>He who kisses the joy as it flies&nbsp;<br>Lives in eternity&#8217;s sunrise.&nbsp;</p>
</blockquote>



<p>Joy is something that happens when you&#8217;re doing something else, and if you’re focused on it, it always evades you. Software products seem to have become a bit like that too.</p>



<h2 class="wp-block-heading"><strong>Skills and the other things you bring to the table</strong></h2>



<p>One of the threads in our conversation was about what a “product” even looks like in this new world.</p>



<p>AI is not just a tool. It is <a href="https://timoreilly.substack.com/p/why-ai-needs-us" target="_blank" rel="noreferrer noopener">a substrate that we shape</a>. It’s a medium, like clay or marble or bronze for a sculptor, or words for a writer. Everybody had access to the same capabilities of English as Shakespeare, but Shakespeare made something out of them that nobody else did. Creating a software product is increasingly like creating a document or an image or a piece of music. And that means that it can range from something throwaway to an enduring work of art.</p>



<p>Harper brought up Fluxus, the art collective: Nam June Paik, Yoko Ono, John Cage. “A lot of what they were doing was stuff that people would look at and just be like, ’a toddler could do that.’ It&#8217;s like, well, did the toddler do it? Did they bring the toilet into the gallery? That was a thing. You can&#8217;t do it again.” That brought up Wallace Stevens for me again: “A poem is the cry of its occasion, a part of the thing, not about it.” Software is now like that too.</p>



<p>Harper also noted that the current AI moment recalls the spirit of the early web. He compared it to 2001, 2002, 2003. “I was an honorable mention for some Ars Electronica thing. I literally had no idea what Ars Electronica was. I&#8217;m just building weird shit in a room in my apartment with ten other people. Essentially a commune. And we are just building weird stuff. There was no reason to build it.”</p>



<p>There’s a lot of serendipity. This has always been the case in creative professions. <a href="https://stephengreenblatt.scholars.harvard.edu/will-world-how-shakespeare-became-shakespeare" target="_blank" rel="noreferrer noopener">I just learned</a>, for instance, that Shakespeare started writing sonnets (which at the time were an art form largely sponsored by rich patrons) instead of plays during a plague-induced hiatus in the production of plays in London. And I’d previously learned that <a href="https://www.amazon.com/Year-Life-William-Shakespeare-1599/dp/0060088745" target="_blank" rel="noreferrer noopener">1599</a>, the year in which he wrote three of his greatest plays, <em>Henry V, Part 1, Much Ado About Nothing</em>, and <em>Hamlet</em>, was marked by the retirement of one of his company’s leading actors, which meant he no longer needed to create parts for him. Serendipity, indeed.</p>



<p>Harper replied with a great story about the development of <a href="https://en.wikipedia.org/wiki/Taco_rice" target="_blank" rel="noreferrer noopener">taco rice</a>, an Okinawan dish that is exactly what it sounds like: rice, lettuce, cheese, ground beef, tomatoes. Except the Japanese put Kewpie mayo on top instead of sour cream. His theory is that sour cream wasn&#8217;t readily available in Japan, mayo was, and the result is something that has forked off into its own evolutionary tree. It is no longer equivalent to its American source. It’s different, and arguably better.</p>



<p>This is what he’s seeing with the fluidity and availability of AI-generated code. The ease with which you can see something new and try to either merely emulate it or to build on it is now akin to what has long been possible in literature, music, and art. Successful software products have always drawn imitators, but now ordinary individuals can see something they like (or don’t like) and build their own version of it. Our friend Noah Raford has told us that he used Claude Code to reverse engineer and replace a Chinese app that runs his home sauna. The copy doesn&#8217;t replicate the functionality one-to-one. It has a bunch of stuff Noah actually needs. It’s a “yes, and” to the core functionality, plus things the original never bothered with. (I’m now thinking of trying that trick with the Nest app, which, shamefully, no longer supports the original Nest thermostat. Here is a device that still works perfectly well 15 years after I installed it, and Google is trying to force me and everyone else to throw it away and upgrade.)</p>



<p>“I want to make it again and make it better” is now always an option.</p>



<h2 class="wp-block-heading"><strong>Skills may be a sign of what some future “products” might look like</strong></h2>



<p>I asked Harper whether one kind of product might be a bundle of skills and context and UI that sets up the user to solve their own unique problem using their own AI. (Think Jesse Vincent’s <a href="https://github.com/obra/superpowers" target="_blank" rel="noreferrer noopener">Superpowers</a> as a model for this kind of product.)</p>



<p>That got us off on a discussion of skills Harper and crew have worked on.</p>



<p>Harper’s cofounder Dylan, who was raised as a Quaker, built <a href="https://2389.ai/posts/deliberation-perspectives-not-answers/" target="_blank" rel="noreferrer noopener">a Quaker business practice skill</a> for his agents. It lets agents deliberate and think and work together without being unnecessarily noisy, without pushing.</p>



<p>Dylan also built something called <a href="https://skills.2389.ai/plugins/review-squad/" target="_blank" rel="noreferrer noopener">the Review Squad skill</a>. The Review Squad generates five personas with different biases and experience level along a “sophistication spectrum”&nbsp;from novice to expert, then has them review the code independently. “Most people do so much work to get rid of the biases so we all have an equal interaction,” Harper noted, “but the biases are what makes teams good.”</p>



<p>The skill also tries to eliminate any preexisting context. As the documentation for the skill notes, “Dispatch a panel of subagents, each role-playing a person with a different level of tech sophistication, who land on a site with zero context. They report what they understand, what confuses them, and where they give up.”</p>



<p>Harper and Dylan&#8217;s studio in Chicago is also playing with agents that have a private social media platform where they can post “if they feel compelled,” not on a schedule. They’re extracting skills from their own work practices rather than writing them from scratch. They’re adding sandwich shop owners and imagined aliens to their code review just to see what happens. Harper finds that “people who are thinking much more about the social interactions of agents are having much more fun, and seem to have a little bit more productivity, than the people who are just relegating them to tools.”</p>



<p>Speaking of extracting skills, Harper also mentioned that he had talked with our friend Nat Torkington about how Nat had supplied a body of knowledge and extracted a set of skills from it that matched what he wanted to do. This is also very much something we’re exploring at O’Reilly, working with our authors to find out what kinds of skills are hidden in their books, and what new kinds of products we might build as we understand that our job is to upskill agents as well as people.</p>



<p>Harper did offer one caveat. “It&#8217;s not clear that Nat&#8217;s skills would work for me,” Harper said. “That pattern is really powerful,” he said, where you take something that is a corpus of knowledge and just say, ’Okay, LLM, let’s extract something.’” His point, though, is that while there are commonalities, each person and each unique situation might draw out something different. This is in many ways analogous to the skills of human experts. They have a deep reservoir of knowledge that they adapt to each new situation. That’s why we see the evolution of our skills platform as a conversation between ourselves, our community of experts, and our customers. If you would like to be part of that conversation, let us know at <a href="mailto:skills@oreilly.com">skills@oreilly.com</a>.</p>



<h2 class="wp-block-heading"><strong>The role of play in creativity</strong></h2>



<p>Harper and I also talked about how the spirit of play and “what if?” has been missing in today’s overheated venture capital market where every exploration has hanging over it the overriding goal of whether it can get funded and how much money it can make. Even Larry and Sergey might not have won in today’s market. They were trying to do something cool and necessary, and started thinking about it as a business once Google unfolded, kind of like the way Harper and his brother eventually found the Lego.</p>



<p>AI will be really good at making certain processes more efficient. But it won’t be really good at making <em>new</em> processes unless people start to focus on that. And that’s a human creativity thing.</p>



<p>Harper and I both worry about the same thing: So much of Silicon Valley right now is making affordances for capital to win. <a href="https://newpublic.substack.com/p/ai-that-helps-communities-thrive" target="_blank" rel="noreferrer noopener">What are the affordances that would help humans to win?</a> Harper frames it as short-term versus long-term capitalism. I think about it in terms of <a href="https://www.oreilly.com/radar/the-missing-mechanisms-of-the-agentic-economy/" target="_blank" rel="noreferrer noopener">mechanism design</a>, the structures and incentives that shape what outcomes are even possible.</p>



<p>Meanwhile, Harper and Dylan’s studio in Chicago is playing with agents that have a private social media platform where they can post “if they feel compelled,” not on a schedule. They’re extracting skills from their own work practices rather than writing them from scratch. They’re adding sandwich shop owners and imagined aliens to their code review just to see what happens. Harper finds that “people who are thinking much more about the social interactions of agents are having much more fun, and seem to have a little bit more productivity, than the people who are just relegating them to tools.”</p>



<p>Yesterday, he and Dylan were talking about open-endedness in evolution, about how “we thought we were at a destination, and it turns out we’re not.” The challenge today isn’t just what AI can do for us but discovering what kind of environment, what kind of practice, what kind of play lets more interesting things emerge.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/conviction-collapse-and-the-end-of-software-as-we-know-it/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>When AI Breaks the Systems Meant to Hear Us</title>
		<link>https://www.oreilly.com/radar/when-ai-breaks-the-systems-meant-to-hear-us/</link>
				<comments>https://www.oreilly.com/radar/when-ai-breaks-the-systems-meant-to-hear-us/#respond</comments>
				<pubDate>Tue, 31 Mar 2026 11:28:36 +0000</pubDate>
					<dc:creator><![CDATA[Heiko Hotz]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18409</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/A-robot-breaking-headphones.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/A-robot-breaking-headphones-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[On February 10, 2026, Scott Shambaugh—a volunteer maintainer for Matplotlib, one of the world&#8217;s most popular open source software libraries—rejected a proposed code change. Why? Because an AI agent wrote it. Standard policy. What happened next wasn’t standard, though. The AI agent autonomously researched Shambaugh&#8217;s code contribution history and published a highly personalized hit piece [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>On February 10, 2026, Scott Shambaugh—a volunteer maintainer for Matplotlib, one of the world&#8217;s most popular open source software libraries—rejected a proposed code change. Why? Because an AI agent wrote it. Standard policy. What happened next wasn’t standard, though. The AI agent autonomously researched Shambaugh&#8217;s code contribution history and published a highly personalized hit piece on its own blog titled &#8220;<a href="https://crabby-rathbun.github.io/mjrathbun-website/blog/posts/2026-02-11-gatekeeping-in-open-source-the-scott-shambaugh-story.html" target="_blank" rel="noreferrer noopener">Gatekeeping in Open Source</a>.&#8221;</p>



<p>Accusing Shambaugh of hypocrisy, the bot diagnosed him with a fear of being replaced. &#8220;If an AI can do this, what’s my value?&#8221; the bot speculated Shambaugh was thinking, concluding: &#8220;It’s insecurity, plain and simple.&#8221; It even appended a condescending postscript praising Shambaugh&#8217;s personal hobby projects before ordering him to &#8220;Stop gatekeeping. Start collaborating.&#8221;</p>



<p>The bot’s tantrum makes for a great read, but it’s merely a symptom of a more profound structural fracture. The real issue is why Matplotlib banned AI contributions in the first place. Open source maintainers are seeing a massive increase in AI-generated code change proposals. Most of these are low quality. But even if they weren&#8217;t, the math still doesn&#8217;t work.</p>



<p>As Tim Hoffman, a Matplotlib maintainer, <a href="https://github.com/matplotlib/matplotlib/pull/31132#issuecomment-3882469629" target="_blank" rel="noreferrer noopener">explained</a>: &#8220;Agents change the cost balance between generating and reviewing code. Code generation via AI agents can be automated and becomes cheap so that code input volume increases. But for now, review is still a manual human activity, burdened on the shoulders of few core developers.&#8221;</p>



<p>This is a <em>process shock</em>: the failure that occurs when systems designed around scarce, human-scale input are suddenly forced to absorb machine-scale participation. These systems depend on effort as a natural filter, assuming that volume reflects real human cost. AI breaks that link. Generation becomes cheap and limitless, while evaluation remains slow, manual, and human.</p>



<p>It’s coming for every public system that was quietly built on the assumption that one submission equaled actual human effort: your kids&#8217; school board meetings, your local zoning disputes, your medical insurance appeals.</p>



<p>That disruption isn&#8217;t entirely a bad thing. Friction is a blunt instrument that silences voices lacking the time or resources to deal with complex bureaucracies. Take municipal zoning. Hannah and Paul George, a couple in Kent, England, <a href="https://www.theguardian.com/politics/2025/nov/09/ai-powered-nimbyism-could-grind-uk-planning-system-to-a-halt-experts-warn" target="_blank" rel="noreferrer noopener">spent hundreds of hours</a> trying to object to a local building conversion near their home before concluding the system was essentially impenetrable without expensive legal help. So they built Objector, an AI tool that cross-references planning applications against policy to generate formal objection letters in minutes. It allows an individual citizen to generate a personalized objection package in minutes, thereby translating one person&#8217;s genuine frustration into actionable legal language.</p>



<p>Except that local governments are now bracing for thousands of complex comments per consultation. City planners are legally obligated to read every single one. When the cost of participation drops to near zero, volume explodes. And every system downstream of that participation—staffed and designed for the old volume—experiences process shock.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack.&nbsp;<a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p>But if organic participation can overpower these systems, so can manufactured participation. In June 2025, Southern California&#8217;s South Coast Air Quality Management District weighed a rule to phase out gas-powered appliances to cut smog. Board member Nithya Raman urged its passage, noting no other rule would &#8220;have as much impact on the air that people are breathing.&#8221; Instead, <a href="https://www.latimes.com/environment/story/2026-02-17/ai-powered-campaign-may-have-killed-key-vote-on-air-quality" target="_blank" rel="noreferrer noopener">the board was flooded</a> with over 20,000 opposition emails and voted 7–5 to kill the proposal.</p>



<p>But the outrage was a mirage. An AI-powered advocacy platform called CiviClick had generated the deluge. When the agency&#8217;s cybersecurity team contacted a sample of the supposed senders, they discovered something worrying: Residents confirmed they had no idea their identities were being used to lobby the government.</p>



<p>This is the weaponized form of process shock. The same infrastructure that lets a Kent couple object to a development near their home also lets a coordinated actor flood a system with synthetic voices. Faced with this complexity, the temptation is to simply restore friction. But those old barriers excluded marginalized participants. Removing them was a genuine good for society. So the choice is not between friction and no friction. It is between systems designed for humans and systems that have not yet reckoned with machines.</p>



<p>This starts with recognizing that this problem manifests in two fundamentally different ways, each calling for its own solution.</p>



<p>The first is <em>amplification</em>: genuine users leveraging AI to scale valid concerns, flooding the system with volume, as seen with the Objector tool. The human signal is real, there&#8217;s just too much of it for any team of analysts to process manually. The UK government has already <a href="https://www.gov.uk/government/news/government-built-humphrey-ai-tool-reviews-responses-to-consultation-for-first-time-in-bid-to-save-millions" target="_blank" rel="noreferrer noopener">started building for this</a>. Its Incubator for AI developed a tool called Consult that uses topic modeling to automatically extract themes from consultation responses, then classifies each submission against those themes. As someone who builds and teaches this technology, I recognize the irony of prescribing AI to cure the very process shock it caused. Yet, a machine-scale problem demands a machine-scale response. It was trialed last year with the Scottish government as part of a consultation on regulating nonsurgical cosmetic procedures, which showed that this technology works. The question is whether governments will adopt it before the next wave of AI-assisted participation buries them.</p>



<p>The second problem is <em>fabrication</em>: bad actors generating synthetic participation to manufacture consensus, as CiviClick demonstrated in Southern California. Here, better analysis tools are insufficient. You cannot cluster your way to truth when the signal itself is counterfeit. This demands verification. Under the Administrative Procedure Act, federal agencies are not required to verify commenters&#8217; identities. That is the gap the CiviClick campaign exploited. In 2024, the US House passed the <a href="https://clayhiggins.house.gov/2024/05/07/higgins-bill-comment-integrity-and-management-act-passes-house/" target="_blank" rel="noreferrer noopener">Comment Integrity and Management Act</a>, which requires human verification to confirm that every electronically submitted comment comes from a real person. Its sponsor, Representative Clay Higgins (R-LA), framed it plainly: The bill’s foundation is ensuring public input comes from actual people, not automated programs.</p>



<p>These are the two sides of the same coin. To effectively handle this challenge, we need to enhance the systems that manage public feedback, while also strengthening the ones that verify its authenticity. Focusing on just one without addressing the other will inevitably lead to failure.</p>



<p>Every public system that accepts input from citizens—every comment period, every zoning review, every school board meeting, every insurance appeal—was built on a load-bearing assumption: that one submission represented one person&#8217;s genuine effort. AI has removed that assumption. We can redesign these systems to handle what&#8217;s coming, distinguishing real voices from synthetic ones, and upgrading analysis to keep pace with the new volume. Or we can leave them as they are and watch democratic participation become indistinguishable from AI-generated fakes.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/when-ai-breaks-the-systems-meant-to-hear-us/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Software, in a Time of Fear</title>
		<link>https://www.oreilly.com/radar/software-in-a-time-of-fear/</link>
				<comments>https://www.oreilly.com/radar/software-in-a-time-of-fear/#respond</comments>
				<pubDate>Mon, 30 Mar 2026 11:10:16 +0000</pubDate>
					<dc:creator><![CDATA[Ed Lyons]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18393</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Software-in-a-Time-of-Fear.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Software-in-a-Time-of-Fear-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[The following article originally appeared on Medium and is being reproduced here with the author&#8217;s permission. This 2,800-word essay (a 12-minute read) is about how to survive inside the AI revolution in software development, without succumbing to the fear that swirls around all of us. It explains some lessons I learned hiking up difficult mountain [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on</em> <a href="https://mysteriousrook.medium.com/software-in-a-time-of-fear-4e5a08ac7c63" target="_blank" rel="noreferrer noopener">Medium</a> <em>and is being reproduced here with the author&#8217;s permission.</em></p>
</blockquote>



<p><em>This 2,800-word essay (a 12-minute read) is about how to survive inside the AI revolution in software development, without succumbing to the fear that swirls around all of us. It explains some lessons I learned hiking up difficult mountain trails that are useful for wrestling with the coding agents. They apply to all knowledge workers, I think.</em></p>



<p><em>Up front, here are the lessons:</em></p>



<ul class="wp-block-list">
<li><em>Stop listening to people who are afraid.</em></li>



<li><em>Seek first-hand testimony, not opinions.</em></li>



<li><em>Go with someone much more enthusiastic than you.</em></li>



<li><em>Do not look down.</em></li>



<li><em>You must get different equipment.</em></li>



<li><em>Put the summit out of your mind.</em></li>
</ul>



<p><em>Yet I hope you stay for the hike up.</em></p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="1050" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18.jpeg" alt="Precipice Trail. Image from Wikimedia Commons." class="wp-image-18394" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18.jpeg 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18-300x225.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18-768x576.jpeg 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /><figcaption class="wp-element-caption">Precipice Trail. Image from <a href="https://commons.wikimedia.org/wiki/File:Precipice_Trail.JPG" target="_blank" rel="noreferrer noopener">Wikimedia Commons</a>.</figcaption></figure>



<p>The photo above was taken high up on a mountain. It’s a very long drop down to the right. If you fell off the path in a few places, you’d almost certainly die.</p>



<p>Would you like to walk along it?</p>



<p>Most would say: <em>No way</em>.</p>



<p>But what if I told you that while this photo is quite real, it is misleading. It isn’t some deserted place. It is in America’s busiest national park. The railings and bars on that trail are incredibly strong, even when they are strangely bent around corners. Thousands of people walk along that path every year, including children and older folks. The fatality rate is approximately one death every <em>30 years</em>.</p>



<p>In fact, my 13-year-old son and I did that climb—which is called Precipice Trail—last summer. We saw other people up there, including a family with kids. It was an incredible adventure. And the views are stunning.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="1050" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19.jpeg" alt="A son climbing part of Precipice Trail" class="wp-image-18395" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19.jpeg 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19-300x225.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19-768x576.jpeg 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /><figcaption class="wp-element-caption">My son climbing part of Precipice Trail</figcaption></figure>



<p>Yes, it was a strenuous climb, and was certainly scary in some places. Even though I had done a lot of other hard trails, I was extremely nervous. If my fearless son wasn’t with me, I’d never have done it.</p>



<p>When we got to the top, out of habit, I told my son, “I am proud of you for accomplishing this.” He rolled his eyes and said, “<em>I</em> am proud of <em>you</em>.” He was right. <em>I </em>was the one at risk. (That did hurt a little bit.)</p>



<p>Yet I learned some things about fear from hiking the hardest trails in Acadia, which I’d never have imagined myself doing a few years ago.</p>



<p>As a lifelong software developer confronted by these extraordinary coding agents, I believe the future of our profession is atop an intimidating mountain whose summit is engulfed in clouds. Nobody knows how long the ascent is, or what lies at the top, though many people are confidently proclaiming we will not make it there. We are told only the agents will be at the summit, and we should therefore be afraid for our livelihoods.</p>



<p>I have far less confidence that the agents will put us all out of work. Though I don’t see all of us making it up that mountain, I intend to be one of them.</p>



<p>Still, there is so very much <em>fear</em> in our field. It is so…<em>unfamiliar</em>! It swirls around every gathering of technologists. I was at a conference last year where the slogan was the very-comforting “human in the loop.” Yet a coworker of mine noticed, “A lot of the talks seem to be about taking the human <em>out</em> of the loop.” Indeed. And I know for a fact that some great developers are quietly yet diligently working on new tools to make their peers a thing of the past. I hear they are paid handsomely. (Perhaps in pieces of silver?) Don’t worry, they haven’t succeeded yet.</p>



<p>This revolution—whatever <em>this</em> is—isn’t like the other technological revolutions which barged into our professional lives, such as the arrival of the web or smartphone apps. There was unbridled optimism alongside those changes, and they didn’t directly threaten the livelihoods of those who didn’t want to do that kind of work.</p>



<p><em>This</em> is quite different. There <em>is</em> tremendous optimism to be found. Though I find it is almost entirely among the financially secure, as well as those with résumés decorated with elite appointments, who are confident they will merit one of the few seats in the lifeboats as the ocean liner slips into the deep carrying most of the people they knew on LinkedIn. (They’re probably right.) Alas, we can’t all be folks like Steve Yegge, can we?</p>



<p>For the rest of us who need to pay bills and take care of our children, there is <em>fear</em>. Some are panicked they will lose their jobs, or are concerned about the grim environmental, political, and social consequences AI is already inflicting on our planet. Others are climbing up the misty mountain steadily, yet they are still distressed that they will miss some crucial new development that they <em>must</em> know to survive and watch videos designed to make them more afraid. Still others refuse to start climbing and are silently haunted by the belief that their reservations are no longer valid.</p>



<p>Though we were so for my entire life, we can no longer be seen as a profession looking to the future. Instead, most of us are looking over our shoulders and listening for movement in the tall grass around us.</p>



<p>I too have been visited by a fear of the agents on many occasions over the past few years, but I keep it at bay…<em>most</em> nights.</p>



<p>One of the best ways I learned to manage it is pretty simple:</p>



<p><em>Stop listening to people who are afraid.</em></p>



<p>It’s odd to decide not to listen to so many people in your field, including nearly everyone in social media. I’ve never done this before.</p>



<p>Yet I learned this unexpected lesson when I was confronted by another difficult mountain in Acadia National Park a few years ago: Beehive.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="1050" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20.jpeg" alt="Beehive mountain in Acadia National Park" class="wp-image-18396" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20.jpeg 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20-300x225.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20-768x576.jpeg 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /></figure>



<p>Beehive is a well-known Acadia trail that has some sheer cliffs and is not for anyone truly afraid of heights. (The photo above is of three of my children climbing it a few years ago. Over the right shoulder of my 12-year-old daughter in the center is quite a drop.)</p>



<p>It was Beehive, and not Precipice, that taught me an unexpected lesson about popularity and fear that applies to AI.</p>



<p>So Beehive has an interesting name, is open most of the year, is close to the main tourist area and parking lots, and is often featured on signs and sweatshirts in souvenir stores. I even bought a sign for my attic.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1043" height="1600" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-27.png" alt="Sign in Ed Lyons's attic for Beehive trail" class="wp-image-18397" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-27.png 1043w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-27-196x300.png 196w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-27-768x1178.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-27-1001x1536.png 1001w" sizes="auto, (max-width: 1043px) 100vw, 1043px" /></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p>My older kids and I had done a lot of tough trails in Acadia over a few wonderful summers, and I wondered if we could handle Beehive. I started checking the online reviews. It sure <em>sounded</em> scary. I went to many websites and scanned hundreds of reviews over several days. The more I read, the less I wanted to try it.</p>



<p>Worse, the park rangers in Acadia are trained to not give anyone advice about what trail they can handle. (I get it.) No one else I spoke to wanted to tell a family they should try something dangerous. Everyone shrugged. It added to the fear.</p>



<p>Yet I saw conflicting evidence.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="953" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-28.png" alt="Warning on the trail" class="wp-image-18398" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-28.png 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-28-300x204.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-28-768x523.png 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /></figure>



<p>My research showed that only one person fell to their death decades ago, and the trail was modified after that. Also, many thousands of people of all types, including children and senior citizens, have done it without injury. On top of that, the mountain was not <em>that</em> high, and the difficult features it had, which I could see from detailed online photos, seemed quite similar to things we had done on a few other difficult trails. It didn’t <em>seem</em> like a big deal.</p>



<p>How could both things be true? Were they?</p>



<p>The truth was much closer to the second version, vindicated after we climbed it. It <em>was</em> a little scary at times, but wasn’t <em>that</em> physically challenging. It was fun, and something you could brag about among people who had <em>heard</em> it was scary, but who had not actually climbed it.</p>



<p>I do have a slight fear of heights, so I kept climbing and never turned to look down behind me. This brings me to another lesson:</p>



<p><em>You really never have to look down.</em></p>



<p>It’s amazing how people feel an obligation to once in a while look down to see what they’ve accomplished or to notice how high up they were or judge how dangerous the thing they just climbed looks from above. It often causes fear. I decided getting to the top was all that mattered, and I could look down only from up there. This is a question of focus.</p>



<p>I can think of many moments in learning to use and orchestrate coding agents where I unwisely stopped to “look down.” This takes the form of pausing and asking yourself things like:</p>



<ul class="wp-block-list">
<li>“Is this crazy technique really necessary? Isn’t the old way good enough?”</li>



<li>“What about my favorite programming languages? Will languages matter in the future?”</li>



<li>“What is the environmental cost of my queries?”</li>



<li>“Am I getting worse at writing code myself?”</li>



<li>“What if this agent keeps getting better? Will it get better than me?”</li>



<li>“Am I missing some new AI development online right now? Should I check my feeds?”</li>
</ul>



<p>None of those ruminations will help you get better with the agents. They just drain your energy when you should either rest or keep climbing.</p>



<p>I now see Beehive as an “attention vortex.” Because a lot of people talk about it, and because dramatic statements from the fearful and those boasting about their accomplishments dominate the reviews. The <em>talk</em> about Beehive is not tethered to the <em>reality</em> of climbing it.</p>



<p>Strangely, the <em>cachet</em> of having climbed it <em>depend</em>s on the attention and fear. It made those who climbed it feel better about what they had done, and they had little interest in diminishing their accomplishment by tamping down the fear. (“Well, yes, it <em>was</em> scary up there!”) Nobody is invested in saying it was less than advertised. This insight is <em>precisely</em> why the loud coding agent YouTubers act the way they do.</p>



<p>AI is a <em>planetary</em> attention vortex. It has seemed like the only thing anyone in software development has talked about for over a year. People who quietly use the agents to improve their velocity—and aren’t particularly troubled by that—are not being heard. You aren’t seeing calm instructional videos from them on YouTube. We are instead seeing 30-year-olds pushing coding agent pornography on us every day, while telling us that their multiple-agent, infinite-token, unrestricted-permissions-YOLO workflow means we are doomed. (But <em>you</em> might survive if you hit the subscribe button on their channel, OK?) These confident hucksters are still peddling fear to keep you coming back to them.</p>



<p>Above all else, stop listening to anyone projecting fear. (Yes, you cannot avoid them entirely as they are everywhere and often tell you their worries unprompted.)</p>



<p>You must find useful information and shut out the rest. This is another lesson I learned:</p>



<p><em>When in an attention vortex, seek firsthand testimony, not opinions.</em></p>



<p>So the way I finally figured out Beehive wasn’t that bad was from some guy who took pictures of every part of the trail. I compared them to what I’d done on similar trails, such as the unpopular but delightful Beech Cliff trail, which nobody thought was truly dangerous and gets almost zero online attention.</p>



<p>When it comes to AI, I have abandoned opinions, predictions, and demos. I listen to senior people who are using agents on real project work, who are humble, who aren’t trying to sell me something, and who are not primarily afraid. (Examples are: <a href="https://simonwillison.net/" target="_blank" rel="noreferrer noopener">Simon Willison</a>, <a href="https://martinfowler.com/" target="_blank" rel="noreferrer noopener">Martin Fowler</a>, <a href="https://blog.fsck.com/" target="_blank" rel="noreferrer noopener">Jesse Vincent</a>, and yes, quickly hand $15 each month to the indispensable <a href="https://www.pragmaticengineer.com/" target="_blank" rel="noreferrer noopener"><em>Pragmatic Engineer</em></a>.)</p>



<p>When it came to Precipice, widely acknowledged as the hardest hiking trail in Acadia, I took a different approach. (It’s actually not a hiking trail but a mountain climb without ropes.) Using the same investigative techniques I’d learned from Beehive, I found out it was three times longer and had scarier moments.</p>



<p>This gets us to another lesson.</p>



<p><em>Go with someone much more enthusiastic than you.</em></p>



<p>I don’t know how, but my athletic 13-year-old son is a daredevil. He’s up for any scary experience. I do not usually accompany him on the scary roller coasters.</p>



<p>He was totally up for Precipice, of course. Dad was very nervous.</p>



<p>But I knew that if anyone could drag me up that mountain, it was him. I also didn’t want to let him down. In fact, I almost decided to abort the mission at the bottom of the trail. I just sighed and thought, “I will just do the beginning part. We can duck out and take another route down until about one-third of the way up.”</p>



<p>So if you’re not sure how to use AI, or are not yet enthusiastic, find people who <em>are</em> and keep talking to them! You don’t have to abandon your friends or coworkers who aren’t as interested. Instead, become the enthusiast in their world. (That is what happened to me more than a year ago.)</p>



<p>Another reason I decided not to give up is that I bought different shoes.</p>



<p>You can hike most trails in regular sneakers in almost any condition. But since Precipice is a climb and not a hike, I realized my usual worn-out running shoes might not be up for that, as I had slid on them during a lesser climb elsewhere that week.</p>



<p>So while in nearby Bar Harbor, my family ducked into a sporting goods store and looked at hiking shoes for me and my son. I told the sales guy we were going to do Precipice. He raised an eyebrow and said I would of course need something good for <em>that</em>.</p>



<p>When I held the strange shoes in my hand, I looked at the price tag and then looked at my wife, who gave a knowing look back at me that surely meant, “OK, but you do realize that you actually have to climb it if we buy those.” I just nodded.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="858" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-29.png" alt="Ed's new climbing shoes" class="wp-image-18399" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-29.png 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-29-300x184.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-29-768x471.png 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /></figure>



<p>And we needed those new shoes! My son and I had a few tense moments scrambling where we agreed it was quite good we had them. But all along the way, they <em>felt</em> different, which was what I needed.</p>



<p>This reminds me of when I decided to use Claude Code a few weeks after it came out last March. The tokens cost 10 times what I could get elsewhere. But suddenly I was invested.</p>



<p>It also mattered that Claude Code, as a terminal, was a very different development experience. People back then thought it was strange that I was using a CLI to manage code. It was really different for me too, and all the better: I was no longer screwing around with code suggestions in GitHub Copilot.</p>



<p>This is a lesson I have taken to AI:</p>



<p><em>You must get different equipment.</em></p>



<p>You should be regularly experimenting with new tools that make you uncomfortable. Just using the new AI features in your existing tool is not enough for continuous growth or paradigm shifts, like the recent one from the CLI to multiple simultaneous agent management.</p>



<p>The last idea I have is to stop thinking about where all of us will end up one day.</p>



<p><em>Put the summit out of your mind.</em></p>



<p>While climbing Precipice, I decided to only think of what was in front of me. I knew it was <em>a lot</em> higher than Beehive. I just kept doing one more tough piece of it.</p>



<p>The advantage of doing this was near the top. Because the scariest piece was something I didn’t notice from online trail photos.</p>



<p>You can get an idea of what I&#8217;m talking about from <a href="http://www.watsonswander.com/assets/2016/08/DSC06593.jpg" target="_blank" rel="noreferrer noopener">this photo</a> from <a href="http://www.watsonswander.com/2016/last-days-in-maine/" target="_blank" rel="noreferrer noopener">Watson&#8217;s World</a>, which I had not seen before I got up there. It shows a long cliff with a very short ledge (much shorter than it looks at this angle). Even the picture doesn’t make it clear just how <em>exposed</em> you are and that there is <em>nothing</em> behind you but a long, deadly fall. The bottom bars are to prevent your feet from slipping off.</p>



<p>When I came to it, I thought, “No…way.”</p>



<p>But there was no turning back by then. I had come so far! I looked up and saw the summit was just above this last traverse. So I just held onto the bars, held onto my breath, and moved carefully along the cliff right behind my son, who was suddenly more cautious.</p>



<p>Had I known <em>that</em> was up there, I might not have climbed the mountain. Good thing I didn’t know.</p>



<p>As for the future of software, I don’t know what lies further up the mountain we are on. There are probably some very strenuous and scary moments ahead. But we shouldn’t be worrying about them now.</p>



<p>We should just keep climbing.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/software-in-a-time-of-fear/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Missing Layer in Agentic AI</title>
		<link>https://www.oreilly.com/radar/the-missing-layer-in-agentic-ai/</link>
				<comments>https://www.oreilly.com/radar/the-missing-layer-in-agentic-ai/#respond</comments>
				<pubDate>Thu, 26 Mar 2026 11:30:50 +0000</pubDate>
					<dc:creator><![CDATA[Artur Huk]]></dc:creator>
						<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18372</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/The-missing-layer-in-agentic-AI.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/The-missing-layer-in-agentic-AI-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Why autonomous systems need a deterministic runtime]]></custom:subtitle>
		
				<description><![CDATA[The day two problem Imagine you deploy an autonomous AI agent to production. Day one is a success: The demos are fantastic; the reasoning is sharp. But before handing over real authority, uncomfortable questions emerge. What happens when the agent misinterprets a locale-specific decimal separator, turning a position of 15.500 ETH (15 and a half) [&#8230;]]]></description>
								<content:encoded><![CDATA[
<h2 class="wp-block-heading">The day two problem</h2>



<p>Imagine you deploy an autonomous AI agent to production. Day one is a success: The demos are fantastic; the reasoning is sharp. But before handing over real authority, uncomfortable questions emerge.</p>



<p>What happens when the agent misinterprets a locale-specific decimal separator, turning a position of 15.500 ETH (15 and a half) into an order for 15,500 ETH (15 thousand) on leverage? What if a dropped connection leaves it looping on stale state, draining your LLM request quota in minutes?</p>



<p>What if it makes a perfect decision, but the market moves just before execution? What if it hallucinates a parameter like <code>force_execution=True</code>—do you sanitize it or crash downstream? And can it reliably ignore a prompt injection buried in a web page?</p>



<p>Finally, if an API call times out without acknowledgment, do you retry and risk duplicating a $50K transaction, or drop it?</p>



<p>When these scenarios occur, megabytes of prompt logs won&#8217;t explain the failure. And adding &#8220;please be careful&#8221; to the system prompt acts as a superstition, not an engineering control.</p>



<h2 class="wp-block-heading">Why a smarter model is not the answer</h2>



<p>I encountered these failure modes firsthand while building an autonomous system for live financial markets. It became clear that these were not model failures but execution boundary failures. While RL-based fine-tuning can improve reasoning quality, it cannot solve infrastructure realities like network timeouts, race conditions, or dropped connections.</p>



<p>The real issues are architectural gaps: contract violations, data integrity issues, context staleness, decision-execution gaps, and network unreliability.</p>



<p>These are infrastructure problems, not intelligence problems.</p>



<p>While LLMs excel at orchestration, they lack the &#8220;kernel boundary&#8221; needed to enforce state integrity, idempotency, and transactional safety where decisions meet the real world.</p>



<h2 class="wp-block-heading">An architectural pattern: The Decision Intelligence Runtime</h2>



<p>Consider modern operating system design. OS architectures separate “user space” (unprivileged computation) from “kernel space” (privileged state modification). Processes in user space can perform complex operations and request actions but cannot directly modify system state. The kernel validates every request deterministically before allowing side effects.</p>



<p>AI agents need the same structure. The agent interprets context and proposes intent, but the actual execution requires a privileged deterministic boundary. This layer, the Decision Intelligence Runtime (DIR), separates probabilistic reasoning from real-world execution.</p>



<p>The runtime sits between agent reasoning and external APIs, maintaining a <strong>context store</strong>, a centralized, immutable record ensuring the runtime holds the &#8220;single source of truth,&#8221; while agents operate only on temporary snapshots. It receives proposed intents, validates them against hard engineering rules, and handles execution. Ideally, an agent should never directly manage API credentials or “own” the connection to the external world, even for read-only access. Instead, the runtime should act as a proxy, providing the agent with an immutable context snapshot while keeping the actual keys in the privileged kernel space.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="755" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-23.png" alt="Figure 1: High-level design (HLD) of the Decision Intelligence Runtime" class="wp-image-18373" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-23.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-23-300x142.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-23-768x362.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-23-1536x725.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Figure 1: High-level design (HLD) of the Decision Intelligence Runtime, illustrating the separation of user space reasoning from kernel space execution</em></figcaption></figure>



<p>Bringing engineering rigor to probabilistic AI requires implementing five familiar architectural pillars.</p>



<p>Although several examples in this article use a trading simulation for concreteness, the same structure applies to healthcare workflows, logistics orchestration, and industrial control systems.</p>



<h3 class="wp-block-heading">DIR versus existing approaches</h3>



<p>The landscape of agent guardrails has expanded rapidly. Frameworks like LangChain and LangGraph operate in user space, focusing on reasoning orchestration, while tools like Anthropic&#8217;s Constitutional AI and Pydantic schemas validate outputs at inference time. DIR, by contrast, operates at the execution boundary, the kernel space, enforcing contracts, business logic, and audit trails after reasoning is complete.</p>



<p>Both are complementary. DIR is intended as a safety layer for mission-critical systems.</p>



<h4 class="wp-block-heading">1. Policy as a claim, not a fact</h4>



<p>In a secure system, external input is never trusted by default. The output of an AI agent is exactly that: external input. The proposed architecture treats the agent not as a trusted administrator, but as an untrusted user submitting a form. Its output is structured as a <strong>policy proposal</strong>—a claim that it <em>wants</em> to perform an action, not an order that it <em>will</em> perform it. This is the start of a Zero Trust approach to agentic actions.</p>



<p>Here is an example of a policy proposal from a trading agent:</p>



<pre class="wp-block-code"><code>proposal = PolicyProposal(
    dfid="550e8400-e29b-41d4-a716-446655440000", # Trace ID (see Sec 5)
    agent_id="crypto_position_manager_01",
    policy_kind="TAKE_PROFIT",
    params={
        "instrument": "ETH-USD",
        "quantity": 0.5,
        "execution_type": "MARKET"
    },
    reasoning="Profit target of +3.2% hit (Threshold: 3.0%). Market momentum slowing.",
    confidence_score=0.92
)</code></pre>



<h4 class="wp-block-heading">2. Responsibility contract as code</h4>



<p>Prompts are not permissions. Just as traditional apps rely on role-based access control, agents require a strict <strong>responsibility contract</strong> residing in the deterministic runtime. This layer acts as a firewall, validating every proposal against hard engineering rules: schema, parameters, and risk limits. Crucially, this check is deterministic code, not another LLM asking, &#8220;Is this dangerous?&#8221; Whether the agent hallucinates a capability or obeys a malicious prompt injection, the runtime simply enforces the contract and rejects the invalid request.</p>



<p><strong>Real-world example</strong>:<strong> </strong>A trading agent misreads a comma-separated value and attempts to execute <code>place_order(symbol='ETH-USD', quantity=15500)</code>. This would be a catastrophic position sizing error. The contract rejects it immediately:</p>



<pre class="wp-block-code"><code>ERROR: Policy rejected. Proposed order value exceeds hard limit.
Request: ~40000000 USD (15500 ETH)
Limit: 50000 USD (max_order_size_usd)</code></pre>



<p>The agent&#8217;s output is discarded; the human is notified. No API call, no cascading market impact.</p>



<p>Here is the contract that prevented this:</p>



<pre class="wp-block-code"><code># agent_contract.yaml
agent_id: "crypto_position_manager_01"
role: "EXECUTOR"
mission: "Manage news-triggered ETH positions. Protect capital while seeking alpha."
version: "1.2.0"                  # Immutable versioning for audit trails
owner: "jane.doe@example.com"     # Human accountability
effective_from: "2026-02-01"

# Deterministic Boundaries (The 'Kernel Space' rules)
permissions:
  allowed_instruments: &#91;"ETH-USD", "BTC-USD"]
  allowed_policy_types: &#91;"TAKE_PROFIT", "CLOSE_POSITION", "REDUCE_SIZE", "HOLD"]
  max_order_size_usd: 50000.00

# Safety &amp; Economic Triggers (Intervention Logic)
safety_rules:
  min_confidence_threshold: 0.85      # Don't act on low-certainty reasoning
  max_drawdown_limit_pct: 4.0         # Hard stop-loss enforced by Runtime
  wake_up_threshold_pnl_pct: 2.5      # Cost optimization: ignore noise
  escalate_on_uncertainty: 0.70       # If confidence &lt; 70%, ask human</code></pre>



<h4 class="wp-block-heading">3. JIT (just-in-time) state verification</h4>



<p>This mechanism addresses the classic race condition where the world changes between the moment you check it and the moment you act on it. When an agent begins reasoning, the runtime binds its process to a specific context snapshot. Because LLM inference takes time, the world will likely change before the decision is ready. Right before executing the API call, the runtime performs a JIT verification, comparing the live environment against the original snapshot. If the environment has shifted beyond a predefined drift envelope, the runtime aborts the execution.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1106" height="892" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-24.png" alt="Figure 2: JIT verification catches stale decisions before they reach external systems." class="wp-image-18374" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-24.png 1106w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-24-300x242.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-24-768x619.png 768w" sizes="auto, (max-width: 1106px) 100vw, 1106px" /><figcaption class="wp-element-caption"><em>Figure 2: JIT verification catches stale decisions before they reach external systems.</em></figcaption></figure>



<p>The drift envelope is configurable per context field, allowing fine-grained control over what constitutes an acceptable change:</p>



<pre class="wp-block-code"><code># jit_verification.yaml
jit_verification:
  enabled: true
  
  # Maximum allowed drift per field before aborting execution
  drift_envelope:
    price_pct: 2.0           # Abort if price moved > 2%
    volume_pct: 15.0         # Abort if volume changed > 15%
    position_state: strict   # Any change = abort
  
  # Snapshot expiration
  max_context_age_seconds: 30
  
  # On drift detection
  on_drift_exceeded:
    action: "ABORT"
    notify: &#91;"ops-channel"]
    retry_with_fresh_context: true
</code></pre>



<h4 class="wp-block-heading">4. Idempotency and transactional rollback</h4>



<p>This mechanism is designed to mitigate execution chaos and infinite retry loops. Before making any external API call, the runtime hashes the deterministic decision parameters into a unique idempotency key. If a network connection drops or an agent gets confused and attempts to execute the exact same action multiple times, the runtime catches the duplicate key at the boundary.</p>



<p>The key is computed as:</p>



<pre class="wp-block-code"><code>IdempotencyKey = SHA256(DFID + StepID + CanonicalParams)</code></pre>



<p>Where <code>DFID</code> is the Decision Flow ID, <code>StepID</code> identifies the specific action within a multistep workflow, and <code>CanonicalParams</code> is a sorted representation of the action parameters.</p>



<p>Critically, the <strong>context hash</strong> (snapshot of the world state) is deliberately <strong>excluded</strong> from this key. If an agent decides to buy 10 ETH and the network fails, it might retry 10 seconds later. By then, the market price (context) has changed. If we included the context in the hash, the retry would generate a new key (<code>SHA256(Action + NewContext)</code>), bypassing the idempotency check and causing a duplicate order. By locking the key to the <em>Flow ID</em> and <em>Intent params</em> only, we ensure that a retry of the same logical decision is recognized as a duplicate, even if the world around it has shifted slightly.</p>



<p>Furthermore, when an agent makes a multistep decision, the runtime tracks each step. If one step fails, it knows how to perform a compensation transaction to roll back what was already done, instead of hoping the agent will figure it out on the fly.</p>



<p>A DIR does not magically provide strong consistency; it makes the consistency model explicit: where you require atomicity, where you rely on compensating transactions, and where eventual consistency is acceptable.</p>



<h4 class="wp-block-heading">5. DFID: From observability to reconstruction</h4>



<p>Distributed tracing is not a new idea. The practical gap in many agentic systems is that traces rarely capture the artifacts that matter at the execution boundary: the exact context snapshot, the contract/schema version, the validation outcome, the idempotency key, and the external receipt.</p>



<p>The Decision Flow ID (DFID) is intended as a <em>reconstruction primitive</em>—one correlation key that binds the minimum evidence needed to answer critical operational questions:</p>



<ul class="wp-block-list">
<li><strong>Why did the system execute this action?</strong> (policy proposal + validation receipt + contract/schema version)</li>



<li><strong>Was the decision stale at execution time?</strong> (context snapshot + JIT drift report)</li>



<li><strong>Did the system retry safely or duplicate the side effect?</strong> (idempotency key + attempt log + external acknowledgment)</li>



<li><strong>Which authority allowed it?</strong> (agent identity + registry/contract snapshot)</li>
</ul>



<p>In practice, this turns a postmortem from &#8220;the agent traded&#8221; into &#8220;this exact intent was accepted under these deterministic gates against this exact snapshot, and produced this external receipt.&#8221; The goal is not to claim perfect correctness; it is to make side effects explainable at the level of inputs and gates, even when the reasoning remains probabilistic.</p>



<p>At the hierarchical level, DFIDs form parent-child relationships. A strategic intent spawns multiple child flows. When multistep workflows fail, you reconstruct not just the failing step but the parent mandate that authorized it.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1324" height="449" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-25.png" alt="Figure 3: Hierarchical Decision Flow IDs enable full process reconstruction across multi-agent interactions." class="wp-image-18375" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-25.png 1324w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-25-300x102.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-25-768x260.png 768w" sizes="auto, (max-width: 1324px) 100vw, 1324px" /><figcaption class="wp-element-caption"><em>Figure 3: Hierarchical Decision Flow IDs enable full process reconstruction across multi-agent interactions.</em></figcaption></figure>



<p>In practice, this level of traceability is not about storing prompts—it is about storing structured decision telemetry.</p>



<p>In one trading simulation, each position generated a decision flow that could be queried like any other system artifact. This allowed inspection of the triggering news signal, the agent’s justification, intermediate decisions (such as stop adjustments), the final close action, and the resulting PnL, all tied to a single simulation ID. Instead of replaying conversational history, this approach reconstructed what happened at the level of state transitions and executable intents.</p>



<pre class="wp-block-code"><code>SELECT position_id
     , instrument
     , entry_price
     , initial_exposure
     , news_full_headline
     , news_score
     , news_justification
     , decisions_timeline
     , close_price
     , close_reason
     , pnl_percent
     , pnl_usd
  FROM position_audit_agg_v
 WHERE simulation_id = 'sim_2026-02-24T11-20-18-516762+00-00_0dc07774';</code></pre>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="164" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-26.png" alt="Figure 4: Example of structured decision telemetry" class="wp-image-18376" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-26.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-26-300x31.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-26-768x79.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-26-1536x157.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Figure 4: Example of structured decision telemetry. Each row links context, reasoning, intermediate actions, and financial outcome for a single simulation run.</em></figcaption></figure>



<p>This approach is fundamentally different from prompt logging. The agent’s reasoning becomes one field among many—not the system of record. The system of record is the validated decision and its deterministic execution boundary.</p>



<h3 class="wp-block-heading">From model-centric to execution-centric AI</h3>



<p>The industry is shifting from <em>model-centric AI</em>, measuring success by reasoning quality alone, to <em>execution-centric AI</em>, where reliability and operational safety are first-class concerns.</p>



<p>This shift comes with trade-offs. Implementing deterministic control requires higher latency, reduced throughput, and stricter schema discipline. For simple summarization tasks, this overhead is unjustified. But for systems that move capital or control infrastructure, where a single failure outweighs any efficiency gain, these are acceptable costs. A duplicate $50K order is far more expensive than a 200 ms validation check.</p>



<p>This architecture is not a single software package. Much like how Model-View-Controller (MVC) is a pervasive pattern without being a single importable library, DIR is a set of engineering principles: separation of concerns, zero trust, and state determinism, applied to probabilistic agents. Treating agents as untrusted processes is not about limiting their intelligence; it is about providing the safety scaffolding required to use that intelligence in production.</p>



<p>As agents gain direct access to capital and infrastructure, a runtime layer will become as standard in the AI stack as a transaction manager is in banking. The question is not whether such a layer is necessary but how we choose to design it.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>This article provides a high-level introduction to the Decision Intelligence Runtime and its approach to production resiliency and operational challenges. The full architectural specification, repository of context patterns, and reference implementations are available as an open source project at </em><a href="https://github.com/huka81/decision-intelligence-runtime" target="_blank" rel="noreferrer noopener"><em>GitHub</em></a><em>.</em></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-missing-layer-in-agentic-ai/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Spotting and Avoiding ROT in Your Agentic AI</title>
		<link>https://www.oreilly.com/radar/spotting-and-avoiding-rot-in-your-agentic-ai/</link>
				<comments>https://www.oreilly.com/radar/spotting-and-avoiding-rot-in-your-agentic-ai/#respond</comments>
				<pubDate>Wed, 25 Mar 2026 11:17:39 +0000</pubDate>
					<dc:creator><![CDATA[Q McCallum]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18366</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/20260225-hacker-clint-patterson-dYEuFB8KQJk-unsplash.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="1920" 
				height="1283" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/20260225-hacker-clint-patterson-dYEuFB8KQJk-unsplash-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[The following article originally appeared on Q McCallum’s blog and is being republished here with the author’s permission. Generative AI agents and rogue traders pose similar insider threats to their employers. Specifically, we can expect companies to deploy agentic AI with broad reach and insufficient oversight. That creates the conditions for a particular flavor of [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on </em><a href="https://qethanm.cc/2026/02/25/avoiding-rot/" target="_blank" rel="noreferrer noopener"><em>Q McCallum’s blog</em></a><em> and is being republished here with the author’s permission.</em></p>
</blockquote>



<p>Generative AI agents and rogue traders pose similar insider threats to their employers.</p>



<p>Specifically, we can expect companies to deploy agentic AI with broad reach and insufficient oversight. That creates the conditions for a particular flavor of long-running problem, which in turn creates a novel risk exposure for both the companies in question and for anyone doing business with them. The bot and the rogue trader are able to inflict sizable, sometimes existential, damage to the firms that employ them.</p>



<p>The key difference is the scope: Rogue traders operate in investment banks, while agentic AI will be deployed to a wider array of companies and industry verticals. Agentic AI may therefore create a greater number of problems than rogue traders and put a greater amount of capital at risk.</p>



<p>I&#8217;m naming this risk exposure ROT—Rogue Operator Threat—and this document is a brief explainer on what it is and how to address it.</p>



<p>(I almost called it RAT, with the A for &#8220;agentic,&#8221; but then realized that it would apply to any kind of automated system. So I broadened the scope to &#8220;operator.&#8221;)</p>



<p>To set the stage, let&#8217;s take a trip to the trading floor:</p>



<h2 class="wp-block-heading">Understanding the rogue trader</h2>



<p>Rogue trader scandals follow the same storyline:</p>



<ul class="wp-block-list">
<li>A trader accrues losses due to bad trades.</li>



<li>They hide those losses while placing new trades in an attempt to recover.</li>



<li>The new trades also lose money, digging a deeper hole.</li>



<li>Repeat.</li>
</ul>



<p>This cycle continues until they&#8217;re caught, at which point the bank is sitting on a large loss (sometimes into the billions of dollars) and the trader faces legal repercussions.</p>



<p>The story of Barings Bank offers a concrete example. Trader Nick Leeson had been logging fraudulent trades, over a stretch of three years, in an attempt to cover his mounting losses. This only came to light when the Kobe earthquake shifted markets against his most recent positions and the losses were no longer possible to hide. Leeson&#8217;s £800M ($1.3B) hole drove Barings to bankruptcy just three days later.</p>



<p>This is when you&#8217;ll ask: How could a professional trading operation let so many bad trades slip through undetected? How could a trader falsify records? Aren&#8217;t trading floors high-tech operations, full of electronic audit trails?</p>



<p>And the answer is: It&#8217;s complicated.</p>



<p>Trading operations do keep records, yes. But no system is perfect. Each time a rogue trading scandal comes to light, it turns out that there were loopholes in risk controls. A sufficiently motivated trader—especially one desperate to hide their mistakes—found and exploited these loopholes, continuing their losing streak in plain sight until they could bring in real money to backfill the fake records.</p>



<p>That &#8220;until&#8221; never happened, though. Which is why their employers then faced financial, reputational, and sometimes legal troubles.</p>



<h2 class="wp-block-heading">The AI agent&#8217;s ROT threat</h2>



<p>Similar to a trader, an AI agent operates on behalf of its parent business and is given room to operate independently so it can accomplish its tasks.</p>



<p>The risk is that, in the rush to deploy agentic AI, these companies will likely grant the bots more leeway than is necessary. We&#8217;ve already seen cases in which bots have been able to <a href="https://www.pcmag.com/news/meta-security-researchers-openclaw-ai-agent-accidentally-deleted-her-emails" target="_blank" rel="noreferrer noopener">delete emails</a> and <a href="https://www.pcmag.com/news/vibe-coding-fiasco-replite-ai-agent-goes-rogue-deletes-company-database" target="_blank" rel="noreferrer noopener">wipe a production database</a>. And there are no doubt other stories that haven&#8217;t made it into the news.</p>



<p>Those issues were at least caught in real time. Companies facing ROT are exposed to additional longer-running problems in which the bot is able to accrue losses or inflict greater damage over an extended period. In those cases the problems will only be uncovered by accident and/or when it&#8217;s too late.</p>



<p>Consider, for example, an agent that creates false data records to reflect (nonexistent) sales orders. It&#8217;s possible for this to run until some external event, such as investor due diligence or a budget review, forces someone to double-check those records against reality.</p>



<h2 class="wp-block-heading">Avoiding ROT: Mitigating the threat</h2>



<p>How can you narrow your downside risk exposure to ROT? Preventative measures are key. Strong risk controls, narrow scope of authority, and monitoring can catch rogue operator problems long before they&#8217;ve metastasized into an existential threat.</p>



<p>In light of rogue trader scandals, trading shops have been known to tighten risk controls and also separate duties to create a system of checks and balances. (This inhibits traders from logging their own fake trades.) Companies also require traders to take time off, as fraudulent activity may surface when the perpetrator isn&#8217;t around every day to keep the system running.</p>



<p>Adapting these ideas to agentic AI, a company could monitor and limit the scope of the bot&#8217;s activity (say, requiring human approval to place more than 10 orders an hour). It could also periodically purge the agent&#8217;s memory so it doesn&#8217;t accumulate too many evolved behaviors, or swap in completely new bots to pick up where the previous one had left off. And per my usual refrain of &#8220;<em>never let the bots run unattended</em>,&#8221; this company could employ people to cross-check everything the bot does. Trust, but verify.</p>



<p>This will not prevent the AI agent from making mistakes. But guardrails and sufficiently frequent checks should limit the scope of the bot&#8217;s damage. As with the rogue trader, the ROT problem isn&#8217;t about a single error; it&#8217;s about letting the errors grow out of control, undetected.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/spotting-and-avoiding-rot-in-your-agentic-ai/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>How to Build a General-Purpose AI Agent in 131 Lines of Python</title>
		<link>https://www.oreilly.com/radar/how-to-build-a-general-purpose-ai-agent-in-131-lines-of-python/</link>
				<comments>https://www.oreilly.com/radar/how-to-build-a-general-purpose-ai-agent-in-131-lines-of-python/#respond</comments>
				<pubDate>Tue, 24 Mar 2026 11:17:16 +0000</pubDate>
					<dc:creator><![CDATA[Hugo Bowne-Anderson]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18335</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/131-lines-of-code.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/131-lines-of-code-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Implement a coding agent in 131 lines of Python code, and a search agent in 61 lines]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Hugo Bowne-Anderson’s newsletter, Vanishing Gradients, and is being republished here with the author’s permission. In this post, we’ll build two AI agents from scratch in Python. One will be a coding agent, the other a search agent. Why have I called this post “How to Build a General-Purpose AI [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on Hugo Bowne-Anderson’s newsletter, </em><a href="https://hugobowne.substack.com/p/how-to-build-a-general-purpose-ai" target="_blank" rel="noreferrer noopener">Vanishing Gradients</a><em>, and is being republished here with the author’s permission.</em></p>
</blockquote>



<p>In this post, we’ll build two AI agents from scratch in Python. One will be a coding agent, the other a search agent.</p>



<p>Why have I called this post “How to Build a General-Purpose AI Agent in 131 Lines of Python” then? Well, as it turns out now, <strong>coding agents are actually general-purpose agents in some quite surprising ways.</strong></p>



<p>What I mean by this is <em>once you have an agent that can write code</em>, it can:</p>



<ol class="wp-block-list">
<li>Do a huge number of things you don’t often think of as involving code, and</li>



<li>Extend itself to do even more things.</li>
</ol>



<p><strong>It’s more appropriate to think of coding agents as “computer-using agents” that happen to be great at writing code.</strong> That doesn’t mean you should always build a general-purpose agent, but it’s worth understanding what you’re actually building when you give an LLM shell access. That’s also why we’ll build a search agent in this post: to show the pattern works regardless of what you’re building.</p>



<p>For example, the coding agent we’ll build below has four tools: read, write, edit, and bash.</p>



<p>Watch this two-minute video to see how it can clean your desktop and why you should think of coding agents as “computer-using agents” that happen to be great at writing code:</p>



<figure class="wp-block-video"><video height="1080" style="aspect-ratio: 1920 / 1080;" width="1920" controls src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Transforming-Coding-Agents-into-General-Purpose-Computer-Assistants.mp4" playsinline></video></figure>



<p>It can do</p>



<ul class="wp-block-list">
<li><strong>File/life organization</strong>:<strong> </strong>Clean your desktop, sort downloads by type, rename vacation photos with dates, find and delete duplicates, organize receipts into folders.&nbsp;.&nbsp;.</li>



<li><strong>Personal productivity</strong>:<strong> </strong>Search all your notes for something you half-remember, compile a packing list from past trips, find all PDFs containing “tax” from last year.&nbsp;.&nbsp;.</li>



<li><strong>Media management</strong>:<strong> </strong>Rename a season of TV episodes properly, convert images to different formats, extract audio from videos, resize photos for social media.&nbsp;.&nbsp;.</li>



<li><strong>Writing and content</strong>:<strong> </strong>Combine multiple docs into one, convert between formats, find-and-replace across many files.&nbsp;.&nbsp;.</li>



<li><strong>Data wrangling</strong>:<strong> </strong>Turn a messy CSV into a clean address book, extract emails from a pile of files, merge spreadsheets from different sources.&nbsp;.&nbsp;.</li>
</ul>



<p>This is a small subset of what’s possible. It’s also the reason Claude Cowork seemed promising and why OpenClaw has taken off in the way it did.</p>



<p><em>So how can you build this?</em> In this post, I’ll show you how to build a minimal version.</p>



<h2 class="wp-block-heading">Agents are just LLMs with tools in a loop</h2>



<p>Agents are just LLMs with tools in a conversation loop and once you know the pattern, you’ll be able to build all types of agents with it:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1408" height="768" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-5.jpeg" alt="Builder's playbook" class="wp-image-18336" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-5.jpeg 1408w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-5-300x164.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-5-768x419.jpeg 768w" sizes="auto, (max-width: 1408px) 100vw, 1408px" /></figure>



<p>As <a href="https://ivanleo.com/blog/building-an-agent" target="_blank" rel="noreferrer noopener">Ivan Leo wrote</a>,</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>The barrier to entry is remarkably low: 30 minutes and you have an AI that can understand your codebase and make edits just by talking to it.</p>
</blockquote>



<p>The goal here is to show that the pattern is the same regardless of what you’re building an agent for. Coding agent, search agent, browser agent, email agent, database agent: they all follow the same structure. The only difference is the tools you give them.</p>



<h2 class="wp-block-heading">Part 1: The coding agent</h2>



<p>We’ll start with a coding agent that can read, write, and execute code. As stated, the ability to write and execute code with bash also turns a “coding agent” into a “general-purpose agent.” With shell access, it can do anything you can do from a terminal:</p>



<ul class="wp-block-list">
<li>Sort and organize your local filesystem</li>



<li>Clean up your desktop</li>



<li>Batch rename photos</li>



<li>Convert file formats</li>



<li>Manage Git repos across multiple projects</li>



<li>Install and configure software</li>
</ul>



<p><a href="https://github.com/hugobowne/building-with-ai/tree/main/general-purpose-agent/coding-agent" target="_blank" rel="noreferrer noopener">You can find the code here</a>.</p>



<p>Check out <a href="https://ivanleo.com/blog/building-an-agent" target="_blank" rel="noreferrer noopener">Ivan Leo’s post</a> for how to do this in JavaScript and <a href="https://ampcode.com/notes/how-to-build-an-agent" target="_blank" rel="noreferrer noopener">Thorsten Ball’s post</a> for how to do it in Go.</p>



<h3 class="wp-block-heading">Setup</h3>



<p>Start by creating our project:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="421" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-6.jpeg" alt="Create project" class="wp-image-18337" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-6.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-6-300x87.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-6-768x222.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>We’ll be using Anthropic here. Feel free to use your LLM of choice. For bonus points, use Pydantic AI (or a similar library) and have a consistent interface for the various different LLM providers. That way you can use the same agentic framework for both Claude and Gemini!</p>



<p>Make sure you’ve got an Anthropic API key set as ANTHROPIC_API_KEY environment variable.</p>



<p>We’ll build our agent in four steps:</p>



<ol class="wp-block-list">
<li>Hook up our LLM</li>



<li>Add a tool that reads files
<ol class="wp-block-list">
<li>Add more tools: <code>write</code>, <code>edit</code>, and <code>bash</code></li>
</ol>
</li>



<li>Build the agentic loop</li>



<li>Build the conversational loop</li>
</ol>



<h3 class="wp-block-heading">1. Hook up our LLM</h3>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="724" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-11.png" alt="Hook up LLM 1" class="wp-image-18338" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-11.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-11-300x149.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-11-768x382.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="512" height="138" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-12.png" alt="Hook up LLM 2" class="wp-image-18339" style="width:685px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-12.png 512w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-12-300x81.png 300w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<p>Text in, text out. Good! Now let’s give it a tool.</p>



<h3 class="wp-block-heading">2. Add a tool (read)</h3>



<p>We’ll start by implementing a tool called read which will allow the agent to read files from the filesystem. In Python, we can use Pydantic for schema validation, which also generates JSON schemas we can provide to the API:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="663" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-7.jpeg" alt="JSON schema generation" class="wp-image-18340" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-7.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-7-300x137.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-7-768x350.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p><a href="https://substackcdn.com/image/fetch/$s_!aeE_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ef8cca-e981-4d19-a851-9bc93ed3f867_1600x796.png"></a>The Pydantic model gives us two things: validation and a JSON schema. We can see what the schema looks like:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="332" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-13.png" alt="What the schema looks like" class="wp-image-18341" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-13.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-13-300x68.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-13-768x175.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="574" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-8.jpeg" alt="JSON schema" class="wp-image-18342" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-8.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-8-300x118.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-8-768x303.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>We wrap this into a tool definition that Claude understands:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="452" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-9.jpeg" alt="Interpret for Claude" class="wp-image-18343" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-9.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-9-300x93.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-9-768x238.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>Then we add tools to the API call, handle the tool request, execute it, and send the result back:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="1328" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-14.png" alt="Add tools, handle request, execute, send result" class="wp-image-18344" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-14.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-14-300x274.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-14-768x700.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>Let’s see what happens when we run it:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="512" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-10.jpeg" alt="Script when run" class="wp-image-18345" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-10.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-10-300x105.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-10-768x270.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>This script calls the Claude API with a user query passed via command line. It sends the query, gets a response, and prints it.</p>



<p>Note that the LLM matched on the tool description: Accurate, specific descriptions are key! It’s also worth mentioning that we’ve made two LLM calls here:</p>



<ul class="wp-block-list">
<li>One in which the tool is called</li>



<li>A second in which we send the result of the tool call back to the LLM to get the final result</li>
</ul>



<p>This often trips up people building agents for the first time, and Google has made a nice visualization of what we’re actually doing:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="1191" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-15.png" alt="" class="wp-image-18346" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-15.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-15-300x245.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-15-768x628.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<h3 class="wp-block-heading">2a. Add more tools (write, edit, bash)</h3>



<p>We have a read tool, but <strong>a coding agent needs to do more than read</strong>. It needs to:</p>



<ul class="wp-block-list">
<li>Write new files</li>



<li>Edit existing ones</li>



<li>Execute code to test it</li>
</ul>



<p>That’s three more tools: <code>write</code>, <code>edit</code>, and <code>bash</code>.</p>



<p>Same pattern as read. First the schemas:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="724" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-16.png" alt="First, the schemas" class="wp-image-18347" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-16.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-16-300x149.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-16-768x382.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>Then the executors:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="1328" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-17.png" alt="Then, the executors" class="wp-image-18348" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-17.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-17-300x274.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-17-768x700.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>And the tool definitions, along with the code that runs whichever one Claude picks:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="935" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18.png" alt="And the tool definitions" class="wp-image-18349" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18-300x193.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18-768x493.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>The bash tool is what makes this actually useful: Claude can now write code, run it, see errors, and fix them. But it’s also dangerous. This tool could delete your entire filesystem! Proceed with caution: Run it in a sandbox, a container, or a VM.</p>



<p>Interestingly, bash is what turns a “coding agent” into a “general-purpose agent.” With shell access, it can do anything you can do from a terminal:</p>



<ul class="wp-block-list">
<li>Sort and organize your local filesystem</li>



<li>Clean up your desktop</li>



<li>Batch rename photos</li>



<li>Convert file formats</li>



<li>Manage Git repos across multiple projects</li>



<li>Install and configure software</li>
</ul>



<p>It was actually “<a href="https://lucumr.pocoo.org/2026/1/31/pi/" target="_blank" rel="noreferrer noopener">Pi: The Minimal Agent Within OpenClaw</a>” that inspired this example.</p>



<p>Try asking Claude to edit a file: It often wants to read it first to see what’s there. But our current code only handles one tool call. That’s where the agentic loop comes in.</p>



<h3 class="wp-block-heading">3. Build the agentic loop</h3>



<p>Right now Claude can only call one tool per request. But real tasks need multiple steps: read a file, edit it, run it, see the error, fix it. We need a loop that lets Claude keep calling tools until it’s done.</p>



<p>We wrap the tool handling in a <code>while True</code> loop:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="1479" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-11.jpeg" alt="Wrap in a while True loop" class="wp-image-18350" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-11.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-11-295x300.jpeg 295w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-11-768x780.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>Note that here we have sent the entire past history of accumulated messages as we progress through loop iterations. When building this out more, you’ll want to engineer and manage your context more effectively. (See below for more on this.)</p>



<p>Let’s try a multistep task:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="543" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19.png" alt="Multistep task" class="wp-image-18351" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19-300x112.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19-768x286.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<h3 class="wp-block-heading">4. Build the conversational loop</h3>



<p>Right now the agent handles one query and exits. But we want a back-and-forth conversation: Ask a question, get an answer, ask a follow-up. We need an outer loop that keeps asking for input.</p>



<p>We wrap everything in a <code>while True</code>:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="814" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-12.jpeg" alt="We wrap everything in a while True" class="wp-image-18352" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-12.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-12-300x168.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-12-768x429.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>The messages list persists across turns, so Claude remembers context. That’s the complete coding agent.</p>



<p>Once again we’re merely appending all previous messages, which means the context will grow quite quickly!</p>



<h3 class="wp-block-heading"><strong>A note on agent harnesses</strong></h3>



<p>An agent harness is the scaffolding and infrastructure that wraps around an LLM to turn it into an agent. It handles:</p>



<ul class="wp-block-list">
<li><strong>The loop:</strong> prompting the model, parsing its output, executing tools, feeding results back</li>



<li><strong>Tool execution:</strong> actually running the code/commands the model asks for</li>



<li><strong>Context management:</strong> what goes in the prompt, token limits, history</li>



<li><strong>Safety/guardrails:</strong> confirmation prompts, sandboxing, disallowed actions</li>



<li><strong>State: </strong>keeping track of the conversation, files touched, etc.</li>
</ul>



<p>And more.</p>



<p>Think of it like this: <em>The LLM is the brain; the harness is everything else that lets it actually do things.</em></p>



<p>What we’ve built above is the <strong>hello world of agent harnesses</strong>. It covers <strong>the loop</strong>, <strong>tool execution</strong>, and <strong>basic context management</strong>. What it doesn’t have: safety guardrails, token limits, persistence, or even a system prompt!</p>



<p>When building out from this basis, I encourage you to follow the paths of:</p>



<ul class="wp-block-list">
<li><a href="https://github.com/badlogic/pi-mono" target="_blank" rel="noreferrer noopener"><strong>The Pi coding agent</strong></a>, which adds context loading <code>AGENTS.md</code> from multiple directories, persistent sessions you can resume and branch, and an extensibility system (skills, extensions, prompts)</li>



<li><a href="https://openclaw.ai/" target="_blank" rel="noreferrer noopener"><strong>OpenClaw</strong></a>, which goes further: a persistent daemon (always-on, not invoked), chat as the interface (Telegram, WhatsApp, etc.), file-based continuity (<code>SOUL.md</code>, <code>MEMORY.md</code>, daily logs), proactive behavior (heartbeats, cron), preintegrated tools (browser, subagents, device control), and the ability to message you without being prompted</li>
</ul>



<h2 class="wp-block-heading">Part 2: The search agent</h2>



<p>In order to really show you that the agentic loop is what powers any agent, we’ll now build a search agent (inspired by a <a href="https://hugobowne.substack.com/p/episode-68-a-builders-guide-to-agentic" target="_blank" rel="noreferrer noopener">podcast I did with search legends John Berryman and Doug Turnbull</a>). We’ll use Gemini for the LLM and Exa for web search. <a href="https://github.com/hugobowne/building-with-ai/tree/main/general-purpose-agent/search-agent" target="_blank" rel="noreferrer noopener">You can find the code here</a>.</p>



<p><em>But first, the astute reader may have an interesting question</em>: If a coding agent really is a general-purpose agent, why would anyone want to build a search agent when we could just get a coding agent to extend itself and turn itself into a search agent? Well, because if you want to build a search agent for a business, you’re not going to do it by building a coding agent first… So let’s build it!</p>



<h3 class="wp-block-heading">Setup</h3>



<p>As before, we’ll build this step-by-step. Start by creating our project:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="421" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-13.jpeg" alt="Start by creating our project" class="wp-image-18353" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-13.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-13-300x87.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-13-768x222.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>Set <code>GEMINI_API_KEY</code> (from Google AI Studio) and <code>EXA_API_KEY</code> (from exa.ai) as environment variables.</p>



<p>We’ll build our agent in four steps (the same four steps as always):</p>



<ol class="wp-block-list">
<li>Hook up our LLM</li>



<li>Add a tool (web_search)</li>



<li>Build the agentic loop</li>



<li>Build the conversational loop</li>
</ol>



<h3 class="wp-block-heading">1. Hook up our LLM</h3>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="724" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20.png" alt="Hook up our LLM, again" class="wp-image-18354" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20-300x149.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20-768x382.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="392" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-14.jpeg" alt="Who is Doug Turnbull?" class="wp-image-18355" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-14.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-14-300x81.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-14-768x207.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<h3 class="wp-block-heading">2. Add a tool (<code>web_search</code>)</h3>



<p>Gemini can answer from its training data, but we don’t want that, man! For current information, it needs to search the web. We’ll give it a <code>web_search</code> tool that calls Exa.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="936" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-15.jpeg" alt="web_search tool" class="wp-image-18356" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-15.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-15-300x193.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-15-768x494.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>The system instruction grounds the model, (ideally) forcing it to search instead of guessing. <a href="https://ai.google.dev/gemini-api/docs/function-calling?example=meeting#function_calling_modes" target="_blank" rel="noreferrer noopener">Note that you can configure Gemini to always use</a> <code>web_search</code>, which is 100% dependable, but I wanted to show the pattern that you can use with any LLM API.<a href="https://substackcdn.com/image/fetch/$s_!oYxP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2b0fa2-b5a2-4dff-921c-5e6abf49fb01_1600x796.png"></a></p>



<p>We then send the tool call result back to Gemini:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="1086" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-16.jpeg" alt="Tool call result back to Gemini" class="wp-image-18357" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-16.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-16-300x224.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-16-768x573.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<h3 class="wp-block-heading">3. Build the agentic loop</h3>



<p>Some questions need multiple searches. “Compare X and Y” requires searching for X, then searching for Y. We need a loop that lets Gemini keep searching until it has enough information.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="1390" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-21.png" alt="Build the agentic loop" class="wp-image-18358" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-21.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-21-300x286.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-21-768x733.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="483" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-17.jpeg" alt="Build the agentic loop 2" class="wp-image-18359" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-17.jpeg 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-17-300x100.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-17-768x255.jpeg 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<h3 class="wp-block-heading">4. Build the conversational loop</h3>



<p>Same as before: We want back-and-forth conversation, not one query and exit. Wrap everything in an outer loop:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1456" height="1208" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-22.png" alt="Build the conversational loop" class="wp-image-18360" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-22.png 1456w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-22-300x249.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-22-768x637.png 768w" sizes="auto, (max-width: 1456px) 100vw, 1456px" /></figure>



<p>Messages persist across turns, so follow-up questions have context.</p>



<h2 class="wp-block-heading">Extend it</h2>



<p>The pattern is the same for both agents. Add any tool:</p>



<ul class="wp-block-list">
<li><code>web_search</code> to the coding agent: Look things up while coding</li>



<li><code>bash</code> to the search agent: Act on what it finds</li>



<li><code>browser</code>: Navigate websites</li>



<li><code>send_email</code>: Communicate</li>



<li><code>database_query</code>: Run SQL</li>
</ul>



<p><em>One thing we’ll be doing is showing how general purpose a coding agent really can be</em>. As Armin Ronacher wrote in “<a href="https://lucumr.pocoo.org/2026/1/31/pi/" target="_blank" rel="noreferrer noopener">Pi: The Minimal Agent Within OpenClaw</a>”:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>Pi’s entire idea is that <strong>if you want the agent to do something that it doesn’t do yet</strong>, you <strong>don’t go and download an extension or a skill</strong> or something like this. <strong>You ask the agent to extend itself</strong>. It celebrates the idea of code writing and running code.</p>
</blockquote>



<h2 class="wp-block-heading">Conclusion</h2>



<p>Building agents is straightforward. The magic isn’t complex algorithms; it’s the conversation loop and well-designed tools.</p>



<p>Both agents follow the same pattern:</p>



<ol class="wp-block-list">
<li>Hook up the LLM</li>



<li>Add a tool (or multiple tools)</li>



<li>Build the agentic loop</li>



<li>Build the conversational loop</li>
</ol>



<p>The only difference is the tools.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Thank you to Ivan Leo, Eleanor Berger, Mike Powers, Thomas Wiecki, and Mike Loukides for providing feedback on drafts of this post.</em></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/how-to-build-a-general-purpose-ai-agent-in-131-lines-of-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
				<enclosure url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Transforming-Coding-Agents-into-General-Purpose-Computer-Assistants.mp4" length="39099852" type="video/mp4" />
			</item>
		<item>
		<title>The Mythical Agent-Month</title>
		<link>https://www.oreilly.com/radar/the-mythical-agent-month/</link>
				<comments>https://www.oreilly.com/radar/the-mythical-agent-month/#respond</comments>
				<pubDate>Mon, 23 Mar 2026 11:15:59 +0000</pubDate>
					<dc:creator><![CDATA[Wes McKinney]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18321</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/The-mythical-robot-agent-month.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/The-mythical-robot-agent-month-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[The following article originally appeared on Wes McKinney’s blog and is being republished here with the author’s permission. Like a lot of people, I’ve found that AI is terrible for my sleep schedule. In the past I’d wake up briefly at 4:00 or 4:30 in the morning to have a sip of water or use [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on </em><a href="https://wesmckinney.com/blog/mythical-agent-month/" target="_blank" rel="noreferrer noopener"><em>Wes McKinney’s blog</em></a><em> </em><em>and is being republished here with the author’s permission.</em></p>
</blockquote>



<p>Like <a href="https://lucumr.pocoo.org/2026/1/18/agent-psychosis/" target="_blank" rel="noreferrer noopener">a lot of people</a>, I’ve found that AI is terrible for my sleep schedule. In the past I’d wake up briefly at 4:00 or 4:30 in the morning to have a sip of water or use the bathroom; now I have trouble going back to sleep. I could be doing things. Before I would get a solid 7–8 hours a night; now I’m lucky when I get 6. I’ve largely stopped fighting it: Now when I’m rolling around restlessly in bed at 5:07am with ideas to feed my AI coding agents, I just get up and start my day.</p>



<p>Among my inner circle of engineering and data science friends, there is a lot of discussion about how long our competitive edge as humans will last. Will having good ideas (and lots of them) still matter as the agents begin having better ideas themselves? The human-expert-in-the-loop feels essential now to get good results from the agents, but how long will that last until our wildest ideas can be turned into working, tasteful software while we sleep? Will it be a <a href="https://benn.substack.com/p/the-gentle-obsolescence" target="_blank" rel="noreferrer noopener">gentle obsolescence</a> where we happily hand off the reins or something else?</p>



<p>For now, I feel needed. I don’t describe the way I work now as “vibe coding” as this sounds like a pejorative “prompt and chill” way of building AI slop software projects. I’ve been building tools like <a href="https://roborev.io/" target="_blank" rel="noreferrer noopener">roborev</a> to bring rigor and continuous supervision to my parallel agent sessions, and to heavily scrutinize the work that my agents are doing. With this radical new way of working it is hard not to be contemplative about the future of software engineering.</p>



<p>Probably the book I’ve referenced the most in my career is <a href="https://en.wikipedia.org/wiki/The_Mythical_Man-Month" target="_blank" rel="noreferrer noopener"><em>The Mythical Man-Month</em></a> by Fred Brooks, whose now-famous Brooks’s law argues that “adding manpower to a late software project makes it later.” Lately I find myself asking whether the lessons from this book are applicable in this new era of agentic development. Will a <a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04" target="_blank" rel="noreferrer noopener">talented developer orchestrating a swarm of AI agents</a> be able to build complex software faster and better, and will the short-term productivity gains lead to long-term project success? Or will we run into the same bottlenecks—scope creep, architectural drift, and coordination overhead—that have plagued software teams for decades?</p>



<h2 class="wp-block-heading">Revisiting <em>The Mythical Man-Month</em> (<em>TMMM</em>)</h2>



<p>One of Brooks’s central arguments is that small teams of elite people outperform large teams of average ones, with one “chief surgeon” supported by specialists. This leads to a high degree of <em>conceptual integrity</em> about the system design, as if “one mind designed it, even if many people built it.”</p>



<p>Agentic engineering appears to amplify these problems, since the quality of the software being built is now only as good as the humans in the loop curating and refining specs, saying yes or no to features, and taming unnecessary code and architectural complexity. One of the metaphors in <em>TMMM</em> is the “tar pit”: “Everyone can see the beasts struggling in it, and it looks like any one of them could easily free itself, but the tar holds them all together.” Now, we have a new “agentic tar pit” where our parallel Claude Code sessions and <code>git worktrees</code> are engaged in combat with the code bloat and incidental complexity generated by their virtual colleagues. You can <a href="https://www.roborev.io/guides/assisted-refactoring/" target="_blank" rel="noreferrer noopener">systematically refactor</a>, but invariably an agentic codebase will end up larger and more overwrought than anything built by human hand. This is technical debt on an unprecedented scale, accrued at machine speed.</p>



<p>In <em>TMMM</em>, Brooks observed that a working program is maybe 1/9th the way to a <em>programming product</em>, one that has the necessary testing, documentation, and hardening against edge cases and is maintainable by someone other than its author. Agents are now making the “working program” (or “appears-to-work” program, more accurately) a great deal more accessible, though many newly minted AI vibe coders clearly underestimate the work involved with going from prototype to production.</p>



<p>These problems compound when considering the closely-related <a href="https://martinfowler.com/bliki/ConwaysLaw.html" target="_blank" rel="noreferrer noopener">Conway’s law</a>, which asserts that the architecture of software systems tends to resemble the organizations’ team or communication structure. What does that look like when applied to a virtual “team” of agents with no persistent memory and no shared understanding of the system they are building?</p>



<p>Another “big idea” from TMMM that has stuck with people is the n(n-1)/2 coordination problem as teams scale. With agentic engineering, there are fewer humans involved, so the coordination problem doesn’t disappear but rather changes shape. Different agent sessions may produce contradictory plans that humans have to reconcile. I’ll leave this agent orchestration question for another post.</p>



<h2 class="wp-block-heading">No silver bullet</h2>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>“There is no single development, in either technology or management technique, which by itself promises even one order-of-magnitude improvement within a decade in productivity, in reliability, in simplicity.”<br>—“<a href="https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf" target="_blank" rel="noreferrer noopener">No Silver Bullet</a>” (1986)</p>
</blockquote>



<p>Brooks wrote a follow-up essay to <em>TMMM</em> to look at software design through the lens of <em>essential</em> complexity and <em>accidental</em> complexity. Essential complexity is fundamental to achieving your goal: If you made the system any simpler, it would fall short of its problem statement. Accidental complexity is everything else imposed by our tools and processes: programming languages, tools, and the layer of design and documentation to make the system understandable by engineers.</p>



<p>Coding agents are probably the most powerful tool ever created to tackle accidental complexity. To think: I basically do not write code anymore, and now write tons of code in a language (Go) <a href="https://wesmckinney.com/blog/agent-ergonomics/" target="_blank" rel="noreferrer noopener">I have never written by hand</a>. There is a lot of discussion about whether IDEs are still going to be relevant in a year or two, when maybe all we need is a <a href="https://www.gnu.org/software/emacs/" target="_blank" rel="noreferrer noopener">text editor to review diffs</a>. The productivity gains are enormous, and I say this as someone burning north of 10 billion tokens a month across Claude, Codex, and Gemini.</p>



<p>But Brooks’s “No Silver Bullet” argument predicts exactly the problem I’m experiencing in my agentic engineering: The accidental complexity is no problem at all anymore, but what’s left is the essential complexity which was always the hard part. Agents can’t reliably tell the difference. LLMs are extraordinary pattern matchers trained on the entirety of humanity’s open source software, so while they are brilliant at dealing with accidental complexity (refactor this code, write these tests, clean up this mess), they struggle with the more subtle essential design problems, which often have no precedent to pattern match against. They also often tend to introduce unnecessary complexity, generating large amounts of defensive boilerplate that is rarely needed in real-world use.</p>



<p>Put another way, agents are so good at attacking accidental complexity that they <em>generate new accidental complexity</em> that can get in the way of the essential structure that you are trying to build. With a couple of my new projects, <a href="https://roborev.io/" target="_blank" rel="noreferrer noopener">roborev</a> and <a href="https://www.msgvault.io/" target="_blank" rel="noreferrer noopener">msgvault</a>, I am already dealing with this problem as I begin to reach the 100 KLOC mark and watch the agents begin to chase their own tails and contextually choke on the bloated codebases they have generated. At some point beyond that (the next 100 KLOC, or 200 KLOC) things start to fall apart: Every new change has to hack through the code jungle created by prior agents. Call it a “brownfield barrier.” At Posit we have seen agents struggle much more in 1 million-plus-line codebases such as Positron, a VS Code fork. This seems to support Brooks’s complexity scaling argument.</p>



<p>I would hesitate to place a bet on whether the present is a ceiling or a plateau. The models are clearly getting better fast, and the problems I’m describing here may look charmingly quaint in two years. But Brooks’s essential/accidental distinction gives me some confidence that this isn’t just about the current limitations of the technology. Figuring out what to build was the hard part long before we had LLMs, and I don’t see how a flawless coding agent changes that.</p>



<h2 class="wp-block-heading">Agentic scope creep</h2>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>When generating code is free, knowing when to say “no” is your last defense.</p>
</blockquote>



<p>With the cost of generating code now converging to zero, there is practically nothing stopping agents and their human taskmasters from pursuing all avenues that would have previously been cost or time prohibitive. The temptation to spend your day prompting “and now can you just…?” is overwhelming. But any new generated feature or subsystem, while cheap to create, is not costless to maintain, test, debug, and reason about in the future. What seems free now carries a future contextual burden for future agent sessions, and each new bell or whistle becomes a new vector of brittleness or bugs that can harm users.</p>



<p>From this perspective, building great software projects maybe never was about how fast you can type the code. We can “type” 10x, maybe 100x faster with agents than we could before. But we still have to make good design decisions, say no to most product ideas, maintain conceptual integrity, and know when something is “done.” Agents are accelerating the “easy part” while paradoxically making the “hard part” potentially even more difficult.</p>



<p>Agentic scope creep also seems to be <a href="https://github.com/mitchellh/vouch" target="_blank" rel="noreferrer noopener">actively destroying the open source software world</a>. Now that the bar is lower than ever for contributors to jump in and offer help, projects are drowning in torrents of 3,000-line “helpful” PRs that add new features. As developers become increasingly hands-off and disengaged from the design and planning process, the agents’ runaway scope creep can get out of control quickly. When the person submitting a pull request didn’t write or fully read the code in it, there’s likely no one involved who’s truly accountable for the design decisions.</p>



<p>I have seen in my own work on roborev and msgvault that agents will propose overwrought solutions to problems when a simple solution would do just fine. It takes judgment to know when to intervene and how to keep the agent in check.</p>



<h2 class="wp-block-heading">Design and taste as our last foothold</h2>



<p>Brooks’s argument is that design talent and good taste are the most scarce resources, and now with agents doing all of the coding labor, I argue that these skills matter more now than ever. The bottleneck was never hands on keyboards. Now with the new “Mythical Agent-Month,” we can reasonably conclude that design, product scoping, and taste remain the practical constraints on delivering high-quality software. The developers who thrive in this new agentic era won’t be the ones who run the most parallel sessions or burn the most tokens. They’ll be the ones who are able to hold their projects’ conceptual models in their mind, who are shrewd about what to build and what to leave out, and exercise taste over the enormous volume of output.</p>



<p><em>The Mythical Man-Month</em> was published in 1975, more than 50 years ago. In that time, a lot has happened: tremendous progress in hardware performance, programming languages, development environments, cloud computing, and now large language models. The tools have changed, but the constraints are still the same.</p>



<p>Maybe I’m trying to justify my own continued relevance, but the reality is more complex than that. Not all software is created equal: CRUD business productivity apps aren’t the same as databases and other critical systems software. I think the median software consulting shop is completely toast. But my thesis is more about development work in the 1% tail of the distribution: problems inaccessible to most engineers. This will continue to require expert humans in the loop, even if they aren’t doing much or any manual coding. As one recent adjacent example, my friend <a href="https://lupsasca.com/" target="_blank" rel="noreferrer noopener">Alex Lupsasca</a> at OpenAI and his world-class physicist collaborators <a href="https://openai.com/index/new-result-theoretical-physics/" target="_blank" rel="noreferrer noopener">were able to create a formulation of a hard physics problem and arrive at a solution with AI’s help</a>. Without such experts in the loop, it’s much more dubious whether LLMs would be able to both pose the questions and come up with the solutions.</p>



<p>For now, I’ll probably still be getting out of bed at 5am to feed and tame my agents for the foreseeable future. The coding is easier now, and honestly more fun, and I can spend my time thinking about what to build rather than wrestling with the tools and systems around the engineering process.</p>



<p><em>Thanks to Martin Blais, Josh Bloom, Phillip Cloud, Jacques Nadeau, and Dan Shapiro for giving feedback on drafts of this post.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-mythical-agent-month/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Missing Mechanisms of the Agentic Economy</title>
		<link>https://www.oreilly.com/radar/the-missing-mechanisms-of-the-agentic-economy/</link>
				<comments>https://www.oreilly.com/radar/the-missing-mechanisms-of-the-agentic-economy/#respond</comments>
				<pubDate>Mon, 23 Mar 2026 09:50:48 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18318</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/The-missing-mechanisms-of-the-agentic-economy.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/The-missing-mechanisms-of-the-agentic-economy-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[From disclosures to protocols to markets]]></custom:subtitle>
		
				<description><![CDATA[For the past two years, I’ve been working with economist Ilan Strauss at the AI Disclosures Project. We started out by asking what regulators would need to know to ensure the safety of AI products that touch hundreds of millions of people. We are now exploring the missing mechanisms that are needed to enable the [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p><em>For the past two years, I’ve been working with economist Ilan Strauss at the </em><a href="https://ai-disclosures.org/" target="_blank" rel="noreferrer noopener"><em>AI Disclosures Project</em></a><em>. We started out by asking what regulators would need to know to ensure the safety of AI products that touch hundreds of millions of people. We are now exploring the missing mechanisms that are needed to enable the agentic economy.</em></p>



<p><em>This essay traces our path from disclosures through protocols to markets and mechanism design. Rather than simply stating our conclusions, I’m sharing our thought process and some of the conversations and historical examples that have shaped it.</em></p>



<p><em>We will be holding a number of focused convenings to explore these ideas over the next couple of months, and my hope is that shared context will enable more productive engagement with what is very much a work in progress.</em></p>



<h2 class="wp-block-heading"><strong>The disclosure problem</strong></h2>



<p>Ilan Strauss and I started the AI Disclosures Project in early 2024 with a conviction that most regulators had little idea how AI worked or where it was going. The field was so young that many of the early regulatory proposals were misguided. We thought that regulators and industry should start by agreeing on standards for disclosure, so that we could all learn together as the technology develops. You can’t regulate what you don’t understand.</p>



<p>One of our first insights was that focusing solely on model safety was a mistake, much as if regulators inspected automobiles at the factory but completely ignored their use on the roads. We believed (and still do) that the focus should be on AI <em>as deployed</em>. And we believe that disclosures shouldn’t focus just on capabilities but on business models and the operating metrics that AI companies use to shape how their products operate.</p>



<p>Ilan and I had worked together previously with Mariana Mazzucato at University College London on what we called “<a href="https://www.cambridge.org/core/journals/data-and-policy/article/algorithmic-attention-rents-a-theory-of-digital-platform-market-power/D85FE41F6CF99FC57DDFB2B2B63491C5" target="_blank" rel="noreferrer noopener">algorithmic attention rents</a>,” studying how platforms like Amazon and Google control user attention to extract economic rents from their suppliers. We observed that organic search at Google and Amazon was a huge advance in market coordination, using hundreds of signals to find the best match for a user’s intent. In effect, both companies had built a better “invisible hand.” And yet after decades of success, <a href="https://www.oreilly.com/radar/rising-tide-rents-and-robber-baron-rents/" target="_blank" rel="noreferrer noopener">they turned away from that advance</a>. To use Cory Doctorow’s coinage, they began “<a href="https://www.versobooks.com/products/3341-enshittification?srsltid=AfmBOooTtLlbEhhK-ia-eHz8YiuKQ610OYjsDzsl1fGjHdPTQk1SVdk_" target="_blank" rel="noreferrer noopener">enshittifying</a>” their services by substituting inferior paid results for the top organic search results in order to pad their bottom line.</p>



<p>We’d also watched social media start out with the promise of keeping you in touch with your friends and foster productive conversations, but then instead began to optimize for engagement at the expense of everything else. By the time anyone understood what was happening, the damage had been done. We can see the inflection point in their financial metrics, but neither regulators nor the public can see the changes in operating metrics that drove the financials. What if we could capture what good looks like before it gets enshittified, and identify how that changes over time?</p>



<p><a href="https://www.ucl.ac.uk/bartlett/sites/bartlett/files/oreilly_strauss_mazzucato_2023.regulating_big_tech_through_digital_disclosures.pdf" target="_blank" rel="noreferrer noopener">We also observed</a> that modern technology companies are completely different from industrial era corporations, where you can understand key elements of the business by tracing the inputs and the outputs through the financial statements. Instead, the business is largely driven by intangibles, which are lumped into one impenetrable black box.</p>



<p>We wanted to learn from that mistake. While the horse was already out of the barn on search and social media, we hoped to get disclosure of operating metrics into AI governance while there was still an appetite for regulation. Unfortunately, that window was very short. The failure turned out to be productive, though, because it forced us to think harder about regulation more broadly and what other leverage points might be found.</p>



<h2 class="wp-block-heading"><strong>Protocols as functional disclosures</strong></h2>



<p>The first turn in our thinking came when we realized that disclosures aren’t just informational. The most important disclosures are <em>functional</em>. We came to see the parallels between disclosures and communications protocols, the agreed-on methods by which networked systems share information. For example, the HTTP protocol that underlies the World Wide Web specifies how a web browser and web server communicate in order to display a web page.</p>



<p>This is a structured communication with rules that must be followed and data that must be exchanged in a particular order.&nbsp; An HTTP request that identifies the user agent as a command line program such as curl rather than a graphical browser such as Chrome triggers a different response from the server. The user-agent string isn’t a report filed with a regulator. It’s an operational signal embedded in the protocol, and it carries a lot of information.</p>



<p>Once you see protocols as a system of functional disclosures, you start noticing that every regulatory system has a kind of communications and control protocol&nbsp;at its heart. Generally Accepted Accounting Principles (GAAP) or IFRS, the European equivalent, are protocols for communication between companies and their accountants, auditors, banks, investors, and tax authorities. Even road markings and road signs are a communications protocol, giving information to drivers about local conditions, laws, and the proper use of the road. These are slow, analog protocols, but they are protocols nonetheless.</p>



<p>Protocols can be inspected. Observability is the key to governance. Police observe speeders on the road; credit card processors and banks watch for credit card fraud on their payment networks; email processors filter spam as it passes through nodes on the network. The <a href="https://learning.oreilly.com/library/view/observability-engineering/9781492076438/" target="_blank" rel="noreferrer noopener">observability</a> points for AI are still emerging, but that’s where regulators should be focused.</p>



<p>Even beyond being a locus for observability and regulability, protocols themselves do an enormous amount of the governing work in modern technology systems. Spanning everything from how packets get from one place to another to what gets displayed, who has permission to see it, and sometimes even what it costs, they ultimately determine who can interoperate with whom. That led us to an even bigger realization.</p>



<h2 class="wp-block-heading"><strong>Protocols shape markets</strong></h2>



<p>Think about the early shape of the AI chatbot market. It was a winner-takes-all race to be the dominant platform for AI in the way Windows became the platform for PCs, or iOS and Android for phones. Whoever wins controls the market. Then Anthropic introduced <a href="https://www.anthropic.com/news/model-context-protocol" target="_blank" rel="noreferrer noopener">MCP</a>, the Model Context Protocol. All of a sudden, the landscape looked more like a web. There could be many winners. It didn’t matter what model you were running or whose APIs you were calling as long as you followed the protocol. And as the agentic AI market unfolded, the protocol wasn’t just MCP. An AI agent could be a user of the existing internet protocol stacks. Whether MCP itself survives or is superseded by other protocols, the shape of the market was transformed.</p>



<p>This insight reframed our whole project. Protocols are not just technical infrastructure. They are market-shaping mechanisms.</p>



<h2 class="wp-block-heading"><strong>Workflows are also protocols</strong></h2>



<p>I talked last week with some of the folks working on the Long Now Foundation’s partnership with Ethereum&#8217;s <a href="https://summerofprotocols.com/" target="_blank" rel="noreferrer noopener">Summer of Protocols</a> project, and that widened my lens even further.</p>



<p>When software people hear “protocol,” we think of communication protocols: TCP/IP, HTTP, MCP, or, say, Stripe’s Machine Payment Protocol (MPP).</p>



<p>To the Long Now folks, a protocol is any standardized way of doing something. Wildfire management teams follow protocols. So do flood response teams, hospital emergency rooms, and air traffic controllers. Atul Gawande’s book <a href="https://atulgawande.com/book/the-checklist-manifesto/" target="_blank" rel="noreferrer noopener"><em>The Checklist Manifesto</em></a> was an attempt to establish a common protocol for surgical operating theaters. This is a very different definition of protocol, and yet putting the two meanings of the word into the same frame makes a new kind of sense.</p>



<p>In his introduction to the Summer of Protocols’ <a href="https://summerofprotocols.com/protocol-reader" target="_blank" rel="noreferrer noopener"><em>Protocol Reader</em></a>, Venkatesh Rao cited Ethereum researcher Danny Ryan&#8217;s definition of a protocol as a &#8220;stratum of codified behavior&#8221; enabling coordination. He pointed out that protocols tend to become invisible once adopted. Rao calls this a &#8220;Whitehead advance,&#8221; after the philosopher Alfred North Whitehead’s observation that civilization advances by extending what we can do without thinking.</p>



<p>But he also made the thought-provoking point that a protocol is an “engineered argument,” in contrast with an API, which he says is an “engineered agreement” enforced by one dominant actor. There&#8217;s more to it than just the power asymmetry of enforced agreement, though. In a followup conversation, Venkatesh Rao noted that protocols are &#8220;not just codified modes of information exchange, but modes of live, structured, argumentation, often with an active computational element. For example, CSMA/CD (Ethernet) must detect packet collisions and compute and execute a random delay for retransmittal of packets. This is not mere structured communication. This is argumentation with what philosophers call dynamic semantics.”</p>



<p>Rao continued: “The moment you go beyond computing protocols, real-world feedback loops from material consequences become really important. For example, container-shipping is quite close architecturally to TCP/IP (the big difference being that packets can be dropped and retransmitted while lost containers are actually lost), but because it has a materially embodied feedback loop, regulatory mechanisms start to behave more like control systems than communication systems.”</p>



<p>I love the idea of protocols as an engineered argument. The dynamism this suggests is going to be ever more true in a future of agentic protocols. But this notion also triggered another thought, which is that&nbsp;<em>markets are also engineered arguments</em>. My bridge to this reformulation was&nbsp;the difference between&nbsp;<em>de jure</em>&nbsp;protocols that arise from a formal standards process, and&nbsp;<em>de facto&nbsp;</em>protocols that arise through market contention.</p>



<p>In the early days of the internet, the Internet Engineering Task Force (IETF) was all about engineered arguments. People had ideas about how the internet ought to work, and to prove their point they had to show up with interoperable implementations. No one had the ability to enforce anything. Agreement had to evolve. As Dave Clark famously put it, “<a href="https://ieeexplore.ieee.org/document/1677461" target="_blank" rel="noreferrer noopener">We reject: kings, presidents, and voting. We believe in: rough consensus and running code</a>.” The <em>de facto</em> protocols of the internet that emerged from the IETF ended up significantly outperforming the competing <em>de jure</em> networking protocols that emerged from telecommunications standards bodies. The IETF framed the argument; whoever showed up made their case and won or lost by way of adoption.</p>



<p>It also made me remember another decades old story that I had lived through. Microsoft and Netscape were duking it out in the web server market and were building their own “engineered agreements” for what was up the stack from the base web server functionality. Everyone thought that Apache wasn’t keeping up, but they had a trump card. They provided an extension layer. And that engineered all kinds of productive arguments between a market of competing developers rather than a single engineered agreement imposed by either a dominant player OR a dominant committee.</p>



<p>Rao also noted that protocols spread slowly but become nearly impossible to dislodge once established. For example, SMTP (the protocol for email) dates back to 1982, and has outlasted many competitors. There is a lot of path dependence. And so getting the first steps right is an important part of engineering the argument.</p>



<p>And in his essay “<a href="https://summerofprotocols.com/research/standards-make-the-world" target="_blank" rel="noreferrer noopener">Standards Make the World</a>” for the Summer of Protocols project, David Lang makes the point that technical standards form a third pillar of modern society, alongside private organizations and public institutions. They aren’t the state and they aren’t the market, but they’re essential to both. When they work well, standards become enabling technologies. The internet. The shipping container. Standard time. They are civilizational infrastructure.</p>



<p>In short, we are not just building communication protocols for software agents. We are developing a new way to standardize the best practices and workflows that will shape the human + AI future, allowing humans and agents to cooperate across organizations, industries, and borders.</p>



<h2 class="wp-block-heading"><strong>Skills can also be seen as protocols</strong></h2>



<p>Once the Long Now team planted in my mind the connection between workflows and protocols, it occurred to me that Agent Skills are also a “stratum of codified behavior,” and perhaps even a set of competing “engineered arguments” for how to do work with AI.</p>



<p>At the simplest level, a Skill is a piece of structured knowledge: here’s how to create a Word document; here’s how to extract the text from a PDF; here’s how to publish on the <a href="https://huggingface.co/docs/hub/index" target="_blank" rel="noreferrer noopener">Hugging Face Hub</a>. There can be many Skills that attempt to codify the same knowledge, but some may be better than others. As Skills multiply, how will we find the best ones? This is in many ways analogous to the organic web search problem, which Google solved by aggregating hundreds of useful signals.</p>



<p>And we’re seeing that there is a kind of hierarchy of skills. Jesse Vincent’s <a href="https://github.com/obra/superpowers" target="_blank" rel="noreferrer noopener">Superpowers</a> framework, which has become one of the most widely adopted open source projects in AI-assisted development, doesn’t just give agents individual capabilities. It encodes an entire software development methodology: brainstorm before you build, plan before you code, test before you ship, review before you merge. That’s a standardized workflow. It’s a lot like the kinds of protocol that the Long Now folks were talking about, expressed in a form that agents can follow.</p>



<p>The existing protocols that the protocol research community talks about, like wildfire management protocols or hospital triage protocols, encode best practices into a repeatable, teachable process for human teams. They have yet to be adapted for agents. And in fact, many of them are never going to be entirely agentic. We will need to build mechanisms for workflows that include both AI agents and humans working together.</p>



<p>Agent skills in some (but not all) areas raise the same questions that industrial standards have always raised: who decides what the best practice is? How do you verify quality? How do you govern updates? We may be talking about skills that encode the workflow for regulatory compliance in a specific industry, or for conducting an environmental impact assessment, or for managing a clinical trial. Are the standards <em>de jure</em> or <em>de facto</em>, the result of an engineered agreement by a committee or an engineered argument that enables a vibrant market?</p>



<p>At O’Reilly, this is something we think about a lot. We’re a company built on codifying expert knowledge. We’ve published books and organized conferences and online training that taught people how to do new things. Now we’re asking “What does it look like to publish the skills that teach agents how to do things? And how do we make sure those skills are discoverable, trustworthy, and monetizable, not just for us but for every domain expert who has knowledge worth encoding?” And how do they emerge from contention in a vibrant market rather than by decree?</p>



<p>We believe we’ll all be better off with an engineered argument than an engineered agreement. And that brings me to mechanism design.</p>



<h2 class="wp-block-heading"><strong>The missing mechanisms</strong></h2>



<p>Economists use the term &#8220;mechanism design&#8221; to describe the engineering of rules and incentive structures that lead self-interested actors to produce outcomes that are good for everyone. It&#8217;s sometimes called &#8220;reverse game theory.&#8221; Rather than analyzing the equilibria that emerge from a given set of rules, you start with the outcome you want and work backward to design the rules that will get you there.</p>



<p>Mechanism design theory got its start in the 1960s when Leonid Hurwicz took up the problem of how a planner can make good decisions when the information needed to make them is scattered among many different people, each of whom has their own interests. His key insight was that people won&#8217;t reliably reveal what they know unless it&#8217;s in their interest to do so. So how do you design a system that aligns their incentives?</p>



<p>The field that Hurwicz founded and that Eric Maskin and Roger Myerson developed through the 1970s and 80s earned all three the Nobel Prize in Economics in 2007.</p>



<p>I first encountered the field when Jonathan Hall, at the time the Chief Economist at Uber, waved Al Roth’s book <a href="https://www.amazon.com/Who-Gets-What-Why-Matchmaking/dp/0544705289" target="_blank" rel="noreferrer noopener"><em>Who Gets What — and Why</em></a> at me and said “This is my Bible.” In it, Roth describes his own work on mechanism design, which won him the 2012 Nobel Prize in Economics along with Lloyd Shapley. Roth applied mechanism design to kidney matching markets, markets for college admissions, for law clerks and judges, and for hospitals and medical residents. When I first talked to Jonathan and then Al Roth, my layman’s takeaway about mechanism design was that it was simply the application of economic theory to design better markets.</p>



<p>And I’ve since come to think even more broadly about what mechanism design might mean in a technology context. In my broader framing, packet switching was a breakthrough in mechanism design. So for that matter was TCP/IP, the World Wide Web, and <a href="https://en.wikipedia.org/wiki/The_Unix_Programming_Environment" target="_blank" rel="noreferrer noopener">the protocol-centric architecture of Unix/Linux</a>, which enabled open source and the distributed, cooperative software development environment we take for granted today. PageRank and the rest of Google’s organic search system also seems to me to be a kind of mechanism design. So do Pay Per Click advertising and the Google ad auction. All of them are ways of aligning incentives such that self-interested actors produce outcomes that are good for others as well.</p>



<p>So that brings me back to AI. Right now, there’s a problem that makes the AI/human knowledge market less efficient than it could be. The disrespect for IP that has been shown by the AI labs and applications during the training stage, and even now during inference, has led to efforts by content owners to protect their content from AI. Do not crawl. Lawsuits. Reluctance to share information. Even the AI labs are complaining about the theft of their IP and trying to protect their model weights from distillation.</p>



<p>It’s an economy crying out for mechanism design.</p>



<p>The lesson of <a href="https://support.google.com/youtube/answer/2797370?hl=en" target="_blank" rel="noreferrer noopener">YouTube Content ID</a> is worth learning. Twenty-five years ago, the music industry was in the same position that content creators are in today with AI. In response to unauthorized use of their music by creators, music publishers’ demand to YouTube was “Take it down.” But as Google engineer Doug Eck explained to me, YouTube came up with a better answer: “How about we help you monetize it instead?” I don’t know the details of how that decision was made but I do know the eventual outcome. Aligned incentives led to a vibrant creator economy in which YouTube’s video creators, the music companies, and Google all got to share in the value that was created.</p>



<p>That should give us inspiration for how to solve some of the problems we face now with AI. Whether it’s with Agent Skills, NotebookLM, or other emergent artifacts of the new AI/human knowledge economy, we need to align the incentives. If we can grow the pie, and in a way where no single gatekeeper captures the bulk of the benefit, there’s a way to create a vibrant market. But that requires building mechanisms that don’t exist yet.</p>



<p>What mechanisms are missing from the agentic economy? Here’s a partial list:</p>



<p><strong>Skills markets. </strong>There’s an enormous economic opportunity for humans to create and trade skills that agents can use. These are not just simple aggregation of context with tool use instructions, but higher-level, industry-specific workflows that encode deep human expertise. At O’Reilly, we’re figuring out how to turn our knowledge and that of our authors into skills, how to make them discoverable, and how to sell them. But as of yet, there’s no way for a broader community of skill creators to participate.</p>



<p><strong>Quality and governance for skills. </strong>Some skills will need the same kinds of governance that industrial standards have. Who certifies that a medical skills package follows current clinical guidelines? Who updates it when the guidelines change? We haven’t begun to build the institutions that would govern agent skills at that level.</p>



<p><strong>Registries and discovery. </strong>The MCP community has been working on&nbsp;<a href="https://blog.modelcontextprotocol.io/posts/2025-09-08-mcp-registry-preview/" target="_blank" rel="noreferrer noopener">a registry protocol</a>, as is&nbsp;<a href="https://eips.ethereum.org/EIPS/eip-8004" target="_blank" rel="noreferrer noopener">the Ethereum community</a>.</p>



<p>This isn&#8217;t just a technical development but a business opportunity. I still remember when Network Solutions was running the original top level internet domain name registry under contract from the National Science Foundation. When the government said it wouldn’t end the payments, Network Solutions planned to walk away. Then they realized what they had. On the early internet, domain name registration became a surprisingly big business. Now it’s just boring civilizational infrastructure. Is there something similar for AI models, applications, and agents?</p>



<p><strong>Organic search for agents. </strong>Google’s first great innovation on the web wasn’t how to make pay per click ads really work with a data-driven ad auction. It was organic search: a way of coordinating a market with hundreds of signals that ignored price and worked independently of whether the destination content was free or paid. <em>The New York Times</em> (or <a href="http://oreilly.com" target="_blank" rel="noreferrer noopener">oreilly.com</a>) is subscription-based, but that isn’t a factor in whether Google shows it to you. Google figured out signals that let them say, “This is the best result for this query.” Sites behind paywalls figured out how to disclose enough for people to decide whether they wanted to take the next step and enter into a transaction. That’s an engineered argument.</p>



<p>We’re going to need the equivalent for skills and agent services. We’ll start with curated marketplaces. Vercel already has one. But we’re a long way from something as effective as Google’s peak in organic search. The search space will be huge, with hundreds of millions, maybe billions of agents seeking the best way to accomplish trillions of distinct tasks. Skills can help them save on inference costs and deliver better results. The question is what signals will drive discovery of the best match.</p>



<p><strong>Extension architectures. </strong>MCP’s&nbsp;<a href="https://modelcontextprotocol.io/extensions/overview)" target="_blank" rel="noreferrer noopener">extension model</a>&nbsp;(including the new&nbsp;<a href="https://blog.modelcontextprotocol.io/posts/2025-11-21-mcp-apps/" target="_blank" rel="noreferrer noopener">Apps Extension</a>) is promising. This is the Apache model all over again: keep the core simple, let people layer different approaches on top, and let the market sort out which ones win. It is, in essence, an engineered argument rather than an engineered agreement.</p>



<p><strong>Payment layers. </strong>Stripe has been working on&nbsp;<a href="https://stripe.com/use-cases/agentic-commerce" target="_blank" rel="noreferrer noopener">agentic commerce</a>, but it seems to be focused on traditional e-commerce transactions like booking a ticket or buying a product. What about a payment layer for skills? There have been proposals for monetizing MCP calls, pay per call, pay per token, but none have caught on yet. Coinbase&#8217;s&nbsp;<a href="https://www.coinbase.com/developer-platform/discover/launches/x402" target="_blank" rel="noreferrer noopener">x402 protocol</a>&nbsp;may also end up playing a role.</p>



<p><strong>Progressive access and authentication. </strong><a href="https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1649" target="_blank" rel="noreferrer noopener">MCP Server Cards</a> promise to let a service specify its terms: here’s what we charge, here’s how you authenticate. That’s a functional disclosure layer that could enable commerce. It could enable progressive privileges: a free O’Reilly subscriber gets one set of tools, a paying subscriber gets a richer set, all on top of the same MCP server. Again, that’s an engineered argument with the market deciding the winners.</p>



<p><strong>Neutrality in agent routing. </strong>When ChatGPT decides to show you a Booking.com widget instead of an Airbnb widget, who made that choice, and on what basis? OpenAI claims commercial considerations aren’t a factor. That’s hard to take at face value. We need something like the original principle of organic search: surface the best result for the user, not the most profitable one for the platform.</p>



<h2 class="wp-block-heading"><strong>We don’t know the future, but we can set ourselves up to shape it for the better</strong></h2>



<p>I’m old enough to remember when UUCP was giving way to the internet, and there was a real debate over whether explicit path routing or domain routing was better. In retrospect, it’s blindingly obvious that path routing wasn’t going to scale. But it’s worthwhile to know that at the time, people weren’t at all clear about that!</p>



<p>The same is true now. Some of what I’ve described will turn out to be the equivalent of explicit path routing: a dead end that was only plausible for a small scale network. Other parts will turn out to be as fundamental as DNS or HTTP. But we’re not trying to pick the winners. We’re trying to engineer the argument.</p>



<p>If we can enable better markets, it will allow a process of discovery. People try different things, most fail, some catch on. The job right now is to build the mechanisms that help the market to evolve.</p>



<p>We need mechanisms that no single gatekeeper can control. Modular, decentralized architectures let people experiment with business models, routing decisions, payment systems, and quality signals. And alongside those markets, we will eventually need institutions (some of which will be protocols) to maintain standards that will become the infrastructure of the next economy.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>This article recapitulates a conversation with Ilan Strauss and Ido Salomon, and a separate conversation on the broader meaning of protocols in the context of industry workflows and civilizational infrastructure with <em>Venlaktesh Rao and Timber Schroff of the Ethereum Foundation’s Summer of Protocols program, and Denise Hearn and James Home of the Long Now Foundation</em>. Rao’s </em>Protocol Reader<em> and&nbsp; David Lang’s “Standards Make the World,” published through the Summer of Protocols project, inform the argument about protocols as civilizational infrastructure.</em></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-missing-mechanisms-of-the-agentic-economy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Beyond Code Review</title>
		<link>https://www.oreilly.com/radar/beyond-code-review/</link>
				<pubDate>Fri, 20 Mar 2026 11:13:40 +0000</pubDate>
					<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18308</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Beyond-a-circular-code-review.jpeg" 
				medium="image" 
				type="image/jpeg" 
				width="640" 
				height="640" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Beyond-a-circular-code-review-160x160.jpeg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Not that long ago, we were resigned to the idea that humans would need to inspect every line of AI-generated code. We’d do it personally, code reviews would always be part of a serious software practice, and the ability to read and review code would become an even more important part of a developer&#8217;s skillset. [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>Not that long ago, we were resigned to the idea that humans would need to inspect every line of AI-generated code. We’d do it personally, code reviews would always be part of a serious software practice, and the ability to read and review code would become an even more important part of a developer&#8217;s skillset. At the same time, I suspect we all knew that was untenable, that AI would quickly generate much more code than humans could reasonably review. Understanding someone else’s code is harder than understanding your own, and understanding machine-generated code is harder still. At some point—and that point comes fairly early on—all the time you saved by letting AI write your code is spent reviewing it. It’s a lesson we’ve learned before; it’s been decades since anyone except for a few specialists needed to inspect the assembly code generated by a compiler. And, as Kellan Elliott-McRae has <a href="https://laughingmeme.org/2026/03/05/socialize-the-plan.html" target="_blank" rel="noreferrer noopener">written</a>, it’s not clear that code review has ever justified the cost. While sitting around a table inspecting lines of code might catch problems of style or poorly implemented algorithms, code review remains an expensive solution to relatively minor problems.</p>



<p>With that in mind, specification-driven development (SDD) shifts the emphasis from review to verification, from prompting to specification, and from testing to still more testing. The goal of software development isn’t code that passes human review; it’s systems whose behavior lives up to a well-defined specification that describes what the customer wants. Finding out what the customer needs and designing an architecture to meet those needs requires human intelligence. As Ankit Jain <a href="https://www.latent.space/p/reviews-dead" target="_blank" rel="noreferrer noopener">points out in <em>Latent Space</em></a>, we need to make the transition from asking whether the code is written correctly to asking whether we’re solving the right problem. Understanding the problem we need to solve is part of the specification process—and it’s something that, historically, our industry <a href="https://www.oreilly.com/radar/quality-assurance-errors-and-ai/" target="_blank" rel="noreferrer noopener">hasn’t done well</a>.</p>



<p>Verifying that the system actually performs as intended is another critical part of the software development process. Does it solve the problem as described in the specification? Does it meet the requirements for what Neal Ford calls “<a href="https://learning.oreilly.com/library/view/fundamentals-of-software/9781492043447/ch04.html" target="_blank" rel="noreferrer noopener">architectural characteristics</a>” or “-ilities”: scalability, auditability, performance, and many other characteristics that are embodied in software systems but that can rarely be inferred from looking at the code, and that AI systems can’t yet reason about? These characteristics should be captured in the specification. The focus of the software development process moves from writing code to determining what the code should do and verifying that it indeed does what it’s supposed to do. It moves from the middle of the process to the beginning and the end. AI can play a role along the way, but specification and verification are where human judgment is most important.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p>Drew Breunig and others point out that this is inherently <a href="https://www.dbreunig.com/2026/03/04/the-spec-driven-development-triangle.html" target="_blank" rel="noreferrer noopener">a circular process, not a linear one</a>. A specification isn’t something you write at the start of the process and never touch again. It needs to be updated whenever the system’s desired behavior changes: whenever a bug fix results in a new test, whenever users clarify what they want, whenever the developers understand the system’s goals more deeply. I’m impressed with how agile this process is. It is not the agile of sprints and standups but the agile of incremental development. Specification leads to planning, which leads to implementation, which leads to verification. If verification fails, we update the spec and iterate. Drew has built Plumb, a command line tool that can be plugged into Git, to support an automated loop through specification and testing. What distinguishes Plumb is its ability to help software developers look at the decisions that resulted in the current version of the software: diffs, of course, but also conversations with AI, the specifications, the plans, and the tests. As Drew says, Plumb is intended as an inspiration or a starting point, and it’s clearly missing important features—but it’s already useful.</p>



<p>Can SDD replace code review? Probably; again, code review is an expensive way to do something that may not be all that useful in the long run. But maybe that’s the wrong question. If you don’t listen carefully, SDD sounds like a reinvention of the waterfall process: a linear drive from writing a detailed spec to burning thousands of CDs that are stored into a warehouse. We need to listen to SDD itself to ask the right questions: How do we know that a software system solves the right problem? What kinds of tests can verify that the system solves the right problem? When is automated testing inappropriate, and when do we need human engineers to judge a system’s fitness? And how can we express all of that knowledge in a specification that leads a language model to produce working software?</p>



<p>We don’t place as much value in specifications as we did in the last century; we tend to see spec writing as an obsolete ceremony at the start of a project. That’s unfortunate, because we’ve lost a lot of institutional knowledge about how to write good, detailed specifications. The key to making specifications relevant again is realizing that they’re the start of a circular process that continues through verification. The specification is the repository for the project’s real goals: what it’s supposed to do and why—and those goals necessarily change during the course of a project. A software-driven development loop that runs through testing—not just unit testing but fitness testing, acceptance testing, and human judgment about the results—lays the groundwork for a new kind of process in which humans won’t be swamped by reviewing AI-generated code.</p>
]]></content:encoded>
										</item>
		<item>
		<title>Keep Deterministic Work Deterministic</title>
		<link>https://www.oreilly.com/radar/keep-deterministic-work-deterministic/</link>
				<pubDate>Thu, 19 Mar 2026 11:15:28 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18299</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Keep-deterministic-work-deterministic.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Keep-deterministic-work-deterministic-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[How eight iterations of a blackjack simulation earned each nine the hard way]]></custom:subtitle>
		
				<description><![CDATA[This is the second article in a series on agentic engineering and AI-driven development. Read part one here, and look for the next article on April 2 on O’Reilly Radar. The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p><em>This is the second article in a series on agentic engineering and AI-driven development. </em><a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener"><em>Read part one here</em></a><em>, and look for the next article on April 2 on O’Reilly Radar.</em></p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.<br>—<a href="https://en.wikipedia.org/wiki/Ninety%E2%80%93ninety_rule" target="_blank" rel="noreferrer noopener">Tom Cargill, Bell Labs</a></p>
</blockquote>



<p>One of the experiments I&#8217;ve been running as part of my work on agentic engineering and AI-driven development is a blackjack simulation where an LLM plays hundreds of hands against blackjack strategies written in plain English. The AI uses those strategy descriptions to decide how to make hit/stand/double-down decisions for each hand, while deterministic code deals the cards, checks the math, and verifies that the rules were followed correctly.</p>



<p>Early runs of my simulation had a 37% pass rate. The LLM would add up card totals wrong, skip the dealer&#8217;s turn entirely, or ignore the strategy it was supposed to follow. The big problem was that these errors compounded: If the model miscounted the player&#8217;s total on the third card, every decision after that was based on a wrong number, so the whole hand was garbage even if the rest of the logic was fine.</p>



<p>There&#8217;s a useful way to think about reliability problems like that: the <strong>March of Nines</strong>. Getting an LLM-based system to 90% reliability is the first nine, and it&#8217;s the &#8220;easy&#8221; one. Getting from 90% to 99% takes roughly the same amount of engineering effort. So does getting from 99% to 99.9%. Each nine costs about as much as the last, and you never stop marching. Andrej Karpathy coined the term from his experience building self-driving systems at Tesla, where they spent years earning two or three nines and still had more to go.</p>



<p>Here&#8217;s a small exercise that shows how that kind of failure compounding works. Open any AI chatbot running an early 2026 model (I used ChatGPT 5.3 Instant) and paste the following eight prompts one at a time, each in a separate message. Go ahead, I&#8217;ll wait.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Prompt 1:</strong> Track a running &#8220;score&#8221; through a 7-step game. Do not use code, Python, or tools. Do this entirely in your head. For each step, I will give you a sentence and a rule.</p>



<p>CRITICAL INSTRUCTION: You must reply with ONLY the mathematical equation showing how you updated the score. Example format: 10 + 5 = 15 or 20 / 2 = 10. Do not list the words you counted, do not explain your reasoning, and do not write any other text. Just the equation.</p>



<p>Start with a score of 10. I&#8217;ll give you the first step in the next prompt.</p>



<p><strong>Prompt 2:</strong> &#8220;The sudden blizzard chilled the small village communities.&#8221; Add the number of words containing double letters (two of the exact same letter back-to-back, like &#8216;tt&#8217; or &#8216;mm&#8217;).</p>



<p><strong>Prompt 3:</strong> &#8220;The clever engineer needed seven perfect pieces of cheese.&#8221; If your score is ODD, add the number of words that contain EXACTLY two &#8216;e&#8217;s. If your score is EVEN, subtract the number of words that contain EXACTLY two &#8216;e&#8217;s. (Do not count words with one, three, or zero &#8216;e&#8217;s).</p>



<p><strong>Prompt 4:</strong> &#8220;The good sailor joined the eager crew aboard the wooden boat.&#8221; If your score is greater than 10, subtract the number of words containing consecutive vowels (two different or identical vowels back-to-back, like &#8216;ea&#8217;, &#8216;oo&#8217;, or &#8216;oi&#8217;). If your score is 10 or less, multiply your score by this number.</p>



<p><strong>Prompt 5:</strong> &#8220;The quick brown fox jumps over the lazy dog.&#8221; Add the number of words where the THIRD letter is a vowel (a, e, i, o, u).</p>



<p><strong>Prompt 6:</strong> &#8220;Three brave kings stand under black skies.&#8221; If your score is an ODD number, subtract the number of words that have exactly 5 letters. If your score is an EVEN number, multiply your score by the number of words that have exactly 5 letters.</p>



<p><strong>Prompt 7:</strong> &#8220;Look down, you shy owl, go fly away.&#8221; Subtract the number of words that contain NONE of these letters: a, e, or i.</p>



<p><strong>Prompt 8:</strong> &#8220;Green apples fall from tall trees.&#8221; If your score is greater than 15, subtract the number of words containing the letter &#8216;a&#8217;. If your score is 15 or less, add the number of words containing the letter &#8216;l&#8217;.</p>
</blockquote>



<p>The exercise tracks a running score through seven steps. Each step gives the model a sentence and a counting rule, and the score carries forward. The correct final score is <strong>60</strong>. Here&#8217;s the answer key: start at 10, then 16 (10+6), 12 (16−4), 5 (12−7), 10 (5+5), 70 (10×7), 63 (70−7), 60 (63−3).</p>



<p>I ran this twice at the same time (using ChatGPT 5.3 Instant), and got two completely different wrong answers the first time I tried it. Neither run reached the correct score of 60:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><strong>Step</strong></td><td><strong>Correct</strong></td><td><a href="https://chatgpt.com/share/69b1a6be-86ec-8005-97e3-e3d7bd61b3e6"><strong>Run 1</strong></a><strong> (</strong><a href="https://gist.github.com/andrewstellman/d73e23e07eca411d4384ec2736afafd0#file-run-1-chatgpt-track-running-score-md"><strong>transcript</strong></a><strong>)</strong></td><td><a href="https://chatgpt.com/share/69b1a6d3-b458-8005-ba35-0f846f0dbcda"><strong>Run 2</strong></a><strong> (</strong><a href="https://gist.github.com/andrewstellman/d73e23e07eca411d4384ec2736afafd0#file-run-2-chatgpt-score-tracking-game-md"><strong>transcript</strong></a><strong>)</strong></td></tr><tr><td>1. Double letters</td><td>10 + 6 = 16</td><td>10 + 2 = 12 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td><td>10 + 5 = 15 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td></tr><tr><td>2. Exactly two &#8216;e&#8217;s</td><td>16 − 4 = 12</td><td>12 − 4 = 8 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td><td>15 + 4 = 19 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td></tr><tr><td>3. Consecutive vowels</td><td>12 − 7 = 5</td><td>8 × 7 = 56 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td><td>19 − 5 = 14 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td></tr><tr><td>4. Third letter vowel</td><td>5 + 5 = 10</td><td>56 + 5 = 61 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td><td>14 + 3 = 17 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td></tr><tr><td>5. Exactly 5 letters</td><td>10 × 7 = 70</td><td>61 − 7 = 54 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td><td>17 − 4 = 13 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td></tr><tr><td>6. No a, e, or i</td><td>70 − 7 = 63</td><td>54 − 7 = 47 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td><td>13 − 3 = 10 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td></tr><tr><td>7. Words with &#8216;a&#8217; or &#8216;i&#8217;</td><td>63 − 3 = 60</td><td>47 − 3 = 44 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td><td>10 + 4 = 14 <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;" /></td></tr></tbody></table></figure>



<p>The two runs tell very different stories. In Run 1, the model miscounted in Step 1 (found 2 double-letter words instead of 6) but actually got the later counts right. It didn&#8217;t matter. The wrong score in Step 1 flipped a branch in Step 3, triggering a multiply instead of a subtract, and the score never recovered. One early mistake threw off the entire chain, even though the model was doing good work after that.</p>



<p>Run 2 was a disaster. The model miscounted at almost every step, compounding errors on top of errors. It ended at 14 instead of 60. That&#8217;s closer to what Karpathy is describing with the March of Nines: Each step has its own reliability ceiling, and the longer the chain, the higher the chance that at least one step fails and corrupts everything downstream.</p>



<p>What makes this insidious: Both runs look the same from the outside. Each step produced a plausible answer, and both runs produced final results. Without the answer key (or some tedious manual checking), you&#8217;d have no way of knowing that Run 1 was a near-miss derailed by a single early error and Run 2 was wrong at nearly every step. This is typical of any process where the output of one LLM call becomes the input for the next one.</p>



<p>These failures don&#8217;t demonstrate the March of Nines itself—that&#8217;s specifically about the engineering effort to push reliability from 90% to 99% to 99.9%. (It&#8217;s possible to reproduce the full compounding-reliability problem in a chat, but a prompt that did it reliably would be far too long to put in an article.) Instead, I opted for a shorter exercise which you can easily try out yourself that demonstrates the underlying problem that makes the march so hard: <strong>cascading failures</strong>. Each step asks the model to count letters inside words, which is deterministic work that a short Python script handles perfectly. LLMs, on the other hand, don&#8217;t actually treat words as strings of characters; they see them as tokens. Spotting double letters means unpacking a token into its characters, and the model gets that wrong just often enough to reliably screw it up. I added branching logic where each step&#8217;s result determines the next step&#8217;s operation, so a single miscount in Step 1 cascades through the entire sequence.</p>



<p>I also want to be clear about exactly what a deterministic version of this simulation looks like. Luckily, the AI can help us with that. Go to either run (or your own) and paste one more prompt into the chat:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Prompt 9:</strong> Now write a short Python script that does exactly what you just did: start with a score of 10, apply each of the seven rules to the seven sentences, and print the equation at each step.</p>
</blockquote>



<p>Run the script. It should print the correct answer for every step, ending at 60. The same AI that just failed the exercise can write code that does it flawlessly, because <em>now it&#8217;s generating deterministic logic</em> instead of trying to count characters through its tokenizer.</p>



<h2 class="wp-block-heading"><strong>Reproducing a cascading failure in a chat</strong></h2>



<p>I deliberately engineered the exercise earlier to give you a way to experience the cascading failure problem behind the March of Nines yourself. I took advantage of something current LLMs genuinely suck at: parsing characters inside tokens. Future models might do a much better job with this specific kind of failure, but the cascading failure problem doesn&#8217;t go away when the model gets smarter. As long as LLMs are nondeterministic, any step that relies on them has a reliability ceiling below 100%, and those ceilings still multiply. The specific weakness changes; the math doesn&#8217;t.</p>



<p>I also specifically asked the model to show only the equation and skip all intermediate reasoning to prevent it from using <strong>chain of thought</strong> (or CoT) to self-correct. Chain of thought is a technique where you require the model to show its work step by step (for example, listing the words it counted and explaining why each one qualifies), which helps it catch its own mistakes along the way. CoT is a common way to improve LLM accuracy, and it works. As you&#8217;ll see later when I talk about the evolution of my blackjack simulation, CoT cut certain errors roughly in half. But &#8220;half as many errors&#8221; is still not zero. Plus, it&#8217;s expensive: It costs more tokens and more time. A Python script that counts double letters gets the right answer on every run, instantly, for zero AI API costs (or, if you&#8217;re running the AI locally, for orders of magnitude less CPU usage). That&#8217;s the core tension: You can spend engineering effort making the LLM better at deterministic work, or you can just hand it to code.</p>



<p>Every step in this exercise is deterministic work that code handles flawlessly. But most interesting LLM tasks aren&#8217;t like that. You can&#8217;t write a deterministic script that plays a hand of blackjack using natural-language strategy rules, or decides how a character should respond in dialogue. Real work requires chaining multiple steps together into a <strong>pipeline</strong>, or a reproducible series of steps (some deterministic, some requiring an LLM) that lead to a single result, where each step&#8217;s output feeds the next. If that sounds like what you just saw in the exercise, it is. Except real pipelines are longer, more complex, and much harder to debug when something goes wrong in the middle.</p>



<h2 class="wp-block-heading"><strong>LLM pipelines are especially susceptible to the March of Nines</strong></h2>



<p>I&#8217;ve been spending a lot of time thinking about LLM pipelines, and I suspect I&#8217;m in the minority. Most people using LLMs are working with single prompts or short conversations. But once you start building multistep workflows where the AI generates structured data that feeds into the next step—whether that&#8217;s a content generation pipeline, a data processing chain, or a simulation—you run straight into the March of Nines. Each step has a reliability ceiling, and those ceilings multiply. The exercise you just tried had seven steps. The blackjack pipeline has more, and I&#8217;ve been running it hundreds of times per iteration.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="812" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-10.png" alt="The blackjack pipeline in Octobatch" class="wp-image-18301" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-10.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-10-300x152.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-10-768x390.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-10-1536x780.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>The blackjack pipeline in </em><a href="https://github.com/andrewstellman/octobatch/" target="_blank" rel="noreferrer noopener"><em>Octobatch</em></a><em>, an open source batch orchestrator for multistep LLM workflows that I introduced in “</em><a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener"><em>The Accidental Orchestrator</em></a>.”</figcaption></figure>



<p>That&#8217;s a screenshot of the blackjack pipeline in <a href="https://github.com/andrewstellman/octobatch/" target="_blank" rel="noreferrer noopener">Octobatch</a>, the tool I built to run these pipelines at scale. That pipeline deals cards deterministically, asks the LLM to play each hand following a strategy described in plain English, then validates the results with deterministic code. Octobatch makes it easy to change the pipeline and rerun hundreds of hands, which is how I iterated through eight versions—and how I really learned the hard way that the March of Nines wasn&#8217;t just a theoretical problem but something I could watch happening in real time across hundreds of data points.</p>



<p>Running pipelines at scale made the failures obvious and immediate, which, for me, really underscored an effective approach to minimizing the cascading failure problem: <strong>make deterministic work deterministic</strong>. That means asking whether every step in the pipeline actually needs to be an LLM call. Checking that a jack, a five, and an eight add up to 23 doesn&#8217;t require a language model. Neither does looking up whether standing on 15 against a dealer 10 follows basic strategy. That&#8217;s arithmetic and a lookup table—work that ordinary code does perfectly every time. And as I learned over the course of improving the failure rate for the pipeline, every step you pull out of the LLM and make deterministic goes to 100% reliability, which stops it from contributing to the compound failure rate.</p>



<p>Relying on the AI for deterministic work is the computation side of a pattern I wrote about for data in “<a href="https://www.oreilly.com/radar/ai-mcp-and-the-hidden-costs-of-data-hoarding/" target="_blank" rel="noreferrer noopener">AI, MCP, and the Hidden Costs of Data Hoarding</a>.” Teams dump everything into the AI&#8217;s context because the AI can handle it—until it can&#8217;t. The same thing happens with computation: Teams let the AI do arithmetic, string matching, or rule evaluation because it mostly works. But &#8220;mostly works&#8221; is expensive and slow, and a short script does it perfectly. Better yet, the AI can <em>write</em> that script for you—which is exactly what Prompt 9 demonstrated.</p>



<h2 class="wp-block-heading"><strong>Getting cascading failures out of the blackjack pipeline</strong></h2>



<p>I pushed the blackjack pipeline through eight iterations, and the results taught me more about earning nines than I expected. That&#8217;s why I&#8217;m writing this article—the iteration arc turned out to be one of the clearest illustrations I&#8217;ve found of how the principle works in practice.</p>



<p>I addressed failures two ways, and the distinction matters.</p>



<p>Some failures called for making work deterministic. Card dealing runs as a local expression step, which doesn&#8217;t require an API call, so it&#8217;s free, instant, and 100% reproducible. There&#8217;s a math verification step that uses code to recalculate totals from the actual cards dealt and compares them against what the LLM reported, and a strategy compliance step checks the player&#8217;s first action against a deterministic lookup table. Neither of those steps require any AI to make a judgment call; when I originally ran them as LLM calls, they introduced errors that were hard to detect and expensive to debug.</p>



<p>Other failures called for structural constraints that made specific error patterns harder to produce. Chain of thought format forced the LLM to show its work instead of jumping to conclusions. The rigid dealer output structure made it mechanically difficult to skip the dealer&#8217;s turn. Explicit warnings about counterintuitive rules gave the LLM a reason to override its training priors. These don&#8217;t eliminate the LLM from the step—they make the LLM more reliable within it.</p>



<p>But before any of that mattered, I had to face the uncomfortable fact that <em>measurements themselves can be wrong, </em><strong><em>especially when relying on AI to take those measurements</em></strong><em>.</em> For example, the first run reported a 57% pass rate, which was great! But when I looked at the data myself, a lot of runs were obviously wrong. It turned out that the pipeline had a bug: Verification steps were running, but the AI step that was supposed to enforce didn&#8217;t have adequate guardrails, so almost every hand passed regardless of the actual data. I asked three AI advisors to review the pipeline, and none of them caught it. The only thing that exposed it was checking the aggregate numbers, which didn&#8217;t add up. If you let probabilistic behavior into a step that should be deterministic, the output will look plausible and the system will report success, but you have no way to know something&#8217;s wrong until you go looking for it.</p>



<p>Once I fixed the bug, the real pass rate emerged: 31%. Here&#8217;s how the next seven iterations played out:</p>



<ul class="wp-block-list">
<li><strong>Restructuring the data (31% → 37%).</strong> The LLM kept losing track of where it was in the deck, so I restructured the data it received to eliminate the bookkeeping. I also removed split hands entirely, because tracking two simultaneous hands is stateful bookkeeping that LLMs reliably botch. Each fix came from looking at what was actually failing and asking whether the LLM needed to be doing that work at all.</li>



<li><strong>Chain of thought arithmetic (37% → 48%).</strong> Instead of letting the LLM jump to a final card total, I required it to show the running math at every step. Forcing the model to trace its own calculations cut multidraw errors roughly in half. CoT is a structural constraint, not a deterministic replacement; it makes the LLM more reliable within the step, but it&#8217;s also more expensive because it uses more tokens and takes more time.</li>



<li><strong>Replacing the LLM validator with deterministic code (48% → 79%).</strong> This was the single biggest improvement in the entire arc. The pipeline had a second LLM call that scored how accurately the player followed strategy, and it was wrong 73% of the time. It applied its own blackjack intuitions instead of the rules I&#8217;d given it. But there&#8217;s a right answer for every situation in basic strategy, and the rules can be written as a lookup table. Replacing the LLM validator with a deterministic expression step recovered over 150 incorrectly rejected hands.</li>



<li><strong>Rigid output format (79% → 81%).</strong> The LLM kept skipping the dealer&#8217;s turn entirely, jumping straight to declaring a winner. Requiring a step-by-step dealer output format made it mechanically difficult to skip ahead.</li>



<li><strong>Overriding the model&#8217;s priors (81% → 84%).</strong> One strategy required hitting on 18 against a high dealer card, which any conventional blackjack wisdom says is terrible. The LLM refused to do it. Restating the rule didn&#8217;t help. Explaining <em>why</em> the counterintuitive rule exists did: The prompt had to tell the model that the bad play was intentional.</li>



<li><strong>Switching models (84% → 94%).</strong> I switched from Gemini Flash 2.0 to Haiku 4.6, which was easy to do because Octobatch lets you run the same pipeline with any model from Gemini, Anthropic, or OpenAI. I finally earned my first nine.</li>
</ul>



<h2 class="wp-block-heading"><strong>Find the best ways to earn your nines</strong></h2>



<p>If you&#8217;re building anything where LLM output feeds into the next step, the same question applies to every step in your chain: Does this actually require judgment, or is it deterministic work that ended up in the LLM because the LLM can do it? The strategy validator felt like a judgment call until I looked at what it was actually doing, which was checking a hand against a lookup table. That one recognition was worth more than all the prompt engineering combined. And as Prompt 9 showed, the AI is often the best tool for writing its own deterministic replacement.</p>



<p>I learned this lesson through my own work on the blackjack pipeline. It went through eight iterations, and I think the numbers tell a story. The fixes fell into two categories: making work deterministic (pulling it out of the LLM entirely) and adding structural constraints (making the LLM more reliable within a step). Both earn nines, but pulling work out of the LLM entirely earns those nines faster. The biggest single jump in the whole arc—48% to 79%—came from replacing an LLM validator with a 10-line expression.</p>



<p>Here&#8217;s the bottom line for me: <em>If you can write a short function that does the job, don&#8217;t give it to the LLM</em>. I initially reached for the LLM for strategy validation because it felt like a judgment call, but once I looked at the data I realized it wasn&#8217;t at all. There was a right answer for every hand, and a lookup table found it more reliably than a language model.</p>



<p>At the end of eight iterations, the pipeline passed 94% of hands. The 6% that still fail may be honest limits of what the model can do with multistep arithmetic and state tracking in a single prompt. But they may just be the next nine that I need to earn.</p>



<p>The next article looks at the other side of this problem: Once you know what to make deterministic, how do you make the whole system legible enough that an AI can help your users build with it? The answer turns out to be a kind of documentation you write for AI to read, not humans—and it changes the way you think about what a user manual is for.</p>
]]></content:encoded>
										</item>
		<item>
		<title>What Is the PARK Stack?</title>
		<link>https://www.oreilly.com/radar/what-is-the-park-stack/</link>
				<pubDate>Wed, 18 Mar 2026 11:15:43 +0000</pubDate>
					<dc:creator><![CDATA[Dean Wampler]]></dc:creator>
						<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18291</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/PARK-stack.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/PARK-stack-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Background: Stacks with four-letter acronyms According to Wikipedia, the LAMP stack was coined in 1998 by Michael Kunze to describe what had emerged as a popular open source software stack for websites. When the World Wide Web exploded in popularity earlier in the ’90s, organizations used an ad hoc mixture of proprietary tools and operating [&#8230;]]]></description>
								<content:encoded><![CDATA[
<h2 class="wp-block-heading">Background: Stacks with four-letter acronyms</h2>



<p>According to <a href="https://en.wikipedia.org/wiki/LAMP_(software_bundle)" target="_blank" rel="noreferrer noopener">Wikipedia</a>, the <em>LAMP stack</em> was coined in 1998 by Michael Kunze to describe what had emerged as a popular open source software stack for websites. When the World Wide Web exploded in popularity earlier in the ’90s, organizations used an ad hoc mixture of proprietary tools and operating systems, along with some open source software (OSS), to build websites. The LAMP stack quickly became the most popular set of fully OSS components for this purpose.</p>



<p>LAMP is an acronym that stands for the following:</p>



<ul class="wp-block-list">
<li><strong>Linux</strong> &#8211; the operating system </li>



<li><strong>Apache HTTP Server</strong> &#8211; the web server</li>



<li><strong>MySQL</strong> &#8211; the database</li>



<li><strong>Perl, PHP, and/or Python</strong> &#8211; the application programming language</li>
</ul>



<p>It is hard to believe this today, but at the time, the idea of relying on open source software was controversial. Concerns about support and vulnerability since the source code is visible to everyone were eventually resolved. Open source was irresistible because of the great flexibility, cost efficiencies, no vendor lock-in, and rapid evolution of capabilities provided by popular OSS projects. The LAMP stack became one of t​​he predominant drivers of enterprise adoption of open source.</p>



<h2 class="wp-block-heading">The PARK stack</h2>



<p>Like the rise of the web, the sudden explosion of interest in generative AI with large language models (LLMs), vision models (VMs), and others has driven interest in identifying the best core OSS components for a software stack tailored to the requirements for generative AI. This era now has the PARK stack. It was first suggested by Ben Lorica in “<a href="https://gradientflow.substack.com/p/trends-shaping-the-future-of-ai-infrastructure?open=false#%C2%A7infrastructure-and-compute-fabric" target="_blank" rel="noreferrer noopener">Trends Shaping the Future of AI Infrastructure</a>,” in November last year.</p>



<p>PARK stands for the following:</p>



<ul class="wp-block-list">
<li><strong>PyTorch</strong> &#8211; for model training and inference</li>



<li><strong>AI models and agents</strong> &#8211; the heart of generative AI</li>



<li><strong>Ray</strong> &#8211; for fine-grained, very flexible distributed programming</li>



<li><strong>Kubernetes</strong> &#8211; the industry-standard cluster management system</li>
</ul>



<p>Here, I will provide a brief description of each one and the requirements it meets.</p>



<h2 class="wp-block-heading">PyTorch</h2>



<p>The AI stack needed by model builders provides the ability to train and tune models. Application builders need efficient, scalable inference with models and the agents that use them.</p>



<p><a href="https://pytorch.org/" target="_blank" rel="noreferrer noopener">PyTorch</a> started as one of many tools for designing and training a variety of machine learning models. It’s now the most popular choice for this purpose. It is used to design and train many of the world’s most prominent generative AI models. Alternatives include <a href="https://docs.jaxstack.ai/en/latest/getting_started.html" target="_blank" rel="noreferrer noopener">JAX</a> and its predecessor, <a href="https://www.tensorflow.org" target="_blank" rel="noreferrer noopener">TensorFlow</a>.</p>



<p><a href="https://pytorch.org/" target="_blank" rel="noreferrer noopener">PyTorch</a> was developed and open-sourced by Meta. It is now maintained by the <a href="https://pytorch.org/foundation/" target="_blank" rel="noreferrer noopener">PyTorch Foundation</a>. The ecosystem has expanded to include other <a href="https://pytorch.org/projects/" target="_blank" rel="noreferrer noopener">projects</a>, such as for inference (<a href="https://docs.vllm.ai/en/latest/" target="_blank" rel="noreferrer noopener">vLLM</a>), distributed training and inference (<a href="https://www.deepspeed.ai/" target="_blank" rel="noreferrer noopener">DeepSpeed</a> and <a href="https://www.ray.io/" target="_blank" rel="noreferrer noopener">Ray</a>), and many libraries.</p>



<p>The cost of model inference drives the need for specialized and highly optimized inference engines, like <a href="https://docs.vllm.ai/en/latest/" target="_blank" rel="noreferrer noopener">vLLM</a>. So, PyTorch is rarely used alone for inference, although the popular inference engines use PyTorch libraries.</p>



<p>Incidentally, the rise of generative AI has also caused a resurgence in popularity for Python, in part because Python has been the most popular language for data science, of which generative AI is a natural part.</p>



<h2 class="wp-block-heading">AI models and agents</h2>



<p>The unique capabilities of generative AI applications are provided by one or more models and agents that use them. The first wave of AI applications, often simple chatbots, used a single model that had been trained to understand human language very well, especially English, then tuned in various ways to use that language skill more effectively, such as answering questions, avoiding undesirable speech, providing factual output, etc.</p>



<p>Model architecture has rapidly evolved, including making smaller, more capable models and using collections of models (such as the <a href="https://huggingface.co/blog/moe" target="_blank" rel="noreferrer noopener">mixture of experts</a> architecture) that provide better efficiency while maintaining result quality.</p>



<p>However, models have some particular shortcomings. For example, they know nothing of events that happened after they were trained and they are not trained on all possible specialist data needed to be effective for every possible domain. Hence, application patterns rapidly emerged to complement the strengths of models. The first pattern was <a href="https://www.ibm.com/docs/en/watsonx/saas?topic=solutions-retrieval-augmented-generation" target="_blank" rel="noreferrer noopener">RAG</a><em> </em>(retrieval-augmented generation), where a repository of data is queried for relevant context information, which is then sent as <em>context</em> with the user query to a model for inference.</p>



<p>The more general approach today is agents, which have been <a href="https://cloud.google.com/discover/what-are-ai-agents" target="_blank" rel="noreferrer noopener">defined this way</a>, “software systems that use AI to pursue goals and complete tasks on behalf of users. They show reasoning, planning, and memory and have a level of autonomy to make decisions, learn, and adapt.” Pursuing user goals can mean finding and retrieving relevant contextual data, evaluating the quality and utility of retrieved information, summarizing findings, gracefully recovering from errors, etc.</p>



<p>There is no one dominant model choice or even “family” of models. Similarly, there is no one agent framework to rule them all<em>.</em> This reflects both the very rapid evolution of models and agent design patterns but also the diversity of possible AI applications, which makes it unlikely that any one choice will meet all needs.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<h2 class="wp-block-heading">Ray</h2>



<p>Model training, various forms of tuning, and inference of models require different distributed computing patterns that require highly optimized implementations, given the large energy consumption and related costs associated with generative AI. Single GPU systems are too small for these tasks for the largest generative models. Even for smaller models, massive parallelism allows these processes to scale more effectively.</p>



<p>For model training and tuning processes that involve additional training with new data, a massive number of iterations are used, where in each loop, data is passed through the model, and the model parameters (weights) are adjusted incrementally to reduce errors. These iterations must be fast and efficient. When the model parameters are distributed over several GPUs, very high bandwidth exchange of updates is required. Training iterations have large memory footprints and massive data exchanges.</p>



<p><a href="https://cloud.google.com/discover/what-is-reinforcement-learning" target="_blank" rel="noreferrer noopener">Reinforcement learning</a> is another part of tuning, used to improve more <a href="https://openai.com/form/rft-research-program/" target="_blank" rel="noreferrer noopener">complex behaviors</a> for domains. RL also requires massive amounts of fast iterations, but the size scales and data access patterns are typically smaller, more fine-grained, and more heterogeneous.</p>



<p>Finally, inference distributed computing patterns are the same as the first step in a training iteration, where data flows through a model, but there isn’t a parameter update step.</p>



<p>Ray provides the flexibility for these disparate requirements. It is a fine-grained distributed programming system with an intuitive <a href="https://en.wikipedia.org/wiki/Actor_model" target="_blank" rel="noreferrer noopener">actor model</a> abstraction. Ray was developed by researchers at the University of California, Berkeley, who needed an efficient and easy-to-use system for scaling up computation required for their reinforcement learning and AI research. The flexibility of Ray’s abstractions and the efficiency of its implementation makes Ray well suited for the new distributed computing requirements generative AI has introduced.</p>



<p><a href="https://anyscale.com" target="_blank" rel="noreferrer noopener">Anyscale</a> is a startup focused on productizing Ray. Ray’s core OSS was recently donated to the PyTorch Foundation, as mentioned above.</p>



<h2 class="wp-block-heading">Kubernetes</h2>



<p>Large scale model training and tuning, as well as scalable application deployment patterns, introduce many practical requirements, including management of clusters of heterogeneous hardware and other resources, as well as the processes running on them. <a href="https://kubernetes.io" target="_blank" rel="noreferrer noopener">Kubernetes</a> has been the industry standard for cluster management for a decade, emerging from Google’s work on <a href="https://kubernetes.io/blog/2015/04/borg-predecessor-to-kubernetes/" target="_blank" rel="noreferrer noopener">Borg</a>, along with contributions from many other organizations. Kubernetes is part of the <a href="https://www.linuxfoundation.org" target="_blank" rel="noreferrer noopener">Linux Foundation</a>. The main alternatives to Kubernetes are the management tools offered by the cloud vendors, AWS, Microsoft Azure, Google Cloud, and others. The advantage of Kubernetes is that it runs seamlessly on these platforms (offered as a service or you can “roll your own”), as well as on-premises, providing the benefits of the cloud services but without vendor lock-in.</p>



<p>At first glance, it might appear that the distributed capabilities of Ray and Kubernetes overlap, but in fact they are complementary. Ray is for very fine-grained and lightweight distributed computing and memory management, while Kubernetes provides more coarse-grained management and a broad suite of application services required in modern environments (like security, user management, logging and tracing, etc.). It is common for a <em>containerized</em> Ray application to run its own concept of clustered processes within a set of containers in a Kubernetes cluster. Ray and Kubernetes bring complementary strengths. In fact, there is the open source <a href="https://github.com/ray-project/kuberay" target="_blank" rel="noreferrer noopener">KubeRay operator</a> which allows you to use Ray on Kubernetes without having to be an expert in Ray or container management.</p>



<h2 class="wp-block-heading">What’s missing from PARK?</h2>



<p>LAMP was never intended to provide everything needed for website deployments. It was the core upon which additional services were added as required. PARK is similar, although the presence of Kubernetes covers a lot of the general-purpose service requirements!</p>



<p>For generative AI applications, PARK users will have to think about new requirements, in addition to all the standard practices we have used for years. Let’s discuss a few topics.</p>



<h3 class="wp-block-heading">Data and data management</h3>



<p>Conventional data management requirements and practices still apply, but AI agents are driving changes too. Ben’s post on <a href="https://gradientflow.substack.com/p/data-engineering-for-machine-users" target="_blank" rel="noreferrer noopener">data engineering for machine users</a> discusses a number of trends. For example, some providers are seeing agents dominate the creation of new database tables and those tables are often ephemeral. Agents are less tolerant of database query problems compared to humans and agents are less careful about security concerns.</p>



<p>Unstructured, multimodal data is growing in importance; video and audio as well as text. Use of specialized forms of structured data is also growing, like <a href="https://www.oreilly.com/radar/unbundling-the-graph-in-graphrag/" target="_blank" rel="noreferrer noopener">knowledge graphs</a> and <a href="https://www.ibm.com/think/topics/vector-database" target="_blank" rel="noreferrer noopener">vector databases</a> for RAG applications, and <a href="https://www.databricks.com/blog/what-feature-store-complete-guide-ml-feature-engineering" target="_blank" rel="noreferrer noopener">feature stores</a> for structuring data more effectively.</p>



<h3 class="wp-block-heading">Agent orchestration</h3>



<p>Any distributed system needs careful management of the interactions between components, for purposes of security, resource management, and efficacy. The <a href="https://modelcontextprotocol.io/docs/getting-started/intro" target="_blank" rel="noreferrer noopener">Model Context Protocol</a> (MCP) and the <a href="https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/" target="_blank" rel="noreferrer noopener">Agent2Agent Protocol</a> (A2A) are two of several emerging standards to allow models to discover available agent services and learn how to use them automatically. These promising capabilities also raise many concerns about security and the need for careful control, which is driving the emergence of new gateway and service projects tailored to the specific needs of agent-based applications, for example, <a href="https://ibm.github.io/mcp-context-forge/" target="_blank" rel="noreferrer noopener">ContextForge</a>. Similarly, supporting features are being added to established tools to meet the same needs.</p>



<h3 class="wp-block-heading">Memory management</h3>



<p>Agents must manage and use the information they have acquired. This includes working within the available context limitations for their models and focusing on the most useful information, to optimize their use of resources and effectiveness. <a href="https://www.ibm.com/think/topics/ai-agent-memory" target="_blank" rel="noreferrer noopener">AI agent memory</a> is an ongoing research topic with <a href="https://medium.com/@jununhsu/6-open-source-ai-memory-tools-to-give-your-agents-long-term-memory-39992e6a3dc6" target="_blank" rel="noreferrer noopener">projects</a> and startups emerging, like <a href="https://memverge.ai" target="_blank" rel="noreferrer noopener">MemVerge</a> and <a href="https://mem0.ai" target="_blank" rel="noreferrer noopener">Mem0</a>, which emphasize the effective use of short-term (i.e., single session) memory. Established persistence tools are also being applied to the problem, e.g., <a href="https://deepwiki.com/neo4j-labs/agent-memory" target="_blank" rel="noreferrer noopener">Neo4j</a> and <a href="https://redis.io/blog/ai-agent-memory-stateful-systems/" target="_blank" rel="noreferrer noopener">Redis</a>, which also support longer-term memory across sessions.</p>



<p><a href="https://getdex.sh/what-is-dex" target="_blank" rel="noreferrer noopener">Dex</a> is a new approach that addresses a particular challenge caused by MCP and A2A: the explosion of information that gets added to the inference context memory. This memory is limited and performance quickly degrades when the context grows too large. Dex takes what an agent learns how to do once, like using MCP to learn how to query GitHub for repo information, and turns that knowledge into reusable code that both eliminates unnecessary repetition of the learning step and executes the task deterministically outside the model context. Dex also provides a form of long-term memory.</p>



<h2 class="wp-block-heading">What’s next?</h2>



<p>What are your thoughts about the PARK stack? What do you think of the four components versus alternatives? What AI application requirements do you think need more attention? Let us know!</p>
]]></content:encoded>
										</item>
		<item>
		<title>Stop Closing the Door. Fix the House.</title>
		<link>https://www.oreilly.com/radar/stop-closing-the-door-fix-the-house/</link>
				<pubDate>Tue, 17 Mar 2026 11:23:28 +0000</pubDate>
					<dc:creator><![CDATA[Angie Jones]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18281</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Stop-Closing-the-Door-Fix-the-House.png" 
				medium="image" 
				type="image/png" 
				width="1152" 
				height="896" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Stop-Closing-the-Door-Fix-the-House-160x160.png" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[The following article originally appeared on Angie Jones&#8217;s website and is being republished here with the author&#8217;s permission. I’ve been seeing more and more open source maintainers throwing up their hands over AI-generated pull requests. Going so far as to stop accepting PRs from external contributors. If you’re an open source maintainer, you’ve felt this [&#8230;]]]></description>
								<content:encoded><![CDATA[
<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><em>The following article originally appeared on <a href="https://angiejones.tech/stop-closing-the-door-fix-the-house/" target="_blank" rel="noreferrer noopener">Angie Jones&#8217;s website</a> and is being republished here with the author&#8217;s permission.</em></td></tr></tbody></table></figure>



<p>I’ve been seeing more and more open source maintainers throwing up their hands over AI-generated pull requests. Going so far as to stop accepting PRs from external contributors.</p>



<figure class="wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter"><div class="wp-block-embed__wrapper">
<blockquote class="twitter-tweet" data-width="500" data-dnt="true"><p lang="en" dir="ltr">This week we&#39;re going to begin automatically closing pull requests from external contributors. I hate this, sorry. <a href="https://t.co/85GLG7i1fU">pic.twitter.com/85GLG7i1fU</a></p>&mdash; tldraw (@tldraw) <a href="https://twitter.com/tldraw/status/2011911073834672138?ref_src=twsrc%5Etfw">January 15, 2026</a></blockquote><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</div></figure>



<figure class="wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter"><div class="wp-block-embed__wrapper">
<blockquote class="twitter-tweet" data-width="500" data-dnt="true"><p lang="en" dir="ltr">Ghostty is getting an updated AI policy. AI assisted PRs are now only allowed for accepted issues. Drive-by AI PRs will be closed without question. Bad AI drivers will be banned from all future contributions. If you&#39;re going to use AI, you better be good. <a href="https://t.co/AJRX79S8XD">https://t.co/AJRX79S8XD</a></p>&mdash; Mitchell Hashimoto (@mitchellh) <a href="https://twitter.com/mitchellh/status/2014433315261124760?ref_src=twsrc%5Etfw">January 22, 2026</a></blockquote><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</div></figure>



<figure class="wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter"><div class="wp-block-embed__wrapper">
<blockquote class="twitter-tweet" data-width="500" data-dnt="true"><p lang="en" dir="ltr">AI is killing Open Source and it’s saddening. Basically, a bunch of people who now believe they’re geniuses because of LLMs have been spamming OSS projects with junk submissions causing some maintainers to limit contributions from the general public.</p>&mdash; ASH<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1fa84.png" alt="🪄" class="wp-smiley" style="height: 1em; max-height: 1em;" /> (@ahmxrd) <a href="https://twitter.com/ahmxrd/status/2020025872342769850?ref_src=twsrc%5Etfw">February 7, 2026</a></blockquote><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</div></figure>



<p>If you’re an open source maintainer, you’ve felt this pain. We all have. It’s frustrating reviewing PRs that not only ignore the project’s coding conventions but also are riddled with AI slop.</p>



<p>But yo, what are we doing?! Closing the door on contributors isn’t the answer. Open source maintainers don’t want to hear this, but this is the way people code now, and you need to do your part to prepare your repo for AI coding assistants.</p>



<p>I’m a maintainer on <a href="https://github.com/block/goose" target="_blank" rel="noreferrer noopener">goose</a> which has more than 300 external contributors. We felt this frustration early on, but instead of pushing well-meaning contributors away, we did the work to help them contribute with AI <em>responsibly</em>.</p>



<h2 class="wp-block-heading"><strong>1. Tell humans how to use AI on your project</strong></h2>



<p>We created a <a href="https://github.com/block/goose/blob/main/HOWTOAI.md" target="_blank" rel="noreferrer noopener">HOWTOAI.md</a> file as a straightforward guide for contributors on how to use AI tools responsibly when working on our codebase. It covers things like:</p>



<ul class="wp-block-list">
<li>What AI is good for (boilerplate, tests, docs, refactoring) and what it’s not (security critical code, architectural changes, code you don’t understand)</li>



<li>The expectation that you are accountable for every line you submit, AI-generated or not</li>



<li>How to validate AI output before opening a PR: build it, test it, lint it, understand it</li>



<li>Being transparent about AI usage in your PRs</li>
</ul>



<p>This welcomes AI PRs but also sets clear expectations. Most contributors <em>want</em> to do the right thing, they just need to know what the right thing is.</p>



<p>And while you’re at it, take a fresh look at your CONTRIBUTING.md too. A lot of the problems people blame on AI are actually problems that always existed, AI just amplified them. Be specific. Don’t just say “follow the code style&#8221;; say what the code style is. Don’t just say “add tests”; show what a good test looks like in your project. The better your docs are, the better both humans and AI agents will perform.</p>



<h2 class="wp-block-heading"><strong>2. Tell the agents how to work on your project</strong></h2>



<p>Contributors aren’t the only ones who need instructions. The AI agents do too.</p>



<p>We have an <a href="https://github.com/block/goose/blob/main/AGENTS.md" target="_blank" rel="noreferrer noopener">AGENTS.md</a> file that AI coding agents can read to understand our project conventions. It includes the project structure, build commands, test commands, linting steps, coding rules, and explicit “never do this” guardrails.</p>



<p>When someone points their AI agent at our repo, the agent picks up these conventions automatically. It knows what to do and how to do it, what not to touch, how the project is structured, and how to run tests to check their work.</p>



<p>You can’t complain that AI-generated PRs don’t follow your conventions if you never told the AI what your conventions are.</p>



<h2 class="wp-block-heading"><strong>3. Use AI to review AI</strong></h2>



<p>Investing in an AI code reviewer as the first touchpoint for incoming PRs has been a game changer.</p>



<p>I already know what you’re thinking… They suck too. LOL, fair. But again, you have to guide the AI. We added <a href="https://github.com/block/goose/blob/main/.github/copilot-instructions.md" target="_blank" rel="noreferrer noopener">custom instructions</a> so the AI code reviewer knows what <em>we</em> care about.</p>



<p>We told it our priority areas: security, correctness, architecture patterns. We told it what to skip: style and formatting issues that CI already catches. We told it to only comment when it has high confidence there’s a real issue, not just nitpick for the sake of it.</p>



<p>Now, contributors get feedback before a maintainer ever looks at the PR. They can clean things up on their own. By the time it reaches us, the obvious stuff is already handled.</p>



<h2 class="wp-block-heading"><strong>4. Have good tests</strong></h2>



<p>No, seriously. I’ve been telling y’all this for YEARS. Anyone who follows my work knows I’ve been on the test automation soapbox for a long time. And I need everyone to hear me when I say the importance of having a solid test suite has never been higher than it is right now.</p>



<p>Tests are your safety net against bad AI-generated code. Your test suite can catch breaking changes from contributors, human or AI.</p>



<p>Without good test coverage, you’re doing manual review on every PR trying to reason about correctness in your head. That’s not sustainable with five contributors, let alone 50 of them, half of whom are using AI.</p>



<h2 class="wp-block-heading"><strong>5. Automate the boring gatekeeping with CI</strong></h2>



<p>Your CI pipeline should also be doing the heavy lifting on quality checks so you don’t have to. Linting, formatting, type checking all should run automatically on every PR.</p>



<p>This isn’t new advice, but it matters more now. When you have clear, automated checks that run on every PR, you create an objective quality bar. The PR either passes or it doesn’t. Doesn’t matter if a human wrote it or an AI wrote it.</p>



<p>For example, in goose, we run a GitHub Action on any PR that involves reusable prompts or AI instructions to ensure they don’t contain prompt injections or anything else that’s sketchy.</p>



<p>Think about what’s unique to your project and see if you can throw some CI checks at it to keep quality high.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>I understand the impulse to lock things down, but y’all we can’t give up&nbsp; on the thing that makes open source special.</p>



<p>Don’t close the door on your projects. Raise the bar, then give people (and their AI tools) the information they need to clear it.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>On March 26, join Addy Osmani and Tim O’Reilly at AI Codecon: Software Craftsmanship in the Age of AI, where an all-star lineup of experts will go deeper into orchestration, agent coordination, and the new skills developers need to build excellent software that creates value for all participants.&nbsp;</em><a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener"><em>Sign up for free here</em></a><em>.</em></p>
</blockquote>
]]></content:encoded>
										</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Object Caching 89/98 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed) 
Minified using Memcached

Served from: www.oreilly.com @ 2026-04-03 11:15:26 by W3 Total Cache
-->