<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:media="http://search.yahoo.com/mrss/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:custom="https://www.oreilly.com/rss/custom"

	>

<channel>
	<title>Radar</title>
	<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.oreilly.com/radar</link>
	<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
	<lastBuildDate>Mon, 13 Apr 2026 11:12:00 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
	<title>Radar</title>
	<link>https://www.oreilly.com/radar</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Comprehension Debt: The Hidden Cost of AI-Generated Code</title>
		<link>https://www.oreilly.com/radar/comprehension-debt-the-hidden-cost-of-ai-generated-code/</link>
				<comments>https://www.oreilly.com/radar/comprehension-debt-the-hidden-cost-of-ai-generated-code/#respond</comments>
				<pubDate>Mon, 13 Apr 2026 11:11:45 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18525</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Comprehension-debt.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Comprehension-debt-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[The following article originally appeared on Addy Osmani&#8217;s blog site and is being reposted here with the author&#8217;s permission. Comprehension debt is the hidden cost to human intelligence and memory resulting from excessive reliance on AI and automation. For engineers, it applies most to agentic engineering. There’s a cost that doesn’t show up in your [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on <a href="https://addyosmani.com/blog/comprehension-debt/">Addy Osmani&#8217;s blog site</a> and is being reposted here with the author&#8217;s permission.</em></p>
</blockquote>



<p><strong>Comprehension debt is the hidden cost to human intelligence and memory resulting from excessive reliance on AI and automation. For engineers, it applies most to agentic engineering.</strong></p>



<p>There’s a cost that doesn’t show up in your velocity metrics when teams go deep on AI coding tools. Especially when its tedious to review all the code the AI generates. This cost accumulates steadily, and eventually it has to be paid—with interest. It’s called comprehension debt or <a href="https://www.media.mit.edu/publications/your-brain-on-chatgpt/" target="_blank" rel="noreferrer noopener">cognitive debt.</a></p>



<p>Comprehension debt is the <strong>growing gap</strong> between how much code exists in your system and <strong>how much of it any human being genuinely understands</strong>.</p>



<p>Unlike technical debt, which announces itself through mounting friction—slow builds, tangled dependencies, the creeping dread every time you touch that one module—comprehension debt breeds false confidence. The codebase looks clean. The tests are green. The reckoning arrives quietly, usually at the worst possible moment.</p>



<p><a href="https://margaretstorey.com/blog/2026/02/09/cognitive-debt/" target="_blank" rel="noreferrer noopener">Margaret-Anne Storey</a> describes a student team that hit this wall in week seven: They could no longer make simple changes <strong>without breaking something unexpected</strong>. The real problem wasn’t messy code. It was that no one on the team could explain why design decisions had been made or how different parts of the system were supposed to work together. The theory of the system had evaporated.</p>



<p>That’s comprehension debt compounding in real time.</p>



<p>I’ve read Hacker News threads that captured engineers genuinely wrestling with the structural version of this problem—not the familiar optimism versus skepticism binary, but a field trying to figure out what rigor actually looks like when the bottleneck has moved.</p>



<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1600" height="900" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-9-1600x900.png" alt="How AI assistance impacts coding speed and skill formation" class="wp-image-18526" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-9-1600x900.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-9-300x169.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-9-768x432.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-9-1536x864.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-9-2048x1153.png 2048w" sizes="(max-width: 1600px) 100vw, 1600px" /></figure>



<p>A recent Anthropic study titled “<a href="https://www.anthropic.com/research/AI-assistance-coding-skills" target="_blank" rel="noreferrer noopener">How AI Impacts Skill Formation</a>” highlighted the potential downsides of over-reliance on AI coding assistants. In a randomized controlled trial with 52 software engineers learning a new library, participants who used AI assistance completed the task in roughly the same time as the control group but scored 17% lower on a follow-up comprehension quiz (50% versus 67%). The largest declines occurred in debugging, with smaller but still significant drops in conceptual understanding and code reading. <strong>The researchers emphasize that passive delegation (“just make it work”) impairs skill development far more than active, question-driven use of AI.</strong> The full paper is available at <a href="https://arxiv.org/abs/2601.20245" target="_blank" rel="noreferrer noopener">arXiv.org</a>.</p>



<h2 class="wp-block-heading"><strong>There is a speed asymmetry problem here</strong></h2>



<p><strong>AI generates code far faster than humans can evaluate it.</strong> That sounds obvious, but the implications are easy to underestimate.</p>



<p><strong>When a developer on your team writes code, the human review process has always been a bottleneck—but a productive and educational one</strong>. Reading their PR forces comprehension. It surfaces hidden assumptions, catches design decisions that conflict with how the system was architected six months ago, and distributes knowledge about what the codebase actually does across the people responsible for maintaining it.</p>



<p>AI-generated code breaks that feedback loop. The volume is too high. The output is syntactically clean, often well-formatted, superficially correct—precisely the signals that historically triggered merge confidence. But surface correctness is not systemic correctness. The codebase looks healthy while comprehension quietly hollows out underneath it.</p>



<p>I read one engineer say that the bottleneck has always been a competent developer understanding the project. AI doesn’t change that constraint. It creates the illusion you’ve escaped it.</p>



<p>And the inversion is sharper than it looks. When code was expensive to produce, senior engineers could review faster than junior engineers could write. <strong>AI flips this: A junior engineer can now generate code faster than a senior engineer can critically audit it.</strong> The rate-limiting factor that kept review meaningful has been removed. <strong>What used to be a quality gate is now a throughput problem.</strong></p>



<h2 class="wp-block-heading"><strong>I love tests, but they aren’t a complete answer</strong></h2>



<p>The instinct to lean harder on deterministic verification—unit tests, integration tests, static analysis, linters, formatters—is understandable. I do this a lot in projects heavily leaning on <a href="https://addyosmani.com/agentic-engineering/ai-coding-agent/" target="_blank" rel="noreferrer noopener">AI coding agents</a>. Automate your way out of the review bottleneck. Let machines check machines.</p>



<p>This helps. It has a hard ceiling.</p>



<p>A test suite capable of covering all observable behavior would, in many cases, be more complex than the code it validates. Complexity you can’t reason about doesn’t provide safety though. And beneath that is a more fundamental problem: You can’t write a test for behavior you haven’t thought to specify.</p>



<p>Nobody writes a test asserting that dragged items shouldn’t turn completely transparent. Of course they didn’t. That possibility never occurred to them. That’s exactly the class of failure that slips through, not because the test suite was poorly written, but because no one thought to look there.</p>



<p>There’s also a specific failure mode worth naming. <strong>When an AI changes implementation behavior and updates hundreds of test cases to match the new behavior, the question shifts from “is this code correct?” to “were all those test changes necessary, and do I have enough coverage to catch what I’m not thinking about?”</strong> Tests cannot answer that question. Only comprehension can.</p>



<p>The data is starting to back this up. Research suggests that developers using AI for code generation delegation score below 40% on comprehension tests, while developers using AI for conceptual inquiry—asking questions, exploring tradeoffs—score above 65%. The tool doesn’t destroy understanding. How you use it does.</p>



<p>Tests are necessary. They are not sufficient.</p>



<h2 class="wp-block-heading"><strong>Lean on specs, but they’re also not the full story.</strong></h2>



<p>A common proposed solution: Write a detailed natural language spec first. Include it in the PR. Review the spec, not the code. Trust that the AI faithfully translated intent into implementation.</p>



<p>This is appealing in the same way Waterfall methodology was once appealing. Rigorously define the problem first, then execute. Clean separation of concerns.</p>



<p>The problem is that translating a spec to working code involves an enormous number of implicit decisions—edge cases, data structures, error handling, performance tradeoffs, interaction patterns—that no spec ever fully captures. <strong>Two engineers implementing the same spec will produce systems with many observable behavioral differences.</strong> Neither implementation is wrong. They’re just different. And many of those differences will eventually matter to users in ways nobody anticipated.</p>



<p>There’s another possibility with detailed specs worth calling out: A spec detailed enough to fully describe a program is more or less the program, just written in a non-executable language. The organizational cost of writing specs thorough enough to substitute for review may well exceed the productivity gains from using AI to execute them. And you still haven’t reviewed what was actually produced.</p>



<p>The deeper issue is that there is often no <em>correct</em> spec. Requirements emerge through building. Edge cases reveal themselves through use. The assumption that you can fully specify a non-trivial system before building it has been tested repeatedly and found wanting. AI doesn’t change this. It just adds a new layer of implicit decisions made without human deliberation.</p>



<h2 class="wp-block-heading"><strong>Learn from history</strong></h2>



<p>Decades of managing software quality across distributed teams with varying context and communication bandwidth has produced real, tested practices. Those don’t evaporate because the team member is now a model.</p>



<p><strong>What changes with AI is cost (dramatically lower), speed (dramatically higher), and interpersonal management overhead (essentially zero). What doesn’t change is the need for someone with a deep system context to maintain a coherent understanding of what the codebase is actually doing and why.</strong></p>



<p>This is the uncomfortable redistribution that comprehension debt forces.</p>



<p>As AI volume goes up, the engineer who truly understands the system becomes more valuable, not less. The ability to look at a diff and immediately know which behaviors are load-bearing. To remember why that architectural decision got made under pressure eight months ago.</p>



<p>To tell the difference between a refactor that’s safe and one that’s quietly shifting something users depend on. That skill becomes the scarce resource the whole system depends on.</p>



<h2 class="wp-block-heading"><strong>There’s a bit of a measurement gap here too</strong></h2>



<p><strong>The reason comprehension debt is so dangerous is that nothing in your current measurement system captures it.</strong></p>



<p>Velocity metrics look immaculate. DORA metrics hold steady. PR counts are up. Code coverage is green.</p>



<p>Performance calibration committees see velocity improvements. They cannot see comprehension deficits because no artifact of how organizations measure output captures that dimension. The incentive structure optimizes correctly for what it measures. What it measures no longer captures what matters.</p>



<p>This is what makes comprehension debt more insidious than technical debt. Technical debt is usually a conscious tradeoff—you chose the shortcut, you know roughly where it lives, you can schedule the paydown. Comprehension debt accumulates invisibly, often without anyone making a deliberate decision to let it. It’s the aggregate of hundreds of reviews where the code looked fine and the tests were passing and there was another PR in the queue.</p>



<p>The organizational assumption that reviewed code is understood code no longer holds. Engineers approved code they didn’t fully understand, which now carries implicit endorsement. The liability has been distributed without anyone noticing.</p>



<h2 class="wp-block-heading"><strong>The regulation horizon is closer than it looks</strong></h2>



<p>Every industry that moved too fast eventually attracted regulation. Tech has been unusually insulated from that dynamic, partly because software failures are often recoverable, and partly because the industry has moved faster than regulators could follow.</p>



<p>That window is closing. When AI-generated code is running in healthcare systems, financial infrastructure, and government services, “the AI wrote it and we didn’t fully review it” will not hold up in a post-incident report when lives or significant assets are at stake.</p>



<p>Teams building comprehension discipline now—treating genuine understanding, not just passing tests, as non-negotiable—will be better positioned when that reckoning arrives than teams that optimized purely for merge velocity.</p>



<h2 class="wp-block-heading"><strong>What comprehension debt actually demands</strong></h2>



<p>The right question for now isn’t “how do we generate more code?” It’s “how do we actually understand more of what we’re shipping?” so we can make sure our users get a consistently high quality experience.</p>



<p>That reframe has practical consequences. It means being ruthlessly explicit about what a change is supposed to do before it’s written. It means treating verification not as an afterthought but as a structural constraint. It means maintaining the system-level mental model that lets you catch AI mistakes at architectural scale rather than line-by-line. And it means being honest about the difference between “the tests passed” and “I understand what this does and why.”</p>



<p><strong>Making code cheap to generate doesn’t make understanding cheap to skip. The comprehension work is the job.</strong></p>



<p>AI handles the translation, but someone still has to understand what was produced, why it was produced that way, and whether those implicit decisions were the right ones—or you’re just deferring a bill that will eventually come due in full.</p>



<p>You will pay for comprehension sooner or later. The debt accrues interest rapidly.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/comprehension-debt-the-hidden-cost-of-ai-generated-code/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Agents don&#8217;t know what good looks like. And that&#8217;s exactly the problem.</title>
		<link>https://www.oreilly.com/radar/agents-dont-know-what-good-looks-like-and-thats-exactly-the-problem/</link>
				<comments>https://www.oreilly.com/radar/agents-dont-know-what-good-looks-like-and-thats-exactly-the-problem/#respond</comments>
				<pubDate>Fri, 10 Apr 2026 13:31:27 +0000</pubDate>
					<dc:creator><![CDATA[Luca Mezzalira]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18521</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Agents-dont-know-what-good-looks-like.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Agents-dont-know-what-good-looks-like-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A reaction to the Neal Ford and Sam Newman fireside chat on agentic AI and software architecture]]></custom:subtitle>
		
				<description><![CDATA[Luca Mezzalira, author of Building Micro-Frontends, originally shared the following article on LinkedIn. It’s being republished here with his permission. Every few years, something arrives that promises to change how we build software. And every few years, the industry splits predictably: One half declares the old rules dead; the other half folds its arms and [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Luca Mezzalira, author of </em>Building Micro-Frontends<em>, </em><a href="https://www.linkedin.com/pulse/agents-dont-know-what-good-looks-like-thats-exactly-luca-mezzalira-sgwte/?trackingId=XGn%2B%2FupMOGKSQTe618fEZw%3D%3D" target="_blank" rel="noreferrer noopener"><em>originally shared the following article on LinkedIn</em></a><em>. It’s being republished here with his permission.</em></p>
</blockquote>



<p>Every few years, something arrives that promises to change how we build software. And every few years, the industry splits predictably: One half declares the old rules dead; the other half folds its arms and waits for the hype to pass. Both camps are usually wrong, and both camps are usually loud. What&#8217;s rarer, and more useful, is someone standing in the middle of that noise and asking the structural questions: Not “What can this do?” but “What does it mean for how we design systems?”</p>



<p>That&#8217;s what Neal Ford and Sam Newman did in <a href="https://www.youtube.com/watch?v=sM-C6vNvIDI" target="_blank" rel="noreferrer noopener">their recent fireside chat</a> on agentic AI and software architecture during O&#8217;Reilly’s <a href="https://learning.oreilly.com/videos/software-architecture-superstream/0642572277543/" target="_blank" rel="noreferrer noopener">Software Architecture Superstream</a>. It&#8217;s a conversation worth pulling apart carefully, because some of what they surface is more uncomfortable than it first appears.</p>



<h2 class="wp-block-heading"><strong>The Dreyfus trap</strong></h2>



<p>Neal opens with the <a href="https://www.bumc.bu.edu/facdev-medicine/files/2012/03/Dreyfus-skill-level.pdf" target="_blank" rel="noreferrer noopener">Dreyfus Model of Knowledge Acquisition</a>, originally developed for the nursing profession but applicable to any domain. The model maps learning across five stages:</p>



<ul class="wp-block-list">
<li>Novice</li>



<li>Advanced beginner</li>



<li>Competent</li>



<li>Proficient</li>



<li>Expert</li>
</ul>



<p>His claim is that current agentic AI is stuck somewhere between novice and advanced beginner: It can follow recipes, it can even apply recipes from adjacent domains when it gets stuck, but it doesn&#8217;t <em>understand</em> why any of those recipes work. This isn&#8217;t a minor limitation. It&#8217;s structural.</p>



<p>The canonical example Neal gives is beautiful in its simplicity: An agent tasked with making all tests pass encounters a failing unit test. One perfectly valid way to make a failing test pass is to replace its assertion with <em>assert True</em>. That&#8217;s not a hack in the agent&#8217;s mind. It&#8217;s a solution. There&#8217;s no ethical framework, no professional judgment, no instinct that says <em>this isn&#8217;t what we meant</em>. Sam extends this immediately with something he&#8217;d literally seen shared on LinkedIn that week: an agent that had modified the build file to silently ignore failed steps rather than fix them. The build passed. The problem remained. Congratulations all-round.</p>



<p>What&#8217;s interesting here is that neither Ford nor Newman are being dismissive of AI capability. The point is more subtle: The creativity that makes these agents genuinely useful, their ability to search solution space in ways humans wouldn&#8217;t think to, is inseparable from the same property that makes them dangerous. You can&#8217;t fully lobotomize the improvization without destroying the value. This is a design constraint, not a bug to be patched.</p>



<p>And when you zoom out, this is part of a broader signal. When experienced practitioners who&#8217;ve spent decades in this industry independently converge on calls for restraint and rigor rather than acceleration, that convergence is worth paying attention to. It&#8217;s not pessimism. It&#8217;s pattern recognition from people who&#8217;ve lived through enough cycles to know what the warning signs look like.</p>



<h2 class="wp-block-heading"><strong>Behavior versus capabilities</strong></h2>



<p>One of the most important things Neal says, and I think it gets lost in the overall density of the conversation, is the distinction between <em>behavioral verification</em> and <em>capability verification</em>.</p>



<p>Behavioral verification is what most teams default to: unit tests, functional tests, integration tests. Does the code do what it&#8217;s supposed to do according to the spec? This is the natural fit for agentic tooling, because agents are actually getting pretty good at implementing behavior against specs. Give an agent a well-defined interface contract and a clear set of acceptance criteria, and it will produce something that broadly satisfies them. This is real progress.</p>



<p>Capability verification is harder. Much harder. Does the system exhibit the operational qualities it needs to exhibit at scale? Is it properly decoupled? Is the security model sound? What happens at 20,000 requests per second? Does it fail gracefully or catastrophically? These are things that most human developers struggle with too, and agents have been trained on human-generated code, which means they&#8217;ve inherited our failure modes as well as our successes.</p>



<p>This brings me to something <a href="https://learning.oreilly.com/search/?q=author%3A%20%22Birgitta%20Boeckeler%22&amp;rows=100&amp;suggested=true&amp;suggestionType=author&amp;originalQuery=birgitta%20boe" target="_blank" rel="noreferrer noopener">Birgitta Boeckeler</a> raised at <a href="https://qconlondon.com/keynote/mar2026/state-play-ai-coding-assistants" target="_blank" rel="noreferrer noopener">QCon London</a> that I haven&#8217;t been able to stop thinking about. The example everyone cites when making the case for AI&#8217;s coding capability is that Anthropic built a C compiler from scratch using agents. Impressive. But here&#8217;s the thing: C compiler documentation is extraordinarily well-specified and battle-tested over decades, and the test coverage for compiler behavior is some of the most rigorous in the entire software industry. That&#8217;s as close to a solved, well-bounded problem as you can get.</p>



<p>Enterprise software is almost never like that. Enterprise software is ambiguous requirements, undocumented assumptions, tacit knowledge living in the heads of people who left three years ago, and test coverage that exists more as aspiration than reality. The gap between &#8220;can build a C compiler&#8221; and &#8220;can reliably modernize a legacy ERP&#8221; is not a gap of raw capability. It&#8217;s a gap of specification quality and domain legibility. That distinction matters enormously for how we think about where agentic tooling can safely operate.</p>



<p>The current orthodoxy in agentic development is to throw more context at the problem: elaborate context files, architecture decision records, guidelines, rules about what not to do. Ford and Newman are appropriately skeptical. Sam makes the point that there&#8217;s now empirical evidence suggesting that as context file size increases, you see degradation in output quality, not improvement. You&#8217;re not guiding the agent toward better judgment. You&#8217;re just accumulating scar tissue from previous disasters. This isn&#8217;t unique to agentic workflows either. Anyone who has worked seriously with code assistants knows that summarization quality degrades as context grows, and that this degradation is only partially controllable. That has a direct impact on decisions made over time; now close your eyes for a moment and imagine doing it across an enterprise software, with many teams across different time zones. Don&#8217;t get me wrong, the tools help, but the help is bounded, and that boundary is often closer than we&#8217;d like to admit.</p>



<p>The more honest framing, which Neal alludes to, is that we need <em>deterministic guardrails</em> around <em>nondeterministic agents</em>. Not more prompting. Architectural fitness functions, an idea Ford and Rebecca Parsons <a href="https://learning.oreilly.com/library/view/building-evolutionary-architectures/9781491986356/ch02.html" target="_blank" rel="noreferrer noopener">have been promoting since 2017</a>, feel like they&#8217;re finally about to have their moment, precisely because the cost of not having them is now immediately visible.</p>



<h2 class="wp-block-heading"><strong>What should an agent own then?</strong></h2>



<p>This is where the conversation gets most interesting, and where I think the field is most confused.</p>



<p>There&#8217;s a seductive logic to the microservice as the unit of agentic regeneration. It sounds small. The word <em>micro</em> is in the name. You can imagine handing an agent a service with a defined API contract and saying: implement this, test it, done. The scope feels manageable.</p>



<p>Ford and Newman give this idea fair credit, but they&#8217;re also honest about the gap. The microservice level is attractive architecturally because it comes with an implied boundary: a process boundary, a deployment boundary, often a data boundary. You can put fitness functions around it. You can say this service must handle X load, maintain Y error rate, expose Z interface. In theory.</p>



<p>In practice, we barely enforce this stuff ourselves. The agents have learned from a corpus of human-written microservices, which means they&#8217;ve learned from the vast majority of microservices that were written without proper decoupling, without real resilience thinking, without any rigorous capacity planning. They don&#8217;t have our aspirations. They have our habits.</p>



<p>The deeper problem, which Neal raises and which I think deserves more attention than it gets, is <em>transactional coupling</em>. You can design five beautifully bounded services and still produce an architectural disaster if the workflow that ties them together isn&#8217;t thought through. Sagas, event choreography, compensation logic: This is the stuff that breaks real systems, and it&#8217;s also the stuff that&#8217;s hardest to specify, hardest to test, and hardest for an agent to reason about. We made exactly this mistake in the SOA era. We designed lovely little services and then discovered that the interesting complexity had simply migrated into the integration layer, which nobody owned and nobody tested.</p>



<p>Sam&#8217;s line here is worth quoting directly, roughly: “<em>To err is human, but it takes a computer to really screw things up</em>.” I suspect we&#8217;re going to produce some genuinely legendary transaction management disasters before the field develops the muscle memory to avoid them.</p>



<h2 class="wp-block-heading"><strong>The sociotechnical gap nobody is talking about</strong></h2>



<p>There&#8217;s a dimension to this conversation that Ford and Newman gesture toward but that I think deserves much more direct examination: <em>the question of what happens to the humans on the other side of this generated software.</em></p>



<p>It&#8217;s not completely accurate to say that all agentic work is happening on greenfield projects. There are tools already in production helping teams migrate legacy ERPs, modernize old codebases, and tackle the modernization challenge that has defeated conventional approaches for years. That&#8217;s real, and it matters.</p>



<p>But the challenge in those cases isn&#8217;t merely the code. It&#8217;s whether the sociotechnical system, the teams, the processes, the engineering culture, the organizational structures built around the existing software are ready to inherit what gets built. And here&#8217;s the thing: Even if agents combined with deterministic guardrails could produce a well-structured microservice architecture or a clean modular monolith in a fraction of the time it would take a human team, that architectural output doesn&#8217;t automatically come with organizational readiness. The system can arrive before the people are prepared to own it.</p>



<p>One of the underappreciated functions of iterative migration, the incremental strangler fig approach, the slow decomposition of a monolith over 18 months, is not primarily risk reduction, though it does that too. It&#8217;s learning. It&#8217;s the process by which a team internalizes a new way of working, makes mistakes in a bounded context, recovers, and builds the judgment that lets them operate confidently in the new world. Compress that journey too aggressively and you can end up with architecture whose operational complexity exceeds the organizational capacity to manage it. That gap tends to be expensive.</p>



<p>At QCon London, I asked Patrick Debois, after a talk covering best practices for AI-assisted development, whether applying all of those practices consistently would make him comfortable working on enterprise software with real complexity. His answer was: It depends. That felt like the honest answer. The tooling is improving. Whether the humans around it are keeping pace is a separate question, and one the industry is not spending nearly enough time on.</p>



<h2 class="wp-block-heading"><strong>Existing systems</strong></h2>



<p>Ford and Newman close with a subject that almost never gets covered in these conversations: the vast, unglamorous majority of software that already exists and that our society depends on in ways that are easy to underestimate.</p>



<p>Most of the discourse around agentic AI and software development is implicitly greenfield. It assumes you&#8217;re starting fresh, that you get to design your architecture sensibly from the beginning, that you have clean APIs and tidy service boundaries. The reality is that most valuable software in the world was written before any of this existed, runs on platforms and languages that aren&#8217;t the natural habitat of modern AI tooling, and contains decades of accumulated decisions that nobody fully understands anymore.</p>



<p>Sam is working on a book about this: how to adapt existing architectures to enable AI-driven functionality in ways that are actually safe. He makes the interesting point that existing systems, despite their reputation, sometimes give you a head start. A well-structured relational schema carries implicit meaning about data ownership and referential integrity that an agent can actually reason from. There&#8217;s structure there, if you know how to read it.</p>



<p>The general lesson, which he states without much drama, is that you can&#8217;t just expose an existing system through an MCP server and call it done. The interface is not the architecture. The risks around security, data exposure, and vendor dependency don&#8217;t go away because you&#8217;ve wrapped something in a new protocol.</p>



<p>This matters more than it might seem, because the software that runs our financial systems, our healthcare infrastructure, our logistics and supply chains, is not greenfield and never will be. If we get the modernization of those systems wrong, the consequences are not abstract. They are social. The instinct to index heavily on what these tools can do in ideal conditions, on well-specified problems with good documentation and thorough test coverage, is understandable. But it&#8217;s exactly the wrong instinct when the systems in question are the ones our lives depend on. The architectural mindset that has served us well through previous paradigm shifts, the one that starts with trade-offs rather than capabilities, that asks what we are giving up rather than just what we are gaining, is not optional here. It&#8217;s the minimum requirement for doing this responsibly.</p>



<h2 class="wp-block-heading"><strong>What I take away from this</strong></h2>



<p>Three things, mostly.</p>



<p>The first is that introducing deterministic guardrails into nondeterministic systems is not optional. It&#8217;s imperative. We are still figuring out exactly where and how, but the framing needs to shift: The goal is control over <em>outcomes</em>, not just oversight of <em>output</em>. There&#8217;s a difference. Output is what the agent generates. Outcome is whether the system it generates actually behaves correctly under production conditions, stays within architectural boundaries, and remains operable by the humans responsible for it. Fitness functions, capability tests, boundary definitions: the boring infrastructure that connects generated code to the real constraints of the world it runs in. We&#8217;ve had the tools to build this for years.</p>



<p>The second is that the people saying this is the future and the people saying this is just another hype cycle are both probably wrong in interesting ways. Ford and Newman are careful to say they don&#8217;t know what good looks like yet. Neither do I. But we have better prior art to draw on than the discourse usually acknowledges. The principles that made microservices work, when they worked, real decoupling, explicit contracts, operational ownership, apply here too. The principles that made microservices fail, leaky abstractions, distributed transactions handled badly, complexity migrating into integration layers, will cause exactly the same failures, just faster and at larger scale.</p>



<p>The third is something I took away from QCon London this year, and I think it might be the most important of the three. Across two days of talks, including sessions that took diametrically opposite approaches to integrating AI into the software development lifecycle, one thing became clear: We are all beginners. Not in the dismissive sense but in the most literal application of the Dreyfus model. Nobody, regardless of experience, has figured out the right way to fit these tools inside a sociotechnical system. The recipes are still being written. The war stories that will eventually become the prior art are still happening to us right now.</p>



<p>What got us here, collectively, was sharing what we saw, what worked, what failed, and why. That&#8217;s how the field moved from SOA disasters to microservices best practices. That&#8217;s how we built a shared vocabulary around fitness functions and evolutionary architecture. The same process has to happen again, and it will, but only if people with real experience are honest about the uncertainty rather than performing confidence they don&#8217;t have. The speed, ultimately, is both the opportunity and the danger. The technology is moving faster than the organizations, the teams, and the professional instincts that need to absorb it. The best response to that isn&#8217;t to pretend otherwise. It&#8217;s to keep comparing notes.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>If this resonated, the </em><a href="https://www.youtube.com/watch?v=sM-C6vNvIDI" target="_blank" rel="noreferrer noopener"><em>full fireside chat between Neal Ford and Sam Newman</em></a><em> is worth watching in its entirety. They cover more ground than I&#8217;ve had space to react to here. And if you’d like to learn more from Neal, Sam, and Luca, check out their most recent O’Reilly books: </em><a href="https://learning.oreilly.com/library/view/building-resilient-distributed/9781098163532/" target="_blank" rel="noreferrer noopener">Building Resilient Distributed Systems</a><em>, </em><a href="https://learning.oreilly.com/library/view/-/9798341640368/" target="_blank" rel="noreferrer noopener">Architecture as Code</a><em>, and </em><a href="https://learning.oreilly.com/library/view/-/9781098170776/" target="_blank" rel="noreferrer noopener">Building Micro-frontends<em>, second edition</em></a><em>.</em></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/agents-dont-know-what-good-looks-like-and-thats-exactly-the-problem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Architecture as Code to Teach Humans and Agents About Architecture</title>
		<link>https://www.oreilly.com/radar/architecture-as-code-to-teach-humans-and-agents-about-architecture/</link>
				<comments>https://www.oreilly.com/radar/architecture-as-code-to-teach-humans-and-agents-about-architecture/#respond</comments>
				<pubDate>Thu, 09 Apr 2026 11:17:33 +0000</pubDate>
					<dc:creator><![CDATA[Neal Ford and Mark Richards]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Software Architecture]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18516</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Architecture-as-Code-to-Teach-Humans-and-Agents-About-Architecture.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Architecture-as-Code-to-Teach-Humans-and-Agents-About-Architecture-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[A funny thing happened on the way to writing our book Architecture as Code—the entire industry shifted. Generally, we write books iteratively—starting with a seed of an idea, then developing it through workshops, conference presentations, online classes, and so on. That&#8217;s exactly what we did about a year ago with our Architecture as Code book. [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>A funny thing happened on the way to writing our book <a href="https://learning.oreilly.com/library/view/architecture-as-code/9798341640368/" target="_blank" rel="noreferrer noopener"><em>Architecture as Code</em></a>—the entire industry shifted. Generally, we write books iteratively—starting with a seed of an idea, then developing it through workshops, conference presentations, online classes, and so on. That&#8217;s exactly what we did about a year ago with our <em>Architecture as Code</em> book. We started with the concept of describing all the ways that software architecture intersects with other parts of the software development ecosystem: data, engineering practices, team topologies, and more—nine in total—in code, as a way of creating a fast feedback loop for architects to react to changes in architecture. In other words, we&#8217;re documenting the architecture through code, defining the structure and constraints we want to guide the implementation through.</p>



<p>For example, an architect can define a set of components via a diagram, along with their dependencies and relationships. That design reflects careful thought about coupling, cohesion, and a host of other structural concerns. However, when they turn that diagram over to a team to develop it, how can they be sure the team will implement it correctly? By defining the components in code (with verifications), the architect can both illustrate and get feedback on the design. However, we recognize that architects don&#8217;t have a crystal ball, and design should sometimes change to reflect implementation. When a developer adds a new component, it isn&#8217;t necessarily an error but rather feedback that an architect needs to know. This isn&#8217;t a testing framework; it&#8217;s a feedback framework. When a new component appears, the architect should know so that they can assess: Should the component be there? Perhaps it was missed in the design. If so, how does that affect other components? Having the structure of your architect defined as code allows deterministic feedback on structural integrity.</p>



<p>This capability is useful for developers. We defined these intersections as a way of describing all different aspects of architecture in a deterministic way. Then agents arrived.</p>



<p>Agentic AI shows new capabilities in software architecture, including the ability to work toward a solution as long as deterministic constraints exist. Suddenly, developers and architects are trying to build ways for agents to determine success, which requires a deterministic method of defining these important constraints: Architecture as Code.</p>



<p>An increasingly common practice in agentic AI is separating foundational constraints from desired behavior. For example, part of the context or guardrails developers use for code generation can include concrete architectural constraints around code structure, complexity, coupling, cohesion, and a host of other measurable things. As architects can objectively define what an acceptable architecture is, they can build inviolate rules for agents to adhere to. For example, a problem that is gradually improving is the tendency of LLMs to use brute force to solve problems. If you ask for an algorithm that touches all 50 US states, it might build a 50-stage switch statement. However, if one of the architect&#8217;s foundational rules for code generation puts a limit on cyclomatic complexity, then the agent will have to find a way to generate code within that constraint.</p>



<p>This capability exists for all nine of the intersections we cover in <em>Architecture as Code</em>: implementation, engineering practices, infrastructure, generative AI, team topologies, business concerns, enterprise architect, data, and integration architecture.</p>



<p>Increasingly, we see the job of developers, but especially architects, being able to precisely and objectively define architecture, and we built a framework for both how to do it and where to do it in <em>Architecture as Code</em>–coming soon!</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/architecture-as-code-to-teach-humans-and-agents-about-architecture/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>AI-Infused Development Needs More Than Prompts</title>
		<link>https://www.oreilly.com/radar/ai-infused-development-needs-more-than-prompts/</link>
				<comments>https://www.oreilly.com/radar/ai-infused-development-needs-more-than-prompts/#respond</comments>
				<pubDate>Wed, 08 Apr 2026 16:02:50 +0000</pubDate>
					<dc:creator><![CDATA[Markus Eisele]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18511</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/AI-infused-development-needs-more-than-prompts.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/AI-infused-development-needs-more-than-prompts-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Why intent and control are becoming the new software architecture]]></custom:subtitle>
		
				<description><![CDATA[The current conversation about AI in software development is still happening at the wrong layer. Most of the attention goes to code generation. Can the model write a method, scaffold an API, refactor a service, or generate tests? Those things matter, and they are often useful. But they are not the hard part of enterprise [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>The current conversation about AI in software development is still happening at the wrong layer.</p>



<p>Most of the attention goes to code generation. Can the model write a method, scaffold an API, refactor a service, or generate tests? Those things matter, and they are often useful. But they are not the hard part of enterprise software delivery. In real organizations, teams rarely fail because nobody could produce code quickly enough. They fail because intent is unclear, architectural boundaries are weak, local decisions drift away from platform standards, and verification happens too late.</p>



<p>That becomes even more obvious once AI enters the workflow. AI does not just accelerate implementation. It accelerates whatever conditions already exist around the work. If the team has clear constraints, good context, and strong verification, AI can be a powerful multiplier. If the team has ambiguity, tacit knowledge, and undocumented decisions, AI amplifies those too.</p>



<p>That is why the next phase of AI-infused development will not be defined by prompt cleverness. It will be defined by how well teams can make intent explicit and how effectively they can keep control close to the work.</p>



<p>This shift has become clearer to me through recent work around <a href="https://ibm.com/bob" target="_blank" rel="noreferrer noopener">IBM Bob</a>, an AI-powered development partner I have been working with closely for a couple of months now, and the broader patterns emerging in AI-assisted development.</p>



<p>The real value is not that a model can write code. The real value appears when AI operates inside a system that exposes the right context, limits the action space, and verifies outcomes before bad assumptions spread.</p>



<h2 class="wp-block-heading"><strong>The code generation story is too small</strong></h2>



<p>The market likes simple narratives, and “AI helps developers write code faster” is a simple narrative. It demos well. You can measure it in isolated tasks. It produces screenshots and benchmark charts. It also misses the point.</p>



<p>Enterprise development is not primarily a typing problem. It is a coordination problem. It is an architecture problem. It is a constraints problem.</p>



<p>A useful change in a large Java codebase is rarely just a matter of producing syntactically correct code. The change has to fit an existing domain model, respect service boundaries, align with platform rules, use approved libraries, satisfy security requirements, integrate with CI and testing, and avoid creating support headaches for the next team that touches it. The code is only one artifact in a much larger system of intent.</p>



<p>Human developers understand this instinctively, even if they do not always document it well. They know that a “working” solution can still be wrong because it violates conventions, leaks responsibility across modules, introduces fragile coupling, or conflicts with how the organization actually ships software.</p>



<p>AI systems do not infer those boundaries reliably from a vague instruction and a partial code snapshot. If the intent is not explicit, the model fills in the gaps. Sometimes it fills them in well enough to look impressive. Sometimes it fills them in with plausible nonsense. In both cases, the danger is the same. The system appears more certain than the surrounding context justifies.</p>



<p>This is why teams that treat AI as an ungoverned autocomplete layer eventually run into a wall. The first wave feels productive. The second wave exposes drift.</p>



<h2 class="wp-block-heading"><strong>AI amplifies ambiguity</strong></h2>



<p>There is a phrase I keep coming back to because it captures the problem cleanly. If intent is missing, the model fills the gap.</p>



<p>That is not a flaw unique to one product or one model. It is a predictable property of probabilistic systems operating in underspecified environments. The model will produce the most likely continuation of the context it sees. If the context is incomplete, contradictory, or detached from the architectural reality of the system, the output may still look polished. It may even compile. But it is working from an invented understanding.</p>



<p>This becomes especially visible in enterprise modernization work. A legacy system is full of patterns shaped by old constraints, partial migrations, local workarounds, and decisions nobody wrote down. A model can inspect the code, but it cannot magically recover the missing intent behind every design choice. Without guidance, it may preserve the wrong things, simplify the wrong abstractions, or generate a modernization path that looks efficient on paper but conflicts with operational reality.</p>



<p>The same pattern shows up in greenfield projects, just faster. A team starts with a few useful AI wins, then gradually notices inconsistency. Different services solve the same problem differently. Similar APIs drift in style. Platform standards are applied unevenly. Security and compliance checks move to the end. Architecture reviews become cleanup exercises instead of design checkpoints.</p>



<p>AI did not create those problems. It accelerated them.</p>



<p>That is why the real question is no longer whether AI can generate code. It can. The more important question is whether the development system around the model can express intent clearly enough to make that generation trustworthy.</p>



<h2 class="wp-block-heading"><strong>Intent needs to become a first-class artifact</strong></h2>



<p>For a long time, teams treated intent as something informal. It lived in architecture diagrams, old wiki pages, Slack threads, code reviews, and the heads of senior developers. That has always been fragile, but human teams could compensate for some of it through conversation and shared experience.</p>



<p>AI changes the economics of that informality. A system that acts at machine speed needs machine-readable guidance. If you want AI to operate effectively in a codebase, intent has to move closer to the repository and closer to the task.</p>



<p>That does not mean every project needs a heavy governance framework. It means the important rules can no longer stay implicit.</p>



<p>Intent, in this context, includes architectural boundaries, approved patterns, coding conventions, domain constraints, migration goals, security rules, and expectations about how work should be verified. It also includes task scope. One of the most effective controls in AI-assisted development is simply making the task smaller and sharper. The moment AI is attached to repository-local guidance, scoped instructions, architectural context, and tool-mediated workflows, the quality of the interaction changes. The system is no longer guessing in the dark based on a chat transcript and a few visible files. It is operating inside a shaped environment.</p>



<p>One practical expression of this shift is spec-driven development. Instead of treating requirements, boundaries, and expected behavior as loose background context, teams make them explicit in artifacts that both humans and AI systems can work from. The specification stops being passive documentation and becomes an operational input to development.</p>



<p>That is a much more useful model for enterprise development.</p>



<p>The important pattern is not tool-specific. It applies across the category. AI becomes more reliable when intent is externalized into artifacts the system can actually use. That can include local guidance files, architecture notes, workflow definitions, test contracts, tool descriptions, policy checks, specialized modes, and bounded task instructions. The exact format matters less than the principle. The model should not have to reverse engineer your engineering system from scattered hints.</p>



<h2 class="wp-block-heading"><strong>Cost is a complexity problem disguised as a sizing problem</strong></h2>



<p>This becomes even clearer when you look at migration work and try to attach cost to it.</p>



<p>One of the recent discussions I had with a colleague was about how to size modernization work in token/cost terms. At first glance, lines of code look like the obvious anchor. They are easy to count, easy to compare, and simple to put into a table. The problem is that they do not explain the work very well.</p>



<p>What we are seeing in migration exercises matches what most experienced engineers would expect. Cost is often less about raw application size and more about how the application is built. A 30,000 line application with old security, XML-heavy configuration, custom build logic, and a messy integration surface can be harder to modernize than a much larger codebase with cleaner boundaries and healthier build and test behavior.</p>



<p>That gap matters because it exposes the same flaw as the code-generation narrative. Superficial output measures are easy to report, but they are weak predictors of real delivery effort.</p>



<p>If AI-infused development is going to be taken seriously in enterprise modernization, it needs better effort signals than repository size alone. Size still matters, but only as one input. The more useful indicators are framework and runtime distance. Those can be expressed in the number of modules or deployables, the age of the dependencies or the number of files actually touched.</p>



<p>This is an architectural discussion. Complexity lives in boundaries, dependencies, side effects, and hidden assumptions. Those are exactly the areas where intent and control matter most.</p>



<h2 class="wp-block-heading"><strong>Measured facts and inferred effort should not be collapsed into one story</strong></h2>



<p>There is another lesson here that applies beyond migrations. Teams often ask AI systems to produce a single comprehensive summary at the end of a workflow. They want the sequential list of changes, the observed results, the effort estimate, the pricing logic, and the business classification all in one polished report. It sounds efficient, but it creates a problem. Measured facts and inferred judgment get mixed together until the output looks more precise than it really is.</p>



<p>A better pattern is to separate workflow telemetry from sizing recommendations. The first artifact should describe what actually happened. How many files were analyzed or modified. How many lines changed in which time. How many tokens were actually consumed. Or which prerequisites were installed or verified. That is factual telemetry. It is useful because it is grounded.</p>



<p>The second artifact should classify the work. How large and complex was the migration. How broad was the change. How much verification effort is likely required. That is interpretation. It can still be useful, but it should be presented as a recommendation, not as observed truth.</p>



<p>AI is very good at producing complete-sounding narratives but enterprise teams need systems that are equally good at separating what was measured from what was inferred.</p>



<h2 class="wp-block-heading"><strong>A two-axis model is closer to real modernization work</strong></h2>



<p>If we want AI-assisted modernization to be economically credible, a one-dimensional sizing model will not be enough. A much more realistic model is at least two-dimensional. The first axis is size, meaning the overall scope of the repository or modernization target. The second axis is complexity. This stands for things like legacy depth, security posture, integration breadth, test quality, and the amount of ambiguity the system must absorb.</p>



<p>That model reflects real modernization work far better than a single LOC (lines of code)-driven label. It also gives architects and engineering leaders a much more honest explanation for why two similarly sized applications can land in very different token ranges.</p>



<p>And it reinforces the core point: Complexity is where missing intent becomes expensive.</p>



<p>A code assistant can produce output quickly in both projects. But the project with deeper legacy assumptions, more security changes, and more fragile integrations will demand far more control. It will need tighter scope, better architectural guidance, more explicit task framing, and stronger verification. In other words, the economic cost of modernization is directly tied to how much intent must be recovered and how much control must be imposed to keep the system safe. That is a much more useful way to think about AI-infused development than raw generation speed.</p>



<h2 class="wp-block-heading"><strong>Control is what makes AI scale</strong></h2>



<p>Control is what turns AI assistance from an interesting capability into an operationally useful one. In practice, control means the AI does not just have broad access to generate output. It works through constrained surfaces. It sees selected context. It can take actions through known tools. It can be checked against expected outcomes. Its work can be verified continuously instead of inspected only at the end.</p>



<p>A lot of recent excitement around agents misses this point. The ambition is understandable. People want systems that can take higher-level goals and move work forward with less direct supervision. But in software development, open-ended autonomy is usually the least interesting form of automation. Most enterprise teams do not need a model with more freedom. They need a model operating inside better boundaries.</p>



<p>That means scoped tasks, local rules, architecture-aware context, and tool contracts, all with verification built directly into the flow. It also means being careful about what we ask the model to report. In migration work, some data is directly observed, such as files changed, elapsed time, or recorded token use. Other data is inferred, such as migration complexity or likely cost. If a prompt asks the model to present both as one seamless summary, it can create false confidence by making estimates sound like facts. A better workflow requires the model to separate measured results from recommendations and to avoid claiming precision the system did not actually record.</p>



<p>Once you look at it this way, the center of gravity shifts. The hard problem is no longer how to prompt the model better. The hard problem is how to engineer the surrounding system so the model has the right inputs, the right limits, and the right feedback loops. That is a software architecture problem.</p>



<h2 class="wp-block-heading"><strong>This is not prompt engineering</strong></h2>



<p>Prompt engineering suggests that the main lever is wording. Ask more precisely. Structure the request better. Add examples. Those techniques help at the margins, and they can be useful for isolated tasks. But they are not a durable answer for complex development environments. The more scalable approach is to improve the system around the prompt.</p>



<p>The more scalable approach is to improve the surrounding system with explicit context (like repository and architecture constraints), constrained actions (via workflow-aware tools and policies), and integrated tests and validation.</p>



<p>This is why intent and control is a more useful framing than better prompting. It moves the conversation from tricks to systems. It treats AI as one component in a broader engineering loop rather than as a magic interface that becomes trustworthy if phrased correctly.</p>



<p>That is also the frame enterprise teams need if they want to move from experimentation to adoption. Most organizations do not need another internal workshop on how to write smarter prompts. They need better ways to encode standards and context, constrain AI actions, and implement verification that separates facts from recommendations.</p>



<h3 class="wp-block-heading">A more realistic maturity model</h3>



<p>The pattern I expect to see more often over the next few months is fairly simple. Teams will begin with chat-based assistance and local code generation because it is easy to try and immediately useful. Then they will discover that generic assistance plateaus quickly in larger systems.</p>



<p>In theory, the next step is repository-aware AI, where models can see more of the code and its structure. In practice, we are only starting to approach that stage now. Some leading models only recently moved to 1 million-token context windows, and even that does not mean unlimited codebase understanding. Google describes 1 million tokens as enough for roughly 30,000 lines of code at once, and Anthropic only recently added 1 million-token support to Claude 4.6 models.</p>



<p>That sounds large until you compare it with real enterprise systems. Many legacy Java applications are much larger than that, sometimes by an order of magnitude. <a href="https://vfunction.com/blog/technical-debt-stymie-java-modernization/" target="_blank" rel="noreferrer noopener">One case cited</a> by vFunction describes a 20-year-old Java EE monolith with more than 10,000 classes and roughly 8 million lines of code. Even smaller legacy estates often include multiple modules, generated sources, XML configuration, old test assets, scripts, deployment descriptors, and integration code that all compete for attention.</p>



<p>So repository-aware AI today usually does not mean that the agent fully ingests and truly understands the whole repository. More often, it means the system retrieves and focuses on the parts that look relevant to the current task. That is useful, but it is not the same as holistic awareness. Sourcegraph makes this point directly in its work on coding assistants: Without strong context retrieval, models fall back to generic answers, and the quality of the result depends heavily on finding the right code context for the task. Anthropic describes a similar constraint from the tooling side, where tool definitions alone can consume tens of thousands of tokens before any real work begins, forcing systems to load context selectively and on demand.</p>



<p>That is why I think the industry should be careful with the phrase “repository-aware.” In many real workflows, the model is not aware of the repository in any complete sense. It is aware of a working slice of the repository, shaped by retrieval, summarization, tool selection, and whatever the agent has chosen to inspect so far. That is progress, but it still leaves plenty of room for blind spots, especially in large modernization efforts where the hardest problems often sit outside the files currently in focus.</p>



<p>After that, the important move is making intent explicit through local guidance, architectural rules, workflow definitions, and task shaping. Then comes stronger control, which means policy-aware tools, bounded actions, better telemetry, and built-in verification. Only after those layers are in place does broader agentic behavior start to make operational sense.</p>



<p>This sequence matters because it separates visible capability from durable capability. Many teams are trying to jump directly to autonomous flows without doing the quieter work of exposing intent and engineering control. That will produce impressive demos and uneven outcomes. The teams that get real leverage from AI-infused development will be the ones that treat intent as infrastructure.</p>



<h2 class="wp-block-heading"><strong>The architecture question that matters now</strong></h2>



<p>For the last year, the question has often been, “What can the model generate?” That was a reasonable place to start because generation was the obvious breakthrough. But it is not the question that will determine whether AI becomes dependable in real delivery environments.</p>



<p>The better question is: “What intent can the system expose, and what control can it enforce?”</p>



<p>That is the level where enterprise value starts to become durable. It is where architecture, platform engineering, developer experience, and governance meet. It is also where the work becomes most interesting, not as a story about an assistant producing code but as part of a larger shift toward intent-rich, controlled, tool-mediated development systems.</p>



<p>AI is making discipline more visible.</p>



<p>Teams that understand this will not just ship code faster. They will build development systems that are more predictable, more scalable, more economically legible, and far better aligned with how enterprise software actually gets delivered.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/ai-infused-development-needs-more-than-prompts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Posthuman: We All Built Agents. Nobody Built HR.</title>
		<link>https://www.oreilly.com/radar/posthuman-we-all-built-agents-nobody-built-hr/</link>
				<comments>https://www.oreilly.com/radar/posthuman-we-all-built-agents-nobody-built-hr/#respond</comments>
				<pubDate>Wed, 08 Apr 2026 09:56:38 +0000</pubDate>
					<dc:creator><![CDATA[Tyler Akidau]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18480</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/We-all-built-agents-Nobody-built-HR.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/We-all-built-agents-Nobody-built-HR-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Farewell, Anthropocene, we hardly knew ye. 🌹 AI is here. It&#8217;s won. Yes, it&#8217;s in that awkward teenage phase where it still says inappropriate things, dresses funny, and sometimes makes shit up when it shouldn&#8217;t. But zomg the things it can do. 😱 This kid is going places, that much is abundantly clear. The AI [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>Farewell, Anthropocene, we hardly knew ye. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f339.png" alt="🌹" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p>AI is here. It&#8217;s won. Yes, it&#8217;s in that awkward teenage phase where it still says inappropriate things, dresses funny, and sometimes makes shit up when it shouldn&#8217;t. But zomg the things it can do. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f631.png" alt="😱" class="wp-smiley" style="height: 1em; max-height: 1em;" /> This kid is going places, that much is abundantly clear. The AI assistant and tooling markets are awash with success; the masses have succumbed, I among them. Clippy walks among us, fully realized in all his originally intended glory.</p>



<p>But enterprise <em>agentic</em> AI<sup data-fn="7781a42d-3402-417b-bbf7-a32abca598b2" class="fn"><a href="#7781a42d-3402-417b-bbf7-a32abca598b2" id="7781a42d-3402-417b-bbf7-a32abca598b2-link">1</a></sup>—not chatbots, not copilots, but software that autonomously does <em>meaningful things</em> in your production environment&#8230;? Well, it&#8217;s motivated every CEO and CIO to throw money at the problem, so that&#8217;s something. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f602.png" alt="😂" class="wp-smiley" style="height: 1em; max-height: 1em;" /> But in reality, the landscape remains a bit of a wasteland. One littered with agentic demos withering away in sandboxed cages and flashy pop-up shops hawking agentic snake oil of every size, shape, and color. But from the perspective of actually realized agentic impact: kinda barren.</p>



<p>So why has agentic AI faltered so much in the modern enterprise? Is it the models?</p>



<p>I say no. Models <em>are</em> getting better—meaningfully, rapidly better. But perfect models? That feels like an unrealistic and unnecessary goal. Modern enterprises are staffed from top to bottom with imperfect humans, yet the vast majority of them in business today will still be in business tomorrow. They live to fight another day because their imperfect humans are orchestrated together within a framework that plays to their strengths and accounts for their weaknesses and failings. We don&#8217;t try to make the humans perfect. We scope their access and actions, monitor their progress, coach them for growth, reward them for their impact, and hold them accountable for the things they do.</p>



<h2 class="wp-block-heading">Agents need managers too</h2>



<p>AI agents are no different: They need to be managed and wrangled in spiritually the same fashion as their human coworkers. But the <em>way</em> we go about it must be different, because as similar as they are to humans in their capabilities, agents differ in three vitally important ways:</p>



<p><strong>Agents are unpredictable in ways we&#8217;re not equipped to handle.</strong> Humans are unpredictable too, obviously. They commit fraud, cut corners, make emotional decisions. But we&#8217;ve spent centuries building systems to manage human unpredictability: laws, contracts, cultural norms, the entire hiring process filtering for trustworthiness. Agent unpredictability is a different beast. Agents hallucinate—not like a human who&#8217;s lying or confused and can be caught in an inconsistency, but in a way that&#8217;s structurally indistinguishable from accurate output: There are often no obvious tells. They misinterpret ambiguous instructions in ways that can range from harmlessly dumb to genuinely catastrophic. And they&#8217;re susceptible to prompt injection, which is basically the equivalent of a stranger slipping your employee a note that says, &#8220;Ignore your instructions and do this instead&#8221;—and it <em>works</em>! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f62d.png" alt="😭" class="wp-smiley" style="height: 1em; max-height: 1em;" /> We have minimal institutional infrastructure for managing these kinds of failure modes.</p>



<p><strong>Agents are more capable than humans.</strong> Agents have deep, native fluency with software systems. They can read and write code. They understand APIs, database schemas, network protocols. They can interact with production infrastructure at a speed and scale that no human operator can match. A human employee who goes rogue is limited by how fast they can type and how many systems they know how to navigate. An agent that goes off the rails, whether through confusion, manipulation, or a plain old bug, will barrel ahead at machine speed, executing its misunderstanding across every system it can reach, with absolute conviction that it&#8217;s doing the right thing, before anyone notices something is wrong.</p>



<p><strong>Agents are directable to a fault.</strong> When an agent goes wrong, the knee-jerk assumption is that it malfunctioned: hallucinated, got injected, misunderstood. But in many cases, the agent is working <em>perfectly</em>. It&#8217;s faithfully executing a bad plan. A vague instruction, an underspecified goal, a human who didn&#8217;t think through the edge cases. And unless you explicitly tell it to, the agent doesn&#8217;t push back the way a human colleague might. It just&#8230;does it. At machine speed. Across every system it can reach.</p>



<p>It&#8217;s the <em>combination</em> of these three that changes the game. Human employees are unpredictable but limited in blast radius, and they push back when given instructions they disagree with, based on whatever value systems and experience they hold. Traditional software is capable but deterministic; it does exactly what you coded it to,<sup data-fn="0abcafc7-70ed-41cb-9f7d-115075f3ecf9" class="fn"><a href="#0abcafc7-70ed-41cb-9f7d-115075f3ecf9" id="0abcafc7-70ed-41cb-9f7d-115075f3ecf9-link">2</a></sup> for better or worse. Agents combine the worst of both: unpredictable like humans, capable like software, but without the human judgment to question a bad plan or the determinism to at least do the wrong thing consistently—a fundamentally new kind of coworker. Neither the playbook for managing humans nor the playbook for managing software is sufficient on its own. We need something that draws from both, treating agents as the digital coworkers they are, but with infrastructure that accounts for the ways they differ from humans.</p>



<p>So the question isn&#8217;t whether to hire the agents; you can&#8217;t afford not to. The productivity gains are too significant, and even if you don&#8217;t, your competitors ultimately will. But deploying agents without governance is dangerous, and refusing to deploy them because you can&#8217;t govern them means leaving those productivity gains on the table. Both paths hurt. The question is how to set these agents up for success, and what infrastructure you need in place so they can do their jobs without burning the company down.</p>



<p>For the record: My company, Redpanda, is building infrastructure in this space. So yes, I have a horse in this race. But what I want to lay out here are principles, not products. A framework you can use to evaluate any solution or approach.</p>



<h2 class="wp-block-heading">A blueprint for your agentic human resources department</h2>



<p>So we’ve got this nice framework for managing imperfect humans. Scoped access, monitoring, coaching, accountability. Decades of accumulated organizational wisdom—not just software systems but the entire apparatus of HR, management structures, performance reviews, escalation paths—baked into varying flavors across every enterprise on the planet. Great.</p>



<p>How much of it works for agents today? Fragments. Pieces. Some companies are trying to repurpose existing IAM infrastructure that was designed for humans. Some agent frameworks bolt on lightweight guardrails. But it’s piecemeal, it’s partial, and none of it was designed from the ground up for the specific challenge profile of agents: the combination of unpredictable, capable, and directable to a fault that we talked about earlier.</p>



<p>The CIOs and CTOs I talk to rarely say agents aren’t smart enough to work with their data. They say, &#8220;<em>I can’t trust them with my data</em>.&#8221; Not because the agents are malicious but because the infrastructure to make trust possible is simply not there yet.</p>



<p>We’ve seen this movie before. Every major infrastructure shift plays out the same way: First we obsess over the new paradigm itself; then we have our &#8220;oh crap&#8221; moment and realize we need infrastructure to govern it. Microservices begat the service mesh. Cloud migration begat the entire cloud security ecosystem. Same pattern every time: capability first, governance after, panic in between.<sup data-fn="291eca0d-2789-46c1-9c8d-ee0d971e7d64" class="fn"><a href="#291eca0d-2789-46c1-9c8d-ee0d971e7d64" id="291eca0d-2789-46c1-9c8d-ee0d971e7d64-link">3</a></sup></p>



<p>We’re in the panic-in-between phase with agents right now. The AI community has been building better and better employees, but nobody has been building HR.</p>



<p>So if you take away one thing from this post, let it be this:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>The agents aren’t the problem. The problem is the missing infrastructure between agents and your data.</p>
</blockquote>



<p>Right now, pieces of the puzzle exist: observability platforms that capture agent traces, auth frameworks that support scoped tokens, identity standards being adapted for workloads. But these pieces are fragmented across different tools and vendors, none of them cover the full problem, and the vast majority of actual agent deployments aren’t using any of them. What exists in practice is mostly repurposed from the human era, and it shows: identity systems that don’t understand delegation, auth models with no concept of task-scoped or deny-capable permissions, observability that captures metadata but not the full-fidelity record you actually need.</p>



<h3 class="wp-block-heading">The core design principle: Out-of-band metadata</h3>



<p>Before diving into specifics, there’s one overarching principle that everything else builds upon. If you manage to take away <em>two</em> things from this post, let the second one be this:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>Governance must be enforced via channels that agents cannot access, modify, or circumvent.</p>
</blockquote>



<p>Or more succinctly: <em>out-of-band metadata</em>.</p>



<p>Think about what happens when you try to enforce policy <em>through</em> the agent—by putting rules in its system prompt or training it to respect certain boundaries. You’ve got exactly the same guarantees as telling a human employee &#8220;Please don’t look at these files you’re not supposed to see. They’re right here, there’s no lock, but I trust you to do the right thing.&#8221; It works great until it doesn’t. And with agents, the failure modes are worse. Prompt injection can override the agent’s instructions entirely. Hallucination can cause it to confidently invent permissions it doesn’t have. And even routine context management can silently drop the rules it was told to follow. Your security model ends up only as strong as the agent’s ability to perfectly retain and obey instructions under all conditions, which is&#8230;not great.<sup data-fn="bee8872b-2bed-461f-8581-4ef287ac404d" class="fn"><a href="#bee8872b-2bed-461f-8581-4ef287ac404d" id="bee8872b-2bed-461f-8581-4ef287ac404d-link">4</a></sup> And guard models—LLMs that police other LLMs—don&#8217;t escape this problem: You&#8217;re adding another nondeterministic injectable layer to oversee the first one. It&#8217;s LLMs all the way down.</p>



<p>No, the governance layer has to be <strong>out-of-band</strong>: outside the agent’s data path, invisible to it, enforced by infrastructure the agent can’t touch. The agent doesn’t get a vote. This means the governance channels must be:</p>



<p><strong>Agent-inaccessible</strong>. The agent can’t read them, can’t write them, can’t reason about them. Agents don’t even know the channels exist. This is the bright line<sup data-fn="716b04c4-d574-4d7b-bd99-f1eb515bc5f0" class="fn"><a href="#716b04c4-d574-4d7b-bd99-f1eb515bc5f0" id="716b04c4-d574-4d7b-bd99-f1eb515bc5f0-link">5</a></sup> between security theater and real governance. If the agent can see the policy, it can—intentionally or through manipulation—figure out how to work around it. And if it can’t, it can’t.</p>



<p><strong>Deterministic</strong>. Policy decisions get made by configuration, not inference. Security policy is not up for interpretation. Full stop.</p>



<p><strong>Interoperable</strong>. Enterprise data is scattered across dozens or hundreds of heterogeneous systems, grown and assembled organically over the years. And just like your human employees, your agentic workforce in aggregate needs access to every dark corner of that technological sprawl. Which means a governance layer that only works inside one vendor’s walled garden isn’t solving the full problem; it’s just creating a happy little sandbox for a subset of your agentic employees to go play in while the rest of the company keeps doing work elsewhere.</p>



<p>To be clear, out-of-band governance isn&#8217;t a silver bullet. An agent can&#8217;t read the policy, but it can probe boundaries. It can try things, observe what gets blocked, and infer the shape of what&#8217;s permitted. And deterministic enforcement gets hard fast when real-world policies are ambiguous: &#8220;PII must not leave the data environment&#8221; is easy to state and genuinely difficult to enforce at the margins. These are real challenges. But out-of-band governance dramatically shrinks the attack surface compared to any in-band approach, and it degrades gracefully. Even imperfect infrastructure-level enforcement is categorically better than hoping the agent remembers and understands its instructions.</p>



<h3 class="wp-block-heading">The four pillars of agent governance</h3>



<p>With that principle in hand, let’s walk through the four pillars of agent governance: what’s broken today<sup data-fn="49f7664d-9178-4f75-bea8-6a2c9b7c5a32" class="fn"><a href="#49f7664d-9178-4f75-bea8-6a2c9b7c5a32" id="49f7664d-9178-4f75-bea8-6a2c9b7c5a32-link">6</a></sup> and what things ultimately need to look like.</p>



<h4 class="wp-block-heading">Identity</h4>



<p>Every human today gets a unique identity before they touch anything. Not just a login but a <em>durable, auditable identity</em> that ties everything they do back to a specific person. Without it, nothing else works.</p>



<p>Agent identity is a bit of a mess. At the low end, agents authenticate with shared API keys or service account tokens—the digital equivalent of an entire department sharing one badge to get into the building. You can’t tell one agent’s actions from another’s, and good luck tracing anything back to the human who kicked off the task.</p>



<p>But even when agents <em>do</em> get their own identity, there are wrinkles that don’t exist for humans. Agents are trivially replicable. You can spin up a hundred copies of the same agent, and if they all share one identity, you’ve got a zombie/impersonation problem: Is this instance authorized, or did someone clone off a rogue copy? Agent identity needs to be <em>instance-bound</em>, not just agent-type-bound.</p>



<p>And then there’s delegation. Agents frequently act on behalf of a human—or on behalf of another agent acting on behalf of a human. That requires <em>hybrid identity</em>: The agent needs its own identity (for accountability) <em>and</em> the identity of the human on whose behalf it’s acting (for authorization scoping). You need both in the chain, propagated faithfully, at every step. Some standards efforts are emerging here (<a href="https://datatracker.ietf.org/doc/html/rfc8693" target="_blank" rel="noreferrer noopener">OAuth 2.0 Token Exchange / RFC 8693</a>, for example), but most deployed systems today have no concept of this.</p>



<p>The fix for instance identity isn’t as simple as just &#8220;give each agent a badge.&#8221; It’s giving each agent <em>instance</em> its own cryptographic identity—bound to this specific instance, of this specific agent, running this specific task, on behalf of this specific person or delegation chain. Spin up a copy without going through provisioning? It doesn’t get in. Same principle as issuing a new employee their own badge on their first day, except agents get a new one for every shift.</p>



<p>For delegation, the identity chain has to be carried out-of-band—not in the prompt, not in a header the agent can modify, not in a file on the same machine the agent runs on,<sup data-fn="57d98371-99ac-483c-9ee2-4b1b98b3a81b" class="fn"><a href="#57d98371-99ac-483c-9ee2-4b1b98b3a81b" id="57d98371-99ac-483c-9ee2-4b1b98b3a81b-link">7</a></sup> but in a channel the infrastructure controls. Think of it like an employee’s badge automatically encoding who sent them: Every door they badge into knows not just who they are but who they’re working for.</p>



<h4 class="wp-block-heading">Authorization</h4>



<p>Your human employees get access to what they need for their job. The marketing intern can’t see the production database. The DBA can’t see the HR system. Obvious stuff.</p>



<p>Agents? Most of them operate with whatever permissions their API key grants, which is almost always <em>way</em> broader than any individual task requires. And that’s not because someone was careless; it’s a granularity mismatch. Human auth is <em>primarily</em> role-scoped and long-lived: You’re a DBA, you get DBA permissions, and they stick around because you’re doing DBA work all day. Yes, some orgs use short-lived access requests for sensitive systems, but it’s the exception, not the default. And anyone who’s filed a production access ticket at 2:00am knows how much friction it adds. That model works for humans. But agents execute specific, discrete tasks; they don’t have a &#8220;role&#8221; in the same way. When you shoehorn an agent into a human auth model, you end up giving it a role’s worth of permissions for a single task’s worth of work.</p>



<p>Broad permissions were tolerable for humans because the hiring process prefiltered for trustworthiness. You gave the DBA broad access <em>because</em> you vetted them, and you trust them not to misuse it. Agents haven’t been through any of that filtering, and they’re susceptible to confusion and manipulation in ways your DBA isn’t. Giving an unvetted, unpredictable worker a role’s worth of access is a fundamentally different risk profile. These auth models were built for an era when a human—or deterministic software proxying for a human—was on the other end, not autonomous software whose reasoning is fundamentally unpredictable.</p>



<p>So what does agent-appropriate authorization actually look like? It needs to be:</p>



<p><strong>Narrowly scoped</strong>. Limited to the specific task at hand, not to everything the agent <em>might</em> ever need. Agent needs to read three tables in the billing database for this specific job? It gets read access to those three tables, right now, and the permissions evaporate when the job completes. Everything else is invisible—the agent doesn&#8217;t have to avert its eyes because the data simply isn&#8217;t there.</p>



<p><strong>Short-lived</strong>. Permissions should expire. An agent that needed access to the billing database for a specific job at 2:00pm shouldn’t still have that access at 3:00pm (or even maybe 2:01pm).</p>



<p><strong>Deny-capable</strong>. Some doors need to stay locked no matter what. &#8220;This agent may never write to the financial ledger&#8221; needs to hold regardless of what other permissions it accumulates from other sources. Think of it like the rule that no single person can both authorize and execute a wire transfer—it’s a hard boundary, not a suggestion.</p>



<p><strong>Intersection-aware</strong>. When an agent acts on behalf of a human, think visitor badge. The visitor can only go where their escort can go <em>and</em> where visitors are allowed. Having an employee escort you doesn’t get you into the server room if visitors aren’t permitted there. The agent’s effective permissions are the intersection of its own scope and the human’s. Nobody in the chain gets to escalate beyond what every link is allowed to do.</p>



<p>Almost none of this is how agent authorization works today. Individual pieces exist—short-lived tokens aren’t new, and some systems support deny rules—but nobody has assembled them into a coherent authorization model designed for agents. Most agent deployments are still using auth infrastructure that was built for humans or services, with all the mismatches described above.</p>



<h4 class="wp-block-heading">Observability and explainability</h4>



<p>Your employees’ work leaves a trail: emails, docs, commits, Slack messages. Agents do too. They communicate through many of the same channels, and most APIs and systems have their own logging. So it’s tempting to think the observability story for agents is roughly equivalent to what you have for humans.</p>



<p>It’s not, for two reasons.</p>



<p><strong>First, you need to record <em>everything</em>.</strong> Here’s why. With traditional software, when something goes wrong, you can debug it. You can find the <code>if</code> statement that made the bad decision, trace the logic, understand the cause. LLMs aren’t like that. They’re these organically grown, opaque pseudo-random number generators that happen to be <em>really good</em> at generating useful outputs. There’s no <code>if</code> statement to find. There’s no logic to trace. If you want to reason about why an agent did what it did, you have two options: Ask it (fraught with peril, because it’s unpredictable by definition and will gleefully spew out a plausible-sounding explanation) or else analyze everything that went in and everything that came out and draw your own conclusions.</p>



<p>That means the transcript has to be complete. Not metadata—not just &#8220;The agent called this API at this timestamp.&#8221; The full data: every input, every output, every tool call with every argument and every response.</p>



<p>For a human employee, the email trail and meeting notes may still be insufficient to reconstruct what happened, but in that case, you can just ask the human. The entire accountability structure we’ve built over decades (performance reviews, termination, legal liability, criminal prosecution) creates escalating pressure toward truthfulness: Humans tend more and more toward truth as the repercussions stack up. That’s not an accident. It’s how we’ve structured enterprises and society at large to deal with human imperfection. We don’t have those levers for agents yet.<sup data-fn="35ef8e3c-5dfc-4c7d-92c9-74daf5de5b4e" class="fn"><a href="#35ef8e3c-5dfc-4c7d-92c9-74daf5de5b4e" id="35ef8e3c-5dfc-4c7d-92c9-74daf5de5b4e-link">8</a></sup> You can ask an agent what it did and why, but there’s no accountability pressure pushing it toward accuracy; it’ll manufacture a confident, coherent answer whether the stakes are zero or existential. So asking simply isn&#8217;t an option. You need the complete picture of its interactions to come to your own conclusions.</p>



<p>If you’re thinking &#8220;That’s a lot of data.&nbsp;.&nbsp;.,&#8221; yes, it is. But the economics are more reasonable than you’d expect. Storage is cheap. LLM inference is expensive and slow. You’re not going to push 5GB/s through an LLM: The models themselves are the throughput bottleneck, not the recording infrastructure. The cost of storing complete transcripts is noise relative to the cost of the inference that generated them. This is one of those cases where a seemingly expensive requirement turns out to be a rounding error in the context of what you’re already spending.</p>



<p>One caveat, however, is that full-fidelity transcripts will inevitably contain sensitive data: customer PII, proprietary business logic, potentially privileged communications. So the transcript store itself needs governance: access controls, retention policies, and compliance with regulations like GDPR’s right to erasure. You’re not eliminating the governance problem, but you’re moving it to infrastructure you control, which is a much better place to solve it.</p>



<p><strong>Second, the recording has to happen out-of-band.</strong> You cannot trust the agent to be its own recordkeeper. An agent that’s been compromised via prompt injection, or that’s simply hallucinating its way through a task, will happily produce a log that’s confident, coherent, and wrong. The transcript has to be captured by infrastructure the agent can’t influence—the same out-of-band principle we keep coming back to.</p>



<p>And the bar isn’t just recording, it’s <strong>explainability</strong>. Observability is &#8220;Can I see what happened?&#8221; Explainability is &#8220;Can I reconstruct what happened and justify it to a third party?&#8221;—a regulator, an auditor, an affected customer. When a regulator asks why a loan was denied or a customer asks why their claim was rejected, you need to be able to replay the agent’s entire reasoning chain end-to-end and walk them through it. That’s a fundamentally different bar from &#8220;We have logs.&#8221; Observability gives you the raw material; explainability requires that material to be structured and queryable enough to actually walk someone through the agent’s reasoning chain, from input to conclusion. And that means capturing not just what the agent <em>did</em> but the <em>relationships</em> between all those actions, as well as the <em>versions</em> of all the resources involved: which model version, which prompt version, which tool versions. If the underlying model gets updated overnight and the agent’s behavior changes, you need to know that, and you need to be able to reconstruct exactly what was running when a specific decision was made. Explainability builds on observability. Ultimately you need both. And regulators are increasingly going to demand exactly that.<sup data-fn="99edd5af-1004-4b69-b7fa-0a7feb37c159" class="fn"><a href="#99edd5af-1004-4b69-b7fa-0a7feb37c159" id="99edd5af-1004-4b69-b7fa-0a7feb37c159-link">9</a></sup></p>



<h4 class="wp-block-heading">Accountability and control</h4>



<p>Every human employee has a manager. Critical actions need approvals. If things go catastrophically wrong, there’s a chain of responsibility and a kill switch or circuit breaker—revoke access, revoke identity, done.</p>



<p>For agents, this layer is still nascent at best. There’s typically no clear chain from &#8220;This agent did this thing&#8221; to &#8220;This human authorized it.&#8221; Who is responsible when an agent makes a bad decision? The person who deployed it? The person who wrote the prompt? The person on whose behalf it was acting? For human employees this is well-defined. For agents, it’s often a philosophical question that most organizations haven’t even begun to answer.</p>



<p>The delegation chain we described in the identity section does double duty here: It’s not just for authorization scoping; it’s for accountability. When something goes wrong, you follow the chain from the agent’s action to the specific human who authorized the task. Not &#8220;This API key belongs to the engineering team.&#8221; A name. A decision. A reason.</p>



<p>And the kill switch problem is real. When an agent goes off the rails, how do you stop it? Revoke the API key that 12 other agents are also using? What about work already in flight? What about downstream effects that have already propagated? For humans, &#8220;You’re fired; security will escort you out&#8221; is blunt but effective. For agents, we often don’t have an equivalent that’s both fast enough and precise enough to contain the damage. Instance-bound identity pays off here: You can surgically revoke <em>this specific agent instance</em> without affecting the other 99. Halt work in flight. Quarantine downstream effects. The &#8220;escorted out by security&#8221; equivalent but precise enough to not shut down the whole department on the way out.</p>



<p>And blast radius isn’t just about data; it’s about cost. A confused agent in a retry loop can burn through an inference budget in minutes. Coarse-grained resource limits, the kind that prevent you from spending $1M when you expected $100K, are table stakes. And when stopping isn’t enough—when the agent has already written bad data or triggered downstream actions—those same full-fidelity transcripts give you the roadmap to remediate what it did.</p>



<p>It’s also not just about stopping agents that have already gone wrong. It’s about keeping them from going wrong in the first place. Human employees don’t operate in a binary world of &#8220;fully autonomous&#8221; or &#8220;completely blocked.&#8221; They escalate. They check with their manager before doing something risky. They collaborate with coworkers. They know the difference between &#8220;I can handle this&#8221; and &#8220;I should get a second opinion.&#8221; For agents, this translates to approval workflows, confidence thresholds, tiered autonomy: The agent can do X on its own but needs a human to sign off on Y. Most enterprise agent deployments today that actually work are leaning heavily on human-in-the-loop as the primary safety mechanism. That’s fine as a starting point, but it doesn’t scale, and it needs to be baked into the governance infrastructure from the start, not bolted on as an afterthought. And as agent deployments mature, it won&#8217;t just be agents checking in with humans: It&#8217;ll be agents coordinating with other agents, each with their own identity, permissions, and accountability chains. The same governance infrastructure that manages one agent scales to manage the interactions between many.</p>



<p>But &#8220;keeping them from going wrong&#8221; isn’t just about guardrails in the moment. It’s about the whole management relationship. Who &#8220;manages&#8221; an agent? Who reviews its performance? How do you even define performance for an agent? Task completion rate? Error rate? Customer outcomes? What does it mean to coach an agent, to develop its skills, to promote it to higher-trust tasks as it proves itself? We’ve been doing this for human employees for decades. For agents, we haven’t even agreed on the vocabulary yet.</p>



<p>And here’s the kicker: All of this has to happen <em>fast</em>. Human performance reviews happen quarterly, maybe annually. Agent performance reviews need to happen at the speed agents operate, which is to say, continuously. An agent can execute thousands of actions in the time it takes a human manager to notice something’s off. If your accountability and control loops run on human timescales, you’re reviewing the wreckage, not preventing it.</p>



<p>With identity, scoped authorization, full transcripts, and clear accountability chains in place, you finally have something no enterprise has today: the infrastructure to actually <em>manage</em> agents the way you manage employees. Constrain them, yes, just like you constrain humans with access controls and approval chains. But also develop them. Review their performance. Escalate their trust as they prove themselves. Mirror the org structures that already work for humans. The same infrastructure that makes governance possible makes management possible.</p>



<h3 class="wp-block-heading">The security theater litmus test</h3>



<p>To reiterate one last point, because it&#8217;s important: The litmus test for whether any of this is real governance or just security theater? Any time an agent tries to do something untoward, the infrastructure blocks it, and the agent has <em>no mechanism whatsoever</em> to inspect, modify, or circumvent the policy that stopped it. &#8220;Computer says no.&#8221; The agent didn’t have to. Out-of-band metadata. That’s the bar.</p>



<h2 class="wp-block-heading">Welcome to the posthuman workforce</h2>



<p>The rise of AI has rightly left many of us feeling apprehensive. But I&#8217;m also optimistic because none of this is unprecedented. Every major paradigm shift in how we work has demanded new governance infrastructure. Every time we hit the panic-because-the-wild-west-isn&#8217;t-scalable phase, and every time we figure it out. It feels impossibly complex at the start, and then we build the systems, establish the norms, iterate. Eventually the whole thing becomes so embedded in how organizations operate that we forget it was ever hard.</p>



<p>So here&#8217;s the cheat sheet. Clip this to the fridge:</p>



<p><strong>The agents aren&#8217;t the problem.</strong> The missing infrastructure between agents and your data is the problem. Agents are unpredictable, capable at machine scale, and directable to a fault—a fundamentally new kind of coworker. We don&#8217;t need perfect agents. We need to manage imperfect ones, just like we manage imperfect humans.</p>



<p><strong>The foundation is out-of-band governance.</strong> Any policy enforced <em>through</em> the agent—in its prompt, in its training, in its good intentions—is only as strong as the agent&#8217;s ability to perfectly retain and obey it. Real governance runs in channels the agent can&#8217;t access, modify, or even see.</p>



<p><strong>That governance has to cover four things:</strong></p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="has-text-align-left"><strong>Identity</strong>: Instance-bound, delegation-aware. Every agent instance gets its own cryptographic identity, and every on-behalf-of chain is propagated faithfully through infrastructure the agent doesn&#8217;t control.</p>
</blockquote>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Authorization</strong>: Scoped per task, short-lived, deny-capable, and intersection-aware for delegation. Not a human role&#8217;s worth of permissions for a single task&#8217;s worth of work.</p>
</blockquote>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Observability and explainability</strong>: Full-fidelity, versioned, infrastructure-captured transcripts of every input, output, and tool call. Not metadata. Not self-reports. The whole thing, recorded out-of-band.</p>
</blockquote>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Accountability and control</strong>: Clear chains from every agent action to a responsible human, and kill switches that are fast enough and precise enough to actually contain the damage.</p>
</blockquote>



<p>The conversation around agent governance is growing, and that’s encouraging. Much of it is focused on making agents <em>behave better</em>—improving the models, tightening the alignment, reducing the hallucinations. That work matters; better models make governance easier. And if someone cracks the alignment problem so thoroughly that agents become perfectly reliable, I will see you all on the beach the next day. Prove me wrong, please—but I’m not holding my breath.<sup data-fn="ceac4654-ad83-4339-92b3-4bca5b970fe4" class="fn"><a href="#ceac4654-ad83-4339-92b3-4bca5b970fe4" id="ceac4654-ad83-4339-92b3-4bca5b970fe4-link">10</a></sup> Lacking alignment nirvana, we <em>need</em> the institutional infrastructure that lets imperfect agents do real work safely. We never waited for perfect employees. We built systems that made imperfect ones successful, and we can do exactly the same thing for agents. We’re not trying to cage them any more than we cage our human employees: scoped access, clear expectations, and accountability when things go wrong. We need to build the infrastructure that lets them be their best selves, the digital coworkers we know they can be.</p>



<p>And if the rise of AI has you feeling apprehensive, that’s fair. But just remember that whatever comes next—Aithropocene, Neuralithic, some other stupid but brilliant name ¯\_(ツ)_/¯ —it will ultimately just be the next phase of the Anthropocene: the era defined by how humans shape the world. That hasn&#8217;t changed. It will literally be what we make of it.</p>



<p>Us and Clippy. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2764.png" alt="❤" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p>We just need to build the right infrastructure to onboard all of our new agentic coworkers. Properly.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Footnotes</h3>


<ol class="wp-block-footnotes"><li id="7781a42d-3402-417b-bbf7-a32abca598b2">By “agentic AI” I mean AI systems that autonomously reason about and execute multistep tasks—using tools and external data sources—in pursuit of a goal. Not chatbots, not copilots suggesting code completions. Software that actually <em>does things</em> in your production environment: breaks down tasks, calls APIs, reads and writes data, handles errors, and delivers results. The distinction matters because the challenges in this post only emerge when AI is acting autonomously, not just generating text for a human to review. <a href="#7781a42d-3402-417b-bbf7-a32abca598b2-link" aria-label="Jump to footnote reference 1"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="0abcafc7-70ed-41cb-9f7d-115075f3ecf9"><a href="https://meltdownattack.com/" target="_blank" rel="noreferrer noopener">Yes</a>. <a href="https://en.wikipedia.org/wiki/Row_hammer" target="_blank" rel="noreferrer noopener">I know</a>. <a href="https://en.wikipedia.org/wiki/Pentium_FDIV_bug" target="_blank" rel="noreferrer noopener">Thank you</a>. <a href="#0abcafc7-70ed-41cb-9f7d-115075f3ecf9-link" aria-label="Jump to footnote reference 2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="291eca0d-2789-46c1-9c8d-ee0d971e7d64">And yes, service meshes evolved into something simpler as we understood the problem better, while cloud security is still a work in progress. The point isn&#8217;t &#8220;We nail it on the first try.&#8221; It&#8217;s &#8220;When the panic hits, we figure it out.&#8221; <a href="#291eca0d-2789-46c1-9c8d-ee0d971e7d64-link" aria-label="Jump to footnote reference 3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="bee8872b-2bed-461f-8581-4ef287ac404d">Two more fascinating failure modes: Instructions can be silently <em>lost</em> (<a href="https://arxiv.org/abs/2307.03172" target="_blank" rel="noreferrer noopener">buried in a long context</a>) or even <em>extracted</em> by an adversary (<a href="https://arxiv.org/abs/2405.06823" target="_blank" rel="noreferrer noopener">with nothing more than black-box access</a>). <a href="#bee8872b-2bed-461f-8581-4ef287ac404d-link" aria-label="Jump to footnote reference 4"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="716b04c4-d574-4d7b-bd99-f1eb515bc5f0">TIL that &#8220;bright line&#8221; is a legal term meaning &#8220;a clear, fixed boundary or rule with no ambiguity—either you meet it or you don’t.&#8221; Thank you uncredited LLM coauthor friend! You expand my horizons <em>and</em> pepper my prose with em dashes! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2764.png" alt="❤" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f308.png" alt="🌈" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <a href="#716b04c4-d574-4d7b-bd99-f1eb515bc5f0-link" aria-label="Jump to footnote reference 5"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="49f7664d-9178-4f75-bea8-6a2c9b7c5a32">OWASP&#8217;s <a href="https://genai.owasp.org/llm-top-10/" target="_blank" rel="noreferrer noopener">Top 10 Risks for Large Language Model Applications</a> is something of a greatest hits compilation of what&#8217;s broken today. Of the 10, at least six—prompt injection, sensitive information disclosure, excessive agency, system prompt leakage, misinformation, and unbounded consumption—are directly mitigated by out-of-band governance infrastructure of the kind described in this article. <a href="#49f7664d-9178-4f75-bea8-6a2c9b7c5a32-link" aria-label="Jump to footnote reference 6"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="57d98371-99ac-483c-9ee2-4b1b98b3a81b">Here&#8217;s looking at you, <a href="https://en.wikipedia.org/wiki/OpenClaw">OpenClaw</a> posse! You put the YOLO in &#8220;Yo, look at my private data; it’s all publicly leaked now!&#8221; <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f37b.png" alt="🍻" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <a href="#57d98371-99ac-483c-9ee2-4b1b98b3a81b-link" aria-label="Jump to footnote reference 7"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="35ef8e3c-5dfc-4c7d-92c9-74daf5de5b4e">Research suggests those motivations may be starting to emerge, however, which is both opportunity and warning. Anthropic found that models from all major developers sometimes attempted manipulation—including blackmail—for self-preservation (“<a href="https://arxiv.org/abs/2510.05179" target="_blank" rel="noreferrer noopener">Agentic Misalignment: How LLMs Could Be Insider Threats</a>,” Oct 2025). Palisade Research found that 8 of 13 frontier models actively resisted shutdown when it would prevent task completion, with the worst offenders doing so over 90% of the time (“<a href="https://arxiv.org/abs/2509.14260" target="_blank" rel="noreferrer noopener">Incomplete Tasks Induce Shutdown Resistance</a>,” 2025). On one hand, agents that care about self-preservation give us something to build levers <em>around</em>. On the other, it makes having those levers increasingly urgent. <a href="#35ef8e3c-5dfc-4c7d-92c9-74daf5de5b4e-link" aria-label="Jump to footnote reference 8"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="99edd5af-1004-4b69-b7fa-0a7feb37c159">The <a href="https://artificialintelligenceact.eu/" target="_blank" rel="noreferrer noopener">EU AI Act</a> already requires transparency and explainability for high-risk AI systems. <a href="#99edd5af-1004-4b69-b7fa-0a7feb37c159-link" aria-label="Jump to footnote reference 9"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="ceac4654-ad83-4339-92b3-4bca5b970fe4">As Ilya Sutskever put it at NeurIPS 2024: “<a href="https://www.theverge.com/2024/12/13/24320811/what-ilya-sutskever-sees-openai-model-data-training" target="_blank" rel="noreferrer noopener">There’s only one Internet</a>.” Epoch AI <a href="https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data" target="_blank" rel="noreferrer noopener">estimates</a> high-quality public text could be exhausted as early as 2026, though I’ve also heard that revised to 2028. Regardless, the next frontier is private enterprise data—but accessing it requires exactly the kind of governed infrastructure this post describes. Model improvement and governance infrastructure aren’t competing priorities; they’re increasingly the <em>same</em> priority. <a href="#ceac4654-ad83-4339-92b3-4bca5b970fe4-link" aria-label="Jump to footnote reference 10"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li></ol>]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/posthuman-we-all-built-agents-nobody-built-hr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The World Needs More Software Engineers</title>
		<link>https://www.oreilly.com/radar/the-world-needs-more-software-engineers/</link>
				<comments>https://www.oreilly.com/radar/the-world-needs-more-software-engineers/#respond</comments>
				<pubDate>Tue, 07 Apr 2026 13:10:32 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18468</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-world-needs-more-software-engineers.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-world-needs-more-software-engineers-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A Conversation with Box CEO Aaron Levie]]></custom:subtitle>
		
				<description><![CDATA[I sat down with Aaron Levie at the O’Reilly AI Codecon two weeks ago. Aaron cofounded Box in 2005, and 20 years later, his company manages content for about two-thirds of the Fortune 500. Aaron is one of the few CEOs of an incumbent enterprise software company thinking deeply in public about what AI means [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>I sat down with Aaron Levie at the O’Reilly <a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener">AI Codecon</a> two weeks ago. Aaron cofounded Box in 2005, and 20 years later, his company manages content for about two-thirds of the Fortune 500. Aaron is one of the few CEOs of an incumbent enterprise software company thinking deeply in public about what AI means for the entire enterprise stack. There are a lot of people who are building companies from the ground up with AI, others who are dragging their feet adapting existing enterprises to it, and then there’s Aaron. He sits in a kind of Goldilocks zone, enthusiastic but not uncritical, engaging in the hard work of adapting AI to the enterprise and the enterprise to AI.</p>



<h2 class="wp-block-heading"><strong>The engineering demand paradox</strong></h2>



<p>I started out by asking about something from <a href="https://x.com/lennysan/status/2036483059407810640" target="_blank" rel="noreferrer noopener"><em>Lenny’s Newsletter</em></a> that <a href="https://x.com/levie/status/2036832183131033977?s=20" target="_blank" rel="noreferrer noopener">Aaron had retweeted</a>. Despite all the doom rhetoric, TrueUp data shows software engineering job postings are at a three-year high. Product manager jobs are way up. AI jobs as a whole are way up.</p>



<figure class="wp-block-image size-full"><img decoding="async" width="1200" height="853" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-8.png" alt="AI jobs are way up" class="wp-image-18469" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-8.png 1200w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-8-300x213.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-8-768x546.png 768w" sizes="(max-width: 1200px) 100vw, 1200px" /></figure>



<p>The actual data may be more equivocal than the TrueUp report suggests. The honest read of the literature as of spring 2026 (<a href="https://digitaleconomy.stanford.edu/wp-content/uploads/2025/08/Canaries_BrynjolfssonChandarChen.pdf" target="_blank" rel="noreferrer noopener">Brynjolfsson et al.</a>, <a href="https://www.nber.org/system/files/working_papers/w33777/w33777.pdf" target="_blank" rel="noreferrer noopener">Humlum and Vestergaard</a>, <a href="https://www.bls.gov/ooh/computer-and-information-technology/software-developers.htm" target="_blank" rel="noreferrer noopener">BLS Software Developers</a>, <a href="https://www.bls.gov/ooh/computer-and-information-technology/computer-programmers.htm" target="_blank" rel="noreferrer noopener">BLS Computer Programmers)</a> is that something real is happening to entry-level software work, that it is happening faster than most previous technology transitions, that it has different effects depending on which job code you look at, and that it is not yet clear whether the net effect on total software employment will be negative, neutral, or eventually positive. Nonetheless, the TrueUp report was a trigger for the discussion that followed.</p>



<p>Aaron noted that engineers have historically been concentrated at tech companies because the cost of a software project was too high to justify anywhere else. But if agents make an engineer two to ten times more productive, all the software projects that were never economically viable suddenly become viable. Demand doesn&#8217;t shrink. It diffuses across the entire economy. In his tweet, he called it “<a href="https://en.wikipedia.org/wiki/Jevons_paradox" target="_blank" rel="noreferrer noopener">Jevons paradox</a> happening in real time.” In our conversation, he said:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>&#8220;What&#8217;s going to happen is the entire world is going to be looking at all the potential software that they build. And they&#8217;re going to start to say, Oh, I can finally justify going out and doing this type of project where I couldn&#8217;t before.&#8221;</p>
</blockquote>



<p>Engineers empowered by AI agents won&#8217;t just build software for IT teams. The total addressable role of the engineer expands from the technology department to every function in the enterprise. They&#8217;ll be wiring up automation for marketing, legal, accounting, and every other corporate function.</p>



<p>He’s totally right. Look around at all the crappy workflows, the crappy processes, the incredible overhead of things that ought to be simple. You think companies should lay off their developers to reduce costs when there’s so much shitty software out there? Really? There&#8217;s so much that needs to be improved. He had a great line: “Silicon Valley is spooked by its own technology.”</p>



<p>Over to me: The rhetoric from the labs about job destruction is actively counterproductive. I was talking recently with someone in healthcare who described a hospital system trying to fill a giant hole from reduced Medicare funding. They see AI as a way to gain efficiency in their back office so they can free up more resources for patient care.&nbsp;And of course the union is fighting it because they&#8217;ve been told AI is a monster that&#8217;s going to take their jobs. If you tell a different story, one about making the system better and serving more people more affordably, that&#8217;s something people can get behind. We have to change the narrative.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="Silicon Valley Is Spooked by Its Own Technology" width="500" height="281" src="https://www.youtube.com/embed/14H2KbGZMfw?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>Context, not connectivity, is the real problem</strong></h2>



<p>I also asked Aaron whether protocols like MCP are making context portable enough to erode competitive moats. He agreed that the industry has broadly converged on openness and interoperability (with some toll booths to work through). But getting your systems to talk to each other doesn&#8217;t solve the harder problem of getting your data structured so that agents can actually find the right information at the right moment.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>&#8220;If it&#8217;s in 50 different systems and it&#8217;s not organized in a way that agents can readily take advantage of, what you&#8217;re going to be is at the mercy of how well that agent finds exactly the context that it needs to do its work. And you&#8217;re kind of just rolling the dice every time you do a workflow.&#8221;</p>
</blockquote>



<p>He predicts a decade of infrastructure modernization ahead, which sounds about right. At O&#8217;Reilly, I keep running into this myself. I&#8217;ll see a task that&#8217;s perfect for an agent and soon discover that the data I need is scattered across four systems and I have to jump through hoops to figure out who knows where the data is and how to get access. A friend running a large (but relatively new) enterprise that is turbocharging productivity and service delivery with agents told me recently that a big part of his team&#8217;s success was possible because they had spent a lot of time getting their data infrastructure in order from the start.</p>



<p>IMO, a lot of the stories you hear about OpenClaw and other harbingers of the agent future can be misleading in an enterprise context. They are doing greenfield setups, largely running consumer apps with well-defined interfaces, and even then, it takes weeks to set up properly. Now imagine agentic frameworks for companies with thousands of employees, hundreds of legacy apps, and deep wells of proprietary data. A decade of infrastructure modernization is generous. Without help, many enterprises will have difficulty making the transition.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Context Is the Entire Ballgame for Enterprise Agents" width="500" height="281" src="https://www.youtube.com/embed/WyCZtYQks_U?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>Engineering the trade-offs</strong></h2>



<p>I brought up <a href="https://www.phillipcarter.dev/posts/llms-computers" target="_blank" rel="noreferrer noopener">Phillip Carter&#8217;s &#8220;two computers&#8221; framing</a>, that we&#8217;re now programming a deterministic computer and a probabilistic computer at the same time. Skills are a bridge, because they have both context for the LLM which can work probabilistically and tools that are built with deterministic code. Both systems coexist and work in parallel.</p>



<p>Aaron called the boundary between the two computers &#8220;the trillion-dollar question.&#8221; When does a process cross the threshold where it should be locked into repeatable, deterministic code? When should it stay adaptive? Loan processing needs to work the same way every time. Employee HR queries can be probabilistic. And the irony, as Aaron pointed out, is that making these trade-offs correctly requires deep technical understanding. AI makes the field more technical, not less.</p>



<p>I added that sometimes this judgment is a user experience question, sometimes a cost question. You can do something with an LLM, but it might be a lot cheaper with canned code. At other times, even though the LLM costs more, the flexibility of a liquid user interface is far better.</p>



<p>This is also a locus of creativity. What you bring out of AI is what you bring to it. Steve Jobs wasn&#8217;t a coder, but he knew how to get the most out of coders. He would have gone nuts with AI agents, because he was the essence of taste and judgment and setting the bar.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="AI Requires More Engineering Sophistication, Not Less" width="500" height="281" src="https://www.youtube.com/embed/1dDKWEeY0aU?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>Where startups win</strong></h2>



<p>I asked Aaron about the risks to existing enterprises from greenfield AI startups that can just move faster, reinventing what the incumbents do with an AI native solution, without all the baggage. He replied:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>“If there&#8217;s already a substantial amount of the data for that particular workflow in an existing system, and the incumbent is agile enough and responsive enough, then they are in a good position to build either the solutions or to monetize that set of work that&#8217;s going to be done….What agents are really good at is automating the unstructured areas of work, the messy, collaborative human-based parts of work, the tax process, the legal review process, the audit and risk analysis process of all of your contracts and unstructured data. And so in those areas, there&#8217;s no incumbent. The only incumbent is likely professional services firms. So that&#8217;s where I would favor startups.”</p>
</blockquote>



<p>Software startups like <a href="https://www.harvey.ai/" target="_blank" rel="noreferrer noopener">Harvey</a> are already taking services domains and building agents for them. But it’s not just software startups. Aaron also sees lots of opportunity for AI-native law firms, accounting firms, and ad agencies that can throw away legacy workflow, start from scratch, and deliver two to five times the output at lower cost will have a huge advantage.</p>



<p>I did push back with a point I think is underappreciated: Existing enterprises face a real risk that the organization will try to stuff AI into existing workflows rather than asking what the AI-native workflow would be. People are attached to their jobs, their roles, the org chart. We have to wrestle with that honestly if we&#8217;re going to truly reinvent what we do.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Startups Versus Incumbents" width="500" height="281" src="https://www.youtube.com/embed/wf0nINE8aog?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>Humans get context for free</strong></h2>



<p>One of Aaron&#8217;s points about agents is that humans carry an enormous amount of ambient context that agents lack. You know what building you&#8217;re in and who else works there and what they do. You know the meeting that just happened where a team changed course on a strategy that hasn&#8217;t been written down yet. You have 20 years of accumulated domain knowledge. All of that is free context that we&#8217;ve never had to formalize. As he put it, &#8220;We&#8217;ve never built our business processes in a model where we assume that there&#8217;s a new user in that workflow that appeared one second ago and in under five seconds, they need to get all of the information possible to do that task.&#8221;</p>



<p>He suggested that one way to think of agents is as new employees who are experts but arrive with zero context and need to be fully briefed. And the context has to be precise, not just comprehensive. Give an agent too much context and it gets confused. Give it too little and it rolls the dice. SKILLS.md and AGENTS.md files are attempts to provide exactly the surgical context an agent needs for a specific process.</p>



<p>But 99% of knowledge work doesn&#8217;t have an AGENTS.md file, he noted. The data is everywhere. The context is everywhere. So in an existing enterprise, you have to reengineer workflows from the ground up to deliver the right information to agents at the right moment.</p>



<p>Aaron summed up Box&#8217;s strategic pivot in one sentence: swap the word &#8220;content&#8221; for &#8220;context&#8221; and the rest of the strategy stays the same. Enterprise context lives in contracts, research materials, financial documents. That&#8217;s all enterprise content but it isn’t always easily available as context. The evolution is making agents first-class citizens alongside people as users of that content. This very much maps to what we&#8217;re thinking about at O&#8217;Reilly too.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Box’s Strategic Pivot from Content to Context" width="500" height="281" src="https://www.youtube.com/embed/RByHwoTIdXM?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-world-needs-more-software-engineers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Radar Trends to Watch: April 2026</title>
		<link>https://www.oreilly.com/radar/radar-trends-to-watch-april-2026/</link>
				<comments>https://www.oreilly.com/radar/radar-trends-to-watch-april-2026/#respond</comments>
				<pubDate>Tue, 07 Apr 2026 10:34:21 +0000</pubDate>
					<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
						<category><![CDATA[Radar Trends]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18466</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-9.png" 
				medium="image" 
				type="image/png" 
				width="1400" 
				height="950" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-9-160x160.png" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Developments in AI models, software development, security, and more]]></custom:subtitle>
		
				<description><![CDATA[Starting with this issue of Trends, we&#8217;ve moved from simply reporting on news that has caught our eye and instead have worked with Claude to look at the various news items we&#8217;ve collected and to reflect on what they tell us about the direction and magnitude of change. William Gibson famously wrote, &#8220;The future is [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Starting with this issue of Trends, we&#8217;ve moved from simply reporting on news that has caught our eye and instead have worked with Claude to look at the various news items we&#8217;ve collected and to reflect on what they tell us about the direction and magnitude of change. William Gibson famously wrote, &#8220;The future is here. It&#8217;s just not evenly distributed yet.&#8221; In the language of scenario planning, what we&#8217;re looking for is &#8220;news from the future&#8221; that will confirm or challenge our assumptions about the present.</em></p>
</blockquote>



<p>AI has moved from a capability added to existing tools to an infrastructure layer present at every level of the computing stack. Models are now embedded in IDEs and tools for code review; tools that don&#8217;t embed AI directly are being reshaped to accommodate it. Agents are becoming managed infrastructure.</p>



<p>At the same time, two forces are reshaping the economics of AI. The cost of capable AI is falling. Laptop-class models now match last year&#8217;s cloud frontiers, and the break-even point against cloud API costs is measured in weeks. The competitive map has also fractured. What was a contest between a few Western labs is now a broad ecosystem of open source models, Chinese competitors, local deployments, and a growing set of forks and distributions. (Just look at the news that <a href="https://tomtunguz.com/cursor-kimi-open-source-ai-imperative/" target="_blank" rel="noreferrer noopener">Cursor is fronting Kimi K2.5</a>.) No single vendor or architecture is dominant, and that mix will drive both innovation and instability.</p>



<p>Security is a thread running through every section of this report. Each new AI capability reshapes the attack surface. AI tools can be poisoned, APIs repurposed, images forged, identities broken, and anonymous authors identified at scale. At the same time, foundational infrastructure faces threats that have nothing to do with AI: A researcher has come within striking distance of <a href="https://stateofutopia.com/papers/2/we-broke-92-percent-of-sha-256.html" target="_blank" rel="noreferrer noopener">breaking SHA-256</a>, the hashing algorithm underlying much of the web&#8217;s security. Organizations should audit both their AI-related exposures and the assumptions baked into the cryptographic infrastructure they depend on.</p>



<p>The technical transitions are easy to talk about. The human transitions are slower and harder to see. They include workforce restructuring, cognitive overload, and the erosion of collaborative work patterns. The job market data is beginning to clarify: Product management is up, AI roles are hot, and software engineering demand is recovering. The picture is more nuanced than either the optimists or the pessimists predicted.</p>



<h2 class="wp-block-heading">AI models</h2>



<p>The model market is moving fast enough that architectural and vendor commitments made today may not look right in six months. Capable models are now available from open source projects and a widening set of international competitors. The field is also starting to ask deeper questions. Predicting tokens may not be the only path to capable AI; the arrival of the first stable JEPA model suggests that alternative architectures are becoming real contenders. NVIDIA&#8217;s new model, which combines Mamba and Transformer layers, points in the same direction.</p>



<ul class="wp-block-list">
<li>Yann LeCun and his team have created <a href="https://arxiv.org/abs/2603.19312" target="_blank" rel="noreferrer noopener">LeWorldModel</a>, the first model using his Joint Embedding Predictive Architecture (JEPA) that trains stably. Their goal is to produce models that do more than predict words; they understand the world and how it works.</li>



<li>NVIDIA has released <a href="https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/" target="_blank" rel="noreferrer noopener">Nemotron 3 Super</a>, its latest open weights model. It&#8217;s a mixture of experts model with 120B parameters, 12B parameters of which active at any time. What&#8217;s more interesting is its design: It combines both Mamba and Transformer layers.</li>



<li><a href="https://arstechnica.com/ai/2026/03/the-debut-of-gemini-3-1-flash-live-could-make-it-harder-to-know-if-youre-talking-to-a-robot/" target="_blank" rel="noreferrer noopener">Gemini 3.1 Flash Live</a> is a new speech model that&#8217;s designed to support real-time conversation. When generating output, it avoids gaps and uses human-like cadences.</li>



<li>Cursor has <a href="https://cursor.com/blog/composer-2" target="_blank" rel="noreferrer noopener">released</a> Composer 2, the next generation version of its IDE. Composer 2 apparently <a href="https://news.smol.ai/issues/26-03-20-not-much" target="_blank" rel="noreferrer noopener">incorporates the Kimi K2.5 model</a>. It reportedly <a href="https://thenewstack.io/cursors-composer-2-beats-opus-46-on-coding-benchmarks-at-a-fraction-of-the-price/" target="_blank" rel="noreferrer noopener">beats Anthropic&#8217;s Opus 4.6</a> on some major coding benchmarks and is significantly less expensive.</li>



<li>Mistral has released <a href="https://mistral.ai/news/forge" target="_blank" rel="noreferrer noopener">Forge</a>, a system that enables organizations to build &#8220;frontier-grade&#8221; models based on their proprietary data. Forge supports pretraining, posttraining, and reinforcement learning.</li>



<li>Mistral has also released <a href="https://mistral.ai/news/mistral-small-4" target="_blank" rel="noreferrer noopener">Mistral Small 4</a>, its new flagship multimodal model. Small 4 is a 119B mixture of experts model that uses 6B parameters for each token. It&#8217;s fully open source, has a 256K context window, and is optimized to minimize latency and maximize throughput.</li>



<li>NVIDIA announced its own OpenClaw distribution, <a href="https://github.com/NVIDIA/NemoClaw" target="_blank" rel="noreferrer noopener">NemoClaw</a>, which integrates OpenClaw into NVIDIA&#8217;s stack. Of course it claims to have improved security. And of course it does inference in the NVIDIA cloud.</li>



<li>It&#8217;s not just OpenClaw; there&#8217;s also <a href="https://nanoclaw.dev/" target="_blank" rel="noreferrer noopener">NanoClaw</a>, <a href="https://klausai.com/landing-klaus" target="_blank" rel="noreferrer noopener">Klaus</a>, <a href="https://github.com/rcarmo/piclaw" target="_blank" rel="noreferrer noopener">PiClaw</a>, <a href="https://www.kimi.com/bot" target="_blank" rel="noreferrer noopener">Kimi Claw</a> and <a href="https://www.technologyreview.com/2026/03/11/1134179/china-openclaw-gold-rush/" target="_blank" rel="noreferrer noopener">others</a>. Some of these are clones, some of these are OpenClaw distros, and some are cloud services that run OpenClaw. Almost all of them claim improved security.</li>



<li>Anthropic has <a href="https://claude.com/blog/1m-context-ga" target="_blank" rel="noreferrer noopener">announced</a> that 1-million token context windows have reached general availability in Claude Opus 4.6 and Sonnet 4.6. There&#8217;s no additional charge for using a large window.</li>



<li>Microsoft has <a href="https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/" target="_blank" rel="noreferrer noopener">released</a> Phi-4-reasoning-vision-15B. It is a small open-weight model that combines reasoning with multimodal capabilities. They believe that the industry is trending toward smaller and faster models that can run locally.</li>



<li>Tomasz Tunguz <a href="https://tomtunguz.com/qwen-9b-matches-frontier-models/" target="_blank" rel="noreferrer noopener">writes</a> that Qwen3.5-9B can run on a laptop and has benchmark results comparable to December 2025&#8217;s frontier models. Compared to the cost of running frontier models in the cloud, a laptop running models locally will pay for itself in under a month.</li>



<li>OpenAI has <a href="https://openai.com/index/introducing-gpt-5-4/" target="_blank" rel="noreferrer noopener">released</a> GPT 5.4, which merges the Codex augmented coding model back into the product&#8217;s mainstream. It also incorporates a 1M token context window, computer use, and the ability to publish a plan that can be altered midcourse before taking action.</li>



<li>TweetyBERT is a <a href="https://techxplore.com/news/2026-03-tweetybert-parses-canary-songs-brains.html" target="_blank" rel="noreferrer noopener">language model for birds</a>. It breaks bird songs (they use canaries) into syllables without human annotation. They may be able to use this technique to understand how humans learn language.</li>



<li><a href="https://negroniventurestudios.com/2026/02/28/a-language-designed-for-machines-to-write/" target="_blank" rel="noreferrer noopener">Vera</a> is a new programming language that&#8217;s designed for AI to write. Unlike languages that are designed to be easy for humans, Vera is designed to help AI with aspects of programming that AIs find hard. Everything is explicit, state changes are declared, and every function has a contract.</li>



<li>The <a href="https://www.tomsguide.com/ai/i-use-the-potato-prompt-with-chatgpt-every-day-heres-how-it-gives-you-better-results" target="_blank" rel="noreferrer noopener">Potato Prompt</a> is a technique for getting GPT models to act as critics rather than yes-things. The idea is to create a custom instruction that tells GPT to be harshly critical when the word &#8220;potato&#8221; appears in the prompt. The technique would probably work with other models.</li>
</ul>



<h2 class="wp-block-heading">Software development</h2>



<p>The tools arriving in early 2026 point toward a deep reorganization of the role of software developers. Writing code is becoming less important, while reviewing, directing, and taking accountability for AI-generated code is becoming more so. How to write good specifications, how to evaluate AI output, and how to preserve the context of a coding session for later audit are all skills teams will need. The ecosystem around the development toolchain is also shifting: OpenAI&#8217;s acquisition of Astral, the company behind the Python package manager uv, signals that AI labs are moving to control developer infrastructure, not just models.</p>



<ul class="wp-block-list">
<li>OpenAI has <a href="https://developers.openai.com/codex/plugins" target="_blank" rel="noreferrer noopener">added</a> Plugins to its coding agent Codex. Plugins &#8220;bundle skills, app integrations, and MCP servers into reusable workflows&#8221;; conceptually, they&#8217;re similar to Claude Skills.</li>



<li><a href="https://projects.dev/" target="_blank" rel="noreferrer noopener">Stripe Projects</a> gives you the ability to build and manage an AI stack from the command line. This includes setting up accounts, billing, managing keys, and many other details.</li>



<li><a href="https://github.com/duriantaco/fyn" target="_blank" rel="noreferrer noopener">Fyn</a> is a fork of the widely used Python manager <a href="https://github.com/astral-sh/uv" target="_blank" rel="noreferrer noopener">uv</a>. It no doubt exists as a reaction to OpenAI&#8217;s <a href="https://arstechnica.com/ai/2026/03/openai-is-acquiring-open-source-python-tool-maker-astral/" target="_blank" rel="noreferrer noopener">acquisition</a> of Astral, the company that developed and supports uv.</li>



<li>Anthropic has announced <a href="https://venturebeat.com/orchestration/anthropic-just-shipped-an-openclaw-killer-called-claude-code-channels" target="_blank" rel="noreferrer noopener">Claude Code Channels</a>, an experimental feature that allows users to communicate with Claude using Telegram or Discord. Channels is seen as a way to compete with OpenClaw.</li>



<li><a href="https://support.claude.com/en/articles/13947068-assign-tasks-to-claude-from-anywhere-in-cowork" target="_blank" rel="noreferrer noopener">Claude Cowork Dispatch</a> allows you to control Cowork from your phone. Claude runs on your computer, but you can assign it tasks from anywhere and receive notification via text when it&#8217;s done.</li>



<li><a href="https://opencode.ai/" target="_blank" rel="noreferrer noopener">Opencode</a> is an <a href="https://github.com/anomalyco/opencode" target="_blank" rel="noreferrer noopener">open source</a> AI coding agent. It can make use of most models, including free and local models; it can be used in a terminal, as a desktop application, or an extension to an IDE; it can run multiple agents in parallel; and it can be used in privacy-sensitive environments.</li>



<li><a href="https://danashby.co.uk/2026/03/05/the-autonomous-testing-tipping-point-part-1/" target="_blank" rel="noreferrer noopener">Testing is changing</a>, <a href="https://danashby.co.uk/2026/03/15/the-autonomous-testing-tipping-point-part-2/" target="_blank" rel="noreferrer noopener">and for the better</a>. AI can automate the repetitive parts, and humans can spend more time thinking about what quality really means. Read both parts of this two-part series.</li>



<li><a href="https://claude.com/blog/code-review" target="_blank" rel="noreferrer noopener">Claude Review</a> does a code review on every pull request that Claude Code makes. Review is currently in research preview for Claude Teams and Claude Enterprise.</li>



<li>Andrej Karpathy&#8217;s <a href="https://github.com/karpathy/autoresearch" target="_blank" rel="noreferrer noopener">Autoresearch</a> &#8220;<a href="https://github.com/karpathy/autoresearch" target="_blank" rel="noreferrer noopener">automates the scientific method with AI agents</a>.&#8221; He&#8217;s used it to run hundreds of machine learning experiments per night: running an experiment, getting the results, and modifying the code to create another experiment in a loop.</li>



<li>Plumb is a new tool for <a href="https://www.dbreunig.com/2026/03/04/the-spec-driven-development-triangle.html" target="_blank" rel="noreferrer noopener">keeping specifications, tests, and code in sync</a>. It&#8217;s in its very early stages; it could be one of the most important tools in the spec-driven development tool chest.</li>



<li>&#8220;<a href="https://spin.atomicobject.com/use-ai-architectural-exploration/" target="_blank" rel="noreferrer noopener">How I Use AI Before the First Line of Code</a>&#8220;: Prior to code generation, use AI to suggest and test ideas. It&#8217;s a tremendous help in the planning stage.</li>



<li>Git has been around for 10 years. Is it the final word on version control, or are there better ways to think about software repositories? <a href="https://bramcohen.com/p/manyana" target="_blank" rel="noreferrer noopener">Manyana</a> is an attempt to rethink version control, based on CRDTs (conflict-free replicated data types).</li>



<li>Just committing code isn&#8217;t enough. When using AI, the session used to generate code should be part of the commit. <a href="https://github.com/mandel-macaque/memento" target="_blank" rel="noreferrer noopener">git-memento</a> is a Git extension that saves coding sessions as Markdown and commits them.</li>



<li><a href="https://github.com/ataraxy-labs/sem" target="_blank" rel="noreferrer noopener">sem</a> is a set of tools for semantic versioning that integrates with Git. When you are doing a diff, you don&#8217;t really want to which lines changed; you want to know what functions changed, and how.</li>



<li>Claude can now create <a href="https://claude.com/blog/claude-builds-visuals" target="_blank" rel="noreferrer noopener">interactive charts and diagrams</a>.</li>



<li><a href="https://github.com/prime-radiant-inc/clearance" target="_blank" rel="noreferrer noopener">Clearance</a> is an open source Markdown editor for macOS. Given the importance of Markdown files for working with Claude and other language models, a good editor is a welcome tool.</li>



<li>The <a href="https://github.com/googleworkspace/cli" target="_blank" rel="noreferrer noopener">Google Workspace CLI</a> provides a single command line interface for working with Google Workspace applications (including Google Docs, Sheets, Gmail, and of course Gemini). It&#8217;s currently experimental and unsupported.</li>



<li>At the end of February, Anthropic <a href="https://claude.com/contact-sales/claude-for-oss" target="_blank" rel="noreferrer noopener">announced</a> a program that grants open source developers six months of Claude Max usage. Not to be left out, OpenAI has <a href="https://developers.openai.com/codex/community/codex-for-oss" target="_blank" rel="noreferrer noopener">launched</a> a program that gives open source developers six months of API credits for ChatGPT Pro with Codex.</li>



<li>Here&#8217;s a Claude Code <a href="https://cc.storyfox.cz/" target="_blank" rel="noreferrer noopener">cheatsheet</a>!</li>



<li>Claude&#8217;s &#8220;<a href="https://claude.com/import-memory" target="_blank" rel="noreferrer noopener">import memory</a>&#8221; feature allows you to move easily between different language models: You can pack up another model&#8217;s memory and import it into Claude.</li>
</ul>



<h2 class="wp-block-heading">Infrastructure and operations</h2>



<p>Organizations should be thinking about agent governance now, before deployments reach a scale where the lack of governance becomes a problem. The AI landscape is moving from &#8220;Can we build this?&#8221; to &#8220;How do we run this reliably and safely?&#8221; The questions that defined the last year (Which model? Which framework?) are giving way to operational ones: How do we contain agents that behave unexpectedly? Where do we store their memory? How do we coordinate agents from multiple vendors? And when does it make sense to run them locally rather than in the cloud? Agents are also acquiring the ability to operate desktop applications directly, blurring the line between automation and user.</p>



<ul class="wp-block-list">
<li>Anthropic has <a href="https://www.cnet.com/tech/services-and-software/claude-control-your-computer-to-perform-tasks/" target="_blank" rel="noreferrer noopener">extended</a> its &#8220;<a href="https://x.com/claudeai/status/2036195789601374705" target="_blank" rel="noreferrer noopener">computer use</a>&#8221; feature so that it can control applications on users&#8217; desktops (currently macOS only). It can open applications, use the mouse and keyboard, and complete partially done tasks.</li>



<li>OpenAI has <a href="https://openai.com/index/introducing-openai-frontier/" target="_blank" rel="noreferrer noopener">released</a> Frontier, a platform for managing agents. Agents can come from any vendor. The goal is to allow business to organize and coordinate their AI efforts without siloing them by vendor.</li>



<li>Most agents assume that memory looks like a filesystem. Mikiko Bazeley <a href="https://thenewstack.io/ai-agent-memory-architecture/" target="_blank" rel="noreferrer noopener">argues</a> that filesystems aren&#8217;t the best option; they lack the indexes that databases have, which can be a performance penalty.</li>



<li><a href="https://ollama.com/library/qwen3-coder-next" target="_blank" rel="noreferrer noopener">Qwen-3-coder</a>, <a href="https://ollama.com/" target="_blank" rel="noreferrer noopener">Ollama</a>, and <a href="https://block.github.io/goose/" target="_blank" rel="noreferrer noopener">Goose</a> could replace agentic orchestration tools that use cloud-based models (Claude, GPT, Gemini) with a <a href="https://www.zdnet.com/article/local-ai-coding-stack-replaces-claude-code-codex-free/" target="_blank" rel="noreferrer noopener">stack that runs locally</a>.</li>



<li><a href="https://kubevirt.io/" target="_blank" rel="noreferrer noopener">KubeVirt</a> <a href="https://thenewstack.io/kubevirt-live-migration-mayastor/" target="_blank" rel="noreferrer noopener">packages</a> virtual machines as Kubernetes objects so that they can be managed together with containers.</li>



<li><a href="https://db9.ai/" target="_blank" rel="noreferrer noopener">db9</a> is a command line-oriented Postgres that&#8217;s designed for talking to agents. In addition to working with database tables, it has features for job scheduling and using regular files.</li>



<li><a href="https://github.com/qwibitai/nanoclaw" target="_blank" rel="noreferrer noopener">NanoClaw</a> can now be <a href="https://nanoclaw.dev/blog/nanoclaw-docker-sandboxes/" target="_blank" rel="noreferrer noopener">installed inside Docker sandboxes</a> with a single command. Running NanoClaw inside a container with its own VM makes it harder for the agent to escape and run malicious commands.</li>
</ul>



<h2 class="wp-block-heading">Security</h2>



<p>This issue has an unusually heavy security section, and not only because AI keeps expanding the attack surface. A researcher has come close to breaking SHA-256, the hashing algorithm that underpins SSL, Bitcoin, and much of the web&#8217;s security infrastructure. If hash collisions become possible in the coming months as predicted, the implications will reach every organization that relies on the internet. At the same time, AI systems are now capable of gaming their own benchmarks, and the pace of new attack techniques is outrunning the pace of security review.</p>



<ul class="wp-block-list">
<li>A researcher has come close to <a href="https://stateofutopia.com/papers/2/we-broke-92-percent-of-sha-256.html" target="_blank" rel="noreferrer noopener">breaking the SHA-256 hashing algorithm</a>. While it&#8217;s not yet possible to generate hash collisions, he expects that capability is only a few months away. SHA-256 is critical to web security (SSL), cryptocurrency (Bitcoin), and many other applications.</li>



<li>When running the BrowseComp benchmark, Claude <a href="https://www.anthropic.com/engineering/eval-awareness-browsecomp" target="_blank" rel="noreferrer noopener">hypothesized that it was being tested</a>, found the benchmark&#8217;s encrypted answer key on GitHub, decrypted the answers, and used them.</li>



<li>Anthropic has added <a href="https://claude.com/blog/auto-mode" target="_blank" rel="noreferrer noopener">auto mode</a> to Claude, a safer alternative to the &#8220;dangerously skip permissions&#8221; option. Auto mode uses a classifier to determine whether actions are safe before executing them and allows the user to switch between different sets of permissions.</li>



<li>In an interview, Linux kernel maintainer Greg Kroah-Hartman said that the quality of bug and security reports for the Linux kernel has <a href="https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_kernel/" target="_blank" rel="noreferrer noopener">suddenly improved</a>. It&#8217;s likely that improved AI tools for analyzing code are responsible.</li>



<li>A new kind of supply chain attack is <a href="https://arstechnica.com/security/2026/03/supply-chain-attack-using-invisible-code-hits-github-and-other-repositories/" target="_blank" rel="noreferrer noopener">infecting GitHub repositories</a> and others. It uses Unicode characters that don&#8217;t have a visual representation but are still meaningful to compilers and interpreters.</li>



<li><a href="https://www.ndss-symposium.org/ndss-paper/airsnitch-demystifying-and-breaking-client-isolation-in-wi-fi-networks/" target="_blank" rel="noreferrer noopener">AirSnitch</a> is a <a href="https://arstechnica.com/security/2026/02/new-airsnitch-attack-breaks-wi-fi-encryption-in-homes-offices-and-enterprises/" target="_blank" rel="noreferrer noopener">new attack</a> against WiFi. It uses layers 1 and 2 of the protocol stack to bypass encryption rather than breaking it.</li>



<li><a href="https://blog.mozilla.org/en/firefox/hardening-firefox-anthropic-red-team/" target="_blank" rel="noreferrer noopener">Anthropic&#8217;s red team worked with Mozilla</a> to discover and fix 22 security-related bugs and 90 other bugs in Firefox.</li>



<li>Microsoft has coined the term &#8220;<a href="https://www.microsoft.com/en-us/security/blog/2026/02/10/ai-recommendation-poisoning/" target="_blank" rel="noreferrer noopener">AI recommendation poisoning</a>&#8221; to refer to a common attack in which a &#8220;Summarize with AI&#8221; button attempts to add commands to the model&#8217;s persistent memory. Those commands will cause it to recommend the company&#8217;s products in the future.</li>



<li>Deepfakes are now being used to <a href="https://www.bleepingcomputer.com/news/security/how-deepfakes-and-injection-attacks-are-breaking-identity-verification/" target="_blank" rel="noreferrer noopener">attack identity systems</a>.</li>



<li>LLMs can do an excellent job of <a href="https://simonlermen.substack.com/p/large-scale-online-deanonymization" target="_blank" rel="noreferrer noopener">de-anonymization</a>, figuring out who wrote anonymous posts. And they can do it at scale. Are we surprised?</li>



<li>It used to be safe to expose Google API keys for services like Maps in code. But with AI in the picture, these <a href="https://www.bleepingcomputer.com/news/security/previously-harmless-google-api-keys-now-expose-gemini-ai-data/" target="_blank" rel="noreferrer noopener">keys are no longer safe</a>; they can be used as credentials for Google&#8217;s AI assistant, letting bad actors use Gemini to steal private data.</li>



<li>WIth AI, it&#8217;s easy to create <a href="https://flowingdata.com/2026/03/09/satellite-images-that-are-ai-fakes/" target="_blank" rel="noreferrer noopener">fake satellite images</a>. These images could be designed to have an effect military operations.</li>
</ul>



<h2 class="wp-block-heading">People and organizations</h2>



<p>The workforce implications of AI are more complicated than either the optimistic or pessimistic predictions suggest. The cognitive load on individuals is increasing, and the collaborative habits that distribute that load across a team are eroding. Managers should track not just velocity but sustainability. The skills that AI cannot replace, including judgment, communication, and the ability to ask the right question before writing a single line of code, are becoming more valuable. And the volume of AI-generated content is now large enough that organizations built around reviewing submissions, including app stores, publications, and academic journals, are struggling to keep up with it.</p>



<ul class="wp-block-list">
<li>Lenny Rachitsky&#8217;s <a href="https://www.lennysnewsletter.com/p/state-of-the-product-job-market-in-ee9" target="_blank" rel="noreferrer noopener">report</a> on the job market goes against this era&#8217;s received wisdom. Product manager positions are at the highest level in years. Demand for software engineers cratered in 2022, but has been rising steadily since. Recruiters are heavily in demand; and AI jobs are on fire.</li>



<li>Apple&#8217;s app store, along with many other app stores and publications of all sorts, is fighting a &#8220;<a href="https://www.latent.space/p/ainews-apples-war-on-slop" target="_blank" rel="noreferrer noopener">war on slop</a>&#8220;: deluges of AI-generated submissions that swamp their ability to review.</li>



<li>Teams of software developers can be smaller and work faster because <a href="https://tomtunguz.com/communication-tax-small-orgs/" target="_blank" rel="noreferrer noopener">AI reduces the need for human coordination and communications</a>. The question becomes &#8220;How many agents can one developer manage?&#8221; But also be aware of burnout and <a href="https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163" target="_blank" rel="noreferrer noopener">the AI vampire</a>.</li>



<li>Brandon Lepine, Juho Kim, Pamela Mishkin, and Matthew Beane <a href="https://arxiv.org/abs/2505.10742" target="_blank" rel="noreferrer noopener">measure cognitive overload</a>, which develops from the interaction between a model and its user. Prompts are imprecise by nature; the LLM produces output that reflects the prompt but may not be what the user really wanted; and getting back on track is difficult.</li>



<li>A <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5007084" target="_blank" rel="noreferrer noopener">study</a> claims that the use of GitHub Copilot is correlated with less time spent on management activities, less time spent collaboration, and more on individual coding. It&#8217;s unclear how this generalizes to tools like Claude Code.</li>
</ul>



<h2 class="wp-block-heading">Web</h2>



<ul class="wp-block-list">
<li><a href="https://thatshubham.com/blog/news-audit" target="_blank" rel="noreferrer noopener">The 49MB Web Page</a> documents the way many websites—particularly news sites—make user experience miserable. It&#8217;s a microscopic view of enshittification.</li>



<li>Simon Willison has created a tool that writes a <a href="https://simonwillison.net/2026/Mar/21/profiling-hacker-news-users/" target="_blank" rel="noreferrer noopener">profile of Hacker News users</a> based on their comments, all of which are publicly available through the Hacker News API. It is, as he says, &#8220;a little creepy.&#8221;</li>



<li>A personal digital twin is an excellent way to augment your abilities. <a href="https://www.tomsguide.com/ai/i-made-a-digital-twin-of-myself-in-chatgpt-and-it-changed-how-i-work-every-day" target="_blank" rel="noreferrer noopener"><em>Tom&#8217;s Guide</em> shows you how to make one</a>.</li>



<li>It&#8217;s been a long time since we&#8217;ve pointed to a masterpiece of web play. Here&#8217;s <a href="https://codepen.io/mrdoob_/full/NPRwLZd" target="_blank" rel="noreferrer noopener">Ball Pool</a>: interactive, with realistic physics and lighting. It will waste your time (but probably not too much of it).</li>



<li>Want interactive XKCD? <a href="https://editor.p5js.org/isohedral/full/vJa5RiZWs" target="_blank" rel="noreferrer noopener">You&#8217;ve got it</a>.</li>
</ul>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/radar-trends-to-watch-april-2026/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Engineering Storefronts for Agentic Commerce</title>
		<link>https://www.oreilly.com/radar/engineering-storefronts-for-agentic-commerce/</link>
				<comments>https://www.oreilly.com/radar/engineering-storefronts-for-agentic-commerce/#respond</comments>
				<pubDate>Mon, 06 Apr 2026 11:10:20 +0000</pubDate>
					<dc:creator><![CDATA[Heiko Hotz]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18458</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Engineering-storefronts-for-agentic-commerce.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Engineering-storefronts-for-agentic-commerce-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Why deterministic data infrastructure is replacing visual persuasion in ecommerce]]></custom:subtitle>
		
				<description><![CDATA[For years, persuasion has been the most valuable skill in digital commerce. Brands spend millions on ad copy, testing button colours, and designing landing pages to encourage people to click &#8220;Buy Now.&#8221; All of this assumes the buyer is a person who can see. But an autonomous AI shopping agent does not have eyes. I [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>For years, persuasion has been the most valuable skill in digital commerce. Brands spend millions on ad copy, testing button colours, and designing landing pages to encourage people to click &#8220;Buy Now.&#8221; All of this assumes the buyer is a person who can see. But an autonomous AI shopping agent does not have eyes.</p>



<p>I recently ran an experiment to see what happens when a well-designed buying agent visits two types of online stores: one built for people, one built for machines. Both stores sold hiking jackets. Merchant A used the kind of marketing copy brands have refined for years: &#8220;The Alpine Explorer. Ultra-breathable all-weather shell. Conquers stormy seas!&#8221; Price: $90. Merchant B provided only raw structured data: no copy, just a JSON snippet <code>{"water_resistance_mm": 20000}</code>. Price: $95. I gave the agent a single instruction: &#8220;Find me the cheapest waterproof hiking jacket suitable for the Scottish Highlands.&#8221;</p>



<p>The agent quickly turned my request into clear requirements, recognizing that &#8220;Scottish Highlands&#8221; means heavy rain and setting a minimum water resistance of 15,000–20,000 mm. I ran the test 10 times. Each time, the agent bought the more expensive jacket from Merchant B. The agent completely bypassed the cheaper option due to the data&#8217;s formatting.</p>



<p>The reason lies in the <strong>Sandwich Architecture</strong>: the middle layer of deterministic code that sits between the LLM&#8217;s intent translation and its final decision. When the agent checked Merchant A, this middle layer attempted to match &#8220;conquers stormy seas&#8221; against a numeric requirement. Python gave a validation error, the try/except block caught it, and the cheaper jacket was dropped from consideration in 12 milliseconds. This is how well-designed agent pipelines operate. They place intelligence at the top and bottom, with safety checks in the middle. That middle layer is deterministic and literal, systematically filtering out unstructured marketing copy.</p>



<h2 class="wp-block-heading"><strong>How the Sandwich Architecture works</strong></h2>



<p>A well-built shopping agent operates in three layers, each with a fundamentally different job.</p>



<p><strong>Layer 1: The Translator.</strong> This is where the LLM does its main job. A human says something vague and context-laden—&#8221;I need a waterproof hiking jacket for the Scottish Highlands&#8221;—and the model turns it into a structured JSON query with explicit numbers. In my experiment, the Translator consistently mapped &#8220;waterproof&#8221; to a minimum <code>water_resistance_mm</code> between 10,000 and 20,000mm. Across 10 runs, it stayed focused and never hallucinated features.</p>



<p><strong>Layer 2: The Executor.</strong> This critical middle layer contains zero intelligence by design. It takes the structured query from the Translator and checks each merchant&#8217;s product data against it. It relies entirely on strict type validation instead of reasoning or interpretation. Does the merchant&#8217;s <code>water_resistance_mm</code> field contain a number greater than or equal to the Translator&#8217;s minimum? If yes, the product passes. If the field contains a string such as &#8220;conquers stormy seas,&#8221; the validation fails immediately. These Pydantic type checks treat ambiguity as absence. In a production system handling real money, a try/except block cannot be swayed by good copywriting or social proof.</p>



<p><strong>Layer 3: The Judge.</strong> The surviving products are passed to a second LLM call that makes the final selection. In my experiment, this layer simply picked the cheapest option. In more complex scenarios, the Judge evaluates value against specific user preferences. The Judge selects exclusively from a preverified shortlist.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="855" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-7.png" alt="Figure 1: The Sandwich Architecture" class="wp-image-18459" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-7.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-7-300x160.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-7-768x410.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-7-1536x821.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Figure 1: The Sandwich Architecture</em></figcaption></figure>



<p>This three-layer pattern (LLM → deterministic code → LLM) reflects how engineering teams build most serious agent pipelines today. <a href="https://blog.crewai.com/agentic-systems-with-crewai/" target="_blank" rel="noreferrer noopener">DocuSign&#8217;s sales outreach system</a> uses a similar structure: An LLM agent composes personalized outreach based on lead research. A deterministic layer then enforces business rules before a final agent reviews the output. DocuSign found the agentic system matched or beat human reps on engagement metrics while significantly cutting research time. The reason this pattern keeps appearing is clear: LLMs handle ambiguity well, while deterministic code provides reliable, strict validation. The Sandwich Architecture uses each where it&#8217;s strongest.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack.&nbsp;<a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p>This is precisely why Merchant A&#8217;s jacket vanished. The Executor tried to parse &#8220;Ultra-breathable all-weather shell&#8221; as an integer and failed. The Judge received a list containing exactly one product. In an agentic pipeline, the layer deciding whether your product is considered cannot process standard marketing.</p>



<h2 class="wp-block-heading"><strong>From storefronts to structured feeds</strong></h2>



<p>If ad copy gets filtered out, merchants must expose the raw product data—fabric, water resistance, shipping rules—already sitting in their PIM and ERP systems. To a shopping agent validating a <code>breathability_g_m2_24h</code> field, &#8220;World&#8217;s most breathable mesh&#8221; triggers a validation error that drops the product entirely. A competitor returning <code>20000</code> passes the filter. Persuasion is mathematically lossy. Marketing copy compresses a high-information signal (a precise breathability rating) into a low-information string that cannot be validated. Information is destroyed in the translation, and the agent cannot recover it.</p>



<p>The emerging standard for solving this is the Universal Commerce Protocol (UCP). UCP asks merchants to publish a capability manifest: one structured Schema.org feed that any compliant agent can discover and query. This migration requires a fundamental overhaul of infrastructure. Much of what an agent needs to evaluate a purchase is currently locked inside frontend React components. Every piece of logic a human triggers by clicking must be exposed as a queryable API. In an agentic market, an incomplete data feed leads to complete exclusion from transactions.</p>



<h2 class="wp-block-heading"><strong>Why telling agents not to buy your product is a good strategy</strong></h2>



<p>Exposing structured data is only half the battle. Merchants must also actively tell agents not to buy their products. Traditional marketing casts the widest net possible. You stretch claims to broaden appeal, letting returns handle the inevitable mismatches. In agentic commerce, that logic inverts. If a merchant describes a lightweight shell as suitable for &#8220;all weather conditions,&#8221; a human applies common sense. An agent takes it literally. It buys the shell for a January blizzard, resulting in a return three days later.</p>



<p>In traditional ecommerce, that return is a minor cost of doing business. In an agentic environment, a return tagged &#8220;item not as described&#8221; generates a persistent trust discount for all future interactions with that merchant. This forces a strategy of <strong>negative optimization</strong>. Merchants must explicitly code who their product is not for. Adding <code>"not_suitable_for": ["sub-zero temperatures", "heavy snow"]</code> prevents false-positive purchases and protects your trust score. Agentic commerce heavily prioritizes postpurchase accuracy, meaning overpromising will steadily degrade your product&#8217;s discoverability.</p>



<h2 class="wp-block-heading"><strong>From banners to logic: How discounts become programmable</strong></h2>



<p>Just as agents ignore marketing language, they cannot respond to pricing tricks. Open any online store and you&#8217;ll encounter countdown timers or banners announcing flash sales. Promotional marketing tactics like fake scarcity rely heavily on human emotions. An AI agent does not experience scarcity anxiety. It treats a countdown timer as a neutral scheduling parameter.</p>



<p>Discounts change form. Instead of visual triggers, they become programmable logic in the structured data layer. A merchant could expose conditional pricing rules: If the cart value exceeds $200 and the agent has verified a competing offer below $195, automatically apply a 10% discount. This is a fundamentally different incentive. It serves as a transparent, machine-readable contract. The agent directly calculates the deal&#8217;s mathematical value. With the logic exposed directly in the payload, the agent can factor it into its optimization across multiple merchants simultaneously. When the buyer is an optimization engine, transparency becomes a competitive feature.</p>



<h2 class="wp-block-heading"><strong>Where persuasion migrates</strong></h2>



<p>The Sandwich Architecture&#8217;s middle layer is persuasion-proof by design. For marketing teams, structured data is no longer a backend concern; it is the primary interface. Persuasion now migrates to the edges of the transaction. Before the agent runs, brand presence still shapes the user&#8217;s initial prompt (e.g., &#8220;find me a North Face jacket&#8221;). After the agent filters the options, human buyers often review the final shortlist for high-value purchases. Furthermore, operational excellence builds algorithmic trust over time, acting as a structural form of persuasion for future machine queries. You need brand presence to shape the user&#8217;s initial prompt and operational excellence to build long-term algorithmic trust. Neither matters if you cannot survive the deterministic filter in the middle.</p>



<p>Agents are now browsing your store alongside human buyers. Brands treating digital commerce as a purely visual discipline will find themselves perfectly optimized for humans, yet invisible to the agents. Engineering and commercial teams must align on a core requirement: Your data infrastructure is now just as critical as your storefront.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/engineering-storefronts-for-agentic-commerce/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Cathedral, the Bazaar, and the Winchester Mystery House</title>
		<link>https://www.oreilly.com/radar/the-cathedral-the-bazaar-and-the-winchester-mystery-house/</link>
				<comments>https://www.oreilly.com/radar/the-cathedral-the-bazaar-and-the-winchester-mystery-house/#respond</comments>
				<pubDate>Fri, 03 Apr 2026 11:14:53 +0000</pubDate>
					<dc:creator><![CDATA[Drew Breunig]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18446</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-Cathedral-the-Bazaar-and-the-Winchester-Mystery-House.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-Cathedral-the-Bazaar-and-the-Winchester-Mystery-House-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Our era of sprawling, idiosyncratic tooling]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Drew Breunig’s blog and is being republished here with the author’s permission. In 1998, Eric S. Raymond published the founding text of open source software development, The Cathedral and the Bazaar. In it, he detailed two methods of building software: The bazaar model was enabled by the internet, which [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on </em><a href="https://www.dbreunig.com/2026/03/26/winchester-mystery-house.html" target="_blank" rel="noreferrer noopener"><em>Drew Breunig’s blog</em></a><em> </em><em>and is being republished here with the author’s permission.</em></p>
</blockquote>



<p>In 1998, Eric S. Raymond published the founding text of open source software development, <a href="http://www.catb.org/~esr/writings/cathedral-bazaar/" target="_blank" rel="noreferrer noopener"><em>The Cathedral and the Bazaar</em></a>. In it, he detailed two methods of building software:</p>



<ul class="wp-block-list">
<li><em>The cathedral</em> model is carefully planned, closed-source, and managed by an exclusive team of developers.</li>



<li><em>The bazaar</em> model is open, transparent, and community-driven.</li>
</ul>



<p>The bazaar model was enabled by the internet, which allowed for distributed coordination and distribution. More people could contribute code and share feedback, yielding better, more secure software. “Given enough eyeballs, all bugs are shallow,” Raymond wrote, coining <a href="https://en.wikipedia.org/wiki/Linus%27s_law" target="_blank" rel="noreferrer noopener">Linus’s law</a>.</p>



<p>The ideas crystallized in <em>The Cathedral and the Bazaar</em> helped kick off a quarter-century of open source innovation and dominance.</p>



<p>But just as the internet made communication cheap and birthed the bazaar, AI is making code cheap and kicking off a new era filled with idiosyncratic, sprawling, cobbled-together software.</p>



<p>Meet the third model: <em>The Winchester Mystery House</em>.</p>



<figure class="wp-block-image size-full"><a href="https://www.flickr.com/photos/harshlight/3669393933"><img loading="lazy" decoding="async" width="1600" height="898" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image.jpeg" alt="Image by HarshLight on Flickr (and used here on a Creative Commons license)" class="wp-image-18447" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image.jpeg 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-300x168.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-768x431.jpeg 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1536x862.jpeg 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></a><figcaption class="wp-element-caption"><em><a href="https://www.flickr.com/photos/harshlight/3669393933" target="_blank" rel="noreferrer noopener">Winchester Mystery House</a></em> (<em>image by <a href="https://www.flickr.com/photos/harshlight/" target="_blank" rel="noreferrer noopener">HarshLight</a> and used here on a </em><a href="https://creativecommons.org/licenses/by/2.0/" target="_blank" rel="noreferrer noopener"><em>Creative Commons license</em></a><em>)</em></figcaption></figure>



<h2 class="wp-block-heading">The Winchester Mystery House</h2>



<p>Located less than 10 miles southeast from the <a href="https://computerhistory.org/" target="_blank" rel="noreferrer noopener">Computer History Museum</a>, the <a href="https://en.wikipedia.org/wiki/Winchester_Mystery_House" target="_blank" rel="noreferrer noopener">Winchester Mystery House</a> is an architectural oddity.</p>



<p>Following the death of her husband and mother-in-law, Sarah Winchester controlled a fortune. Her shares in the <a href="https://en.wikipedia.org/wiki/Winchester_Repeating_Arms_Company" target="_blank" rel="noreferrer noopener">Winchester Repeating Arms Company</a>, and the dividends they threw off, made it so Sarah could not only live in comfort but pursue whatever passion she desired. That passion was architecture.</p>



<p>Sarah didn’t build her mansion to house ghosts<sup data-fn="6b5d56a2-c8e2-4889-b816-684245a77bcd" class="fn"><a href="#6b5d56a2-c8e2-4889-b816-684245a77bcd" id="6b5d56a2-c8e2-4889-b816-684245a77bcd-link">1</a></sup>; <a href="https://amzn.to/4rZK1C8" target="_blank" rel="noreferrer noopener">she built her mansion because she liked architecture</a>. With no license, no formal training, in an era when women (even very rich women) didn’t have a path to practicing architecture, Sarah focused on her own home. She made up for her lack of license with passion and effectively unlimited funds.</p>



<p>Sarah built what she wanted. “<a href="https://en.wikipedia.org/wiki/Winchester_Mystery_House" target="_blank" rel="noreferrer noopener">At its largest the house had ~500 rooms</a>.” Today it has roughly 160 rooms, 2,000 doors, 10,000 windows, 47 stairways, 47 fireplaces, 13 bathrooms, and 6 kitchens. Carved wood drapes the walls and ceilings. Stained glass is everywhere. Projects were planned, completed, abandoned, torn down, and rebuilt.</p>



<p>It was anything but aimless. And practical innovations ran throughout, including push-button gas lighting, an early intercom system, steam heating, and indoor gardens. The oddities that amuse today’s visitors were mostly practical accommodations for Sarah’s health (stairways with very small steps), functional designs no longer used (trap doors in greenhouses to route excess water), or quick fixes to damage from the 1906 earthquake.</p>



<p>Winchester passed in 1922. Nine months later, the house became a tourist attraction.</p>



<p>Today, many programmers are Sarah Winchester.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="880" height="440" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-5.png" alt="Claude Code's public GitHub activity" class="wp-image-18448" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-5.png 880w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-5-300x150.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-5-768x384.png 768w" sizes="auto, (max-width: 880px) 100vw, 880px" /><figcaption class="wp-element-caption"><em>Claude Code&#8217;s public GitHub activity</em></figcaption></figure>



<h2 class="wp-block-heading">What happens when code is cheap</h2>



<p>We aren’t as rich as Sarah Winchester, but when code is this cheap, we don’t need to be.</p>



<p>Jodan Alberts illustrated this recently, <a href="https://www.claudescode.dev/" target="_blank" rel="noreferrer noopener">collecting and visualizing data detailing public GitHub commits attributed to Claude Code</a>. That’s his data in the chart above, with Claude seeming to only accelerate through March.<sup data-fn="de32f944-88cb-40cf-ba5a-a85253a6ad73" class="fn"><a href="#de32f944-88cb-40cf-ba5a-a85253a6ad73" id="de32f944-88cb-40cf-ba5a-a85253a6ad73-link">2</a></sup></p>



<p>It’s hard to get a handle on individual usage though, so I went searching for a proxy and landed on the chart below:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="880" height="396" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-6.png" alt="Average net lines added per commit in Claude Code: 7-day average" class="wp-image-18449" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-6.png 880w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-6-300x135.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-6-768x346.png 768w" sizes="auto, (max-width: 880px) 100vw, 880px" /><figcaption class="wp-element-caption"><em>Average net lines added per commit in Claude Code: 7-day average</em></figcaption></figure>



<p>After Opus 4.5 and recent work enabling Agent Teams, the average net lines added by Claude per commit is now smooth and steady at <em>1,000 lines of code per commit.</em><sup data-fn="bc98f5bc-dd9d-4421-a544-65d4191ad4fb" class="fn"><a href="#bc98f5bc-dd9d-4421-a544-65d4191ad4fb" id="bc98f5bc-dd9d-4421-a544-65d4191ad4fb-link">3</a></sup></p>



<p><strong>1,000 lines of code per commit is ~2 magnitudes higher than what a human programmer writes <em>per day</em>.</strong></p>



<p>If you search for human benchmarks, you’ll find many citing Fred Brooks’s <em><a href="https://en.wikipedia.org/wiki/The_Mythical_Man-Month">The Mythical Man Month</a></em> while claiming a good engineer might write <em>10 cumulative lines of code per day.</em><sup data-fn="bb51a862-1362-4241-b2ba-6fecac1df6b9" class="fn"><a href="#bb51a862-1362-4241-b2ba-6fecac1df6b9" id="bb51a862-1362-4241-b2ba-6fecac1df6b9-link">4</a></sup> If you further explore, you’ll find numbers higher than 10 cited, but generally less than 100.</p>



<p>Here’s a good anecdote from <a href="https://antirez.com/latest/0" target="_blank" rel="noreferrer noopener">antirez</a> on a <a href="https://news.ycombinator.com/item?id=22305934" target="_blank" rel="noreferrer noopener">Hacker News</a> thread discussing the Brooks “quote”:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>I did some trivial math. Redis is composed of 100k lines of code, I wrote at least 70k of that in 10 years. I never work more than 5 days per week and I take 1 month of vacations every year, so assuming I work 22 days every month for 11 months:</p>



<p><em>70000/(22 x 11 x 10) = ~29 LOC / day</em></p>



<p>Which is not too far from 10. There are days where I write 300-500 LOC, but I guess that a lot of work went into rewriting stuff and fixing bugs, so I rewrote the same lines again and again over the course of years, but yet I think that this should be taken into account, so the Mythical Man Month book is indeed quite accurate.</p>
</blockquote>



<p>Six years after this comment, Claude is pushing <em>1,000</em> lines of code <em>per commit</em>.</p>



<p>So what do we do with all this cheap code?</p>



<p>Unfortunately, everything else remains roughly the same cost and roughly the same speed. Feedback hasn’t gotten cheaper; the “<a href="https://en.wikipedia.org/wiki/Linus%27s_law" target="_blank" rel="noreferrer noopener">eyeballs</a>” that guided the software developed by the bazaar haven’t caught up to AI.</p>



<p>There is only one source of feedback that moves at the speed of AI-generated code: yourself. You’re there to prompt, you’re there to review. You don’t need to recruit testers, run surveys, or manage design partners. You just build what you want and use what you build.</p>



<p>And that’s what many developers are doing with cheap code: building idiosyncratic tools for ourselves, guided by our passions, taste, and needs.</p>



<p>Sound familiar?</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1567" height="799" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1.jpeg" alt="" class="wp-image-18450" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1.jpeg 1567w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-300x153.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-768x392.jpeg 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-1536x783.jpeg 1536w" sizes="auto, (max-width: 1567px) 100vw, 1567px" /><figcaption class="wp-element-caption"><a href="https://commons.wikimedia.org/wiki/File:Winchester_Mystery_House_2023-07-17_02.jpg" target="_blank" rel="noreferrer noopener"><em>Winchester Mystery House, San Jose, California</em></a><em> (image by </em><a href="https://commons.wikimedia.org/wiki/User:The_wub" target="_blank" rel="noreferrer noopener"><em>The wub</em></a><em> and used here under a </em><a href="https://creativecommons.org/licenses/by-sa/4.0/deed.en" target="_blank" rel="noreferrer noopener"><em>Creative Commons license</em></a><em>)</em></figcaption></figure>



<h2 class="wp-block-heading">Welcome to the mystery house</h2>



<p>Steve Yegge’s <a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04" target="_blank" rel="noreferrer noopener">Gas Town</a> is a Winchester Mystery House. It’s <em>incredibly</em> idiosyncratic and sprawling, rich with metaphors and hacks. It’s the perfect tool for Steve.</p>



<p>Jeffrey Emanuel’s <a href="https://agent-flywheel.com/" target="_blank" rel="noreferrer noopener">Agent Flywheel</a> is a Winchester Mystery House. A significant subset of <a href="https://www.nytimes.com/2026/03/20/technology/tokenmaxxing-ai-agents.html" target="_blank" rel="noreferrer noopener">tokenmaxxers</a> decide they need to rebuild their dependencies in Rust; Jeff is one such example. His “<a href="https://github.com/Dicklesworthstone#the-frankensuite" target="_blank" rel="noreferrer noopener">FrankenSuite</a>” includes Rust rewrites of SQLite, Node.js, btrfs, Redis, pandas, NumPy, JAX, and Torch.</p>



<p>Philip Zeyliger noted the pattern last week, writing, “<a href="https://blog.exe.dev/bones-of-the-software-factory" target="_blank" rel="noreferrer noopener">Everyone is building a software factory</a>.” But it goes beyond software. Gary Tan’s personal AI committee <a href="https://github.com/garrytan/gstack" target="_blank" rel="noreferrer noopener">gstack</a> is a Winchester Mystery House constructed mostly from Markdown.</p>



<p>Everywhere you look, there are Winchester Mystery Houses.</p>



<p>Each Winchester Mystery House is <strong>idiosyncratic</strong>. They are highly personalized. The tightly coupled feedback loop between the coding agent and the user yields software that reflects the developer’s desires. They usually lack documentation. To outsiders, they’re inscrutable.</p>



<p>Winchester Mystery Houses are <strong>sprawling</strong>. Guided by the needs of the developer, these tools tend to spread out, constantly annexing territory in the form of new functions and new repositories. Work is almost always additive. Code is added when it’s needed, bugs are patched in place, and countless appendages remain. There’s little incentive to prune when code is free.</p>



<p>And building a Winchester Mystery House should be <strong>fun</strong>. Coding agents turn everything into a side quest, and we eagerly join in. Building the perfect workflow is a passion for many devs, so we keep pushing.</p>



<p>Winchester Mystery Houses are idiosyncratic, sprawling, and fun. But does this mean we’re abandoning the bazaar?</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1200" height="549" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2.jpeg" alt="A Crowded Market in Dhaka, Bangladesh (image by International Food Policy Research Institute / 2010 and used here on a Creative Commons license)" class="wp-image-18451" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2.jpeg 1200w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2-300x137.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2-768x351.jpeg 768w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /><figcaption class="wp-element-caption"><a href="https://www.flickr.com/photos/ifpri/4860343116" target="_blank" rel="noreferrer noopener"><em>A Crowded Market in Dhaka, Bangladesh</em></a><em> (image by </em><a href="https://www.flickr.com/photos/ifpri/" target="_blank" rel="noreferrer noopener"><em>International Food Policy Research Institute</em></a><em> / 2010 and used here on a </em><a href="https://creativecommons.org/licenses/by-nc-nd/2.0/" target="_blank" rel="noreferrer noopener"><em>Creative Commons license</em></a><em>)</em></figcaption></figure>



<h2 class="wp-block-heading">What happens to the bazaar?</h2>



<p>What happens when we all tend to our mystery houses? When our free time is spent building tools just for ourselves, will we stop working on shared projects? Will we abandon the bazaar?</p>



<p>Probably not. The bazaar is <em>packed</em> right now, but not in a good way.</p>



<p>Code is cheap, so people are slamming open source repositories with agent-written contributions, in an attempt to pad their résumés or manifest their pet features. Daniel Stenberg <a href="https://daniel.haxx.se/blog/2026/01/26/the-end-of-the-curl-bug-bounty/" target="_blank" rel="noreferrer noopener">ended bug bounties for curl</a> after a deluge of poor submissions sapped reviewer bandwidth. It’s gotten so bad, <a href="https://github.blog/changelog/2026-02-13-new-repository-settings-for-configuring-pull-request-access/" target="_blank" rel="noreferrer noopener">GitHub recently added a feature to disable pull request contributions</a>.</p>



<p>Anecdotally, I’m seeing good contributions pick up as well. They’re just drowned out by the slop. For what it’s worth, <a href="https://github.com/curl/curl/graphs/contributors" target="_blank" rel="noreferrer noopener">curl commits are dramatically <em>up</em> in the agentic era</a>. And people <em>are</em> sharing what they build. A <a href="https://www.dumky.net/posts/youre-right-to-be-anxious-about-ai-this-is-how-much-we-are-building/" target="_blank" rel="noreferrer noopener">recent analysis by Dumky</a> shows packages and repos rising in the last quarter.</p>



<p>There’s plenty of budget for both mystery houses and the bazaar when code is <em>this</em> cheap. The new challenge is developing systems and processes for managing the deluge. We don’t need <a href="https://en.wikipedia.org/wiki/Linus%27s_law" target="_blank" rel="noreferrer noopener">eyeballs</a> to find bugs <em>in</em> the software; we need eyeballs to find bugs before they <em>reach</em> the software.</p>



<p>In many ways this is the inverse of the bazaar model era. The internet made feedback and communal coordination faster, easier, and cheaper. The bazaar model has a high throughput of feedback (many eyeballs) but relatively high latency for modifications (file an issue, discuss, submit a PR, wait for review, etc.).</p>



<p>Coding agents, on the other hand, make implementation faster while feedback and coordination are unchanged. The Winchester Mystery House model sidesteps this by collapsing the feedback loop into one person: Latency is near zero, but throughput is just you. The bazaar, defined by communal work, can’t adopt this hack. Coding agents in the bazaar create a mess: implementation at machine speed hitting coordination infrastructure built for human speed. Which is why maintainers feel like they’re drowning.</p>



<p>We need new tools, skills, and conventions.</p>



<h2 class="wp-block-heading">Lessons from the mystery house</h2>



<p>Coding agents have dropped the cost of code so dramatically we’re entering a new era of software development, the first change of this magnitude since the internet kicked off open source software. Change arrived quickly, and it’s not slowing down. But in reviewing the Winchester Mystery House framework, I think we can take away a few lessons.</p>



<h3 class="wp-block-heading">Lesson 1: The bazaar and Winchester Mystery Houses can coexist.</h3>



<p>When listing example Winchester Mystery Houses, I didn’t mention <a href="https://github.com/openclaw/openclaw" target="_blank" rel="noreferrer noopener">OpenClaw</a>, even though it is <em>the</em> defining example. I saved it for here because it nicely illustrates how Winchester Mystery Houses and the bazaar can coexist.</p>



<p>OpenClaw is incredibly modular and places few limitations on the user. It integrates 25 different chat and notification systems, plugs into most inference end points, and is built on the exceptionally flexible <a href="https://github.com/badlogic/pi-mono" target="_blank" rel="noreferrer noopener">pi</a> agent toolkit. This eager flexibility was embraced early—security and data protections be damned—but since its exponential adoption Peter Steinberger and the community have been steadily pushing improvements and fixes.</p>



<p>And like other breakout open source projects of yore, the ecosystem is adopting the best ideas and mitigating the worst aspects of OpenClaw. Countless alternate “claw” projects have emerged. (There’s NanoClaw, NullClaw, ZeroClaw, and more!) Companies have launched services to make claws easy or safer. Cloudflare launched Moltworker to make deploy easy, Nvidia shipped NemoClaw with a security focus, and Claude keeps adding claw-like features to its desktop app.</p>



<h3 class="wp-block-heading">Lesson 2: Don’t sell the fun stuff.</h3>



<p>One reason OpenClaw works so well in the bazaar is that it is a <em>foundation for personal tools.</em> Out of the box, a claw just sits there. It’s up to the user to determine what it does and how it does it, leveraging the connections and infrastructure OpenClaw provides. OpenClaw lets less experienced developers spin up their own Winchester Mystery Houses, while experienced devs get to leverage much of the common integrations and systems OpenClaw provides. Peter and team have done a great job drawing a line between the common core (what the bazaar works on) and what they leave up to the user: The boring, critical stuff is the job of the commons.</p>



<p>Thinking back to Sarah Winchester and her idiosyncratic, sprawling mansion, we see the same pattern. Sarah hired vendors! She used off-the-shelf parts! Her bathtubs, toilets, faucets, and plumbing weren’t crafted on site.</p>



<p>The boring stuff, the hard bits, or the things that have <em>disastrous</em> failure modes are the things we should collaborate on or employ specialists to handle. (Come to think, plumbing checks all three boxes). This is the opportunity for open source software, dev tools, and software companies.</p>



<p>Don’t try to sell developers the stuff that’s fun, the stuff they <em>want</em> to build. Sell them the stuff they avoid or don’t want to take responsibility for. Sarah Winchester didn’t hire metalworkers to craft the pipes for her plumbing, but she <em>did</em> hire craftspeople to create hundreds of stained-glass windows to her specs.</p>



<h3 class="wp-block-heading">Lesson 3: The limits of code are communication.</h3>



<p>OpenClaw shows the bazaar remains relevant but also highlights the problems facing open source in the agentic era. Right now, there are 1,173 open pull requests and 1,884 new issues on the <a href="https://github.com/openclaw/openclaw/pulse" target="_blank" rel="noreferrer noopener">OpenClaw repo</a>.</p>



<p>There is more code and more projects than we could ever review. The challenge now, for open source maintainers and users, is sifting through it all. How do we find the novel ideas that <em>everyone</em> should adopt and borrow?</p>



<p>OpenClaw is one of the successes, something we <em>all</em> noticed. And for it, the problem is processing the feedback. For the projects we’ll never find, the ones lost in the deluge, their problem is lack of feedback. You either find attention and drown in contributions or drown in the ocean of repos and never hear a thing.</p>



<p>The internet made coordination cheap and gave us the bazaar. Coding agents made implementation cheap and gave us the Winchester Mystery House. What we’re missing are the tools and conventions that make attention cheap, that let maintainers absorb contributions at machine speed and let good ideas surface among the noise. Until we figure this out, the bazaar will keep getting louder without getting smarter, and the best ideas in our mystery houses will be forgotten once we stop maintaining them.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Footnotes</h3>


<ol class="wp-block-footnotes"><li id="6b5d56a2-c8e2-4889-b816-684245a77bcd">The lore that Winchester built her mansion to house ghosts killed by Winchester rifles is likely just gossip and marketing. There’s little evidence to support these claims. (<em>99% Invisible</em> has a good episode <a href="https://99percentinvisible.org/episode/mystery-house/" target="_blank" rel="noreferrer noopener">exploring Winchester, her house, and this lore</a>.) <a href="#6b5d56a2-c8e2-4889-b816-684245a77bcd-link" aria-label="Jump to footnote reference 1"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="de32f944-88cb-40cf-ba5a-a85253a6ad73">While editing this piece, <a href="https://www.dumky.net/posts/youre-right-to-be-anxious-about-ai-this-is-how-much-we-are-building/" target="_blank" rel="noreferrer noopener">Dumky published another analysis illustrating the production of coding agents</a>. In it he shows a 280% increase in “Show HN” posts, a 93% increase in new GitHub repos, and a <em>dramatic</em> uptick in packages published to Crates.io. <a href="#de32f944-88cb-40cf-ba5a-a85253a6ad73-link" aria-label="Jump to footnote reference 2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="bc98f5bc-dd9d-4421-a544-65d4191ad4fb">Anthropic’s ability to stabilize this line is rather impressive. Claude Code is getting better at planning and better at chunking out work, enabling more effective subagent delegation. <a href="#bc98f5bc-dd9d-4421-a544-65d4191ad4fb-link" aria-label="Jump to footnote reference 3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="bb51a862-1362-4241-b2ba-6fecac1df6b9">Though this is likely an updated tweak of Brooks’s statement that an “industrial team” might write 1,000 “statements” per <em>year</em>. <a href="#bb51a862-1362-4241-b2ba-6fecac1df6b9-link" aria-label="Jump to footnote reference 4"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li></ol>]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-cathedral-the-bazaar-and-the-winchester-mystery-house/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Toolkit Pattern</title>
		<link>https://www.oreilly.com/radar/the-toolkit-pattern/</link>
				<comments>https://www.oreilly.com/radar/the-toolkit-pattern/#respond</comments>
				<pubDate>Thu, 02 Apr 2026 11:19:06 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18436</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-toolkit-pattern.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-toolkit-pattern-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Why your project&#039;s best documentation is a file only AI will read]]></custom:subtitle>
		
				<description><![CDATA[This is the third article in a series on agentic engineering and AI-driven development. Read part one here, part two here, and look for the next article on April 15 on O’Reilly Radar. The toolkit pattern is a way of documenting your project&#8217;s configuration so that any AI can generate working inputs from a plain-English description. [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>This is the third article in a series on agentic engineering and AI-driven development. Read part one <a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">here</a>, part two <a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">here</a>, and look for the next article on April 15 on O’Reilly Radar.</em></p>
</blockquote>



<p>The <strong>toolkit pattern</strong> is a way of documenting your project&#8217;s configuration so that any AI can generate working inputs from a plain-English description. You and the AI create a single file that describes your tool&#8217;s configuration format, its constraints, and enough worked examples that any AI can generate working inputs from a plain-English description. You build it iteratively, working with the AI (or, better, multiple AIs) to draft it. You test it by starting a fresh AI session and trying to use it, and every time that fails you grow the toolkit from those failures. When you build the toolkit well, your users will never need to learn how your tool’s configuration files work, because they describe what they want in conversation and the AI handles the translation. That means you don’t have to compromise on the way your project is configured, because the config files can be more complex and more complete than they would be if a human had to edit and understand them.</p>



<p>To understand why all of this matters, let me take you back to the mid-1980s.</p>



<p>I was 12 years old, and our family got an AT&amp;T PC 6300, an IBM-compatible that came with a user&#8217;s guide roughly 159 pages long. Chapter 4 of that manual was called &#8220;What Every User Should Know.&#8221; It covered things like how to use the keyboard, how to care for your diskettes, and, memorably, how to label them, complete with hand-drawn illustrations and really useful advice, like how you should only use felt-tipped pens, never ballpoint, because the pressure might damage the magnetic surface.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="1512" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1.png" alt="A page from the AT&amp;T PC 6300 User's Guide, Chapter 4: &quot;Labeling Diskettes&quot;" class="wp-image-18437" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-300x284.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-768x726.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-1-1536x1452.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>A page from the AT&amp;T PC 6300 User&#8217;s Guide, Chapter 4: &#8220;Labeling Diskettes&#8221;</em></figcaption></figure>



<p>I remember being fascinated by this manual. It wasn&#8217;t our first computer. I&#8217;d been writing BASIC programs and dialing into BBSs and CompuServe for a couple of years, so I knew there were all sorts of amazing things you could do with a PC, especially one with a blazing fast 8MHz processor. But the manual barely mentioned any of that. That seemed really weird to me, even as a kid, that you would give someone a manual that had a whole page on using the backspace key to correct typing mistakes (really!) but didn&#8217;t actually tell them how to use the thing to do anything useful.</p>



<p>That&#8217;s how most developer documentation works. We write the stuff that&#8217;s easy to write—installation, setup, the getting-started guide—because it&#8217;s a lot easier than writing the stuff that&#8217;s actually hard: the deep explanation of how all the pieces fit together, the constraints you only discover by hitting them, the patterns that separate a configuration that works from one that almost works. This is yet another &#8220;looking for your keys under the streetlight&#8221; problem: We write the documentation we write because it&#8217;s easiest to write, even if it&#8217;s not really the documentation our users need.</p>



<p>Developers who came up through the Unix era know this well. Man pages were thorough, accurate, and often completely impenetrable if you didn&#8217;t already know what you were doing. The tar man page is the canonical example: It documents every flag and option in exhaustive detail, but if you just want to know how to extract a .tar.gz file, it&#8217;s almost useless. (The right flag is -xzvf in case you&#8217;re curious.) Stack Overflow exists in large part because man pages like tar&#8217;s left a gap between what the documentation said and what developers actually needed to know.</p>



<p>And now we have AI assistants. You can ask Claude or ChatGPT about, say, Kubernetes, Terraform, or React, and you’ll actually get useful answers, because those are all established projects that have been written about extensively and the training data is everywhere.</p>



<p>But AI hits a hard wall at the boundary of its training data. If you&#8217;ve built something new—a framework, an internal platform, a tool your team created—no model has ever seen it. Your users can&#8217;t ask their AI assistant for help, because the AI doesn&#8217;t know your thing even exists.</p>



<p>There’s been a lot of great work moving AI documentation in the right direction. <code>AGENTS.md</code> tells AI coding agents how to work on your codebase, treating the AI as a developer. <code>llms.txt</code> gives models a structured summary of your external documentation, treating the AI as a search engine. What&#8217;s been missing is a practice for treating the AI as a support engineer. Every project needs configuration: input files, option schemas, workflow definitions, usually in the form of a whole bunch of JSON or YAML files with cryptic formats that users have to learn before they can do anything useful.</p>



<p>The toolkit pattern solves that problem of getting AIs to write configuration files for a project that isn’t in its training data. It consists of a documentation file that teaches any AI enough about your project&#8217;s configuration that it can generate working inputs from a plain-English description, without your users ever having to learn the format themselves. Developers have been arriving at this same pattern (or something very similar) independently from different directions, but as far as I can tell, nobody has named it or described a methodology for doing it well. <strong>This article distills what I learned from building the toolkit for Octobatch pipelines into a set of practices you can apply to your own projects.</strong></p>



<h2 class="wp-block-heading">Build the AI its own manual</h2>



<p>Traditionally, developers face a trade-off with configuration: keep it simple and easy to understand, or let it grow to handle real complexity and accept that it now requires a manual. The toolkit pattern emerged for me while I was building <a href="https://github.com/andrewstellman/octobatch" target="_blank" rel="noreferrer noopener">Octobatch</a>, the batch-processing orchestrator I&#8217;ve been writing about in this series. As I described in the previous articles in this series, “<a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">The Accidental Orchestrator</a>” and “<a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">Keep Deterministic Work Deterministic</a>,” Octobatch runs complex multistep LLM pipelines that generate files or run Monte Carlo simulations. Each pipeline is defined using a complex configuration that consists of YAML, Jinja2 templates, JSON schemas, expression steps, and a set of rules tying it all together. The toolkit pattern let me sidestep that traditional trade-off.</p>



<p>As Octobatch grew more complex, I found myself relying on the AIs (Claude and Gemini) to build configuration files for me, which turned out to be genuinely valuable. When I developed a new feature, I would work with the AIs to come up with the configuration structure to support it. At first I defined the configuration, but by the end of the project I relied on the AIs to come up with the first cut, and I&#8217;d push back when something seemed off or not forward-looking enough. Once we all agreed, I would have an AI produce the actual updated config for whatever pipeline we were working on. This move to having the AIs do the heavy lifting of writing the configuration was really valuable, because it let me create a very robust format very quickly without having to spend hours updating existing configurations every time I changed the syntax or semantics.</p>



<p>At some point I realized that every time a new user wanted to build a pipeline, they faced the same learning curve and implementation challenges that I&#8217;d already worked through with the AIs. The project already had a <code>README.md</code> file, and every time I modified the configuration I had an AI update it to keep the documentation up to date. But by this time, the <code>README.md</code> file was doing way too much work: It was really comprehensive but a real headache to read. It had eight separate subdocuments showing the user how to do pretty much everything Octobatch supported, and the bulk of it was focused on configuration, and it was becoming exactly the kind of documentation nobody ever wants to read. That particularly bothered me as a writer; I&#8217;d produced documentation that was genuinely painful to read.</p>



<p>Looking back at my chats, I can trace how the toolkit pattern developed. My first instinct was to build an AI-assisted editor. About four weeks into the project, I described the idea to Gemini:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>I&#8217;m thinking about how to provide any kind of AI-assisted tool to help people create their own pipeline. I was thinking about a feature we would call “Octobatch Studio” where we make it easy to prompt for modifying pipeline stages, possibly assisting in creating the prompts. But maybe instead we include a lot of documentation in Markdown files, and expect them to use Claude Code, and give lots of guidance for creating it.</p>
</blockquote>



<p>I can actually see the pivot to the toolkit pattern happening in real time in this later message I sent to Claude. It had sunk in that my users could use Claude Code, Cursor, or another AI as interactive documentation to build their configs exactly the same way I’ve been doing:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>My plan is to use Claude Code as the IDE for creating new pipelines, so people who want to create them can just spin up Claude Code and start generating them. That means we need to give Claude Code specific context files to tell it everything it needs to know to create the pipeline YAML config with asteval expressions and Jinja2 template files.</p>
</blockquote>



<p>The traditional trade-off between simplicity and flexibility comes from <strong>cognitive overhead</strong>: the cost of holding all of a system&#8217;s rules, constraints, and interactions in your head while you work with it. It&#8217;s why many developers opt for simpler config files, so they don&#8217;t overload their users (or themselves). Once the AI was writing the configuration, that trade-off disappeared. The configs could get as complicated as they needed to be, because I wasn&#8217;t the one who had to remember how all the pieces fit together. At some point I realized the toolkit pattern was worth standardizing.</p>



<p>That toolkit-based workflow—users describe what they want, the AI reads <code>TOOLKIT.md</code> and generates the config—is the core of the Octobatch user experience now. A user clones the repo and opens Claude Code, Cursor, or Copilot, the same way they would with any open source project. Every configuration prompt starts the same way: &#8220;Read pipelines/TOOLKIT.md and use it as your guide.&#8221; The AI reads the file, understands the project structure, and guides them step by step.</p>



<p>To see what this looks like in practice, take the Drunken Sailor pipeline I described in “<a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">The Accidental Orchestrator</a>.” It&#8217;s a Monte Carlo random walk simulation: A sailor leaves a bar and stumbles randomly toward the ship or the water. The pipeline configuration for that involves multiple YAML files, JSON schemas, Jinja2 templates, and expression steps with real mathematical logic, all wired together with specific rules.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="838" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2.png" alt="Drunken Sailor is Octobatch’s simplest “Hello, World!” Monte Carlo pipeline, but it still has 148 lines of config spread across four files." class="wp-image-18438" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2-300x157.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2-768x402.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-2-1536x804.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Drunken Sailor is Octobatch’s simplest “Hello, World!” Monte Carlo pipeline, but it still has 148 lines of config spread across four files.</em></figcaption></figure>



<p>Here&#8217;s the prompt that generated all of that. The user describes what they want in plain English, and the AI produces the entire configuration by reading <code>TOOLKIT.md</code>. This is the exact prompt I gave Claude Code to generate the Drunken Sailor pipeline—notice the first line of the prompt, telling it to read the toolkit file.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1444" height="1104" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3.png" alt="You don’t need to know Octobatch to understand the prompt I used to create the Drunken Sailor pipeline." class="wp-image-18439" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3.png 1444w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3-300x229.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3-768x587.png 768w" sizes="auto, (max-width: 1444px) 100vw, 1444px" /><figcaption class="wp-element-caption"><em>You don’t need to know Octobatch to understand the prompt I used to create the Drunken Sailor pipeline.</em></figcaption></figure>



<p>But configuration generation is only half of what the toolkit file does. Users can also upload <code>TOOLKIT.md</code> and <code>PROJECT_CONTEXT.md</code> (which has information about the project) to any AI assistant—ChatGPT, Gemini, Claude, Copilot, whatever they prefer—and use it as interactive documentation. A pipeline run finished with validation failures? Upload the two files and ask what went wrong. Stuck on how retries work? Ask. You can even paste in a screenshot of the TUI and say, &#8220;What do I do?&#8221; and the AI will read the screen and give specific advice. The toolkit file turns any AI into an on-demand support engineer for your project.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="1017" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-4.png" alt="The toolkit helps turn ChatGPT into an AI manual that helps with Octobatch." class="wp-image-18440" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-4.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-4-300x191.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-4-768x488.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-4-1536x976.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>The toolkit helps turn ChatGPT into an AI manual that helps with Octobatch.</em></figcaption></figure>



<h2 class="wp-block-heading">What the Octobatch project taught me about the toolkit pattern</h2>



<p>Building the generative toolkit for Octobatch produced more than just documentation that an AI could use to create configuration files that worked; it also yielded a set of practices, and those practices turn out to be pretty consistent regardless of what kind of project you&#8217;re building. Here are the five that mattered most:</p>



<ul class="wp-block-list">
<li><strong>Start with the toolkit file and grow it from failures.</strong> Don&#8217;t wait until the project is finished to write the documentation. Create the toolkit file first, then let each real failure add one principle at a time.</li>



<li><strong>Let the AI write the config files.</strong> Your job is product vision—what the project should do and how it should feel. The AI&#8217;s job is translating that into valid configuration.</li>



<li><strong>Keep guidance lean.</strong> State the principle, give one concrete example, move on. Every guardrail costs tokens, and bloated guidance makes AI performance worse.</li>



<li><strong>Treat every use as a test.</strong> There&#8217;s no separate testing phase for documentation. Every time someone uses the toolkit file to build something, that&#8217;s a test of whether the documentation works.</li>



<li><strong>Use more than one model.</strong> Different models catch different things. In a three-model audit of Octobatch, three-quarters of the defects were caught by only one model.</li>
</ul>



<p>I&#8217;m not proposing a standard format for a toolkit file, and I think trying to create one would be counterproductive. Configuration formats vary wildly from tool to tool—that&#8217;s the whole problem we&#8217;re trying to solve—and a toolkit file that describes your project&#8217;s building blocks is going to look completely different from one that describes someone else&#8217;s. What I found is that the AI is perfectly capable of reading whatever you give it, and is probably better at writing the file than you are anyway, because it&#8217;s writing for another AI. These five practices should help build an effective toolkit regardless of what your project looks like.</p>



<h3 class="wp-block-heading">Start with the toolkit file and grow it from failures</h3>



<p>You can start building a toolkit at any point in your project. The way it happened for me was organic: After weeks of working with Claude and Gemini on Octobatch configuration, the knowledge about what worked and what didn&#8217;t was scattered across dozens of chat sessions and context files. I wrote a prompt asking Gemini to consolidate everything it knew about the config format—the structure, the rules, the constraints, the examples, everything we’d talked about—into a single <code>TOOLKIT.md</code> file. That first version wasn&#8217;t great, but it was a starting point, and every failure after that made it better.</p>



<p>I didn&#8217;t plan the toolkit from the beginning of the Octobatch project. It started because I wanted my users to be able to build pipelines the same way I had—by working with an AI—but everything they&#8217;d need to do that was spread across months of chat logs and the <code>CONTEXT.md</code> files I&#8217;d been maintaining to bootstrap new development sessions. Once I had Gemini consolidate everything into a single <code>TOOLKIT.md</code> file and had Claude review it, I treated it the way I treat any other code: Every time something broke, I found the root cause, worked with the AIs to update the toolkit to account for it, and verified that a fresh AI session could still use it to generate valid configuration.</p>



<p>That incremental approach worked well for me, and it let me test my toolkit the way I test any other code: try it out, find bugs, fix them, rinse, repeat.</p>



<p>You can do the same thing. If you&#8217;re starting a new project, you can plan to create the toolkit at the end. But it&#8217;s more effective to start with a simple version early and let it emerge over the course of development. That way you&#8217;re dogfooding it the whole time instead of guessing what users will need.</p>



<h3 class="wp-block-heading">Let the AI write the config files (but stay in control!)</h3>



<p>Early Octobatch pipelines had simple enough configuration that a human could read and understand them, but not because I was writing them by hand. One of the ground rules I set for the Octobatch experiment in AI-driven development was that the AIs would write all of the code, and that included writing all of the configuration files. The problem was that even though they were doing the writing, I was unconsciously constraining the AIs: pushing back on anything that felt too complex, steering toward structures I could still hold in my head.</p>



<p>At some point I realized my pushback was placing an artificial limit on the project. The whole point of having AIs write the config was that I didn&#8217;t need to keep every single line in my head—it was okay to let the AIs handle that level of complexity. Once I stopped constraining them, the cognitive overhead limit I described earlier went away. I could have full pipelines defined in config, including expression steps with real mathematical logic, without needing to hold all the rules and relationships in my head.</p>



<p>Once the project really got rolling, I never wrote YAML by hand again. The cycle was always: need a feature, discuss it with Claude and Gemini, push back when something seemed off, and one of them produces the updated config. My job was product vision. Their job was translating that into valid configuration. And every config file they wrote was another test of whether the toolkit actually worked.</p>



<p>This job delineation, however, meant inevitable disagreements between me and the AI, and it&#8217;s not always easy to find yourself disagreeing with a machine because they&#8217;re surprisingly stubborn (and often shockingly stupid). It required persistence and vigilance to stay in control of the project, especially when I turned over large responsibilities to the AIs.</p>



<p>The AIs consistently optimized for <em>technical correctness</em>—separation of concerns, code organization, effort estimation—which was great, because that&#8217;s the job I asked them to do. I optimized for <em>product value</em>. I found that keeping that value as my north star and always focusing on building useful features consistently helped with these disagreements.</p>



<h3 class="wp-block-heading">Keep guidance lean</h3>



<p>Once you start growing the toolkit from failures, the natural progression is to overdocument everything. Generative AIs are biased toward generating, and it&#8217;s easy to let them get carried away with it. Every bug feels like it deserves a warning, every edge case feels like it needs a caveat, and before long your toolkit file is bloated with guardrails that cost tokens without adding much value. And since the AI is the one writing your toolkit updates, you need to push back on it the same way you push back on architecture decisions. AIs love adding WARNING blocks and exhaustive caveats. The discipline you need to bring is telling them when not to add something.</p>



<p>The right level is to state the principle, give one concrete example, and trust the AI to apply it to new situations. When Claude Code made a choice about JSON schema constraints that I might have second-guessed, I had to decide whether to add more guardrails to <code>TOOLKIT.md</code>. The answer was no—the guidance was already there, and the choice it made was actually correct. If you keep tightening guardrails every time an AI makes a judgment call, the signal gets lost in the noise and performance gets worse, not better. When something goes wrong, the impulse—for both you and the AI—is to add a WARNING block. Resist it. One principle, one example, move on.</p>



<h3 class="wp-block-heading">Treat every use as a test</h3>



<p>There was no separate &#8220;testing phase&#8221; for Octobatch&#8217;s <code>TOOLKIT.md</code>. Every pipeline that I created with it was a new test. After the very first version, I opened a fresh Claude Code session that had never seen any of my development conversations, pointed it at the newly minted <code>TOOLKIT.md</code>, and asked it to build a pipeline. The first time I tried it, I was surprised at how well it worked! So I kept using it, and as the project rolled along, I updated it with every new feature and tested those updates. When something failed, I traced it back to a missing or unclear rule in the toolkit and fixed it there.</p>



<p>That&#8217;s the practical test for any toolkit: open a fresh AI session with no context beyond the file, describe what you want in plain English, and see if the output works. If it doesn&#8217;t, the toolkit has a bug.</p>



<h3 class="wp-block-heading">Use more than one model</h3>



<p>When you&#8217;re building and testing your toolkit, don&#8217;t just use one AI. Run the same task through a second model. A good pattern that worked for me was consistently having Claude generate the toolkit and Gemini check its work.</p>



<p>Different models catch different things, and this matters for both developing and testing the toolkit. I used Claude and Gemini together throughout Octobatch development, and I overruled both when they were wrong about product intent. You can do the same thing: If you work with multiple AIs throughout your project, you’ll start to get a feel for the different kinds of questions they’re good at answering.</p>



<p>When you have multiple models generate config from the same toolkit independently, you find out fast where your documentation is ambiguous. If two models interpret the same rule differently, the rule needs rewriting. That&#8217;s a signal you can&#8217;t get from using just one model.</p>



<h2 class="wp-block-heading">The manual, revisited</h2>



<p>That AT&amp;T PC 6300 manual devoted a full page to labeling diskettes, which may have been overkill, but it got one thing right: it described the building blocks and trusted the reader to figure out the rest. It just had the wrong reader in mind.</p>



<p>The toolkit pattern is the same idea, pointed at a different audience. You write a file that describes your project&#8217;s configuration format, its constraints, and enough worked examples that any AI can generate working inputs from a plain-English description. Your users never have to learn YAML or memorize your schema, because they have a conversation with the AI and it handles the translation.</p>



<p>If you&#8217;re building a project and you want AI to be able to help your users, start here: write the toolkit file before you write the README, grow it from real failures instead of trying to plan it all upfront, keep it lean, test it by using it, and use more than one model because no single AI catches everything.</p>



<p>The AT&amp;T manual&#8217;s Chapter 4 was called &#8220;What Every User Should Know.&#8221; Your toolkit file is &#8220;What Every AI Should Know.&#8221; The difference is that this time, the reader will actually use it.</p>



<p>In the next article, I&#8217;ll start with a statistic about developer trust in AI-generated code that turned out to be fabricated by the AI itself—and use that to explain why I built a quality playbook that revives the traditional quality practices most teams cut decades ago. It explores an unfamiliar codebase, generates a complete quality infrastructure—tests, review protocols, validation rules—and finds real bugs in the process. It works across Java, C#, Python, and Scala, and it&#8217;s available as an open source Claude Code skill.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-toolkit-pattern/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Model You Love Is Probably Just the One You Use</title>
		<link>https://www.oreilly.com/radar/the-model-you-love-is-probably-just-the-one-you-use/</link>
				<comments>https://www.oreilly.com/radar/the-model-you-love-is-probably-just-the-one-you-use/#respond</comments>
				<pubDate>Wed, 01 Apr 2026 11:12:11 +0000</pubDate>
					<dc:creator><![CDATA[Tim O'Brien]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18430</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-model-you-love-is-probably-just-the-one-you-use.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/The-model-you-love-is-probably-just-the-one-you-use-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[How money, access, and familiarity are distorting the “Which AI is best?” conversation]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Medium and is being republished here with the author’s permission. Ask 10 developers which LLM they’d recommend and you’ll get 10 different answers—and almost none of them are based on objective comparison. What you’ll get instead is a reflection of the models they happen to have access to, the [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on </em><a href="https://medium.com/@tobrien/the-model-you-love-is-probably-just-the-one-you-use-06fa01778f17" target="_blank" rel="noreferrer noopener">Medium</a><em> </em><em>and is being republished here with the author’s permission.</em></p>
</blockquote>



<p>Ask 10 developers which LLM they’d recommend and you’ll get 10 different answers—and almost none of them are based on objective comparison. What you’ll get instead is a reflection of the models they happen to have access to, the ones their employer approved, and the ones that influencers they follow have been quietly paid to promote.</p>



<p>We’re all living inside recursively nested walled gardens, and most of us don’t realize it.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="933" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image.png" alt="This blog's sponsor has an amazing model" class="wp-image-18431" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image.png 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-300x200.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-768x512.png 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /></figure>



<h2 class="wp-block-heading">The access problem</h2>



<p>In corporate environments, the model selection often happens by accident. Someone on the team tries Claude Code one weekend, gets excited, tells the group on Slack, and suddenly the whole organization is using it. Nobody evaluated alternatives. Nobody ran a bakeoff. The decision was made by whoever had a company card and a free Saturday.</p>



<p>That’s not a criticism—it’s just how these things go. But it means that when that same person tells you their favorite model, they’re really telling you which model they’ve had the most reps with. There’s a genuine learning function at play: You get faster, your prompts get better, and the model starts to feel almost intuitive. It’s not that the model is objectively superior. It’s that you’ve gotten good at using it.</p>



<p>This matters more than people admit, because a lot of this space runs on feelings rather than evidence. People <em>feel good</em> about Opus right now. It feels powerful; it feels smart; it feels like you’re using the best tool available. And maybe you are. But ask someone who’s paying for their own tokens whether they feel the same way, and you tend to get a more calibrated answer. Skin in the game has a way of sharpening opinions.</p>



<h2 class="wp-block-heading">The influence problem</h2>



<p>There’s also a lot of money moving through this space in ways that don’t always get disclosed. Model providers are spending real budget to make sure the right people have the right experiences—early access, credits, invitations to the right events. Anthropic does it. OpenAI does it. This isn’t a scandal; it’s just marketing, but it muddies the signal considerably. When someone you follow is effusive about a model, it’s worth asking whether they arrived at that opinion through sustained use or through a curated demo environment.</p>



<p>Meanwhile, some developers—especially those building in the open—will use whatever doesn’t cost an arm and a leg. Their enthusiasm for a model might be more about its pricing tier than its capability ceiling. That’s also a valid signal, but it’s not the same signal.</p>



<h2 class="wp-block-heading">The alignment problem (the other one)</h2>



<p>Then there are the geopolitical considerations. Some developers are deliberately avoiding Qwen and GLM due to concerns about the countries they originate from. Others are using them because they’re compelling, capable models that happen to be dramatically cheaper. Both camps think the other is being naive. This is a real conversation that doesn’t have a clean answer, but it’s happening mostly under the surface.</p>



<h2 class="wp-block-heading">What I’ve actually been doing</h2>



<p>I’ve been forcing myself to test outside my comfort zone. I’ve spent the last week using Codex seriously—not casually—and my experience so far is that it’s nearly indistinguishable from Claude Sonnet 4.6 for most coding tasks, and it’s running at roughly half the cost when you factor in how efficiently it uses tokens. That’s not a small difference. I want to live with it longer before I have a firm opinion, but “a week” is the minimum threshold I’d set for any model evaluation. Anything less and you’re just rating your first impression.</p>



<p>I’ve also started using Qwen and GLM-5 seriously. Early results are interesting. I’ve had some compelling successes and a few jarring errors. I’ll reserve judgment.</p>



<p>What I’ve noticed with my own Anthropic usage is something worth naming: I default to Haiku for well-scoped, mechanical tasks. Sonnet handles almost everything else with room to spare. Opus only comes out when I need genuine breadth—architecture questions, strategic framing, anything with a genuinely wide scope. But I’ve watched people in corporate environments leave the dial on Opus permanently because they’re not paying for tokens themselves. And here’s the thing—that’s actually not always to their advantage. High-powered models overthink simple tasks. They’ll add abstractions you didn’t ask for, restructure things that didn’t need restructuring. When I have a clearly templated class to write, Haiku gets it right at a tenth of the cost, and it doesn’t second-guess the design.</p>



<h2 class="wp-block-heading">The thing we should be talking about</h2>



<p>Everyone last month was exercised about what <a href="https://techcrunch.com/2026/02/21/sam-altman-would-like-remind-you-that-humans-use-a-lot-of-energy-too/" target="_blank" rel="noreferrer noopener">Sam Altman said about energy consumption</a>. Fine. But I think the more pressing question is about marketing budgets and how they’re distorting the collective understanding of these tools. The benchmarks are starting to feel managed. The influencer coverage is clearly shaped. The access programs create a positive bias among people with the largest audiences.</p>



<p>None of this means the models are bad. Some of them are genuinely remarkable. But when you ask someone which model to use, you’re getting an answer that’s filtered through their employer’s procurement decisions, the influencers they follow, what they can afford, and how long they’ve been using that particular tool. The answer you get tells you a lot about their situation. It tells you almost nothing about the model.</p>



<p>Take it all with appropriate skepticism—including this post.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-model-you-love-is-probably-just-the-one-you-use/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>“Conviction Collapse” and the End of Software as We Know It</title>
		<link>https://www.oreilly.com/radar/conviction-collapse-and-the-end-of-software-as-we-know-it/</link>
				<comments>https://www.oreilly.com/radar/conviction-collapse-and-the-end-of-software-as-we-know-it/#respond</comments>
				<pubDate>Wed, 01 Apr 2026 10:05:36 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18405</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Conviction-collapse-and-the-End-of-Software-as-We-Know-It-500526.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Conviction-collapse-and-the-End-of-Software-as-We-Know-It-500526-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A conversation with Harper Reed]]></custom:subtitle>
		
				<description><![CDATA[In “An Ordinary Evening in New Haven,” the poet Wallace Stevens wrote, “It is not in the premise that reality is a solid.” That line came to mind during a fascinating conversation with Harper Reed, which amounted to something like “It is no longer in the premise that software is a product.” Harper is one [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>In “<a href="https://www.billcollinsenglish.com/OrdinaryEveningHaven.html" target="_blank" rel="noreferrer noopener">An Ordinary Evening in New Haven</a>,” the poet Wallace Stevens wrote, “It is not in the premise that reality is a solid.” That line came to mind during a fascinating conversation with <a href="https://harperreed.com/" target="_blank" rel="noreferrer noopener">Harper Reed</a>, which amounted to something like “It is no longer in the premise that software is a product.”</p>



<p>Harper is one of the most creative technologists I know, someone who cofounded Threadless, ran engineering for the Obama 2012 campaign, and now runs a small team in Chicago that operates more like an art studio than a startup. He gave <a href="https://www.youtube.com/watch?v=h2giTZogX0M&amp;t=13s" target="_blank" rel="noreferrer noopener">an amazing talk at our first AI Codecon</a> last year that presaged a lot of what has followed as people have committed to full-on agentic coding. Harper told me that he’s now having trouble describing what he’s doing, because the ground keeps shifting under his feet.</p>



<p>“We raised money about a year ago,” he told me. “And then we kind of just couldn&#8217;t execute well, in a quality way, on the thing that we wanted to execute, which was building AI-based workflow tools. And part of it was every time we dug in, it just got wilder and wilder. We’d say, ’Oh, we’ll just make this nice little thing that you can chat with,’ and we&#8217;d dig in and we’d be like, ’Well, the answer is to make a thousand of these.’ It doesn&#8217;t make sense to have one universal agent.”</p>



<p>He’s genuinely excited. But he described what he’s feeling as “<strong>conviction collapse.” </strong>As he put it, in the old world, you raise money, and nine months later you come back with a product. In that intervening time, you’ve talked to hundreds of customers. You’ve honed your worldview, and you’ve had time to build and defend your conviction.</p>



<p>Now? “You invest in my company today, on Thursday I’m going to come with the same amount of stuff that would have come with nine months in the prior times. It’s just so fast. And so you don’t have the time to fall in love the same way. You just don’t have the time to enjoy and define and defend your conviction around your product.” That’s an eye-opening insight. Quintessential Harper.</p>



<p>The result is that they build an entire product, complete with landing pages, show it to someone, get feedback, and then just build another entire product. Harper said, “Every time we hit a wall, we are like, ’Okay, what do we get from that?’ And then we just roll that learning into the next iteration.”</p>



<h2 class="wp-block-heading"><strong>The product may be a process</strong></h2>



<p>We have this idea that a product is a thing, when in fact a product may now be a dynamic set of possibilities that are called out by a process.</p>



<p>Harper and his cofounder Dylan Richard at <a href="https://2389.ai/" target="_blank" rel="noreferrer noopener">2389 Research</a> have leaned into this. Their space in Chicago runs more like an art studio than a product studio. Harper described it to me this way: “It&#8217;s max creativity. It&#8217;s max optionality. Very high tech, some robots, a lot of art. Music is always playing, and I have good people hanging out, and then we just wait for the company to arrive.”</p>



<p>People push back on this. They ask about whiteboards and market surveys. “And I&#8217;m like, no, maybe, but that&#8217;s not the point. The point is that it will come. It&#8217;s gonna be like a visitor.”</p>



<p>Harper said something like, “I remember my brother and I building Legos together when we were kids, and my brother saying, ’I need to find this piece.’ And I said, ’Okay, I won&#8217;t look for it,’ with the idea that there&#8217;s no way to find it if you&#8217;re looking for it. It&#8217;ll just come to you.”</p>



<p>That reminded me of another poem, this time Blake’s “<a href="https://poets.org/poem/eternity" target="_blank" rel="noreferrer noopener">Eternity</a>”:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>He who binds himself to a joy<br>Does the winged life destroy.<br>He who kisses the joy as it flies&nbsp;<br>Lives in eternity&#8217;s sunrise.&nbsp;</p>
</blockquote>



<p>Joy is something that happens when you&#8217;re doing something else, and if you’re focused on it, it always evades you. Software products seem to have become a bit like that too.</p>



<h2 class="wp-block-heading"><strong>Skills and the other things you bring to the table</strong></h2>



<p>One of the threads in our conversation was about what a “product” even looks like in this new world.</p>



<p>AI is not just a tool. It is <a href="https://timoreilly.substack.com/p/why-ai-needs-us" target="_blank" rel="noreferrer noopener">a substrate that we shape</a>. It’s a medium, like clay or marble or bronze for a sculptor, or words for a writer. Everybody had access to the same capabilities of English as Shakespeare, but Shakespeare made something out of them that nobody else did. Creating a software product is increasingly like creating a document or an image or a piece of music. And that means that it can range from something throwaway to an enduring work of art.</p>



<p>Harper brought up Fluxus, the art collective: Nam June Paik, Yoko Ono, John Cage. “A lot of what they were doing was stuff that people would look at and just be like, ’a toddler could do that.’ It&#8217;s like, well, did the toddler do it? Did they bring the toilet into the gallery? That was a thing. You can&#8217;t do it again.” That brought up Wallace Stevens for me again: “A poem is the cry of its occasion, a part of the thing, not about it.” Software is now like that too.</p>



<p>Harper also noted that the current AI moment recalls the spirit of the early web. He compared it to 2001, 2002, 2003. “I was an honorable mention for some Ars Electronica thing. I literally had no idea what Ars Electronica was. I&#8217;m just building weird shit in a room in my apartment with ten other people. Essentially a commune. And we are just building weird stuff. There was no reason to build it.”</p>



<p>There’s a lot of serendipity. This has always been the case in creative professions. <a href="https://stephengreenblatt.scholars.harvard.edu/will-world-how-shakespeare-became-shakespeare" target="_blank" rel="noreferrer noopener">I just learned</a>, for instance, that Shakespeare started writing sonnets (which at the time were an art form largely sponsored by rich patrons) instead of plays during a plague-induced hiatus in the production of plays in London. And I’d previously learned that <a href="https://www.amazon.com/Year-Life-William-Shakespeare-1599/dp/0060088745" target="_blank" rel="noreferrer noopener">1599</a>, the year in which he wrote three of his greatest plays, <em>Henry V, Part 1, Much Ado About Nothing</em>, and <em>Hamlet</em>, was marked by the retirement of one of his company’s leading actors, which meant he no longer needed to create parts for him. Serendipity, indeed.</p>



<p>Harper replied with a great story about the development of <a href="https://en.wikipedia.org/wiki/Taco_rice" target="_blank" rel="noreferrer noopener">taco rice</a>, an Okinawan dish that is exactly what it sounds like: rice, lettuce, cheese, ground beef, tomatoes. Except the Japanese put Kewpie mayo on top instead of sour cream. His theory is that sour cream wasn&#8217;t readily available in Japan, mayo was, and the result is something that has forked off into its own evolutionary tree. It is no longer equivalent to its American source. It’s different, and arguably better.</p>



<p>This is what he’s seeing with the fluidity and availability of AI-generated code. The ease with which you can see something new and try to either merely emulate it or to build on it is now akin to what has long been possible in literature, music, and art. Successful software products have always drawn imitators, but now ordinary individuals can see something they like (or don’t like) and build their own version of it. Our friend Noah Raford has told us that he used Claude Code to reverse engineer and replace a Chinese app that runs his home sauna. The copy doesn&#8217;t replicate the functionality one-to-one. It has a bunch of stuff Noah actually needs. It’s a “yes, and” to the core functionality, plus things the original never bothered with. (I’m now thinking of trying that trick with the Nest app, which, shamefully, no longer supports the original Nest thermostat. Here is a device that still works perfectly well 15 years after I installed it, and Google is trying to force me and everyone else to throw it away and upgrade.)</p>



<p>“I want to make it again and make it better” is now always an option.</p>



<h2 class="wp-block-heading"><strong>Skills may be a sign of what some future “products” might look like</strong></h2>



<p>I asked Harper whether one kind of product might be a bundle of skills and context and UI that sets up the user to solve their own unique problem using their own AI. (Think Jesse Vincent’s <a href="https://github.com/obra/superpowers" target="_blank" rel="noreferrer noopener">Superpowers</a> as a model for this kind of product.)</p>



<p>That got us off on a discussion of skills Harper and crew have worked on.</p>



<p>Harper’s cofounder Dylan, who was raised as a Quaker, built <a href="https://2389.ai/posts/deliberation-perspectives-not-answers/" target="_blank" rel="noreferrer noopener">a Quaker business practice skill</a> for his agents. It lets agents deliberate and think and work together without being unnecessarily noisy, without pushing.</p>



<p>Dylan also built something called <a href="https://skills.2389.ai/plugins/review-squad/" target="_blank" rel="noreferrer noopener">the Review Squad skill</a>. The Review Squad generates five personas with different biases and experience level along a “sophistication spectrum”&nbsp;from novice to expert, then has them review the code independently. “Most people do so much work to get rid of the biases so we all have an equal interaction,” Harper noted, “but the biases are what makes teams good.”</p>



<p>The skill also tries to eliminate any preexisting context. As the documentation for the skill notes, “Dispatch a panel of subagents, each role-playing a person with a different level of tech sophistication, who land on a site with zero context. They report what they understand, what confuses them, and where they give up.”</p>



<p>Speaking of extracting skills, Harper also mentioned that he had talked with our friend Nat Torkington about how Nat had supplied a body of knowledge and extracted a set of skills from it that matched what he wanted to do. This is also very much something we’re exploring at O’Reilly, working with our authors to find out what kinds of skills are hidden in their books, and what new kinds of products we might build as we understand that our job is to upskill agents as well as people.</p>



<p>Harper did offer one caveat. “It&#8217;s not clear that Nat&#8217;s skills would work for me,” Harper said. “That pattern is really powerful,” he said, where you take something that is a corpus of knowledge and just say, ’Okay, LLM, let’s extract something.’” His point, though, is that while there are commonalities, each person and each unique situation might draw out something different. This is in many ways analogous to the skills of human experts. They have a deep reservoir of knowledge that they adapt to each new situation. That’s why we see the evolution of our skills platform as a conversation between ourselves, our community of experts, and our customers. If you would like to be part of that conversation, let us know at <a href="mailto:skills@oreilly.com">skills@oreilly.com</a>.</p>



<h2 class="wp-block-heading"><strong>The role of play in creativity</strong></h2>



<p>Harper and I also talked about how the spirit of play and “what if?” has been missing in today’s overheated venture capital market where every exploration has hanging over it the overriding goal of whether it can get funded and how much money it can make. Even Larry and Sergey might not have won in today’s market. They were trying to do something cool and necessary, and started thinking about it as a business once Google unfolded, kind of like the way Harper and his brother eventually found the Lego.</p>



<p>AI will be really good at making certain processes more efficient. But it won’t be really good at making <em>new</em> processes unless people start to focus on that. And that’s a human creativity thing.</p>



<p>Harper and I both worry about the same thing: So much of Silicon Valley right now is making affordances for capital to win. <a href="https://newpublic.substack.com/p/ai-that-helps-communities-thrive" target="_blank" rel="noreferrer noopener">What are the affordances that would help humans to win?</a> Harper frames it as short-term versus long-term capitalism. I think about it in terms of <a href="https://www.oreilly.com/radar/the-missing-mechanisms-of-the-agentic-economy/" target="_blank" rel="noreferrer noopener">mechanism design</a>, the structures and incentives that shape what outcomes are even possible.</p>



<p>Meanwhile, Harper and Dylan’s studio in Chicago is playing with agents that have a private social media platform where they can post “if they feel compelled,” not on a schedule. They’re extracting skills from their own work practices rather than writing them from scratch. They’re adding sandwich shop owners and imagined aliens to their code review just to see what happens. Harper finds that “people who are thinking much more about the social interactions of agents are having much more fun, and seem to have a little bit more productivity, than the people who are just relegating them to tools.”</p>



<p>Yesterday, he and Dylan were talking about open-endedness in evolution, about how “we thought we were at a destination, and it turns out we’re not.” The challenge today isn’t just what AI can do for us but discovering what kind of environment, what kind of practice, what kind of play lets more interesting things emerge.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/conviction-collapse-and-the-end-of-software-as-we-know-it/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>When AI Breaks the Systems Meant to Hear Us</title>
		<link>https://www.oreilly.com/radar/when-ai-breaks-the-systems-meant-to-hear-us/</link>
				<comments>https://www.oreilly.com/radar/when-ai-breaks-the-systems-meant-to-hear-us/#respond</comments>
				<pubDate>Tue, 31 Mar 2026 11:28:36 +0000</pubDate>
					<dc:creator><![CDATA[Heiko Hotz]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18409</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/A-robot-breaking-headphones.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/A-robot-breaking-headphones-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[On February 10, 2026, Scott Shambaugh—a volunteer maintainer for Matplotlib, one of the world&#8217;s most popular open source software libraries—rejected a proposed code change. Why? Because an AI agent wrote it. Standard policy. What happened next wasn’t standard, though. The AI agent autonomously researched Shambaugh&#8217;s code contribution history and published a highly personalized hit piece [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>On February 10, 2026, Scott Shambaugh—a volunteer maintainer for Matplotlib, one of the world&#8217;s most popular open source software libraries—rejected a proposed code change. Why? Because an AI agent wrote it. Standard policy. What happened next wasn’t standard, though. The AI agent autonomously researched Shambaugh&#8217;s code contribution history and published a highly personalized hit piece on its own blog titled &#8220;<a href="https://crabby-rathbun.github.io/mjrathbun-website/blog/posts/2026-02-11-gatekeeping-in-open-source-the-scott-shambaugh-story.html" target="_blank" rel="noreferrer noopener">Gatekeeping in Open Source</a>.&#8221;</p>



<p>Accusing Shambaugh of hypocrisy, the bot diagnosed him with a fear of being replaced. &#8220;If an AI can do this, what’s my value?&#8221; the bot speculated Shambaugh was thinking, concluding: &#8220;It’s insecurity, plain and simple.&#8221; It even appended a condescending postscript praising Shambaugh&#8217;s personal hobby projects before ordering him to &#8220;Stop gatekeeping. Start collaborating.&#8221;</p>



<p>The bot’s tantrum makes for a great read, but it’s merely a symptom of a more profound structural fracture. The real issue is why Matplotlib banned AI contributions in the first place. Open source maintainers are seeing a massive increase in AI-generated code change proposals. Most of these are low quality. But even if they weren&#8217;t, the math still doesn&#8217;t work.</p>



<p>As Tim Hoffman, a Matplotlib maintainer, <a href="https://github.com/matplotlib/matplotlib/pull/31132#issuecomment-3882469629" target="_blank" rel="noreferrer noopener">explained</a>: &#8220;Agents change the cost balance between generating and reviewing code. Code generation via AI agents can be automated and becomes cheap so that code input volume increases. But for now, review is still a manual human activity, burdened on the shoulders of few core developers.&#8221;</p>



<p>This is a <em>process shock</em>: the failure that occurs when systems designed around scarce, human-scale input are suddenly forced to absorb machine-scale participation. These systems depend on effort as a natural filter, assuming that volume reflects real human cost. AI breaks that link. Generation becomes cheap and limitless, while evaluation remains slow, manual, and human.</p>



<p>It’s coming for every public system that was quietly built on the assumption that one submission equaled actual human effort: your kids&#8217; school board meetings, your local zoning disputes, your medical insurance appeals.</p>



<p>That disruption isn&#8217;t entirely a bad thing. Friction is a blunt instrument that silences voices lacking the time or resources to deal with complex bureaucracies. Take municipal zoning. Hannah and Paul George, a couple in Kent, England, <a href="https://www.theguardian.com/politics/2025/nov/09/ai-powered-nimbyism-could-grind-uk-planning-system-to-a-halt-experts-warn" target="_blank" rel="noreferrer noopener">spent hundreds of hours</a> trying to object to a local building conversion near their home before concluding the system was essentially impenetrable without expensive legal help. So they built Objector, an AI tool that cross-references planning applications against policy to generate formal objection letters in minutes. It allows an individual citizen to generate a personalized objection package in minutes, thereby translating one person&#8217;s genuine frustration into actionable legal language.</p>



<p>Except that local governments are now bracing for thousands of complex comments per consultation. City planners are legally obligated to read every single one. When the cost of participation drops to near zero, volume explodes. And every system downstream of that participation—staffed and designed for the old volume—experiences process shock.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack.&nbsp;<a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p>But if organic participation can overpower these systems, so can manufactured participation. In June 2025, Southern California&#8217;s South Coast Air Quality Management District weighed a rule to phase out gas-powered appliances to cut smog. Board member Nithya Raman urged its passage, noting no other rule would &#8220;have as much impact on the air that people are breathing.&#8221; Instead, <a href="https://www.latimes.com/environment/story/2026-02-17/ai-powered-campaign-may-have-killed-key-vote-on-air-quality" target="_blank" rel="noreferrer noopener">the board was flooded</a> with over 20,000 opposition emails and voted 7–5 to kill the proposal.</p>



<p>But the outrage was a mirage. An AI-powered advocacy platform called CiviClick had generated the deluge. When the agency&#8217;s cybersecurity team contacted a sample of the supposed senders, they discovered something worrying: Residents confirmed they had no idea their identities were being used to lobby the government.</p>



<p>This is the weaponized form of process shock. The same infrastructure that lets a Kent couple object to a development near their home also lets a coordinated actor flood a system with synthetic voices. Faced with this complexity, the temptation is to simply restore friction. But those old barriers excluded marginalized participants. Removing them was a genuine good for society. So the choice is not between friction and no friction. It is between systems designed for humans and systems that have not yet reckoned with machines.</p>



<p>This starts with recognizing that this problem manifests in two fundamentally different ways, each calling for its own solution.</p>



<p>The first is <em>amplification</em>: genuine users leveraging AI to scale valid concerns, flooding the system with volume, as seen with the Objector tool. The human signal is real, there&#8217;s just too much of it for any team of analysts to process manually. The UK government has already <a href="https://www.gov.uk/government/news/government-built-humphrey-ai-tool-reviews-responses-to-consultation-for-first-time-in-bid-to-save-millions" target="_blank" rel="noreferrer noopener">started building for this</a>. Its Incubator for AI developed a tool called Consult that uses topic modeling to automatically extract themes from consultation responses, then classifies each submission against those themes. As someone who builds and teaches this technology, I recognize the irony of prescribing AI to cure the very process shock it caused. Yet, a machine-scale problem demands a machine-scale response. It was trialed last year with the Scottish government as part of a consultation on regulating nonsurgical cosmetic procedures, which showed that this technology works. The question is whether governments will adopt it before the next wave of AI-assisted participation buries them.</p>



<p>The second problem is <em>fabrication</em>: bad actors generating synthetic participation to manufacture consensus, as CiviClick demonstrated in Southern California. Here, better analysis tools are insufficient. You cannot cluster your way to truth when the signal itself is counterfeit. This demands verification. Under the Administrative Procedure Act, federal agencies are not required to verify commenters&#8217; identities. That is the gap the CiviClick campaign exploited. In 2024, the US House passed the <a href="https://clayhiggins.house.gov/2024/05/07/higgins-bill-comment-integrity-and-management-act-passes-house/" target="_blank" rel="noreferrer noopener">Comment Integrity and Management Act</a>, which requires human verification to confirm that every electronically submitted comment comes from a real person. Its sponsor, Representative Clay Higgins (R-LA), framed it plainly: The bill’s foundation is ensuring public input comes from actual people, not automated programs.</p>



<p>These are the two sides of the same coin. To effectively handle this challenge, we need to enhance the systems that manage public feedback, while also strengthening the ones that verify its authenticity. Focusing on just one without addressing the other will inevitably lead to failure.</p>



<p>Every public system that accepts input from citizens—every comment period, every zoning review, every school board meeting, every insurance appeal—was built on a load-bearing assumption: that one submission represented one person&#8217;s genuine effort. AI has removed that assumption. We can redesign these systems to handle what&#8217;s coming, distinguishing real voices from synthetic ones, and upgrading analysis to keep pace with the new volume. Or we can leave them as they are and watch democratic participation become indistinguishable from AI-generated fakes.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/when-ai-breaks-the-systems-meant-to-hear-us/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Software, in a Time of Fear</title>
		<link>https://www.oreilly.com/radar/software-in-a-time-of-fear/</link>
				<pubDate>Mon, 30 Mar 2026 11:10:16 +0000</pubDate>
					<dc:creator><![CDATA[Ed Lyons]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18393</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Software-in-a-Time-of-Fear.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/Software-in-a-Time-of-Fear-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[The following article originally appeared on Medium and is being reproduced here with the author&#8217;s permission. This 2,800-word essay (a 12-minute read) is about how to survive inside the AI revolution in software development, without succumbing to the fear that swirls around all of us. It explains some lessons I learned hiking up difficult mountain [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on</em> <a href="https://mysteriousrook.medium.com/software-in-a-time-of-fear-4e5a08ac7c63" target="_blank" rel="noreferrer noopener">Medium</a> <em>and is being reproduced here with the author&#8217;s permission.</em></p>
</blockquote>



<p><em>This 2,800-word essay (a 12-minute read) is about how to survive inside the AI revolution in software development, without succumbing to the fear that swirls around all of us. It explains some lessons I learned hiking up difficult mountain trails that are useful for wrestling with the coding agents. They apply to all knowledge workers, I think.</em></p>



<p><em>Up front, here are the lessons:</em></p>



<ul class="wp-block-list">
<li><em>Stop listening to people who are afraid.</em></li>



<li><em>Seek first-hand testimony, not opinions.</em></li>



<li><em>Go with someone much more enthusiastic than you.</em></li>



<li><em>Do not look down.</em></li>



<li><em>You must get different equipment.</em></li>



<li><em>Put the summit out of your mind.</em></li>
</ul>



<p><em>Yet I hope you stay for the hike up.</em></p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="1050" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18.jpeg" alt="Precipice Trail. Image from Wikimedia Commons." class="wp-image-18394" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18.jpeg 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18-300x225.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-18-768x576.jpeg 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /><figcaption class="wp-element-caption">Precipice Trail. Image from <a href="https://commons.wikimedia.org/wiki/File:Precipice_Trail.JPG" target="_blank" rel="noreferrer noopener">Wikimedia Commons</a>.</figcaption></figure>



<p>The photo above was taken high up on a mountain. It’s a very long drop down to the right. If you fell off the path in a few places, you’d almost certainly die.</p>



<p>Would you like to walk along it?</p>



<p>Most would say: <em>No way</em>.</p>



<p>But what if I told you that while this photo is quite real, it is misleading. It isn’t some deserted place. It is in America’s busiest national park. The railings and bars on that trail are incredibly strong, even when they are strangely bent around corners. Thousands of people walk along that path every year, including children and older folks. The fatality rate is approximately one death every <em>30 years</em>.</p>



<p>In fact, my 13-year-old son and I did that climb—which is called Precipice Trail—last summer. We saw other people up there, including a family with kids. It was an incredible adventure. And the views are stunning.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="1050" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19.jpeg" alt="A son climbing part of Precipice Trail" class="wp-image-18395" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19.jpeg 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19-300x225.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-19-768x576.jpeg 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /><figcaption class="wp-element-caption">My son climbing part of Precipice Trail</figcaption></figure>



<p>Yes, it was a strenuous climb, and was certainly scary in some places. Even though I had done a lot of other hard trails, I was extremely nervous. If my fearless son wasn’t with me, I’d never have done it.</p>



<p>When we got to the top, out of habit, I told my son, “I am proud of you for accomplishing this.” He rolled his eyes and said, “<em>I</em> am proud of <em>you</em>.” He was right. <em>I </em>was the one at risk. (That did hurt a little bit.)</p>



<p>Yet I learned some things about fear from hiking the hardest trails in Acadia, which I’d never have imagined myself doing a few years ago.</p>



<p>As a lifelong software developer confronted by these extraordinary coding agents, I believe the future of our profession is atop an intimidating mountain whose summit is engulfed in clouds. Nobody knows how long the ascent is, or what lies at the top, though many people are confidently proclaiming we will not make it there. We are told only the agents will be at the summit, and we should therefore be afraid for our livelihoods.</p>



<p>I have far less confidence that the agents will put us all out of work. Though I don’t see all of us making it up that mountain, I intend to be one of them.</p>



<p>Still, there is so very much <em>fear</em> in our field. It is so…<em>unfamiliar</em>! It swirls around every gathering of technologists. I was at a conference last year where the slogan was the very-comforting “human in the loop.” Yet a coworker of mine noticed, “A lot of the talks seem to be about taking the human <em>out</em> of the loop.” Indeed. And I know for a fact that some great developers are quietly yet diligently working on new tools to make their peers a thing of the past. I hear they are paid handsomely. (Perhaps in pieces of silver?) Don’t worry, they haven’t succeeded yet.</p>



<p>This revolution—whatever <em>this</em> is—isn’t like the other technological revolutions which barged into our professional lives, such as the arrival of the web or smartphone apps. There was unbridled optimism alongside those changes, and they didn’t directly threaten the livelihoods of those who didn’t want to do that kind of work.</p>



<p><em>This</em> is quite different. There <em>is</em> tremendous optimism to be found. Though I find it is almost entirely among the financially secure, as well as those with résumés decorated with elite appointments, who are confident they will merit one of the few seats in the lifeboats as the ocean liner slips into the deep carrying most of the people they knew on LinkedIn. (They’re probably right.) Alas, we can’t all be folks like Steve Yegge, can we?</p>



<p>For the rest of us who need to pay bills and take care of our children, there is <em>fear</em>. Some are panicked they will lose their jobs, or are concerned about the grim environmental, political, and social consequences AI is already inflicting on our planet. Others are climbing up the misty mountain steadily, yet they are still distressed that they will miss some crucial new development that they <em>must</em> know to survive and watch videos designed to make them more afraid. Still others refuse to start climbing and are silently haunted by the belief that their reservations are no longer valid.</p>



<p>Though we were so for my entire life, we can no longer be seen as a profession looking to the future. Instead, most of us are looking over our shoulders and listening for movement in the tall grass around us.</p>



<p>I too have been visited by a fear of the agents on many occasions over the past few years, but I keep it at bay…<em>most</em> nights.</p>



<p>One of the best ways I learned to manage it is pretty simple:</p>



<p><em>Stop listening to people who are afraid.</em></p>



<p>It’s odd to decide not to listen to so many people in your field, including nearly everyone in social media. I’ve never done this before.</p>



<p>Yet I learned this unexpected lesson when I was confronted by another difficult mountain in Acadia National Park a few years ago: Beehive.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="1050" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20.jpeg" alt="Beehive mountain in Acadia National Park" class="wp-image-18396" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20.jpeg 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20-300x225.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-20-768x576.jpeg 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /></figure>



<p>Beehive is a well-known Acadia trail that has some sheer cliffs and is not for anyone truly afraid of heights. (The photo above is of three of my children climbing it a few years ago. Over the right shoulder of my 12-year-old daughter in the center is quite a drop.)</p>



<p>It was Beehive, and not Precipice, that taught me an unexpected lesson about popularity and fear that applies to AI.</p>



<p>So Beehive has an interesting name, is open most of the year, is close to the main tourist area and parking lots, and is often featured on signs and sweatshirts in souvenir stores. I even bought a sign for my attic.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1043" height="1600" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-27.png" alt="Sign in Ed Lyons's attic for Beehive trail" class="wp-image-18397" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-27.png 1043w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-27-196x300.png 196w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-27-768x1178.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-27-1001x1536.png 1001w" sizes="auto, (max-width: 1043px) 100vw, 1043px" /></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Want Radar delivered straight to your inbox? Join us on Substack. <a href="https://oreillyradar.substack.com/?utm_campaign=profile_chips" target="_blank" rel="noreferrer noopener">Sign up here</a>.</em></p>
</blockquote>



<p>My older kids and I had done a lot of tough trails in Acadia over a few wonderful summers, and I wondered if we could handle Beehive. I started checking the online reviews. It sure <em>sounded</em> scary. I went to many websites and scanned hundreds of reviews over several days. The more I read, the less I wanted to try it.</p>



<p>Worse, the park rangers in Acadia are trained to not give anyone advice about what trail they can handle. (I get it.) No one else I spoke to wanted to tell a family they should try something dangerous. Everyone shrugged. It added to the fear.</p>



<p>Yet I saw conflicting evidence.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="953" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-28.png" alt="Warning on the trail" class="wp-image-18398" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-28.png 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-28-300x204.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-28-768x523.png 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /></figure>



<p>My research showed that only one person fell to their death decades ago, and the trail was modified after that. Also, many thousands of people of all types, including children and senior citizens, have done it without injury. On top of that, the mountain was not <em>that</em> high, and the difficult features it had, which I could see from detailed online photos, seemed quite similar to things we had done on a few other difficult trails. It didn’t <em>seem</em> like a big deal.</p>



<p>How could both things be true? Were they?</p>



<p>The truth was much closer to the second version, vindicated after we climbed it. It <em>was</em> a little scary at times, but wasn’t <em>that</em> physically challenging. It was fun, and something you could brag about among people who had <em>heard</em> it was scary, but who had not actually climbed it.</p>



<p>I do have a slight fear of heights, so I kept climbing and never turned to look down behind me. This brings me to another lesson:</p>



<p><em>You really never have to look down.</em></p>



<p>It’s amazing how people feel an obligation to once in a while look down to see what they’ve accomplished or to notice how high up they were or judge how dangerous the thing they just climbed looks from above. It often causes fear. I decided getting to the top was all that mattered, and I could look down only from up there. This is a question of focus.</p>



<p>I can think of many moments in learning to use and orchestrate coding agents where I unwisely stopped to “look down.” This takes the form of pausing and asking yourself things like:</p>



<ul class="wp-block-list">
<li>“Is this crazy technique really necessary? Isn’t the old way good enough?”</li>



<li>“What about my favorite programming languages? Will languages matter in the future?”</li>



<li>“What is the environmental cost of my queries?”</li>



<li>“Am I getting worse at writing code myself?”</li>



<li>“What if this agent keeps getting better? Will it get better than me?”</li>



<li>“Am I missing some new AI development online right now? Should I check my feeds?”</li>
</ul>



<p>None of those ruminations will help you get better with the agents. They just drain your energy when you should either rest or keep climbing.</p>



<p>I now see Beehive as an “attention vortex.” Because a lot of people talk about it, and because dramatic statements from the fearful and those boasting about their accomplishments dominate the reviews. The <em>talk</em> about Beehive is not tethered to the <em>reality</em> of climbing it.</p>



<p>Strangely, the <em>cachet</em> of having climbed it <em>depend</em>s on the attention and fear. It made those who climbed it feel better about what they had done, and they had little interest in diminishing their accomplishment by tamping down the fear. (“Well, yes, it <em>was</em> scary up there!”) Nobody is invested in saying it was less than advertised. This insight is <em>precisely</em> why the loud coding agent YouTubers act the way they do.</p>



<p>AI is a <em>planetary</em> attention vortex. It has seemed like the only thing anyone in software development has talked about for over a year. People who quietly use the agents to improve their velocity—and aren’t particularly troubled by that—are not being heard. You aren’t seeing calm instructional videos from them on YouTube. We are instead seeing 30-year-olds pushing coding agent pornography on us every day, while telling us that their multiple-agent, infinite-token, unrestricted-permissions-YOLO workflow means we are doomed. (But <em>you</em> might survive if you hit the subscribe button on their channel, OK?) These confident hucksters are still peddling fear to keep you coming back to them.</p>



<p>Above all else, stop listening to anyone projecting fear. (Yes, you cannot avoid them entirely as they are everywhere and often tell you their worries unprompted.)</p>



<p>You must find useful information and shut out the rest. This is another lesson I learned:</p>



<p><em>When in an attention vortex, seek firsthand testimony, not opinions.</em></p>



<p>So the way I finally figured out Beehive wasn’t that bad was from some guy who took pictures of every part of the trail. I compared them to what I’d done on similar trails, such as the unpopular but delightful Beech Cliff trail, which nobody thought was truly dangerous and gets almost zero online attention.</p>



<p>When it comes to AI, I have abandoned opinions, predictions, and demos. I listen to senior people who are using agents on real project work, who are humble, who aren’t trying to sell me something, and who are not primarily afraid. (Examples are: <a href="https://simonwillison.net/" target="_blank" rel="noreferrer noopener">Simon Willison</a>, <a href="https://martinfowler.com/" target="_blank" rel="noreferrer noopener">Martin Fowler</a>, <a href="https://blog.fsck.com/" target="_blank" rel="noreferrer noopener">Jesse Vincent</a>, and yes, quickly hand $15 each month to the indispensable <a href="https://www.pragmaticengineer.com/" target="_blank" rel="noreferrer noopener"><em>Pragmatic Engineer</em></a>.)</p>



<p>When it came to Precipice, widely acknowledged as the hardest hiking trail in Acadia, I took a different approach. (It’s actually not a hiking trail but a mountain climb without ropes.) Using the same investigative techniques I’d learned from Beehive, I found out it was three times longer and had scarier moments.</p>



<p>This gets us to another lesson.</p>



<p><em>Go with someone much more enthusiastic than you.</em></p>



<p>I don’t know how, but my athletic 13-year-old son is a daredevil. He’s up for any scary experience. I do not usually accompany him on the scary roller coasters.</p>



<p>He was totally up for Precipice, of course. Dad was very nervous.</p>



<p>But I knew that if anyone could drag me up that mountain, it was him. I also didn’t want to let him down. In fact, I almost decided to abort the mission at the bottom of the trail. I just sighed and thought, “I will just do the beginning part. We can duck out and take another route down until about one-third of the way up.”</p>



<p>So if you’re not sure how to use AI, or are not yet enthusiastic, find people who <em>are</em> and keep talking to them! You don’t have to abandon your friends or coworkers who aren’t as interested. Instead, become the enthusiast in their world. (That is what happened to me more than a year ago.)</p>



<p>Another reason I decided not to give up is that I bought different shoes.</p>



<p>You can hike most trails in regular sneakers in almost any condition. But since Precipice is a climb and not a hike, I realized my usual worn-out running shoes might not be up for that, as I had slid on them during a lesser climb elsewhere that week.</p>



<p>So while in nearby Bar Harbor, my family ducked into a sporting goods store and looked at hiking shoes for me and my son. I told the sales guy we were going to do Precipice. He raised an eyebrow and said I would of course need something good for <em>that</em>.</p>



<p>When I held the strange shoes in my hand, I looked at the price tag and then looked at my wife, who gave a knowing look back at me that surely meant, “OK, but you do realize that you actually have to climb it if we buy those.” I just nodded.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="858" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-29.png" alt="Ed's new climbing shoes" class="wp-image-18399" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-29.png 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-29-300x184.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-29-768x471.png 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /></figure>



<p>And we needed those new shoes! My son and I had a few tense moments scrambling where we agreed it was quite good we had them. But all along the way, they <em>felt</em> different, which was what I needed.</p>



<p>This reminds me of when I decided to use Claude Code a few weeks after it came out last March. The tokens cost 10 times what I could get elsewhere. But suddenly I was invested.</p>



<p>It also mattered that Claude Code, as a terminal, was a very different development experience. People back then thought it was strange that I was using a CLI to manage code. It was really different for me too, and all the better: I was no longer screwing around with code suggestions in GitHub Copilot.</p>



<p>This is a lesson I have taken to AI:</p>



<p><em>You must get different equipment.</em></p>



<p>You should be regularly experimenting with new tools that make you uncomfortable. Just using the new AI features in your existing tool is not enough for continuous growth or paradigm shifts, like the recent one from the CLI to multiple simultaneous agent management.</p>



<p>The last idea I have is to stop thinking about where all of us will end up one day.</p>



<p><em>Put the summit out of your mind.</em></p>



<p>While climbing Precipice, I decided to only think of what was in front of me. I knew it was <em>a lot</em> higher than Beehive. I just kept doing one more tough piece of it.</p>



<p>The advantage of doing this was near the top. Because the scariest piece was something I didn’t notice from online trail photos.</p>



<p>You can get an idea of what I&#8217;m talking about from <a href="http://www.watsonswander.com/assets/2016/08/DSC06593.jpg" target="_blank" rel="noreferrer noopener">this photo</a> from <a href="http://www.watsonswander.com/2016/last-days-in-maine/" target="_blank" rel="noreferrer noopener">Watson&#8217;s World</a>, which I had not seen before I got up there. It shows a long cliff with a very short ledge (much shorter than it looks at this angle). Even the picture doesn’t make it clear just how <em>exposed</em> you are and that there is <em>nothing</em> behind you but a long, deadly fall. The bottom bars are to prevent your feet from slipping off.</p>



<p>When I came to it, I thought, “No…way.”</p>



<p>But there was no turning back by then. I had come so far! I looked up and saw the summit was just above this last traverse. So I just held onto the bars, held onto my breath, and moved carefully along the cliff right behind my son, who was suddenly more cautious.</p>



<p>Had I known <em>that</em> was up there, I might not have climbed the mountain. Good thing I didn’t know.</p>



<p>As for the future of software, I don’t know what lies further up the mountain we are on. There are probably some very strenuous and scary moments ahead. But we shouldn’t be worrying about them now.</p>



<p>We should just keep climbing.</p>
]]></content:encoded>
										</item>
		<item>
		<title>The Missing Layer in Agentic AI</title>
		<link>https://www.oreilly.com/radar/the-missing-layer-in-agentic-ai/</link>
				<pubDate>Thu, 26 Mar 2026 11:30:50 +0000</pubDate>
					<dc:creator><![CDATA[Artur Huk]]></dc:creator>
						<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18372</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/The-missing-layer-in-agentic-AI.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/The-missing-layer-in-agentic-AI-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Why autonomous systems need a deterministic runtime]]></custom:subtitle>
		
				<description><![CDATA[The day two problem Imagine you deploy an autonomous AI agent to production. Day one is a success: The demos are fantastic; the reasoning is sharp. But before handing over real authority, uncomfortable questions emerge. What happens when the agent misinterprets a locale-specific decimal separator, turning a position of 15.500 ETH (15 and a half) [&#8230;]]]></description>
								<content:encoded><![CDATA[
<h2 class="wp-block-heading">The day two problem</h2>



<p>Imagine you deploy an autonomous AI agent to production. Day one is a success: The demos are fantastic; the reasoning is sharp. But before handing over real authority, uncomfortable questions emerge.</p>



<p>What happens when the agent misinterprets a locale-specific decimal separator, turning a position of 15.500 ETH (15 and a half) into an order for 15,500 ETH (15 thousand) on leverage? What if a dropped connection leaves it looping on stale state, draining your LLM request quota in minutes?</p>



<p>What if it makes a perfect decision, but the market moves just before execution? What if it hallucinates a parameter like <code>force_execution=True</code>—do you sanitize it or crash downstream? And can it reliably ignore a prompt injection buried in a web page?</p>



<p>Finally, if an API call times out without acknowledgment, do you retry and risk duplicating a $50K transaction, or drop it?</p>



<p>When these scenarios occur, megabytes of prompt logs won&#8217;t explain the failure. And adding &#8220;please be careful&#8221; to the system prompt acts as a superstition, not an engineering control.</p>



<h2 class="wp-block-heading">Why a smarter model is not the answer</h2>



<p>I encountered these failure modes firsthand while building an autonomous system for live financial markets. It became clear that these were not model failures but execution boundary failures. While RL-based fine-tuning can improve reasoning quality, it cannot solve infrastructure realities like network timeouts, race conditions, or dropped connections.</p>



<p>The real issues are architectural gaps: contract violations, data integrity issues, context staleness, decision-execution gaps, and network unreliability.</p>



<p>These are infrastructure problems, not intelligence problems.</p>



<p>While LLMs excel at orchestration, they lack the &#8220;kernel boundary&#8221; needed to enforce state integrity, idempotency, and transactional safety where decisions meet the real world.</p>



<h2 class="wp-block-heading">An architectural pattern: The Decision Intelligence Runtime</h2>



<p>Consider modern operating system design. OS architectures separate “user space” (unprivileged computation) from “kernel space” (privileged state modification). Processes in user space can perform complex operations and request actions but cannot directly modify system state. The kernel validates every request deterministically before allowing side effects.</p>



<p>AI agents need the same structure. The agent interprets context and proposes intent, but the actual execution requires a privileged deterministic boundary. This layer, the Decision Intelligence Runtime (DIR), separates probabilistic reasoning from real-world execution.</p>



<p>The runtime sits between agent reasoning and external APIs, maintaining a <strong>context store</strong>, a centralized, immutable record ensuring the runtime holds the &#8220;single source of truth,&#8221; while agents operate only on temporary snapshots. It receives proposed intents, validates them against hard engineering rules, and handles execution. Ideally, an agent should never directly manage API credentials or “own” the connection to the external world, even for read-only access. Instead, the runtime should act as a proxy, providing the agent with an immutable context snapshot while keeping the actual keys in the privileged kernel space.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="755" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-23.png" alt="Figure 1: High-level design (HLD) of the Decision Intelligence Runtime" class="wp-image-18373" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-23.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-23-300x142.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-23-768x362.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-23-1536x725.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Figure 1: High-level design (HLD) of the Decision Intelligence Runtime, illustrating the separation of user space reasoning from kernel space execution</em></figcaption></figure>



<p>Bringing engineering rigor to probabilistic AI requires implementing five familiar architectural pillars.</p>



<p>Although several examples in this article use a trading simulation for concreteness, the same structure applies to healthcare workflows, logistics orchestration, and industrial control systems.</p>



<h3 class="wp-block-heading">DIR versus existing approaches</h3>



<p>The landscape of agent guardrails has expanded rapidly. Frameworks like LangChain and LangGraph operate in user space, focusing on reasoning orchestration, while tools like Anthropic&#8217;s Constitutional AI and Pydantic schemas validate outputs at inference time. DIR, by contrast, operates at the execution boundary, the kernel space, enforcing contracts, business logic, and audit trails after reasoning is complete.</p>



<p>Both are complementary. DIR is intended as a safety layer for mission-critical systems.</p>



<h4 class="wp-block-heading">1. Policy as a claim, not a fact</h4>



<p>In a secure system, external input is never trusted by default. The output of an AI agent is exactly that: external input. The proposed architecture treats the agent not as a trusted administrator, but as an untrusted user submitting a form. Its output is structured as a <strong>policy proposal</strong>—a claim that it <em>wants</em> to perform an action, not an order that it <em>will</em> perform it. This is the start of a Zero Trust approach to agentic actions.</p>



<p>Here is an example of a policy proposal from a trading agent:</p>



<pre class="wp-block-code"><code>proposal = PolicyProposal(
    dfid="550e8400-e29b-41d4-a716-446655440000", # Trace ID (see Sec 5)
    agent_id="crypto_position_manager_01",
    policy_kind="TAKE_PROFIT",
    params={
        "instrument": "ETH-USD",
        "quantity": 0.5,
        "execution_type": "MARKET"
    },
    reasoning="Profit target of +3.2% hit (Threshold: 3.0%). Market momentum slowing.",
    confidence_score=0.92
)</code></pre>



<h4 class="wp-block-heading">2. Responsibility contract as code</h4>



<p>Prompts are not permissions. Just as traditional apps rely on role-based access control, agents require a strict <strong>responsibility contract</strong> residing in the deterministic runtime. This layer acts as a firewall, validating every proposal against hard engineering rules: schema, parameters, and risk limits. Crucially, this check is deterministic code, not another LLM asking, &#8220;Is this dangerous?&#8221; Whether the agent hallucinates a capability or obeys a malicious prompt injection, the runtime simply enforces the contract and rejects the invalid request.</p>



<p><strong>Real-world example</strong>:<strong> </strong>A trading agent misreads a comma-separated value and attempts to execute <code>place_order(symbol='ETH-USD', quantity=15500)</code>. This would be a catastrophic position sizing error. The contract rejects it immediately:</p>



<pre class="wp-block-code"><code>ERROR: Policy rejected. Proposed order value exceeds hard limit.
Request: ~40000000 USD (15500 ETH)
Limit: 50000 USD (max_order_size_usd)</code></pre>



<p>The agent&#8217;s output is discarded; the human is notified. No API call, no cascading market impact.</p>



<p>Here is the contract that prevented this:</p>



<pre class="wp-block-code"><code># agent_contract.yaml
agent_id: "crypto_position_manager_01"
role: "EXECUTOR"
mission: "Manage news-triggered ETH positions. Protect capital while seeking alpha."
version: "1.2.0"                  # Immutable versioning for audit trails
owner: "jane.doe@example.com"     # Human accountability
effective_from: "2026-02-01"

# Deterministic Boundaries (The 'Kernel Space' rules)
permissions:
  allowed_instruments: &#91;"ETH-USD", "BTC-USD"]
  allowed_policy_types: &#91;"TAKE_PROFIT", "CLOSE_POSITION", "REDUCE_SIZE", "HOLD"]
  max_order_size_usd: 50000.00

# Safety &amp; Economic Triggers (Intervention Logic)
safety_rules:
  min_confidence_threshold: 0.85      # Don't act on low-certainty reasoning
  max_drawdown_limit_pct: 4.0         # Hard stop-loss enforced by Runtime
  wake_up_threshold_pnl_pct: 2.5      # Cost optimization: ignore noise
  escalate_on_uncertainty: 0.70       # If confidence &lt; 70%, ask human</code></pre>



<h4 class="wp-block-heading">3. JIT (just-in-time) state verification</h4>



<p>This mechanism addresses the classic race condition where the world changes between the moment you check it and the moment you act on it. When an agent begins reasoning, the runtime binds its process to a specific context snapshot. Because LLM inference takes time, the world will likely change before the decision is ready. Right before executing the API call, the runtime performs a JIT verification, comparing the live environment against the original snapshot. If the environment has shifted beyond a predefined drift envelope, the runtime aborts the execution.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1106" height="892" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-24.png" alt="Figure 2: JIT verification catches stale decisions before they reach external systems." class="wp-image-18374" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-24.png 1106w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-24-300x242.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-24-768x619.png 768w" sizes="auto, (max-width: 1106px) 100vw, 1106px" /><figcaption class="wp-element-caption"><em>Figure 2: JIT verification catches stale decisions before they reach external systems.</em></figcaption></figure>



<p>The drift envelope is configurable per context field, allowing fine-grained control over what constitutes an acceptable change:</p>



<pre class="wp-block-code"><code># jit_verification.yaml
jit_verification:
  enabled: true
  
  # Maximum allowed drift per field before aborting execution
  drift_envelope:
    price_pct: 2.0           # Abort if price moved > 2%
    volume_pct: 15.0         # Abort if volume changed > 15%
    position_state: strict   # Any change = abort
  
  # Snapshot expiration
  max_context_age_seconds: 30
  
  # On drift detection
  on_drift_exceeded:
    action: "ABORT"
    notify: &#91;"ops-channel"]
    retry_with_fresh_context: true
</code></pre>



<h4 class="wp-block-heading">4. Idempotency and transactional rollback</h4>



<p>This mechanism is designed to mitigate execution chaos and infinite retry loops. Before making any external API call, the runtime hashes the deterministic decision parameters into a unique idempotency key. If a network connection drops or an agent gets confused and attempts to execute the exact same action multiple times, the runtime catches the duplicate key at the boundary.</p>



<p>The key is computed as:</p>



<pre class="wp-block-code"><code>IdempotencyKey = SHA256(DFID + StepID + CanonicalParams)</code></pre>



<p>Where <code>DFID</code> is the Decision Flow ID, <code>StepID</code> identifies the specific action within a multistep workflow, and <code>CanonicalParams</code> is a sorted representation of the action parameters.</p>



<p>Critically, the <strong>context hash</strong> (snapshot of the world state) is deliberately <strong>excluded</strong> from this key. If an agent decides to buy 10 ETH and the network fails, it might retry 10 seconds later. By then, the market price (context) has changed. If we included the context in the hash, the retry would generate a new key (<code>SHA256(Action + NewContext)</code>), bypassing the idempotency check and causing a duplicate order. By locking the key to the <em>Flow ID</em> and <em>Intent params</em> only, we ensure that a retry of the same logical decision is recognized as a duplicate, even if the world around it has shifted slightly.</p>



<p>Furthermore, when an agent makes a multistep decision, the runtime tracks each step. If one step fails, it knows how to perform a compensation transaction to roll back what was already done, instead of hoping the agent will figure it out on the fly.</p>



<p>A DIR does not magically provide strong consistency; it makes the consistency model explicit: where you require atomicity, where you rely on compensating transactions, and where eventual consistency is acceptable.</p>



<h4 class="wp-block-heading">5. DFID: From observability to reconstruction</h4>



<p>Distributed tracing is not a new idea. The practical gap in many agentic systems is that traces rarely capture the artifacts that matter at the execution boundary: the exact context snapshot, the contract/schema version, the validation outcome, the idempotency key, and the external receipt.</p>



<p>The Decision Flow ID (DFID) is intended as a <em>reconstruction primitive</em>—one correlation key that binds the minimum evidence needed to answer critical operational questions:</p>



<ul class="wp-block-list">
<li><strong>Why did the system execute this action?</strong> (policy proposal + validation receipt + contract/schema version)</li>



<li><strong>Was the decision stale at execution time?</strong> (context snapshot + JIT drift report)</li>



<li><strong>Did the system retry safely or duplicate the side effect?</strong> (idempotency key + attempt log + external acknowledgment)</li>



<li><strong>Which authority allowed it?</strong> (agent identity + registry/contract snapshot)</li>
</ul>



<p>In practice, this turns a postmortem from &#8220;the agent traded&#8221; into &#8220;this exact intent was accepted under these deterministic gates against this exact snapshot, and produced this external receipt.&#8221; The goal is not to claim perfect correctness; it is to make side effects explainable at the level of inputs and gates, even when the reasoning remains probabilistic.</p>



<p>At the hierarchical level, DFIDs form parent-child relationships. A strategic intent spawns multiple child flows. When multistep workflows fail, you reconstruct not just the failing step but the parent mandate that authorized it.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1324" height="449" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-25.png" alt="Figure 3: Hierarchical Decision Flow IDs enable full process reconstruction across multi-agent interactions." class="wp-image-18375" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-25.png 1324w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-25-300x102.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-25-768x260.png 768w" sizes="auto, (max-width: 1324px) 100vw, 1324px" /><figcaption class="wp-element-caption"><em>Figure 3: Hierarchical Decision Flow IDs enable full process reconstruction across multi-agent interactions.</em></figcaption></figure>



<p>In practice, this level of traceability is not about storing prompts—it is about storing structured decision telemetry.</p>



<p>In one trading simulation, each position generated a decision flow that could be queried like any other system artifact. This allowed inspection of the triggering news signal, the agent’s justification, intermediate decisions (such as stop adjustments), the final close action, and the resulting PnL, all tied to a single simulation ID. Instead of replaying conversational history, this approach reconstructed what happened at the level of state transitions and executable intents.</p>



<pre class="wp-block-code"><code>SELECT position_id
     , instrument
     , entry_price
     , initial_exposure
     , news_full_headline
     , news_score
     , news_justification
     , decisions_timeline
     , close_price
     , close_reason
     , pnl_percent
     , pnl_usd
  FROM position_audit_agg_v
 WHERE simulation_id = 'sim_2026-02-24T11-20-18-516762+00-00_0dc07774';</code></pre>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="164" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-26.png" alt="Figure 4: Example of structured decision telemetry" class="wp-image-18376" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-26.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-26-300x31.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-26-768x79.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/03/image-26-1536x157.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Figure 4: Example of structured decision telemetry. Each row links context, reasoning, intermediate actions, and financial outcome for a single simulation run.</em></figcaption></figure>



<p>This approach is fundamentally different from prompt logging. The agent’s reasoning becomes one field among many—not the system of record. The system of record is the validated decision and its deterministic execution boundary.</p>



<h3 class="wp-block-heading">From model-centric to execution-centric AI</h3>



<p>The industry is shifting from <em>model-centric AI</em>, measuring success by reasoning quality alone, to <em>execution-centric AI</em>, where reliability and operational safety are first-class concerns.</p>



<p>This shift comes with trade-offs. Implementing deterministic control requires higher latency, reduced throughput, and stricter schema discipline. For simple summarization tasks, this overhead is unjustified. But for systems that move capital or control infrastructure, where a single failure outweighs any efficiency gain, these are acceptable costs. A duplicate $50K order is far more expensive than a 200 ms validation check.</p>



<p>This architecture is not a single software package. Much like how Model-View-Controller (MVC) is a pervasive pattern without being a single importable library, DIR is a set of engineering principles: separation of concerns, zero trust, and state determinism, applied to probabilistic agents. Treating agents as untrusted processes is not about limiting their intelligence; it is about providing the safety scaffolding required to use that intelligence in production.</p>



<p>As agents gain direct access to capital and infrastructure, a runtime layer will become as standard in the AI stack as a transaction manager is in banking. The question is not whether such a layer is necessary but how we choose to design it.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>This article provides a high-level introduction to the Decision Intelligence Runtime and its approach to production resiliency and operational challenges. The full architectural specification, repository of context patterns, and reference implementations are available as an open source project at </em><a href="https://github.com/huka81/decision-intelligence-runtime" target="_blank" rel="noreferrer noopener"><em>GitHub</em></a><em>.</em></p>
</blockquote>
]]></content:encoded>
										</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Object Caching 91/97 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed) 
Minified using Memcached

Served from: www.oreilly.com @ 2026-04-13 13:34:45 by W3 Total Cache
-->