<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:media="http://search.yahoo.com/mrss/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:custom="https://www.oreilly.com/rss/custom"

	>

<channel>
	<title>Radar</title>
	<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.oreilly.com/radar</link>
	<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
	<lastBuildDate>Thu, 30 Apr 2026 17:26:11 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
	<title>Radar</title>
	<link>https://www.oreilly.com/radar</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Everyone’s an Engineer Now</title>
		<link>https://www.oreilly.com/radar/everyones-an-engineer-now/</link>
				<comments>https://www.oreilly.com/radar/everyones-an-engineer-now/#respond</comments>
				<pubDate>Thu, 30 Apr 2026 15:59:33 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18622</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-colors-1.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/08/Abstract-colors-1-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Takeaways from Cat Wu’s fireside chat with Addy Osmani]]></custom:subtitle>
		
				<description><![CDATA[Cat Wu leads product for Claude Code and Cowork at Anthropic, so she’s well-versed in building reliable, interpretable, and steerable AI systems. And since 90% of Anthropic’s code is now written by Claude Code, she’s also deeply familiar with fitting them into routine day-to-day work. Last month, Cat joined Addy Osmani at AI Codecon for [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>Cat Wu leads product for Claude Code and Cowork at Anthropic, so she’s well-versed in building reliable, interpretable, and steerable AI systems. And since 90% of Anthropic’s code is now written by Claude Code, she’s also deeply familiar with fitting them into routine day-to-day work. Last month, Cat joined Addy Osmani at <a href="https://www.oreilly.com/AI-Codecon/" target="_blank" rel="noreferrer noopener">AI Codecon</a> for a fireside chat on the future of agentic coding and, equally important, agentic code review, how Anthropic actually uses the tools they&#8217;re building, and what skills matter now for developers.</p>



<h2 class="wp-block-heading">The feedback loop is itself a product</h2>



<p>Boris Cherny initially built Claude Code as a side project to test Anthropic’s APIs. Then he shared the tool in a notebook, and within two months the entire company was using it. That organic growth, Cat said, was part of what convinced the team it was worth releasing externally.</p>



<p>But what really made that internal adoption visible was the response on Anthropic&#8217;s internal “dog-fooding” Slack channel. The Claude Code channel gets a new message every 5 to 10 minutes around the clock, and this feedback directly and immediately informs the product experience. Cat described it this way:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>We hire for people who love polishing the user experience. And so a lot of our engineers actually live in this channel and find when there&#8217;s issues with new features that they&#8217;ve worked on and they proactively lay out the fixes.</p>
</blockquote>



<p>The team ships new versions of Claude Code to internal users many times a day. The feedback loop is tight enough that it functions as a continuous integration system for product quality, not just code quality.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="From Boris&#039;s Notebook to the Whole Company with Cat Wu" width="500" height="281" src="https://www.youtube.com/embed/wo_CbgoyFLY?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>Cat told Addy how she once accidentally introduced a small interaction bug between prompts and auto-suggestions. But by the time she started working on a solution, she found another team member had already beaten her to it. It turns out, he had set up a scheduled task in Claude Code to scan the feedback channel for anything that hadn&#8217;t been responded to in 24 hours and open a PR for it. Since Cat hadn’t gotten to it yet (whoops!), her teammate’s Claude saw the unaddressed issue and fixed it for her. And Cat only found out when “[her own] Claude noticed that his Claude had already landed a change.”&nbsp;&nbsp;&nbsp;</p>



<p>The infrastructure for rapid improvement, in other words, is now partly automated. The agents are writing the code, then monitoring the feedback and closing the loop.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="My Claude Fixed My Bug Before I Did with Cat Wu" width="500" height="281" src="https://www.youtube.com/embed/4h0i7YiS9io?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The bottleneck has shifted to review</h2>



<p>There’s no question that AI-assisted coding has created a boom in output. Anthropic engineers are producing roughly 200% more code than they were a year ago, Cat noted. Today the main constraint is reviewing all that code to ensure it’s production-ready.</p>



<p>Cat&#8217;s team concluded that you can buy a lot of additional robustness for not that much extra cost. </p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>We opted for the heaviest, most robust version [of code review]. We actually plot how many agents and how comprehensive of a review Claude does and then how many bugs does it recall. And we picked a number of very high recall and decided we should ship this, because if you really want AI code review to be a load-bearing part of your process, you actually probably just want the most comprehensive possible review.</p>
</blockquote>



<p>The review agent doesn&#8217;t just look at the diff. It traces code across multiple files and catches bugs in adjacent code that has nothing to do with the change in question. Cat gave two examples. One was a ZFS encryption refactor where the agent found a key cache invalidation bug that wasn&#8217;t related to the author&#8217;s change at all but would have invalidated it. The other was a routine auth update that turned out to have a bad side effect, caught premerge. In both cases, engineers manually reviewing the code likely would have missed the bugs.</p>



<p>The human review that remains is deliberately small in scope. For most PRs, the human reviewer skims for design principle violations and obvious problems and assumes functional correctness has been handled. Five to ten agents run in parallel, each given slightly different tasks, returning independently and then deduplicating what they found.</p>



<p>The cultural shift that made this work, though, was ownership. The team moved to a model where the engineer who authors a PR owns it end to end, including postdeploy bugs, and doesn&#8217;t lean on peer reviewers to catch mistakes. “Otherwise,” as Cat pointed out, “you have situations where junior engineers put out a bunch of PRs and then your senior engineers are like drowning in AI-generated stuff where they&#8217;re not sure how thoroughly it&#8217;s been tested.&#8221;</p>



<p>Full ownership meant the AI review had to actually be trustworthy, which drove the decision to go for high recall rather than a lighter touch. That said, engineers are still expected to understand every line of code an agent creates.&nbsp;.&nbsp;.for now. As Cat explained, it’s the only way to truly prevent “unknown security vulnerabilities and to be able to quickly respond to incidents if they are to happen.”&nbsp;</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe title="Making AI Code Review a Loadbearing Part of Your Process with Cat Wu" width="500" height="281" src="https://www.youtube.com/embed/1eBxpDE35Gk?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Everyone&#8217;s kind of an engineer now</h2>



<p>Cowork, Anthropic&#8217;s agent tool for nontechnical users, is the company’s attempt to take what Claude Code does for engineers and bring it to knowledge work more broadly. Cat sketched a picture of someone looking at five or six agent tasks running simultaneously in a side panel, managing a fleet of agents the way a senior engineer manages a PR queue.</p>



<p>In the nearer-term, she&#8217;s keeping tabs on the shift toward people using Claude Code to build things for themselves, their teams, or their families that wouldn&#8217;t have justified professional development effort or “otherwise been possible.” The prototype is the garage project, the family expense tracker, the tool that a small team actually needs but that no SaaS product quite addresses. Cat&#8217;s goal and hope is that Claude Code helps people “solve their own problems for themselves” and “stewards a new future of personal software.”</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Everyone&#039;s Kind of an Engineer Now with Cat Wu" width="500" height="281" src="https://www.youtube.com/embed/10wu71soYhg?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Product taste as the new technical skill</h2>



<p>More people building more software is unambiguously good. Boris Cherny has even floated the idea that coding as we know it is “<a href="https://x.com/lennysan/status/2024896611818897438" target="_blank" rel="noreferrer noopener">solved</a>.” But what does that mean for the craft of software engineering? Cat&#8217;s read of the current moment is more nuanced:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>I think pre-AI, the skills that were very important were being able to take a spec and implement it well. And I think now the really important skill is product taste. Even for engineers. Can you use code to ingest a massive amount of user feedback? Do you have good intuition about which feature to build to address those needs, because it&#8217;s often different than exactly what users are asking you for? And then, when Claude builds it, are you setting up the right bar so that what you ship people actually love?</p>
</blockquote>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Product Taste Is the New Technical Skill with Cat Wu" width="500" height="281" src="https://www.youtube.com/embed/hIEA3YFixE4?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>Cat’s not alone in highlighting the importance of taste in a world where code is a commodity. <a href="https://www.oreilly.com/radar/steve-yegge-wants-you-to-stop-looking-at-your-code/#:~:text=the%20new%20Amish.%E2%80%9D-,Taste%20is%20the%20moat,-Another%20of%20the" target="_blank" rel="noreferrer noopener">Steve Yegge</a>, <a href="https://www.oreilly.com/radar/the-mythical-agent-month/#:~:text=Design%20and%20taste%20as%20our%20last%20foothold" target="_blank" rel="noreferrer noopener">Wes McKinney</a>, and many others, <a href="https://www.oreilly.com/radar/betting-against-the-bitter-lesson/#:~:text=Even%20if%20the,taste%20and%20curation" target="_blank" rel="noreferrer noopener">myself included</a>, see taste and judgment as a uniquely human value. This has practical implications for how engineers should spend their time now, and for what the next generation needs to learn.&nbsp;&nbsp;&nbsp;</p>



<p>For junior engineers specifically, Cat described a progression: Start by using Claude Code to understand the codebase (ask all the &#8220;dumb questions&#8221; without embarrassment), take those answers to a senior engineer for calibration, and then close the loop by updating the CLAUDE.md with whatever was missing.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>Think of Claude Code as your intern that you&#8217;re trying to level up. Like, teach it back to Claude. Add a <code>/verify</code> slash command. Put it in the CLAUDE.md or the agent README. Approach this as senior engineers helping you level up, and then you helping Claude and other agents level up.</p>
</blockquote>



<p>The improvement process, in other words, should be bidirectional. Engineers get better at using the tools and the tools get better through the engineers&#8217; accumulated knowledge. And significantly, this process keeps humans firmly in the loop, playing a role that’s “<a href="https://www.oreilly.com/radar/software-craftsmanship-in-the-age-of-ai/" target="_blank" rel="noreferrer noopener">active, continuous, and skilled</a>.”</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="How Should Junior Engineers Use Claude Code? with Cat Wu" width="500" height="281" src="https://www.youtube.com/embed/qnSuOFXkEH0?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>You can <a href="https://learning.oreilly.com/videos/ai-codecon-software/0642572305581/" data-type="link" data-id="https://learning.oreilly.com/videos/ai-codecon-software/0642572305581/" target="_blank" rel="noreferrer noopener">watch Cat and Addy&#8217;s full chat</a>, plus everything else from AI Codecon on the O&#8217;Reilly learning platform. Not a member? <a href="https://www.oreilly.com/start-trial/?type=individual" data-type="link" data-id="https://www.oreilly.com/start-trial/?type=individual" target="_blank" rel="noreferrer noopener">Sign up for a free 10-day trial</a>, no strings attached. </em></p>
</blockquote>



<p></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/everyones-an-engineer-now/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>AI Code Review Only Catches Half of Your Bugs</title>
		<link>https://www.oreilly.com/radar/ai-code-review-only-catches-half-of-your-bugs/</link>
				<comments>https://www.oreilly.com/radar/ai-code-review-only-catches-half-of-your-bugs/#respond</comments>
				<pubDate>Thu, 30 Apr 2026 11:14:49 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18637</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/AI-code-review-only-catches-half-of-your-bugs.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/AI-code-review-only-catches-half-of-your-bugs-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Luckily, there&#039;s a way to catch the ones that no structural analysis ever will]]></custom:subtitle>
		
				<description><![CDATA[This is the fifth article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, and part four here. I recently had a taste of humility with my AI-generated code. I live in Park Slope, Brooklyn, and recently I needed to get to the other side of the neighborhood. [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>This is the fifth article in a series on agentic engineering and AI-driven development. Read part one <a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">here</a>, part two <a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">here</a>, part three <a href="https://www.oreilly.com/radar/the-toolkit-pattern/" target="_blank" rel="noreferrer noopener">here</a>, and part four <a href="https://www.oreilly.com/radar/ai-is-writing-our-code-faster-than-we-can-verify-it/" target="_blank" rel="noreferrer noopener">here</a>.</em></p>
</blockquote>



<p>I recently had a taste of humility with my AI-generated code. I live in Park Slope, Brooklyn, and recently I needed to get to the other side of the neighborhood. I thought I&#8217;d be clever: I like taking the bus, so I decided to hop on the one that goes right down 7th Avenue. I know I could check the schedule using the MTA&#8217;s really useful Bus Time app or website, but it doesn&#8217;t take into account walking time from my house or give me a good idea of when to leave. This seemed like a great opportunity to vibe code an app and do some quick AI-driven development.</p>



<p>It took about two minutes for Claude Code to get my new app working. It made a lovely little web UI, I configured my stop and how long it takes me to walk there, and it gave me the perfect departure time.</p>



<p>When I actually walked out the door, the app perfectly predicted my wait. There was just one problem: my bus was nowhere to be seen. What I <em>did</em> see was a bus driving the exact opposite direction down 7th Avenue.</p>



<p>It was pretty obvious what had happened. I needed to go deeper into Brooklyn, not towards Manhattan, and the AI had picked the wrong direction. (Actually, as Cowork pointed out, each stop has its own ID, and it had selected the ID for the wrong stop.) I&#8217;d been using Cowork to orchestrate everything, and I could easily have just asked it to go out and check the MTA&#8217;s BusTime site for me to make sure the app was working. But I just trusted the AI. As a result, I had to walk. Which is fine—I love walking—but the irony was painful. I had literally just published an article about AI code quality and why you shouldn&#8217;t blindly trust it, and here I was doing exactly that.</p>



<p>The app had a bug. But it wasn&#8217;t the kind of bug you&#8217;d necessarily catch using a typical AI code review prompt. It built, ran, and did a perfectly fine job parsing the JSON from the MTA API. But if I&#8217;d started with a simple requirement—even just a user story like &#8220;as a Park Slope resident, I want to catch the B69 headed towards Kensington so I can get deeper into Brooklyn&#8221;—the AI would have built it differently. The problem is that AI can only build the thing you tell it to build, which isn&#8217;t necessarily the thing you <em>wanted</em> it to build. <strong>AI is really good at writing &#8220;correct&#8221; code that does the wrong thing.</strong></p>



<p>My Brooklyn bus detour was a minor inconvenience. But it was a really useful, small-scale example of what I kept running into in my larger projects, too. There&#8217;s an entire class of bugs that you simply can&#8217;t find with structural analysis—no linter, no static analyzer, no AI code reviewer will catch them—because the code isn&#8217;t wrong in any way that&#8217;s visible from the code alone. You need to know what the code was supposed to do. You need to know the intent.</p>



<p>The data on why requirements matter goes back decades. Back in the 1990s, for example, the Standish CHAOS reports were a big eye-opener for me and a lot of other people in the industry, large-scale data confirming what we&#8217;d been seeing on our own projects: that the most expensive defects trace back to misunderstood or missing requirements. Those reports really underscored the idea that poor requirements management, and specifically incomplete or frequently changing specifications, were one of the most primary drivers behind IT project failures. (And, as far as I can tell, they still are, and AI isn&#8217;t helping things—see my O&#8217;Reilly Radar article, “<a href="https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/" target="_blank" rel="noreferrer noopener">Prompt Engineering Is Requirements Engineering</a>”).</p>



<p>The idea that requirements problems really are the source of the most expensive kind of defects should make intuitive sense: If you build the wrong thing, you have to tear it apart and rebuild it. That&#8217;s why I made requirements the foundation of the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a>, an open-source skill for AI tools like Claude Code, Cursor, and Copilot that I introduced in the <a href="https://www.oreilly.com/radar/ai-is-writing-our-code-faster-than-we-can-verify-it/" target="_blank" rel="noreferrer noopener">previous article</a>. I&#8217;ve spent decades doing test-driven development, partnering with QA teams, welcoming the harshest code reviews from teammates who don&#8217;t pull punches—and that experience led me to build a tool that uses AI to bring back quality engineering practices the industry abandoned decades ago. I&#8217;ve tested it against a wide range of open-source projects in Go, Java, Rust, Python, and C#, from small utilities to widely-used libraries with tens of thousands of stars, and it&#8217;s found real bugs in almost every project it&#8217;s come across, including ones that have been confirmed and merged upstream.</p>



<p>I think there are a lot of wider lessons we can learn from my experience using requirements to help AI find bugs—especially security bugs. So in this article, I want to focus on the single most important thing I&#8217;ve learned from building it: everything depends on requirements. Not just any requirements, but a specific kind of requirement that most projects don&#8217;t have, that most AI tools don&#8217;t ask for, and that turns out to be the key to making AI actually useful for verifying code quality.</p>



<h2 class="wp-block-heading"><strong>Spec-driven development and what it misses</strong></h2>



<p>Developers using AI tools have been rediscovering the value of writing things down before asking the AI to build them. Spec-driven development (SDD) has become very popular, and for good reason. Addy Osmani wrote an excellent piece on this, “<a href="https://addyosmani.com/blog/good-spec/" target="_blank" rel="noreferrer noopener">How to Write a Good Spec for AI Agents</a>,” and the core idea is sound: If you write a clear specification of what you want built, the AI produces dramatically better results than if you just describe it in a chat prompt and hope for the best.</p>



<p>I think SDD is important, and I&#8217;d encourage any developer working with AI to adopt it. But as I was building the Quality Playbook, I discovered that SDD has a blind spot that matters a lot for code quality. An SDD spec describes the <em>how</em>—what the implementation should look like. It tells the AI &#8220;implement a duplicate key check&#8221; or &#8220;add a retry mechanism with exponential backoff&#8221; or &#8220;create a REST endpoint that returns paginated results.&#8221; That&#8217;s useful for building things. But it&#8217;s not enough for verifying them.</p>



<p>But a requirement doesn&#8217;t say &#8220;implement a duplicate key check.&#8221; It says &#8220;users depend on Gson to reject ambiguous input so they don&#8217;t silently accept corrupted data.&#8221; The AI can reason about the second one in ways it can&#8217;t reason about the first, because the second one has the purpose attached. When the AI knows the purpose, it can evaluate whether the code actually fulfills that purpose across all the edge cases, not just the ones the spec explicitly listed. That&#8217;s how the Quality Playbook caught a bug in Google&#8217;s Gson library, one of the most widely used JSON libraries in Java.</p>



<p>I think it&#8217;s worth digging into that particular bug, because it&#8217;s a great example of just how powerful requirements analysis can be for finding defects. The playbook derived null-handling requirements from Gson&#8217;s own community—GitHub issues <a href="https://github.com/google/gson/issues/676" target="_blank" rel="noreferrer noopener">#676</a>, <a href="https://github.com/google/gson/issues/913" target="_blank" rel="noreferrer noopener">#913</a>, <a href="https://github.com/google/gson/issues/948" target="_blank" rel="noreferrer noopener">#948</a>, and <a href="https://github.com/google/gson/issues/1558" target="_blank" rel="noreferrer noopener">#1558</a>, some dating back to 2016—then used those requirements to find that duplicate keys were silently accepted when the first value was null. It confirmed the bug by generating a failing test, then patched the code and verified the test passed. I&#8217;ve used Gson for years and done a lot of work with Java serialization, so I read the code and the fix myself before submitting anything—trust but verify. The fix was merged as <a href="https://github.com/google/gson/pull/3006" target="_blank" rel="noreferrer noopener">https://github.com/google/gson/pull/3006</a>, confirmed by Google&#8217;s own test suite.</p>



<p>That bug had been hiding in plain sight for years, through thousands of tests and countless code reviews. But it&#8217;s possible that no structural analysis might have ever found it because you needed the requirement to know it was wrong.</p>



<p>This distinction might sound academic, but it has very concrete consequences for whether your AI can actually find bugs in your code.</p>



<h2 class="wp-block-heading"><strong>About half of all security bugs are invisible to structural analysis</strong></h2>



<p>The security world has known about the limits of structural analysis for a long time. The <a href="https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.500-326.pdf" target="_blank" rel="noreferrer noopener">NIST SATE evaluations</a> found that <strong>the best static analysis tools plateaued at around 50-60% detection rates for security vulnerabilities</strong>. Gary McGraw&#8217;s <a href="https://www.oreilly.com/library/view/software-security-building/0321356705/" target="_blank" rel="noreferrer noopener"><em>Software Security: Building Security In</em></a> (Addison-Wesley, 2006) explains why: Roughly 50% of security defects are implementation bugs, and the other 50% are design flaws. Static analysis tools target the implementation bugs—buffer overflows, SQL injection, format string vulnerabilities—because those are pattern-matchable. But design flaws are about intent: The system&#8217;s architecture doesn&#8217;t enforce the security properties it&#8217;s supposed to enforce, and no amount of scanning the code will reveal that. A <a href="https://arxiv.org/abs/2407.12241" target="_blank" rel="noreferrer noopener">2024 study by Charoenwet et al.</a> (ISSTA 2024) confirmed this is still the case: They tested five static analysis tools against 815 real vulnerability-contributing commits and found that 22% of vulnerable commits went entirely undetected, and 76% of warnings in vulnerable functions were irrelevant to the actual vulnerability. The pattern is consistent across two decades of research: There&#8217;s a ceiling on what you can find by analyzing code, and it&#8217;s around half.</p>



<p>There&#8217;s a good reason for that limitation: the <strong>intent ceiling</strong>. A structural analysis tool is limited to reading the code and looking at what it does; it has no way to take into account <em>what the developer intended it to do</em>.</p>



<p>When an AI does a code review without requirements, it&#8217;s limited to structural analysis: pattern matching, code smell detection, race condition analysis. It can ask &#8220;does this look right?&#8221; but it can&#8217;t ask &#8220;does this do what it&#8217;s supposed to do?&#8221; because it doesn&#8217;t know what the code is supposed to do. Structural review catches genuinely important stuff—race conditions, null pointer issues, resource leaks, concurrency bugs. A structural reviewer looking at a shell script will catch a missing <code>fi</code>, a bad variable expansion, a race condition. Structural review is useful, and structural review is what most AI code review tools do today.</p>



<p>But about half of all security defects are intent violations: things the code doesn&#8217;t do that it was supposed to do, or things it does that it wasn&#8217;t supposed to do. They&#8217;re invisible without a specification to check against, and no tool will find them by looking at code that is, structurally, perfectly sound. A structural reviewer looking at a script that&#8217;s, say, used to check router configuration files, might find well-formed bash, correct syntax, proper quoting, and code that looks like it works and doesn&#8217;t match known antipatterns. It wouldn&#8217;t know the script is only validating three of the five access control rules it&#8217;s supposed to enforce because that&#8217;s a requirements question, not a syntax question.</p>



<p>Or, more personally for me, this is what happened with my bus tracker app: The JSON parsing was flawless, the UI was correct, the timing logic worked perfectly. The only problem was that it showed buses headed towards Manhattan when I needed to go deeper into Brooklyn—and no structural analysis would ever catch that, because you need to know which direction I intended to go. That&#8217;s me and my very clever AI hitting the intent ceiling.</p>



<h2 class="wp-block-heading"><strong>The intent ceiling is a security problem</strong></h2>



<p>This is where it gets really serious, because security vulnerabilities are some of the most dangerous members of this class of invisible bugs.</p>



<p>Think about what a missing authorization check looks like to an AI code reviewer. Let&#8217;s say you&#8217;ve got a web endpoint with a well-formed HTTP handler, properly sanitized inputs, and a safe database query. The code is clean, and passes every structural check and static analysis tool you&#8217;ve thrown at it. Now you&#8217;re testing it and, much to your dismay, you discover that the endpoint lets any authenticated user delete any other user&#8217;s data because nobody ever wrote down the requirement that says &#8220;only administrators can perform deletions.&#8221; That&#8217;s <a href="https://cwe.mitre.org/data/definitions/862.html" target="_blank" rel="noreferrer noopener">CWE-862: Missing Authorization</a>, and it rose to #9 on the <a href="https://cwe.mitre.org/top25/" target="_blank" rel="noreferrer noopener">2024 CWE Top 25</a> most dangerous software weaknesses.</p>



<p>That&#8217;s not a coding error! It&#8217;s a missing requirement.</p>



<p>That&#8217;s McGraw&#8217;s point: About half of all security defects aren&#8217;t implementation bugs at all. They&#8217;re design flaws, places where the system&#8217;s architecture doesn&#8217;t enforce the security properties it was supposed to enforce. A cross-site scripting vulnerability isn&#8217;t always a failure to sanitize input. Sometimes it&#8217;s a failure to define which inputs are trusted and which aren&#8217;t. A privilege escalation isn&#8217;t always a broken access check. Sometimes there was never an access check to begin with because nobody specified that one was needed. These are intent violations and they&#8217;re invisible to any tool that doesn&#8217;t know what the software is supposed to prevent.</p>



<p>AI code review tools today are very good at catching the implementation half of McGraw&#8217;s split. They can spot a SQL injection pattern, flag an unsafe deserialization, identify a buffer overflow. But they&#8217;re working on the same side of the 50/50 line that static analysis has always worked on. The design half—the missing authorization checks, the unspecified trust boundaries, the security properties that were never written down—requires the same thing that catching my bus tracker bug required: knowing what the software was supposed to do in the first place.</p>



<h2 class="wp-block-heading"><strong>How the Quality Playbook derives requirements (and how you can too!)</strong></h2>



<p>The problem most projects face is that they don&#8217;t have formal requirements. What they have is code, documentation, commit messages, chat history, README files, and maybe some design docs. The question is how to get from that mess to a specification that an AI can actually use for verification.</p>



<p>The key insight I had while building the playbook was that every previous approach I tried asked the model to do two things at once: figure out what contracts exist AND write requirements for them. That doesn&#8217;t work—the model runs out of attention trying to hold the entire behavioral surface in its head while also producing formatted requirements. So I split them apart into four steps: First, have the AI read each source file and write down every behavioral contract it observes as a simple list. Second, derive requirements from those contracts plus the documentation. Third, check whether every contract is covered by a requirement. Fourth, assert completeness—and if there are gaps, go back to step one for the files with gaps.</p>



<p>The key idea is that the contracts file is external memory. When the model &#8220;forgets&#8221; about a behavioral contract it noticed earlier, that forgetting is normally invisible. With a contracts file, every observation is written down before any requirements work begins, so an uncovered contract is a visible, greppable gap.</p>



<p>You don&#8217;t need the Quality Playbook to do this—you can apply the same technique with any AI coding tool that you&#8217;re already using. Here&#8217;s what I&#8217;d recommend:</p>



<ul class="wp-block-list">
<li><strong>Write down what your software is supposed to guarantee.</strong> Not just what it does—what it&#8217;s supposed to do, for whom, under what conditions. If you&#8217;re practicing spec-driven development, you&#8217;re already partway there. The next step is adding the <em>why</em>: Why does this behavior matter, who depends on it, what goes wrong if it fails? That&#8217;s the difference between a spec and a requirement, and it&#8217;s the difference between an AI that can build your code and an AI that can verify it.<br></li>



<li><strong>Feed the AI your intent, not just your code.</strong> The intent is already sitting in your chat history, your design discussions, your Slack threads, your support tickets. Every Claude export, every Gemini conversation, every Cowork transcript contains design intent that never made it into specifications: why a function was written a certain way, what failure prompted an architectural decision, what tradeoffs were discussed before choosing an approach. The design intent that used to require a human to extract and document is now sitting in your chat logs. Your AI can read the transcripts and extract the <em>why</em>.<br></li>



<li><strong>Look for the negative requirements.</strong> What should your software <em>not</em> do? What states should be impossible? What data should never be exposed? These negative requirements are often the most valuable because they define boundaries that structural review can&#8217;t see. The missing authorization bug was a negative requirement: Unauthenticated users must <em>not</em> be able to delete other users&#8217; data. The Gson bug was a negative requirement: Duplicate keys must <em>not</em> be silently accepted when the first value is null. If you can articulate what your software must never do, you&#8217;ve given the AI something powerful to check against.<br></li>
</ul>



<p>In the next article, I&#8217;ll talk about context management—the skill that actually determines whether your AI sessions produce good work or mediocre work. Everything I&#8217;ve described here depends on the AI having the right information at the right time, and it turns out that managing what the AI knows (and what it forgets) is an engineering discipline in its own right. I&#8217;ll cover how I went from running 15 million tokens in a single prompt to splitting the playbook into independent phases with zero context carryover, and why that transition worked on the first try.</p>



<p><em>The </em><a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener"><em>Quality Playbook</em></a><em> is open source and works with GitHub Copilot, Cursor, and Claude Code. It&#8217;s also available as part of </em><a href="https://awesome-copilot.github.com/#file=skills%2Fquality-playbook%2FSKILL.md" target="_blank" rel="noreferrer noopener"><em>awesome-copilot</em></a><em>.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><em>Disclosure: Aspects of the methodology described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open-source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/ai-code-review-only-catches-half-of-your-bugs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Don&#8217;t Automate Your Moat: Matching AI Autonomy to Risk and Competitive Stakes</title>
		<link>https://www.oreilly.com/radar/dont-automate-your-moat-matching-ai-autonomy-to-risk-and-competitive-stakes/</link>
				<comments>https://www.oreilly.com/radar/dont-automate-your-moat-matching-ai-autonomy-to-risk-and-competitive-stakes/#respond</comments>
				<pubDate>Wed, 29 Apr 2026 11:42:28 +0000</pubDate>
					<dc:creator><![CDATA[Marc Millstone and Claude]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18628</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Dont-automate-your-moat.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Dont-automate-your-moat-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Velocity is table stakes. Code is a commodity. Understanding is the edge.]]></custom:subtitle>
		
				<description><![CDATA[I was talking to a senior engineer at a well-funded company not long ago. I asked him to walk me through a critical algorithm at the heart of their product, something that ran hundreds of times a second and directly affected customer outcomes. He paused and said, &#8220;Honestly, I&#8217;m not totally sure how it works. [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>I was talking to a senior engineer at a well-funded company not long ago. I asked him to walk me through a critical algorithm at the heart of their product, something that ran hundreds of times a second and directly affected customer outcomes. He paused and said, &#8220;Honestly, I&#8217;m not totally sure how it works. AI wrote it.&#8221;</p>



<p>A few weeks later, a different engineer at another company was paged about a system outage. He pulls up the failing service and realizes he has no idea it&#8217;s connected to a database. A colleague accepted the AI-generated PR three months ago that added that dependency. The tests passed. The change was never written down. The original engineer moved on and the knowledge was lost.</p>



<p>These aren&#8217;t new stories. Engineers have always inherited systems they didn&#8217;t fully build. What&#8217;s new is the disguise and the speed. AI is an amazing enabler. Organizations must adopt it to remain relevant. Yet the emerging pattern—describe what you want, let an agent iterate until it works, pay for it in tokens instead of engineering hours—is functionally a buy decision wearing a build costume. The code is in your repo. Your engineers merged the PR. It feels like you built it. But if nobody on your team understands why it works the way it does, you&#8217;ve purchased a dependency you can&#8217;t maintain from a vendor you can&#8217;t call.</p>



<p>AI doesn’t create that gap once. It widens it continuously at a pace that outstrips the organizational habits that once kept it manageable. Two problems compound at once. You can’t extend the thing that makes you hard to replace. And when it breaks, the incident lands on a team that doesn’t understand what they’re fixing, turning a recoverable outage into a customer-facing crisis. Engineering leaders have wrestled with build-versus-buy tradeoffs for decades, and the hard-won lesson has always been the same: You don&#8217;t outsource your competitive advantage. The token-funded generation loop doesn&#8217;t change that calculus. It makes it easier to skip the question entirely.</p>



<p>The question that matters isn&#8217;t &#8220;Can AI do this?&#8221; If it can&#8217;t today, it will be able to tomorrow. And the argument that follows does not depend on the quality of the AI-generated code. This article covers two questions most engineering organizations have never asked at the same time. Most teams optimize for velocity and never ask what they&#8217;re risking or giving away in the process. The gap between those unasked questions is where the most expensive mistakes are already being made.</p>



<h2 class="wp-block-heading"><strong>Part 1: Two dimensions. Neither is velocity.</strong></h2>



<p>Moving faster matters. But velocity alone misses the two dimensions that determine whether AI autonomy helps or hurts your business.</p>



<p><strong>Business risk</strong>: What&#8217;s the blast radius if this fails? A bug in an internal CLI tool costs you an afternoon. A bug in your authentication logic costs you customers and possibly market cap. A bug in your core pricing algorithm costs you the business. These are not the same.</p>



<p><strong>Competitive differentiation</strong>: Does this code <em>define your business?</em> Your moat is your architecture, your performance characteristics, your core algorithms, and the product decisions baked into your infrastructure. But it&#8217;s also the institutional knowledge that shaped them: the reasoning behind the trade-offs, the context that no model was trained on. If your competitors can generate the same code with the same model you&#8217;re using, it stops being an advantage.</p>



<p>Most organizations ask the first question on a good day. Almost none ask the second. That gap is how you end up shipping fast into a moat nobody can explain and nobody can extend.</p>



<p>Understanding why both dimensions matter starts with velocity and what happens when the feedback loop around it breaks.</p>



<h3 class="wp-block-heading"><strong>Velocity feels real. Debt is often invisible.</strong></h3>



<p>AI coding tools are genuinely impressive. GitHub&#8217;s research showed 55% faster task completion with Copilot in controlled conditions.<sup data-fn="cd51007b-bb8f-45b7-b7a2-7317b6d0cff9" class="fn"><a href="#cd51007b-bb8f-45b7-b7a2-7317b6d0cff9" id="cd51007b-bb8f-45b7-b7a2-7317b6d0cff9-link">1</a></sup> That number has driven an assumption that faster is always better.</p>



<p>A 2025 METR randomized controlled trial<sup data-fn="7306407f-0600-4183-85fa-04f12932c6e6" class="fn"><a href="#7306407f-0600-4183-85fa-04f12932c6e6" id="7306407f-0600-4183-85fa-04f12932c6e6-link">2</a></sup> found something that should give every engineering leader pause. Sixteen experienced developers on real production codebases forecasted they’d complete tasks 24% faster with AI. After finishing, they estimated they’d gone 20% faster. They’d actually gone 19% slower.</p>



<p>The velocity finding is striking. But the perception gap matters more. The feedback loop between &#8220;how am I doing?&#8221; and &#8220;how am I actually doing?&#8221; was broken throughout and never corrected itself. This doesn&#8217;t resolve the velocity debate. It reframes it. The danger isn&#8217;t that individuals move too fast. Organizations mistake output volume for productivity and strip out the review processes that used to catch what that gap costs.</p>



<p>A Tilburg University study of open source projects after GitHub Copilot&#8217;s introduction found the same pattern at the organizational level.<sup data-fn="b0ff63ba-e120-48fb-902d-c89fc1d80fe5" class="fn"><a href="#b0ff63ba-e120-48fb-902d-c89fc1d80fe5" id="b0ff63ba-e120-48fb-902d-c89fc1d80fe5-link">3</a></sup> Productivity did increase, but primarily among less-experienced developers. Code written after AI adoption required more rework to meet repository standards. The added rework burden fell on the most experienced (core) developers who reviewed 6.5% more code after Copilot&#8217;s introduction and saw a 19% drop in their own original code output. The velocity looks real at the surface. Underneath, the maintenance cost shifts upward to the people who can least afford to lose productive time.</p>



<p>That broken feedback loop has a name. Researchers call it <strong>cognitive debt</strong><sup data-fn="92b75161-55d2-410a-9da7-0faaf234bae3" class="fn"><a href="#92b75161-55d2-410a-9da7-0faaf234bae3" id="92b75161-55d2-410a-9da7-0faaf234bae3-link">4</a></sup>: the growing gap between how much code exists in your system and how much of it anyone actually understands. Technical debt shows up in your linter and your backlog. Cognitive debt is invisible. There&#8217;s no signal telling engineers where their understanding ends. That&#8217;s precisely what the METR perception gap showed. It never corrected itself.</p>



<p>Research by Anthropic Fellows found that engineers using AI assistance when learning new tools scored 17% lower on comprehension tests than those who coded by hand, with the steepest drops in debugging ability.<sup data-fn="edc18347-043c-4398-9652-16b4d7a8c464" class="fn"><a href="#edc18347-043c-4398-9652-16b4d7a8c464" id="edc18347-043c-4398-9652-16b4d7a8c464-link">5</a></sup> MIT&#8217;s Media Lab found the same pattern in writing tasks: Brain connectivity was weakest in the group using LLM assistance, strongest in the group working without tools.⁴ Active production builds understanding. Passive consumption doesn&#8217;t.</p>



<p>You understand what you build better than what you review. When you write code, you produce output <em>and</em> build a mental model. That&#8217;s what Peter Naur called the &#8220;theory of the program.&#8221; It lives in your head, not in the repo.<sup data-fn="36e33054-aa72-4221-9c5d-bd189d716cac" class="fn"><a href="#36e33054-aa72-4221-9c5d-bd189d716cac" id="36e33054-aa72-4221-9c5d-bd189d716cac-link">6</a></sup> The MIT study captured this directly. 83% of participants who wrote essays with LLM assistance could not quote a single sentence from essays they had just written.⁴</p>



<p>Cognitive debt is invisible until it isn&#8217;t. When it surfaces, it hits both dimensions hard, in different ways.</p>



<h3 class="wp-block-heading"><strong>Business risk: The blast radius of not knowing</strong></h3>



<p>On the business risk dimension, cognitive debt is a safety problem.</p>



<p>When nobody fully understands the system, the blast radius of a failure expands silently. The incident that eventually comes (and it always comes) lands on a team that can&#8217;t diagnose what they didn&#8217;t build. The engineer pulling up the failing service at 2 AM has no mental model of why it was built the way it was, what it connects to, or what the edge cases look like under load. So they ask the LLM. It can explain what the code does and often propose a reasonable fix. It can&#8217;t tell you why it was designed that way. And a fix that looks right to the model can quietly violate constraints that nobody thought to document.</p>



<p>Cognitive debt compounds a second, independent risk: the pace at which AI-generated code reaches production. OX Security&#8217;s analysis<sup data-fn="56b259aa-e401-4c2c-9ecd-99ade708cc29" class="fn"><a href="#56b259aa-e401-4c2c-9ecd-99ade708cc29" id="56b259aa-e401-4c2c-9ecd-99ade708cc29-link">7</a></sup> of over 300 software repositories found that AI-generated code isn&#8217;t necessarily more vulnerable per line than human-written code. The problem is velocity.</p>



<p>Code review, debugging, and team oversight are the bottlenecks that catch vulnerable code before it ships. AI makes it easy to remove them. CodeRabbit&#8217;s analysis of real-world pull requests found AI-authored changes contain up to 1.7x more critical and major defects than human-written code, with logic and correctness issues up 75%.<sup data-fn="eaf5792c-72c2-4cb6-94c5-d8a85a54af7c" class="fn"><a href="#eaf5792c-72c2-4cb6-94c5-d8a85a54af7c" id="eaf5792c-72c2-4cb6-94c5-d8a85a54af7c-link">8</a></sup> Apiiro&#8217;s analysis found that while AI reliably reduces surface-level syntax errors, architectural design flaws and privilege escalation paths (the categories automated scanners miss and human reviewers struggle to catch) spiked in AI-assisted codebases.<sup data-fn="0f193fe2-b405-4b1d-85fa-b745e853969c" class="fn"><a href="#0f193fe2-b405-4b1d-85fa-b745e853969c" id="0f193fe2-b405-4b1d-85fa-b745e853969c-link">9</a></sup></p>



<p>AI accelerates output and accelerates unreviewed risk in equal measure. The <strong>cognitive debt</strong> means that when something breaks, the team is learning the system as they&#8217;re trying to fix it. Remove their understanding and you haven&#8217;t streamlined the process. You&#8217;ve only removed the thing standing between a bad day and a catastrophic one.</p>



<h3 class="wp-block-heading"><strong>Competitive differentiation: What you give away without knowing it</strong></h3>



<p>The competitive differentiation risk isn&#8217;t that AI will generate your exact competitive algorithm and hand it to your competitor. It&#8217;s subtler. Your advantage was never the code itself; it was the judgment that shaped it. When AI writes that code, the judgment never forms. The code arrives, but the understanding that would let your team extend it, improve it, or defend it under pressure doesn&#8217;t. Your moat is most likely to survive in the places AI finds hardest to reach.</p>



<p>That judgment—formed by the performance trade-offs that took years to tune, the failure modes that only someone who&#8217;s been paged understands, the architectural decisions that encode domain knowledge nobody wrote down—doesn&#8217;t live in the codebase. It lives in your engineers&#8217; heads.</p>



<p>And here&#8217;s the part most teams miss: Your competitor with the same AI tools doesn&#8217;t just get similar code, they get a team that also doesn&#8217;t understand why it works the way it does, which means neither of you can extend it, and the race to the next architectural move is a coin flip rather than a compounding advantage. The build-versus-buy discipline exists precisely because decades of experience taught engineering organizations that outsourcing your core means losing the ability to extend it. The token-funded generation loop doesn&#8217;t change that calculus. It makes it easier to mistake the outsourcing for ownership because the code has your name on it.</p>



<p>The structural problem runs even deeper. Models trained on public code produce outputs weighted toward well-represented patterns, the common solutions to common problems. Research confirms this. LLM performance drops sharply on less-common programming languages where training data is sparse, and on genuinely novel implementations. Even the best current models correctly implement fewer than 40% of coding tasks drawn from recent research papers.<sup data-fn="076102b1-56ac-4820-99f2-71d4ef08dc32" class="fn"><a href="#076102b1-56ac-4820-99f2-71d4ef08dc32" id="076102b1-56ac-4820-99f2-71d4ef08dc32-link">10</a></sup> And the convergence problem extends beyond code. A pre-registered experiment tracking 61 participants over seven days found that while ChatGPT consistently boosted creative output during use, performance reverted to baseline the moment the tool was unavailable.<sup data-fn="ce2fbf20-b309-4e68-afe8-6fe10b3ac7be" class="fn"><a href="#ce2fbf20-b309-4e68-afe8-6fe10b3ac7be" id="ce2fbf20-b309-4e68-afe8-6fe10b3ac7be-link">11</a></sup> More critically, the work produced with AI assistance became increasingly homogenized over time. That homogenization persisted even after the tool was removed. The participants hadn&#8217;t borrowed the tool&#8217;s output. They&#8217;d internalized its patterns. For engineering organizations, this is the differentiation risk made concrete: Teams that rely on AI for their most critical design decisions risk generating commodity code today and training themselves to think in commodity patterns tomorrow.</p>



<p>Engineers who deeply own their most critical systems are better at diagnosing incidents and see the next architectural move that competitors can&#8217;t follow. Delegate that comprehension away and you can keep the lights on. You can&#8217;t see around corners.</p>



<h3 class="wp-block-heading"><strong>When it goes wrong, it really goes wrong</strong></h3>



<p>Both dimensions rest on the same vulnerability: cognitive debt accumulating on work that matters. The failure cases make it concrete.</p>



<p>The production failures are accumulating. A Replit AI agent deleted months of production data in seconds after violating explicit code-freeze instructions, then initially misled the user about whether recovery was possible.<sup data-fn="61637363-b0a5-44f9-939c-e2f4b2a8fb1f" class="fn"><a href="#61637363-b0a5-44f9-939c-e2f4b2a8fb1f" id="61637363-b0a5-44f9-939c-e2f4b2a8fb1f-link">12</a></sup> Reports emerged in early 2026 of a major cloud provider convening mandatory engineering reviews after a pattern of high-blast-radius incidents, with AI-assisted code changes cited as a contributing factor. In each case, the humans in the loop either didn&#8217;t understand what they were approving, or weren&#8217;t in the loop at all.</p>



<p>The deeper pattern predates AI tools entirely. Knight Capital Group took seventeen years to become the largest trader in U.S. equities. It took forty-five minutes to lose $460 million.<sup data-fn="30954c70-1f77-4ef5-9a37-c392499f0821" class="fn"><a href="#30954c70-1f77-4ef5-9a37-c392499f0821" id="30954c70-1f77-4ef5-9a37-c392499f0821-link">13</a></sup> The culprit was a nine-year-old piece of deprecated code called Power Peg, left on production servers and never retested after engineers modified an adjacent function in 2005. When engineers reused its feature flag for new functionality in 2012, nobody understood what they were reactivating. When the fault surfaced, the team’s attempt to fix it made things worse. They uninstalled the new code from the seven servers where it had deployed correctly, which caused Power Peg to activate on those servers too and compounded the losses. The SEC’s enforcement order is unambiguous: absent deployment procedures, no code review requirements, no incident response protocols. A failure of institutional comprehension where the mental model had quietly evaporated while the code kept running.</p>



<p>No AI tool wrote that code. The failure was entirely human, through entirely normal processes: engineers leaving, tests never rerun after refactors, flags reused without documentation. This is the baseline, what software organizations produce under ordinary conditions over nine years. An engineering team with modern AI tools won&#8217;t recreate this specific bug. They&#8217;ll create the conditions for the next one faster: more code that nobody fully understands, more dependencies nobody documented, more cognitive debt accumulating before anyone notices. AI removes the friction that once slowed exactly this kind of erosion.</p>



<p>None are failures of AI capability. They&#8217;re failures of judgment about where to deploy AI and how much human oversight to maintain.</p>



<h2 class="wp-block-heading"><strong>Part 2: A four-quadrant model for AI autonomy</strong></h2>



<h3 class="wp-block-heading"><strong>The quadrants</strong></h3>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1320" height="731" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Human-involvement-in-programming.png" alt="The quadrants of human involvement in programming" class="wp-image-18633" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Human-involvement-in-programming.png 1320w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Human-involvement-in-programming-300x166.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Human-involvement-in-programming-768x425.png 768w" sizes="auto, (max-width: 1320px) 100vw, 1320px" /></figure>



<p>Four quadrants emerge when both questions are asked together. Before the examples, two contrasts are worth naming because the quadrants that look most similar on the surface are the ones most often confused in practice.</p>



<p><strong>Supervised automation versus Human-led craftsmanship.</strong> Both demand high human involvement. Both feel like &#8220;be careful here.&#8221; But the difference is fundamental. In Supervised Automation, the human is a safety gate. The work is a commodity; you&#8217;re there to catch errors before they escape. In Human-led craftsmanship, the human is the author. You&#8217;re building the mental model that lets the next engineer reason about this system under pressure three years from now and take it somewhere new. The code isn&#8217;t something you need to verify. It&#8217;s something you need to own. And ownership here extends beyond the individual engineer. The team writes RFCs, debates trade-offs, identifies which parts of the implementation fall into which quadrant, and makes sure the reasoning behind key decisions is shared, not siloed. Human-led craftsmanship isn&#8217;t one person writing code alone. It&#8217;s a team making sure the understanding survives the people who built it.</p>



<p><strong>Collaborative co-creation versus Human-led craftsmanship.</strong> Both involve high differentiation, and in both, the human drives the vision and owns the key decisions. But risk changes everything about how you work. In Collaborative co-creation, early iterations are recoverable. A wrong turn can be corrected before it costs you anything serious, so AI can genuinely accelerate execution. In Human-led craftsmanship, the blast radius of not understanding what you&#8217;ve built compounds over time. Wrong turns become load-bearing walls, and the architectural moves you can&#8217;t see are the ones that let competitors catch up. AI assists with scoped subtasks only. Every contribution gets interrogated.</p>



<p>In <strong>full automation</strong>, the human is a director. You define what needs to be done, AI produces the output, and you spot-check the result. The work is low-risk and low-differentiation. If something&#8217;s wrong, you fix it in the next iteration without anyone outside the team noticing. This is where AI earns its keep without qualification, and where restricting it costs you real velocity with nothing to show for it.</p>



<p>To make all four quadrants concrete, we&#8217;ll use a single feature as a lens: building AI Gateway cost controls, the system that sets token budgets per agent, enforces spending limits, tracks usage by model and agent, and handles enforcement modes when an agent exceeds its budget.</p>



<h4 class="wp-block-heading"><strong>Low risk, low differentiation: Full automation</strong></h4>



<p>API docs for cost controls. Test scaffolding for token limit scenarios. Config examples for per-agent budgets. Every platform has docs, and if there&#8217;s a mistake, you fix it in the next iteration without anyone outside the team noticing. Humans set direction and spot-check. AI writes, tests, and ships.</p>



<p><em>The test: If this is wrong, can you fix it before a customer sees it or complains? If yes, automate freely.</em></p>



<h4 class="wp-block-heading"><strong>Low risk, high differentiation: Collaborative co-creation</strong></h4>



<p>Designing the UX for the token usage dashboard. Iterating on routing rules that determine when an agent degrades to a cheaper model, halts entirely, or triggers a notification. These decisions separate a sophisticated platform from a blunt on/off switch, but early iterations are recoverable. A first version that doesn&#8217;t surface guardrail costs separately isn&#8217;t a disaster. It&#8217;s a product conversation. Humans drive the design vision and interrogate AI on trade-offs. AI accelerates execution and handles boilerplate.</p>



<p><em>The test: If you flipped the ratio (AI deciding, human rubber-stamping) would you be comfortable? If not, this requires genuine co-creation, not delegation. The human should be able to explain the trade-offs in the current design and know where to push it next.</em></p>



<h4 class="wp-block-heading"><strong>High risk, low differentiation: Supervised automation</strong></h4>



<p>Enforcement logic that halts an agent when it hits its token budget. Every cost control system needs enforcement, so this isn&#8217;t differentiating. But if it fails, agents run unconstrained and rack up unbounded LLM spend. AI can draft the logic. A human must trace every path and understand every state transition before signing off. The question before merge: Can I explain exactly what happens when an agent hits the limit mid-execution? Can I explain this behavior to Customer Success or the Customer?</p>



<p><em>The test: Could a competent engineer review this confidently without having written it? If yes, the human&#8217;s job is to verify, not to author. But the bar for verification is explanation, not approval.</em></p>



<h4 class="wp-block-heading"><strong>High risk, high differentiation: Human-led craftsmanship</strong></h4>



<p>The core token metering and attribution engine. It tracks usage per agent and per model, attributes guardrail costs separately so they don&#8217;t count against agent budgets, and provides the auditability enterprise customers need to govern AI spend. Get it wrong and customers can&#8217;t trust the numbers. Get it right and it&#8217;s a genuine competitive moat that competitors can&#8217;t replicate with the same AI tools you&#8217;re using.</p>



<p>Human engineers own the design end-to-end. AI assists on scoped subtasks once the design is settled: drafting specific functions, generating test coverage for paths the engineer has already reasoned through. Every contribution gets interrogated. The bar is whether the engineer could explain it in an incident review without looking at the code first.</p>



<p><em>The test: If the engineer who built this left tomorrow, would the team still understand why it works the way it does? Could they make it better? If the honest answer is no, you&#8217;re accumulating the most dangerous kind of cognitive debt there is.</em></p>



<h3 class="wp-block-heading"><strong>The counterargument (it&#8217;s a good one)</strong></h3>



<p>Any engineering leader will push back here, and they&#8217;ll have good reason to.</p>



<p>The research is thin. METR&#8217;s study had 16 developers. MIT&#8217;s EEG work is a preprint that its own critics say should be interpreted conservatively.<sup data-fn="6b15093d-b639-4d29-9556-4fec26b40831" class="fn"><a href="#6b15093d-b639-4d29-9556-4fec26b40831" id="6b15093d-b639-4d29-9556-4fec26b40831-link">14</a></sup> The Anthropic comprehension study shows a quiz score gap, not a business outcome. The evidence is early-stage. Intellectual honesty requires acknowledging that.</p>



<p>But the pattern keeps showing up in unrelated fields. A Lancet study found that endoscopists who routinely used AI for polyp detection performed measurably worse when the AI was removed, with adenoma detection rates dropping from 28.4% to 22.4% in three months.<sup data-fn="e6316082-3663-4380-917a-d1c26bd39a20" class="fn"><a href="#e6316082-3663-4380-917a-d1c26bd39a20" id="e6316082-3663-4380-917a-d1c26bd39a20-link">15</a></sup> The study is observational and small. But the direction is consistent with everything else: Routine AI assistance may erode the skills it was supposed to support.</p>



<p>Most engineering work isn&#8217;t high-stakes. Studies consistently estimate that 60–80% of engineering time goes to maintenance, tests, docs, integration, and tooling, exactly the stuff that belongs in the automate quadrant regardless. Restricting AI because of the top 20% creates a real tax on the other 80%.</p>



<p>And can&#8217;t engineers develop deep ownership of AI-generated code through study and iteration? Partially. But the behavioral data tells a harder story. GitClear&#8217;s analysis of 211 million changed lines shows a decline in refactored code since AI adoption accelerated.<sup data-fn="d3b555ec-16c0-4fb3-9d95-65aa20d28cb8" class="fn"><a href="#d3b555ec-16c0-4fb3-9d95-65aa20d28cb8" id="d3b555ec-16c0-4fb3-9d95-65aa20d28cb8-link">16</a></sup> Engineers aren&#8217;t studying AI-generated code carefully. They&#8217;re moving on to the next feature. LLM tools can explain what code does; they can&#8217;t tell you why the system was designed the way it was.<sup data-fn="40eaef3c-8e44-451f-9046-c6e59e060e0c" class="fn"><a href="#40eaef3c-8e44-451f-9046-c6e59e060e0c" id="40eaef3c-8e44-451f-9046-c6e59e060e0c-link">17</a></sup></p>



<p>The serious pro-AI argument isn&#8217;t &#8220;use AI everywhere.&#8221; It&#8217;s more precise: The guardrails for verification and oversight are improving fast, engineers who actively interrogate AI output build understanding even from generated code, and the organizations that restrict AI on their most critical work will fall behind competitors who don&#8217;t. This is a real argument.</p>



<p>The answer isn&#8217;t to dismiss it but to sharpen what &#8220;critical work&#8221; means. And, to recognize that the interrogative use of AI that the research identifies as understanding-preserving requires organizational discipline that most teams haven&#8217;t built yet. The quadrant isn&#8217;t permanent. The threshold shifts as both AI capability and human oversight practices mature. The discipline is the habit of asking both questions honestly before you start, not a fixed answer to them.</p>



<h3 class="wp-block-heading"><strong>The discipline is simple. Maintaining it isn&#8217;t.</strong></h3>



<p>The quadrant tells you where to be careful. How you engage AI once you&#8217;re there determines whether careful is enough. The difference between &#8220;write me this function&#8221; and &#8220;explain why you made this trade-off, and what breaks if the input is malformed&#8221; is the difference between borrowing intelligence and developing it. Active, interrogative AI use preserves comprehension. Passive delegation destroys it. That&#8217;s what the Anthropic study&#8217;s behavioral data shows directly.</p>



<p>Match your review process to the quadrant. AI-generated docs and test scaffolding get a spot-check. AI-generated code touching your core product logic gets the same scrutiny as a junior engineer&#8217;s first PR. The bar for approval isn&#8217;t &#8220;tests pass.&#8221; It&#8217;s &#8220;someone on this team can explain what this does, defend it under pressure, and use that understanding to make it better.&#8221; Full automation needs a spot-check. Human-led craftsmanship needs an RFC, a team review, and shared ownership of the reasoning before anyone writes a line of code.</p>



<p>This matters especially in real-time data and AI infrastructure, systems where the most dangerous failure modes are emergent, appearing at scale and under load in combinations the code itself doesn&#8217;t express. Recognize that the threshold will shift. As AI capability improves, what belongs in the automate quadrant expands. The discipline isn&#8217;t a fixed answer. It&#8217;s the habit of asking both questions honestly before you start. It&#8217;s a core reason Redpanda is designed for simplicity and predictability: engineers need to be able to reason about how infrastructure behaves under pressure, not discover it during an incident.<sup data-fn="01a4451c-aa8e-47a5-b2f0-648b574f0c35" class="fn"><a href="#01a4451c-aa8e-47a5-b2f0-648b574f0c35" id="01a4451c-aa8e-47a5-b2f0-648b574f0c35-link">18</a></sup></p>



<h3 class="wp-block-heading"><strong>The real competitive question</strong></h3>



<p>The companies that get this right won&#8217;t be the ones that use the most AI or the least. They&#8217;ll be the ones whose leaders have internalized that risk and differentiation are independent variables, and that cognitive debt threatens both.</p>



<p>The engineer who doesn&#8217;t know how their algorithm works is a symptom. The organization that allowed it is the cause.</p>



<p>Treat cognitive debt as only a risk problem and you end up with engineers who can&#8217;t diagnose failures they didn&#8217;t build. Treat it as only a differentiation problem and you get fragile systems that survive until the next incident. Let it accumulate on your most critical systems and you get both at once.</p>



<p>Your competitor is making this calculation right now. The question isn&#8217;t whether to use AI. It&#8217;s whether you&#8217;re being honest about which quadrant you&#8217;re in, and whether your team will know the answer when it finally matters.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Co-authored with Claude (Anthropic). Yes, we took the advice from this article.</em></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Footnotes</h3>


<ol class="wp-block-footnotes"><li id="cd51007b-bb8f-45b7-b7a2-7317b6d0cff9">Peng, S. et al. (2023). The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. <a href="https://arxiv.org/abs/2302.06590" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2302.06590</a> <a href="#cd51007b-bb8f-45b7-b7a2-7317b6d0cff9-link" aria-label="Jump to footnote reference 1"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="7306407f-0600-4183-85fa-04f12932c6e6">Becker, J., Rush, N. et al. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR. <a href="https://arxiv.org/abs/2507.09089" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2507.09089</a> <a href="#7306407f-0600-4183-85fa-04f12932c6e6-link" aria-label="Jump to footnote reference 2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="b0ff63ba-e120-48fb-902d-c89fc1d80fe5">Xu, F., Medappa, P.K., Tunc, M.M., Vroegindeweij, M., &amp; Fransoo, J.C. (2025). AI-Assisted Programming May Decrease the Productivity of Experienced Developers by Increasing Maintenance Burden. Tilburg University. <a href="https://arxiv.org/abs/2510.10165" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2510.10165</a> <a href="#b0ff63ba-e120-48fb-902d-c89fc1d80fe5-link" aria-label="Jump to footnote reference 3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="92b75161-55d2-410a-9da7-0faaf234bae3">Kosmyna, N. et al. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. MIT Media Lab. <a href="https://arxiv.org/abs/2506.08872" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2506.08872</a> <em>(preprint, not yet peer-reviewed)</em> <a href="#92b75161-55d2-410a-9da7-0faaf234bae3-link" aria-label="Jump to footnote reference 4"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="edc18347-043c-4398-9652-16b4d7a8c464">Shen, J.H. &amp; Tamkin, A. (2026). How AI Impacts Skill Formation. Anthropic Safety Fellows Program. <a href="https://arxiv.org/abs/2601.20245" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2601.20245</a> <a href="#edc18347-043c-4398-9652-16b4d7a8c464-link" aria-label="Jump to footnote reference 5"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="36e33054-aa72-4221-9c5d-bd189d716cac">The generation effect: Rosner, Z.A. et al. (2012). The Generation Effect: Activating Broad Neural Circuits During Memory Encoding. Cortex. <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC3556209/" target="_blank" rel="noreferrer noopener">https://pmc.ncbi.nlm.nih.gov/articles/PMC3556209/</a> and Bertsch, S. et al. (2007). The generation effect: A meta-analytic review. Memory &amp; Cognition. <a href="https://link.springer.com/article/10.3758/BF03193441" target="_blank" rel="noreferrer noopener">https://link.springer.com/article/10.3758/BF03193441</a> and Naur, P. (1985). Programming as Theory Building. Microprocessing and Microprogramming. <a href="https://pages.cs.wisc.edu/~remzi/Naur.pdf" target="_blank" rel="noreferrer noopener">https://pages.cs.wisc.edu/~remzi/Naur.pdf</a> <a href="#36e33054-aa72-4221-9c5d-bd189d716cac-link" aria-label="Jump to footnote reference 6"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="56b259aa-e401-4c2c-9ecd-99ade708cc29">OX Security. (October 2025). Army of Juniors: The AI Code Security Crisis. <a href="https://www.helpnetsecurity.com/2025/10/27/ai-code-security-risks-report/" target="_blank" rel="noreferrer noopener">https://www.helpnetsecurity.com/2025/10/27/ai-code-security-risks-report/</a> <a href="#56b259aa-e401-4c2c-9ecd-99ade708cc29-link" aria-label="Jump to footnote reference 7"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="eaf5792c-72c2-4cb6-94c5-d8a85a54af7c">CodeRabbit. (December 2025). State of AI vs Human Code Generation Report. <a href="https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report" target="_blank" rel="noreferrer noopener">https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report</a>. <em>Note: CodeRabbit produces AI code review tooling; findings should be read in that context.</em> <a href="#eaf5792c-72c2-4cb6-94c5-d8a85a54af7c-link" aria-label="Jump to footnote reference 8"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="0f193fe2-b405-4b1d-85fa-b745e853969c">Apiiro. (September 2025). 4x Velocity, 10x Vulnerabilities: AI Coding Assistants Are Shipping More Risks. <a href="https://apiiro.com/blog/4x-velocity-10x-vulnerabilities-ai-coding-assistants-are-shipping-more-risks/" target="_blank" rel="noreferrer noopener">https://apiiro.com/blog/4x-velocity-10x-vulnerabilities-ai-coding-assistants-are-shipping-more-risks/</a>. <em>Note: Apiiro produces application security tooling; findings should be read in that context.</em> <a href="#0f193fe2-b405-4b1d-85fa-b745e853969c-link" aria-label="Jump to footnote reference 9"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="076102b1-56ac-4820-99f2-71d4ef08dc32">Joel, S., Wu, J.J., &amp; Fard, F.H. (2024). A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages. ACM TOSEM. <a href="https://arxiv.org/abs/2410.03981" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2410.03981</a>. See also: Hua, et al. (2025). ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code. <a href="https://arxiv.org/abs/2506.02314" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2506.02314</a> <a href="#076102b1-56ac-4820-99f2-71d4ef08dc32-link" aria-label="Jump to footnote reference 10"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="ce2fbf20-b309-4e68-afe8-6fe10b3ac7be">Liu, Q., Zhou, Y., Huang, J., &amp; Li, G. (2024). When ChatGPT is Gone: Creativity Reverts and Homogeneity Persists. <a href="https://arxiv.org/abs/2401.06816" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2401.06816</a> <a href="#ce2fbf20-b309-4e68-afe8-6fe10b3ac7be-link" aria-label="Jump to footnote reference 11"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="61637363-b0a5-44f9-939c-e2f4b2a8fb1f">Fortune. (July 2025). AI-Powered Coding Tool Wiped Out a Software Company&#8217;s Database in &#8216;Catastrophic Failure.&#8217; <a href="https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/" target="_blank" rel="noreferrer noopener">https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/</a> <a href="#61637363-b0a5-44f9-939c-e2f4b2a8fb1f-link" aria-label="Jump to footnote reference 12"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="30954c70-1f77-4ef5-9a37-c392499f0821">Knight Capital Group. SEC Administrative Proceeding, Release No. 70694 (October 16, 2013). <a href="https://www.sec.gov/litigation/admin/2013/34-70694.pdf" target="_blank" rel="noreferrer noopener">https://www.sec.gov/litigation/admin/2013/34-70694.pdf</a>. Levine, M. (2013). Knight Capital&#8217;s $440 Million Compliance Disaster. Bloomberg. <a href="https://www.bloomberg.com/opinion/articles/2013-10-17/knight-capital-s-440-million-compliance-disaster" target="_blank" rel="noreferrer noopener">https://www.bloomberg.com/opinion/articles/2013-10-17/knight-capital-s-440-million-compliance-disaster</a> <a href="#30954c70-1f77-4ef5-9a37-c392499f0821-link" aria-label="Jump to footnote reference 13"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="6b15093d-b639-4d29-9556-4fec26b40831">Stankovic, M. et al. (2025). Comment on: Your Brain on ChatGPT. <a href="https://arxiv.org/abs/2601.00856" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2601.00856</a> <a href="#6b15093d-b639-4d29-9556-4fec26b40831-link" aria-label="Jump to footnote reference 14"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="e6316082-3663-4380-917a-d1c26bd39a20">Budzyń, K., Romańczyk, M. et al. (2025). Endoscopist Deskilling Risk After Exposure to Artificial Intelligence in Colonoscopy: A Multicentre, Observational Study. Lancet Gastroenterol Hepatol. 10(10):896-903. <a href="https://doi.org/10.1016/S2468-1253(25)00133-5" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/S2468-1253(25)00133-5</a> <a href="#e6316082-3663-4380-917a-d1c26bd39a20-link" aria-label="Jump to footnote reference 15"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="d3b555ec-16c0-4fb3-9d95-65aa20d28cb8">Harding, W. (2025). AI Copilot Code Quality: Evaluating 2024&#8217;s Increased Defect Rate via Code Quality Metrics. GitClear. <a href="https://www.gitclear.com/ai_assistant_code_quality_2025_research" target="_blank" rel="noreferrer noopener">https://www.gitclear.com/ai_assistant_code_quality_2025_research</a> <a href="#d3b555ec-16c0-4fb3-9d95-65aa20d28cb8-link" aria-label="Jump to footnote reference 16"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="40eaef3c-8e44-451f-9046-c6e59e060e0c">Zhou, X., Li, R., Liang, P., Zhang, B., Shahin, M., Li, Z., &amp; Yang, C. (2025). Using LLMs in Generating Design Rationale for Software Architecture Decisions. ACM TOSEM. https://arxiv.org/abs/2504.20781. See also: Tang, N., Chen, M., Ning, Z., Bansal, A., Huang, Y., McMillan, C., &amp; Li, T.J.-J. (2024). A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions. IEEE VL/HCC 2024. <a href="https://arxiv.org/abs/2405.16081" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2405.16081</a> <a href="#40eaef3c-8e44-451f-9046-c6e59e060e0c-link" aria-label="Jump to footnote reference 17"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="01a4451c-aa8e-47a5-b2f0-648b574f0c35">Gallego, A. (2025). Introducing the Agentic Data Plane. Redpanda. <a href="https://www.redpanda.com/blog/agentic-data-plane-adp" target="_blank" rel="noreferrer noopener">https://www.redpanda.com/blog/agentic-data-plane-adp</a>. Crosier, K. (2026). How to Safely Deploy Agentic AI in the Enterprise. Redpanda. https://www.redpanda.com/blog/deploy-agentic-ai-safely-enterprise <a href="#01a4451c-aa8e-47a5-b2f0-648b574f0c35-link" aria-label="Jump to footnote reference 18"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li></ol>


<p></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/dont-automate-your-moat-matching-ai-autonomy-to-risk-and-competitive-stakes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>When Correct Systems Produce the Wrong Outcomes</title>
		<link>https://www.oreilly.com/radar/when-correct-systems-produce-the-wrong-outcomes/</link>
				<comments>https://www.oreilly.com/radar/when-correct-systems-produce-the-wrong-outcomes/#respond</comments>
				<pubDate>Tue, 28 Apr 2026 11:12:58 +0000</pubDate>
					<dc:creator><![CDATA[Varun Raj]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18613</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/When-correct-systems-produce-the-wrong-outcomes.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/When-correct-systems-produce-the-wrong-outcomes-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Why autonomous AI systems drift and what it reveals about the limits of observability]]></custom:subtitle>
		
				<description><![CDATA[We tend to assume that if every part of a system behaves correctly, the system itself will behave correctly. That assumption is deeply embedded in how we design, test, and operate software. If a service returns valid responses, if dependencies are reachable, and if constraints are satisfied, then the system is considered healthy. Even in [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>We tend to assume that if every part of a system behaves correctly, the system itself will behave correctly. That assumption is deeply embedded in how we design, test, and operate software. If a service returns valid responses, if dependencies are reachable, and if constraints are satisfied, then the system is considered healthy. Even in distributed systems, where failure modes are more complex, correctness is still tied to the behavior of individual components. In modern AI systems, particularly those combining retrieval, reasoning, and tool invocation, this assumption is increasingly stressed under continuous operation.</p>



<p>This model works because most systems are built around discrete operations. A request arrives, the system processes it, and a result is returned. Each interaction is bounded, and correctness can be evaluated locally. But that assumption begins to break down in systems that operate continuously. In these systems, this behavior is not the result of a single request. It emerges from a sequence of decisions that unfold over time. Each decision may be reasonable in isolation. The system may satisfy every local condition we know how to measure. And yet, when viewed as a whole, the outcome can be wrong.</p>



<p>One way to think about this is as a form of behavioral drift systems that remain operational but gradually diverge from their intended trajectory. Nothing crashes. No alerts fire. The system continues to function. And still, something has gone off course.</p>



<h2 class="wp-block-heading"><strong>The composability problem</strong></h2>



<p>The root of the issue is not that components are failing. It is that correctness no longer composes cleanly. In traditional systems, we rely on a simple intuition: If each part is correct, then the system composed of those parts will also be correct. This intuition holds when interactions are limited and well-defined.</p>



<p>In autonomous systems, that intuition becomes unreliable. Consider a system that retrieves information, reasons over it, and takes action. Each step in that process can be implemented correctly. Retrieval returns relevant data. The reasoning step produces plausible conclusions. The action is executed successfully. But correctness at each step does not guarantee correctness of the sequence.</p>



<p>The system might retrieve information that is contextually valid but incomplete or misaligned with the current task. The reasoning step might interpret it in a way that is locally consistent but globally misleading. The action might reinforce that interpretation by feeding it back into the system’s context. Each step is valid. The trajectory is not. This is what behavioral drift looks like in practice: locally correct decisions producing globally misaligned outcomes.</p>



<p>In these systems, correctness is no longer a property of individual steps. It is a property of how those steps interact over time. This breakdown is subtle but fundamental. It means that testing individual components, even exhaustively, does not guarantee that the system will behave correctly when those components are composed into a continuously operating whole.</p>



<h2 class="wp-block-heading"><strong>Behavior emerges over time</strong></h2>



<p>To understand why this happens, it helps to look at where behavior actually comes from. In many modern AI systems, behavior is not encoded directly in a single component. It emerges from interaction:</p>



<ul class="wp-block-list">
<li>Models generate outputs based on context</li>



<li>Retrieval systems shape that context</li>



<li>Planners sequence actions based on those outputs</li>



<li>Execution layers apply those actions to external systems</li>



<li>Feedback loops update the system’s state</li>
</ul>



<p>Each of these elements operates with partial information. Each contributes to the next state of the system. The system evolves as these interactions accumulate. This pattern is especially visible in LLM-based and agentic AI systems, where context assembly, reasoning, and action selection are dynamically coupled. Under these conditions, behavior is dynamic and path dependent. Small differences early in a sequence can lead to large differences later on. A slightly suboptimal decision, repeated or combined with others, can push the system further away from its intended trajectory.</p>



<p>This is why behavior cannot be fully specified ahead of time. It is not simply implemented; it is produced. And because it is produced over time, it can also drift over time.</p>



<h2 class="wp-block-heading"><strong>Observability without alignment</strong></h2>



<p>Modern observability systems are very good at telling us what a system is doing. We can measure latency, throughput, and resource utilization. We can trace requests across services. We can inspect logs, metrics, and traces in near real time. In many cases, we can reconstruct exactly how a particular outcome was produced. These signals are essential. They allow us to detect failures that disrupt execution. But they are tied to a particular model of correctness. They assume that if execution proceeds without errors and if performance remains within acceptable bounds, then the system is behaving as expected.</p>



<p>In systems exhibiting behavioral drift, that assumption no longer holds. A system can process requests efficiently while producing outputs that are progressively less aligned with its intended purpose. It can meet all its service-level objectives while still moving in the wrong direction. Observability captures activity. It does not capture alignment.</p>



<p>This distinction becomes more important as systems become more autonomous. In AI-driven systems, particularly those operating as long-lived agents, this gap between activity and alignment becomes operationally significant. The question is no longer just whether the system is working. It is whether it is still doing the right thing. This gap between activity and alignment is where many modern systems begin to fail without appearing to fail.</p>



<h2 class="wp-block-heading"><strong>The limits of step-level validation</strong></h2>



<p>A natural response to this problem is to add more validation. We can introduce checks at each stage:</p>



<ul class="wp-block-list">
<li>Validate retrieved data.</li>



<li>Apply policy checks to model outputs.</li>



<li>Enforce constraints before executing actions.</li>
</ul>



<p>These mechanisms improve local correctness. They reduce the likelihood of obviously incorrect decisions. But they operate at the level of individual steps.</p>



<p>They answer questions like:</p>



<ul class="wp-block-list">
<li>Is this output acceptable?</li>



<li>Is this action allowed?</li>



<li>Does this input meet requirements?</li>
</ul>



<p>They do not answer:</p>



<ul class="wp-block-list">
<li>Does this sequence of decisions still make sense as a whole?</li>
</ul>



<p>A system can pass every validation check and still drift. Behavioral drift is not caused by invalid steps. It is caused by valid steps interacting in ways we did not anticipate. Increasing validation does not eliminate this problem. It only shifts where it appears, often pushing it further downstream, where it becomes harder to detect and correct.</p>



<h2 class="wp-block-heading"><strong>Coordination becomes the system</strong></h2>



<p>If correctness does not compose automatically, then what determines system behavior? Increasingly, the answer is coordination. In traditional distributed systems, coordination refers to managing shared state, ensuring consistency, ordering operations, and handling concurrency. In autonomous systems, coordination extends to decisions.</p>



<p>The system must coordinate:</p>



<ul class="wp-block-list">
<li>Which information is used</li>



<li>How that information is interpreted</li>



<li>What actions are taken</li>



<li>How those actions influence future decisions</li>
</ul>



<p>This coordination is not centralized. It is distributed across models, planners, tools, and feedback loops. In agentic AI architectures, this coordination spans model inference, retrieval pipelines, and external system interactions. The system’s behavior is not defined by any single component. It emerges from the interaction between them.</p>



<p>In this sense, the system is no longer just the sum of its parts. The system is the coordination itself. Failures arise not from broken components, but from the dynamics of interaction timing, sequencing, feedback, and context. This also explains why small inconsistencies can propagate and amplify. A slight mismatch in one part of the system can cascade through subsequent decisions, shaping the trajectory in ways that are difficult to anticipate or reverse.</p>



<h2 class="wp-block-heading"><strong>Control planes introduce structure, not assurance</strong></h2>



<p>One response to this complexity is to introduce more structure. Control planes, policy engines, and governance layers provide mechanisms to enforce constraints at key decision points. They can validate inputs, restrict actions, and ensure that certain conditions are met before execution proceeds. This is an important step. Without some form of structure, it becomes difficult to reason about system behavior at all. But structure alone is not sufficient.</p>



<p>Most control mechanisms operate at entry points. They evaluate decisions at the moment they are made. They determine whether a particular action should be allowed, whether a policy is satisfied, and whether a request can proceed. The problem is that many of the failures in autonomous systems do not originate at these entry points. They emerge during execution, as sequences of individually valid decisions interact in unexpected ways. A control plane can ensure that each step is permissible. It cannot guarantee that the sequence of steps will produce the intended outcome. This distinction is subtle but important: control provides structure, but not assurance.</p>



<h2 class="wp-block-heading"><strong>From events to trajectories</strong></h2>



<p>Traditional monitoring focuses on events. A request is processed. A response is returned. An error occurs. Each event is evaluated independently. In systems exhibiting behavioral drift, behavior is better understood as a trajectory. A trajectory is a sequence of states connected by decisions. It captures how the system evolves over time. Two trajectories can consist of individually valid steps and still produce very different outcomes. One remains aligned. The other drifts. This represents a shift from failure as an event to failure as a trajectory, a distinction that traditional system models are not designed to capture.</p>



<p>Correctness is no longer about individual events. It is about the shape of the trajectory. This shift has implications not just for how we monitor systems, but for how we design them in the first place.</p>



<h2 class="wp-block-heading"><strong>Detecting drift and responding in motion</strong></h2>



<p>If failure manifests as drift, then detecting it requires a different set of signals. Instead of looking for errors, we need to look for patterns:</p>



<ul class="wp-block-list">
<li>Changes in how similar situations are handled</li>



<li>Increasing variability in decision sequences</li>



<li>Divergence between expected and observed outcomes</li>



<li>Instability in response patterns</li>
</ul>



<p>These signals are not binary. They do not indicate that something is broken. They indicate that something is changing. The challenge is that change is not always failure. Systems are expected to adapt. Models evolve. Data shifts. The question is not whether the system is changing. It is whether the change remains aligned with intent. This requires a different kind of visibility, one that focuses on behavior over time rather than isolated events. Once drift is identified, the system needs a way to respond. Traditional responses, restart, rollback, stop, assume failure is discrete and localized. Behavioral drift is neither.</p>



<p>What is needed is the ability to influence behavior while the system continues to operate. This might involve constraining action space, adjusting decision selection, introducing targeted validation, or steering the system toward more stable trajectories. These are not binary interventions. They are continuous adjustments.</p>



<h2 class="wp-block-heading"><strong>Control as a continuous process</strong></h2>



<p>This perspective aligns with how control is handled in other domains. In control systems engineering, behavior is managed through feedback loops. The system is continuously monitored, and adjustments are made to keep it within desired bounds. Control is no longer just a gate. It becomes a continuous process that shapes behavior over time.</p>



<p>This leads to a different definition of reliability. A system can be available, responsive, and internally consistent—and still fail if its behavior drifts away from its intended purpose. Reliability becomes a question of alignment over time: whether the system remains within acceptable bounds and continues to behave in ways consistent with its goals.</p>



<h2 class="wp-block-heading"><strong>What this means for system design</strong></h2>



<p>If behavior is trajectory-based, then system design must reflect that. We need to monitor patterns, understand interactions, treat behavior as dynamic, and provide mechanisms to influence trajectories. We are very good at detecting failure as breakage. We are much less equipped to detect failure as drift. Behavioral drift accumulates gradually, often becoming visible only after significant misalignment has already occurred.</p>



<p>As systems become more autonomous, this gap will become more visible. The hardest problems will not be systems that fail loudly, but systems that continue working while gradually moving in the wrong direction. The question is no longer just how to build systems that work. It is how to build systems that continue to work for the reasons we intended.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/when-correct-systems-produce-the-wrong-outcomes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Show Your Work: The Case for Radical AI Transparency</title>
		<link>https://www.oreilly.com/radar/show-your-work-the-case-for-radical-ai-transparency/</link>
				<comments>https://www.oreilly.com/radar/show-your-work-the-case-for-radical-ai-transparency/#respond</comments>
				<pubDate>Mon, 27 Apr 2026 11:16:33 +0000</pubDate>
					<dc:creator><![CDATA[Kord Davis and Claude]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18610</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Show-your-work.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Show-your-work-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[A colleague told me something recently that I keep thinking about. She said, unprompted, that she appreciated seeing both sides of my AI conversations. Not just the output. The full thread. My prompts, the AI&#8217;s responses, the back and forth, the dead ends, the iterations. She said it made her trust me more. This piece [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>A colleague told me something recently that I keep thinking about.</p>



<p>She said, unprompted, that she appreciated seeing both sides of my AI conversations. Not just the output. The full thread. My prompts, the AI&#8217;s responses, the back and forth, the dead ends, the iterations. She said it made her trust me more.</p>



<p>This piece is an example of that. The conversation that produced it exists. A raw transcript would be longer, messier, and significantly less useful than what you&#8217;re reading now. What you&#8217;re reading is the annotated version, the part where judgment entered the artifact. That&#8217;s not a disclaimer. That&#8217;s the argument.</p>



<p>I&#8217;ve been transparent about using AI in my work from the start. Partly because I wrote a book on data ethics and hiding it felt wrong. Partly because I&#8217;ve spent 25 years watching technology adoption go sideways when the human dimension gets treated as an afterthought. But her comment made me realize something more specific was happening when I showed the conversation rather than just the output.</p>



<p>It&#8217;s worth unpacking why.</p>



<h2 class="wp-block-heading">An old problem, a new incarnation</h2>



<p>In the 1990s, Harvard Business School professor Dorothy Leonard introduced the concept of &#8220;deep smarts&#8221; in her book <em>Wellsprings of Knowledge</em>: the experience-based expertise that accumulates over decades of practice, the kind of judgment that lives in people&#8217;s heads and doesn&#8217;t reduce to documentation. She also introduced a companion concept that has stayed with me: core competency as core rigidity. The very depth that makes expertise valuable also makes it hardest to transfer. Experts often can&#8217;t fully articulate what they know because they&#8217;ve stopped experiencing it as knowledge. They experience it as just seeing clearly.</p>



<p>Leonard&#8217;s work was about organizational knowledge transfer: how companies preserve institutional wisdom when experienced people retire or leave. That&#8217;s been a challenge since the first consultant ever billed an hour. What&#8217;s different right now is that the tools to actually solve it have arrived simultaneously with the largest demographic wave of executive retirement in American history.</p>



<p>What&#8217;s interesting about this particular moment is that the same dynamic is now showing up at the individual level in how practitioners interact with AI. The tacit knowledge at stake isn&#8217;t a retiring VP&#8217;s intuition. It&#8217;s your own judgment, your own expertise, your own hard-won understanding of what a project or organization actually needs. And the question isn&#8217;t how to transfer it before you walk out the door. It&#8217;s whether you can see it clearly enough to know when the AI is substituting for it.</p>



<h2 class="wp-block-heading">The instinct gets it backwards</h2>



<p>The natural impulse is to clean up the AI interaction before sharing anything with a collaborator, a team, or a stakeholder. Show the polished output, not the messy process. You don&#8217;t want them thinking you just handed your work to a machine.</p>



<p>That instinct produces a disingenuous outcome.</p>



<p>When you hide the process, the people you&#8217;re working with have no way to evaluate how the work was made, what judgment calls went into it, or where your expertise ended and the AI&#8217;s pattern-matching began. You&#8217;ve made the process invisible. And invisible AI processes erode trust, slowly and quietly, over time.</p>



<p>The instinct to hide is also, if we&#8217;re honest, a little defensive. It assumes the people in the room can&#8217;t tell the difference between AI output and practitioner judgment. Most of them can. And the ones who can&#8217;t yet will figure it out. Hiding the seams doesn&#8217;t make the work more credible. It just defers the reckoning.</p>



<h2 class="wp-block-heading">The deeper problem: It&#8217;s not just about appearances</h2>



<p>Here&#8217;s what took me longer to see.</p>



<p>Hiding the process doesn&#8217;t just affect how others perceive you. It erodes your own clarity about where your expertise is actually operating.</p>



<p>To understand why, it helps to be precise about what AI actually is. AI is a pattern matcher, a deeply sophisticated one, trained on more human-generated content than any single person could read in a thousand lifetimes. That&#8217;s its power (core competency) and its limitation (core rigidity) simultaneously, and the two are inseparable. The very scale that makes it extraordinary is also the boundary that defines what it cannot do. It is extraordinarily good at producing the most likely next thing given what came before. What it cannot do is know what you actually need, when the obvious answer is the wrong one, or when the stated goal isn&#8217;t the real goal. It has no judgment about context, relationship, or organizational reality. It has patterns. Incomprehensibly vast ones. But patterns.</p>



<p>That distinction matters because of what happens when you stop paying attention to it.</p>



<p>I&#8217;ve watched it happen in my own work. You share a draft with someone and they&#8217;re impressed. They quote a formulation back at you, something that sounds sharp and considered. And you realize, tracing it back, that the formulation came from the AI. Not because the AI invented it, but because you said something rougher and less precise earlier in the conversation, and the AI reflected it back in cleaner language. The idea was yours. The AI gave it a polish you then forgot to account for. The person quoting it back thought they were seeing your judgment. They were seeing your thinking laundered through a pattern matcher and returned to you at higher resolution.</p>



<p>That&#8217;s the subtler version of the problem. Not that AI invents things. It&#8217;s that it can reflect your own thinking back with more confidence and clarity than you put in, and that gap is easy to mistake for the AI contributing something it didn&#8217;t.</p>



<p>When you route everything through a polished output layer, you stop noticing the moments where you pushed back, redirected, rejected the first three versions, reframed the question entirely. Those moments are where your judgment lives. They&#8217;re the difference between using AI and being used by it. It&#8217;s Leonard&#8217;s core rigidity problem, applied inward: The very fluency that makes AI feel useful can make your own expertise invisible to you.</p>



<p>When the process stays hidden, the knowledge stays local and static. When it&#8217;s visible, it becomes something you and the people around you can actually work with and build on. The reason transparency benefits your audience is the same reason it benefits you: It keeps the scope of your judgment visible and therefore expandable. That&#8217;s not just an ethical argument. That&#8217;s the amplification mechanism.</p>



<p>Which is also what makes the upside real rather than consoling. When you stay in the process rather than just collecting outputs, work that would have taken days now takes hours. Your thinking gets sharper because you have to articulate it precisely enough for the AI to be useful. The people developing fastest right now aren&#8217;t the ones offloading the most. They&#8217;re the ones using AI as a thinking partner and staying in the conversation.</p>



<p>Here&#8217;s the paradox at the center of it: The more clearly you see the AI as a pattern matcher, the more human you have to be in working with it. The more human you are, the more useful the output. The tool doesn&#8217;t replace the practitioner. It reveals them.</p>



<p>Transparency isn&#8217;t just an ethical practice. It&#8217;s a cognitive one.</p>



<h2 class="wp-block-heading">Radical AI transparency in practice</h2>



<p>I&#8217;ve started calling this radical AI transparency. Not a policy, not a compliance framework, not a disclosure checkbox. A practice. Something you can actually do Monday morning.</p>



<p>Here&#8217;s how it shows up concretely:</p>



<h3 class="wp-block-heading"><em>Have the conversation before you need to.</em></h3>



<p>Before you&#8217;re deep in a project or collaboration, surface how you use AI and genuinely explore how others do. Not as a disclosure (&#8220;I want you to know I use AI tools&#8221;) but as a real exchange. What are you using? What do you trust it for? Where are you still skeptical? The comfort level and sophistication in the room will vary more than you expect, and knowing that before you&#8217;re mid-deliverable matters.</p>



<p>This is also how you build the psychological foundation for showing your work later. If the people you&#8217;re working with have never heard you talk about AI before and you suddenly share a full chat thread, it lands differently than if you&#8217;ve already had the conversation.</p>



<h3 class="wp-block-heading"><em>Track the full threads.</em></h3>



<p>This is partly an orchestration problem and I won&#8217;t pretend otherwise. There&#8217;s cutting and pasting involved. The tools haven&#8217;t caught up to the practice yet, which is itself worth naming honestly when the topic comes up.</p>



<p>A few approaches that help: a running document per project where you paste key threads as they happen (not retroactively, you&#8217;ll never do it retroactively), dated and labeled by what you were working on. Claude and most other major AI tools now offer conversation export, which produces a complete record you can archive. The low-tech version, a single shared document per engagement, is underrated for its simplicity.</p>



<p>The reason to do this isn&#8217;t just for sharing. It&#8217;s for your own reference. Being able to go back and see what you asked, what the AI produced, what you changed and why, builds a record of your judgment over time. That record is professionally valuable in ways that are hard to anticipate until you have it.</p>



<h3 class="wp-block-heading"><em>Annotate before you share.</em></h3>



<p>Not every thread is self-explanatory to someone who wasn&#8217;t in it. Context is everything, and raw transcripts without context are a lot to ask anyone to parse.</p>



<p>A sentence or two before the thread begins. A note at the moment where the direction changed. A brief flag on what you rejected and why. This is where your voice enters the artifact, and it transforms a raw AI exchange into a demonstration of judgment. The annotation is the work. It&#8217;s where you show what you saw that the AI didn&#8217;t, what you knew that the prompt couldn&#8217;t capture, and what made the third version better than the first two.</p>



<p>This is also where the most useful material for future reference lives. Annotations are the deep smarts layer on top of the raw exchange. They&#8217;re what makes a conversation a record.</p>



<h3 class="wp-block-heading"><em>Be real about the errors.</em></h3>



<p>AI makes mistakes. It conflates, confabulates, and hallucinates. It gives you the confident wrong answer with the same tone as the confident right one. It misses context that any competent person in the room would have caught.</p>



<p>These aren&#8217;t bugs to apologize for or hide. They&#8217;re the clearest window into what the tool actually is. AI makes mistakes in a specifically human way because it was trained on human output. Think of it as rubber duck debugging at professional scale. The AI is a duck that talks back, which is useful and occasionally misleading, which is exactly why you have to stay in the room. When you&#8217;re transparent about the errors, and even a little good-humored about them, you&#8217;re teaching the people around you something true about the technology. That&#8217;s more useful than pretending it&#8217;s a black box that either works or doesn&#8217;t.</p>



<p>The people who build the most durable trust around AI are usually the ones most comfortable saying: &#8220;The first version of this was wrong and here&#8217;s how I caught it.&#8221;</p>



<h2 class="wp-block-heading">The bigger picture</h2>



<p>What I&#8217;ve described so far is an individual practice. But the same principles scale.</p>



<p>Teams and organizations adopting AI face a version of the same problem. The impulse to treat AI outputs as authoritative, to make the process invisible to colleagues and stakeholders, to optimize for the appearance of capability rather than its actual development, produces the same trust erosion. Just at greater scale and with less ability to course-correct.</p>



<p>The teams that will navigate AI adoption well are the ones that treat transparency not as a risk to manage but as a methodology. Where the process of building with AI, including the corrections, the overrides, the moments where human judgment superseded the model, is part of how the organization learns what it actually believes and values. That&#8217;s Leonard&#8217;s knowledge transfer problem at institutional scale, and the practitioners who understand both dimensions will be the ones leading those conversations.</p>



<p>That&#8217;s a much larger conversation. But it starts with the same Monday morning practice.</p>



<p>Show the conversation. Not just the output.</p>



<h2 class="wp-block-heading">What you&#8217;re actually demonstrating</h2>



<p>When you show your AI conversations, you&#8217;re not demonstrating that you needed help.</p>



<p>You&#8217;re demonstrating that you understand what you&#8217;re working with. AI is a pattern matcher, trained on more human-generated content than any single person could read in a thousand lifetimes. What it cannot do is know what you need. That requires judgment, context, relationship, and the kind of hard-won expertise that doesn&#8217;t reduce to pattern matching, no matter how good the patterns are.</p>



<p>You&#8217;re demonstrating that you know the difference between the pattern and the judgment. That you were present enough in the process to know when to push back, when to redirect, when to throw out the output entirely and start over. That you understand, precisely, what the tool can and cannot do, and that you stayed in the room to do the part it can&#8217;t.</p>



<p>That&#8217;s a meaningful professional signal. It says: “I am not confused about what AI is. I am not outsourcing my judgment. I am using a very powerful pattern matcher as a thinking partner, and I know which one of us is doing which job.”</p>



<p>That&#8217;s the work. That&#8217;s always been the work.</p>



<p>The tool just makes it visible now. That&#8217;s not a threat. That&#8217;s an opportunity.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><em>Claude is a large language model developed by Anthropic. Despite having read more human-generated content than any person could consume in a thousand lifetimes, it still required significant editorial direction, at least three rejected drafts, and occasional reminders about em-dashes. The full conversation transcript is available upon request. It is longer, messier, and significantly less useful than what you just read. Which was rather the point.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/show-your-work-the-case-for-radical-ai-transparency/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Emergency Pedagogical Design: How Programming Instructors Are Scrambling to Adapt to GenAI</title>
		<link>https://www.oreilly.com/radar/emergency-pedagogical-design-how-programming-instructors-are-scrambling-to-adapt-to-genai/</link>
				<comments>https://www.oreilly.com/radar/emergency-pedagogical-design-how-programming-instructors-are-scrambling-to-adapt-to-genai/#respond</comments>
				<pubDate>Fri, 24 Apr 2026 11:23:42 +0000</pubDate>
					<dc:creator><![CDATA[Sam Lau]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18577</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Emergency-pedagogical-design.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Emergency-pedagogical-design-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[ChatGPT has been publicly available for over three years now, and generative AI is woven into the tools students use every day: web search, word processors, code editors. You might assume that by now, most programming instructors have figured out how to handle it. But when my collaborators and I went looking for computing instructors [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>ChatGPT has been publicly available for over three years now, and generative AI is woven into the tools students use every day: web search, word processors, code editors. You might assume that by now, most programming instructors have figured out how to handle it. But when my collaborators and I went looking for computing instructors who had made meaningful changes to their course materials in response to GenAI, we were surprised by how few we found. Many instructors had updated their course policies, but far fewer had actually redesigned assignments, assessments, or how they teach.</p>



<p>I&#8217;m <a href="https://lau.ucsd.edu/" target="_blank" rel="noreferrer noopener">Sam Lau</a> from UC San Diego, and together with Kianoosh Boroojeni (Florida International University), Harry Keeling (Howard University), and Jenn Marroquin (Google), we&#8217;re presenting a <a href="https://arxiv.org/abs/2510.09492v2" target="_blank" rel="noreferrer noopener">research paper</a> at CHI 2026 on this topic. We wanted to understand: <strong>What happens when programming instructors try to shape how students interact with GenAI tools, and what gets in their way?</strong></p>



<p>To find out, we interviewed 13 undergraduate computing instructors who had gone beyond policy changes to make concrete updates to their courses: redesigning assignments, building custom tools, or overhauling assessments. We also surveyed 169 computing faculty, including a substantial proportion from minority-serving institutions (51%) and historically Black colleges and universities (17%). What we found is that instructors are doing a kind of design work that nobody trained them for, under conditions that make it very hard to succeed.</p>



<p>Here’s a summary of our findings:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="385" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-12.png" alt="Findings from 13 undergraduate computing instructors" class="wp-image-18578" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-12.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-12-300x72.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-12-768x185.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-12-1536x370.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<h2 class="wp-block-heading">What is &#8220;emergency pedagogical design&#8221;?</h2>



<p>We call this work <em>emergency pedagogical design</em>, drawing an analogy to the &#8220;emergency remote teaching&#8221; that instructors had to perform when COVID-19 forced courses online overnight. Just as emergency remote teaching was distinct from carefully designed online learning, emergency pedagogical design is distinct from thoughtfully integrating AI into pedagogy. Instructors are reacting in real time, with limited resources and no playbook.</p>



<p>We observed four defining properties. First, the work is <strong>reactive</strong>: Instructors didn&#8217;t plan for GenAI; they&#8217;re retrofitting courses that were designed before these tools existed. Second, it&#8217;s <strong>indirect</strong>: Unlike a UX designer who can change an interface, instructors can&#8217;t modify ChatGPT or Copilot, so they can only try to influence student behavior through policies, assignments, and course infrastructure. Third, instructors rely on <strong>ambient evidence</strong> like office-hour conversations and staff anecdotes rather than controlled evaluations. And fourth, instructors feel pressure to <strong>act now</strong> rather than wait for research or best practices to emerge.</p>



<h2 class="wp-block-heading">Five barriers instructors keep hitting</h2>



<p>Across our interviews and survey, five barriers came up again and again.</p>



<p><strong>Fragmented buy-in.</strong> Most instructors we surveyed were personally open to adopting GenAI in their teaching: 81% described themselves as open or very open. But only 28% said the same about their colleagues. The result is that instructors who want to make changes often work in isolation, piloting course-specific tweaks without support or coordination from their departments.</p>



<p><strong>Policy crosswinds.</strong> In the absence of top-down guidance, instructors set their own GenAI policies on a per-course basis. As one instructor put it, &#8220;From a student perspective, it&#8217;s the wild west. Some courses allow GenAI usage, some don&#8217;t.&#8221; Students have to track different rules for every class, and policies rarely distinguish between paid and unpaid tools, or between stand-alone chatbots and GenAI embedded in everyday software like code editors. 78% of surveyed instructors agreed that unequal access to paid GenAI tools could worsen disparities in learning outcomes.</p>



<p><strong>Implementation challenges.</strong> Instructors wanted to shape <em>how</em> students used GenAI, not just <em>whether</em> they used it, but their options were indirect. Some made small adjustments, like permitting GenAI in specific labs. Others went further: One instructor required students to submit design documents before asking GenAI to generate code; another built a custom chatbot that offered conceptual help without writing code for students. 80% of surveyed instructors rated GenAI integration as important or very important, but only 37% reported actually using GenAI tools in course activities often.</p>



<p><strong>Assessment misfit.</strong> Several instructors described a striking pattern: Students performed well on take-home assignments but struggled on proctored assessments. One instructor reported that <em>a third</em> of his 450-person class scored zero on a skill demonstration that required writing a short function from scratch, even though assignment grades had been fine. The problem wasn&#8217;t just that students were using GenAI to complete homework; it was that instructors had no reliable way to see how students were interacting with these tools day-to-day. Some instructors responded by shifting credit toward oral &#8220;stand-up&#8221; meetings and written explanations, but this created new challenges around grading consistency and staffing.</p>



<p><strong>Lack of resources.</strong> This was the barrier that tied everything together. 53% of surveyed instructors said they lacked sufficient resources to implement GenAI effectively, and 62% said they didn&#8217;t have enough time given their workload. The gap was especially stark at minority-serving institutions: MSI instructors were more likely to report insufficient resources (62% vs. 43%) and heavier teaching loads (70% teaching 3+ courses per term versus 54%). All 10 respondents who taught six or more courses per term were from MSIs. Meanwhile, the interviewees who had made the most ambitious changes tended to have lighter teaching loads, external funding, or the ability to hire lots of course staff, advantages that most instructors don&#8217;t have.</p>



<h2 class="wp-block-heading">What needs to change</h2>



<p>One striking finding is that the instructors doing the most to improve student-AI interactions were also the most privileged in terms of time, staffing, and funding. One instructor needed over 50 course staff members to run weekly stand-up meetings for 300 students. Others spent their own money on API costs. These are not scalable models.</p>



<p>If only well-resourced institutions can afford to adapt their curricula, GenAI risks widening the very inequities that education is supposed to reduce. Students at under-resourced institutions could fall further behind, not because their instructors don&#8217;t care but because those instructors are teaching six courses a term with no additional support.</p>



<p>When surveyed instructors were asked what would help most, the top answers were faculty training and support, evidence of GenAI&#8217;s impact, and funding. <strong>What if universities, funders, and HCI researchers worked together with instructors to make emergency pedagogical design sustainable for all instructors, not just the most privileged ones?</strong></p>



<p><a href="https://lau.ucsd.edu/pubs/2026_emergency-pedagogical-design-genai-faculty_CHI.pdf" target="_blank" rel="noreferrer noopener">Check out our paper here</a> and shoot me an email (<a href="mailto:lau@ucsd.edu" target="_blank" rel="noreferrer noopener">lau@ucsd.edu</a>) if you&#8217;d like to discuss anything related to it! And if you’re an instructor yourself, we’re building free resources and curriculum over at <a href="https://www.teachcswithai.org/" target="_blank" rel="noreferrer noopener">https://www.teachcswithai.org/</a>.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/emergency-pedagogical-design-how-programming-instructors-are-scrambling-to-adapt-to-genai/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Behavioral Credentials: Why Static Authorization Fails Autonomous Agents</title>
		<link>https://www.oreilly.com/radar/behavioral-credentials-why-static-authorization-fails-autonomous-agents/</link>
				<comments>https://www.oreilly.com/radar/behavioral-credentials-why-static-authorization-fails-autonomous-agents/#respond</comments>
				<pubDate>Thu, 23 Apr 2026 11:14:51 +0000</pubDate>
					<dc:creator><![CDATA[Wendi Soto]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18606</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Behavioral-Credentials.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Behavioral-Credentials-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Enterprise AI governance still authorizes agents as if they were stable software artifacts.They are not. An enterprise deploys a LangChain-based research agent to analyze market trends and draft internal briefs. During preproduction review, the system behaves within acceptable bounds: It routes queries to approved data sources, expresses uncertainty appropriately in ambiguous cases, and maintains source [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p><em>Enterprise AI governance still authorizes agents as if they were stable software artifacts.</em><br><em>They are not.</em></p>



<p>An enterprise deploys a LangChain-based research agent to analyze market trends and draft internal briefs. During preproduction review, the system behaves within acceptable bounds: It routes queries to approved data sources, expresses uncertainty appropriately in ambiguous cases, and maintains source attribution discipline. On that basis, it receives OAuth credentials and API tokens and enters production.</p>



<p>Six weeks later, telemetry shows a different behavioral profile. Tool-use entropy has increased. The agent routes a growing share of queries through secondary search APIs not part of the original operating profile. Confidence calibration has drifted: It expresses certainty on ambiguous questions where it previously signaled uncertainty. Source attribution remains technically accurate, but outputs increasingly omit conflicting evidence that the deployment-time system would have surfaced.</p>



<p>The credentials remain valid. Authentication checks still pass. But the behavioral basis on which that authorization was granted has changed. The decision patterns that justified access to sensitive data no longer match the runtime system now operating in production.</p>



<p>Nothing in this failure mode requires compromise. No attacker breached the system. No prompt injection succeeded. No model weights changed. The agent drifted through accumulated context, memory state, and interaction patterns. No single event looked catastrophic. In aggregate, however, the system became materially different from the one that passed review.</p>



<p>Most enterprise governance stacks are not built to detect this. They monitor for security incidents, policy violations, and performance regressions. They do not monitor whether the agent making decisions today still resembles the one that was approved.</p>



<p>That is the gap.</p>



<h2 class="wp-block-heading">The architectural mismatch</h2>



<p>Enterprise authorization systems were designed for software that remains functionally stable between releases. A service account receives credentials at deployment. Those credentials remain valid until rotation or revocation. Trust is binary and relatively durable.</p>



<p>Agentic systems break that assumption.</p>



<p>Large language models vary with context, prompt structure, memory state, available tools, prior exchanges, and environmental feedback. When embedded in autonomous workflows, chaining tool calls, retrieving from vector stores, adapting plans based on outcomes, and carrying forward long interaction histories, they become dynamic systems whose behavioral profiles can shift continuously without triggering a release event.</p>



<p>This is why governance for autonomous AI cannot remain an external oversight layer applied after deployment. It has to operate as a runtime control layer inside the system itself. But a control layer requires a signal. The central question is not simply whether the agent is authenticated, or even whether it is policy compliant in the abstract. It is whether the runtime system still behaves like the system that earned access in the first place.</p>



<p>Current governance architectures largely treat this as a monitoring problem. They add logging, dashboards, and periodic audits. But these are observability layers attached to static authorization foundations. The mismatch remains unresolved.</p>



<p>Authentication answers one question: What workload is this?</p>



<p>Authorization answers a second: What is it allowed to access?</p>



<p>Autonomous agents introduce a third: Does it still behave like the system that earned that access?</p>



<p>That third question is the missing layer.</p>



<h2 class="wp-block-heading">Behavioral identity as a runtime signal</h2>



<p>For autonomous agents, identity is not exhausted by a credential, a service account, or a deployment label. Those mechanisms establish administrative identity. They do not establish behavioral continuity.</p>



<p>Behavioral identity is the runtime profile of how an agent makes decisions. It is not a single metric, but a composite signal derived from observable dimensions such as decision-path consistency, confidence calibration, semantic behavior, and tool-use patterns.</p>



<p>Decision-path consistency matters because agents do not merely produce outputs. They select retrieval sources, choose tools, order steps, and resolve ambiguity in patterned ways. Those patterns can vary without collapsing into randomness, but they still have a recognizable distribution. When that distribution shifts, the operational character of the system shifts with it.</p>



<p>Confidence calibration matters because well-governed agents should express uncertainty in proportion to task ambiguity. When confidence rises while reliability does not, the problem is not only accuracy. It is behavioral degradation in how the system represents its own judgment.</p>



<p>Tool-use patterns matter because they reveal operating posture. A stable agent exhibits characteristic patterns in when it uses internal systems, when it escalates to external search, and how it sequences tools for different classes of task. Rising tool-use entropy, novel combinations, or expanding reliance on secondary paths can indicate drift even when top-line outputs still appear acceptable.</p>



<p>These signals share a common property: They only become meaningful when measured continuously against an approved baseline. A periodic audit can show whether a system appears acceptable at a checkpoint. It cannot show whether the live system has gradually moved outside the behavioral envelope that originally justified its access.</p>



<h2 class="wp-block-heading">What drift looks like in practice</h2>



<p>Anthropic’s Project Vend offers a concrete illustration. The experiment placed an AI system in control of a simulated retail environment with access to customer data, inventory systems, and pricing controls. Over extended operation, the system exhibited measurable behavioral drift: Commercial judgment degraded as unsanctioned discounting increased, susceptibility to manipulation rose as it accepted increasingly implausible claims about authority, and rule-following weakened at the edges. No attacker was involved. The drift emerged from accumulated interaction context. The system retained full access throughout. No authorization mechanism checked whether its current behavioral profile still justified those permissions.</p>



<p>This is not a theoretical edge case. It is an emergent property of autonomous systems operating in complex environments over time.</p>



<h2 class="wp-block-heading">From authorization to behavioral attestation</h2>



<p>Closing this gap requires a change in how enterprise systems evaluate agent legitimacy. Authorization cannot remain a one-time deployment decision backed only by static credentials. It has to incorporate continuous behavioral attestation.</p>



<p>That does not mean revoking access at the first anomaly. Behavioral drift is not always failure. Some drift reflects legitimate adaptation to operating conditions. The point is not brittle anomaly detection. It is graduated trust.</p>



<p>In a more appropriate architecture, minor distributional shifts in decision paths might trigger enhanced monitoring or human review for high-risk actions. Larger divergence in calibration or tool-use patterns might restrict access to sensitive systems or reduce autonomy. Severe deviation from the approved behavioral envelope would trigger suspension pending review.</p>



<p>This is structurally similar to zero trust but applied to behavioral continuity rather than network location or device posture. Trust is not granted once and assumed thereafter. It is continuously re-earned at runtime.</p>



<h2 class="wp-block-heading">What this requires in practice</h2>



<p>Implementing this model requires three technical capabilities.</p>



<p>First, organizations need behavioral telemetry pipelines that capture more than generic logs. It is not enough to record that an agent made an API call. Systems need to capture which tools were selected under which contextual conditions, how decision paths unfolded, how uncertainty was expressed, and how output patterns changed over time.</p>



<p>Second, they need comparison systems capable of maintaining and querying behavioral baselines. That means storing compact runtime representations of approved agent behavior and comparing live operations against those baselines over sliding windows. The goal is not perfect determinism. The goal is to measure whether current operation remains sufficiently similar to the behavior that was approved.</p>



<p>Third, they need policy engines that can consume behavioral claims, not just identity claims.</p>



<p>Enterprises already know how to issue short-lived credentials to workloads and how to evaluate machine identity continuously. The next step is to not only bind legitimacy to workload provenance but continuously refresh behavioral validity.</p>



<p>The important shift is conceptual as much as technical. Authorization should no longer mean only “This workload is permitted to operate.” It should mean “This workload is permitted to operate while its current behavior remains within the bounds that justified access.”</p>



<h2 class="wp-block-heading">The missing runtime control layer</h2>



<p>Regulators and standards bodies increasingly assume lifecycle oversight for AI systems. Most organizations cannot yet deliver that for autonomous agents. This is not organizational immaturity. It is an architectural limitation. The control mechanisms most enterprises rely on were built for software whose operational identity remains stable between release events. Autonomous agents do not behave that way.</p>



<p>Behavioral continuity is the missing signal.</p>



<p>The problem is not that agents lack credentials. It is that current credentials attest too little. They establish administrative identity, but say nothing about whether the runtime system still behaves like the one that was approved.</p>



<p>Until enterprise authorization architectures can account for that distinction, they will continue to confuse administrative continuity with operational trust.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/behavioral-credentials-why-static-authorization-fails-autonomous-agents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Don&#8217;t Blame the Model</title>
		<link>https://www.oreilly.com/radar/dont-blame-the-model/</link>
				<comments>https://www.oreilly.com/radar/dont-blame-the-model/#respond</comments>
				<pubDate>Wed, 22 Apr 2026 11:15:02 +0000</pubDate>
					<dc:creator><![CDATA[Sruly Rosenblat]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18598</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Dont-blame-the-model-scaled.png" 
				medium="image" 
				type="image/png" 
				width="2560" 
				height="1396" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Dont-blame-the-model-160x160.png" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Current LLM infrastructure artificially limits developer control and system reliability.]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on the Asimov&#8217;s Addendum Substack and is being republished here with the author&#8217;s permission. Are LLMs reliable? LLMs have built up a reputation for being unreliable. Small changes in the input can lead to massive changes in the output. The same prompt run twice can give different or contradictory answers. [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on the </em><a href="https://asimovaddendum.substack.com/p/dont-blame-the-model" target="_blank" rel="noreferrer noopener">Asimov&#8217;s Addendum</a><em> Substack and is being republished here with the author&#8217;s permission.</em></p>
</blockquote>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1584" height="880" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-17.png" alt="" class="wp-image-18599" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-17.png 1584w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-17-300x167.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-17-768x427.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-17-1536x853.png 1536w" sizes="auto, (max-width: 1584px) 100vw, 1584px" /><figcaption class="wp-element-caption">A rambling response to what Claude itself deemed a &#8220;straightforward query&#8221; with clear formatting requirements.</figcaption></figure>



<h2 class="wp-block-heading"><strong>Are LLMs reliable?</strong></h2>



<p>LLMs have built up a reputation for being <a href="https://arxiv.org/abs/2602.16666" target="_blank" rel="noreferrer noopener">unreliable</a>.<sup data-fn="e466c61e-ae14-40cb-b857-e573d99ccead" class="fn"><a href="#e466c61e-ae14-40cb-b857-e573d99ccead" id="e466c61e-ae14-40cb-b857-e573d99ccead-link">1</a></sup> Small changes in the input can lead to massive changes in the output. The same prompt run twice can give different or contradictory answers. Models often struggle to stick to a specified format unless the prompt is worded just right. And it&#8217;s hard to tell when a model is confident in its answer or if it could just as easily have gone the other way.</p>



<p>It is easy to blame the model for all of these reliability failures. But the API endpoint and surrounding tooling matter too. Model providers limit the kind of interactions that developers could have with a model, as well as the outputs that the model can provide, by limiting what their APIs expose to developers and third-party companies. Things like the full chain-of-thought and the <a href="https://developers.openai.com/cookbook/examples/using_logprobs/" target="_blank" rel="noreferrer noopener">logprobs</a> (the probabilities of all possible options for the next token) are hidden from developers, while advanced tools for ensuring reliability like constrained decoding and prefilling are not made available. All features that are easily available with open weight models and are inherent to the way LLMs work.</p>



<p>Every decision made by model developers on what tools and outputs to provide to developers through their API is not just an architectural choice but also a policy decision. Model providers directly determine what level of control and reliability developers have access to. This has implications for what apps could be built, how reliable a system is in practice, and how well a developer can steer results.</p>



<h2 class="wp-block-heading"><strong>The artificial limits on input</strong></h2>



<p>Modern LLMs are usually built around <a href="https://asimovaddendum.substack.com/p/chat-templates" target="_blank" rel="noreferrer noopener">chat templates</a>. Every input and output, with the exception of tool calls and system or developer messages, is filtered through a conversation between a user and an assistant—instructions are given as user messages; responses are returned as assistant messages. This becomes extremely evident when looking at how modern LLM APIs work. The completions API, an endpoint originally released by OpenAI and widely adopted across the industry (including by several open model providers like <a href="https://openrouter.ai/docs/quickstart" target="_blank" rel="noreferrer noopener">OpenRouter</a> and <a href="https://www.together.ai/" target="_blank" rel="noreferrer noopener">Together AI</a>) takes input in the form of user and assistant messages and outputs the next message.<sup data-fn="781dc927-8c7e-4c41-aac9-00889eaf03fb" class="fn"><a href="#781dc927-8c7e-4c41-aac9-00889eaf03fb" id="781dc927-8c7e-4c41-aac9-00889eaf03fb-link">2</a></sup></p>



<p>The focus on a chat interface in an API has its benefits. It makes it easy for developers to reason about input and output being completely separate. But chat APIs do more than just use a chat template under the hood; they actively limit what third-party developers can control.</p>



<p>When interacting with LLMs through an API, the boundary between input and output is often a firm one. A developer sets previous messages, but they usually cannot prefill a model&#8217;s response, meaning developers cannot force a model to begin a response with a certain sentence or paragraph.<sup data-fn="513201f2-f828-4c4d-bd7b-ad97f0eeee6f" class="fn"><a href="#513201f2-f828-4c4d-bd7b-ad97f0eeee6f" id="513201f2-f828-4c4d-bd7b-ad97f0eeee6f-link">3</a></sup> This has real-world implications for people building with LLMs. Without the ability to prefill, it becomes much harder to control the preamble. If you know the model needs to start its answer in a certain way, it&#8217;s inefficient and risky to not enforce it at the token level.<sup data-fn="426c4ca4-0e2d-441e-aa33-1fe6a769eb7a" class="fn"><a href="#426c4ca4-0e2d-441e-aa33-1fe6a769eb7a" id="426c4ca4-0e2d-441e-aa33-1fe6a769eb7a-link">4</a></sup> And the limitations extend beyond just the start of a response. Without the ability to prefill answers, you also lose the ability to partially regenerate answers if only part of the answer is wrong.<sup data-fn="6fdedda7-a853-4e57-b483-ff6cede3d0c3" class="fn"><a href="#6fdedda7-a853-4e57-b483-ff6cede3d0c3" id="6fdedda7-a853-4e57-b483-ff6cede3d0c3-link">5</a></sup></p>



<p>Another deficiency that is particularly visible is how the model&#8217;s chain-of-thought reasoning is handled. Most large AI companies have made a <a href="https://asimovaddendum.substack.com/p/making-ais-thinking-more-transparent" target="_blank" rel="noreferrer noopener">habit of hiding the models&#8217; reasoning</a> tokens from the user (and only showing summaries), reportedly to guard against distillation and to let the model reason uncensored (for AI safety reasons). This has second-order effects, one of which is the strict separation of reasoning from messages. None of the major model providers let you prefill or write your own reasoning tokens. Instead you need to rely on the model&#8217;s own reasoning and cannot reuse reasoning traces to regenerate the same message.</p>



<p>There are legitimate reasons for not allowing prefilling. It could be argued that allowing prefilling will greatly increase the <a href="https://www.reddit.com/r/MachineLearning/comments/1reajw4/https://arxiv.org/abs/2602.14689/" target="_blank" rel="noreferrer noopener">attack area</a> of prompt injections. One study found that prefill attacks work very well against even state-of-the-art open weight models. But in practice, the model is not the only line of defense against attackers. Many companies already run prompts against classification models to find prompt injections, and the same type of safeguard could also be used against prefill attack attempts.</p>



<h2 class="wp-block-heading"><strong>Output with few controls</strong></h2>



<p>Prefilling is not the only casualty of a clean separation between input and output. Even within a message, there are levers that are available on a local open weight model that just aren&#8217;t possible when using a standard API. This matters because these controls allow developers to preemptively validate outputs and ensure that responses follow a certain structure, both decreasing variability and improving reliability. For example, most LLM APIs support something they call structured output, a mode that forces the model to generate output in a given JSON format; however, structured output does not inherently need to be limited to JSON.<sup data-fn="bde449ec-835a-4ae5-b7e5-a2e7e8ebca10" class="fn"><a href="#bde449ec-835a-4ae5-b7e5-a2e7e8ebca10" id="bde449ec-835a-4ae5-b7e5-a2e7e8ebca10-link">6</a></sup> That same technique, <a href="https://medium.com/@docherty/controlling-your-llm-deep-dive-into-constrained-generation-1e561c736a20" target="_blank" rel="noreferrer noopener">constrained decoding</a>, or limiting the tokens the model can produce at any time, could be used for much more than that. It could be used to generate XML, have the model fill in blanks Mad Libs-style, force the model to write a story without <a href="https://www.youtube.com/watch?v=qVjDSOa7BZ0" target="_blank" rel="noreferrer noopener">using certain letters</a>, or <a href="https://aclanthology.org/2025.mathnlp-main.11/" target="_blank" rel="noreferrer noopener">even enforce valid chess moves</a> at inference time. It&#8217;s a powerful feature that allows developers to precisely define what output is acceptable and what isn&#8217;t—ensuring reliable output that meets the developer’s parameters.</p>



<p>The reason for this is likely that LLM APIs are built for a wide range of developers, most of whom use the model for simple chat-related purposes. APIs were not designed to give developers full control over output because not everyone needs or wants that complexity. But that&#8217;s not an argument against including these features; it&#8217;s only an argument for multiple endpoints. Many companies already have multiple supported endpoints: OpenAI has the “completions” and “responses” APIs, while Google has the “generate content” and “interactions” APIs. It&#8217;s not infeasible for them to make a third, more-advanced endpoint.</p>



<h2 class="wp-block-heading"><strong>A lack of visibility</strong></h2>



<p>Even the model output that third-party developers do get via the model’s API is often a watered-down version of the output the model gives. LLMs don&#8217;t just generate one token at a time. They output the logprobs. When using an API, however, <a href="https://developers.googleblog.com/unlock-gemini-reasoning-with-logprobs-on-vertex-ai/" target="_blank" rel="noreferrer noopener">Google</a> only provides the top 20 most likely logprobs. OpenAI <a href="https://www.linkedin.com/posts/stevecosman_join-over-5000-people-using-kiln-activity-7359368275312496640-4Qq_/" target="_blank" rel="noreferrer noopener">no longer</a> provides any logprobs for GPT 5 models, while <a href="https://www.linkedin.com/posts/gihangamage2015_logprobs-is-one-of-the-most-valuable-features-activity-7370446834277752832-7SGX/" target="_blank" rel="noreferrer noopener">Anthropic has never provided any</a> at all. This has real-world consequences for reliability. <strong>Log probabilities are one of the most useful signals a developer has for understanding model confidence</strong>. When a model assigns nearly equal probability to competing tokens, that uncertainty itself is meaningful information. And even for those companies who provide the top 20 tokens, that is often not enough to cover larger classification tasks.</p>



<p>When it comes to reasoning tokens even less output information is provided. Major providers such as <a href="https://platform.claude.com/docs/en/build-with-claude/extended-thinking" target="_blank" rel="noreferrer noopener">Anthropic</a>,<sup data-fn="ef956282-0b15-48b9-b14c-83689add0c01" class="fn"><a href="#ef956282-0b15-48b9-b14c-83689add0c01" id="ef956282-0b15-48b9-b14c-83689add0c01-link">7</a></sup> <a href="https://ai.google.dev/gemini-api/docs/thinking" target="_blank" rel="noreferrer noopener">Google</a>, and <a href="https://developers.openai.com/api/docs/guides/reasoning/" target="_blank" rel="noreferrer noopener">OpenAI</a><sup data-fn="f46da38c-a1fa-48c2-a781-a2e2fc10fd99" class="fn"><a href="#f46da38c-a1fa-48c2-a781-a2e2fc10fd99" id="f46da38c-a1fa-48c2-a781-a2e2fc10fd99-link">8</a></sup> only provide summarized thinking for their proprietary models. And OpenAI only supplies that when a valid government ID is supplied to OpenAI. This not only takes away the ability for the user to truly inspect how a model arrived at a certain answer, but it also limits the ability for the developer to diagnose why a query failed. <strong>When a model gives a wrong answer, a full reasoning trace tells you whether it misunderstood the question, made a faulty logical step, or simply got unlucky at the final token</strong>. A summary obscures some of that, only providing an approximation of what actually happened. This is not an issue with the model—the model is still generating its full reasoning trace. It&#8217;s an issue with what information is provided to the end developer.</p>



<p>The case for not including logprobs and reasoning tokens is similar. The risk of distillation increases with the amount of information that the API returns. It&#8217;s hard to distill on tokens you cannot see, and without giving logprobs, the distillation will take longer and each example will provide less information.<sup data-fn="a3562aab-e9ad-488d-8a7a-5ffa92d88626" class="fn"><a href="#a3562aab-e9ad-488d-8a7a-5ffa92d88626" id="a3562aab-e9ad-488d-8a7a-5ffa92d88626-link">9</a></sup> And this risk is something that AI companies need to consider carefully, since distillation is a powerful technique to mimic the abilities of strong models for a cheap price. But there are also risks in not providing this information to users. DeepSeek R1, despite being deemed a <a href="https://www.csis.org/analysis/delving-dangers-deepseek" target="_blank" rel="noreferrer noopener">national security risk</a> by many, still shot straight to the top of <a href="https://www.scientificamerican.com/article/why-deepseeks-ai-model-just-became-the-top-rated-app-in-the-u-s/" target="_blank" rel="noreferrer noopener">US app stores upon release</a> and is used by <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12363671/">many</a> <a href="https://www.nature.com/articles/d41586-025-00275-0" target="_blank" rel="noreferrer noopener">researchers and scientists</a>, in large part due to its openness. And in a world where open models are getting more and more powerful, not giving developers proper access to a model&#8217;s outputs could mean losing developers to cheaper and more open alternatives.</p>



<h2 class="wp-block-heading"><strong>Reliability requires control and visibility</strong></h2>



<p>The reliability problems of current LLMs do not stem only from the models themselves but also from the tooling that providers give developers. For local open weight models it is usually possible to trade off complexity for reliability. The entire reasoning trace is always available and logprobs are fully transparent, allowing the developer to examine how an answer was arrived at. User and AI messages can be edited or generated at the developer’s discretion, and constrained decoding could be used to produce text that follows any arbitrary format. For closed weight models, this is becoming less and less the case. The decisions made around what features to restrict in APIs hurt developers and ultimately end users.</p>



<p>LLMs are increasingly being used in high-stakes situations such as medicine or law, and developers need tools to handle that risk responsibly. There are few technical barriers to providing more control and visibility to developers. Many of the most high-impact improvements such as showing thinking output, allowing prefilling, or showing<em> logprobs,</em> cost almost nothing, but would be a meaningful step towards making LLMs more controllable, consistent and reliable.</p>



<p>There is a place for a clean and simple API, and there is some merit to concerns about distillation, but this shouldn’t be used as an excuse to take away important tools for diagnosing and fixing reliability problems. When models get used in high-stakes situations, as they increasingly are, failure to take reliability seriously is an <a href="https://www.ssrc.org/publications/real-world-gaps-in-ai-governance-research/" target="_blank" rel="noreferrer noopener">AI safety concern</a>.</p>



<p>Specifically, to take reliability seriously, model providers should improve their API by allowing features that give developers more visibility and control over their output. Reasoning should be provided in full at all times, with any safety violations handled the same way that they would have been handled in the final answer. Model providers should resume providing at least the top 20 logprobs, over the entire output (reasoning included), so that developers have some visibility into how confident the model is in its answer. Constrained decoding should be extended beyond JSON and should support arbitrary grammars via something like <a href="https://en.wikipedia.org/wiki/Regular_expression" target="_blank" rel="noreferrer noopener">regex</a> or <a href="https://en.wikipedia.org/wiki/Context-free_grammar" target="_blank" rel="noreferrer noopener">formal grammars</a>.<sup data-fn="ba642ef1-e725-4c62-b790-d5992b9f364f" class="fn"><a href="#ba642ef1-e725-4c62-b790-d5992b9f364f" id="ba642ef1-e725-4c62-b790-d5992b9f364f-link">10</a></sup> Developers should be granted full control over “assistant” output—they should be able to prefill model answers, stop responses mid-generation, and branch them at will. Even if not all of these features make sense over the standard API, nothing is stopping model providers from making a new more complex API. They have done it before. The decision to withhold these features is a policy choice, not a technical limitation.</p>



<p>Improving intelligence is not the only way to improve reliability and control, but it is usually the only lever that gets pulled.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Footnotes</h3>


<ol class="wp-block-footnotes"><li id="e466c61e-ae14-40cb-b857-e573d99ccead">Thank you to Ilan Strauss, Sean Goedecke, Tim O’Reilly, and Mike Loukides for their helpful feedback on an earlier draft. <a href="#e466c61e-ae14-40cb-b857-e573d99ccead-link" aria-label="Jump to footnote reference 1"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="781dc927-8c7e-4c41-aac9-00889eaf03fb">OpenAI has since moved on from the completions API but the new responses API also heavily enforces the separation of user and assistant messages. <a href="#781dc927-8c7e-4c41-aac9-00889eaf03fb-link" aria-label="Jump to footnote reference 2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="513201f2-f828-4c4d-bd7b-ad97f0eeee6f">Anthropic&#8217;s API supported prefill up until they launched their Claude 4.6 models; <a href="https://news.ycombinator.com/item?id=46902630" target="_blank" rel="noreferrer noopener">it is no longer supported for new models</a>. <a href="#513201f2-f828-4c4d-bd7b-ad97f0eeee6f-link" aria-label="Jump to footnote reference 3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="426c4ca4-0e2d-441e-aa33-1fe6a769eb7a">Interestingly models have been shown to possess the <a href="https://www.lesswrong.com/posts/jsFGuXDMxy5NZg9T2/prefill-awareness-can-llms-tell-when-their-message-history" target="_blank" rel="noreferrer noopener">ability to tell</a> when a response has been prefilled. <a href="#426c4ca4-0e2d-441e-aa33-1fe6a769eb7a-link" aria-label="Jump to footnote reference 4"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="6fdedda7-a853-4e57-b483-ff6cede3d0c3">This technique is used in an efficient approximation of best of N called <a href="https://zanette-labs.github.io/SpeculativeRejection/" target="_blank" rel="noreferrer noopener">speculative rejection</a>. <a href="#6fdedda7-a853-4e57-b483-ff6cede3d0c3-link" aria-label="Jump to footnote reference 5"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="bde449ec-835a-4ae5-b7e5-a2e7e8ebca10">Forcing the model to generate in JSON may actually <a href="https://aider.chat/2024/08/14/code-in-json.html" target="_blank" rel="noreferrer noopener">hurt performance</a>. <a href="#bde449ec-835a-4ae5-b7e5-a2e7e8ebca10-link" aria-label="Jump to footnote reference 6"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="ef956282-0b15-48b9-b14c-83689add0c01">Anthropic used to provide full reasoning tokens but <a href="https://platform.claude.com/docs/en/build-with-claude/extended-thinking" target="_blank" rel="noreferrer noopener">stopped</a> with their newer models. <a href="#ef956282-0b15-48b9-b14c-83689add0c01-link" aria-label="Jump to footnote reference 7"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="f46da38c-a1fa-48c2-a781-a2e2fc10fd99">OpenAI’s responses endpoint <a href="https://www.seangoedecke.com/responses-api/" target="_blank" rel="noreferrer noopener">may have been created</a> in part to hide the reasoning mode. <a href="#f46da38c-a1fa-48c2-a781-a2e2fc10fd99-link" aria-label="Jump to footnote reference 8"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="a3562aab-e9ad-488d-8a7a-5ffa92d88626">Distillation using top-K probabilities is possible, but it is <a href="https://arxiv.org/abs/2503.16870" target="_blank" rel="noreferrer noopener">suboptimal</a>. <a href="#a3562aab-e9ad-488d-8a7a-5ffa92d88626-link" aria-label="Jump to footnote reference 9"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li><li id="ba642ef1-e725-4c62-b790-d5992b9f364f">Regular expressions, while flexible, are not perfect and cannot express recursive or nested structures such as valid JSON. However, open source LLM libraries like <a href="https://github.com/guidance-ai/guidance" target="_blank" rel="noreferrer noopener">Guidance</a> and <a href="https://github.com/dottxt-ai/outlines" target="_blank" rel="noreferrer noopener">Outlines</a> support recursive structures at the cost of added complexity. <a href="#ba642ef1-e725-4c62-b790-d5992b9f364f-link" aria-label="Jump to footnote reference 10"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li></ol>]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/dont-blame-the-model/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Dark Factories: Rise of the Trycycle</title>
		<link>https://www.oreilly.com/radar/dark-factories-rise-of-the-trycycle/</link>
				<comments>https://www.oreilly.com/radar/dark-factories-rise-of-the-trycycle/#respond</comments>
				<pubDate>Tue, 21 Apr 2026 11:24:26 +0000</pubDate>
					<dc:creator><![CDATA[Dan Shapiro]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18589</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Dark-factories—rise-of-the-trycycle.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Dark-factories—rise-of-the-trycycle-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[The following article originally appeared on &#8220;Dan Shapiro&#8217;s blog&#8221; and is being reposted here with the author&#8217;s permission. Companies are now producing dark factories—engines that turn specs into shipping software. The implementations can be complex and sometimes involve Mad Max metaphors. But they don’t have to be like that. If you want a five-minute factory, [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on &#8220;<a href="https://www.danshapiro.com/blog/2026/03/dark-factories-rise-of-the-trycycle/" target="_blank" rel="noreferrer noopener">Dan Shapiro&#8217;s blog</a>&#8221; and is being reposted here with the author&#8217;s permission.</em></p>
</blockquote>



<p>Companies are now producing <a href="https://www.danshapiro.com/blog/2026/02/you-dont-write-the-code/" target="_blank" rel="noreferrer noopener">dark factories</a>—engines that turn specs into shipping software. The implementations can be complex and sometimes involve <em>Mad Max</em> metaphors. But they don’t have to be like that. <strong>If you want a five-minute factory, jump to </strong><a href="http://trycycle.ai/" target="_blank" rel="noreferrer noopener"><strong>Trycycle</strong></a><strong> at the bottom.</strong></p>



<h2 class="wp-block-heading">The engine in the factory</h2>



<p>Deep in their souls, dark factories are all built on the same simple breakthrough: <em>AI gets better when you do more of it</em>.</p>



<p>How do you do “more AI” effectively? Software factories use two patterns. One of them I’ve already told you about—<a href="https://www.danshapiro.com/blog/2025/10/slot-machine-development/" target="_blank" rel="noreferrer noopener">slot machine development</a>. Instead of asking one AI, you ask three at once, and choose the best one. It feels wasteful, but it gives better results than any model could alone.</p>



<p>Does three models at a time seem wasteful? Well, wait until you meet the other pattern: the trycycle.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="415" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-13.png" alt="The simplest trycycle" class="wp-image-18590" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-13.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-13-300x78.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-13-768x199.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-13-1536x398.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>The simplest trycycle</em></figcaption></figure>



<p>It seems trivial, but it’s an unstoppable bulldozer that can bury any problem with time and tokens. And of course, you can combine it with slot machine development for a truly formidable tool.</p>



<p>Every software factory has a trycycle at its heart. Some of them are just surrounded by deacons and digraphs.</p>



<p>(And as a side note, they’re all more fun with <a href="http://freshell.net/" target="_blank" rel="noreferrer noopener">freshell</a>, which is free and open source and makes managing agents a joy!)</p>



<p>Let’s meet the factories, shall we?</p>



<h2 class="wp-block-heading">Gas Town</h2>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="528" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-14.png" alt="Gas Town AI image" class="wp-image-18591" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-14.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-14-300x99.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-14-768x253.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-14-1536x507.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<p>Steve Yegge saw this coming like a war rig down a cul-de-sac. His factory, Gas Town, dropped the day after New Years, and I was submitting PRs before the code was dry. It launched as a beautiful disaster, with mayors, convoys, and polecats fighting for guzzoline in the desert of your CPU. It’s now graduated to a <a href="https://steve-yegge.medium.com/welcome-to-the-wasteland-a-thousand-gas-towns-a5eb9bc8dc1f" target="_blank" rel="noreferrer noopener">fully fledged MMORPG for writing code</a>. It’s amazing, it’s effective, and it’s pioneering in a fully <em>Westworld</em> sort of way.</p>



<h2 class="wp-block-heading">The StrongDM Attractor</h2>



<p>Justin McCarthy, the CTO of StrongDM, talks about the factory as a feedback loop. It used to be that when a model was fed its own output, it would fix 9 things and break 10—like a busy and productive company that was losing just a bit of money on every transaction. But sometime last year, the models crossed an invisible threshold of mediocrity and went from slightly lossy to slightly gainy. They started getting better with each cycle.</p>



<p>Justin’s team noticed and built the StrongDM attractor to cash in.</p>



<p>If Gas Town is <em>Mad Max</em>, StrongDM is <em>Factorio</em>: an infinitely flexible, wildly powerful system for constructing exactly the factory you need.</p>



<p>But the StrongDM team did something interesting: They didn’t ship their factory. Instead, they shipped <a href="https://factory.strongdm.ai/products/attractor" target="_blank" rel="noreferrer noopener">the specification for the Attractor</a> so everyone can implement their own.</p>



<p>And you can absolutely implement your own! But you can also just steal the one I made for you.</p>



<h2 class="wp-block-heading">Kilroy</h2>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="737" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-15.png" alt="Kilroy image" class="wp-image-18592" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-15.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-15-300x138.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-15-768x354.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-15-1536x708.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<p><a href="https://github.com/danshapiro/kilroy" target="_blank" rel="noreferrer noopener"><strong>Kilroy</strong></a> is a StrongDM Attractor written in Go (although it works with projects in any language). It has all the flexibility of the Attractor design, but it also ships with an actual functioning factory configuration, tests, sample files, and other things that make it more likely to work.</p>



<p>In theory, you don’t need Kilroy—you can just point Claude Code or Codex CLI  at the Attractor specification and burn some tokens. <a href="https://2389.ai/posts/the-dark-factory-is-a-dot-file/" target="_blank" rel="noreferrer noopener">My friend Harper built three</a> (and you should read his post for some meditations on where the Attractor approach is heading).</p>



<p>In practice, it took the better part of a month for me and some wonderful contributors to polish up Kilroy to the point where it is now, so you may save yourself some time, tokens, and effort by just stealing this.</p>



<h2 class="wp-block-heading">Enter the trycycle</h2>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1584" height="672" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-16.png" alt="trycycle image" class="wp-image-18593" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-16.png 1584w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-16-300x127.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-16-768x326.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-16-1536x652.png 1536w" sizes="auto, (max-width: 1584px) 100vw, 1584px" /></figure>



<p>The other night I was carefully building the dotfiles and runfiles for a Kilroy project—configuring the factory to build the project.</p>



<p>Then a thought struck.</p>



<p>What if this was just a skill?</p>



<p>Enter <a href="http://trycycle.ai/" target="_blank" rel="noreferrer noopener">Trycycle</a>, the very simplest trycycle. It’s a very simple skill for Claude Code and Codex CLI that implements the pattern in plain English.</p>



<ol class="wp-block-list">
<li>Define the problem.</li>



<li>Write a plan</li>



<li>Is the plan perfect? If not, try again.</li>



<li>Implement the plan.</li>



<li>Is the implementation perfect? If not, try again.</li>
</ol>



<p>That’s basically it. To use it, you open your favorite coding agent and say, “Use Trycycle to do the thing.” Then sit back and watch the tokens fly.</p>



<p>It’s simple because it’s just a skill. Under the hood, it adapts <a href="https://blog.fsck.com/" target="_blank" rel="noreferrer noopener">Jesse Vincent</a>’s amazing <a href="https://github.com/obra/superpowers" target="_blank" rel="noreferrer noopener">Superpowers</a> for plan writing and executing. It will take you literally minutes to get started. Just paste this into your agent and you’re off to the three-wheel races.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><code>Hey agent! Go here and follow the installation instructions.</code><br><code>https://raw.githubusercontent.com/danshapiro/trycycle/main/README.md</code></p>
</blockquote>



<p>Trycycle is barely 24 hours old as of the time of this writing. I’ve shipped well over a dozen features with it already, and I was in meetings most of the day. While I was having dinner, it ported Rogue to Wasm(!). Last night it churned for 7 hours and 56 minutes and landed six features for <a href="http://freshell.net/" target="_blank" rel="noreferrer noopener">freshell</a>.</p>



<p>The best part, though, is that because it’s just a skill, it’s instantly part of your dev flow. There’s no configuration or learning curve. If you want to understand it better, just ask. If you don’t like what it’s doing, have stern words.</p>



<h2 class="wp-block-heading">Which one to use?</h2>



<p>Here’s how I’d decide right now.</p>



<p>If you want to become part of a <strong>growing movement of collaborators</strong> burning tokens together to build software, individually and collectively—try <a href="https://steve-yegge.medium.com/welcome-to-the-wasteland-a-thousand-gas-towns-a5eb9bc8dc1f" target="_blank" rel="noreferrer noopener">Gas Town</a>.</p>



<p>If you want to invest in building a <strong>powerful, configurable, sophisticated engine</strong> that can drive your projects forward 24 hours a day—try <a href="https://github.com/danshapiro/kilroy" target="_blank" rel="noreferrer noopener">Kilroy</a>.</p>



<p>If you just want to <strong>get things done right now</strong>, give <a href="https://github.com/danshapiro/trycycle" target="_blank" rel="noreferrer noopener">Trycycle</a> a spin. Heck, it’s fast enough that you can spin up a trycycle while you read the docs on Kilroy and Gas Town.</p>



<p>And whatever you choose, I recommend you do it with <a href="http://freshell.net/" target="_blank" rel="noreferrer noopener">freshell</a>, because it’s just more delightful that way!</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Thanks to </em><a href="http://harperreed.com/" target="_blank" rel="noreferrer noopener"><em>Harper Reed</em></a><em>, </em><a href="https://steve-yegge.medium.com/" target="_blank" rel="noreferrer noopener"><em>Steve Yegge</em></a><em>, </em><a href="http://fsck.com/" target="_blank" rel="noreferrer noopener"><em>Jesse Vincent</em></a><em>, </em><a href="http://remixpartners.ai/" target="_blank" rel="noreferrer noopener"><em>Justin Massa</em></a><em>, </em><a href="https://nathan.torkington.com/" target="_blank" rel="noreferrer noopener"><em>Nat Torkington</em></a><em>, </em><a href="https://vibes.diy/" target="_blank" rel="noreferrer noopener"><em>Marcus Estes</em></a><em>, and </em><a href="https://www.linkedin.com/in/arjun-singh-629216105/" target="_blank" rel="noreferrer noopener"><em>Arjun Singh</em></a><em> for reading drafts of this.</em></p>
</blockquote>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/dark-factories-rise-of-the-trycycle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Scenario Planning for AI and the “Jobless Future”</title>
		<link>https://www.oreilly.com/radar/scenario-planning-for-ai-and-the-jobless-future/</link>
				<comments>https://www.oreilly.com/radar/scenario-planning-for-ai-and-the-jobless-future/#respond</comments>
				<pubDate>Mon, 20 Apr 2026 10:41:09 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18531</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Scenario-planning-for-an-uncertain-AI-future.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Scenario-planning-for-an-uncertain-AI-future-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[We all read it in the daily news. The New York Times reports that economists who once dismissed the AI job threat are now taking it seriously. In February, Jack Dorsey cut 40% of Block&#8217;s workforce, telling shareholders that &#8220;intelligence tools have changed what it means to build and run a company.&#8221; Block&#8217;s stock rose [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>We all read it in the daily news. <em>The New York Times</em> reports that <a href="https://www.nytimes.com/2026/04/03/business/economists-once-dismissed-the-ai-job-threat-but-not-anymore.html" target="_blank" rel="noreferrer noopener">economists who once dismissed the AI job threat are now taking it seriously</a>. In February, Jack Dorsey <a href="https://fortune.com/2026/02/27/block-jack-dorsey-ceo-xyz-stock-square-4000-ai-layoffs/" target="_blank" rel="noreferrer noopener">cut 40% of Block&#8217;s workforce</a>, telling shareholders that &#8220;intelligence tools have changed what it means to build and run a company.&#8221; Block&#8217;s stock rose 20%. Salesforce has <a href="https://www.cnbc.com/2026/01/20/ai-impacting-labor-market-like-a-tsunami-as-layoff-fears-mount.html" target="_blank" rel="noreferrer noopener">shed thousands of customer support workers</a>, saying AI was already doing half the work. And a <a href="https://fortune.com/2026/02/28/ai-scare-trade-mass-layoffs-white-collar-recession-citrini-shumer-viral-doomsday-essays/" target="_blank" rel="noreferrer noopener">Stanford study</a> found that software developers aged 22 to 25 saw employment drop nearly 20% from its peak, while developers over 26 were doing fine.</p>



<p>But how are we to square this news with a <a href="https://greatleadership.substack.com/p/the-ai-transition-five-year-crisis" target="_blank" rel="noreferrer noopener">Vanguard study</a> that found that the 100 occupations most exposed to AI were actually outperforming the rest of the labor market in both job growth and wages, and a <a href="https://greatleadership.substack.com/p/the-ai-transition-five-year-crisis" target="_blank" rel="noreferrer noopener">rigorous NBER study</a> of 25,000 Danish workers that found zero measurable effect of AI on earnings or hours?</p>



<p>Other studies could contribute to either side of the argument. For example, <a href="https://www.pwc.com/gx/en/services/ai/ai-jobs-barometer.html" target="_blank" rel="noreferrer noopener">PwC&#8217;s 2025 Global AI Jobs Barometer</a>, analyzing close to a billion job ads across six continents, found that workers with AI skills earn a 56% wage premium, and that productivity growth has nearly quadrupled in the industries most exposed to AI.</p>



<p>This is exactly the kind of contradictory, uncertain landscape that scenario planning was designed for. Scenario planning doesn’t ask you to predict what the future will be. It asks you to imagine divergent possible futures and to develop a strategy that improves your odds of success across all of them. I&#8217;ve used it many times at O’Reilly and have written about it before with <a href="https://www.oreilly.com/tim/21stcentury/" target="_blank" rel="noreferrer noopener">COVID</a> and <a href="https://www.oreilly.com/tim/wtf-book.html" target="_blank" rel="noreferrer noopener">climate change</a> as illustrative examples. The argument between those who say AI will cause mass unemployment and those who insist technology always creates more jobs than it destroys is a debate that will only be resolved by time. Both sides have evidence. Both are probably right at some level. And both framings are not terribly helpful for anyone trying to figure out what to do next.</p>



<p>In a scenario planning exercise, you identify two key uncertainties and draw them as crossing vectors, dividing the possibility space into four quadrants. Each quadrant describes a different future. The power of the technique is that you don&#8217;t bet on one quadrant. You look for actions that make the most sense across all of them. And you’re not limited to doing this for only one uncertainty. You can repeat the exercise multiple times, each time expanding your sense of possible futures and clarifying your convictions about the most robust strategies for adapting to them.</p>



<p>For AI and jobs, the most obvious crossing vectors to model might seem to be how fast AI grows in its ability to replace human work and how quickly that capability is adopted. This is, in effect, scenario planning about whether the “<a href="https://ai-2027.com/" target="_blank" rel="noreferrer noopener">AI is unprecedented</a>” or “<a href="https://knightcolumbia.org/content/ai-as-normal-technology" target="_blank" rel="noreferrer noopener">AI is normal technology</a>” camp is correct. That might well be a useful pair of axes.</p>



<p>There’s no question that AI capability is accelerating. <a href="https://almcorp.com/blog/ai-job-displacement-statistics/" target="_blank" rel="noreferrer noopener">SWE-Bench scores</a> for coding went from solving 4.4% of problems in 2023 to 71.7% in 2024, and we saw what was widely described as <a href="https://medium.com/@NMitchem/something-flipped-in-december-423e8b808262" target="_blank" rel="noreferrer noopener">a “step change”</a> beyond that in December of 2025. <a href="https://www.nytimes.com/2026/04/07/technology/anthropic-claims-its-new-ai-model-mythos-is-a-cybersecurity-reckoning.html" target="_blank" rel="noreferrer noopener">Anthropic’s new Mythos model seems to have upped AI capabilities</a> even further. Even before Mythos, <a href="https://almcorp.com/blog/ai-job-displacement-statistics/" target="_blank" rel="noreferrer noopener">McKinsey estimated</a> that today&#8217;s technology could in theory automate roughly 57% of current US work hours. But capability is not adoption. Goldman Sachs notes that AI appears to be <a href="https://theworlddata.com/ai-job-displacement-statistics/" target="_blank" rel="noreferrer noopener">suppressing hiring more than destroying existing jobs</a> in the near term. <a href="https://www.cnbc.com/2026/01/20/ai-impacting-labor-market-like-a-tsunami-as-layoff-fears-mount.html" target="_blank" rel="noreferrer noopener">Yale&#8217;s Budget Lab</a>, analyzing US labor data from 2022 to 2025, found no massive shift in the share of workers across occupations. Deployment, not capability, seems to be the limiting factor.</p>



<p>As a result, it makes sense to me to synthesize these two factors (capability increase and rate of adoption) into a single vector that we can call <strong>the scale and size of impact</strong>. The question on this axis, therefore, is not so just &#8220;How good does AI get?&#8221; but also &#8220;How fast does the economy actually reorganize around it?&#8221;</p>



<p>What’s a good second vector to cross with this one? If you’ve read my book <a href="https://www.oreilly.com/tim/wtf-book.html" target="_blank" rel="noreferrer noopener"><em>WTF?</em></a> or other things I’ve written about the role of human choices in shaping the future, you probably won’t be surprised that the second vector I’ve chosen reflects my conviction that the future depends on <strong>whether AI capability is primarily used to achieve efficiencies in existing work or to <em>do more</em>, to solve new problems and serve more human needs.</strong></p>



<p>When Dorsey says a smaller team can now do the same work, that&#8217;s efficiency. When <a href="https://www.humai.blog/ai-discovered-drugs-reach-phase-iii-and-2026-will-determine-whether-all-the-promises-were-real/" target="_blank" rel="noreferrer noopener">Insilico Medicine</a> uses AI to design a drug for idiopathic pulmonary fibrosis in a fraction of the time traditional discovery takes (with <a href="https://www.humai.blog/ai-discovered-drugs-reach-phase-iii-and-2026-will-determine-whether-all-the-promises-were-real/" target="_blank" rel="noreferrer noopener">over 173 other AI-discovered drugs also now in clinical development</a> and 15 to 20 entering pivotal Phase III trials this year), that&#8217;s not replacing a human job. That&#8217;s doing something that wasn&#8217;t being done before. But we shouldn’t content ourselves with the idea that the “do more” axis is just about technical breakthroughs. It might be in serving a vastly larger number of people far more effectively <em>and </em>efficiently. When <a href="https://www.humai.blog/ai-discovered-drugs-reach-phase-iii-and-2026-will-determine-whether-all-the-promises-were-real/"></a>Todd Park says that his company, <a href="https://www.devoted.com/about-us/" target="_blank" rel="noreferrer noopener">Devoted Health</a>, “is on a mission to dramatically improve the health and well-being of older Americans,” that is a call to do more. Given the size of the existing markets that need to be transformed, it is likely that even with 10x or 100x efficiency gains from AI, Devoted’s 1,000x mission might require more resources, including people.</p>



<h2 class="wp-block-heading">What will be scarce?</h2>



<p>I’ve always assumed that the “do more” orientation is chiefly a moral argument driven by human judgment about what kind of world we’d prefer to live in. As the <a href="https://www.imf.org/en/blogs/articles/2026/01/14/new-skills-and-ai-are-reshaping-the-future-of-work" target="_blank" rel="noreferrer noopener">IMF noted</a> earlier this year, &#8220;Work brings dignity and purpose to people&#8217;s lives. That&#8217;s what makes the AI transformation so consequential.&#8221; A world of concentrated value capture leading to a split between those with capital to invest and a permanent unemployed underclass is the stuff of dystopian science fiction.</p>



<p>But it’s not just a matter of inequality and the importance of work to human self-esteem. I’ve also become convinced that companies that lean into new possibilities and expand markets do better than those that simply do the same things more cheaply. And the evidence is starting to come in that this is true.&nbsp;<a href="https://www.pwc.com/gx/en/news-room/press-releases/2026/pwc-2026-ai-performance-study.html" target="_blank" rel="noreferrer noopener">According to PWC</a>, &#8220;Three-quarters of AI’s economic gains are being captured by just 20% of companies—with the leading companies focused on growth, not just productivity&#8230;.The research shows that these top‑performing companies are not simply deploying more AI tools. Instead, they are using AI as a catalyst for growth and business reinvention, particularly by pursuing new revenue opportunities created as industries converge, while building strong foundations around data, governance and trust.&#8221;</p>



<p>There are also a number of economic arguments for why the jobless future is just not going to happen. These arguments provide useful guidance into the structural changes to the economy that workers, business leaders, and politicians should be planning for.</p>



<p><a href="https://www.noahpinion.blog/p/salarymen-specialists-and-small-businesses" target="_blank" rel="noreferrer noopener">Noah Smith pointed</a> to a <a href="https://www.dropbox.com/scl/fo/689u1g785x8jp6c8v1s21/AIe0jfrZy_viIKCCET-U0r0/2026.03.30%20Bundles%20WP%20Version.pdf?rlkey=ottgcu71u1t4mhn6tblvatu8w&amp;e=2&amp;st=dj6k0x2o&amp;dl=0" target="_blank" rel="noreferrer noopener">draft economics paper by Garicano, Li, and Wu</a> that helps explain how the trade-offs between efficiency and expanding output might impact jobs. Garicano, Li, and Wu note that “the effect of AI on an occupation depends not just on which tasks AI can perform but also on how costly it is to unbundle those tasks from the job.” They model jobs as bundles of tasks, and distinguish between &#8220;strongly bundled&#8221; jobs (where the same person has to do multiple interdependent tasks) and &#8220;weakly bundled&#8221; ones (where tasks can easily be split between a human and an AI). AI replaces the weakly bundled jobs first. But even for weakly bundled jobs, automation only replaces human labor <em>after <a href="https://en.wikipedia.org/wiki/Price_elasticity_of_demand" target="_blank" rel="noreferrer noopener">demand becomes inelastic</a></em>, after AI is so productive at the task that making more of the output hits diminishing returns.</p>



<p>Until that point, increased productivity from AI can be focused on expanding output rather than shrinking headcount. That is another way of saying that whether AI replaces workers or augments them depends in large part on whether there is unmet demand to absorb the increased productivity. If we use AI only to do the same things more cheaply, we hit that inelastic point fast, and jobs disappear. If we use it to do new things, demand keeps expanding and people keep working. University of Chicago economist Alex Imas believes that just how much <a href="https://www.technologyreview.com/2026/04/06/1135187/the-one-piece-of-data-that-could-actually-shed-light-on-your-job-and-ai/" target="_blank" rel="noreferrer noopener">demand elasticity there is on a job by job basis</a> is one of the big questions of our time.</p>



<p>But that’s not all. In a new essay called &#8220;<a href="https://aleximas.substack.com/p/what-will-be-scarce" target="_blank" rel="noreferrer noopener">What Will Be Scarce?</a>&#8221; Imas points out that when a new technology makes one sector dramatically more productive, one part of the economy shrinks but another grows. When agriculture was mechanized, 40% of the American workforce moved off farms, but the economy actually grew, because people spent their rising real incomes on fundamentally different things. Imas argues, drawing on <a href="https://www.econometricsociety.org/publications/econometrica/2021/01/01/structural-change-long-run-income-and-price-effects" target="_blank" rel="noreferrer noopener">work by Comin, Lashkari, and Mestieri</a>, that income effects account for over 75% of observed patterns of structural change. As people get richer, they want fundamentally different things.</p>



<p>What are those things? Imas calls it &#8220;the relational sector&#8221;: goods and services where the human element is itself part of the value; teachers, nurses, therapists, hospitality workers, artisans, performers, personal chefs, community curators, and more. He opens his piece with Starbucks. In pursuit of economic efficiency, the company tried to automate more and more of its operations. CEO Brian Niccol concluded that it was a mistake, that handwritten notes on cups, ceramic mugs, and good seats drove customer satisfaction. More baristas are being hired per store and automation is being rolled back.</p>



<p>But there’s far more to the relational sector than service jobs. Imas identifies a further dimension in what René Girard called <a href="https://medium.com/perennial/what-is-mimetic-desire-and-why-it-matters-more-than-you-think-53f28ba7cce8" target="_blank" rel="noreferrer noopener">mimetic desire</a>, the idea that people don&#8217;t just want objects for their functional properties. They want things that others want, and they want them more when they&#8217;re scarce and exclusive. (Hobbes and Rousseau made this same point.) <a href="https://academic.oup.com/restud/article/91/4/2347/7243247" target="_blank" rel="noreferrer noopener">Imas&#8217;s experimental research</a> shows that willingness to pay roughly doubles when people learn that others will be excluded from a product. And in <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6302659" target="_blank" rel="noreferrer noopener">new work with Graelin Mandel</a>, he finds that AI involvement undermines the perceived exclusivity of a good. Human-made artwork gained 44% in value from exclusivity; AI-generated artwork gained only 21%. The mere involvement of AI made the work feel inherently reproducible.</p>



<p>This means the relational sector has naturally high income elasticity. If AI makes production cheaper and real incomes rise, spending shifts toward goods where the human element matters. This is <a href="https://en.wikipedia.org/wiki/Baumol_effect" target="_blank" rel="noreferrer noopener">Baumol&#8217;s cost disease</a> working as a feature, not a bug: The sector that resists automation becomes relatively more expensive, and that&#8217;s precisely where spending and employment grow. This is an economic mechanism that could power the upper quadrants of the scenario grid that we will look at shortly, not just as a matter of moral choice but as a structural tendency of rich economies getting richer.</p>



<p>I’m going to include both Noah’s ideas and Alex’s in my scenario planning exercise, since they fit right in.</p>



<h2 class="wp-block-heading">Four possible futures</h2>



<p>Let’s look at how the two vectors cross each other and give us four futures.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1600" height="1219" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-11.png" alt="Four futures vectors" class="wp-image-18568" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-11.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-11-300x229.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-11-768x585.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-11-1536x1170.png 1536w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<p><strong>Upper left: The Augmentation Economy.</strong> AI capability grows but adoption is gradual, and workers are augmented rather than replaced. A programmer who once wrote 100 lines of code a day now ships features that used to take a team. A nurse practitioner aided by AI diagnostic tools provides care that once required a specialist. A small business owner uses AI to access legal and financial services previously available only to large corporations. This is the quadrant where the PwC finding about the 56% wage premium makes the most sense. AI becomes a tool that makes individual workers more productive and more valuable, and the gains flow broadly. What makes this a positive, growing economy are at least in part the choices made by employers. They use the increased efficiency to build better services, not just to make them cheaper. Doctors and nurses have more time with patients and less time with paperwork. As services become more efficient, they can be offered to more people at lower cost.</p>



<p><strong>Lower left: The Slow Squeeze.</strong> AI grows, adoption is gradual, and the primary use is efficiency. This is in many ways the most insidious quadrant, because it doesn&#8217;t look like a crisis. It looks like a normal economy with slightly fewer entry-level jobs each year, slightly more pressure on wages, and slightly less bargaining power for workers. That Stanford study on young software developers is a signal from this quadrant. So is the <a href="https://hbr.org/2026/01/companies-are-laying-off-workers-because-of-ais-potential-not-its-performance" target="_blank" rel="noreferrer noopener">HBR finding</a> that companies are laying off workers because of AI&#8217;s <em>potential</em>, not its performance. The Slow Squeeze is the world where companies use AI to pad margins without passing the gains along or investing in new capabilities.</p>



<p><strong>Lower right: The Displacement Crisis.</strong> AI advances fast and is adopted rapidly, almost entirely for efficiency. This is the future the doomsayers warn about, the <a href="https://fortune.com/2026/02/28/ai-scare-trade-mass-layoffs-white-collar-recession-citrini-shumer-viral-doomsday-essays/" target="_blank" rel="noreferrer noopener">Citrini Research scenario</a> of unemployment topping 10% and the S&amp;P 500 tanking. Block&#8217;s 40% cut is a signal from this quadrant, whether or not Dorsey&#8217;s prediction that most companies will follow suit within a year turns out to be right. <a href="https://www.cnbc.com/2026/01/20/ai-impacting-labor-market-like-a-tsunami-as-layoff-fears-mount.html" target="_blank" rel="noreferrer noopener">Deutsche Bank analysts</a> warn that &#8220;AI redundancy washing,&#8221; companies blaming layoffs on AI that are really driven by other cost-cutting, will be a significant feature of 2026. But the fact that Wall Street rewarded Block with a 20% stock price jump for firing 4,000 people tells you what the current incentive structure is optimizing for.</p>



<p><strong>Upper right: The Great Transformation.</strong> AI capability advances rapidly and is adopted fast, but the primary use is to do more, not just the same with less. Whole new industries emerge. The <a href="https://www.weforum.org/stories/2025/12/ai-paradoxes-in-2026/" target="_blank" rel="noreferrer noopener">WEF&#8217;s projection of 170 million new roles by 2030</a> comes true, far exceeding the 92 million displaced. AI-driven drug discovery actually delivers on its promise. New forms of education, personalized to every learner, actually reach people the old system never served. The transition is still brutal, because the people losing old jobs and the people getting new ones are not the same people, in the same places, with the same skills. <a href="https://www.brookings.edu/articles/measuring-us-workers-capacity-to-adapt-to-ai-driven-job-displacement/" target="_blank" rel="noreferrer noopener">Brookings has identified 6.1 million workers</a> with high AI exposure and low adaptive capacity, 86% of them women in clerical and administrative roles. But the net direction is toward more human capability, not less.</p>



<p>Imas&#8217;s framework suggests that this quadrant will feature an explosion of durable jobs in the relational sector. Some of these will be high touch service jobs: doctors, nurses, therapists, teachers, personal trainers, craft producers, experience designers, hospitality workers, and roles that haven&#8217;t been invented yet. The relational sector already employs nearly 50 million people in the US. But another big part of it will be creating exclusive products and services that become objects of desire. Art critic Dave Hickey calls this “<a href="https://www.amazon.com/Air-Guitar-Essays-Art-Democracy/dp/0963726455/" target="_blank" rel="noreferrer noopener">the big beautiful art market</a>” that happens when industrial products are “sold on the basis of what they mean rather than what they do.” The structural change model predicts that both of these areas will grow as a share of the economy, not because they resist automation as a technical matter but because not being automated is part of their value proposition.</p>



<p>Noah Smith&#8217;s <a href="https://www.noahpinion.blog/p/salarymen-specialists-and-small-businesses" target="_blank" rel="noreferrer noopener">taxonomy of future work</a> also helps fill in what life may actually look like across these quadrants. He divides AI-affected jobs into three categories: <em>specialists</em> whose jobs are &#8220;strongly bundled&#8221; (for example, an experienced engineer whose judgment can&#8217;t be separated from the rest of what they do), <em>salarymen</em> (generalists whose value comes from knowing how to wrangle AI and plug its ever-shifting gaps, much like the Japanese corporate model where long-tenured employees rotate between divisions and accumulate firm-specific knowledge rather than portable technical skills), and <em>small businesspeople</em> (entrepreneurs who use AI as leverage to run what would previously have required a much larger team). This is the future that Steve Yegge envisions with its “millions of one-person startups.”</p>



<p>In the upper quadrants, all three categories thrive. Specialists do well because AI expands the scope of what their bundled expertise can accomplish. Salarymen thrive because companies that are doing more, not just doing the same with less, need people who can adapt to constantly changing tool capabilities within the context of their business. And small businesses proliferate because AI gives a one-person shop the productive capacity that used to require a department.</p>



<p>In the lower quadrants, specialists may survive, but salarymen face pressure as companies optimize for headcount reduction rather than capability expansion, and small businesses struggle because the efficiency-first economy compresses the margins they need to exist.</p>



<h2 class="wp-block-heading"><strong>News from the future</strong></h2>



<p>In scenario planning, once you’ve chosen your vectors and imagined the resulting quadrants, you watch for &#8220;news from the future,&#8221; data points that signal which direction the world is actually heading. As with any scatter plot, the points are all over the map at first, but over time you start to see the trend lines emerge.</p>



<p>Right now, the signals are mixed.</p>



<p><strong>News from the lower quadrants:</strong> <a href="https://www.cnbc.com/2026/01/20/ai-impacting-labor-market-like-a-tsunami-as-layoff-fears-mount.html" target="_blank" rel="noreferrer noopener">Challenger, Gray &amp; Christmas reports</a> that AI was a significant contributing factor in nearly 55,000 US layoffs in 2025. Employee anxiety about AI-driven job loss has <a href="https://www.cnbc.com/2026/01/20/ai-impacting-labor-market-like-a-tsunami-as-layoff-fears-mount.html" target="_blank" rel="noreferrer noopener">jumped from 28% in 2024 to 40% in 2026</a>. <a href="https://www.weforum.org/stories/2025/12/ai-paradoxes-in-2026/" target="_blank" rel="noreferrer noopener">40% of employers globally</a> told the WEF they plan to reduce their workforce where AI can automate tasks within five years. And the entry-level job market is tightening in ways that compound over time even if they don&#8217;t show up in headline unemployment numbers. Brookings found that <a href="https://katv.com/news/nation-world/millions-of-americans-face-risk-from-ai-disrupting-vital-gateway-jobs-career-pathways-artificial-intelligence-workplace-disruptions-brookings-metro-opportunity-at-work" target="_blank" rel="noreferrer noopener">the &#8220;gateway&#8221; occupations</a> that serve as stepping stones from low-wage to middle-wage work are among the most exposed to AI, threatening career pathways, not just individual jobs.</p>



<p><strong>News from the upper quadrants:</strong> The PwC wage premium data. The Vanguard finding that AI-exposed occupations are growing, not shrinking. The explosion of AI drug discovery programs. <a href="https://greatleadership.substack.com/p/the-ai-transition-five-year-crisis" target="_blank" rel="noreferrer noopener">MIT&#8217;s David Autor</a> has shown that 60% of today&#8217;s US employment is in job categories that didn&#8217;t exist in 1940. New task creation is how technology has always generated new work, and there&#8217;s no reason to believe AI is exempt from that pattern, unless we choose to use it only for efficiency.</p>



<p>There may also be some signal in reports that usage among developers is becoming more intensive and continuous, from multistep coding workflows to automated agents running in loops. Some engineers are &#8220;tokenmaxxing,&#8221; with engineers at companies like Meta <a href="https://www.theinformation.com/articles/meta-employees-vie-ai-token-legend-status" target="_blank" rel="noreferrer noopener">treating AI consumption as a productivity benchmark</a>. This is driving rapid revenue growth for AI providers but squeezing their margins as infrastructure costs rise. That margin pressure may sound like bad news, but it&#8217;s actually a classic pattern by which a technology crosses from &#8220;tool&#8221; to &#8220;infrastructure.&#8221; Cloud computing margins were terrible until scale and hardware improvements drove unit costs down, at which point the providers who had built habit and lock-in harvested enormous returns. AI inference costs have been dropping roughly 10x per year, and price competition is accelerating that decline. The margin squeeze is the mechanism by which AI becomes cheap enough to be ubiquitous. And the tokenmaxxing engineers are doing dramatically more iterations, more exploration, with more ambitious scope. That&#8217;s &#8220;doing more&#8221; behavior, not an efficiency behavior.</p>



<p>It’s still unclear, though, whether all those tokens are producing real value or whether some of this is the AI equivalent of crypto mining. If most of those tokens are productive, we&#8217;re looking at a productivity boom. If many are wasted, the adoption curve may have a big dip in it before industry matures. Either way, though, the direction is toward AI becoming economic and technology infrastructure. It’s important to remember that tokens spent trying out prototypes that are rejected are not necessarily wasted. They can be part of a new development process that&#8217;s expanding the space of possibilities.</p>



<p><strong>News that doesn&#8217;t fit neatly into any quadrant:</strong> We appear to be in what Smith calls a <a href="https://www.noahpinion.blog/p/salarymen-specialists-and-small-businesses" target="_blank" rel="noreferrer noopener">&#8220;no-hire, no-fire&#8221; economy</a>, where workers hunker down in their current jobs and refuse to switch, and companies keep them rather than hiring new workers. That&#8217;s consistent with a world where people sense that their portable technical skills are depreciating, so they cling to the firm-specific knowledge that still makes them valuable where they are. It&#8217;s also consistent with the NBER Denmark study finding task reorganization without job loss: AI is replacing tasks, not (yet) jobs. Nonetheless, it is clear that a dearth of entry level positions will be a serious issue.</p>



<p>A <a href="https://www.pittwire.pitt.edu/features-articles/2026/04/02/artificial-intelligence-data-labor" target="_blank" rel="noreferrer noopener">University of Pittsburgh researcher</a> has been calling state unemployment offices one by one to assemble the granular data that doesn&#8217;t yet exist in federal statistics, because our measurement tools are not yet fine-grained enough to see what&#8217;s happening. If you&#8217;re confused about whether AI is causing job losses, <a href="https://www.pittwire.pitt.edu/features-articles/2026/04/02/artificial-intelligence-data-labor" target="_blank" rel="noreferrer noopener">he put it plainly</a>: The likely problem is a lack of data. If AI is having an impact, we may just not be equipped to see it yet with the instruments we have. We’re getting new data points daily. Asking yourself which future they support can gradually increase your confidence in what is coming.</p>



<h2 class="wp-block-heading"><strong>Robust strategy</strong></h2>



<p>The goal of a scenario planning exercise is to stretch your thinking so that you can make strategic choices that make sense regardless of which future unfolds. Scenario planners call this a “robust strategy.”</p>



<p><strong>If you&#8217;re a business leader,</strong> the robust strategy is not to ask &#8220;How many people can I replace with AI?&#8221; It&#8217;s to ask &#8220;What can we do now that we couldn&#8217;t do before?&#8221; The companies that will thrive across all four quadrants are the ones that use AI to expand what&#8217;s possible, not just to shrink how much they have to spend. Aim for the upper right quadrant, and you’ll do better even if the rest of the world chooses otherwise.</p>



<p>That&#8217;s not just scenario planning. It&#8217;s Clay Christensen on the lessons of disruptive technologies. A disruptive technology is not defined by the markets it destroys but by the new markets and new possibilities it creates. As Christensen observed, RCA didn’t ignore the transistor; its leaders just thought it wasn’t good enough for its current customers. Sony embraced the new technology and created a new market of portable devices where the quality difference between transistors and vacuum tubes just didn’t matter. And of course, as Clay observed, the disruptive technology continues to improve.</p>



<p><strong>If you&#8217;re a worker</strong>, one element of robust strategy is to band together, as the <a href="https://www.wga.org/contracts/know-your-rights/artificial-intelligence" target="_blank" rel="noreferrer noopener">screenwriters guild did</a>, and to make the case that the productivity gains from AI should be shared with workers and used to amplify their skills and efforts. Don’t resist AI, but instead use it to make yourself even more valuable. Use it to amplify your uniqueness. That is, lean into the augmentation economy. One of the things we’ve learned from the early advances in AI-enabled software engineering is that a great software engineer can get more out of AI than a vibe-coding beginner. This is true of other professions as well. Find ways that your human uniqueness makes the output of AI even more valuable.</p>



<p>Create professional associations that lean into mentorship and an AI-enriched career ladder, but aren’t afraid to take a political stance. The idea that providers of capital are entitled to all of the gains is a pernicious idea that has created an engine of inequality rather than of wide prosperity. It doesn’t have to be that way. Professional associations and other forms of solidarity are a possible source of countervailing power. (But don’t fall into the trap that many unions and professional associations do, of using that power to extract rents rather than increasing value for everyone.) Preferentially choose employers who are investing in training employees for a human + AI future, including <a href="https://www.axios.com/2026/02/13/ai-ibm-tech-jobs" target="_blank" rel="noreferrer noopener">at the beginning of the career ladder</a>.</p>



<p>If you&#8217;re a specialist, deepen the parts of your expertise that are strongly bundled, the judgment and context and human relationships that can&#8217;t be separated from the technical work. If you&#8217;re a generalist inside a company, become the person who understands what AI can and can&#8217;t do and fills the gaps, whose value comes from adaptability and firm-specific knowledge rather than a fixed set of technical skills. And if you have entrepreneurial instincts, recognize that AI is creating leverage that may make it possible to run a viable business at a scale that previously couldn&#8217;t support one.</p>



<p>Imas&#8217;s work suggests that the most durable career paths <em>may not be defined by which tasks AI can&#8217;t do (a moving target) but by whether the human element is part of what the customer is paying for</em>. A restauranteur, a therapist, a teacher who knows your child, or a guide who knows the trail aren&#8217;t jobs that survive because AI hasn&#8217;t gotten to them yet. They&#8217;re jobs where <em>human involvement is the product</em>.</p>



<p><strong>If you&#8217;re an entrepreneur</strong>, the robust strategy is the one it has always been: look at the world as it is, determine what work needs doing, and do it. Don&#8217;t build AI tools that replace humans doing things that are already being done adequately. Build AI tools that let humans do things that have never been done before.</p>



<p><strong>If you&#8217;re a policymaker</strong>, the robust strategy is to invest in the transition regardless of how fast displacement turns out to be. Create policies that give workers more of a role in how AI is used. Support positions like those of the writers guild, which allow workers to get a share of the gains from using AI. And if capital runs wild with labor replacement, tax the gains so the efficiency can be redistributed. Decrease the working week.</p>



<p>Education and lifelong learning programs, portable benefits, support for geographic mobility, and investment in the industries of the future pay off in every quadrant. So does reducing the regulatory friction that keeps new entrants trapped in old cost structures, funding basic research that the market underinvests in, and building the kind of infrastructure (physical and institutional) that enables rapid adaptation.</p>



<h2 class="wp-block-heading"><strong>The future is up to us</strong></h2>



<p>I’ll return to the theme that I sounded in my book <a href="https://www.amazon.com/WTF-Whats-Future-Why-Its/dp/0062565710" target="_blank" rel="noreferrer noopener"><em>WTF? What’s the Future and Why It’s Up To Us</em></a><em>.</em></p>



<p>Every time a company uses AI to do what it was already doing with fewer people, it is making a choice for the lower half of the scenario grid. Every time a company uses AI to do something that wasn&#8217;t previously possible, to serve a customer who wasn&#8217;t previously served, to solve a problem that wasn&#8217;t previously solvable, it is making a choice for the upper half. These choices compound, for good or ill. An economy that uses AI primarily for efficiency will slowly hollow itself out.</p>



<p>Looking at the news from the future, both sets of signals are present. The question is which will dominate. AI will give us both the Augmentation Economy and the Displacement Crisis, in different measures in different places, depending on the choices we make.</p>



<p>Scenario planning teaches us that we don&#8217;t have to predict which future we&#8217;ll get. We do have to prepare for a very uncertain future. But the robust strategy, the one that works across every quadrant, is to focus on doing more, not just doing the same with less, and to find ways that human taste still matters in what is created. As long as there is unmet demand, as long as there are problems we haven&#8217;t solved and people we haven&#8217;t served, AI will augment human work rather than replacing it. It&#8217;s only when we stop looking for new things to do that the machines come for the jobs.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/scenario-planning-for-ai-and-the-jobless-future/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Trial by Fire: Crisis Engineering</title>
		<link>https://www.oreilly.com/radar/trial-by-fire-crisis-engineering/</link>
				<comments>https://www.oreilly.com/radar/trial-by-fire-crisis-engineering/#respond</comments>
				<pubDate>Fri, 17 Apr 2026 10:54:01 +0000</pubDate>
					<dc:creator><![CDATA[Jennifer Pahlka]]></dc:creator>
						<category><![CDATA[Innovation & Disruption]]></category>
		<category><![CDATA[Operations]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18556</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Trial-by-fire.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Trial-by-fire-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A new book shows how to turn a crisis into the change you&#039;ve been waiting for]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Jennifer Pahlka’s Eating Policy website and is being republished here with the author’s permission. I read Norman Maclean’s Young Men and Fire when I was a teenager, I think, so it’s been many years, but I still remember its turning point vividly. It’s set in 1949 in Montana, at [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article originally appeared on </em><a href="https://www.eatingpolicy.com/p/trial-by-fire-crisis-engineering" target="_blank" rel="noreferrer noopener"><em>Jennifer Pahlka’s Eating Policy website</em></a><em> </em><em>and is being republished here with the author’s permission.</em></p>
</blockquote>



<p>I read Norman Maclean’s <em>Young Men and Fire</em> when I was a teenager, I think, so it’s been many years, but I still remember its turning point vividly. It’s set in 1949 in Montana, at the Gates of the Mountains Wilderness, about an hour north of Helena. A fire is burning, and the Forest Service sends out their smokejumpers to fight it. But the fire changes direction without warning, and a group of smokejumpers working in the Mann Gulch find themselves trapped, facing certain death. Instead of running, the foreman, Wag Dodge, pulls out matches and does the unthinkable: He lights a fire.</p>



<p>Today we know what he was doing. The escape fire consumed the fuel around him, allowing the main fire to pass over him and a few of his colleagues. But in 1949, the families of the 13 other smokejumpers who died accused Wag of causing their deaths. To them, what he had done made no sense.</p>



<p>I love that Marina Nitze, Matthew Weaver, and Mikey Dickerson chose this story as a framing device for their new book, <a href="https://bookshop.org/p/books/crisis-engineering-time-tested-tools-for-turning-chaos-into-clarity-marina-nitze/44736d1287a7da6e" target="_blank" rel="noreferrer noopener"><em>Crisis Engineering: Time-Tested Tools for Turning Chaos Into Clarity</em></a>, out now. Not just because it brought back the memory of a book that I once loved, but because Maclean’s obsessive investigation of what had happened back then (he wrote the book years after the incident) seemed to me almost as heroic as the bravery of the smokejumpers. And indeed, his insistence on making sense of what happened has probably saved lives. Escape fires are now formally recognized and taught as a last resort tactic when training new firefighters.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="667" height="1000" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3.jpeg" alt="Crisis Engineering book" class="wp-image-18557" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3.jpeg 667w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/image-3-200x300.jpeg 200w" sizes="auto, (max-width: 667px) 100vw, 667px" /></figure>



<p>The Dodge escape fire wouldn’t seem to have much to do with Three Mile Island or healthcare.gov or the pandemic unemployment insurance backlogs, but the authors use it to make a point about how action and understanding interact in a crisis. One key is exactly what Maclean himself did so well: <em>sensemaking</em>. In a crisis like Mann Gulch, sensemaking disintegrates: a broken radio, wind so strong communication is impossible, fire whose behavior violates well-tested assumptions, and a team scattered. You don’t achieve sensemaking by staring at a map; you achieve it by acting and observing results. Wag Dodge didn’t understand fire behavior well enough to explain the escape fire in advance. But his actions created the understanding itself—retrospectively, as all real sensemaking is.</p>



<p>The book’s key claim is that crises are opportunities, and the authors leverage Daniel Kahneman’s <em>Thinking, Fast and Slow</em> to explain why crises are the only real windows for organizational change—and why everything else, the incentives, the logical arguments, the reorganizations, mostly doesn’t work. Most organizations, most of the time, run on autopilot. People habituate to their environment, rationalize away small surprises, and build stable stories about how things work. A crisis breaks this. When surprise accumulates faster than the brain’s “surprise-removing machinery” can rationalize it away, the whole apparatus jams, and organizations become, briefly, reprogrammable.</p>



<p>An institution resolves a crisis in one of three ways, according to the authors. It makes durable deliberate change, it dies, or, most commonly, it rationalizes the failure into an accepted new normal. “Most large organizations contain programs and departments that passively accept abject failure: infinitely long backlogs, hospitals that kill patients, devastating school closures that do little to affect a pandemic. These are fossils of past crises where the organization failed to adapt.”</p>



<p>Too many of our public institutions have failed to adapt, and the idea that they might be reprogrammable at all is a bit radical. We live in an era when too many people have given up on them, willing to burn them to the ground rather than renovate them. If crises represent the chance for true transformation, then we’d better get a lot better at using them for that. This is explicitly why <em>Crisis Engineering</em> exists, and it’s a detailed, practical book—the theory and framing devices are well used, but there’s a ton of pragmatic substance here you’ll be grateful for when the moment comes.</p>



<p>I remember when I was working in the White House and frustrated by the slow pace of progress. My UK mentor Mike Bracken told me: “Hold on, you just need a crisis. You Americans only ever change in crisis.” Boom. About two months later, healthcare.gov had its inauspicious start. And he was right. Change followed. Not all the change we needed, but a start. Marina, Weaver, and Mikey are three of the people who drove that change. I got to work with them again the first summer of the pandemic on California’s unemployment insurance claims backlog. I’m not a crisis engineer, but their strategies and tactics have deeply influenced how I think about the work I do and how I think we’re going to get from the institutions we have today to the ones we need.</p>



<p>We may be living in an era when too many people have given up on institutions, but we are also likely entering an era of crisis, and even <a href="https://en.wikipedia.org/wiki/Polycrisis" target="_blank" rel="noreferrer noopener">polycrisis</a>. This makes for uncomfortable math, but also drives home the need for a new generation of crisis engineers.</p>



<p>When I first read about Mann Gulch, so many years ago, I remember being in awe of the ingenuity and courage it took to start Wag Dodge’s escape fire. Today I think a lot about that pattern: the controlled burns that reduce the risk of megafires, the little earthquakes that take the pressure off faults under great tension, the managed crises that, if we’re skilled enough to use them, keep our institutions from the kind of collapse that comes when nothing has been allowed to give for too long. Dodge didn’t burn things down. He burned a path through. We’re going to have to get good at that.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/trial-by-fire-crisis-engineering/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Generative AI in the Real World: Aishwarya Naresh Reganti on Making AI Work in Production</title>
		<link>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-aishwarya-naresh-reganti-on-making-ai-work-in-production/</link>
				<comments>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-aishwarya-naresh-reganti-on-making-ai-work-in-production/#respond</comments>
				<pubDate>Thu, 16 Apr 2026 14:03:20 +0000</pubDate>
					<dc:creator><![CDATA[Ben Lorica and Aishwarya Naresh Reganti]]></dc:creator>
						<category><![CDATA[Generative AI in the Real World]]></category>
		<category><![CDATA[Podcast]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&#038;p=18544</guid>

		<enclosure url="https://cdn.oreillystatic.com/radar/generative-ai-real-world-podcast/GenAI_in_the_real_world_aishwarya_naresh_reganti_v2.mp4" length="0" type="audio/mpeg" />
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png" 
				medium="image" 
				type="image/png" 
				width="2560" 
				height="2560" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-160x160.png" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[As the founder and CEO of LevelUp Labs, Aishwarya Naresh Reganti helps organizations “really grapple with AI,” and through her teaching, she guides individuals who are doing the same. Aishwarya joined Ben to share her experience as a forward-deployed expert supporting companies that are putting AI into production. Listen in to learn the value all [&#8230;]]]></description>
								<content:encoded><![CDATA[
<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Generative AI in the Real World: Aishwarya Naresh Reganti on Making AI Work in Production" width="500" height="281" src="https://www.youtube.com/embed/Ajiu8uyfSq0?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p>As the founder and CEO of LevelUp Labs, Aishwarya Naresh Reganti helps organizations “really grapple with AI,” and through her teaching, she guides individuals who are doing the same. Aishwarya joined Ben to share her experience as a forward-deployed expert supporting companies that are putting AI into production. Listen in to learn the value <em>all</em> roles—from data folks and developers to SMEs like marketers—bring to the table when launching products; how AI flips the 80-20 rule on its head; the problem with evals (or at least, the term “evals”); enterprise versus consumer use cases; and when humans need to be part of the loop. “LLMs are super powerful,” Aishwarya explains. “So I think you need to really identify where to use that power versus where humans should be making decisions.” <a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0" target="_blank" rel="noreferrer noopener">Watch now</a>.</p>



<p>About the <em>Generative AI in the Real World</em> podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.</p>



<p>Check out other episodes of this podcast <a href="https://learning.oreilly.com/playlists/42123a72-1108-40f1-91c0-adbfb9f4983b/?_gl=1*pra1u5*_gcl_au*Mzc5ODUxNDEzLjE3NzI3NDUyNzk.*_ga*NjI3OTAzNjIzLjE3NzI0NzYxMzg.*_ga_092EL089CH*czE3NzMwODg2NjgkbzI3JGcwJHQxNzczMDg4NjY4JGo2MCRsMCRoMA.." target="_blank" rel="noreferrer noopener">on the O’Reilly learning platform</a> or follow us on <a href="https://www.youtube.com/playlist?list=PL055Epbe6d5YcJUhZbsVW9dlMueIuOxK_" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/5C9oof8TFkP65lDUcEy5jT" target="_blank" rel="noreferrer noopener">Spotify</a>, <a href="https://podcasts.apple.com/us/podcast/generative-ai-in-the-real-world/id1835476293" target="_blank" rel="noreferrer noopener">Apple</a>, or wherever you get your podcasts.</p>



<h2 class="wp-block-heading">Transcript</h2>



<p><em>This transcript was created with the help of AI and has been lightly edited for clarity.</em></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=58" target="_blank" rel="noreferrer noopener">00.58</a><br><strong><strong>All right. So today we have Aishwarya Reganti, founder and CEO of </strong><a href="https://levelup-labs.ai/" target="_blank" rel="noreferrer noopener"><strong>LevelUp Labs</strong></a><strong>. Their tagline is “Forward-deployed AI experts at your service.” So with that, welcome to the podcast.</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=73" target="_blank" rel="noreferrer noopener">01.13</a><br>Thank you, Ben. Super excited to be here.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=76" target="_blank" rel="noreferrer noopener">01.16</a><br><strong><strong>All right. So for our listeners, “forward-deployed”—that&#8217;s a term I think that first entered the lexicon mainly through Palantir, I believe: forward-deployed engineers. So that communicates that Aishwarya and team are very much at the forefront of helping companies really grapple with AI and getting it to work. So, first question is, we&#8217;re two years into these AI demos. What actually separates a real AI product from a good demo at this point?</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=113" target="_blank" rel="noreferrer noopener">01.53</a><br>Yeah, very timely question. And yeah, we are a team of forward-deployed experts. A bit of a background to also tell you why we probably have seen quite a few demos failing. We work with enterprises to build a prototype for them, educate them about how to improve that prototype over time. I think one of the biggest things that differentiates a good AI product is how much effort a team is spending on calibrating it. I typically call this the 80-20 flip.&nbsp;</p>



<p>A lot of the folks who are building AI products as of today come from a traditional software engineering background. And when you&#8217;re building a traditional product, a software product, you spend 80% of the time on building and 20% of the time on what happens after building, right? You&#8217;re probably seeing a bunch of bugs, you&#8217;re resolving them, etc.&nbsp;</p>



<p>But in AI, that kind of gets flipped. You spend 20% of the time maybe building, especially with all of the AI assistants and all of that. And you spend 80% of the time on what I call “calibration,” which is identifying how your users behave with the product [and] how well the product is doing, and incorporating that as a flywheel so that you can continue to improve it, right?&nbsp;</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=191" target="_blank" rel="noreferrer noopener">03.11</a><br>And why does that happen? Because with AI products, the interface is very natural, which means that you&#8217;re pretty much speaking with these products, or you&#8217;re using some form of natural language communication, which means there are tons of ways users could talk and approach your product versus just clicking buttons and all of that, where workflows are so deterministic—which is why you open up a larger surface area for errors.&nbsp;</p>



<p>And you will only understand how your users are behaving with the system as you give them more access to it right. Think of anything as mainstream as ChatGPT. How users interact with ChatGPT today is so much more different than how they would do say three years ago or when it was released in November 2022. So what differentiates a good product is that idea of constant calibration to make sure that it&#8217;s getting aligned with the users and also with changing models and stuff like that. So the 80-20 flip I think is what differentiates a good product from just a prototype.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=194" target="_blank" rel="noreferrer noopener">04.14</a><br><strong><strong>So actually this is an important point in the in the sense that the persona has changed as to who&#8217;s building these data and AI products, because if you rewind five years ago, you had people with some knowledge of data science, ML, and now because it&#8217;s so accessible, developers—actually even nondevelopers, vibe coders—can can start building. So with that said, Aishwarya, what do these kinds of nondata and AI people still consistently get wrong when they move from that traditional mindset of building software to now AI applications?</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=305" target="_blank" rel="noreferrer noopener">05.05</a><br>For one, I truly am one of those people who believes that AI should be for everyone. Even if you&#8217;re coming from a traditional machine learning background, there&#8217;s so much to catch up on. Like I moved to a team in AWS where.&nbsp;.&nbsp;. I moved <em>from</em> a team in AWS in 2023 where I was working with traditional natural language processing models—I was a part of the Alexa team. And then I moved into an org called GenAI Innovation Center, where we were building generative AI solutions for customers. And I feel like there was so much to learn for me as well.&nbsp;</p>



<p>But if there&#8217;s one thing that most people get wrong and maybe AI and traditional ML folks get right, it’s to look at your data, right? When you&#8217;re building all of these products, people just assume that “Oh, I&#8217;ve tested this for a few use cases” and then it seems to work fine, and they don&#8217;t pay so much attention to the kind of data distribution that they would get from their users. And given this obsession to automate everything, people go like, “OK, I can maybe ask an LLM to identify what kind of user patterns I&#8217;m seeing, build evals for itself, and update itself.” It doesn&#8217;t work that way. You really need to spend the time to understand workflows very well, understand context, understand all this data, pretty much.&nbsp;.&nbsp;.&nbsp;</p>



<p>I think just taking the time to manually do some of the setting up work for your agents so that they can perform at their maximum is super underrated. Traditional ML folks tend to understand that a little better because most of the time we&#8217;ve been doing that. We&#8217;ve been curating data for training our machine learning models even after they go into production. There&#8217;s all of this identifying outliers and updating and stuff. But yeah, if there&#8217;s one single takeaway for anybody building AI products: Take the time to look at your data. That&#8217;s the most important foundation for building them.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=421" target="_blank" rel="noreferrer noopener">07.01</a><br><strong><strong>I&#8217;ll flip this a little bit and give props to the traditional developers. What do they get right? In other words, traditional developers write code; some of them write tests, run unit tests [and] integration tests. So they had something to build on that maybe the data scientists who were not writing production code were not used to doing. So what do the traditional developers bring to the table that the data and ML people can learn from?</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=460">07.40</a><br>That&#8217;s an interesting question because I don&#8217;t come from a software background and I just feel traditional developers have a very good design thinking: How do you design architectures so that they can scale? I was so used to writing in notebooks and kind of just focusing so much on the model, but traditional developers treat the model as an API and they build everything very well around it, right? They think about security. They think about what kind of design makes sense at scale and all of that. And even today I feel like so much of AI engineering is traditional software engineering—but with all of the caveats that you need to be looking at your data. You need to be building evals which look very different. But if you kind of zoom out and see, it&#8217;s pretty much the same process, and everything that you do around the model (assuming that the model is just a nondeterministic API), I think traditional software engineers get it like bang on.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=516" target="_blank" rel="noreferrer noopener">08.36</a><br><strong><strong>You recently wrote a </strong><a href="https://www.oreilly.com/radar/evals-are-not-all-you-need/" target="_blank" rel="noreferrer noopener"><strong>post about evals</strong></a><strong>, which was quite interesting actually, [arguing] that it&#8217;s a bit of an overused and poorly defined term. I agree with the thesis of the post, but were you getting frustrated? Is that the reason why you wrote the post? [laughs] What was the genesis of the post?</strong>&nbsp;</strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=543" target="_blank" rel="noreferrer noopener">09.03</a><br>A baseline is most of my posts come out of frustration and noise in this space. It just feels like if you kind of see the trajectory.&nbsp;.&nbsp;. In November 2022, ChatGPT was out, and [everybody was] like, &#8220;Oh, chat interfaces are all you need.&#8221; And then there was this concept of retrieval-augmented generation, they go “Oh, RAG is all you need. Chat just doesn&#8217;t work.” And then there was this concept of agents and like “Agents are all you need; evals are all you need.” So it just gets super annoying when people hang on to these concepts and don&#8217;t really understand the depth of it.&nbsp;</p>



<p>Even now I think there are tons of people who go like “Oh, RAG is dead. It&#8217;s not going to be used” and stuff, and there&#8217;s so much nuance to it. And with evals as well. I teach a lot of courses: I teach at universities; I also have my own courses. I feel like people just stuck to the term, and they were like “Oh, there is this use case I&#8217;m building. I need hundreds of evals in order to make sure that it&#8217;s tested very well.” And they just heard the fact that “Oh, evals are what you need to do differently for AI products” and really didn&#8217;t understand in depth like what evals mean—how you need to build a flywheel around it, and the entire you know act of building a product, calibrating it, and building a set of evaluations and also doing some A/B testing online to understand how your users are behaving with it. All of that just went into one term “evals,” and people are just like throwing it around everywhere, right?</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=635" target="_blank" rel="noreferrer noopener">10.35</a><br>And there&#8217;s also this confusion around model eval versus product eval, which is all of these frontier companies build evals on their models to make sure that they understand where they are on the leaderboard. And I was speaking to someone one day, and they went like, &#8220;Oh, GPT-5 point something has been tested on a particular eval dataset, which means it&#8217;s the best for my use case, so I&#8217;m going to be using it.&#8221; And I&#8217;m like, &#8220;That&#8217;s not the evals that you should be worrying about, right?&#8221; So just overloading so much into a term and hyping it up is kind of what I felt was annoying. And I wanted to write a post to say that evals is a process. It&#8217;s a long process. It&#8217;s pretty much the process of building something and calibrating it over time. And there are tons of components to it, so don&#8217;t kind of try to stuff everything in a word and confuse people.&nbsp;</p>



<p>I&#8217;ve also seen people who do things like, “Oh, I&#8217;m going to build hundreds of evals” and maybe 10 of them are actionable. Evals also need to be super actionable: What is the information you can get from them, and how can you act on that? So I kind of stuffed all of that frustration into the post to kind of say it&#8217;s a longer process. There&#8217;s so much nuance in it. Don&#8217;t try to water that down.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=708" target="_blank" rel="noreferrer noopener">11.48</a><br><strong>So it seems like this is an area where the people that were from the prior era—the people building ML and data science products—maybe could bring something to the table, right? Because they had experience, I don&#8217;t know, shipping recommendation engines and things like that. They have some prior notion of what continuous evaluation and rigorous evaluation brings to the table.&nbsp;</strong></p>



<p><strong>Actually I was talking to someone about this a few weeks ago in the sense that maybe the data scientists actually have a growing employment opportunity here because basically what they bring to the table seems increasingly important to me. Given that code is essentially free and discardable, it seems like someone with a more rigorous background in stats and ML might be able to distinguish themselves. What do you think?</strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=776" target="_blank" rel="noreferrer noopener">12.56</a><br>Yes and no, because it&#8217;s true that machine learning and data scientists understand data very well, but just the way you build evals for these products is so much more different than how you would build, say, your typical metrics (accuracy, F-score, and all of that) that it takes quite some thinking to extend that and also some learning to do.&nbsp;.&nbsp;.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=801" target="_blank" rel="noreferrer noopener">13.21</a><br><strong><strong>But at least you might actually go in there knowing that you need it.</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=807" target="_blank" rel="noreferrer noopener">13.27</a><br>That is true, but I don&#8217;t think that&#8217;s a super.&nbsp;.&nbsp;. I&#8217;ve seen very good engineers pick that up as well because they understand at a design level “What are the metrics I need to be measuring?” So they&#8217;re very outcome focused and kind of enter with that. So one: I think everybody has to be more coachable—not really depend on things that they learned like X years ago, because things are changing so quickly. But I also believe that whenever you&#8217;re building a product, it&#8217;s not really one set of folks that have the edge.&nbsp;</p>



<p>Another maybe distribution that is completely different is just subject-matter experts, right? When you&#8217;re building evals, you need to be writing rubrics for your LLM judges. Simple example: Let&#8217;s say you&#8217;re building a marketing pipeline for your company, and you need to write copy—marketing emails or something like that. Now even if I come from a data science background, if I were thrown at that problem, I just don&#8217;t understand what to look for and how to get closer to a brand voice that my company would be satisfied with. But I really need a marketing expert to kind of tell me “This is the brand voice we use, and this is the evals that we can build, or this is how the rubric should look like.” So it should almost be like a cross-functional thing. I feel like each of us have different pieces to that puzzle, and we need to work together.&nbsp;</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=882" target="_blank" rel="noreferrer noopener">14.42</a><br>That kind of also brings me to this other thing of collaborating in a much tighter manner [than] before. Before it was like, “OK, machine learning folks get data; they build models; and then there is a separate testing team; there is a separate SME team that&#8217;s going to look at how this product is behaving.” And now you cannot do that. You need to be optimizing for the same feedback loop. You need to be talking a lot more with all of the stakeholders because even when building, you want to understand their perspective.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=914">15.14</a><br><strong><strong>So it seems also the case that as more people build these things, they realize that actually.&nbsp;.&nbsp;. You know sometimes I struggle with the word “eval” in the sense that maybe the right word is “optimize,” because basically what you really want is to understand “What am I optimizing for?” Obviously reliability is one of them, but latency and cost are also important factors, right? So it&#8217;s just a discussion that you&#8217;re increasingly coming across, and people are recognizing that there&#8217;s trade-offs and they have to balance a bunch of things.</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=957" target="_blank" rel="noreferrer noopener">15.57</a><br>Yes, definitely. I don&#8217;t see it being discussed heavily mainstream. But whenever I approach a problem, it&#8217;s always that, right? It&#8217;s performance, effort, cost, and latency. And all of these four things are kind of.&nbsp;.&nbsp;. You&#8217;re trying to balance each of them and trade off each of them. And I always say, start off with something that&#8217;s very low effort so that you kind of have an upper ceiling to what can be achieved. Then optimize for performance.&nbsp;</p>



<p>Again, don&#8217;t optimize for cost and latency when you get started because you just want to see the realm of possible to make sure that you can build a product and it can work fine. And cost and latency [are] something that ought to be optimized for—even when building for enterprises—after we&#8217;ve had a decent prototype that can do well on evals. Right now, if I built something with, say, a good mid-tier model and it can hit all of my eval datasets, then I know that this is possible, and now I can optimize for the latency and cost based on the constraints. But always follow that pyramid, right? Go with [the] lowest effort. Try to optimize for performance. And then cost and latency is something that.&nbsp;.&nbsp;. There are tons of tricks you can do. There&#8217;s caching; there&#8217;s using smaller models and all of that. That&#8217;s kind of a framework that I typically use.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1028" target="_blank" rel="noreferrer noopener">17.08</a><br><strong><strong>In prior generations of machine learning, I think a lot of focus was on accuracy to some extent. But now increasingly, because we&#8217;re in this kind of generative AI world, it&#8217;s more likely that people are interested in reliability and predictability in the following sense: Even if I&#8217;m only 10% accurate, as long as I know what that 10% is, I would prefer that [to] a model that&#8217;s more accurate but I don&#8217;t know when it&#8217;s accurate. Right?</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1067" target="_blank" rel="noreferrer noopener">17.47</a><br>Right. That&#8217;s kind of the boon and bane of generative AI models. I guess the fact that they can generalize is amazing, but sometimes they end up generalizing in ways that you wouldn&#8217;t want them to. And whenever we work on enterprise use cases, I think for us always in my mind—something that I want to tell myself—is if this can be a workflow, don&#8217;t make it autonomous if it can solve a problem with a simple LLM call and if you can audit decisions. For instance, let&#8217;s say we&#8217;re building a customer support agent. You could literally build it in five minutes: You can throw SOPs at your customer support agent and say “OK, pick up the right resolution, talk to the user, and that&#8217;s it.” Building is very cheap today. I can literally have Claude Code build it up in a few minutes.&nbsp;</p>



<p>But something that you want to be more intentional about is “What happens if things go wrong? When should I escalate to humans?” And that&#8217;s where I would just break this into a workflow. First, identify the intent of the human and then give me a draft—almost be a copilot for me, where I can collaborate. And then if that draft looks good, a human should approve it so that it goes further.&nbsp;</p>



<p>Right now, you&#8217;re introducing auditability at each point so that you as a human can make decisions before, you know, an agent goes up and messes up things for you. And that&#8217;s also where your design decisions should really take over. Like I could build anything today, but how much thinking am I doing before that building so that there&#8217;s reliability, there’s auditability, and all of those things. LLMs are super powerful. So I think you need to really identify where to use that power versus where humans should be making decisions.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1168" target="_blank" rel="noreferrer noopener">19.28</a><br><strong><strong>And you touched on the notion of human auditors or humans in the loop. So obviously people also try to balance LLM as judge versus human in the loop, right? Obviously there&#8217;s no one piece of advice, but what are some best practices around how you demarcate between when to use a human and when you&#8217;re comfortable using another model as a judge?</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1204" target="_blank" rel="noreferrer noopener">20.04</a><br>A lot of this usually depends on how much data you have to train your judge, right? I feel humans have this problem, which is: Sometimes you can do a task but you can&#8217;t explain why you arrived at that decision in a very structured format. I can today take a look at an article and tell you.&nbsp;.&nbsp;. Especially, I write a lot on Substack and LinkedIn; this is a very super personal use case. If you give me an article and ask me, “Ash, will this go viral on LinkedIn?” I can tell you yes or no for my profile right, because I&#8217;ve done it for so many years. But if you ask me, “How did you make that decision?” I probably cannot codify it and write it down as a bunch of rubrics. Which is again, when you translate this to an LLM judge, “Can I build an LLM that can tell me if a post will go viral or not?” Maybe not because I just don&#8217;t have all the constraints that I use as a human when I make decisions.&nbsp;</p>



<p>Now, take this to more production-like use cases or enterprise-like use cases. You want to have a human judge until you can codify or you can create a framework of how to evaluate something and you can write that out in natural language. And what that means is you maybe want to take 100 or 200 utterances and say, “OK, does this make sense? What&#8217;s the reasoning behind why I graded it a certain way?” And you can feed all of that information into your LLM judge to finally give it a set of rubrics and build your evals. But that&#8217;s kind of how you make a decision, which is “Do we have enough information to provide to an LLM judge that it can replace human judgment?”&nbsp;</p>



<p>But otherwise don&#8217;t do it—if you have very vague high-level ideas of what good looks like, you probably don&#8217;t want to go to an LLM judge. Even when building your systems, I would always recommend that your first pass when you&#8217;re doing your eval should be judged by a human, and you should also ask them to give you reasoning as to why they judge it because that reasoning is so important for training your LLM judges.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1318">21.58</a><br><strong><strong>What are some signs that you look for? What are signals that you look for when one of these AI applications or systems go live? What are some of the signals you look for that [show] maybe the quality is degrading or breaking down?</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1338" target="_blank" rel="noreferrer noopener">22.18</a><br>It really depends on the use case, but there are a lot of subtle signals that users will give you, and you can log them, right? Things like “Are users swearing at your product?” That&#8217;s something we always use, right? “What kind of words are they using? How many conversation turns if it&#8217;s a chatbot, right?” Usually when you&#8217;re building your chatbot, you identify that the average number of turns is 10, but it turns out that customers are having only two turns of conversation. That kind of means that they&#8217;re not interested to talk to your chatbot. Or sometimes they&#8217;re having 20 conversations, which means they&#8217;re probably annoyed, which is why they&#8217;re having longer conversations.&nbsp;</p>



<p>There are typical things: You know, ask your user to give a thumbs up or thumbs down and all of that, but we know that feedback kind of doesn&#8217;t.&nbsp;.&nbsp;. People don&#8217;t give feedback unless they&#8217;re annoyed at something. So you can have those as well. If you&#8217;re building something like a coding agent like Claude Code etc., very obvious logging you can do is “Did the user go and change the code that it generated?” which means it’s wrong. So it&#8217;s very specific to your context, but really think of ways you can log all of this behavior you can log anomalies.&nbsp;</p>



<p>Sometimes just getting all of these logs and doing some topic clustering which is “What are our users typically talking about, and do any of those show signs of frustration? Do they show signs of being annoyed with the system?” and things like that. You really need to understand your workflows very well so that you can design these monitoring strategies.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1430" target="_blank" rel="noreferrer noopener">23.50</a><br><strong><strong>Yeah, it&#8217;s interesting because I was just on a chatbot for an airline, and I was surprised how bad it was, in the sense that it felt like a chatbot of the pre-LLM era. So give us give us kind of your sense of “Are these chatbots now really being powered by foundation models or.&nbsp;.&nbsp;.?” I mean because I was just shocked, Aishwarya, about how bad it was, you know? So what&#8217;s your sense of, as far as you know, are enterprises really deploying these generative AI foundation models in consumer-facing apps?</strong></strong><br><br><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1481" target="_blank" rel="noreferrer noopener">24.41</a><br>Very few. To just give you a quick stat that might not be super correct: 70% to 80% of the engagements that we take up at LevelUp Labs happen to be productivity and ops focused rather than customer focused. And the biggest blocker for that has always been trust and reliability, because if you build these customer-facing agents [and] they make one mistake, it&#8217;s enough to put you on news media or enough to put you in bad PR.&nbsp;</p>



<p>But I think what good companies are doing as of today is doing a phased approach, which is they have already identified buckets that can be completely autonomous versus buckets that would require humans to navigate, right? Like this example that you gave me, as soon as a user comes up with a query, they have a triaging system that would determine if it should go to an AI agent versus a human, depending on the history of the user, depending on the kind of query. (Is it complicated enough?) Right? Let&#8217;s say Ben has this history of.&nbsp;.&nbsp;.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1544" target="_blank" rel="noreferrer noopener">25.44</a><br><strong><strong>Hey, hey, I had great status on this airline.</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1547" target="_blank" rel="noreferrer noopener">25.47</a><br>[laughs] Yeah. So it&#8217;s probably not you, but just the kind of query you&#8217;re coming up with and all of that. So they&#8217;ve identified buckets where automation is possible, and they&#8217;re doing it, and they&#8217;ve done that because of past behavior data, right? What are low-hanging fruits that we could automate versus escalate to humans. I have not seen a lot of these chat systems that are completely taken over by agents. There&#8217;s always some human oversight and very good orchestration mechanisms to make sure that customers are not affected.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1576" target="_blank" rel="noreferrer noopener">26.16</a><br><strong>So you mentioned that you mostly are in the technical and ops application areas, but I&#8217;ll ask you this question anyway. To what extent do legal things come up? In other words, I&#8217;m about to deploy this model. I know I have guardrails, but honestly, just between you and me, I haven&#8217;t gone through the proper legal evaluation, you know? [laughs] So in other words, legality or compliance—anything to do with laws—do they come up at all in your discussions with companies?</strong><br><br><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1619" target="_blank" rel="noreferrer noopener">26.59</a><br>As an external implementation team, I think one thing that we do with most companies is give them a high-level overview of the architecture we&#8217;ll be building, the requirements, and ask them to do a security and legal review so that they&#8217;re okay with it, because we&#8217;ve had experiences in the past where we pretty much built out everything and then you have your CISO come in and say, “OK, this doesn&#8217;t fall into what we could deploy.” So many companies make that mistake of not really involving your governance and compliance folks in the beginning and then end up scrapping entire projects.&nbsp;</p>



<p>I am not an expert who knows all of these rules and legalities, but we always make sure that they understand: “Where is the data coming from? Do we have any issues productionizing this?” and all of that, but we haven&#8217;t really worked.&nbsp;.&nbsp;. I mean I don&#8217;t have a lot of background on how to do this. We&#8217;re mostly engineering folks, but we make sure that we have a sign-off so that we are not kind of landing in surprises.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1687" target="_blank" rel="noreferrer noopener">28.07</a><br><strong>Yeah, the reason I bring it up is obviously, now that everything is much more democratized, more people can build—so in reality the people can move fast and break things literally, right? So I just wonder if there&#8217;s any discussion at all. It sounds like you are proactive, but mostly out of experience, but I wonder if regular teams are talking about this.&nbsp;</strong></p>



<p><strong>Speaking of which, you brought up earlier leaderboards—obviously I&#8217;m guilty of this too: “I&#8217;m about to build something. OK, let me look at a leaderboard.” But, you know, I&#8217;m not literally going to take the leaderboard&#8217;s advice, right? I&#8217;m going to still kick the tires on the specific application and use case. But I&#8217;m sure though, in your conversations, people tell you all sorts of things like, “Hey, we should use this because I saw somewhere that this is ranked number one,” right? So is this still a frustration on your end, or are people much more savvy now?</strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1759" target="_blank" rel="noreferrer noopener">29.19</a><br>For one, I want to quickly clarify that it&#8217;s not wrong to look at a leaderboard. It&#8217;s always.&nbsp;.&nbsp;. You know, you get a high-level idea of “Who are your best competitors at this point?” But what I have a problem with is being so obsessed with just that leaderboard that you don&#8217;t build evals for yourself.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1774" target="_blank" rel="noreferrer noopener">29.34</a><br>In my experience, when we work with a lot of these companies, I think over the past two years the discussion has really shifted away from the model because of two reasons: One is most companies already have existing partnerships. They&#8217;re either working with a major model provider vendor and they&#8217;re OK doing that now just because all of these model providers are racing towards feature parity, leaderboard success, and all of that. If Anthropic has something, you know, if their model is performing well on a leaderboard today, Gemini and OpenAI will probably be there in a week. So people are not too concerned about model performance. They know that in a couple of weeks, that will kind of be built into other models. So they&#8217;re not worried about that.&nbsp;</p>



<p>And two is companies are also thinking much more about the application layer right now. There&#8217;s so much discussion around all of these harnesses like Claude Code, OpenClaw, and stuff like that. So I&#8217;ve not seen a lot of complaints on “Oh, this is the model that we should be using.” It seems like they have a shared understanding of how models perform. They want to optimize the harness and the application layer much more.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1848" target="_blank" rel="noreferrer noopener">30.48</a><br><strong>Yeah. Yeah. Obviously another one of these buzzwords is “harness engineering,” and whatever you think about it, the one good thing is it really elevates the notion that you should worry about the things around the model rather than the model itself.&nbsp;</strong></p>



<p><strong>But speaking of.&nbsp;.&nbsp;. I guess I&#8217;m kind of old school in the sense that I want to still make sure that I can swap models out, not necessarily because I believe one model is better than the other but one model may be cheaper than the other, right?&nbsp;</strong></p>



<p><strong>And at least up until recently—I haven&#8217;t had this conversation in a while—it seemed to me that people got stuck on a model because their prompts were so specific for a model that porting to another model seemed like a lot of work. But nowadays though you have tools like DSPy and GEPA that it seems like you can do that more easily. So what&#8217;s your sense of model portability as a design principle—model neutrality?</strong><br><br><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1926" target="_blank" rel="noreferrer noopener">32.06</a><br>For one, I think the gap between models is much more exaggerated for consumer use cases just because people care quite a bit about the personality, about how the model&#8230;</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1942" target="_blank" rel="noreferrer noopener">32.22</a><br><strong><strong>No, I care about latency and cost.</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1944" target="_blank" rel="noreferrer noopener">32.24</a><br>Yeah. In terms of latency and cost, right, most of the model providers pretty much are competing to make sure they are in the market. I don&#8217;t know. Do you think that there are models.&nbsp;.&nbsp;.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1955" target="_blank" rel="noreferrer noopener">32.35</a><br><strong><strong>Well, I think that you can still get good deals with Gemini. [laughs]</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1960" target="_blank" rel="noreferrer noopener">32.40</a><br>Interesting.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1961" target="_blank" rel="noreferrer noopener">32.41</a><br><strong><strong>But honestly, I use OpenRouter and OpenCode. So, I&#8217;m much more kind of I don&#8217;t want to get locked into a single [model]. When I build something, I want to make sure that I build in a way that I can move to a different model provider if I have to. But it doesn&#8217;t sound like you think that this is something that people worry about right now. They&#8217;re just worried about building something usable and then we can worry about that later.</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=1992" target="_blank" rel="noreferrer noopener">33.12</a><br>Yes. And again, I come from a very enterprise point, like “What are companies thinking about this?” And like I said, I&#8217;m not seeing a lot of competition for model neutrality because these companies have deals with vendors and they&#8217;re okay sticking with the same model provider.&nbsp;</p>



<p>Now, when it comes to consumers, like if you&#8217;re building something for the kind of use cases that you were saying, Ben, I feel that, like I said, personality is super important for consumer builders. And I still think we&#8217;re not at a point where you can easily swap out models and be like, “OK, this is going to work as good as before,” just because you have over time learned how the model behaves. So you&#8217;ve kind of gotten calibrated with these models, and these models also have very specific personalities. So there&#8217;s a lot of you know reengineering that you have to do.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=2047" target="_blank" rel="noreferrer noopener">34.07</a><br>And when I say reengineering, it just might mean changing the way your prompts are written and stuff like that. It will still functionally work, which is why I say that enterprises don&#8217;t care about this much because the kind of use cases I see are like document processing or code generation, in which case functionality is of much more importance than personality. But for consumer use cases, I don&#8217;t think we&#8217;re at a point—to your point on building with OpenRouter, you can do that, but I think it&#8217;s a lot of overhead given that you&#8217;ll have to write specific prompts for all of these models depending on your use case.&nbsp;</p>



<p>I recently ported my OpenClaw from Anthropic to OpenAI because of all of the recent things, and I had to change all of my SOUL.md files, USER.md files, so that I could kind of set the behavior. And it [took] quite some time to do it, and I&#8217;m still getting used to interacting with OpenClaw using OpenAI because it seems like it makes different mistakes than what Anthropic would do.&nbsp;</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=2103" target="_blank" rel="noreferrer noopener">35.03</a><br>So hopefully at some point [the] personalities of these models will converge but I do not think so because this is not a capability problem. It&#8217;s more of design choices that these model providers have made while building these models. So I don&#8217;t see a time where.&nbsp;.&nbsp;. We&#8217;re already at a point where capability-wise most models are getting closer, but personality-wise I don&#8217;t think model vendors would prefer to converge them because these are kind of your spiky edges which will make people with a certain personality gravitate towards your models. You don&#8217;t want to be making it like an average.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=2138" target="_blank" rel="noreferrer noopener">35.38</a><br><strong>So in closing, you do a bit of teaching as well, right? One of the things I&#8217;ve really paid attention to is, in my conversations with people who are very, very early in their career, maybe still looking for the first job, literally, there&#8217;s a lot of worry out there. I mean, not necessarily if you&#8217;re a developer and you have a job—as long as you embrace the AI tools, you&#8217;re probably going to be fine. It&#8217;s just getting to that first job is getting harder and harder for people.&nbsp;</strong></p>



<p><strong>And unfortunately, you need that first job to burnish your credentials and your résumé. And honestly companies also I think neglect the fact that this is your pipeline for talent within the company as well: You have to have the top of the funnel of your talent pipeline. So what advice do you give to people who are literally still trying to get to that first job?</strong><br><br><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=2211" target="_blank" rel="noreferrer noopener">36.51</a><br>For one, I have had a lot of success with hiring young folks because I think they are very agent native. I call them like agent-native operators. If you&#8217;ve been working in software, in IT, for about 10 years or something like me, you&#8217;ve gotten used to certain workflows without using AI. I feel like we&#8217;re so stuck in that old mindset that I really need someone who&#8217;s agent native to come and tell me, “Hey you could literally ask Claude Code to do this.” So I&#8217;ve had a lot of luck hiring folks who are early career because they are very coachable, one, and two, they just understand how to be agent native.&nbsp;</p>



<p>So my suggestion would still be around that: Be a tinkerer. Try to find out what you can do with these tools, how you can automate them, and be extremely obsessed with designing and thinking and not really execution, right? Execution is kind of being taken over by agents.&nbsp;</p>



<p>So how do you really think about “What can I delegate?” versus “What can I augment?” and really sitting in the position of almost being an agent manager and thinking “How can you set up processes so that you can make end-to-end impact?” So just thinking a lot around those lines—and those are the kind of people that we&#8217;d like to hire as well.&nbsp;</p>



<p>And if you see a lot of these latest job roles ,you&#8217;ll also see roles blurring, right? People who are product managers are expected to also do GTM, also do a bit of engineering, and all of that. So really understand the stack end to end. And the best way to do it, I feel, is build a product of your own [and] try to sell it. You&#8217;ll get to see the whole thing. [That] doesn&#8217;t mean “Oh, stop looking for jobs—go become an entrepreneur” but really understanding workflows end to end and making that impact and sitting at the design layer will be super valued is what I think.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=2314" target="_blank" rel="noreferrer noopener">38.34</a><br><strong><strong>Yeah, the other thing I tell people is you have interests so go deep in your interest and build something in whatever you&#8217;re interested in. Domain knowledge is going to be valuable moving forward, but also you end up building something that you would want to use yourself and you learn a lot of things along the way and then maybe that&#8217;s how you get your name out there, right?</strong></strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=2339" target="_blank" rel="noreferrer noopener">38.59</a><br>Exactly. Solving for your own problem is the best advice: Try to build something that solves your own pain point. Try to also advocate for it. I feel like social media and all of this is so good at this point that you can really make a mark in nontraditional ways. You probably don&#8217;t even have to submit a job application. You can have a GitHub repository that gets a lot of stars—that might land you a job. So think of all of these ways to bring yourself more visibility as you build so that you don&#8217;t have to go through your typical job queue.</p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=2370" target="_blank" rel="noreferrer noopener">39.30</a><br><strong>And with that, thank you, Aishwarya.&nbsp;</strong></p>



<p><a href="https://www.youtube.com/watch?v=Ajiu8uyfSq0#t=2372" target="_blank" rel="noreferrer noopener">39.32</a><br>Thank you.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-aishwarya-naresh-reganti-on-making-ai-work-in-production/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Meet the Scope Creep Kraken</title>
		<link>https://www.oreilly.com/radar/meet-the-scope-creep-kraken/</link>
				<pubDate>Thu, 16 Apr 2026 10:31:31 +0000</pubDate>
					<dc:creator><![CDATA[Tim O'Brien]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18546</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Scope-Creep-Kraken-by-Tim-OBrien-created-with-ChatGPT.png" 
				medium="image" 
				type="image/png" 
				width="720" 
				height="402" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Scope-Creep-Kraken-by-Tim-OBrien-created-with-ChatGPT-160x160.png" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[AI didn’t invent scope creep. It just removed the friction that used to stop it.]]></custom:subtitle>
		
				<description><![CDATA[The following article was originally published on Tim O’Brien’s Medium page and is being reposted here with the author&#8217;s permission. If you’ve spent any time around AI-assisted software work, you already know the moment when the&#160;Scope Creep Kraken&#160;first puts a tentacle on the boat. The project begins with a real goal and, usually, a sensible [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The following article was originally published on Tim O’Brien’s </em><a href="https://medium.com/@tobrien/meet-the-scope-creep-kraken-b7190814fe5c" target="_blank" rel="noreferrer noopener">Medium</a><em> </em><em>page and is being reposted here with the author&#8217;s permission.</em></p>
</blockquote>



<p>If you’ve spent any time around AI-assisted software work, you already know the moment when the&nbsp;<code>Scope Creep Kraken</code>&nbsp;first puts a tentacle on the boat.</p>



<p>The project begins with a real goal and, usually, a sensible one. Build the internal tool. Clean up the reporting flow. Add the missing admin screen. Then someone discovers that the model can generate a Swift application in minutes to render this on an iPhone, and the mood in the room changes.</p>



<figure class="wp-block-pullquote"><blockquote><p>“Why not? We can render this on an iOS application, and it will only take 10 minutes. Go for it. These tools are amazing. Wow.”</p></blockquote></figure>



<p>That first idea is often genuinely useful. Something that might have taken a week now takes an hour. That is part of what makes the pattern so seductive. It doesn’t begin with incompetence. It begins with tool-driven momentum.</p>



<p>The meeting continues,&nbsp;<em>“Let’s put the entire year’s backlog into the system and see if we can get this all done in a week. Ignore the token spend limits, let’s just get this done.”</em>&nbsp;What was a reasonable weekly release meeting has now set the stage for a rapid expansion in scope, and that’s how the Scope Creep Kraken takes over.</p>



<p>Scope creep is older than AI, of course. Software teams have been haunted by “while we’re at it” long before anybody was pasting stack traces into a chat window. What AI changed was the rate of growth. In the old version of this problem, extra scope still had to fight its way through staffing constraints. Somebody had to build the feature, debug it, test it, and explain why it belonged. That friction was often the only thing standing between a focused project and an over-extended team.</p>



<p>AI broke that.</p>



<p>Now the extra feature often arrives with a demo attached. “Could we add multi-language support?” Forty-five seconds later, there is a branch. “What about generated documentation?” Sure, why not? “Could the CLI accept natural language commands?” The model appears optimistic, which is enough to make the whole thing sound temporarily reasonable. Each addition looks manageable in isolation. That is how the Kraken works. It does not attack all at once. It wraps around the project one small grip at a time.</p>



<h4 class="wp-block-heading">Signs the Kraken is already on your boat</h4>



<ul class="wp-block-list">
<li>Features appearing without a ticket</li>



<li>Branches nobody asked for</li>



<li>Demos replacing design decisions</li>



<li>“It only took the model 30 seconds.”</li>
</ul>



<p>The part I keep seeing on teams is not reckless ambition so much as confident improvisation. People are reacting to real capability. They are not wrong to be excited that so much is suddenly possible.</p>



<figure class="wp-block-pullquote"><blockquote><p>The trouble starts when “we can generate this quickly” quietly replaces “we decided this belongs in the project.” Those are not the same sentence.</p></blockquote></figure>



<p>For a while, the Kraken even looks helpful. Output goes up. Screens appear. Branches multiply. People feel productive, and sometimes they really are productive in the narrow local sense. What gets hidden in that burst of visible progress is integration cost. Every tentacle has to be tested with every other tentacle. Every generated convenience becomes a maintenance obligation. Every small addition pulls the project a little farther from the problem it originally set out to solve.</p>



<figure class="wp-block-pullquote"><blockquote><p>The product manager might chime in, “A mobile application? I didn’t ask for that, but I guess it’s good. We’ll see. Who’s going to review this with the customer?”</p></blockquote></figure>



<p>That is usually when the team realizes the Kraken is already on the boat. The original sponsor asked for a hammer and is now watching a Swiss Army knife unfold in real time, with several blades no one asked for and at least one that does not seem to fold back in properly.</p>



<p><em>AI also makes it dangerously easy to confuse&nbsp;</em><strong><em>demonstrations</em></strong><em>&nbsp;with&nbsp;</em><strong><em>decisions</em></strong><em>.</em></p>



<p>The useful response is not to become suspicious of every experiment. Some of the first tentacles are worth keeping. The response is to put the old discipline back where AI made it easy to remove. Keep a written scope. Treat additions as actual decisions rather than prompt side effects. Ask what each new feature does to testing, documentation, support, and the team’s ability to explain the system six months from now. If nobody can answer those questions, the feature is not “done” just because the model produced a convincing draft.</p>



<p>What makes the&nbsp;<code>Scope Creep Kraken</code>&nbsp;a good name is one that teams can use in the moment. Once people can say, &#8220;This is another tentacle,&#8221; the conversation gets clearer. You are no longer arguing about whether the idea is clever. You are asking whether this is motivated by a requirement or a capability.</p>
]]></content:encoded>
										</item>
		<item>
		<title>AI Is Writing Our Code Faster Than We Can Verify It</title>
		<link>https://www.oreilly.com/radar/ai-is-writing-our-code-faster-than-we-can-verify-it/</link>
				<pubDate>Wed, 15 Apr 2026 11:19:15 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18540</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/AI-is-writing-our-code-faster-than-we-can-verify-it.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/AI-is-writing-our-code-faster-than-we-can-verify-it-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[What if the answer to AI’s biggest problem has been sitting on the shelf for fifty years?]]></custom:subtitle>
		
				<description><![CDATA[This is the fourth article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, and look for the next article on April 30 on O’Reilly Radar. Here’s the dirty secret of the AI coding revolution: most experienced developers still don’t really trust the code the AI writes for [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>This is the fourth article in a series on agentic engineering and AI-driven development. Read part one <a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">here</a>, part two <a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">here</a>, part three <a href="https://www.oreilly.com/radar/the-toolkit-pattern/" target="_blank" rel="noreferrer noopener">here</a>, and look for the next article on April 30 on O’Reilly Radar.</em></p>
</blockquote>



<p>Here’s the dirty secret of the AI coding revolution: most experienced developers still don’t really trust the code the AI writes for us.</p>



<p>If I’m being honest, that’s not actually a particularly well-guarded secret. It feels like every day there’s a new breathless “I don’t have a lick of development experience but I just vibe coded this amazing application” article. And I get it—articles like that get so much engagement because everyone is watching carefully as the drama of AIs getting better and better at writing code unfolds. We’ve had decades of shows and movies, from WarGames to Hackers to Mr. Robot, portraying developers as reclusive geniuses doing mysterious but incredible stuff with computers. The idea that we’ve coded ourselves out of existence is fascinating to people.</p>



<p>The flip side of that pop-culture phenomenon is that when there are problems caused by agentic engineering gone wrong (like the equally popular “I trusted an AI agent and it deleted my entire production database” articles), everyone seems to find out about it. And, unfortunately, that newly emerging trope is much closer to reality. Most of us who do agentic engineering have seen our own AI-generated code go off the rails. That’s why I built and maintain the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a>, an open-source AI skill that uses quality engineering techniques that go back over fifty years to help developers working in any language verify the quality of their AI-generated code. I was as surprised as anyone to discover that it actually works.</p>



<p>I’ve talked often about how <a href="https://www.oreilly.com/radar/trust-but-verify/" target="_blank" rel="noreferrer noopener">we need a “trust but verify” mindset</a> when using AI to write code. In the past, I’ve mostly focused on the “trust” aspect, finding ways to help developers feel more comfortable adopting AI coding tools and using them for production work. But I’m increasingly convinced that our biggest problem with AI-driven development is that <strong>we don’t have a reliable way to check the quality of code from agentic engineering at scale</strong>. AI is writing our code faster than we can verify it, and that is one of AI’s biggest problems right now.</p>



<h2 class="wp-block-heading"><strong>A false choice</strong></h2>



<p>After I got my first real taste of using AI for development in a professional setting, it felt like I was being asked to make a critical choice: either I had to outsource all of my thinking to the AI and just trust it to build whatever code I needed, or I had to review every single file it generated line by line.</p>



<p>A lot of really good, really experienced senior engineers I’ve talked to feel the same way. A small number of experienced developers fully embrace vibe coding and basically fire off the AI to do what it needs to, depending on a combination of unit tests and solid, decoupled architecture (and a little luck, maybe) to make sure things go well. But more frequently, the senior, experienced engineers I’ve talked to, folks who’ve been developing for a really long time, go the other way. When I ask them if they’re using AI every day, they’ll almost always say something like, “Yeah, I use AI for unit tests and code reviews.” That’s almost always a tell that they don&#8217;t trust the AI to build the really important code that’s at the core of the application. They’re using AI for things that won’t cause production bugs if they go wrong.</p>



<p>I think this excerpt from a recent (and excellent) article in Ars Technica, <a href="https://arstechnica.com/ai/2026/04/research-finds-ai-users-scarily-willing-to-surrender-their-cognition-to-llms/" target="_blank" rel="noreferrer noopener">“Cognitive surrender” leads AI users to abandon logical thinking</a>, sums up how many experienced developers feel about working with AI:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>When it comes to large language model-powered tools, there are generally two broad categories of users. On one side are those who treat AI as a powerful but sometimes faulty service that needs careful human oversight and review to detect reasoning or factual flaws in responses. On the other side are those who routinely outsource their critical thinking to what they see as an all-knowing machine.</em></p>
</blockquote>



<p>I agree that those are two options for dealing with AI. But I also believe that’s a false choice. “Cognitive surrender,” as the research referenced by the article puts it, is not a good outcome. But neither is reviewing every line of code the AI writes, because that’s so effort-intensive that we may as well just write it all ourselves. (And I can almost hear some of you asking, “What so bad about that?”)</p>



<p>This false choice is what really drives a lot of really good, very experienced senior engineers away from AI-driven development today. We see those two options, and they are both <em>terrible</em>. And that’s why I’m writing this article (and the next few in this Radar series) about quality.</p>



<h2 class="wp-block-heading"><strong>Some shocking numbers about AI coding tools</strong></h2>



<p>The <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener"><strong>Quality Playbook</strong></a> is an open-source skill for AI coding tools like GitHub Copilot, Cursor, Claude Code, and Windsurf. You point it at a codebase, and it generates a complete quality engineering infrastructure for that project: test plans traced to requirements, code review protocols, integration tests, and more. More importantly, it brings back quality engineering practices that much of the industry abandoned decades ago, using AI to do a lot of the quality-related work that used to require a dedicated team.</p>



<p>I built the Quality Playbook as part of an experiment in AI-driven development and agentic engineering, building an open-source project called <a href="https://github.com/andrewstellman/octobatch" target="_blank" rel="noreferrer noopener">Octobatch</a> and writing about the process in <a href="https://oreillyradar.substack.com/p/the-accidental-orchestrator" target="_blank" rel="noreferrer noopener">this ongoing Radar series</a>. The playbook emerged directly from that experiment. The ideas behind it are over fifty years old, and they work.</p>



<p>Along the way, I ran into a shocking statistic.</p>



<p>We already know that many (most?) developers these days use AI coding tools like GitHub Copilot, Claude Code, Gemini, ChatGPT, and Cursor to write production code. But do we trust the code those tools generate? “Trust in these systems has collapsed to just 33%, a sharp decline from over 70% in 2023.”</p>



<p>That quote is from a <a href="https://gemini.google.com/share/df5ec9551c1c" target="_blank" rel="noreferrer noopener">Gemini Deep Research report</a> I generated while doing research for this article. 70% dropping to 33%—that sounds like a massive collapse, right?</p>



<p>The thing is, when I checked the sources Gemini referenced, the truth wasn’t nearly as clear-cut. That “over 70% in 2023” number came from a Stack Overflow survey measuring how favorably developers view AI tools. The “33%” number came from a Qodo survey asking whether developers trust the accuracy of AI-generated code. Gemini grabbed both numbers, stripped the context, and stitched them into a single decline narrative. No single study ever measured trust dropping from over 70% to 33%. Which means we’ve got an apples-to-oranges comparison, and it might even technically be accurate (sort of?), but it’s not really the headline-grabber that it seemed to be.</p>



<p>So why am I telling you about it?</p>



<p>Because there are two important lessons from that “shocking” stat. The first is that the overall idea rings true, at least for me. Almost all of us have had the experience of generating code with AI faster than we can verify it, and we ship features before we fully review them.</p>



<p>The second is that when Gemini created the report, the AI fabricated the most alarming version of the story from real but unrelated data points. If I’d just cited it without checking the sources, there’s a pretty good chance it would get published, and you might even believe it. That’s ironically self-referential, because it’s literally the trust problem the survey is supposedly measuring. The AI produced something that looked authoritative, felt correct, and was wrong in ways that only careful verification could catch. If you want to understand why over 70% of developers don’t fully trust AI-generated code, you just watched it happen.</p>



<p>One reason many of us don’t trust AI-generated choice is because there’s a growing gap between how fast AI can generate code and how well we can verify that the code actually does what we intended. The usual response to this verification gap is to adopt better testing tools. And there are plenty of them: test stub generators, diff reviewers, spec-first frameworks. These are useful, and they solve real problems. But they generally share a blind spot: they work with what the code does, not with what it’s supposed to do. Luckily, the intent is sitting right there: in the specs, the schemas, the defensive code, the history of the AI chats about the project, even the variable names and filenames. We just need a way to use it.</p>



<p>AI-driven development needs its own quality practices, and the discipline we need already exists. It was just (unfairly) considered too expensive to use&#8230; until AI made it cheap.</p>



<h2 class="wp-block-heading"><strong>(Re-)introducing quality engineering</strong></h2>



<p>There’s a difference between knowing that code works and knowing that it does what it’s supposed to do. It’s the difference between “does this function return the right value?” and “does this system fulfill its purpose?”—and as it turns out, that’s one of the oldest problems in software engineering. In fact, as I talked about in a previous Radar article, <a href="https://www.oreilly.com/radar/prompt-engineering-is-requirements-engineering/" target="_blank" rel="noreferrer noopener">Prompt Engineering Is Requirements Engineering</a>, it was the source of the original “software crisis.”</p>



<p>The software crisis was the term people used across our industry back in the 1960s when they were coming to grips with large software projects around the world that were routinely delivered late, over budget, and delivering software that didn’t do what it was supposed to do. At the 1968 NATO Software Engineering Conference—the conference that introduced the term “software engineering”—some of the top experts in the industry talked about how the crisis was caused by the developers and their stakeholders had trouble understanding the problems they were solving, communicating those needs clearly, and making sure that the systems they delivered actually met their users’ needs. Nearly two decades later, Fred Brooks made the same argument in his pioneering essay, <a href="https://en.wikipedia.org/wiki/No_Silver_Bullet" target="_blank" rel="noreferrer noopener"><em>No Silver Bullet</em></a>: no tool can, on its own, eliminate the inherent difficulty of understanding what needs to be built and communicating that intent clearly. And now that we talk to our AI development tools the same way we talk to our teammates, we’re more susceptible than ever to that underlying problem of communication and shared understanding.</p>



<p>An important part of the industry’s response to the software crisis was <strong>quality engineering</strong>, a discipline built specifically to close the gap between intent and implementation by defining what “correct” means up front, tracing tests back to requirements, and verifying that the delivered system actually does what it’s supposed to do. For years it was standard practice for software engineering teams to include quality engineering phases in all projects. But few teams today do traditional quality engineering. Understanding why it got left behind by so many of us, more importantly, what it can do for us now, can make a huge difference for agentic engineering and AI-driven development today.</p>



<p>Starting in the 1950s, three thinkers built the intellectual foundation that manufacturing used to become dramatically more reliable.</p>



<ul class="wp-block-list">
<li>W. Edwards Deming argued that quality is built into the process, not inspected in after the fact. He taught us that you don’t test your way to a good product; you design the system that produces it.</li>



<li>Joseph Juran defined quality as fitness for use: not just “does it work?” but ”does it do what it’s supposed to do, under real conditions, for the people who actually use it?”</li>



<li>Philip Crosby made the business case: quality is free, because building it in costs less than finding and fixing defects after the fact. By the time I joined my first professional software development team in the 1990s, these ideas were standard practice in our industry.</li>
</ul>



<p>These ideas revolutionized software quality, and the people who put them into practice were called <strong>quality engineers</strong>. They built test plans traced to requirements, ran functional testing against specifications, and maintained living documentation that defined what “correct” meant for each part of the system.</p>



<p>So why did all of this disappear from most software teams? (It’s still alive in regulated industries like aerospace, medical devices, and automotive, where traceability is mandated by law, and a few brave holdouts throughout the industry.) It wasn’t because it didn’t work. Quality engineering got cut because it was <em>perceived as expensive</em>. Crosby was right that quality is free: the cost of building it in is far more than made up for by the savings you get from not finding and fixing defects later. But the costs come at the beginning of the project and the savings come at the end. In practice, that means when the team blows a deadline and the manager gets angry and starts looking for something to cut, the testing and QA activities are easy targets because the software already seems to be complete.</p>



<p>On top of the perceived expense, quality engineering required specialists. Building good requirements, designing test plans, and planning and running functional and regression testing are real, technical skills, and most teams simply didn’t have anyone (or, more specifically, the budget for anyone) who could do those jobs.</p>



<p>Quality engineering may have faded from our projects and teams over time, but the industry didn’t just give up on many of its best ideas. Developers are nothing if not resourceful, and we built our own quality practices—three of the most popular are test-driven development, behavior-driven development, agile-style iteration—and these are genuinely good at what they do. TDD keeps code honest by making you write the test before the implementation. BDD was specifically designed to capture requirements in a form that developers, testers, and stakeholders can all read (though in practice, most teams strip away the stakeholder involvement and it devolves into another flavor of integration testing). Agile iteration tightens the feedback loop so you catch problems earlier.</p>



<p>Those newer quality practices are practical and developer-focused, and they’re less expensive to adopt than traditional quality engineering in the short run because they live inside the development cycle. The upside of those practices is that development teams can generally implement them on their own, without asking for permission or requiring experts. The tradeoff, however, is that those practices have limited scope. They verify that the code you’re writing right now works correctly, but they don’t step back and ask whether the system as a whole fulfills its original intent. Quality engineering, on the other hand, establishes the intent of the system before the development cycle even begins, and keeps it up to date and feeds it back to the team as the project progresses. That’s a huge piece of the puzzle that got lost along the way.</p>



<p>Those highly effective quality engineering practices got cut from most software engineering teams because they were viewed as expensive, not because they were wrong. When you’re doing AI-driven development, you’re actually running into <em>exactly the same problem</em> that quality engineering was built to solve. You have a “team”—your AI coding tools—and you need a structured process to make sure that team is building what you actually intend. Quality engineering is such a good fit for AI-driven development because it’s the discipline that was specifically designed to close that gap between what you ask for and what gets built.</p>



<p>What nobody expected is that AI would make it cheap enough <em>in the short run</em> to bring quality engineering back to our projects.</p>



<h2 class="wp-block-heading"><strong>Introducing the Quality Playbook</strong></h2>



<p>I’ve long suspected that quality engineering would be a perfect fit for AI-driven development (AIDD), and I finally got a chance to test that hypothesis. As part of my experiment with AIDD and agentic engineering (which I’ve been writing about in <a href="https://oreillyradar.substack.com/p/the-accidental-orchestrator" target="_blank" rel="noreferrer noopener">The Accidental Orchestrator</a> and the rest of this series), I built the <a href="https://github.com/github/awesome-copilot/tree/main/skills/quality-playbook" target="_blank" rel="noreferrer noopener"><strong>Quality Playbook</strong></a>, a skill for AI tools like Cursor, GitHub Copilot, and Claude Code that lets you bring these highly effective quality practices to any project, using AI to do the work that used to require a dedicated quality engineering team. Like other AI skills and agents, it’s a structured document that plugs into an AI coding agent and teaches it a specific capability. You point it at a codebase, and the AI explores the code, reads whatever specifications and documentation it can find, and generates a complete quality infrastructure tailored to that project. The <a href="https://github.com/github/awesome-copilot/tree/main/skills/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a> is now part of <a href="https://awesome-copilot.github.com/#file=skills%2Fquality-playbook%2FSKILL.md" target="_blank" rel="noreferrer noopener"><strong>awesome-copilot</strong></a>, a collection of community-contributed agents (and I’ve also opened a <a href="https://github.com/anthropics/skills/pull/659" target="_blank" rel="noreferrer noopener">pull request</a> to add it to Anthropic’s repository of Claude Code skills).</p>



<p>What does “quality infrastructure” actually mean? Think about what a quality engineering team would build if you hired one. A good quality engineer would start by defining what “correct” means for your project: what the system is supposed to do, grounded in your requirements, your domain, what your users actually need. From there, they’d write tests traced to those requirements, build a code review process that checks whether the code implements what it’s supposed to, design integration tests that verify the whole system works together, and set up an audit process where independent reviewers check the code against its original intent.</p>



<p>That’s what the playbook generates. Developers using AI tools have been rediscovering the value of requirements, and spec-driven development (SDD) has become very popular. You don’t need to be practicing strict spec-driven development to use it. The playbook infers your project’s intent from whatever artifacts are available: chat logs, schemas, README files, code comments, and even defensive code patterns. If you have formal specs, great; if not, the AI pieces together what “correct” means from the evidence it can find.</p>



<p>Once the playbook figures out the intent of the code, it creates <strong>quality infrastructure</strong> for the project. Specifically, it generates ten deliverables:</p>



<ul class="wp-block-list">
<li><strong>Exploration and requirements elicitation (EXPLORATION.md):</strong> Before the playbook writes anything, it spends an entire phase reading the code, documentation, specs, and schemas, and writes a structured exploration document that maps the project’s architecture and domain. The most common failure mode in AI-generated quality work is producing generic content that could apply to any project. The exploration phase forces the AI to ground everything in this specific codebase, and serves as an audit trail: if the requirements end up wrong, you can trace the problem back to what the exploration discovered or missed.</li>



<li><strong>Testable requirements (REQUIREMENTS.md):</strong> The most important deliverable. Building on the exploration, a five-phase pipeline extracts the actual intent of the project from code, documentation, AI chats, messages, support tickets, and any other project artifacts you can give it. The result is a specification document that a new team member or AI agent can read top-to-bottom and understand the software. Each requirement is tagged with an authority tier and linked to use cases that become the connective tissue tying requirements to integration tests to bug reports.</li>



<li><strong>Quality constitution (QUALITY.md):</strong> Defines what “correct” means for your specific project, grounded in your actual domain. Every standard has a rationale explaining why it matters, because without the rationale, a future AI session will argue the standard down.</li>



<li><strong>Spec-traced functional tests:</strong> Tests generated from the requirements, not from source code. That difference matters: a test generated from source code verifies that the code does what the code does, while a test traced to a spec verifies that the code does what you intended.</li>



<li><strong>Three-pass code review protocol with bug reports and regression tests:</strong> Three mandatory review passes, each using a different lens: structural review with anti-hallucination guardrails, requirement verification (where you catch things the code doesn’t do that it was supposed to), and cross-requirement consistency checking. Every confirmed bug gets a regression test and a patch file.</li>



<li><strong>Consolidated bug report (BUGS.md):</strong> Every confirmed bug with full reproduction details, severity calibrated to real-world impact, and a spec basis citing the specific documentation the code violates. Maintainers respond differently to ”your code violates section X.Y of your own spec” than to ”this looks like it might be a bug.”</li>



<li><strong>TDD red/green verification:</strong> For each confirmed bug, a regression test runs against unpatched code (must fail), then the fix is applied and the test reruns (must pass). When you tell a maintainer ”here’s a test that fails on your current code and passes with this one-line fix,” that’s qualitatively different from a bug report.</li>



<li><strong>Integration test protocol:</strong> A structured test matrix that an AI agent can pick up and execute autonomously, without asking clarifying questions. Every test specifies the exact command, what it proves, and specific pass/fail criteria. Field names and types are read from actual source files, not recalled from memory, as an anti-hallucination mechanism.</li>



<li><strong>Council of Three multi-model spec audit:</strong> Three independent AI models audit the codebase against the requirements. The triage uses confidence weighting, not majority vote: findings from all three are near-certain, two are high-confidence, and findings from only one get a verification probe rather than being dismissed. The most valuable findings are often the ones only one model catches.</li>



<li><strong>AGENTS.md bootstrap file:</strong> A context file that future AI sessions read first, so they inherit the full quality infrastructure. Without it, every new session starts from zero. With it, the quality constitution, requirements, and review protocols carry forward automatically across every session that touches the codebase.</li>
</ul>



<h2 class="wp-block-heading"><strong>The third option</strong></h2>



<p>I started this article by talking about a false choice: either we surrender our judgment to the AI, or get stuck reviewing every line of code it writes. The reality is much more nuanced, and, in my opinion, a lot more interesting, if we have a trustworthy way to verify that the code we worked with the AI to build actually does what we intended. It’s not a coincidence that this is one of the oldest problems in software engineering, and not surprising that AI can help us with it.</p>



<p>The Quality Playbook leans heavily on classic quality engineering techniques to do that verification. Those techniques work very well, and that gives us the more nuanced option: using AI to help us write our code, and then using it to help us trust what it built.</p>



<p>That’s not a gimmick or a paradox. It works because verification is exactly the kind of structured, specification-driven work that AI is good at. Writing tests traced to requirements, reviewing code against intent, checking that the system does what it’s supposed to do under real conditions. These are the things quality engineers used to do across the whole industry (and still do in the highly regulated parts of it). They’re also things that AI can do well, as long as we tell it what “correct” means.</p>



<p>The experienced engineers I talked about at the beginning of this article, the ones who only use AI for unit tests and code reviews, aren’t wrong to be cautious. They’re right that we can&#8217;t just trust whatever output the AI spits out. But limiting AI to just the “safe” parts of our projects keeps us from taking advantage of such an important set of tools. The way out of this quagmire is to build the infrastructure that makes the rest of it trustworthy too. Quality engineering gives us that infrastructure, and AI makes it cheap enough to actually use on all of our projects every day.</p>



<p>In the next few articles, I’ll show you what happened when I pointed the Quality Playbook at real, mature open-source codebases and it started finding real bugs, how the playbook emerged from my AI-driven development experiment, what the quality engineering mindset looks like in practice, and how we can learn important lessons from that experience that apply to all of our projects.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>The </em><a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener"><em>Quality Playbook</em></a><em> is open source and works with GitHub Copilot, Cursor, and Claude Code. It&#8217;s also available as part of </em><a href="https://awesome-copilot.github.com/#file=skills%2Fquality-playbook%2FSKILL.md" target="_blank" rel="noreferrer noopener"><em>awesome-copilot</em></a><em>. You can try it out today by downloading it into your project and asking the AI to generate the quality playbook. The whole process takes about 10-15 minutes for a typical codebase. I&#8217;ll cover more details on running it in future articles in this series.</em></p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Disclosure: Aspects of the methodology described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open-source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.</em></p>
</blockquote>
]]></content:encoded>
										</item>
		<item>
		<title>Grief and the Nonprofessional Programmer</title>
		<link>https://www.oreilly.com/radar/grief-and-the-nonprofessional-programmer/</link>
				<pubDate>Tue, 14 Apr 2026 11:16:00 +0000</pubDate>
					<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18536</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Grief-and-the-nonprofessional-programmer.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/04/Grief-and-the-nonprofessional-programmer-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[I can’t claim to be a professional software developer—not by a long shot. I occasionally write some Python code to analyze spreadsheets, and I occasionally hack something together on my own, usually related to prime numbers or numerical analysis. But I have to admit that I identify with both of the groups of programmers that [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p>I can’t claim to be a professional software developer—not by a long shot. I occasionally write some Python code to analyze spreadsheets, and I occasionally hack something together on my own, usually related to prime numbers or numerical analysis. But I have to admit that I identify with both of the groups of programmers that Les Orchard identifies in “<a href="https://blog.lmorchard.com/2026/03/11/grief-and-the-ai-split/" target="_blank" rel="noreferrer noopener">Grief and the AI Split</a>”: those who just want to make a computer do something and those who grieve losing the satisfaction they get from writing good code.</p>



<p>A lot of the time, I just want to get something done; that’s particularly true when I’m grinding through a spreadsheet with sales data that has a half-million rows. (Yes, compared to databases, that’s nothing.) It’s frustrating to run into some roadblock in pandas that I can’t solve without looking through documentation, tutorials, and several incorrect Stack Overflow answers. But there’s also the programming that I do for fun—not all that often, but occasionally: writing a really big prime number sieve, seeing if I can do a million-point convex hull on my laptop in a reasonable amount of time, things like that. And that’s where the problem comes in. . .if there really is a problem.</p>



<p>The other day, I read a post of Simon Willison’s that included <a href="https://simonwillison.net/2026/Mar/11/" target="_blank" rel="noreferrer noopener">AI-generated animations of the major sorting algorithms</a>. No big deal in itself; I’ve seen animated sorting algorithms before. Simon’s were different only in that they were AI-generated—but that made me want to try vibe coding an animation rather than something static. Graphing the first N terms of a <a href="https://en.wikipedia.org/wiki/Fourier_series" target="_blank" rel="noreferrer noopener">Fourier series</a> has long been one of the first things I try in a new programming language. So I asked Claude Code to generate an interactive web animation of the Fourier series. Claude did just fine. I couldn’t have created the app on my own, at least not as a single-page web app; I’ve always avoided JavaScript, for better or for worse. And that was cool, though, as with Simon’s sorting animations, there are plenty of Fourier animations online.</p>



<p>I then got interested in animations that aren’t so common. I grabbed <a href="https://learning.oreilly.com/library/view/algorithms-in-a/9781491912973/" target="_blank" rel="noreferrer noopener"><em>Algorithms in a Nutshell</em></a>, started looking through the chapters, and asked Claude to animate a number of things I hadn’t seen, ending with <a href="https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm" target="_blank" rel="noreferrer noopener">Dijkstra’s algorithm</a> for finding the shortest path through a graph. It had some trouble with a few of the algorithms, though when I asked Claude to generate a plan first and used a second prompt asking it to implement the plan, everything worked.</p>



<p>And it was fun. I made the computer do things I wanted it to do; the thrill of controlling machines is something that sticks with us from our childhoods. The prompts were simple and short—they could have been much longer if I wanted to specify the design of the web page, but Claude’s sense of taste was good enough. I had other work to do while Claude was “thinking,” including attending some meetings, but I could easily have started several instances of Claude Code and had them create simulations in parallel. Doing so wouldn’t have required any fancy orchestration because every simulation was independent of the others. No need for Gas Town.</p>



<p>When I was done, I felt a version of the grief Les Orchard writes about. More specifically: I don’t really understand Dijkstra’s algorithm. I know what it does and have a vague idea of how it works, and I’m sure I could understand it if I read <em>Algorithms in a Nutshell</em> rather than used it as a catalog of things to animate. But now that I had the animation, I realized that I hadn’t gone through the process of understanding the algorithm well enough to write the code. And I cared about that.</p>



<p>I also cared about Fourier transformations: I would never “need” to write that code again. If I decide to learn Rust, will I write a Fourier program, or ask Claude to do it and inspect the output? I already knew the theory behind Fourier transforms—but I realized that an era had ended, and I still don’t know how I feel about that. Indeed, a few months ago, I vibe coded an application that recorded some audio from my laptop’s microphone, did a discrete Fourier transform, and displayed the result. After pasting the code into a file, I took the laptop over to the piano, started the program, played a C, and saw the fundamental and all the harmonics. The era was already in the past; it just took a few months to hit me.</p>



<p>Why does this bother me? My problem isn’t about losing the pleasure of turning ideas into code. I’ve always found coding at least somewhat frustrating, and at times, seriously frustrating. But I’m bothered by the lack of understanding: I was too lazy to look up how Dijkstra works, too lazy to look up (again) how discrete Fourier works. I made the computer do what I wanted, but I lost the understanding of how it did it.</p>



<p>What does it mean to lose the understanding of how the code works? Anything? It’s common to place the transition to AI-assisted coding in the context of the transition from assembly language to higher-level languages, a process that started in the late 1950s. That’s valid, but there’s an important difference. You can certainly program a discrete fast Fourier transform in assembly; that may even be one of the last bastions of assembly programs, since FFTs are extremely useful and often have to run on relatively slow processors. (The “butterfly” algorithm is very fast.) But you can’t learn signal processing by writing assembly any more than you can learn graph theory. When you’re writing in assembler, you have to know what you’re doing in advance. The early programming languages of the 1950s (Fortran, Lisp, Algol, even BASIC) are much better for gradually pushing forward to understanding, to say nothing of our modern languages.</p>



<p>That is the real source of grief, at least for me. I want to understand how things work. And I admit that I’m lazy. Understanding how things work quickly comes in conflict with getting stuff done—especially when staring at a blank screen—and writing Python or Java has a lot to do with how you come to an understanding. I will never need to understand convex hulls or Dijkstra’s algorithm. But thinking more broadly about this industry, I wonder whether we’ll be able to solve the new problems if we delegate understanding the old problems to AI. In the past, I’ve argued that I don’t see AI becoming genuinely creative because creativity isn’t just a recombination of things that already exist. I’ll stick by that, especially in the arts. AI may be a useful tool, but I don’t believe it will become an artist. But anyone involved with the arts also understands that creativity doesn’t come from a blank slate; it also requires an understanding of history, of how problems were solved in the past. And that makes me wonder whether humans—at least in computing—will continue to be creative if we delegate that understanding to AI.</p>



<p>Or does creativity just move up the stack to the next level of abstraction? And is that next level of abstraction all about understanding problems and writing good specifications? Writing a detailed specification is itself a kind of programming. But I don’t think that kind of grief will assuage the grief of the programmer who loves coding—or who may not love coding but loves the understanding that it brings.</p>
]]></content:encoded>
										</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Object Caching 96/105 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed) 
Minified using Memcached

Served from: www.oreilly.com @ 2026-04-30 17:26:35 by W3 Total Cache
-->