<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:media="http://search.yahoo.com/mrss/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:custom="https://www.oreilly.com/rss/custom"

	>

<channel>
	<title>Radar</title>
	<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.oreilly.com/radar</link>
	<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
	<lastBuildDate>Thu, 18 Jun 2026 14:25:15 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
	<title>Radar</title>
	<link>https://www.oreilly.com/radar</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Kubernetes in the Age of AI</title>
		<link>https://www.oreilly.com/radar/kubernetes-in-the-age-of-ai/</link>
				<comments>https://www.oreilly.com/radar/kubernetes-in-the-age-of-ai/#respond</comments>
				<pubDate>Thu, 18 Jun 2026 14:21:16 +0000</pubDate>
					<dc:creator><![CDATA[Andy Kwan]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18938</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Kubernetes-in-the-age-of-AI.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Kubernetes-in-the-age-of-AI-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[When Kubernetes first came onto the scene, it was a major turning point, a revision of the infrastructure and operations space that transformed the way developers and ops personnel build, deploy, and maintain applications in the cloud. It has since become the clear standard for how modern applications are built and operated. As the CNCF [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">When Kubernetes first came onto the scene, it was a major turning point, a revision of the infrastructure and operations space that transformed the way developers and ops personnel build, deploy, and maintain applications in the cloud. It has since become the clear standard for how modern applications are built and operated. As the CNCF noted in its latest <a href="https://www.cncf.io/reports/the-cncf-annual-cloud-native-survey/" target="_blank" rel="noreferrer noopener"><em>Annual Cloud Native Survey</em> report</a>, “Among container users, 82% are using Kubernetes in production in 2025, up from 66% in 2023. This represents near-universal adoption within the container ecosystem.”</p>



<p class="wp-block-paragraph">Over the last few years, another revision in the space has occurred with Kubernetes’s evolution from a container orchestrator to an AI infrastructure platform. According to the CNCF survey, “The rise of Kubernetes as the de facto AI platform represents a fundamental shift in how organizations approach machine learning operations.&nbsp;.&nbsp;.[with Kubernetes] providing a unified orchestration layer that handles both traditional application workloads and compute-intensive AI tasks.” The emergence of seismic technologies like generative AI and agentic AI has only accelerated this transformation.</p>



<p class="wp-block-paragraph">The intersection of AI with Kubernetes is undoubtedly one of the most impactful developments in the operations space. As Jonathan Johnson, software architect at Dijure, observes, “AI on K8s is very, very important, and there is not enough [resources] out there.” Raju Gandhi, senior technical architect at Edward Jones, echoes this assessment, noting that “operationalizing AI/ML on K8s is a big issue, [and it’s only] getting bigger. This is a topic that needs attention.” But what are some of the things that you should know about this trend to keep abreast and stay ahead in the game?</p>



<h2 class="wp-block-heading"><strong>Generative AI</strong></h2>



<p class="wp-block-paragraph">Anyone with access to a computer or a smartphone has likely used some iteration of generative AI, a stunning fact when you consider that GenAI was on the outer edges of mainstream discourse and consumption a scant five years ago. But at the end of 2022, the debut of ChatGPT marked the beginning of a technological revolution, one that would impact and reshape nearly every aspect of our working and personal lives. Unsurprisingly, there are now thousands of generative AI models, a proliferation that naturally has its own set of complexities. Selecting a model is simple, but if you’re an application developer or MLOps engineer, how do you go about operating that model in a production system? Not only do you have to be cognizant of factors like resilience, scalability, security, and operational costs, but there’s the fact that bringing a model from experimentation into production can be arduous if not done properly. That’s where Kubernetes comes into play.</p>



<p class="wp-block-paragraph">As Roland Huß and Daniele Zonca, distinguished engineers at Red Hat, note, “GenAI/LLM models are resource intensive, requiring substantial computational power and large datasets. Given its scalability and extensibility, Kubernetes is uniquely suited to function as an efficient platform for AI and LLM model pretraining, fine-tuning, deployment, and prompt engineering.” They further elaborate that “this integration with Kubernetes not only simplifies the adoption of cutting-edge AI technologies but also ensures a seamless and efficient operational flow. Kubernetes, with its robust scalability and management capabilities, stands as an ideal platform for generative AI projects, aligning DevOps and MLOps practices in a cohesive ecosystem.”</p>



<p class="wp-block-paragraph">This sentiment is already shared by a wide swath of the industry. According to the CNCF survey above, as of 2025, 66% of organizations run generative AI workloads on Kubernetes. These organizations include <a href="https://kubernetes.io/case-studies/openai/" target="_blank" rel="noreferrer noopener">OpenAI</a>, which uses Kubernetes for its AI/LLM application experimenting and testing; <a href="https://llm-d.ai/blog/production-grade-llm-inference-at-scale-kserve-llm-d-vllm" target="_blank" rel="noreferrer noopener">Tesla</a>, which utilizes KServe to manage production-grade LLM inference; and <a href="https://docs.firefly.ai/integrations/data-sources/kubernetes" target="_blank" rel="noreferrer noopener">Adobe</a>, which uses Kubernetes to power its suite of generative creative models. Other companies taking this approach include <a href="https://www.zenml.io/mlops-database/uber-michelangelo-modernization-ray-on-kubernetes-michelangelo-modernization-evolving-an-end-to-end-ml-platform-from-tre">Uber</a>, <a href="https://www.techtarget.com/searchitoperations/news/366558957/Generative-AI-brings-changes-to-cloud-native-platforms" target="_blank" rel="noreferrer noopener">Intuit</a>, and <a href="https://learning.oreilly.com/library/view/generative-ai-on/9781098171919/preface01.html" target="_blank" rel="noreferrer noopener">Google</a>. With more companies adopting this practice for their generative AI and LLMs operations, it’d be prudent for any organization to leverage Kubernetes for their own GenAI and LLM workflows.</p>



<h2 class="wp-block-heading"><strong>Agentic AI</strong></h2>



<p class="wp-block-paragraph">Nearly coinciding with the rise of GenAI has been the steady growth of agentic AI. Unlike GenAI, agentic AI goes beyond answering simple prompts and generating text in its ability to operate autonomously to perform complex, multistep actions, utilize tools, and make independent decisions. With its ability to support both traditional ML processes and GenAI and LLM operations, it should come as no surprise that Kubernetes has a role in the agentic AI ecosystem as well.</p>



<p class="wp-block-paragraph">According to Ronald Petty, principal consultant at RX-M, “Kubernetes has been leveraged to host machine learning pipelines, including AI model training and inference. As inference options have become plentiful and affordable, on and off-premise, we have seen the rise of agents. Coupling cloud native technologies and popular protocols, we now see agents moving from ad hoc demos to complex fleets of agents on systems like Kubernetes.” So what are some examples of the integration between these two technologies?</p>



<p class="wp-block-paragraph">One notable offering is <a href="https://www.cncf.io/blog/2025/04/15/kagent-bringing-agentic-ai-to-cloud-native/" target="_blank" rel="noreferrer noopener">Kagent</a>, an OS programming framework that runs AI agents in Kubernetes and “helps engineers build powerful internal platforms by tackling cloud native tasks such as configuration, troubleshooting, complex deployment scenarios, observability pipelines and dashboards, and safely enabling network security.” Operating along similar lines is K8sGPT, an AI-powered tool that leverages intelligent insights and automated troubleshooting to analyze Kubernetes clusters for configuration problems and security issues, as well as generates solutions to problems discovered in analysis.</p>



<p class="wp-block-paragraph">A more recent entry in the field is <a href="https://github.com/sympozium-ai/sympozium" target="_blank" rel="noreferrer noopener">Sympozium</a>, a Kubernetes-native coordination layer for multi-agent AI systems that “solves the same problem Kubernetes solved for containers, but for agents that need to share context, hand off tasks, and maintain shared situational awareness.” Another newer offering is <a href="https://kubernetes.io/blog/2026/03/20/running-agents-on-kubernetes-with-agent-sandbox/" target="_blank" rel="noreferrer noopener">Agent Sandbox</a>, which allows you to run AI agents as isolated, stateful workloads with a native API on Kubernetes.</p>



<h2 class="wp-block-heading"><strong>The fundamentals</strong></h2>



<p class="wp-block-paragraph">While it’s important to be aware of the latest developments and trends affecting your domain, that shouldn’t come at the expense of foundational knowledge and skills. As basketball great Michael Jordan once said, “Get the fundamentals down and the level of everything you do will rise.” One of the most fundamental skills for working with Kubernetes is networking, and frustratingly enough, it&#8217;s one of the more difficult ones to master. As Cisco senior staff engineer Nico Vibert observes, “Platform engineers tend to be comfortable with Linux networking but less so with protocols like BGP and IPv6; network administrators know those protocols well but find Kubernetes abstractions unfamiliar. Both personas struggle to navigate the dozens of networking tools seemingly required to meet connectivity and security requirements.” Yet as organizations move mission-critical workloads, AI training pipelines, and regulated financial services onto Kubernetes, the engineers who can design, secure, and troubleshoot the network layer have become some of the most sought-after professionals in the industry.</p>



<p class="wp-block-paragraph">In recognition of both the importance and difficult nature of the Kubernetes networking skill, the CNCF recently <a href="https://www.cncf.io/announcements/2025/11/11/cncf-launches-cnpe-certification-to-define-enterprise-scale-platform-engineering-globally/" target="_blank" rel="noreferrer noopener">announced</a> a new certification focused on the Kubernetes network engineer role. The certification is designed to validate hands-on networking expertise across all of the aforementioned layers, filling a gap that the Kubernetes community has long recognized.</p>



<p class="wp-block-paragraph">For organizations that use Kubernetes to develop and deliver applications, leaders and decision-makers need to be aware that utilizing Kubernetes in conjunction with the latest AI tools is no longer a luxury but a necessary practice that will allow their companies to thrive. A similar onus should be placed on the basics. When hiring your next DevOps, network, or site reliability engineer, ensure that their ability to design, secure, and troubleshoot the Kubernetes network layer is second to none.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>If you want to dive deeper, check out Roland Huß and Daniele Zonca’s </em><a href="https://learning.oreilly.com/library/view/generative-ai-on/9781098171919/" target="_blank" rel="noreferrer noopener">Generative AI on Kubernetes</a><em>, Jonathan Johnson’s <a href="https://learning.oreilly.com/live-events/gpu-kubernetes-homelab-infrastructure-as-code-for-ai-workloads/0642572275662/" target="_blank" rel="noreferrer noopener">GPU Kubernetes Homelab</a> live course, Alex Corvin, Taneem Ibrahim, and Kyle Stratis’s </em><a href="https://learning.oreilly.com/library/view/kubernetes-for-generative/9781836209935/" target="_blank" rel="noreferrer noopener">Scalable Kubernetes Infrastructure for AI Platforms</a><em>, Ashok Srirama and Sukirti Gupta’s </em><a href="https://learning.oreilly.com/library/view/kubernetes-for-generative/9781836209935/" target="_blank" rel="noreferrer noopener">Kubernetes for Generative AI Solutions</a><em>, and Yogesh Raheja’s <a href="https://learning.oreilly.com/course/k8sgpt-essentials-/9781806690077/" target="_blank" rel="noreferrer noopener">K8sGPT Essentials</a> on-demand course. They’re all on O’Reilly. If you’re not a member, you can <a href="https://www.oreilly.com/start-trial/?type=individual" target="_blank" rel="noreferrer noopener">get started with a free trial</a>.</em></p>
</blockquote>



<p class="wp-block-paragraph"></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/kubernetes-in-the-age-of-ai/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Case Against Building Your Own Agent Platform</title>
		<link>https://www.oreilly.com/radar/the-case-against-building-your-own-agent-platform/</link>
				<comments>https://www.oreilly.com/radar/the-case-against-building-your-own-agent-platform/#respond</comments>
				<pubDate>Wed, 17 Jun 2026 13:53:16 +0000</pubDate>
					<dc:creator><![CDATA[Pete Johnson]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18935</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-case-against-building-your-own-agent-platform.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-case-against-building-your-own-agent-platform-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[You know the meeting. The board wants an AI agent strategy by end of quarter. Someone on the leadership team has read a McKinsey report. You&#8217;ve been voluntold to build the platform. The slide deck says &#8220;AI-native.&#8221; The acceptance criteria are vague. Somebody mentions LangGraph, and somebody else says, &#8220;We&#8217;ll just wrap it ourselves.&#8221; You [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">You know the meeting. The board wants an AI agent strategy by end of quarter. Someone on the leadership team has read a McKinsey report. You&#8217;ve been voluntold to build the platform. The slide deck says &#8220;AI-native.&#8221; The acceptance criteria are vague. Somebody mentions LangGraph, and somebody else says, &#8220;We&#8217;ll just wrap it ourselves.&#8221;</p>



<p class="wp-block-paragraph">You ask what &#8220;done&#8221; looks like. Nobody in the room can answer.</p>



<p class="wp-block-paragraph">The cost of building this is almost always estimated before anyone has a clear picture of what &#8220;this&#8221; actually is. And that&#8217;s the problem I want to work through here, because the scope of the work being casually assigned to internal platform teams right now is genuinely larger than the people assigning it understand.</p>



<h2 class="wp-block-heading"><strong>Build versus buy, flipped in a year</strong></h2>



<p class="wp-block-paragraph">This particular pendulum has swung before. App servers in the late 1990s. Content management systems in the 2000s. Container orchestration in the 2010s. The pattern rhymes every time: When a category is new, the components look deceptively simple. Early adopters build their own. The market catches up. Within 18 months, building becomes the expensive path. Within 36 months, the teams that built internally are rewriting on top of the category winner that emerged while they weren&#8217;t looking.</p>



<p class="wp-block-paragraph">What&#8217;s different about the current moment is the speed. Menlo Ventures&#8217; <a href="https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/" target="_blank" rel="noreferrer noopener"><em>2025 State of Generative AI in the Enterprise</em> report</a> shows the build-versus-buy split inverted in a single year. In 2024, 47% of enterprise AI solutions were built internally. By late 2025, that number had collapsed to 24%. The market made the decision in 12 months, which is unusual.</p>



<p class="wp-block-paragraph">I&#8217;ve lived through enough of these transitions to recognize the shape. What I want to do in this piece is explain why I think the scope of &#8220;agent platform&#8221; is systematically underestimated right now, and what platform engineers should be asking before they commit to building one.</p>



<h2 class="wp-block-heading"><strong>Most &#8220;agent platforms&#8221; aren&#8217;t</strong></h2>



<p class="wp-block-paragraph">A lot of the projects labeled &#8220;agent platform&#8221; right now are actually workflow systems with an LLM in the loop. That&#8217;s a meaningful distinction. As Anthropic pointed out in its &#8220;<a href="https://www.anthropic.com/research/building-effective-agents" target="_blank" rel="noreferrer noopener">Building Effective Agents</a>&#8221; guidance, workflows are systems where LLMs and tools are <strong>orchestrated</strong> through <strong>predefined code paths</strong>. Agents are systems where LLMs <strong>dynamically</strong> direct their <strong>own processes and tool usage</strong>.</p>



<p class="wp-block-paragraph">Most of what enterprises are shipping today sits on the workflow side. That&#8217;s fine. Workflows have bounded requirements, tractable testing, and predictable failure modes. If your team is building a workflow system, you might reasonably build it yourselves.</p>



<p class="wp-block-paragraph">The trap is that teams start building for workflows, then get asked to support agents, and discover the jump isn&#8217;t incremental. Agents need memory that survives across sessions. They need evaluation that handles nondeterminism. They need governance that tracks actions, not just outputs. They need orchestration that recovers from failure modes a workflow engine never sees.</p>



<p class="wp-block-paragraph">Here&#8217;s the thesis I want to put on the table: The decision to build an agent platform almost always underestimates the long tail. Memory, governance, eval, and orchestration aren&#8217;t features you add to a workflow engine. They&#8217;re separate product bets, each with its own maturity curve, its own vendor landscape, and its own team of specialists who&#8217;ve been working on it full-time for 18 months while you&#8217;ve been doing something else.</p>



<p class="wp-block-paragraph">Let me walk through them.</p>



<h3 class="wp-block-heading">Memory</h3>



<p class="wp-block-paragraph">The assumption inside most build proposals is that memory is a database problem. You&#8217;ll pick a vector store, shove conversation history into it, and retrieve relevant chunks when the agent needs context. Done.</p>



<p class="wp-block-paragraph">Production memory is three separate systems: episodic, semantic, and procedural, each with different retention and retrieval policies. It&#8217;s temporal reasoning that tracks when facts were valid, not just what they were. It&#8217;s deduplication, multitenant isolation, and explicit source-of-truth governance.</p>



<p class="wp-block-paragraph">The signal that this is a separate product category, not a feature: Mem0 raised <a href="https://mem0.ai/series-a" target="_blank" rel="noreferrer noopener">$24 million across seed and Series A</a>. Letta (formerly MemGPT) raised <a href="https://www.felicis.com/blog/letta" target="_blank" rel="noreferrer noopener">$10M from Felicis</a>. Zep exists as an independent company with a <a href="https://arxiv.org/abs/2501.13956" target="_blank" rel="noreferrer noopener">temporal knowledge graph engine</a>. Mem0&#8217;s <a href="https://mem0.ai/blog/state-of-ai-agent-memory-2026" target="_blank" rel="noreferrer noopener"><em>State of AI Agent Memory 2026</em> report</a> maps 21 frameworks across three hosting models with measurable benchmark gaps between them. On <a href="https://mem0.ai/blog/graph-memory-solutions-ai-agents" target="_blank" rel="noreferrer noopener">LongMemEval</a>, Zep scores 15 points higher than Mem0 on temporal queries, which tells you these aren&#8217;t interchangeable tools that happen to serve the same market.</p>



<p class="wp-block-paragraph">This is the component that platform teams underestimate hardest. Memory sounds like a database problem. It isn&#8217;t.</p>



<h3 class="wp-block-heading">Governance</h3>



<p class="wp-block-paragraph">The assumption is that governance is RBAC plus audit logging. Your agents are services. Services get role-based access controls. You log the tool calls. Compliance is happy.</p>



<p class="wp-block-paragraph">Agent governance is something different. It spans action authorization, not just data authorization. It requires decision-chain auditability, where you can reconstruct why the agent did what it did, not just what it did. It needs behavioral drift detection, tiered autonomy, and compliance mapped to agent actions rather than data accesses.</p>



<p class="wp-block-paragraph">Grant Thornton&#8217;s <a href="https://www.grantthornton.com/services/advisory-services/artificial-intelligence/2026-ai-impact-survey" target="_blank" rel="noreferrer noopener"><em>2026 AI Impact Survey</em></a> of 950 business executives found that 78% lack strong confidence they could pass an independent AI governance audit within 90 days. Meanwhile, enterprises are moving to increase agent autonomy faster than their governance frameworks can keep up. Traditional AI governance wasn&#8217;t designed for action-level authorization, which is where most agent-specific risk accumulates.</p>



<p class="wp-block-paragraph">And there&#8217;s a hard deadline attached to this. The <a href="https://www.covasant.com/blogs/eu-ai-act-compliance-autonomous-agents-enterprise-2026" target="_blank" rel="noreferrer noopener">EU AI Act</a> becomes fully enforceable for high-risk systems in August 2026. Credit scoring, hiring decisions, healthcare support, and critical infrastructure all fall in scope. If your internal platform doesn&#8217;t handle conformity assessments, human oversight mechanisms, complete audit trails, and ongoing monitoring, that&#8217;s not a v2 feature. That&#8217;s a legal exposure.</p>



<p class="wp-block-paragraph">OWASP now documents &#8220;<a href="https://www.ewsolutions.com/agentic-ai-governance/" target="_blank" rel="noreferrer noopener">excessive agency</a>&#8221; as a top vulnerability class for LLM applications. Cornell researchers have demonstrated indirect prompt injection attacks that manipulate agents through content they ingest. These are agent-specific attack surfaces, and traditional security tooling doesn&#8217;t see them.</p>



<p class="wp-block-paragraph">RBAC was designed for humans with predictable intent. Agents don&#8217;t have predictable intent.</p>



<h3 class="wp-block-heading">Eval</h3>



<p class="wp-block-paragraph">The assumption is that evaluation means writing test cases and measuring accuracy. You built software before. You know how to test things.</p>



<p class="wp-block-paragraph">Agent evaluation is qualitatively different from traditional software testing or even LLM evaluation, <a href="https://medium.com/quantumblack/evaluations-for-the-agentic-world-c3c150f0dd5a" target="_blank" rel="noreferrer noopener">McKinsey&#8217;s QuantumBlack team noted</a>: For LLMs, you evaluate the response to a prompt. For a single agent, you evaluate the full trajectory, including tool calls, state transitions, and intermediate decisions. For multi-agent systems, you evaluate system dynamics, including coordination patterns and collective invariants.</p>



<p class="wp-block-paragraph">This matters because agent behavior is nondeterministic by design. The same input produces different valid execution paths. &#8220;Did the agent succeed?&#8221; is no longer a yes-or-no question, because the agent might reach the right answer through a trajectory you didn&#8217;t anticipate, or reach the wrong answer through a trajectory that looks reasonable until the last step.</p>



<p class="wp-block-paragraph">The tooling ecosystem reflects this. <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-agents" target="_blank" rel="noreferrer noopener">Google Vertex AI has standardized</a> <code>trajectory_exact_match</code>, <code>trajectory_precision</code>, and <code>trajectory_recall</code> as production metrics. These didn&#8217;t exist 18 months ago. LangSmith, Braintrust, Arize, Galileo, Maxim, and others are building full evaluation platforms around trajectory-based analysis, LLM-as-judge scoring with statistical validation, and regression testing against production failures.</p>



<p class="wp-block-paragraph">Here&#8217;s the signal that the category is real: LangChain&#8217;s <a href="https://www.getmaxim.ai/articles/top-5-ai-evaluation-platforms-in-2026-2/" target="_blank" rel="noreferrer noopener"><em>2026 State of AI Agents</em> report</a> found that 57% of organizations now have agents in production, and 32% cite quality as the top deployment barrier. Gartner projects that 60% of software engineering teams will adopt AI evaluation and observability platforms by 2028, up from 18% in 2025. When a category jumps from 18% to 60% adoption in three years, that&#8217;s not a &#8220;we can build this in a sprint&#8221; situation.</p>



<p class="wp-block-paragraph">You can&#8217;t tell whether your evaluation is working without another evaluation. Judge drift, calibration against human experts, internal consistency across independent runs. . .your eval system needs its own eval system, which is exactly the kind of recursion that eats platform teams alive.</p>



<h3 class="wp-block-heading">Orchestration</h3>



<p class="wp-block-paragraph">The orchestration layer hasn&#8217;t converged. LangGraph uses directed graphs with conditional edges. CrewAI uses role-based crews. OpenAI&#8217;s Agents SDK uses explicit handoffs. AutoGen uses conversational GroupChat. Google ADK uses hierarchical agent trees. Claude&#8217;s Agent SDK uses tool-use chains with subagents. Microsoft&#8217;s Agent Framework is its own thing. Each represents a different bet on state management, communication pattern, and coordination model. None of them are interchangeable. Migration between them isn&#8217;t a config change—it&#8217;s rewriting most of your agent logic.</p>



<p class="wp-block-paragraph">Underneath them, the protocol layer is still being invented. The <a href="https://www.anthropic.com/research/building-effective-agents" target="_blank" rel="noreferrer noopener">Model Context Protocol</a> is becoming the standard for tool integration, and agent-to-agent (A2A) protocols are emerging for cross-framework coordination. Both are moving targets, and building on a moving protocol is a cost that internal platform teams rarely price in.</p>



<p class="wp-block-paragraph">If you built your own orchestration layer in 2024, you&#8217;re rewriting it in 2026. The teams that picked a framework spent those two years shipping.</p>



<h2 class="wp-block-heading">The honest case for building</h2>



<p class="wp-block-paragraph">I want to engage the strongest version of the build argument, because there are real reasons to build, and pretending otherwise makes this piece less useful than it should be.</p>



<p class="wp-block-paragraph">Proprietary data genuinely is a durable competitive moat. Mastercard built a foundation model on its transaction network. Plaid built one on its financial institution coverage. As <a href="https://www.pymnts.com/artificial-intelligence-2/2026/fintechs-race-to-build-foundation-models-on-proprietary-data/" target="_blank" rel="noreferrer noopener">Morgan Stanley&#8217;s analysis</a> from last year made clear, decades of verified historical data with consistent identifiers is both technically challenging and prohibitively expensive for outside players to recreate. If your organization has data like that, you should absolutely build on it.</p>



<p class="wp-block-paragraph">Regulated industries have legitimate reasons to want control over the full stack. Off-the-shelf AI tools don&#8217;t always cleanly map to frameworks like HIPAA, GxP, 21 CFR Part 11, SOX, FFIEC, and PCI DSS, and the cost of a failed audit is measured in business units shut down, not in sprints.</p>



<p class="wp-block-paragraph">Vendor lock-in at the AI layer is subtler and more dangerous than in traditional software. If your agentic workflows are built on a vendor&#8217;s proprietary orchestration layer, switching costs compound rapidly across memory, eval, and integrations simultaneously.</p>



<p class="wp-block-paragraph">But here&#8217;s the distinction that matters: Those are arguments for building agents on top of platform components, not arguments for building the platform components themselves. You can own the data, the domain logic, the evaluation criteria, the governance policies, and the specific behaviors your business needs without owning the memory layer, the orchestration engine, or the trace collection infrastructure underneath them.</p>



<p class="wp-block-paragraph">Build the things that are specific to your business. Buy the things that are specific to the technology category. That&#8217;s the heuristic.</p>



<h2 class="wp-block-heading"><strong>Five questions before you commit</strong></h2>



<p class="wp-block-paragraph">If you&#8217;re the platform engineer being pulled into this decision, here are the questions worth asking before anyone signs up for the scope.</p>



<p class="wp-block-paragraph"><strong>Are you building an agent platform or a workflow system?</strong> They&#8217;re not the same scope, and conflating them is where most of the cost overruns originate. A workflow system is a reasonable thing to build. An agent platform is four product categories you haven&#8217;t staffed for.</p>



<p class="wp-block-paragraph"><strong>Can you articulate what &#8220;done&#8221; looks like for each of the four components?</strong> Memory, governance, eval, orchestration. In under three sentences each. If you can&#8217;t, you don&#8217;t have requirements. You have a vibe. And vibes don&#8217;t ship.</p>



<p class="wp-block-paragraph"><strong>What happens to your platform when you need to swap the underlying model?</strong> Menlo&#8217;s <a href="https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/" target="_blank" rel="noreferrer noopener">December 2025 data</a> shows Anthropic went from 12% of enterprise LLM spend in 2023 to 40% in 2025, while OpenAI fell from 50% to 27%. Enterprises didn&#8217;t plan those switches. The capability gaps forced them. If your internal platform hardcoded assumptions about context windows, tool-calling formats, or reasoning styles from one vendor, swapping models isn&#8217;t an API key change. It&#8217;s simultaneous rewrites across memory, eval, and orchestration.</p>



<p class="wp-block-paragraph"><strong>What happens when the techniques themselves change?</strong> Eighteen months ago the default pattern was RAG with flat vector retrieval. Now it&#8217;s just-in-time context strategies, agent-managed memory tiers, and trajectory-based evaluation. Anthropic&#8217;s <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" target="_blank" rel="noreferrer noopener">own follow-up</a> to &#8220;Building Effective Agents&#8221; explicitly acknowledges the field has moved since they wrote the original. If your platform baked in the 2024 patterns, the 2026 patterns are a refactor, not a config change. Vendor platforms absorb those shifts as releases. Internal platforms absorb them as sprints.</p>



<p class="wp-block-paragraph"><strong>What happens when the platform team leaves?</strong> This is the tale as old as COBOL, custom ESBs in 2008, or hand-rolled container orchestration in 2015. A small team builds something clever, it works, they move on, and five years later you&#8217;re paying premium rates to contractors who can still read the code. Agent platforms are a particularly bad candidate for this pattern because the talent pool is both small and mobile. Here&#8217;s the uncomfortable version of the question: Who on your team, today, could rebuild the memory layer if the person who wrote it left tomorrow?</p>



<h2 class="wp-block-heading">What this looks like in 2 years</h2>



<p class="wp-block-paragraph">Gartner&#8217;s prediction that <a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" target="_blank" rel="noreferrer noopener">over 40% of agentic AI projects will be canceled by 2027</a> isn&#8217;t really about the AI. It&#8217;s about projects that got scoped before anyone understood the shape of the work. Most of the canceled projects will be internal builds, because internal builds are where the scope estimation error accumulates. Deloitte&#8217;s data on <a href="https://www.deloitte.com/us/en/insights/topics/digital-transformation/state-of-generative-ai-in-enterprise.html" target="_blank" rel="noreferrer noopener">two- to four-year AI ROI horizons</a> is the warning shot. If your timeline to value is already long, every month you spend rebuilding a component that exists as a product is a month you don&#8217;t have.</p>



<p class="wp-block-paragraph">The teams that built their platforms around OpenAI in 2023 weren&#8217;t wrong. They made a reasonable bet on the market leader at the time. But they spent 2025 porting to a landscape where Anthropic had tripled share and Google had gone from 7% to 21%. The teams that picked model-agnostic platforms spent 2025 shipping. The only durable bet in this space is the one that assumes the bet will change.</p>



<p class="wp-block-paragraph">The best platform engineering decision you can make this quarter might be to not build the platform.</p>



<h2 class="wp-block-heading">Sources</h2>



<h3 class="wp-block-heading">Primary sources</h3>



<ul class="wp-block-list">
<li>Menlo Ventures, <em>2025: The State of Generative AI in the Enterprise</em>, December 2025, <br><a href="https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/" target="_blank" rel="noreferrer noopener">https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/</a>.</li>



<li>Anthropic, &#8220;Building Effective Agents,&#8221; December 2024, <br><a href="https://www.anthropic.com/research/building-effective-agents" target="_blank" rel="noreferrer noopener">https://www.anthropic.com/research/building-effective-agents</a>.</li>



<li>Anthropic, &#8220;Effective Context Engineering for AI Agents,&#8221; 2025, <br><a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" target="_blank" rel="noreferrer noopener">https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents</a>.</li>



<li>European Commission, AI Act Regulatory Framework (Regulation EU 2024/1689), <br><a href="https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai" target="_blank" rel="noreferrer noopener">https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai</a>.</li>



<li>Google Cloud, &#8220;Evaluate Gen AI Agents,&#8221; Vertex AI Documentation, <br><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-agents" target="_blank" rel="noreferrer noopener">https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-agents</a>.</li>



<li>McKinsey QuantumBlack, &#8220;Evaluations for the Agentic World,&#8221; <br><a href="https://medium.com/quantumblack/evaluations-for-the-agentic-world-c3c150f0dd5a" target="_blank" rel="noreferrer noopener">https://medium.com/quantumblack/evaluations-for-the-agentic-world-c3c150f0dd5a</a>.</li>



<li>LangChain, <em>State of Agent Engineering 2026</em>,<br><a href="https://www.langchain.com/state-of-agent-engineering" target="_blank" rel="noreferrer noopener">https://www.langchain.com/state-of-agent-engineering</a>.</li>



<li>Gartner, &#8220;Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027,&#8221; June 2025, <a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" target="_blank" rel="noreferrer noopener">https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027</a>.</li>



<li>Grant Thornton, <em>2026 AI Impact Survey</em>, April 2026,<br><a href="https://www.grantthornton.com/services/advisory-services/artificial-intelligence/2026-ai-impact-survey" target="_blank" rel="noreferrer noopener">https://www.grantthornton.com/services/advisory-services/artificial-intelligence/2026-ai-impact-survey</a>.</li>
</ul>



<h3 class="wp-block-heading">Secondary Sources</h3>



<ul class="wp-block-list">
<li>Mem0, &#8220;Mem0 Raises $24M to Build the Memory Layer for AI,&#8221; October 2025, <br><a href="https://mem0.ai/series-a" target="_blank" rel="noreferrer noopener">https://mem0.ai/series-a</a>.</li>



<li>Felicis, &#8220;Felicis&#8217;s Seed in Letta,&#8221; September 2024, <br><a href="https://www.felicis.com/blog/letta" target="_blank" rel="noreferrer noopener">https://www.felicis.com/blog/letta</a>.</li>



<li>Vectorize.io, &#8220;Mem0 vs Zep,&#8221; Benchmark Comparison, <br><a href="https://vectorize.io/articles/mem0-vs-zep" target="_blank" rel="noreferrer noopener">https://vectorize.io/articles/mem0-vs-zep</a>.</li>



<li>Rasmussen et al., &#8220;Zep: A Temporal Knowledge Graph Architecture for Agent Memory,&#8221; arXiv 2501.13956, <br><a href="https://arxiv.org/abs/2501.13956" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2501.13956</a>.</li>



<li>OWASP, &#8220;LLM08:2025 Excessive Agency,&#8221; OWASP Top 10 for LLM Applications, <br><a href="https://genai.owasp.org/llmrisk/llm08-excessive-agency/" target="_blank" rel="noreferrer noopener">https://genai.owasp.org/llmrisk/llm08-excessive-agency/</a>.</li>



<li>Greshake et al., &#8220;Not What You&#8217;ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,&#8221; arXiv 2302.12173, February 2023,<br><a href="https://arxiv.org/abs/2302.12173" target="_blank" rel="noreferrer noopener">https://arxiv.org/abs/2302.12173</a>.</li>



<li>Model Context Protocol, Official Specification, <br><a href="https://modelcontextprotocol.io" target="_blank" rel="noreferrer noopener">https://modelcontextprotocol.io</a>.</li>



<li>PYMNTS, &#8220;FinTechs Race to Build Foundation Models on Proprietary Data,&#8221; 2026,<br><a href="https://www.pymnts.com/artificial-intelligence-2/2026/fintechs-race-to-build-foundation-models-on-proprietary-data/" target="_blank" rel="noreferrer noopener">https://www.pymnts.com/artificial-intelligence-2/2026/fintechs-race-to-build-foundation-models-on-proprietary-data/</a>.</li>



<li>Deloitte, &#8220;State of Generative AI in the Enterprise,&#8221; Quarterly Reports, <br><a href="https://www.deloitte.com/us/en/insights/topics/digital-transformation/state-of-generative-ai-in-enterprise.html" target="_blank" rel="noreferrer noopener">https://www.deloitte.com/us/en/insights/topics/digital-transformation/state-of-generative-ai-in-enterprise.html</a>.</li>
</ul>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-case-against-building-your-own-agent-platform/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Linear Thinking, Nonlinear Costs</title>
		<link>https://www.oreilly.com/radar/linear-thinking-nonlinear-costs/</link>
				<comments>https://www.oreilly.com/radar/linear-thinking-nonlinear-costs/#respond</comments>
				<pubDate>Tue, 16 Jun 2026 11:02:01 +0000</pubDate>
					<dc:creator><![CDATA[Nicole Koenigstein]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18920</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-missing-optimization-layer-in-agent-systems.png" 
				medium="image" 
				type="image/png" 
				width="1200" 
				height="896" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-missing-optimization-layer-in-agent-systems-160x160.png" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[The Missing Optimization Layer in Agent Systems]]></custom:subtitle>
		
				<description><![CDATA[Many AI agent systems become economically unsustainable long before they become technically impressive. Teams usually focus on model choice, prompt design, tool calling, and orchestration. Those things matter, but they are only part of the system setup. The deeper issue is that coding agents, such as Claude Code, Codex, and Jules, make agent workflows easier [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Many AI agent systems become economically unsustainable long before they become technically impressive. Teams usually focus on model choice, prompt design, tool calling, and orchestration. Those things matter, but they are only part of the system setup. The deeper issue is that coding agents, such as Claude Code, Codex, and Jules, make agent workflows easier to generate. But when implementation is abstracted away, the underlying mechanics become harder to see. Bad engineering used to produce slow code. Now it produces expensive systems that also happen to be slow.</p>



<p class="wp-block-paragraph">When we design agent systems, we still need to remember that the costs scale nonlinearly. A single user request rarely triggers a single model call. It expands into routing, retrieval, reasoning, reflection, guardrail checks, tool calls, and synthesis. Each step may repeat shared context, reload state, recompute a planner decision, or retry a failed path. What looks like an intelligent workflow can therefore behave like a recursive, stateful computation with overlapping subproblems. If that sounds like backtracking, dynamic programming, and memoization to you, you’re right.</p>



<p class="wp-block-paragraph">We already know how to optimize systems like this. The problem is that coding agents make agent systems easier to generate, but not necessarily easier to optimize. Unless we recognize the underlying mechanics, we may never ask our coding agents to apply the optimization patterns that keep our systems viable.</p>



<h2 class="wp-block-heading"><strong>Old problems wearing new clothes</strong></h2>



<p class="wp-block-paragraph">When we use coding agents to generate agent architectures, it’s tempting to stop at &#8220;the trace looks reasonable.&#8221; The tool can generate routers, retrievers, planners, evaluators, guardrails, tool interfaces, and synthesis steps. It may also know about caching, pruning, memoization, and state modeling. But it won’t necessarily implement those patterns unless you ask for these optimization layers explicitly.</p>



<p class="wp-block-paragraph">Even if you work with agent instructions, unless your SKILL.md, AGENTS.md, or project instructions include constraints around repeated context, memoization, cache invalidation, pruning, and cost per request, your resulting agent system may be functionally correct and economically wasteful at the same time. That’s the tricky part: The code can pass review, the unit tests can pass, and the architecture can look reasonable. The invoice is where the hidden computation finally shows up.</p>



<p class="wp-block-paragraph">It’s easy to give too much agency to tools like Claude Code. When a coding agent reasons in language, calls tools, reflects, and produces fluent text or code, it can feel like a knowledgeable coworker. At the interface level, that impression is understandable. These tools help teams generate more code, move faster, and become more productive. Still, this doesn’t remove the need for engineering craft underneath. Someone still has to recognize repeated context, recomputed planner decisions, correlated retries, unpruned branches, and state that can’t be reused. The coding agent can implement the system, but the engineer still has to understand what kind of system should be implemented. This is where old computer science returns, not as theory but as the optimization layer our agent systems need in production.</p>



<h2 class="wp-block-heading"><strong>The cost multiplier, repeated-work problems, and backtracking</strong></h2>



<p class="wp-block-paragraph">The cost multiplier often shows up first as latency. The user doesn’t see the router, the retries, the reflection loop, or the tool calls. They only see that the agent is taking too long. From the outside, the system looks stuck or broken. From the inside, it may simply be repeating work.</p>



<p class="wp-block-paragraph">This is one of the uncomfortable differences between traditional software and agent systems. In a conventional application, a failed operation often throws an error, times out, or leaves a trace that is easy to inspect. In an agent workflow, failure can look like effort to improve reliability. Take the weakest step in your agent workflow. If it succeeds 60% of the time, and you try to push it close to 99% reliability through retries, you need 5 retries:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="has-text-align-center wp-block-paragraph">1 <em>− </em>(1 <em>− </em>0<em>.</em>60)<sup>5 </sup>= 0<em>.</em>98976</p>
</blockquote>



<p class="wp-block-paragraph">This math assumes each retry is a roll of fair dice. LLMs aren’t dice. Whether you’re using greedy decoding or probabilistic sampling, the model is still drawing from the same underlying distribution shaped by your prompt. If the first &#8220;thought&#8221; is a hallucination or logic error, bumping the temperature won’t fix the underlying state. You aren’t buying independent trials; you’re just sampling different paths through the same flawed map and state.</p>



<p class="wp-block-paragraph">This is where the old algorithmic framing matters. In a backtracking problem, you don’t keep walking down the same failed branch and call it progress. You return to the last valid state, mark the failed path, and use the failure as information for the next choice. The point isn’t just to try again. The point is to try again under a changed state.</p>



<p class="wp-block-paragraph">Agent workflows need the same discipline. A retry shouldn’t mean &#8220;run it again and hope.&#8221; It should give the model structured feedback about why the previous attempt failed: which constraint failed, which tool result was invalid, which schema didn’t validate, which assumption was unsupported, or which branch added nothing. The next attempt should then change something meaningful: the prompt, the tool choice, the retrieved evidence, the validation constraint, or the planner state.</p>



<h2 class="wp-block-heading"><strong>Memoization, pruning, and dynamic programming</strong></h2>



<p class="wp-block-paragraph">Prompt caching is usually the first optimization. If every step repeats the same system prompt, tool definitions, schema constraints, examples, and policy rules, then caching the shared prefix is an obvious win. It reduces the cost of repeated context. But prompt caching only recognizes that text repeats. It doesn’t notice that decisions repeat.</p>



<p class="wp-block-paragraph">In many agent systems, the expensive unit isn’t only text. It’s the repeated decision. If the same or equivalent state appears again, paying the model to rediscover the same action is unnecessary. That is what memoization does: It turns repeated computation into lookup. In classical algorithms, the repeated computation might be a recursive subproblem. In an agent system, it might be a planner decision over the same task, facts, tools, and constraints. The planner can be treated as a function over state:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<div class="wp-block-math"><math display="block"><semantics><mrow><msup><mrow></mrow><mi>π</mi></msup><mi>L</mi><mi>L</mi><mi>M</mi><mo form="prefix" stretchy="false">(</mo><msub><mi>S</mi><mi>t</mi></msub><mo form="postfix" stretchy="false">)</mo><mo stretchy="false">→</mo><msub><mi>a</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub></mrow><annotation encoding="application/x-tex">^πLLM(S_t) \rightarrow a_{t+1} </annotation></semantics></math></div>
</blockquote>



<p class="wp-block-paragraph">where <math data-latex="S_t "><semantics><msub><mi>S</mi><mi>t</mi></msub><annotation encoding="application/x-tex">S_t </annotation></semantics></math> is the current state of the workflow and <math data-latex="a_{t+1}"><semantics><msub><mi>a</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><annotation encoding="application/x-tex">a_{t+1}</annotation></semantics></math><sub> </sub>is the next action. Without memoization, this function is evaluated again and again through an LLM call. With memoization, the system first checks whether it has seen the same or equivalent state before. If you want a deeper walkthrough of how to use memoization, I cover it in <em><a href="https://learning.oreilly.com/library/view/ai-agents-the/0642572247775/" target="_blank" rel="noreferrer noopener">AI Agents: The Definitive Guide</a></em>.</p>



<p class="wp-block-paragraph">But memoization only helps once the system knows which states are worth revisiting. Pruning handles the other side of the problem: branches that shouldn’t be explored further. However, don’t limit pruning to KV cache pruning or speculative decoding. Use it also when a tool repeatedly returns no new information. Your next LLM call shouldn’t be a slightly reworded version of the same query. If a reflection loop keeps producing stylistic changes without improving correctness, the loop should stop. If a search path violates a constraint or depends on an unsupported assumption, it should be marked as unproductive and removed from the active search space.</p>



<p class="wp-block-paragraph">Dynamic programming becomes relevant when different branches of the workflow solve overlapping subproblems. A research agent may ask similar questions across several documents. A coding agent may inspect the same dependency chain from different entry points. A business analysis agent may compute the same metric for several report sections. If every branch solves these subproblems from scratch, the system pays repeatedly for work it has already done. Table 1 shows examples of how these patterns map to AI agent systems.</p>



<p class="has-text-align-center wp-block-paragraph"><strong>Table 1. Classical optimization patterns applied to AI agent systems </strong></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><strong>Optimization</strong></td><td><strong>The &#8220;old&#8221; CS way</strong></td><td><strong>The &#8220;agent&#8221; way&nbsp;</strong></td></tr><tr><td>Memoization</td><td>Store results of expensive function calls.</td><td>Cache decisions. If the agent saw this state before, don’t ask it to reason again.&nbsp;</td></tr><tr><td>Pruning</td><td>Cut off search paths in a tree that won’t lead to a solution.</td><td>Kill a reflection loop when the critique stops yielding structural improvements.</td></tr><tr><td>Dynamic programming</td><td>Break problems into overlapping subproblems.&nbsp;</td><td>Share codebase analysis across multiple specialized agents instead of rereading files.</td></tr></tbody></table></figure>



<p class="wp-block-paragraph"><br>This isn’t nostalgia. These patterns mitigate the cost structure of agent systems. Memoization reduces repeated decisions. Pruning reduces repeated failure. Dynamic programming reduces repeated subproblem solving. Together, they form the optimization layer many agent architectures are missing in production.</p>



<h2 class="wp-block-heading"><strong>Where to start: Optimization follows topology</strong></h2>



<p class="wp-block-paragraph">The patterns above aren’t a checklist you apply uniformly. Each multi-agent topology, whether centralized, decentralized, independent, or hybrid, distributes communication and coordination differently, which directly affects overhead, latency, and failure propagation. The optimization layer has to follow.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>Centralized</strong><br>A single orchestrator decides, delegates, and aggregates. The expensive unit is the orchestrator’s decision, repeated across similar inputs. Memoize the planner first.</p>



<p class="wp-block-paragraph"><strong>Decentralized</strong><br>Agents coordinate peer-to-peer, exchanging messages without a central authority. The cost moves into the communication itself: redundant exchanges, restated context, agents reasoning over the same shared state from different angles. Prompt caching on the shared context is the first win, followed by pruning exchanges that no longer add information.</p>



<p class="wp-block-paragraph"><strong>Independent/swarms</strong><br>Lightweight agents fan out without coordinating. Cheap individually, expensive in aggregate. If three of your ten agents ask semantically equivalent questions, you pay three times for the same answer. Memoization and pruning aren’t optimizations here; they’re load-bearing.</p>



<p class="wp-block-paragraph"><strong>Hybrid</strong><br>The repeated work shows up at two scales: within a cluster (overlapping subproblems among peers) and across clusters (the coordinator rediscovering the same routing decision). Use dynamic programming on shared subproblems inside the cluster, memoization on the coordinator’s decisions across them.</p>
</blockquote>



<p class="wp-block-paragraph">The optimization layer isn’t a generic discipline you bolt on. It’s a function of the shape of the implementation. Coding agents made it easy to generate the shape without seeing it. The craft is in seeing it anyway.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/linear-thinking-nonlinear-costs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Who Owns the Code Claude Wrote?</title>
		<link>https://www.oreilly.com/radar/who-owns-the-code-claude-wrote/</link>
				<comments>https://www.oreilly.com/radar/who-owns-the-code-claude-wrote/#respond</comments>
				<pubDate>Mon, 15 Jun 2026 10:58:47 +0000</pubDate>
					<dc:creator><![CDATA[Sena Evren]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18912</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Who-owns-the-code-Claude-wrote.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Who-owns-the-code-Claude-wrote-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[AI-generated code copyright explained for builders.]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Sena Evren’s Legal Layer newsletter and is being reposted here with the author’s permission. TL; DR Agentic coding tools like Claude Code, Cursor, and Codex generate code that may be uncopyrightable, owned by your employer, or contaminated by open source licenses you cannot see. Some of this is settled [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em>The following article originally appeared on </em><a href="https://legallayer.substack.com/p/who-owns-the-claude-code-wrote" target="_blank" rel="noreferrer noopener"><em>Sena Evren’s </em>Legal Layer<em> newsletter</em></a><em> and is being reposted here with the author’s permission.</em></p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>TL; DR</strong><br><br>Agentic coding tools like Claude Code, Cursor, and Codex generate code that may be uncopyrightable, owned by your employer, or contaminated by open source licenses you cannot see. Some of this is settled law, some is actively contested, and this piece is clear about which is which. If you are shipping AI-assisted code and have not thought about any of this, this piece is for you.<br><br></p>
</blockquote>



<p class="wp-block-paragraph">If you shipped code this week, some of it was probably written by an AI. The question of who legally owns that code is less settled than most developers assume, and the answer depends on three things that have nothing to do with how good the code is:</p>



<ol class="wp-block-list">
<li>Whether a human made enough creative decisions to establish copyright</li>



<li>Whether your employment contract already assigned it to your employer</li>



<li>Whether the model pulled from GPL-licensed training data and quietly contaminated your codebase</li>
</ol>



<p class="wp-block-paragraph">On March 31, 2026, Anthropic accidentally published 512,000 lines of Claude Code’s source code in a routine software update through a missing configuration file. Before sunrise, the codebase was mirrored across GitHub. Before breakfast, a developer had used an AI tool to rewrite the entire thing in Python, and the “claw-code” repository hit 100,000 GitHub stars in a single day, the fastest in history. Then came the DMCA takedowns, and then came the question nobody had a clean answer to:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">If Claude Code was, by Anthropic’s own lead engineer’s admission, predominantly written by Claude itself, does Anthropic even own it? Can you issue a DMCA takedown for code that copyright law may not protect?</p>
</blockquote>



<p class="wp-block-paragraph">That incident compressed every open question about AI-generated code ownership into a single news cycle. The same questions apply to your codebase.</p>



<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="1200" height="480" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-14.png" alt="Three risks in every AI-assisted codebase" class="wp-image-18913" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-14.png 1200w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-14-300x120.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-14-768x307.png 768w" sizes="(max-width: 1200px) 100vw, 1200px" /></figure>



<h2 class="wp-block-heading">The copyright rule nobody told you</h2>



<p class="wp-block-paragraph">Here is the legal baseline, in plain terms: <strong>Copyright only protects work created by a human</strong>.</p>



<p class="wp-block-paragraph">The US Copyright Office has confirmed this consistently, and the DC Circuit upheld it in the Thaler case. When the Supreme Court declined to hear the Thaler appeal in March 2026, it did not endorse the lower court&#8217;s reasoning or settle the question nationally. Cert denial means the court chose not to hear the case, nothing more. What it does mean is that the DC Circuit&#8217;s ruling stands, the Copyright Office&#8217;s position is intact, and no court has yet gone the other way. Works predominantly generated by AI without meaningful human authorship are not eligible for copyright protection under current doctrine, and that position is stable even if it is not finally settled.</p>



<p class="wp-block-paragraph">Two important limits on what Thaler actually decided.</p>



<ol class="wp-block-list">
<li>The case involved a painting created with zero human involvement at all. Thaler listed the AI system as sole author and made no claim of any human creative contribution. The ruling does not directly address the harder question of AI-assisted work where a human was involved but the degree of that involvement is disputed.</li>



<li>Thaler involved visual art. No court has yet applied the human authorship doctrine specifically to code output from an AI coding tool. The logic applies, but the direct precedent does not exist yet.</li>
</ol>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>What it means for you</strong>: Code that Claude Code or Cursor generated and you accepted without meaningful modification may not be copyrightable by anyone. If a competitor copies it, you may have no legal recourse, because the code sits in the public domain in everything but name.</p>
</blockquote>



<figure class="wp-block-image size-full"><img decoding="async" width="1200" height="500" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-15.png" alt="What counts as meaningful human authorship?" class="wp-image-18914" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-15.png 1200w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-15-300x125.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-15-768x320.png 768w" sizes="(max-width: 1200px) 100vw, 1200px" /></figure>



<p class="wp-block-paragraph">The phrase that determines whether your code is protected is “<strong>meaningful human authorship</strong>,” and the Copyright Office has deliberately refused to quantify it with a percentage or a number of edits, because what courts look for is evidence that a human made genuine creative decisions:</p>



<ul class="wp-block-list">
<li>Choosing the architecture</li>



<li>Deciding what to reject</li>



<li>Restructuring the output to fit a specific design</li>
</ul>



<p class="wp-block-paragraph">Specifying an objective to the model is not enough. Directing how the work is constructed is what counts.</p>



<p class="wp-block-paragraph">In an agentic workflow, this distinction is harder to establish than it sounds. Consider a typical Claude Code session:</p>



<ul class="wp-block-list">
<li>You write a one-line prompt: “build a rate limiting module for the API.”</li>



<li>Claude Code plans the approach, generates five files, and iterates through three versions.</li>



<li>You review the output, run the tests, and merge.</li>
</ul>



<p class="wp-block-paragraph">Your contribution in that sequence is your architectural intent and your final approval. Whether that constitutes meaningful human authorship in a courtroom is an unresolved question with no definitive court ruling yet.</p>



<p class="wp-block-paragraph">The honest answer is: probably yes for modules you substantially redirected, probably no for code you accepted verbatim, and unclear for everything in between.</p>



<p class="wp-block-paragraph">The middle ground is actively being litigated right now. In Allen v. Perlmutter, artist Jason Allen is challenging the Copyright Office’s denial of registration for a work he created using more than 600 detailed prompts and subsequent editing in Photoshop. The Copyright Office acknowledged the Photoshop edits as human-authored but still denied registration for the AI-generated underlying elements. That case has not been decided yet, and whatever it decides will be the closest thing to a ruling on how much human involvement is enough.</p>



<p class="wp-block-paragraph">The closest existing precedent on partial protection is <em>Zarya of the Dawn</em>, a graphic novel where the Copyright Office granted registration for the human-authored text but denied it for the Midjourney-generated images. That decision establishes a practical principle developers can use right now: The human-authored elements of an AI-assisted codebase may be separately protectable even if the generated code itself is not. Your architecture documents, your design decisions recorded in commit messages, your ADRs, your prompt logs showing deliberate redirection, these may be protectable as human-authored expression even if the code they produced is not. Protecting what you can starts with documenting what you actually did.</p>



<h2 class="wp-block-heading">What your employer probably already owns</h2>



<p class="wp-block-paragraph">Before you think about whether your code is copyrightable, there is a more immediate question: Even if it is, is it actually yours?</p>



<p class="wp-block-paragraph">Your employment contract almost certainly says that anything you build at work belongs to your employer. That principle has a name in copyright law: the work-for-hire doctrine. Under it, any code created by an employee within the scope of their employment is owned by the employer, who is treated as the legal author, regardless of whether the code was written by hand, generated by Claude Code, or some combination. Using an AI coding tool during work hours, on a work project, on a work machine, does not change who owns the result.</p>



<p class="wp-block-paragraph">Most employment contracts go further than the doctrine’s defaults. Look for a section in yours called “Intellectual Property,” “IP Assignment,” or “Work Product.” Open the contract, search for those terms, and read that section. A clause that says any of the following almost certainly covers your AI-assisted code:</p>



<ul class="wp-block-list">
<li>“Any work product created using company equipment or resources”</li>



<li>“Any invention or development made during the term of employment”</li>



<li>“Any software created with the assistance of company-licensed tools”</li>
</ul>



<p class="wp-block-paragraph">The third one is the one to watch. If your employer licenses Claude Code, Cursor, or Copilot for the team, and you use those same tools to build a side project, a broad IP assignment clause may give the employer a claim over that project, even if you built it on your own time.</p>



<p class="wp-block-paragraph">A senior developer in San Francisco described exactly this situation earlier this year. He had used Claude Code for work projects and for a personal fitness tracking app built on evenings and weekends. His company updated its IP policy and claimed everything he had built with AI assistance, including the personal app, arguing that because Claude had access to open work files in the IDE, any AI output was a derivative work of company IP.</p>



<p class="wp-block-paragraph">This is the clearest example of how far this can stretch. His company&#8217;s claim rested on one phrase: The AI tools were &#8220;context-aware&#8221; of his company&#8217;s codebase. The argument does not hold up legally, because context visibility in an IDE does not make AI output a derivative work of files that were open nearby, and the connection between what Claude can see and what it generates is probabilistic pattern completion, not copying. But the argument illustrates what employers are starting to claim. If the clause is broad enough, it has surface validity regardless of what the AI actually did.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>The practical rule</strong>: If you are building something on the side, use a personal account, a personal machine, and tools you pay for yourself. Keep your employer’s licensed tools out of that workflow entirely.</p>
</blockquote>



<h2 class="wp-block-heading">The open source contamination problem</h2>



<p class="wp-block-paragraph">Even if you own your AI-generated code, you may have already contaminated it with an open source license you cannot see.</p>



<p class="wp-block-paragraph">AI coding tools are trained on massive amounts of public code, including code licensed under the GPL, LGPL, and other copyleft licenses. <strong>Copyleft licenses carry a specific obligation that travels with the code</strong>:</p>



<ul class="wp-block-list">
<li>If you distribute software that is a derivative of GPL-licensed code, you must release your own source code under the same license.</li>



<li>This applies even if you did not know the code you incorporated was GPL-licensed.</li>



<li>“I did not know” is not a defense to a copyleft violation.</li>
</ul>



<figure class="wp-block-image size-full"><img decoding="async" width="1200" height="460" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-16.png" alt="The GPL contamination chain" class="wp-image-18915" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-16.png 1200w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-16-300x115.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-16-768x294.png 768w" sizes="(max-width: 1200px) 100vw, 1200px" /></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">When an AI tool reproduces a substantial verbatim portion of GPL-licensed code from its training data, and you ship that code in a commercial product without releasing source, you may have created a copyleft violation without ever touching the original repository. The legal standard for infringement is substantial verbatim reproduction, not functional similarity or resemblance, and this distinction matters: an AI tool generating code that works like GPL code is different from an AI tool that reproduces GPL code word for word. The risk sits at the verbatim end of that spectrum, and the problem is that you have no way to know which side of the line your codebase is on without running a scan.</p>
</blockquote>



<p class="wp-block-paragraph">The chardet community dispute made this concrete in early 2026. This was not a filed lawsuit but a public dispute within the open source community that raised the question without resolving it legally. A developer used Claude to rewrite chardet, a Python character encoding library, and rereleased it under an MIT license, arguing that the AI rewrite was a “clean room” implementation free of the original LGPL license.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>The legal question the community fought over</strong>: If Claude was trained on the LGPL-licensed codebase and its output reproduces substantial verbatim portions of that code, can the output be treated as license-free? The chardet dispute did not resolve cleanly and no court has issued a definitive ruling on this specific question. What is settled is that verbatim copying of GPL code violates the license regardless of how it was produced. What is unsettled is whether AI-generated output that reproduces training data patterns counts as verbatim copying. The working assumption among lawyers advising companies through M&amp;A is that it probably does, and that assumption is now showing up as a standard condition in acquisition due diligence.</p>
</blockquote>



<p class="wp-block-paragraph">The Doe v GitHub litigation, still working through the Ninth Circuit as of April 2026, is asking whether GitHub Copilot reproduces licensed code without attribution in violation of copyright law and DMCA Section 1202. The district court dismissed most claims but the appeal is live. Whatever the outcome, the litigation has already changed industry behavior: GitHub Copilot added duplicate detection filters, and acquisition due diligence now routinely includes an AI codebase license scan.</p>



<h2 class="wp-block-heading">What to do about all of this</h2>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1200" height="420" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-17.png" alt="Your four actions before you ship" class="wp-image-18916" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-17.png 1200w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-17-300x105.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-17-768x269.png 768w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></figure>



<p class="wp-block-paragraph">Four concrete actions, none of which require a lawyer.</p>



<h3 class="wp-block-heading">1. Run a license scan on your AI-assisted codebase</h3>



<p class="wp-block-paragraph">Tools that do this well:</p>



<ul class="wp-block-list">
<li><a href="https://fossa.com/">FOSSA</a>—most comprehensive, widely used in enterprise</li>



<li><a href="https://snyk.io/product/open-source-security/">Snyk Open Source</a>—good for dev-team workflows, integrates with GitHub</li>



<li><a href="https://www.blackducksoftware.com/">Black Duck</a>—standard in M&amp;A due diligence</li>
</ul>



<p class="wp-block-paragraph">Each will scan your codebase, flag code that matches known open source libraries, and identify the licenses attached. If you are shipping a commercial product and have never run one of these, you are operating on assumption. The scan takes an afternoon and costs less than the first hour of a copyright dispute.</p>



<h3 class="wp-block-heading">2. Document your human creative contributions as you go</h3>



<p class="wp-block-paragraph">The evidence that establishes meaningful human authorship is the same evidence you already produce in a normal engineering workflow. You just have to keep it deliberately rather than letting it disappear.</p>



<p class="wp-block-paragraph">What to preserve:</p>



<ul class="wp-block-list">
<li>Commit messages that describe what you changed and why, not just what the AI generated. “Restructured Claude’s module architecture, rejected initial state management approach, rewrote error handling from scratch” is evidence. “Add rate limiting module” is not.</li>



<li>Prompt logs. Claude Code and Cursor both retain interaction history. Export or screenshot the sessions where you made significant architectural decisions.</li>



<li>Design documents, ADRs, or any notes that predate the generated code and show you specified the structure before the AI built it.</li>
</ul>



<p class="wp-block-paragraph">The second commit message versus the first is the difference between a defensible authorship claim and a clean “Claude wrote this” record.</p>



<h3 class="wp-block-heading">3. Read the IP clause in your employment contract before you build anything on the side</h3>



<p class="wp-block-paragraph">Open your contract, search for “intellectual property,” “IP assignment,” or “work product,” and read that section carefully. The specific language determines your exposure:</p>



<ul class="wp-block-list">
<li>“Work product created during employment hours” is narrower than “work product created using company resources.”</li>



<li>“Relating to the company’s business” is narrower than “any software development.”</li>



<li>“Company-licensed tools” is the phrase that captures AI coding tools even on personal projects.</li>
</ul>



<p class="wp-block-paragraph">If the clause is broad and you want to build something independently, you have three realistic options: negotiate a written carveout before you start (easier at the start of a new role than mid-employment), use entirely personal tools on entirely personal time on a personal machine, or accept that the claim exists and decide whether the risk is worth it.</p>



<h3 class="wp-block-heading">4. Check which Anthropic plan you are on before shipping for commercial use</h3>



<p class="wp-block-paragraph">Go to <a href="http://anthropic.com/legal" target="_blank" rel="noreferrer noopener">anthropic.com/legal</a> and compare the consumer terms against the commercial terms. The difference that matters:</p>



<ul class="wp-block-list">
<li><strong>Consumer terms (free and Pro plans)</strong>: Anthropic assigns outputs to you, but the IP indemnification is narrower and covers fewer scenarios.</li>



<li><strong>Commercial terms (API and enterprise)</strong>: Anthropic assigns outputs to you and will defend you against copyright infringement claims arising from your authorized use of the service and its outputs.</li>
</ul>



<p class="wp-block-paragraph">If you are shipping AI-assisted code in a commercial product using the free or Pro plan, the indemnification gap is real. The API or enterprise agreement is the appropriate tier. Note that neither indemnification covers a downstream GPL violation from license contamination in your codebase. That is your governance problem to solve with the license scan in action 1.</p>



<h2 class="wp-block-heading">The thing worth sitting with</h2>



<p class="wp-block-paragraph">Anthropic’s own lead engineer publicly stated that his recent contributions to Claude Code were written entirely by the AI, and the leaked codebase that Anthropic issued 8,000 DMCA takedowns to suppress may be predominantly AI-authored. Whether Anthropic’s copyright claims over that codebase are legally valid remains an open question no court has yet resolved.</p>



<p class="wp-block-paragraph">If the company that built the tool cannot cleanly assert copyright over its own AI-assisted code, the question of whether you can is worth taking seriously before it becomes relevant in a transaction, a dispute, or an acquisition conversation. The developer who documents their creative contributions from the start is in a meaningfully different legal position than the one who accepted three thousand lines of Claude output and merged without review, even if both shipped the same product.</p>



<h2 class="wp-block-heading">A note on what this piece covers and what it does not</h2>



<p class="wp-block-paragraph">Three things in it are settled law:</p>



<ul class="wp-block-list">
<li>Works lacking human authorship are uncopyrightable,</li>



<li>The work-for-hire doctrine applies regardless of how code was generated.</li>



<li>Verbatim copying of GPL-licensed code violates the license.</li>
</ul>



<p class="wp-block-paragraph">Two things are emerging consensus without definitive court rulings yet:</p>



<ul class="wp-block-list">
<li>How much human direction is enough to establish meaningful authorship in an agentic workflow</li>



<li>Whether AI output that reproduces training data patterns counts as verbatim copying</li>
</ul>



<p class="wp-block-paragraph">One thing is genuine speculation:</p>



<ul class="wp-block-list">
<li>Whether any of this will be litigated at scale in the near term</li>
</ul>



<p class="wp-block-paragraph">Most code copyright claims never reach court. The place where the unsettled questions become concrete today is M&amp;A due diligence and institutional fundraising, where acquirers and investors are already asking these questions as a condition of closing.</p>



<p class="wp-block-paragraph">If neither of those applies to your situation right now, the four actions above are still worth doing, but the urgency is lower than the piece might imply.</p>



<h3 class="wp-block-heading">Further reading</h3>



<p class="wp-block-paragraph">1. <a href="https://www.copyright.gov/ai/" target="_blank" rel="noreferrer noopener">US Copyright Office—Copyright and Artificial Intelligence (Part 2: Copyrightability)</a><br>The primary regulatory source on what qualifies as meaningful human authorship in AI-assisted works. Part 2 covers the specific tests the Office applies when reviewing AI-generated content registrations. Essential if you want to understand exactly where the legal line sits.</p>



<p class="wp-block-paragraph">2. <a href="https://fingfx.thomsonreuters.com/gfx/legaldocs/gdpzybblovw/STABILITY%20AI%20LAWSUIT.pdf" target="_blank" rel="noreferrer noopener">Andersen v. Stability AI, Midjourney, DeviantArt—Ninth Circuit docket</a><br>The foundational case on AI training data and copyright infringement, currently shaping how courts think about what AI models learn and reproduce. Relevant to the GPL contamination question in a way most developers have not connected yet.</p>



<p class="wp-block-paragraph">3. <a href="https://githubcopilotlitigation.com/" target="_blank" rel="noreferrer noopener">Doe v. GitHub, Inc.—Ninth Circuit appeal</a><br>The live litigation on whether Copilot reproduces licensed code without attribution. Track this one: The Ninth Circuit decision will set the standard that determines whether AI-generated code carrying open source patterns constitutes copyright infringement.</p>



<p class="wp-block-paragraph">4. <a href="https://github.blog/2021-11-15-why-github-copilot-does-not-infringe-copyright/" target="_blank" rel="noreferrer noopener">GitHub—Copilot and copyright: What you need to know</a><br>GitHub’s own legal position on why Copilot outputs are not infringing. Worth reading as a counterpoint: Understanding the argument they make helps you understand where it is strong and where it has limits, particularly on the GPL training data question.</p>



<p class="wp-block-paragraph">5. <a href="https://fossa.com/learn/open-source-licenses" target="_blank" rel="noreferrer noopener">FOSSA—Understanding open source license obligations</a><br>A developer-friendly reference to how copyleft obligations actually work in practice: what triggers the source disclosure requirement, what constitutes a derivative work, and how the GPL, LGPL, and AGPL differ in their reach. The clearest plain-language guide available on this topic.</p>



<p class="wp-block-paragraph">6. <a href="https://www.anthropic.com/legal" target="_blank" rel="noreferrer noopener">Anthropic—Usage Policy and Terms of Service</a><br>The actual document that determines your IP rights and indemnification scope when you use Claude commercially. Read sections 7 and 8 specifically: output ownership and IP indemnification. The difference between the consumer and commercial terms is stated plainly and takes 10 minutes to understand.</p>



<p class="wp-block-paragraph"><em>I write about legal architecture for AI products at </em><a href="https://legallayer.substack.com/" target="_blank" rel="noreferrer noopener">Legal Layer</a><em>. This piece is informational and does not constitute legal advice.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/who-owns-the-code-claude-wrote/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>This Week in AI: The Next-Gen Recommendation Experience</title>
		<link>https://www.oreilly.com/radar/this-week-in-ai-the-next-gen-recommendation-experience/</link>
				<comments>https://www.oreilly.com/radar/this-week-in-ai-the-next-gen-recommendation-experience/#respond</comments>
				<pubDate>Fri, 12 Jun 2026 14:18:19 +0000</pubDate>
					<dc:creator><![CDATA[Michelle Smith]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18909</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/0642572383770_This_Week_in_AI_Cover-scaled.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2560" 
				height="2560" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/0642572383770_This_Week_in_AI_Cover-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Plus responsible AI and why a conversational agent doesn’t count as a true sales agent]]></custom:subtitle>
		
				<description><![CDATA[This week Miguel Fierro, a former Microsoft principal researcher who recently founded his own company, RecoMind, joined data and AI evangelist Christina Stathopoulos to talk about the state of recommendation systems. Christina also ran through the latest AI news she&#8217;s been watching, from Anthropic&#8217;s continued rise to responsible AI, announcements from Google’s I/O 2026 conference, [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">This week Miguel Fierro, a former Microsoft principal researcher who recently founded his own company, <a href="https://recomind.io/" target="_blank" rel="noreferrer noopener">RecoMind</a>, joined data and AI evangelist Christina Stathopoulos to talk about the state of recommendation systems. Christina also ran through the latest AI news she&#8217;s been watching, from Anthropic&#8217;s continued rise to responsible AI, announcements from Google’s I/O 2026 conference, and (continuing the discussion from last week) the growing backlash against tokenmaxxing as a productivity metric. Here are three takeaways from the conversation.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="This Week in AI with Christina Stathopoulos and Miguel Fierro" width="500" height="281" src="https://www.youtube.com/embed/apTfbIR-U24?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>Recommendation systems are a bigger deal than most companies realize</strong></h2>



<p class="wp-block-paragraph">Miguel has spent the better part of a decade building recommendation systems for enterprise customers at Microsoft, and he thinks most companies are leaving a lot on the table by not paying closer attention to recommendations. Amazon generates roughly 35% of its revenue through recommendations. Netflix attributes 75% of content consumption to them. Best Buy credits recommendations with 24% of revenue. TikTok&#8217;s entire user experience is a recommendation engine. And yet many large retailers he worked with at Microsoft weren&#8217;t investing seriously in the area, often because they weren&#8217;t tracking the value it was generating.</p>



<p class="wp-block-paragraph">The gap between the top tier and everyone else is wide and getting wider. The most advanced systems today treat user behavior as a sequence prediction problem, similar to how large language models predict the next token. Rather than just encoding clicks, they encode all user actions into embeddings, run sequences through those representations, and use huge 1.5 trillion-parameter models to predict what a user will want next. That&#8217;s not something a mid-tier retailer can replicate today, but it signals where the field is heading.</p>



<p class="wp-block-paragraph">Even if you don’t work in a top well-resourced company, you should still pay attention to the convergence of search and recommendations into a single personalized retrieval layer and the early application of foundation models to recommendation problems. Netflix has built what Miquel described as the <a href="https://netflixtechblog.medium.com/integrating-netflixs-foundation-model-into-personalization-applications-cf176b5860eb" target="_blank" rel="noreferrer noopener">only published foundation model</a> in this space; Meta is rumored to be developing one as well. The barrier is data, particularly for smaller organizations. Unlike text, behavioral interaction data isn&#8217;t publicly available, so building at that scale requires both proprietary datasets and serious compute.</p>



<p class="wp-block-paragraph">If you want to get your hands on state-of-the-art implementations, including knowledge graph-based approaches, without starting from scratch, Miguel suggested the <a href="https://github.com/recommenders-team/recommenders" target="_blank" rel="noreferrer noopener">open source Recommenders library</a>, originally developed at Microsoft and now housed under the Linux Foundation, as a practical entry point.</p>



<h2 class="wp-block-heading"><strong>The agent hype has a recommender-shaped hole in it</strong></h2>



<p class="wp-block-paragraph">Miguel drew a distinction between true sales agents and what most companies offer today, which are usually just conversational agents. A conversational agent responds to what you say. An agentic sales system understands a customer, anticipates what they want, and surfaces the right product or offer at the right moment—and that requires a recommendation system baked in.</p>



<p class="wp-block-paragraph">If your &#8220;agent&#8221; is a chatbot with access to a knowledge base, it&#8217;s not doing recommendation. Recommendation systems need training data, a retrieval layer, and a personalization model, none of which you get for free from a foundation model API. A language model can answer questions about a product catalog, but it can’t offer up personalized recommendations unless it also has a model of the customer&#8217;s preferences, history, and likely next action. Most companies don’t have the infrastructure in place to make that possible. . .yet.</p>



<h2 class="wp-block-heading"><strong>The responsible AI conversation has left the research community</strong></h2>



<p class="wp-block-paragraph">What’s notable about the responsible AI conversation right now is the range of institutions offering their perspective. Anthropic, alongside announcing a funding round pushing its valuation toward $1 trillion, urged a <a href="https://www.reuters.com/business/anthropic-says-ai-labs-need-coordinated-plan-halt-development-if-risks-rise-2026-06-04/" target="_blank" rel="noreferrer noopener">global pause on AI development</a> tied to the risk of recursive self-improvement: systems that can design and develop their own successors. The Future of Life Institute published <em>The Better Path for AI</em>, a framework arguing for <a href="https://betterpath.ai/" target="_blank" rel="noreferrer noopener">capability development oriented toward human benefit</a> rather than human replacement. And the pope issued a <a href="https://www.vatican.va/content/leo-xiv/en/encyclicals/documents/20260515-magnifica-humanitas.html" target="_blank" rel="noreferrer noopener">formal encyclical focused on AI</a> and the common good.</p>



<p class="wp-block-paragraph">None of these institutions is making the same argument, but the convergence of their attention matters. Responsible AI used to be a specialized conversation happening largely within research labs and a small set of policy organizations. It&#8217;s now a topic where major AI companies, religious institutions, and civil society groups are all staking out public positions in the same news cycle.</p>



<p class="wp-block-paragraph">For the technical community, this creates both pressure and opportunity. &#8220;We&#8217;re thinking about safety&#8221; is no longer a sufficient posture; external scrutiny is intensifying from directions that don&#8217;t share the field&#8217;s assumptions or vocabulary. But the broader conversation creates real demand for practitioners who can translate between what responsible AI actually requires in practice and what policymakers, executives, and institutions are trying to figure out. That translation work is increasingly where the field needs people.</p>



<h2 class="wp-block-heading"><strong>What&#8217;s next</strong></h2>



<p class="wp-block-paragraph">Join us Monday morning for the <a href="https://www.oreilly.com/live/this-week-in-ai.html" target="_blank" rel="noreferrer noopener">next episode of <em>This Week in AI</em></a>, where YK Sugi and John Lindquist will break down the massive structural and financial shifts reshaping the technology industry. (They’ll also chat about the recent release of Claude Fable 5.) And on July 23, Christina will be hosting the <a href="https://www.oreilly.com/live/ai-superstream-ai-harnesses.html" target="_blank" rel="noreferrer noopener">AI Superstream on AI harnesses</a>, a four-hour event focused on agentic AI and the frameworks practitioners need to move from models to agents. Both are free to attend. <a href="https://www.oreilly.com/live/free.html" target="_blank" rel="noreferrer noopener">Register now</a> to save your seat.</p>



<p class="wp-block-paragraph">For deeper reading on topics covered this week, Christina recommended three titles available on the O&#8217;Reilly learning platform: <a href="https://learning.oreilly.com/library/view/hands-on-llm-serving/9798341621480/" target="_blank" rel="noreferrer noopener"><em>Hands-On LLM Serving and Optimization</em></a>, <em><a href="https://learning.oreilly.com/library/view/hands-on-rag-for/9798341621701/" target="_blank" rel="noreferrer noopener">Hands-On RAG for Production</a></em>, and <em><a href="https://learning.oreilly.com/library/view/large-language-models/9798341622517/" target="_blank" rel="noreferrer noopener">Large Language Models: The Hard Parts</a></em>. Not a member? <a href="https://www.oreilly.com/start-trial/?type=individual" target="_blank" rel="noreferrer noopener">Sign up for a free 10-day trial</a> to check them out.</p>



<p class="wp-block-paragraph">We’ll continue to publish our takeaways here on Radar each Friday and share full episodes on <a href="https://www.youtube.com/watch?v=g4cfjz5AKxY&amp;list=PL055Epbe6d5bJEhT7_ZzOeJZ6gPyUzYpS" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/033kJS2BG1teGunxmtsU1r" target="_blank" rel="noreferrer noopener">Spotify</a>, <a href="https://podcasts.apple.com/us/podcast/this-week-in-ai/id1896798047" target="_blank" rel="noreferrer noopener">Apple</a>, or wherever you get your podcasts.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/this-week-in-ai-the-next-gen-recommendation-experience/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Generative AI in the Real World: Agentic Systems Fundamentals with Maarten Grootendorst</title>
		<link>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-agentic-systems-fundamentals-with-maarten-grootendorst/</link>
				<comments>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-agentic-systems-fundamentals-with-maarten-grootendorst/#respond</comments>
				<pubDate>Thu, 11 Jun 2026 17:58:23 +0000</pubDate>
					<dc:creator><![CDATA[Ben Lorica and Maarten Grootendorst]]></dc:creator>
						<category><![CDATA[Generative AI in the Real World]]></category>
		<category><![CDATA[Podcast]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?post_type=podcast&#038;p=18898</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-scaled.png" 
				medium="image" 
				type="image/png" 
				width="2560" 
				height="2560" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2024/01/Podcast_Cover_GenAI_in_the_Real_World-160x160.png" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[BERTopic creator and Google DeepMind developer relations engineer Maarten Grootendorst has spent years helping practitioners build intuition for how AI systems actually work—not just how to prompt them. Maarten joined Ben Lorica to cover the enduring relevance of embeddings and topic models in an LLM-dominated world, his hot take that agents are essentially just an [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">BERTopic creator and Google DeepMind developer relations engineer Maarten Grootendorst has spent years helping practitioners build intuition for how AI systems actually work—not just how to prompt them. Maarten joined Ben Lorica to cover the enduring relevance of embeddings and topic models in an LLM-dominated world, his hot take that agents are essentially just an “LLM in a for loop with some tools, some memory, and perhaps some guardrails,&#8221; and what separates genuine agentic behavior from a well-constructed pipeline. They also get into the practical trade-offs between open weight and proprietary models, the future of state space models and attention, and why Maarten worries that a generation of builders shipping code they can&#8217;t read may be storing up technical debt they can&#8217;t repay. &#8220;If you don&#8217;t really know how an LLM works,&#8221; he says, &#8220;that intuition [about how to use it effectively] is much more difficult to develop.&#8221;</p>



<p class="wp-block-paragraph">About the <em>Generative AI in the Real World</em> podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In <em>Generative AI in the Real World</em>, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.</p>



<p class="wp-block-paragraph">Check out other episodes of this podcast on the <a href="https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-chang-she-on-data-infrastructure-for-ai/#:~:text=on%20the%20O%E2%80%99Reilly%20learning%20platform" target="_blank" rel="noreferrer noopener">O’Reilly learning platform</a> or follow us on <a href="https://www.youtube.com/playlist?list=PL055Epbe6d5YcJUhZbsVW9dlMueIuOxK_" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/5C9oof8TFkP65lDUcEy5jT" target="_blank" rel="noreferrer noopener">Spotify</a>, <a href="https://podcasts.apple.com/us/podcast/generative-ai-in-the-real-world/id1835476293" target="_blank" rel="noreferrer noopener">Apple</a>, or wherever you get your podcasts.</p>



<h2 class="wp-block-heading">Transcript</h2>



<p class="wp-block-paragraph"><em>This transcript was created with the help of AI and has been lightly edited for clarity.</em></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=50" target="_blank" rel="noreferrer noopener">0.50 </a><br><strong>All right. So today we have Maarten Grootendorst. He is a developer relations engineer at Google DeepMind, and he is also the coauthor of two O&#8217;Reilly books, <em><a href="https://learning.oreilly.com/library/view/hands-on-large-language/9781098150952/" target="_blank" rel="noreferrer noopener">Hands-On Large Language Models</a></em> and <em><a href="https://learning.oreilly.com/library/view/an-illustrated-guide/9798341662681/" target="_blank" rel="noreferrer noopener">An Illustrated Guide to AI</a></em>. And so, Maarten, welcome to the podcast.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=70" target="_blank" rel="noreferrer noopener">01.10</a><br>Thank you. It&#8217;s wonderful to be here.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=72" target="_blank" rel="noreferrer noopener">01.12</a> <br><strong>So, I had you on the podcast—I was looking at it earlier this morning—August 2022, a few months before ChatGPT was released. </strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=83" target="_blank" rel="noreferrer noopener">01.23</a><br>It&#8217;s been a while. [laughs]</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=85" target="_blank" rel="noreferrer noopener">01.25</a><br><strong>Yeah. Back then, what I wanted to talk to you about was, I was a user of your <a href="https://maartengr.github.io/BERTopic/index.html" target="_blank" rel="noreferrer noopener">BERTopic library</a>. For listeners who are not familiar, BERTopic was kind of a marriage between the transformer approach with topic modeling and Maarten wrote one of the more popular libraries for doing that. Actually, what&#8217;s happened to this whole topic of topic models?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=118" target="_blank" rel="noreferrer noopener">01.58</a><br>Oh, yeah. I think it&#8217;s still going strong. You mentioned ChatGPT. So a lot of people say, “OK, just use that for topic modeling.” You can. It&#8217;s just very difficult to make sure you get a more structured, standardized output rerun thing, especially if [you have] millions of potential documents. And you can still use that on top of that. It&#8217;s still my baby of sorts, right? I mean, it&#8217;s been four years since we talked, and. . . I love working on that. I don&#8217;t have that much time to do it anymore, but it&#8217;s great.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=156" target="_blank" rel="noreferrer noopener">02.36</a><br><strong>Yeah. So I think one of the things that these large language models have done is kind of, I guess, cast by the wayside some of these earlier approaches for really wading through a lot of text. Unfortunately, I think people, as you mentioned, are trying to prompt their way into a topic model. But I think topic models themselves are still very useful. So one question to you, Maarten. What&#8217;s the level of usage of BERTopic now compared to when we talked?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=193" target="_blank" rel="noreferrer noopener">03.13</a><br>It&#8217;s only grown since then.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=197" target="_blank" rel="noreferrer noopener">03.17</a><br><strong>Really?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=198" target="_blank" rel="noreferrer noopener">03.18</a><br>Yeah. It surprised me too. [laughs] I think it&#8217;s because it&#8217;s easy to use. I did some, I think, cool tricks in there, but other than that, I think the main benefit was mostly just a nice user experience. And that helps people use something for a very specific task instead of trying to prompt your way towards something that might or might not work, and you still have to iterate over that. It just works out of the box. It&#8217;s not perfect. Nothing is. It&#8217;s not a free lunch. But yeah, I think that&#8217;s it.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=235" target="_blank" rel="noreferrer noopener">03.55</a><br><strong>One thing that&#8217;s happened, of course, is that this whole area of AI and NLP has gotten so democratized that. . . When we talked, I think the people who were using BERTopic at least had some notion of what NLP was and what text mining was, right? I would imagine now, in your role as a developer relations person, you encounter a lot of people who don&#8217;t come from a data science or ML background. And so they have no clue what topic models are, I would imagine.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=274" target="_blank" rel="noreferrer noopener">04.34</a><br>Yeah, many don&#8217;t. It&#8217;s very interesting to see because you mentioned NLP and text mining and, well, [they’re] completely outdated terms now for some reason. It&#8217;s all AI. Let&#8217;s just call it AI and be done with it. [laughs] That&#8217;s not necessarily a bad thing, don&#8217;t get me wrong. It&#8217;s just very interesting to see how the field has evolved, but that also means that people don&#8217;t really look towards these “older techniques” that still drive much of the adoption of newer stuff.<br><br>Sometimes it feels like that, you know, AI and LLMs. . . It&#8217;s a hammer and we&#8217;re looking for nails to actually use it instead of, “OK, but we have packages for very specific things, and you can use LLMs on top of that.” You don&#8217;t have to. But it requires a bit of education on that end, because like you mentioned, a lot of people new to the field, you have to explain, “What are embeddings? What is clustering?” It&#8217;s also very interesting to see that even something like that needs to be explained a little bit in more detail. It&#8217;s a nice opportunity for me to explain stuff. I like doing that.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=348" target="_blank" rel="noreferrer noopener">05.48</a><br><strong>And the key here is that because a lot of people are entering this field and building things and they don&#8217;t necessarily know the prior art, so to speak, it seems like they might be leaving a lot of things on the table. Right? So in terms of, here&#8217;s my text or my data, I am just going to prompt and I think that I got everything out of it, but that&#8217;s not really the case for the most part.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=384" target="_blank" rel="noreferrer noopener">06.24</a><br>No. Definitely not. There&#8217;s so many things that you can do with these systems, whether it&#8217;s on the LLM side or the agentic side or the topic modeling side. If you just know a little bit more on what&#8217;s going on under the hood then that helps you understand “When do I prompt? When do I not prompt? What&#8217;s going wrong?” That feeling, that intuition. You don&#8217;t just get it with building. Building’s very important, but if you don&#8217;t really know how an LLM works, that intuition is much more difficult to develop.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=419" target="_blank" rel="noreferrer noopener">06.59</a><br><strong>Which brings me to your two books, which are fantastic, which I think go a long way into helping people get that foundation. But let&#8217;s face it, a lot of people, Maarten. . . So let&#8217;s take your earlier book with Jay [Alammar], which is <em>Hands-On Large Language Models</em>. A lot of people may say, “I don&#8217;t have time to read this whole book.” So for someone who is a developer, doesn&#8217;t have a data science or ML background, what would be the most important concepts for large language models? Drill down on these three or four concepts that will set you up for success.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=469" target="_blank" rel="noreferrer noopener">07.49 </a><br>From the top of my head, those are chapters two and three. So buy the book now. [laughs] I&#8217;m just kidding. Tokens. Super underappreciated.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=483" target="_blank" rel="noreferrer noopener">08.03</a><br><strong>Which now is a big topic because, as I joke, the CFO has now become the CTO, the chief token officer.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=491" target="_blank" rel="noreferrer noopener">08.11</a><br>I didn&#8217;t know that one. That&#8217;s amazing. I&#8217;m gonna use it. But, yeah, tokens are now the thing, right? It&#8217;s what LLMs use to see the world, so to say—to interpret the world. And it&#8217;s how they communicate with the world. So it&#8217;s really important to know what tokens are. It helps you get into the realm of embeddings, which I still think is super fundamental to so many things we do.<br><br>And the second part is kind of an obvious one, but the attention mechanism, “Oh, wow. Why are these things so strong? What makes them so special?” Attention is an obvious one. We have other things like Mamba, recurrent neural networks, but it all starts from attention. So if you&#8217;re completely new to this field, those two. Yeah.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=538" target="_blank" rel="noreferrer noopener">08.58</a> <br><strong>Let&#8217;s take the topic of embeddings. I think at least that topic, Maarten, some people have had to play around with it, right? Because when LLMs first came online, the “Hello, World!” example was RAG, and one of the knobs that people were tuning was embedding, obviously chunking, so the information extraction, the search and retrieval—they&#8217;re all important. But one thing that people immediately tried to play around with was embeddings because they could go to places like Hugging Face: <br>Hey, let me try these four different embeddings.” Do you find that embeddings have a special place in that more people play around with embeddings and have some rudimentary understanding of embeddings?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=590" target="_blank" rel="noreferrer noopener">09.50</a> <br>I have a sweet spot for embeddings because it&#8217;s the main part of BERTopic. But I think it&#8217;s so fundamental to so many things that we do in this field. Even things like RAG—which some people think is outdated. It actually isn&#8217;t. It&#8217;s very much alive and still kicking—runs on embeddings and understanding how they work will also help you understand how LLMs work. And it can be used in so many different ways. </p>



<p class="wp-block-paragraph">Sometimes we&#8217;re looking for bigger embedding models, more contextualized information. Great. [They] have their own purposes. And there are now certain parties focusing a little bit more on these static embeddings that are super fast and quick, like the old school embeddings that we used to have, and now in a new form that can be used in conjunction with coding agents to quickly search through repos and find the information that they&#8217;re looking for. Much of what we do is still search, and search revolves in big part on embeddings. And it&#8217;s just nice when you have text that you have one numerical representation for it—just that gives you so many opportunities to do so many cool things.&nbsp;.&nbsp;.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=678" target="_blank" rel="noreferrer noopener">11.18</a><br><strong>So when you&#8217;re trying to convince someone, Maarten, that “Hey, you should learn more about embeddings, because they&#8217;re important,” is there a canonical example that you use to say, “Hey, look, if you just understood embeddings and you made this one decision, look at the change in your application.” Is there a canonical example that you go to?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=700" target="_blank" rel="noreferrer noopener">11.40</a><br>Oh, yeah, I love the question, but I don&#8217;t think I have an answer to that. Because, OK, so I&#8217;m a psychologist and I really like to say “it depends on,” and here it kind of depends on the application that you&#8217;re running, obviously. Contextualized versus noncontextualized embeddings is a very interesting example because the contextualized ones are generally larger. But there&#8217;s larger transformer-like models that require a lot of compute to run. So you can see the latency actually appearing in your search engines. Or if you connect your coding agent to one of those, it slows down because, you know, it needs to wait for the search compared to the faster static ones, for instance, like Model2Vec and stuff like that, which are tremendously fast. So amazing for those use cases, not that performance because they&#8217;re way smaller, obviously. And it&#8217;s these use cases where the building does get you a lot of intuition about when to use what instead of relaying that decision only to an agent. You&#8217;re still the one that needs to have the feeling, that gut feeling, to say this works better for my use case.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=703" target="_blank" rel="noreferrer noopener">13.03</a><br><strong>But I would say the reality is that people will go to some leaderboard.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=789" target="_blank" rel="noreferrer noopener">13.09 </a><br>Yeah. That&#8217;s just the way it is.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=793" target="_blank" rel="noreferrer noopener">13.13</a><br><strong>So there we go. OK. So in this leaderboard here are the top 10. In this top 10, there&#8217;s some that look larger than the others. So I&#8217;ll try three or four of varying sizes. Is that a fair characterization of what normally happens?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=812" target="_blank" rel="noreferrer noopener">13.32</a><br>Yeah that&#8217;s even what I always did. Just you know, top of the leaderboard, pick one or two. But then as you are more experienced with picking one, what about multilinguality? I&#8217;m Dutch. There aren&#8217;t that many very good Dutch embedding models—big problem there. There are things like matryoshka embeddings, where they&#8217;re embedding one embedding model, but they generate embeddings of different sizes for different purposes, which is also very interesting. So there&#8217;s all these types of small decisions and nuances that you can make. And we now have instruction-tuned embeddings, where you prefix it with an instruction that you want an embedding for clustering or for classification or for what have you. And then you suddenly see the nuances in selecting something.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=867" target="_blank" rel="noreferrer noopener">14.27</a><br><strong>So on the attention mechanism, again, I will play the role of someone who has no time. I don&#8217;t have time to read the chapter, Maarten. What are one to three things I should know about the attention mechanism?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=884" target="_blank" rel="noreferrer noopener">14.44</a> <br>I think the most important thing about the attention mechanism is it contextualizes information. That&#8217;s by far the most important thing. When you look at the world before attention and after, it&#8217;s a little bit less black-and-white, obviously, but it puts stuff into context. You know, if you have the word “bank,” is it the bank of a river or a financial bank? And as we talk now with each other, there&#8217;s a lot of contextual stuff going on. You need to interpret what I&#8217;m saying, because if you only focus on what I say, you don&#8217;t know that that was actually a question beforehand that drives my answer. And I think that&#8217;s what makes attention so special. It tries to look at the entire thing instead of individual tokens or words.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=934" target="_blank" rel="noreferrer noopener">15.34</a><br><strong>Playing devil&#8217;s advocate, so you just explained it to me. Why do I have to learn more than that? [laughs]</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=940" target="_blank" rel="noreferrer noopener">15.40</a><br>Always learn more. [laughs]</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=944" target="_blank" rel="noreferrer noopener">15.44</a><br><strong>Yeah, yeah, yeah. So you mentioned Mamba and the state space models. There was some excitement around them. So maybe give our listeners a high-level description of what these state space models are and what their current status is in the wild in terms of actual practical usage.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=968" target="_blank" rel="noreferrer noopener">16.08</a> <br>State space models are a completely different way of approaching this attention mechanism, right? It almost does away with it and replaces it with something that is much, much faster. It&#8217;s a very complex and highly technical subject, so I don&#8217;t want to go too into that because it&#8217;s really confusing. [laughs]</p>



<p class="wp-block-paragraph">So what you see happening is that people replace attention mechanisms. So you have a decoder and LLM, and it has several stacks of attention mechanism normally. What you can do is you can remove half of them with the very quick state space models that help speed up the inference—because that&#8217;s what we&#8217;re mostly bound now by, is inference speeds. People want more, more tokens. So it needs to be faster. So it&#8217;s, it&#8217;s a way to make it quicker.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1033" target="_blank" rel="noreferrer noopener">17.13</a><br><strong>Yeah. And so what is the actual implementation or adoption of state space models right now?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1041" target="_blank" rel="noreferrer noopener">17.21</a><br>Mostly hybrid models. Models, stats, interleave the attention blocks, the decoder blocks with Mamba blocks as a way to make it faster, where some do it with, for example, local attention and global attention—one is more compute-intensive than others. Mamba is a way to do something similar, as a way to speed up that inference.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1071" target="_blank" rel="noreferrer noopener">17.51</a><br><strong>Your latest book is about agents: <em>An Illustrated Guide to AI Agents</em>. Before we dive in, in your mind, what makes a system truly agentic? In other words, before we started bandying around the word “agents,” people were using the term “robotic process automation” or something like that. So in your mind, what makes a system agentic?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1102" target="_blank" rel="noreferrer noopener">18.22 </a><br>That&#8217;s actually been one of the more complex topics for us to actually describe, because the field has been changing so quickly. And what is fundamentally an agent when they change it every two months? It&#8217;s a little bit of a hot take, but I really do think that an agent is an LLM in a for loop with some tools, some memory, and perhaps some guardrails. And that really is essentially all it boils down to at its base.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1135" target="_blank" rel="noreferrer noopener">18.55</a><br><strong>You just described the harness basically. The hot term right now is harness engineering. So what is the real progress and what is just marketing when it comes to agents?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1159" target="_blank" rel="noreferrer noopener">19.19 </a><br>Yeah, I agree very much with what you imply here because agents sound so cool, and they are cool, but the moment you give an LLM complete freedom, no constraints, just go off and do your stuff, it will fail horribly, horribly, horribly. Agents still need. . . And we can call them guardrails, but you can call them something else. They need direction. They need to be constrained a little bit in the things that they do. So yes, agents, there&#8217;s a lot of hype around that. I&#8217;m not a big fan of hype. It is what it is. But there are a lot of cool use cases for it because there&#8217;s a reason why coding agents are now the big thing. I&#8217;m using them myself daily because they make my life easier. But when we look at other use cases, we&#8217;re so early in AI progress. Yeah, coding works very nicely. But to ask an agent to book a vacation for me. Yeah. No.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1235" target="_blank" rel="noreferrer noopener">20.35</a><br><strong>It seems like that example of “I want to go on a trip. This trip will involve staying in five countries. And I want you to pick the best hotel for every country.” always was kind of the demo even during the robotic process automation. And as you alluded to, I don&#8217;t think we can do it quite yet. So here&#8217;s another family of agents, Maarten, that a lot of people are using now: deep research agents. Would you consider deep research an agent?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1275" target="_blank" rel="noreferrer noopener">21.15</a><br>Maybe. It kind of depends on how it&#8217;s implemented. It depends. I&#8217;m sorry. I&#8217;m going to do that a couple of times, but. . . You can make it very structured, where you say, “OK, do the search on the archive, read the abstracts, make a summary. That&#8217;s it.” That&#8217;s not really. . .</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1298" target="_blank" rel="noreferrer noopener">21.38</a><br><strong>It fits into your description in that you’re prompting an LLM. The LLM goes on a for loop where it uses as tools a search index, a knowledge graph. . .</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1313" target="_blank" rel="noreferrer noopener">21.53</a><br>Fair enough. Yeah. It makes the decision on its own when to use a tool, why to use a tool. Whereas you can also put it in a pipeline where you specifically say, “I always want you to do steps one, two, and three.” And an agent might decide to say, “OK, I&#8217;m going to do step 3, 3, 1, 2, 1, 3.” Decide on its own when and where to use specific tools. I think that&#8217;s maybe the best distinction you can make on what is and what isn&#8217;t an agent.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1346" target="_blank" rel="noreferrer noopener">22.26</a><br><strong>And then I guess it depends on the implementation, as you mentioned. But memory could also fill a role there, especially. . . Let&#8217;s say I&#8217;m using only one service—Google or Perplexity. Maybe it remembers over time what my preferences are. I don&#8217;t know if they actually implement it that way. But there&#8217;s potentially that aspect.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1373" target="_blank" rel="noreferrer noopener">22.53</a><br>So how we phrase it in the book at least, we say, “OK, an agent is a reasoning LLM that has access to planning, tools, and memory,” because there&#8217;s no such thing as an agent that goes off and does three steps of something only to forget what the previous steps were. So I think memory is maybe a little bit underappreciated in the realm of agents, because imagine it has to go through an entire codebase and translate it from Python to C++ or Rust or what have you. It&#8217;s a very common example of things people want to do. That requires hundreds of steps to do, because it&#8217;s potentially a large codebase. How does it remember what it did when it did what, what the current state is, what what&#8217;s changed, etc., etc.? And you can write that in a Markdown file. That&#8217;s nice, but it also needs to understand, “OK, what&#8217;s the trajectory that I went through?” And you can do a lot of cool stuff with that trajectory, because that&#8217;s essentially the memory of an agent.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1442" target="_blank" rel="noreferrer noopener">24.02</a><br><strong>In your role in developer relations, I assume you talk to a lot of people who work in different companies. We&#8217;ve mentioned coding agents; we mentioned deep research. So what are some of the more common agents that people are building? They could be internal or external facing. So what are some of the more common agent types, I guess, that people are building?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1469" target="_blank" rel="noreferrer noopener">24.29</a><br>Aside from the obvious, it depends on the industry. I do see coding agents actually being done quite a bit internally. Just trying to see how they can prevent data from being leaked elsewhere. Because a lot of processes now are very privacy sensitive. I came from healthcare before I joined DeepMind. And what you see in these kinds of fields is that, especially in Europe. . .</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1506" target="_blank" rel="noreferrer noopener">25.06</a><br><strong>I imagine if you&#8217;re in finance in a hedge fund. . .</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1509" target="_blank" rel="noreferrer noopener">25.09</a><br>So yeah, same. . . And these are situations wherein people focus a lot on privacy and making sure that everything&#8217;s constrained within their environments. And you see a lot of people playing around with LLMs and then using harnesses—can be Hermes but also [taking] a more foundational agent and build[ing] stuff around that. Or the larger organizations that, well, just use whatever cloud offering there is and use an agent there. We&#8217;re so at the beginning of all of this. [laughs]</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1550" target="_blank" rel="noreferrer noopener">25.50</a><br><strong>For me, the area where I see it being used—and this is not going to be a surprise to our listeners—is still the technical team bucket, which would be DevOps, data engineering, platform engineering. . . They&#8217;re building agents to help them do the work. But you might be interacting with a large website, and in the background, there&#8217;s a bunch of agents doing a lot of heavy lifting, moving data around for you to get the answer you want or whatever, or internal processes. But DevOps, I think they&#8217;re starting to build their own agents. I think, data engineering for pipelines, they&#8217;re building their own agents. I would imagine the people in security teams are also building agents because they have to go through lots of log files and. . .</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1615" target="_blank" rel="noreferrer noopener">26.55</a><br>A question for you then: Are they building agents, as in, you know, fully an agent, or are they building skills? Because I&#8217;ve seen a lot of people more focusing on creating skills and giving that to whatever agent is available. Or do you also see a lot of people actually building agents from scratch?</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1637" target="_blank" rel="noreferrer noopener">27.17</a><br><strong>I think internally there are people who are building what we would consider agents in the sense that it would do a huge chunk of their normal work and they interact with it with prompting, but maybe they don&#8217;t consider it completely autonomous. So in the sense that many people who use coding agents, at least, the ones who know how to code, as you might still test and read some of the code, right?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1670" target="_blank" rel="noreferrer noopener">27.50</a><br>Sometimes. Sometimes. [laughs]</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1672" target="_blank" rel="noreferrer noopener">27.52</a><br><strong>Our listeners may be sharp, but there&#8217;s huge cohorts of people using coding agents who don&#8217;t know how to code or who are building websites and web applications. So in the data, in the DevOps, in the data engineering field, the kinds of agents they&#8217;re building are somewhat similar to the coding agents in that they&#8217;re doing a lot of the work, but they still have guardrails. I would say they&#8217;re still human-in-the-loop. Now, there&#8217;s also agents in the nontechnical fields, but they&#8217;re a little more. . . Maybe to your point, maybe they can be better described as skills, for example, in marketing or sales. Internally at some of these companies, they&#8217;re building things to help these teams be more independent from IT.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1741" target="_blank" rel="noreferrer noopener">29.01</a><br>So yeah, you see mostly and we can call them skills, but we can also call them workflows or pipelines or just prompts. . .</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1750" target="_blank" rel="noreferrer noopener">29.10</a><br><strong>Imagine you&#8217;re a marketing analyst at a big Fortune 500 company. And your job used to be to manage a bunch of ad campaigns and online campaigns. That was very manual, and so now you can automate a lot of that work. And then you might still have a dashboard where you can kind of see what&#8217;s going on. But the things that used to drive you crazy, now you can focus on other things.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1786" target="_blank" rel="noreferrer noopener">29.46</a><br>But I am curious about the long-term effects of all of this, especially when, as you mentioned, a lot of people code without knowing how to code. I think that&#8217;s fun for a while but in the long term, stuff breaks and you don’t know where to start.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1801" target="_blank" rel="noreferrer noopener">30.01</a><br><strong>I don&#8217;t know about you, but I&#8217;ve come across people who literally don&#8217;t know how to code, who built a website, starting to have customers. Customers will file support questions or they say, “This part of your website doesn&#8217;t quite work.” Since they don&#8217;t know how to code, they go back to the same coding agent: “Hey, fix this.” The coding agent says I fixed it. They go back to the customer: “It&#8217;s fixed.” The customer goes, “It&#8217;s not fixed.” And so then this is when they start going “I need to hire someone to actually. . . Because now it actually needs to be fixed. And the holding agent can&#8217;t fix it.” So there are obviously dangers to going kind of completely wild on these technologies.</strong></p>



<p class="wp-block-paragraph"><strong>So open weights versus proprietary. This might be a sensitive topic to you because you have Gemini, but you guys also have Gemma.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1869" target="_blank" rel="noreferrer noopener">31.09</a><br>I work on Gemma. Ask me everything about Gemma. [laughs]</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1872" target="_blank" rel="noreferrer noopener">31.12</a><br><strong>[laughs] In your work—or not in your work, but in your day-to-day life, talking to friends, traveling, in your dev rel hat, what is a level of interest in open weights?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1887" target="_blank" rel="noreferrer noopener">31.27</a><br>Oh, a lot, yeah. That&#8217;s for the most part because I&#8217;m in Europe. And Europe loves to say, “OK, we want to own things. We don&#8217;t want to push it over to someone else.” So there&#8217;s a lot of interest for open weight models. It&#8217;s way more than I initially thought because there was quite a big performance gap when ChatGPT came out, 3.5. But now they&#8217;re closing in. These models are extremely capable. You can run them on MacBooks. I mean, when Claude came out, I&#8217;ve seen so many threads of people buying Mac Studios just to be able to run whatever local LLM they have. So you see it in every part of the field, whether it&#8217;s very large organizations or very small, finance, healthcare, what have you.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=1945" target="_blank" rel="noreferrer noopener">32.25</a><br><strong>One of the challenges with open weights is open weights is a business decision. And business decisions can be reversed. Meta Llama may no longer produce open weights. Alibaba—kind of mixed signals there. Some of the Chinese open weights providers are starting to send mixed signals. So it&#8217;s one thing to release an open weights model. But as you know, in this environment you have to release models at a regular cadence and that starts getting expensive. So I guess one of the challenges there for our whole community and industry is, you know, where is the steady supply of open weights models going to come from moving forward? Because basically, like I said, it&#8217;s a business decision, and a business decision is going to be reversed.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2008" target="_blank" rel="noreferrer noopener">33.28</a><br>No, I agree on that. So in the general sense, that&#8217;s what we see happening. Some organizations stop doing open source, [or] less of it, focus on different things. It&#8217;s understandable in a way, because, you know. . .</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2025" target="_blank" rel="noreferrer noopener">33.45</a><br><strong>And, you know, one of the obvious advantages of open weights is you can take the weights and run it in your cluster. And so you have control if. . . One of the things that annoys a lot of these enterprise teams is OK, so I&#8217;m really optimized for Claude 4.5. And then, hey, they are deprecating Claude 4.5, you know. So here at least you have control. And I think one of the things that most teams are starting to realize, Maarten, is actually I can use open weights for a lot of things because. . . Let&#8217;s say it&#8217;s so focused, like a simple sentiment analysis or whatever. I don&#8217;t need the most expensive models. And this I can control moving forward. So I think people and teams are discovering, “Hey, while I should be concerned that these open weights models may stop getting released, for some, for many of my tasks, maybe I don&#8217;t need the latest and greatest anyway.”</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2092" target="_blank" rel="noreferrer noopener">34.52</a><br>That can be the case. Yeah, because these models are very capable. I think there will always be a steady supply of open weight models. If we look at the status of the field now, many. . . Obviously Qwen, they&#8217;re doing an amazing job. Needs to be said. Same with Gemma, they’re also doing well.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2114" target="_blank" rel="noreferrer noopener">35.14</a><br>T<strong>he Qwen team lost a bunch of people, and I think there&#8217;s some worry that Alibaba may back off from. . .</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2123" target="_blank" rel="noreferrer noopener">35.23</a><br>I think they will continue. I don&#8217;t know, obviously, but I think it&#8217;s still a very good strategy to do.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2130" target="_blank" rel="noreferrer noopener">35.30</a><br><strong>And wait, Gemma is not as good as Gemini. [laughs]</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2133" target="_blank" rel="noreferrer noopener">35.33</a><br>We have good benchmarks. What is this? What is this? [laughs] No, but they serve different audiences. And what we see happening with open weights is you get so much back from giving open weights to the community. And DeepMind is a nice example. But the more labs obviously that have always given a lot to the community, when you do that, you also get a lot back, right? Because if people are super excited about Gemma 4—we released a model two days ago, <a href="https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/" target="_blank" rel="noreferrer noopener">12B-1</a>. And you see people using that for a lot of cool use cases. Driving research to create new things that, you know, we might not have thought of. That can be the case. You see Flash, for instance, which is a diffusion-based drafter, super fast, very incredible being used with Gemma 4. That&#8217;s cool. And it&#8217;s not to say that Gemma was the first one that drove that, but open weights in general allow a random person somewhere without access to thousands of GPUs to pretrain a model and still be able to do very cool and interesting research. So as long as I&#8217;m at DeepMind, I&#8217;m gonna make sure we&#8217;re gonna keep doing very cool Gemma stuff.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2223" target="_blank" rel="noreferrer noopener">37.03</a><br><strong>All right, so let&#8217;s close with a rapid fire round. So for each question, keep your answer under a minute. Question number one. OpenClaw. What says you, Maarten, about this trend around personal agents?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2241" target="_blank" rel="noreferrer noopener">37.21</a><br>I love personal agents. They&#8217;re very cool and interesting. And at the same time, I&#8217;m very worried about the security of it. We&#8217;re seeing a lot of people&#8217;s keys being opened up, things that are being deleted that shouldn&#8217;t be deleted. And that&#8217;s because we&#8217;re in very early stages of all of this—just a little bit more time, and then it will be amazing.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2266" target="_blank" rel="noreferrer noopener">37.46</a><br><strong>Yeah. And run it locally with Gemma. [laughs]</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2270" target="_blank" rel="noreferrer noopener">37.50</a><br>Yeah, of course. [laughs] I&#8217;m not gonna sell too much. I love Gemma, I&#8217;m selling already too much.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2277" target="_blank" rel="noreferrer noopener">37.57</a><br><strong>Question number two: reinforcement learning. I&#8217;m a big fan. I always push out a post once a year at least, where I say it&#8217;s just around the corner. Now it seems like there&#8217;s a bit of a comeback with reinforcement, fine-tuning. Are you paying attention to reinforcement learning?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2301" target="_blank" rel="noreferrer noopener">38.21</a><br>A lot. I have a couple of colleagues, and we started something called the <a href="https://www.ragpack.ai/" target="_blank" rel="noreferrer noopener">RAG Pack</a> with some bigger influencers, like Jay Allamar and Josh Starmer from StatQuest. And we did a course on reinforcement quite recently. It&#8217;s such a cool technology. It&#8217;s the technique that makes LLMs the way they are today. And there&#8217;s still a lot of new things coming up in that field to make them faster, more capable, multituning trajectories. Yeah, it&#8217;s the whole thing.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2334" target="_blank" rel="noreferrer noopener">38.54</a><br><strong>Third question: scaling loss. So Anthropic in particular is big on scaling loss: bigger models, more data, that&#8217;s the road to better and better models. So what&#8217;s your feeling right now about scaling loss.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2351" target="_blank" rel="noreferrer noopener">39.11</a><br>They change quickly. We started with regular “more parameters, better model.” Then we switched to reasoning, where we said “longer reasoning, better model.” And now we&#8217;re slowly going towards the “longer trajectories, better model.” You know, more is better. I think they&#8217;re interesting, but they&#8217;re changing now so quickly that I&#8217;m wondering in half a year what the new scaling law and the new nifty thing is going to be.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2379" target="_blank" rel="noreferrer noopener">39.39</a><br><strong>So in closing, data centers. Data centers are a hot topic in the US. A lot of communities seem to be coalescing around opposing the build-out of data centers. So it&#8217;s a bit of a complicated issue in the sense that, you know, assuming that these AI technologies work and they get adopted, we will need compute in order for people to have access to these technologies. Otherwise, maybe the rich are the only ones who will have access to AI. On the other hand, the data centers themselves, you definitely need local input because, electricity, water, noise. . . And then unlike factories, they don&#8217;t really produce a lot of jobs because how many people do you really need to run a data center with all the DevOps agents now that we talked about? So what&#8217;s going on in data centers in Europe?</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2443" target="_blank" rel="noreferrer noopener">40.43</a><br>We don&#8217;t like them. I&#8217;m saying we—I&#8217;m Dutch. If I&#8217;m saying for the people of the Netherlands, we don&#8217;t like them generally. And that&#8217;s going to be very interesting moving forward because there&#8217;s still demand for AI. I know there&#8217;s a lot of people that don&#8217;t like it, but at the same time, there&#8217;s still a lot of people using it, and we need to find a way to balance that out. There&#8217;s no way forward otherwise, and I really hope we can focus more on efficiency when it comes to these compute-heavy things. That&#8217;s why I focus so much on Gemma. They&#8217;re small, capable models that you run on your cell phone. That&#8217;s great. Without needing to have these large data centers, aside from training, maybe, but that will always be there. We have to be honest about that. AI is here to stay. We just need to make it more efficient.</p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2498" target="_blank" rel="noreferrer noopener">41.38</a><br><strong>And with that, thank you, Maarten. And by the way, closing note about data centers, for our listeners, there&#8217;s a lot of announcements, right? Several gigawatts are being. . . Contracts being signed. But if you really follow what&#8217;s going on, there&#8217;s not a lot of build-out. There&#8217;s not a lot of data centers actually being built in and coming online. So&#8230; Thank you, Maarten.</strong></p>



<p class="wp-block-paragraph"><a href="https://www.youtube.com/watch?v=4FRZtBnZWnQ#t=2527" target="_blank" rel="noreferrer noopener">42.07</a> <br>Thank you.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/podcast/generative-ai-in-the-real-world-agentic-systems-fundamentals-with-maarten-grootendorst/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>When Context Collapses: Teaching Agents to Detect and Recover from Lost Memory</title>
		<link>https://www.oreilly.com/radar/when-context-collapses-teaching-agents-to-detect-and-recover-from-lost-memory/</link>
				<comments>https://www.oreilly.com/radar/when-context-collapses-teaching-agents-to-detect-and-recover-from-lost-memory/#respond</comments>
				<pubDate>Thu, 11 Jun 2026 10:59:13 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18901</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/When-context-collapses.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/When-context-collapses-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Or, how I learned to stop worrying about compaction and love the file system]]></custom:subtitle>
		
				<description><![CDATA[This is the eighth article in a series on agentic engineering and AI-driven development.&#160;Read part one&#160;here, part two&#160;here, part three&#160;here, part four&#160;here, part five&#160;here, part six&#160;here, and part seven here. &#8220;640K ought to be enough for anybody.&#8221;—Bill Gates (allegedly) If you&#8217;re building AI agents that do complex, multistep work, you&#8217;re going to run into context [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>This is the eighth article in a series on agentic engineering and AI-driven development.&nbsp;Read part one&nbsp;<a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">here</a>, part two&nbsp;<a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">here</a>, part three&nbsp;<a href="https://www.oreilly.com/radar/the-toolkit-pattern/" target="_blank" rel="noreferrer noopener">here</a>, part four&nbsp;<a href="https://www.oreilly.com/radar/ai-is-writing-our-code-faster-than-we-can-verify-it/" target="_blank" rel="noreferrer noopener">here</a>, part five&nbsp;<a href="https://www.oreilly.com/radar/ai-code-review-only-catches-half-of-your-bugs/" target="_blank" rel="noreferrer noopener">here</a></em>, <em> part six&nbsp;<a href="https://www.oreilly.com/radar/why-doesnt-anyone-teach-developers-about-context-management/" target="_blank" rel="noreferrer noopener">here</a>, and part seven <a href="https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/" target="_blank" rel="noreferrer noopener">here</a>.</em></p>
</blockquote>



<p class="wp-block-paragraph"><em>&#8220;640K ought to be enough for anybody.&#8221;—Bill Gates (allegedly)</em></p>



<p class="wp-block-paragraph">If you&#8217;re building AI agents that do complex, multistep work, you&#8217;re going to run into context loss. The agent&#8217;s working memory fills up, older information gets silently dropped or compressed, and the agent keeps going without realizing it&#8217;s forgotten something. This article, the third in my Radar article trilogy about context management, walks through a pattern I&#8217;ve been refining for detecting and recovering from that problem, which I call the <strong>externalize-recognize-rehydrate pattern</strong> (or <strong>ERR</strong>, which I think is actually a pretty good acronym for an error recovery pattern): save your agent&#8217;s state to files on disk, detect when context has degraded, and reload from those files to recover. The individual techniques are standard practice in agent and skill engineering—checkpointing, progress files, state verification—but the real power comes from combining them into a coherent workflow that you can use live or build into your agents. I&#8217;ll walk through each step with specific prompts you can adapt for your own agents and coding sessions.</p>



<p class="wp-block-paragraph">Which brings me to memory. Gates has said on multiple occasions that he never actually said that quote at the top of this article, but it endures because it captures one of the core limitations of that era, one that people struggled with constantly, in a way that we can laugh about now. Around that time I was using a 286 with 1 MB of RAM. That&#8217;s megabytes, not gigabytes. MS-DOS 3.3 gave me 640K of conventional memory plus 384K of upper memory, and I spent a lot of time figuring out how to use every bit of it. I configured memory managers, loaded device drivers high, used (and wrote!) terminate-and-stay-resident programs that moved themselves out of conventional memory to free up space, and generally treated memory as a resource that required active, deliberate engineering. There was a lot I wanted to do that didn&#8217;t fit into 640K, and like most people at the time, I went to some lengths to compensate for the memory limitations.</p>



<p class="wp-block-paragraph">We&#8217;re at the 640K stage of AI development. The context window is the new RAM ceiling. Most of today&#8217;s models give you somewhere between 200K and 2M tokens of working memory (and, like memory in the late 1980s and early 1990s, those numbers are growing all the time), and if you&#8217;re building agents that do complex multistep work, you will hit that ceiling. When you do, the AI starts compacting: compressing or dropping older parts of the conversation to make room. And just like running out of conventional memory on a 286, things stop working right and you&#8217;re not sure why.</p>



<p class="wp-block-paragraph">In 20 years we&#8217;ll be looking back at today&#8217;s puny context windows and wondering how developers in the 2020s managed to get anything done with just a few million tokens. Because none of this is new. In case you don&#8217;t believe me, here&#8217;s a photo of my dad at Princeton in the early 1970s working on an Evans and Sutherland LDS-1 graphics computer, the first commercial vector graphics machine, connected to a PDP-10 mainframe:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1600" height="1225" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-12-1600x1225.png" alt="Keep on truckin" class="wp-image-18902" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-12-1600x1225.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-12-300x230.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-12-768x588.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-12-1536x1176.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-12.png 1964w" sizes="auto, (max-width: 1600px) 100vw, 1600px" /></figure>



<p class="wp-block-paragraph">The actual LDS-1 is in the large cabinet in the background, directly behind the monitor. Sitting next to it, just out of the picture, is an even larger cabinet that holds a memory unit with <em>16K of magnetic core memory</em> (technically 8K words).</p>



<p class="wp-block-paragraph">So you can imagine that just a decade later, 640K in a tiny PC that fit on your desktop seemed extravagant.</p>



<p class="wp-block-paragraph">In the last two articles in this series (“<a href="https://www.oreilly.com/radar/why-doesnt-anyone-teach-developers-about-context-management/" target="_blank" rel="noreferrer noopener">Why Doesn’t Anyone Teach Developers About Context Management?</a>” and “<a href="https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/" target="_blank" rel="noreferrer noopener">Your AI Agent Already Forgot Half of What You Told It</a>”), I talked about what context is and why context management matters, and I shared practical techniques and prompts for keeping important information in files instead of leaving it in the AI&#8217;s context window. This article gets more technical. I want to build on those strategies and talk about how to build agents that can detect when they&#8217;ve lost context and recover from it on their own.</p>



<h2 class="wp-block-heading"><strong>Brute-forcing my way through context loss</strong></h2>



<p class="wp-block-paragraph">I&#8217;ve been doing this kind of context management for a while now, long before the specific tools I&#8217;m about to describe existed. But a recent crash gave me a clean example of what the process looks like in its most brute-force form.</p>



<p class="wp-block-paragraph">I was working in Copilot with a seven-step plan, going through it one step at a time, having another AI review each step before moving on. Steps one and two went fine. When it came time to do step three and I gave it the prompt, it jumped straight to step four. This kind of thing can be really frustrating, because it seems like an AI smart enough to implement a complex feature in code should be able to (ahem) count to four.</p>



<p class="wp-block-paragraph">The key to not getting frustrated when the AI loses track of steps or can&#8217;t seem to count from prompt to prompt is to remember what it&#8217;s good at and how it remembers things. If the AI you&#8217;re using does that, check the conversation history. You&#8217;ll probably see something like &#8220;summarizing conversation history&#8221; or &#8220;compacting conversation&#8221; somewhere above your last message. That&#8217;s telling you that the AI lost track of where it was because that count was literally purged from its memory.</p>



<p class="wp-block-paragraph">AIs are good at carrying out an instruction. They&#8217;re bad at keeping track of their own state over a long conversation, and the way they manage their memory is a big part of that. This article is about finding ways to build your AI tools so you&#8217;re not relying on them to do the thing they&#8217;re worst at.</p>



<p class="wp-block-paragraph">But compaction isn&#8217;t the only way your AI loses context. A few weeks ago I was deep into a long session with Copilot, working through a multiphase code review. I&#8217;d spent a while building up context with the AI about my codebase and the decisions we&#8217;d made together. I was about to move on to the next phase, and then I got this:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1552" height="964" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-13.png" alt="Phase B" class="wp-image-18903" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-13.png 1552w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-13-300x186.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-13-768x477.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-13-1536x954.png 1536w" sizes="auto, (max-width: 1552px) 100vw, 1552px" /></figure>



<p class="wp-block-paragraph">The entire context was wiped, which could have been a really frustrating problem, since I had a long history with the session, and it had built up a lot of knowledge about what we were doing. This turned out to be a bug in Opus 4.6&#8217;s interaction with Copilot&#8217;s conversation history, and I&#8217;ve seen other people hit the same thing. I was staring at a fresh prompt with nothing in it.</p>



<p class="wp-block-paragraph">So I did something that, in retrospect, is a pretty good brute-force version of what this whole article is about. I recognized the context was gone (hard to miss when the whole conversation disappears). I copied the entire conversation out of Copilot and pasted it into a text file. Then I gave the new session a prompt:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">We were in the middle of a long conversation, then I got an error and the entire context was wiped. I saved a copy of the conversation in #file:chat_history.txt, read it and bring yourself back up to speed.</p>
</blockquote>



<p class="wp-block-paragraph">And it worked! This brought the new session back to where I needed it to be.</p>



<p class="wp-block-paragraph">That simple error and recovery actually outlines a pretty good pattern for dealing with context loss:</p>



<ol class="wp-block-list">
<li><strong>Externalize the state.</strong> Get the important information out of the conversation and into a file on disk, where it won&#8217;t disappear when the context window reshuffles.</li>



<li><strong>Recognize the loss.</strong> Notice that the agent&#8217;s working context has been wiped or degraded, whether that&#8217;s obvious (like a crash) or subtle (like output that quietly stops making sense).</li>



<li><strong>Rehydrate from the file.</strong> Point a new session at that file and let it rebuild its understanding from what&#8217;s written down.</li>
</ol>



<p class="wp-block-paragraph">The individual mechanics are well-documented across cognitive science (cognitive offloading, task resumption), software engineering (the Memento pattern, React hydration), and knowledge management (the SECI model). I&#8217;m not claiming to have invented any of them. But the specific abstraction of these three phases into a unified, named pattern applied to AI context management is, as far as I can tell, new. It&#8217;s synthesis and codification, not invention.</p>



<p class="wp-block-paragraph">In this case I did it with copy and paste, which isn&#8217;t particularly elegant, but it worked for me. But this is a blunt instrument, because a raw conversation dump is both too much and too little: it&#8217;s too much because it&#8217;s full of noise, like tool calls, dead ends, back-and-forth that doesn&#8217;t matter anymore; and it&#8217;s too little because the context that got silently compressed away during the session is already gone. When you build these mechanisms into agents and skills, you can do it in a much more subtle and automated way.</p>



<h2 class="wp-block-heading"><strong>Externalize: Add two layers of state to your agent</strong></h2>



<p class="wp-block-paragraph">The idea behind <strong>externalization</strong>, or periodically saving your agent&#8217;s state, came out of a conversation I was having with an AI assistant while building the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a>, an open source AI coding skill that runs structured code reviews. The playbook runs a structured code review as a single process, but that process could easily turn into a 15-million-token request if you tried to do it all in one shot. I described in the <a href="https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/" target="_blank" rel="noreferrer noopener">previous article in this series</a> how I broke it into six phases, and that was only possible because the context for each phase had already been externalized. Each phase reads its inputs from files, does its work, writes its outputs to files, and stops. The next phase picks up from the files, not from whatever the agent remembers. If this sounds like the familiar advice to ask the AI to plan before you ask it to implement, it&#8217;s the same principle applied to context management. Separating each step and persisting the output means you can inspect it, and the next step doesn&#8217;t depend on the agent&#8217;s memory.</p>



<p class="wp-block-paragraph">But what should those files contain? I found that the AI is actually good at figuring that out. At some point I asked the assistant:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Would it make sense for the agent to record more context in files as it progresses, to make sure nothing is dropped along the way? It should work even if you break it into separate prompts, because the result from each step is persisted. Plus, we can audit its reasoning for debugging and improvement.</p>
</blockquote>



<p class="wp-block-paragraph">That prompt was all it took. The assistant designed the file structure itself: a progress tracker that records which phase is active and what&#8217;s been completed, a JSONL artifact file (JSONL is just a file with a bundle of JSON objects, with one record per line) where each pass appends its output, and a set of brief documents describing the purpose of each phase. You don&#8217;t need to overengineer this. Tell the agent what you&#8217;re trying to preserve and let it figure out the file layout.</p>



<p class="wp-block-paragraph">What emerged falls into two categories that I think of as execution continuity and task continuity:</p>



<ul class="wp-block-list">
<li><strong>Execution continuity</strong> is the state the agent needs to resume work in the middle of a task: what step it&#8217;s on, what it&#8217;s completed, what decisions it&#8217;s made so far. These files change constantly as the agent works.<br></li>



<li><strong>Task continuity</strong> is the broader context that doesn&#8217;t change during execution: what the whole task is about, what success looks like, what the structural constraints are. These files are written once and read at every resumption.</li>
</ul>



<p class="wp-block-paragraph">When an agent needs to resume after suspected compaction, it reads back both layers. The task continuity files anchor it back to what the whole endeavor is about. The execution continuity files put it back in the middle of the work. Together, they give the agent enough information to continue without relying on anything that might have been compacted.</p>



<p class="wp-block-paragraph">The key is that externalization isn&#8217;t something you do once at the beginning of a task. You want the agent saving its state at frequent checkpoints so that if compaction happens mid-run, the most recent checkpoint is close to where the agent was working. Here&#8217;s the kind of instruction I gave the agent for tasks that processed records one at a time:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Update the progress file after every single record, not in batches. Write the output line first, then update the progress file with the new cursor and a fresh timestamp. If the progress file&#8217;s timestamp falls behind the output file&#8217;s, you&#8217;re batching and that&#8217;s wrong.</p>
</blockquote>



<p class="wp-block-paragraph">The frequency matters because context can compact at any point. If the agent only saves state at the end of a long run, compaction in the middle means losing everything since the start. If it checkpoints after every unit of work, the worst case is losing one unit.</p>



<p class="wp-block-paragraph">Two-layer externalization survives context reshaping, not only outright context loss. Even if the agent&#8217;s context window isn&#8217;t full, if the context has been reorganized or reprioritized (a compression that reshapes without truncating), the agent can reload the external files and know for certain what the ground truth is.</p>



<h2 class="wp-block-heading"><strong>Recognize: Detecting loss from inside the agent</strong></h2>



<p class="wp-block-paragraph">The second step in the pattern is to <strong>recognize</strong> that your agent has lost context, and it turns out to be the hardest part (at least with today&#8217;s AI technology). When the context window fills up, the AI compacts silently, and the agent keeps working without realizing it&#8217;s lost information. The agent can&#8217;t tell you it&#8217;s forgotten something, because it doesn&#8217;t know it forgot. Detecting that change turns out to be a nontrivial problem; I&#8217;ll walk you through an approach that helped me, and keep it general enough so you can do the same thing. The copy-and-paste approach works when the context loss is obvious, like a crash that wipes your whole conversation. But most context loss isn&#8217;t that visible.</p>



<p class="wp-block-paragraph">I described context compaction in the <a href="https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/" target="_blank" rel="noreferrer noopener">previous article</a>, but it&#8217;s worth restating the core problem from the agent&#8217;s perspective. Different tools handle context overflow differently: Some truncate older messages; some compress conversations into summaries; some use a sliding window. But they all have the same effect. Information disappears from the agent&#8217;s working context, and the agent doesn&#8217;t get notified.</p>



<p class="wp-block-paragraph">This was a challenge when I built the Quality Playbook, because it runs multiple passes over a codebase, each one reading source files, extracting requirements, and checking coverage. Each pass can involve enough work that it fills the context window multiple times over. And when context compacts mid-pass, the agent doesn&#8217;t know it happened. It keeps working, but the output starts silently degrading. So I started building mechanisms for the agent to detect compaction and recover by reading back the files it had written earlier. The patterns that came out of that work are general enough to apply to anyone building agents that need to survive context pressure.</p>



<p class="wp-block-paragraph">From the agent&#8217;s perspective, compaction is seamless. It&#8217;s tracking state, referencing decisions made earlier in the conversation, and then at some point the earlier context is gone. But the agent can&#8217;t tell the difference between &#8220;I never knew that&#8221; and &#8220;I knew it but lost it.&#8221; It tries to reference something and finds nothing, or finds a compressed version that lost the nuance. And because the agent doesn&#8217;t know it lost anything, it doesn&#8217;t know it needs to recover.</p>



<p class="wp-block-paragraph">This invisibility is the core problem. But it turns out you can work around it, and the next two sections walk through how.</p>



<h2 class="wp-block-heading"><strong>Building a detection mechanism</strong></h2>



<p class="wp-block-paragraph">Once you have files on disk, the question is what specifically to check and how to know when something has gone wrong. I landed on a mechanism while building the Quality Playbook&#8217;s requirement extraction pipeline. The playbook processes source documents in multiple passes, and each pass appends its output to a JSONL artifact file. After each unit of work, the agent also writes a progress record to a separate file: what it just finished, what it found, and where it should pick up next.</p>



<p class="wp-block-paragraph">The detection mechanism comes from two rules I gave the agent. The idea is that the progress file tracks a cursor, which is just a position marker that tells the agent which record to process next. If the agent writes a record to the output file but then loses context before updating the progress file, those two files will be out of sync.</p>



<p class="wp-block-paragraph">The agent didn&#8217;t need to understand any of that upfront; I just described the rules in plain language and let it figure out the implementation. The first rule establishes an invariant between the output file and the progress file:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Cursor advances only after the line is on disk. Write the summary line to the output file first, then update the progress file. The cursor must always equal the index of the next record that still needs to be processed.</p>
</blockquote>



<p class="wp-block-paragraph">The second rule told the agent how to check that invariant on startup:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">On startup, read the progress file. Resume from its cursor value. Verify continuity: the last line in the output file should equal cursor minus one. If not, roll the cursor back to match disk state and report the discrepancy.</p>
</blockquote>



<p class="wp-block-paragraph">If the progress file says the cursor is at record 381, but the last line in the output file is record 379, something happened. The context compacted and the agent lost track of where it was. The divergence between the two files is the signal.</p>



<p class="wp-block-paragraph">This worked because files on disk don&#8217;t change when context compacts. They&#8217;re written once and then read repeatedly. If what the agent thinks it knows doesn&#8217;t match what&#8217;s actually in the files, something shifted in the agent&#8217;s memory, not on disk. I ended up folding this check into a preamble that every session started with:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">If this session has experienced auto-compaction, re-read the pass specification from disk. Do not try to reconstruct it from the compacted summary. Read the progress file. Read the last record of the JSONL artifact and confirm its index equals the cursor minus one. If not, roll the cursor back to match disk state. Disk is the source of truth. The conversation is not.</p>
</blockquote>



<p class="wp-block-paragraph">That preamble ran at the top of every session. During one particularly intensive day of pipeline development, I ran over a hundred Claude Code sessions with that exact instruction. Most of them completed without hitting compaction. But the ones that did hit it recovered cleanly, because the preamble told the agent exactly what to check and exactly what to do when the check failed.</p>



<p class="wp-block-paragraph">The specific prompts I used are tied to the Quality Playbook&#8217;s file structure, but the technique generalizes. If you&#8217;re building any agent that does multistep work, you can adapt the same approach. Here&#8217;s a version you could drop into a session preamble or an agent&#8217;s system prompt:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Before continuing any task, read your progress file and your most recent output file. Compare them: does the progress file say you&#8217;ve completed work that isn&#8217;t reflected in the output? If so, trust the output file, roll back your progress to match, and note the discrepancy. Do not rely on what you remember from the conversation. The files on disk are the source of truth.</p>
</blockquote>



<p class="wp-block-paragraph">The wording doesn&#8217;t have to be precise. What matters is the structure: tell the agent where to look, what to compare, and which source to trust when they disagree.</p>



<h2 class="wp-block-heading"><strong>But didn&#8217;t you just say the AI can&#8217;t detect its own compaction?</strong></h2>



<p class="wp-block-paragraph">Right, and it can&#8217;t. What I described above isn&#8217;t the agent detecting compaction. It&#8217;s the agent running a deterministic check against files on disk and finding a discrepancy. The agent doesn&#8217;t need to know that compaction happened. It just needs to notice that two files disagree. Think of the agent as an amnesiac clerk. You don&#8217;t ask the clerk to remember what they did yesterday. You make the clerk check the physical ledger every time they sit down at the desk. If their notes disagree with the ledger, they&#8217;re trained to trust the ledger.</p>



<p class="wp-block-paragraph">If you saw Christopher Nolan&#8217;s breakout movie <em>Memento</em>, you can think of your agent as Leonard Shelby, the character played by Guy Pearce with anterograde amnesia. You couldn&#8217;t ask Leonard to remember what he did yesterday. He had to check his tattoos every time he woke up. If his tattoos disagreed with what he&#8217;s seeing, he trusts the tattoo (which leads to a major plot point, which I won&#8217;t spoil). Again, this isn&#8217;t a new idea either. I mentioned the <a href="https://en.wikipedia.org/wiki/Memento_pattern" target="_blank" rel="noreferrer noopener">Memento pattern</a> earlier, which is literally named after this movie.</p>



<p class="wp-block-paragraph">This is a classic distributed systems technique. In double-entry bookkeeping, you maintain two independent records of the same transaction and reconcile them regularly. If they disagree, you investigate. You don&#8217;t need to know why they diverged; the divergence itself is the signal. A two-phase commit works the same way: write the data first, then update the record that says the data was written. If you find data without a matching record, or a record without matching data, something went wrong between the two phases.</p>



<p class="wp-block-paragraph">That&#8217;s exactly what the cursor invariant does. The agent writes the output line first, then updates the progress file. If those two files are out of sync, something happened between the two writes. The agent doesn&#8217;t detect compaction. It detects a broken invariant, and it&#8217;s been told that when the invariant breaks, the files on disk win.</p>



<p class="wp-block-paragraph">Three things make this work. First, the check is purely deterministic: read two files, compare two numbers, act on the result. There&#8217;s no reasoning involved, no judgment call about whether the agent &#8220;feels&#8221; like it lost context. I wrote about this principle in “<a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">Keep Deterministic Work Deterministic</a>”; you never want an AI making decisions that a file comparison can make for it. Second, the files on disk don&#8217;t change when context compacts. They&#8217;re the stable reference point that the agent&#8217;s memory gets checked against. Third, the instruction to run the check lives in the system prompt or preamble, which is generally preserved even when conversation context gets compacted. The check survives the thing it&#8217;s designed to detect.</p>



<h2 class="wp-block-heading"><strong>Rehydrate: Reading back the state</strong></h2>



<p class="wp-block-paragraph"><strong>Rehydration</strong> is the process of reading back externalized state and rebuilding the agent&#8217;s working context. Once the agent detects compaction (or, more specifically and accurately, has enough evidence from the filesystem that compaction occurred), the recovery step is to read back the externalized files and rebuild. For the Quality Playbook, rehydration meant:</p>



<ol class="wp-block-list">
<li>Read the phase brief to re-anchor the purpose of this pass</li>



<li>Read the progress file to know which unit is active and what&#8217;s been completed</li>



<li>Read the tail of the JSONL artifact to confirm the last successfully written record</li>



<li>Recompute the next unit of work from those files</li>
</ol>



<p class="wp-block-paragraph">This is different from just continuing without detection. Without detection, the agent tries to pick up where it left off and hopes it still has enough context. With detection, the agent knows something happened and deliberately reloads state before continuing.</p>



<p class="wp-block-paragraph">You can make the rehydration process itself auditable. Instead of silently reading the files and resuming, have the agent write down what it learned:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Read the progress file and the JSONL artifact. Write a summary of what you learned: what pass is running, what unit is active, what the cursor position is, and how many requirements have been extracted so far. Then continue from there.</p>
</blockquote>



<p class="wp-block-paragraph">Writing a rehydration summary serves two purposes. It gives you visibility into what the agent understood and whether it rehydrated correctly. And it forces the agent to process the external files explicitly rather than just loading them into context. Explicit processing is more reliable than silent loading because the agent has to commit to an interpretation, and you can read that interpretation and catch mistakes.</p>



<p class="wp-block-paragraph">You can adapt this approach to any agent workflow where work happens in steps. The specific files and cursor values are particular to my pipeline, but the underlying technique is general: have the agent write its progress to a file after each step, and check that file against its output at the start of every session. And this advice isn&#8217;t just for writing agents or skills. Even in a live session with Claude Code, Cursor, or Copilot, you can tell the agent to periodically write a summary of what it&#8217;s done and what it plans to do next to a file on disk. If the session crashes or the context gets long enough to compact, you can point a new session at that file and pick up where you left off. The key is getting the state out of the conversation and onto disk before you need it.</p>



<h2 class="wp-block-heading"><strong>Context management is an architectural concern</strong></h2>



<p class="wp-block-paragraph">Every technique I&#8217;ve described in these articles comes down to the same principle: Important information shouldn&#8217;t live only in the agent&#8217;s context window. The previous articles covered how to put that information on disk. This one covers how to make the agent aware of its own limitations so it can recover when context pressure gets too high.</p>



<p class="wp-block-paragraph">An agent that can detect its own degradation and correct for it is fundamentally more reliable than one that just keeps going. When the agent knows how to stop, check itself against ground truth, and reload what it lost, context pressure becomes a recoverable event instead of a slow, silent failure.</p>



<p class="wp-block-paragraph">This concludes my mini-series trilogy of articles about context management. The first article in this series was about understanding what context is and why it disappears. The second was about getting important information out of the conversation and onto disk before you need it. This one is about closing the loop: making the agent aware of its own limitations so it can detect degradation and recover from it. Together, they add up to treating context as an engineering problem rather than something you hope works out.</p>



<p class="wp-block-paragraph">These are still early days. Context windows will get larger, compaction will get smarter, and some of the workarounds in this article will eventually be unnecessary. But the underlying principle won&#8217;t change: If your agent&#8217;s ability to do its job depends on information, that information needs to live somewhere more durable than working memory. That was true for my dad&#8217;s 32KB core memory at Princeton, it was true for my 640K of conventional RAM, and it&#8217;s true for today&#8217;s 200K-token context windows.</p>



<p class="wp-block-paragraph"><em>The <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a> and <a href="https://github.com/andrewstellman/octobatch" target="_blank" rel="noreferrer noopener">Octobatch</a> are open source projects where these techniques are used in production. Both are built using AI-driven development and available for exploration if you want to see how this looks in practice.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026, by the author. The open source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/when-context-collapses-teaching-agents-to-detect-and-recover-from-lost-memory/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The PM&#8217;s Playbook for Shipping AI Features That Actually Work in Production</title>
		<link>https://www.oreilly.com/radar/the-pms-playbook-for-shipping-ai-features-that-actually-work-in-production/</link>
				<comments>https://www.oreilly.com/radar/the-pms-playbook-for-shipping-ai-features-that-actually-work-in-production/#respond</comments>
				<pubDate>Wed, 10 Jun 2026 10:55:56 +0000</pubDate>
					<dc:creator><![CDATA[Gaurav Savla]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18892</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-PMs-playbook-for-shipping-AI-features-that-actually-work-in-production.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-PMs-playbook-for-shipping-AI-features-that-actually-work-in-production-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[The demo to production Death Valley If you&#8217;ve worked on an AI feature, you know the feeling. You start building something that you are excited about, set launch timelines. The model spits out a perfect response, the prototype works magically, and everybody in the room is mentally calculating how big this product will be when [&#8230;]]]></description>
								<content:encoded><![CDATA[
<h2 class="wp-block-heading"><strong>The demo to production Death Valley</strong></h2>



<p class="wp-block-paragraph">If you&#8217;ve worked on an AI feature, you know the feeling. You start building something that you are excited about, set launch timelines. The model spits out a perfect response, the prototype works magically, and everybody in the room is mentally calculating how big this product will be when we launch. I&#8217;ve been in that room a lot many times and it&#8217;s fun.</p>



<p class="wp-block-paragraph">Then you try to test before you ship.</p>



<p class="wp-block-paragraph">Latency spikes to 10 seconds on mobile. The model starts hallucinating on edge cases that happen to represent 15% of actual user queries. Your A/B test shows no statistically significant engagement lift because the variance in AI outputs makes traditional hypothesis testing basically meaningless. The safety team flags 340 failure cases in the first week, and you’re now debugging nondeterministic cases that fail in creative, novel ways every single day.</p>



<p class="wp-block-paragraph">Most often than not, it&#8217;s not a model problem but an engineering discipline problem. Shipping an AI product is very different from traditional software. I&#8217;ve figured this out the hard way. This playbook shares my learnings.</p>



<h2 class="wp-block-heading"><strong>Latency budgets</strong></h2>



<p class="wp-block-paragraph">Every AI feature comes with a latency tax. Large language model inference takes time. We&#8217;re talking 500 milliseconds to 5 or even 50 seconds depending on model size, input length, and infrastructure setup. For consumer products where people expect sub-200-millisecond interactions, this is a hard constraint you have to design around.</p>



<p class="wp-block-paragraph">The mistake I see most often is teams measuring only p50 latency. A feature with 800 milliseconds p50 sounds fine until you discover the p90 is 15 seconds. That means 10 in every 100 users sit there waiting for 15+ seconds. At scale, that&#8217;s thousands of terrible experiences per day.</p>



<p class="wp-block-paragraph">The way I think about it is you define your latency budget by interaction type, not globally: <strong>Synchronous interactions</strong>, where the user is staring at a spinner, need to resolve under 1 second.<strong> Progressive interactions</strong>,<strong> </strong>where output streams token by token, need first token in under 500 milliseconds and full response under 5 seconds. <strong>Asynchronous interactions</strong>, where the user keeps doing other stuff, can take up to 20 seconds with a progress indicator.</p>



<p class="wp-block-paragraph">You also need to measure cold starts separately. The first request after a model loads into memory can be 10 times slower than subsequent requests, and if your traffic is bursty, cold starts will disproportionately punish your most engaged users arriving during peak hours.</p>



<p class="wp-block-paragraph">Besides, you also need to budget for the full pipeline, not just inference. A typical AI feature pipeline including input preprocessing (tokenization, context assembly, and prompt construction), model inference, output postprocessing (parsing, formatting, safety filtering, etc.), and a full response delivery adds up. Optimizing inference while ignoring the rest is like tuning your engine while driving on flat tires.</p>



<p class="wp-block-paragraph">Lastly, use streaming aggressively for generative features. Pushing tokens to the user as they&#8217;re generated instead of waiting for the full response changes how users perceive latency.&nbsp; A four-second response that starts appearing at 300 milliseconds feels dramatically faster than one that pops in all at once. Perception is reality when it comes to user experience.</p>



<h2 class="wp-block-heading"><strong>Designing fallbacks</strong></h2>



<p class="wp-block-paragraph">Traditional software fails in boring, predictable ways. AI features fail in novel, unpredictable, and occasionally creative ways. I once saw a model respond to a product recommendation query with a poem about loneliness. Your fallback strategy needs to be considerably more sophisticated than a try/catch block.</p>



<p class="wp-block-paragraph">I think about fallbacks as a hierarchy. First, model fallback: When your primary model fails, drop to a simpler, faster, and more reliable model. Most failure cases get handled without the user ever knowing. Second, cache fallback: For queries similar to stuff you&#8217;ve seen before, serve a cached response. Third, template fallback: When generation fails completely, fall back to prewritten templates. Degraded beats dead every time. Fourth, graceful omission: Sometimes the best fallback is to simply not show the AI feature at all rather than showing a broken version.</p>



<p class="wp-block-paragraph">The design principle underneath all of this is that users should never encounter an unhandled AI failure. Every failure mode maps to a specific level, and transitions between levels should be invisible whenever you can manage it.</p>



<h2 class="wp-block-heading"><strong>Quality measurement</strong></h2>



<p class="wp-block-paragraph">Quality in traditional software is binary. The button works or it doesn&#8217;t. AI feature quality is continuous and subjective, and it changes depending on context. I&#8217;ve landed on a four-layer quality pyramid.</p>



<p class="wp-block-paragraph">The foundation is safety, and it&#8217;s nonnegotiable. Does the output contain harmful content, PII, or made-up facts? This layer is binary, and you measure it with automated classifiers running against 100% of outputs.</p>



<p class="wp-block-paragraph">The second layer is factual correctness, which is domain specific. Is the output actually right? For a coding assistant that means generated code compiles and passes tests. For a writing tool it means grammatical, stylistically appropriate output. You measure this with domain specific evaluation suites.</p>



<p class="wp-block-paragraph">The third layer is usefulness, and it&#8217;s user centered. Did the person actually benefit? Track acceptance rate, edit distance, time to task completion, and repeat usage. This is where traditional product metrics meet AI specific ones.</p>



<p class="wp-block-paragraph">The fourth layer is delight, which is experimental. Does the output feel good? Hardest to measure but often most important for adoption. Sometimes the numbers say the feature works but users&#8217; guts say it doesn&#8217;t. This layer catches that gap.</p>



<h2 class="wp-block-heading"><strong>A/B testing AI features</strong></h2>



<p class="wp-block-paragraph">A/B testing AI features is fundamentally harder than traditional features because AI outputs are nondeterministic. The same user doing the same thing twice might get different outputs, introducing variance that traditional frameworks weren&#8217;t built to handle.</p>



<p class="wp-block-paragraph">The core challenge is that intratreatment variance inflates the sample size you need for statistical significance, often by three to five times. If you&#8217;re running your AI experiment with normal sample size assumptions, you&#8217;re probably looking at noise and calling it signal.</p>



<p class="wp-block-paragraph">Then there&#8217;s the metric selection problem. A chatbot generating entertaining but factually wrong responses might show amazing engagement numbers while actively misleading users. You have to measure engagement and quality together. &#8220;Engaged interactions where quality score exceeds threshold&#8221; is more meaningful than raw engagement alone.</p>



<p class="wp-block-paragraph">The temporal problem matters too. AI feature value changes over time as users learn how to work with it. Short experiments will underestimate long-term value if there&#8217;s a learning curve, or overestimate it if there&#8217;s a novelty bump.</p>



<p class="wp-block-paragraph">My practical guidance: budget two to three times more time and traffic for AI experiments than traditional ones. Lean on Bayesian methods as they handle high variance better. And always pair quantitative tests with qualitative research. Ten user interviews will surface failure modes that no amount of statistical analysis will catch.</p>



<h2 class="wp-block-heading"><strong>Model drift monitoring</strong></h2>



<p class="wp-block-paragraph">Model drift is the slow, invisible rot of AI output quality over time, and there are multiple culprits.</p>



<p class="wp-block-paragraph">Data drift happens because the world changes and user behavior evolves. A model trained on 2024 data performs worse on 2026 queries referencing new concepts, slang, and cultural moments.</p>



<p class="wp-block-paragraph">Provider drift happens because third-party APIs change without your consent. <a href="https://www.ciodive.com/news/ChatGPT-OpenAI-GPT4-LLM-behavior-Stanford-UC-Berkeley/688683/" target="_blank" rel="noreferrer noopener">OpenAI acknowledged</a> that GPT-4&#8217;s behavior shifted measurably between March and June 2023, and <a href="https://arxiv.org/abs/2307.09009" target="_blank" rel="noreferrer noopener">Stanford researchers documented significant performance swings</a>. The fix: Pin your model versions so updates happen on your schedule, after your testing.</p>



<p class="wp-block-paragraph">Evaluation drift is the subtlest form. Even your quality metrics can become inadequate and the evaluation criteria that made sense at launch might become inadequate as usage patterns shift and user expectations change. Quarterly reviews of your evaluation suites are essential.</p>



<p class="wp-block-paragraph">At minimum you need daily automated quality evaluations on 1% to 5% of production traffic, weekly analysis of input distribution characteristics, and monthly human evaluation of 100 to 500 examples. Shipping an AI feature without drift monitoring is like deploying a service without alerting. You won&#8217;t know it&#8217;s broken until your users tell you, and by then they&#8217;re angry.</p>



<h2 class="wp-block-heading"><strong>Evaluation frameworks</strong></h2>



<p class="wp-block-paragraph">How do you know if your AI feature is good enough? You need two fundamentally different approaches, and you genuinely need both.</p>



<p class="wp-block-paragraph">Automated evaluation gives you speed. Build a golden dataset of 500 to 2,000 labeled examples, train a classifier or use a capable model as judge, and validate against human judgment quarterly targeting 85% agreement. Automated evals chew through thousands of examples per hour, making them essential for velocity. The pitfall: They miss novel failure modes not in the training data.</p>



<p class="wp-block-paragraph">Human evaluation catches what automation misses. Structure it with five to seven evaluators mixing domain experts and representative users. Use a consistent rubric covering accuracy, helpfulness, tone, completeness, and safety. Run weekly during development, monthly in production. The trade-offs: expensive at $15 to $30 per example, slow with 24 to 72 hour turnaround, and subject to human biases. Manage by rotating evaluators and capping sessions at two hours.</p>



<p class="wp-block-paragraph">The model as judge approach is an increasingly viable middle ground. Judging quality is often easier than generating it, which means a model can reliably evaluate outputs even for tasks where it couldn&#8217;t produce them itself. Use it for high-volume evaluation but always validate against human judgment.</p>



<h2 class="wp-block-heading"><strong>Graceful degradation and prompt engineering</strong></h2>



<p class="wp-block-paragraph">Graceful degradation means when capabilities decrease, the experience gets worse smoothly instead of falling off a cliff. Design for capability levels, not binary states. Define four to five levels with specific behaviors at each. For example, for an AI writing assistant: Level 5 is full capability with real-time suggestions, tone adjustment, and structure recommendations. Level 4 is delayed suggestions appearing after a two- to three-second pause because latency is up. Level 3 is basic suggestions only like grammar and spelling with no style feedback. Each level is a deliberate design decision, not an accident.</p>



<p class="wp-block-paragraph">Make degradation invisible when possible. Users shouldn&#8217;t see a &#8220;broken&#8221; experience. They see a less detailed one. That&#8217;s a huge difference psychologically. However,&nbsp; when the degradation is significant enough that users will notice, proactive communication like &#8220;AI suggestions are temporarily limited&#8221; builds trust infinitely more than silently pushing poor-quality outputs.</p>



<p class="wp-block-paragraph">Prompt engineering in production is software engineering. In production, prompts are code, and they need version control, testing, monitoring, and maintenance. Version controls every prompt. Parameterize prompts, don&#8217;t hardcode context. Production prompts should be templates with clearly defined injection points for user context, system state, and dynamic instructions. This makes them testable because you can inject known inputs and verify outputs, and it makes them maintainable because changing how you handle context shouldn&#8217;t require rewriting the entire prompt from scratch.</p>



<p class="wp-block-paragraph">Test prompts against regression suites. Maintain 200 to 500 test cases covering the full distribution of expected inputs, including edge cases and adversarial inputs. Run the suite against every prompt change before deployment.</p>



<p class="wp-block-paragraph">Monitor prompt performance in production. Track output quality metrics like acceptance rate, user edits, and regeneration requests, segmented by prompt version. When you deploy a new version, compare its production metrics against the previous one for at least 72 hours before calling it stable. This is basically canary deployment for prompts.</p>



<h2 class="wp-block-heading"><strong>Ship it right</strong></h2>



<p class="wp-block-paragraph">These systems aren&#8217;t optional add ons you can bolt on after launch. Every feature I&#8217;ve seen fail was built first with plans to &#8220;add production hardening later.&#8221; Later never comes.</p>



<p class="wp-block-paragraph">AI features are probabilistic and nondeterministic, and they change over time without anyone touching them. Build these systems, staff them properly, and treat them with the same seriousness you&#8217;d give your core infrastructure. The gap between demo and production is wide, but it&#8217;s absolutely crossable if you build the right bridge.</p>



<p class="wp-block-paragraph"><em>Note: The research work pertaining to this article was done in a personal capacity. Views are of my own and do not reflect my employer&#8217;s views in any way.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-pms-playbook-for-shipping-ai-features-that-actually-work-in-production/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Subsidy Ended: What Tool-Using Agents Actually Cost</title>
		<link>https://www.oreilly.com/radar/the-subsidy-ended-what-tool-using-agents-actually-cost/</link>
				<comments>https://www.oreilly.com/radar/the-subsidy-ended-what-tool-using-agents-actually-cost/#respond</comments>
				<pubDate>Tue, 09 Jun 2026 11:09:17 +0000</pubDate>
					<dc:creator><![CDATA[Bennie Haelen]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18887</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-subsidy-ended.png" 
				medium="image" 
				type="image/png" 
				width="1200" 
				height="896" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-subsidy-ended-160x160.png" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Usage-based billing didn’t make agents expensive. It made their existing costs visible, and visibility turns agent economics into a governance problem.]]></custom:subtitle>
		
				<description><![CDATA[On June 1, GitHub Copilot&#8217;s usage-based billing became active for all Copilot plans, and developers reacted quickly and loudly. A Pro plan still costs $10, but it now comes with a monthly pool of AI credits. Those credits are priced at a penny each, and they’re consumed according to the model used and the tokens [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">On June 1, GitHub Copilot&#8217;s <a href="https://github.blog/changelog/2026-06-01-updates-to-github-copilot-billing-and-plans/" target="_blank" rel="noreferrer noopener">usage-based billing became active</a> for all Copilot plans, and developers reacted quickly and loudly. A Pro plan still costs $10, but it now comes with a monthly pool of AI credits. Those credits are priced at a penny each, and they’re consumed according to the model used and the tokens processed, including input, output, and cached tokens. For a heavy agentic session running a frontier model, that makes spend feel very different from a flat subscription.</p>



<p class="wp-block-paragraph">That’s the news, and it’s worth understanding, but it isn’t the important part. Nothing about the underlying cost of agentic work actually changed on June 1. The tokens were always being consumed, the loops were always running, and the tool calls were always expanding the context. What changed is that the meter became visible. A workload that had been quietly subsidized under a flat rate started showing up as an itemized bill.</p>



<h2 class="wp-block-heading"><strong>Where the tokens go</strong></h2>



<p class="wp-block-paragraph">To see why the bill landed so hard, it helps to compare two things that look similar and bill very differently. A chat completion is close to a single transaction. You send a prompt, the model sends an answer, and you pay roughly once for the input and once for the output. A tool-using agent doesn’t work that way at all. An agent doesn’t answer a question so much as work toward it, and it works by looping. It reasons about the task, calls a tool, reads the result, reasons again, calls another tool, and continues until it decides it’s finished.</p>



<p class="wp-block-paragraph">Every pass through that loop carries a cost that’s easy to miss. In many agent harnesses, each turn carries forward a large share of the accumulated context: prior messages, tool descriptions, retrieved files, and tool results. Even when some of that context is cached, summarized, or pruned, the system is still doing metered work to preserve enough state for the next decision. The final answer you actually wanted is only a thin slice of what you paid for. The loop is the bill.</p>



<p class="wp-block-paragraph">This is why agent cost doesn’t scale politely. It scales with the number of turns, and the number of turns scales with how much discovery the agent has to do, which in turn scales with how vague the request was and how much irrelevant context it’s dragging along. A clean, well-scoped task might finish in three turns, while the same task posed as an open-ended question might wander through 15, each carrying the cost of everything that came before it. Under a flat rate, that difference was invisible. Under usage-based billing, it’s the difference between a small interaction and an expensive one.</p>



<h2 class="wp-block-heading"><strong>Tool design is now part of the cost model</strong></h2>



<p class="wp-block-paragraph">I wrote recently about a <a href="https://www.linkedin.com/pulse/hidden-input-tax-your-mcp-tools-bennie-haelen-eqflc/" target="_blank" rel="noreferrer noopener">hidden tax on Model Context Protocol servers</a>: the way an overstuffed tool catalog quietly degrades a model&#8217;s ability to route to the right tool. Bloated descriptions, overlapping responsibilities, and vague parameters make the model&#8217;s job harder and its choices worse. That argument was about accuracy. The billing change adds a second invoice for the same bloat, and this one is denominated in dollars.</p>



<p class="wp-block-paragraph">The tool catalog is often part of what gets carried through the agent&#8217;s loop. A tool described in three tight sentences and a tool described in three rambling paragraphs may both function, but the second one pays rent in the context window every time an agent has it loaded. Multiply that across a catalog of 40 tools and a workflow that runs a dozen turns, and the cost of verbose tool design stops being a rounding error. Tool design was already a correctness discipline. It’s now a cost discipline as well. The same audit that tightens routing accuracy tightens the bill.</p>



<h2 class="wp-block-heading"><strong>Where prompt discipline runs out</strong></h2>



<p class="wp-block-paragraph">There’s a layer of this that individual users can control, and it’s worth knowing because the savings are real and immediate. Two patterns matter most, and I’ve been handing both to the engineers on a pilot I run for a large healthcare organization. They aren’t magic tricks. They’re ways to keep the agent out of unnecessary discovery loops.</p>



<p class="wp-block-paragraph">The first pattern is about input. Prompt the agent like a short requirement rather than a broad question. A request such as &#8220;look at the encounter data and tell me what you find&#8221; forces the agent into discovery mode, where it burns turns figuring out what you meant, and every one of those turns carries the full context forward. Compare that to a prompt that front-loads the specifics by naming the project and the table, naming the date field to filter on, stating the output shape you want, and calling out anything that should be excluded. A better prompt would be: &#8220;Using the curated clinical project and the silver-zone encounters table, show total encounters by month for calendar year 2025, use admission_date_time for inclusion, and return one row per month ordered chronologically.&#8221; The second prompt collapses the loop. The agent has what it needs on the first turn, so it does the work instead of interviewing you for it.</p>



<p class="wp-block-paragraph">In practice, the difference isn’t just polish. The vague version forces the agent to discover the data model, infer the date semantics, choose an aggregation, and decide on a display format. The specific version turns the task into a bounded query. That difference shows up in accuracy, latency, and cost.</p>



<p class="wp-block-paragraph">The second pattern is about output, and it’s the lever most people overlook. Ask for plain text or Markdown during the intermediate steps, and save rich HTML formatting for the final, confirmed deliverable. Formatted output is expensive to generate, and requirements shift. If you ask for a polished HTML report on the first pass and then change a filter, you pay full output-token freight to regenerate all that layout, often more than once. The cheaper habit is to validate the numbers in text and format only at the end.</p>



<p class="wp-block-paragraph">These patterns work, and they also have a ceiling. Both of them put the entire burden of cost control on the user, and they hold only as long as every user exercises the discipline on every prompt. The day someone reverts to &#8220;tell me what you find,&#8221; the savings evaporate, and the only thing standing between the team and a surprise invoice is a budget cap that reports the overspend after it has already happened.</p>



<h2 class="wp-block-heading"><strong>Cost is a governance problem, not a budgeting one</strong></h2>



<p class="wp-block-paragraph">That fragility is the real lesson. A budget cap is a backstop rather than a control. It will stop a runaway, but it tells you that you overspent rather than why, and it does nothing to make the next run cheaper. Treating cost as a budgeting problem leaves you forever reacting to the meter, while treating it as an architecture problem lets you build the savings in once and stop relying on everyone&#8217;s good behavior.</p>



<p class="wp-block-paragraph">That means the controls that matter belong on the platform rather than in individual prompts. By the platform I don’t mean the agent itself, the coding assistant or chat client a developer drives day-to-day, and I don’t mean the model or a router sitting beneath it. I mean the control plane that sits above the agents, the layer where an organization enforces policy, access, observability, and now cost across every agent and model its developers touch. An administrative console that gives IT visibility into who is doing what and which capabilities they can install is an early, narrow instance of it. A router that sends planning to a cheap model is one feature that belongs there. The platform is where the rules live, and the agent is a consumer of those rules rather than the place you set them. The platform should route models by task, using cheaper models for planning and reserving frontier models for work that earns the price. It should bound the loop, requiring the agent to check in after a fixed number of iterations. It should cap tool-result payloads so a careless query cannot dump a million rows into the context window. It should default intermediate work to plain text, making the cheap path the path of least resistance instead of something users have to remember.</p>



<p class="wp-block-paragraph">Every one of those controls is something a user can approximate by hand and something the platform can simply guarantee. This is the same principle I keep returning to in the context of data access, where safe behavior cannot depend on the person at the keyboard remembering the rules. Prompts guide behavior. Guardrails make the cheaper and safer behavior the default. Cost governance is guardrails as control plane, with a dollar sign attached, enforced at the same layer where you already enforce who is allowed to see which row.</p>



<h2 class="wp-block-heading"><strong>The pattern, not the vendor</strong></h2>



<p class="wp-block-paragraph">It would be a mistake to read this as only a GitHub story. GitHub is the current example because its change is visible and recent, but usage-based billing for agentic work is the direction of travel for many AI tools. The economics under the hood are similar: Agentic workloads turn single answers into loops of model calls, tool calls, and context management. The flat-rate subsidy was always going to come under pressure once the workload shifted from autocomplete to autonomy.</p>



<p class="wp-block-paragraph">The organizations that treat June 1 as a pricing event will optimize a few prompts, grumble, and move on until the next vendor changes its meter. The ones that treat it as an architecture signal will push the cost controls down into the platform, where they hold regardless of which provider is counting which token. That’s the more durable place to stand. The bill didn’t get bigger this month. It got honest, and an honest bill is the kind you can engineer against.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-subsidy-ended-what-tool-using-agents-actually-cost/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Long-Running Agents</title>
		<link>https://www.oreilly.com/radar/long-running-agents/</link>
				<comments>https://www.oreilly.com/radar/long-running-agents/#respond</comments>
				<pubDate>Mon, 08 Jun 2026 15:59:06 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18883</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Long-running-agents-image-created-with-Adobe-Firefly.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Long-running-agents-image-created-with-Adobe-Firefly-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission. A long-running AI agent can keep making progress over hours, days, or weeks. It can do this across many context windows and sandboxes, recover from failure, leave structured artifacts behind, and resume where it left off. For [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on <a href="https://addyosmani.com/blog/long-running-agents/" target="_blank" rel="noreferrer noopener">Addy Osmani’s blog</a> and is being reposted here with the author’s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">A long-running AI agent can keep making progress over hours, days, or weeks. It can do this across many context windows and sandboxes, recover from failure, leave structured artifacts behind, and resume where it left off.</p>



<p class="wp-block-paragraph">For two years the dominant image of an “AI agent” has been a chat window with a clever loop in it. You type a goal; the agent calls some tools; you watch tokens stream by; you stop watching when the work runs out of patience or the context window fills up. That paradigm got us a long way, but it has a ceiling. The model forgets. It declares “task complete” when it isn’t. It reintroduces a bug it fixed nine turns ago. The whole thing is structured around a single sitting.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1375" height="768" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image.jpeg" alt="Long-running AI agents" class="wp-image-18884" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image.jpeg 1375w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-300x168.jpeg 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-768x429.jpeg 768w" sizes="auto, (max-width: 1375px) 100vw, 1375px" /></figure>



<p class="wp-block-paragraph">Long-running agents are what comes next. The idea is easy to state: an agent that keeps making forward progress on a goal across many sessions and many sandboxes, possibly many days or weeks, while leaving the workspace clean enough that the next session can pick up where the last one left off. The engineering is harder. You have to solve for persistence, recovery, and verification in a way that doesn’t just paper over the cracks. You have to build a state layer that lives outside the model’s context window, and you have to design the handoff between sessions so the agent doesn’t lose its mind when it wakes up and finds itself in a different sandbox with a different context window.</p>



<p class="wp-block-paragraph">This post is my attempt to lay out what’s changed, who’s pushing on it, and how an engineer can use long-running agents today without writing the whole thing from scratch.</p>



<h2 class="wp-block-heading">What “long-running” actually means</h2>



<p class="wp-block-paragraph">“Long-running” used to mean at least three different things in practice, and it helps to keep them separate.</p>



<p class="wp-block-paragraph"><strong>Long-horizon reasoning</strong>. The agent has to plan and execute over many dependent steps. This is mostly a model-quality story: coherence, planning, the ability to recover from a wrong turn 10 steps ago. METR has been tracking this with their <em>time horizon</em> metric, which estimates how long a task a frontier model can complete with 50% reliability. The headline finding is that the metric has been <a href="https://metr.org/time-horizons/" target="_blank" rel="noreferrer noopener">doubling roughly every seven months</a> since 2019, and their <a href="https://metr.org/blog/2026-1-29-time-horizon-1-1/" target="_blank" rel="noreferrer noopener">TH1.1 update</a> earlier this year doubled the count of eight-hour-plus tasks in the eval set. If that curve holds, frontier agents complete tasks at the day scale by 2028 and the year scale by 2034.</p>



<p class="wp-block-paragraph"><strong>Long-running execution</strong>. The agent’s <em>process</em> runs for hours or days. Maybe it’s a coding job, maybe it’s a research sweep, maybe it’s a 24-7 monitoring service. The model might be invoked thousands of times across the run. This is mostly a <em>harness</em> story, and it’s the one this post is mostly about.</p>



<p class="wp-block-paragraph"><strong>Persistent agency</strong>. The agent has an identity that outlives any single task. It accumulates memory, learns user preferences, and is always available. This is the <a href="https://docs.cloud.google.com/agent-builder/agent-engine/memory-bank/overview" target="_blank" rel="noreferrer noopener">Memory Bank</a> flavor of long-running.</p>



<p class="wp-block-paragraph">In practice the three blur together. A real production agent does long-horizon reasoning <em>inside</em> a long-running execution <em>backed by</em> persistent agency. But the engineering problems are different in each, and so are the products that solve them.</p>



<h2 class="wp-block-heading">Why this matters</h2>



<p class="wp-block-paragraph">There are two reasons I believe this work matters a lot right now.</p>



<p class="wp-block-paragraph">The first is a phase change in what’s economically feasible to delegate. An agent that runs for 10 minutes can answer a question, summarize a doc, fix a small bug. An agent that runs for 10 hours can own an entire feature, finish a migration that was on the backlog for six quarters, or do the kind of overnight research sweep that used to require a junior analyst. One of Anthropic’s <a href="https://www.anthropic.com/news/claude-sonnet-4-5" target="_blank" rel="noreferrer noopener">Claude Sonnet announcements</a> put concrete numbers on this last fall: 30+ hours of autonomous coding in internal tests, including <a href="https://venturebeat.com/ai/anthropics-new-claude-can-code-for-30-hours-think-of-it-as-your-ai-coworker" target="_blank" rel="noreferrer noopener">one run</a> that produced an 11,000-line Slack-style app. That’s already past the threshold where the answer to “Should I delegate this?” is no longer obvious.</p>



<p class="wp-block-paragraph">The second is that persistence changes what the agent <em>is</em>. A stateless agent answers your question and disappears. A long-running one accumulates context: which competitor moved which way last week, which test flaked twice on Tuesday, what you usually mean by “the dashboard.” Anthropic’s <a href="https://www.anthropic.com/research/project-vend-1" target="_blank" rel="noreferrer noopener">Project Vend</a> was the most public early demonstration of this. They had a Claude instance run an actual office vending business for a month, managing inventory, setting prices, talking to suppliers. It failed in informative ways, and <a href="https://www.anthropic.com/research/project-vend-2" target="_blank" rel="noreferrer noopener">the second phase</a> ran much better, but the point wasn’t profitability. The point was watching what kinds of weird coherence problems show up when an agent has to maintain identity across weeks instead of turns.</p>



<p class="wp-block-paragraph">Those are the same problems every team building production agents now hits.</p>



<h2 class="wp-block-heading">The three walls every long-running agent hits</h2>



<p class="wp-block-paragraph">Three walls show up in basically every write-up I’ve read this year.</p>



<p class="wp-block-paragraph"><strong>Finite context</strong>. Even a 1M-token window fills. And <a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">context rot</a>, the steady degradation of model performance as the window gets full, kicks in well before the hard limit. A 24-hour run is not going to fit in any context window the field has on its roadmap. Something has to give.</p>



<p class="wp-block-paragraph"><strong>No persistent state</strong>. A new session starts blank. Anthropic’s framing in their <a href="https://www.anthropic.com/research/long-running-Claude" target="_blank" rel="noreferrer noopener">scientific computing post</a> is the cleanest version I’ve seen: “Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.” Without an explicit persistence story, every shift change is a productivity disaster.</p>



<p class="wp-block-paragraph"><strong>No self-verification</strong>. Models reliably skew positive when they grade their own work. Asked “Are you done?” they answer “yes” more often than they should. Without a separate signal that the work meets a bar, you get the agent that ships at 30% complete with full confidence.</p>



<p class="wp-block-paragraph">Long-running agent designs are mostly answers to these three problems. The major labs have converged on similar shapes of answer, but with very different surface area.</p>



<h2 class="wp-block-heading">The Ralph loop: One of the simpler practitioner versions of long-running agents</h2>



<p class="wp-block-paragraph">The Ralph loop (sometimes called the Ralph Wiggum technique) is one of “simpler” practitioner version of long-running agents, popularized by <a href="https://ghuntley.com/ralph/" target="_blank" rel="noreferrer noopener">Geoffrey Huntley</a> and <a href="https://github.com/snarktank/ralph" target="_blank" rel="noreferrer noopener">Ryan Carson</a>. The reference implementation is <a href="https://ghuntley.com/ralph/" target="_blank" rel="noreferrer noopener">literally a bash script</a> that loops:</p>



<ol class="wp-block-list">
<li>Pick the next unfinished task from a list (prd.json or equivalent).</li>



<li>Build a prompt with the task, the relevant context, and any persistent notes.</li>



<li>Call the agent.</li>



<li>Run tests or other checks.</li>



<li>Append what happened to progress.txt.</li>



<li>Update the task list (done, failed, blocked).</li>



<li>Go back to step 1.</li>
</ol>



<p class="wp-block-paragraph">The reason it works is the same reason any of the harnesses below work: State lives outside the agent’s context. <code>prd.json</code> is the plan, <code>progress.txt</code> is the lab notes, and <code>AGENTS.md</code> is the rolling rulebook. The agent itself is amnesiac, but the filesystem isn’t. Each iteration starts fresh and reads enough state from disk to keep going. Carson’s <a href="https://github.com/snarktank/compound-product" target="_blank" rel="noreferrer noopener">Compound Product</a> extends the idea by chaining multiple loops (an analysis loop that reads daily reports, a planning loop that emits a PRD, an execution loop that writes the code), which is roughly the open source version of the planner-generator-evaluator triad Anthropic landed on independently.</p>



<p class="wp-block-paragraph">I went deeper on all of this in “<a href="https://addyosmani.com/blog/self-improving-agents/" target="_blank" rel="noreferrer noopener">Self-Improving Coding Agents</a>”: task list structure, progress files, QA gates, monitoring, the failure modes you’ll actually hit. The short version is that you can build a working long-running agent in an evening with a bash script and a JSON file. Most of what Google and Anthropic have productized is the work of making this pattern recoverable, secure, and observable at scale.</p>



<p class="wp-block-paragraph">The big-lab stories below are different ways of paying for that production-readiness.</p>



<h2 class="wp-block-heading">Anthropic: Harnesses, then the brain/hands/session split</h2>



<p class="wp-block-paragraph">Anthropic has been the most public about the engineering. Two posts are worth reading end to end.</p>



<p class="wp-block-paragraph">The first is “<a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents" target="_blank" rel="noreferrer noopener">Effective Harnesses for Long-Running Agents</a>,” which lays out a two-agent harness for autonomous full stack development. An initializer agent runs once at the start of a project to set up the environment, expand the prompt into a structured <code>feature-list.json</code>, and write an <code>init.sh</code> that future sessions will run on boot. A coding agent is then woken up over and over, each session asked to make incremental progress on one feature, run tests, leave a <code>claude-progress.txt</code> note, and commit. A test ratchet (“it is unacceptable to remove or edit tests because this could lead to missing or buggy functionality”) sits in the prompt to stop the very common failure of an agent deleting failing tests to “make them pass.” <a href="https://www.infoq.com/news/2026/04/anthropic-three-agent-harness-ai/" target="_blank" rel="noreferrer noopener"><em>InfoQ</em>’s writeup</a> extends this into a planner, generator, and evaluator triad, on the same logic that separating generation from evaluation matters because models grade their own work too generously.</p>



<p class="wp-block-paragraph">The second is “<a href="https://www.anthropic.com/engineering/managed-agents" target="_blank" rel="noreferrer noopener">Scaling Managed Agents: Decoupling the Brain from the Hands</a>,” the architectural post behind <a href="https://platform.claude.com/docs/en/managed-agents/overview" target="_blank" rel="noreferrer noopener">Claude Managed Agents</a> (Anthropic’s hosted runtime, launched in early April). The argument is that an agent has three components that should be independently replaceable. The Brain is the model and the harness loop that calls it. The Hands are sandboxed, ephemeral execution environments where tools actually run. The Session is an append-only event log of every thought, tool call, and observation.</p>



<p class="wp-block-paragraph">This sounds abstract, but it isn’t. Here’s Anthropic’s framing: “Every component in a harness encodes an assumption about what the model can’t do on its own.” When you couple them, an assumption that goes stale (e.g., the model used to need an explicit planner and now plans natively) means the whole system has to change at once. When you decouple them, the harness becomes stateless, sandboxes become <em>cattle, not pets</em>, and a brain crash doesn’t lose the run. A fresh container calls <code>wake(sessionId)</code> and reconstitutes the state from the log. They reported <a href="https://www.anthropic.com/engineering/managed-agents" target="_blank" rel="noreferrer noopener">time-to-first-token dropped ~60% at p50 and over 90% at p95</a> just from being able to start inference before the sandbox is ready.</p>



<p class="wp-block-paragraph">The session-as-event-log idea is the part most teams underappreciate. It is what makes a long-running agent recoverable. Without it, a container failure is a session failure and you’re debugging into a stale snapshot. With it, the agent’s memory is a queryable artifact that lives outside whatever process happens to be running at the moment.</p>



<p class="wp-block-paragraph">For the scientific computing crowd, Anthropic’s “<a href="https://www.anthropic.com/research/long-running-Claude" target="_blank" rel="noreferrer noopener">long-running Claude</a>” post reduces all of this to a simpler stack: <code>CLAUDE.md</code> as a living plan the agent edits as it learns, <code>CHANGELOG.md</code> as portable lab notes, <code>tmux</code> plus <code>SLURM</code> plus <code>git</code> as the execution and coordination layer, and the Ralph loop, a <code>for</code> loop that kicks the agent back into context whenever it claims completion and asks if it’s <em>really</em> done. Their flagship case study is a Boltzmann solver Claude Opus 4.6 built over a few days that reached subpercent agreement with a reference CLASS implementation. Months to years of researcher time, compressed.</p>



<p class="wp-block-paragraph">Same patterns across all three posts: an explicit plan file, an explicit progress file, structured handoffs between sessions, separate generation from evaluation, and a loop that refuses to let the agent stop early.</p>



<h2 class="wp-block-heading">Cursor: Planners, workers, judges</h2>



<p class="wp-block-paragraph">Cursor’s “<a href="https://cursor.com/blog/scaling-agents" target="_blank" rel="noreferrer noopener">Scaling Long-Running Autonomous Coding</a>” is the other essential read this year. They walked into walls that Anthropic mostly papered over.</p>



<p class="wp-block-paragraph">Their first attempt was a flat coordination model: equal-status agents writing to shared files with locks. It became a bottleneck and made the agents risk averse, churning rather than committing. Their second attempt swapped locks for optimistic concurrency control, which removed the bottleneck but didn’t fix the coordination problem. The third design is what’s running in production now and what they describe as solving most of the problem:</p>



<ul class="wp-block-list">
<li>Planners continuously explore the codebase and emit tasks. They can recursively spawn subplanners.</li>



<li>Workers are focused executors. They don’t coordinate with each other and they don’t worry about the big picture.</li>



<li>Judges decide when an iteration is finished and when to restart.</li>
</ul>



<p class="wp-block-paragraph">Two things stand out from the post. One: “A surprising amount of the system’s behavior comes down to how we prompt the agents” more than the harness or the model. Two: Different models slot into different roles. Their reported finding is that a GPT model was better than Opus for <em>extended autonomous work</em> specifically because Opus tended to stop early and take shortcuts. Same task, different role, different model. The matching is becoming part of the design surface.</p>



<p class="wp-block-paragraph">This pairs with <a href="https://cursor.com/blog/composer" target="_blank" rel="noreferrer noopener">Composer 2</a> (their proprietary frontier coding model that ships in <a href="https://cursor.com/changelog/2-0" target="_blank" rel="noreferrer noopener">Cursor 3</a>) and their background cloud agents: long-running tasks that run on Anysphere’s cloud infrastructure rather than your laptop. Eight-hour refactors and codebase-wide migrations survive a closed lid. You can start a task locally, hit <em>run in cloud</em> when you realize it’ll take 30 minutes, and reattach later from your phone. Each agent runs in an isolated Git worktree and merges back via PR. The handoff between local and remote is the part most teams haven’t figured out yet, and Cursor’s bet is that it has to be its own product surface.</p>



<p class="wp-block-paragraph">The shape ends up close to Anthropic’s: Roles are split, sessions are durable, judges sit beside the worker, and a long task runs in a cloud sandbox with Git as the coordination substrate.</p>



<h2 class="wp-block-heading">Google: Long-running agents on the Agent Platform</h2>



<p class="wp-block-paragraph">Google’s announcement at <a href="https://cloud.google.com/blog/products/ai-machine-learning/introducing-gemini-enterprise-agent-platform" target="_blank" rel="noreferrer noopener">Cloud Next ’26</a> folded Vertex AI into the Gemini Enterprise Agent Platform and turned long-running agents into a named product, with named SLAs.</p>



<p class="wp-block-paragraph">The pieces that matter for this post:</p>



<ul class="wp-block-list">
<li>Agent Runtime supports agents that “run autonomously for days at a time” with sub-second cold starts and on-demand sandbox provisioning. The launch post’s example use case is a sales prospecting sequence that takes a week to play out, which is roughly the right shape for it.</li>



<li>Agent Sessions persist conversation and event history. You can pin them to a custom session ID that maps to your own CRM or DB record, so the agent’s state lives next to the business state instead of in a separate AI silo.</li>



<li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/scale/memory-bank">Agent Memory Bank</a> is the persistent long-term memory layer, generally available as of Next ’26. It curates memories from sessions, scopes them to a user identity, and exposes a search API so the next agent invocation can pull what’s relevant. Payhawk reported that auto-submitting expenses through a Memory Bank-backed agent cut submission time by over 50%.</li>



<li>Agent Sandbox handles hardened code execution.</li>



<li>Agent-to-Agent Orchestration, Agent Registry, Agent Identity, Agent Gateway, Agent Observability, and Agent Simulation cover basically every operational concern you’d otherwise build by hand for a production fleet, including the cryptographic-identity-and-audit-log story enterprises actually need to ship.</li>
</ul>



<p class="wp-block-paragraph">Architecturally this is the same brain/hands/session split Anthropic described, just productized at platform scale and bundled with <a href="https://google.github.io/adk-docs/" target="_blank" rel="noreferrer noopener">ADK</a> (the code-first dev kit) and Agent Studio (the visual one). If you’re building inside Google Cloud, you don’t have to design a session log or a memory store from scratch anymore. You wire an ADK agent into Memory Bank and Sessions, deploy onto Agent Runtime, and the persistence question is answered.</p>



<p class="wp-block-paragraph">Notice how much this looks like the pattern Anthropic and Cursor describe, just unbundled into named services with SLAs. Three years ago you’d have built all of this yourself. Now you pick which version of “decoupled brain, hands, and session” you want to rent.</p>



<h2 class="wp-block-heading">Five patterns for long-running agents in production</h2>



<p class="wp-block-paragraph">Shubham Saboo and I <a href="https://x.com/GoogleCloudTech/status/2046989964077146490" target="_blank" rel="noreferrer noopener">wrote up</a> five design patterns we’ve seen separate working long-running agents from demos. They aren’t Google-specific, but they map cleanly onto the primitives Agent Runtime now exposes, so it’s worth walking through them here in shortened form.</p>



<p class="wp-block-paragraph"><strong>Checkpoint-and-resume</strong>. The most common multiday failure is context loss. An agent processes 200 documents over four hours, hits an error on document 201, and without a checkpoint you start from scratch. Treat the agent like a long-running server process: write intermediate state to disk, checkpoint every N units of work, recover from failures. The Agent Runtime sandbox gives you a persistent filesystem, but choosing the right checkpoint granularity (not every step, not only the end) is on you.</p>



<p class="wp-block-paragraph"><strong>Delegated approval (human-in-the-loop)</strong>. Most “human-in-the-loop” implementations are: serialize state to JSON, fire a webhook, hope someone responds. The state goes stale, the notification gets buried, the agent re-deserializes into a slightly different world. Long-running runtimes let the agent pause in place with full execution state intact: reasoning chain, working memory, tool history, pending action. Hours of human time pass, the agent consumes zero compute, and it resumes with subsecond latency. Mission Control is Google’s inbox for this. The pattern works regardless of vendor.</p>



<p class="wp-block-paragraph"><strong>Memory-layered context</strong>. A seven-day agent needs more than session state. Memory Bank handles long-term curated memory, Memory Profiles add low-latency lookups, and the failure mode you’ll hit in production is memory drift: The agent learns a procedural shortcut from a few atypical interactions and starts applying it broadly. Govern memory like you govern microservices. Agent Identity controls who can read and write which banks. Agent Registry tracks which version of which agent is running. Agent Gateway enforces policy on the wire. The auditing question stops being “What are my agents doing?” and becomes “What are my agents remembering, and how is that changing their behavior?”</p>



<p class="wp-block-paragraph"><strong>Ambient processing</strong>. Not every long-running agent talks to a human. Some sit on a Pub/Sub stream or a BigQuery table and act on events as they arrive: content moderation, anomaly detection, inbox triage. The architectural decision worth making early is to not hardcode policy into the agent. Define it in the Gateway and the fleet picks up policy changes without redeploys. Ambient agents run unsupervised for long stretches, and the only sane way to update a hundred of them is to update the policy layer once.</p>



<p class="wp-block-paragraph"><strong>Fleet orchestration</strong>. In real systems, you rarely have one agent. A coordinator delegates subtasks to specialists (a Lead Researcher Agent, a Scoring Agent, an Outreach Agent), each running independently for different durations. Each specialist gets its own Identity (so the Outreach Agent can’t read financial data meant for Scoring), its own policy enforcement, its own Registry entry. This is the same coordinator/worker shape distributed systems have used for decades. What’s new is that ADK handles it declaratively with graph-based workflows, and a bad deployment in one specialist doesn’t cascade to the others.</p>



<p class="wp-block-paragraph">The patterns compose. A compliance system might use checkpointing for document processing, delegated approval for review gates, memory layering for cross-session knowledge, and fleet orchestration to coordinate the specialists. The opening question is always the same: What’s the longest uninterrupted unit of work your agent needs to perform? Minutes, and you don’t need long-running agents. Hours or days, and these patterns are where to start. The <a href="https://x.com/GoogleCloudTech/status/2046989964077146490" target="_blank" rel="noreferrer noopener">full write-up with code samples</a> covers each pattern in depth.</p>



<h2 class="wp-block-heading">So how do you actually build one today?</h2>



<p class="wp-block-paragraph">This is the practical question, and it has a different answer depending on what you’re building.</p>



<p class="wp-block-paragraph"><strong>You’re a developer who wants long-running coding work on your own repo</strong>. Just use <a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">Claude Code</a> (or Antigravity, Cursor, or Codex). The harness is already there. Treat your <code>AGENTS.md</code> like a pilot’s checklist: short, every line earned by a real failure. Add hooks for typecheck and lint that surface failures back to the agent. Write a plan file before the agent starts. Use <a href="https://addyosmani.com/blog/self-improving-agents/" target="_blank" rel="noreferrer noopener">the Ralph loop</a> when the agent claims it’s done and you don’t believe it. For multihour or overnight jobs, run in a worktree so a closed laptop doesn’t kill the run, and have it commit progress every meaningful unit of work. This is the path most people should take, and it’s where the most leverage is right now.</p>



<p class="wp-block-paragraph"><strong>You’re building a hosted agent product</strong>. Don’t build the runtime. Pick a managed one. The three real options today: <a href="https://cloud.google.com/products/gemini-enterprise-agent-platform" target="_blank" rel="noreferrer noopener">Google’s Agent Platform</a> (Agent Engine + Memory Bank + Sessions), <a href="https://platform.claude.com/docs/en/managed-agents/overview" target="_blank" rel="noreferrer noopener">Claude Managed Agents</a>, or roll something on top of <a href="https://google.github.io/adk-docs/">ADK</a>, the <a href="https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk" target="_blank" rel="noreferrer noopener">Claude Agent SDK</a>, or <a href="https://platform.openai.com/docs/codex" target="_blank" rel="noreferrer noopener">Codex SDK</a> and host it yourself. The trade-off is the usual one. Managed gets you the brain/hands/session split, observability, identity, and an audit trail out of the box. Self-hosted gets you control and the ability to use weird models for weird roles (Cursor’s pattern). For most teams, the right starting point is a managed runtime plus your own ADK or SDK code for the actual loop.</p>



<p class="wp-block-paragraph"><strong>You’re doing something autonomous and operational (monitoring, research, ops)</strong>. Memory Bank-style persistence is what you want, and it’s the part that doesn’t exist in Claude Code. ADK + Memory Bank + Cloud Run + Cloud Scheduler is the cleanest stack I’ve seen for “agent runs every N hours, accumulates state, alerts on a threshold.” This is also where Cursor’s planner/worker/judge split starts to matter more than it does for IDE coding, because the work is genuinely parallel and the failure modes are different.</p>



<p class="wp-block-paragraph">A few things matter regardless of which path you take.</p>



<p class="wp-block-paragraph"><em>Write down the done condition before the agent starts.</em> This is the single highest-leverage move for long runs. The Anthropic harness post calls it the feature list; Cursor calls it the planner’s task spec. Either way, it’s an external file with explicit, testable completion criteria, and it exists so the agent can’t quietly redefine <em>done</em> midrun.</p>



<p class="wp-block-paragraph"><em>Separate the evaluator from the generator.</em> Self-grading is the failure mode. A planner/worker/judge pipeline, or a generator/evaluator pair, is a real architectural pattern, not a stylistic preference. Even if it’s the same model in different roles with different prompts.</p>



<p class="wp-block-paragraph"><em>Invest in the session log, not just the prompt.</em> The append-only event log is what makes the agent recoverable, debuggable, and auditable. If you can’t reconstruct what the agent did in the last 24 hours from durable storage, what you have is a long-running shell script that happens to call an LLM, not a long-running agent.</p>



<p class="wp-block-paragraph"><em>Treat compaction and context resets as first class.</em> Anthropic is explicit that summarization-as-compaction wasn’t enough for very long jobs; they had to do full context resets where the harness tears the session down and rebuilds it from a structured handoff file. It is essentially how humans onboard a new engineer.</p>



<h2 class="wp-block-heading">There are some real limitations right now</h2>



<p class="wp-block-paragraph">A few things are still genuinely unsolved.</p>



<p class="wp-block-paragraph"><strong>Cost</strong>. A 24-hour run with a frontier model and a few tools is not cheap. Without budgets, circuit breakers, and a hard cap on tool spend, an agent can quietly burn through a week’s API budget in an afternoon. This is solvable, but it’s an explicit step you have to take.</p>



<p class="wp-block-paragraph"><strong>Security</strong>. A long-running agent with API keys, cloud access, and the ability to run shell commands has a much larger attack surface than a chat session. The brain/hands separation pattern matters here too: Credentials should be unreachable from the sandbox where model-generated code runs, which is one of the benefits Anthropic calls out for Managed Agents.</p>



<p class="wp-block-paragraph"><strong>Alignment drift</strong>. Over many context windows, agents drift. The original goal gets summarized, then resummarized, then loses fidelity. This is the part hooks and judges exist to defend against. It is also the most common reason “the agent went off and did something I didn’t ask for.”</p>



<p class="wp-block-paragraph"><strong>Verification</strong>. Auditing 24 hours of autonomous activity is a real human-time problem. Observability and structured artifacts (PRs, commits, briefings, test runs) are how you make this tractable. Without them, you’re scrolling logs and you’ll miss what matters.</p>



<p class="wp-block-paragraph"><strong>The human role</strong>. This is the one I keep coming back to. Defining work crisply enough that an agent can run for a day on it is harder than doing the work yourself. The skill that’s appreciating in value isn’t writing code. It’s writing specs that survive contact with an autonomous executor.</p>



<h2 class="wp-block-heading">Where this is going</h2>



<p class="wp-block-paragraph">Google, Anthropic, and Cursor have converged on roughly the same shape. Separate the model loop from the execution sandbox from the durable session log. Split planning from generation from evaluation. Bake in compaction, hooks, and context resets. Expose memory as a managed service that any agent invocation can query.</p>



<p class="wp-block-paragraph">Surface area is what differs. Google’s Agent Platform is the enterprise-stack version, with the identity and audit trail story baked in. The patterns underneath are the same. Claude Managed Agents is “Anthropic’s harness, hosted.” Cursor’s background agents are “long-running coding, pulled out of the IDE and into the cloud.”</p>



<p class="wp-block-paragraph">The harder problems for the next year aren’t in any of those layers individually. They’re in the coordination above them. Many long-running agents on a shared codebase. Agents that read their own traces and patch their own harnesses. Harnesses that assemble tools and context just in time for a task instead of being preconfigured at startup. That’s where the agent stops looking like a smarter chat window and starts looking like a colleague who’s been on the project longer than you have.</p>



<p class="wp-block-paragraph">The model is still load-bearing. But the gap between a chat window and an agent you can leave running overnight is mostly in the state, sessions, and structured handoffs wrapped around it. That’s where I’d spend my learning time right now.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/long-running-agents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The AI Agents Stack (2026 Edition)</title>
		<link>https://www.oreilly.com/radar/the-ai-agents-stack-2026-edition/</link>
				<comments>https://www.oreilly.com/radar/the-ai-agents-stack-2026-edition/#respond</comments>
				<pubDate>Mon, 08 Jun 2026 10:56:59 +0000</pubDate>
					<dc:creator><![CDATA[Paolo Perrone]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18870</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-AI-agents-stack.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-AI-agents-stack-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Six layers between your LLM and a production agent]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Paolo Perrone’s The AI Engineer Substack and is being reposted here with the author’s permission. Your team picks LangGraph for a customer support chatbot. Three weeks in, you&#8217;ve got 14 nodes in a state graph, a custom checkpointer writing to Redis, and retry logic for tool calls that fail [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on </em><a href="https://theaiengineer.substack.com/p/the-ai-agents-stack-2026-edition" target="_blank" rel="noreferrer noopener"><em>Paolo Perrone’s </em>The AI Engineer<em> Substack</em></a><em> and is being reposted here with the author’s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">Your team picks LangGraph for a customer support chatbot. Three weeks in, you&#8217;ve got 14 nodes in a state graph, a custom checkpointer writing to Redis, and retry logic for tool calls that fail once a week. The agent answers refund questions. It calls one API. A 50-line script on the OpenAI SDK with two MCP servers would have done the same thing. But nobody mapped which layers the problem actually needed.</p>



<p class="wp-block-paragraph">In November 2024, Letta published an <a href="https://www.letta.com/blog/ai-agents-stack" target="_blank" rel="noreferrer noopener">AI agents stack diagram</a> that became the default reference for half the engineering teams I talk to. If you&#8217;ve seen a &#8220;layers of an agent&#8221; visual on LinkedIn or pinned in a Slack channel, it probably traces back to that article.</p>



<p class="wp-block-paragraph">That diagram is 14 months old now, and a lot has changed since. MCP didn&#8217;t exist yet. Memory was still treated as a subset of your vector database. Nobody was shipping provider-native agent SDKs. Eval wasn&#8217;t even on the map. The stack has six layers in 2026, and at least three of them didn&#8217;t exist as distinct categories when Letta drew the original.</p>



<p class="wp-block-paragraph">So we drew it from scratch. This is the 2026 version.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="700" height="639" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-3.png" alt="The minimum viable agent stack in 2026" class="wp-image-18871" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-3.png 700w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-3-300x274.png 300w" sizes="auto, (max-width: 700px) 100vw, 700px" /></figure>



<h2 class="wp-block-heading"><strong>TL;DR</strong></h2>



<p class="wp-block-paragraph">That&#8217;s the starting stack. Add complexity when something specific breaks, not before.</p>



<h2 class="wp-block-heading"><strong>What are we even mapping?</strong></h2>



<p class="wp-block-paragraph">Before the stack, there was a loop. In “<a href="https://theaiengineer.substack.com/p/what-is-an-ai-agent" target="_blank" rel="noreferrer noopener">What Is an AI Agent?</a>,” we defined an agent as the think-act-observe cycle: The model reasons about a task, takes an action (calls a tool, writes to memory), observes the result, and loops until the task is done. That loop is the atomic unit. Everything in this issue is infrastructure that makes that loop work reliably, at scale, in production.</p>



<p class="wp-block-paragraph">The agent stack is not the LLM stack. A chatbot needs inference and maybe RAG. An agent needs state management across multistep execution, tool access governed by protocols, memory that persists across sessions, autonomous reasoning loops, and guardrails that constrain behavior in real time. That&#8217;s a fundamentally different set of infrastructure problems.</p>



<p class="wp-block-paragraph">We&#8217;re mapping the six layers between your LLM and a production agent. We&#8217;re not covering training infrastructure, data pipelines, or model fine-tuning. Those are adjacent stacks. We covered RAG in depth in <a href="https://theaiengineer.substack.com/p/what-is-rag-retrieval-augmented-generation" target="_blank" rel="noreferrer noopener">Issue #5</a>. Today we’re zooming out to show where RAG fits in the bigger picture.</p>



<p class="wp-block-paragraph">Three things redrew the map between 2024 and 2026. MCP standardized tool connectivity, and the entire tools layer is new because of it. Reasoning models changed what agents can do autonomously, with single-call agents replacing some multistep chains. And memory became a first-class architectural primitive, not an afterthought bolted onto a vector database.</p>



<h3 class="wp-block-heading"><strong>How to evaluate each layer</strong></h3>



<p class="wp-block-paragraph">When choosing tools at each layer, ask three questions. <em>How much state do you need to manage?</em> A stateless tool caller and a multi-session agent that learns over time are different engineering problems, and the layers where state management is hardest (memory, frameworks) are where most teams get stuck. <em>How much vendor lock-in can you tolerate?</em> MCP is an open standard, provider SDKs are not, and every tool choice either increases or decreases how painful your next migration will be. <em>And how hard is it to go from demo to production?</em> Some layers (model serving) have almost no gap, while others (eval, guardrails) have a massive one. The layer where you feel that gap most is the one to invest in first.</p>



<p class="wp-block-paragraph">We take each layer from the bottom up, starting with the most stable and ending with the least mature.</p>



<h2 class="wp-block-heading"><strong>Layer 1: Models and inference</strong></h2>



<p class="wp-block-paragraph"><em>How you run the model that powers your agent: call an API, use a managed open weight provider, or self-host.</em></p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="700" height="305" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-4.png" alt="Models &amp; inference: key players" class="wp-image-18872" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-4.png 700w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-4-300x131.png 300w" sizes="auto, (max-width: 700px) 100vw, 700px" /></figure>



<p class="wp-block-paragraph">The inference layer changed more in tone than in substance. Reasoning models like o1, o3, DeepSeek R1, and Claude with extended thinking shifted what agents can plan and execute. Agents that previously needed multistep chains can now solve problems in a single reasoning call. Open weight models like Llama 3.3, DeepSeek V3, and Qwen 2.5 closed the quality gap dramatically, so &#8220;always use the biggest closed model&#8221; is no longer default advice. The emerging pattern is to prototype on closed source and deploy on open weight.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">The honest take: This layer is commoditizing. Model differences matter less each quarter. The real decision is the cost and latency trade-off, not which model is &#8220;smartest.&#8221;</p>
</blockquote>



<p class="wp-block-paragraph">On the evaluation side, API calls are stateless. Send a request, get a response. Nothing to manage. Lock-in risk runs high for closed APIs because each model reasons differently, so switching providers means retuning prompts, adjusting for different failure modes, and retesting your eval suite. It&#8217;s low for open weight, where you can swap the model and keep the infra. The prototype-to-production gap is the smallest of any layer. Your demo API call is the same as your production API call.</p>



<p class="wp-block-paragraph">Self-host when your agent call volume makes API pricing untenable or when you need sub-100ms latency that API round-trips can&#8217;t deliver.</p>



<h2 class="wp-block-heading"><strong>Layer 2: Protocols and tools</strong></h2>



<p class="wp-block-paragraph"><em>How your agent calls external tools and APIs: through MCP servers, browser automation, or agent-to-agent protocols.</em></p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="700" height="336" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-5.png" alt="Protocols &amp; tools: key players" class="wp-image-18873" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-5.png 700w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-5-300x144.png 300w" sizes="auto, (max-width: 700px) 100vw, 700px" /></figure>



<p class="wp-block-paragraph">This layer didn&#8217;t exist as a distinct category in 2024. Every framework had its own JSON schema for tool definitions. Now MCP is the standard, with 97M monthly SDK downloads, adoption by OpenAI, Google, and Microsoft, and a donation to the Linux Foundation.</p>



<p class="wp-block-paragraph">Browser Use exploded in parallel, hitting 78K GitHub stars in under a year. Nobody was shipping browser agents in production in 2024. And agents can now talk to other agents. IBM launched ACP, and Google launched A2A. Neither is standard yet, but the problem they solve (agents coordinating with other agents) is real and growing.</p>



<p class="wp-block-paragraph">Security is the open problem. Endor Labs <a href="https://www.endorlabs.com/learn/classic-vulnerabilities-meet-ai-infrastructure-why-mcp-needs-appsec" target="_blank" rel="noreferrer noopener">analyzed 2,614 MCP servers</a> and found 82% prone to path traversal and 67% to code injection.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">The honest take: The protocol debate is over. MCP won. The only question left is how you lock down your MCP servers before someone exploits them.</p>
</blockquote>



<p class="wp-block-paragraph">State management is nonexistent here. Your agent calls a tool, gets a response, done. No session, no memory between calls. Lock-in risk is low because MCP is an open standard, so if you build MCP servers, any MCP-compatible agent can use them. The prototype-to-production gap is medium. Your demo MCP server works until someone sends a malicious tool description. Security and governance are the gap.</p>



<p class="wp-block-paragraph">MCP standardized how agents use tools. It says nothing about how agents talk to each other. ACP and A2A are trying to solve that, but neither has reached critical mass. If you need multi-agent coordination today, you&#8217;re building it yourself at the framework layer. We covered MCP in depth in <a href="https://theaiengineer.substack.com/p/what-is-mcp" target="_blank" rel="noreferrer noopener">Issue #4</a>.</p>



<h2 class="wp-block-heading"><strong>Layer 3: Memory and knowledge</strong></h2>



<p class="wp-block-paragraph"><em>How your agent stores and retrieves what it knows: in-context state, vector search, or persistent memory across sessions.</em></p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="700" height="288" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-6.png" alt="Memory &amp; knowledge: key players" class="wp-image-18874" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-6.png 700w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-6-300x123.png 300w" sizes="auto, (max-width: 700px) 100vw, 700px" /></figure>



<p class="wp-block-paragraph">All three tiers feed into the same place: The context window your agent sees on every call.</p>



<p class="wp-block-paragraph">In 2024, memory meant &#8220;pick a vector database and do RAG.&#8221; In 2026, memory is a first-class architectural primitive with three distinct tiers. Context windows got massive. Gemini hit 1M+ tokens, Claude 200K. Bigger windows didn&#8217;t kill the need for memory. They changed the trade-off: What do you stuff in-context versus what do you retrieve on demand?</p>



<p class="wp-block-paragraph">&#8220;Context engineering&#8221; replaced &#8220;prompt engineering&#8221; as the core discipline. Instead of writing a better prompt, you architect what information the agent sees on every call. Memory blocks appeared as named, structured fields in the context window that the agent can read and overwrite every turn. Instead of dumping everything into the system prompt, the agent manages its own state: what to keep, what to update, what to drop.</p>



<p class="wp-block-paragraph">On the infrastructure side, pgvector became the default for teams that don&#8217;t need a dedicated vector database. It&#8217;s just Postgres with an extension. GraphRAG emerged as a second retrieval option: follow relationships between entities instead of matching embeddings, with Neo4j leading this space. Sleep-time compute, where agents process information during idle time, is research stage but signals where tier 3 is heading.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">The honest take: Most teams overcomplicate memory. Start with conversation history in Postgres and a structured system prompt. Add vector search when your history exceeds context limits. Add agentic memory management only when your agent needs to learn across sessions.</p>
</blockquote>



<p class="wp-block-paragraph">This IS the state layer. You&#8217;re deciding what your agent remembers, how it retrieves it, and when it forgets. Highest complexity in the stack. Lock-in risk is medium. pgvector is portable because it&#8217;s just Postgres, while specialized tools like Mem0 or Zep are harder to migrate away from. The prototype-to-production gap is large. Demo memory works because context windows are big enough. Production memory breaks when conversations get long and your agent starts forgetting the important parts.</p>



<p class="wp-block-paragraph">In-context memory breaks down when agents need to share memory across instances or maintain state across model provider switches. That&#8217;s where dedicated memory infrastructure like Letta, Zep, and Mem0 earns its keep.</p>



<h2 class="wp-block-heading"><strong>Layer 4: Frameworks and SDKs</strong></h2>



<p class="wp-block-paragraph"><em>How you wire together the model calls, tool use, and control flow that make your agent work: a provider&#8217;s built-in toolkit (SDK), a graph-based framework like LangGraph, or raw code.</em></p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="700" height="384" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-7.png" alt="Frameworks &amp; SDKs: key players" class="wp-image-18875" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-7.png 700w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-7-300x165.png 300w" sizes="auto, (max-width: 700px) 100vw, 700px" /></figure>



<p class="wp-block-paragraph">Every major AI lab now ships its own agent SDK. OpenAI has the Agents SDK (evolved from Swarm). Google released ADK. Microsoft has Semantic Kernel and AutoGen. Hugging Face built smolagents. Two years ago, LangChain was the only game. Now you pick between three camps: provider SDKs that are fast to start but locked to one model, graph-based frameworks like LangGraph that are portable but require more setup, or no framework at all. That choice didn&#8217;t exist in 2024.</p>



<p class="wp-block-paragraph">LangGraph solidified as the graph-based orchestration leader with v1.0 released October 2025 and production deployments at Uber, JPMorgan, LinkedIn, and Klarna. LangChain agents are now built on LangGraph under the hood. Meanwhile, the &#8220;build it yourself&#8221; camp grew. Teams that tried LangChain in 2024 and fought the abstraction are now writing thin wrappers over provider APIs + MCP. No framework means full control. This works until your agent needs state management or complex branching.</p>



<p class="wp-block-paragraph">A quick note on naming: &#8220;LangChain&#8221; and &#8220;LangGraph&#8221; are not the same thing. LangChain is the integration layer handling model connectors, tool calling, and prompt templates. LangGraph is the orchestration engine managing state, control flow, and graphs. Most production teams use both together, but LangGraph is where the agent logic lives.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">The honest take: Most teams pick too much framework. If your agent calls a model and a few tools, you don&#8217;t need LangGraph. A provider SDK and a couple of tool calls will get you to production faster than any graph.</p>
</blockquote>



<p class="wp-block-paragraph">Provider SDKs manage state for you. LangGraph makes you define every state transition explicitly. Build-it-yourself means you roll your own. Lock-in risk is the highest in the stack. Your orchestration code doesn&#8217;t port. A LangGraph agent rewritten for CrewAI is a new codebase. Provider SDKs are worse because you&#8217;re locked to one model too. The prototype-to-production gap is large. Demo works because nothing goes wrong. Production means handling tool failures, retries, timeouts, and humans who need to approve before the agent acts.</p>



<p class="wp-block-paragraph">The framework you pick determines your migration cost. Provider SDKs are fastest to start but lock you to one model. LangGraph is portable but complex. Building your own gives you full control until your agent outgrows your wrapper. MCP is the one layer that transfers across all three camps.</p>



<h2 class="wp-block-heading"><strong>Layer 5: Eval and observability</strong></h2>



<p class="wp-block-paragraph"><em>How you measure whether your agent is doing its job: tracing runs, scoring outputs, and catching regressions before users do.</em></p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="700" height="336" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-8.png" alt="Eval &amp; observability: key players" class="wp-image-18876" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-8.png 700w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-8-300x144.png 300w" sizes="auto, (max-width: 700px) 100vw, 700px" /></figure>



<p class="wp-block-paragraph">This layer barely existed in 2024. Now it&#8217;s the gap. <a href="https://www.langchain.com/state-of-agent-engineering" target="_blank" rel="noreferrer noopener">LangChain&#8217;s State of Agent Engineering</a> survey found 89% of teams with production agents have implemented observability, but only 52% have evals. That 37-point gap is where production quality dies.</p>



<p class="wp-block-paragraph">&#8220;Evaluation as infrastructure&#8221; is converging on three tiers: fast checks on every PR (Did the agent call the right tools?), nightly regression suites that use an LLM to judge output quality, and continuous production monitoring that alerts when agent performance drifts. New agent-specific benchmarks have emerged too, including Context-Bench for memory management, Recovery-Bench for error recovery, and Terminal-Bench for coding agents.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">The honest take: Most teams skip eval until something breaks in production. By then they&#8217;re debugging blind. The teams that don&#8217;t have this problem built evals before they deployed.</p>
</blockquote>



<p class="wp-block-paragraph">State management matters here because your agent runs 12 steps, step 3 picked the wrong tool, and steps 4–12 were doomed from there. If your eval only checks the final output, you&#8217;ll never know why. Lock-in risk is moderate. Most tools export OpenTelemetry traces, so switching observability providers is doable, but switching eval frameworks means rebuilding your test suites. The prototype-to-production gap is the biggest of any layer. Most prototypes have zero eval. You don&#8217;t feel the pain until production users find the failures for you.</p>



<p class="wp-block-paragraph">Current eval tools are strongest for single-turn and tool-calling evaluation. Multi-agent evaluation, long-horizon task assessment, and evaluating agents that learn over time are all unsolved problems. If your agent does any of those, you&#8217;ll need custom eval infrastructure beyond what the platforms offer today.</p>



<h2 class="wp-block-heading"><strong>Layer 6: Guardrails and safety</strong></h2>



<p class="wp-block-paragraph"><em>How you stop your agent from doing things it shouldn&#8217;t: filtering inputs, authorizing tool calls, and validating outputs.</em></p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="700" height="336" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-9.png" alt="Guardrails &amp; safety: key players" class="wp-image-18877" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-9.png 700w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-9-300x144.png 300w" sizes="auto, (max-width: 700px) 100vw, 700px" /></figure>



<p class="wp-block-paragraph">Agent guardrails became a separate discipline from LLM guardrails. In 2024, guardrails meant input/output filters on a model. In 2026, your agent calls tools, spends money, and takes actions. Guardrails now means authorizing tool calls, enforcing rate limits, and validating what the agent actually did.</p>



<p class="wp-block-paragraph">The &#8220;guardrails before action&#8221; pattern emerged from teams that learned the hard way. They now enforce authorization at the tool execution layer, not the output layer. By the time you filter the response, the agent already sent the email. OWASP published the MCP Top 10 (beta), which is the first real security checklist for tool-connected agents. Deployment is still DIY. LangGraph Cloud and Bedrock Agents exist, but most production teams are still deploying with FastAPI and their own infra. This layer is where you&#8217;ll spend the most unplanned engineering time.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">The honest take: This is the least mature layer in the stack. No dominant framework, no established patterns. You&#8217;re writing policy code from scratch.</p>
</blockquote>



<p class="wp-block-paragraph">Guardrails need to know what the agent is doing right now to decide what it shouldn&#8217;t do next. That means tracking agent state in real time. Lock-in risk is low because most guardrails are custom policy code you write yourself. NeMo Guardrails is the closest thing to a framework, but you&#8217;ll still write most rules from scratch. The prototype-to-production gap is effectively infinite. Your demo has no guardrails because nobody&#8217;s trying to break it. Production will.</p>



<p class="wp-block-paragraph">Current guardrails tools focus on single-agent systems. If you&#8217;re running multi-agent workflows where agents delegate to each other, guardrail propagation across agent boundaries is an unsolved problem. You&#8217;ll need custom authorization logic.</p>



<h2 class="wp-block-heading"><strong>What are you building?</strong></h2>



<p class="wp-block-paragraph">This is the decision that cuts through the framework confusion. The agent type determines which layers you invest in and which tools to pick at each one.</p>



<p class="wp-block-paragraph">A <strong>stateless tool caller</strong> answers questions from a knowledge base, looks up an order, or checks inventory. You need a provider SDK, MCP, and Postgres. No framework, no vector database. This is a weekend project.</p>



<p class="wp-block-paragraph">A <strong>multistep workflow</strong> processes a refund end to end, reviews a PR across five files, or triages and routes support tickets. Steps depend on each other, things fail in the middle, and humans need to approve before the agent acts. You need LangGraph, MCP, and eval. Build evals before you deploy because these agents break silently.</p>



<p class="wp-block-paragraph">An <strong>agent that learns</strong> remembers your preferences across sessions, gets better at your codebase over time, or tracks project context across weeks. You need a memory-first architecture, a vector DB, and eval. Orchestration is the easy part. The hard part is deciding what to remember, what gets dropped, and how you stop old context from polluting new answers.</p>



<p class="wp-block-paragraph">A <strong>multi-agent system</strong> has agents that delegate to other agents, split a research task across specialists, or run parallel workstreams. You need the full stack. Two agents passing context to each other is already hard to debug. Five is impossible without trace-level evals on every handoff. Build eval infrastructure before you build the second agent.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1400" height="867" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-10.png" alt="Pick your stack" class="wp-image-18878" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-10.png 1400w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-10-300x186.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-10-768x476.png 768w" sizes="auto, (max-width: 1400px) 100vw, 1400px" /></figure>



<h2 class="wp-block-heading"><strong>Coding agents: All 6 layers in action</strong></h2>



<p class="wp-block-paragraph">Coding agents like Cursor, Claude Code, Codex, and Windsurf are the most proven application of the AI agents stack. All six layers, working together.</p>



<p class="wp-block-paragraph">At the inference layer, these tools serve hundreds of millions of daily requests. Cursor routes between Claude, GPT-4, and its own fine-tuned models depending on the task. At the protocols layer, MCP servers connect to editors, terminals, filesystems, and Git, which is how the agent reads your code and runs commands. The memory layer uses codebase-aware retrieval with reranking. The agent doesn&#8217;t read your whole repo. It retrieves the files that matter for this specific edit.</p>



<p class="wp-block-paragraph">At the framework layer, these are custom orchestration systems with RL loops. Not LangGraph, not a provider SDK. Purpose-built control flow for code generation, review, and iteration. At the eval layer, Cursor retrains its acceptance-rate model every 90 minutes based on whether users accept or reject suggestions. That&#8217;s eval running in production, continuously. And at the guardrails layer, sandboxed execution prevents runaway agents. The agent can write code and run it, but inside a container that limits what it can touch.</p>



<h2 class="wp-block-heading"><strong>The AI agent stack cheat sheet</strong></h2>



<p class="wp-block-paragraph">Every layer scored on the three questions from the evaluation framework: How much state do you need to manage? How much vendor lock-in can you tolerate? And how hard is it to go from demo to production?</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="700" height="478" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-11.png" alt="The agent stack cheat sheet" class="wp-image-18879" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-11.png 700w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-11-300x205.png 300w" sizes="auto, (max-width: 700px) 100vw, 700px" /></figure>



<h2 class="wp-block-heading"><strong>The bigger picture</strong></h2>



<p class="wp-block-paragraph">Most teams are building like it&#8217;s still 2024. They pick LangGraph before they know if they need state. They add a vector database before they&#8217;ve outgrown Postgres. They design multi-agent architectures before they&#8217;ve shipped one agent that works. The decision flowchart above exists because a tool-calling chatbot and a multi-agent research system share almost no infrastructure. Treat them the same and you&#8217;ll overbuild the first and underbuild the second.</p>



<p class="wp-block-paragraph">The teams that got past this run evals on every deploy, not once a quarter. Their guardrails sit at the tool call layer, not the output layer. Their memory architecture was designed, not inherited from whatever the framework defaulted to. Most teams ship the opposite: no evals, output-only filtering, and a system prompt that grows until the context window chokes. The gap isn&#8217;t talent or budget. It&#8217;s knowing which layers matter for your specific agent instead of half-building all six.</p>



<p class="wp-block-paragraph">The stack is going to collapse. Provider SDKs are already absorbing memory, tool calling, and basic eval into a single API. By early 2027, most teams won&#8217;t build each layer separately. They&#8217;ll get an increasingly opinionated stack from their model provider and that will be fine for 80% of use cases. The other 20%, agents at scale where the defaults break, will still build custom at every layer. But even then, when something fails in production, you need to know which layer failed. That&#8217;s what this article is for.</p>



<h2 class="wp-block-heading">Sources</h2>



<ol class="wp-block-list">
<li>“<a href="https://www.letta.com/blog/ai-agents-stack" target="_blank" rel="noreferrer noopener">The AI Agents Stack</a>,” Letta, November 2024.</li>



<li>“<a href="https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation" target="_blank" rel="noreferrer noopener">Donating the Model Context Protocol and Establishing the Agentic AI Foundation</a>,” Anthropic, December 2025.</li>



<li>“<a href="https://www.stackone.com/blog/ai-agent-tools-landscape-2026/" target="_blank" rel="noreferrer noopener">120+ Agentic AI Tools Mapped Across 11 Categories [2026]</a>,” StackOne, February 2026.</li>



<li>Henrik Plate and Darren Meyer, <em><a href="https://www.endorlabs.com/lp/dependency-management-report" target="_blank" rel="noreferrer noopener">Dependency Management Report</a></em>, Endor Labs, January 2026.</li>



<li>Jason Liu, <a href="https://jxnl.co/writing/2025/08/28/context-engineering-index/" target="_blank" rel="noreferrer noopener">Context Engineering Series: Building Better Agentic RAG Systems</a>, August 2025.</li>



<li>“<a href="https://www.langchain.com/blog/langchain-langgraph-1dot0" target="_blank" rel="noreferrer noopener">LangChain and LangGraph Agent Frameworks Reach v1.0 Milestones</a>,” LangChain, October 2025.</li>



<li><em><a href="https://www.langchain.com/state-of-agent-engineering" target="_blank" rel="noreferrer noopener">State of Agent Engineering</a></em>, LangChain, December 2025.</li>



<li>Yunfei Bai, Allie Colin, Kashif Imran, and Winnie Xiong, “<a href="https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/" target="_blank" rel="noreferrer noopener">Evaluating AI Agents: Real-World Lessons from Building Agentic Systems at Amazon</a>,” Amazon, February 2026.</li>



<li><a href="https://github.com/OWASP/www-project-mcp-top-10/" target="_blank" rel="noreferrer noopener">OWASP MCP Top 10</a>, OWASP.</li>
</ol>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-ai-agents-stack-2026-edition/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>This Week in AI: Production Viability</title>
		<link>https://www.oreilly.com/radar/this-week-in-ai-production-viability/</link>
				<comments>https://www.oreilly.com/radar/this-week-in-ai-production-viability/#respond</comments>
				<pubDate>Fri, 05 Jun 2026 15:55:20 +0000</pubDate>
					<dc:creator><![CDATA[Michelle Smith]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[This Week in AI]]></category>
		<category><![CDATA[Podcast]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18861</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/0642572383770_This_Week_in_AI_Cover-scaled.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2560" 
				height="2560" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/0642572383770_This_Week_in_AI_Cover-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Metacognition, what OpenAI’s finance move is really about, and why tokenmaxxing is a trap]]></custom:subtitle>
		
				<description><![CDATA[On this week’s episode, host and the founder of AI advisory firm Intelligence Briefing Andreas Welsch brought together Maya Mikhailov, cofounder and CEO of Savvi AI, and Doug Shannon, generative AI and intelligent automation leader, to cover a handful of interconnected topics that practitioners are navigating right now: OpenAI’s push into personal finance, the role [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">On this week’s episode, host and the founder of AI advisory firm <a href="https://www.intelligence-briefing.com" target="_blank" rel="noreferrer noopener">Intelligence Briefing</a> Andreas Welsch brought together Maya Mikhailov, cofounder and CEO of <a href="https://www.savviai.com/" target="_blank" rel="noreferrer noopener">Savvi AI</a>, and Doug Shannon, generative AI and intelligent automation leader, to cover a handful of interconnected topics that practitioners are navigating right now: OpenAI’s push into personal finance, the role of <a href="https://www.linkedin.com/feed/update/urn:li:activity:7462494796318748673/?trk=public_post_embed_social-actions-reactions" target="_blank" rel="noreferrer noopener">metacognition</a> in AI-assisted technical work, the growing backlash against token-based productivity metrics, and the new role of forward-deployed engineer. Together, these stories sketch a picture of an industry that’s good at generating output but is still figuring out what output is worth.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="This Week in AI: Production Viability with Andreas Welsch, Maya Mikhailov, and Doug Shannon" width="500" height="281" src="https://www.youtube.com/embed/inQlD1CzUg8?start=1&amp;feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>Why OpenAI wants your bank account data</strong></h2>



<p class="wp-block-paragraph">When OpenAI announced it was <a href="https://openai.com/index/personal-finance-chatgpt/" target="_blank" rel="noreferrer noopener">analyzing users’ transaction data</a> in partnership with financial institutions, the coverage focused on the consumer benefit: a smarter way to track spending, comparable to what Credit Karma or Mint offered but with a more conversational interface.</p>



<p class="wp-block-paragraph">But that’s not all the company’s interested in, or even the main thing. Maya reframed the stakes: “What OpenAI wants to do is figure out consumer intent.” Being able to access users’ financial data is less about helping people manage their money and more about completing a profile the company can then monetize. OpenAI already builds a surprisingly accurate picture of users from their chat histories. Add transaction data and you get specifics that weren’t there before: what someone is saving for, what they’re anxious about, where their money is actually going. That’s a data asset worth a great deal to advertisers.</p>



<p class="wp-block-paragraph">We’ve seen this pattern before, and as Andreas noted, companies have long held (and used) potentially invasive data to recommend products. The <a href="https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html" target="_blank" rel="noreferrer noopener">Target pregnancy prediction story</a> is now more than a decade old, but it’s still being taught in business school, including by Andreas, precisely because it illustrates how behavioral data can be combined to infer things people haven’t explicitly disclosed—and spotlights the fine line between effective recommendations and those that feel <em>too</em> personalized, reminding consumers just how much information companies have on them. Companies’ profile-building capability hasn’t changed, but AI chat adds a new wrinkle, said Maya. A conversational interface makes disclosure feel natural, so the knowledge graph based on your chat history is very powerful. And these tools are also better positioned to share recommendations than traditional avenues. “By having this style that is agreeable, that is engaging,” Maya explained, “those recommendations are going to be a lot stickier than what a fragment of a sentence I type into a regular search engine.”</p>



<h2 class="wp-block-heading"><strong>Metacognition as a professional skill</strong></h2>



<p class="wp-block-paragraph">When you delegate thinking to a system that averages across a massive range of inputs to produce an answer, you need to know when that answer is good enough and when it isn’t.</p>



<p class="wp-block-paragraph">“We’re essentially being averaged out,” Doug said. The model is doing many things behind the scenes to find a mean response. The human’s job is to ask questions about the questions, to push past the first answer, and to know whether their own judgment is still in the loop. That’s why Doug’s been pushing for a renewed interest in metacognition, or “thinking about thinking.” Offloading cognitive load that’s peripheral to your work is fine, Doug and Maya agreed. Offloading the reasoning that’s central to your job’s value—what Doug called cognitive surrender—is where organizations get into trouble.</p>



<p class="wp-block-paragraph">The future advantage won’t come from access to AI. Everyone will have some kind of access to it. The advantage will come from knowing what to offload, what to question, and what should never leave human judgment. This is a skill-development question as much as a philosophical one. The people who’ll be most effective with AI tools aren’t the ones who use them most; they’re the ones who understand what to hand off and what to keep. That requires domain knowledge, judgment about when a model’s answer is plausible but wrong, and enough fluency with how these systems work to recognize when you’re being handed an average instead of an answer.</p>



<h2 class="wp-block-heading"><strong>Tokenmaxxing and the wrong incentive</strong></h2>



<p class="wp-block-paragraph">The <a href="https://en.wikipedia.org/wiki/Token_maxxing" target="_blank" rel="noreferrer noopener">tokenmaxxing</a> debate seems to be coming to a head. Amazon <a href="https://www.cnet.com/tech/services-and-software/amazon-ai-leaderboard-tokenmaxxing/" target="_blank" rel="noreferrer noopener">abolished its AI productivity leaderboard</a> after employees started gaming it by writing inefficient code to rack up token usage. And one company reportedly burned through <a href="https://www.axios.com/2026/05/28/ai-spending-roi-enterprise-costs" target="_blank" rel="noreferrer noopener">$500M in Anthropic tokens in a single month</a> after failing to set limits. The companies encouraging tokenmaxxing are incentivizing the wrong metrics, Maya argued. It’s like determining which bakery is best by the amount of flour it uses. The right question is “Are we making a quality product?”</p>



<p class="wp-block-paragraph">Andreas shared his own vibe coding experience as an example of how token consumption and technical debt compound in practice. A developer starts with a modest plan and burns through their quota running agents in half an hour. They upgrade to a higher tier, paying five times more, but now the sunk-cost logic kicks in. As Andreas pointed out, now they feel like they “should also be getting five times more the value out of [their subscription],” so scope expands from a single tool into a unified business operating system. Three weeks later, the accumulated complexity has outpaced the ability to evaluate it: Repeated security audits keep surfacing new issues, each pass generating recommendations that require cybersecurity expertise most vibe coders don’t have. Here’s where Doug’s point about metacognition applies: The more a builder stays actively involved in understanding what the system is actually doing, the better their judgment about whether it is working. For less engaged users, the risk is accepting the output, shipping the debt, and discovering the consequences later.</p>



<p class="wp-block-paragraph">Most of the misalignment originates in the gap between what executives expect from AI and what practitioners deal with day-to-day. Executives see a capability that could change the slope of productivity, Maya explained. Engineers and analysts live with the technical debt, the version control problems, and the regulatory constraints that don’t disappear because you have a better code completion tool. The leaderboard problem is a symptom of that disconnect.</p>



<p class="wp-block-paragraph">GitHub’s recent shift from unlimited to usage-based pricing for Copilot is likely to realign these incentives faster than any internal policy change would. When more CFOs start seeing the actual bills, the leaderboards will all come down.</p>



<p class="wp-block-paragraph">Doug identified a related problem emerging with the “cognitive surrender” to LLMs. When organizations encourage employees to pipe internal processes, proprietary logic, and institutional knowledge into foundation models without governance, they’re not just running up token bills. They’re giving away the operational knowledge that differentiates them. Process documentation, workflow logic, and institutional memory about why certain decisions were made are all forms of intellectual property, and once they’re encoded into a general-purpose model, the organization’s advantage from them diminishes.</p>



<h2 class="wp-block-heading"><strong>Forward-deployed engineers aren’t enough on their own</strong></h2>



<p class="wp-block-paragraph">Is the answer to these challenges to put a skilled engineer directly inside the customer environment to translate between what a model produces and what an organization actually needs? That’s the promise of the forward-deployed engineer (FDE) approach popularized by AI firms. Doug and Maya both had some criticisms of the model.</p>



<p class="wp-block-paragraph">Maya’s objection was structural. Enterprise AI deployment isn’t a matter of adding capability on top of existing infrastructure. Organizations arrive with siloed data, legacy systems, and regulatory constraints that no forward-deployed engineer can resolve on technical skill alone. You can’t “just sprinkle some AI on it, and it’ll work just by a package of tokens,” she said. Engineers have to know the context behind why certain data can’t be used or why a particular model can’t be deployed in a regulated context. FDEs coming into an organization fresh don’t have this understanding and as a result may undo decisions that were made carefully and for reasons that aren’t written down anywhere obvious.</p>



<p class="wp-block-paragraph">Doug’s concern was about communication. FDEs, in his experience, tend to arrive with strong technical instincts and limited organizational context. They get into the work quickly but struggle to communicate across the full stack of stakeholders involved. That’s why business analysts exist, to understand the customers’ problems and what the process actually is before engineers can address them. Skip that step and you get technically correct output that solves the wrong problem.</p>



<p class="wp-block-paragraph">What both Maya and Doug were underscoring is that AI deployment at the enterprise level is fundamentally a <em>context</em> problem. The models are capable. What’s hard is knowing which capability to apply, where to do it, and with what constraints in place. That knowledge doesn’t live in the model; it lives in the people who’ve worked inside the organization long enough to know why things are the way they are.</p>



<h2 class="wp-block-heading"><strong>The measurement problem</strong></h2>



<p class="wp-block-paragraph">All the topics in this episode circle back to the same question: What are we actually measuring, and what incentives are we setting in place with those measurements? Token counts and lines of code don’t always correlate to the outcomes companies want. You need human expertise and a contextual knowledge of the business to figure out what goals you want to achieve and what to measure to ensure you get there.</p>



<p class="wp-block-paragraph">On next Monday’s episode of <em>This Week in AI</em>, RecoMind founder Miguel Fierro joins host Christina Stathopoulos to discuss responsible AI, multimodal content creation, and more on how LLMs are changing personalization and user understanding. Miguel will also lead a live demo that offers a glimpse of the next generation of recommendation experiences—<a href="https://www.oreilly.com/live/this-week-in-ai.html" target="_blank" rel="noreferrer noopener">register here</a>.</p>



<p class="wp-block-paragraph">We’ll continue to publish our takeaways here on Radar each Friday and share full episodes on <a href="https://www.youtube.com/watch?v=g4cfjz5AKxY&amp;list=PL055Epbe6d5bJEhT7_ZzOeJZ6gPyUzYpS" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/033kJS2BG1teGunxmtsU1r" target="_blank" rel="noreferrer noopener">Spotify</a>, <a href="https://podcasts.apple.com/us/podcast/this-week-in-ai/id1896798047" target="_blank" rel="noreferrer noopener">Apple</a>, or wherever you get your podcasts.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/this-week-in-ai-production-viability/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>I Let an AI Agent Run 40 Experiments While I Slept</title>
		<link>https://www.oreilly.com/radar/i-let-an-ai-agent-run-40-experiments-while-i-slept/</link>
				<comments>https://www.oreilly.com/radar/i-let-an-ai-agent-run-40-experiments-while-i-slept/#respond</comments>
				<pubDate>Fri, 05 Jun 2026 10:27:18 +0000</pubDate>
					<dc:creator><![CDATA[Vanchhit Khare]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18855</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/I-let-an-AI-agent-run-40-experiments-while-I-slept.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/I-let-an-AI-agent-run-40-experiments-while-I-slept-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A linter ruined half of them.]]></custom:subtitle>
		
				<description><![CDATA[I set up an AI agent on a rented GPU, pointed it at a training script, and went to bed. By morning it had run 40 experiments, improved validation loss by 5.9%, and cut memory usage from 44 GB to 17 GB. It also spent four hours chasing a bug that a linter introduced behind [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">I set up an AI agent on a rented GPU, pointed it at a training script, and went to bed. By morning it had run 40 experiments, improved validation loss by 5.9%, and cut memory usage from 44 GB to 17 GB. It also spent four hours chasing a bug that a linter introduced behind its back. The agent never flagged it. I only found out because the numbers stopped improving and I started reading logs.</p>



<p class="wp-block-paragraph">The setup was based on <a href="https://github.com/karpathy/autoresearch" target="_blank" rel="noreferrer noopener">Andrej Karpathy’s autoresearch project</a>: Give an agent one file it can edit (train.py), one metric to optimize (validation bits per byte), a fixed five-minute training budget per experiment, and Git for checkpointing. If an experiment beats the current best, keep the commit. If not, revert. Loop forever. Karpathy’s own run produced <a href="https://x.com/karpathy/status/2031135152349524125" target="_blank" rel="noreferrer noopener">700 experiments and 20 genuine improvements</a> across 48 hours, an 11% speedup on already-optimized code. Shopify’s Tobi Lütke <a href="https://x.com/tobi/status/2032212531846971413" target="_blank" rel="noreferrer noopener">pointed the same pattern at Liquid</a>, their templating engine, and got 53% faster rendering from 93 automated commits. The pattern clearly works. The question is what breaks when you run it yourself.</p>



<h2 class="wp-block-heading">The first failure: Agents fixing agents</h2>



<p class="wp-block-paragraph">Before running autoresearch, I had a separate problem. I had 15 custom skills for Claude Code (think reusable prompt templates with tool access, structured inputs, and specific behaviors). Most of them were broken when dispatched as parallel background agents. Vague descriptions meant the system couldn’t figure out when to invoke them. Missing tool permissions caused silent failures. Duplicate scopes between similar skills created routing confusion.</p>



<p class="wp-block-paragraph">So I used the same pattern: dispatch background agents in parallel, one per skill, each tasked with reading the skill definition, identifying problems, and rewriting it. 13 out of 15 came back improved. Descriptions got specific. Dead references to nonexistent files were removed. Tool permissions were added. Two skills were left untouched because the agents couldn’t find anything wrong with them. The whole batch took under an hour.</p>



<p class="wp-block-paragraph">But here’s what I didn’t expect. Three of the “improved” skills had subtle regressions. One agent removed an AskUserQuestion gate that was there for a reason, because the gate’s purpose wasn’t documented and the agent read it as unnecessary friction. Another agent rewrote a skill description so precisely that it stopped triggering on the fuzzy, misspelled queries real users actually type. I caught these during manual review, but if I had trusted the parallel output without checking, three skills would have silently degraded in production.</p>



<h2 class="wp-block-heading">The second failure: The linter in the loop</h2>



<p class="wp-block-paragraph">Then I started the training loop. The agent worked through hyperparameters methodically. It halved the batch size early (experiment 4), which turned out to be the single biggest win: more gradient steps in the same five-minute window. It reduced model depth from eight to seven layers, dropped weight decay from 0.2 to 0.05, and tuned the learning rate schedule. Each change was small. The cumulative effect was a 5.9% improvement in validation loss and a 60% reduction in peak GPU memory.</p>



<p class="wp-block-paragraph">Out of 40 experiments, the agent kept nine, discarded 28, and crashed three. That keep/discard ratio felt about right. Most ideas don’t work. The point of automation isn’t to have better ideas. It’s to try bad ones faster.</p>



<p class="wp-block-paragraph">Then the numbers plateaued. Experiments 30 through 38 produced nothing worth keeping. I started digging through the logs and found something I hadn’t expected: A linter running on the remote machine had been silently modifying a hyperparameter in train.py. It changed SCALAR_LR from 0.5 to 0.3 every time the agent saved the file. The agent would set the value, commit, and run the experiment, but the linter would alter the file between the save and the execution. The agent had no way to detect this because it checked Git diffs, not the runtime state of the file. Every experiment after a certain point was running with a learning rate the agent never chose.</p>



<p class="wp-block-paragraph">I lost roughly four hours of compute to this. The agent kept going, proposing new ideas, running experiments, logging results. From its perspective nothing was wrong. The experiments ran, produced numbers, and the numbers were plausible. There was no crash, no error, no alert.</p>



<h2 class="wp-block-heading">Why this matters beyond my GPU bill</h2>



<p class="wp-block-paragraph">Gartner predicts <a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" target="_blank" rel="noreferrer noopener">over 40% of agentic AI projects will be canceled</a> by the end of 2027, citing escalating costs and inadequate risk controls as the primary drivers. My overnight session was a toy example: a single GPU, a small model, and a low-stakes experiment. But the failure pattern scales. An agent that can’t detect when its inputs are being modified between decisions will make the same class of error whether it’s tuning hyperparameters or managing a production pipeline.</p>



<p class="wp-block-paragraph">The autoresearch constraints are smart: one file, one metric, and Git for state. But they assume the environment is stable. Nobody checks whether something outside the loop is modifying the file between commits. The agent optimizes within its sandbox, and the sandbox has a hole in the wall that nobody thought to look for.</p>



<p class="wp-block-paragraph">Anyone who has run distributed systems recognizes this. When the linter changed that hyperparameter, it was the equivalent of someone editing a database record between a read and a write. We solved that problem years ago with compare-and-swap, optimistic locking, checksums. We just haven’t brought any of it to autonomous AI workflows. The SkyPilot team recently <a href="https://blog.skypilot.co/scaling-autoresearch/" target="_blank" rel="noreferrer noopener">scaled autoresearch to 16 GPUs and 910 experiments</a>. At that scale, an undetected environment mutation doesn’t cost you four hours. It costs you a cluster.</p>



<p class="wp-block-paragraph">Next time I run autoresearch, I’ll add a file integrity check before every experiment. It’s three lines of code, but it would have saved me four hours and produced a better final result. The agent did its job. The environment didn’t.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/i-let-an-ai-agent-run-40-experiments-while-i-slept/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Tidy House</title>
		<link>https://www.oreilly.com/radar/the-tidy-house/</link>
				<comments>https://www.oreilly.com/radar/the-tidy-house/#respond</comments>
				<pubDate>Thu, 04 Jun 2026 16:25:11 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18849</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-tidy-house.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/The-tidy-house-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[DJ Patil on why the hardest part of AI adoption is organizational, not technical]]></custom:subtitle>
		
				<description><![CDATA[DJ Patil has spent the past several months on a listening tour. Wherever he travels, he finds a local university, pings faculty and students and anyone else who wants to show up, and runs an AMA. He&#8217;s heard from grad students who can&#8217;t get callbacks, hospital administrators dealing with federal policy changes that land like [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">DJ Patil has spent the past several months on a listening tour. Wherever he travels, he finds a local university, pings faculty and students and anyone else who wants to show up, and runs an AMA. He&#8217;s heard from grad students who can&#8217;t get callbacks, hospital administrators dealing with federal policy changes that land like a change in the laws of physics, and executives who can&#8217;t forecast their AI spending past six months. He&#8217;s trying to synthesize all of it and help reframe the wider conversation.</p>



<p class="wp-block-paragraph">DJ co-coined the term &#8220;data scientist,&#8221; served as America&#8217;s first chief data scientist under President Obama, and was chief scientist at LinkedIn. He&#8217;s a longtime O&#8217;Reilly author, going back to <em><a href="https://www.oreilly.com/library/view/building-data-science/BLDNGDST0001/" target="_blank" rel="noreferrer noopener">Building Data Science Teams</a></em> and <em><a href="https://www.oreilly.com/library/view/ethics-and-data/9781492043898/" target="_blank" rel="noreferrer noopener">Ethics and Data Science</a></em>, and he&#8217;s on the founding team at <a href="https://www.devoted.com/" target="_blank" rel="noreferrer noopener">Devoted Health</a>, where he&#8217;s spent the past decade building the kind of data infrastructure most organizations are still struggling to put in place. He calls it “the tidy house.” He sat down with me to talk about &#8220;the broken promise&#8221; in the job market that is driving AI sentiment, and why weak data infrastructure is a big part of the gap between what AI can do and what most institutions can actually absorb.</p>



<h2 class="wp-block-heading">The broken promise</h2>



<p class="wp-block-paragraph">What DJ keeps hearing on his tour is anger and angst. One word that keeps coming up is &#8220;terrified.&#8221; Workers are worried about layoffs. Meanwhile, students, including those from top-tier universities like MIT, Carnegie Mellon, and UC Berkeley, have been applying to 300+ internships and getting fewer than 10 callbacks. Many had zero offers going into the summer. And the industry&#8217;s response has been to tell them to learn more AI and burn more tokens. What it comes down to, DJ explained, is “effectively a broken promise”:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">We said, “Go to college, get these things, you&#8217;re going to get an internship, you&#8217;re going to get job training, you&#8217;re going to pay off your student loans, and then you&#8217;re going to have all the other things that are part of that social contract.”</p>



<p class="wp-block-paragraph">What the students are feeling for the first time [is]. . .“Wait, if I can&#8217;t get this internship, . . .I&#8217;m fundamentally off trajectory from getting this job.” And it doesn&#8217;t have to be a technical person. It could be someone that is in marketing. It could be someone that&#8217;s in the liberal arts. It could be a researcher.&nbsp;.&nbsp;.&nbsp;.There are plenty of students that I have talked to who are supposed to be going to a doctoral PhD program or a medical school or something like that. The slots aren&#8217;t there because of the overall budget impacts. And so whether you call it AI impact or economic reframing, the thing is broken.</p>
</blockquote>



<p class="wp-block-paragraph">This is where both DJ and I have been trying to build a counter narrative. The story coming from the AI labs is destructive: “We&#8217;re going to put all of you out of work, and we&#8217;ll figure out the rest once the intelligence explosion arrives.” That&#8217;s bad PR for AI, but it’s also magical thinking. An economy is a circulatory system. You can&#8217;t put your customers out of work and at the same time expect that the economy will hum along as usual. A catastrophic recession could easily interrupt the funding that keeps AI on its growth path and the concentration of value that they assume will fund universal basic income and an expanded safety net.</p>



<p class="wp-block-paragraph">That’s why I’m a fan of <a href="https://www.oreilly.com/radar/the-missing-mechanisms-of-the-agentic-economy/" target="_blank" rel="noreferrer noopener">mechanism design</a>: start from the outcome you want, then figure out the rules of the game that produces it. Right now, they’ve designed a game that concentrates all the value in the hands of AI first movers. They could be designing a game that generates value throughout the economy. But they aren’t building affordances for that.</p>



<p class="wp-block-paragraph">YouTube ContentID is a good example of mechanism design leading to economic value creation. When unauthorized music use by online video creators triggered a backlash from rights holders, YouTube replied to the takedown notices with a way for both the people who owned the music and the people who wanted to use it to get paid. A whole creator economy came out of that design choice. The labs have the same opportunity in front of them and mostly aren&#8217;t taking it.</p>



<p class="wp-block-paragraph">DJ had one concrete mechanism in mind:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Imagine OpenAI and Anthropic and Microsoft.&nbsp;.&nbsp;.get together and [say], “If you&#8217;re building something for your local community, we&#8217;ll fully subsidize the token cost for some period of time.”.&nbsp;.&nbsp;.We&#8217;re talking about marginal token usage relatively on the spectrum of things, but the potential innovation and use of AI to help local communities could be astounding. You&#8217;re not putting anybody out of a job with that.&nbsp;.&nbsp;.&nbsp;.You&#8217;re filling the holes that already exist in the system.</p>
</blockquote>



<p class="wp-block-paragraph">The <a href="https://openaifoundation.org/news/update-on-the-openai-foundation#our-mission" target="_blank" rel="noreferrer noopener">OpenAI Foundation just announced</a> it will put $1 billion into public-benefit projects this year, including $250 million aimed at building economic futures. It&#8217;s a start. But it mostly seems designed to ameliorate the bad effects of AI rather than to forestall them by building a more inclusive AI future. If the labs start investing in the human-plus-AI economy rather than just studying the job losses, the payoff to local communities could be real.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Broken Promise with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/OAwI4G_MxYg?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">A makerspace to bridge the internship gap</h2>



<p class="wp-block-paragraph">DJ&#8217;s plan is to build a bridge. He&#8217;s launching a program, basically a makerspace, for students who don&#8217;t have an internship this summer. Over two four-week sprints, an initial cohort will get mentors, speakers, and the space to explore whatever they&#8217;re interested in. It doesn&#8217;t have to be AI. Whether they’re doing investigative journalism, screenwriting, or building civic tech, participants will get some experience with current tools and produce a tangible asset they can use to prove what they know. As I told DJ in our conversation, I think he’s really on to something, and I&#8217;d love O&#8217;Reilly to be part of what he’s building.</p>



<p class="wp-block-paragraph">There&#8217;s a kind of person who has always been at the center of the O&#8217;Reilly community and never waited for a job description. High school and college dropouts who started companies, built open source software packages, or otherwise took the future into their own hands. People who looked around, found something that needed doing, and did it. DJ is one of them. He&#8217;s a community college kid who learned from a good local library, from the <a href="https://www.oreilly.com/content/a-short-history-of-the-oreilly-animals/" target="_blank" rel="noreferrer noopener">books with the “funny animals” on the cover</a>, and from open source. That path is still open. The early O&#8217;Reilly business came out of exactly this instinct. We were a tech-writing consulting shop, and when we ran out of paid work, we wrote manuals that didn&#8217;t exist yet but that we thought were needed. Later, when there were big conferences for every corporate technology and none for open source, we ran the first one for Perl. Conferences became a whole new business for us. You look for the gap and you fill it.</p>



<p class="wp-block-paragraph">DJ pushes the same idea down to the level of the neighborhood:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">If you want to feel rewarded, go fix something in your neighborhood. Go help out the food pantry. Go help out the local foster child care system. Go help out.&nbsp;.&nbsp;.parks and rec. Use those skills to go do something, and then you&#8217;re going to see.&nbsp;.&nbsp;.people respond in a different way.&nbsp;.&nbsp;.&nbsp;.The target-rich area for problems is massive. You just have to look.</p>
</blockquote>



<p class="wp-block-paragraph">I&#8217;ve never bought the jobless-future story. Back when I wrote <em><a href="https://www.oreilly.com/tim/wtf-book.html" target="_blank" rel="noreferrer noopener">WTF?</a></em> in 2016, I pointed out that there is so much around us that needs to be made better. The constraint has never been a shortage of problems. AI gives us new tools for solving them. It should be a way to put people <em>to work</em>, not <em>out of work</em>.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="A Makerspace to Bridge the Internship Gap with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/bzE88bDjvJo?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The organization is the AI bottleneck</h2>



<p class="wp-block-paragraph">DJ has also been visiting hospitals and clinics and talking to CIOs and CTOs as part of the tour, and what he&#8217;s seeing is alarming.</p>



<p class="wp-block-paragraph">The federal changes to Medicaid and the Affordable Care Act are landing on systems that were already near collapse. Hospitals that depended on outpatient procedures like colonoscopies for margin are watching volumes drop 20% to 30% because people can&#8217;t afford insurance. Some are running $1 million a day behind, a $300 to $400 million shortfall for the year.</p>



<p class="wp-block-paragraph">At the same time, AI companies are telling those same hospitals to move into the new world, and partly because of the “you will soon be replaced” narrative from the AI labs, labor is responding the way the Kaiser nurses did in California, where any use of AI was off the table as a bargaining condition. As DJ pointed out, we can’t afford to disregard AI when it has the potential to automate the most painful parts of healthcare workers’ jobs and let them “do the job they&#8217;re trained for” without the administrative burden. Businesses need to change not just their narrative but their strategy. They need to be saying, “We’re going to use AI to help you do more for our customers. We’re going to make your job more human and let the machines deal with the BS.”</p>



<p class="wp-block-paragraph">There’s a version of this where the efficiencies AI creates get plowed back into better patient care. There&#8217;s also the version that&#8217;s actually happening in most places, where private equity captures the savings as profit. The difference is institutional design, and that&#8217;s where reform isn&#8217;t happening. I saw this directly with a <a href="http://codeforamerica.org" target="_blank" rel="noreferrer noopener">Code for America</a> project called <a href="https://www.clearmyrecord.org/" target="_blank" rel="noreferrer noopener">Clear My Record</a>. A California initiative had turned a number of petty crimes into misdemeanors, but very few people were petitioning to have their status changed. We started using software to streamline an absurdly convoluted criminal record expungement process, but then we asked ourselves why we were helping people fill out forms that shouldn&#8217;t exist. The law had already changed the record. The process should have been a database update, not something that required a petition to the court. That’s the kind of problem AI was born to solve. It can help us refactor old stuck processes and move to something way better.</p>



<p class="wp-block-paragraph">Done right, DOGE could have been an opportunity to carry out that kind of real institutional change at scale. Instead it became a wrecking ball, and it&#8217;s given the whole idea of institutional reform a bad name.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Organizational Capacity Is the Bottleneck with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/BHsqVllEZPQ?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p class="wp-block-paragraph">The Silicon Valley default assumes that incumbents will just get disrupted by startups, the way media was by Google and Meta and retail was by Amazon. There&#8217;s some truth to that. But disruption takes much longer than people think, and in a domain as central as healthcare or government services, the delay means real harm to real people. Healthcare is a third of the economy. You can&#8217;t just let it fail and rebuild it fresh while people depend on it for survival.</p>



<h2 class="wp-block-heading">Data infrastructure is the competitive advantage</h2>



<p class="wp-block-paragraph">DJ&#8217;s term for the alternative he&#8217;s living with at Devoted is “the tidy house.” He built the boring infrastructure years before LLMs existed, and that&#8217;s why the company could move the moment AI arrived. People don&#8217;t think about having well organized, effective data infrastructure as the deep secret behind enterprise AI adoption, but DJ is right. As we work on O&#8217;Reilly&#8217;s own transformation and talk with our customers about what&#8217;s holding them back, it&#8217;s a huge part of the problem.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">One of the ways we&#8217;ve tried to make this work is fundamentally still data 101, unified data environments, data flows that are clean, that have a lot of organization.&nbsp;.&nbsp;.&nbsp;.Because we invested so heavily in that infrastructure, the dumb, boring, painful parts of making sure you&#8217;ve got a really great data warehouse, great data engineering pipes, all of the metadata that goes with it, when AI shows up, you get to use it right away. Now you get to focus on the orchestration, the harness, all those pieces.</p>
</blockquote>



<p class="wp-block-paragraph">While other organizations are reconstructing ETL inside context windows and paying for it in GPU costs, Devoted&#8217;s team gets to work on the actual clinical problems. As DJ put it, transforming a healthcare system is &#8220;like walking and chewing gum while balancing bowling balls on your head and on a unicycle,&#8221; with the laws of physics changing on you the whole time. The organizations that come through it will be the ones that did the unglamorous work of keeping clean, flowing data with its lineage and metadata intact. The ones that didn&#8217;t will keep paying to reconstruct context they should have had all along.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Keeping a Tidy House with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/73vf3GeP20g?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The pharmacists who built their own agents</h2>



<p class="wp-block-paragraph">The tidy house pays off when you put the tools in the hands of people who already know the domain. At Devoted, clinicians are building things without waiting for a product manager to learn the problem first. These frontline workers have already spent decades understanding it.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">A pharmacist. . .says, “Hey, you know what? I&#8217;m really worried when I see these kinds of drugs show up together. That&#8217;s not a good thing. . . .Why don&#8217;t I have an agent that alerts me every time this happens? I should just automate it because maybe one of the patients gets prescribed something by another provider and we don&#8217;t see it.” So the pharmacist [says,]. . .”I&#8217;m just going to build that agent.” Now I&#8217;ve got an agent always looking for bad drug interactions. And another pharmacist says, “I&#8217;ve got my own version of that.” . . .So I say, “Hey, agent, I want you to go ask all the pharmacists that we have a quick survey of what might be happening. . . .What are the universe of things that we should be watching out for?” Now I&#8217;ve got a robust medical layer. . .looking out and protecting all of our members from bad drug interactions. Having the right infrastructure makes it possible to act on decades of accumulated judgment distributed throughout the organization.</p>
</blockquote>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Pharmacists Who Built Their Own Agents with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/bHqxMWVbP44?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The histogram is still the most powerful product</h2>



<p class="wp-block-paragraph">You don&#8217;t need exotic tooling to get value out of data, and DJ punctured the assumption that you do.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Oftentimes, I tell people, the most powerful data product you can build is still a histogram. Just give me a distribution of what&#8217;s going on.&nbsp;.&nbsp;.&nbsp;.AI gives us a tremendous opportunity to let people [access this data quickly], but we&#8217;ve got to figure out the guardrails, so people don&#8217;t ask [questions] or get answers.&nbsp;.&nbsp;.[without realizing] that there&#8217;s a flaw in how they&#8217;re asking it.</p>
</blockquote>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Histogram Is Still the Most Powerful Data Product with DJ Patil" width="500" height="281" src="https://www.youtube.com/embed/xBBjws9NIIo?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p class="wp-block-paragraph">Every time a new technology empowers employees to make innovative use of corporate data, there is resistance. We’ve been in this loop since the beginning of the data movement, DJ explained. The stewards of the data warehouse stand at the gate and say, “You shall not pass!” Then democratization breaks it open, and the gatekeepers reconstitute themselves in the next era. Hadoop did it last time. LLMs are doing it now, and the temptation to insist that only experts can use the tools correctly is as strong as it&#8217;s ever been. You do need ways to catch errors. But the goal should always be access.</p>



<h2 class="wp-block-heading">The real opportunity is in the layers above AI models</h2>



<p class="wp-block-paragraph">DJ and I also talked about the new discipline forming inside computer science, engineering the trade-offs between conventional software and LLMs, when to reach for a local or open weight model, and understanding what inference actually costs against the value it returns.</p>



<p class="wp-block-paragraph">Getting that right requires an expanded view of mechanism design. While this isn’t how economists talk about it, many advances in technology are really just that: redesigning the rules of a game to get better outcomes. Pay-per-click advertising started as a crude auction that sold to the highest bidder, and then Google refined it into something that worked. Rob McCool wired a web server to a database with CGI and ushered in a decade of invention of new mechanisms for data-driven websites. Or take Apache Kafka, which DJ reminded us began as a project to help LinkedIn rein in its Splunk bill and only later became the foundation for a company and an ecosystem.</p>



<p class="wp-block-paragraph">We&#8217;re at the front of an architectural innovation cycle now, and the biggest opportunities are not in the models themselves but in the layers above them. That’s also where a renaissance of open source for the AI era could happen.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Future of Software Will Be Shaped by Microeconomics with Tim O&amp;apos;Reilly" width="500" height="281" src="https://www.youtube.com/embed/ZLffZO_GHzs?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<p class="wp-block-paragraph">DJ and I are both, as he says, &#8220;this giant human LLM, summarizing and distilling all the things we&#8217;re hearing&#8221; from a lot of people. What we&#8217;re hearing is that the technology is mostly ready, but our institutions are not. What&#8217;s lagging is the organizational and economic infrastructure that lets universities, hospitals, data teams, and the labs themselves actually deploy what&#8217;s been built.</p>



<p class="wp-block-paragraph">It’s time to get busy!</p>



<p class="wp-block-paragraph"><em>On June 10, Harper Reed, cofounder of 2389 Research, will join me to talk about why the future of software depends on creativity, serendipity, and building weird stuff. And on July 9, Trail of Bits cofounder and CEO Dan Guido will stop by to share his playbook for going AI native. You can register to attend them live <a href="https://www.oreilly.com/live/live-with-tim/" target="_blank" rel="noreferrer noopener">here</a>. You can also follow </em>Live with Tim O’Reilly<em> on <a href="https://www.youtube.com/playlist?list=PL055Epbe6d5YQ8t30jyo1D6XuSpe8uhAG" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/79YLK6OLSAJam4kcd8w3Kw" target="_blank" rel="noreferrer noopener">Spotify</a>, <a href="https://podcasts.apple.com/us/podcast/live-with-tim-oreilly/id1896312725" target="_blank" rel="noreferrer noopener">Apple</a>, or wherever you get your podcasts.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-tidy-house/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Predict, Don&#8217;t Enumerate</title>
		<link>https://www.oreilly.com/radar/predict-dont-enumerate/</link>
				<pubDate>Thu, 04 Jun 2026 10:57:44 +0000</pubDate>
					<dc:creator><![CDATA[Michael Roytman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18846</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Predict-dont-enumerate.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Predict-dont-enumerate-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[An AI lab just endorsed a predictive model for defense.]]></custom:subtitle>
		
				<description><![CDATA[A third of the way into a security-operations guide that Anthropic published in April 2026, wedged between a recommendation to patch CISA&#8217;s Known Exploited Vulnerabilities list and a suggestion to automate your deployment pipeline is a small recommendation: &#8220;Use EPSS to prioritize the rest.&#8221; For anyone who has worked on a vulnerability backlog in the [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">A third of the way into a <a href="https://claude.com/blog/preparing-your-security-program-for-ai-accelerated-offense" target="_blank" rel="noreferrer noopener">security-operations guide</a> that Anthropic published in April 2026, wedged between a recommendation to patch CISA&#8217;s Known Exploited Vulnerabilities list and a suggestion to automate your deployment pipeline is a small recommendation: &#8220;Use EPSS to prioritize the rest.&#8221; For anyone who has worked on a vulnerability backlog in the last decade, the sentence is an acknowledgment of a widely felt but often unspoken fact about security programs: They have become machine-scale problems of signal to noise.</p>



<p class="wp-block-paragraph">EPSS (Exploit Prediction Scoring System) is a statistical model that takes a known software flaw, runs it through a set of signals about what attackers are actually doing across the internet, and returns a probability that the flaw will be exploited in the next 30 days. It isn’t an LLM, and it does no reasoning or prompt engineering. It predicts. The company endorsing it is the same company whose newest model can surface thousands of novel, exploitable vulnerabilities in production software, many of them two or three decades old, most of them still unpatched.</p>



<p class="wp-block-paragraph">As far as we can tell, this is the first time a frontier AI lab has publicly endorsed a purpose-built predictive model as the right tool for a defensive problem. LLM labs usually recommend LLMs. That Anthropic did not is worth noting, but the recommendation itself isn’t news to the practitioners it’s aimed at. It’s a description of what they’ve been doing.</p>



<h2 class="wp-block-heading"><strong>The quiet consensus</strong></h2>



<p class="wp-block-paragraph">The volume problem isn’t new. Anyone running a scanner against a large enterprise estate in 2015 was already generating hundreds of thousands of findings per month. Anyone running one against a cloud environment in 2020 was generating millions. Enterprises have spent the better part of a decade staring at dashboards where the number of open critical findings was larger than the capacity of the team supposed to fix them. In other words, cybersecurity has become machine scale.</p>



<p class="wp-block-paragraph">Risk-based vulnerability management, as a product category, has existed since around 2018. EPSS, as a public resource, has been usable since 2021. More than 120 vendors embed it today into their products. The field has had access to a predictive baseline for years.</p>



<p class="wp-block-paragraph">What has been missing is an external justification to change the status quo recommendations from auditors, model risk management teams, and even boards. Auditors want a clear set of expectations, making grading more objective and therefore easier to evaluate. Compliance frameworks like CVSS (Common Vulnerability Scoring System) because CVSS is <em>easy</em>, but implementing something more efficient has historically required that aforementioned external push. A working CISO could tell you she had stopped treating every vulnerability scored a severity 9.8/10 by CVSS as an emergency in 2019, but she would also tell you she still kept CVSS in the report.</p>



<p class="wp-block-paragraph">Anthropic&#8217;s guidance is useful because it makes the private consensus public. Patch what you know to be exploited, then use EPSS above a threshold based on the team’s capacity or risk tolerance. DHS CISA’s practice of publishing known exploited vulnerabilities since November of 2021 is just additional proof that the existing methodologies were being overwhelmed by scale and lack of signal.</p>



<h2 class="wp-block-heading">Why prediction, stated plainly</h2>



<p class="wp-block-paragraph">In 2014, at Black Hat, Dan Geer, then the chief information security officer of In-Q-Tel, asked the first principles question: Are vulnerabilities in software sparse or dense? Sparse meant finite, meaning every fix measurably shrank the attack surface. Dense meant weeds in a field. Geer could not answer the question because the data were not in.</p>



<p class="wp-block-paragraph">Eight years later, Jonathan Spring at Carnegie Mellon&#8217;s Software Engineering Institute tied vulnerability enumeration to the halting problem and showed, in theory, that for any sufficiently complex piece of deployed software, there are always more undiscovered flaws.</p>



<p class="wp-block-paragraph">The AI-driven discovery results of the last 18 months have made the density argument impossible to wave off even in a compliance review. A 27-year-old bug in OpenBSD. A 16-year-old bug in FFmpeg that five million fuzzing runs never caught. Disclosed findings, by the developers&#8217; own accounting, are less than 1% of what has been found. But again, the volume was already a problem. With the coming release of its newest model, Mythos, Anthropic is telling teams to plan for an order of magnitude more findings over the next 24 months.</p>



<p class="wp-block-paragraph">Static severity scoring can’t survive the volume problem, because it’s a human-scale solution for a machine scale problem. Neither can any process that treats every critical finding as an emergency. The threshold for action has to be probabilistic, measurable, and defensible. That’s what a predictive model is for, and that’s what working teams have been using in noisy large enterprise environments.</p>



<h2 class="wp-block-heading">Pointing machines and knowing machines</h2>



<p class="wp-block-paragraph">Geer returned to his 2014 question in the summer of 2025, <a href="https://www.lawfaremedia.org/article/ai-and-secure-code-generation" target="_blank" rel="noreferrer noopener">writing with Dave Aitel in <em>Lawfare</em></a>. The piece gives the industry a vocabulary for a distinction it has been fudging:</p>



<p class="wp-block-paragraph">A vulnerability in the code isn’t automatically a threat. A buffer overflow is a hazard. It becomes a risk only if an attacker can exploit it reliably, in this environment, against these controls, through this traffic. Bugs are abundant but the ability to weaponize a particular bug against a particular target is much rarer.</p>



<p class="wp-block-paragraph">The industry, they wrote, has built a pointing machine. It enumerates.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Even children learn early to point and name—but knowing the word “dog” doesn’t reveal whether the animal might bite. In cybersecurity, we’ve built systems that similarly point and name vulnerabilities without understanding whether they’re truly dangerous. By embracing AI solely for pattern recognition, we’ve created a powerful “pointing machine” that identifies possible threats but does not comprehend their actual impact. What we need instead is a “knowing machine,” capable of understanding how code functions within complex, real-world environments, recognizing not just hazards but the full context of how and whether those hazards might become genuine risks.</em></p>
</blockquote>



<p class="wp-block-paragraph">A knowing machine is a system that understands how code behaves in a particular environment and recognizes the context that turns a hazard into a risk. A predictive model is how you build a knowing machine. EPSS is the clearest public example: It covers every published CVE and is updated daily.</p>



<h2 class="wp-block-heading">Global isn’t local</h2>



<p class="wp-block-paragraph">EPSS is a global model. It sees what attackers are doing across the whole of the internet. It picks up patterns in exploitation activity that severity scores never could. What it can’t see is any particular organization&#8217;s environment. It doesn’t know which assets carry the data the business actually cares about. It doesn’t know what compensating controls are in place, where remediation is risky, or how your telemetry and history change the odds.</p>



<p class="wp-block-paragraph">A 9.8 with a 97% global probability of exploitation and a 9.8 with a 0.1% probability are not the same animal. Neither are two organizations applying the same EPSS threshold to the same CVE on different assets. One has the vulnerable code path exposed to the internet, behind a web application firewall that doesn’t inspect the relevant protocol. The other has the same CVE on an internal system that accepts authenticated input from a single service account. A scanner can’t tell them apart. A global model can’t tell them apart. Their actual risk profiles are orders of magnitude apart.</p>



<p class="wp-block-paragraph">Local context is where most security teams have been stuck the entire time, and where the next decade of the field is going to be fought.</p>



<h2 class="wp-block-heading">What a local knowing machine actually requires</h2>



<p class="wp-block-paragraph">Pair a better pointing machine with a faster remediation engine and all you’ve done is increase the speed at which you produce churn, breakage and wasted effort. You’ll also spend a king&#8217;s ransom in agent tokens fixing vulnerabilities that were never dangerous in your environment.</p>



<p class="wp-block-paragraph">In contrast to an omniscient scanner, a local model trains on the specific environment being defended: asset inventory, application topology, reachability, deployed controls, attack telemetry observed on-site, and the history of the organization&#8217;s own remediations and their outcomes. The model produces probabilities specific to the enterprise. Most organizations already have the inputs, scattered across CMDBs, endpoint agents, firewall logs, ticketing systems and scanner output. This context is precisely what attackers (whether they’re using good old fashioned metasploit or Mythos with an infinite budget) are lacking in their models. The context becomes an asymmetrical advantage for defenders, perhaps the only one that exists.</p>



<h2 class="wp-block-heading">The policy shifts that actually matter</h2>



<p class="wp-block-paragraph">The interventions that will decide whether a security program survives the next 24 months aren’t purely technical. A CISO can put most of them in motion without buying anything.</p>



<p class="wp-block-paragraph">Rewrite the SLA. Most vulnerability-management SLAs are organized by severity. Criticals in 15 days, highs in 30, mediums in 90. That structure was built for a world where the count of open criticals was small enough to matter. It’s now actively harmful, because it forces teams to spend the same effort on a 9.8 nobody is exploiting and a 7.5 that’s under active attack. SLAs should be rewritten in terms of probability of exploitation and asset exposure, not severity. A CISO who can’t get that past her GRC team can at least add a second tier that makes the probability-based cut enforceable alongside the severity-based one.</p>



<p class="wp-block-paragraph">Change what the board sees. If the monthly security report counts the numbers of vulnerabilities, exposures or findings in different buckets (“critical,” “open past 30 days,” etc.), the organization is being managed to the wrong metric. The metric should be exploitability-weighted exposure over time, with a second line for predicted versus observed exploitation. Boards will accept this once somebody explains it. This beats showing them a number that has no relationship to risk and is growing exponentially as new LLM models are released. More to the point: A great team can do amazing <em>volumes</em> of remediation work, and risk can still rise because they’re measuring and remediating the wrong thing. An efficient, context-rich team can do far less work and meaningfully move the probability of an event down.</p>



<p class="wp-block-paragraph">Invest in telemetry. The single most valuable instrument a security program can build is a feedback loop between what was prioritized and what was exploited. If the loop shows you were wrong, the model improves. If the loop does not exist, you will keep being wrong indefinitely (or just not being aware of misses).</p>



<p class="wp-block-paragraph">Fix the compliance conversation. The reason CVSS survives is regulatory inertia. PCI, HIPAA, and most state breach-notification frameworks still reference severity. The CISOs who will come out of the next two years in the best shape are the ones who engage their auditors now, in writing, about what a probabilistic prioritization framework looks like under the existing rules.</p>



<p class="wp-block-paragraph">Staff for the bottleneck, which isn’t scanning. The industry has spent a decade hiring people to find bugs. The bottleneck now is deciding which bugs matter, getting the fixes deployed, and measuring whether the prioritization was correct. The job descriptions should reflect this. A security-data engineer may be able to increase efficiency to meet SLAs more than increasing capacity would.</p>



<p class="wp-block-paragraph">None of this requires a new product. All of it requires a CISO willing to say, out loud, that the old dogma is broken and that the new one will be managed by data and probabilities. That is the shift Anthropic&#8217;s five-word sentence was really announcing. The technology is available and the models are here—both the LLM-based ones to find the vulnerabilities and the predictive knowing machines to prioritize efficiently.</p>
]]></content:encoded>
										</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Object Caching 91/102 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed) 
Minified using Memcached

Served from: www.oreilly.com @ 2026-06-18 14:33:29 by W3 Total Cache
-->