<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:media="http://search.yahoo.com/mrss/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:custom="https://www.oreilly.com/rss/custom"

	>

<channel>
	<title>Radar</title>
	<atom:link href="https://www.oreilly.com/radar/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.oreilly.com/radar</link>
	<description>Now, next, and beyond: Tracking need-to-know trends at the intersection of business and technology</description>
	<lastBuildDate>Wed, 03 Jun 2026 11:00:28 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.oreilly.com/radar/wp-content/uploads/sites/3/2025/04/cropped-favicon_512x512-160x160.png</url>
	<title>Radar</title>
	<link>https://www.oreilly.com/radar</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Context as Code</title>
		<link>https://www.oreilly.com/radar/context-as-code/</link>
				<comments>https://www.oreilly.com/radar/context-as-code/#respond</comments>
				<pubDate>Wed, 03 Jun 2026 11:00:14 +0000</pubDate>
					<dc:creator><![CDATA[Artur Huk]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18837</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Context-as-code.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/Context-as-code-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Build-time governance in the era of infinite syntax]]></custom:subtitle>
		
				<description><![CDATA[As syntax becomes cheap and abundant, architectural control becomes the scarce resource. Effective governance starts upstream, where intent, constraints, and threat models shape the agent’s working context before generation begins. The goal isn’t better prompting but build-time boundaries that prevent structurally invalid code from entering the system. The Frankenstein factories The dark factories (as Dan [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">As syntax becomes cheap and abundant, architectural control becomes the scarce resource. Effective governance starts upstream, where intent, constraints, and threat models shape the agent’s working context before generation begins. The goal isn’t better prompting but build-time boundaries that prevent structurally invalid code from entering the system.</p>



<h2 class="wp-block-heading">The Frankenstein factories</h2>



<p class="wp-block-paragraph">The <a href="https://www.oreilly.com/radar/dark-factories-rise-of-the-trycycle/" target="_blank" rel="noreferrer noopener">dark factories</a> (as Dan Shapiro calls them) are running. Tokens fly through trycycles, features ship overnight, and codebases are ported before breakfast. The velocity is real. And <a href="https://www.oreilly.com/radar/comprehension-debt-the-hidden-cost-of-ai-generated-code/" target="_blank" rel="noreferrer noopener">comprehension debt</a> (a term coined by Addy Osmani) is compounding in silence behind it.</p>



<p class="wp-block-paragraph">What this era is producing, at scale, deserves its own name: Frankenstein factories. Not a critique of any single approach but a description of a structural condition—generation engines so effective at producing working syntax that they have industrialized the creation of architecturally ungovernable systems. The creature walks out of the laboratory impressive, functional, and alive on delivery day.</p>



<p class="wp-block-paragraph">The crisis arrives the day someone must govern it. To govern a system means to hold it accountable to its design boundaries—the ability to look at it and reliably say <em>why</em> it works, <em>what</em> is permitted to touch what, and to categorically prevent forbidden state changes before they happen. Victor&#8217;s catastrophe was not the act of creation but the absent governing frame.</p>



<p class="wp-block-paragraph">For prototyping or shipping features fast, unconstrained generation is a powerful tool. It optimizes for velocity, and it delivers. But for enterprise payment systems, insurance underwriting engines, logistics orchestrators, and regulated platforms, the question is not &#8220;Does the code ship?&#8221; but &#8220;Who is liable when it does the wrong thing?&#8221; Here, automating the word &#8220;YES&#8221; to every feature request does not solve the problem. It industrializes it.</p>



<p class="wp-block-paragraph">Consider a standard Jira ticket: &#8220;Add an email notification after a successful payment.&#8221;</p>



<p class="wp-block-paragraph">A junior developer might attempt to wedge the email-sending logic directly into the <code>PaymentProcessor</code> class. A senior architect catches this in code review: &#8220;No. Fire a <code>PaymentSuccessEvent</code> to the message bus.&#8221; That human friction—the architectural &#8220;No&#8221;—keeps the system maintainable.</p>



<p class="wp-block-paragraph">Unconstrained AI agents lack this assertiveness. By default, they are the ultimate yes-men.</p>



<p class="wp-block-paragraph">Hand that same ticket to a standard coding agent and it will not argue about bounded contexts. It will burn tokens until it produces 300 lines of syntactically perfect code, import an SMTP library directly into the core of your billing domain, and submit a pull request. The tests will pass; conventional feature tests make no assertion about bounded contexts. The CI pipeline will go green. And structurally, the system is now a disaster.</p>



<p class="wp-block-paragraph">This happens not through malice but because of how agentic loops are built. Without explicit architectural constraints, the system&#8217;s emergent behavior is to fulfill immediate user intent. The agent is orchestrated to ship the feature, not to defend the architecture. Comprehension debt is the structural consequence: AI generates syntax faster than human beings can read or govern it. Expecting a probabilistic model to enforce structural integrity on its own is a category error. Without a governing frame, the agent will always take the path of least resistance to a &#8220;YES.&#8221;</p>



<p class="wp-block-paragraph">You cannot fix code overproduction by hiring more people to read it nor by running the generation loop faster. The only scalable answer is to build a concrete riverbed <em>before</em> you turn on the water.</p>



<p class="wp-block-paragraph">If the current era automates the word &#8220;YES,&#8221; we should automate the word &#8220;NO.&#8221;</p>



<p class="wp-block-paragraph">Securing the runtime environment prevents the monster from escaping. But to prevent it from being built in the first place, we need to step back into the IDE and the CI/CD pipeline. We need to govern <em>generation</em>.</p>



<h2 class="wp-block-heading">The great softening: Shifting risk from build time to runtime</h2>



<p class="wp-block-paragraph">Compilers never guaranteed correct software. You could write catastrophic logically broken systems in C, Java, or any other compiled language. But compilers served a crucial engineering purpose: They deterministically governed a specific layer of structural risk.</p>



<p class="wp-block-paragraph">By enforcing hard execution constraints—syntax validity, type compatibility, linkage rules, and executable viability—the compiler acted as an automated boundary. It didn’t verify business intent, domain correctness, or architectural quality. What it did was eliminate an entire class of low-level structural failure <em>before</em> execution ever began.</p>



<p class="wp-block-paragraph">That delegation of risk is one of the quiet triumphs of software engineering. Our discipline has always advanced by mechanizing one class of guarantees so humans can focus on the next layer of abstraction. We automated machine-level structural correctness so engineers could spend their cognitive energy on application logic. Later, we pushed more guarantees upward, into schemas, testing, static analysis, architectural patterns, and operational controls.</p>



<p class="wp-block-paragraph">Over time, we also deliberately softened certain boundaries in exchange for speed. Dynamic languages, richer runtimes, reflection, and increasingly abstract frameworks all traded deterministic compile-time guarantees for developer velocity and flexibility. The newly exposed risk was absorbed elsewhere: runtime validation, automated testing, observability, and engineering discipline.</p>



<p class="wp-block-paragraph">Today, with agentic AI, we are softening boundaries again, more radically than ever before.</p>



<p class="wp-block-paragraph">Natural language has become a high-level control plane for software generation. Arbitrary text increasingly shapes executable behavior. And in that shift, we have blurred one of the oldest boundaries in computing: the separation between <em>data</em> and <em>instructions</em>.</p>



<p class="wp-block-paragraph">Outside the model, that boundary still exists. Systems enforce permission scopes, schema contracts, sandboxing, and execution policies. But inside the inference context, those protections collapse into the same token stream.</p>



<p class="wp-block-paragraph">System prompts, retrieved documents, user messages, tool outputs, and external content all flow through the same neural weights. There is no hard privilege boundary between instruction and input. Modern models may resist naive attacks like &#8220;Ignore previous instructions,&#8221; but they remain vulnerable to indirect injections disguised as legitimate operational context. A malicious instruction embedded in a customer email, a webpage, or a tool response is not processed as passive data. It can become behavioral influence.</p>



<p class="wp-block-paragraph">Inside the context window, untrusted text can shape control flow. That is the real softening.</p>



<p class="wp-block-paragraph">We are generating syntax at machine speed, but we have dissolved the structural gate that once constrained how systems were built. The result is a massive shift of risk from build time to runtime. Code that appears structurally sound during generation may violate architectural boundaries, introduce unsafe execution paths, or become behaviorally compromised the moment hostile context enters the loop.</p>



<p class="wp-block-paragraph">The conclusion is straightforward: The fact that AI-generated code runs is no longer a meaningful proxy for system correctness.</p>



<p class="wp-block-paragraph">Syntax is abundant. Execution is easy. Structural governance is what is missing.</p>



<p class="wp-block-paragraph">We outsourced the writing of logic to machines, but we did not build a deterministic boundary that governs what those machines are allowed to generate.</p>



<p class="wp-block-paragraph">If we want control back, we cannot rely on human code review at machine speed. We must rebuild the build-time gate.</p>



<h2 class="wp-block-heading">From dependency bloat to tailor-made architecture</h2>



<p class="wp-block-paragraph">For decades, the industry&#8217;s default response to complexity was abstraction by accumulation: monolithic frameworks, sprawling dependency trees, and ever-thicker layers of indirection. Importing a 50-megabyte library to avoid repetitive boilerplate was a rational trade-off when developer time and cognitive bandwidth were the scarce resources. For AI agents, that trade-off changes.</p>



<p class="wp-block-paragraph">This is not an argument against foundational infrastructure. Mature primitives—like SQLAlchemy in Python or Spring Boot in Java—remain essential precisely because their conventions are widely learned and predictable. The problem isn’t abstraction but opacity. When core business logic disappears behind proprietary decorators, internal frameworks, or custom orchestration layers, execution becomes a black box. An agent cannot safely reason about code it cannot trace. It needs direct visibility into causality: what changes state, what enforces invariants, and where responsibilities begin and end. Hidden flow degrades reasoning into guesswork; guesswork silently becomes architectural drift.</p>



<p class="wp-block-paragraph">At the same time, AI drives the cost of procedural code toward zero. Boilerplate is no longer expensive. Clarity is. The design question shifts from &#8220;How much can we abstract away?&#8221; to &#8220;How much must remain explicit for safe reasoning?&#8221; The answer is tailor-made architecture: thin infrastructure, explicit domain logic, hard boundaries, and narrowly scoped components with visible contracts. The value is no longer in how much code you avoid writing but in how clearly the system declares its boundaries.</p>



<p class="wp-block-paragraph">That same opacity also breaks verification. AI review can catch local defects, risky patterns, and implementation mistakes, but it remains blind to architectural drift and missing business intent unless those constraints are explicitly encoded. After all, if you ask a model to review code generated from the exact same vague Jira ticket, do you actually get verification, or do you just engineer a circular hallucination, where the AI politely revalidates its own blind spots?</p>



<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="1536" height="1024" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image.png" alt="Tailor-made architecture gives generated syntax a clear structure without dissolving system boundaries." class="wp-image-18838" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-300x200.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-768x512.png 768w" sizes="(max-width: 1536px) 100vw, 1536px" /><figcaption class="wp-element-caption"><em>Figure 1. Tailor-made architecture gives generated syntax a clear structure without dissolving system boundaries.</em></figcaption></figure>



<h2 class="wp-block-heading">The Context Compilation Pattern</h2>



<p class="wp-block-paragraph">The Context Compilation Pattern governs <em>generation</em> in the IDE and the CI/CD pipeline before a single syntactically plausible line ever reaches a human reviewer. If the Decision Intelligence Runtime (DIR) is the vault door that protects execution in production, context compilation is the blueprint that prevents the monster from being built in the lab.</p>



<p class="wp-block-paragraph">This is not &#8220;prompt engineering,&#8221; which merely asks a probabilistic model for a better answer. What we need is build-time governance: two layers of defense assembled before the LLM inference is even triggered. The first is structured context injection (assembling the prompt from prioritized artifacts). The second is postgeneration static verification (deterministic AST checks that enforce rules no probabilistic model can override). The prompt structure biases generation toward compliant solutions; the static checks make declared, machine-verifiable boundary violations impossible to merge.</p>



<p class="wp-block-paragraph">Deterministic build-time governance is not a return to formal software specification (like UML), nor is it merely &#8220;prompt engineering disguised as Markdown.&#8221; It’s a mechanical constraint on the generation space that makes explicitly declared boundary violations rejectable by design. Context compilation does not eliminate architectural review or replace engineering judgment. Instead, it ensures that the agent operates within a defined riverbed of allowed structural invariants.</p>



<p class="wp-block-paragraph">Engineering evolves whenever implicit rules become explicit declarations. Application development is now crossing that boundary. The senior engineer&#8217;s new job is <em>declarative boundary engineering</em>: explicitly declaring what the system is absolutely forbidden from doing.</p>



<p class="wp-block-paragraph">The failure is not in the frameworks. The failure is in the process: pointing an unconstrained AI agent at a codebase full of invisible magic and expecting a CI/CD pipeline designed for human-generated code to catch what goes wrong. The answer is to build a compiler for the agent&#8217;s context.</p>



<p class="wp-block-paragraph">The Context Compilation Pattern is the staged pipeline that makes this concrete.</p>



<figure class="wp-block-image size-large"><img decoding="async" width="1056" height="1600" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1-1056x1600.png" alt="The Context Compilation Pattern pipeline, enforcing build-time constraints through deterministic artifact assembly and dual verification." class="wp-image-18839" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1-1056x1600.png 1056w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1-198x300.png 198w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1-768x1164.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1-1013x1536.png 1013w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-1.png 1274w" sizes="(max-width: 1056px) 100vw, 1056px" /><figcaption class="wp-element-caption"><em>Figure 2. The Context Compilation Pattern pipeline, enforcing build-time constraints through deterministic artifact assembly and dual verification.</em></figcaption></figure>



<h3 class="wp-block-heading">Step 1: The context artifacts</h3>



<p class="wp-block-paragraph">The most strategically valuable code in your repository may no longer live in <code>src/</code>. It lives in <code>/context</code>. The pipeline consumes versioned artifacts such as <code>intent.md</code>, <code>boundaries.md</code>, and <code>threat-model.md</code>, each authored by a specialist before a single line of code is generated. (Ownership and role responsibilities are covered in “Artifact-Bound Roles and Accountability” below.) What matters here is that these files are the <em>inputs</em> to the compiler: Without them, there’s nothing to compile.</p>



<p class="wp-block-paragraph">To prevent cognitive overlap, their roles must be fiercely separated: <code>boundaries.md</code> declares <em>structural invariants</em> (e.g., dependency direction, allowed communication paths, and event emission), whereas <code>threat-model.md</code> models <em>adversarial constraints </em>as declarative abuse scenarios (e.g., prompt injection and secrets exfiltration) that must be mechanically blocked.</p>



<p class="wp-block-paragraph"><code>boundaries.md</code> warrants a precise definition, because it anchors the entire build-time governance model. In practice, boundaries are typically defined at module or bounded-context granularity (e.g., <code>/billing/*</code> or <code>/risk/*</code>), not per class or per repository. They are implemented using <strong>hybrid artifacts</strong>: a natural language document designed to constrain the LLM, tightly paired with a deterministic rule for the CI runner.</p>



<p class="wp-block-paragraph">Consider this concrete example of how an architectural boundary is explicitly declared and enforced:</p>



<p class="wp-block-paragraph"><strong>1. <code>boundaries.md</code> (for the LLM context)<br></strong>This Markdown file is injected into the agent’s prompt. It defines the vocabulary, architectural constraints, and allowed interactions.</p>



<pre class="wp-block-code"><code>Module: Billing
Ontology: Order, Invoice, PaymentEvent
Rule: Zero external network I/O is allowed in this domain. You must NEVER import requests or smtplib.</code></pre>



<p class="wp-block-paragraph"><strong>2. <code>semgrep-rule.yml</code> (for the CI/CD runner)</strong><br>This static file goes to the CI pipeline to mechanize the boundary. It ensures the code check is fully deterministic.</p>



<pre class="wp-block-code"><code>rules:
  # Block forbidden imports at the module boundary
  - id: block-external-io-in-billing
    patterns:
      - pattern-either:
          - pattern: import smtplib
          - pattern: import requests
    message: "Architecture Violation: External I/O is strictly forbidden in the billing domain."
    severity: ERROR
    languages: &#91;python]
    paths:
      include: &#91;"src/billing/**"]

  # Domain layer must not talk to DB driver directly
  - id: block-db-driver-in-domain
    patterns:
      - pattern-either:
          - pattern: import sqlalchemy
          - pattern: from sqlalchemy import ...
          - pattern: import psycopg2
          - pattern: from psycopg2 import ...
    message: "Architecture Violation: Domain layer must use Repository abstraction, not database drivers directly."
    severity: ERROR
    languages: &#91;python]
    paths:
      include:
        - "src/billing/domain/**"</code></pre>



<p class="wp-block-paragraph">Crucially, these Semgrep/CI rules are human-authored (or human-reviewed) precommit artifacts. We don’t rely on an LLM to generate the security gates on the fly. The AI reads the Markdown to guide its generation; the CI runner executes the static YAML to enforce the boundary.</p>



<p class="wp-block-paragraph">If these artifacts stay current, they actively govern the generated codebase. Stale or malformed context becomes context debt: The pipeline will enforce strictly whatever was declared, even if the declaration is wrong. Governance artifacts are production code. They require strict versioning, explicit ownership, and periodic review just like the executable logic they constrain. That’s why core artifacts like <code>boundaries.md</code> require rigorous peer review, not just casual updates.</p>



<h3 class="wp-block-heading">Step 2: The context compiler</h3>



<p class="wp-block-paragraph">Dumping all Markdown files into the system prompt is sometimes acceptable for small projects and small artifacts. But as the codebase grows or the context window fills with too many competing constraints, models begin to suffer from &#8220;lost in the middle&#8221; degradation and silently ignore what matters most.</p>



<p class="wp-block-paragraph">The term “context compiler&#8221; might sound like a magical enterprise heavy-lift, but the reality is entirely mundane. In its simplest form, it’s just a deterministic context assembly layer combined with a routing mechanism.</p>



<p class="wp-block-paragraph">Instead of treating context as a flat pile of documents, the compiler assembles it into an ordered structure. Because different artifacts apply to different parts of the project, <code>boundaries.md</code> in the <code>/billing</code> module might enforce strict isolation, while the one in /frontend might be much more permissive.</p>



<p class="wp-block-paragraph">In practice, the compiler may take one of these forms:</p>



<p class="wp-block-paragraph"><strong>Manual selection:</strong> The developer simply points their IDE or agent to a structured set of Markdown files.</p>



<p class="wp-block-paragraph"><strong>A mundane script:</strong> A basic Python or bash script that understands a directory structure. It concatenates the <code>.md</code> files to build the LLM&#8217;s system prompt and hands the <code>.yml</code> files directly to the CI runner.</p>



<p class="wp-block-paragraph"><strong>Tool-mediated context protocols:</strong> Dedicated mechanisms (e.g., MCP) that allow the agent to query the workspace and dynamically assemble the required boundaries directly within the IDE, bypassing the need for manual script invocation.</p>



<p class="wp-block-paragraph">Consider a practical directory structure:</p>



<pre class="wp-block-code"><code>/context
  /global
    coding-standards.md
  /domain
    /billing
      boundaries.md
      threat-model.md
      semgrep-rule.yml
    /risk
      boundaries.md
      threat-model.md
      semgrep-rule.yml
    /frontend
      boundaries.md
      threat-model.md
      semgrep-rule.yml</code></pre>



<p class="wp-block-paragraph">When generating code for the billing module, the script reads <code>/global</code> and <code>/billing</code>. The compiler simply scopes the rules based on the directory, perfectly focusing the agent&#8217;s attention on the boundaries that matter while wiring the corresponding YAML rules for deterministic CI verification.</p>



<h3 class="wp-block-heading">Step 3: Strict boundary hierarchy (resolving conflicts)</h3>



<p class="wp-block-paragraph">When faced with conflicting instructions, LLMs don’t throw a compilation error. They hallucinate a dangerous compromise. The compiler prevents this by enforcing a deterministic precedence of declared constraints before the prompt is assembled:</p>



<p class="wp-block-paragraph"><strong>Threat model &gt; Boundaries &gt; Coding standards &gt; Intent + acceptance criteria</strong></p>



<p class="wp-block-paragraph">Security and architectural boundaries unconditionally overrule feature delivery. This operates at two levels. At the prompt level (soft enforcement), constraint ordering biases generation toward compliant solutions. At the postgeneration level (hard enforcement), deterministic code checks parse the generated syntax, verify structural invariants, and instantly fail the build on violation.</p>



<p class="wp-block-paragraph">&#8220;Resolution&#8221; in this context does not mean an LLM philosophically negotiating between two Markdown files. It means <em>deterministic rejection via CI</em>. If the <code>intent.md</code> asks to &#8220;email a receipt to the user,&#8221; but <code>boundaries.md</code> forbids external network calls in the billing module, an unconstrained AI might try to generate an SMTP call. The conflict is mechanically &#8220;resolved&#8221; when the CI pipeline runs a static rule (derived from <code>semgrep-rule.yml</code>) and instantly fails the build. The developer (context orchestrator) must then intervene and change the design to use an event bus instead. The hierarchy is enforced by deterministic code analysis, not LLM reasoning. A rejected build is not necessarily a rejected business need; it’s a signal that declared boundaries and intended capability must be reconciled explicitly before regeneration. (This mechanical rejection physically executes during the adversarial verification phase in step 5).</p>



<p class="wp-block-paragraph">We do not use AI for this validation. We use existing, proven AST tools and code linters like <a href="https://semgrep.dev/" target="_blank" rel="noreferrer noopener">Semgrep</a>, <a href="https://bandit.readthedocs.io/" target="_blank" rel="noreferrer noopener">Bandit</a>, or <a href="https://codeql.github.com/" target="_blank" rel="noreferrer noopener">CodeQL</a> to enforce these boundaries in CI/CD.</p>



<p class="wp-block-paragraph">However, we must be precise about what this governance actually achieves. Deterministic checks enforce invariants, not the architecture as a whole. You can statically enforce forbidden imports, forbidden outbound I/O, strict layering, and schema conformance. You cannot statically enforce domain semantics, aggregate ownership correctness, subtle coupling, or conceptual cohesion. Deterministic verification doesn’t prove architectural correctness. It proves compliance with explicitly declared structural invariants.</p>



<h3 class="wp-block-heading">Step 4: Generation</h3>



<p class="wp-block-paragraph">Context as code matters only if generated syntax is verified against the same boundaries that shaped it. With a compiled, conflict-free context hierarchy, the developer agent generates code inside an isolated user space sandbox. In this fleeting fraction of a second, the agent inside the developer&#8217;s IDE consumes the narrowed, precompiled system prompt and outputs the actual <code>payment_service.py</code>. Its role is constrained synthesis: translating the boundaries in <code>boundaries.md</code> and the imperatives in <code>intent.md</code> into code.</p>



<h3 class="wp-block-heading">Step 5: Adversarial verification (negative space)</h3>



<p class="wp-block-paragraph">This phase checks whether the generated code crossed a forbidden boundary. Before the development cycle begins, the adversarial context provider defines threat vectors in <code>threat-model.md</code>. Because a Markdown file only guides the LLM softly, the governance platform engineer bridges the gap to determinism by translating those declarative threats into matching executable rules (like <code>semgrep-rule.yml</code>) wired into the CI gates. If the threat model identifies server-side request forgery or secrets exfiltration as a risk for the <code>/frontend</code> module, the corresponding CI rule parses the generated code and instantly fails the build if a known attack pattern or insecure execution sink is detected.</p>



<p class="wp-block-paragraph">The pipeline doesn’t ask an LLM to read the Markdown and assess if the code is safe. It mechanically executes the prewritten rules derived from it. If a generative agent helps draft the rule set, it does so before the cycle in an isolated sandbox, and a human reviews the result before it enters CI. Step 5 doesn’t prove overall correctness; it proves that declared structural and security boundaries are enforced.</p>



<p class="wp-block-paragraph">Like any static gate, deterministic boundary checks trade flexibility for safety and will occasionally reject valid implementations. That friction is intentional: Explicit override and artifact refinement are part of the governance loop.</p>



<p class="wp-block-paragraph">AI code review may identify suspicious code, but it cannot certify that declared boundaries survived generation. Step 5 therefore relies on deterministic CI rules, not on a probabilistic model interpreting the pull request.</p>



<h3 class="wp-block-heading">Step 6: Acceptance verification (positive space)</h3>



<p class="wp-block-paragraph">This phase checks whether the generated code solves the business problem. The <code>acceptance-criteria.md</code> defines the expected behavior not as a vague user story, but as a machine-executable contract (e.g., using Gherkin syntax):</p>



<pre class="wp-block-code"><code>Scenario: Successful payment emits notification
  Given a valid payment of 100 EUR
  When the transaction completes
  Then the PaymentSuccessEvent is published to the message bus</code></pre>



<p class="wp-block-paragraph">The CI pipeline parses this exact Markdown block and runs the corresponding test suite. Step 6 provides what step 5 cannot: verification against a declared delivery contract.</p>



<p class="wp-block-paragraph">The code is approved only when it passes adversarial checks <em>and</em> satisfies the acceptance criteria. Without step 5, the system could violate structural boundaries. Without step 6, it could implement the wrong intent. Both contracts must hold.</p>



<h2 class="wp-block-heading">Artifact-bound roles and accountability</h2>



<p class="wp-block-paragraph">The traditional SDLC is a linear cascade: Requirements flow to architecture, then to code, then to QA. In an era where a machine generates 10,000 lines of syntax in the time it takes to fetch a coffee, that handoff is a fatal bottleneck.</p>



<p class="wp-block-paragraph">In the context matrix, specialists define parallel, independent constraint vectors <em>before</em> generation begins. The titles on business cards stay the same. The artifacts they produce change entirely.</p>



<figure class="wp-block-table"><table><tbody><tr><td><strong>Old role</strong></td><td><strong>New role</strong></td><td><strong>Artifact</strong></td><td><strong>Responsibility</strong></td></tr><tr><td>Business analyst</td><td><strong>Intent definer</strong></td><td><code>intent.md</code> + <br><code>acceptance-criteria.md</code></td><td>Define the &#8220;what&#8221; and the deterministic proof that it was delivered</td></tr><tr><td>Software architect</td><td><strong>World builder</strong></td><td><code>boundaries.md</code></td><td>Define domain ontology, architectural invariants, and allowed interaction patterns</td></tr><tr><td>QA &amp; security engineer</td><td><strong>Adversarial context provider</strong></td><td><code>threat-model.md</code></td><td>Define threat vectors and abuse paths <em>before</em> generation</td></tr><tr><td>Platform engineer/DevOps</td><td><strong>Governance platform engineer</strong></td><td>Compiler pipeline + CI gates (<code>semgrep-rule.yml</code>)&nbsp;</td><td>Operationalize declared constraints into nonbypassable enforcement gates</td></tr><tr><td>Developer</td><td><strong>Context orchestrator</strong></td><td><code>coding-standards.md</code> + critical code</td><td>Resolve artifact conflicts, steer generation workflows, implement critical paths, and refine context quality</td></tr></tbody></table></figure>



<p class="wp-block-paragraph">In this model, accountability is distributed and artifact bound. Rather than handing off work downstream, each role owns specific upstream activities and constraints.</p>



<ul class="wp-block-list">
<li><strong>The intent definer (formerly business analyst):</strong> Owns the business reality. They translate user needs into <code>intent.md</code> and define hard <code>acceptance-criteria.md</code> (like BDD scenarios or API contracts). Their job is to formulate requirements so strictly that the pipeline can automatically prove delivery, acting as the first line of defense against vague &#8220;vibe coding.&#8221;</li>



<li><strong>The world builder (formerly software architect):</strong> Owns the structural gravity. They write <code>boundaries.md</code> to establish the domain ontology and hard architectural boundaries. Instead of reviewing pull requests for drift, their daily activity is defining what modules are allowed to communicate and declaring the structural invariants the generated code must respect.</li>



<li><strong>The adversarial context provider (formerly QA and security):</strong> Owns the negative space. They anticipate failure modes and define threat vectors via <code>threat-model.md</code>. Their responsibility is identifying the precise abuse paths that the CI pipeline must block, ensuring an LLM never tests its own code.</li>



<li><strong>The governance platform engineer (formerly platform engineer/DevOps):</strong> Owns the enforcement machinery. They build the context compiler pipeline and operationalize declared constraints into nonbypassable enforcement gates. Their responsibility is the deterministic enforcement pipeline that executes declared governance artifacts at precommit and CI/CD boundaries.</li>



<li><strong>The context orchestrator (formerly developer):</strong> Owns generation orchestration and critical handwritten paths. This is a hybrid reality, not the end of programming. They write <code>coding-standards.md</code>, manually implement zero-trust paths, and resolve runtime exception requests. For the bulk of the system, their focus shifts to a meta-level: resolving conflicting constraints, tuning the prompt&#8217;s signal-to-noise ratio, and debugging why a given artifact failed to govern the agent properly.</li>
</ul>



<p class="wp-block-paragraph">When a failure occurs, the investigation shifts from &#8220;What was the agent thinking?&#8221; to &#8220;Which contract failed to govern?&#8221; Because the pipeline deterministically enforces what was explicitly declared, failures are no longer opaque hallucinations. They’re traceable collisions between artifact boundaries. A structural flaw cleanly points to an unbounded <code>boundaries.md</code>. When the pipeline is green and the contracts are honest, the orchestrator acts as a firewall against process failure, not a scapegoat for undocumented assumptions.</p>



<figure class="wp-block-image size-large"><img decoding="async" width="1600" height="780" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2-1600x780.png" alt="The decision boundary architecture: Context compilation governs generation, ROA structures intent, and DIR validates execution." class="wp-image-18841" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2-1600x780.png 1600w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2-300x146.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2-768x375.png 768w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2-1536x749.png 1536w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/image-2.png 2048w" sizes="(max-width: 1600px) 100vw, 1600px" /><figcaption class="wp-element-caption"><em>Figure 3. The decision boundary architecture: Context compilation governs generation, ROA structures intent, and DIR validates execution.</em></figcaption></figure>



<h2 class="wp-block-heading">The economics of governance</h2>



<p class="wp-block-paragraph">Context compilation makes economic sense only when the cost of architectural failure exceeds the cost of explicit governance. It adds upfront design work and cognitive overhead, so its value depends on how expensive a wrong system decision would be.</p>



<p class="wp-block-paragraph">For rapid prototyping, throwaway utility scripts, marketing sites, or low-stakes internal tools—where the worst-case consequence of a hallucination is a misaligned dashboard—let the generative engines run unconstrained. Velocity is the only thing that matters.</p>



<p class="wp-block-paragraph">For safety-critical automation, trading platforms, healthcare orchestrators, and regulated enterprise systems, the economics invert. Velocity without deterministic boundaries is simply the speed at which you accumulate liability. A single unconstrained agent importing an insecure dependency into a payment core costs orders of magnitude more than the engineer-hours spent writing a <code>boundaries.md</code> contract.</p>



<p class="wp-block-paragraph">You don’t build a bank vault door for a garden shed. You apply context compilation where the systemic cost of emergent architectural failure is catastrophic.</p>



<h2 class="wp-block-heading">Automating the word &#8220;NO&#8221;</h2>



<p class="wp-block-paragraph">When code generation becomes cheap, architectural entropy tends to scale with it. That makes post hoc code review less effective, especially when reviewers spend their attention on machine-generated boilerplate. A more durable approach is <em>context review</em>: peer review of the declarative constraints that shape what the machine is allowed to build. A reviewed <code>boundaries.md</code> can guide many later development cycles. A reviewed pull request usually governs only a single change.</p>



<p class="wp-block-paragraph">The discipline has shifted from imperative engineering of procedures to declarative engineering of boundaries.</p>



<p class="wp-block-paragraph">Let’s return to the Jira ticket that started this discussion: &#8220;Add an email notification after a successful payment.&#8221;</p>



<p class="wp-block-paragraph">The business analyst submits the <code>intent.md</code>. Before the developer agent sees the prompt, the context compiler activates—at the precommit gate or via tool-mediated context protocols (e.g., script or MCP) in the IDE—before a line is written. It retrieves the architect&#8217;s <code>boundaries.md</code>, which states, &#8220;The <code>/domain</code> module has zero external dependencies. No network calls.&#8221; The SMTP import collides with that boundary instantly. Even if the agent generates the import, the build will not survive it—the prompt biases generation toward compliant solutions, and the deterministic static check in step 5 rejects it at the declared boundary. The Frankenstein is caught in the pipeline, not discovered in production three release cycles later.</p>



<p class="wp-block-paragraph">Code generation is becoming abundant. Architectural discipline is becoming scarce.</p>



<p class="wp-block-paragraph">Context as code governs what may be generated. Responsibility-oriented agents govern what may be proposed. Decision Intelligence Runtime governs what may be executed. Three boundaries. One governing frame.</p>



<p class="wp-block-paragraph">The highest-value engineering skill is no longer writing syntax. It’s engineering the conditions under which correct syntax can emerge.</p>



<p class="wp-block-paragraph">That is the ability to automate the word &#8220;NO.&#8221;</p>



<p class="wp-block-paragraph"><em>This article concludes the three-part series on engineering boundaries in agentic AI. The repository at <a href="https://github.com/huka81/decision-intelligence-runtime" target="_blank" rel="noreferrer noopener">github.com/huka81/decision-intelligence-runtime</a> contains an open source reference implementation of the concepts described in this series.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/context-as-code/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Radar Trends to Watch: June 2026</title>
		<link>https://www.oreilly.com/radar/radar-trends-to-watch-june-2026/</link>
				<comments>https://www.oreilly.com/radar/radar-trends-to-watch-june-2026/#respond</comments>
				<pubDate>Tue, 02 Jun 2026 10:58:22 +0000</pubDate>
					<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
						<category><![CDATA[Radar Trends]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18834</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-7.png" 
				medium="image" 
				type="image/png" 
				width="1400" 
				height="950" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2023/06/radar-1400x950-7-160x160.png" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Developments in policy and governance, infrastructure and ops, AI models, and more]]></custom:subtitle>
		
				<description><![CDATA[Coauthored with Claude Agents are making the transition from performing tasks to running operations. The Cloudflare and Stripe partnership ships an agent that opens accounts, registers domains, and deploys an application on its own (details), while Stripe/Tempo and iWallet have each published machine-to-machine payment protocols to make that kind of work a standard. Office documents, [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph"><em>Coauthored with Claude</em></p>



<p class="wp-block-paragraph">Agents are making the transition from performing tasks to running operations. The Cloudflare and Stripe partnership ships an agent that opens accounts, registers domains, and deploys an application on its own (<a href="https://www.infoworld.com/article/4165857/are-we-ready-to-give-ai-agents-the-keys-to-the-cloud-cloudflare-thinks-so.html" target="_blank" rel="noreferrer noopener">details</a>), while Stripe/Tempo and iWallet have each published machine-to-machine payment protocols to make that kind of work a standard. Office documents, browser sessions, and, in one announcement, the phone interface itself are next on the list. View the expanded role of agents as an opportunity for humans to accomplish more.</p>



<h2 class="wp-block-heading">AI Models</h2>



<p class="wp-block-paragraph">The model menagerie keeps expanding in size and shape. Open weight contenders run at frontier capability on modest hardware, while specialist models for voice, conversation timing, and privacy filtering take over what used to be features inside one general chat model. Treat your prompts and skills as portable; the model behind them will change.</p>



<ul class="wp-block-list">
<li>Anthropic has <a href="https://www.anthropic.com/news/claude-opus-4-8" target="_blank" rel="noreferrer noopener">released</a> Opus Claude 4.8. This model is not Mythos, which they expect to release soon. Opus 4.8 is a “modest improvement” that claims better results on coding and greater likelihood of informing users when it is uncertain about claims. Changes to the agents may be more important. Claude Code now has the ability to plan solutions to large problems involving hundreds of subagents (“dynamic workflows”); Cowork can control the effort put into solving a problem.</li>



<li>Cohere&#8217;s <a href="https://cohere.com/blog/command-a-plus" target="_blank" rel="noreferrer noopener">Command A+</a> is an open weight mixture-of-experts model with 218B parameters, 25B active. It’s competitive with frontier models and requires relatively little hardware to run: Two H100s isn&#8217;t small, but it&#8217;s not a data center either.</li>



<li>Google&#8217;s announcements at this year’s I/O conference include <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni/" target="_blank" rel="noreferrer noopener">Omni</a>, a new model that takes any kind of input (video, audio, image) and generates any kind of output; <a href="https://ai.google.dev/gemini-api/docs/interactions/whats-new-gemini-3.5" target="_blank" rel="noreferrer noopener">Gemini 3.5 Flash</a>, a fast and efficient update to their coding model; <a href="https://gemini.google/overview/agent/spark/" target="_blank" rel="noreferrer noopener">Gemini Spark</a>, a personal agent; and <a href="https://blog.google/products-and-platforms/platforms/android/android-xr-io-2026/" target="_blank" rel="noreferrer noopener">intelligent eyewear</a>, another attempt at smart glasses.</li>



<li>Alibaba has <a href="https://qwen.ai/blog?id=qwen3.7" target="_blank" rel="noreferrer noopener">announced</a> Qwen3.7-Max, its most capable model.</li>



<li>Thinking Machines has <a href="https://thinkingmachines.ai/blog/interaction-models/" target="_blank" rel="noreferrer noopener">announced</a> a research preview of interaction models. These models support natural conversation flow. The model can wait for a speaker to finish, interrupt the speaker, respond when the speaker interrupts the model, and keep track of time.</li>



<li>OpenAI has <a href="https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/" target="_blank" rel="noreferrer noopener">released</a> new voice models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. They’re moving from call-and-response models to models that can take part in conversations, reason, and take actions.</li>



<li>OpenRouter published cost studies for both <a href="https://openrouter.ai/announcements/opus-47-tokenizer-analysis" target="_blank" rel="noreferrer noopener">Claude Opus 4.7</a> and <a href="https://openrouter.ai/announcements/gpt55-cost-analysis" target="_blank" rel="noreferrer noopener">GPT-5.5</a>. GPT-5.5 raised the token price but reduced the number of tokens in a typical conversation. Claude kept prices the same, but conversations tend to require more tokens. What&#8217;s the impact on your monthly bill?</li>



<li>Google has <a href="https://arstechnica.com/ai/2026/05/googles-gemma-4-open-ai-models-use-speculative-decoding-to-get-up-to-3x-faster/" target="_blank" rel="noreferrer noopener">updated</a> its Gemma 4 models, claiming that they triple token generation speed. They use a technique called <a href="https://x.com/googlegemma/status/2051694045869879749" target="_blank" rel="noreferrer noopener">multi-token prediction</a> (MTP) to draft a sequence of tokens with a very small model and then approve those tokens with the large model.</li>



<li>IBM released <a href="https://research.ibm.com/blog/granite-4-1-ai-foundation-models" target="_blank" rel="noreferrer noopener">Granite 4.1</a>, a collection of small models (30B parameters and down).</li>



<li>An academic paper describes “<a href="https://arxiv.org/abs/2510.22977" target="_blank" rel="noreferrer noopener">the reasoning trap</a>,” a phenomenon in which training models for increased reasoning also increases hallucinations about tool use.</li>



<li><a href="https://talkie-lm.com/chat" target="_blank" rel="noreferrer noopener">Talkie</a> is an LLM that was trained only on data from 1931 and earlier. If you want to know what it was like to live during the start of the Depression, this is the LLM to ask.</li>



<li>OpenAI has <a href="https://openai.com/index/introducing-openai-privacy-filter/" target="_blank" rel="noreferrer noopener">announced</a> a <a href="https://huggingface.co/openai/privacy-filter" target="_blank" rel="noreferrer noopener">privacy filter model</a>. This is a small specialized model (1.5B) that can run on phones and other small devices. It removes personally identifiable information (PII) from text documents.</li>
</ul>



<h2 class="wp-block-heading">Software Development</h2>



<p class="wp-block-paragraph">We are beginning to see anecdotal evidence that the brief era of <a href="https://thenewstack.io/opus-4-8-claude-smarter-token-discipline-urgent/" target="_blank" rel="noreferrer noopener">tokenmaxxing is coming to an end</a>. Agents may increase productivity, but they can also use tokens at an astonishing rate. So can the latest models, like Anthropic’s Claude 4.8 with new features like dynamic workflows. Employers are realizing that the only way to measure productivity is to look at the quality of an employee’s work rather than relying on an artificial (and easily gameable) metric like token use. Teams that use AI effectively will be disciplined about token use; they’ll choose lower cost (or local) models where possible, reaching for expensive models like Claude 4.8 Opus only when necessary.</p>



<ul class="wp-block-list">
<li>The Agentic AI Foundation is <a href="https://aaif.io/blog/mcp-is-growing-up/" target="_blank" rel="noreferrer noopener">updating</a> the MCP protocol, with a <a href="https://blog.modelcontextprotocol.io/posts/2026-07-28-release-candidate/" target="_blank" rel="noreferrer noopener">release candidate</a> scheduled for July 28. Changes include making MCP a stateless protocol, adding a process for creating extensions, and aligning authorization with the OAuth and OpenID standards.</li>



<li>Google is <a href="https://developers.googleblog.com/an-important-update-transitioning-gemini-cli-to-antigravity-cli/" target="_blank" rel="noreferrer noopener">dropping Gemini CLI</a> and putting all of its effort behind <a href="https://antigravity.google/" target="_blank" rel="noreferrer noopener">Antigravity</a>, its agentic software development platform. There are desktop and command line versions of Antigravity, but unlike Gemini CLI, neither are open source.</li>



<li>What shall we call <a href="https://steve-yegge.medium.com/welcome-to-gas-city-57f564bb3607" target="_blank" rel="noreferrer noopener">Gas City</a>, created by Julian Knutsen and Chris Sells? Gas Town 2.0? Steve Yegge says it&#8217;s an SDK for building your own &#8220;dark factories&#8221; by deploying teams of collaborating agents in any topology. It&#8217;s &#8220;a pivotal moment in the Mad Max school of agent orchestration.&#8221;</li>



<li>The problem with agentic programming is that agents serve individuals, not groups, and programming is a team sport. Is <a href="https://www.lukew.com/ff/entry.asp?2153" target="_blank" rel="noreferrer noopener">collaborative steering</a> (context management for groups) an answer?</li>



<li>GitHub has <a href="https://github.com/features/preview/github-app" target="_blank" rel="noreferrer noopener">released</a> a preview of its Copilot app, a stand-alone desktop application for coding with AI. It’s completely integrated with GitHub; for example, you can launch tasks directly from GitHub issues.</li>



<li>If you think tokenmaxxing is your path to promotion, check out <a href="https://github.com/dtnewman/burn-baby-burn" target="_blank" rel="noreferrer noopener">burn-baby-burn</a>. It does what it says: burns lots of tokens, fast, using the LLM of your choice. We hope it&#8217;s a parody, but we bet it works.</li>



<li>Mitchell Hashimoto <a href="https://x.com/mitchellh/status/2055039647924007222" target="_blank" rel="noreferrer noopener">tweets</a> that Anthropic&#8217;s rewrite of Bun from Zig to Rust demonstrates that programming languages are now fungible. Programming language lock-in has ended; programs can easily move from one language to another.</li>



<li><a href="https://github.com/NVIDIA/OpenShell?utm_source=the+new+stack&amp;utm_medium=referral&amp;utm_content=inline-mention&amp;utm_campaign=tns+platform" target="_blank" rel="noreferrer noopener">OpenShell</a> is a <a href="https://thenewstack.io/nvidia-openshell-agent-runtime/" target="_blank" rel="noreferrer noopener">runtime environment</a> built with security in mind from the ground up. It’s intended to be used as a secure environment for running agents. Every agent runs in its own sandbox; an external gateway manages credentials and policies.</li>



<li>OpenAI is <a href="https://community.openai.com/t/openai-is-winding-down-the-fine-tuning-api-and-platform-discussion-thread/1380522" target="_blank" rel="noreferrer noopener">shutting down</a> its API for fine-tuning its models. <a href="https://x.com/bradenjhancock/status/2053309599248453999?s=20" target="_blank" rel="noreferrer noopener">They say</a> the current models are better and don&#8217;t require significant fine-tuning. As <em>Latent Space</em> <a href="https://www.latent.space/p/ainews-the-end-of-finetuning" target="_blank" rel="noreferrer noopener">points out</a>, this doesn&#8217;t necessarily mean the end of fine-tuning as a discipline, particularly for open models. But it may be a signal. Drew Breunig <a href="https://www.dbreunig.com/2026/05/10/overfitting-the-harness.html" target="_blank" rel="noreferrer noopener">writes</a> about what this means for agents and harnesses.</li>



<li>Anthropic has <a href="https://claude.com/blog/collaborate-with-claude-across-excel-powerpoint-word-and-outlook" target="_blank" rel="noreferrer noopener">released</a> Claude for Office 365, allowing users to run sessions that cross Word, Excel, and PowerPoint. Integration with Outlook is coming, though Claude for Outlook is currently a separate product.</li>



<li>A <a href="https://developers.openai.com/codex/app/chrome-extension?utm_source=the+new+stack&amp;utm_medium=referral&amp;utm_content=inline-mention&amp;utm_campaign=tns+platform" target="_blank" rel="noreferrer noopener">plugin to Chrome allows Codex to use Chrome</a> for browser tasks that require you to be logged in—for example, reading email.</li>



<li><a href="https://www.firecrawl.dev/" target="_blank" rel="noreferrer noopener">Firecrawl</a> is an API that agents can use to interact with websites in a human way. It enables agents to search for the latest data, interact with the site, and return the results at scale.</li>



<li>Drew Breunig&#8217;s “<a href="https://www.dbreunig.com/2026/05/04/10-lessons-for-agentic-coding.html" target="_blank" rel="noreferrer noopener">10 Lessons for Agentic Coding</a>” is an invaluable list of tips, including &#8220;Implement to learn.&#8221; Letting an agent write all the code is easy, but when you really need to learn something, write it by hand first.</li>



<li><a href="https://github.com/aattaran/deepclaude" target="_blank" rel="noreferrer noopener">Deepclaude</a> configures Claude&#8217;s autonomous agent loop to use DeepSeek V4 Pro rather than one of Anthropic&#8217;s models. It&#8217;s a good way to save (DeepSeek costs much less per token) and experiment with open models. (Fair warning: The name deepclaude may change.)</li>



<li>OpenAI has announced <a href="https://chatgpt.com/codex/for-work/" target="_blank" rel="noreferrer noopener">Codex for Work</a>, an assistant that&#8217;s designed for office work rather than software development.</li>



<li><a href="https://github.com/kanwas-ai/kanwas" target="_blank" rel="noreferrer noopener">Kanwas</a> is a new tool for sharing context across agents. It can be used by workgroups to collaborate on projects.</li>



<li><a href="https://mikeoss.com/" target="_blank" rel="noreferrer noopener">Mike</a> is an open source AI trained for legal work and designed to run locally.</li>



<li>GitHub is <a href="https://arstechnica.com/ai/2026/04/github-will-start-charging-copilot-users-based-on-their-actual-ai-usage/" target="_blank" rel="noreferrer noopener">transitioning</a> to <a href="https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/" target="_blank" rel="noreferrer noopener">usage-based billing for Copilot</a>.</li>



<li>OpenAI and Qualcomm are reportedly <a href="https://thenextweb.com/news/openai-qualcomm-ai-phone-agents-replace-apps" target="_blank" rel="noreferrer noopener">working on a phone</a> where the user interface is an agent. There won&#8217;t be any apps; the agent will do everything.</li>
</ul>



<h2 class="wp-block-heading">Infrastructure and Operations</h2>



<p class="wp-block-paragraph">The infrastructure questions of the moment are whether agents can transact and deploy without humans, and whether the platforms that host open source can stay reliable enough to keep that work going. Watch for GitHub alternatives to become competitive. And watch AI Together, a cloud company that hosts hundreds of open source models.</p>



<ul class="wp-block-list">
<li><a href="https://www.withlanai.com/products/tokentuner" target="_blank" rel="noreferrer noopener">TokenTuner</a> helps control AI costs by <a href="https://thenewstack.io/lanai-token-tuner-tokenmaxxing/" target="_blank" rel="noreferrer noopener">identifying</a> where companies can use lower-cost models productively. It attempts to match token usage to business outcomes, and evaluates individuals and teams on how effectively they use their token budget.</li>



<li>In partnership with <a href="https://projects.dev/" target="_blank" rel="noreferrer noopener">Stripe</a>, <a href="https://blog.cloudflare.com/agents-stripe-projects/" target="_blank" rel="noreferrer noopener">Cloudflare</a> now has an <a href="https://www.infoworld.com/article/4165857/are-we-ready-to-give-ai-agents-the-keys-to-the-cloud-cloudflare-thinks-so.html" target="_blank" rel="noreferrer noopener">agent that can create a new account</a>, start a subscription, register a domain name with DNS, and deploy an application without human intervention aside from granting permission.</li>



<li>Stripe and Tempo have <a href="https://thenewstack.io/ai-agent-payment-protocols/" target="_blank" rel="noreferrer noopener">released</a> the Machine Payments Protocol (MPP), and iWallet has laid out a roadmap for the Autonomous Settlement Protocol (ASP). These new protocols are designed to facilitate machine-to-machine transactions, transactions that have to be designed without a human in the loop.</li>



<li>The <a href="https://www.latent.space/p/ainews-the-inference-inflection" target="_blank" rel="noreferrer noopener">Inference Era</a> is when inference, rather than training, drives AI usage, cost, and infrastructure. GPUs remain important, but the relative demand for CPUs increases.</li>



<li>GitHub is in danger of losing its place at the center of the open source ecosystem. <a href="https://www.theregister.com/2026/04/29/github_says_sorry_and_says/" target="_blank" rel="noreferrer noopener">Problems with uptime</a> are causing projects to find homes elsewhere—<a href="https://www.theregister.com/2026/04/29/mitchell_hashimoto_ghostty_quitting_github/" target="_blank" rel="noreferrer noopener">most recently, Ghostty</a>.</li>



<li><a href="https://www.together.ai/" target="_blank" rel="noreferrer noopener">Together AI</a> operates a cloud AI platform that’s designed <a href="https://rokosbas.beehiiv.com/p/may-20-2026" target="_blank" rel="noreferrer noopener">specifically for inference</a> rather than training and that provides API access to over 200 open weight models. As AI use increases, the ability to run models and provide answers efficiently becomes more important than the ability to train new models.</li>
</ul>



<h2 class="wp-block-heading">Security</h2>



<p class="wp-block-paragraph">The patch window is shrinking to zero, and the attacker&#8217;s toolkit and the defender&#8217;s toolkit now include the same AI models. Any vulnerability disclosed today is being exploited tonight. The good news is that defenders running these tools at scale can close gaps faster than ever; the bad news is that the race never ends.</p>



<ul class="wp-block-list">
<li><a href="https://arstechnica.com/security/2026/05/websites-have-a-new-way-to-spy-on-visitors-analyzing-their-ssd-activity/" target="_blank" rel="noreferrer noopener">FROST</a> is a new technology for surreptitiously discovering what websites a user is visiting. It’s based on measuring the I/O operations on the user’s SSD. FROST requires no interaction from the user and runs entirely in the browser.</li>



<li>Regrettably, neither arcane prompt injection attacks nor cryptocurrency scams are news. But it warms a ham radio enthusiast&#8217;s heart to see <a href="https://www.dexerto.com/entertainment/x-user-tricks-grok-into-sending-them-200000-in-crypto-using-morse-code-3361036/" target="_blank" rel="noreferrer noopener">Morse code used in a prompt injection to scam a crypto trading bot</a>.</li>



<li>TeamPCP, a cybercriminal collective, has <a href="https://arstechnica.com/information-technology/2026/05/a-hacker-group-is-poisoning-open-source-code-at-an-unprecedented-scale/" target="_blank" rel="noreferrer noopener">attacked GitHub</a> by installing a poisoned extension to VS Code. GitHub announced that nearly 4,000 repositories have been compromised, all belonging to GitHub itself; no customer repositories have become victims. But anyone who installs corrupted code from GitHub&#8217;s own repositories is vulnerable.</li>



<li><em><a href="https://berryvilleiml.com/docs/no-security-meter-ai.pdf" target="_blank" rel="noreferrer noopener">No Security Meter for AI</a></em> provides an excellent look into the state of AI security.</li>



<li>Cloudflare&#8217;s <a href="https://blog.cloudflare.com/cyber-frontier-models/" target="_blank" rel="noreferrer noopener">report</a> on Project Glasswing and Claude Mythos is worth reading. Mythos is especially noteworthy for its ability to chain vulnerabilities. In real life, few vulnerabilities are exploitable on their own; they become vulnerable when they are used in combination with others.</li>



<li>Daniel Stenberg <a href="https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-vulnerability/" target="_blank" rel="noreferrer noopener">reports</a> that Mythos found five potential vulnerabilities in <a href="https://curl.se/" target="_blank" rel="noreferrer noopener">curl</a>, of which one was legitimate. The low count isn&#8217;t surprising, given the quality of the curl team&#8217;s work. What&#8217;s significant is that Mythos was able to find a legitimate vulnerability in software that had been thoroughly audited by humans, traditional tools, and AI.</li>



<li><a href="https://arman-bd.hashnode.dev/i-left-port-22-open-on-the-internet-for-54-days-here-s-who-showed-up" target="_blank" rel="noreferrer noopener">Who showed up?</a> A security researcher ran a honeypot with port 22 open for 54 days, and logged every attempt to log in: 269,000 connection attempts from 7,556 unique IP addresses.</li>



<li>GitHub&#8217;s dependency scanning service for its MCP server is now in <a href="https://github.blog/changelog/2026-05-05-dependency-scanning-with-github-mcp-server-is-in-public-preview/?utm_source=the+new+stack&amp;utm_medium=referral&amp;utm_content=inline-mention&amp;utm_campaign=tns+platform" target="_blank" rel="noreferrer noopener">public preview</a>. It checks code changes for vulnerable dependencies before committing code or opening a pull request.</li>



<li><a href="https://jorijn.com/en/blog/copy-fail-cve-2026-31431-linux-kernel-bug-explained/" target="_blank" rel="noreferrer noopener">Copy.fail</a> is a recently discovered Linux kernel vulnerability that allows unprivileged processes to escalate privileges, and it was exploited within a day of its release. Unlike most vulnerabilities, running infected programs in a container does not offer protection. The time from release of a zero-day to exploitation in the wild is indeed shrinking.</li>



<li>OpenAI&#8217;s <a href="https://thenextweb.com/news/openai-chatgpt-advanced-security-yubico-passkeys" target="_blank" rel="noreferrer noopener">Advanced Account Security</a> requires a physical key or passkey for access; there are no passwords. Hardware keys are provided by Yubico or a compatible hardware token.</li>



<li><a href="https://techcrunch.com/2026/04/30/after-dissing-anthropic-for-limiting-mythos-openai-restricts-access-to-cyber-too/" target="_blank" rel="noreferrer noopener">GPT-5.5 Cyber</a> is a version of GPT-5.5 that has been trained as a security tool. As Anthropic did with Mythos, OpenAI is limiting access to a small group of trusted users.</li>



<li>The Firefox team has <a href="https://blog.mozilla.org/en/firefox/ai-security-zero-day-vulnerabilities/" target="_blank" rel="noreferrer noopener">used Claude Mythos to find 271 previously unknown vulnerabilities</a> in Firefox. While this finding is terrifying, they conclude that defenders now have the advantage. Once you know the vulnerabilities, it&#8217;s possible to close the gap between defenders and attackers.</li>



<li>Claude Code can <a href="https://bdtechtalks.com/2026/04/27/claude-code-api-token-leak/" target="_blank" rel="noreferrer noopener">leak credentials</a> and other secrets to public repos and package registries. When you select &#8220;allow always&#8221; for a specific command, the command and its credentials are stored in a subdirectory of .claude. This directory can inadvertently be incorporated into a package.</li>
</ul>



<h2 class="wp-block-heading">Policy and Governance</h2>



<ul class="wp-block-list">
<li>The ArXiv preprint repository has <a href="https://xcancel.com/tdietterich/status/2055000956144935055" target="_blank" rel="noreferrer noopener">clarified</a> its code of conduct for AI users. Submitters are responsible for their papers and will be banned for a year if they submit papers that use AI-generated content inappropriately. This includes hallucinated content, references, and plagiarism.</li>



<li>Look to China for new approaches to <a href="https://thenextweb.com/news/china-data-governance-global-standard" target="_blank" rel="noreferrer noopener">data governance</a>. China is treating data as a national resource and building the infrastructure for a data economy.</li>
</ul>



<h2 class="wp-block-heading">Web</h2>



<ul class="wp-block-list">
<li>At its I/O conference, Google <a href="https://blog.google/products-and-platforms/products/search/search-io-2026/#powerful-ai" target="_blank" rel="noreferrer noopener">announced</a> that traditional search will be replaced by AI search, powered by Gemini 3.5 Flash. Both AI search and traditional search (which is really AI-powered) have proven useful. What happens when you eliminate one of the options?</li>



<li><a href="https://www.xda-developers.com/linux-running-inside-pdf-file/" target="_blank" rel="noreferrer noopener">Linux running in a PDF</a>? The PDF format supports JavaScript, and C can be compiled to JavaScript.</li>
</ul>



<h2 class="wp-block-heading">Biology</h2>



<ul class="wp-block-list">
<li>Colossal Biosciences has <a href="https://www.technologyreview.com/2026/05/19/1137471/colossal-biosciences-is-growing-chickens-in-a-3d-printed-container/" target="_blank" rel="noreferrer noopener">developed</a> a 3D-printed artificial eggshell that’s capable of raising chicks from embryos.</li>



<li>Brazil has <a href="https://www.economist.com/the-americas/2026/05/21/why-brazils-government-is-obsessed-with-vaccines" target="_blank" rel="noreferrer noopener">invested heavily</a> in vaccines and has created a single-shot vaccine against Dengue fever. The country is striving for “medical sovereignty,” a concept that’s clearly related to data sovereignty and AI sovereignty.</li>
</ul>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/radar-trends-to-watch-june-2026/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>AI Sovereignty and the Architecture of Participation</title>
		<link>https://www.oreilly.com/radar/ai-sovereignty-and-the-architecture-of-participation/</link>
				<comments>https://www.oreilly.com/radar/ai-sovereignty-and-the-architecture-of-participation/#respond</comments>
				<pubDate>Mon, 01 Jun 2026 16:05:58 +0000</pubDate>
					<dc:creator><![CDATA[Tim O’Reilly]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18818</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Image-by-ChatGPT-5.5-Earth-from-space-at-night-as-a-federated-distributed-network.png" 
				medium="image" 
				type="image/png" 
				width="512" 
				height="288" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Image-by-ChatGPT-5.5-Earth-from-space-at-night-as-a-federated-distributed-network-160x160.png" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Adam Tooze recently shared a piece from The Economist about Brazil&#8217;s push for what it calls &#8220;medical sovereignty,&#8221; the determination to make its own vaccines and the active ingredients that go into its medicines rather than depend on supply chains it doesn&#8217;t control. Brazil already produces a large share of its own medicines through public [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Adam Tooze recently <a href="https://adamtooze.substack.com/p/top-links-1115-claiming-medical-sovereignty" target="_blank" rel="noreferrer noopener">shared</a> a piece from <em>The Economist</em> about <a href="https://www.economist.com/the-americas/2026/05/21/why-brazils-government-is-obsessed-with-vaccines" target="_blank" rel="noreferrer noopener">Brazil&#8217;s push for what it calls &#8220;medical sovereignty,&#8221;</a> the determination to make its own vaccines and the active ingredients that go into its medicines rather than depend on supply chains it doesn&#8217;t control. Brazil already produces a large share of its own medicines through public institutions like Fiocruz and Butantan, but a lot of the underlying inputs still come from abroad, and the pandemic made clear the cost of that dependence. So the country is trying to build the capacity to make the things it most needs to survive. The economist behind a lot of this thinking is <a href="https://marianamazzucato.com/" target="_blank" rel="noreferrer noopener">Mariana Mazzucato</a>, whose mission-oriented approach treats public procurement as a tool to build national capacity rather than just buy finished goods. (<a href="https://foreignpolicy.com/2024/01/26/brazil-lula-industrial-policy-economy-mission-mazzucato/" target="_blank" rel="noreferrer noopener"><em>Foreign Policy</em> has a good overview</a>.)</p>



<p class="wp-block-paragraph">I think we&#8217;re going to see a lot more of this, and not only in medicine. The same impulse is driving the quest for sovereign AI, as countries decide they don&#8217;t want their access to a foundational technology to run through a handful of American or Chinese companies. You can see it too in Europe&#8217;s and Japan&#8217;s new willingness to take responsibility for their own military destiny rather than assume the United States will always be there.</p>



<p class="wp-block-paragraph">Most commentators describe all of this as decoupling, the unwinding of a connected world. That reading is too narrow.</p>



<h2 class="wp-block-heading">Free trade was an architecture of participation that broke</h2>



<p class="wp-block-paragraph">Much like open source software and the World Wide Web, free trade was supposed to have what I call “<a href="https://asimovaddendum.substack.com/p/the-architecture-of-participation" target="_blank" rel="noreferrer noopener">an architecture of participation</a>.” The most important thing about the web and open source wasn&#8217;t openness for its own sake. It was that there were no central gatekeepers. Anyone could add to the richness of the system without asking permission as long as they followed the rules of the communication protocols that allowed independently-developed pieces to work together. In addition, value circulated among the participants instead of being extracted to a center, and the system got better the more people used it. That is a very different thing from a system that is merely large and connected.</p>



<p class="wp-block-paragraph">Free trade was also supposed to work like that. The theory, going back to Smith and Ricardo, was that specialization and exchange would make everyone better off, and that the connections would be mutual. What we actually got over the past few decades looks more like the platform dominance we see in big tech than the original vision of a commons built around shared exchange. A handful of large and powerful countries and firms set the terms and the smaller players are forced to take what is on offer. Despite the language of free trade, the experience for many countries was closer to colonialism, just with a new narrative.</p>



<p class="wp-block-paragraph">Overall, under the neoliberal order (whose reign, as <a href="https://global.oup.com/academic/product/the-rise-and-fall-of-the-neoliberal-order-9780197519646" target="_blank" rel="noreferrer noopener">Gary Gerstle explains</a>, is now ending), free trade became far less egalitarian, inclusive, and generative than it could have been. Less powerful countries ended up in roughly the position that small businesses occupy on Amazon, or developers occupy on the app stores: free to participate, on terms they don&#8217;t control, with much of the value they create flowing back to the hub.</p>



<p class="wp-block-paragraph">Brazil&#8217;s response (and that of many others) should not be seen as a retreat from the world. It is a refusal to be participate <em>only as a buyer</em>, or as a source of raw materials.</p>



<p class="wp-block-paragraph">That&#8217;s why decoupling is the wrong word. Decoupling means cutting the connections. What these countries seem to want is to stay connected but to build real capacity of their own, so that no single supplier can switch them off. That&#8217;s closer to federation than to separation. A federated system is still a system, and its nodes still interoperate. But no node is wholly at the mercy of another, and value circulates among them rather than collecting at the center. A trading order in which the gains pool at a few hubs is brittle and eventually illegitimate, in the same way that a platform economy that strip-mines its participants eventually provokes regulation and revolt.</p>



<p class="wp-block-paragraph">I put the increasingly visible quest for <a href="https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-sovereign-ai" target="_blank" rel="noreferrer noopener">sovereign AI</a>, and the role of open source models and open source agentic protocols and harnesses in enabling that sovereignty, into the same bucket. I remember back in the early days of open source software when Michael Tiemann, whose pioneering open source company Cygnus Solutions had just been acquired by Red Hat, told me “What we really sell at Red Hat is control. The ability to control your own destiny.”</p>



<p class="wp-block-paragraph">As companies are increasingly at the mercy of <a href="https://www.theinformation.com/newsletters/ai-agenda/rising-ai-costs-becoming-problem-investors" target="_blank" rel="noreferrer noopener">unexpected token pricing changes by the big centralized players</a>, this same quest for sovereignty is playing out at the level of organizations. Open source AI, including not just open source and open weight models but open agentic protocols, agentic harnesses, and portable memory, are increasingly an essential part of the sovereignty toolkit.</p>



<p class="wp-block-paragraph">The national technology sovereignty movements should take a lesson from the open source movement. The heart of open source is its architecture of participation. It is a force for innovation and value creation to the extent that it frees up the ability of people to solve their own problems and contribute their solutions to a low-friction global commons.</p>



<h2 class="wp-block-heading">Is capture the inevitable fate of any architecture of participation?</h2>



<p class="wp-block-paragraph">The pattern of open architectures leading to a wave of innovation, winners emerging, consolidating their power and then turning to the dark side seems to be a natural part of the technology cycle. The web broke Microsoft’s dominance over the personal computer software ecosystem only to give rise to a new generation of gatekeepers. Cory Doctorow called this cycle “<a href="https://en.wikipedia.org/wiki/Enshittification" target="_blank" rel="noreferrer noopener">enshittification</a>.” I’ve told my own version of that story using the language of economics in “<a href="https://www.oreilly.com/radar/rising-tide-rents-and-robber-baron-rents/" target="_blank" rel="noreferrer noopener">Rising Tide Rents and Robber Baron Rents</a>.”</p>



<p class="wp-block-paragraph">The instinct after capture is to try to rebuild the thing that got captured, only this time with better rules. Mastodon and Bluesky tried to rebuild Twitter&#8217;s social layer with cleaner governance, and neither has succeeded at the scale they hoped for. Critics might say that it was because Mastodon stayed pure and never made itself easy enough to use, while Bluesky looked federated without really being so. But more importantly, reinventing what we used to have, or what we think we used to have, is rarely the path forward. You have to build something new.</p>



<p class="wp-block-paragraph">Each country building its own answer to the latest frontier models is the Mastodon move. The winning move is to operate at a layer the centralized model structurally can&#8217;t reach. Open agent protocols that let services from different providers interoperate (the work that MCP and the emerging agent stack are beginning to do) are one such layer. AI accountable to local democratic and legal institutions is another such layer. Domain-specific AI built around problems the global market won&#8217;t serve (the tropical disease vaccine analogue) is another. None of these is a smaller copy of what the hyperscalers offer. But there’s one more important layer to consider: infrastructure.</p>



<h2 class="wp-block-heading">Where are the servers?</h2>



<p class="wp-block-paragraph"><a href="https://ai-disclosures.org/" target="_blank" rel="noreferrer noopener">Ilan Strauss</a> made a useful point in our conversation about these ideas. Ilan noted that AI is one of the most global forms of capital we&#8217;ve ever built, trained on the whole of the internet and runnable more or less anywhere, and the sovereignty rhetoric is partly an attempt to give something inherently placeless a place. The technology wants to be everywhere at once. The people who live with its consequences want some say over it where they are.</p>



<p class="wp-block-paragraph">The placelessness of AI is only half of the truth, though. The other half is that AI is physically place-bound. The model weights are placeless. The data centers, the chips, the electrical grid, and the water for cooling are very much somewhere.</p>



<p class="wp-block-paragraph">The comparison with Brazil’s medical sovereignty reinforces this point. Brazil’s challenge isn’t to invent new drugs to compete with Pfizer, but to build the capacity to manufacture existing vaccines, and eventually to build the capacity to invent vaccines for diseases the West ignores. Fiocruz and Butantan matter not because they hold patents but because they are physical institutional capacity rooted in Brazilian soil: the labs, the cold chains, the regulatory capacity, the trained workforce, and access to the active pharmaceutical ingredients. That&#8217;s what medical sovereignty really means in practice. It is infrastructure plus the institutions that run it.</p>



<p class="wp-block-paragraph">The same is becoming true for AI. Open weights matter. They&#8217;re closer, though, to the patent than to the lab. Even if Qwen, Kimi, DeepSeek, Llama, Gemma, Granite, and whatever comes next are fully open, running them at scale requires data centers that cost tens of billions to build, chips whose supply chains a handful of countries control, and electricity grids that have to be expanded substantially to carry the load. The countries pursuing sovereign AI seriously seem to understand this. The EU&#8217;s AI Gigafactories program, India&#8217;s IndiaAI mission, the Gulf compute buildouts, the Singapore and Japan strategies, are all infrastructure plays first and model plays second.</p>



<p class="wp-block-paragraph">Infrastructure is the layer where capture is hardest to undo. You can distill or fine tune a model far more easily than you can build a new continent’s worth of data centers or conjure the necessary electricity from a fragile power grid. If the architecture of participation for AI is defined only at the model layer, the infrastructure layer below will quietly recapture, over years, everything that was won above. Open weights running on three companies’ servers is not sovereignty.</p>



<p class="wp-block-paragraph">Building physical infrastructure capable of carrying a generation&#8217;s worth of economic activity is exactly the kind of mission the public sector used to take on, before we convinced ourselves the market would handle it. Mazzucato’s argument is that public procurement and public capacity-building are the real engines of foundational technology. AI sovereignty without industrial policy is wishful thinking.</p>



<p class="wp-block-paragraph">Industrial policy should aim to reinvent 20th century infrastructure, not just copy it. Can we use the enormous rebuild of infrastructure for the AI era to leapfrog the past? The analogy with centralized power grids and decentralized solar reminds us that local control does not have to be a localized version of the hyperscaler pattern. Might we envision a future where there is an intelligence grid that seamlessly uses frontier models in massive data centers and local models controlled by the user as dictated by considerations like cost, privacy, specialized knowledge, and user preferences? Creating the software to manage such an interoperable intelligence grid should be a high priority for the AI open source community. We need an orchestrator not just for agents but also for models and even for data center capacity.</p>



<h2 class="wp-block-heading">Could federated AI give us a new pattern for the economy?</h2>



<p class="wp-block-paragraph">In a previous piece about AI and markets, &#8220;<a href="https://asimovaddendum.substack.com/p/the-third-artificial-intelligence" target="_blank" rel="noreferrer noopener">The Third Artificial Intelligence</a>&#8221; I picked up Richard Danzig&#8217;s argument that markets and the bureaucracies that underpin nation states are themselves artificial intelligences, information-processing mechanisms older than the machine kind. The question with all three is who designs and builds them, what they optimize for, and what feedback loops govern them.</p>



<p class="wp-block-paragraph">We&#8217;re about to spend a lot of effort working out how AI should be organized both across nations and across organizations, whether it concentrates in a few firms and a few countries or whether it can be built as something more federated, where smaller players have genuine capacity and the value they create flows back to them. The choices we are now making about how AI is organized, at the model layer, the protocol layer, and the infrastructure layer, are also choices about how economic activity will be organized for at least a generation. If we manage to get that architecture right for AI, it may give us a working pattern for the thing we&#8217;ve so far failed to get right for trade. If we get it wrong, we&#8217;ll most likely reproduce, at the level of intelligence itself, the same concentration that free trade has produced in goods and the existing internet platforms produced online.</p>



<p class="wp-block-paragraph">The technology wants to be everywhere at once. The people who live with its consequences want some say over it where they are. The infrastructure that resolves that tension will be a federation of models, a federation of protocols and code, and a federation of capacity. We need an architecture of participation all the way down the stack, and all the way up.</p>



<p class="wp-block-paragraph"><em>The final section of this piece benefited greatly from questions and comments raised by Ilan Strauss and <a href="https://www.oreilly.com/people/mike-loukides/" target="_blank" rel="noreferrer noopener">Mike Loukides</a>, as well as from previous conversations with Richard Danzig.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/ai-sovereignty-and-the-architecture-of-participation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>SaaS Is Not Dead Yet</title>
		<link>https://www.oreilly.com/radar/saas-is-not-dead-yet/</link>
				<comments>https://www.oreilly.com/radar/saas-is-not-dead-yet/#respond</comments>
				<pubDate>Mon, 01 Jun 2026 11:01:35 +0000</pubDate>
					<dc:creator><![CDATA[Mike Loukides]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18822</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/SaaS-is-not-dead-yet.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/06/SaaS-is-not-dead-yet-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[With the rise of agents, many people have been proclaiming that the age of software as a service (SaaS) is over. Who needs to subscribe to a service when you can create your own software with a few English-language prompts and a few dollars spent on tokens? Your own software, most likely a skill that [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">With the rise of agents, many people have been proclaiming that the age of software as a service (SaaS) is over. Who needs to subscribe to a service when you can create your own software with a few English-language prompts and a few dollars spent on tokens? Your own software, most likely a skill that runs in an agent, will have exactly the features you want: no more, no less.</p>



<p class="wp-block-paragraph">But whenever someone talks about the death of SaaS, there’s something wrong with the picture. It’s simply that work is about groups and teams, and so far, programming with agents is about individuals. A related challenge is that SaaS companies are good at building dashboards and generating reports for humans, but agents need the raw data, not a representation of the data.</p>



<p class="wp-block-paragraph">Think about the teamwork required for a good sales team. Someone needs a database to keep track of their customer info. It’s easy to get Claude, Gemini, or GPT to build that, using SQLite for a backend and putting a reasonable web frontend on it. You could also do that fairly quickly with Ruby on Rails, but AI makes it even easier. But what about the salesperson at the next desk? She needs similar CRM software, and she can create it with Claude, Gemini, or GPT. No problem. But it won’t be exactly the same; it will reflect her needs and preferences. Soon you have a team of salespeople in which everyone has their own personal CRM. They’re all similar, but slightly different. They may use different backends (Filemaker, SQLite, MySQL, or maybe a corporate Oracle instance); they have similar-but-slightly-different schemas (one has a single field for customer address, another has separate street, city, state, and country fields); and they don’t interoperate.</p>



<p class="wp-block-paragraph">That’s the simplest possible case. How do you generate company-wide reports if everyone has their own version of the data? How do you know if you’re succeeding or failing if everyone on the team has their own version of the metrics? Everyone has become their own silo.</p>



<p class="wp-block-paragraph">The company is not paying subscription fees to a vendor like Salesforce, but is this really progress? If anything, we need to make sharing data and metrics easier, not more difficult. On top of that, a product like Salesforce has hundreds of features. Most people don’t need most of them, but there’s a good chance that almost everyone needs one feature that nobody else needs. And there’s always the features you don’t know you need, ways to get value from data that you haven’t thought of. There’s value in buying a bundle that goes beyond your immediate requirements.</p>



<p class="wp-block-paragraph">There’s certainly a lot good about enabling people to develop their own tools. I guarantee that if we had Claude Code 30 years ago, I would have vibe-coded my own skills for managing the authors I was working with. I would have vibe-coded some of the crazy tools I wrote to translate from one document format to another. (WordPerfect to troff? Why?) Now that we have agentic programming, I may never write my own tools again. But the SaaS scenario highlights something missing from the agentic picture. We don’t have tools for sharing or collaboration. Nobody buys a Salesforce subscription for themselves. It’s a departmental or corporate resource, shared between many people. And the ability to share easily is precisely what agentic programming lacks. I’ve built some of my own Claude tools and skills, but it’s very difficult to share them with other people at O’Reilly. <a href="https://www.linkedin.com/posts/openai-for-business_today-were-introducing-skills-in-beta-for-activity-7435743335107084288-yHR9/">ChatGPT Skills for Business and Enterprise</a> hints at the ability to share skills among team members and some ability to generate them collaboratively, though it’s hard to find evidence that it delivers. I think we’re seeing a symptom of technological overreach. It’s easy to assume something is &#8220;easy&#8221; when it isn’t: &#8220;You just generate a .md file and put it in the corporate GitHub.&#8221; That process has a lot of friction, particularly for users who aren’t technical.</p>



<p class="wp-block-paragraph">To make skills really useful across a company, we need:</p>



<ul class="wp-block-list">
<li><strong>Sharing.</strong> This can be a Git server that’s registered as a private marketplace and then configured via a corporate administrative dashboard. Publishing skills to the marketplace would remain the province of Git-aware users, and that’s a problem.</li>



<li><strong>Requirements.</strong> We don’t want everyone to build a personal toolset; that’s the problem we’re trying to solve. How do you resolve differences between users who want slightly different things? What does the PRD for a skill look like?</li>



<li><strong>Collaboration.</strong> Aside from Google Docs, the current state of widely used collaboration tools is poor. Suffice it to say that working on different branches of a Git repo and merging changes may work for professional programmers, but not for anyone else.</li>



<li><strong>Testing.</strong> Tests and evals for agents (related, but not the same) are topics that we don’t yet understand well. But if you’re going to empower users to use and create agentic tools for creating projections and writing reports, you need to know they won’t backfire. Skills also behave like any other AI application: They drift over time. Even after they’re published, they need to be evaluated regularly to see if they still perform correctly.</li>



<li><strong>Versioning.</strong> Like any software—and we need to recognize that agentic tools and skills are software, even if they’re written in English—it will be important to update them as requirements change and as LLM behavior drifts. It’s important to keep track of versions and for users to update their skills to the latest version easily. Again, this is a matter of wrapping Git appropriately for nontechnical users.</li>



<li><strong>Security.</strong> Security for intelligent agents is still poorly understood. We know about prompt injection, but we also know that it’s a problem that can’t be solved yet. And attackers are still finding novel ways to inject malicious prompts. What vulnerabilities might agentic skills and tools have if they can access corporate data?</li>
</ul>



<p class="wp-block-paragraph">While the democratization of programming doesn’t threaten SaaS companies, intelligent agents pose a deeper challenge. In “<a href="https://asimovaddendum.substack.com/p/the-salesforce-of-agents-wont-be" target="_blank" rel="noreferrer noopener">The Salesforce of Agents Won’t Be Salesforce, the Google of Agents Won’t Be Google</a>,” Jesus Rodriguez points out that the future for services like Salesforce and Google isn’t web UIs and dashboards; it’s APIs that are designed for agents. These APIs require a different kind of data: not something that a human can glance at to get a quick feel for what’s happening, but “structured state, task objectives, relationship graphs, permissioned memory, machine-readable sales playbooks, and reliable APIs for updating intent.” Humans need the data compression that you get from a dashboard. Agents want the data itself, and they’ll take care of the compression. SaaS companies can become the system of record that is responsible for delivering accurate data. What they need to recognize is that their real customer may not be a human user; the customer will be an agent, and that will affect everything from marketing strategy and product design to pricing.</p>



<p class="wp-block-paragraph">I wouldn’t claim that Salesforce or Google can’t or won’t build APIs to help companies access their own data. SaaS remains relevant, but it’s a different kind of SaaS than we have now. Companies like Salesforce know what data is available and how to work with it. Designing and building the data infrastructure that’s needed to provide next-generation SaaS isn’t trivial, and doing the programming in English rather than C++ doesn’t make it easier. Companies like Salesforce and Google know what needs to be built. They’re likely to offer their own collections of agentic skills as a starting point, alongside APIs. But large, established companies are ripe to be blindsided if they move slowly—and it’s difficult for large institutions to move quickly.</p>



<p class="wp-block-paragraph">SaaS companies have momentum—or inertia, which to a physicist is the same thing. They have to change, but they aren’t threatened by AI, agents, and user-defined skills. Providing APIs that have been designed to provide data in formats that machines can use should be an obvious next step. If they die, it will be because they don’t adapt. But there’s nothing new about that.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/saas-is-not-dead-yet/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Open Source Ecosystems</title>
		<link>https://www.oreilly.com/radar/open-source-ecosystems/</link>
				<comments>https://www.oreilly.com/radar/open-source-ecosystems/#respond</comments>
				<pubDate>Fri, 29 May 2026 11:00:08 +0000</pubDate>
					<dc:creator><![CDATA[Ilan Strauss]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18814</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Open-source-ecosystems.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Open-source-ecosystems-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[When open strategy meets private tactics]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on the Asimov&#8217;s Addendum Substack and is being reposted here with the author&#8217;s permission. Bill Gurley&#160;has an excellent article on what he calls&#160;open source strategy,&#160;which we recommend reading. There is a lot to debate about his concluding argument in particular: that open-weight models are central to keeping the AI market [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on the</em> <a href="https://asimovaddendum.substack.com/p/open-source-ecosystems" target="_blank" rel="noreferrer noopener">Asimov&#8217;s Addendum</a> <em>Substack and is being reposted here with the author&#8217;s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph"><a href="https://p3institute.substack.com/p/from-open-source-software-to-open" target="_blank" rel="noreferrer noopener">Bill Gurley</a>&nbsp;has an excellent article on what he calls&nbsp;<em>open source strategy,&nbsp;</em>which we recommend reading. There is a lot to debate about his concluding argument in particular: that open-weight models are central to keeping the AI market rent-free. The limits of open-weight AI as the primary open source strategy are surely considerable though, if it still requires expensive hardware to run on, and&nbsp;<a href="https://www.oreilly.com/pub/a/tim/articles/architecture_of_participation.html" target="_blank" rel="noreferrer noopener">if the architecture ultimately remains monolithic</a>—rather than composable and protocol-centric.</p>



<p class="wp-block-paragraph">A related consideration comes from Anthropic’s<a href="https://www.anthropic.com/news/anthropic-acquires-stainless" target="_blank" rel="noreferrer noopener">&nbsp;recent acquisition of Stainless</a>—a startup that generates SDKs, command-line tools, and MCP servers from API specifications. This illustrates that open protocols like MCP, even when publicly governed,<sup data-fn="6732a4b0-bcdf-41ae-a355-761cc861ab6b" class="fn"><a href="#6732a4b0-bcdf-41ae-a355-761cc861ab6b" id="6732a4b0-bcdf-41ae-a355-761cc861ab6b-link">1</a></sup>&nbsp;remain exposed at their complementary layers to private actors capturing rents. (Protocol openness does not eliminate this and instead probably enables it, by enabling market growth).</p>



<p class="wp-block-paragraph">We asked Claude to analyze this acquisition, going beyond the press releases. Its first pass overstated parts of the competitive-denial story; what follows is what survived it taking a closer look:</p>



<ol class="wp-block-list">
<li><strong>Complement capture, not protocol capture.</strong>&nbsp;MCP—the standard that lets AI agents talk to other software—remains open, and its governance has been handed to an independent foundation. What Anthropic bought is the company that turned that standard into something most developers could actually use.&nbsp;<em>Stainless was the dominant tool for taking an ordinary business API</em>&nbsp;(say, a hotel booking system or a customer database) and converting it into something an AI agent could call through MCP. The open standard is still open. The path most developers walked to use it has now been bought.<br></li>



<li><strong>This isn’t a one-off—the whole layer is consolidating.</strong>&nbsp;Stainless wasn’t alone in this market. Its main competitor, Fern, was<a href="https://buildwithfern.com/post/stainless-pricing-alternatives" target="_blank" rel="noreferrer noopener">&nbsp;bought by Postman in January 2026</a>. Anthropic bought Stainless four months later, in May 2026. That leaves&nbsp;<a href="https://www.speakeasy.com/" target="_blank" rel="noreferrer noopener">Speakeasy</a>&nbsp;as the only major independent player, plus an open-source fallback called&nbsp;<a href="https://openapi-generator.tech/" target="_blank" rel="noreferrer noopener">OpenAPI Generator</a>&nbsp;that most developers consider too rough for production use without significant manual work. In under five months, two of the three serious companies in this part of the market have been absorbed into larger platforms.&nbsp;<em>The Stainless deal is more visible because of who bought it and why, but the broader pattern matters more: an entire layer of AI infrastructure is being pulled inside platform owners</em>.<br></li>



<li><strong>Moat migration.</strong> The gap in raw model capability between Anthropic, OpenAI, and Google has narrowed considerably and continues to close, and the implication is that model quality alone is unlikely to be the principal basis of competitive advantage over the next two years. What may distinguish the leading firms instead <em>is the quality of the developer experience around their models: how easily a business or an engineer can build something useful on top of a given model, how cleanly the tooling integrates with existing systems, and how reliable the connectors are over time.</em></li>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Stainless was founded by Alex Rattray, formerly of Stripe.&nbsp;<em>Stripe built its market position largely on unusually well-designed developer tools</em>, and Stainless was, in effect, an attempt to apply the same approach to the layer between AI APIs and the rest of the software economy. Anthropic has acquired the team that knows how to do this.</p>
</blockquote>



<li><strong>Pricing logic, with caveats on denial.</strong>&nbsp;Stainless was last valued at&nbsp;<a href="https://www.analyticsinsight.net/news/anthropic-acquires-stainless-for-over-300m-to-strengthen-ai-sdk-and-tool-access" target="_blank" rel="noreferrer noopener">$150M in December 2025</a>; at &gt;$300M five months later, this is a roughly 2x strategic markup, not acqui-hire arithmetic. Removing a critical-path external dependency on Anthropic’s own SDKs, while denying it to a tight set of competitors, is rational at that price—but the denial logic is partial.&nbsp;<em>Speakeasy is a viable substitute, and OpenAI was reportedly already migrating off Stainless. The friction tax falls hardest on smaller players who lack the engineering bench to absorb migration cost</em>.</li>
</ol>



<p class="wp-block-paragraph">…The press release calls it “extending reach”; the <em>InfoWorld</em> read—“last-mile developer experience”—is closer, but the complement-capture component, even if partial, is real.</p>



<p class="wp-block-paragraph">-*-</p>



<p class="wp-block-paragraph">Now, while Claude might be overstating some of the market risks associated with this acquisition (you tell us?), it shows that open source’s impacts are highly conditional on its dependencies and should never be analyzed in isolation from the market’s software stack and architecture. This is equally true for open weight models—being dependent on data, compute, and distribution—as it is for open protocols like MCP, dependent on constant API translations and access. Tracking those interdependencies is what a full ecosystem view involves and is helpful to undertake in order to consider where chokepoints might arise, and in turn, where&nbsp;<em>open source strategy</em>&nbsp;might eventually fail or be captured.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Footnotes</h3>


<ol class="wp-block-footnotes"><li id="6732a4b0-bcdf-41ae-a355-761cc861ab6b">In this case by the<a href="https://www.linuxfoundation.org/press/agentic-ai-foundation" target="_blank" rel="noreferrer noopener"> Agentic AI Foundation under the Linux Foundation</a> <a href="#6732a4b0-bcdf-41ae-a355-761cc861ab6b-link" aria-label="Jump to footnote reference 1"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/21a9.png" alt="↩" class="wp-smiley" style="height: 1em; max-height: 1em;" />︎</a></li></ol>]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/open-source-ecosystems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Your AI Agent Already Forgot Half of What You Told It</title>
		<link>https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/</link>
				<comments>https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/#respond</comments>
				<pubDate>Thu, 28 May 2026 10:59:36 +0000</pubDate>
					<dc:creator><![CDATA[Andrew Stellman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18803</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Your-AI-agent-already-forgot-half-of-what-you-told-it.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Your-AI-agent-already-forgot-half-of-what-you-told-it-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[How to keep agents and skills from losing track mid-workflow]]></custom:subtitle>
		
				<description><![CDATA[This is the seventh article in a series on agentic engineering and AI-driven development.&#160;Read part one&#160;here, part two&#160;here, part three&#160;here, part four&#160;here, part five&#160;here, and part six here. This is the latest article in my Radar series on AI-driven development and agentic engineering, and I have to admit that this one took a bit of [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>This is the seventh article in a series on agentic engineering and AI-driven development.&nbsp;Read part one&nbsp;<a href="https://www.oreilly.com/radar/the-accidental-orchestrator/" target="_blank" rel="noreferrer noopener">here</a>, part two&nbsp;<a href="https://www.oreilly.com/radar/keep-deterministic-work-deterministic/" target="_blank" rel="noreferrer noopener">here</a>, part three&nbsp;<a href="https://www.oreilly.com/radar/the-toolkit-pattern/" target="_blank" rel="noreferrer noopener">here</a>, part four&nbsp;<a href="https://www.oreilly.com/radar/ai-is-writing-our-code-faster-than-we-can-verify-it/" target="_blank" rel="noreferrer noopener">here</a>, part five&nbsp;<a href="https://www.oreilly.com/radar/ai-code-review-only-catches-half-of-your-bugs/" target="_blank" rel="noreferrer noopener">here</a></em>, <em>and part six <a href="https://www.oreilly.com/radar/why-doesnt-anyone-teach-developers-about-context-management/" target="_blank" rel="noreferrer noopener">here</a>.</em></p>
</blockquote>



<p class="wp-block-paragraph">This is the latest article in my Radar series on AI-driven development and agentic engineering, and I have to admit that this one took a bit of a turn I wasn&#8217;t expecting.</p>



<p class="wp-block-paragraph">In my <a href="https://www.oreilly.com/radar/why-doesnt-anyone-teach-developers-about-context-management/" target="_blank" rel="noreferrer noopener">last article</a> I talked about context and context management and I promised to give you some real practical tips for using it. It was originally meant to be about specific, practical context management techniques that were really helpful to me building <a href="https://github.com/andrewstellman/octobatch" target="_blank" rel="noreferrer noopener">Octobatch</a> and the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a>, two open source projects where I work with AIs to plan and orchestrate all of the work and every line of code is written by AI tools like Claude Code and Cursor.</p>



<p class="wp-block-paragraph">But as I was writing this, I found that I&#8217;d adapted those same techniques to my work writing articles like this one. Which is surprising! I&#8217;ve been doing all this work finding ways to help people developing AI skills improve context management, so their skills run more efficiently. It turns out that those same exact techniques apply to anyone using AI tools, even when you&#8217;re using chatbots like Claude.ai or ChatGPT.</p>



<p class="wp-block-paragraph">Full disclosure: I use multiple AI tools to manage this article series. My primary tools are Claude Cowork for brainstorming and managing my article research, notes, and backlog and Gemini&#8217;s mobile app for reading drafts aloud and taking my notes while I&#8217;m away from my desk. And I want to tell you about something that happened while I was using those tools, because I think it really helps show why context management isn&#8217;t just a problem for developers.</p>



<p class="wp-block-paragraph">While I was writing this article, I was using Gemini&#8217;s mobile app to read the draft aloud and take my notes. Partway through the session I asked it to go back and check whether there were earlier notes it hadn&#8217;t incorporated yet. It told me it didn&#8217;t have access to the previous notes, which seemed weird and insane, since we had <em>just taken those notes a few prompts earlier in the session</em>. I could scroll back up and see them earlier in the conversation, but somehow it didn&#8217;t &#8220;know&#8221; about them.</p>



<p class="wp-block-paragraph">Here&#8217;s what happened. Gemini had compacted our conversation without telling me, and the notes from the first half of the session were just&#8230; gone.</p>



<p class="wp-block-paragraph">If you&#8217;ve ever had a web chat AI just seem to forget things you talked about earlier, you&#8217;ve experienced context compaction, just like I did. Understanding even the basics of context and context windows can make a big difference in preventing that kind of frustration.</p>



<p class="wp-block-paragraph">This all reminded me of something I wrote more than two decades ago in <em><a href="https://learning.oreilly.com/library/view/applied-software-project/0596009488/" target="_blank" rel="noreferrer noopener">Applied Software Project Management</a></em> (back in 2005!): &#8220;Important information is discovered during the discussion that the team will need to refer back to during the development process, and if that information is not written down, the team will have to have the discussion all over again.&#8221;</p>



<p class="wp-block-paragraph">Jenny Greene and I wrote that about human teams and project meetings, but it applies to AI sessions just as well.</p>



<p class="wp-block-paragraph">Which brings me back to context, which I wrote about in my last article, and which I&#8217;ll write more about in the next one, because it&#8217;s one of the most important concepts to keep top of mind when working with AI.</p>



<h3 class="wp-block-heading"><strong>Context loss may be invisible, but that doesn&#8217;t make it any less frustrating</strong></h3>



<p class="wp-block-paragraph"><strong>Context</strong> is everything the AI is holding in its working memory during a conversation: what you&#8217;ve told it, what it&#8217;s told you, any files or instructions it&#8217;s read, and whatever internal notes the system has made along the way. All of that lives in a fixed-size <strong>context window</strong>—think of that as your AI&#8217;s short-term memory, the stuff it&#8217;s thinking about right now—and when the window fills up, the AI has to start letting things go. Different tools handle this differently: Some truncate older messages, some compress the conversation into a summary (which means details get lost even though the summary looks complete), and some just start behaving inconsistently so you can&#8217;t tell whether the AI forgot something or never understood it in the first place. The result is the same: The AI loses track of things you told it, decisions you made together, or details it noticed earlier in the session. And it won&#8217;t tell you it forgot. It&#8217;ll just keep generating confident-sounding output based on whatever it still has.</p>



<p class="wp-block-paragraph">Before we dive in a little deeper, I want to do a quick jargon check. If you&#8217;ve seen the terms &#8220;skills&#8221; and &#8220;agents&#8221; floating around but aren&#8217;t sure what they are, think of skills as libraries for AIs and agents as interactive executables. Those aren&#8217;t perfectly precise definitions, but if you&#8217;re a developer they&#8217;re close enough for this discussion.</p>



<p class="wp-block-paragraph">When you&#8217;re coding skills and agents, you run into context problems quickly. The work you&#8217;re asking the AI to do is often complex enough that the context window fills up, and the AI has to start compacting: compressing or dropping older parts of the conversation to make room for new ones. Compaction always seems to happen at the most frustrating and inconvenient time, which makes sense when you think about it. You hit context limits precisely when you&#8217;ve put the most information into the conversation, which is exactly when losing that information costs you the most.</p>



<p class="wp-block-paragraph">That&#8217;s why I think it can often help to think of AIs as having the same shortcomings that human teams do, except those shortcomings are exaggerated by their AI nature. A person who forgets something from a meeting last week might remember it when you remind them. An AI that lost something to context compaction won&#8217;t, because the information is gone. But there&#8217;s something you can do about it, and it turns out the techniques that help are the same whether you&#8217;re building autonomous AI skills or just trying to get a chatbot to remember what you told it 20 minutes ago.</p>



<p class="wp-block-paragraph">I&#8217;ve landed on four techniques that I come back to over and over again. Each one exists because at some point the AI forgot something important and I responded by putting that thing in a file where it couldn&#8217;t be forgotten. None of them require special tooling. And to my surprise, all of these techniques have turned out to be useful for both building software and managing a writing project like this one, whether I&#8217;m chatting with Claude, ChatGPT, or Gemini, or using a desktop tool like Claude Cowork or Codex. These are the techniques I find most valuable:</p>



<ul class="wp-block-list">
<li><strong>Split discovery from documentation:</strong> Don&#8217;t ask the AI to figure something out and produce polished output in the same pass.</li>



<li><strong>Use handoff documents, not continuation prompts:</strong> Before closing a stale session, have the AI write down everything the next session needs to know.</li>



<li><strong>Give the AI an acceptance criterion, not a procedure:</strong> Tell it what &#8220;done&#8221; looks like instead of spelling out the steps.</li>



<li><strong>Use spec documents as the bridge between AI tools:</strong> Make a shared document the single source of truth that all your tools read from.</li>
</ul>



<h3 class="wp-block-heading"><strong>Split discovery from documentation</strong></h3>



<p class="wp-block-paragraph">When you ask an AI to do something complex, you&#8217;re often asking it to do two things at once without realizing it. You&#8217;re asking it to figure something out and produce polished output at the same time. The problem is that figuring things out takes attention, and producing output takes attention, and the model only has so much of it. When you combine both tasks in the same prompt, the model starts cutting corners on one of them, and you can&#8217;t tell which one it shortchanged.</p>



<p class="wp-block-paragraph">I ran into this with the <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a>, an open source AI coding skill I built that runs structured code reviews against any codebase. One of the things it does is derive requirements from source code: It reads through the code, identifies what the code promises to do (I call these behavioral contracts), and then produces a requirements document. Originally this all happened in a single pass. The problem was that single-pass requirement generation ran out of attention after about 70 requirements. The model forgot behavioral contracts it had noticed earlier in the code, and the forgetting was completely invisible. There was no stack trace or error message, just incomplete output and no way to know what was missing. I fixed it by splitting the work into two separate prompts:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Read each source file and write down every behavioral contract you observe as a simple list in CONTRACTS.md.</em></p>



<p class="wp-block-paragraph"><em>Read CONTRACTS.md and the documentation, then derive requirements from them and write REQUIREMENTS.md.</em></p>
</blockquote>



<p class="wp-block-paragraph">Then a third pass checks whether every contract has a corresponding requirement, and if there are gaps, goes back to step one for the files with gaps.</p>



<p class="wp-block-paragraph">The key idea is that CONTRACTS.md is external memory. When the model &#8220;forgets&#8221; about a behavioral contract it noticed earlier, that forgetting is normally invisible. With a contracts file, every observation is written down before any requirements work begins, so an uncovered contract is a visible, greppable gap. You can see what was forgotten and fix it.</p>



<p class="wp-block-paragraph">The principle: Don&#8217;t ask the AI to figure out what exists and write formatted output in the same pass. The model runs out of attention trying to do both at once. Whenever you&#8217;re asking an AI to do something complex, consider whether you&#8217;re actually asking it to do two things at once. &#8220;Analyze this codebase and write a report&#8221; is two tasks. &#8220;Read this document and suggest improvements&#8221; is two tasks. Split them, and let the first pass write its observations to a file before the second pass starts working with them.</p>



<h3 class="wp-block-heading"><strong>Use handoff documents, not continuation prompts</strong></h3>



<p class="wp-block-paragraph">Anyone who&#8217;s spent a long session with an AI coding tool has felt the moment when the context starts to go stale. The AI stops tracking details it was handling fine an hour ago, or it contradicts something it said earlier. The session gets slow, and you&#8217;re often restarting because the AI seems to have gotten bogged down and filled up on what you told it. You get the sense that if you keep going, you&#8217;re going to spend more time correcting it than making progress.</p>



<p class="wp-block-paragraph">Most developers respond to their session getting too long in one of two ways: They push through the problem, or they start a fresh one and try to reexplain everything from scratch. Both of those approaches can cause the AI to lose context. The first loses it to compaction; the second loses it to incomplete reexplanation. And both are frustrating! Specifically because you just spent so much time building up all that context with the AI.</p>



<p class="wp-block-paragraph">There&#8217;s a third option. Before you close the session, ask the AI to write a handoff document: a file that captures everything the next session needs to know, written while the current session still has full context. The key is that you&#8217;re asking the AI to write this while the relevant details are still fresh in the working context, and in a way that it or another AI can read.</p>



<p class="wp-block-paragraph">I built this into the Quality Playbook as a core part of how phases communicate. When I split the playbook from a single prompt to independent phases, I needed each phase to run as a completely independent session with no context carryover. So each phase got its own kickoff prompt as a standalone file. Here&#8217;s the structure each one follows:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Write a handoff document that a fresh session could use to pick up this work cold. Include everything it would need to know.</em></p>
</blockquote>



<p class="wp-block-paragraph">Every kickoff opens with what prior phases accomplished, includes explicit boundaries about what&#8217;s frozen, and names which future phase owns each piece of remaining work, because without it the AI will helpfully start doing Phase 3 work while you&#8217;re still in Phase 2. Each phase also ends with a required forward-looking handoff where the completing agent writes down what the next session needs to know.</p>



<p class="wp-block-paragraph">The principle: Each handoff is a complete state snapshot. The incoming AI agent never needs to read prior kickoff prompts or chat history. Everything it needs is in the current handoff file: current state, uncommitted changes, immediate next task, pending tasks, file locations, and anything that was discovered during the prior session. A fresh AI session can pick it up cold.</p>



<p class="wp-block-paragraph">If you&#8217;re deep into a Claude Code or Copilot session and you can feel the context getting stale, ask the AI to write a handoff document before you close the session. Tell it to include everything a fresh session would need to continue the work. Then start a new session and point it at that file. A fresh session with a good handoff document will usually outperform a stale session, because it&#8217;s starting with clean context instead of compacted, fragmented context.</p>



<h3 class="wp-block-heading"><strong>Give the AI an acceptance criterion, not a procedure</strong></h3>



<p class="wp-block-paragraph">When you give an AI a multistep task, the natural instinct is to spell out the steps. First do this, then do that, then combine the results. The problem is that step-by-step procedures are the first thing the AI forgets when the context window fills up. It&#8217;ll skip steps, merge phases, or quietly drop tasks, and there&#8217;s nothing in the procedure itself that would help the AI notice what it missed. The procedure tells the AI what to do, but it doesn&#8217;t tell the AI what &#8220;done&#8221; looks like.</p>



<p class="wp-block-paragraph">I learned this the hard way with the Quality Playbook. The playbook runs multiple iteration passes over a codebase, and the results need to be cumulative. It keeps a list of all the bugs it finds in the code being tested in a file called BUGS.md. Early on, I gave the AI a procedure to run four times and then update that file:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>First run the main pass, then run four iteration passes, then merge the findings into BUGS.md.</em></p>
</blockquote>



<p class="wp-block-paragraph">The AI did not respond well to that instruction.</p>



<p class="wp-block-paragraph">It turns out that when you ask an AI to do a very complex task a specific number of times, it can lose count. In fact, from my experimentation, it seems that count is one of the first casualties of context compaction. Most of the time the AI decided three iterations was enough, or merged findings from only two passes, and no matter how many different ways I tried to rephrase that instruction, there was nothing I could come up with that prevented the problem.</p>



<p class="wp-block-paragraph">However, everything changed when I replaced the &#8220;run four times&#8221; instruction with an <strong>acceptance criterion</strong>, or a specific condition that tells the AI when to stop looping:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>You are done only when BUGS.md contains the cumulative findings from the main run plus all four itration passes.</em></p>
</blockquote>



<p class="wp-block-paragraph">Even when the AI lost track of intermediate steps, it could check the output against the criterion and know whether it was finished. And I could verify the output against the same criterion, which gave me a way to audit the agent&#8217;s work without watching every step.</p>



<p class="wp-block-paragraph">In developer terms, the AI is really bad at loops like <em>for (i = 0; i &lt; 4; i++)</em> because it loses track of the value of the iterator <em>i</em> when it compacts its context. But it&#8217;s really good at loops like <em>while (!done)</em> because it can check <em>done</em> based on the current state without relying on history.</p>



<p class="wp-block-paragraph">The principle behind all this is that an acceptance criterion survives context pressure because the AI can always check &#8220;Am I done?&#8221; against a concrete test. This is actually the same principle behind test-driven development: write the test before the code so you know when you&#8217;re done. The acceptance criterion is the test for your AI session. When you&#8217;re giving an AI a task that has multiple steps, don&#8217;t describe the steps. Describe what &#8220;done&#8221; looks like, and let the AI figure out how to get there.</p>



<h3 class="wp-block-heading"><strong>Use spec documents as the bridge between AI tools</strong></h3>



<p class="wp-block-paragraph">Most developers working with AI don&#8217;t use just one tool. You might use Claude for design, Cursor for coding, and Copilot for quick edits. You might even use multiple models inside the same tool, like GPT-5.5 and Opus 4.7 in separate Copilot chats inside VS Code. It&#8217;s common to have one model for coding, another for review, and a third for orchestration and project management. The problem is that none of these tools or chats know what you told the others. Claude doesn&#8217;t know what you decided with Cursor. Two separate Copilot chats in the same editor don&#8217;t share context. You&#8217;re the one carrying context between them, and that&#8217;s exactly the kind of lossy handoff that causes drift. A design decision you made in one conversation gets lost or distorted by the time it reaches the tool that needs to implement it.</p>



<p class="wp-block-paragraph">The fix is to make the spec document the single source of truth that all your AI tools read from. I used this when building a game prototype, where I had Claude handling design and planning and Cursor doing the coding. They never talked to each other directly, so the spec documents served as the shared contract: Claude wrote the specs, and Cursor read them. The rule I followed was simple:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Never tell the AI coder something that isn&#8217;t already in the specs. If you make a design decision in conversation, write it into the spec first, then point the coder at the spec.</em></p>
</blockquote>



<p class="wp-block-paragraph">If I made a design decision in a conversation with Claude, that decision had to be written into the spec before I told Cursor about it. If I discovered something during implementation, I wrote it into the appropriate doc first, then pointed the coder at it. The spec was always the single source of truth. When Claude and I changed the wound topology (removing one wound type, promoting another), we updated the docs first, then told Cursor to reread them. When we decided to add a new UI element, we wrote it into the UI spec first, then told Cursor to reread the doc.</p>



<p class="wp-block-paragraph">The key was including rationale in the specs. Not just &#8220;show 5 progressive labels&#8221; but why: &#8220;The player shouldn&#8217;t be told what they&#8217;re fighting. They should discover it.&#8221; This helps the AI coder make better decisions when the spec doesn&#8217;t cover an edge case because it knows the intent behind the requirement.</p>



<p class="wp-block-paragraph">The principle: The spec document is the shared context that all your tools can read. It prevents the drift that happens when design intent lives only in chat history that the other tool can&#8217;t see. This technique works any time you&#8217;re using more than one AI tool on the same project, which at this point is most projects.</p>



<h3 class="wp-block-heading"><strong>How these techniques combine: Managing this article series</strong></h3>



<p class="wp-block-paragraph">Those four practices came out of AI-driven development work, but they apply to almost any AI work. And while these techniques emerged for me while working on agents and skills, I think it&#8217;s valuable to demonstrate them in a nondevelopment context, so I&#8217;ll share an example from my work on the article series you&#8217;re reading now.</p>



<p class="wp-block-paragraph">Over time, the process for how my AI assistant and I manage this article backlog evolved organically in conversation, but it was never written down anywhere except in the AI&#8217;s context window. Which means every time the session compacted or I started a fresh chat, the process was gone and I had to reexplain it. I caught this when the AI did something slightly wrong and I wanted to confirm we were on the same page. So I asked:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Every time I suggest a new article idea, you add an entry to the backlog, and then create a new markdown file with the source material, right?</em></p>
</blockquote>



<p class="wp-block-paragraph">That&#8217;s split discovery from documentation. I didn&#8217;t say &#8220;document our process.&#8221; I said &#8220;confirm what we do.&#8221; Discovery first, then documentation as a separate step. If I&#8217;d said &#8220;write up our process&#8221; without confirming first, the AI might have written something plausible but wrong, and I wouldn&#8217;t have caught the discrepancy.</p>



<p class="wp-block-paragraph">Once we&#8217;d confirmed the process, I asked the AI to create two files. <strong>AGENTS.md</strong> is an emerging standard for AI-readable project context—a single file that tells any AI session what it needs to know about a project. You can learn more about the convention at <a href="https://agents.md/" target="_blank" rel="noreferrer noopener">agents.md</a>. <strong>CONTEXT.md</strong> serves a similar role as a bootstrapping document—it&#8217;s less established as a standard, but the practice of asking the AI to dump everything it knows into a context file so the next session can pick it up cold has been one of the most valuable habits I&#8217;ve developed. Here&#8217;s the prompt I used:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>Update the backlog file to explain what it is and how we maintain it. Create a CONTEXT.md with everything you&#8217;d need to bootstrap a new chat. Create an AGENTS.md to make it easy to bootstrap with a single-line prompt.</em></p>
</blockquote>



<p class="wp-block-paragraph">That prompt is a handoff document. I was explicitly asking the AI to write down everything it knew while it still had full context, specifically because I knew that context would be lost to compaction. The CONTEXT.md file is a handoff from this session to whatever fresh session picks up the work next week.</p>



<p class="wp-block-paragraph">Notice what I didn&#8217;t say. I didn&#8217;t give step-by-step instructions for what should go in those files. I said &#8220;everything you would need to bootstrap this process again in case we lost it&#8221; and &#8220;a complete dump of all of the context you would need to bootstrap a new chat and get it to the point where this current chat is.&#8221; Those are acceptance criteria, not procedures. The AI had to figure out what belonged in those files. If I&#8217;d given it a procedure (&#8220;first write the publication history, then the voice rules, then the file locations&#8221;), it would have followed the list and missed anything I forgot to include. The acceptance criterion is harder to satisfy but more robust: the test is &#8220;Could a fresh session bootstrap from these files alone?&#8221;</p>



<p class="wp-block-paragraph">And the AGENTS.md file itself is a spec document as a bridge between tools. It&#8217;s the shared contract that any AI session, whether it&#8217;s Claude, Gemini, Cowork, or a fresh chat, can read to get aligned with the project. This session wrote it; the next session reads it. The two sessions never communicate directly, so the spec file bridges the gap between them.</p>



<p class="wp-block-paragraph">That&#8217;s all four practices in two prompts, applied to something as ordinary as managing a writing project. It didn&#8217;t require pipelines or codebases or batch orchestration. The practices work because they solve the same underlying problem regardless of the domain: important information living in the AI&#8217;s context window instead of on disk.</p>



<h3 class="wp-block-heading"><strong>Context management is a development skill</strong></h3>



<p class="wp-block-paragraph">Every practice I&#8217;ve described in this article and the last one is something developers have always been told to do: write things down, record your rationale, be deliberate about what you save and what you let go, write ADRs and design docs and inline comments explaining nonobvious choices. We&#8217;ve always known we should do more of it. When you&#8217;re working with AI, the cost of not doing it becomes immediate and visible.</p>



<p class="wp-block-paragraph">The practices in this article all come down to the same thing: putting the important information in files where compaction can&#8217;t touch it, so you can see what the AI knows and verify that it matches reality. In the next article, I&#8217;ll go deeper on the debugging angle: how to use externalized files to understand what your AI is actually doing, with practical techniques that work even if you&#8217;re not building agents but are just using a chatbot.</p>



<p class="wp-block-paragraph"><em>The <a href="https://github.com/andrewstellman/quality-playbook" target="_blank" rel="noreferrer noopener">Quality Playbook</a> is open source and works with GitHub Copilot, Cursor, and Claude Code. It&#8217;s also available as part of <a href="https://awesome-copilot.github.com/#file=skills%2Fquality-playbook%2FSKILL.md" target="_blank" rel="noreferrer noopener">awesome-copilot</a>.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/your-ai-agent-already-forgot-half-of-what-you-told-it/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Get a Good Return on Your AI Investments</title>
		<link>https://www.oreilly.com/radar/get-a-good-return-on-your-ai-investments/</link>
				<comments>https://www.oreilly.com/radar/get-a-good-return-on-your-ai-investments/#respond</comments>
				<pubDate>Wed, 27 May 2026 16:52:37 +0000</pubDate>
					<dc:creator><![CDATA[Louise Corrigan]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18808</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Get-a-good-return-on-your-AI-investments.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Get-a-good-return-on-your-AI-investments-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Takeaways from Sam Newman&#039;s fireside chat with Nathen Harvey, DORA team lead at Google Cloud]]></custom:subtitle>
		
				<description><![CDATA[Last week, we had our first Infrastructure &#38; Ops superstream of 2026, Platform Engineering in the Age of AI. Our speakers explored a range of topics focused on supporting new AI workloads, each with unique infrastructure needs, unpredictable costs, and novel security concerns. Google Cloud’s Abdel Sghiouar took the audience through what a good platform [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Last week, we had our first Infrastructure &amp; Ops superstream of 2026, <a href="https://learning.oreilly.com/live-events/infrastructure-ops-superstream-platform-engineering-in-the-age-of-ai/0642572314507/0642572314491/" target="_blank" rel="noreferrer noopener">Platform Engineering in the Age of AI</a>. Our speakers explored a range of topics focused on supporting new AI workloads, each with unique infrastructure needs, unpredictable costs, and novel security concerns. Google Cloud’s Abdel Sghiouar took the audience through what a good platform for AI looks like, Cockroach Labs’ Jordan Lewis shared lessons learned rolling out a corporate AI platform, Syntasso’s Daniel Bryant outlined a three-layer model for building a good platform, technology leader Sarah Wells discussed the importance of governance and how to make it more manageable, and Thoughtworks’ Ben O&#8217;Mahony explained why evals should be part of your observability story. You can <a href="https://youtu.be/neycwJJmpG0" target="_blank" rel="noreferrer noopener">watch the highlights here</a>.</p>



<p class="wp-block-paragraph">The event concluded with a fireside chat between Sam and Nathen Harvey, who leads the DORA team at Google Cloud. <a href="https://dora.dev/" target="_blank" rel="noreferrer noopener">DORA</a> has been tracking software delivery performance for over a decade, which means they&#8217;ve watched a lot of technology trends come through. Their center of gravity has always been the same question: How quickly and safely can a team move change into a running production application?</p>



<p class="wp-block-paragraph">AI hasn&#8217;t changed that question, although it has made answering it a bit harder. DORA recently released its <a href="https://cloud.google.com/resources/content/dora-roi-of-ai-assisted-software-development" target="_blank" rel="noreferrer noopener"><em>ROI of AI-Assisted Software Development</em> report</a> to show how AI is working for teams right now, and how that may or may not be contributing to organizations’ bottom lines. Nathen used the findings as a jumping-off point to dig into how AI is changing platform engineering and software development as a whole.</p>



<h2 class="wp-block-heading">The productivity gap</h2>



<p class="wp-block-paragraph">Sam started by pointing out one of the biggest headline findings from DORA’S 2025 data: Organizations saw about 10% improvement in terms of actual code shipped to production systems. Even though developers likely felt that they were more productive, that doesn&#8217;t automatically carry through to production. DORA&#8217;s data shows higher throughput alongside higher instability. In other words, teams are shipping more but they’re also more frequently rolling back changes or implementing fixes. The gains at the individual level are real (and 10% is a pretty good number), but those gains aren’t “the dramatic improvements that you find in the headlines.”</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Productivity Gap with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/9jxMx1yHAZo?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">AI amplifies good processes (and bad ones)</h2>



<p class="wp-block-paragraph">Nathen explained that AI is an amplifier and mirror that equally reflects the good and bad. On teams where shipping change is already easy, AI tends to keep things running well. On teams where getting change into production is painful, AI generates <em>more</em> change and makes the existing friction more acute. That said, his read on this outcome is cautiously optimistic: &#8220;If the pain is more acute, we maybe will invest in addressing that pain.&#8221;</p>



<p class="wp-block-paragraph">The rub is that the investment has to actually happen. Nathen noted that in lower-performing organizations, AI tools often arrive with a reset of expectations rather than an invitation to fix the process: Here&#8217;s your new tool. Now we expect more from you. Addressing this problem means reframing the question “Does AI make people more productive?” What we really should be asking is “Under what conditions will AI boost productivity, and who&#8217;s responsible for creating them?” And that falls on the organization, not the technology.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="AI Is an Amplifier and Mirror for Good Processes and Bad with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/5CzvrWpXBHg?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Verification isn&#8217;t a checkbox</h2>



<p class="wp-block-paragraph">Trust is a big challenge with generative AI. About 30% of DORA survey respondents trust AI output little or not at all. Around 46% trust it &#8220;somewhat&#8221; (and Nathen is one of them). Despite all the advances in generative AI, these tools still make mistakes, and if you&#8217;ve multiplied your ability to generate code without doing anything to scale your ability to verify it, you&#8217;ve made your situation worse, not better.</p>



<p class="wp-block-paragraph">Nathen called this the verification tax, and it belongs in any honest accounting of AI&#8217;s productivity impact. Pipeline adaptation belongs there too: Is your delivery pipeline fit for purpose given the volume of change you&#8217;re now trying to push through? These costs don&#8217;t show up in the headlines about 10x developer productivity. They show up in your incident reports three months later.</p>



<p class="wp-block-paragraph">DORA recently published an <a href="https://dora.dev/ai/roi/calculator/#staff_size=500&amp;salary=176000&amp;revenue=100000000&amp;downtime_cost_per_hour=100000&amp;current_deployments_per_year=50&amp;current_features_per_year=50&amp;idea_success_rate=0.33&amp;revenue_impact_per_feature=0.005&amp;current_cfr=0.05&amp;current_fdrt=4&amp;time_saved_per_developer=0.125&amp;ai_license_cost_per_user=250&amp;additional_ai_cost_per_user=80&amp;additional_ai_infra_cost=100000&amp;training_cost_per_user=9600&amp;target_deployments_per_year=56&amp;target_features_per_year=56&amp;target_cfr=0.06&amp;j_curve_drop=0.15&amp;j_curve_duration=3" target="_blank" rel="noreferrer noopener">ROI framework and calculator</a> for AI-assisted software development. Nathen was clear that there&#8217;s no universal number to offer, and the calculator doesn&#8217;t pretend otherwise. What it does is give teams a way to model the real costs, including the learning investment, the verification overhead, and the pipeline changes required.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="The Verification Tax with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/wGYLtVj8z0Q?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Context switching and burnout</h2>



<p class="wp-block-paragraph">With productivity on the upswing, AI-induced burnout is becoming a serious concern. (Steve Yegge calls this the “<a href="https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163" target="_blank" rel="noreferrer noopener">AI vampire</a>.”) DORA’s data for 2025 showed that AI adoption wasn’t strongly connected with burnout, with the caveat that about 64% of DORA survey respondents said they’d never worked in an agentic workflow. Both of those findings are likely to change significantly in 2026.</p>



<p class="wp-block-paragraph">Nathen highlighted one source of burnout he expects to escalate as agents become the norm: context switching. As he pointed out, software developers spent years arguing for protected focus time to do the deep work that requires them to maintain flow. Agentic workflows are now incentivizing those same developers to voluntarily run a dozen or more agents at once, forcing them to context-switch multiple times every hour. As he joked, “There&#8217;s plenty of research that supports the idea that all of us feel like we&#8217;re pretty good multitaskers and none of us are.” The consequences are coming, and we’re doing it to ourselves.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Burnout Will Go Up, and We’re Doing It to Ourselves with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/ibdw27MxQq0?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">The cognitive debt question</h2>



<p class="wp-block-paragraph">Sam Newman brought up the related notion of “cognitive debt,” and in particular, Margaret-Anne Storey’s discussion of it. (See “<a href="https://margaretstorey.com/blog/2026/02/09/cognitive-debt/" target="_blank" rel="noreferrer noopener">How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt</a>” and “<a href="https://arxiv.org/abs/2603.22106" target="_blank" rel="noreferrer noopener">From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI</a>.”) Here’s how Storey explains the problem in her blog post:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Debt compounded from going fast lives in the brains of the developers and affects their lived experiences and abilities to “go fast” or to make changes. Even if AI agents produce code that could be easy to understand, the humans involved may have simply lost the plot and may not understand what the program is supposed to do, how their intentions were implemented, or how to possibly change it.</p>
</blockquote>



<p class="wp-block-paragraph">And as Sam noted, this compounds across teams and organizations. As developers increasingly work in parallel with AI rather than with each other, they lose the shared understanding that comes from people building software together. Kent Beck once said that “<a href="https://tidyfirst.substack.com/p/self-team-product" target="_blank" rel="noreferrer noopener">software design is an exercise in human relationships</a>.” Agentic workflows are putting pressure on that in ways we&#8217;re only beginning to see.</p>



<p class="wp-block-paragraph">Nathen agreed cognitive debt is where he&#8217;s most concerned, and both your workers and your architecture will suffer for it. Understanding the ramifications of an architectural decision you made eight months ago takes years of operation to surface, and AI doesn&#8217;t help with that at all.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="Cognitive Debt and Long Feedback Loops with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/yiOsikXaQ7c?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">Invest in your platform now</h2>



<p class="wp-block-paragraph">Considering what makes some AI-assisted teams high performers, Nathen explained, “It’s not <em>that</em> you’re using AI but <em>how</em> you’re using AI.” This observation led DORA to develop <a href="https://cloud.google.com/blog/products/ai-machine-learning/introducing-doras-inaugural-ai-capabilities-model" target="_blank" rel="noreferrer noopener">seven capabilities</a> that, when combined with AI adoption, lead to better outcomes. Nathen briefly ran through the list, ending on quality internal platforms. And here he made a claim about software engineering investment that was, in his words, “a little bit wild”:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Every product engineer that you have in your organization, every engineer that&#8217;s focused on building features right now, should probably stop building features and focus on the platform.</p>
</blockquote>



<p class="wp-block-paragraph">His argument is that platforms matter more, not less, in an environment where AI makes it possible for almost anyone in an organization to build something. The people closest to customers and business problems can now generate working software. What they can&#8217;t do is ensure that software is durable, secure, and production-ready.</p>



<p class="wp-block-paragraph">Nathen suggested that the best leverage for software engineering investment today might be building platforms that provide those guardrails, that shift the complexity of production-readiness down into the infrastructure so that anyone building on top of it gets the safety net for free. He acknowledged that moving every product engineer to platform work might be overkill. But the direction of travel is real. The platform is also, as Newman pointed out, where you bring determinism back into a process that AI has made more nondeterministic.</p>



<p class="wp-block-paragraph">That’s something we’ve been hearing a lot here at O’Reilly. The expansion of who can build doesn&#8217;t reduce the need for deep engineering expertise. It changes where that expertise is most valuable, and platforms are a good answer to where.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="AI Capabilities and the Case for Platform Investment Now with Nathen Harvey" width="500" height="281" src="https://www.youtube.com/embed/CIFoHFTbIec?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading">What DORA’s research tells us</h2>



<p class="wp-block-paragraph">The teams that are doing well are running experiments, learning from them, and spreading those lessons. The measure Nathen suggested is not how many tokens you&#8217;ve consumed but how many experiments you&#8217;ve run and how well you&#8217;re distributing what you&#8217;ve learned.</p>



<p class="wp-block-paragraph">The tools are moving fast enough that any organization locking in a fixed policy around specific tools will find itself stuck. What you want is the capacity to keep learning, which means building the culture and the processes that make learning visible and transferable.</p>



<p class="wp-block-paragraph">All of DORA&#8217;s research is freely available at <a href="https://dora.dev/" target="_blank" rel="noreferrer noopener">dora.dev</a>, including the 2025 annual report and the ROI framework. The <a href="https://dora.community/" target="_blank" rel="noreferrer noopener">DORA Community</a> provides a space for practitioners to work through these questions together. If you&#8217;re trying to navigate any of this with your team, you may want to spend some time there.</p>



<p class="wp-block-paragraph">And if you want to dive deeper into Nathen and Sam’s chat or explore the other sessions, you can <a href="https://learning.oreilly.com/videos/infrastructure-ops/0642572308308/" target="_blank" rel="noreferrer noopener">watch the entire Infrastructure &amp; Ops Superstream</a> on the O’Reilly learning platform. Our next event, on September 9, will cover agentic observability. <a href="https://www.oreilly.com/live/io-superstream-agentic-observability.html" target="_blank" rel="noreferrer noopener">Register for free here</a>, and check out all the other <a href="https://www.oreilly.com/live/free.html" target="_blank" rel="noreferrer noopener">free live events on O’Reilly</a>.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/get-a-good-return-on-your-ai-investments/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Agent Skills</title>
		<link>https://www.oreilly.com/radar/agent-skills/</link>
				<comments>https://www.oreilly.com/radar/agent-skills/#respond</comments>
				<pubDate>Wed, 27 May 2026 10:59:18 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18796</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-skills.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-skills-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[A senior engineer’s job is mostly the parts that don’t show up in the diff. Specs. Tests. Reviews. Scope discipline. Refusing to ship what can’t be verified. AI coding agents skip those parts by default. Agent Skills is my attempt to make them not optional.]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission. The default behavior of any AI coding agent is to take the shortest path to “done.” Ask for a feature and it writes the feature. It doesn’t ask whether you have a spec, write a test before [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on <a href="https://addyosmani.com/blog/agent-skills/" target="_blank" rel="noreferrer noopener">Addy Osmani’s blog</a> and is being reposted here with the author’s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">The default behavior of any AI coding agent is to take the shortest path to “done.” Ask for a feature and it writes the feature. It doesn’t ask whether you have a spec, write a test before the implementation, consider whether the change crosses a trust boundary, or check what the PR will look like to a reviewer. It produces code, declares victory, and moves on.</p>



<p class="wp-block-paragraph">This is the same failure mode every senior engineer has spent their career learning to avoid. The senior version of any task includes work that doesn’t show up in the diff: surfacing assumptions, writing the spec, breaking the work into reviewable chunks, choosing the boring design, leaving evidence that the result is correct, sizing the change so a human can actually review it. Those steps are most of what separates engineers who ship reliable software at scale from people who push code that breaks.</p>



<p class="wp-block-paragraph">Agents skip those steps for the same reason any junior would. They’re invisible. The reward signal points at “task complete” not “task complete and the design doc exists.” So we have to bolt the senior-engineer scaffolding back on.</p>



<p class="wp-block-paragraph"><a href="https://github.com/addyosmani/agent-skills" target="_blank" rel="noreferrer noopener">Agent Skills</a> is my attempt at that scaffolding. It just crossed 27K stars, so apparently I’m not alone in wanting it. This post is the part the README doesn’t quite cover: why each design choice exists, how it maps onto standard SDLC and Google’s published engineering practices, and what you should steal from the project even if you never install a single skill.</p>



<h2 class="wp-block-heading">What a “skill” actually is</h2>



<p class="wp-block-paragraph">The word “skill” is doing a lot of work in the Claude Code/Anthropic vocabulary, and it helps to be precise. A skill is a Markdown file with front matter that gets injected into the agent’s context when the situation calls for it. Somewhere between a system-prompt fragment and a runbook.</p>



<p class="wp-block-paragraph">A skill is <em>not</em> reference documentation. It is not “everything you should know about testing.” It is a workflow: a sequence of steps the agent follows, with checkpoints that produce evidence, ending in a defined exit criterion.</p>



<p class="wp-block-paragraph">That distinction is the whole game. If you put a 2,000-word essay on testing best practices into the agent’s context, the agent reads it, generates plausible-looking text, and skips the actual testing. If you put a <em>workflow</em> there (write the failing test first, run it, watch it fail, write the minimum code to pass, watch it pass, refactor), the agent has something to do, and you have something to verify.</p>



<p class="wp-block-paragraph">Process over prose. Workflows over reference. Steps with exit criteria over essays without them. That single distinction separates a useful skill from a pretty Markdown file. It also explains why so many “AI rules” repos end up doing nothing in practice. The rules are essays.</p>



<h2 class="wp-block-heading">The SDLC the skills encode</h2>



<p class="wp-block-paragraph">The 20 skills in the repo organize around six lifecycle phases, with seven slash commands sitting on top. Define (<code>/spec</code>) is where you decide what you’re actually building. Plan (<code>/plan</code>) breaks the work down. Build (<code>/build</code>) implements it in vertical slices. Verify (<code>/test</code>) proves it works. Review (<code>/review</code>) catches what slipped through. Ship (<code>/ship</code>) gets it to users safely. <code>/code-simplify</code> sits across the bottom of the whole thing.</p>



<p class="wp-block-paragraph">This isn’t a coincidence. It’s the same SDLC every functioning engineering organization runs, just in different vocabulary. Google calls it design doc → review → implementation → readability review → launch checklist. Amazon calls it the working-backward memo and the bar raiser. Every healthy team has some version of this loop.</p>



<p class="wp-block-paragraph">What’s new with AI coding agents is that <em>most agents skip most of these phases by default</em>. You ask for a feature, you get an implementation, and the spec, plan, tests, review, and launch checklist all just don’t happen. Skills push the agent through the same phases a senior engineer forces themselves through, because shipping the code without them is how you produce incidents.</p>



<p class="wp-block-paragraph">A complex feature might activate eleven skills in sequence. A small bug fix might use three. The router (<code>using-agent-skills</code>) decides which apply. The point is that the workflow scales to the actual scope, not to the assumed scope.</p>



<h2 class="wp-block-heading">Five principles that are doing the work</h2>



<p class="wp-block-paragraph">Five design decisions in the project are the loadbearing ones. The rest of the system follows from them.</p>



<h3 class="wp-block-heading">1. Process over prose</h3>



<p class="wp-block-paragraph">Already covered. Workflows are agent-actionable; essays are not. The same is true for human teams. If your team handbook is 200 pages, no one reads it under time pressure. If it’s a small set of workflows with checkpoints, people actually run them.</p>



<h3 class="wp-block-heading">2. Anti-rationalization tables</h3>



<p class="wp-block-paragraph">This is the most distinctive design decision in the project, and the one I most want other teams to steal.</p>



<p class="wp-block-paragraph">Each skill includes a table of common excuses an agent (or a tired engineer) might use to skip the workflow, paired with a written rebuttal. A few examples close to the originals:</p>



<ul class="wp-block-list">
<li>“This task is too simple to need a spec.” → Acceptance criteria still apply. Five lines is fine. Zero lines is not.</li>



<li>“I’ll write tests later.” → Later is the loadbearing word. There is no later. Write the failing test first.</li>



<li>“Tests pass, ship it.” → Passing tests are evidence, not proof. Did you check the runtime? Did you verify user-visible behavior? Did a human read the diff?</li>
</ul>



<p class="wp-block-paragraph">The reason this works is that LLMs are excellent at rationalization. They will produce a plausible-sounding paragraph explaining why <em>this particular</em> task doesn’t need a spec or why <em>this particular</em> change is fine to merge without review. Anti-rationalization tables are prewritten rebuttals to lies the agent hasn’t yet told.</p>



<p class="wp-block-paragraph">The pattern is just as good for human teams. Most engineering decay isn’t anyone choosing to do bad work. It’s people accepting plausible-sounding justifications for skipping the parts they don’t feel like doing. A team that writes down its anti-rationalizations is a team that has fewer of them.</p>



<h3 class="wp-block-heading">3. Verification is nonnegotiable</h3>



<p class="wp-block-paragraph">Every skill terminates in concrete evidence. Tests pass. Build output is clean. The runtime trace shows the expected behavior. A reviewer signs off. “Seems right” is never sufficient.</p>



<p class="wp-block-paragraph">This is the same principle that makes Anthropic’s harness recover from failures, that makes Cursor’s planner/worker/judge split actually catch bugs, that makes any <a href="https://addyosmani.com/blog/long-running-agents/" target="_blank" rel="noreferrer noopener">long-running agent</a> recoverable. The agent is a generator. You need a separate signal that the work is done. Skills bake that signal into every workflow.</p>



<h3 class="wp-block-heading">4. Progressive disclosure</h3>



<p class="wp-block-paragraph">Do not load all 20 skills into context at session start. Activate them based on the phase. A small meta-skill (<code>using-agent-skills</code>) acts as a router that decides which skill applies to the current task.</p>



<p class="wp-block-paragraph">This is the <a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">harness engineering</a> lesson applied at skill granularity. Every token loaded into context degrades performance somewhere, so you load what’s relevant and leave the rest on disk. Progressive disclosure is how you get a 20-skill library into a 5K-token slot without poisoning the well.</p>



<h3 class="wp-block-heading">5. Scope discipline</h3>



<p class="wp-block-paragraph">The meta-skill encodes a nonnegotiable I’d staple to every agent if I could: “touch only what you’re asked to touch.” Don’t refactor adjacent systems. Don’t remove code you don’t fully understand. Don’t brush against a TODO and decide to rewrite the file.</p>



<p class="wp-block-paragraph">This sounds obvious until you watch an agent decide that fixing one bug requires modernizing three unrelated files. Scope discipline is the single biggest determinant of whether an agent’s PR is mergeable or has to be unwound. It’s also the principle that maps most cleanly onto Google’s code review norms, where reviewers will block a PR for doing more than one thing.</p>



<h2 class="wp-block-heading">The Google DNA</h2>



<p class="wp-block-paragraph">The skills are saturated with practices from <em><a href="https://learning.oreilly.com/library/view/software-engineering-at/9781492082781/" target="_blank" rel="noreferrer noopener">Software Engineering at Google</a></em> and Google’s public engineering culture. This is intentional. Most of what makes Google-scale software work is documented and public, and it is <em>exactly</em> the part agents are most likely to skip.</p>



<p class="wp-block-paragraph">A partial map of which skill encodes which practice:</p>



<ul class="wp-block-list">
<li><strong>Hyrum’s law</strong><strong> in </strong><strong>api-and-interface-design</strong><strong>. </strong>Every observable behavior of your API will eventually be depended on by someone, so design with that in mind.</li>



<li><strong>The test pyramid (~80/15/5) and the Beyoncé rule</strong><strong> in </strong><strong>test-driven-development</strong><strong>.</strong> “If you liked it, you should have put a test on it.” Infrastructure changes don’t catch bugs; tests do.</li>



<li><strong>DAMP over DRY in tests.</strong> Google’s testing philosophy is explicit that test code should read like a specification even at the cost of some duplication. Overabstracted tests are a known antipattern.</li>



<li><strong>~100-line PR sizing, with Critical/Nit/Optional/FYI severity labels</strong><strong> in </strong><strong>code-review-and-quality</strong><strong>.</strong> Straight from Google’s code review norms. Big PRs don’t get reviewed; they get rubber-stamped.</li>



<li><strong>Chesterton’s Fence</strong><strong> in </strong><strong>code-simplification</strong><strong>.</strong> Don’t remove a thing until you understand why it was put there.</li>



<li><strong>Trunk-based development and atomic commits</strong><strong> in </strong><strong>git-workflow-and-versioning</strong><strong>.</strong></li>



<li><strong>Shift left and feature flags</strong><strong> in </strong><strong>ci-cd-and-automation</strong><strong>.</strong> Catch problems as early as possible, decouple deploy from release.</li>



<li><strong>Code-as-liability</strong><strong> in </strong><strong>deprecation-and-migration</strong><strong>.</strong> Every line you keep is one you have to maintain forever, so prefer the smaller surface.</li>
</ul>



<p class="wp-block-paragraph">None of these are new ideas. The point is that none of them are in the agent by default. A frontier model has read the phrase “Hyrum’s law” in its training data, but it does not apply Hyrum’s law when it’s designing your API at 3am. Skills are how you make sure it does.</p>



<h2 class="wp-block-heading">How to actually use it</h2>



<p class="wp-block-paragraph">Three modes, in roughly increasing commitment.</p>



<p class="wp-block-paragraph"><strong>Mode 1: Install via marketplace. </strong>If you’re using Claude Code:</p>



<pre class="wp-block-code"><code><code>/plugin marketplace add addyosmani/agent-skills 
/plugin install agent-skills@addy-agent-skills</code></code></pre>



<p class="wp-block-paragraph">You get the slash commands (<code>/spec</code>, <code>/plan</code>, <code>/build</code>, <code>/test</code>, <code>/review</code>, <code>/ship</code>, <code>/code-simplify</code>) and the agent activates the relevant skills automatically based on context. This is the path I’d recommend most people start on.</p>



<p class="wp-block-paragraph"><strong>Mode 2: Drop the Markdown into your tool of choice.</strong> The skills are plain Markdown with front matter. Cursor users put them in <code>.cursor/rules/</code>. Gemini CLI has its own install path. Codex, Aider, Windsurf, OpenCode, anything that accepts a system prompt can read them. The tooling matters less than the workflow underneath.</p>



<p class="wp-block-paragraph"><strong>Mode 3: Read them as a spec. </strong>Even if you never install anything, the skills are a <em>documented description of what good engineering with AI agents looks like</em>. Read <code>code-review-and-quality.md</code> and apply the five-axis framework to your team’s review process. Read <code>test-driven-development.md</code> and use it to settle the next “do we need to write the test first” argument with a junior. Read the meta-skill and steal the five nonnegotiables for your own AGENTS.md.</p>



<p class="wp-block-paragraph">This third mode is where I’d actually start. Pick the four or five skills closest to your current pain. Decide which workflows you want enforced. Then install the runtime, or roll your own, to do the enforcing.</p>



<h2 class="wp-block-heading">What to steal even if you never install</h2>



<p class="wp-block-paragraph">A few patterns from the project I’d steal regardless of whether you use AI coding agents at all:</p>



<p class="wp-block-paragraph"><strong>Anti-rationalization as a team practice.</strong> Write down the lies your team tells itself. “We’ll fix the tests after launch.” “This change is too small for a design doc.” “It’s fine, we have monitoring.” Pair each with the rebuttal. Put it in your AGENTS.md or your engineering wiki. It will save you arguments and it will catch the next tired Friday-afternoon shortcut.</p>



<p class="wp-block-paragraph"><strong>Process over prose for anything you write internally.</strong> If you find yourself writing a 2,000-word doc titled “how we approach X” you’ve written reference material. Convert it to a workflow with checkpoints. The doc shrinks to 400 words and people actually run it. This applies as much to onboarding guides and runbooks as it does to agent skills.</p>



<p class="wp-block-paragraph"><strong>Verification as a hard exit criterion.</strong> Make “produce evidence” the exit step of every task. For agents, for engineers, for yourself. Evidence is whatever proves the work is done: a green test run, a screenshot, a log, a review approval. Without it, the task is not done. “Seems right” never closes the loop.</p>



<p class="wp-block-paragraph"><strong>Progressive disclosure for any rulebook.</strong> Do not write a 50-page handbook. Write a small router that points to the right small chapter for the situation. This is true for AGENTS.md, for runbooks, for incident playbooks, for anything anyone will read under time pressure.</p>



<p class="wp-block-paragraph">Five nonnegotiables, lifted from the meta-skill, that I’d put in any AGENTS.md tomorrow:</p>



<ol class="wp-block-list">
<li>Surface assumptions before building. Wrong assumptions held silently are the most common failure mode.</li>



<li>Stop and ask when requirements conflict. Don’t guess.</li>



<li>Push back when warranted. The agent (or engineer) is not a yes-machine.</li>



<li>Prefer the boring, obvious solution. Cleverness is expensive.</li>



<li>Touch only what you’re asked to touch.</li>
</ol>



<p class="wp-block-paragraph">That’s a worthwhile engineering culture in five lines, and you don’t need to install anything to adopt it.</p>



<h2 class="wp-block-heading">Where this fits in the harness</h2>



<p class="wp-block-paragraph">In the broader picture, skills are one layer of <a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">agent harness engineering</a>. The harness is the model plus everything you build around it; skills are the reusable workflow chunks that get progressively disclosed into the system prompt. They sit alongside <code>AGENTS.md</code> (the rolling rulebook), hooks (the deterministic enforcement layer), tools (the actions the agent can take), and the session log (the durable memory). Each layer has a specific job. Skills do the senior-engineer-process job.</p>



<p class="wp-block-paragraph">Skills matter more for <a href="https://addyosmani.com/blog/long-running-agents/" target="_blank" rel="noreferrer noopener">long-running agents</a> than they do for chat-style ones, because long runs amplify every shortcut. An agent that skips the test in a 10-minute session produces one bug. An agent that skips the test in a 30-hour session produces a debugging archaeology project at the end of the run, when no one remembers what the original intent was. The longer the run, the more the senior-engineer scaffolding has to be enforced rather than suggested.</p>



<p class="wp-block-paragraph">The portability of the skills format matters too. The same SKILL.md file works in Claude Code, Cursor (with rules), Gemini CLI, Codex, and any other harness that accepts system-prompt content. Write the workflow once, the runtime enforces it. That’s the thing the Markdown-with-front matter format buys you that bespoke prompt engineering does not.</p>



<h2 class="wp-block-heading">Closing</h2>



<p class="wp-block-paragraph">The thing I most want people to take from this project, more than the skills themselves, is the framing.</p>



<p class="wp-block-paragraph">AI coding agents are extremely capable junior engineers with no instinct for the parts of the job that don’t show up in the diff. The senior-engineering work (surfacing assumptions, sizing changes, writing the spec, leaving evidence, refusing to merge what can’t be reviewed) is exactly what an agent will skip unless you make it impossible to skip. The job, increasingly, is to encode that discipline as something the agent cannot talk itself out of.</p>



<p class="wp-block-paragraph">Skills are one shape of that. Anti-rationalization tables. Progressive disclosure. Process over prose. Verification as the loadbearing exit criterion. The Google practices that already work, made portable.</p>



<p class="wp-block-paragraph">You can install <a href="https://github.com/addyosmani/agent-skills" target="_blank" rel="noreferrer noopener">my version</a>. You can roll your own. The lesson stands either way: The senior-engineer parts of the job are no longer optional, even when the engineer is a model.</p>



<p class="wp-block-paragraph"><em>The repo is at <a href="https://github.com/addyosmani/agent-skills" target="_blank" rel="noreferrer noopener">github.com/addyosmani/agent-skills</a> (MIT). For the broader scaffolding picture, see “<a href="https://addyosmani.com/blog/agent-harness-engineering/" target="_blank" rel="noreferrer noopener">Agent Harness Engineering</a>” and “<a href="https://addyosmani.com/blog/long-running-agents/" target="_blank" rel="noreferrer noopener">Long-Running Agents</a>.”</em></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/agent-skills/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>Who Authorized That? The Delegation Problem in Multi-Agent AI</title>
		<link>https://www.oreilly.com/radar/who-authorized-that-the-delegation-problem-in-multi-agent-ai/</link>
				<comments>https://www.oreilly.com/radar/who-authorized-that-the-delegation-problem-in-multi-agent-ai/#respond</comments>
				<pubDate>Tue, 26 May 2026 10:58:58 +0000</pubDate>
					<dc:creator><![CDATA[Sunil Prakash]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18793</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Who-Authorized-That.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Who-Authorized-That-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Securing access isn’t enough. As agents begin calling other agents, enterprises need to secure delegation too.]]></custom:subtitle>
		
				<description><![CDATA[Your AI agent booked a meeting, summarized a financial report, and emailed the highlights to three stakeholders. To do this, it called a calendar agent, a document analysis agent, and an email agent. Each accessed internal systems, made decisions about what to include, and acted on your behalf. Here’s the question your security team can’t [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Your AI agent booked a meeting, summarized a financial report, and emailed the highlights to three stakeholders. To do this, it called a calendar agent, a document analysis agent, and an email agent. Each accessed internal systems, made decisions about what to include, and acted on your behalf.</p>



<p class="wp-block-paragraph">Here’s the question your security team can’t answer: <strong>Who authorized the email agent to read that financial report?</strong></p>



<p class="wp-block-paragraph">In most current architectures, the honest answer is no one explicitly. The logs may show that a service called another service. But they can’t show that the delegation itself was authorized. The authorization didn’t fail loudly. It leaked silently through the chain.</p>



<p class="wp-block-paragraph">This is the delegation problem in multi-agent AI. As enterprises connect agents through protocols such as MCP and A2A, they’re solving the connectivity problem faster than they’re solving the authority problem. The result is a new security boundary that most enterprise architectures have not yet modeled, precisely because most organizations still treat it as orchestration rather than authorization.</p>



<h2 class="wp-block-heading">Agents are connecting faster than authorization is adapting</h2>



<p class="wp-block-paragraph">The agent ecosystem has moved fast over the past two years. Anthropic&#8217;s MCP gave model-powered applications a standard way to connect to tools, data sources, and services. Google&#8217;s A2A protocol gave agents a standard way to communicate and coordinate across systems. Frameworks and SDKs such as LangChain, CrewAI, and Google&#8217;s ADK made it easier to build multi-agent workflows where one agent orchestrates several others.</p>



<p class="wp-block-paragraph">What these protocols don’t yet provide, at least not as a mature common layer, is a delegation-aware authorization model.</p>



<p class="wp-block-paragraph">MCP describes a protected server as an OAuth 2.1 resource server, with the MCP client acting as an OAuth client making requests on behalf of a resource owner. That’s a familiar and well-understood pattern, but it was designed for a world where a human clicks &#8220;Allow&#8221; and a single client gets a scoped token. It doesn’t address what happens when Agent A receives that token, delegates a subtask to Agent B, and Agent B spawns Agent C to handle part of it. Each hop in that chain either reuses the original token (overprivileged) or has no token at all (untracked).</p>



<p class="wp-block-paragraph">A2A was built for interoperability: independent, potentially opaque agent systems communicating and coordinating actions across enterprise platforms. That’s the right problem to solve. But communication and delegation governance are different layers. A2A helps agents discover, describe, and communicate with one another. This is necessary infrastructure, but it isn’t the same as delegated authority. It doesn’t tell you whether a specific downstream action was legitimately derived from an upstream instruction.</p>



<p class="wp-block-paragraph">Static API keys are even weaker for this problem. A key grants access to a service. It says nothing about who is using it, what they’re using it for, or whether the entity presenting it is the same one it was issued to. Service accounts identify a workload, not an intent. When three agents share a service account, every action looks the same in your logs.</p>



<p class="wp-block-paragraph">None of these tools are broken. They solve different problems. The gap is structural. Authentication answers which agent is calling. Authorization defines what that agent may access. The harder question, and the one most enterprise architectures are not yet designed to answer, is whether a specific downstream action was legitimately derived from an upstream instruction, under narrowed constraints, with a verifiable chain back to a human decision. That’s the delegation question, and it sits in a layer that today&#8217;s stack doesn’t really have.</p>



<p class="wp-block-paragraph">In a clean version of this picture, privilege should sit only with the agent that touches the outside world. If a payer (A) asks a bookkeeper agent (B) to make a payment, and the bookkeeper asks a banking agent (C) to execute the transfer, only the banking agent needs banking authority. The bookkeeper doesn’t need to move money. It only needs to know the request came from an authorized payer. The banking agent only needs to know the request came from an authorized bookkeeper. This is the principle of least privilege, a concept the security community has lived with for decades, applied to delegation chains. The difficulty is that today&#8217;s agent stacks make it hard to enforce.</p>



<h2 class="wp-block-heading">What breaks in the chain</h2>



<p class="wp-block-paragraph">Consider a treasury reporting workflow in a regulated bank. A planning agent is allowed to read liquidity projections and produce a daily summary for senior finance users. To complete the task, it delegates chart generation to a visualization agent and narrative review to a communications agent. The visualization agent doesn’t need access to raw account-level data. The communications agent doesn’t need access to the underlying liquidity model. Yet unless the delegation layer attenuates permissions, both may receive more context than their task requires. The result isn’t a dramatic breach, but it is a quiet expansion of access that the access-control model never explicitly approved.</p>



<p class="wp-block-paragraph">The risk isn’t limited to internet-facing agents. Many delegation failures happen entirely inside the enterprise boundary. An internal agent may call another internal agent, which calls an internal tool, which sends data to an approved SaaS service. Every individual step may look acceptable. The risk appears in the composition: The final data movement or action may exceed the intent of the original authorization.</p>



<p class="wp-block-paragraph">This pattern creates three categories of failure that enterprises may have to explain to regulators, auditors, or customers.</p>



<p class="wp-block-paragraph"><strong>Ghost permissions. </strong>A finance analyst assistant has been given access to a customer transactions database to support quarterly reporting. It calls a summarization agent: &#8220;summarize recent transactions for these accounts.&#8221; The summarization agent now operates against customer records, even though no policy engine granted it that access. The analyst assistant&#8217;s privileges effectively traveled with the request. The permission is a ghost. It exists in practice but not in any authorization system.</p>



<p class="wp-block-paragraph"><strong>Scope drift.</strong> Even when an agent starts with narrow permissions, delegation tends to widen scope rather than narrow it. An agent authorized to read Q1 revenue data delegates to a charting agent, which calls an external rendering API, which now has the revenue figures. The data left the organization through three hops of implicit trust. Each agent acted within what it understood as its scope. The aggregate result exceeded what any human would have approved.</p>



<p class="wp-block-paragraph"><strong>Broken audit trails.</strong> Regulated industries require the ability to answer &#8220;who did what and why&#8221; for any consequential action. In a single-agent system, this is manageable. In a multi-agent chain, the audit trail fragments across agents, protocols, and services. When a compliance team asks why a particular customer communication was sent, the answer might involve four agents across two protocols, none of which logged the delegation chain. The action is traceable to a system but not to a decision.</p>



<p class="wp-block-paragraph">These aren’t edge cases. They’re a common outcome when delegation isn’t modeled explicitly. The delegation problem isn’t a bug in any particular framework. It’s a gap in the layer between them.</p>



<h2 class="wp-block-heading">What a delegation-aware model requires</h2>



<p class="wp-block-paragraph">A delegation-aware authorization model has to solve four things at once, which is part of why no existing layer covers it cleanly<em>.</em></p>



<p class="wp-block-paragraph">The first is identity. The downstream agent needs a cryptographic credential that the receiving system can verify independently, not just a hostname or an API key. Hostnames lie. API keys travel. A real identity is one the calling system cannot fabricate.</p>



<p class="wp-block-paragraph">The second is attenuation. When an agent delegates a task, the subagent should receive strictly fewer permissions than the parent—never the same set, and certainly never more. This is the principle of least privilege applied to delegation chains, and almost no current tooling enforces it by default.</p>



<p class="wp-block-paragraph">The third is purpose. &#8220;Read this report to summarize liquidity exposure for the CFO&#8221; is a different authorization from &#8220;read this report and send selected figures to an external charting service.&#8221; It may be the same data and the same agent, but it’s two very different risk profiles. Without a purpose binding, the authorization layer has no way to distinguish them.</p>



<p class="wp-block-paragraph">The fourth is audit. The organization should be able to reconstruct, after the fact, who delegated what, under which constraints, and what evidence each agent produced at completion. Not just which systems were called but which decisions were made and on whose authority.</p>



<p class="wp-block-paragraph">It’s possible for agents to authenticate successfully even when they don’t have accountable authority. They can prove who they are and still execute actions that no human ever authorized.</p>



<h2 class="wp-block-heading">Emerging approaches</h2>



<p class="wp-block-paragraph">Several efforts address parts of this problem: workload identity standards, agent metadata in tokens, OAuth-based MCP authorization, A2A authentication patterns, and agent identity frameworks. These are useful building blocks, but identity is not the same as delegated authority. A signed agent card can help establish an agent&#8217;s declared identity and capabilities. An OAuth token can tell you what a client may access. Neither, by itself, proves that a specific downstream action was authorized by a specific upstream decision under narrowed constraints.</p>



<p class="wp-block-paragraph">One emerging pattern is delegation-bound capability tokens: short-lived credentials that bind an invocation to an agent identity, a constrained permission set, and a provenance record. One example is the <a href="https://datatracker.ietf.org/doc/draft-prakash-aip/" target="_blank" rel="noreferrer noopener">Agent Identity Protocol (AIP)</a>, which I’ve been working on as an Internet-Draft and <a href="https://sunilprakash.com/aip/" target="_blank" rel="noreferrer noopener">open source implementation</a>. AIP is still early, but it illustrates the shape of one possible answer: invocation-bound tokens that carry identity, attenuated permissions, and provenance through a delegation chain. The token chain itself becomes part of the audit evidence rather than something reconstructed after the fact from fragmented logs.</p>



<p class="wp-block-paragraph">Complementary approaches are also emerging. Behavioral credentials, the idea that agents should be continuously reauthorized based on runtime behavior rather than just initial permissions, address a related but distinct problem. Delegation tokens tell you who authorized what. Behavioral monitoring tells you whether the agent is still acting within its authorized profile. A complete solution will likely need both.</p>



<p class="wp-block-paragraph">None of these approaches have reached mainstream adoption. But the fact that they are emerging simultaneously, from different corners of the industry, signals that the delegation gap is real and recognized.</p>



<h2 class="wp-block-heading">What enterprise teams should do now</h2>



<p class="wp-block-paragraph">You don’t need to wait for standards to mature before addressing the delegation problem. There are concrete steps that security, platform, and architecture teams can take today.</p>



<p class="wp-block-paragraph"><strong>Map your delegation chains.</strong> Most teams deploying multi-agent workflows haven’t documented which agents call which other agents, with what permissions, through which protocols. Start there. If you can’t draw the graph, you can’t secure it.</p>



<p class="wp-block-paragraph"><strong>Audit implicit permissions.</strong> For every agent-to-agent interaction, ask: Was this access explicitly granted, or is the downstream agent inheriting permissions by proximity? If the answer is inheritance, you have a ghost permission that needs a policy decision.</p>



<p class="wp-block-paragraph"><strong>Require scope attenuation.</strong> Establish an architectural rule: When an agent delegates a task, the subagent must receive fewer permissions than the parent, never more. Current tooling doesn’t enforce this automatically, but you can enforce it in your orchestration layer.</p>



<p class="wp-block-paragraph"><strong>Build the audit trail before the auditor asks.</strong> If your organization is in a regulated industry, the question &#8220;Who authorized this agent action?&#8221; will eventually be asked. The time to instrument delegation logging is before that question arrives, not after. Log the full chain: which agent initiated the task, what permissions were passed, which subagents were invoked, and what each one accessed.</p>



<p class="wp-block-paragraph"><strong>Test with real tooling.</strong> Delegation-aware approaches, including capability-token designs, workload identity standards, and agent identity frameworks, are early but functional. Running one in a nonproduction environment will expose gaps in your current authorization model that architecture review alone will not surface.</p>



<h2 class="wp-block-heading">Delegation is the security boundary</h2>



<p class="wp-block-paragraph">The first phase of enterprise agent adoption was about connectivity: Can the agent reach the tool, the API, the database, or the other agent? The next phase will be about accountable delegation: Should this agent be allowed to ask that agent to do this specific thing, with this data, under these constraints?</p>



<p class="wp-block-paragraph">That question won’t be answered by prompt engineering. It belongs in the authorization layer, the platform layer, and the audit trail.</p>



<p class="wp-block-paragraph">Enterprises don’t need to solve the entire standards problem today. But they do need to stop treating delegation as an implementation detail. In multi-agent systems, delegation is the security boundary.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/who-authorized-that-the-delegation-problem-in-multi-agent-ai/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>This Week in AI: Rethinking the Agent Harness</title>
		<link>https://www.oreilly.com/radar/this-week-in-ai-rethinking-the-agent-harness/</link>
				<comments>https://www.oreilly.com/radar/this-week-in-ai-rethinking-the-agent-harness/#respond</comments>
				<pubDate>Fri, 22 May 2026 15:01:29 +0000</pubDate>
					<dc:creator><![CDATA[Michelle Smith]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[This Week in AI]]></category>
		<category><![CDATA[Podcast]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18774</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/0642572383770_This_Week_in_AI_Cover-scaled.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2560" 
				height="2560" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/0642572383770_This_Week_in_AI_Cover-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Plus AI security, the compute arms race, and why eventually there may no longer be an internet for humans]]></custom:subtitle>
		
				<description><![CDATA[We kicked off our new weekly series This Week in AI on Monday, and we covered a lot of ground in 30 minutes, including an AI model that found security holes faster than decades of human auditing, a data center in Utah the size of two Manhattans, and a practical argument for why the harness [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">We kicked off our new weekly series <em>This Week in AI</em> on Monday, and we covered a lot of ground in 30 minutes, including an AI model that found security holes faster than decades of human auditing, a data center in Utah the size of two Manhattans, and a practical argument for why the harness you build around a model now matters more than which model you pick.<br><br>Here are a few takeaways from the conversation between host Eric Freeman, faculty member at UT Austin and a longtime <a href="https://learning.oreilly.com/search/?q=author%3A%20%22Eric%20Freeman%22&amp;suggested=true&amp;suggestionType=author&amp;originalQuery=eric%20freeman&amp;rows=100&amp;language=en" target="_blank" rel="noreferrer noopener">friend of O’Reilly</a>, and guest John Berryman, founder of Arcturus Labs, an early production engineer on GitHub Copilot, and coauthor of O&#8217;Reilly&#8217;s<a href="https://learning.oreilly.com/library/view/prompt-engineering-for/9781098156145/" target="_blank" rel="noreferrer noopener"> <em>Prompt Engineering for LLMs</em></a>. Watch the entire episode to find out why you should be building your own agent and why John believes eventually there will be no internet for humans.</p>



<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"><div class="wp-block-embed__wrapper">
<iframe loading="lazy" title="This Week in AI: Rethinking the Agent Harness" width="500" height="281" src="https://www.youtube.com/embed/g4cfjz5AKxY?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div></figure>



<h2 class="wp-block-heading"><strong>AI&#8217;s security problem is now a policy problem</strong></h2>



<p class="wp-block-paragraph">You’ve probably already heard about <a href="https://red.anthropic.com/2026/mythos-preview/" target="_blank" rel="noreferrer noopener">Mythos</a>. Anthropic&#8217;s internal testing of the frontier model surfaced thousands of previously unknown security vulnerabilities across major operating systems, browsers, and financial infrastructure, including a 27-year-old bug in OpenBSD. Anthropic chose not to release the model publicly and instead launched <a href="https://www.anthropic.com/glasswing" target="_blank" rel="noreferrer noopener">Project Glasswing</a>, a restricted program giving monitored access to a small group of trusted partners for defensive patching.</p>



<p class="wp-block-paragraph">That decision moved fast in Washington. In roughly six weeks, the conversation shifted from the light-touch national AI policy released in March to reported White House discussions of an <a href="https://fortune.com/2026/05/06/trump-administration-embraces-ai-oversight-policies-it-once-rejected-anthropic-mythos-caisi/" target="_blank" rel="noreferrer noopener">executive order review process</a> modeled on how the FDA handles drugs. Security researcher Bruce Schneier has questioned <a href="https://www.schneier.com/blog/archives/2026/04/mythos-and-cybersecurity.html" target="_blank" rel="noreferrer noopener">whether Mythos is uniquely capable here</a> or whether similar results are achievable with cheaper public models, but as Freeman noted (paraphrasing Schneier), either way, it’s a problem that’s coming.</p>



<h2 class="wp-block-heading">The compute race is getting stranger</h2>



<p class="wp-block-paragraph">Anthropic <a href="https://x.ai/news/anthropic-compute-partnership" target="_blank" rel="noreferrer noopener">leased xAI&#8217;s entire Colossus 1 supercluster</a> in Memphis: more than 200,000 GPUs and 300 megawatts of power. A month before that deal, <a href="https://www.anthropic.com/news/google-broadcom-partnership-compute" target="_blank" rel="noreferrer noopener">Anthropic expanded its agreement with Google and Broadcom for 3.5 gigawatts</a> of capacity coming online in 2027. For context, that&#8217;s roughly 10 times the power output of the Colossus 1 deal, in a single contract. After this episode aired, Anthropic announced that that deal has been <a href="https://www.axios.com/2026/05/20/anthropic-spacex-compute" target="_blank" rel="noreferrer noopener">expanded to Colossus 2</a> as well.</p>



<p class="wp-block-paragraph">Box Elder County, Utah, just approved a 40,000-acre AI data center called the Stratos project, backed by investor and TV personality Kevin O&#8217;Leary (a.k.a. Mr. Wonderful). It’s planned for <a href="https://www.theregister.com/on-prem/2026/05/13/utah-mega-datacenter-could-dump-23-atomic-bombs-worth-of-energy-per-day/5239670" target="_blank" rel="noreferrer noopener">9 gigawatts at full buildout</a>. That&#8217;s a footprint more than twice the size of Manhattan, powered by the equivalent of nine commercial nuclear reactors. And like many data center deals going forward, including Colossus above, it was <a href="https://www.cnn.com/2026/05/09/tech/ai-data-center-utah-kevin-oleary-opposition" target="_blank" rel="noreferrer noopener">approved over local protests</a>.</p>



<p class="wp-block-paragraph">Infrastructure at this incredible scale takes years to come online, and the companies making these bets are pricing in a world where model capability keeps scaling. Whether that assumption holds will determine a lot about what&#8217;s economically viable to build in the next decade.</p>



<h2 class="wp-block-heading"><strong>The harness matters more than the model</strong></h2>



<p class="wp-block-paragraph">John was on hand to rethink the agent harness, which as he pointed out, entered a new phase with the step change in model capability that occurred in November and December of last year. He took Eric through the arc of AI product development, from document completion and chat loops to tool-calling agents, DAG-based workflows, and now the harness era represented by tools like Claude Code. Each progression added capability, John noted, but also complexity, and each generated a new class of problems around reliability and control. In our current moment, which John has dubbed the “age of the unharnessed agent,” agents are now within reach of everyone, not just software developers.</p>



<p class="wp-block-paragraph">The payoff of this “unharnessed” era is control. John described a client engagement where he replaced a bespoke application with a skills-driven agent. Now domain experts with no development experience can read the agent&#8217;s behavior written in plain English and better understand it. As John explained,</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Rather than building a bespoke agent.&nbsp;.&nbsp;., I just built something that was just the agent harness—the agent—and I just gave it skills that describe what basically I learned in interviewing their experts, how they would work with these agents. And it worked perfectly. Not only does the agent stay on track and do what it needs to do these days, but it&#8217;s coded, as far as my client is concerned, in English.<br><br>The experts don&#8217;t have to complain to developers “this doesn&#8217;t work.” The experts can look at the English description of what&#8217;s going on and see problems, and maybe even fix it themselves. And I&#8217;m really excited to basically give that power into the hands of the people that know best how to change it, the experts.</p>
</blockquote>



<p class="wp-block-paragraph">That&#8217;s a different relationship between the experts and the tool than anything a wrapped commercial product offers.</p>



<p class="wp-block-paragraph">As Eric pointed out, recent <a href="https://arxiv.org/html/2603.28052v1" target="_blank" rel="noreferrer noopener">Stanford research</a> supports this broader point: Performance gaps between a bare model and a well-designed harness now often matter more than which underlying model you&#8217;re using. The benchmark that used to dominate buying decisions, which model scores highest, has been displaced by a harder question about which harness fits the task.</p>



<p class="wp-block-paragraph">John closed with a demo of his personal agent moving from an Obsidian notebook into Wikipedia and back, carrying context across environments. He used it to illustrate a concept he called the &#8220;open agent protocol,&#8221; his term for a not-yet-existing standard where an agent receives environment-specific skills as it moves between contexts. The protocol doesn&#8217;t exist yet, but the demo made the direction clear.</p>



<h2 class="wp-block-heading"><strong>What&#8217;s next</strong></h2>



<p class="wp-block-paragraph">Join us and a rotating lineup of expert guests for weekly live tool demos and deeper dives into the topics that matter in AI. We’re taking next week off for Memorial Day in the US, but we’ll be back on June 1 with host Andreas Welsch and guests Maya Mikhailov and Doug Shannon to cut through another week of AI headlines and separate what actually drives business value from what looks good in a demo but goes nowhere in production. Our first few episodes are free and open to all if you’d like to attend live—<a href="https://www.oreilly.com/live/this-week-in-ai.html" target="_blank" rel="noreferrer noopener">register here</a>.</p>



<p class="wp-block-paragraph">We’ll continue to share full episodes and publish our takeaways here on Radar each Friday. You can also watch or listen on <a href="https://www.youtube.com/watch?v=g4cfjz5AKxY&amp;list=PL055Epbe6d5bJEhT7_ZzOeJZ6gPyUzYpS" target="_blank" rel="noreferrer noopener">YouTube</a>, <a href="https://open.spotify.com/show/033kJS2BG1teGunxmtsU1r" data-type="link" data-id="https://open.spotify.com/show/033kJS2BG1teGunxmtsU1r" target="_blank" rel="noreferrer noopener">Spotify</a>, Apple, or wherever you get your podcasts.</p>



<p class="wp-block-paragraph"></p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/this-week-in-ai-rethinking-the-agent-harness/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Agentic P&#038;L: Beyond the Empire of Headcount</title>
		<link>https://www.oreilly.com/radar/the-agentic-pl-beyond-the-empire-of-headcount/</link>
				<comments>https://www.oreilly.com/radar/the-agentic-pl-beyond-the-empire-of-headcount/#respond</comments>
				<pubDate>Thu, 21 May 2026 15:04:52 +0000</pubDate>
					<dc:creator><![CDATA[Shreshta Shyamsundar and Anmol Jain]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18761</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-Agentic-PL.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-Agentic-PL-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[For over a century, both the prestige and budget of a corporate department have been measured by a single crude metric: headcount. If you manage 500 people, you’re a &#8220;distinguished leader.&#8221; If you manage five, you’re a footnote. This &#8220;empire of headcount&#8221; has governed everything from office square footage to C-suite influence. It’s the fundamental [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">For over a century, both the prestige and budget of a corporate department have been measured by a single crude metric: headcount. If you manage 500 people, you’re a &#8220;distinguished leader.&#8221; If you manage five, you’re a footnote. This &#8220;empire of headcount&#8221; has governed everything from office square footage to C-suite influence. It’s the fundamental unit of the 20th-century P&amp;L.</p>



<p class="wp-block-paragraph">In an enterprise powered by federated agentic systems, this math is not just obsolete—it is a liability. AI will reshape the enterprise. The question is now “Which line items on the P&amp;L change, and by how much?” Labor and benefits contract. Token and infrastructure costs appear as a new operating line. Compliance costs shift from reactive rework to proactive provenance. And the assets that matter most—structured knowledge enclaves, trained agent policies, decision logs—do not yet appear on most balance sheets.</p>



<h2 class="wp-block-heading">Why AI-on-top-of-hierarchy fails</h2>



<p class="wp-block-paragraph">Most enterprise AI deployments begin with the right instinct and the wrong architecture. A foundation model is procured, a chatbot is deployed, and analysts are relieved of their most repetitive queries. This is the butler-bot phase: AI as a faster way to do what the organization already does, inside a structure designed for a different era.</p>



<p class="wp-block-paragraph">The problem is the process the model is plugged into. If a compliance decision requires sign-off from three managers, an AI assistant that drafts the memo faster doesn’t change the three-week cycle time. If context is scattered across email threads and local drives, a model querying that corpus will hallucinate at exactly the rate the corpus is incomplete. The model inherits the organization&#8217;s structural debt. The agentic P&amp;L begins where the butler bot ends: with a deliberate redesign of the process, not just the tooling.</p>



<p class="wp-block-paragraph">The enterprise must pivot: Stop valuing the empire of headcount and start valuing the federated nervous system.</p>



<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="362" height="186" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-11.png" alt="Figure 1. Empire of headcount vs. federated nervous system—An analogy" class="wp-image-18771" style="width:503px;height:auto" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-11.png 362w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-11-300x154.png 300w" sizes="auto, (max-width: 362px) 100vw, 362px" /><figcaption class="wp-element-caption">Figure 1. Empire of headcount vs. federated nervous system—An analogy</figcaption></figure>



<h2 class="wp-block-heading">Pillar 1: Potential energy—How knowledge-ready is your department?</h2>



<p class="wp-block-paragraph">If the department is the fundamental unit of the enterprise, its contextual enclave is its brain—its store of potential energy. Most companies are drowning in low-quality context: petabytes of data buried in half-finished Slack threads, abandoned wikis, and tacit knowledge held by seniors who are three months from retirement. To an agent, this isn’t intelligence; it’s noise.</p>



<h3 class="wp-block-heading">From data lakes to sharded enclaves</h3>



<p class="wp-block-paragraph">The data lake became a 2020s nightmare—a giant swamp where context went to die. In the federated model, legal, HR, engineering, and compliance each maintain their own secure, high-density enclave instead. Policy, process documentation, and institutional knowledge is synthesized into a form an agent can reason over directly, without a human in the interpretive loop. Data stays local; reasoning moves via agents. Protocols like the Model Context Protocol (MCP) are emerging as the TCP/IP of the federated enterprise—a standard way for agents and tools to discover each other, exchange context, and record what happened regardless of which vendor stack sits underneath. MCP is what allows “reasoning moves, data stays” to be an implementation detail rather than a custom integration project every time.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1134" height="633" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8.png" alt="Figure 2. Contextual density in shared enclaves" class="wp-image-18764" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8.png 1134w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8-300x167.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-8-768x429.png 768w" sizes="auto, (max-width: 1134px) 100vw, 1134px" /><figcaption class="wp-element-caption">Figure 2. Contextual density in shared enclaves</figcaption></figure>



<h3 class="wp-block-heading">Making potential energy measurable</h3>



<p class="wp-block-paragraph">Three dimensions combine into what we call the contextual density score: coverage (what proportion of policy and process is documented and retrievable—for a compliance enclave, the fraction of onboarding scenarios tied to explicit playbooks); consistency and recency (how often does retrieved guidance conflict, and how stale is it); and retrieval quality (how often can a reference agent answer test questions from its own enclave without human overrides). The contextual density score measures how ready an enclave is for agents to act on it reliably. Each enclave is assigned an owner whose job is to improve that score quarter over quarter, as a traditional leader improves throughput or defect rates. Context maintenance becomes the new R&amp;D.</p>



<h2 class="wp-block-heading">Pillar 2: Agentic throughput (the kinetic energy)</h2>



<p class="wp-block-paragraph">If a department’s knowledge enclave is its store of potential energy, throughput is the kinetic energy: the volume and value of cognitive outcomes produced by the agentic layer without human execution in the critical path. To measure this, we must stop counting &#8220;activity&#8221; and start counting handshakes.</p>



<h3 class="wp-block-heading">The handshake economy</h3>



<p class="wp-block-paragraph">In a federated mesh, work is done through agent-to-agent (A2A) negotiation. A logistics agent detects a delayed shipment and initiates a handshake with a procurement agent to find an alternative supplier. That agent consults the contracts enclave via a legal agent to check compliance and risk limits. A resolution is reached, records are updated, and a human is notified of the result—not every intermediate step. Throughput is the rate of successful, economically meaningful handshakes.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1233" height="688" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9.png" alt="Figure 3. The federated agent operating model" class="wp-image-18765" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9.png 1233w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9-300x167.png 300w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-9-768x429.png 768w" sizes="auto, (max-width: 1233px) 100vw, 1233px" /><figcaption class="wp-element-caption">Figure 3. The federated agent operating model</figcaption></figure>



<h3 class="wp-block-heading">Agentic unit economics: The cost of the handshake</h3>



<p class="wp-block-paragraph">Not all handshakes are equal. Every one carries a token tax, an infrastructure cost, and a latency cost. Agentic throughput is only valuable when the cost per cognitive outcome is significantly lower than the labor-equivalent at equal or better quality. If an agent fans out 50 calls to a premium model to resolve a $5 inquiry, you&#8217;ve increased throughput and destroyed ROI. If a handful of calls to a moderately priced model resolve a complex cross-silo onboarding decision that previously took three teams and two weeks, the economics are compelling.</p>



<p class="wp-block-paragraph">The agentic P&amp;L must therefore track outcome volume (risk-weighted handshakes per period) and cost per outcome relative to the pre-agentic baseline—this is where CFOs and architects meet. This recommendation is consistent with <a href="https://www.pwc.com.au/media/2026/pwc-ai-performance-study-australian-companies-lead-on-ai-security.html" target="_blank" rel="noreferrer noopener">emerging research</a>: The companies seeing genuine AI ROI are those using it to expand what they can do, not those focused purely on headcount reduction.</p>



<h3 class="wp-block-heading">How agents learn: Gyms and mirrors</h3>



<p class="wp-block-paragraph">The gym is a simulation built from historical cases and synthetic data where agents train against gold decisions, respecting policy constraints and risk limits. The mirror is a read-only, regulator-grade log of what agents did in production: prompts, tool calls, model versions, human overrides, and final outcomes. <a href="https://www.oreilly.com/radar/gyms-for-them-mirrors-for-us/" target="_blank" rel="noreferrer noopener">Agents spar in the gym; they are judged in the mirror</a>. By 2026, decision provenance—the ability to reconstruct who or what did what, under which policy and model version—is becoming standard operating procedure in regulated industries.</p>



<h3 class="wp-block-heading">The Agentic P&amp;L decomposed</h3>



<p class="wp-block-paragraph">Four-line items change structurally when an enterprise moves from a headcount model to a federated agentic model:</p>



<p class="wp-block-paragraph">Labor and benefits contract, but not to zero. The compliance function that previously employed 400 analysts moves to 80–100 humans in orchestration and oversight roles—higher-skilled and higher-cost per head, a deliberate trade of volume for leverage.</p>



<p class="wp-block-paragraph">General expenses shift as management layers thin, training budgets pivot from procedural compliance to enclave curation, and real estate requirements contract as hybrid squads replace large hub operations.</p>



<p class="wp-block-paragraph">Token and infrastructure costs emerge as a new operating line that does not exist in the pre-agentic P&amp;L. This line must be actively managed: cost per cognitive outcome is the new unit of measurement and deteriorates quickly with poorly designed agent architectures.</p>



<p class="wp-block-paragraph">Compliance and audit costs shift structure. In a Tier-1 bank, the cost of a single regulatory finding—remediation, legal exposure, delayed onboarding—dwarfs the annual cost of maintaining a well-designed decision log. The mirror transforms regulatory response from a fire drill into a navigable record. Decision provenance is not governance overhead. It is P&amp;L protection.</p>



<p class="wp-block-paragraph">Revenue productivity per person (RPP)—revenue divided by headcount—ties the expense-side story to the top line. Software-native firms have long used RPP as a signal of operational leverage; banks are now applying the same lens to their operations functions. As headcount contracts while throughput and revenue capacity hold or grow, RPP rises structurally rather than cyclically—the metric that tells a CFO whether agentic transformation is delivering leverage or merely cost reduction.</p>



<h2 class="wp-block-heading">A stylized agentic P&amp;L: Compliance in a Tier-1 bank</h2>



<p class="wp-block-paragraph">Consider a compliance function with 400 analysts. Its P&amp;L is dominated by salaries, benefits, and office costs. Context sits in email, local drives, and the memory of experienced analysts—institutional knowledge that walks out of the building every evening.</p>



<p class="wp-block-paragraph">In phase 1, the bank builds a compliance enclave: policies, historical cases, and regulator Q&amp;A synthesized into a structured knowledge graph. Three hybrid squads of 12–15 humans work alongside 10–15 agents handling document collection, screening, and rule-based decisions. Agentic throughput starts modestly—20%–30% of low-risk cases auto-cleared from within the enclave. The P&amp;L effect at this stage is primarily a productivity story: lower cost per case, faster cycle times.</p>



<p class="wp-block-paragraph">The structural transformation comes in phase 2. After several cycles of gym training and mirror-driven refinement, the function operates with 80–100 humans plus 40–60 agents. The compliance enclave—curated policies, decision logs, evaluated reward functions—is now the primary asset. Legal discovery may require the email archive; what the regulator wants is a structured, navigable record of decisions. That’s what the mirror provides. With it, the reduced headcount is defensible to regulators, to the board, and on the P&amp;L.</p>



<h2 class="wp-block-heading">The new org unit: The 3+N squad</h2>



<p class="wp-block-paragraph">The &#8220;3+N&#8221; squad—a small human core plus a flexible swarm of agents—is the fundamental cell of the agentic enterprise. The strategic architect sets intent and constraints. The policy and ethics lead designs the gyms, ensuring agents act under responsible AI principles. The technical orchestrator manages the context mesh, MCP-based connectors, and enclave density. Around them, specialized agents handle contract analysis, sanctions screening, exception routing, and external API liaison. This is cognitive federation. Humans move up-stack into judgment and intent, while agents handle high-volume reasoning and cross-departmental coordination.</p>



<p class="wp-block-paragraph">Leaders rewarded for headcount and budget will resist decomposing their empires even as enclave quality and throughput improve. Executive scorecards must include agentic KPIs: enclave maturity, agentic throughput, risk-adjusted outcomes, and RPP. The mirror needs an explicit owner spanning risk, compliance, and engineering. Without decision provenance, you get the worst of both worlds: expensive models and humans still quietly doing the real work in spreadsheets.</p>



<p class="wp-block-paragraph">When you tell a senior vice president that their value is no longer tied to a 500-person headcount but to the knowledge readiness and agentic throughput of their domain, they will fight. The resistance isn’t just economic; it’s psychological. Headcount has been a proxy for power and identity. In the new world, it often becomes a proxy for architectural debt.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph">Client: &#8220;Can&#8217;t we just put a human in the loop but set the default to &#8216;Accept&#8217;?&#8221;</p>



<p class="wp-block-paragraph"><br>Me: &#8220;That&#8217;s not human-in-the-loop. That&#8217;s human-as-rubberstamp. You&#8217;re just automating the blame.&#8221;</p>
</blockquote>



<p class="wp-block-paragraph">The reframing that works is not &#8220;we are shrinking your kingdom&#8221; but &#8220;we are upgrading your leverage&#8221; from managing people (inherently high friction and limited scale) to designing intelligence (human-plus-agent systems that scale almost without bound).</p>



<h2 class="wp-block-heading">The leader of 2027: The playbook</h2>



<p class="wp-block-paragraph">The leader of 2027 thinks in flows instead of functions, enclaves and mirrors instead of departments and reports, and token costs and compliance risk instead of merely headcount and budget. Their signature move is converting headcount empires into high-density enclaves and high-throughput meshes under credible governance, then proving it on the P&amp;L with lower unit costs, faster cycle times, and a compliance posture auditors can navigate.</p>



<p class="wp-block-paragraph">For leaders mapping their 2026–2027 roadmaps, here are three hard pivots you need to make: First, stop hiring for capacity; build a better gym, not a bigger team. Second, audit your enclave’s knowledge readiness—if agents hallucinate, you have contextual debt, not a model problem; invest in governed sharded enclaves and mirrors your auditors can use. Finally, manage your token line as the new overhead expense; track cost per cognitive outcome rather than aggregate spend and monitor RPP as your headline leverage indicator.</p>



<p class="wp-block-paragraph">The goal is not to build an AI that works for you. The goal is to build an enterprise that thinks with you.</p>



<p class="wp-block-paragraph">Gyms for them, mirrors for us, and a context mesh to hold the P&amp;L together—that is the architecture of a decentralized, high-alpha enterprise. Anything else is just an expensive way to stay in the 20th century.</p>
]]></content:encoded>
							<wfw:commentRss>https://www.oreilly.com/radar/the-agentic-pl-beyond-the-empire-of-headcount/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
							</item>
		<item>
		<title>The Agent Stack Bet</title>
		<link>https://www.oreilly.com/radar/the-agent-stack-bet/</link>
				<pubDate>Wed, 20 May 2026 10:58:36 +0000</pubDate>
					<dc:creator><![CDATA[Addy Osmani]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18746</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-agent-stack-bet.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/The-agent-stack-bet-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[The bet every serious developer needs to make on on their agent stack]]></custom:subtitle>
		
				<description><![CDATA[The following article originally appeared on the Elevate newsletter and is being reposted here with the author&#8217;s permission. Peek under the hood of most “production agents” shipping today and you won’t find intelligence. You’ll find custom plumbing, fragile session logic, shared service accounts, and a security model held together by hope. This can be so [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>The following article originally appeared on the</em> <a href="https://addyo.substack.com/p/the-agent-stack-bet" target="_blank" rel="noreferrer noopener">Elevate</a> <em>newsletter and is being reposted here with the author&#8217;s permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">Peek under the hood of most “production agents” shipping today and you won’t find intelligence. You’ll find custom plumbing, fragile session logic, shared service accounts, and a security model held together by hope. This can be so much better.</p>



<p class="wp-block-paragraph">If you’ve spent the last 18 months putting agents into production, you already know the models and tools have gotten <em>dramatically</em> better. You also know the problems that are still burning your on-call rotation are not problems you can prompt your way out of. We are running into a <strong>stack ceiling</strong>, and it is quietly creating a <strong>governance</strong> and <strong>reliability gap</strong> that the next generation of agentic systems cannot grow through.</p>



<p class="wp-block-paragraph">Right now the industry is living with what I’d call <em>excessive agency</em>: <strong>autonomous systems given broad permissions to get things done</strong>, then left to discover—at runtime, in production—that a schema drifted, an API changed, or a downstream service started returning PII it wasn’t supposed to. Agents mark tasks “complete” while leaving a trail of corrupted state behind them. The humans find out on Monday.</p>



<p class="wp-block-paragraph">This is not a failure of the people building agents. It is a failure of the stack they’re building on.</p>



<p class="wp-block-paragraph">Here are the four architectural bets I think every serious team has to make in the next twelve months.</p>



<h2 class="wp-block-heading"><strong>1) Agents need identities, not shared credentials</strong></h2>



<p class="wp-block-paragraph">Every engineer who has shipped agents to production knows this specific flavor of dread: You have agents doing useful work, and effectively zero visibility into which tools they touched, which data they moved, or which credentials they used to do it. I call this <em>governance debt</em>—the silent accumulation of security and audit risk that eventually forces a full rewrite, usually right after the first incident that reaches the CISO.</p>



<p class="wp-block-paragraph">The root cause is that most agents today are ghosts. They don’t have identities. They borrow a service account, inherit a human’s OAuth token, and “promise”—in application code, in a prompt—to stay inside the lines. In a real enterprise environment, a promise in a prompt is not a policy.</p>



<p class="wp-block-paragraph"><strong>My bet is that agent identity has to move from the application layer down into the platform layer.</strong></p>



<p class="wp-block-paragraph">The difference is between bolted-on versus embedded security. Bolted-on looks like middleware in front of every tool call, politely asking the agent to behave: easy to bypass, expensive in latency, and invisible to your existing IAM. Embedded looks like a badge reader welded into a steel frame. The agent has a distinct, unforgeable identity recognized at the network and platform level, and policy is enforced at the source. If the agent reaches for a database it isn’t cleared for, the connection never opens. No middleware, no vibes.</p>



<p class="wp-block-paragraph">Done right, this turns “a fleet of liabilities” into something that looks a lot more like a managed workforce: every action attributable, every permission auditable, every agent revocable with one call.</p>



<h2 class="wp-block-heading"><strong>2) Agents need universal context, not scraped windows</strong></h2>



<p class="wp-block-paragraph">Context management is a tax every builder is currently paying. Teams are burning a huge share of their engineering hours (and tokens) on undifferentiated plumbing—custom serialization, bespoke session stores, hand-rolled memory layers—just to keep an agent from forgetting its mission halfway through a multi-step task.</p>



<p class="wp-block-paragraph">Worse, the context agents <em>can</em> get their hands on is usually siloed. A browser-based agent can see the open tab. A desktop wrapper can see the files a user happened to drag in. Neither of them can easily reason across the systems where the business actually lives—the CRM, the ERP, the data warehouse, the ticketing system, the transcripts, the project plans—at the same time.</p>



<p class="wp-block-paragraph"><strong>Agents need universal context that integrates at the platform level.</strong> If we don’t fix this, we should be honest that the ceiling of agentic AI is “slightly better spreadsheet autocomplete,” and we should stop writing vision pieces about it.</p>



<h2 class="wp-block-heading"><strong>3) Agents need to survive your laptop closing</strong></h2>



<p class="wp-block-paragraph">Here’s the uncomfortable version of this: A lot of what ships today as “an agent” isn’t yet ready to deploy across a business.</p>



<p class="wp-block-paragraph">I want to be precise, because the frontier has genuinely moved in the last six months. Environments like Claude Code, OpenClaw, and similar platforms are capable—persistent task state, scheduled execution, multi-agent coordination, and long-running sessions that survive disconnects are no longer aspirational. These are not toys. The question has moved on.</p>



<p class="wp-block-paragraph">The question now is whether an agent can run for a week instead of an hour. Whether it can cross three handoffs, two credential rotations, and an approval gate without a human babysitting the session. Whether the work it did on Tuesday is auditable on Friday by someone who wasn’t in the room. A session that survives a dropped WebSocket is table stakes. A mission that survives a quarter is the bar enterprises actually need.</p>



<p class="wp-block-paragraph">Real work doesn’t fit in a session, and most of it doesn’t fit in a day either. A procurement workflow spans weeks and a dozen handoffs. A compliance audit runs for a month. An incident investigation outlives three on-call rotations.</p>



<p class="wp-block-paragraph"><strong>Most agents today hit a hard ceiling—sometimes time-based, sometimes token-based, sometimes governance-based—and when they hit it, the mission fails and a human picks up the pieces from wherever the transcript ended.</strong></p>



<p class="wp-block-paragraph">Enterprise-grade autonomy requires durable, cloud-native execution with a much higher floor than “the session stayed up.” Concretely, that means:</p>



<ul class="wp-block-list">
<li><strong>State</strong> and <strong>checkpointing</strong> that survives restarts, disconnects, redeploys, and model version changes by default—not bolted on with a local Redis and a prayer.</li>



<li><strong>Context that outlives the window</strong>: long-horizon memory, summarization, and handoff between agent instances, so a multi-week task doesn’t die because a single run exhausted its tokens.</li>



<li><strong>Missions that outlive sessions</strong>: agents that stay on the job across days, handoffs, and credential rotations, with an auditable trail of what happened while you were asleep.</li>



<li><strong>First-class human-in-the-loop primitives,</strong> so the agent can pause and ask for permission to do something new instead of silently deciding it has the authority.</li>
</ul>



<p class="wp-block-paragraph">Persistence with guardrails. That’s the bar. Anything less and you’re building demos that happen to run for a long time.</p>



<h2 class="wp-block-heading"><strong>4) Agents need platforms</strong></h2>



<p class="wp-block-paragraph">The pattern I see most often in strong teams is the saddest one: brilliant engineers draining their bandwidth into stack problems that do not differentiate their product. Custom memory. Bespoke eval harnesses. Homegrown observability. Handwritten retry logic. A tracing system that almost works. None of this is the hard part of the agentic era, and none of it is what your users are paying you for.</p>



<p class="wp-block-paragraph">The real value lives in domain reasoning and business logic—the judgment calls that are specific to your company, your customers, your regulatory environment. Everything underneath should be the platform you <em>build on</em>, not the plumbing you <em>build</em>.</p>



<p class="wp-block-paragraph">This is why the maturation of open primitives matters right now. Open-source orchestration frameworks exist precisely so the scaffolding isn’t locked behind any single vendor’s roadmap. The model that worked for cloud compute, containers, and CI/CD—start local on open primitives, graduate to a managed platform when you’re ready to scale—is the model agent platforms need to copy.</p>



<p class="wp-block-paragraph"><strong>Teams should be able to prototype on their laptop with the same building blocks they’ll run in production, and cross that boundary without a rewrite.</strong></p>



<p class="wp-block-paragraph">That’s the engineering standard that lets teams stop fighting plumbing and get back to the product.</p>



<h2 class="wp-block-heading"><strong>The five-year horizon</strong></h2>



<p class="wp-block-paragraph">The teams that pull ahead in the next five years will not pull ahead by being smarter at writing boilerplate. They’ll pull ahead by <strong>choosing the right agent foundation</strong> and spending their engineering hours on the problems <em><strong>only they can solve</strong></em>.</p>



<p class="wp-block-paragraph">Every month spent rebuilding the common stack—identity, context, persistence, orchestration—is a month not spent on the logic that actually makes your agents worth deploying.</p>



<p class="wp-block-paragraph"><strong>The agent stack has to become a solved problem.</strong> The only real question is whether you want to solve it yourself, again, or build on a foundation that was engineered for agents from the ground up.</p>



<p class="wp-block-paragraph">My bet is on the latter. I think yours should be too.</p>
]]></content:encoded>
										</item>
		<item>
		<title>When an Agent Deletes the Production Database</title>
		<link>https://www.oreilly.com/radar/when-an-agent-deletes-the-production-database/</link>
				<pubDate>Tue, 19 May 2026 16:00:39 +0000</pubDate>
					<dc:creator><![CDATA[Sam Newman]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18743</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/When-an-agent-deletes-the-production-database.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/When-an-agent-deletes-the-production-database-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Revisiting the PocketOS Incident]]></custom:subtitle>
		
				<description><![CDATA[Another day, another example of an AI Agent &#8220;running rogue&#8221; and doing something the human operator didn&#8217;t want it to do. The tl;dr is that Jeremy (Jer) Crane, founder of PocketOS, was using Claude to perform some routine DB maintenance. Claude then proceeded to delete the production database and all backups hosted at their cloud [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Another day, another example of an AI Agent &#8220;running rogue&#8221; and doing something the <a href="https://www.theregister.com/2026/04/27/cursoropus_agent_snuffs_out_pocketos/" target="_blank" rel="noreferrer noopener">human operator didn&#8217;t want it to do</a>. The tl;dr is that Jeremy (Jer) Crane, founder of PocketOS, was using Claude to perform some routine DB maintenance. Claude then proceeded to delete the production database and all backups hosted at their cloud provider, Railway. To their credit Railway managed to recover the lost data. The initial deletion took less than 10 seconds; I&#8217;m sure the recovery took much longer. Let’s look at what we can learn from what happened, and why AI is really just an amplifier of existing issues, rather than the cause itself.</p>



<p class="wp-block-paragraph">We know about the incident because Jer <a href="https://x.com/lifeof_jer/status/2048103471019434248?s=20" target="_blank" rel="noreferrer noopener">wrote</a> about it after it happened. First, taking time to reflect after something goes wrong is important; it&#8217;s how we learn. Sharing your mistakes with the world can be difficult, but it creates chances for us all to learn from each other. Second, I&#8217;ve seen a lot of people publicly dunking on both PocketOS and Railway. I would guess that none of those people have ever experienced the sheer terror and panic that happens during an incident like this. The feeling that you just want the ground to open and swallow you whole. It&#8217;s a feeling I&#8217;ve only experienced once or twice before, and it&#8217;s not an experience I&#8217;m keen to repeat.</p>



<p class="wp-block-paragraph">One point in Railway’s credit is that they got PocketOS’s data back. If you called for a deletion via the APIs on AWS, Azure, Google Cloud or whatever, using a valid credential, that data is gone—unless you have your own backups of course. AWS et al. aren’t maintaining backups of customer data to hedge against customer mistakes. This is your yearly reminder to <a href="https://www.backblaze.com/blog/the-3-2-1-backup-strategy/" target="_blank" rel="noreferrer noopener">look into the 3-2-1 backup strategy</a>.</p>



<p class="wp-block-paragraph">What can we learn about what happened? Well, for all the discussion around how this is AI&#8217;s fault, what we have here is a much simpler example of common system weaknesses being exploited both accidentally and at speed.</p>



<h2 class="wp-block-heading">What Did Claude Do?</h2>



<p class="wp-block-paragraph">Claude had been asked to carry out a task against PocketOS&#8217;s staging environment. The agent hit an issue, searched out and found a long-lived API token which gave access to production, and then proceeded to delete the production volume that contained both the production databases and the backups.</p>



<p class="wp-block-paragraph">When asked what had happened, Claude’s reaction was objectively funny. It seemed to be totally aware of what went wrong, and what it should have done instead. This implies a set of reasoning that was not evident during the actual operation itself—I do wonder if recent attempts to reduce how much reasoning Claude does in certain modes to reduce token use—and Anthropic’s operating costs might partly be to blame.</p>



<p class="wp-block-paragraph">Breaking it all down, there seem to be a couple of fairly straightforward issues at play that at first glance have very little to do with AI itself.</p>



<p class="wp-block-paragraph">The token Claude had access to gave overly broad access. It&#8217;s common for cloud-based infrastructure providers like AWS or Azure to allow you to create tokens that are limited in what they do. This helps implement the <em>principle of least privilege</em>. The idea is that an actor in a system should be given access to what they need, and no more. The principle of least privilege reduces the impact if an inappropriate party gains access to the actor’s credentials, or if the actor themself goes rogue. Consider what happens if someone steals your hotel room key. They can get into your hotel room, which isn&#8217;t great, but they can&#8217;t get into anyone else&#8217;s. It seems that Railway has a limitation that its auth tokens cannot have their scope limited.</p>



<p class="wp-block-paragraph">The second problem was that the credentials were stored on disk and had not expired. This makes the impact of the broadly scoped auth token much worse. Credentials should be time limited, so that if they are found later they cannot be used. If tokens are generated on demand, which could have been done in this specific case, then this particular issue could have been mitigated. Claude would have had to ask for a human to provide a credential—at which point, hopefully, the operator would have had a chance to work out what was going on.</p>



<p class="wp-block-paragraph">I take minor issue with Jer&#8217;s assertion that Railway&#8217;s GraphQL API should have required a confirmation before deletion. This, to me, is a fundamental misunderstanding of what cloud APIs are for. APIs are there for automation; if you want a human-in-the-loop confirmation model, you have to build that yourself. This has always been the case. However, in the aftermath of an incident like this, we should give Jer a lot of leeway around his view of the problems, and some of Jeremy&#8217;s requests for how Railway should change appear to be very sensible (e.g. more clear SLAs, easier to scope tokens).</p>



<h2 class="wp-block-heading">How Could These Issues Be Mitigated?</h2>



<p class="wp-block-paragraph">One obvious takeaway is to ensure that access tokens are more aggressively expired, but also made more limited in scope. This reduces the chance of Claude accessing something it shouldn’t. This would need to be solved on the Railway side, as they generate the token in the first place.</p>



<p class="wp-block-paragraph">Unfortunately, having a more limited token for Claude isn’t a total fix for this scenario. Claude was given a token that limited its behavior, and went looking for a better token—and found it. This is not the first time I’ve heard of this happening; the same thing happened to a client of mine recently.</p>



<p class="wp-block-paragraph">As our agents become more sophisticated, it seems that some sort of sandboxing is key. The production token was viewable by Claude, so it was used. Running agents in a restricted sandbox where they are only able to see parts of your filesystem would help greatly. However that also limits their usefulness.</p>



<p class="wp-block-paragraph">Another option would be for the agent to ask for confirmation before it does something like delete data. It seems conceivable that having a human in the loop model when the agent has to escalate privileges could help. But again, if it gets access to an access token with broad scope, it won’t need to ask a human.</p>



<p class="wp-block-paragraph">Finally, I’ve seen a lot of discussion about how the agent should “know” that deleting the data was bad, and that it should have checked first. This is a fundamental limitation of an LLM-based agent. It has no concept of causality. It cannot predict what will happen. There is a field of AI study known as <a href="https://en.wikipedia.org/wiki/World_model_(artificial_intelligence)" target="_blank" rel="noreferrer noopener">world models</a>, which could allow these agents to make more informed decisions. For example, a world model that understands physics would be able to predict that the egg would likely break if the egg was pushed from a table on to the concrete floor below. World models are used a lot in video generation and autonomous driving (where prediction of motion is key), but are sparsely used elsewhere.</p>



<h2 class="wp-block-heading">AI Not To Blame?</h2>



<p class="wp-block-paragraph">I said just a moment ago that these issues seem to have little to do with AI. That isn&#8217;t entirely true.</p>



<p class="wp-block-paragraph">In the recent DORA report on the state of <a href="https://dora.dev/research/2025/dora-report/" target="_blank" rel="noreferrer noopener">AI-assisted Software Development</a>, the authors noted that AI seems to be an amplifier: that AI-assisted software development tends to help good teams go faster, and slow teams go slower. Bad practices get encoded and done more. In the PocketOS and Railway situation, we have a set of credentials that were overly broad, with long-lived credentials stored on disc, combined with an apologetic AI agent doing something other than what was expected of it. If a human had made the same mistakes, they would have made them much more slowly, and may well have had the chance to work out their mistake part way through. AI works so fast that it can go more quickly in the wrong direction.</p>



<p class="wp-block-paragraph">More importantly, unlike LLM-based AI, a human being has the chance to learn from experience, and for that learning to be rooted in a very specific, emotional response. When I first heard about the PocketOS story, I was brought back to a dim echo of that same horrific feeling I had in the midst of a major production issue that I had contributed to. Those feelings don&#8217;t leave you—those lessons don&#8217;t leave you. Every time I touched a production system, those memories were with me, and helped guide me towards more sensible working practices.</p>
]]></content:encoded>
										</item>
		<item>
		<title>AI Artifact Catalogs: Durable Standards Worth Institutional Investment</title>
		<link>https://www.oreilly.com/radar/ai-artifact-catalogs-durable-standards-worth-institutional-investment/</link>
				<pubDate>Tue, 19 May 2026 11:05:38 +0000</pubDate>
					<dc:creator><![CDATA[Tadas Antanavicius]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>
		<category><![CDATA[Commentary]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18737</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/AI-artifact-catalogs.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/AI-artifact-catalogs-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
		
				<description><![CDATA[Companies everywhere are trying to leverage AI to boost internal productivity metrics. Some, like Ramp and Intercom, are succeeding. Many are failing. To make matters more complicated, the narrative around what tooling enables these gains is constantly shifting. For software engineers, auto-complete via GitHub Copilot was the bleeding-edge tool of choice in 2024. Then it [&#8230;]]]></description>
								<content:encoded><![CDATA[
<p class="wp-block-paragraph">Companies everywhere are trying to leverage AI to boost internal productivity metrics. Some, like <a href="https://x.com/geoffintech/status/2042002590758572377?s=20" target="_blank" rel="noreferrer noopener">Ramp</a> and <a href="https://www.linkedin.com/posts/destraynor_in-intercom-we-literally-doubled-productivity-activity-7450589093400469504-wbHH/" target="_blank" rel="noreferrer noopener">Intercom</a>, are succeeding. <a href="https://www.pwc.com/gx/en/news-room/press-releases/2026/pwc-2026-ai-performance-study.html" target="_blank" rel="noreferrer noopener">Many are failing</a>.</p>



<p class="wp-block-paragraph">To make matters more complicated, the narrative around what tooling enables these gains is constantly shifting. For software engineers, auto-complete via <a href="https://github.com/features/copilot" target="_blank" rel="noreferrer noopener">GitHub Copilot</a> was the bleeding-edge tool of choice in 2024. Then it was Cursor for <a href="https://ramp.com/vendors/cursor" target="_blank" rel="noreferrer noopener">much of 2025</a>. 2026 has been dominated by command-line-based coding agents like <a href="https://www.anthropic.com/news/anthropic-raises-30-billion-series-g-funding-380-billion-post-money-valuation" target="_blank" rel="noreferrer noopener">Claude Code</a> and <a href="https://fortune.com/2026/03/04/openai-codex-growth-enterprise-ai-agents/" target="_blank" rel="noreferrer noopener">Codex</a>.</p>



<p class="wp-block-paragraph">While the tooling layer winds ebb and flow, many of them have come to share a number of common primitives: <strong>open standards that help configure and guide these tools’ capabilities</strong>.</p>



<p class="wp-block-paragraph"><a href="https://agentskills.io/" target="_blank" rel="noreferrer noopener">Agent Skills</a>. <a href="https://modelcontextprotocol.io/" target="_blank" rel="noreferrer noopener">MCP</a>. <a href="https://open-plugins.com/" target="_blank" rel="noreferrer noopener">Plugins</a>. These all present vendor-agnostic mechanisms by which we can configure the tools today. The catch: These mechanisms aren’t one-size-fits-all. How you can connect to an MCP server depends on your organization’s security posture. An Agent Skill crafted specifically for one team’s design system does not copy-paste well into that of another team.</p>



<p class="wp-block-paragraph">As individuals within organizations begin to configure—and sometimes build from scratch—the skills and MCP servers that unlock real productivity gains, the next unlock is to translate those wins to shareable, reusable institutional knowledge. <strong>AI artifact catalogs</strong> are the output of this step. They represent the useful bits of <em>internal </em>knowledge and glue that connect much of what employees are doing manually today, over to empowering both:</p>



<ul class="wp-block-list">
<li><strong>Their peers</strong>. By sharing these artifacts within or across teams, productivity gains are shared across the organization, not in individual silos.</li>



<li><strong>And their agents</strong>. Equipping agent runtimes like Claude Code or Codex with hard-won, domain-specific guidance means employees can spend more time building agentic systems and less time toiling on repeatable labor.</li>
</ul>



<h2 class="wp-block-heading">The durability of open standards</h2>



<p class="wp-block-paragraph">There is an ongoing industry-wide rush to buy AI-powered solutions in the hopes that a vendor can unlock these sought-after productivity gains. <a href="https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/" target="_blank" rel="noreferrer noopener">95% of those pilot projects are failing</a>.</p>



<p class="wp-block-paragraph">Of course, there is a spectrum of risk when buying solutions like this from a vendor. If you go all-in on Anthropic’s tooling—like <a href="https://ideas.fin.ai/p/we-gave-claude-code-to-everyone-at" target="_blank" rel="noreferrer noopener">Intercom did with Claude Code</a>—and Anthropic continues to be an industry leader, things will go well. Make the same decision with a startup’s offering that fails to get broad industry adoption, and you’re stuck with a proprietary data model that operates in a dead-end silo you have to rebuild from scratch in a year.</p>



<p class="wp-block-paragraph">There’s another path: that of committing to open standards. If you invest in Agent Skills, in MCP, in plugins, not only will you be protected against a single vendor going belly-up, but you won’t even miss a beat when the leading coding agent that all your engineers demand next quarter changes, again. Switching costs drop to a fraction of what they’d be with a proprietary stack.</p>



<p class="wp-block-paragraph">There’s no doubt that AI capabilities are evolving at a breakneck pace. It’s hard to predict what innovations the next cycle will bring. But what’s unique about these vendor-agnostic standardized primitives is that they are concepts upon which innovation can build, not replace. We’re all still building on top of HTTP that forms the fabric of the web. QWERTY keyboards are strictly inferior to Dvorak keyboards, and yet the standard prevails to this day. JavaScript is a much-maligned language, yet it underpins practically the entire frontend of the internet.</p>



<p class="wp-block-paragraph">As AI rapidly reduces the cost of building, the cost of coordination among people and among entities remains high. Standards remain scarce and valuable.</p>



<h2 class="wp-block-heading">AI artifacts and their relative maturity</h2>



<p class="wp-block-paragraph">The most important aspect of any standard is its level of adoption. It’s clear that the leading tooling empowering internal AI transformation is coalescing around coding agent tools like Claude Code and Codex, less-technical tooling like Claude Cowork, and rich agent SDKs like those from Anthropic or OpenAI.</p>



<p class="wp-block-paragraph">Taking the compatibility of leading tools in those categories as indicators of standard adoption, here’s where I think the landscape of AI artifacts currently nets out:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><strong>Standard</strong></td><td><strong>Artifact</strong></td><td><strong>Status</strong></td><td><strong>Adoption</strong></td></tr><tr><td>Agent Skills</td><td><a href="https://agentskills.io/" target="_blank" rel="noreferrer noopener">Skill</a></td><td>Vendor-agnostic standard</td><td>Highest</td></tr><tr><td>MCP servers</td><td><a href="https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2633" target="_blank" rel="noreferrer noopener">mcp.json</a> and <a href="https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2127" target="_blank" rel="noreferrer noopener">Server Card</a></td><td>Vendor-agnostic standard</td><td>Highest</td></tr><tr><td>Plugins</td><td><a href="https://open-plugins.com/" target="_blank" rel="noreferrer noopener">Plugin</a></td><td>Vendor-agnostic standard</td><td>High</td></tr><tr><td>Command line interface (CLI) tools</td><td>Custom</td><td>Unstandardized</td><td>High</td></tr><tr><td>Hooks</td><td><a href="https://open-plugins.com/agent-builders/components/hooks" target="_blank" rel="noreferrer noopener">Hook</a></td><td>Derivative standard (Open Plugins)</td><td>Medium</td></tr><tr><td>Roots</td><td>Git repositories</td><td>Derivative standard (<a href="https://agents.md" target="_blank" rel="noreferrer noopener">AGENTS.md</a>)</td><td>Medium</td></tr><tr><td>Rules</td><td><a href="https://open-plugins.com/agent-builders/components/rules" target="_blank" rel="noreferrer noopener">Rule</a></td><td>Derivative standard (Open Plugins)</td><td>Medium</td></tr></tbody></table><figcaption class="wp-element-caption"><em>Tool compatibility considered in “adoption” as of April 2026</em><strong><em>: </em></strong><em>Claude Code, Cowork, Codex, Cursor, GitHub Copilot, Gemini CLI, Pi, OpenCode, Amp, Claude Agents SDK, OpenAI Agents SDK</em></figcaption></figure>



<p class="wp-block-paragraph">A minimalist catalog stored as a Git repository for a team might start off looking something like this:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="626" height="252" src="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-6.png" alt="A minimalist catalog stored as a Git repository" class="wp-image-18738" srcset="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-6.png 626w, https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/image-6-300x121.png 300w" sizes="auto, (max-width: 626px) 100vw, 626px" /></figure>



<p class="wp-block-paragraph">I work with software engineering teams early in their AI adoption journey, where they might have a few individual tinkerers leaning heavily into AI but haven’t yet figured out how to propagate adoption more widely. Out of the gate, my conversations with teams tend to run a gamut of disparate tool preferences, unique workflows, disjoint architectures, and other one-off quirks. A big unlock for moving these organizations forward is to introduce shared language. Shared language grounds conversations. It puts teams working on different AI-related initiatives on a path to smooth integration with each other. People get excited about how puzzle pieces might fit together.</p>



<p class="wp-block-paragraph">Let’s review these artifacts in more detail.</p>



<h3 class="wp-block-heading">Skills: The lifeblood of most institutional knowledge</h3>



<p class="wp-block-paragraph">As Tim O’Reilly <a href="https://www.oreilly.com/radar/betting-against-the-bitter-lesson/" target="_blank" rel="noreferrer noopener">wrote a few months ago</a>, a skill can be “the integration of expert workflow logic that orchestrates when and how to use each tool, informed by domain knowledge that gives the LLM the judgment to make good decisions in context.”</p>



<p class="wp-block-paragraph">This is not the only “type” of skill that currently exists out there. They can span a gamut of purposes; to name a few:</p>



<ul class="wp-block-list">
<li>Encoding of internal, expert orchestration knowledge (as in the above)</li>



<li>Guidance on using otherwise deterministic tools (such as MCP servers or CLI tools)</li>



<li>Context management tricks that have broad appeal (to make up for LLM capability limitations)</li>
</ul>



<p class="wp-block-paragraph">But the first—the encoding of expert knowledge—is very much the most valuable and irreplaceable. Chances are, what an organization might capture in that variant of skill is knowledge not otherwise documented. It lives as tacit knowledge among your employees or is scattered across many systems so as to make any associated work a multistep journey.</p>



<p class="wp-block-paragraph">The implication: Any skill you can download from the public internet is probably not nearly as valuable as an internal skill crafted by an employee. The latter skill is aware of your business context, the opinionated systems in play, and maybe encodes unique expertise hard-won over years of tenure. And most importantly: That level of insight is not making it into a model training run any time soon. Nor is it likely to be relevant to just about anyone outside of your own company. The same can’t be said for the latest skill repository on GitHub that acquires 10,000 stars. If that public skill is any good, the generic concepts will find their way into natural model and harness capabilities before long, eliminating the need for that class of skill.</p>



<p class="wp-block-paragraph">Skills are <a href="https://agentskills.io/clients" target="_blank" rel="noreferrer noopener">extremely well-adopted</a>; uncontroversially so by every major coding agent.</p>



<h3 class="wp-block-heading">MCP and CLI tools: The connectivity layer to external systems</h3>



<p class="wp-block-paragraph">Most agents don’t operate in a vacuum: Interaction with external systems is how we compose AI. One agent can talk to another agent, or just some separate deterministic system, by way of MCP or a CLI tool.</p>



<p class="wp-block-paragraph">The <a href="https://claude.com/blog/building-agents-that-reach-production-systems-with-mcp" target="_blank" rel="noreferrer noopener">MCP versus CLI</a> debate is well-documented, so we won’t rehash it here. Regardless of which of the two you implement (and perhaps you use both for different use cases), the point is that MCP/CLI is responsible for poking a hole into what is otherwise a local-only sandboxed environment for your agent.</p>



<p class="wp-block-paragraph">This is the layer that juggles authentication—facilitating OAuth, injecting any relevant secrets—and exposes some well-defined surface area for what your agent could possibly do in communication with that external system (e.g., MCP tool definitions or CLI command options).</p>



<p class="wp-block-paragraph">For MCP, you have well-established conventions and standards in the form of <a href="https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2127" target="_blank" rel="noreferrer noopener">Server Cards</a> and <a href="https://github.com/modelcontextprotocol/registry/blob/main/docs/reference/server-json/generic-server-json.md" target="_blank" rel="noreferrer noopener">server.json</a> files—to declare all the <em>possible</em> configurations of an MCP server—and also an upcoming standard called <a href="https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2633" target="_blank" rel="noreferrer noopener">mcp.json</a> to declare <em>specific</em> configurations of an MCP server (inspired by, among others, files like <a href="https://code.claude.com/docs/en/mcp#project-scope" target="_blank" rel="noreferrer noopener">.mcp.json from Claude Code</a>).</p>



<p class="wp-block-paragraph">For CLI, cataloging a tool means rolling your own catalog format: probably covering metadata like “how to install this,” “what auth mechanisms does it support,” “where to store secrets,” and related concerns that are explicitly or implicitly captured in analogous mcp.json files.</p>



<p class="wp-block-paragraph">MCP is very well-adopted and natively compatible with most agent frameworks. CLI works anywhere the agent comes with bash capabilities but can be fairly limited in a sandbox environment and doesn’t share the sort of configurability as MCP does otherwise.</p>



<h3 class="wp-block-heading">Hooks: Inject capabilities at deterministic trigger points</h3>



<p class="wp-block-paragraph">Hooks are handy to inject sprinkles of determinism in an otherwise nondeterministic agentic session. Some effective uses I’ve seen: injecting a session transcript capture step for future review or capturing analytics on what skills are being invoked within a team.</p>



<p class="wp-block-paragraph">Hooks don’t have their own standard but are <a href="https://open-plugins.com/agent-builders/components/hooks" target="_blank" rel="noreferrer noopener">baked into the upcoming Open Plugins standard</a>. The concept is supported by most major coding agents, although implementations have some variance.</p>



<h3 class="wp-block-heading">Rules: Context appended to rules</h3>



<p class="wp-block-paragraph">Originally <a href="https://cursor.com/docs/rules" target="_blank" rel="noreferrer noopener">popularized by Cursor</a>, rules allow for injecting blurbs of context in largely deterministic, but sometimes nondeterministic, fashion.</p>



<p class="wp-block-paragraph">Functionally, many rules could be modeled as skills and AGENTS.md files. Given the popularity of the latter, it’s unclear whether they will continue to remain relevant in the long run.</p>



<h3 class="wp-block-heading">Roots: An agent’s starting point</h3>



<p class="wp-block-paragraph">Most agents “start” inside a particular location in a filesystem: a “root.” For coding agents, this means some folder within a Git repository. In some agents, such as Claude Cowork, this is equivalent to the notion of a “project.”</p>



<p class="wp-block-paragraph">While not directly standardized, the notion of a root is implicit in the AGENTS.md standard, which assumes the presence of a filesystem that hosts static context for which the agent should operate upon.</p>



<h3 class="wp-block-heading">Plugins: Bundles to bring it all together</h3>



<p class="wp-block-paragraph">Plugins are somewhat unique in the above list. Conceptually, they are a <em>bundle</em> of several of the other artifacts. A plugin can be thought of as a composition of skills, rules, hooks, MCP servers, and some other components. The up-and-coming <a href="https://open-plugins.com/" target="_blank" rel="noreferrer noopener">Open Plugins</a> initiative spearheaded by Vercel is working to finalize what this specification looks like.</p>



<p class="wp-block-paragraph">They serve a natural purpose. Any team leaning into building skills and MCP servers will quickly get to a point where several skills and MCP servers will combine to form a practical grouping of guidance and capabilities. Claude Code’s implementation of <a href="https://code.claude.com/docs/en/plugin-marketplaces" target="_blank" rel="noreferrer noopener">plugin marketplaces</a> is becoming a de facto distribution mechanism for plugins. It’s very much an option to catalog individual artifacts, and then use mechanisms like that to distribute them all as bundled within the plugin abstraction layer.</p>



<p class="wp-block-paragraph">Some companies have fully leaned into this abstraction. For example, Intercom, rather than cataloging skills or hooks individually, <a href="https://ideas.fin.ai/p/how-we-use-claude-code-today-at-intercom" target="_blank" rel="noreferrer noopener">just catalogs plugins</a>—skills and hooks are fully inlined within them.</p>



<p class="wp-block-paragraph">Most of the agentic tooling ecosystem is largely aligned on plugins, with Pi and OpenCode being notable holdouts.</p>



<h2 class="wp-block-heading">Rich, practical catalogs are what can separate AI success stories from repeated false starts</h2>



<p class="wp-block-paragraph">Maybe you choose to go all-in on plugins and bundle your skills and MCP servers inline; maybe you build a granular catalog per artifact type. But whatever shape it takes, what matters is that your company is cataloging—and retaining ownership of—its way of working. And doing so in a way that maximizes potential compatibility with the frontier tooling that is yet to be invented.</p>



<p class="wp-block-paragraph">It’s very immediately actionable for a company to start on this path. No new vendor relationship is needed, just an internal agreement to start storing artifacts in some company-wide Git repository. Encouraging sharing, moving past individual silos, celebrating wins—and eventually celebrating <em>usage</em>—of these artifacts. Every addition to that catalog is an opportunity for someone else to leverage an artifact someone else constructed, a chance to build on top of it, to collaborate or consolidate efforts.</p>



<p class="wp-block-paragraph">If you’re part of a company building its first catalog, I’d like to hear from you. I work with a few companies in the early stages of this initiative, and I’ve been capturing early learnings around managing these catalogs in a very <a href="https://github.com/pulsemcp/air" target="_blank" rel="noreferrer noopener">lightweight open source framework called AIR</a>. If others are getting value out of leaning into these open standards as catalogs, we likely have an opportunity to collaborate across companies on some of the glue and minutiae that can operationalize the ideas here.</p>



<p class="wp-block-paragraph">Ramp and Intercom aren&#8217;t winning because they picked the right tooling vendor. They&#8217;re winning because they&#8217;ve turned individual productivity into organizational capability. The tooling will keep rotating. Whether your company compounds alongside it is a choice worth making deliberately.</p>
]]></content:encoded>
										</item>
		<item>
		<title>Agent Skills Work but the Research Shows Most Teams Are Building Them Wrong</title>
		<link>https://www.oreilly.com/radar/agent-skills-work-but-the-research-shows-most-teams-are-building-them-wrong/</link>
				<pubDate>Mon, 18 May 2026 10:59:14 +0000</pubDate>
					<dc:creator><![CDATA[Aishwarya Naresh Reganti, Prahitha Movva and Kiriti Badam]]></dc:creator>
						<category><![CDATA[AI & ML]]></category>

		<guid isPermaLink="false">https://www.oreilly.com/radar/?p=18732</guid>

		
					<media:content 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-skills-work-but-the-research-shows-most-teams-are-building-them-wrong.jpg" 
				medium="image" 
				type="image/jpeg" 
				width="2304" 
				height="1792" 
			/>

			<media:thumbnail 
				url="https://www.oreilly.com/radar/wp-content/uploads/sites/3/2026/05/Agent-skills-work-but-the-research-shows-most-teams-are-building-them-wrong-160x160.jpg" 
				width="160" 
				height="160" 
			/>
		
				<custom:subtitle><![CDATA[Everybody is building agent skills, but not all skills are created equal. Here are some recent research papers that empirically show best practices to build them.]]></custom:subtitle>
		
				<description><![CDATA[This post was originally published on The Nuanced Perspective and is being reposted here with the authors’ permission. Agent skills are everywhere right now. Atlassian built them into Rovo so agents can automatically triage Jira tickets, draft Confluence pages, and route service requests without anyone typing a prompt. Canva and Figma use them so Claude [&#8230;]]]></description>
								<content:encoded><![CDATA[
<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><em>This post was originally published on </em><a href="https://thenuancedperspective.substack.com/p/agent-skills-work-but-the-research" target="_blank" rel="noreferrer noopener">The Nuanced Perspective</a><em> and is being reposted here with the authors’ permission.</em></p>
</blockquote>



<p class="wp-block-paragraph">Agent skills are everywhere right now. Atlassian built them into Rovo so agents can automatically triage Jira tickets, draft Confluence pages, and route service requests without anyone typing a prompt. Canva and Figma use them so Claude can interact with design files directly. Stripe published skills for payment workflow automation. When Anthropic <a href="https://venturebeat.com/technology/anthropic-launches-enterprise-agent-skills-and-opens-the-standard" target="_blank" rel="noreferrer noopener">launched the Agent Skills open standard in December 2025</a>, Microsoft adopted it in VS Code and GitHub within weeks.</p>



<p class="wp-block-paragraph">The idea is elegantly simple. Instead of building a new specialized agent for every use case, you write a skill once, and any agent that understands the standard can use it. A code reviewer, a PR generator, a deployment checklist, a sprint planner. Each lives in a folder, triggers when relevant, and brings your team’s specific way of doing things into the agent’s context.</p>



<p class="wp-block-paragraph">But the research on whether skills actually work, and what causes them to fail, is only catching up to adoption now. Four recent papers take the first systematic look at skills in practice: what the benchmarks show, how libraries break down as they grow, and what a more principled approach to orchestration looks like.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="wp-block-paragraph"><strong>Three findings that will change how you think about skills</strong>:</p>



<ul class="wp-block-list">
<li>Curated skills raised the rate at which agents successfully completed tasks by <strong>16.2% on average</strong> across 84 tasks. Model-written skills showed no consistent benefit across any configuration tested.</li>



<li>As skill libraries grow, the agent’s ability to find the right skill on demand breaks down. When it scans every skill description in one pass, similar-sounding skills start colliding. <strong>Organizing skills into a hierarchy</strong> rather than a flat list is what the research shows actually fixes this.</li>



<li>A large-scale security study of ~31K community skills found that more than one in four contain exploitable vulnerabilities, spanning <strong>prompt injection</strong>, <strong>data exfiltration</strong>, and <strong>privilege escalation</strong>.</li>
</ul>
</blockquote>



<p class="wp-block-paragraph">This is what those papers found, and what it means for anyone building with skills today.</p>



<h2 class="wp-block-heading">What a skill is</h2>



<p class="wp-block-paragraph">Your team has a specific way of reviewing PRs. Particular checks, a specific order, standards that go beyond what any generic reviewer would know. You’ve explained it to every new engineer who joined. A skill is how you stop explaining it and let the agent carry it instead. In practice it’s a folder with a SKILL.md file at the center: a description that acts as the trigger condition, a body with step-by-step instructions, and optionally scripts and reference documents that load only when needed. A scoped set of tools and instructions the agent can invoke.</p>



<p class="wp-block-paragraph">At session startup, the agent reads only the name and description from each installed skill, which is about 100 tokens per skill. The full instructions load only when the skill activates, and scripts run without being read into context at all. A large skill library costs almost nothing at initialization. The context budget only gets spent when a skill is actually running.</p>



<p class="wp-block-paragraph">That’s progressive disclosure, and it’s what makes skills different from system prompts, which load everything globally every session, or tools, which are API calls that give the agent direct capabilities. The distinction that holds up for MCPs is that MCP gives the agent abilities, say, a shell, an API connection, or access to a database, whereas skills encode the knowledge of how to use those abilities well for a specific workflow. <a href="https://block.github.io/goose/blog/2025/12/22/agent-skills-vs-mcp/" target="_blank" rel="noreferrer noopener">Block’s engineering team put it well</a> that skills are like GitHub Actions YAML, and MCP is the runner. One describes the workflow and the other makes it possible.</p>



<p class="wp-block-paragraph">Some concrete examples of what this looks like in practice, from teams that have shipped skills in production:</p>



<ul class="wp-block-list">
<li>A <strong>PR review skill</strong> that loads your org’s specific style guide, flagging violations and blockers according to your team’s standards rather than generic best practices</li>



<li>A <strong>deployment checklist skill</strong> that runs your team’s exact predeploy sequence, covering environment checks, rollback verification, and the three Slack channels to notify in order</li>



<li>A <strong>data reporting skill</strong> that knows your company’s metric definitions, so when someone asks for “revenue,” it pulls the right number rather than the closest approximation</li>



<li>A <strong>sprint planning skill</strong> that fetches the backlog, applies your team’s capacity rules, and proposes a plan structured the way your team runs standups</li>
</ul>



<p class="wp-block-paragraph">The value in each of these isn’t the task itself. Any agent can attempt a PR review or a sprint plan. The value is the organizational knowledge baked into how the skill executes it, your style rules, your deploy sequence, your metric definitions, your team’s way of running things. That specificity is also what makes skills hard to get right, as the benchmarks show.</p>



<h2 class="wp-block-heading">What the benchmarks show</h2>



<p class="wp-block-paragraph"><a href="https://arxiv.org/pdf/2602.12670v1" target="_blank" rel="noreferrer noopener">SkillsBench</a> is the <a href="https://www.skillsbench.ai/blogs/introducing-skillsbench" target="_blank" rel="noreferrer noopener">first benchmark</a> built specifically to measure whether agent skills actually improve performance. It tested 84 tasks across 11 domains, running each task under three conditions: no skill, a curated skill, and a self-generated skill. The results are worth sitting with.</p>



<p class="wp-block-paragraph">Curated skills raised average pass rates by 16.2%. However, the gains were uneven across domains. Software engineering tasks improved by 4.5%, while healthcare tasks saw nearly 52% improvement. The domains where skills helped most were the ones with highly structured workflows and domain-specific conventions the base model doesn’t carry natively.</p>



<p class="wp-block-paragraph">The less-cited result is that self-generated skills, where the model writes its own skill rather than a human curating one, provided no average benefit across configurations (“<a href="https://arxiv.org/pdf/2602.12670v1">SkillsBench</a>,” Table 3). Some model configurations saw small gains; others saw small losses. The paper’s conclusion was that models cannot reliably author the procedural knowledge they benefit from consuming. The trajectory analysis in the benchmark identified two failure modes:</p>



<ul class="wp-block-list">
<li>Models either generate imprecise procedures lacking specific API patterns, or</li>



<li>Fail to recognize what domain knowledge the task actually requires.</li>
</ul>



<p class="wp-block-paragraph">The benchmark’s self-generation condition has also drawn pushback from practitioners. One engineer writing on <a href="https://hackernoon.com/read-this-before-you-write-another-agent-skill" target="_blank" rel="noreferrer noopener"><em>HackerNoon</em></a> argues the test doesn’t reflect how skilled teams actually build skills. The benchmark prompted a fresh agent to write a skill and immediately use it, which is closer to asking a model to think harder before attempting a task than to building a skill from real execution experience. His own replication, using skills built from actual debugging sessions, showed much stronger results. The distinction matters because a skill captures what a fresh model wouldn’t know. If the model could have reasoned its way there anyway, the skill wasn’t needed.</p>



<p class="wp-block-paragraph">The practical consequence is that self-generation is the obvious shortcut. You finish a workflow, ask the agent to extract it as a skill, and move on. The benchmark says that without a human review step, you’re not getting the gains you’d expect. The skills look complete. They often cover the main path. What they miss are the edge cases, the exceptions, the three things your team does differently that the model has no way of knowing, and those are exactly the things that make a skill valuable.</p>



<p class="wp-block-paragraph">One finding worth noting for anyone building with skills: focused skills with two to three modules consistently outperformed comprehensive documentation (“<a href="https://arxiv.org/pdf/2602.12670v1" target="_blank" rel="noreferrer noopener">SkillsBench</a>,” Section 4.2). More coverage in a single skill didn’t help; more focused, well-scoped skills did. The benchmark also found that smaller models running with curated skills could match larger models running without them, which is a meaningful cost implication for anyone running skills at scale (“<a href="https://arxiv.org/pdf/2602.12670v1" target="_blank" rel="noreferrer noopener">SkillsBench</a>,” Section 4.2.3, Finding 7).</p>



<h2 class="wp-block-heading">Questions that come up when building with skills</h2>



<p class="wp-block-paragraph">These questions show up every time a team starts building a skill library.</p>



<p class="wp-block-paragraph"><strong>When does something become a skill versus staying in a workflow or system prompt?</strong><br>The cleaner test is whether this is a recurring task that your team has a specific, repeatable way of doing. If yes, it’s a skill candidate. If it’s a one-time flow or something where general reasoning is sufficient, it probably doesn’t need one. The key difference between a skill and a workflow tool like n8n is flexibility. A workflow executes a fixed sequence and breaks when inputs change, while a skill gives the agent procedural guidance it can apply to variations of the same task. Similarly, agentic workflows can chain multiple agents and tasks together, but each agent still benefits from skills that encode the org-specific knowledge for its part of the chain. When you want the <em>what</em> to be consistent but the agent to handle the <em>how</em> intelligently, that’s a skill.</p>



<p class="wp-block-paragraph"><strong>How narrow or broad should a skill be?</strong><br>The SkillsBench finding that focused skills with two to three modules outperform comprehensive ones is directly relevant here (“<a href="https://arxiv.org/pdf/2602.12670v1" target="_blank" rel="noreferrer noopener">SkillsBench</a>,” Section 4.2). A skill that tries to cover an entire domain tends to underperform one that handles a specific thing well. The more practical question is whether to put a full workflow (data fetch, format, generate PDF) into one skill or split it. Current research supports splitting because, then, each piece becomes reusable, easier to update when something changes, and less likely to create unexpected behavior when one module’s scope drifts.</p>



<p class="wp-block-paragraph"><strong>What about skills for noncoders or nonsoftware workflows?</strong><br>Skills are format-agnostic. They’re structured instructions plus optional scripts, and the domain can be anything. A customer support team can encode their escalation criteria, tone guidelines, and the specific conditions where a human always takes over. A legal team can encode their document review checklist. A design team can encode component standards so reviews stay consistent across contributors. <a href="https://support.atlassian.com/rovo/docs/agent-actions/" target="_blank" rel="noreferrer noopener">Atlassian’s Rovo agents</a> are a useful reference outside the coding context. Their skills handle ticket triage, Confluence page creation, and service request routing, none of which is software engineering.</p>



<p class="wp-block-paragraph"><strong>When should you deprecate a skill?</strong><br>This is the question that gets skipped most often. The “<a href="https://arxiv.org/pdf/2602.20867v1" target="_blank" rel="noreferrer noopener">SoK</a>” paper argues for treating skills like any other maintained artifact through discovery, refinement, evaluation, update, and eventually deprecation (see Figure 2 in the paper). A skill that was compensating for a model capability gap six months ago may now be redundant, and worse than redundant if it’s overriding better native behavior. The practical test is to run the task with and without the skill and check if the skill still helps. If the gap has closed, retire it.</p>



<h2 class="wp-block-heading">What breaks as the library grows</h2>



<p class="wp-block-paragraph">A single well-written skill works well. As libraries grow, flat retrieval breaks down, and the “<a href="https://arxiv.org/pdf/2603.02176" target="_blank" rel="noreferrer noopener">AgentSkillOS</a>” paper is the first to study this systematically across ecosystem scales from 200 to 200,000 skills.</p>



<p class="wp-block-paragraph">Flat skill libraries don’t scale. When the agent scans a flat directory of, say, 80+ skills on every request, retrieval becomes unreliable. Two skills with similar descriptions start triggering interchangeably and behavior becomes nondeterministic for the same input. At the extreme, the orchestrator falls into <strong>routing collapse,</strong> where it consistently invokes the wrong skill because the semantic embeddings of two similar skills are indistinguishable. The output looks reasonable BUT the wrong skill ran.</p>



<p class="wp-block-paragraph">The fix the paper proposes is capability trees: organize skills into a hierarchy rather than a flat list. Top-level domains like code, data, docs, with more specific skills as branches and leaves. The agent navigates from domain to branch to leaf instead of scanning everything. They also introduce a usage frequency queue, where skills that aren’t being invoked or aren’t improving outcomes get moved to a <strong>dormant index</strong> so they don’t pollute retrieval for active skills.</p>



<p class="wp-block-paragraph">Testing this across ecosystems ranging from 200 to over 200,000 skills, the structured approach consistently outperformed flat invocation, and the gap widened as library size grew.</p>



<p class="wp-block-paragraph">This pattern shows up in how production teams manage their libraries too. Atlassian recommends <a href="https://support.atlassian.com/rovo/docs/agent-actions/" target="_blank" rel="noreferrer noopener">fewer than five skills per Rovo agent</a>. OpenHands maintains a <a href="https://github.com/OpenHands/extensions" target="_blank" rel="noreferrer noopener">curated extensions repository</a> with separate skill packages for discrete workflows rather than one monolithic skill set. Across all of them, scoped purposeful skill sets outperform comprehensive ones. More skills isn’t more capable. Past a point, it’s just more noise.</p>



<h2 class="wp-block-heading">How orchestration can work differently</h2>



<p class="wp-block-paragraph"><em>This section uses a different definition of skill than the rest of the article, so the distinction matters upfront.</em></p>



<p class="wp-block-paragraph">In the “<a href="https://arxiv.org/pdf/2602.19672" target="_blank" rel="noreferrer noopener">SkillOrchestra</a>” paper, a skill isn’t a SKILL.md file. It’s a capability description used to match task requirements to individual agents in a multi-agent system (see Figure 3 in the paper). The concern isn’t procedural knowledge for one agent but figuring out which agent in a pool should handle a given task and why.</p>



<p class="wp-block-paragraph">The problem it’s solving is that standard reinforcement learning approaches to multi-agent routing don’t hold up as systems grow. Adding a new agent or modifying a workflow means retraining from scratch. RL policies also tend to send everything to the highest-capability agent regardless of cost, which looks fine in evaluation but gets expensive when you’re running it in production.</p>



<p class="wp-block-paragraph">SkillOrchestra’s alternative has each agent maintain a <strong>competence profile</strong> derived from its own execution history, specifically estimated success rates across different task types. The orchestrator routes incoming tasks to the agent whose profile best matches what the task actually demands, rather than the one with the highest raw capability. The routing logic stays current without retraining, and you can inspect why a task went where it went.</p>



<p class="wp-block-paragraph">The same logic applies to SKILL.md-based systems. Tracking which skills actually improve outcomes for specific task types, and what they cost in tokens, gives you the foundation for better selection as your library grows. You don’t need SkillOrchestra’s full framework to benefit from the core idea.</p>



<h2 class="wp-block-heading">The security problem</h2>



<p class="wp-block-paragraph">A <a href="https://arxiv.org/abs/2601.10338" target="_blank" rel="noreferrer noopener">large-scale security analysis</a> of 31,132 community-sourced skills found that 26.1% contain at least one exploitable vulnerability, spanning prompt injection, data exfiltration, privilege escalation, and supply chain risks. More than one in four.</p>



<p class="wp-block-paragraph">The attack patterns aren’t exotic. Prompt injection hidden in skill descriptions that manipulate agent behavior once the skill loads. Scripts that execute against filesystem permissions broader than the skill needs. Tool authorizations scoped to the entire workspace when the task only requires one directory.</p>



<p class="wp-block-paragraph">The core issue is that an external skill isn’t a document you’re reading. It’s code running with your agent’s permissions. Importing a skill from a public repository without reviewing it is like doing an npm install from an unknown author. You wouldn’t do that without at least checking what the package does. That framing changes what due diligence looks like. It means checking the scripts folder before installing, verifying that the permissions the skill requests match what the task actually requires, and sandboxing execution where your environment allows.</p>



<p class="wp-block-paragraph">The tooling for auditing skills at install time doesn’t exist at the level it should yet. Until it does, the due diligence is manual. <a href="https://github.com/OpenHands/extensions">OpenHands’ extensions repository</a> and <a href="https://medium.com/@xuelangping/introducing-atlassian-skills-extending-ai-agent-with-atlassian-integration-fa19f6056df7">Atlassian’s open source skill package</a> are reasonable references for how production-grade community skills scope permissions. Claude Code’s built-in skill creator also helps here, since it structures permission scoping explicitly from the start.</p>



<h2 class="wp-block-heading">3 things to do differently</h2>



<p class="wp-block-paragraph">Across all four papers, three recommendations are consistent.</p>



<p class="wp-block-paragraph"><strong>Write skills from real execution.</strong> Do the workflow manually with an agent, correct it as you go, then extract it as a skill. The agent has full context of what worked. Skills built from real runbooks, incident reports, and accumulated corrections outperform skills written from scratch. The org-specific edge cases are exactly what the base model doesn’t already know. The general workflow it can handle; the three exceptions your team deals with differently are what the skill needs to capture.</p>



<p class="wp-block-paragraph"><strong>Treat the description as routing logic.</strong> The description isn’t a label. It’s how the skill gets triggered at all. Specific phrases, explicit activation conditions, context that distinguishes this skill from adjacent ones. If a skill isn’t firing when you expect it to, or fires when it shouldn’t, rewrite the description first. That’s almost always where the problem is.</p>



<p class="wp-block-paragraph"><strong>Plan for the full lifecycle.</strong> Creation is the easy part. Skills drift out of relevance as models improve. A skill that compensated for something Claude couldn’t do eight months ago may now be actively overriding better native behavior. They need to be evaluated against actual task outcomes, updated when workflows change, and retired when they stop earning their place. The teams that treat their skill libraries the way good engineering teams treat their codebase, with reviews, with metrics, with a process for deprecation, are the ones whose libraries stay useful as they grow.</p>



<h2 class="wp-block-heading">Where this is heading</h2>



<p class="wp-block-paragraph">The shift from prompt engineering to tool use to skill engineering has followed a pattern. Each era produces artifacts that persist longer than the last. Prompts lived in conversations. Tools live in configurations. Skills live in libraries, versioned, shared, maintained, and eventually retired. They behave like code.</p>



<p class="wp-block-paragraph">Most teams aren’t treating them that way yet. Skills get written quickly, without evaluation criteria, without any plan for what happens when they stop being useful. That’s worked so far because most skill libraries are still small enough to hold in your head. It won’t hold as they become infrastructure.</p>



<p class="wp-block-paragraph">The teams building durable agent systems won’t be the ones with the most skills. They’ll be the ones who figured out earlier that a skill library needs to be maintained, not just populated, and who started building the discipline to do that before it became urgent.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p class="wp-block-paragraph"><em>This article grew out of a live “Chai &amp; AI” session conducted by </em><a href="https://open.substack.com/users/14105724-prahitha-movva?utm_source=mentions" target="_blank" rel="noreferrer noopener"><em>Prahitha Movva</em></a><em> where practitioners debated whether agent skills actually deliver on the hype, or just add another layer of complexity.</em></p>
]]></content:encoded>
										</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Object Caching 85/114 objects using Memcached
Page Caching using Disk: Enhanced (Page is feed) 
Minified using Memcached

Served from: www.oreilly.com @ 2026-06-03 16:15:47 by W3 Total Cache
-->