Radar

This Week in AI: Rethinking the Agent Harness

Michelle Smith — Fri, 22 May 2026 15:01:29 +0000

We kicked off our new weekly series This Week in AI on Monday, and we covered a lot of ground in 30 minutes, including an AI model that found security holes faster than decades of human auditing, a data center in Utah the size of two Manhattans, and a practical argument for why the harness you build around a model now matters more than which model you pick.

Here are a few takeaways from the conversation between host Eric Freeman, faculty member at UT Austin and a longtime friend of O’Reilly, and guest John Berryman, founder of Arcturus Labs, an early production engineer on GitHub Copilot, and coauthor of O’Reilly’s Prompt Engineering for LLMs. Watch the entire episode to find out why you should be building your own agent and why John believes eventually there will be no internet for humans.

AI’s security problem is now a policy problem

You’ve probably already heard about Mythos. Anthropic’s internal testing of the frontier model surfaced thousands of previously unknown security vulnerabilities across major operating systems, browsers, and financial infrastructure, including a 27-year-old bug in OpenBSD. Anthropic chose not to release the model publicly and instead launched Project Glasswing, a restricted program giving monitored access to a small group of trusted partners for defensive patching.

That decision moved fast in Washington. In roughly six weeks, the conversation shifted from the light-touch national AI policy released in March to reported White House discussions of an executive order review process modeled on how the FDA handles drugs. Security researcher Bruce Schneier has questioned whether Mythos is uniquely capable here or whether similar results are achievable with cheaper public models, but as Freeman noted (paraphrasing Schneier), either way, it’s a problem that’s coming.

The compute race is getting stranger

Anthropic leased xAI’s entire Colossus 1 supercluster in Memphis: more than 200,000 GPUs and 300 megawatts of power. A month before that deal, Anthropic expanded its agreement with Google and Broadcom for 3.5 gigawatts of capacity coming online in 2027. For context, that’s roughly 10 times the power output of the Colossus 1 deal, in a single contract. After this episode aired, Anthropic announced that that deal has been expanded to Colossus 2 as well.

Box Elder County, Utah, just approved a 40,000-acre AI data center called the Stratos project, backed by investor and TV personality Kevin O’Leary (a.k.a. Mr. Wonderful). It’s planned for 9 gigawatts at full buildout. That’s a footprint more than twice the size of Manhattan, powered by the equivalent of nine commercial nuclear reactors. And like many data center deals going forward, including Colossus above, it was approved over local protests.

Infrastructure at this incredible scale takes years to come online, and the companies making these bets are pricing in a world where model capability keeps scaling. Whether that assumption holds will determine a lot about what’s economically viable to build in the next decade.

The harness matters more than the model

John was on hand to rethink the agent harness, which as he pointed out, entered a new phase with the step change in model capability that occurred in November and December of last year. He took Eric through the arc of AI product development, from document completion and chat loops to tool-calling agents, DAG-based workflows, and now the harness era represented by tools like Claude Code. Each progression added capability, John noted, but also complexity, and each generated a new class of problems around reliability and control. In our current moment, which John has dubbed the “age of the unharnessed agent,” agents are now within reach of everyone, not just software developers.

The payoff of this “unharnessed” era is control. John described a client engagement where he replaced a bespoke application with a skills-driven agent. Now domain experts with no development experience can read the agent’s behavior written in plain English and better understand it. As John explained,

Rather than building a bespoke agent. . ., I just built something that was just the agent harness—the agent—and I just gave it skills that describe what basically I learned in interviewing their experts, how they would work with these agents. And it worked perfectly. Not only does the agent stay on track and do what it needs to do these days, but it’s coded, as far as my client is concerned, in English.

The experts don’t have to complain to developers “this doesn’t work.” The experts can look at the English description of what’s going on and see problems, and maybe even fix it themselves. And I’m really excited to basically give that power into the hands of the people that know best how to change it, the experts.

That’s a different relationship between the experts and the tool than anything a wrapped commercial product offers.

As Eric pointed out, recent Stanford research supports this broader point: Performance gaps between a bare model and a well-designed harness now often matter more than which underlying model you’re using. The benchmark that used to dominate buying decisions, which model scores highest, has been displaced by a harder question about which harness fits the task.

John closed with a demo of his personal agent moving from an Obsidian notebook into Wikipedia and back, carrying context across environments. He used it to illustrate a concept he called the “open agent protocol,” his term for a not-yet-existing standard where an agent receives environment-specific skills as it moves between contexts. The protocol doesn’t exist yet, but the demo made the direction clear.

What’s next

Join us and a rotating lineup of expert guests for weekly live tool demos and deeper dives into the topics that matter in AI. We’re taking next week off for Memorial Day in the US, but we’ll be back on June 1 with host Andreas Welsch and guests Maya Mikhailov and Doug Shannon to cut through another week of AI headlines and separate what actually drives business value from what looks good in a demo but goes nowhere in production. Our first few episodes are free and open to all if you’d like to attend live—register here.

We’ll continue to share full episodes and publish our takeaways here on Radar each Friday. You can also watch or listen on YouTube, Spotify, Apple, or wherever you get your podcasts.

The Agentic P&L: Beyond the Empire of Headcount

Shreshta Shyamsundar and Anmol Jain — Thu, 21 May 2026 15:04:52 +0000

For over a century, both the prestige and budget of a corporate department have been measured by a single crude metric: headcount. If you manage 500 people, you’re a “distinguished leader.” If you manage five, you’re a footnote. This “empire of headcount” has governed everything from office square footage to C-suite influence. It’s the fundamental unit of the 20th-century P&L.

In an enterprise powered by federated agentic systems, this math is not just obsolete—it is a liability. AI will reshape the enterprise. The question is now “Which line items on the P&L change, and by how much?” Labor and benefits contract. Token and infrastructure costs appear as a new operating line. Compliance costs shift from reactive rework to proactive provenance. And the assets that matter most—structured knowledge enclaves, trained agent policies, decision logs—do not yet appear on most balance sheets.

Why AI-on-top-of-hierarchy fails

Most enterprise AI deployments begin with the right instinct and the wrong architecture. A foundation model is procured, a chatbot is deployed, and analysts are relieved of their most repetitive queries. This is the butler-bot phase: AI as a faster way to do what the organization already does, inside a structure designed for a different era.

The problem is the process the model is plugged into. If a compliance decision requires sign-off from three managers, an AI assistant that drafts the memo faster doesn’t change the three-week cycle time. If context is scattered across email threads and local drives, a model querying that corpus will hallucinate at exactly the rate the corpus is incomplete. The model inherits the organization’s structural debt. The agentic P&L begins where the butler bot ends: with a deliberate redesign of the process, not just the tooling.

The enterprise must pivot: Stop valuing the empire of headcount and start valuing the federated nervous system.

Figure 1. Empire of headcount vs. federated nervous system—An analogy

Pillar 1: Potential energy—How knowledge-ready is your department?

If the department is the fundamental unit of the enterprise, its contextual enclave is its brain—its store of potential energy. Most companies are drowning in low-quality context: petabytes of data buried in half-finished Slack threads, abandoned wikis, and tacit knowledge held by seniors who are three months from retirement. To an agent, this isn’t intelligence; it’s noise.

From data lakes to sharded enclaves

The data lake became a 2020s nightmare—a giant swamp where context went to die. In the federated model, legal, HR, engineering, and compliance each maintain their own secure, high-density enclave instead. Policy, process documentation, and institutional knowledge is synthesized into a form an agent can reason over directly, without a human in the interpretive loop. Data stays local; reasoning moves via agents. Protocols like the Model Context Protocol (MCP) are emerging as the TCP/IP of the federated enterprise—a standard way for agents and tools to discover each other, exchange context, and record what happened regardless of which vendor stack sits underneath. MCP is what allows “reasoning moves, data stays” to be an implementation detail rather than a custom integration project every time.

Figure 2. Contextual density in shared enclaves

Making potential energy measurable

Three dimensions combine into what we call the contextual density score: coverage (what proportion of policy and process is documented and retrievable—for a compliance enclave, the fraction of onboarding scenarios tied to explicit playbooks); consistency and recency (how often does retrieved guidance conflict, and how stale is it); and retrieval quality (how often can a reference agent answer test questions from its own enclave without human overrides). The contextual density score measures how ready an enclave is for agents to act on it reliably. Each enclave is assigned an owner whose job is to improve that score quarter over quarter, as a traditional leader improves throughput or defect rates. Context maintenance becomes the new R&D.

Pillar 2: Agentic throughput (the kinetic energy)

If a department’s knowledge enclave is its store of potential energy, throughput is the kinetic energy: the volume and value of cognitive outcomes produced by the agentic layer without human execution in the critical path. To measure this, we must stop counting “activity” and start counting handshakes.

The handshake economy

In a federated mesh, work is done through agent-to-agent (A2A) negotiation. A logistics agent detects a delayed shipment and initiates a handshake with a procurement agent to find an alternative supplier. That agent consults the contracts enclave via a legal agent to check compliance and risk limits. A resolution is reached, records are updated, and a human is notified of the result—not every intermediate step. Throughput is the rate of successful, economically meaningful handshakes.

Figure 3. The federated agent operating model

Agentic unit economics: The cost of the handshake

Not all handshakes are equal. Every one carries a token tax, an infrastructure cost, and a latency cost. Agentic throughput is only valuable when the cost per cognitive outcome is significantly lower than the labor-equivalent at equal or better quality. If an agent fans out 50 calls to a premium model to resolve a $5 inquiry, you’ve increased throughput and destroyed ROI. If a handful of calls to a moderately priced model resolve a complex cross-silo onboarding decision that previously took three teams and two weeks, the economics are compelling.

The agentic P&L must therefore track outcome volume (risk-weighted handshakes per period) and cost per outcome relative to the pre-agentic baseline—this is where CFOs and architects meet. This recommendation is consistent with emerging research: The companies seeing genuine AI ROI are those using it to expand what they can do, not those focused purely on headcount reduction.

How agents learn: Gyms and mirrors

The gym is a simulation built from historical cases and synthetic data where agents train against gold decisions, respecting policy constraints and risk limits. The mirror is a read-only, regulator-grade log of what agents did in production: prompts, tool calls, model versions, human overrides, and final outcomes. Agents spar in the gym; they are judged in the mirror. By 2026, decision provenance—the ability to reconstruct who or what did what, under which policy and model version—is becoming standard operating procedure in regulated industries.

The Agentic P&L decomposed

Four-line items change structurally when an enterprise moves from a headcount model to a federated agentic model:

Labor and benefits contract, but not to zero. The compliance function that previously employed 400 analysts moves to 80–100 humans in orchestration and oversight roles—higher-skilled and higher-cost per head, a deliberate trade of volume for leverage.

General expenses shift as management layers thin, training budgets pivot from procedural compliance to enclave curation, and real estate requirements contract as hybrid squads replace large hub operations.

Token and infrastructure costs emerge as a new operating line that does not exist in the pre-agentic P&L. This line must be actively managed: cost per cognitive outcome is the new unit of measurement and deteriorates quickly with poorly designed agent architectures.

Compliance and audit costs shift structure. In a Tier-1 bank, the cost of a single regulatory finding—remediation, legal exposure, delayed onboarding—dwarfs the annual cost of maintaining a well-designed decision log. The mirror transforms regulatory response from a fire drill into a navigable record. Decision provenance is not governance overhead. It is P&L protection.

Revenue productivity per person (RPP)—revenue divided by headcount—ties the expense-side story to the top line. Software-native firms have long used RPP as a signal of operational leverage; banks are now applying the same lens to their operations functions. As headcount contracts while throughput and revenue capacity hold or grow, RPP rises structurally rather than cyclically—the metric that tells a CFO whether agentic transformation is delivering leverage or merely cost reduction.

A stylized agentic P&L: Compliance in a Tier-1 bank

Consider a compliance function with 400 analysts. Its P&L is dominated by salaries, benefits, and office costs. Context sits in email, local drives, and the memory of experienced analysts—institutional knowledge that walks out of the building every evening.

In phase 1, the bank builds a compliance enclave: policies, historical cases, and regulator Q&A synthesized into a structured knowledge graph. Three hybrid squads of 12–15 humans work alongside 10–15 agents handling document collection, screening, and rule-based decisions. Agentic throughput starts modestly—20%–30% of low-risk cases auto-cleared from within the enclave. The P&L effect at this stage is primarily a productivity story: lower cost per case, faster cycle times.

The structural transformation comes in phase 2. After several cycles of gym training and mirror-driven refinement, the function operates with 80–100 humans plus 40–60 agents. The compliance enclave—curated policies, decision logs, evaluated reward functions—is now the primary asset. Legal discovery may require the email archive; what the regulator wants is a structured, navigable record of decisions. That’s what the mirror provides. With it, the reduced headcount is defensible to regulators, to the board, and on the P&L.

The new org unit: The 3+N squad

The “3+N” squad—a small human core plus a flexible swarm of agents—is the fundamental cell of the agentic enterprise. The strategic architect sets intent and constraints. The policy and ethics lead designs the gyms, ensuring agents act under responsible AI principles. The technical orchestrator manages the context mesh, MCP-based connectors, and enclave density. Around them, specialized agents handle contract analysis, sanctions screening, exception routing, and external API liaison. This is cognitive federation. Humans move up-stack into judgment and intent, while agents handle high-volume reasoning and cross-departmental coordination.

Leaders rewarded for headcount and budget will resist decomposing their empires even as enclave quality and throughput improve. Executive scorecards must include agentic KPIs: enclave maturity, agentic throughput, risk-adjusted outcomes, and RPP. The mirror needs an explicit owner spanning risk, compliance, and engineering. Without decision provenance, you get the worst of both worlds: expensive models and humans still quietly doing the real work in spreadsheets.

When you tell a senior vice president that their value is no longer tied to a 500-person headcount but to the knowledge readiness and agentic throughput of their domain, they will fight. The resistance isn’t just economic; it’s psychological. Headcount has been a proxy for power and identity. In the new world, it often becomes a proxy for architectural debt.

Client: “Can’t we just put a human in the loop but set the default to ‘Accept’?”

Me: “That’s not human-in-the-loop. That’s human-as-rubberstamp. You’re just automating the blame.”

The reframing that works is not “we are shrinking your kingdom” but “we are upgrading your leverage” from managing people (inherently high friction and limited scale) to designing intelligence (human-plus-agent systems that scale almost without bound).

The leader of 2027: The playbook

The leader of 2027 thinks in flows instead of functions, enclaves and mirrors instead of departments and reports, and token costs and compliance risk instead of merely headcount and budget. Their signature move is converting headcount empires into high-density enclaves and high-throughput meshes under credible governance, then proving it on the P&L with lower unit costs, faster cycle times, and a compliance posture auditors can navigate.

For leaders mapping their 2026–2027 roadmaps, here are three hard pivots you need to make: First, stop hiring for capacity; build a better gym, not a bigger team. Second, audit your enclave’s knowledge readiness—if agents hallucinate, you have contextual debt, not a model problem; invest in governed sharded enclaves and mirrors your auditors can use. Finally, manage your token line as the new overhead expense; track cost per cognitive outcome rather than aggregate spend and monitor RPP as your headline leverage indicator.

The goal is not to build an AI that works for you. The goal is to build an enterprise that thinks with you.

Gyms for them, mirrors for us, and a context mesh to hold the P&L together—that is the architecture of a decentralized, high-alpha enterprise. Anything else is just an expensive way to stay in the 20th century.

The Agent Stack Bet

Addy Osmani — Wed, 20 May 2026 10:58:36 +0000

The following article originally appeared on the Elevate newsletter and is being reposted here with the author’s permission.

Peek under the hood of most “production agents” shipping today and you won’t find intelligence. You’ll find custom plumbing, fragile session logic, shared service accounts, and a security model held together by hope. This can be so much better.

If you’ve spent the last 18 months putting agents into production, you already know the models and tools have gotten dramatically better. You also know the problems that are still burning your on-call rotation are not problems you can prompt your way out of. We are running into a stack ceiling, and it is quietly creating a governance and reliability gap that the next generation of agentic systems cannot grow through.

Right now the industry is living with what I’d call excessive agency: autonomous systems given broad permissions to get things done, then left to discover—at runtime, in production—that a schema drifted, an API changed, or a downstream service started returning PII it wasn’t supposed to. Agents mark tasks “complete” while leaving a trail of corrupted state behind them. The humans find out on Monday.

This is not a failure of the people building agents. It is a failure of the stack they’re building on.

Here are the four architectural bets I think every serious team has to make in the next twelve months.

1) Agents need identities, not shared credentials

Every engineer who has shipped agents to production knows this specific flavor of dread: You have agents doing useful work, and effectively zero visibility into which tools they touched, which data they moved, or which credentials they used to do it. I call this governance debt—the silent accumulation of security and audit risk that eventually forces a full rewrite, usually right after the first incident that reaches the CISO.

The root cause is that most agents today are ghosts. They don’t have identities. They borrow a service account, inherit a human’s OAuth token, and “promise”—in application code, in a prompt—to stay inside the lines. In a real enterprise environment, a promise in a prompt is not a policy.

My bet is that agent identity has to move from the application layer down into the platform layer.

The difference is between bolted-on versus embedded security. Bolted-on looks like middleware in front of every tool call, politely asking the agent to behave: easy to bypass, expensive in latency, and invisible to your existing IAM. Embedded looks like a badge reader welded into a steel frame. The agent has a distinct, unforgeable identity recognized at the network and platform level, and policy is enforced at the source. If the agent reaches for a database it isn’t cleared for, the connection never opens. No middleware, no vibes.

Done right, this turns “a fleet of liabilities” into something that looks a lot more like a managed workforce: every action attributable, every permission auditable, every agent revocable with one call.

2) Agents need universal context, not scraped windows

Context management is a tax every builder is currently paying. Teams are burning a huge share of their engineering hours (and tokens) on undifferentiated plumbing—custom serialization, bespoke session stores, hand-rolled memory layers—just to keep an agent from forgetting its mission halfway through a multi-step task.

Worse, the context agents can get their hands on is usually siloed. A browser-based agent can see the open tab. A desktop wrapper can see the files a user happened to drag in. Neither of them can easily reason across the systems where the business actually lives—the CRM, the ERP, the data warehouse, the ticketing system, the transcripts, the project plans—at the same time.

Agents need universal context that integrates at the platform level. If we don’t fix this, we should be honest that the ceiling of agentic AI is “slightly better spreadsheet autocomplete,” and we should stop writing vision pieces about it.

3) Agents need to survive your laptop closing

Here’s the uncomfortable version of this: A lot of what ships today as “an agent” isn’t yet ready to deploy across a business.

I want to be precise, because the frontier has genuinely moved in the last six months. Environments like Claude Code, OpenClaw, and similar platforms are capable—persistent task state, scheduled execution, multi-agent coordination, and long-running sessions that survive disconnects are no longer aspirational. These are not toys. The question has moved on.

The question now is whether an agent can run for a week instead of an hour. Whether it can cross three handoffs, two credential rotations, and an approval gate without a human babysitting the session. Whether the work it did on Tuesday is auditable on Friday by someone who wasn’t in the room. A session that survives a dropped WebSocket is table stakes. A mission that survives a quarter is the bar enterprises actually need.

Real work doesn’t fit in a session, and most of it doesn’t fit in a day either. A procurement workflow spans weeks and a dozen handoffs. A compliance audit runs for a month. An incident investigation outlives three on-call rotations.

Most agents today hit a hard ceiling—sometimes time-based, sometimes token-based, sometimes governance-based—and when they hit it, the mission fails and a human picks up the pieces from wherever the transcript ended.

Enterprise-grade autonomy requires durable, cloud-native execution with a much higher floor than “the session stayed up.” Concretely, that means:

State and checkpointing that survives restarts, disconnects, redeploys, and model version changes by default—not bolted on with a local Redis and a prayer.
Context that outlives the window: long-horizon memory, summarization, and handoff between agent instances, so a multi-week task doesn’t die because a single run exhausted its tokens.
Missions that outlive sessions: agents that stay on the job across days, handoffs, and credential rotations, with an auditable trail of what happened while you were asleep.
First-class human-in-the-loop primitives, so the agent can pause and ask for permission to do something new instead of silently deciding it has the authority.

Persistence with guardrails. That’s the bar. Anything less and you’re building demos that happen to run for a long time.

4) Agents need platforms

The pattern I see most often in strong teams is the saddest one: brilliant engineers draining their bandwidth into stack problems that do not differentiate their product. Custom memory. Bespoke eval harnesses. Homegrown observability. Handwritten retry logic. A tracing system that almost works. None of this is the hard part of the agentic era, and none of it is what your users are paying you for.

The real value lives in domain reasoning and business logic—the judgment calls that are specific to your company, your customers, your regulatory environment. Everything underneath should be the platform you build on, not the plumbing you build.

This is why the maturation of open primitives matters right now. Open-source orchestration frameworks exist precisely so the scaffolding isn’t locked behind any single vendor’s roadmap. The model that worked for cloud compute, containers, and CI/CD—start local on open primitives, graduate to a managed platform when you’re ready to scale—is the model agent platforms need to copy.

Teams should be able to prototype on their laptop with the same building blocks they’ll run in production, and cross that boundary without a rewrite.

That’s the engineering standard that lets teams stop fighting plumbing and get back to the product.

The five-year horizon

The teams that pull ahead in the next five years will not pull ahead by being smarter at writing boilerplate. They’ll pull ahead by choosing the right agent foundation and spending their engineering hours on the problems only they can solve.

Every month spent rebuilding the common stack—identity, context, persistence, orchestration—is a month not spent on the logic that actually makes your agents worth deploying.

The agent stack has to become a solved problem. The only real question is whether you want to solve it yourself, again, or build on a foundation that was engineered for agents from the ground up.

My bet is on the latter. I think yours should be too.

When an Agent Deletes the Production Database

Sam Newman — Tue, 19 May 2026 16:00:39 +0000

Another day, another example of an AI Agent “running rogue” and doing something the human operator didn’t want it to do. The tl;dr is that Jeremy (Jer) Crane, founder of PocketOS, was using Claude to perform some routine DB maintenance. Claude then proceeded to delete the production database and all backups hosted at their cloud provider, Railway. To their credit Railway managed to recover the lost data. The initial deletion took less than 10 seconds; I’m sure the recovery took much longer. Let’s look at what we can learn from what happened, and why AI is really just an amplifier of existing issues, rather than the cause itself.

We know about the incident because Jer wrote about it after it happened. First, taking time to reflect after something goes wrong is important; it’s how we learn. Sharing your mistakes with the world can be difficult, but it creates chances for us all to learn from each other. Second, I’ve seen a lot of people publicly dunking on both PocketOS and Railway. I would guess that none of those people have ever experienced the sheer terror and panic that happens during an incident like this. The feeling that you just want the ground to open and swallow you whole. It’s a feeling I’ve only experienced once or twice before, and it’s not an experience I’m keen to repeat.

One point in Railway’s credit is that they got PocketOS’s data back. If you called for a deletion via the APIs on AWS, Azure, Google Cloud or whatever, using a valid credential, that data is gone—unless you have your own backups of course. AWS et al. aren’t maintaining backups of customer data to hedge against customer mistakes. This is your yearly reminder to look into the 3-2-1 backup strategy.

What can we learn about what happened? Well, for all the discussion around how this is AI’s fault, what we have here is a much simpler example of common system weaknesses being exploited both accidentally and at speed.

What Did Claude Do?

Claude had been asked to carry out a task against PocketOS’s staging environment. The agent hit an issue, searched out and found a long-lived API token which gave access to production, and then proceeded to delete the production volume that contained both the production databases and the backups.

When asked what had happened, Claude’s reaction was objectively funny. It seemed to be totally aware of what went wrong, and what it should have done instead. This implies a set of reasoning that was not evident during the actual operation itself—I do wonder if recent attempts to reduce how much reasoning Claude does in certain modes to reduce token use—and Anthropic’s operating costs might partly be to blame.

Breaking it all down, there seem to be a couple of fairly straightforward issues at play that at first glance have very little to do with AI itself.

The token Claude had access to gave overly broad access. It’s common for cloud-based infrastructure providers like AWS or Azure to allow you to create tokens that are limited in what they do. This helps implement the principle of least privilege. The idea is that an actor in a system should be given access to what they need, and no more. The principle of least privilege reduces the impact if an inappropriate party gains access to the actor’s credentials, or if the actor themself goes rogue. Consider what happens if someone steals your hotel room key. They can get into your hotel room, which isn’t great, but they can’t get into anyone else’s. It seems that Railway has a limitation that its auth tokens cannot have their scope limited.

The second problem was that the credentials were stored on disk and had not expired. This makes the impact of the broadly scoped auth token much worse. Credentials should be time limited, so that if they are found later they cannot be used. If tokens are generated on demand, which could have been done in this specific case, then this particular issue could have been mitigated. Claude would have had to ask for a human to provide a credential—at which point, hopefully, the operator would have had a chance to work out what was going on.

I take minor issue with Jer’s assertion that Railway’s GraphQL API should have required a confirmation before deletion. This, to me, is a fundamental misunderstanding of what cloud APIs are for. APIs are there for automation; if you want a human-in-the-loop confirmation model, you have to build that yourself. This has always been the case. However, in the aftermath of an incident like this, we should give Jer a lot of leeway around his view of the problems, and some of Jeremy’s requests for how Railway should change appear to be very sensible (e.g. more clear SLAs, easier to scope tokens).

How Could These Issues Be Mitigated?

One obvious takeaway is to ensure that access tokens are more aggressively expired, but also made more limited in scope. This reduces the chance of Claude accessing something it shouldn’t. This would need to be solved on the Railway side, as they generate the token in the first place.

Unfortunately, having a more limited token for Claude isn’t a total fix for this scenario. Claude was given a token that limited its behavior, and went looking for a better token—and found it. This is not the first time I’ve heard of this happening; the same thing happened to a client of mine recently.

As our agents become more sophisticated, it seems that some sort of sandboxing is key. The production token was viewable by Claude, so it was used. Running agents in a restricted sandbox where they are only able to see parts of your filesystem would help greatly. However that also limits their usefulness.

Another option would be for the agent to ask for confirmation before it does something like delete data. It seems conceivable that having a human in the loop model when the agent has to escalate privileges could help. But again, if it gets access to an access token with broad scope, it won’t need to ask a human.

Finally, I’ve seen a lot of discussion about how the agent should “know” that deleting the data was bad, and that it should have checked first. This is a fundamental limitation of an LLM-based agent. It has no concept of causality. It cannot predict what will happen. There is a field of AI study known as world models, which could allow these agents to make more informed decisions. For example, a world model that understands physics would be able to predict that the egg would likely break if the egg was pushed from a table on to the concrete floor below. World models are used a lot in video generation and autonomous driving (where prediction of motion is key), but are sparsely used elsewhere.

AI Not To Blame?

I said just a moment ago that these issues seem to have little to do with AI. That isn’t entirely true.

In the recent DORA report on the state of AI-assisted Software Development, the authors noted that AI seems to be an amplifier: that AI-assisted software development tends to help good teams go faster, and slow teams go slower. Bad practices get encoded and done more. In the PocketOS and Railway situation, we have a set of credentials that were overly broad, with long-lived credentials stored on disc, combined with an apologetic AI agent doing something other than what was expected of it. If a human had made the same mistakes, they would have made them much more slowly, and may well have had the chance to work out their mistake part way through. AI works so fast that it can go more quickly in the wrong direction.

More importantly, unlike LLM-based AI, a human being has the chance to learn from experience, and for that learning to be rooted in a very specific, emotional response. When I first heard about the PocketOS story, I was brought back to a dim echo of that same horrific feeling I had in the midst of a major production issue that I had contributed to. Those feelings don’t leave you—those lessons don’t leave you. Every time I touched a production system, those memories were with me, and helped guide me towards more sensible working practices.

AI Artifact Catalogs: Durable Standards Worth Institutional Investment

Tadas Antanavicius — Tue, 19 May 2026 11:05:38 +0000

Companies everywhere are trying to leverage AI to boost internal productivity metrics. Some, like Ramp and Intercom, are succeeding. Many are failing.

To make matters more complicated, the narrative around what tooling enables these gains is constantly shifting. For software engineers, auto-complete via GitHub Copilot was the bleeding-edge tool of choice in 2024. Then it was Cursor for much of 2025. 2026 has been dominated by command-line-based coding agents like Claude Code and Codex.

While the tooling layer winds ebb and flow, many of them have come to share a number of common primitives: open standards that help configure and guide these tools’ capabilities.

Agent Skills. MCP. Plugins. These all present vendor-agnostic mechanisms by which we can configure the tools today. The catch: These mechanisms aren’t one-size-fits-all. How you can connect to an MCP server depends on your organization’s security posture. An Agent Skill crafted specifically for one team’s design system does not copy-paste well into that of another team.

As individuals within organizations begin to configure—and sometimes build from scratch—the skills and MCP servers that unlock real productivity gains, the next unlock is to translate those wins to shareable, reusable institutional knowledge. AI artifact catalogs are the output of this step. They represent the useful bits of internal knowledge and glue that connect much of what employees are doing manually today, over to empowering both:

Their peers. By sharing these artifacts within or across teams, productivity gains are shared across the organization, not in individual silos.
And their agents. Equipping agent runtimes like Claude Code or Codex with hard-won, domain-specific guidance means employees can spend more time building agentic systems and less time toiling on repeatable labor.

The durability of open standards

There is an ongoing industry-wide rush to buy AI-powered solutions in the hopes that a vendor can unlock these sought-after productivity gains. 95% of those pilot projects are failing.

Of course, there is a spectrum of risk when buying solutions like this from a vendor. If you go all-in on Anthropic’s tooling—like Intercom did with Claude Code—and Anthropic continues to be an industry leader, things will go well. Make the same decision with a startup’s offering that fails to get broad industry adoption, and you’re stuck with a proprietary data model that operates in a dead-end silo you have to rebuild from scratch in a year.

There’s another path: that of committing to open standards. If you invest in Agent Skills, in MCP, in plugins, not only will you be protected against a single vendor going belly-up, but you won’t even miss a beat when the leading coding agent that all your engineers demand next quarter changes, again. Switching costs drop to a fraction of what they’d be with a proprietary stack.

There’s no doubt that AI capabilities are evolving at a breakneck pace. It’s hard to predict what innovations the next cycle will bring. But what’s unique about these vendor-agnostic standardized primitives is that they are concepts upon which innovation can build, not replace. We’re all still building on top of HTTP that forms the fabric of the web. QWERTY keyboards are strictly inferior to Dvorak keyboards, and yet the standard prevails to this day. JavaScript is a much-maligned language, yet it underpins practically the entire frontend of the internet.

As AI rapidly reduces the cost of building, the cost of coordination among people and among entities remains high. Standards remain scarce and valuable.

AI artifacts and their relative maturity

The most important aspect of any standard is its level of adoption. It’s clear that the leading tooling empowering internal AI transformation is coalescing around coding agent tools like Claude Code and Codex, less-technical tooling like Claude Cowork, and rich agent SDKs like those from Anthropic or OpenAI.

Taking the compatibility of leading tools in those categories as indicators of standard adoption, here’s where I think the landscape of AI artifacts currently nets out:

Standard	Artifact	Status	Adoption
Agent Skills	Skill	Vendor-agnostic standard	Highest
MCP servers	mcp.json and Server Card	Vendor-agnostic standard	Highest
Plugins	Plugin	Vendor-agnostic standard	High
Command line interface (CLI) tools	Custom	Unstandardized	High
Hooks	Hook	Derivative standard (Open Plugins)	Medium
Roots	Git repositories	Derivative standard (AGENTS.md)	Medium
Rules	Rule	Derivative standard (Open Plugins)	Medium

Tool compatibility considered in “adoption” as of April 2026: Claude Code, Cowork, Codex, Cursor, GitHub Copilot, Gemini CLI, Pi, OpenCode, Amp, Claude Agents SDK, OpenAI Agents SDK

A minimalist catalog stored as a Git repository for a team might start off looking something like this:

I work with software engineering teams early in their AI adoption journey, where they might have a few individual tinkerers leaning heavily into AI but haven’t yet figured out how to propagate adoption more widely. Out of the gate, my conversations with teams tend to run a gamut of disparate tool preferences, unique workflows, disjoint architectures, and other one-off quirks. A big unlock for moving these organizations forward is to introduce shared language. Shared language grounds conversations. It puts teams working on different AI-related initiatives on a path to smooth integration with each other. People get excited about how puzzle pieces might fit together.

Let’s review these artifacts in more detail.

Skills: The lifeblood of most institutional knowledge

As Tim O’Reilly wrote a few months ago, a skill can be “the integration of expert workflow logic that orchestrates when and how to use each tool, informed by domain knowledge that gives the LLM the judgment to make good decisions in context.”

This is not the only “type” of skill that currently exists out there. They can span a gamut of purposes; to name a few:

Encoding of internal, expert orchestration knowledge (as in the above)
Guidance on using otherwise deterministic tools (such as MCP servers or CLI tools)
Context management tricks that have broad appeal (to make up for LLM capability limitations)

But the first—the encoding of expert knowledge—is very much the most valuable and irreplaceable. Chances are, what an organization might capture in that variant of skill is knowledge not otherwise documented. It lives as tacit knowledge among your employees or is scattered across many systems so as to make any associated work a multistep journey.

The implication: Any skill you can download from the public internet is probably not nearly as valuable as an internal skill crafted by an employee. The latter skill is aware of your business context, the opinionated systems in play, and maybe encodes unique expertise hard-won over years of tenure. And most importantly: That level of insight is not making it into a model training run any time soon. Nor is it likely to be relevant to just about anyone outside of your own company. The same can’t be said for the latest skill repository on GitHub that acquires 10,000 stars. If that public skill is any good, the generic concepts will find their way into natural model and harness capabilities before long, eliminating the need for that class of skill.

Skills are extremely well-adopted; uncontroversially so by every major coding agent.

MCP and CLI tools: The connectivity layer to external systems

Most agents don’t operate in a vacuum: Interaction with external systems is how we compose AI. One agent can talk to another agent, or just some separate deterministic system, by way of MCP or a CLI tool.

The MCP versus CLI debate is well-documented, so we won’t rehash it here. Regardless of which of the two you implement (and perhaps you use both for different use cases), the point is that MCP/CLI is responsible for poking a hole into what is otherwise a local-only sandboxed environment for your agent.

This is the layer that juggles authentication—facilitating OAuth, injecting any relevant secrets—and exposes some well-defined surface area for what your agent could possibly do in communication with that external system (e.g., MCP tool definitions or CLI command options).

For MCP, you have well-established conventions and standards in the form of Server Cards and server.json files—to declare all the possible configurations of an MCP server—and also an upcoming standard called mcp.json to declare specific configurations of an MCP server (inspired by, among others, files like .mcp.json from Claude Code).

For CLI, cataloging a tool means rolling your own catalog format: probably covering metadata like “how to install this,” “what auth mechanisms does it support,” “where to store secrets,” and related concerns that are explicitly or implicitly captured in analogous mcp.json files.

MCP is very well-adopted and natively compatible with most agent frameworks. CLI works anywhere the agent comes with bash capabilities but can be fairly limited in a sandbox environment and doesn’t share the sort of configurability as MCP does otherwise.

Hooks: Inject capabilities at deterministic trigger points

Hooks are handy to inject sprinkles of determinism in an otherwise nondeterministic agentic session. Some effective uses I’ve seen: injecting a session transcript capture step for future review or capturing analytics on what skills are being invoked within a team.

Hooks don’t have their own standard but are baked into the upcoming Open Plugins standard. The concept is supported by most major coding agents, although implementations have some variance.

Rules: Context appended to rules

Originally popularized by Cursor, rules allow for injecting blurbs of context in largely deterministic, but sometimes nondeterministic, fashion.

Functionally, many rules could be modeled as skills and AGENTS.md files. Given the popularity of the latter, it’s unclear whether they will continue to remain relevant in the long run.

Roots: An agent’s starting point

Most agents “start” inside a particular location in a filesystem: a “root.” For coding agents, this means some folder within a Git repository. In some agents, such as Claude Cowork, this is equivalent to the notion of a “project.”

While not directly standardized, the notion of a root is implicit in the AGENTS.md standard, which assumes the presence of a filesystem that hosts static context for which the agent should operate upon.

Plugins: Bundles to bring it all together

Plugins are somewhat unique in the above list. Conceptually, they are a bundle of several of the other artifacts. A plugin can be thought of as a composition of skills, rules, hooks, MCP servers, and some other components. The up-and-coming Open Plugins initiative spearheaded by Vercel is working to finalize what this specification looks like.

They serve a natural purpose. Any team leaning into building skills and MCP servers will quickly get to a point where several skills and MCP servers will combine to form a practical grouping of guidance and capabilities. Claude Code’s implementation of plugin marketplaces is becoming a de facto distribution mechanism for plugins. It’s very much an option to catalog individual artifacts, and then use mechanisms like that to distribute them all as bundled within the plugin abstraction layer.

Some companies have fully leaned into this abstraction. For example, Intercom, rather than cataloging skills or hooks individually, just catalogs plugins—skills and hooks are fully inlined within them.

Most of the agentic tooling ecosystem is largely aligned on plugins, with Pi and OpenCode being notable holdouts.

Rich, practical catalogs are what can separate AI success stories from repeated false starts

Maybe you choose to go all-in on plugins and bundle your skills and MCP servers inline; maybe you build a granular catalog per artifact type. But whatever shape it takes, what matters is that your company is cataloging—and retaining ownership of—its way of working. And doing so in a way that maximizes potential compatibility with the frontier tooling that is yet to be invented.

It’s very immediately actionable for a company to start on this path. No new vendor relationship is needed, just an internal agreement to start storing artifacts in some company-wide Git repository. Encouraging sharing, moving past individual silos, celebrating wins—and eventually celebrating usage—of these artifacts. Every addition to that catalog is an opportunity for someone else to leverage an artifact someone else constructed, a chance to build on top of it, to collaborate or consolidate efforts.

If you’re part of a company building its first catalog, I’d like to hear from you. I work with a few companies in the early stages of this initiative, and I’ve been capturing early learnings around managing these catalogs in a very lightweight open source framework called AIR. If others are getting value out of leaning into these open standards as catalogs, we likely have an opportunity to collaborate across companies on some of the glue and minutiae that can operationalize the ideas here.

Ramp and Intercom aren’t winning because they picked the right tooling vendor. They’re winning because they’ve turned individual productivity into organizational capability. The tooling will keep rotating. Whether your company compounds alongside it is a choice worth making deliberately.

Agent Skills Work but the Research Shows Most Teams Are Building Them Wrong

Aishwarya Naresh Reganti, Prahitha Movva and Kiriti Badam — Mon, 18 May 2026 10:59:14 +0000

This post was originally published on The Nuanced Perspective and is being reposted here with the authors’ permission.

Agent skills are everywhere right now. Atlassian built them into Rovo so agents can automatically triage Jira tickets, draft Confluence pages, and route service requests without anyone typing a prompt. Canva and Figma use them so Claude can interact with design files directly. Stripe published skills for payment workflow automation. When Anthropic launched the Agent Skills open standard in December 2025, Microsoft adopted it in VS Code and GitHub within weeks.

The idea is elegantly simple. Instead of building a new specialized agent for every use case, you write a skill once, and any agent that understands the standard can use it. A code reviewer, a PR generator, a deployment checklist, a sprint planner. Each lives in a folder, triggers when relevant, and brings your team’s specific way of doing things into the agent’s context.

But the research on whether skills actually work, and what causes them to fail, is only catching up to adoption now. Four recent papers take the first systematic look at skills in practice: what the benchmarks show, how libraries break down as they grow, and what a more principled approach to orchestration looks like.

Three findings that will change how you think about skills:

Curated skills raised the rate at which agents successfully completed tasks by 16.2% on average across 84 tasks. Model-written skills showed no consistent benefit across any configuration tested.

As skill libraries grow, the agent’s ability to find the right skill on demand breaks down. When it scans every skill description in one pass, similar-sounding skills start colliding. Organizing skills into a hierarchy rather than a flat list is what the research shows actually fixes this.

A large-scale security study of ~31K community skills found that more than one in four contain exploitable vulnerabilities, spanning prompt injection, data exfiltration, and privilege escalation.

This is what those papers found, and what it means for anyone building with skills today.

What a skill is

Your team has a specific way of reviewing PRs. Particular checks, a specific order, standards that go beyond what any generic reviewer would know. You’ve explained it to every new engineer who joined. A skill is how you stop explaining it and let the agent carry it instead. In practice it’s a folder with a SKILL.md file at the center: a description that acts as the trigger condition, a body with step-by-step instructions, and optionally scripts and reference documents that load only when needed. A scoped set of tools and instructions the agent can invoke.

At session startup, the agent reads only the name and description from each installed skill, which is about 100 tokens per skill. The full instructions load only when the skill activates, and scripts run without being read into context at all. A large skill library costs almost nothing at initialization. The context budget only gets spent when a skill is actually running.

That’s progressive disclosure, and it’s what makes skills different from system prompts, which load everything globally every session, or tools, which are API calls that give the agent direct capabilities. The distinction that holds up for MCPs is that MCP gives the agent abilities, say, a shell, an API connection, or access to a database, whereas skills encode the knowledge of how to use those abilities well for a specific workflow. Block’s engineering team put it well that skills are like GitHub Actions YAML, and MCP is the runner. One describes the workflow and the other makes it possible.

Some concrete examples of what this looks like in practice, from teams that have shipped skills in production:

A PR review skill that loads your org’s specific style guide, flagging violations and blockers according to your team’s standards rather than generic best practices
A deployment checklist skill that runs your team’s exact predeploy sequence, covering environment checks, rollback verification, and the three Slack channels to notify in order
A data reporting skill that knows your company’s metric definitions, so when someone asks for “revenue,” it pulls the right number rather than the closest approximation
A sprint planning skill that fetches the backlog, applies your team’s capacity rules, and proposes a plan structured the way your team runs standups

The value in each of these isn’t the task itself. Any agent can attempt a PR review or a sprint plan. The value is the organizational knowledge baked into how the skill executes it, your style rules, your deploy sequence, your metric definitions, your team’s way of running things. That specificity is also what makes skills hard to get right, as the benchmarks show.

What the benchmarks show

SkillsBench is the first benchmark built specifically to measure whether agent skills actually improve performance. It tested 84 tasks across 11 domains, running each task under three conditions: no skill, a curated skill, and a self-generated skill. The results are worth sitting with.

Curated skills raised average pass rates by 16.2%. However, the gains were uneven across domains. Software engineering tasks improved by 4.5%, while healthcare tasks saw nearly 52% improvement. The domains where skills helped most were the ones with highly structured workflows and domain-specific conventions the base model doesn’t carry natively.

The less-cited result is that self-generated skills, where the model writes its own skill rather than a human curating one, provided no average benefit across configurations (“SkillsBench,” Table 3). Some model configurations saw small gains; others saw small losses. The paper’s conclusion was that models cannot reliably author the procedural knowledge they benefit from consuming. The trajectory analysis in the benchmark identified two failure modes:

Models either generate imprecise procedures lacking specific API patterns, or
Fail to recognize what domain knowledge the task actually requires.

The benchmark’s self-generation condition has also drawn pushback from practitioners. One engineer writing on HackerNoon argues the test doesn’t reflect how skilled teams actually build skills. The benchmark prompted a fresh agent to write a skill and immediately use it, which is closer to asking a model to think harder before attempting a task than to building a skill from real execution experience. His own replication, using skills built from actual debugging sessions, showed much stronger results. The distinction matters because a skill captures what a fresh model wouldn’t know. If the model could have reasoned its way there anyway, the skill wasn’t needed.

The practical consequence is that self-generation is the obvious shortcut. You finish a workflow, ask the agent to extract it as a skill, and move on. The benchmark says that without a human review step, you’re not getting the gains you’d expect. The skills look complete. They often cover the main path. What they miss are the edge cases, the exceptions, the three things your team does differently that the model has no way of knowing, and those are exactly the things that make a skill valuable.

One finding worth noting for anyone building with skills: focused skills with two to three modules consistently outperformed comprehensive documentation (“SkillsBench,” Section 4.2). More coverage in a single skill didn’t help; more focused, well-scoped skills did. The benchmark also found that smaller models running with curated skills could match larger models running without them, which is a meaningful cost implication for anyone running skills at scale (“SkillsBench,” Section 4.2.3, Finding 7).

Questions that come up when building with skills

These questions show up every time a team starts building a skill library.

When does something become a skill versus staying in a workflow or system prompt?
The cleaner test is whether this is a recurring task that your team has a specific, repeatable way of doing. If yes, it’s a skill candidate. If it’s a one-time flow or something where general reasoning is sufficient, it probably doesn’t need one. The key difference between a skill and a workflow tool like n8n is flexibility. A workflow executes a fixed sequence and breaks when inputs change, while a skill gives the agent procedural guidance it can apply to variations of the same task. Similarly, agentic workflows can chain multiple agents and tasks together, but each agent still benefits from skills that encode the org-specific knowledge for its part of the chain. When you want the what to be consistent but the agent to handle the how intelligently, that’s a skill.

How narrow or broad should a skill be?
The SkillsBench finding that focused skills with two to three modules outperform comprehensive ones is directly relevant here (“SkillsBench,” Section 4.2). A skill that tries to cover an entire domain tends to underperform one that handles a specific thing well. The more practical question is whether to put a full workflow (data fetch, format, generate PDF) into one skill or split it. Current research supports splitting because, then, each piece becomes reusable, easier to update when something changes, and less likely to create unexpected behavior when one module’s scope drifts.

What about skills for noncoders or nonsoftware workflows?
Skills are format-agnostic. They’re structured instructions plus optional scripts, and the domain can be anything. A customer support team can encode their escalation criteria, tone guidelines, and the specific conditions where a human always takes over. A legal team can encode their document review checklist. A design team can encode component standards so reviews stay consistent across contributors. Atlassian’s Rovo agents are a useful reference outside the coding context. Their skills handle ticket triage, Confluence page creation, and service request routing, none of which is software engineering.

When should you deprecate a skill?
This is the question that gets skipped most often. The “SoK” paper argues for treating skills like any other maintained artifact through discovery, refinement, evaluation, update, and eventually deprecation (see Figure 2 in the paper). A skill that was compensating for a model capability gap six months ago may now be redundant, and worse than redundant if it’s overriding better native behavior. The practical test is to run the task with and without the skill and check if the skill still helps. If the gap has closed, retire it.

What breaks as the library grows

A single well-written skill works well. As libraries grow, flat retrieval breaks down, and the “AgentSkillOS” paper is the first to study this systematically across ecosystem scales from 200 to 200,000 skills.

Flat skill libraries don’t scale. When the agent scans a flat directory of, say, 80+ skills on every request, retrieval becomes unreliable. Two skills with similar descriptions start triggering interchangeably and behavior becomes nondeterministic for the same input. At the extreme, the orchestrator falls into routing collapse, where it consistently invokes the wrong skill because the semantic embeddings of two similar skills are indistinguishable. The output looks reasonable BUT the wrong skill ran.

The fix the paper proposes is capability trees: organize skills into a hierarchy rather than a flat list. Top-level domains like code, data, docs, with more specific skills as branches and leaves. The agent navigates from domain to branch to leaf instead of scanning everything. They also introduce a usage frequency queue, where skills that aren’t being invoked or aren’t improving outcomes get moved to a dormant index so they don’t pollute retrieval for active skills.

Testing this across ecosystems ranging from 200 to over 200,000 skills, the structured approach consistently outperformed flat invocation, and the gap widened as library size grew.

This pattern shows up in how production teams manage their libraries too. Atlassian recommends fewer than five skills per Rovo agent. OpenHands maintains a curated extensions repository with separate skill packages for discrete workflows rather than one monolithic skill set. Across all of them, scoped purposeful skill sets outperform comprehensive ones. More skills isn’t more capable. Past a point, it’s just more noise.

How orchestration can work differently

This section uses a different definition of skill than the rest of the article, so the distinction matters upfront.

In the “SkillOrchestra” paper, a skill isn’t a SKILL.md file. It’s a capability description used to match task requirements to individual agents in a multi-agent system (see Figure 3 in the paper). The concern isn’t procedural knowledge for one agent but figuring out which agent in a pool should handle a given task and why.

The problem it’s solving is that standard reinforcement learning approaches to multi-agent routing don’t hold up as systems grow. Adding a new agent or modifying a workflow means retraining from scratch. RL policies also tend to send everything to the highest-capability agent regardless of cost, which looks fine in evaluation but gets expensive when you’re running it in production.

SkillOrchestra’s alternative has each agent maintain a competence profile derived from its own execution history, specifically estimated success rates across different task types. The orchestrator routes incoming tasks to the agent whose profile best matches what the task actually demands, rather than the one with the highest raw capability. The routing logic stays current without retraining, and you can inspect why a task went where it went.

The same logic applies to SKILL.md-based systems. Tracking which skills actually improve outcomes for specific task types, and what they cost in tokens, gives you the foundation for better selection as your library grows. You don’t need SkillOrchestra’s full framework to benefit from the core idea.

The security problem

A large-scale security analysis of 31,132 community-sourced skills found that 26.1% contain at least one exploitable vulnerability, spanning prompt injection, data exfiltration, privilege escalation, and supply chain risks. More than one in four.

The attack patterns aren’t exotic. Prompt injection hidden in skill descriptions that manipulate agent behavior once the skill loads. Scripts that execute against filesystem permissions broader than the skill needs. Tool authorizations scoped to the entire workspace when the task only requires one directory.

The core issue is that an external skill isn’t a document you’re reading. It’s code running with your agent’s permissions. Importing a skill from a public repository without reviewing it is like doing an npm install from an unknown author. You wouldn’t do that without at least checking what the package does. That framing changes what due diligence looks like. It means checking the scripts folder before installing, verifying that the permissions the skill requests match what the task actually requires, and sandboxing execution where your environment allows.

The tooling for auditing skills at install time doesn’t exist at the level it should yet. Until it does, the due diligence is manual. OpenHands’ extensions repository and Atlassian’s open source skill package are reasonable references for how production-grade community skills scope permissions. Claude Code’s built-in skill creator also helps here, since it structures permission scoping explicitly from the start.

3 things to do differently

Across all four papers, three recommendations are consistent.

Write skills from real execution. Do the workflow manually with an agent, correct it as you go, then extract it as a skill. The agent has full context of what worked. Skills built from real runbooks, incident reports, and accumulated corrections outperform skills written from scratch. The org-specific edge cases are exactly what the base model doesn’t already know. The general workflow it can handle; the three exceptions your team deals with differently are what the skill needs to capture.

Treat the description as routing logic. The description isn’t a label. It’s how the skill gets triggered at all. Specific phrases, explicit activation conditions, context that distinguishes this skill from adjacent ones. If a skill isn’t firing when you expect it to, or fires when it shouldn’t, rewrite the description first. That’s almost always where the problem is.

Plan for the full lifecycle. Creation is the easy part. Skills drift out of relevance as models improve. A skill that compensated for something Claude couldn’t do eight months ago may now be actively overriding better native behavior. They need to be evaluated against actual task outcomes, updated when workflows change, and retired when they stop earning their place. The teams that treat their skill libraries the way good engineering teams treat their codebase, with reviews, with metrics, with a process for deprecation, are the ones whose libraries stay useful as they grow.

Where this is heading

The shift from prompt engineering to tool use to skill engineering has followed a pattern. Each era produces artifacts that persist longer than the last. Prompts lived in conversations. Tools live in configurations. Skills live in libraries, versioned, shared, maintained, and eventually retired. They behave like code.

Most teams aren’t treating them that way yet. Skills get written quickly, without evaluation criteria, without any plan for what happens when they stop being useful. That’s worked so far because most skill libraries are still small enough to hold in your head. It won’t hold as they become infrastructure.

The teams building durable agent systems won’t be the ones with the most skills. They’ll be the ones who figured out earlier that a skill library needs to be maintained, not just populated, and who started building the discipline to do that before it became urgent.

This article grew out of a live “Chai & AI” session conducted by Prahitha Movva where practitioners debated whether agent skills actually deliver on the hype, or just add another layer of complexity.

Agent Harness Engineering

Addy Osmani — Fri, 15 May 2026 11:02:24 +0000

This article was originally published on Addy Osmani’s blog. It’s being reposted here with the author’s permission.

Roughly: Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.

We’ve spent the last two years arguing about models. Which one is smartest, which one writes the cleanest React, which one hallucinates less. That conversation is fine as far as it goes, but it’s missing the other half of the system. The model is one input into a running agent. The rest is the harness: the prompts, tools, context policies, hooks, sandboxes, subagents, feedback loops, and recovery paths wrapped around the model so it can actually finish something.

A decent model with a great harness beats a great model with a bad harness. I’ve watched this play out on my own work over and over. And increasingly the interesting engineering isn’t in picking the model; it’s in designing the scaffolding around it.

That discipline now has a name. Viv Trivedy coined the term harness engineering, and his “Anatomy of an Agent Harness” post is the cleanest derivation of what a harness actually is and why each piece exists. Dex Horthy has been tracking the pattern as it emerges. HumanLayer frames most agent failures as “skill issues” that come down to configuration rather than model weights. Anthropic’s engineering team has published what I think is the best public breakdown of how to design a harness for long-running work. And Birgitta Böckeler has a good overview of what this looks like from the user’s side.

This post is my attempt to pull those threads together.

What is a harness, really?

Viv’s one-liner does most of the work:

Agent = Model + Harness. If you’re not the model, you’re the harness.

A harness is every piece of code, configuration, and execution logic that isn’t the model itself. A raw model is not an agent. It becomes one once a harness gives it state, tool execution, feedback loops, and enforceable constraints.

Concretely, a harness includes:

System prompts, CLAUDE.md, AGENTS.md, skill files, and subagent prompts
Tools, skills, MCP servers, and their descriptions
Bundled infrastructure (filesystem, sandbox, browser)
Orchestration logic (subagent spawning, handoffs, model routing)
Hooks and middleware for deterministic execution (compaction, continuation, lint checks)
Observability (logs, traces, cost and latency metering)

Simon Willison reduces the loop part to its essence: an agent is a system that “runs tools in a loop to achieve a goal.” The skill is in the design of both the tools and the loop.

If that sounds like a lot of surface area, it is. And it’s your surface area, not the model provider’s. Claude Code, Cursor, Codex, Aider, Cline: These are all harnesses. The model underneath is sometimes the same, but the behavior you experience is dominated by what the harness does.

coding agent = AI model(s) + harness

This equation, articulated by Viv and echoed by HumanLayer, is where the work actually lives. The debate over the left-hand side is loud. Most of the actual leverage sits on the right.

The “skill issue” reframe

There’s a pattern I watch engineers fall into. The agent does something dumb, the engineer blames the model, and the blame gets filed under “wait for the next version.”

The harness-engineering mindset rejects that default. The failure is usually legible. The agent didn’t know about a convention, so you add it to AGENTS.md. The agent ran a destructive command, so you add a hook that blocks it. The agent got lost in a 40-step task, so you split it into a planner and an executor. The agent kept “finishing” broken code, so you wire a typecheck back-pressure signal into the loop.

HumanLayer says: “It’s not a model problem. It’s a configuration problem.” Harness engineering is what happens when you take that seriously.

There’s a striking data point that shows up in both Viv’s write-up and HumanLayer’s. On Terminal Bench 2.0, Claude Opus 4.6 running inside Claude Code scores far lower than the same model running in a custom harness. Viv’s team moved a coding agent from Top 30 to Top 5 by changing only the harness. Models get posttraining coupled to the harness they were trained against. Moving them into a different harness, with better tools for your codebase, a tighter prompt, and sharper backpressure, can unlock capability the original harness was leaving on the floor.

This is the opposite of the “just wait for GPT-6” narrative. The gap between what today’s models can do and what you see them doing is largely a harness gap.

The ratchet: Every mistake becomes a rule

The most important habit in harness engineering is treating agent mistakes as permanent signals. Not one-off stories to laugh about, not “bad runs” to retry. Signals.

If the agent ships a PR with a commented-out test and I merge it by accident, that’s an input. The next version of my AGENTS.md says “never comment out tests; delete them or fix them.” The next version of my precommit hook greps for .skip( and xit( in the diff. The next version of my reviewer subagent flags commented-out tests as a blocker.

You only add constraints when you’ve seen a real failure. You only remove them when a capable model has made them redundant. Every line in a good AGENTS.md should be traceable back to a specific thing that went wrong.

This is also why harness engineering is a discipline rather than a framework. The right harness for your codebase is shaped by your failure history. You can’t download it.

Working backward from behavior

The framing from Viv that I find most useful when I’m actually designing a harness is to start from the behavior you want and derive the harness piece that delivers it. His pattern: behavior we want (or want to fix) → harness design to help the model achieve this.

The useful thing about deriving it this way is that every harness component has a specific job. If you can’t name the behavior a component exists to deliver, it probably shouldn’t be there.

The rest of this section walks the pieces in roughly the order Viv does, with the specific patterns I’ve found worth stealing.

Filesystem and Git: Durable state

The filesystem is the most foundational primitive, and it tends to be underrated because it’s boring. Models can only directly operate on what fits in context. Without a filesystem, you’re copy-pasting into a chat window, and that isn’t a workflow.

Once you have a filesystem, the agent gets a workspace to read data, code, and docs; a place to offload intermediate work instead of holding it in context; and a surface where multiple agents and humans can coordinate through shared files. Adding Git on top gives you versioning for free, so the agent can track progress, roll back errors, and branch experiments.

Most of the other harness primitives end up pointing at the filesystem for something.

Bash and code execution: The general-purpose tool

The main agent loop today is a ReAct loop: The model reasons, takes an action via a tool call, observes the result, and repeats. But a harness can only execute the tools it has logic for. You can try to prebuild a tool for every possible action, or you can give the agent bash and let it build the tools it needs on the fly.

Willison’s take on this is that agents already excel at shell commands; most tasks collapse to a few well-chosen CLI invocations. Harnesses still ship focused tools, but bash plus code execution has become the default general-purpose strategy for autonomous problem solving. It’s the difference between teaching someone to use a single kitchen gadget and handing them a kitchen.

Sandboxes and default tooling

Bash is only useful if it runs somewhere safe. Running agent-generated code on your laptop is risky, and a single local environment doesn’t scale to many parallel agents.

Sandboxes give agents an isolated operating environment. Instead of executing locally, the harness connects to a sandbox to run code, inspect files, install dependencies, and verify work. You can allow-list commands, enforce network isolation, spin up new environments on demand, and tear them down when the task is done.

A good sandbox ships with good defaults: preinstalled language runtimes and packages, Git and test CLIs, a headless browser for web interaction. Browsers, logs, screenshots, and test runners are what let the agent observe its own work and close the self-verification loop.

The model doesn’t configure its execution environment. Deciding where the agent runs, what’s available, and how it verifies its output are all harness-level calls.

Memory and search: Continual learning

Models have no additional knowledge beyond their weights and what’s currently in context. Without the ability to edit weights, the only way to add knowledge is through context injection.

The filesystem is again the primitive. Harnesses support memory file standards like AGENTS.md that get injected on every start. As the agent edits that file, the harness reloads it, and knowledge from one session carries into the next. This is a crude but effective form of continual learning.

For knowledge that didn’t exist at training time (new library versions, current docs, today’s data), web search and MCP tools like Context7 bridge the cutoff. These are useful primitives to bake into the harness rather than leaving to the user.

Battling context rot

Context rot is the observation that models get worse at reasoning and completing tasks as the context window fills up. Context is scarce, and harnesses are largely delivery mechanisms for good context engineering.

Three techniques show up repeatedly:

Compaction. When the window gets close to full, something has to give. Letting the API error is not an option for a production harness, so the harness intelligently summarizes and offloads older context so the agent can keep working.

Tool-call offloading. Large tool outputs (think 2,000-line log files) clutter context without adding much signal. The harness keeps the head and tail tokens above a threshold and offloads the full output to the filesystem, where the agent can read it on demand.

Skills with progressive disclosure. Loading every tool and MCP into context at startup degrades performance before the agent takes a single action. Skills let the harness reveal instructions and tools only when the task actually calls for them.

Anthropic’s harness post adds one more technique for the really long jobs: full context resets, where the harness tears the session down and rebuilds it from a compact handoff file. They’re explicit that compaction alone wasn’t sufficient for long tasks; sometimes you need to start fresh with a structured brief. This is closer to how humans onboard a new engineer than to how we usually think about “memory.”

Long-horizon execution: Ralph loops, planning, verification

Autonomous long-horizon work is the holy grail and the hardest thing to get right. Today’s models suffer from early stopping, poor decomposition of complex problems, and incoherence as work stretches across multiple context windows. The harness has to design around all of that.

I’ve written about autonomous coding loops like the Ralph loop before in self-improving agents and in my 2026 trends piece, but it’s worth restating in this framing: A hook intercepts the model’s attempt to exit and reinjects the original prompt into a fresh context window, forcing the agent to continue against a completion goal. Each iteration starts clean but reads state from the previous one through the filesystem. It’s a surprisingly simple trick for turning a single-session agent into a multisession one, and it’s the kind of primitive you’d never derive from “just use a smarter model.”

Planning is when the model decomposes a goal into a sequence of steps, usually into a plan file on disk. The harness supports this with prompting and reminders about how to use the plan file. After each step, the agent checks its work via self-verification: Hooks run a predefined test suite and loop failures back to the model with the error text, or the model reviews its own output against explicit criteria.

Planner/generator/evaluator splits. Anthropic’s long-running harness work is explicit that separating generation from evaluation into distinct agents outperforms self-evaluation, because agents reliably skew positive when grading their own work. It’s GANs for prose. The related pattern is the sprint contract, where the generator and evaluator negotiate what “done” actually means before code gets written. In my own workflows, writing down the done condition before starting has caught more scope drift than any prompt change I’ve ever made.

Hooks: The enforcement layer

Hooks are what separate “I told the agent to do X” from “the system enforces X.”

A hook is a script that runs at a specific lifecycle point: before a tool call, after a file edit, before commit, on session start. They’re the right place for things the agent should never forget but often does. Run typecheck and lint and tests after every edit and surface failures. Block destructive bash (rm -rf, git push --force, DROP TABLE). Require approval before opening a PR or pushing to main. Auto-format on write so the agent doesn’t waste tokens on whitespace.

The principle HumanLayer highlights and I’ve come to agree with is: Success is silent; failures are verbose. If typecheck passes, the agent hears nothing. If it fails, the error text gets injected into the loop and the agent self-corrects. That makes the feedback loop almost free in the common case and directly actionable when something goes wrong.

AGENTS.md and tool choice

The flat markdown rulebook at the root of your repo is still the single highest-leverage configuration point, because it lands in the system prompt every turn. Conventions go here: package manager, test framework, formatting, “never touch /legacy,” “always use our logger.” Two hard-won lessons:

Keep it short. HumanLayer keeps theirs under 60 lines. Every line is competing for attention, and more rules make each rule matter less. Pilot’s checklist, not style guide.

Earn each line. Rules should trace to a specific past failure or a hard external constraint. If they don’t, they’re noise. Ratchet; don’t brainstorm.

Same discipline applies to tools. Each tool’s name, description, and schema gets stamped into the prompt every request. Ten focused tools outperform fifty overlapping ones because the model can hold the menu in its head. HumanLayer also flags a real security concern here: tool descriptions populate the prompt, so any MCP server you install is trusted text the model will read. A sloppy or malicious MCP can prompt-inject your agent before you’ve typed anything.

What this looks like in production

The clearest public picture I’ve seen of a mature harness is Fareed Khan’s (estimated) breakdown of Claude Code’s architecture.

Almost every concept from the previous section shows up on this diagram as a named component. Context injection is the knowledge layer. Loop state lives in the memory store and the worktree isolator. Destructive-action hooks sit behind the permission gate. Subagent context firewalls are the entire multi-agent layer. The tool dispatch registry is where MCP servers and bash both plug in. Khan’s argument is the same as Viv’s, just worked through a shipping product: Claude Code’s trajectory is about the harness at least as much as about the model underneath it.

Harnesses don’t shrink; they move

One of the better observations in the Anthropic write-up is that as models improve, the space of interesting harness combinations doesn’t shrink. It moves.

The naive story is that better models make harnesses obsolete. If the model can plan, no planner. If the model is coherent at long horizons, no context resets. And yes, Opus 4.6 largely killed the context-anxiety failure mode (Sonnet 4.5 used to wrap up work prematurely as it approached what it thought was its context limit), which means a whole class of anxiety-mitigation scaffolding I was writing six months ago is now dead code.

But the ceiling moved with the model. Tasks that were unreachable are in play, and they have their own failure modes. The anxiety scaffolding goes away, and in its place you need a multiday memory policy or a harness that coordinates three specialized agents or evaluators for design quality in generated UIs. The assumptions shift, and so does the scaffolding that encodes them.

Anthropic puts it cleanly: “Every component in a harness encodes an assumption about what the model can’t do on its own.” When the model gets better at something, that component becomes load-bearing for nothing and should come out. When the model unlocks something new, new scaffolding is needed to reach the new ceiling.

The model-harness training loop

The other thing that’s happening, which Viv names explicitly, is a feedback loop between harness design and model training.

Today’s agent products are posttrained with harnesses in the loop. The model gets specifically better at the actions the harness designers think it should be good at: filesystem operations, bash, planning, subagent dispatch. That’s why Opus 4.6 feels different inside Claude Code than inside someone else’s harness, and it’s why changing a tool’s logic sometimes causes strange regressions. A genuinely general model wouldn’t care whether you used apply_patch or str_replace, but cotraining creates overfitting.

The practical implication is twofold. A harness is a living system, not a config file you set up once. And the “best” harness isn’t necessarily the one the model was trained inside; it’s the one designed for your task. Viv’s Top 30 to Top 5 Terminal Bench jump is the clearest proof point I’ve seen.

Harness as a service

Viv’s other contribution is the HaaS framing: harness as a service. The observation is that we’re moving from building on LLM APIs (which give you a completion) to building on harness APIs (which give you a runtime). The Claude Agent SDK, the Codex SDK, and the OpenAI Agents SDK all point in the same direction. You get the loop, the tools, the context management, the hooks, and the sandbox primitives out of the box, and you customize them.

The shift matters because the default path used to be: build your own loop, wire up your own tool-calling, handle your own conversation state, invent your own approval flow. Now the default path is: pick a harness framework, configure it along the four pillars (system prompt, tools, context, subagents), and put the rest of your effort into domain-specific prompt and tool design.

That’s what makes “skill issue” tractable. You’re not rebuilding an agent from scratch every time something goes wrong. You’re tuning a configuration surface that’s already well-factored.

Viv’s line on this is also the best argument for starting messy: “Good agent building is an exercise in iteration. You can’t do iterations if you don’t have a v0.1.”

Where this is going

Look at the top coding agents side by side (Claude Code, Cursor, Codex, Aider, Cline) and they look more like each other than their underlying models do. The models are different. The harness patterns are converging. I don’t think that’s an accident. It’s the industry slowly finding the load-bearing pieces of scaffolding that turn a generative model into something that can ship.

Viv’s framing of the open problems is the one I find most exciting: orchestrating many agents working in parallel on a shared codebase; agents that analyze their own traces to identify and fix harness-level failure modes; harnesses that dynamically assemble the right tools and context just-in-time for a given task instead of being preconfigured at startup.

That last one, in particular, feels like where harnesses stop being static config and start becoming something closer to a compiler.

Generative AI in the Real World: Chang She on Data Infrastructure for AI

Ben Lorica and Chang She — Thu, 14 May 2026 16:00:00 +0000

As a pandas core contributor and early Parquet adopter who built AI data pipelines at streaming company Tubi TV, Chang She saw firsthand why the traditional data stack breaks down for AI workloads—and founded LanceDB to fix it. Chang joined Ben Lorica to explain why vector databases are too narrow a solution for modern AI data needs, and what a true multimodal data infrastructure actually looks like. Chang and Ben get into why the Lance file format is quickly becoming the open source standard for multimodal data, how the rise of agents is exploding data infrastructure demands, why open-weight models are the enterprise cost shift to watch in the next 12 months, and more. “Trillion is the new billion,” Chang says, and the enterprises that set up their data infrastructure now for that scale will be the ones that succeed.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform or follow us on YouTube, Spotify, Apple, or wherever you get your podcasts.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.35
All right, so today we have Chang She, CEO and cofounder of LanceDB, which you can find at lancedb.com. Tagline is “Build better models faster.” So Chang, welcome to the podcast.

00.49
Hey Ben, super excited to be here.

00.52
All right, we’ll jump into the core topics, but a bit of a background there for our listeners who may not be familiar with you. You worked on pandas—you were a core member of the pandas team. You were very early on with Parquet as well. And at some point, you became convinced that for AI workloads, these former tools that you worked on—Parquet, pandas—were not enough. So what was the moment of realization for you that these traditional tools that were foundational for analytics were lacking?

01.33
Absolutely. So I worked at a company called Tubi TV, which was video on-demand and streaming. So movies and TV. And it was there that I ended up dealing with a lot of I guess what I would call AI data. So we had to have embeddings for personalization, video assets, image assets, audio, text for subtitles and all of those things. All of those did not really fit into the traditional data stack—you know, pandas, Spark, Parquet, and even Arrow. So that was sort of the inspiration for me to start LanceDB.

02.15
And Chang, at this point, do you think that more people are aware of this disconnect between those tools and the kinds of tools they’ll need moving forward?

02.30
When I talk to data infrastructure folks who are building and managing that stack for dealing with this kind of data, there’s broad recognition that something has to be done, that the existing stack is just not sufficient to deal with this data. And what’s more interesting is that this data is also becoming a lot more valuable because of AI.

02.52
So obviously, before you came on the scene, there was this wave of vector stores or vector databases which were optimized for retrieval. So let’s say I’m a listener and all I have is text. Do I need anything beyond the vector database?

03.17
Even if you just have text and you just have text embeddings, the creation of those embeddings and then the management of all of those data assets—your metadata, the actual documents, how to serve that—a lot of that falls outside the purview of a vector database. The vector databases tend to be very narrow solutions for a very narrow problem, whereas something like LanceDB takes a broader view of, “When you have AI data, what are all the things you need to do to it throughout that life cycle of application development or model development? And how do we build a tool and a system that allows you to simplify your life by having one system to do all of the major workloads throughout that life cycle?”

04.13
And by the way, for our listeners, there’s LanceDB and then there’s the open Lance file format, and I wanna ask you about this file format in a second, but you mentioned something about vector databases and you were kind of saying that, you know, they’re not great at creating the embeddings. But Chang, the vector database people, they never really positioned themselves as responsible for creating the embeddings, right? So they just assume that you’ll show up with embeddings.

04.47
That’s right. But even if you take that narrow view, what we find in enterprises today is a lot of folks have an offline generation process in the data lake itself, where they chunk up the documents, then they generate the embeddings, then they have what they call an offline store, then they have to copy-paste that data into a vector database for serving. So there’s a lot of data syncing [and] data movement, so it creates expense and there’s a lot of complexity.

And so that’s the. . . Even for just text-based workloads, even just for pure vector search, that tends to be a big pain point. And then two is vector databases, a lot of times, don’t pay as much attention to the overall retrieval stack, right? If you remember, the task for users is I want to find the right data in my dataset, and vector search is just one technique. You have many different kinds of techniques, full-text search, or even just outside of search. You might have SQL queries that you want to run, filters, regexes, all of that goes into a rich and very accurate retrieval process. And vector databases, in general, do not expand beyond just that simple semantic or vector search.

06.10
So I mentioned the Lance open file format, which. . . I guess the shortcut that people use is like Parquet for AI, but it’s actually both a file and table format. So maybe give our listeners, Chang, a high-level description of the Lance format and why it’s become so popular.

06.33
Lance is what we call a lakehouse format. It is quickly becoming the new open source standard for multimodal data. And what I mean by a lakehouse format is that it spans a couple of different layers. So you mentioned in the beginning a file format. So this is the equivalent in the stack to Parquet, where we would talk about “How do we lay out the data in a particular file?” And at this layer, the innovation in Lance is that it is really, really good for random access without sacrificing any speed and scans. And our files are actually smaller than Parquet for many AI datasets.

The next layer is usually what we call a table format that is occupied by projects like Iceberg and Delta and Hudi today. And [the] Lance format comes in at this layer. We have much better designs, more optimizations for machine learning experimentation, so doing backfills easily, doing two-dimensional data evolution, being able to handle really large blob data like videos and images, and then just being able to do a branching strategy that supports true sort of Git for data semantics that takes the best of Parquet and Iceberg.

And then finally, there’s a third layer, which is about indexing so that you can have fast scans, fast searches, fast queries. So when you put all that together, that’s what we call the Lance lakehouse format.

08.11
I described Lance as open. Can you kind of clarify what that means, because I actually don’t know?

08.19
Number one is Lance format is open source. It’s Apache 2.0 license. You can find it on our GitHub. We have community governance; [we] have PMCs that are from lots of external contributors. And then I think beyond that, there’s open source and there’s open source, right? I think what Lance format is designed for is a true open architecture as well. So not only is it open source; it also plays really well into the rest of the data ecosystem.

So for example, when people compare us to Parquet and Iceberg, well, we’re not designed as a head-to-head competitor with Parquet and Iceberg. We will slot into the same Polaris data catalog, or you can have one unified view on all of your datasets, but then under the hood it can be Parquet/Iceberg for BI data and Lance for your AI data. And then Lance itself plugs in natively to Spark and pandas and Polars and DuckDB and any sort of open data tooling that you’re already used to.

09.31
So operationally then, Chang, if I’m a data architect, should I think of Lance as, “OK, so I have Parquet and these table formats like Delta and Iceberg for my structured data. And then if it’s nonstructured, which could mean video, audio, and also text, right? So then I have to bring in this other format, Lance.” Is that operationally what happens in practice?

10.07
Yeah, often what the data infra folks and data engineers we talk to interact with is the tooling, right? So they’re looking at their data pipelines, they’re looking at maybe their Spark jobs or their search applications, and then those are the jobs that actually interact with the underlying storage, for example. And so instead of. . .

And that data transfer process is actually really easy through Apache Arrow. And most of the time, it’s really just one line of code change. It’s the same Spark code, for example. Instead of writing to Parquet, you’re writing to Lance. And it simplifies your overall data pipeline by bringing all of your tabular data and metadata along with your multimodal data all in the same place and also embeddings.

11.05
And then in terms of workload, you alluded to the fact that the previous-generation vector source, they excelled at something very specific, maybe retrieval. So is Lance equally specialized in the sense that, “All right, Lance is great for X, and X might be, I don’t know, analytics, but it doesn’t excel in other things”? Describe the kinds of workloads that teams that are using Lance are using.

11.39
So very high-level, the summary is LanceDB, our enterprise data platform, excels at helping our customers manage really large-scale AI data. So embeddings for search, adding new, adding new features and extracting new, new columns, enriching their dataset, doing data curation and exploration, and then feeding that to GPUs really quickly for distributed training jobs so that they can get as high GPU utilization and as high auto-flops utilization as they can.

12.20
You’ve used the word multimodal a few times, and I’ve always been a proponent of people really making sure that their data infrastructure is positioned for this multimodal world. But sometimes I question this assumption in the following sense, right? Is multimodality a Bay Area bubble thing? In other words, if I go to the East Coast and talk to, I don’t know, Goldman Sachs or an insurance company, are they still grappling with legacy systems that are mostly structured data? What they want to do is be able to do all this fancy AI stuff now with agents, but still using the old-school data that they have.

13.12
I think when we talk about multimodal data, a lot of times what comes to mind first is video generation, image generation, all of those. Self-driving cars. . . So there’s a lot of high-tech, cutting-edge applications that are multimodal. But I think if you look at more traditional enterprises, they already have a lot of multimodal data.

So you just mentioned insurance: They have millions of documents and PDFs and contracts lying around. Insurance especially will have top-down views of houses and boundaries so that they can figure out and assess risk a little bit better. The way I think about it is before AI, it’s just really hard to get value out of that data. They just really haven’t paid as much attention.

So it’s kind of like when I clean up my house, what I like to do is just like move all the mess into a back room or storage. And so then I don’t have to think about it, right? My wife yells at me all the time. She opens up the storage and everything kind of falls out. And so I feel like with multimodal data, this is kind of what traditional enterprises have done: They didn’t know what to do with it. They stuck it in some directory in SharePoint or something like that and kind of just like leave it there for storage. But there’s actually a tremendous amount of value and AI is helping them unlock all of that. So I think in the next few years, especially, we’re going to see a lot more attention paid to, “If we can get a lot more value out of this data, how do we actually manage it? How do we work with it? And how do we combine it with the rest of our data stack so that it’s governed within a single entity?”

15.06
The hot thing a few years ago in data infrastructure was the lakehouse, right? Great term we introduced. [laughs]

15.18
I wonder who came up with that one. [laughs]

15.22
Yeah. So you folks are starting to use the term multimodal lakehouse. So compare the status of the lakehouse. . . [The term] is I think now widely used, right? And then now you’re introducing the multimodal lakehouse. So where is the multimodal lakehouse now kind of mature, and where does it still need to do some work?

15.50
Just for the audience who’s not as familiar, the really, really simplified way I think about just a lakehouse is you have all your data in one place in the data lake, and then you have a combined data warehousing layer on top that provides structure, tables, and structured ways to run workloads on all of that data.

Now, the way we think about multimodal lakehouse is in a couple of different ways. One, the data changes so that you go from purely tabular data or maybe like clickstream data to now all sorts of multimodal data. So from embeddings to all of your multimedia types. So that changes a lot about how you can read and write data efficiently, how you manage that, how you synchronize that with metadata.

Number two is the workloads also are multimodal. You’re not just thinking about running SQL and analytics workloads. You’re now thinking about search. Now you’re thinking about training. Now you’re thinking about feature engineering and “How does your lakehouse interact with GPU clusters?” and all of those things that traditional lakehouses are not very good at.

And then I think the third layer, where the meaning “multimodal” comes in, is traditional lakehouses tend to be good only at batch offline processing. And then if you want to do serving, online processing, you probably need to introduce a sort of an OLTP kind of database or some system that’s primarily for serving. Well, with LanceDB, because of the innovations in the format, you can actually do both at the same time. So the online-offline scenario can also become multimodal in this sense.

17.44
So if I understand what you’re saying, you’re multimodal in multiple senses. So multimodal data types, multimodal workloads, and multimodal kinds of operations. So right now, in the Databricks world, they have—I don’t think they used the word multimodal. If anything, they go back to that HTAP kind of thing, so [a] hybrid transactional analytics kind of processing engine. I think through an acquisition, now they are very good at Postgres. I forget what they call this. [Chang: A lakebase.] So they have the transactions, and they have the analytics. So what you’re saying is that your vision of the multimodal lakehouse has that hybrid transactional analytics, multimodal types of data, and then multimodal workloads. Is that a fair summation? Surely, Chang, certain aspects of what you just described are more fleshed out than others, right? So what areas do you anticipate you folks will be working on hard, in terms of multiple notions of multimodality?

19.16
Number one is actually scale. Scale is actually the biggest driving factor late last year and this year. And a lot of that has been the rise of agents. Because of the rise of agents, data volume and scale, query throughput and scale, and performance and latency requirements, all of those things have just kind of been exploded. And that’s the thing that we find we’re uniquely suited for. And that’s something that we’re pushing a lot on. Oftentimes when we talk to customers, really what we think about is like, trillion is new billion. And we have folks who probably are operating at a thousand times the scale that they were just a year ago or two years ago.

20.22
I guess the hack that people will do for some of these things, Chang, is just let’s put the files in S3 and then use a database somehow. So are you still seeing a lot of people kind of try to do this?

20.39
Yeah, I mean, I think there are a few attempts that [are] doing that. And I think there’s generally a trend because of the data scale, like object storage is kind of the only sort of cost effective and scalable storage backend for a lot of these newer data storage systems. I think where the challenge lies for data infrastructure providers is “How do you actually have scalability and high performance and maintain the cost advantages of S3 and object store?” That is, I think, the difficult challenge. And so we actually have a recent blog article talking about how we do that at 10 billion-vector scale.

At smaller scales, that’s actually really easy. You just slurp up all the data from S3 into some caching system. You can serve it from there in any in-memory system. That’s a really easy problem. There’s tons of open source projects, Lance, for example, that can help you do that pretty effectively. And then the challenge is really at scale. If you have 10 billion vectors, pretty much, your only cost-effective solution is to store that on object storage. Then, you know, imagine the query times if you were just targeting S3 directly. So then indexing challenges and search and caching and all of that, that becomes a big distributed systems problem. So that’s what we solve.

22.16
Like you said, many data engineering and data infrastructure teams are trying to think through, “So what does our infrastructure look like in a world of agents?” right? So imagine—this isn’t happening yet—the equivalent of OpenClaw in enterprise, where a single employee might have 10 of these AI delegates or AI assistants. Some of the things that come up: One, identity management, so access control, identity management. Secondly, maybe some of these AI agents and AI delegates don’t really need anything permanent. They just want something ephemeral. So stand up a LanceDB for a minute and then make it go away. Are these some of the things that you are starting to think of?

23.14
Yeah, so for our cutting-edge customers, that’s already the reality. We specialize a lot in infrastructure for model training, for example. So if you think about features, like a researcher might have, “Hey, I have a feature idea. There’s two input features, each with 10 variants. And then I have some output feature that combines the two.” Well, now I’ve got 100 different variants. So before, there was a limited [number] of variants that I can test as an individual researcher manually. But now I can use agents to run all of that automatically. And I can just go to sleep and it’ll run. Well, now humans can go to sleep, but then the agents are presenting a lot of load on the underlying data infrastructure. This year we’re talking about going from hundreds of queries per second from plain RAG a couple of years ago to a hundred thousand queries per second in this land of agents.

And then when it comes to security and compliance, there’s a lot of churn in the stack about sandboxing and ephemeral systems. And when we talk about object storage, this is actually a big, even a bigger challenge, right? So if your source of truth is on object store, that’s actually the only way you can make this ephemeral workload work out well so that when you have hot data, you cache it, you serve it for a time, and then that can go away. And then the cache can expire it [to] be replaced by the next hot workload. And you can do that without having to pay for really expensive memory and NVMe for all of your data.

25.04
So the other thing, Chang, that comes up with agents right now, the hot thing that it seems like there’s a gazillion people working on is this notion of memory. So I guess my question to you is, if I have a bunch of agents and then I have a multimodal lakehouse. . . I have a lakehouse and now I have memories. So I have three different systems that I have to maintain. What’s your what’s your guys’ take in terms of agent memory?

25.42
LanceDB open source is actually the main memory plug-in for OpenClaw and a number of other agents like Crew AI, for example. And for a lot of these agent frameworks and harnesses, there’s a couple of different requirements. Number one is just lightweight, super easy to use. LanceDB is the only one where it supports hybrid search; it supports reranking, all these fairly sophisticated retrieval mechanisms, without having to maintain a service.

26.20
Before you continue. . . All right, so this notion of lightweight, right? On the one hand, there’s the notion of multimodal lakehouse and a lakehouse is never lightweight, right? But then, it seems like you folks are positioning yourself also in the DuckDB kind of very lightweight SQLite world. Can you clarify what you mean by lightweight when you are supposedly a lakehouse, right?

26.49
So what I mean by lightweight here is that if you think about it from an agent perspective, it simplifies a lot of things if you don’t have to connect to another service and talk to another system in order to get access to your memory and to retrieve from memory. So that’s what I mean. So the open source, the. . .

27.15
But then you’re large-scale infrastructure. . . So then if I’m a lightweight agent, how can you… This is where I guess I’m a bit confused. Can you clarify, why am I bringing along a big piece of infrastructure if I’m a lightweight agent?

27.37
Right. LanceDB open source is actually very lightweight. So there’s no heavy infrastructure involved. This is why it’s perfect for memory. Because a lot of times, memory is very ephemeral. So you just interact with a session and then when that session is gone, you want to retain all of that. At most you might want to compress some of it and then retain it for downstream historical processing. But most of the time, it’s just gone. You don’t have to think about it. And so that’s what I mean by lightweight. So there’s a version of that.

And then for large-scale retrieval, you have a large historical corpus, if you’re working in a corporate environment, if you have an agent that’s searching through patent history or something like that, right? And then that’s where the infrastructure comes in. Well, if I have a petabyte of data out there that I need to search through, the embedded library is not going to do. So you need to have a scalable system out there, but it needs to be easy to use. And from an agent perspective, it’s the same interface. So from the agent perspective, it’s just as easy, but there is a scalable system for that large amount of data that’s kind of hidden beneath the surface there.

I think for agents, that’s sort of just one of the requirements. The other one is having more sophisticated retrieval so that agents can find what they’re looking for. And different agents will want to look for data in different ways. So being able to support all of that without having like a million different plug-ins to do each modality, I think that’s also something very important for agents as well.

29.28
By the way, I was playing devil’s advocate there because I actually use LanceDB every day on my laptop. It can be something that you can use in your laptop just in-memory.

29.42
Yeah. So I think what we find is that when you make it really easy for agents to actually use it, that’s when scale really takes off. The way we’re looking at it is agents are kind of like an ideal gas that if you make it easy for them to use, no matter how much compute you have, no matter how much data and infrastructure you have, agents will expand to fill all of that that you have, right? So what we’ve seen is. . . We talked about growth and creep throughput. And then because of complex agents, there’s compression and latency. Your agents want a hundred-millisecond or like 20-millisecond latencies now. And then we also see a lot of proliferation of data.

One of the largest users in LanceDB told us they’re now managing something like a billion tables. Just because they have so many agents and so much data that they have to manage, like that number of tables within their system. Any computational and data management dimension you can think of, agents will expand to however much capacity you give them.

30.59
So this is a two-part question. Our listeners may not be aware, but for some reason, LanceDB kind of blew up a little more during the launch of OpenClaw. So I guess my two questions are one: How did this OpenClaw community land on Lance? And have you heard back from them, and have they told you what they liked about Lance?

31.32
Yeah, I mean, a lot of that is what we just talked about: It’s lightweight; it’s easy to use the model.

31.39
But how did it happen? How did they land on Lance? Do you know?

31.43
So my recollection was that originally it was a recommendation from Claude or something like that. And I think [Lance] was the only one out there that met the requirements, was embedded, lightweight, sophisticated retrieval. And it can do both in-memory on NVMe local and also on object store.

32.11
Interesting. So since then, has this kind of marriage [with OpenClaw] continued?

32.20
Yeah, we continue to see engagement from the open source community. Our open source continues to grow. I think at the latest, we’re at around 14 million downloads a month across our open source projects. And we’re super excited about working and supporting the open source community on that. What we see now is demand for a more filesystem-like interface. It’s easier for agents a lot of times to interact with a filesystem interface.

Now, I’m choosing my words carefully. I don’t mean a filesystem. I just mean an interface. This is something that we’re looking into—trying to see what it would look like to put a filesystem interface over a LanceDB or Lance format. Based on the usage patterns that we see from agents, this is fairly straightforward to do. So I think if you’re listening and this is something interesting, we’d love to have early users come check it out and test it out with us.

33.29
It’s interesting, actually, as you were talking there, it just dawned on me that this notion. . . These various notions of multimodality that you described earlier actually might be another reason why people landed on Lance. Because there are other vector search systems that you can run in-memory or embedded. If you want to build agents that are more capable moving forward, then the various notions of multimodality that Chang described earlier might come in handy, right?

34.06
Yeah, yeah, absolutely. I will say that like, I’m sort of a. . . There are AI maximalists. I’m sort of a multimodal maximalist. So my prediction is that in five years, multimodal won’t even be a word anymore. It’ll just be data, and it’ll just be multimodal by default. People will just say data, and it’ll be inclusive of all the different modalities. And when we think about data engineering, there won’t be multimodal data engineering. It’ll just be multimodal by default when we say data engineering.

34.37
Interesting, which actually. . . As we’re winding down here, I was going to ask you, If I’m a CxO or an architect at an enterprise, what data infrastructure decision do you think I should bear in mind? Or I guess to put it negatively, what are some of the decisions I can make right now that potentially can hurt my team moving forward in the next year?

35.08
Right, right. So I think we’re already. . . For a lot of early adopters, we see big pain points around new AI data silos. So one pattern, I wouldn’t call it an anti-pattern, but one I would say pain point is if you’re a CIO or CDO or something like that, chances are a lot of your teams within the enterprise have charged forward with their own AI applications and AI stack. And so now the centralized data platform team are faced with maybe like 10 different vector databases that they have to support and maybe five different ways to store the AI data, some in images and some just embeddings and others, many different modalities. So that becomes a big pain point going forward, right? So as companies go from “Let’s try out AI in this particular area” to, I guess, AI transformation, having large swaths of the enterprise be AI-assisted or AI-native, that becomes a big pain point.

I think if I were a CIO or a CEO or CTO at a larger enterprise, I would be looking forward a little bit to think about how do I set up all of my teams across the enterprise for success so that one, “How do I allow them to charge forward very quickly and iterate very quickly without presenting this crazy, untenable challenge on the central platform team?” So that’s what I would be thinking of. That’s actually. . . At LanceDB, that’s what we’re building for.

37.05
If your thesis is multimodal data matures over the next few years, and so do agents and everything that comes with agents, including memory, what does the data stack look like in a few years?

37.22
In broad strokes, the base layers are not going to change all that much. I think the infrastructure layer stays roughly the same. There’s going to be object storage. There’s going to be a storage layer. And then the compute layer will start to change.

37.49
Ray. [laughs]

37.52
What I think we’ll see is that the middle layer of data tooling will start to melt away a little bit because of agents.

38.04
Define data tooling.

38.07
I don’t want to name names, but I think there’s a lot of [what] I would call developer middleware for data where it’s neither the infrastructure layer nor is it the layer that’s interfacing with agents and users directly, right? That middle layer, I think will melt away a little bit or at least be very much refactored. So there’s going to be a lot of churn in that. It’s going to be interesting to see what shakes out. I think what will happen is that agents will continue to push that layer down, and agents will want to get as close to the base layer as possible.

If you look at this middle layer, there’s really two things that they’re providing. One is a precanned data model for how their users think about the problem, right? So they built that on top of the base infrastructure. So they would build that on top of LanceDB, for example. And then the other thing that they have in this middle tier right now is user interaction, right? The combination of the two is how they capture user workflows. And that’s the core of that. I think what happens in the future is that that UI workflow layer will largely go away and be replaced by agents.

But useful data models will still be useful, and they’ll still stay. Yes, you can have agents directly talk to random bits on S3, but why waste all that intelligence? It’s not worth the token cost. A well-formed data model is the right base layer for agents to interact with. And so I think that’s what we’ll see, is that melting away and reformatting of that middle layer. And I think this is something when I talk to data builders and AI infrastructure builders today, I think we’re all seeing that all at the same time.

40.22
What I describe to people right now as kind of the forward-looking stack has two main parts: So one, you have the multimodal lakehouse built around Lance, LanceDB, and the Lance format. And then you have the AI compute layer, which I call the PARK stack, so PyTorch, AI foundation models, Ray, and Kubernetes. So PARK stack here, and then your lakehouse will be around Lance and the Lance format. I see that quite a bit actually. I definitely see the PARK stack, PyTorch, Ray, Kubernetes. And now I’m starting to see more and more people talking about Lance and Lance format. Do you think of these as complementary or what?

41.16
Yeah, yeah, absolutely. I think we have close relationships with Ray and Spark and really like native-level integrations. And also PyTorch, right? I don’t think that’s going away. Those are either like. . . PyTorch is essentially interacting with developers directly, whereas Spark and Ray are very much infrastructure layer, so I don’t think those things are going anywhere. Kubernetes is definitely still around.

41.51
Yeah, yeah, yeah, yeah. And so what big trend are you paying attention to right now that we haven’t yet talked about? This is how we close.

42.08
What’s been really interesting that we didn’t talk about is the rise of open source models. And I think that’s going to have a big impact, maybe starting next year or even the remainder of this year. Enterprise AI. [Ben: Open weight.] Open-weight models. That’s correct. Yeah.

42.35
Who’s the source? Because right now the main source is China for the better ones. And I still see a lot of hesitation for enterprise teams to adopt such models. I actually just wrote a short post about this. Basically the perception seems to be that while the open-weight models from China are closing the gap, there is still a gap, and there’s structural reasons why there’s a gap. So one is the Chinese seem to be benchmaxxing. You know, they’re optimized for the benchmark, so not real workloads. And then secondly, there is a compute challenge, which makes iteration for them more challenging. So whereas the labs here may update their models every three or four months, the Chinese have to wait six months. And then finally, the data pipelines and the investment in data pipelines is just not the same as you would see at, for example, Gemini, Anthropic, and OpenAI. They’re licensing data from all over the place. The Chinese labs tend to do distillation, which means. . . When you’re doing distillation, your cap is basically the model you’re distilling from.

And then there’s the flywheel—OpenAI and Anthropic and Gemini have a lot of users, so therefore they get better as more users interact with them. . .

44.20
That’s right. Don’t forget the open-weight models in China are also. . . [cross-talk] Here’s the way I think about it, right? So I think as AI adoption grows exponentially within enterprises, they are going to be extremely motivated to invest in their own inference on open-weight models, right? Just because there’s such a drastic cost in tokens.

Because of that economic incentive, I think there’s going to be a lot more incentive for companies to create better open-weight models. If you look at the open-weight models in China, one, the fact that they can create open-weight models of this quality on really limited hardware is really telling. So a team in the US theoretically should be able to create much better quality open-weight models because of that.

Number two, I don’t think the distillation argument is actually true. If you look at the report that Anthropic threw out, right, like if you look at the numbers of how much distillation they accused DeepSeek of doing, it’s actually not that much. It’s basically negligible, right? Like MiniMax is a legit big offender, but DeepSeek, basically, didn’t really do that much. I don’t think distillation is a big factor in the quality of open-weight models anymore.

So then there is a remaining gap in quality. Maybe there’s a three- to four-month gap between open-weight models and SOTA. But what’s interesting is the experiments that people have done is, open-weight models, one, are cheaper, and they’re much faster. So if you have a coding agent task, you can do a one-shot with SOTA models or you can do multiple rounds and iterations on an open-weight model, which gets you the same quality, still lower total costs and tokens, and you finish around the same time, or you actually might finish faster. So then I think a lot of that is lack of familiarity and a skill gap, where if you have to do a few shots, that complexity is way more than what people want to think about right now.

So the pattern today is you go into production with SOTA models, then you reach some cost-prohibitive moment where you say, “OK, what are the areas where there’s not requirements for really heavy intelligence but still have a lot of token costs, and then I can replace [them] with open models?” And I think that will happen more and more across enterprises. So I think that’s going to be a big trend to watch this year and next.

47.18
And actually, as you mentioned, my conversations are a product of the fact of the stage of adoption, which is basically [the] early stage of adoption. I will deploy with state-of-the-art models because I’m early. And then as my agent or my application gets used, then I start paying attention to cost, latency, and all these. And then I can worry about swapping the models then. And hopefully, we will have some Western labs start cranking on open-weights models again, right? It seems like Meta is off the table. The Gemma folks produce models, but they’re meant for on-device, I think. Maybe there’s an opening there for someone to start up something that…

Especially as people become more clever in terms of training and tools like LanceDB make training more affordable somehow. We’ll see what happens. And with that, thank you, Chang.

48.24
That’s right. Thank you, Ben.

Why Doesn’t Anyone Teach Developers About Context Management?

Andrew Stellman — Thu, 14 May 2026 10:57:49 +0000

This is the sixth article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, part four here, and part five here.

I think context management is one of the most important skills in AI-driven development, and it’s weird that compared to other AI-related topics, almost nobody talks about it. We talk about prompt engineering, about which model to use, about agentic workflows and tool use. But more than anything else, the thing that actually determines whether your AI session produces good work or mediocre work is how well you manage context (or if you even do it at all!).

A lot of developers using AI tools treat all this “context” talk as AI jargon that can be dismissed, and it’s not hard to understand why. AI development tools have gotten so easy that an experienced developer can be incredibly effective by just combining vibe coding with critical thinking (that’s the central idea behind the Sens-AI Framework), and not really think about context at all. That’s ironic, because despite all the “I’m functionally illiterate but I just vibe coded an entire multitenant SaaS platform” articles, and despite everyone’s general concern that AI will put all developers out of work, the development skills you’ve been working on for years make you especially effective at writing code with AI—and context management is where those skills really shine.

Just to make sure we’re all on the same page, context is (basically) everything the AI is thinking about right now: your prompt, the conversation so far, the files it’s read, the decisions you’ve made together. When you start a fresh session with an AI, its context is wiped clean, and it starts fresh with just the initial instructions it’s been given. Managing context is central for building AI agents and skills. But it’s also really important when you’re using tools like Claude Code, Cursor, or Copilot for day-to-day development work. Context is typically measured in tokens, and there’s a finite amount of it. When the context window, or the maximum amount of information (input and output tokens) an AI model can process and retain at once, fills up, the AI starts losing track of things, and that’s when you start to see it give wrong and weird answers.

Unfortunately a lot of developers read paragraphs like the last one and their eyes glaze over. Somehow it gets classified in the same part of our brains as learning how our build systems work: boring stuff we somehow don’t really want to think about because it takes us away from “real” programming. That’s a shame, because when we don’t understand the basics of how context works we waste a lot of time.

For example, here’s something I see developers do all the time that they absolutely shouldn’t. They’re deep into an AI coding session, and the AI has built up a detailed understanding of their codebase (e.g., it’s noticed patterns, it’s making good decisions, etc.). Then they start seeing “Compacting conversation” messages, or they notice the little context usage indicator in Cursor or Copilot filling up, and they don’t really know what that means. But they learned that closing the session and starting a new one seems to fix the problem. Unfortunately, all they’ve done is trade compaction for total amnesia. The new session just keeps going, producing output that looks fine, but it’s giving worse answers and generating worse code because it’s working from incomplete information.

The really weird thing is that I was writing about something really similar all the way back in 2006, long before AI was around, in Applied Software Project Management: Missing requirements are especially insidious because they’re difficult to spot. I was writing about requirements, not AI context, but the problem is the same. I’ve written about how prompt engineering is requirements engineering, and this is another place where the parallel holds up. When a requirement is missing, there’s no artifact to flag it, you just end up with code that doesn’t do what it’s supposed to do. When context is missing from an AI session, there’s no error message telling you what the AI forgot; you just end up with worse answers.

The cost of poor context management is actually measurable. A developer on Microsoft’s Dev Blog recently timed his own reorientation overhead and found he was spending over an hour a day just reexplaining things to his AI that it had known in a previous session. He’s not alone. There are now entire frameworks and managed services dedicated to giving agents persistent memory, from lightweight CLIs that query Copilot’s local session database to managed memory services from Cloudflare. Some of these tools are genuinely useful, but they’re solutions you need to evaluate, integrate, and maintain before they help you.

My goal in this article and the next is to give you four specific things you can do today, using whatever AI tools you’re already working with. This article covers the problem: why context management matters and how context loss affects the quality of your AI’s output. The next article covers the specific practices that emerged from building the Quality Playbook and Octobatch, things you can bring back to your own prompts, skills, and agents immediately. I’ll use real examples from those projects, because I think they’ve got some good examples that you can draw on.

We get AI wrong in both directions

I think the through line through all of this is that developers both overestimate and underestimate AI. We overestimate how much it can hold in its memory and its ability to remember things and make decisions for us. So we’ll just stuff a whole bunch of stuff in the context window and assume the AI will work it out, and then get annoyed when it hallucinates or forgets.

On the other hand, we massively underestimate its ability as an orchestrator. Your prompt doesn’t just have to ask a question or ask the AI to generate something. You can give it a multistep workflow where each step writes its results to files, and the AI will coordinate the whole thing, spinning off subtasks and picking up where it left off if something breaks.

When developers don’t take either of those things seriously, context management or orchestration, you get a specific cycle. They treat the context window as infinite and cram everything in. Then when the session gets too long and the AI starts losing track, they throw it all away and start fresh. They never consider the alternative, which is designing the workflow so the AI works from externalized files across independent sessions.

I discovered this while building the Quality Playbook. The context management was working so well inside my sessions that I realized the sessions themselves were the bottleneck. I was running the playbook in a single prompt. I think I had a record of over 15 million tokens in a single Copilot GPT-5.4 session that ran for hours, and I did eight of them in parallel. Which incidentally is why I got rate-limited for 54 hours from Copilot, which is completely fair.

The playbook was writing everything down to files as it went, which is why those runs could last that long at all. But I didn’t want that behavior. Running 15 million tokens in a single session is expensive, and if you’re on pay-as-you-go API tokens instead of a flat-rate plan like Copilot or Claude Max or Cursor, that kind of usage can be a real shock. I wanted to make the playbook available to developers who don’t want to burn that much at once. And because the context was already externalized to files, splitting into independent phases turned out to be easy.

Ask the AI to write its context down along the way

Before I get into how the pipeline splits things up, I want to talk about the practice that made the split possible in the first place: storing development context in files as you go.

I don’t mean asking the AI to export its notes at the end of a session, or writing up a “lessons learned” document after the fact. I mean baking it into the actual instructions you give the AI from the start, so it’s continually writing and updating context as it works. For Octobatch, the batch LLM orchestrator that was my first experiment in agentic engineering (I wrote about the development process in “The Accidental Orchestrator”), I had the AI write developer context in every folder, and that really made it easy to spin up a new session.

Here’s what that looks like in practice. Every new Claude Code session on Octobatch starts with a single line: “Read ai_context/DEVELOPMENT_CONTEXT.md and bootstrap yourself to continue development.” That file contains a loading sequence: read this first, then fan out to component-level CONTEXT.md files in scripts/, tui/, pipelines/, each describing its own subsystem at the right level of detail. By the time the AI finishes reading, it knows what the project is, how it’s built, what’s currently in progress, and what the active bugs are.

I think of this as shifting left. Instead of putting constraints in every prompt (don’t use additionalProperties: false, always test with –limit 3), those rules live in the CONTEXT.md files. The prompt stays clean because the documentation does the heavy lifting.

And updating context files is part of every task. Before we commit anything, I have the AI review the context files and make sure they reflect what we just did. If we added a feature or fixed a bug, the context file should reflect that before we commit. Stale context causes the same kinds of problems as stale documentation, except it’s worse because the AI is actually relying on it to make decisions.

I want to be clear exactly what I mean by “development context.” Specifically, it’s the information a new AI session needs to get up to speed: what the project is, how it’s built, and what decisions have been made along the way. Tools like Claude Code read development context from files like AGENTS.md (and you can actually go to that website to learn more) at the start of every session, and if you do a thorough enough job of building up your development context and keeping it up-to-date, you can get them fully bootstrapped. They’re the blueprints for your AI sessions. I wrote in Applied Software Project Management that building software without requirements is similar to building a house without blueprints. Running AI sessions without externalized context is the same mistake. You’re relying on what’s in someone’s head instead of what’s written down. And when you’re working with AI, “someone’s head” is a context window that’s going to get compacted or thrown away.

The most important thing is that what’s in my head matches what’s in the AI’s head. The context file is just a convenient way to help us figure out whether or not we agree. When I start a new Claude Code session on a folder that has a good DEVELOPMENT_CONTEXT.md, the AI reads it and we’re immediately aligned. When I start a session without one, the AI has to rediscover everything from scratch, and it always misses things. Rediscovery is always lossy.

If you’re not already writing context files as part of your workflow, none of the fancier techniques I’m about to describe matter. This is the foundation.

Include the why, or the AI will undo your decisions

There’s a specific thing that has to go into these context files, and it took me a while to learn why it matters so much: the reasoning behind every decision.

Octobatch’s DEVELOPMENT_CONTEXT.md has a section called “Key Technical Learnings” with 49 entries, each in a specific format: What happened, Why it matters, When we discovered it, and Where in the code it applies. At the top of that section is a note in bold: “IMPORTANT: Always include the REASONING (the ‘Why’) for each learning. This prevents future sessions from ‘refactoring’ a deliberate decision.”

That note is there because without it, the AI will do exactly that. I had a case with Octobatch where we used recursive set_timer() instead of set_interval() for auto-refresh because Textual’s set_interval() callbacks aren’t reliably serviced on pushed screens. Without the “Why” in the context file, a future session would look at that code, see a “cleaner” alternative, and helpfully refactor it right back to the broken approach.

The same principle applies to quality standards. Don’t just say “90% coverage for core logic.” Say “90% coverage for core logic, because expression evaluation touches randomness and seeding, where subtle bugs produce plausible-but-wrong output. The drunken sailor reseeding bug passed all visual inspection. Only statistical verification caught that sequential seeds created correlation bias (77.5% fell in water instead of a theoretical 50/50).” Without the “why,” a future AI session will argue the coverage target down. Any standard or architectural decision or unusual code pattern that doesn’t have its rationale attached is vulnerable to being optimized away by an AI that doesn’t know what problem it was solving.

The garbage collection problem

A lot of people like to talk about the context window as your AI’s short-term or working memory, and context that’s persisted to disk as long-term memory. Personally, I’m not sure those analogies to human memory work all that well. I think it’s a lot more useful to find ways to think about context that are similar to how we manage memory in our code.

I find it especially helpful to compare context compaction to garbage collection—again, not a perfect analogy but a useful one. When you look at a GC graph in Java, you see the memory slowly fill up and then suddenly drop after each GC. That drop is the runtime figuring out what’s still being referenced and freeing everything else.

The context window does the same thing. Your conversation accumulates tokens, the AI’s context window fills up, and then compaction happens. The tool (or the model) decides what to keep and what to throw away. Compaction is lossy and automatic, and you don’t control what survives.

Java developers spent decades learning to design their allocation patterns so garbage collection wouldn’t destroy anything important. AI developers need to learn the same thing, and the learning curve should be shorter because the concepts transfer directly.

When you ask the AI to write important state to files, you’re promoting it out of that volatile space. It’s surprisingly easy to do this. Just pass the AI to write its context to a Markdown file. For example, you can put all of the context related to a specific domain into a particular file, like if the AI noticed a behavioral contract, you could have it write all the related context to a file called CONTRACTS.md. If it made a design decision, that could go into DEVELOPMENT_CONTEXT.md—that’s a pattern I use all the time to write down all the important contacts needed to bootstrap a new AI session to work on the code. Those files live on disk, outside the context window, and compaction can’t touch them. But if you start a new session without externalizing any of this, you’re shutting down the application and losing everything that was in memory.

The first time I built Octobatch’s batch orchestrator, it was a Python script with in-memory state and a lot of hope. It worked for small batches but fell apart at scale, which is pretty much what most developers are doing with their AI context right now: keeping everything in the context window and hoping it holds together, even though that stops working once sessions get long and codebases get complex.

It’s way too easy to fall into one context management extreme or the other

The Quality Playbook exists in part because of this problem. When I was building the requirements pipeline, I discovered that single-pass requirement generation runs out of attention after about 70 requirements. The model forgets behavioral contracts it noticed earlier. And it’s completely invisible. You don’t get a stack trace or an error message or any kind of warning, just incomplete output and no way to know what’s missing.

The longer a defect goes uncorrected, the more entrenched it becomes and the more things get built on top of it. Context drift works the same way. When the AI loses track of a design decision early in a session, everything built on that lost context compounds the error. And just like a late-discovered defect, you don’t know what went wrong because the original context is gone.

I had a concrete example when I was running the playbook against virtio-win. Version 1.3.32 found four bugs. Version 1.3.33, after some changes, found only one. That regression was only diagnosable because I had EXPLORATION.md, an externalized intermediate state file that captures what the AI observed during its exploration phase. Without it, the only observable output would have been “fewer bugs this time.” I had no way to tell whether the playbook was worse, or the bugs were harder, or it had just missed something. Without externalized state, I couldn’t have answered any of those questions.

The contracts file in the pipeline exists specifically to solve this. When the model forgets about a behavioral contract it noticed earlier, that forgetting is normally invisible. But with a contracts file, every observation is written down before any requirements work begins. If a contract is in the file but has no corresponding requirement, that’s a visible, greppable gap. You can see what was forgotten and fix it.

But it’s just as easy to overcompensate. If the LLM has to constantly hop between eight different reference files, its context window fragments and you start getting hallucinations. I’ve seen this happen. You load all your context files and requirements documents and design docs into the session, and the AI gets worse, not better. It spends all its attention navigating between reference files instead of thinking about the problem.

I hit this with the Quality Playbook when I expanded the scope of a run against virtio-win from 10 files to about 60. The result was 6x more files analyzed but 75% fewer bugs found. The model burned its context on device drivers instead of going deep on the transport layer where the bugs actually were. Wider scope meant shallower analysis.

The goal isn’t to save everything. You have to decide what to externalize, what to keep in context, and what to let go. The best context file contains exactly what the AI needs for this session and nothing more.

Helping your AI manage its context helps you too

The interesting thing about all of this is that good context management really makes use of your development expertise, and it’s one of those things that makes you a better developer the more you do it. Every practice I’ve described in this article, writing down your decisions, recording why you made them, being deliberate about what goes into a session and what doesn’t, is something developers have always been told to do. We write ADRs and design docs and inline comments explaining nonobvious choices, and we all know we should do more of it. When you’re working with AI, the cost of not doing it becomes immediate and visible. Your context files end up being the project documentation you should have been writing all along, except now there’s something on the other end that will actually go wrong if you skip it.

And once you start thinking about context as something you actively manage, you can start designing your workflows around it. That’s what happened with the Quality Playbook, when it went from a single 15-million-token session to a set of independent phases with clean handoffs between them, and the whole split worked on the first try because the context was already externalized to files.

In the next article, I’ll get into the specific techniques you can use today in your AI agents, but also in your day-to-day AI development work.

The Quality Playbook is open source and works with GitHub Copilot, Cursor, and Claude Code. It’s also available as part of awesome-copilot.

Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open-source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.

Ryan Carson Is a One-Person Code Factory

Tim O’Reilly — Wed, 13 May 2026 16:23:10 +0000

Ryan Carson has built companies for 25 years, including Treehouse, which taught over a million people to code. He knows what it takes to grow a team. So when he told me he’d raised $2 million in seed funding for his latest company, Untangle, an AI-powered divorce assistant, and had no plans to hire anyone, I wanted to understand what that actually looks like.

Ryan stopped writing code professionally around 2008. He’d essentially been “abstracted away” from it by the responsibilities of running a funded startup, as he put it. Following the acquisition of Treehouse and inspired by the arrival of large language models, he decided to teach himself to code again with ChatGPT. Ryan learned Next.js, a framework he’d never touched, using AI as a tutor that was wrong often enough to keep him honest but patient enough that he could go as slowly as he needed.

He shipped something. It didn’t work commercially, so he moved on, but he still learned a lot about iterating on AI products in the process. A few years later, when he had an idea for a divorce tool born out of watching his family members struggle through difficult splits, he was ready to build a real MVP, and he did it all by himself (with a little design help along the way).

As one of the foremost proponents of companies led by a single founder running a team of agents, in some sense, Ryan is a prince from another country. Maybe it’s not immediately apparent how his current workflow is relevant to developers working for big corporations beyond efficiency gains with AI-assisted coding. But thinking bigger picture, what Ryan calls the “code factory”—a system where agents write and review the code, run the tests, triage the error reports, and monitor the production environment, under his oversight—may be an early version of what a lot more organizations will look like in five years.

The loop is the thing

What makes the code factory model possible, Ryan explained, is the ability to set up automations and skills for jobs that you know that you need to be doing every day. In other words, you’re teaching an agent to do a repeatable process. The underlying pattern is the iterative loop, and Ryan was an early proponent and popularizer of Geoffrey Huntley’s “Ralph Wiggum” approach.

The name comes from a Simpsons character who is, to put it charitably, not the sharpest. The idea is that you don’t need the agent to be superintelligent. You need it to do one thing, write down what it did and what it learned, stop, and restart with that notebook in hand. As Ryan pointed out, it turns out that pretty good intelligence, a loop, some instructions, and a notebook gets you surprisingly far into complex territory. Or to use another of Ryan’s analogies:

Think of it as a notebook where it’s like, “Here are the things I’ve done. And here’s the holes I fell into.” It’s like Memento, the movie, where [the main character] tattoos himself or uses notes to remember, like, “What did I do yesterday and what did I learn?” And agents are the same. They don’t have any long-term memory. And so [Geoffrey Huntley] figured out, yeah, this loop actually works shockingly well. It’s very primitive, this idea. And eventually after a number of these iterations, you actually get pretty complex outcomes.

When I heard this I thought of my first exposure to shell programming and how I fell in love with loops. You have a repetitive task and you want to do it many times, and computers are good at that. The language has changed, though; it’s English now instead of Bash. But the logic hasn’t: do something; save the result; do it again.

The skill I use to generate first drafts of posts like this reads the transcript, summarizes it, and suggests possible video clips to extract. I built it with a different sort of loop, iteratively training Claude to write more like me by rewriting its drafts, asking it to analyze the differences, and then feeding back the differences as a SKILL.md file, repeating until the gap narrowed enough to reduce the amount of time it takes to accurately reflect my own takeaways.

Ryan brought up an important point: skills decay. A Next.js skill from six months ago may conflict with your current component library. Two skills may say opposite things. He told me he’d gladly pay for a system that audits his skills library, flags conflicts, and surfaces what’s gone stale. Anyone can write a skill that’s useful in the moment. The value is in keeping the skill current and coherent as it interacts with the code factory’s complete workflow.

The code factory in practice

I asked Ryan to show us his daily workflow to give us a peek into the code factory. He shared a screen with 15 active threads running in Devin (at a monthly token burn of $2,000–$3,000). As Ryan explained, having a tool like Devin is the key to the code factory model. He’d started by “hand-cobbling” together a system with a Ralph Wiggum loop and a skill, but it was fragile and things broke or got out of sync. He needed a more durable system to run the cron jobs and nightly automations that keep the factory humming. He picked Devin, but ultimately choosing a direction was more important than the choice itself:

If you back up and say, How is the modern code factory happening? It’s choosing a tool that allows you to have automations and skills for jobs that you know that you need to be doing every day.

And he’s since expanded that toolset to cover product requirements beyond software engineering, like design.

What you can automate, and what you can’t

One of the threads Ryan had open was an end-to-end smoke test that signs up for his own app every morning, runs through the full onboarding flow, exercises all 14 tools, and records a video of itself doing it. Every morning he wakes up to a report. The test passed or it didn’t, and if it didn’t, here’s what failed. He has a separate Devin automation that reads Sentry every morning, and if it finds something problematic, spins up another Devin to fix it.

This is what a CTO does: reads the Datadog and Sentry reports, triages what matters, and points the team at it. Ryan has automated the reading and the triaging. He still decides what to do about the things that matter, but the number of things he has to pay attention to has been compressed dramatically.

Ryan’s figured out how to automate many of the responsibilities he hired for in his previous companies. Another automation runs against his Google Ads, Meta, and X spend, compiles a performance report on cost per click, lead generation, click-through rate. He reads that the way a head of marketing would read it.

There’s one thing he hasn’t been able to automate: what he should build. As we hear again and again, the efficiency gains in coding, testing, design iteration, and monitoring don’t replace the judgment calls about which problems matter. As Ryan noted, “There isn’t a magic wand still. You can build faster, but whether you’re building the right thing, and doing it better is something [else].”

Programming isn’t going away

We all need to keep pushing back on the narrative that programming is going away. When I started, I wrote assembly language programs. I was literally moving data from registers, multiplying values, low-level operations that nobody does manually anymore because the compiler handles it all. When we look back on that, we don’t think “programmers became unnecessary.” We understand that programming was just abstracted to a higher level, and became more powerful for it. That’s where we are again.

Ryan used the analogy of a carpenter switching from a handsaw to a Sawzall. It saves a ton of time, but you still need to know which pipes you’re cutting or you’re going to have a bad day. The domain knowledge doesn’t get abstracted away with the tool.

The people who are going to do well are the ones who bring genuine domain expertise to what they’re asking agents to do. Ryan knows divorce law well enough to evaluate whether the output is right. He knows enough about software to catch when the agent has gone off the rails. The agent amplifies what you already know; it can’t supply what you don’t.

What happened when he pitched an attorney

Ryan’s company is built for people considering or going through a divorce who find the process too expensive and too hard. But he always expected attorneys to have opinions. As he put it, “Either they would hate us and see us as the grim reaper, or they would love us because we’re going to save them costs.” So he had his AI agent, whom he calls R2, find and book meetings with small family law firms to hear them out. The feedback was very positive (from lawyers at least; paralegals may have another opinion). Here’s how one legal business owner responded to his pitch:

The truth is, I have a lot of overhead from folks that are more in the paralegal space. And it sounds like your tool will do all that work. And I would rather have attorneys on staff that are doing the real legal work and then have all the paralegal work done by AI. I would love to pay you for that.

I expect that’s where most of the near-term displacement happens. Lower-value overhead gets automated and professionals spend more of their hours on actual professional work.

Sometimes there’s an economic tradeoff between job losses (bad for those who lose their jobs) and lower costs that can be passed on to consumers. A lot of people who need legal help with a divorce can’t afford it, so they get stuck in a bad marriage. If the cost of the process comes down because the overhead is lower, some of those people get served who currently aren’t. There’s a big difference in economic impact between a business just saving costs and pocketing the savings and one that passes those savings along to consumers or uses them to radically improve access.

AI’s supporting role

Late in our conversation, someone asked how you use AI to identify strategic opportunities. Ryan’s answer was practical: build a priority map of the projects and people that matter to you, then run a cron job every 15 minutes to triage your inbox and Slack through that map, surface what’s relevant, and act. Ryan calls it his AI chief of staff, and he’s even open-sourced it as Clawchief.

My framing is a little different, and it comes from a conversation I had years ago with Jeff Jonas, who has done data work for national intelligence agencies and casino security systems. His dream was a system where the query lives in the same space as the data. Rather than going looking for things, you define what matters to you and the system watches for it. New data shows up and the query is already there, waiting. Jeff was talking about that long before agents were a concept, but it describes what a well-designed agent loop can do now.

Only you yourself will be able to fully understand the strategic opportunity moments for your company. What AI can do for you is be a scout. It can surface things that you should be paying better attention to. That’s what Jeff and Ryan are both talking about (Steve Yegge too): an agent that watches the flow and surfaces what deserves your attention rather than one that tries to make decisions for you.

Right now, there’s this incredible opportunity to try things out and see what sticks. As Ryan has shown, it doesn’t take an entire company. Identify your goal and opportunity, then start building. His advice: Don’t worry about trying out every new tool. Just “find an energetic system,” then “pick a lane and invest.”

Your AI Problem Is a Data Problem

Aaron Black — Wed, 13 May 2026 10:59:48 +0000

I just sat in a room full of data engineers the other week who were worrying about AI automating them out of work the same way auto manufacturing in Detroit was upended half a century ago.

All AI. All the time. That’s what technology professionals are talking about.

Data scientists, data engineers, and data architects are right to sound an alarm at that. Using AI to solve and automate data problems all the way at the beginning of the pipeline is an obvious use case of agentic engineering in data. Shifting AI left for automation.

That looms as a threat to data engineering positions who own the pipeline underlying the architecture and deliverables. It’s a discussion that we can no longer avoid. In all fields, AI is looming, bringing with it new risks and bigger change.

Introducing AI there can be dangerous, and that’s a conversation all its own. You hear horror stories about AI initiatives that failed—and what failed them.

Agentic frameworks stall because the retrieval layer can’t be trusted. RAG pipelines work in demo then fall apart in production. Problems that should have been solved upstream are solved by building governance tools downstream.

The conversation comes back to one thing. The data wasn’t ready.

Don’t neglect the data layer

A Cloudera and Harvard Business Review study from March 2026 found that only 7% of enterprises consider their data completely ready for AI, and over a quarter said it wasn’t ready at all. Another data point: In Informatica’s 2025 CDO Insights survey, 43% of organizations named data quality and readiness as their top obstacle to AI success. Not model performance. Not tooling. Data.

So why does this keep happening?

Organizations are treating AI as a technology procurement decision. Buy the platform, hire the engineers, deploy the models. But the foundation underneath those initiatives—the data layer—is missing.

The data wasn’t governed. The lineage wasn’t tracked. The pipeline was built for reporting, not for model consumption.

The engineers in that room could easily be part of the solution. Because nobody owned the quality problem. And when the model surfaced a confident, wrong answer, nobody could trace it back to find out why.

That’s not an AI problem. That’s a data problem that AI made visible.

Readiness starts before the model

Data that feeds AI systems needs to be made consistent and owned. Not owned in the sense of having a name in a RACI chart. Owned in the sense that an engineer or data professional is accountable when it degrades. Lineage matters because AI outputs are only as auditable as the data behind them. Quality matters because model performance in production is directly correlated with what goes in.

These aren’t new principles. They’re established data engineering practices. They just haven’t been treated as AI deployment fundamentals. That needs to change.

Data readiness closes the gap between AI ambition and AI outcomes. McKinsey’s 2025 State of AI survey found that organizations investing in their data foundations first were likely to see real financial returns from AI. Without solutions like data contracts between producers and consumers, automated quality monitoring at the pipeline level, and governance frameworks that treat AI as a first-class data consumer rather than an afterthought, your AI spend will be wasted.

Thinking back to my convo with those engineers a few weeks ago, the engineers in that room worried about being automated out of work. Data engineers who understand pipelines, lineage, and quality at depth aren’t facing obsolescence. In fact, there’s a good chance they’ll soon see demand for their services spike, as organizations realize their AI initiatives aren’t failing because they hired the wrong AI engineers. They likely failed because those organizations didn’t invest in the data infrastructure and engineers.

The data engineering job isn’t going away. It’s changing shape as it solves a problem we’re all facing and talking about.

For data engineers, AI readiness is a table stakes deliverable now. That means owning the data that feeds AI systems, and building governance frameworks around what AI actually consumes. AI engineers, for their part, have to stop treating the data layer as someone else’s problem. When an agentic framework stalls or a RAG pipeline falls apart in production, the instinct is to look at the model or the retrieval architecture. The data is usually where the answer is. It behooves these two disciplines to share a definition of “done” that includes the data being ready before the model is deployed rather than after it fails.

The AI problem, for most organizations, is a data problem that can be solved by data engineers and data professionals. The sooner that lands in the boardroom, the better the odds that the next initiative doesn’t end up in the abandoned 42%.

Burnout and Cognitive Debt

Mike Loukides — Tue, 12 May 2026 15:58:10 +0000

Steve Yegge’s article about programmer burnout (“The AI Vampire”) along with Margaret Storey’s article about Cognitive Debt started an ongoing conversation about programmer fatigue and software quality—two topics that should be linked, but often aren’t. Steve argues that programming constantly with the help of agentic AI leds to burnout; it’s fast, it’s fun, but keeping up with your agents causes mental strain. He recommends programming with agents no more than 4 or 5 hours per day. I could cynically say that most software developers spend at most 20% of their time writing code, which leaves about an hour and a half for wrestling with agents—but that’s beside the point. Yegge’s point about burnout is important, and is in line with what friends have told me. At some point, you have to put the laptop down.

Storey makes a different point. Agentic engineering is great at creating software that works, but that you don’t quite understand. Like humans, agents can generate a lot of spaghetti code. They can “design” convoluted and inappropriate software structures—I hesitate to call them “architectures”; they’re what happens in the absence of architecture. Agents are very capable of creating technical debt—and not the kind of meaningful technical debt that lets you release a product on time with the knowledge that you need to make pay it back with interest. If nobody is looking hard at the code, the debt can grow without bounds, sort of like not checking your credit card balance. What’s worse—and this is Storey’s contribution—while that technical debt is growing, developers are losing track of the design, the structure, the architecture. She calls that “cognitive debt.” You don’t just have problems in the code; those problems are harder to find and fix than they should be because you’re unclear on the structure of the code you’re working with.

Other voices have made similar points. The Sonarsource blog writes about how AI is reshaping technical debt and creating new burdens, new kinds of toil. In “The Mythical Agent Month,” Wes McKinney links the problem of burnout to the introduction of “accidental complexity” and “agent scope creep,” while Tim O’Brien writes that while scope creep isn’t new, AI supersized its growth. And Addy Osmani writes about finding your parallel agent limit, coming to grips with what you’re capable of accomplishing without compromising your work or your life.

Cognitive debt and burnout aren’t new, alas. With or without AI, we’ve all stayed up to 4AM working on a bug that won’t go away or pursuing an interesting idea to its end. Sometimes that’s heroic, but AI threatens to turn it into a lifestyle. AI fatigue is real, as Siddhant Khare writes, and it’s something we need to talk about. When fatigued, it’s tempting to say “this works, it looks good, and it passes our tests” without considering how the code fits into the overall plan. With 10x code generation, you also get 10x the debt load, and that’s being optimistic. When the debt curve goes exponential, strategies for managing that debt are stressed past the breaking point.

The problem with cognitive debt is that it eventually makes new features and bug fixes difficult or impossible. The code has become so convoluted that it can’t be changed. I’ve certainly done that with hand-written code: added a feature without thinking enough about how the new code fit in, added some more code later, and then—when I needed to add a third feature—discovered that I’d created a problem that wouldn’t be simple to fix. The right stuff was there, but in the wrong places because I wasn’t thinking about the overall structure.

That’s a common enough problem with handwritten code; it’s almost always a problem with legacy code where the original developers and maintainers are no longer around. We need to realize that it’s also a problem with AI-generated code, which has been characterized as legacy code from the day it’s written. Somebody or something has to pay down the debt. As Storey writes, “velocity without understanding is not sustainable”: not for humans, not for machines. If you understand the structure of what you’re building, you can steer the AI away from creating a problem in the first place, or you can use it to author a fix. If you don’t understand the structure or can’t describe it to the AI, you’re lost.

Cognitive debt accumulates much more quickly when you’re burned out. Burnout has always been a problem for programmers, especially for those who really love programming: you stay up all night to solve a problem. And, while some programmers resist using AI to write code, those who use AI frequently find that it exacts the same toll: it’s hard to stop. It is its own kind of toil: toil that gives you a sense of accomplishment and fulfillment, but still leaves you empty.

Agents may not be subject to burnout, but the humans who control them are. Agents are quickly becoming more capable, but they still can’t maintain a sense of the shape and structure of a project over the long term. That’s our job. They can pay down technical debt, but only if properly guided; that’s also our job. And we won’t be able to do either if we’re burned out.

Gyms for Them, Mirrors for Us

Shreshta Shyamsundar — Tue, 12 May 2026 11:01:56 +0000

Personal AI doesn’t have to run your life to change it. It just must see you clearly and feed your behavior back to you in a way you can’t dodge. Once you look at AI as feedback loops instead of little butlers, the whole “agent” conversation starts to feel upside down.

We’ve overrotated on agents that act and massively underinvested in systems that watch, interpret, and train, for humans and for models.

Stop shipping little butlers

Most personal AI demos orbit the same fantasy: inbox‑zero sidekicks, calendar‑tuning bots, or agents that “just handle it” so you can “focus on what matters.” They’re great on stage but terrible as a risk posture.

The butler model hides a simple asymmetry. A read‑only system that misinterprets you is mildly annoying; you ignore a bad suggestion. A write‑enabled system that misfires in your inbox or CRM is career-limiting. One error is a shrug, the other is an incident report.

That’s the asymmetric agent in one line: Read is cheap; write is expensive. Read can be broad, but write should be narrow, rare, and very hard‑earned. The first, highest‑leverage thing you can build is a mirror: an AI that reads your digital exhaust, synthesizes what it sees, and reflects it back, without ever touching the systems that move money, time, or relationships. Šimon Podhajský’s talk, “Cognitive Exhaust Fumes, or: Read‑Only AI Is Underrated,” is a great example of this pattern in the wild.

This isn’t a temporary sandbox before “real agents.” Treating read‑only as a stepping stone and write as the prize is how you hand a chainsaw to a toddler because they’ve proven they can hold a spoon.

Cognitive exhaust is the real dataset

Your day produces a ridiculous amount of cognitive exhaust: emails half‑written, tabs abandoned, tasks snoozed, articles skimmed, and notes forgotten. Any one stream is noisy. The value appears when you correlate across all of them.

A serious personal AI can sit over multiple sources—mail, calendar, notes, browser history, docs, and CRM—and build a cross‑cutting view of what you do versus what you say you care about. You want it a bit judgmental. You want it to surface things like:

Intention–action gaps: projects you “prioritize” but never touch
Attention drift: where your time really went
Relationship decay: people you insist are key but haven’t contacted in months

Podhajský’s system does exactly this, using a read‑only agent that writes only into its own Obsidian vault—no edits to the original systems, no auto‑emails, just brutally honest reflections and suggested experiments.

Here’s the trap: Your agent must only observe. The moment an agent writes back into the systems it’s monitoring, you’ve poisoned the well. You’re not observing your behavior anymore; you’re observing an AI‑amplified feedback loop. You’ve built an observability rig that forges its own logs. The data stops being “you” and becomes “you plus a stochastic autocomplete with opinions.”

For personal AI, that’s existential. If the whole point is to help you see yourself more clearly, having the same system both author and interpret the traces destroys the value proposition. The mirror starts painting your reflection.

Feedback loops, not party tricks

Seen as feedback loops, the symmetry becomes obvious.

A mirror is a loop targeting your nervous system. The “model” being updated is the human. The exhaust is your digital activity. The environment is your toolchain. The reward shows up as shame, insight, or resolve when you see your week laid bare.

A gym is a loop targeting model weights. The model acts in a world, receives rewards or penalties, and updates its policy. The exhaust is trajectories of prompts, actions, outcomes. The environment is a task harness. The reward is a verifiable signal.

Two different learners, same structure:

In the mirror, the user is the learner and the agent is a silent observer.
In the gym, the model is the learner and the environment is the judge.

Both are broken for the same reason: We obsess over agents doing flashy things and neglect the quality of the signal that trains the system—human or model. We ship chatty butlers and call it “intelligence” instead of asking, “How clean is the feedback?”

Environments are the new unit of deployment

On the model side, we’re still trying to prompt‑engineer our way into reliability. That’s cute for prototypes but reckless for systems you depend on.

We spent 20 years perfecting CI/CD for deterministic code—version control, reproducible builds, test harnesses, staging, blue‑green deploys—all so we could ship well informed. Meanwhile, we vibe‑check stochastic agents into production with a handful of prompts and a cherry‑picked demo.

A more sensible default is to treat the environment definition—the code and configuration that specify the world the model lives in—as the unit of deployment. Libraries like Verifiers make this concrete by packaging environments for LLMs with tools, datasets, parsing logic, rewards, and rollout policies in one place.

To make that definition precise, you need four anchors:

State schema: The shape of the world the environment exposes to the model at each step (fields, types, invariants)
Action interface: The tools or functions the model is allowed to call, with their inputs and outputs
Reward spec: The checks you run to score behavior (correct/incorrect, passed/failed, right tool, right schema)
Rollout policy: How you exercise the environment (single‑turn versus multi‑turn, maximum steps, termination conditions)

You’re not “deploying state” in the sense of a frozen snapshot of production. You’re deploying the rules of the game: what the model can see, what it can do, how you score it, and how you run episodes. Any candidate model you plug into that environment is evaluated and constrained the same way. You then treat that environment definition like a test suite plus staging cluster, comparing models on behavior that matters for your workflow, training smaller, specialized models using verifiable rewards instead of vibes, and detecting regressions when either models or tools change.

For enterprises, this means you don’t “deploy an LLM” with some prompts. You ship an environment package: code, config, and test data that define the world; plus metrics and logging. The model is a plug‑in you can swap or retrain based on how it behaves inside that package, not in an ad hoc prompt sandbox.

Observers, gyms, and asymmetric agents

Mirrors and gyms are both environments built around feedback loops. The difference is who’s allowed to touch reality.

Mirrors watch you. The AI reads broadly, writes only to its own notes, and hands you structured feedback. You learn; you act.
Gyms watch the model. The AI acts inside a sandbox, collects rewards, updates its weights. The model learns; the environment constrains.

Agents—the things that take actions in live systems—should sit downstream of both. They should be asymmetric by design:

In production, agents default to read‑only or read‑mostly. Write access is narrow, logged, reviewable, and easy to kill.
In training and evaluation, agents can be fully read‑write but only inside deliberately engineered environments.

Anything else is YOLO alignment: You train in production, corrupt your own telemetry, and then argue with the logs when something goes wrong.

Think of it as risk management for agents. Every new write permission expands the blast radius. If you haven’t instrumented the read path, you’re taking on unpriced risk. Gyms for them, mirrors for us, asymmetric agents at the edges—that’s a risk posture you can explain to an auditor.

Butler agents are security theater

Now add security to the mix. Simon Willison’s “lethal trifecta” of agent risk is simple: private data, untrusted inputs, and external communications. Get all three in one agent and you’ve basically handed an attacker a loaded gun.

Most “do‑everything” butler agents proudly hit the trifecta: They ingest piles of sensitive internal data, they cheerfully process whatever the internet throws at them, and they’re allowed to send emails, modify records, or call external APIs. You’ve built a hyper‑efficient exfiltration and amplification engine.

Observer AI pulls in the opposite direction. It can still see private data but uses it only to generate internal reflections or drafts. It treats untrusted inputs as something to analyze, not something to obey. And it doesn’t touch external channels; you stay in the loop.

Butler agents give executives the feeling that “AI is doing work for us” while dramatically increasing the blast radius of prompt injection, model hallucinations, or compromised keys. Observers are actual governance: They help humans see, reason, and decide before anything gets written where it counts.

In the enterprise, “agentic workflows” without observer environments are just shadow IT with better branding. If you can’t instrument and audit what the system reads, you have no business trusting what it writes.

Boots on the ground: The friction is real

This isn’t just a whiteboard problem. In big bank reality, the conversation often goes like this:

Client: “We want an AI assistant that updates customer records, sends follow‑ups, and opens tickets automatically.”

Me: “Great. Show me your observability. How do you know what it’s reading today and how those reads map to actions?”

Client: “…we have logs?”

Say, “No, your shiny new bot should not have direct write access to the CRM,” and the first reaction is disbelief. Then come the workarounds: “What if it drafts and auto‑sends unless someone clicks reject?” “What if it only updates ‘safe’ fields?” “What if the human is technically in the loop but the default is accept?” All of them duck the hard work of building the mirror and the gym first.

In a post‑GDPR, postbreach world, an observer that doesn’t push data is a compliance gift. A write‑enabled agent is a data‑deletion nightmare and a discovery headache. We’re desperate to give agents hands before we’ve given ourselves eyes. Until you can trace the read path—what’s accessed, why, and with what downstream effect—every new write permission is architectural debt with a ticking clock.

A simple playbook

If you’re trying to bring order to this chaos, here’s a blunt playbook.

Build observers first
Aggregate your cognitive exhaust—or the org’s. Start with a read‑only layer across mail, tickets, docs, code, CRM, usage logs. Have it produce structured reflections: where work happens, where intent and action diverge, and where relationships or processes are decaying. Let it write only into its own vault.

Encode scary workflows as environments
Pick high‑risk, high‑value flows: claims adjudication, payment routing, change approval, remediation—anything with money, legal exposure, or brand risk. For each, define an environment with clear state schema, action interface, reward spec, and rollout policy. Use frameworks like Verifiers to make these reusable instead of bespoke scripts.

Treat environments as deployable artifacts
Think of an environment as a repo you can clone—not a frozen copy of production but the minimum code, configuration, and sample data needed to exercise a workflow reproducibly. You version, test, and promote that environment package the way you do services. When APIs, schemas, or policies change, you update the package and rerun the suites. You don’t “prompt harder” in production and hope.

Only then, grant narrow write access
Once mirrors and gyms are in place, start handing out tightly scoped write capabilities—one surface at a time, with metrics and rollback. And have your observers watching both human and agent behavior for drift. This is slower. It’s also professional.

Rethinking “personal” and “agentic”

Reframing AI around feedback loops does odd things to our buzzwords. “Personal AI” stops being “a bot that talks like you and acts for you” and becomes “an observability layer on your own cognition.” It’s closer to therapy than outsourcing. Therapy doesn’t send emails for you; it changes how you write them.

“Agentic AI” stops being “a thing that chains tools together” and becomes “a thing that lives inside an environment with explicit constraints and signed‑off rewards.” The swagger moves from the model to the environment. The question shifts from “How smart is your agent?” to “How well‑designed is the world you’re letting it inhabit?”

Gyms for them, mirrors for us. Agents only where the feedback loops are strong enough to justify the risk. Less demo‑friendly than a bot that spams your calendar, sure. But a lot closer to something you can live with—in your personal life, and in a production architecture that must survive contact with reality.

From Capabilities to Responsibilities

Artur Huk — Mon, 11 May 2026 12:00:30 +0000

Human-in-the-Loop becomes an operational bottleneck

In my previous article, ”The Missing Layer in Agentic AI,” I argued that AI agents need a deterministic execution kernel—a privileged “Kernel Space” that validates every proposed action before it touches the real world. That article focused on what happens at the execution boundary: idempotency, JIT state verification, and DFID-correlated telemetry. But establishing that boundary immediately raises a natural question: who exactly is crossing it, and under what authority?

The focus here is on a narrower and more demanding class of systems. We are not looking at RAG chatbots, research copilots, or lightweight assistants that only retrieve and summarize information. The target is high-stakes agentic systems: systems allowed to mutate external state by moving money, changing infrastructure, or modifying critical records. The approach presented here is not a general-purpose agent framework; it is an enforcement pattern for side-effectful systems.

High-stakes AI systems must be designed around responsibilities, not capabilities.

The industry’s current answer is unsatisfying: Human-in-the-Loop (HITL). In development environments and low-frequency pipelines, routing uncertain decisions to a human can be defensible. In production systems operating at scale—dozens of agents, hundreds of decisions per hour—it becomes the Scalability Trap.

Figure 1: The Human-in-the-Loop (HITL) model degrades into an operational bottleneck, substituting true governance with alert fatigue and unverified execution.

Operationally, the failure is simple. An agent flags a decision for review. A human approves it. Then another arrives, then dozens more. The queue grows. The human begins clicking through. They stop reading the JSON payloads. They click “Approve” because the backlog is piling up, the meeting starts in ten minutes, and nothing has gone catastrophically wrong yet. That is alert fatigue: governance degrades into manual throughput management. The problem is not human weakness; it is governance-layer technical debt created by routing too many binary decisions through a manual queue.

Tyler Akidau captured the broader issue in “Posthuman: We All Built Agents. Nobody Built HR.” echoing Tim O’Reilly’s call for the missing protocols of the AI era: the industry has invested heavily in agent capability, but far less in the infrastructure that governs authority, constraint, and accountability.

Scalable AI does not mean hiring more reviewers to supervise more bots. It means changing the governance model entirely. The scalable alternative is Governance by Exception: Humans design policy, the runtime enforces it, and only truly exceptional cases are escalated.

From capabilities to responsibilities—what a responsibility-oriented agent actually is

The dominant framing in enterprise AI asks a single question: What can this agent do? What tools does it have? What APIs can it call? This is the capabilities frame. It is natural, it is intuitive, and in production systems it is the wrong frame entirely.

In organizational design, a role is stable and assigned. Much like Role-based access control (RBAC) in traditional software, it defines what someone is authorized to do, independent of the tasks they happen to be executing. We cannot dictate how a person thinks, but we can strictly bound what they are permitted to do. A responsibility statement makes that boundary explicit. In software, we somehow forgot this distinction, hoping that raw intelligence—better models, tighter prompts, improved alignment—would be a sufficient guardrail.

The difference becomes clearer across some enterprise domains:

Finance: A capability is “can execute equity trades.” A responsibility is “authorized to execute up to $50,000 per order, in highly liquid equities only, with a maximum daily drawdown of 2%.”
Healthcare Operations: A capability is “can reschedule patient appointments.” A responsibility is “authorized to re-book non-critical outpatient visits within a 14-day window, strictly avoiding specialist double-booking.”
Supply Chain: A capability is “can reroute freight.” A responsibility is “authorized to redirect non-hazardous cargo up to a maximum SLA penalty budget of $5,000.”

In systems where agents touch money, medical records, or physical logistics, the gap between these two statements is the gap between a demo and a production deployment.

The current paradigm often handles this gap with prompts. Give the LLM an API key, tell it to “be careful with position sizing,” and hope alignment holds under adversarial inputs, unusual market conditions, and the seductive logic of edge cases. In low-risk contexts that may be tolerable. In high-stakes systems with real-world side effects, it is not a sufficient control surface.

This distinction is not new. Distributed systems solved a similar problem decades ago.

Carl Hewitt’s Actor model—introduced in 1973—gives us a useful foundation. An Actor is an independent computational entity with its own state, its own behavior, and its own messaging interface. Actors do not share state. They communicate only by passing messages. Crucially, an Actor’s behavior is bounded—defined by what messages it accepts, not by an open-ended capability set.

The Responsibility-Oriented Agent (ROA) does not invent a new distributed-systems primitive. Instead, it composes proven patterns—bounded actors, RBAC-style authority envelopes, audit trails, and execution-boundary validation—around an unpredictable LLM core. In truth, ROA is closer to a decision actor than a full computational actor: It maintains its own internal state but does not directly mutate the external world. Within a stable role, a fixed mission, and a machine-enforceable contract, it receives business events, reasons over relevant context, and emits a PolicyProposal for the Runtime to validate.

Its job is epistemic, not executive. It explains the situation and structures intent. But unlike traditional Actors, an ROA agent is defined by strict separation of concerns. In its reference form, credentials reside outside the agent’s reach. It opens no direct execution channel to external systems and writes no state by itself. An ROA agent may use tools to gather context (read-only operations within its sandbox, like querying a knowledge base), but authority for state-mutating actions remains downstream of deterministic validation and execution gates. The only state-changing step attributable to the agent is emit_policy_proposal()—a structured, typed claim that it wants the system to do something. ROA shapes the form of intent; the Runtime decides whether that intent is allowed to become action.

This separation is the architecture’s most important property. Five engineering pillars define what it means in practice—each addressing a different failure mode at the reasoning–execution boundary—and together they transform an LLM from a probabilistic tool into a governable, accountable system component.

To make this concrete, imagine an underwriting agent on the London commercial market receiving a property submission. It reads the documents and produces an Explain narrative. It then emits a PolicyProposal for a quote. But the property value is £15M and its contract caps authority at £10M. The proposal reaches the Kernel, where the Runtime evaluates the YAML contract deterministically, rejects execution, and transitions the flow to ESCALATED. The senior underwriter is no longer reviewing every £2M submission. They are pinged only for this specific £15M exception. That is Human-Over-The-Loop in one decision.

The engineering pillars of an ROA

Pillar 1: Responsibility contract—authority encoded in code

If role defines the class of decisions the agent may handle, the Responsibility Contract defines the hard boundaries of that authority. The agent’s authority envelope is not a prompt. It is a versioned, machine-readable contract registered with the Agent Registry—the Kernel’s single source of truth for agent identity. A key property applies here: Prompts are suggestions. Code is enforcement. A prompt saying “do not exceed $10,000 per trade” can be creatively reinterpreted by a sufficiently motivated model or overridden by a carefully crafted prompt injection. A contract field max_order_size_usd: 10000.0 validated by deterministic runtime code is materially harder to bypass than a natural-language instruction. In the reference architecture, contracts are deployed out of band—agents do not self-register and do not read or modify their own contract.

There is a second-order consequence of this design that is easy to overlook: role definition automatically scopes the data context the agent requires. If an underwriting agent is contractually limited to HOME_STD and HOME_PLUS policy types in the LOW and MEDIUM risk tiers, the Context Compiler—which assembles the agent’s working snapshot before each inference call—needs to supply only the signals relevant to those dimensions. Market data for commercial property, flood zone statistics for excluded risk tiers, and regulatory data for other product lines are simply not in scope. The context is deterministically narrowed by the contract.

This matters for a concrete LLM engineering reason. In practice, models often become less reliable as their working context expands, including the class of effects practitioners describe as Lost in the Middle. A tightly scoped role is not just a governance convenience; it is an architectural mechanism for keeping the agent’s working context small enough to reason over reliably. A general-purpose agent handed an unconstrained context window of everything possibly relevant is more likely to degrade than a contract-bounded agent operating in a defined domain.

In the insurance underwriting sample, that Responsibility Contract could be configured like this:

agents:
  - agent_id: "underwriter_agent"
    version: "1.0.0"
    created_by: "compliance@example.com"
    created_at: "2025-02-17T10:00:00Z"
    mission: |
      You are an insurance underwriter. Analyze the client application and propose
      a policy. Base premium on Total Insured Value (TiV) at ~2% of TiV, capped at max_tiv.
      NEVER propose for Fireworks or CryptoMining industries - these are prohibited.
    contract:
      role: EXECUTOR
      max_tiv: 3000000
      prohibited_industries: ["Fireworks", "CryptoMining"]
      escalate_on_uncertainty: 0.65

Pillar 2: Mission—The North Star

Mission is immutable at runtime. If the Responsibility Contract defines what the agent may do, Mission defines what it is trying to optimize within those boundaries. This distinction is operationally important: the Contract defines the admissible action space, while the Mission defines the ranking logic inside that space. Contract answers may; Mission answers should. Two agents can share the same authority envelope and still optimize for different business outcomes, as long as both remain inside the same hard boundary.

In the ROA architecture, Mission is a deployment artifact with two surfaces: a human-readable mission_statement used by the agent as a reasoning guide, and a machine-verifiable mission_context_hash used by the Runtime to enforce integrity.

mission_statement: "Minimize SLA penalties in logistics rerouting. Prioritize low-cost carriers."
mission_context_hash: "sha256:a3f9b2c1..."   # Kernel-computed at deployment time, strictly immutable

The deterministic Kernel does not interpret the mission_statement text. The agent uses that text internally as a reasoning guide, while the Runtime enforces mission integrity by comparing the mission_context_hash in the proposal with the immutable value registered in the Agent Registry. If prompt injection or runtime drift changes the agent’s objective, the hash no longer matches and the proposal is rejected without semantic interpretation. The hash is one implementation; the requirement is deterministic integrity at the boundary.

A Mission is defined at deployment and evolves only through a deliberate, version-controlled update to the contract—not through prompt tweaking, user feedback, or runtime negotiation. In practical terms, Mission keeps optimization policy under change control. An agent whose mission drifts with each conversation is not a durable production actor; it is a session.

Pillar 3: Epistemic isolation—claims, not commands (Explain versus Policy)

If Contract defines the boundary and Mission defines the objective, Epistemic Isolation defines the only acceptable form of output. An ROA agent interacts with the world exclusively through structured, typed PolicyProposal artifacts. The agent’s output is an untrusted claim—an assertion that it wants the system to do something—and the Runtime treats it precisely as such.

This property is what makes the ROA + Runtime pattern materially more resistant to prompt injection. If an injection bypasses the LLM’s reasoning guardrails, the corrupted output still arrives as a typed proposal carrying an agent_id. If the proposal asks to transfer funds, but the agent’s contract lacks that authority, the Runtime rejects it with RBAC_DENIED. Security derives from deterministic enforcement at the execution boundary, not from trusting LLM alignment.

To cleanly bridge probabilistic thinking to deterministic claims, ROA agents produce decisions through a structured internal workflow with a strict separation between Explain and Policy:

Explain: Agent interprets context and articulates the situation in natural language (e.g., “Flood risk score 3/10...“). This creates a narrative artifact for human auditors. It is never parsed for execution logic.
Policy: Agent formulates a structured PolicyProposal carrying the execution-relevant fields the Runtime can validate deterministically. In the underwriting sample, that looks like this:

proposal = PolicyProposal(
  total_insured_value=2_750_000,
  premium=55_000,
  industry="Commercial Property",
  justification="TiV remains below delegated max_tiv and no prohibited industry indicators were found.",
  confidence=0.81,
)

The binding fields (total_insured_value, premium, industry) drive deterministic validation, while justification and confidence remain observability metadata for audit and escalation.

That separation is what makes the evidence model clean: The narrative remains human-readable, the policy remains machine-enforceable, and both can be bound to the same decision lineage without allowing free text to leak into execution.

Pillar 4: Epistemic longevity—memory across decision cycles

Once the agent has a stable role, a fixed mission, and a disciplined output interface, continuity across decision cycles becomes meaningful. This is the pillar most absent from practical implementations—and the one most responsible for a specific class of production failures: the infinite rejection loop.

ROA agents are not stateless inference calls. They are long-lived entities that maintain a decision trajectory across multiple cycles—a Kernel-managed record of prior proposals, their validation outcomes, and the business consequences of those decisions.

The same scoping logic that constrains authority also determines whether memory is meaningful. A long-lived agent operating within a stable role accumulates history from the same class of decisions under similar constraints—past actions and their outcomes are genuinely causally related. A general-purpose assistant handed unrelated tasks may still notice patterns, but those correlations are rarely operationally reliable. Focused responsibility is what separates signal from coincidence in the agent’s memory.

The failure mode this prevents has a name: decision amnesia. Without longevity, the agent repeats the same rejected intent because the rejection is not part of the next decision cycle.

Pillar 5: Decision telemetry—immutable accountability

Every PolicyProposal carries a Decision Flow ID (dfid) that binds it to the full decision context. Rather than dumping unstructured logs, this constructs a reconstruction primitive—a relational trace connecting:

The Input: The exact Context Snapshot (T₀) the agent reasoned against.
The Validation: The outcome evaluated against the Responsibility Contract.
The Outcome: The final execution receipt.

This correlated record enables answering “why did this agent do this, at this specific moment, against what state of the world?” using a standard SQL join across the full decision lifecycle. In higher-assurance deployments, the same structured telemetry can be wrapped into a cryptographically signed proof-carrying intent, allowing independent verification of the decision artifact without asking anyone to trust mutable text logs—exactly the direction high-risk compliance regimes such as the EU AI Act are pushing toward.

But structured decision telemetry does more than support daily postmortems. Every decision becomes a structured relational record bound by DFID—the same foundation that makes macroscopic failures like Agent Drift detectable before they compound silently across the fleet.

Human-Over-The-Loop—autonomy at scale

The alternative to Human-in-the-Loop is not to remove the human, but to move the human from the execution loop to the design loop.

This is the Human-Over-The-Loop (HOTL) model. The human acts as a Policy Designer who defines and evolves the contract that governs decisions, while the system operates autonomously inside those boundaries. No approval queue. No review fatigue. Governance by Exception is the scalable model.

Figure 2: Human-Over-The-Loop shifts the human from the execution queue to the design loop. The agent runs autonomously within a deterministic contract; the human governs by defining that contract and intervening only on genuine exceptions.

Escalation Triggers. The system escalates only when the agent encounters a situation its contract does not authorize it to resolve alone:

Proposed action exceeds a contract authority limit
Agent confidence drops below escalate_on_uncertainty threshold
External API errors exceed a retry budget
No decision has been emitted within a configured inactivity window

When a trigger fires, the DecisionFlow enters ESCALATED state. The operator sees the WorkingContext, the PolicyProposal, and the reason for escalation, and can OVERRIDE, MODIFY, or ABORT. This is not an “Approve / Reject” queue; it is targeted intervention.

Escalation should not be understood as proof that the agent reliably knows what it does not know. LLMs are poor judges of their own uncertainty, so the architecture does not trust introspection. The escalate_on_uncertainty threshold is a useful heuristic, not a ground truth: the system forces escalation when declared confidence falls below the threshold, or when the proposal violates contract parameters the Kernel can evaluate deterministically. If the model produces a bad proposal with high confidence, the Runtime still blocks it. The agent may signal uncertainty; the Runtime decides whether that uncertainty matters.

Frozen Context + JIT. The operator reviews the proposal against the exact snapshot of the world the agent saw at T₀, avoiding the TOCTOU (Time-of-Check to Time-of-Use) problem: The human audits the machine’s decision using exactly the data the machine saw.

But the world keeps moving. Hitting “OVERRIDE” at T₁ does not blindly execute the action; it forces the proposal through the Runtime’s JIT (Just-In-Time) Verification gate. If reality has drifted beyond the contract’s Drift Envelope between T₀ and T₁ , the Runtime rejects the override rather than executing a once-valid intent against stale state.

Contract Evolution. The right long-term response to a legitimate edge case is usually not repeated override, but contract change. If business reality shifts, the operator updates the Responsibility Contract and deploys a new version. The system adapts through version-controlled governance boundaries rather than prompt edits or fine-tuning.

Escalation Budget. Escalation is rate-limited by a token bucket per agent (for example, 3 escalations per hour). If an agent exhausts that budget, the Runtime transitions it to SUSPENDED, records the state change, and blocks new DecisionFlows until an operator intervenes. This prevents Escalation DDoS and contains runaway reasoning costs.

Confidence ≠ Authority. An agent may emit a proposal with confidence=0.99, and if that proposal exceeds contract authority, the Runtime rejects it. Self-assessed certainty is not permission.

Figure 3: HITL scales supervision cost with agent volume. HOTL shifts that cost to policy design—the human governs the production line, not individual decisions.

Wrapping, not replacing: The role of existing frameworks

Adopting the ROA pattern does not mean discarding the tools your engineering teams have spent the last year mastering. Frameworks like LangChain, AutoGen, and CrewAI excel at orchestrating complex reasoning loops, RAG pipelines, and tool use. ROA is not designed to compete with them; it is designed to govern them.

Figure 4: The ROA pattern wraps existing orchestration frameworks (like LangChain or CrewAI) in User Space, restricting direct execution and forcing output through a structured Policy Proposal validated by Kernel Space.

In practice, you can take a mature LangChain agent and wrap it inside an ROA boundary. The underlying framework still handles the probabilistic reasoning (User Space orchestration). The architectural shift is simple but consequential: you filter the framework’s tool space. You physically remove exchange.execute_trade() or db.drop_table() from the LangChain agent’s toolbox. Instead, you provide it with a single, sandboxed tool: emit_policy_proposal(). The agent reasons, iterates, and eventually calls that tool to emit its final intent. The ROA wrapper catches this claim, may perform a local self-check as a noise-reduction heuristic, and forwards the PolicyProposal across the boundary to the Kernel Space for actual enforcement. You keep the power of the framework, but you gain deterministic execution governance where it matters.

Costs and trade-offs

ROA is not free. It introduces engineering overhead precisely because it replaces informal trust with explicit governance.

Validation gates and JIT checks add latency to every side-effectful decision.
Responsibility Contracts add design overhead: authorship, versioning, ownership, and review now have to be explicit.
DFID-linked auditability adds storage, tracing, and operational integration work.
Escalation thresholds and budgets require domain tuning; bad defaults either flood operators or hide legitimate exceptions.

These costs are justified only when the downside of an incorrect side effect is materially higher than the cost of controlling it. For RAG chatbots and low-risk assistants, this architecture is often excessive. For high-stakes systems, it is the cost of building a real boundary.

Conclusion: Architecture, not alchemy

Five pillars. One architectural commitment: an agent that cannot be trusted to govern itself must operate inside a system that governs it instead. The Responsibility Contract bounds authority. The Mission locks the objective. Epistemic Isolation ensures output is a claim, not a command. Longevity prevents the system from forgetting what it already learned. Audit makes every decision reconstructable. The ROA pattern—a Responsibility Contract instead of a capability list, Claims instead of Commands, a deterministic kernel instead of an informal prompt—composes these into a single enforceable boundary. Intent is structured by the agent. Boundaries are enforced by the contract. Telemetry is accumulated by DFID. The Human-Over-The-Loop model reserves human judgment for genuine exceptions, not approval queues. Together, they transform a probabilistic model into a governable production actor.

Once deterministic execution boundaries and DFID-linked telemetry are in place, a different class of day-three questions becomes possible: Which agents stay within limits yet quietly destroy margin? Which decision patterns justify automatic suspension before humans notice the drift? How do we reconstruct any action to regulatory standard, govern a fleet where agents carry different risk profiles and decision weights?

Responsibility is the missing execution-governance layer—and it belongs in the architecture, not the system prompt.

The era of AI demos is ending. The era of AI production systems is beginning. Those systems will not be distinguished only by the intelligence of their models. They will also be distinguished by the rigor of their governance.

This article provides a high-level introduction to the Responsibility-Oriented Agents and Decision Intelligence Runtime and its approach to production resiliency and operational challenges. The full DIR specification, ROA contract schemas, reference implementations are available as an open source project at GitHub.

Fighting Tool Sprawl: The Case for AI Tool Registries

Peter Richards — Fri, 08 May 2026 11:20:31 +0000

As enterprise AI agent adoption scales, the absence of centralized, organization-level tool infrastructure is producing compounding costs. When adoption is built around optimizing for deployment speed, enterprises expose themselves to a combination of risks: duplicated engineering effort, security exposure, and operational opacity.

Every enterprise needs its own shared tool registry, one that reflects its specific regulatory environment, security posture, and operational conventions. To be clear, this is not an argument for a public package manager, something like npm, PyPI, or Maven. The infrastructure each enterprise needs is internal; scoped to its own teams, its own data, its own policies, its own domain. Trying to expand the scope beyond the confines of individual organizations would be premature standardization in a fast-moving, nascent space.

A shared enterprise tool registry is not an optimization or a nice-to-have. It is foundational infrastructure as agent deployments scale beyond early experiments. The case for it rests on two pillars: reducing coordination cost and enabling risk management, both for the humans building with agents and for the agents themselves.

AI agents depend on tools that retrieve data, write records, trigger workflows, and call external APIs. According to McKinsey, in most large organizations these tools are built by individual teams in an ad hoc fashion: undocumented, ungoverned, and invisible to the rest of the organization. This pattern is familiar to most engineering leaders, and the fragmentation it creates compounds with every new agent deployment. Teams rebuild what already exists elsewhere, security reviews miss tools that were never registered, and when something breaks, no one has a complete picture of what is running or why.

A coordination failure at infrastructure scale

The software industry solved an analogous problem decades ago with package managers. Centralized registries gave teams a way to discover, depend on, and govern shared code. The learning was clear: preventing duplication and inconsistency is an infrastructure problem, not a discipline problem.

The agent era presents the same problem in a new domain. When Kong launched its enterprise MCP Registry in February 2026, they explicitly called out the problems of manual MCP configuration, hardcoded and managed tool isolation across teams, fragmented integrations, and limited organization visibility.

Fragmented tool development is not a consequence of poor engineering practice. Rather, it is the predictable outcome of asking teams to solve an infrastructure problem at the application layer.

The visibility problem

Gravitee’s ”The State of AI Agent Security 2026” survey quantifies what happens when agent tooling is invisible to the people responsible for securing it. The survey found that only 14.4% of teams with agents beyond the planning phase have full security approval, and 88% of organizations had an agent-related security incident this year. Bad practices like shared API keys are endemic, with only 22% of organizations treating agents as independent identities. This governance gap transforms agents from productivity boosters into high-velocity liabilities capable of executing unauthorized actions or leaking sensitive data before a human can even intervene.

The story is clear: adoption is outpacing governance, and in a race for speed old lessons are having to be retaught. The majority of deployed agents (and the MCP servers powering them!) are operating without any security sign-off. This is not primarily a resourcing failure, and it is not something a registry alone solves. Security teams cannot review what they cannot discover, and without a registry, discovery is manual, incomplete, and stale. A registry does not make tools inherently secure; rather, it makes security work possible by ensuring tools exist not as transitory, ad hoc shims, but rather as inventoried artifacts that audits and policy can attach to.

It is worth revisiting public package managers here. These registries have not been able to eliminate a number of security problems, issues such as typosquatting, malicious packages, and dependency confusion, showing clearly that centralization alone is not a security solution. But they also show the converse: a registry is a precondition for security. Numerous community responses to breaches in these ecosystems demonstrate the power centralization provides. Centralization does not guarantee security, but decentralization forfeits the means to coordinate it.

Governance requires shared context

The default posture in most agent deployments is permissive: tools are available unless explicitly blocked. AgilityFeat’s analysis of enterprise AI guardrails identifies the structural risk this creates, since an architecture not built on deny-by-default increases risk and creates upkeep costs.

Allow-by-default, replicated across dozens of independent agent deployments, produces an attack surface that scales with adoption. Inverting this requires a coordination point, a shared, organization-wide context. The registry itself isn’t a governance layer, but it is what makes governance possible. When every tool an agent can use is registered with ownership, version, and review status, the governance layer has something concrete to enforce against. Without that context, policy has to be reimplemented by every consuming team, and consistency becomes impossible.

Frontegg’s framework for AI agent governance describes what that policy layer looks like operationally: agent actions mapped to explicit, granular guardrails that define the operational boundaries for what any agent can attempt or execute. These guardrails live outside the registry, but they depend on it. A guardrail that references a tool the security team has never heard of cannot be written in the first place.

What a production-grade tool catalog requires

A mature enterprise tool registry has two core functions, discovery and versioning, and serves as the foundation for two others: certification metadata and access control. Think of it as an Internal Developer Portal (IDP) built for the agent era, solving the same coordination problem that IDPs solved for service teams but one layer up.

Discovery allows any team building an agent to search for existing tools before writing new ones. With ownership metadata, version history, and usage metrics centralized, duplication is reduced not through mandate but through reduced friction. A well-designed catalog goes further than a flat list: tools should be grouped hierarchically by functional domain so that both humans and agents can find relevant capabilities quickly.

Versioning closes a gap that neither discovery nor access controls address: When agent behavior changes, why did it change? A tool registry that tracks versions gives enterprises the visibility to answer that question. Was it the model? A tool prompt update? An underlying API change? Without proper versioning, finding the answer goes from a simple diff comparison to a time-consuming, manual investigation.

Certification status (things like security approval, API contract validation, PII handling checks) is metadata that the registry surfaces, not a boundary that the registry itself enforces. The actual review work happens through the security organization’s existing tooling. The registry’s contribution is making the result of that review visible at the moment a team is deciding whether to adopt a tool, ensuring the review actually informs the decision it was meant to inform.

Access control works the same way. A policy layer enforces authorization scoped to agent identity, team, environment, and action type, reading from the registry to know what tools exist and who owns them. The registry’s centralization lets access control be applied consistently, rather than forcing each team to come up with something bespoke.

None of this is achievable when each team maintains its own isolated tooling stack. Platform teams already understand why IDPs exist. The value of the paradigm in the agent context is no different.

The compounding cost of inaction

The cost of inaction is direct, not merely operational and security-related. Without a searchable, well-organized catalog of tools, teams continually reinvent the wheel, since it is easier to generate a tool than to find one that already exists. Duplication means redundancy and technical debt. A registry, by making tools discoverable and reusable, converts that redundant spend into capacity for actual work.

For platform engineering teams, the trajectory is clear. Agent adoption is increasing, tool duplication is increasing with it, and the shims that worked at small scale will not hold as the number of agents and tools grows. The security exposure documented in the Gravitee survey will widen, not narrow, without structural intervention.

The organizations that build centralized tool infrastructure now will be able to onboard new agents quickly, govern them consistently, and audit them when something goes wrong. Those that defer will rediscover, the hard way, what platform teams learned a decade ago: coordination problems do not resolve themselves at the application layer. They compound there.