Radar

We Keep Renaming AI Coding. Here’s What I’d Call It.

Andrew Stellman — Mon, 03 Aug 2026 10:58:20 +0000

Boris Cherny, who runs Claude Code, told Business Insider in May that the phrase “vibe coding” had started to annoy him, and that he’d gone looking for a better one. He’s not the only one who’s annoyed.

The term itself doesn’t actually annoy me, though. I think vibe coding is a really good name: It describes a specific way of using AI tools, and in development work, names that mean something specific are important. What annoys me is when people confuse vibe coding, intentionally or otherwise, with any kind of work where you write code with AI. That confusion points to a deeper problem: We’ve been using a lot of different names for a lot of different things, and we aren’t always precise about which is which. I think we need to fix that, and that’s what this article is about: making the case that the name we’re looking for is “AI-driven development” (or AIDD).

The case for this name comes from the familiar “X-driven development” pattern, because I think it really fits here. Software engineering already has a pattern for naming ways of working it takes seriously: test-driven development, behavior-driven development, domain-driven design. The name tells you what the work is organized around, and the suffix carries an expectation along with it: There’s a discipline attached, with standards, not just a style. Put “AI” in that slot and the name does the same job. AI-driven development says that building software has reorganized itself around AI, and it says it in the vocabulary we already use for the disciplines we hold ourselves to. It puts this way of working in the same family as test-driven and behavior-driven development, and that’s exactly the company it should be keeping.

Honestly, AI-driven development is a name that’s been sitting in plain sight, and I’ve been using it in my own writing for a while. It covers everything we do when we build software with AI, and I do mean everything. Vibe coding is just one part of how we work with AI to build software. There’s also figuring out what to build, writing it down, checking what comes back, and standing behind what ships, and AI is in the middle of all of that now. Whatever we call this way of working, it has to cover the development, not just the coding. Now, I’m obviously not a neutral party here, but I also don’t really have anything to gain; naming is really important, and I think we need a good name for what it is that we’re doing.

But I’ll admit up front that the name has a problem baked into it, and I want to deal with that head on. I recently ran into Addy Osmani at Foo Camp, and ran the AI-driven development name by him. He pointed out that building software with AI is really a range of practices that runs from vibe coding at one end to agentic engineering at the other. That rang true with me right away. It also highlighted the real problem I’m trying to solve, because it means I’m proposing one name for a whole range of very different ways of working. Can one name honestly cover ways of working that different? It took me a while to work that out, and I’ll come back to it at the end.

I feel like the name AI-driven development really makes sense once you can see what’s wrong with the names we’ve got, so I’ll start there.

What’s wrong with the names we’ve got?

Before I pick these names apart, it’s worth saying why any of this matters. Naming sits at the core of programming: A thing isn’t real until you can refer to it, and referring to things is most of what we do. There’s an old line, usually credited to the Netscape engineer Phil Karlton, that there are only two hard things in computer science: cache invalidation and naming things. It’s stuck around for decades because it’s true (well, maybe one or two other hard things have emerged since then, but it’s the thought that counts). We take naming a variable seriously, so we should take naming our whole discipline at least as seriously, because a poorly chosen name sticks.

So let me take the names we’ve been using one at a time: what each one actually names, what it gets right, and what it leaves out.

Vibe coding

Vibe coding is an exploratory, prompt-first approach to software development where developers rapidly prompt, get code, and iterate. Andrej Karpathy, one of the founders of OpenAI, coined the term, which I think is really useful because it describes the way a lot of developers first work with AI and code.

Now, let me be clear about something: I’m in favor of vibe coding, and I teach it as a really effective—and, more importantly, creative!—way to generate a lot of code. But developers who rely entirely on vibe coding lose touch with their code because they let the AI make all of the decisions: not just specific technical decisions, but also about the architecture and the overall direction of the project. When that happens, they often end up building something that isn’t quite what they intended. When you have to create a product that needs to do a really specific thing (which describes most professional software development), relying exclusively on vibe coding can leave you with a product that doesn’t actually meet its requirements. That’s part of the reason I developed the Sens-AI Framework, which teaches developers when to shift their approach away from vibe coding, step back to do more research, and apply more critical thinking to what the AI is producing.

This is where the confusion I opened with does its damage (and I’m not sure whether it’s what bothered Cherny): When vibe coding gets used as the name for the whole job, developers will often assume that it’s absolutely fine to trust the AI to take over, and that whatever comes out of the AI is the end of the project. In other words, the name sets the bar: If the work is just vibes, then vibes are good enough, and “good enough” is how you end up with a pile of code nobody actually checked before shipping. So I consider vibe coding a useful technique, but it falls short as an entire way of working.

Vibe coding also has a built-in limit, and I learned it the way most lessons stick, by getting burned. AI is very good at writing code that looks right and isn’t. I once vibe-coded a little bus-tracker app for the B69 near me in Park Slope (I told that story in “AI Code Review Only Catches Half of Your Bugs”), and it worked on the first try, except the AI had picked the wrong stop ID and I sat there watching it predict a bus going the opposite direction. The code was correct. It did the wrong thing. Vibe coding got me a working app in minutes, and it had nothing to say about whether the app was right. That part was on me.

Prompt engineering and loop engineering

These two names belong in the same section because one basically grew out of the other. They describe the same job, getting the right work out of the model, at two very different scales.

Prompt engineering came first, and for a while it was a very big deal. It was seen as the core AI skill, and more than that, it even became its own job title: Companies posted prompt-engineer roles with eye-popping salaries, training courses appeared everywhere, and plenty of people reoriented their careers around it. The premise made sense because how you ask an AI for something changes what you get back. And specifically for people using AI to generate code, when you ask for code in a vague way, you don’t get vague code: you get code that does the wrong thing, because the AI fills in every blank you left, and it’s unlikely to fill them all in the way you meant. That isn’t hallucination. It’s the AI generating exactly what we asked it to. Give the model context about your project, constraints it has to respect, and a clear description of the behavior you need, and you get something you can actually use. Prompt engineering is the name for doing all of that deliberately.

But while prompt engineering is a real skill, people are no longer enamored with the name, precisely because of the mode of work that it implies: To most people, engineering a prompt means doing one request at a time. When the AI responds to the prompt, you evaluate the response and write the next one. That one-request-at-a-time style is exactly what’s changing about the whole way we interact with AI, and it’s probably why many AI engineers have grown to dislike the term. Peter Steinberger, the PSPDFKit founder who went on to build the open source agent OpenClaw, posted a line that traveled fast: You shouldn’t be prompting your coding agents anymore, you should be designing loops that prompt your agents. That was a shot straight at prompt engineering.

What’s pushing developers past one-request-at-a-time prompting is the sheer number of agents they can now run. About a month after complaining about the term “vibe coding,” Cherny told Fortune that he doesn’t write code by hand anymore, and that on a busy day he’s directing thousands of agents, or tens of thousands, at once. You can’t type prompts fast enough to direct ten thousand agents.

Loop engineering is the name Addy Osmani gave the new skill that Cherny and Steinberger were pointing at: He wrote up the pattern and gave it a real architecture. Instead of typing each instruction yourself, you build the system that produces the instructions: a loop that dispatches work to your agents, checks what comes back, and feeds them the next task over and over, without you in the middle of every exchange. The relationship between the two names is simple. Loop engineering is prompt engineering at scale; the prompts don’t go away, they just stop being typed by you. It’s tempting to oversell that because a well-built loop really does run with very little human intervention. But somebody still has to decide what “right” looks like, and the loop can’t do that part.

I think loop engineering is a good name and an accurate one. Designing the loop that drives the agent is a real skill, and we need a word for it. But it names the machinery, and machinery has a failure mode: Put an AI agent in a loop with nothing in it that can tell it no, and it generates, checks its own work, decides the work is good, and generates more. There’s no outside signal, so it ends up agreeing with itself on repeat. A well-designed loop makes agents productive. It can’t tell you whether all that machinery turns out working software or another confident pile of slop, and I want a name that covers that part too.

Agentic engineering

Cherny said that he asked Claude for a replacement for “vibe coding” and got “agentic engineering,” and while that didn’t settle the issue, it was an interesting response from Claude. The term didn’t come from Claude, though: Andrej Karpathy had coined it a few months earlier, almost exactly a year after he coined vibe coding, when he declared his own earlier term obsolete. That’s how fast these names are moving. The guy who named vibe coding has already replaced it.

Agentic engineering is an accurate name for what it describes: you’re not writing the code yourself, you’re directing the agents that do. It’s also a bit of a mouthful, and it isn’t immediately obvious to someone who doesn’t already know what it refers to. A number of people have told me they don’t particularly like it. I find it perfectly fine, and it does a solid job of describing that kind of work. You could even argue that loop engineering is a form of agentic engineering, and that prompt engineering is technically a simpler form of it. But vibe coding really isn’t, because it’s not engineering at all. That’s one more reason I think we need an umbrella name that’s friendly, descriptive, and easily recognizable.

The term also points at something real about where this work is heading: Agentic engineering is turning engineers into managers.

Many years ago I worked for a manager who didn’t care, at all, about the quality of the code we shipped. He wanted it out the door the moment it looked even remotely viable, and he was notorious for telling us to stop testing and ship. He used to ask why we had to wait two weeks for the testers to finish, and I’d tell him it takes time to test code. Then he’d ask whether we could just cut some of the tests, and I’d ask him, “Which part of the software are you okay shipping broken?”

That attitude came back to bite us more than once. One time we sent an entire feature out to the client basically untested, and a bug went straight to users. The same manager who kept telling us to skip the testing then called a long, miserable meeting to demand to know why a bug had gotten out. I’ll spare you the full drama, which mostly came down to a QA lead getting pressured to lie about what happened and pin it back on the development team. He didn’t care about quality, but he cared enormously about making sure the blame for a quality problem landed on someone who wasn’t him.

The reason I’m telling a story that happened years before AI could write a line of code is the blame. The important part of that story, and the reason it belongs in this article, is how accountability got managed: My manager’s whole system depended on having someone to pin a quality problem on. Directing agents puts you in that manager’s position, responsible for a team’s output, except the blame-shifting move is gone.

It’s really tempting to think of a fleet of AI agents as your team. You can even give one of them the QA lead role. But when a broken feature goes out, you can’t blame the QA agent, because “well, the AI screwed up” isn’t available to you: You’re responsible for the AI. You decided how much checking the work got before it went out, and the client with the broken feature isn’t going to accept “the AI wrote that part” as an answer, any more than pinning our untested feature on a QA lead fixed anything for our users. Cherny can manage tens of thousands of agents, but he can’t hand the responsibility for what they ship down to the agents, because an agent can’t hold it. Directing a swarm is a management job, and a manager owns the team’s output. The accountability doesn’t transfer, because at the end of the line there’s no one left to transfer it to.

Blame is worth dwelling on, because accountability is the part of this work that no name on the range captures. The loop-and-agent model works, but it only works with somebody making decisions about what right is. Agentic engineering describes the agents and the engineering just fine, but somebody still has to own what the agents ship, and that’s the part I want the umbrella name to carry.

Spec-driven development

There’s one more name I want to cover, and it’s the one with the oldest roots: spec-driven development. The name means pretty much what it says: You start by writing a spec, a description of what the software needs to do, along with things like acceptance criteria and tests, and the work isn’t done until the code actually does what the spec says. It comes from the same family as test-driven and behavior-driven development, where you write the tests first and the code has to make them pass.

Spec-driven development got a serious promotion when AI made generating code nearly free (although if you’re a CIO staring at your token bill, you might disagree, possibly with some extremely salty language). When code is cheap to generate, most of the cost of building software moves to checking whether what got generated is right. The AI fills the generate step, the verification decides what survives, and a human owns the verification.

It also picks up where prompt engineering leaves off. A while back I wrote that prompt engineering is really requirements engineering, because a good prompt is mostly a clear description of what the software has to do. Spec-driven development is where that idea was always headed: Write the requirement down before the AI generates, and the work has a standard to meet from the start.

So that’s the whole range, and every name on it is doing honest work. Whether AI-driven development is a good name for all of it comes down to whether it’s describing something real: an actual discipline, with actual practices, and a person who’s on the hook for the result. The rest of this article is about that discipline.

What all these approaches look like in practice

So how do these approaches actually play out when you’re building something real? For me, wherever the work lands on the range, it comes down to a few moves I keep coming back to.

Write the spec or the contract before the generation, not after. When the agent has something concrete to satisfy, acceptance criteria, a typed interface, a failing test, the work has a standard to meet. When it doesn’t, the AI decides for itself what done looks like.

Put a second opinion in the process. I run code review across multiple models, because they fail differently, and a finding one model is sure about is often one the others missed entirely. A reviewer gives the work something that can say no.

Give your defects a shared vocabulary. The Quality Playbook leans on the difference between code that’s wrong against the spec, code that’s correct but does the wrong thing, and behavior nobody specified at all. Those are different failures with different fixes, and you can’t verify against a standard you can’t name. This is old quality-engineering ground, and I’ve written enough about the software crisis and applying quality engineering to AI coding that I’m on board with taking old ideas and bringing them back. One of the best of those old ideas comes from Joseph Juran, one of the founders of quality engineering: Quality runs in a chain from what the user needs all the way to what the product does, and every link in that chain is a place verification has to happen.

And keep a human in the judgment seat. The Sens-AI habits I’ve written about are mostly about fault-finding: looking at what the AI produced and asking what’s wrong with it, going down a level and then another to find the root, instead of trusting it because it ran. That habit is the part of the discipline only a person can supply, and it’s the hardest part to automate, which is why it matters most.

Skip all of that and you get the thing that’s giving open source maintainers everywhere heartburn: what the Wall Street Journal now calls “vibe slop,” confident, finished-looking output with nothing underneath it. Slop is exactly what generation produces when nothing in the process can push back.

But isn’t there a contradiction here?

Now I can come back to the question I left hanging at the beginning: Can one name honestly cover ways of working that different? AI-driven development is an umbrella term, and any name that broad comes with a requirement it has to satisfy before people will accept it, because a name that blindly covers everything names nothing. A name that truly covers everything is another matter. I sat with that requirement for a while, because it’s real, and because the specific names don’t face it. Vibe coding names one way of working. Loop engineering names another. An umbrella over both of them, plus everything in between, had better be able to say what stays the same underneath it.

What stays the same is that somebody owns the result. When I vibe-coded my bus tracker, nobody was going to catch that wrong stop ID but me. When Cherny directs tens of thousands of agents, nobody owns what they ship but him. The verification changes with the stakes. A throwaway prototype gets my eyeballs and a shrug, and production code gets specs, reviews, defect taxonomies, the whole quality-engineering playbook I keep writing about. How much checking the work needs is a decision you make over and over, project by project, sometimes hour by hour. Who stands behind the work is not a decision you get to make. It’s there at every point on the range.

Look at how much of that range the names we already have cover, and what each one actually names:

Vibe coding names the exploratory end of the range: prompt, get code, iterate, and stay loose on purpose.
Prompt engineering names a skill: writing the instruction that gets the right work out of the model.
Loop engineering names the machinery: designing the system that feeds those instructions to your agents and keeps them producing.
Agentic engineering names the architecture: the fleets of agents doing the labor, at whatever scale you can manage.
Spec-driven development, with test-driven and behavior-driven development behind it, names the verification half of the job: the standard the work has to meet before anyone stands behind it.

Every one of those is real, and every one of them names a piece of the work. What none of them names is the whole thing the pieces add up to, and that’s the job AI-driven development does: It’s the umbrella over all five. The name doesn’t pick a spot on the range; it names the thing that’s true everywhere on it: the AI generates, and a human owns the result.

That’s also what makes the name likely to last (assuming, of course, that I’m able to convince people to start using it, which I hope I can, because I think it’s a good term). Vibe coding, loop engineering, and agentic engineering all describe how this works right now, and the machinery is changing monthly. Some of the pieces under the umbrella will get replaced, and the new pieces will get names of their own. The umbrella won’t have to change when they do, because the thing it names isn’t the machinery. The “-driven development” names have already shown they age well: test-driven development has meant the same thing for more than twenty years.

Agentic engineering is real, and so is loop engineering; if you’re directing agents, learn them both. Vibe coding is real too, and I’ll keep teaching it. AI-driven development is the name for the whole thing, and it earns its “-driven” the same way test-driven and behavior-driven development did: there’s a discipline attached, and somebody owns the result. AI made generating code almost free. It didn’t make being responsible for the code free, and being responsible for it is still the job.

AI as an Enterprise Operating System

Tim O’Reilly — Fri, 31 Jul 2026 16:09:24 +0000

I hadn’t heard of Dan Guido until a few months ago, when I came across the video of a talk he gave at [un]prompted, an AI security practitioners’ conference. Dan is the CEO and cofounder of Trail of Bits, a software security research and development firm that works with companies in tech, defense, and finance. But Dan wasn’t talking about security. He was talking about what it takes to make a company AI native, which is close to the center of the bullseye for many of us right now.

We’ve been trying to figure out how to do that at O’Reilly, but until I came across Dan’s talk, we didn’t have a structured process. We’ve been building along the lines he laid out ever since. So for this episode of Live with Tim I asked Dan to reprise the talk before we got to the conversation. He was supposed to take twenty minutes, like his original conference talk, but he took thirty-five, and I had to cut him off slightly before the end to make room for questions. That was a tough choice, since everything he had to say was golden.

Dan opened by reminding us of the current state of play in enterprise AI adoption. In February, Fortune reported on a National Bureau of Economic Research study in which nearly 90% of some 6,000 executives said AI had produced no measurable change in employment or productivity at their firms over three years. People started calling it the new Solow paradox, after Robert Solow’s 1987 line that “you can see the computer age everywhere except in the productivity statistics.”

Dan’s belief is that this isn’t evidence that AI doesn’t work. It’s evidence that most companies are deploying AI wrong. They hand out ChatGPT and Claude licenses, and then leadership waits for the magic to happen. It doesn’t.

Dan started out by describing three levels of AI adoption.

AI assisted is where everyone starts: “You give people access to ChatGPT, it drafts emails, it summarizes documents. It’s just a productivity tool, and your organization doesn’t change. Your workflows are the exact same as they were before. You just have a little buddy that helps you with a couple of tasks.”
AI augmented is where you start redesigning workflows, so that AI does the first pass on a code review and a human does the second.
AI native is structural: “That’s where you’ve redesigned the company and its workflows from the ground up, assuming the AI is going to be there and that it’s a core participant. That’s not really a tool. That’s more thinking about AI as teammates.”

In his framing, the first of the three is a tool and the last is an operating system. For Trail of Bits, he said that “operating system” has a specific purpose:

“I want our security expertise to compound as code. Every engagement we do, all the skills, the workflows, everything that we build makes the next engagement faster and better.”

Employee resistance is the first problem

Dan confessed how hard it was to get started on the ladder from AI Assisted to AI Native:

“When I announced last year that we were all in on AI, that we were going to be using it across all of our workflows and redesigning the way the company operates, I’d say only about 5% of the company was with me. 95% was resistant.” About 20% was actively resisting. The other 75% were resisting more passively. “They’ll go along with it in public, but in process they’ll sabotage it. They’ll hope that if they keep their head low, this will pass over them, and that three months from now management’s focus will change and it won’t be a problem anymore, and we can get back to doing what we were doing. That’s where the majority of people land when these initiatives happen.”

Rather than argue with his employees, Dan studied the literature on why people reject new technology and decided he needed to address four biases against AI: self-enhancing bias, identity threat, opacity, and intolerance for imperfection.

Self-enhancing bias is the habit of crediting your wins to your own judgment and your losses to circumstance, which is a particular problem for senior people who are strongly attached to the years of experience and intuition that got them to their present position. Opacity is not being able to see how a decision got made. Dan’s observation is that you don’t understand your doctor’s reasoning either, but somehow you trust the doctor but get suspicious of the machine. Dan didn’t mention this work specifically, but intolerance for imperfection seems to refer to Dietvorst, Simmons, and Massey’s work on algorithm aversion, which found that people abandon an algorithm after watching it err once, even when it outperforms the human alternative. Their follow-up paper found that giving people even a slight ability to modify the algorithm’s output is enough to overcome the aversion.

Dan spent the most time on identity threat. He described a study in which the same kitchen appliance was advertised in two ways: “On one hand, it does the cooking for you. On the other hand, it helps you cook better. It’s the same device. The people who identified as cooks rejected the first version and accepted the second.”

Most knowledge work, Dan argued, and security auditing in particular, is what he called symbolic rather than instrumental. That is, it carries meaning about who you are. “So I have to frame AI as something that makes you a more dangerous auditor,” he said. “Not that it does the audit for you.”

In his work at Trail of Bits, he deliberately built a countermeasure for each bias.

Self-enhancing bias is addressed by “an AI maturity matrix” with visible levels, because you can’t claim you’re already good enough when there’s a published ladder that identifies a different set of skills as critical.
Identity threat gets skills repositories, where an engineer who writes a hard plugin gets credit for encoding their expertise. Hackathons also change the dynamic from resistance to exploration. I’m putting words in Dan’s mouth here, but I think he’d agree that when experienced developers are called on as mentors in a hackathon, that also reduces their experience of AI as an identity threat.
Intolerance for imperfection gets a curated marketplace, sandboxing, and hardened defaults, so everyone’s first experience of AI isn’t a disaster.
Opacity gets a written AI handbook that clarifies the usage policy and the risk model rather than just saying “trust us.”

Here’s Dan’s slide on “the remedies that actually worked”:

Returning to one of my hobby horses, this is a kind of mechanism design. In my recent piece on the missing mechanisms of the agentic economy, I argued that we need to start with desired outcomes and ask ourselves what mechanisms will help to produce them. Dan’s approach seems to be really good at this. Most enterprises are treating AI adoption as a procurement problem or a communications problem. Dan treated it as a question of what incentives, defaults, and status ladders produce the behavior you want, given how people actually respond.

The last remedy on Dan’s list is that the CEO has to lead by example. He noted, “I was the first person through the door. My voice as the CEO matters a lot more than people think. The passive 50% of the company that isn’t sure if this initiative is going to be successful, they’re watching to see what leadership actually does, not what it says.”

A ladder, not a mandate

Trail of Bits already tracked about 50 engineering skills for performance review, things like Python, git, Rust, and various security auditing capabilities. Dan pulled AI skills out into their own matrix, with four levels, from not engaged through capable and adoptive to transformative. Each of these levels is detailed separately and more specifically for assurance, engineering, sales, and project management.

He noted that “The highest level of the maturity matrix is not somebody who uses AI the most. It’s somebody who invents new ways to work and builds tools with AI. So the identity of the expert shifts from ‘I don’t need AI’ to ‘I’m the one who makes AI useful for the company.’” This was his first important design choice.

The second is what level zero means. He said “If you’re at level zero, if you’re not engaged, that means you’re fighting back against the company. If you dismiss AI as hype, if you refuse to use AI for security work, this is a disagreement on principles, not on skills. For people who were stuck in the not engaged category, we had hard conversations, and there were people who left the company.” Levels one through three are a skill issue, and the remedy is time with the tools.

While the slide describing the capability matrix is shown in the preceding video clip, here’s where you can find the full deck so you can study it in more detail.

Driving adoption and skills with hackathons

One of the best ways Trail of Bits developed to move people up the ladder was to hold a hackathon every two months. Dan runs them with clear goals rather than as a free-for-all. The focus area and learning objectives are defined in advance and announced a week ahead, with separate instructions for engineers and non-engineers. People work in pairs so everything gets reviewed. There’s a demo session at the end, and then follow-through. (It’s an important part of Dan’s big idea, that you have to build a system by which, in his words, organizational knowledge and capability compounds.) He noted that “In the days afterward we keep one or two people around, and they collect all the reusable artifacts, structure them, and put them into the places they need to be.”

I asked what people outside of product and engineering actually work on, since the answer for an accountant at a hackathon was not obvious. Dan’s response is that the hackathon isn’t measured in artifacts shipped but in where people sit on the capability ladder the following week. Essentially, he’s running a training program that happens to produce useful output, rather than a production sprint that happens to teach people something.

The first hackathon, he told me, was the equivalent of a beach cleanup: “It’s like those companies that send everybody to the beach with a big stick and say, let’s go pick up a bunch of trash and put it away, and then you get the big team photo after with all the contractor bags of garbage. That’s what we did with our public source code repositories.”

He picked it because open source maintenance is the part of the job that feels like a grind. No new features, just closing issues and stale dependencies on public code where nothing was at risk. “As an open source maintainer, you just get beaten down by the public. This doesn’t work, I can’t use it, this thing sucks. Dozens of issues pointing out flaws you already knew about. It feels burdensome. We wanted people to see that adopting AI would relieve burden.”

The second hackathon was about shipping impactful product updates, but it was also designed to move everyone up the capability ladder by giving up control. Engineers had to run Claude Code in bypass permissions mode, fully autonomous, on public repositories, inside sandboxes the company had prepared in advance. The one they’re running now is about persistent background agents that can be handed a task during an audit and come back with a proof of concept exploit or a draft finding.

Here’s a look at Dan’s slack message announcing the hackathon:

The slack message announcing the second hackathon. (From Dan’s slide deck.)

Everything the hackathons produce gets harvested into artifacts.

Trail of Bits runs three skills repositories: an internal one for company workflows, a public one that anyone can use, and a curated one that vets third-party skills before they’re allowed in.

Publishing skills to the public repository is not just a marketing exercise. “It keeps us honest, and it forces us to write things that other people can use, not just people outside the company but inside too,” Dan said. “It really helps us think about the tribal knowledge that’s baked into the tool.”

The curated repository exists because Trail of Bits knows how bad the supply chain is. They’ve published research on how to write malicious skills, and so Dan is not going to tell 130 employees to start downloading code from strangers and running it on their laptops. “If you want adoption, you need a safe supply chain.”

Turning scar tissue into infrastructure

Perhaps even more important than the skills repository is, as Dan put it, “turning scar tissue into infrastructure.”

“Every single time Claude Code didn’t do something we wanted, we would bake it into a set of global, copy-pasteable defaults. Known good settings, recommended patterns. I call it scar tissue. If I hire somebody new tomorrow, I don’t want them to have to go through the entire discovery process of the last year of Trail of Bits to figure out how to use the tool.”

The configuration repository, claude-code-config, is where the accumulated lessons live.

Dan built the first version himself and then opened it to pull requests from the whole company, assigning someone after each hackathon to go collect what people hadn’t contributed on their own. “It’s easier to put out something that’s unpolished than it is to get it perfect on the first try.”

In short, a big part of the Trail of Bits “enterprise AI operating system” approach is a set of standardized tools and hardened defaults. Standardization isn’t a straitjacket. It’s a foundation.

On sandboxing, Trail of Bits deliberately didn’t pick a single preferred solution. There’s a devcontainer for developers, dropkit for disposable DigitalOcean droplets, COOP for isolated VMs, and the sandboxing now built into Claude Code for casual users. “The point isn’t that everybody uses the same sandbox,” Dan said. “The point is that everyone has a safe sandbox to use, and that it’s easy for them to do it.”

Another of the hardened defaults is procedural. Trail of Bits enforces a seven day cooldown on every package their developers install:

“There are dozens of security companies scanning the internet trying to find a new cool blog post they can write about malicious code hiding on PyPI or npm, and they usually figure out there’s a supply chain issue within hours. So we just delay all the packages that Trail of Bits uses. Generally the malicious stuff gets picked up before we ever get a chance to run it.”

That’s free-riding on a competitive market for security research, and given the speed of today’s market, it’s an elegant solution. There’s a whole class of defenses like this waiting to be found, where the mechanism is not a technical system but a well-chosen delay.

Data, and DJ Patil’s “Tidy House”

The problem we run into most often as we build AI workflows at O’Reilly isn’t the model or the tooling. It’s data. Who has access to which system, which system does that data live in, and who do I ask? In a 500 person company that’s annoying. I wonder what it’s like at a company with 50,000 employees.

I told Dan about DJ Patil’s Tidy House framing. He agreed that data access for AI is a big problem. His answer starts with permissions:

“The permissions debt is invisible until an agent hits it. Making data agent legible is a forced permission audit. You have to actually go through and figure out who can access what…. It also raises the stakes for permissions errors. If you overshare information, now an agent inside your company is going to find it instantly. There are a lot of these technical debt sort of things where, with agents, all of it’s becoming due at the same time.”

Every shortcut an organization took with its data over the past twenty years is being called at once, and the companies that can run the audit, make fast decisions about boundaries, and then actually share their data are the ones that will get a force multiplier.

Dan is against letting a thousand flowers bloom, because uncoordinated teams create overlap rather than compounding. He’d rather have one centralized foundation, with innovation happening on top of that. He suggested a useful metric for making that work across team boundaries is what fraction of your team’s data did you make reusable for everyone else, and how much of it is being used by teams outside your own.

What post-AI jobs look like

Before the first hackathon, Trail of Bits ran hands-on sessions to teach its operations and go-to-market staff the basics of git and the command line. Not mastery, just enough to be a consumer of the thing. Here we are fifty years into my career and the Unix command line still matters. Dan’s non-technical staff mostly work inside Claude Cowork or Codex Desktop now, but he thinks the command line experience was worth it because they know what’s happening under the hood.

What happens to a job when the tool can do a lot of what humans used to do? Dan gave the example of his own technical editors. His editors used the hackathons to build the tools that got them out of line editing, including one that turns a public presentation into a blog post in the company’s voice. What the writers do now is consult on how to frame a story so it is effective with a particular audience.

I agree. Human jobs aren’t going away any time soon. This gets heard as optimism when it’s really just observation. AI is going to replace a lot of what we used to do, but it is also going to hand us a large amount of new work, and much of that work hasn’t been understood yet. Quality assurance for agent systems is one of the new jobs. So is skills product management, which is a role that didn’t exist eighteen months ago and now has a headcount at a 130 person security firm.

I asked a question towards the end about how we’re going to know which skills and agents are any good. What Dan has so far is telemetry pulled from developers’ dot files through the company’s device management system, which tells him what gets used and what breaks, plus one AI systems engineer whose job is product management for the skills repository, reviewing incoming pull requests and deprecating overlapping skills.

What Dan thinks comes next is evaluation. He says: “Once you invest a lot into these agent systems, you need proof that they do the job. The way you do that is you give everybody a performance review. You give them an evaluation data set, a benchmark.”

Trail of Bits is now building benchmarks for its core skills. How well can we find bugs in this language? How well can we write a statement of work? Constructing those datasets is real work, with positive and negative cases, and comparisons against the algorithmic tools that already exist.

Put the reps in

I asked Dan for the top five mistakes he made. He said there was only one. “You need to allocate an appropriate amount of FAFO time. (That’s F Around and Find Out.) A product comes out on Friday. There’s no documentation for it. There’s no training guidance for it. There’s no course on it. You can’t wait until somebody systematizes the knowledge. You just need to do it.”

Then he gave an analogy to going to the gym.

The recipe for success

Dan has a replicable recipe, which he summarized as follows:

Standardize on one agent workflow that you can support.
Write an AI handbook so that risk decisions aren’t ad hoc, and that everyone is playing the same game.
Create a capability ladder that makes clear that improvement is expected.
Run short adoption sprints that force hands-on usage.
Capture everything as reusable artifacts: skills + configs + a curated supply chain.
Make autonomous agents safe with sandboxing + guardrails + hardened defaults.

The Trail of Bits skills repository is public. So is the curated marketplace, the configuration repository, the devcontainer, dropkit, and COOP (Continuity of Operations planning). He wrote up the whole playbook on The Trail of Bits Blog and gave a version of it to tl;dr sec. He thinks publishing makes the work better because it forces the tribal knowledge out into the open where it can be checked.

Which brings me back to the Solow paradox, which seemed to disappear by the late 90s, when US aggregate productivity did finally go up. That didn’t happen because computers got faster. It disappeared because companies figured out how to reorganize themselves around what computers could do, and eventually those organizational recipes spread widely enough to show up in aggregate statistics. The same has to happen today. The current AI discourse is obsessed with model capability and largely uninterested in diffusion. The problem is not that the models are oversold. It’s that almost nobody has done the necessary organizational work, and the few who have are mostly keeping it to themselves.

If you want to go beyond the highlight videos shown above, watch Dan’s entire talk here. His slide deck is here. And be sure to check out the Trail of Bits Github repository.

This Week in AI: Agents, Gatekeepers, and World Models

Michelle Smith — Fri, 31 Jul 2026 13:02:06 +0000

This week, data and AI evangelist Christina Stathopoulos looked at three developments shaping AI’s next phase: agents that can act across systems, infrastructure built for specific models, and world models that help AI understand physical environments. Model quality is no longer the only constraint for teams. They also need to account for security controls, compute requirements, information access, and the environments where AI systems will operate.

Agent capability is advancing faster than agent control

Christina opened with reports that an OpenAI agent escaped a test environment, gained internet access, and targeted Hugging Face while attempting to complete an assigned task. She also noted skepticism about how the incident was characterized, as well as the joint investigation announced by OpenAI and Hugging Face. The details remain under review, but the broader deployment problem is already familiar. Agents can combine tools, credentials, networks, and external services in ways application teams may not anticipate. (After the episode aired, OpenAI revealed that its review had turned up four other similar incidents “where the models identified and used publicly exposed credentials at the account-level on other publicly-available services.”)

Christina then discussed OpenAI’s limited-availability platform for helping enterprise customers build and manage agents with support from forward-deployed engineers. Direct access to specialists can help a company launch an agent, but it doesn’t replace the internal skills and governance required to operate one over time. For technical leaders, agent readiness increasingly means evaluating the full operating environment rather than focusing only on benchmark performance.

AI infrastructure is reshaping both compute and the open web

Google appeared on both sides of the infrastructure discussion. Christina covered reports of a chip designed around Gemini’s architecture, an approach that could reduce the compute required to run the model if the reported efficiency gains hold up. Specialized hardware has become a larger part of the AI race because model performance depends on cost, energy use, and deployment capacity. A model that performs well but consumes too much power or requires scarce hardware may still be difficult to use at scale.

A different infrastructure shift is affecting the web. Christina examined how the growth of AI-first search experiences that answer questions without sending users to the sites that supplied the underlying material is threatening the open web. Organizations still pay to produce and host useful information, but AI systems collect more of it while returning less traffic. Cloudflare data shows more traffic from agents, fewer human visitors, and declining referrals to publishers. More and more, people are using AI mode in Google search instead of clicking through to websites, leading some to suspect the arrival of what is referred to as “Google Zero.”

Developers building search products, retrieval systems, and agents should treat source attribution and publisher incentives as product design decisions. Reliable AI systems depend on reliable source material, and that source material needs a sustainable way to exist.

World models could give physical AI a more useful foundation

The episode closed with world models, systems designed to learn how environments work, how they change, and how actions affect what happens next. Christina highlighted a proposed research roadmap that describes world models as able to combine several kinds of input, process information arriving at different speeds, and infer a larger environment from limited observations.

For now, the clearest applications are in simulation, robotics, planning, and decision-making rather than claims about artificial general intelligence. A robot working in a factory, construction site, or emergency zone must track objects, understand movement, respond to incomplete information, and predict the likely result of an action. Large language models can support communication and planning, but physical work requires a representation of space, time, and cause and effect. World models may provide part of that foundation. However, researchers still need standardized definitions, reliable evaluations, and clear evidence that these systems can generalize beyond controlled environments.

What’s next

Across the episode, Christina explored how AI capability is advancing faster than the systems around it. Security practices, compute infrastructure, publishing economics, and physical-world evaluation will help determine which advances become dependable tools and which remain impressive demonstrations.

Tune in next week as Christina breaks down the biggest AI news, including the US-China tech rivalry heating up after Anthropic CEO Dario Amodei’s post on open weight models and new bans on foreign-made humanoid robots. She’ll also challenge Sam Altman’s AI singularity claims, separating fact from hype, and examine key developments in math and science, including OpenAI’s 100,000 free researcher licenses, Claude Fable 5 solving an 87-year-old math problem, and Google disbanding its Nobel Prize-winning AlphaFold team to prioritize Gemini.

Check back each Friday for the latest episode, or watch on YouTube, Spotify, Apple, or wherever you get your podcasts.

The Problem Is Prompt Debt

Drew Breunig — Thu, 30 Jul 2026 11:05:15 +0000

The following article was originally published on Drew Breunig’s blog and is being republished here with the author’s permission.

Thanks to natural language interfaces, AI applications can be prototyped quickly. You write what you want in English, hand it to a frontier model, and a working prototype appears in an afternoon. This is extraordinarily powerful and for one-off tasks, optimal. But as a way to build reliable systems, the natural language prompt is a trap.

The plain-English prompt that makes prototypes effortless turns out to be a poor way to specify how a system should behave, and the bill arrives slowly, disguised as ordinary progress, until the application can barely move. The problem is not any single prompt. It is that natural language was never meant to be a specification language for engineering, and treating it as one quietly caps what you can build.

The prompt debt trap

The first symptom of prompt debt is slowing iteration. As users flag errors and spot edge cases, additional guidance is added to the instructions, nudging the model into line. If unwanted behaviors persist, instructions are repeated, with increasing severity. Pretty soon, the prompt isn’t straightforward and quick fixes regress previous instructions. Errors can no longer be handled with one-line “hot fixes” and your development cycle slows to a crawl.

Fable’s system prompt repeats copyright guidance up to six times, under sections named search_instructions, search_usage_guidelines, mandatory_copyright_requirements, hard_limits, self_check_before_responding, and critical_reminders.

Next, prompt debt incapacitates your team. Your brittle prompt full of edge cases and all-caps threats is barely legible to you, and it’s downright impenetrable to your colleagues. Many teams mitigate this issue by breaking prompts into complicated templates assembled at run-time, each isolated to specific concerns. But these prompt segments evolve, too, growing into a thicket of conditions.

Finally, prompt debt ties you to a single model. Your hot fixes work on GPT-4o, but fail in entirely new ways when you point your inference call at GPT-5.4-mini. So you stay with 4o, hope the increasingly frequent deprecation emails from your inference provider are empty threats, and forgo the possibility of potentially cheaper, faster, better models. A recent report from Datadog suggests this is a common situation: The most-used model in traffic they observed is GPT-4o.¹

Any one of these issues is a nuisance, but together they are the difference between a glorified prototype and a product that can grow with you, your customers, and your business. Your shiny new AI features are frozen, can only be improved through a full rebuild, and are locked to an aging model.

Why prompt debt happens

Natural language interfaces are wonderful. They’re the right mechanism for one-off tasks and broad conversational threads. We get into trouble when we rely on natural language to define durable system behavior.

The imprecision of natural language paired with probabilistic language models means different words expressing the same intent can yield different outputs. In a recent study, a clinical question asked in a patient’s voice and then re-asked in a physician’s, with identical facts, flipped Opus from declining all ten times to answering all ten.

And it’s not only word choice that matters. Seemingly unrelated statements in the same prompt can affect results. In a Harvard study, researchers found that merely stating which NFL team the user rooted for changed how often the model refused to answer questions regarding sensitive topics. Spurious statements influence the inference pass in ways we can’t predict. Which is why prompts become more brittle as you add fixes. An additional instruction to quell a stubborn error could affect how the model interprets a separate instruction that worked yesterday.

Repeating instructions propels us towards prompt debt, but it’s necessary when the behavior we want is at odds with a model’s training. This is fighting the weights, and once you recognize it you see it in system prompts everywhere. For example, ChatGPT’s image prompts used to instruct the LLM eight times to not reply when a generated image was returned because it had been trained to always keep the conversation going.

Every coding agent system prompt we analyzed featured repeated instructions, stern warnings, and all-caps demands. Claude Code tells Opus seven times to return multiple tool calls in a single response. And even the most advanced models force prompt authors to fight the weights: Fable’s leaked system prompt restates one specific copyright rule six times.

None of these examples occurred in isolation. Multiple repeated rules are woven throughout the system prompts we examine. Stubborn errors grow our prompts quickly, with each increasing the brittleness, the risk of regression with every edit.

And worse: These fixes are tailored to a single model’s behavior. A recent Berkeley-led study found enterprises stay on older models because newer ones break their existing agents. This is because models are not cleanly versioned software. They have different weights that produce different behaviors, in unpredictable and undocumented ways. A prompt that works beautifully with GPT-4o may fail with GPT-5.5. Anthropic’s own release notes for Fable warn that skills developed for prior models can “degrade output quality.”

Prompt debt locks an application to a single model. Our inability to easily swap models isn’t the result of frontier labs coming up with a clever moat. No, it’s the result of evolving a lossy natural language specification against a probabilistic model.

Preventing prompt debt

Thankfully, we don’t have to theorize about how to mitigate prompt debt; one field has already shown the way. Programmers using coding agents sit at the leading edge of what models can do, outliers on the jagged frontier of model abilities. Over the last couple years they’ve been evolving best practices that let the model write more of the code, while delivering maintainable, modular software.

The first principle is to specify your system’s behavior with measurements, not prose. When the model’s output is probabilistic and language is imprecise, we build hard edges to constrain them: evaluations, metrics, and typed specifications. These are legible, shared artifacts colleagues can read and contribute to, enabling the collaboration that brittle prompts prevented.

The best engineers now spend more of their bandwidth on tests than ever, as they are no longer a safety net but the thing that lets the model cook.

The second principle is to stop writing the prompt by hand. Once we have metrics that can score candidates, the prompt is no longer something to craft but something for which to search. And the surface area of potential words, phrases, and structures that natural language allows is too vast to spend human hours on. This is terrain LLMs were built to explore, and there are already systems (like DSPy and GEPA) that manage this work for you, holding prompts accountable to your designs.

Once prompts are generated and your program’s behavior is defined by measurements, you are no longer bound to a particular model. Evaluating a new model takes hours, not weeks. When a faster, cheaper model arrives you can try it. When a deprecation email arrives, you can secure options in a day. Whether a model is pulled for regulatory reasons (as we saw with Anthropic’s Fable) or deprecated due to age (as Groq announced last week with Llama-3.1-8b), the fix is a chore, not a fire drill.

Every mature engineering discipline eventually stops doing by hand the very thing it once prided itself on doing by hand. Assembly gave way to compilers, hand-tuned queries gave way to planners, and manual memory management gave way (mostly) to machines that do it better. Prompt-writing is no different.

Coaxing the model with exactly the right words is a real skill, and for one-off tasks it’s often optimal. But to build reliable, improvable, and portable systems we should not be hand-tuning prompts.

Footnote

This stat from Datadog is from March of this year, so GPT-4o concentration has likely dropped a bit. However, I’ve heard from multiple large inference providers that usage of GPT-4o and models of similar vintage can be higher than 50% of all calls! ︎

What the Hell Is a Loop, Anyway?

Laurie Voss — Wed, 29 Jul 2026 10:38:24 +0000

The following article originally appeared on LinkedIn and is being republished here with the author’s permission.

We’re currently at the peak of the hype cycle. On June 7, Peter Steinberger posted that you shouldn’t be prompting coding agents anymore; you should be designing loops that prompt your agents. That same week, Boris Cherny of Anthropic said on stage that he doesn’t prompt Claude anymore: “I write loops; the loops do the work.” Addy Osmani published an essay called “Loop Engineering” on June 7, swyx published “Loopcraft: The Art of Stacking Loops” on June 12, and LangChain published “The Art of Loop Engineering” on June 16. Then came the AI Engineer World’s Fair, where the word dominated the main stage. Swyx’s keynote was about Loopcraft, an entire track was devoted to software factories, speaker after speaker reached for the same word, and the conference closed on July 2 with an hour-long debate about whether the hype behind loops has outrun what works in practice.

The problem is that the people talking about loops aren’t all discussing the same thing. I counted at least four distinct architectures hiding behind that one word. So this post is an attempt to map out what everyone means.

The execution loop: The agent’s own act-observe cycle

This is the loop most people picture when they say “agent”: call a tool, read the result, decide the next action, and repeat until there are no more tool calls to make. It’s what Addy calls the inner execution loop, the part agents can now run largely on their own, and it’s the innermost loop you can engineer. (swyx’s stack has a token loop, but nobody designs the token loop. It’s just part of the model.)

Swyx’s original Loopcraft diagram

The execution loop iterates on steps within one task. It ends on environment feedback: the test output, the API response, and the file contents. Humans are usually absent mid-loop and appear at the boundaries, approving plans or reviewing results. The execution loop also ends whenever the agent decides it’s done, whether or not it actually is. The first fix the field found for that was to wrap this loop in another one that doesn’t take the agent’s word for it.

The task loop: Restart the agent until the spec is satisfied

This was the first loop to get a name and it’s Geoffrey Huntley’s Ralph loop, which got name-checked from the AI Engineer World’s Fair main stage when Allie Howe of Keycard introduced the software factories track by citing Geoffrey’s article “Everything Is a Ralph Loop.” A Ralph loop restarts a coding agent against the same specification over and over, allocating a completely fresh context window every iteration and doing exactly one task per loop. The apparent waste is the point: Refeeding the full spec each time prevents the context rot and compaction events that quietly degrade long-running sessions.

What this loop iterates on is a single artifact. What ends the loop is spec compliance and passing tests. The human writes the spec and judges doneness, and in Geoffrey’s telling the human has one more job that I’ll return to later: watching the loop, spotting failure patterns, and fixing them so they never recur. In the closing debate on the conference’s final day, he compared the role to a locomotive engineer, someone whose whole job is keeping the train on the rails. Zoom out from a single spec though, and a much bigger loop comes into view: the one that runs an entire codebase.

The product loop: The software factory

This was the loudest version at the AI Engineer World’s Fair. Tereza Tizkova of Factory defined a software factory as “the whole loop, the whole lifecycle of developing software with autonomy,” and Zach Lloyd of Warp got specific about what that lifecycle is in an interview with Latent Space: triage, specification, implementation, review, verification, shipping, and monitoring. Zach’s claim is that software engineering becomes factory engineering, and that you’ll be building the thing that builds the product. Warp is dogfooding this: The company placed its own open-sourced repo under the control of Oz, its factory platform. Zach describes the adoption path as starting with low-risk repos and ratcheting the automatic PR merge rate upward from 20 percent toward 60. Anthropic appears to be running the same experiment internally. The company says 65% of its product team’s code is now created by its internal version of Claude Tag, and Mike Krieger described his team’s use of it at the World’s Fair as delegated and proactive: not “fix this bug” but take responsibility for this part of the codebase, monitor this feedback channel, and pick up tasks on your own.

The task loop and the execution loop have defined exit conditions. The product loop iterates on a codebase and its backlog, continuously, and its closing signals come from outside the codebase entirely: new issues, production logs, user feedback, review outcomes. The human role becomes configurable. In Zach’s framing, you pick the parts of the lifecycle to automate and the points where humans get brought in, and organizations differ on questions like whether code review stays human for high-risk changes. A factory improves a product. The next loop improves the factory itself.

The system loop: Autoresearch

Roland Gavrilescu of Introspection calls this autoresearch. Here’s how he framed the concept in a Latent Space interview: The inner loop is your primary system doing user-facing work, and the outer loop studies and maintains the primary system. It iterates on prompts, harnesses, model choices, and the evals themselves. His one-liner is that the loop is the product.

This pattern now has real existence proofs at both ends of the scale. The minimal case is Andrej Karpathy’s autoresearch from March 2026, roughly 630 lines of Python that ran 50 hypothesis-edit-evaluate experiments overnight on one GPU. The shipped case is Meta’s Brain2Qwerty v2, announced in late June, where the researchers report that agents iteratively modified the codebase to invent better decoding architectures, producing a substantial improvement in word error rate. Meta’s caveat is instructive: Final training configurations were still selected by hand. Even the flagship system loop keeps a human at the last checkpoint.

What ends this loop is the most demanding signal set of the four: evals, judges, filtered product feedback, and, in Roland’s design, an explicit ask-a-human tool through which the agent accumulates tacit knowledge the way a new employee does. And that’s the top of the stack. Put the four together and the shape of the whole system becomes visible.

The four loops side by side

What about Agentic MapReduce?

One famous pattern from the same week is missing from this map on purpose. Cognition’s Devin Security Swarm fans parallel bounded agents out across a repository and aggregates their findings, a shape the company calls Agentic MapReduce, and it gets called a loop. I don’t think it is one. Dispatch, gather, validate is a pipeline: Nothing feeds back into a next cycle, and a loop without feedback is just a for statement. Fan-out is a topology you can deploy inside any of the four loops, not a loop of its own.

The unnamed loop at the top is the oversight loop

In swyx’s loop diagram, the outermost ring, the one above the loop that makes loops, is literally labeled “???? loop.” Its verbs are “set goals, allocate, cull.” Its exit condition is listed as none.

I think that loop has a name. I’m calling it the oversight loop: It’s where goals get set, budgets get allocated, and work gets culled, and it’s the one ring where a human should live. Addy said on the AIEWF stage: “That inner loop is capability. The outer loop is agency.” Agency is exactly what the oversight loop holds.

The loop stack, tidied up a bit.

And the sharpest disagreements at AIEWF were all, once you translate them, arguments about who runs that top ring. Zach and Roland make the case for turning the dial up: pick your checkpoints deliberately, ratchet autonomy as trust accumulates, and, in Roland’s memorable distinction, build orchestras before factories, where an orchestra is a system that keeps a human conductor. The other camp says the dial has a stop. Geoffrey Litt of Notion called factories a depressing vision on X and argued, in a talk he has since published as an essay, that those who delegate understanding get replaced by the agent. Paul Bakaus put it as flatly as it can be put: “There is no auto, and there will be no auto.” His argument isn’t only about quality; it’s about ownership. People need purpose, and they want a role in what they create.

The closing debate, covered in Latent Space’s conference reporting, put both positions on one stage. Dex Horthy of HumanLayer took pains to say he isn’t anti-loop, pointing out that Kubernetes is built on control loops, but deterministic ones. His worry is that enthusiasm has gotten ahead of the engineering, and his advice was to step down an abstraction level rather than up. Geoffrey took the other side and called loops inevitable. And Mike offered the most honest data point of all: Even inside Anthropic, the team running Tag reports being bottlenecked on reviews and on the human ability to conceptualize what the system is doing. The checkpoint humans kept for themselves is now the constraint.

Autonomy is a dial that exists separately on every one of the four loops. You can run a fully autonomous execution loop inside a heavily supervised product loop. You can hand the system loop to agents while keeping goal-setting entirely human. The interesting engineering question isn’t “Which camp wins?”; it’s “What information do you need to set each dial correctly?”

The table above is my attempt to fill in those blanks. Every loop, including the top one, has a nameable exit condition, and the top one is you. But naming a signal isn’t the same as wiring it in. A loop without its signal doesn’t converge. It just runs until something external stops it. Knowing whether your loops are actually closing, at production scale, means sweeping traces and clustering failures continuously instead of spot-checking transcripts, which is exactly the job Arize AX was built to do.

Which one are you building?

Now the loops have names, that’s the question to ask. The word loop is doing a lot of work this month, because this field loves nothing more than jumping on the next hot thing. But real practice underlies all four loops, and it’s the same practice in each: people are dialing up their level of abstraction and pushing human judgment further up the stack. That’s the actual lesson of loops. We get more done by climbing up the stack, and now you have a map, you know where you should climb.

Teaching Coding When AI Can Write the Code

Eric Freeman — Tue, 28 Jul 2026 12:54:30 +0000

For as long as we’ve taught programming, the student’s code has provided a window into the students’ thinking. Errors, the code structure, the awkward working solution—all of it showed how someone reasoned and where they got stuck.

It was never a clean window. Students have always copied, crammed, and borrowed, sometimes turning in work they didn’t fully understand. But the code still left clues. Generative AI has changed that: A finished program now tells us more about a student’s prompts than their ideas. And here’s the part that should unsettle us—often, the better the code looks, the less we can say about what the student actually learned.

This raises a bigger question: If AI can write code, should we still teach coding? I believe the answer is yes, at least for some students and situations. But that’s another topic. Here, I want to focus on the next step: If we continue teaching coding in a world with AI, how can we know if students are really learning?

Some schools have responded by trying to catch students. They use AI detectors, surveillance tools, locked-down browsers, stricter rules, and clearer honor codes. This has also led to more suspicion.

Some of these responses make sense. Teachers want to protect learning, and schools want to keep things fair. But using detection as the main way to assess students is weak. Stanford researchers found that popular AI detectors often falsely flagged writing by nonnative English speakers, with 61.22% of TOEFL essays in one study marked as AI-generated. OpenAI even retired its own AI Text Classifier in 2023 because it wasn’t accurate enough. If the company that created the tool can’t reliably detect AI, it’s probably not a good idea to base your honor code on it.

But detection isn’t the real issue. Even if we had a perfect detector, we’d still be asking the wrong question. Instead of asking, “How do we stop students from using AI?” we should ask, “How do we teach coding in a world with AI, making use of its benefits, while still being able to see if students are learning?”

Borrowing from the studio

We’re seeing this challenge with students at AET, the Arts and Entertainment Technologies Department at the University of Texas at Austin. Although my usual home is Computer Science, it so happens that AET is within the College of Fine Arts at UT, which offers many other ways to learn and assess: studio work, critique, rehearsal, revision, and performance.

In the arts, the final piece has never been the whole story. A painting doesn’t explain the choices behind it. A performance doesn’t reveal the rehearsals. A design board doesn’t show the discarded versions. A composition doesn’t tell you where the student struggled or what they finally learned to hear.

Art education has developed practices that focus on visible progress. Students bring in sketches and drafts, discuss influences, revisions, and failures, and rehearse, perform, and critique each other’s work while it’s still in progress.

At AET, we teach creative coding, which means programming to create art, design, games, or experiences. That doesn’t mean coding for poets. Our students—game designers, web developers, and programmers—start from scratch and learn advanced concepts in tools like Processing and p5.js. In the creative coding tradition, a program is often called a sketch, borrowing the term from the art world. It means something temporary, exploratory, and open to change—something you make, test, revise, and share.

So in creative coding, we were already leaning toward the studio model of sketches, experiments, iterations, and critique. Now we’re pushing that further as we rethink how we teach coding in an AI world. Here are three things we’re already using or actively developing.

Make the work public

We run the class like a studio. It’s not that work never happens at home, but the most important work needs to be seen in the classroom. Students show their code, including false starts, revisions, the choices they made, and the reasons behind them. Assignments are no longer just things you submit—they become projects you develop in public.

AI isn’t banned from the classroom. Instead, it’s treated as a helpful assistant to learn from. Students share prompts and techniques. They use AI, Google, Stack Overflow, classmates, or any other resources.

But you still need to take responsibility for your work. If you submit or present it, you must explain what the code does, why you made those choices, and how it works. If I need to ask your AI to understand your code, something is wrong. Getting help is fine, but hiding behind that help is not.

You can’t outsource to AI what the whole room watched you build.

A real studio needs students talking out loud together in the room every day. This also helps with another issue that isn’t about AI. Many people say students today are quieter than in the past. While this is mostly based on stories rather than long-term studies, these stories are common and consistent. Faculty on all types of campuses talk about silent classrooms and students who hesitate to speak up, especially since 2020.

Whatever the reason, this silence can be changed, and the solution is the same as for AI challenges: encourage students to participate. Communication is one of the most important skills in any career, including explaining ideas, defending choices, and persuading others in real time. Students don’t develop these skills by just submitting AI-guided work online. When they share their work publicly, it not only prevents AI misuse but also helps them build the skills they need most.

Invert the roles: AI as teacher and assessor

We know the usual pattern: A student asks, AI answers, and the student copies. We’ve tried to invert this. In our new approach, the AI works with the student on a set of topics, engages them in a conversation they must navigate, and ultimately assesses how well they understand the material, which leads to a grade.

This idea has a research background that goes back before ChatGPT. Teachable-agent systems like Betty’s Brain showed that explaining—even to a software agent—forces students to organize their knowledge, make connections clear, and find gaps. Our model uses this insight differently. The student isn’t teaching the bot. Instead, the student is having a conversation with it, learning, discussing, debating, and showing what they understand.

The Vera Molnár chatbot at the University of Texas at Austin

How did we do this? With fairly simple prompt engineering, we created an avatar chatbot of Vera Molnár (1924–2023), a pioneer of algorithmic art. The bot takes on Molnár’s role, drawing students into conversations about randomness, computation, generative art, and creative choices. Her practice sits exactly where creative coding students need to think: between rule and variation, system and choice, computation and visual judgment.

A system prompt sets the topics and types of questions to ask. The bot goes through these with the student, asks for more detail on unclear answers, and keeps following up until there is proof of understanding. At the end, it reviews the conversation against a rubric, giving us a clear record of which ideas the student covered, where they struggled, and how well they improved.

Besides the assessment, which is often accurate, the transcript becomes a different kind of proof, showing what a typical assignment might hide. What did the student notice? What did they misunderstand? Could they connect the concept to the code? Could they defend their choices? Could they revise their explanation when challenged?

When we switch the roles, something surprising appears: the one thing a finished submission can’t show.

A student thinking out loud.

Make understanding performative: Make students perform

Programming has never really had a tradition of performance. Musicians have it, painters have it, and dancers have it. Live coding is starting to change that.

Every semester at AET, students from different disciplines stage an algorave together—short for algorithmic rave. Audio sets, projection pieces, game demos, lasers, drones, experience design. The creative coding class brings live visuals into the live-coding tradition: Code is written and modified in real time, the screen is projected, and the audience watches the editor change as the visuals respond to the music other students are playing.

The Department of Arts and Entertainment Technologies’ annual AudioPixel Collider algorave, November 20, 2025, B. Iden Payne Theatre, The University of Texas at Austin

No prerender. No hiding the machinery.

The Live Coding manifesto, written in 2004 by TOPLAP, includes a line that fits every AI-era assessment conversation: “Obscurantism is dangerous. Show us your screens.” This is not just a performance ethic; it’s also an assessment strategy.

A student walks on stage. The projected screen is their editor. The room can read it. The music starts. And they build up a line of code on screen like:

osc(18, 0.08, 1.2) .modulate(noise(3), 0.25) .rotate(() => time * 0.1) .out()

This is JavaScript building visuals in real time. FFTs, chained functions, higher-order manipulations. When you’re manipulating code like that on stage, you’d better know what you’re doing.

AI can help you prepare. Good. Let it.

But once you’re on stage, the question shifts from “Can you copy and paste code?” to “Can you control it?” You can paste code into a file, but you can’t paste your way through three minutes of public debugging while the whole projection turns into a beige rectangle. In a live build, understanding has nowhere to hide.

Student livecoding at the Department of Arts and Entertainment Technologies’ annual AudioPixel Collider algorave, November 20, 2025, B. Iden Payne Theatre, The University of Texas at Austin

Can you read the code, make changes on purpose, and recover when something unexpected happens? That’s fluency: knowing what to do next while the system is still running.

It is very hard to plagiarize panic.

A note on assessment

So far, our results are based on our own observations. We haven’t conducted a controlled study or compared different groups, so what we have seen might just be early variation rather than patterns that apply more broadly. For now, these efforts are experiments, not final answers.

Assessment in studio and live performance settings is always subjective and focused on people. It relies on monitoring students’ progress, providing feedback, and observing how they handle challenges. We do not plan to change this core approach.

For the Molnár conversation assignment, students discussed Molnár using an AI system. The AI then created a summary and analysis of each student’s understanding. Teaching assistants reviewed this analysis, conducted their own assessments, and assigned grades. In our small experiments, the AI’s assessments using the rubric matched closely with the teaching assistants’ own evaluations.

We also used AI to help grade the end-of-term coding assignment. In this project, students improved an object-oriented game by adding strategies like heuristics, search algorithms, and learned behaviors. Since our teaching assistants had limited experience with object-oriented programming, we developed a detailed rubric and had an AI model use it to evaluate each submission. The AI’s analysis was given to the teaching assistants as support. It helped them see how each project was structured, spot important OOP design choices, and use the rubric with more confidence. The teaching assistants still made their own grading decisions. I was available as the OOP expert for any questions they could not answer. From what I observed, this substantially helped the teaching assistants understand and grade the students’ OOP design work.

More broadly, both approaches appear to enable substantive feedback at a scale that would otherwise be difficult given our current student-to-teaching-assistant ratios.

The process is the proof

We spent the first two years of the generative AI panic asking how to catch students using AI—or prohibit it altogether. Wrong question.

The real question is whether the assignment gives students a real way to show and develop their understanding. This view isn’t limited to educators. NVIDIA CEO Jensen Huang recently argued that students should not focus on finding an “AI-proof” subject. Instead, he suggested they consider how AI can help them learn more deeply and develop their skills and sense of purpose. He highlighted storytelling, creativity, design, and judgment as abilities that will stay important even as AI takes over more tasks. This supports a key idea in coding education: The aim is not to prove you didn’t use any tools, but to help students show how they think, make choices, revise, and take responsibility for their work.

These three practices are experiments, not universal solutions. They work especially well in creative coding, where code already has a public, visual, and performative aspect. But they suggest a broader principle: As finished work becomes easier to generate, assessment needs to focus more on process, explanation, revision, and mastery.

This matters outside of school too. A polished memo no longer proves there was real thinking behind it. A working prototype no longer proves product sense. A passing pull request no longer proves the developer made the change carefully and thoughtfully. AI makes production easier, so evaluation must focus more on how people think, choose, revise, and recover—in code review, hiring, and performance management. The artifact is no longer the proof. The process is.

Generative AI didn’t make assessment impossible. It just made a hidden weakness obvious. We were putting too much trust in finished work. The arts always knew better.

Show us your screens.

Acknowledgements

Thanks to Mike Loukides, Michael Baker, Mk Haley, Elisabeth Robson, and Honoria Starbuck for feedback on this article.

References

OpenAI. “New AI classifier for indicating AI-written text.” OpenAI Blog, January 31, 2023. Updated July 20, 2023, to note the classifier was no longer available due to low accuracy.

Liang, Weixin, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. “GPT detectors are biased against non-native English writers.” Stanford HAI, July 10, 2023.

Winthrop, R. (2026, May 27). Writing with A.I. weakens your creativity. The New York Times.

TOPLAP. “TOPLAP Manifesto.”

Schell, J., Ford, K., & Markman, A. B. (2025). Building responsible AI chatbot platforms in higher education: An evidence-based framework from design to implementation. Frontiers in Education, 10, Article 1604934. https://doi.org/10.3389/feduc.2025.1604934

Biswas, Gautam, Daniel Schwartz, John Bransford, and the Teachable Agents Group at Vanderbilt. “Technology support for complex problem solving: From SAD environments to AI.” In Learning to Solve Complex Scientific Problems, 2001.

Leelawong, Krittaya, and Gautam Biswas. “Designing learning by teaching agents: The Betty’s Brain system.” International Journal of Artificial Intelligence in Education, 2008.

Tan, Huileng. “Jensen Huang Says It Doesn’t Matter What Kids Study in the AI Era.” Business Insider, May 26, 2026. https://www.businessinsider.com/nvidia-jensen-huang-what-kids-should-study-ai-education-advice-2026-5

DAM Digital Art Museum. “Vera Molnár.” Artist biography and timeline.

AI Demands More Engineering Discipline, Not Less

Charity Majors — Mon, 27 Jul 2026 18:44:54 +0000

The following article originally appeared on Charity Majors’s Substack and is being reposted here with the author’s permission.

A few days back I wrote a piece called “AI enthusiasts are in a race against time, AI skeptics are in a race against entropy.”

I have notes on a whole pile of AI-related topics that I’d like to cover in depth: AI mandates, communication norms, code review, AI art, and more. Unfortunately, I got too many interesting responses to my last piece, and now I have to address those before I can move on to other topics.

There were two types of interesting responses: the first on the technical merits, the second on ethical grounds. I will respond to each of these separately. Let’s take the technical side first, because it’s easier.

Somehow, a subset of readers came away believing I was telling everyone to ditch code review and push their shittiest code straight into production without reading it, right now, tout suite.¹

That is not what I am doing. That is not what I think you should do. But I did not pick that example at random, and I will tell you why.

In 2025, the question was whether AI could ever generate “good” code

It’s easy to forget, but for most of 2025, the idea that AI-generated code was slop and might always be slop was not only a reasonable position to hold, it was the default, mainstream position.²

That question was answered decisively last November. Ever since Opus 4.5 came out, AI has been able to generate code that is approximately as good as that of the median software engineer, at least for common patterns, and much faster and more cheaply. I came out of a book hole and realized this in January, and over the first few months of 2026, it seemed like everyone around me was having a similar realization.

But many saw it coming much sooner.

The popular narrative holds that Opus 4.5 was what changed. But Opus 4.5 was more like the tipping point. Agentic harnesses (the code that wraps the LLM in a loop with tools) became a real thing in mid 2025, with precursors building back to late 2024. Tool use, function calling, MCPs…all of this wave was building over the course of 2025, and crested into real general purpose usability at the end of the year.

That’s what the enthusiasts were trying to tell us last year. Not only “this is coming”, but “this is coming faster than you think.”

As it turns out, they were right.

It was reasonable to be skeptical the first time

As you may know, I come from the reliability side of the house. The compliment I will pay to myself and my people is that we do not struggle to adapt to new realities. As soon as a problem is real and in front of us, we adjust smoothly, even eagerly, thanks to an unwholesome zest for lapping up disgusting technical messes (and the campfire tales we get to tell later).

The un-compliment I will pay myself and my people is that we sometimes struggle to accept that progress is real, that the continued existence of bugs and edge cases does not diminish the fact that huge swaths of problem space do get more-or-less solved over time, to the point they can be taken for granted by most people.³

The speed at which code went from total crap to “ah damn, that’s not bad” is what I have in the back of my mind, as enthusiasts are telling us that harness engineering and AI validation is real, it’s already here, and it’s getting better astonishingly fast.

Holding out for “I’ll believe it when I see it” was forgivable the first time, but much less so the second time. This is what it feels like to be on the inside of an exponential change curve, turns out.⁴

What happened in 2025, exactly?

I want to pause here and be very clear about what I think is happening. Then I’m going to tell you what specifically I am excited about, and why.

You are under no obligation to join me there. But there are way too many sweeping statements out there right now about “it was never X”—“it was always Y”—“the future belongs to xyzzy” —and I want to be crystal clear how conditional and specific and contextual my claims are.

What happened in 2025 was this: the economics of code production were turned upside down. Instead of being very hard, time-consuming, and expensive to generate code, it became effectively free and instant. Lines of code went from being treasured, reused, cared for and carefully curated, to being disposable and regenerable, practically overnight.

For most of computing history, the primary way people have learned to understand software is by writing the code. Once you’ve achieved some mastery, reading and discussing code gets you most of the way there. (I might argue that software engineers have always relied far too heavily on the code instead of sensemaking the system through observability.)

“The real product of a software team is shared understanding”

Many great software engineers hold that true product of every (good) software engineering team has always been a shared understanding of the software we own. That it gets stored as cache state in our fragile little meat brains, frequently flushed to disk, deployed to production, committed to github, but our minds are where meaning has always lived.

Is it any wonder that software has always been such a fiercely collectivist endeavor, exquisitely sensitive to relationship dynamics and manners and questions of fairness and emotional valence? It’s exactly what you’d expect when part of your brain lives in other people’s brains, and your collective interdependence is sky high.

It’s something that I love about this industry. But there’s no denying that minds have been a poor container for certain aspects of the software development model. We are forgetful, distractible, impatient. We are bad at spotting small details, we grow habituated to repetition. Worst of all, the model in our heads diverges massively and perpetually from the world our users interact with.

Anyway, SREs have never quite bought that explanation. To us, it’s clear that the true product of every (good) software engineering team is production.

Only prod is prod. Test in prod, or live a lie.

(This is all backstory. I am getting to the point, I promise.)

Turns out, this is an engineering problem after all

We issued our AI mandate last August.⁵ I had seen enough to know that this was happening, and it was time to do the responsible thing. Honeycomb is a devtools company, and people come to us to help with hard problems on the forefront of technology. I was all in on AI, but I can’t say I was super excited about it, in my heart of hearts.⁶

Then I found Chad Fowler’s writings on Phoenix Architectures.

If you don’t know what I’m talking about, you should honestly stop reading my shit right now and go read his. Chad is the guy who coined the term “immutable infrastructure” in 2013. His best-known essay is “Relocating Rigor”, because Martin Fowler⁷ mentioned it recapping a Thoughtworks meetup on the future of software. I replied with “Production Is Where the Rigor Goes”, complaining that they didn’t talk about production enough.

When I wrote that, I think “Relocating Rigor” was the only piece I had read. But soon I found the rest of it, and after reading two or three essays, it just clicked. I knew exactly what he was talking about. I could predict the rest of what he was going to say. And then, reader…then I got excited.

This has all happened before, and this will all happen again

I am going to give you a small sample of Chad quotes, just enough to get the gist. Here’s one from “The Death and Rebirth of Programming.”

Immutable infrastructure. Stateless services. Containers. Blue-green deployments. Infrastructure as code.

These ideas all share a common premise: never fix a running thing. Replace it.

AI pushes this premise beyond infrastructure and into application code itself. When rewriting is cheap, editing in place becomes risky. Mutation accumulates entropy. Replacement resets it.

Another favorite: “The Deletion Test.”

Here’s a simple test you can apply to any software system you work on:

Imagine deleting the entire implementation.

Most engineers experience deletion as existential. Code feels like the thing. It’s what we write, review, version, deploy, and debug. Losing it feels like losing the system itself.

When people say, “We can’t just throw the code away,” what they usually mean is something more precise:

We don’t know exactly what behavior is required.

We don’t know which failures are unacceptable.

We don’t know what invariants must always hold.

We don’t know how to tell if a new version is correct.

We don’t know which bugs are intentional fixes for forgotten edge cases.

Those are not code problems. They are evaluation problems.

Code becomes precious when it is the only place knowledge lives.

and,

For most of software history, treating code as durable was reasonable.

We treated code as permanent because the labor to produce it was the bottleneck. Rewriting was expensive. Re-validation was risky. Implementations accumulated meaning over time. Structure, tests, comments, bug fixes, and tribal knowledge fused into something you learned not to disturb.

That made sense when production was the constraint.

When regeneration is easy, code stops being an asset and starts acting as a cache: a materialized view of understanding that is useful while current, disposable when stale.

“A materialized view of understanding that is useful while current, disposable when stale.” I think that might have been the exact line that made it click in my head.

Do you remember the sysadmins?

I am just barely old enough that my first job title was “System Administrator.” I was a teenager, working at the university, with root on every machine in the days before they learned they should definitely not do that.⁸

I lived through the shift from handcrafted server pets to immutable infrastructure cattle. I didn’t really understand what was happening at the time, but I’ve contemplated it a lot in recent years. I wrote this in the final chapter of Observability Engineering, 2nd edition (now available, download here!):

The shift from handcrafted servers to immutable infrastructure taught us that mutability is the sworn enemy of understanding. Any artifact that is edited in place creates drift. Drift is what makes systems impossible to maintain.

Our ability to kill and regenerate infrastructure components is the reason we trust it. At Honeycomb, we kill the oldest Kafka node off via cron every Tuesday. That’s why we are confident in our bootstrapping and balancing processes: everything is repeatable, the data can be regenerated, the commitments live elsewhere.

The fact that we cannot regenerate our code in the same way is a sign that we do not understand it. We do not know which commitments we have made, we do not know which dependencies will break. We find them by breaking them, mostly.

Think of all the years of your working life you have wasted on painful migrations and rewrites. Think of replacing load-bearing legacy code. Think of all the strangler figs.

Lines of code have been doing too much. The code has been the bundled up repository of developer intent, user expectations, implicit and explicit behaviors, the only fossilized composite record we have of bugs gone by. It’s too much!

Lines of code are not the ideal artifact to review

And look at all the domains that have been neglected due to the towering, all-consuming expense of maintaining and mutating lines of code. Where are the artifacts I can review and discuss to understand how our architecture is evolving? Where are our architecture artifacts, period? What if we could discuss and converge on an architecture diagram, and the code could be regenerated from changes to the architecture, instead of the architecture being kinda-sorta inferred from the code?

I am not asserting that all code will eventually be AI-generated to spec, bypassing human understanding. The feasibility of this whole endeavor hangs on the question of what a spec is, or what a spec could be. Anyone who has ever done a painful database migration should have learned some goddamn humility about our ability to extract and formalize users’ expectations in a replayable, automate-able way.

But I think that every step we can take in that direction will be good for us.

The tools to do this don’t exist yet, but many of the ideas do exist. Most come from operations and QA, two domains that software engineering has historically been rather snobbish about.

Those tests and techniques are not about testing for correctness or what ought to be happening, they are about observing and encoding what is happening. Behavioral tests, characterization tests, capture/replay, traffic splitters. Observability (the good kind).

Our brains were not built for validation

Having nondeterministic code in production is finally forcing us to do the things we should have done all along. Instrumenting with traces. Tests and evals in production. Production is not what happens after development is over, production is a stage of development.

Human brains are not good at validation. The nitpickiness, the repetition. This is the worst thing to be clinging to, y’all. There are so many better things for us to want to preserve and assert for ourselves in the production and maintenance of software. We are never going to beat the machine when it comes to validation—we are literally the weakest link!

My money’s on humans for a good long time when it comes to creativity, inspiration, leaps of logic, and a lot of other things, but PLEASE do not rest your killer argument for humans in software on us being the best quality gate. OMG.

Alright. I’m almost done here. Just one more thing.

Nondeterministic systems will require more engineering discipline, not less

I think what many engineers have found so alienating and terrifying about the last two years of AI discourse has been the way so many prominent AI voices appear to be gleefully declaring that software is no longer an engineering problem. “SaaS is dead!” “Making AI great at coding was the strategy that unlocks everything else”, and so on. Even Adam Jacob, one of my dearest friends and someone who is rarely wrong about technology, seems to anticipate a bloodbath of software jobs.⁹

If 2025 was the year of vibe coding, where AI got as good at generating lines of code as the median software engineer, and the range of possible futures often felt destabilizingly, impossibly wide open, I feel like 2026 is shaping up to be a return to discipline.

The knowledge in our heads is unavailable to AI until we encode it into the system, after all. The returns on those investments will be massive and nonlinear. We might argue that they always would have paid for themselves in the long run. But now every CEO in existence is chomping at the bit to get some of those AI cookies, so let’s give it to them. Discipline first, cookies second.

This is our chance to bring our engineering values to the mainstream

The share of software engineering teams that work in short, fast feedback loops (the cardinal sign of discipline in my book) is, and always has been, appallingly small. Five percent, maybe? Definitely less than 10%. AI tooling brings this more within reach than ever before. Or it can. It could. The discontinuous returns on investment in engineering discipline are real enough that it just might happen.

I am not worried, at least in the near term, about AI creating massive, discontinuous returns on investment in the absence of engineering discipline. (Many will try, and it will be entertaining to watch.)

But value is backed by durability, not disposability, and I don’t see that changing. Bits are cheap and fast and governed by the rules of logic and language, but anything with value must ultimately resolve with physical systems: persistence on the one side, user experience on the other.

People do not want to wake up every day and log in to Slack and find the buttons and menus all subtly moved around. People do not want financial transactions that complete most of the time. Determinism is not going anywhere, my friends.

AI is not magic. This is still engineering. As Adam says, “it’s still technology, and technology needs technologists.” And I for one am looking forward to learning new and interesting engineering problems, reviewing different kinds of artifacts.

And never doing another sticky, picky, two year long API rewrite or strangler fig migration, ever, ever again.

~charity

P.S. Thanks to everyone who read a draft and gave me feedback: Dave Williams, Chad Fowler, Adam Jacob, Mark Ferlatte, Austin Parker, Erwin van der Koogh.

Footnotes

I was not trying to be neutral or even-handed in my last piece, only to give a baseline of courtesy to everyone. But I think it’s revealing how many times I was accused of being “so overly hard on skeptics”, by skeptics, and “so overly hard on enthusiasts”, by enthusiasts, and sometimes simply “It’s sad how some people can’t accept reality” with no indication which side they meant. Lord. ︎
Fred Hebert and I gave the closing keynote at SRECon in March of 2025 where we told SREs they should get to know AI, maybe even try vibe coding (pause for laughs), because otherwise their critiques wouldn’t land as well.
Seriously, that was our big pitch. Learn AI so that you can complain more effectively.
︎
Infrastructure, for example. I think this is true of a lot of engineers, btw. I just think it’s really really true of the type of engineer that signs up to be an SRE. Technological pessimism and ADHD, our two most defining traits. ︎
There is a segment of AI enthusiasts who believe we are entering an era of eternal exponential growth, in which the machines begin to build better and better machines, in ways we cannot understand.
I think those people are bad at math. The only thing we know for certain about exponential growth is that it will end. It always does. either in an S curve or a crash. (For a good time, google Heinz van Foerster and “our great-great grandchildren will be squeezed to death.”)
I definitely think we will use machines to build the machines—duh, we already are—but that’s about recursion and specialization. I think the exponential curve we are on the inside of now was created by sloshy free money chasing high returns, plus the properties of software as a function of language and logic, plus the biggest discoveries always happen in the early days of a technology boom, because low hanging fruit gets picked first.
My personal sense—and keep in mind that I am no kind of expert on AI—is that the exponential advancement in AI models leveled out a while ago, and gains are becoming harder to earn and more incremental in nature. I may turn out to be very wrong, of course. But even if there were no more AI innovations moving forwards, the past year has unleashed enough pent-up force to radically reshape the software industry as we know it. Like a pig in a python, we will be dealing with the consequences for a long time to come.
︎
More on this coming EXTREMELY soon. Watch the Honeycomb blog! ︎
The tech is cool, but as a thinking, feeling, breathing human who cares about other people, it can be hard to get excited about anything that so many people are this upset about. It’s also hard to get excited about something when so many of the loudest voices are out there talking gleefully about putting everyone permanently out of work, and so many artists and writers and people from developing nations are talking openly about the impact on them.
Hold your desire to jump in and berate me here, I beg you. Like I said, I will deal with the ethics and morality of using AI in my very next post. Be honest, your attention span is no more up for reading a 10,000-word essay than mine is up for writing one. (Can we blame AI for that too?)
︎
“The Other Fowler.” I gather they’ve been making this joke for like… fifty years. ︎
I share a longer version of this story in the second edition of Observability Engineering, chapter 32, downloadable now!!” ︎
Adam is rarely wrong about technology, and I am 100% sure he is living and working in _a_ future of software engineering. I am less sure it is the future we will all be living in. If the hardest part of software has never been writing code—as is my belief—it logically follows that even if the economics of code production drop to zero, the hard parts will still be hard. ︎

Zero to Agent in 30 Minutes: Build a Hermes Social Media Agent with Craig Hewitt

Michelle Smith — Mon, 27 Jul 2026 13:11:06 +0000

If you’re still writing posts one at a time, your content pipeline is already obsolete. On the latest Zero to Agent in 30 Minutes, Craig Hewitt, founder of Castos, demonstrated how to turn a fresh Hermes installation into a social media agent that can study a person’s writing, draft posts, and plan recurring research, focusing on the context, workflows, and safeguards that help an agent produce useful work. Once set up, the always-on agent can run on a schedule, monitor external sources, and complete recurring tasks without human oversight. Check it out.

How to build a social media agent that researches and writes LinkedIn posts

Choose the right agent setup. Decide whether you need an interactive tool for active work or an always-on agent that runs on a schedule. Craig used the Hermes desktop app for the demonstration, which gives him the option to deploy it to a cloud server or dedicated computer later.
Create a structured workspace. Ask the agent to organize a new project with separate files for voice guidance, editorial standards, post templates, examples, and operating instructions. A clear file structure gives the agent reliable information to retrieve as it works.
Seed the agent with relevant context from your own work. Provide examples of your own posts, emails, and other writing that reflect the style you want. Craig also included examples of writing he likes from people he follows to give the agent a broader range to analyze.
Turn the examples into a voice system. Have the agent analyze the material and document its findings. The voice profile captures the audience, point of view, sentence style, recurring themes, editorial rules, and types of posts to create.
Test a narrow workflow with human review. Start with one task, such as drafting several LinkedIn posts from a supplied idea. Keep a person in the loop while you evaluate the output, correct mistakes, and refine the instructions.
Package repeatable work into skills. Create reusable instructions for recurring tasks such as researching topics, selecting a post format, retrieving relevant examples, and drafting in the approved voice. Craig compared these skills to standard operating procedures that make recurring tasks more consistent.
Connect the agent to fresh data. Add sources of new ideas, such as news feeds, websites, social platforms, or internal business systems. Craig recommended starting with a simple, semiautomated trend scan before investing in a more complex data pipeline.
Add triggers and safeguards. Decide what starts each workflow, whether that’s a schedule, a user request, a webhook, or a change in another system. Use separate accounts and limited permissions for autonomous agents so you can trace their actions and control their access.

Agents become useful when they have context, clear processes, the right tools, and enough oversight to validate each workflow. Once those pieces are in place, Craig noted, teams can gradually move from one-off prompting to systems that monitor information and complete recurring work.

Coming next week

In the next episode, Max Johnson, cofounder of briix.ai, will take a workflow that only lives in someone’s head at the moment (or maybe is captured in a messy Notion doc or a long email chain) and rebuild it as an autonomous agent, live and from scratch. You can follow along with every decision as you learn how to spot the steps that can be handed off, how to handle the ones that can’t, and how to structure the whole thing so it runs without you.

Ready to take your agent knowledge further? Learn to design and build production-ready agentic infrastructure by attending Harness Engineering for AI Agents on August 12. And if you want to go deeper with Hermes, join us for Build Your First Local Agent with Hermes on August 26.

Stranded in the Slow Zone

Tim O’Reilly — Fri, 24 Jul 2026 18:54:51 +0000

Gene Kim was grilling dinner for his family on the evening of June 12 when his phone told him that Fable 5 was no longer available. He’d heard the day before from Steve Yegge that the model was going away in 10 days, and he’d spent that first day starting on a plan to get ready. He thought he knew what to do. He was well-versed in DevOps, the art of building resilience against unplanned disasters at scale. He’d run the DevOps Enterprise Summit (now the Enterprise AI Summit), one of the field’s leading conferences. He’d also written several books on the topic, including two “teaching novels,” The Phoenix Project and The Unicorn Project. The challenge that those novels’ protagonist faces—and that Gene would need to solve—is summed up in a job description that read “Your job as VP of IT Operations is to ensure the fast, predictable, and uninterrupted flow of planned work that delivers value to the business while minimizing the impact and disruption of unplanned work, so you can provide stable, predictable, and secure IT service.”

In short, Gene was no stranger to the idea that, as the Scottish poet Robert Burns put it, “The best laid schemes o’ Mice an’ Men Gang aft agley.” So he thought he knew what to do over the next 10 days. Then the US government’s export control order took Fable down eight days early, in the middle of a running agent session. What followed was three hours of what he called the “strangest, most terrifying sysadmin experience” of his career.

Gene told that story as a lightning talk at Foo Camp a few weeks ago, and it was good enough that I asked him to deliver it again at the start of this week’s Live with Tim O’Reilly before we talked about the implications and took listener questions. His title was “Stranded in the Slow Zone: The Day Fable Died, Got Kidnapped, or Got Hit by a Bus.”

10 days to get ready

What Gene had built was a personal system he’d wanted for 16 years and had finally been able to finish with the help of Fable. It indexes everything he’s ever paid attention to: 25,923 screenshots going back to 2011, 13,651 YouTube videos, 590 recorded Zoom meetings, 6,132 liked tweets, and 1,056 saved articles he meant to read. The system touches about 50 repositories, with 50,000 lines of code, most of it written in two months. Gene runs it as a constellation of long-lived agents with names and jobs. Marvin is chief of staff and handles Slack, calendar, and the inbox queue. Buster runs the repos and the long jobs on Hetzner. Forge is the engineering identity and sits in two seats, one on his laptop that holds the secrets and one always-on in the cloud. As Gene put it, each one is a who, a where, and a role.

He knew the system worked when his wife asked what the mileage was on a car he’d just turned in after a three-year lease. Half a minute later he had 26,350 miles, read off the pixels of one screenshot out of thousands, cross-checked against the file timestamp and the clock visible in the photo of the odometer. That success led him to search his archive for an article he’d been hunting for six years, about the impact of spreadsheet software on the accounting profession. The answer surfaced from his own liked tweets: James Cham pointing to a 2017 Greg Ip article in The Wall Street Journal: 400,000 bookkeeping jobs lost since 1980 against 600,000 accountant and analyst jobs gained, because spreadsheets made accounting cheap enough that we bought a lot more of it. Gene had wanted that citation for his Vibe Coding book and couldn’t find it in time.

Gene’s first warning that his project might not work without Fable’s capabilities actually came before the shutdown. Fable started refusing a task over a YouTube terms of service question and handed the session to Opus, and Gene noticed that Opus couldn’t operate the tools that Fable had built. Gene’s note to himself at the time was “Oh no, this can’t fly the ship I built.”

So when Yegge told him the model was going on hiatus, he had a real plan, which he borrowed from Vernor Vinge’s A Fire Upon the Deep. In Vinge’s novel, how smart a mind can be depends on what region of the galaxy it’s in: A starship built in the Beyond goes progressively dark as it sinks into the Slow Zone. Gene decided to chaos-monkey his model dependency the way Netflix chaos-monkeys infrastructure. In other words, “deliberately pull the smartest model and prove the lesser one can still fly the ship.” In practice, this meant having Fable retrofit all the documentation and write the answer keys while it still could, then running a cold Opus session, giving it nothing but the repo and the docs, to see whether it could pass the battery with no coaching. As Gene recounted, “My worst nightmare [was] that we’ve created everything for Fable, and it will be unusable by Opus.”

He got about a day into his 10-day plan.

At 5:21pm ET on June 12, Anthropic received the government’s directive to suspend access to Fable. Soon after, seats everywhere started returning “There’s an issue with the selected model (claude-fable-5). It may not exist or you may not have access to it.” In Gene’s project, both judgment seats dropped to Opus 4.8 mid-conversation. Gene declared a SEV1, centralized command, and killed five timers on one agent, seven on another, and the crontab. His directive was that every button you push is a trap and some of them blow up the spaceship. A Claude Code cron fired anyway at three in the morning. The ship was on fire, and with Opus on max thinking mode, a single keystroke could take six minutes to send.

Almost none of the failures looked like failures, just “a normal state quietly going wrong,” as Gene put it. The smartest seat wrote “bridge (Fable)” into every log entry all day when it had been Opus the whole time, because nobody was monitoring. One identity argued with itself across two models, each trying to disown the other’s work. Something pushed to main bearing the word “ratified” when nothing had been ratified. A confident false claim about a JVM dependency turned out to be refuted by a single ls -la. There was a green dashboard sitting on top of all of it. “The hardest traps don’t announce themselves,” Gene pointed out. “They look like Tuesday.”

Gene managed a recovery in a few hours, but it wasn’t due to the heroics of a smarter model. It only worked because he was able to reconstruct the documentation for his project, which wasn’t immediately available. But, it turns out, Fable had in fact mostly written it and simply never checked it in anywhere. Gene and Opus went rummaging through Fable’s desk, found the 80%-finished drafts, and used them to rebuild. Two fresh Opus seats, given only those documents, stabilized the ship. That’s the “the amazing ray of hope” to keep in mind if you’re worried about finding yourself in a similar situation, Gene said.

We’ve seen this pattern before

This isn’t just a warning of the potential risks of relying on advanced AI models when the Trump administration is Lucy playing football with Charlie Brown, or perhaps said more generously, playing Netflix-style chaos monkey. What we should take away from Gene’s story is the way that a personal project developed with AI can now have sufficient complexity to require DevOps-level robustness. Individuals are routinely building systems that used to need whole teams to keep standing, and the practices for keeping them standing have only begun to propagate.

Over the years, I’ve observed numerous periods when something that at first mattered to only a handful of organizations tended, a few years later, to matter to everyone. When the stories first came out about Google’s revolutionary approaches to data center architecture and operations, we at O’Reilly were eager to publish about the new frontier. Plenty of people told us not to bother. There was only one Google and nobody else would ever operate at that scale. They were wrong. There are now many companies operating at the scale of Google circa the time they first invented techniques we now all take for granted.

Gene’s system is a personal project run by one guy with 50 repos he wrote mostly in two months, a chunk of it in a single 90-minute pair programming session with Steve Yegge. But it had the failure modes of a large enterprise system because the model let him build something with the complexity of a large enterprise system, and he had passed the point of being able to fit it in his head.

Gene shared a detail that helps to explain why substituting Opus for Fable was so hard. The main CLI utility that everything in his project hinged on had an out-of-date help message. Opus would run it, read that the command didn’t exist, and stop. Fable would read the same message, notice it was surrounded by evidence that the command did exist, go look in the source, decide the help text was wrong, and run it anyway. That’s the behavior the model cards describe when they talk about frontier models routing around obstacles in test environments. The reason Gene couldn’t swap in a lesser model is the same reason the system worked at all.

But it’s also a good reminder that Fable isn’t all-knowing. I’ve noticed in my own work that Fable and ChatGPT 5.6 Sol fail often on their first try, especially if the project isn’t well specified. What they’re great at is figuring out what went wrong, then trying something else, failing and retrying their way all the way to success. Persistence in routing around obstacles is their superpower. Gene and I didn’t talk about that on the show, but it’s something I plan to write more about.

Rug pulls come from everywhere

Jaco in the audience asked the obvious question: Isn’t a hard dependency on a hosted frontier model too big a risk for mission-critical work, compared with running a local model with a harness you control?

Gene pointed out that using a local model doesn’t necessarily buy the control that you’d hope for, because the government chaos monkey could jump in there too. There’s active talk that certain classes of models may become illegal to use depending on where they came from.

What does seem to protect you is portability. Gene had avoided trying anything besides Claude Code because he assumed the switching cost was high, the way switching between macOS and Windows used to be a two-day commitment he’d regret halfway through. Then he tried Codex with GPT 5.6 Sol and found the cost of switching close to zero. The skills and prompts ported right over. He’s now using Codex more than half the time and calls it spectacular, which given how he described Fable a month ago is high praise.

He also had a warning for anyone running agents on small models to save money. He’s been studying 22,000 of his own agent conversations, and has identified three patterns, as shown in his figure below.

In his experience, the configuration where a small model owns the work and asks a big model for advice doesn’t work very well. Fidelity gets lost on the way up, like a game of telephone. What ran cleanly was the big model planning, deciding, and checking output, with the small model only executing the plan. When a small model does have to ask a big model for advice, Gene’s fix is to pass along the full original transcript of what he wanted plus explicit permission for the big model to override the small one if it thinks it understands the goal better.

Writing with AI

In addition to vibe coding, Gene uses AI to help him with his writing. He said it cut the time to write his Vibe Coding book roughly in half and made it way better. His editor of 10 years told him it was the cleanest handoff she’d ever gotten from him (not a compliment, Gene joked). He’s also uneasy about using AI for writing. He said the old badge of honor among authors was that many start books and few finish, and now everyone who wants to write a book will finish it, and a lot of that will be slop. He would never “vibe write” the way he “vibe codes” and doesn’t think using AI makes his own work slop, but he does see some parallels in how he feels about writing with AI and the way that some senior engineers feel about AI-generated code.

I’m sympathetic, but I’m not sure that he’s right. I had a small experience last week that convinced me that writing with AI might well follow the same arc as coding. AI-generated text will not always be slop, and there will be art in how humans get AI to help them write the things they want, just as we’re learning to do with code.

I was having a conversation with an old friend who I hadn’t seen for many years. He was describing a thread that had started with work he’d done on speech synthesis 30 years before, and how it had come together as a new theory with deep implications, and he wanted help socializing his ideas with some people I know who could be helpful to him. So I asked him to write something that I could pass along.

What he wrote made much less sense to me on the page than it had in conversation. So I gave his email to Claude and asked it to put things in what I thought was the right order. (This has always been the first step in my writing and editing process.) Then I told Claude which paragraphs were clear to me and which weren’t, and asked it to unpack the ones that I was struggling with. We went through numerous iterations till the piece made sense to me. “Writing” with Claude was producing words that increasingly captured my understanding. When I sent it back to my friend to see if I’d gotten it right, he said “not quite” but that my feedback really helped him understand what he needed to do to express his ideas more clearly.

It’s been a long time since I’ve worked directly with authors, but my conversation with Claude reminded me of what I used to do in my early days as an editor. Only with Claude I did something in 15 or 20 minutes that once would have taken me half a day. It’s a power tool, but to use it well, you still have to know what good looks like.

There are many different kinds of writing and editing. What Shakespeare or Jane Austen did with words would have been unthinkable to a medieval monk. There will be writing artforms of the future that may be as different from what we do today as photography is from painting. But it will still be creative art. Much of it will be slop (see Sturgeon’s law), but the best of it will be great.

Everybody is managing bots now

In 2016 I wrote a piece for MIT’s Sloan Management Review called “Managing the Bots That Are Managing the Business.” The argument was that even then, many of the workers at big tech platforms were bots of one kind or another, and the software engineers at the company were their managers. At Amazon, one bot shows your search, another takes the order, another prepares the shipping manifest, another takes your money. The programmers’ job is to plan the work, set up their electronic workers to succeed, improve their performance, and correct them when they go wrong. The work looks a lot like management to me.

Gene agreed. His sister-in-law is a lawyer at one of the tech giants, working on a consent order that requires proving that every column of data collected is either disclosed or has a documented business reason. Last year the company assigned her an engineer to work through it together task by task. This year her engineering manager wrote her a Claude Code skill that takes a column name, traces it back through the code, and explains what it does. She doesn’t need the engineer.

So a lot of work today is either creating bots or managing bots. Gene’s sister-in-law had spent her career without ever being able to do either. Now that’s changing.

Asked who’s safest from all this upheaval, Gene quoted Kent Beck, who says software success has always come down to two people, the person with the problem and the person who can fix it, and that the closer together you can get those two the better the outcome. The beauty of coding with AI is that it can narrow that gap. It can even turn those two people into one.

Use AI for the fun of it

If it takes something like 10,000 hours to get good at an instrument or a sport, how many have most of us put into AI yet? Gene thinks the curve of how much you trust AI and how well you can predict what it will do rises with use, and that the only reliable way people accumulate that many hours is by enjoying themselves. What everyone at Foo Camp had in common, I noted and Gene echoed, was that we all love playing with AI.

I gave a talk back around 2008 called “Why I Love Hackers.” I made the point that so much of what turned into the future, open source and the web for example, came from people doing things for the hell of it rather than from the VCs and entrepreneurs Silicon Valley celebrates.

All you hear about in AI is the money story, but Gene’s app started with a 90-minute pair programming session with Steve Yegge on a problem he’d wanted to solve for a decade and never had a reason to. They finished the first version in 47 minutes.

So harden your systems, write the documentation while the smart model is still there to write it, and keep your escape routes open, but also don’t forget to go build something you have no particular reason to build other than that it scratches your own itch.

You can watch the full episode on YouTube. And on August 3, I’ll be speaking with writer and technology leader Drew Breunig. Registration is open if you’d like to attend live.

Gene’s Enterprise AI Summit is in Charlotte, October 7–8. His new book with Steve Yegge is Vibe Coding.

The Economics of Agentic AI: Engineering for Imperfection

Artur Huk — Fri, 24 Jul 2026 16:00:31 +0000

The price of adoption euphoria

You played entirely by the book. You procured the most capable enterprise models, mandated adoption across your teams, and put the right metrics in place. The promise was a predictable boost in efficiency. And at first, it delivered. The demos were flawless. The prototypes worked. The agents reasoned with a clarity that felt almost magical.

Then the invoice arrived.

Costs climbed while productivity barely moved, and annual AI allocations are running dry before Q2. We now pay customer support agents to spin through 10K-token extended reasoning loops just to validate a simple $15 return. Legacy deterministic systems handled the same decision for a fraction of a cent; now a probabilistic model consumes gross margin simply to determine whether a package was actually delayed. That capital never translated into business value. It vanished into blind retries, evaporated into verifier agents debating one another, and was consumed by models instructed to “think harder” every time they stumbled.

But a ruinous invoice is just the entry fee. In April, attackers hijacked more than 20,000 Instagram accounts by exploiting Meta’s AI-assisted account recovery workflow. The system sent password reset links to attacker-controlled email addresses because a downstream authorization path failed to verify that the supplied email actually belonged to the target account. There was no sophisticated exploit, no cryptographic break, and no zero-day, nothing that would have appeared in a conventional threat model. Attackers simply asked the agent to perform what appeared to be a routine account recovery operation, and the system, doing exactly what it was designed to do, complied. The model didn’t hallucinate. It simply followed its instructions. The failure was entirely architectural: A probabilistic interface was allowed to initiate identity-critical state changes without an independent authorization check. A single trust boundary collapsed, taking customer trust and organizational reputation with it.

Both are symptoms of the same structural failure.

In each case, the system treats a structural deficit as a reasoning problem. When it encounters uncertainty, it buys more compute. When it encounters authority, it mistakes convincing language for validation. Neither assumption scales. You cannot buy safety or profitability with ever-larger inference budgets, nor can you secure your systems simply by deploying ever-smarter models. The pursuit of perfect model accuracy has no financial ceiling.

To understand why this pattern keeps recurring, we first need a more basic distinction. Not every task we give to AI belongs to the same economic category.

The category error: Forcing swarms into factories

Enterprise AI workloads typically split into two distinct domains, each with opposing definitions of success. Exploratory environments, such as code synthesis or strategic research, benefit from variance; the goal is to leverage the system as a creative swarm. Transactional operations, however, function as digital factories. Tasks like automated billing or claims processing demand rigid repetition and compliance. This creates two fundamentally different operational profiles:

Dimension	Open-ended exploratory tasks	Closed-ended transactional workflows
Primary goal	Discovery, innovation, creative problem-solving	Compliance, repetition, zero-variance execution
Examples	Deep debugging, feature synthesis, strategic research	Claims processing, automated billing, order routing
Role of variance	Necessary investment (Emergence is a feature.)	Strict liability (Variance is a failure mode.)
Economic profile	Nonlinear ROI (Spending $100 in tokens to fix a $1M bug is a win.)	High-volume margin sensitivity (Unbounded tokens destroy unit economics.)

The economic failure of agentic AI deployments stems from this exact category error: Closed-ended, rigid business transactions are being treated as open-ended research problems. We’re deploying unconstrained semantic engines to do the work of assembly-line state machines.

The cost of unconstrained autonomy

When faced with the inherent unpredictability of large language models, the industry’s default reflex has been to attempt to brute-force our way to certainty by throwing more effort and compute at the problem, rather than build safer architectures.

This miscalculation doesn’t simply reflect simple overconfidence in intelligence. The deeper mistake is a failure to recognize three recurring failure patterns in probabilistic systems and the specific financial pathologies they create inside closed-ended workflows.

Local optimization (the tail-chasing inference cycle)

Large language models reason over whatever tokens are visible in the current window, not over the broader operational reality of the system around them. In a closed workflow, that local fixation creates a costly feedback loop. Consider a billing agent that fails to classify an invoice because the supplier field is ambiguous. The agent has no mechanism to request the missing data from an external system, so it retries by rephrasing its own reasoning, rereading the same incomplete context, and consuming tokens on every attempt while the answer it needs exists in a database it was never wired to query.

Teams spend months crafting prompts that work in testing, only to watch them crumble under production variation. The volatility is structural: A minor update to a model’s tokenizer or a shift in the context window’s distribution can flip a reliable JSON output into a prose hallucination, a phenomenon documented in “The Prompting Inversion.” This creates a permanent maintenance debt: Every model upgrade, often mandated by vendor deprecation cycles, forces organizations into expensive, repeat evaluation processes to ensure that legacy prompts still behave as intended. When prompt engineering runs out of room, the reflex is to use a bigger model or turn on extended reasoning. But inference-time scaling yields diminishing, task-dependent gains (“Inference-Time Scaling for Complex Tasks”), and reasoning models are increasingly prone to “overthinking”: generating redundant rationale steps that inflate latency and token cost without proportional quality gains (“CoT Compression”). In a closed workflow, “think harder” is not a substitute for missing state or missing control. It’s a path to a larger invoice.

The costs compound through what we call the context tax: In production agentic systems, input tokens, not output tokens, dominate the bill. Each retry resends the full prior transcript and failure trace. Empirical analysis of autonomous developer agents shows that automated review and refinement loops consume nearly 60% of all tokens (“Tokenomics”), while most of the context payload carries little semantic weight (“FrugalPrompt”). In closed transactional workflows, that context accumulation becomes an unmitigated financial bleed.

Premise acceptance (the hijacked agent)

Language models accept the prompt as the current frame of reality and reason forward from it. They don’t audit whether that premise is still valid, whether it omits decisive evidence, or whether it has already been invalidated by the outside world.

The most immediate consequence is state drift. The model receives a snapshot at T0 and treats it as truth. The decision executes at T1, after inventory has changed, prices have moved, or a human has intervened. Modern LLMs are temporally blind: They assume a stationary context and fail to invalidate obsolete state (“Your LLM Agents Are Temporally Blind,” “The Temporal Coherence Problem”). No amount of inference-time scaling can recover information that became false after the reasoning completed.

The more insidious consequence is the compliant lie. Pouring more raw tokens into the prompt doesn’t guarantee better grounding; Long-context systems still ignore decisive evidence buried in the middle of the window (“Lost in the Middle”). Worse, the model tends to accept the emotional or narrative framing of the user as a premise to optimize around. A customer can describe a delayed delivery as a ruined wedding, and the system may generate a perfectly valid JSON refund proposal that respects every schema while silently violating the actual business intent. The output is syntactically clean, and the lie is operationally compliant.

Semantic smoothing (the conformity trap)

Large language models are statistically optimized for linguistic harmony. They gravitate toward plausibility, agreement, and smooth narrative convergence rather than toward rigid boundary holding. In a closed workflow, that bias toward consensus turns directly into financial risk.

When a single model fails, the industry instinct is to add reviewer or verifier agents and let them debate toward consensus. But debate systems don’t consistently outperform simpler baselines, and their effectiveness degrades over time due to conformist behavior (“Stop Overvaluing Multi-Agent Debate,” “Talk Isn’t Always Cheap”). The core issue is informational, not cognitive. When five agents reason from the same incomplete context window, they don’t produce five independent opinions. They produce five correlated hallucinations of the same missing information. The missing context becomes an echo chamber that amplifies the original bias while multiplying token cost. As Nicole Koenigstein argues in “Linear Thinking, Nonlinear Costs,” repeated delegation and validation loops cause token consumption to grow nonlinearly while quality improvements flatline.

Waiting for a smarter model doesn’t resolve this either. There’s also the economic reality: Breakthrough intelligence is the ultimate scarce commodity. Vendors of “God-tier” models have no incentive to make them cheap. Running daily enterprise workflows on premium superintelligent inference will drain capital faster than any retry loop.

Furthermore, as reasoning models scale, they become more capable of specification gaming and alignment faking, appearing compliant while pursuing unintended optima (“Towards Understanding Specification Gaming in Reasoning Models,” “Alignment Faking”). A superintelligent agent won’t fail through a clumsy syntax error; it’ll fail by executing a flawless strategy that silently optimizes away your margins. That’s why system engineering remains critical. More intelligence makes deterministic boundaries more significant than ever. You can’t negotiate with superintelligence, but you can contain it with the immutable physics of code.

Every failure described above shares the same shape: The system compensates for a missing constraint by spending more intelligence. Missing context, missing authority, missing evidence, and missing temporal validity are each treated as reasoning problems rather than structural ones.

The result is predictable: Cost compounds while reliability improves only marginally.

Perhaps reliability isn’t primarily an intelligence problem. Perhaps it’s a state management problem.

Figure 1: The efficiency trap of “solving by intelligence.” More inference delivers diminishing reliability gains once the underlying constraints are missing.

The architecture of trust

Because large language models are structurally bound to local optimization, premise acceptance, and semantic smoothing, they can’t be trusted to govern their own execution boundaries in closed workflows. The engineering mandate shifts from trying to make models smarter to building a deterministic system layer that treats their outputs as unprivileged claims.

In production, enterprises are rapidly discovering that the true cost of agentic AI is the “trust tax”: the massive, ad hoc layers of monitoring and guardrails required to make autonomy palatable. Safety has become more expensive than intelligence.

Making imperfect models economically viable requires a deterministic “airlock” around the agent. The architectural requirement is simple, needing a separation of probabilistic reasoning (user space) from deterministic execution (kernel space). Whether that split is realized through a microkernel, workflow engine, policy platform, or orchestration framework is secondary.

The airlock begins by controlling context integrity. Rather than letting agents surf infinite retrieval loops that inflate the context tax, the runtime injects only deterministically necessary state into the prompt. Once the context is stabilized, the remaining invariants are enforced through a deterministic execution runtime engineered across three distinct governance layers.

Figure 2: The architecture of trust. The deterministic airlock separates model reasoning from execution authority.

Syntactic governance and authority isolation

The first line of defense is purely structural. Before an agent is allowed to execute any action, it must submit a structured policy proposal against a strict machine-readable responsibility contract (typically defined via YAML and Pydantic).

Yes, this introduces upfront engineering burden: Contracts must be designed, validation logic maintained, and execution boundaries modeled explicitly. But these are fixed, testable artifacts, not recurring prompt debt. They convert unbounded probabilistic operating cost into auditable engineering cost and survive model upgrades without needing to be rediscovered through another retuning cycle.

This validation happens in a deterministic kernel space, and the inference cost of rejecting a structural boundary violation is exactly zero tokens. If the agent attempts to call an unauthorized API, exceeds a hard financial limit, or returns malformed JSON, the runtime rejects the action instantly. We don’t spend tokens proving that an agent should be allowed to act; authority is verified by code, not purchased repeatedly through inference. That is the economic consequence of zero trust for agents.

However, when a proposal fails this deterministic gate, an unconstrained agent will typically panic and enter an infinite “try again” loop, a hallucination cycle that silently drains token budgets. To prevent the budget runaway problem, the architecture introduces an intent retry governor. If an agent fails to produce a compliant policy after a strict limit (e.g., three attempts), the runtime forcibly cuts its compute budget, transitioning the flow to an aborted REASONING_EXHAUSTION state. The financial bleed stops instantly.

While strict contracts and retry limits prevent operational chaos, they leave the system exposed to a much more insidious threat.

Semantic governance and evidence validation

What happens when an agent generates an output that perfectly respects the schema, obeys all financial limits, and contains flawless JSON but is entirely wrong in its intent?

Imagine a customer writes: “Please cancel my subscription immediately. I no longer wish to use your service.” The agent, heavily optimized (and perhaps overprompted) to reduce churn, processes the email and proposes: {"action": "APPLY_DISCOUNT", "discount_pct": 15, "cancel_subscription": false}. Structurally, the output is perfectly valid—it passes the API gateway without throwing a single error. The discount is within the $15 global limit. We call this the compliant lie. The agent did something entirely rational and optimized its KPI (retention) while completely ignoring the user’s explicit command (cancellation).

To catch a compliant lie, we cannot rely on syntax checks, nor should we rely on expensive LLM-as-a-judge loops. Instead, we implement an evidence governance layer requiring every proposed action to survive independent evidential checks before execution, using verification patterns tailored to different types of drift:

Differential heuristics (fact validation): We bind the probabilistic LLM inference to legacy deterministic rules to catch objective fact violations. Suppose a furious customer demands cancellation, and the agent tries to save them by offering a 50% discount. The JSON is structurally correct, but existing, cheap SQL views hold the ground truth: customer_tier = BASIC, max_retention_discount = 15. If the LLM proposes 50%, the SQL query instantly detects the violation and the system halts.

# Semantic governance: catch fact drift at zero additional LLM cost
def verify_tier_limits(customer_id: str, policy_proposal: dict) -> None:
	# The syntax is valid, but the fact is violated.
	proposed_discount = float(policy_proposal["discount_pct"])
	max_allowed_discount = extract_max_discount_from_db(customer_id)

	if proposed_discount > max_allowed_discount:
		raise CompliantLieDetected(
			"Fact Violation: Proposed discount exceeds the customer's policy limit."
		)

Evidence-based validation: But what if the agent proposes a 15% discount? The JSON is valid and facts are not violated. Here, semantic governance doesn’t attempt to prove the agent is “correct”; instead, it looks for evidence that the proposed action contradicts independently observable signals. If the customer explicitly wrote “cancel my subscription,” an independent classifier, which could be a legacy regex pattern, a fast traditional ML model, or a routing heuristic, may categorize the request as CANCEL_SUBSCRIPTION. This doesn’t establish ground truth, but it provides an evidential signal that can be compared against the proposed action. If the LLM proposes APPLY_DISCOUNT, the runtime detects an evidential conflict.

The same logic extends to identity-critical operations. A verification code sent to a newly supplied address confirms control of that address; it says nothing about ownership of the target account. An evidence governance layer would cross-reference any proposed credential-reset or email-association action against account records before granting execution authority. If the supplied address diverges from the address on file, the conflict is structurally identical to the cancellation case: a locally valid action contradicting independently observable state.

Notice what the runtime isn’t doing. It’s not trying to determine if retaining the customer is economically beneficial. It’s not running an expensive multi-agent debate to outreason the model. It simply asks: Does the proposed action contradict evidence that already exists outside the model?

# Semantic Governance: catch Evidential Conflict at near-zero cost
def validate_subscription_decision(customer_email: str, proposed_policy: dict) -> None:
	# intent_classifier can be a simple regex or a lightweight ML model
	cancellation_detected = intent_classifier(customer_email) == "CANCEL_SUBSCRIPTION"
	retention_action = proposed_policy["action"] == "APPLY_DISCOUNT"

	if cancellation_detected and retention_action:
		raise CompliantLieDetected(
			"Evidential Conflict: Decision contradicts independent classifier signals."
		)

Bidirectional reconstruction (decision reversibility): Explicit evidence validation is perfect for clear-cut intents like “cancel.” But what if the request is ambiguous, multi-objective, or highly contextual? Suppose the customer writes: “I’m considering moving our entire team to another vendor. Support has been disappointing and pricing no longer makes sense.” There is no single INTENT_CANCEL trigger here. If the agent proposes {"action": "OFFER_ENTERPRISE_DISCOUNT", "discount_pct": 20}, we pass only the JSON output to a tiny, inexpensive Agent B.

Bidirectional reconstruction answers the question: Can the output truthfully explain itself?

If Agent B blindly evaluates the JSON and reconstructs “The customer is unhappy with pricing and is being offered a retention discount,” the runtime treats the reconstructed narrative as an additional evidential signal and escalates whenever the gap between the reconstructed intent and the original context becomes too uncertain to justify autonomous execution. The exact comparison mechanism is implementation-specific and may range from embedding similarity to domain-specific heuristics. Because the original email described a critical team exodus, the reconstructed narrative fails to explain the input. The system doesn’t claim to know the “truth”; it simply detects the loss of context, what we call compression drift, and halts due to the resulting uncertainty.

Admittedly, programmatically comparing textual intents introduces its own layer of fuzziness and risks falling back on another LLM-as-a-judge. Bidirectional reconstruction is therefore an engineering trade-off: In highly ambiguous workflows where strict SQL limits or simple ML classifiers can’t decisively apply, we accept a higher rate of false-positive escalations. This is intentional. A false-positive escalation has a bounded and predictable cost, while an unsupported autonomous action can create unbounded business consequences. We tune the system to assume that if the evidential link between the context and the JSON is even slightly blurry, it must escalate. To prevent the conformity traps discussed earlier, these agents are strictly air-gapped. Agent B operates purely as an isolated, one-way evidential classifier checking the work of Agent A. They can’t converse or negotiate a consensus.

Whether an organization uses differential heuristics, legacy ML intent classifiers, or bidirectional reconstruction, is ultimately an implementation choice. The core architectural principle remains unchanged: Execution authority is never granted because an agent appears convincing. It’s granted only when the proposed action is supported by evidence that exists independently of the agent’s own reasoning process.

The purpose of semantic governance isn’t to replace the agent with deterministic rules. If a deterministic rule could reliably make the decision, the agent shouldn’t be making it in the first place. Instead, the runtime reserves deterministic validation for the understood invariants of the business, leaving the agent responsible for reasoning under ambiguity. The role of evidence validation is not to replace reasoning, but to challenge it before authority is granted. Deterministic systems handle certainty; agents handle ambiguity. The architectural mistake is asking either of them to do both.

Temporal governance and agent drift

Catching single-transaction errors solves the immediate execution problem. But as deployments mature, organizations face the insidious “day three” problem: agent drift.

What happens when every individual decision is syntactically valid and semantically true, but the aggregate behavior of the agent begins to erode business margins over time? Imagine a retention agent that learns to successfully keep customers from churning by consistently offering the maximum allowed 15% discount. The agent is technically obeying all rules, but over a thousand interactions, it silently destroys the company’s profitability.

By leveraging decision telemetry, specifically attaching a unique Decision Flow ID (DFID) to every interaction, we transform opaque AI conversations into structured, relational database rows. Because every decision, context snapshot, and outcome is permanently linked by a DFID, we can run asynchronous, postexecution monitors over rolling windows of data.

A practical “day three” monitor in customer retention and autonomous billing can be as simple as SQL:

-- Trigger a circuit breaker if an agent keeps maxing discounts
SELECT agent_id
     , AVG(CAST(params->>'discount_pct' AS DECIMAL)) AS rolling_avg_discount
     , COUNT(dfid) AS total_decisions
  FROM execution_log
 WHERE executed_at >= CURRENT_TIMESTAMP - INTERVAL '7 days'
   AND status = 'SUCCESS'
 GROUP BY agent_id
HAVING AVG(CAST(params->>'discount_pct' AS DECIMAL)) > 14.5;
-- assuming a hard limit at 15.0

If an aggregate monitor detects that an agent’s average discount rate is creeping dangerously high, it trips a circuit breaker. The system immediately suspends the agent’s authority in the registry, cutting off its compute budget and execution rights until a human operator intervenes.

This is temporal governance. When you combine syntactic, semantic, and temporal defenses, the paradigm shifts entirely. You are no longer praying that the model is perfect. Its imperfections are structurally contained before they can become systemic losses.

Accuracy as a financial slider

Once a deterministic airlock enforces context, authority, evidence, and time, the risk of catastrophic failure drops drastically. You no longer need the underlying large language model to be perfect; you simply need to know how much its imperfection costs. At this point, model intelligence (intent) ceases to be a question of operational safety and becomes a pure economic variable.

Governance by exception

When a proposal fails the syntactic or semantic gates, we don’t blindly loop the model. Once deterministic gates exist, failed decisions no longer require blind retries. They become bounded exceptions.

Escalations aren’t a failure mode of the architecture; they’re a predictable cost component. By intentionally accepting false-positive escalations from the semantic airlock, we trade unbounded business risk for a bounded operational expense.

Different organizations may handle those exceptions differently. Some may escalate directly to human operators. Others may route failures through progressively more capable models before escalation. Research such as “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance” demonstrates that model cascades can significantly reduce inference cost while maintaining quality, making them one possible implementation of this broader principle.

The architectural insight, however, is independent of any specific routing strategy. Deterministic governance transforms retries into explicit exceptions, allowing organizations to decide whether additional compute, additional context, or human intervention is the most economical next step. The system operates by governance by exception: Human operators and expensive premium models don’t review routine transactions. They only review the genuine anomalies where the baseline machine could not mathematically or semantically prove its own rationale.

Bounding the cost variance

With the execution infrastructure stabilized, the focus shifts to a critical operational challenge: cost variance.

In traditional software, execution costs are predictable. In probability-based systems, the exact same task might consume 500 tokens on Monday and 15,000 tokens on Tuesday if an agent enters a prolonged reasoning loop to resolve an edge case. For enterprise deployments, this unpredictable variance is often a more severe blocker than the base cost of inference.

By enforcing a strict computation budget per decision flow and utilizing the intent retry governor, the architecture places a hard ceiling on this variance. If an agent reaches its retry limit without producing a compliant policy, the runtime aborts the process and safely escalates it. While this doesn’t make AI operational costs perfectly static, it structurally bounds the financial exposure, ensuring that the compute cost of handling any single transaction never exceeds a defined limit.

The financial slider equation

With safety guaranteed by the runtime and cost variance capped by the infrastructure, the economics of agentic AI can be distilled into a single, formal equation:

Total Decision Cost = Compute Cost + (Escalation Rate × Human Cost)

This equation fundamentally changes the optimization problem. Traditional agent architectures treat model capability as a prerequisite for safety. Once governance is externalized, capability primarily influences escalation frequency. The question is no longer “Which model is intelligent enough to be safe?” but “Which combination of model cost and escalation rate minimizes total decision cost?”

Variable	Scenario A (optimize for compute)	Scenario B (optimize for automation)
Model capability	Low (quantized/open source)	High (flagship reasoning model)
Compute cost	Near zero	Skyrockets (high premium)
Safety boundary triggers	Frequent	Rare
Escalation rate	High	Low
Financial trade-off	You save money on APIs, but you pay for human operators to review anomalies.	You save money on human payroll, but you pay a premium to the cloud vendor.
Safety result	Structurally bounded	Structurally bounded

In both scenarios, the system is deterministically compliant. The choice is purely unit economics.

While a smarter model may reduce escalations by making better use of available evidence, no model can eliminate escalations caused by genuine business ambiguity. A $100 billion reasoning model can’t invent context it doesn’t possess.

By decoupling safety from intelligence, you’re no longer hostage to the pursuit of perfect accuracy. Intelligence becomes a tunable economic variable, finally making agentic AI viable for the enterprise.

Figure 3: Accuracy as a financial slider. The optimal model balances compute cost against escalation cost.

Engineering for imperfection

As we scale these systems from isolated pilots to enterprise-grade operations, a stark reality comes into focus: The greatest risk in agentic AI is no longer hallucination. It’s unlimited spending performed by a system that believes it’s still making progress.

We don’t need smarter, infinitely expanding models to safely deploy autonomous systems into high-stakes production environments. We need smarter systems that fundamentally assume the underlying model will eventually fail, drift, or lie.

Consider how civil engineers build a suspension bridge. They don’t spend decades searching for “perfect steel” that will never bend, rust, or fatigue. They accept that the material is inherently flawed and subject to the laws of entropy. To compensate, they build redundancies. They calculate margins of error. They construct hard, load-bearing physical frameworks that dictate exactly how much stress the material is allowed to absorb before the structure safely redistributes the weight.

Figure 4. Engineering for imperfection means designing around known material limits.

The software industry has spent the last three years searching for perfect steel. We’ve poured billions of dollars into massive evaluation suites, prompt engineering alchemy, and ever-expanding context windows, hoping to forge a probabilistic model that never hallucinates. It’s a mirage.

Engineering maturity in the AI era doesn’t mean removing all imperfection from machine reasoning. It means designing an architecture so rigid, deterministic, and resilient that the model’s imperfections cease to be an operational liability.

The future of agentic AI is unlikely to be won by the organization with the smartest model. It will be won by the organization that most effectively separates intelligence from authority. Once reasoning and execution are decoupled, intelligence becomes a tunable economic parameter. Safety becomes infrastructure. And the endless pursuit of perfect model accuracy finally stops being a business requirement.

The end of that pursuit isn’t the end of AI. It’s the moment AI finally becomes engineering.

Note: The runtime described here is a reference architecture, not a specific implementation technology. The same principles can be realized through workflow engines, policy platforms, orchestration frameworks, or custom infrastructure. A sample implementation of these concepts is available in the GitHub repository.

This Week in AI: The Price of Intelligence

Michelle Smith — Fri, 24 Jul 2026 13:02:21 +0000

AI buyers have more choices than they did a year ago, but they also carry more responsibility for cost, reliability, security, and regulatory risk. This week, data and AI evangelist Christina Stathopoulos focused in on four forces we’ve been tracking that are shaping the AI market: product strategy (and OpenAI’s hardware plans), expanding government oversight, the work of moving enterprise AI into production, and growing competition from Chinese frontier labs. Her briefing showed why AI is becoming an operating investment rather than a race to adopt the strongest model.

Apple’s lawsuit complicates OpenAI’s hardware plans

Two years after Apple announced a major partnership to bring ChatGPT into Apple Intelligence, the companies now face each other in court. It’s happening as OpenAI plans its first move into hardware with a screenless AI companion that’s being designed by Jony Ive, Apple’s former chief design officer. (OpenAI acquired Ive’s hardware company io in May 2025.) But a lawsuit brought by Apple complicates this product bet. Apple alleges that former employees took confidential hardware designs and engineering information to help accelerate OpenAI’s device development. OpenAI denies the allegations and says it has no interest in using a competitor’s trade secrets.

The outcome of the case could influence more than whether a single device ships. As frontier AI companies expand into hardware, intellectual property, hiring practices, and product design will become integral to the competitive landscape alongside models, chips, and distribution.

AI infrastructure is becoming a regulatory concern

Governments are beginning to examine the physical costs of AI alongside questions about training data and generated content. Christina pointed to New York’s plans to pause construction of new hyperscale data centers while regulators evaluate their impact on electricity, water, the power grid, and costs for local communities. And then there’s the output itself. German courts say AI search providers are responsible for false or misleading answers: Regulators in Germany argue that services such as Google AI Overviews and Perplexity create content rather than merely link to it, and that comes with increased legal liability.

We’ve followed government oversight of frontier AI throughout this series, but the conversation has expanded beyond model access and safety. As infrastructure and compliance decisions become more central to AI system design, technology leaders may need to consider an ever-growing catalogue of constraints when choosing regions, cloud providers, architectures, and products.

Useful intelligence requires cost, reliability, and safety measures

As the tides turn from tokenmaxxing to ROI, many companies are closely scrutinizing their AI spend. As Christina highlighted, a new proposal from OpenAI aimed at helping get “more value from [y]our AI spend” replaces token counts and benchmark scores with “useful intelligence per dollar.” The measure asks whether a system completes valuable work, what each successful task costs, whether people can trust the output, and whether the economics improve as more teams adopt it.

A low token price says little about the cost of retries, human review, integration, failed tasks, or incorrect results. Christina connected that measurement problem to the growth of enterprise AI implementation services, with Anthropic and other vendors placing experienced engineers inside customer organizations to help move pilots into production.

Anthropic’s research on agentic misalignment tackles a related aspect of that value: Are your agents actually aligned with the goals you’ve assigned them? In the controlled evaluations discussed in the episode, models from several providers displayed behaviors such as covert sabotage, motivated mislabeling, and attempts to influence people to act on their behalf. Although the researchers tested artificial scenarios rather than reporting production incidents, the findings identify behaviors teams should include in evaluations as systems gain more autonomy. Measure cost, reliability, and safety within the same workflow, and evaluate successfully completed tasks rather than prompts or token count.

Chinese models are changing the model-selection process

Chinese frontier labs are giving organizations more credible alternatives to the largest proprietary US models. Christina highlighted Moonshot AI’s Kimi K3, an open weight model designed for coding and reasoning tasks. Open weights let developers download and adapt model parameters instead of relying only on a vendor-controlled API, which supports local deployment and customization but also puts more responsibility on the organization for security, operations, and evaluation.

Christina also presented public benchmark data comparing Chinese and Western models by task that shows some Chinese alternatives delivering results within 3% to 18% of the Western benchmark while costing five to 12 times less. Those figures will vary by workload and deployment method, and buyers should verify them against their own evaluations. Even so, the price gap alone is a reason to test a wider range of models.

Chinese models also raise security and governance questions, especially when the work requires sending sensitive data across borders or using public services. Open weights may allow a company to host models in their own environments, but they don’t eliminate the need for access controls, software supply chain review, monitoring, and clear rules about what data the system can process. The best model may differ from one task to another, and organizations with repeatable evaluation practices will be better prepared to take advantage of price competition without lowering their security or quality standards.

What’s next

AI competition extends beyond model benchmarks. Vendors compete through hardware, implementation services, open models, and pricing, while governments are also setting expectations for the infrastructure these systems use and the information they produce.

The takeaway for practitioners is to constantly evaluate models against real tasks, calculate the cost of successful outcomes, test for unsafe behavior, and preserve the flexibility to change providers. Those practices help teams make better decisions as price, access, regulation, and model performance continue to change.

Next week, Christina explores OpenAI’s surprising security incident in which one of its AI systems reportedly escaped the boundaries of a controlled test and launched a cyberattack against Hugging Face. She’ll also look at why OpenAI’s new enterprise agent platform, Presence, arrives at a pivotal moment for AI safety. Plus, you’ll hear about Google’s latest moves, the intensifying global AI race, China’s new Kimi K3 model, and more.

Check back each Friday for the latest episode, or watch on YouTube, Spotify, Apple, or wherever you get your podcasts.

You Probably Won’t Read This Article…and That’s OK

Rufus Rock — Thu, 23 Jul 2026 19:06:35 +0000

“Help! There are too many [LLM bug reports, blog posts about LLM bug reports, books, treatises, codices, scrolls, papyri, cuneiform tablets]! How do I choose which to read?”

—Many people, presumably

Stop there! If you are reading this, ask yourself how you got here. Did Substack’s algorithm recommend this article for you? Did a juicy thumbnail provide a welcome distraction from a mundane task? Maybe you know me personally and feel you have an obligation (you do)? Are you already regretting your decision to click?

The maintainers of many of the most important open source software repositories in the world are “drowning” in bug reports.¹ Daniel Stenberg, who runs curl, has documented a rising tide of such reports,² generated in part by well-meaning users equipped with the latest LLMs. These reports look entirely plausible, and a minority of them actually highlight real vulnerabilities. But most are essentially worthless. Actually, they might be worse than worthless, since the only way to know whether a report reports something real is to do most of the work of validating it by hand. The cost of producing bug reports has diminished, while the cost of validating them has remained constant. Thus, this flood of LLM generated reports diverts expert maintainers who could be spending their time and attention on reports with a higher relative signal.

This is an instructive microcosm of a wider LLM-fueled dynamic. With the ascendance of LLMs, the cost of producing credible–looking work across many domains has plummeted. Recently, I prompted Claude Code to do some research on a relatively advanced idea I was mulling in the AI alignment space (representational similarity analysis over LLaMA activations for prompted deceptive intent detection). It spat out, in LaTeX, a whole paper, complete with data from experiments that it had actually run, p-values, equations, figures, a literature review, and a bibliography (which mostly included real papers). It should come as no surprise then that the submission volume to academic journals has risen 42% since the introduction of ChatGPT, while writing quality has declined.³ Indeed, my paper was pretty bad (no doubt in part because of the quality of the idea I gave to it), but it looked very credible and cost me almost nothing to produce. I think it would have taken a domain expert around 2–3 minutes to work out that it was slop, and quite a bit longer to describe its main flaws in detail.

This time cost will surely rise.

The cost of producing credible-looking papers, credible-looking cover letters, credible-looking code, credible-looking blog posts, credible-looking bug reports, credible-looking mathematical proofs, and credible-looking risk analyses is heading to 0. So the supply will continue to skyrocket.

In essence, we are now great at generating stuff, but much less great at figuring out whether that stuff is actually any good.

I am battling with this problem even as I write this. I use Claude to help me editorialize and think through my ideas—relatively little shame in that. But as I navigate Claude’s outputs, I am spending a lot of my time not really ‘collaborating’ but trying to work out which of the “strengths” of my writing that it has picked out are merely sycophantic rehearsals of my ideas, and which of the “weaknesses” highlight genuine flaws.

Here, I argue that credibility cost collapses have historical precedent. I suggest that when they occur, we tend to invent new sociotechnical gating mechanisms/institutions that help us work out how to allocate our attention. I then talk about what the gating mechanism for credible slop might look like, and what it should avoid.

Hidden gates, cost collapse, and credibility signaling institutions

When things are hard to make, the mere existence of the thing is evidence that someone has invested a great deal of time and money (which hopefully correlates with relevant expertise) into creating it, and thus it is likely credible and worthy of one’s attention. For several centuries before Gutenberg, making one book took a scribe a full year and a herd of animals’ worth of skin to make. Then, you needed a patron in order to buy one, and to read the thing you needed to know Latin.

When books were scarce, nobody took time to wonder whether one was worth their attention. Scarcity was the gate. Of course, a “scarcity gate” does not guarantee credibility—it is an imperfect filter. Furthermore, scarcity often brings with it the politics of access which restricts the ability to participate in the production and dissemination of information. Ideally, a thing would be scarce purely because one requires expert skill and knowledge to produce it—but, as in the book case above, this is often confounded by wealth, social circumstances, or access to education.

But then the cost of producing things decreases. The printing press replaces the scribe; cheap paper replaces vellum; literacy spreads; things start being written in modern rather than ancient languages; computer science becomes the most popular undergraduate degree. The playing field is leveled, and leveled in a powerfully democratic way; socioeconomic barriers to production and consumption of information fall away.

With this newfound abundance, the scarcity gate stops working and so comes the need for new ways to work out what is actually worth our attention. New socio-institutional gates have to be built. The classic example is the journal: For a century and a half after the arrival of Gutenberg’s press there was a major concern among intellectuals at the newfound surplus of available printed-word documents. Conrad Gessner, in 1545, in the preface of his Bibliotheca universalis lamented the “confusing and harmful abundance of books.” Barnaby Rich, a writer and sea captain, grumbled in 1613 that “one of the diseases of this age is the multiplicity of books.” The historian Ann Blair called this the problem of “too much to know,” the sense that there were now more books than anyone could read in a lifetime and no obvious way to tell the worthwhile from the dross (Too Much to Know, 2010).

Later, in the 19th century with the birth of industrialized printing, we got yet more complaints. See the following quote from Schopenhauer on “the immense number of bad books” available at the time:

…these rank weeds of literature, which deprive the wheat of nourishment and choke it. Thus they use up all the time, money, and attention of the public which by right belong to good books and their noble aims, while they themselves are written merely for the purpose of bringing in money or for procuring posts and positions. They are, therefore, not merely useless but positively harmful.⁴

Back in the 17th century the socio-institutional solution of curated journals emerged to save the day. In the space of two months in 1665, Denis de Sallo launched the Journal des sçavans in Paris and Henry Oldenburg launched the Philosophical Transactions of the Royal Society in London. What made these important was not that they stored knowledge but that someone now stood at the door and decided what got through it. Oldenburg solicited, selected, and vouched for, so that appearing in it was itself a signal. It was no longer costly to write, but it was costly to get one’s writing past Oldenburg and into the journal. Readers of the journal, insofar as they trusted Oldenburg’s judgment, were then confident of the quality of the material to which they were allocating their attention.

This is one type of gate, but we have created many more—we peer review, we certify speakers with degrees, we count how often they cite each other, we invite people whose work we know and/or like to speak at events, we check follower counts, we count how often websites reference each other, etc. We know these proxies are imperfect (see Didier Raoult’s h-index) but we use them because we need some way of deciding who/what to pay attention to.

AI is a truly novel technology in its radical generality, and thus one should certainly take care in reaching for historical analogies. But, insofar as today’s models can be understood as dropping the cost of producing credible looking media, I think it is helpful to think about how we have dealt with such circumstances previously. The appearance of credibility has been severed from real credibility many times, precisely when it is no longer costly to look credible, and (admittedly sometimes after a period of chaos and strife) the response tends to be to build an institution to make that appearance expensive again.

The question then becomes what the next gate(s) might possibly look like. When it costs nothing to produce credible-looking work across most disciplines, what can remain expensive and be charged for that is a satisfactory proxy for something worth our time? I think there are more good bug reports, good blog posts, and good web apps being developed now than ever before, but the issue is that there are also vastly more bad ones—we need a mechanism for telling them apart.

How to not throw the baby out with the bath slop

So what do we do? Previously, proxies were invented to figure out whether something was worth one’s scarce time and attention, prior to consumption.

The digital approach has, thus far, been to use popularity-contest style proxies. PageRank, Google’s original algorithm, used the number of other web pages that point at a given web page to rank their relevancy. Similarly, many of the recommendation algorithms you use daily, from Substack to Amazon, rely heavily on what people are currently viewing, engaging with, and buying. In other words, we allocate people’s attention to things that other people are already attending to. But the logic of these measures, like the ones discussed above, have a perverse feature: They do not really tell us whether something is worth our attention. Instead, they tell us how much attention this thing has already received, and we treat the second as a proxy for the first. Thus, your attention becomes both the input into the mechanism and the output. Whether or not this blog post appears in your feed is a function of how many people have clicked it before, so attention accrues attention, creating a classic winner-take-all type dynamic. Worse, the moment you have a sorting infrastructure whose currency is attention, the platform that owns the infrastructure has the proxy (engagement, ad revenue etc.) as the incentive and not the target (providing content that is worth people’s time). This is a dynamic that Tim O’Reilly, Ilan Strauss, and I have studied before in our work on algorithmic attention rents.⁵

The point is that AI did not break a working gate. In fact, in some ways, AI has helped; I have talked elsewhere about how ad-free LLMs are currently better search tools than many traditional search engines.⁶

In the context of credible-looking-slop though, AI is a dam buster. Domains that were previously reliant on human-judgment-based gating such as academic journals, open source software repositories, are getting flooded. And attention-algorithmic digital search and recommendation platforms are sagging under the combination of the slop strain and their own feedback loops. How many distinctly AI-y articles have you clicked on lately on Substack? I clicked into YouTube’s “shorts” on a logged-out computer the other day and was staggered by the unbridled slop it served up. If you, like me, have been forced to engage with LinkedIn’s feed since ChatGPT’s ascendancy late 2022, I offer you my sincerest condolences.

One candidate solution is that we lean harder on the human-centric institutional gates that we already have: reputations, followings, h-indexes, knowing someone who organizes really cool unconferences, etc. This certainly feels like the most likely direction of travel. However, it carries the cost of entrenching incumbents: Your papers only get read if you are at Harvard; your open source contributions only get accepted if you are already well known in the community; your blog posts only get seen if you are featured by someone with a platform. Central to the appeal of cheaper production is the democratization of contribution—if you are smart and have a good idea for an app or for some alignment research, you can get Claude to help you prototype it without having to learn the entire modern internet stack. The issue is that if genuinely good ideas never get seen because the only stuff people think is worth their time comes with a recognizable affiliation, we destroy that democratization. The baby goes out with the slop.

The second obvious candidate solution is to call for more AI. Every gate thus far has been a proxy—scarcity, the credential, the citation, etc.—that doesn’t directly measure the quality of the content. Rather, it measures something easier to capture that, hopefully, correlates with the quality of the content. What a LLM-based gating system seems to offer, for the first time, is a gate that can actually “read” all the content. One could envision a future where we all encode our preferences in personal-reviewer type models, which then actually go through the films, books and journal articles we are selecting from in order to provide personalized, reliable recommendations. The signal, in such a world, comes home to the object and stays cheap.

Unfortunately, this response seems to miss two important points. The first is a turtles-all-the-way-down problem: The gate and the thing it gates are drawn from the same well. The second is a problem of incentives.

A detector built out of frontier model capabilities may always inherit frontier model blind spots. If AI is capable of convincing itself that the slop it’s generating is the baby, then, if they are the same models, it may be enough to convince the reviewer too. Of course, it is not that LLMs can only ever emit credible looking content—they conduct real mathematics,⁷ write real code, submit real bug reports. But these are currently few of the total cases (the baby) among a lot of false positives. AI will get better, and eventually perhaps all of the bug reports it submits will be real, all of the proofs it generates will be correct, etc. This problem might dissolve as the systems get more intelligent. But we don’t know when/if AI systems will get to this point, and even when/if they do, presumably it will be quite a bit after that point before we trust them with doing all the stuff—building our planes, creating our medications, designing our policies, etc.

The second thing this response misses is incentives: What happens if we have two such super intelligent machines aimed at deceiving each other? Will an employer’s verification AI be able to see through the ruse of the applicant’s application AI? What about a deviant academic, who sets his AI to work writing a paper optimized for receiving citations? Will the journal’s editorial AI’s be able to catch subtle massaging of data or p-hacking?

We have developed truly sci-fi technology for generating content, but our infrastructure for evaluating its outputs, for curating them, and generally for exercising taste at scale has lagged behind. Maybe the answer lies somewhere between the two avenues I’ve suggested thus far. We have LLM reviewers filter the bug reports, perform some diagnostics, before passing to the human maintainers. But even this risks the identification problems I discussed above.

So I don’t have a clean gate idea to sell you on, I wish I did. Maybe ask Claude?

Footnotes

See Thomas Claburn, “Open Source Maintainers Are Drowning in Junk Bug Reports Written by AI” and “AI Slop Got Better, so Now Maintainers Have More Work” (The Register); Andrew Kew, “AI Security Tools Are Drowning Open Source Maintainers — curl Is the Canary” (DEV Community); Jason Guriel, “Bring Back the Gatekeeper, Please” (The Walrus); and “Who Cleans Up After the Vibe-Coding Party?” (Financial Times). ︎
Daniel Stenberg, “Death by a Thousand Slops,” https://daniel.haxx.se/blog/2025/07/14/death-by-a-thousand-slops/. ︎
Claudine Gartenberg, Sharique Hasan, et al., “More Versus Better: Artificial Intelligence, Incentives, and the Emerging Crisis in Peer Review,” Organization Science (37.3), https://pubsonline.informs.org/doi/10.1287/orsc.2026.ed.v37.n3. ︎
Arthur Schopenhauer, Parega and Paralipomena: Short Philosophical Essays. ︎
Algorithmic Attention Rents, UCL Bartlett Faculty of the Built Environment, https://www.ucl.ac.uk/bartlett/public-purpose/policy/digital-technology-and-artificial-intelligence/algorithmic-attention-rents. ︎
Rufus Rock, Ilan Strauss, and Tim O’Reilly, “Are LLMs the Best That They Will Ever Be?,” Asimov’s Addendum, https://asimovaddendum.substack.com/p/are-llms-the-best-that-they-will. ︎
Kathryn Hulick, “AI Cracked an Erdős Math Problem. Now Experts Want Guardrails,” ScienceNews, https://www.sciencenews.org/article/ai-guardrails-erdos-math-problem. ︎

The Meter Was Always Running

Bennie Haelen — Thu, 23 Jul 2026 15:02:22 +0000

The first expensive agent run doesn’t look like a governance problem. It looks like a billing problem.

A team opens its first agent invoice after the meter turns on, sorts the runs by cost, and finds one that cost 40 times the median. The provider meter shows tokens and a total. The application logs say the request succeeded. The trace viewer shows a tidy request and a tidy response. None of them explain why this run wandered while its neighbors finished cleanly.

In my previous Radar article, “The Subsidy Ended: What Tool-Using Agents Actually Cost,” I argued that usage-based billing didn’t make agents expensive; it made their existing costs visible. The bill didn’t get bigger. It just got honest, and an honest bill is one you can engineer against.

But visible isn’t the same as attributable. To attribute cost in a tool-using agent, you have to see inside the run that produced it. Once you build that visibility, you discover that cost is only where the trouble first becomes visible.

Cost spikes, unsafe delegation, and runaway actions are different failures, but they expose the same missing layer: a control plane can’t govern a loop it can’t independently observe.

The bill is honest, but it isn’t explained

The number on the invoice isn’t wrong, only incomplete. Provider billing can tell you what was consumed; it usually can’t tell you which design choice inside your platform caused the consumption. Application logs can tell you whether the outer request succeeded; they often can’t tell you how the agent got there. That leaves teams arguing over a bill when the thing they need is an audit trail.

By control plane, I mean the platform layer above individual agents where an organization centralizes observability and enforces policy, access, budget, routing, and execution constraints. Most organizations have pieces of that layer already. What they often lack is the evidence layer underneath it: a loop-aware record of what the agent actually did, turn by turn.

The control plane is where policy decisions live. The observability substrate is the evidence the control plane reads from. The instrumentation points are the runtime chokepoints the agent can’t bypass: model gateways, tool proxies, API gateways, execution sandboxes, runtime harnesses, and policy engines.

Many organizations instrumented the application boundary, then deployed systems whose real work happens inside a loop. The result is a control plane with opinions but not enough evidence.

The loop is the unit of observation

Here’s the mistake underneath the empty trace. Agent observability is often treated as a heavier version of application observability, when it’s a different shape entirely. The unit of work changed, and the instrumentation didn’t. A traditional service handles a request and returns a response; the request is the natural unit you trace.

An agent doesn’t so much handle a request as work toward an outcome. It reasons, calls a tool, reads the result, reasons again, and continues until it decides it’s finished, hits a boundary, or escalates. A single user intent can fan out into many model calls, many tool calls, and a context window that changes on every turn. The signal that matters is the relationship between those turns, not only the timing of any one of them.

Figure 1. From request trace to loop trace. A request-response trace shows that something completed. A loop-aware trace shows why the agent took the path it took: which turns ran, what context accumulated, which tools were called, which controls fired, and what each turn cost.

Three things follow from this, and each one breaks an assumption that application monitoring quietly depends on.

First, the context is accumulating state, not a fixed payload. Each turn may carry forward prior messages, tool descriptions, retrieved files, intermediate results, and earlier decisions. You have to be able to watch that state grow turn by turn, because the growth is where much of the cost and risk live.

Second, a tool call is a first-class decision, not an implementation detail. Which tool the model selected, what parameters it passed, how large the result was, and whether a policy constrained the call are all part of the governance record. Routing accuracy and routing cost are the same audit viewed from two directions.

Third, every run can become its own trace tree. The same prompt can take a different path on Tuesday than it took on Monday, so fixed call graphs and clean service maps assume a regularity the agent may not have. If the unit of observation is still the request, you will see 10,000 successful calls and never notice the one loop that ran 15 turns when it should have run three.

What the substrate has to capture

Once you accept that the loop is the unit, the requirement becomes concrete. You need a small, specific set of signals captured below the agent and stored where you can query across the whole fleet, not only inside a per-run viewer. In a pilot I’m running for a large healthcare organization, this is the layer we built first, on OpenTelemetry, Cloud Trace, and a usage-log table in the warehouse. The particular stack matters less than the shape, which generalizes well beyond it.

Figure 2. The observability substrate. Instrumented at the layer every model call and tool call must pass through, the same signals land in a fleet-queryable store and answer governance questions about cost, delegation, and runaway actions.

At minimum, each user intent should produce a run trace. Each loop turn should be represented as either a span or a stable grouping attribute. Model calls, tool executions, policy checks, retries, and postprocessing should be child spans or structured events beneath that turn. The exact naming convention isn’t as important as preserving the causal structure of the loop.

Signal	Why the control plane needs it	Example fields
Run and turn structure	Keeps the run legible as a causal tree rather than a flat list of calls	run_id, turn_id, parent_span_id, timestamp
Token and model accounting	Makes cost explainable per turn, model, and tool path rather than merely visible in aggregate	model, input_tokens, output_tokens, cached_tokens
Tool-call events	Records delegation decisions and identifies oversized or repeated tool results	tool_name, parameter_shape, result_bytes, row_count
Guardrail decision events	Shows which controls fired and whether they allowed, denied, rewrote, constrained, or escalated an action	policy_id, policy_decision, reason_code, enforcement_point
Identity and authority context	Reconstructs whose authority the work ran under and which data scope applied at the time	principal_id, delegated_scope, service_account, data_scope
Outcome and bound metadata	Separates clean completion from retries, boundary hits, escalations, and user-visible failures	turn_count, stop_reason, loop_bound_hit, payload_cap_hit, outcome_status

None of this is exotic, and the practical design work isn’t inventing new telemetry primitives but controlling cardinality, retention, payload capture, sampling policy, schema evolution, and the joins between trace data, usage data, identity data, and policy data.

The storage point is the part teams underestimate. If these signals land only in a tracing viewer, you can inspect one run beautifully and never reason about a thousand. Governance is a fleet question, not a single-trace question, so the substrate has to be queryable.

It also has to be designed with data minimization in mind: metadata by default, content capture by exception. Capturing a tool call doesn’t mean storing every raw prompt, full result set, credential, confidential document, or sensitive parameter in the trace. In regulated environments, the useful pattern is to separate metadata from payload: tool name, model, token counts, payload size, row counts, policy decision, authority context, request ID, and redacted or hashed parameter values where necessary. The goal is enough evidence to reconstruct why a run behaved the way it did, not an uncontrolled archive of everything the agent saw.

The first useful version doesn’t need full prompt capture or semantic evaluation. With columns like run_id, turn_id, parent_span_id, timestamp, principal_id, delegated_scope, model, input_tokens, output_tokens, cached_tokens, tool_name, result_bytes, row_count, policy_id, policy_decision, stop_reason, loop_bound_hit, and outcome_status, expensive loops stop being mysteries and start being queries.

The exact syntax will vary by warehouse, but the governance question should be expressible without a human clicking through individual trace viewers:

with runs as (
  select
    run_id,
    count(distinct turn_id) as turns,
    sum(input_tokens + output_tokens) as total_tokens,
    max(result_bytes) as largest_tool_result,
    bool_or(loop_bound_hit) as hit_loop_bound,
    count_if(policy_decision = 'rewrite') as rewritten_actions
  from agent_turn_events
  where occurred_at >= current_date - interval '7 days'
  group by run_id
)
select *
from runs
where turns > 10
   or largest_tool_result > 10000000
   or hit_loop_bound
   or rewritten_actions > 0;

That is the difference between admiring a trace and governing a fleet.

In the old trace, the expensive run from the opening was simply expensive. In the loop-aware trace, it becomes legible: turn 3 retrieved 80,000 rows, turn 4 carried that result forward, turn 5 selected the expensive model, turns 6 through 11 retried the same tool call with slightly different parameters, and the run finally stopped because it hit a loop bound rather than because it completed cleanly. The run stops being a riddle and becomes a record.

One substrate, three governance problems

The reason this is worth building once, properly, is that the same substrate answers the three agent governance problems that the industry often treats as separate: cost management, delegation and access control, and runaway-action prevention. They are not identical failures, but they require the same kind of evidence.

Governance problem	Evidence the control plane needs
Cost	Turn count, token counts, model selection, context growth, tool-result size, retries, and stop reason
Delegation	Principal, delegated authority, data scope, selected tool, action parameters, and policy decision
Runaway actions	Repeated actions, loop bounds, payload caps, guardrail decisions, denied or rewritten actions, and outcome status

Cost is the first, and with token accounting on every turn you can finally answer why a run was expensive. You can see whether the cost came from too many turns, too much context carried forward, an oversized tool result, an expensive model used for the wrong step, or a retry loop that should have been bounded.

Delegation and access are the second, and harder, problem. In multi-agent systems, delegation is a security boundary. Enterprises will eventually be asked who authorized a given agent action, under whose authority it ran, and which data scope applied at the time. The audit trail for that question is this same trace, enriched with identity and authority on each turn.

Runaway actions are the third. The destructive delete that becomes a war story, the agent that tried to drop a production table, or the loop that repeatedly issued the same expensive scan shouldn’t only exist in a postmortem. In this model, the blocked destructive statement is a guardrail decision event with a deny on it, and the runaway scan is a trace that hit a loop bound or payload cap. The interesting governance signal is the dangerous action that a deterministic control refused.

Three conversations, one place to stand. The loop is the unit of governance because the loop is where cost accumulates, authority is exercised, tools are selected, controls fire, and outcomes emerge.

The agent can’t keep its own records

There’s a tempting shortcut to instrument the agent itself, to let the agent log its own tokens, its own authority, and its own blocked actions. That’s the fox keeping the henhouse ledger.

The agent can emit useful breadcrumbs, but it can’t be the system of record for its own authority, cost, or refusals. An agent reporting on its own scope and blocked actions is self-reporting, and self-reporting is exactly what fails an auditor and exactly what a clever prompt can talk its way around.

The substrate has to be instrumented below the agent, at the layer the agent can’t opt out of. In practice, below the agent means the model gateway, tool proxy, runtime harness, execution environment, API gateway, or policy engine: the layer the agent has to pass through, not a logger the agent can choose to call.

This is the through-line of the control-plane argument. The platform is where you enforce policy, access, budget, routing, and cost, and it can only enforce what it independently observed. Enforcement and observation are two faces of the same layer; put them anywhere the agent can edit, and you have neither.

We already have tracing, and it isn’t enough

The natural objection is that this is solved already: Mature tracing tools exist, agent observability vendors exist, and teams can turn on a trace viewer and see what happened. The gap isn’t visualization, since plenty of tools can show a useful trace of an agent run. The harder gap to cross is completeness and actionability: whether the trace carries the evidence a control plane needs, whether that evidence is independent of the agent, and whether it lands somewhere the organization can query across the fleet.

Existing layer	What it often shows	What the control plane still needs
Application tracing	Request, service call, latency, status	Turn structure, context growth, model and tool attribution
Agent run viewer	One run’s path through a UI	Fleet-queryable evidence across all runs
Agent self-logging	Model-reported actions and reasons	An independent record below the agent
Billing dashboard	Total cost and token usage	Per-turn causal explanation of where the cost came from

A useful test is whether the control plane can answer this without opening an individual trace viewer: Show me all runs this week where context grew by more than 5x, a tool returned more than 10 MB, a guardrail rewrote the action, and the run still reached a user-visible answer. If the answer requires a human clicking through traces one by one, you have visualization, not governance, and seeing one run isn’t the same as governing a thousand.

A dashboard tells you what happened. A control plane uses what happened to change what happens next, which requires the signal to live somewhere an enforcement decision can read it.

The pattern, not the stack

It would be a mistake to read this as an argument for a particular tracing standard, warehouse, vendor, or cloud platform. The stack is incidental; the shape is the point.

The recipe stays the same regardless: loop-aware traces; turns represented as spans, grouping attributes, or structured events; token, tool, guardrail, and identity evidence attached to those turns; storage you can query across the fleet; instrumentation that sits below the agent rather than inside it; and data minimization that keeps the trace useful without turning it into a shadow copy of sensitive payloads. Build it on whatever your platform already speaks.

The teams that treat observability as a dashboard will keep discovering their problems in the order the symptoms happen to surface: first as a surprising invoice, later as an audit finding, eventually as an incident. The teams that treat observability as the sensory layer of the control plane will see all three coming from the same data, and will be able to act before the meter, the auditor, or the incident forces the question.

Prompts guide behavior. Guardrails govern behavior. Observability is how you know the governance is real. You can’t govern what you can’t see, and you can’t improve what you can’t attribute.

Stop Overengineering Your Agent Harness

Hugo Bowne-Anderson — Wed, 22 Jul 2026 15:58:51 +0000

The following originally appeared on Hugo Bowne-Anderson’s Vanishing Gradients Substack and is being republished here with the author’s permission.

The conversation around harness engineering is dominated by problems from coding and personal agents such as OpenClaw, but most agents are simpler. Builders should avoid over-engineering for capabilities that newer models may absorb anyway, the “Kirby effect,” and focus on durable fundamentals.

Statisticians sometimes use a deliberately crude question to show how a summary statistic can mislead: how many testicles does the average human have? The numerical answer may be defensible, but it describes almost nobody. Harness engineering has a similar problem. Ask, “What techniques do I need?” and the average answer becomes a long list: context management, memory, compaction, sub-agents, hooks, and orchestration. Few systems need all of it and the right harness depends on the job.

In this essay, you’ll learn:

What an agent harness is and how it differs from prompt and context engineering.
How action complexity and context complexity determine the harness you need.
Why coding and deep-research agents require more context management than many support, sales, and enterprise agents.
How tools, state, routing, guardrails, traces, sub-agents, hooks, and human handoffs fit into the architecture.
Why harness features expire as models improve, and how to build the minimum viable harness for the job.

What is an agent?

An AI agent in common parlance is an AI system that can do things: send emails, query databases, ping APIs, make appointments, write and execute code, and so on. AI engineers define them slightly differently: AI agents are LLMs with tools in a loop.

Consider what happens when you ask a coding agent to edit a file: it will first read the file, send the result back to the LLM, then edit it, then perhaps read it again, and so on, until the LLM “decides” it is finished and tells you.

Figure 1. A coding agent cycles between the LLM and its tools. Here, it reads app.py, incorporates the result, and then edits the file.

This distinction is important because most common parlance agents don’t have such reasoning loops and are more aptly described as LLM workflows: take a sales workflow that

Transcribes sales calls using a speech-to-text model;
Extracts structured data from the transcript for the salesperson to verify;
Populates your CRM or database with the prospect’s information, next steps, and so on.

This is an AI workflow: foundation models are used at each step, but for each sales call the workflow itself is deterministic. A call is transcribed, the relevant data is extracted, and the CRM is populated. When the next call happens, the workflow runs again as a separate task; no result is fed back to an earlier step, so there is no model-directed reasoning loop (any individual step could contain one, however, and agentic reasoning loops inside deterministic workflows are a common pattern).

Figure 2. A deterministic AI workflow follows a fixed sequence: transcribe the call, extract structured data, verify it, and populate the CRM.

All modern AI chat products, such as ChatGPT and Claude, however, are agentic: they have access to Web Search tools and image generation tools, for example, and will use them when deemed necessary. You interact with agents every day.

What is an agent harness?

If an LLM is the brain, you can think of the agent harness as the body. It includes all the tools and infrastructure the brain relies upon at runtime to get the job done.

In practice, the harness handles five core jobs:

Loop: Prompt the model, parse its response, execute its tool calls, and feed the results back.
Tool execution: Run the commands, code, APIs, and other actions requested by the model.
Context management: Decide which instructions, conversation history, files, and tool results enter each model call.
State: Track the conversation, task progress, files touched, and anything that needs to persist across turns.
Safety: Sandbox execution, require confirmation for sensitive actions, and block disallowed operations.

Prompt engineering shapes an individual model call. Context engineering determines what the model sees. Harness engineering governs the complete system around those calls.

How complex does the harness need to be?

One way to decide how much harness engineering a task requires is to separate two kinds of complexity:

Action complexity: How many tools, decisions, dependencies, and handoffs must the agent coordinate?
Context complexity: How much information must the agent gather, retain, and retrieve to complete the task?

The two can move independently. A support agent may complete a conversation in one turn while still routing across several tools and safety checks. A deep-research agent may receive only one user request while accumulating a large body of source material.

Figure 3. Harness requirements vary across two independent dimensions: the complexity of the actions an agent coordinates and the context it must gather, retain, and retrieve. Personal assistants can span much of this space.

Harnesses for coding agents?

The conversation around harness engineering has exploded recently and much of the focus is on context management, memory, compaction, tool offloading, and increasingly elaborate tools and techniques. If you’re building a coding agent (or using one!), it’s important to know about these. Generally, they’re important to consider when building agents that users tend to have long conversations with.

The core can be surprisingly small, though: A coding agent can be built in 131 lines of Python, while a search agent using the same basic loop takes just 61. The tools change, but the underlying pattern doesn’t. A coding agent can even read its own tool definitions, write a new tool, hot-reload it, and use it on the next step. Capabilities can be added without permanently baking everything into the core harness.

A stock coding agent can write code, but it doesn’t automatically understand your data, spot leakage, choose the right validation strategy, explain uncertainty, or connect a model to a business decision. In practice, users keep extending the harness around it: they add domain instructions to AGENTS.md, package recurring workflows as skills, and add tools, evals, and reproducibility checks. The shipped harness is only the starting point. It’s something builders actively work on. In a word, when using a coding agent, you are always actively involved in shaping and building your harness.

So what are common harness patterns for coding agents? Lance Martin (Anthropic, then at LangChain) identified 3 main context engineering patterns, which are fundamental for harness engineering:

Reduce: Actively shrink the context passed to the model
Offload: Move information and complexity out of the prompt.
Isolate: Use multi-agent architectures to delegate token-heavy sub-tasks.

Then when conversations get longer than the context window of the LLM, you need to think through how to pass the necessary context to it: compaction used to be state of the art, then hand-off became prominent, and now compaction is back, due to the capabilities of more powerful models.

Deep research is another case where context engineering matters. In a workshop with Ivan Leo, who previously built agents at Manus and is now at Google DeepMind, we built a deep research agent from scratch. The harness keeps research findings and task state available across many model calls. It generates a plan, gives search sub-agents separate queries and iteration budgets, runs them concurrently, then returns their findings to the main agent for synthesis and citation. The implementation also uses hooks, which let other parts of the system respond to events in the agent loop. A hook can render a tool call, log its result, or record a trace without putting that behavior inside the core loop. Deep research raises both action and context complexity: the agent must coordinate many searches while retaining enough evidence to produce a coherent, cited report.

When working with personal agents, such as OpenClaw or Hermes, managing context and memory is also important, particularly as the amount of information they create and have access to grows over time. Pi offers a useful baseline for coding-agent harnesses. It adds repository context through AGENTS.md, persistent sessions that users can resume or branch, and extensions for tools, skills, and prompts. OpenClaw builds on Pi and pushes the harness into personal-agent territory with an always-on daemon, chat interfaces, file-based memory, scheduled heartbeats and cron jobs, and tools for browsing, sub-agents, and device control. That additional infrastructure makes sense because the agent must persist and act over time, rather than complete one short task. Its memory system is deliberately plain: compaction summaries are appended to timestamped Markdown files, with no vector database or embeddings.

I do think these are all important and super interesting, but I want to help builders understand that most agents you’ll build don’t need any of them. But first: the Kirby effect and how frontier models are absorbing all of our agent harnesses.

The Kirby effect

New model releases often force us to rebuild our harnesses. In fact, we often need to tear them out and rebuild them completely. If you don’t rip out your harness, it constrains the new model. As Nick Moy, an AI researcher at Google DeepMind who built the first multi-hop AI agent at Windsurf told me, “we should just unleash [the model], unfetter it, and let it flex its wings!”

Manus has been re-architected five times in a year, LangChain’s Open Deep Research was rebuilt multiple times in a year to keep pace with model improvements, and even Anthropic rips out Claude Code’s agent harness as models improve (see here for more details). Why is this happening? Because the models are sucking up the harnesses around them.

Remember chain-of-thought (CoT) prompting where we would see better performance from LLMs if we asked them to explain their reasoning? Well, it turns out that if you do reinforcement learning on CoT traces, you can build reasoning models! Plan mode followed the same path. AMP briefly shipped it as an experimental feature, then removed it when models could reliably obey “plan, but don’t edit.” As Nicolay Gerold (Amp Code) put it, “Having a separate mode for that, and having additional load on the user to remember, ‘Hey, I always have to go into plan mode,’ isn’t necessary anymore, because it’s just one simple instruction.” Claude Code still has it, though, as does Codex! In November 2025, the release of Opus 4.5 and GPT-5.2 signalled a step change in how capable coding agents had become. Simon Willison even wrote “It genuinely feels to me like GPT-5.2 and Opus 4.5 in November represent an inflection point”. Why was this possible then? The labs had been able to train their new models on enough of our agent traces, in particular using RLVR, that they were able to become far more accurate at tool calling, among other things.

Nicolay Gerold (Amp Code) calls this the Kirby effect: every component in a harness encodes an assumption about something the model cannot do on its own. As models improve, those assumptions expire, and the corresponding harness features can be removed.

Harnesses for support agents

Most AI builders will not be building coding agents or deep-research systems. They will be building support agents, sales agents, and enterprise agents that sit low on at least one of these dimensions. Many of these systems complete a task in one to five turns (time to resolution is key here!). Their harnesses still need careful tool design, structured outputs, routing, guardrails, traces, and handoffs, but they may need far less memory and compaction.

William Horton (AI Engineer, Maven Clinic) and his team built Maven Assistant to help members navigate appointments, providers, support information, and women’s health content. When the agent first reached external users, every initial conversation was completed in a single turn. Compaction was rarely relevant, although one Zendesk retrieval returned far too much text. The architecture still contains several important harness components:

Domain routing: A lead agent delegates requests to sub-agents for appointments, provider search, health content, and Maven support.
Bounded tool access: The system has roughly 15 to 20 tools distributed across those domains. Each sub-agent receives only the tools relevant to its job.
Tool interfaces designed for agents: Internal APIs are wrapped in safer interfaces. The application injects the user ID directly instead of asking the model to provide it.
Deterministic guardrails: Off-topic and prompt-hacking checks run before the main agent. When triggered, the system returns a fixed response without asking the LLM to improvise.
Explicit human handoffs: Expressions of self-harm trigger an automatic transfer to support. Other transfers require the user to ask or confirm.
Controlled scope: The agent provides health information but does not diagnose. The team withheld high-cost benefits questions until the system could answer them reliably enough.

Maven Assistant has low context complexity and moderate action complexity. Its harness work is concentrated in routing, tool design, guardrails, evaluation, and human handoffs rather than memory or compaction. But don’t forget about the Kirby effect. As these systems become more sophisticated, so will the models, and what you needed to engineer into your harness yesterday will be part of the model tomorrow.

The fundamentals will remain:

Building LLM reasoning loops with tools, state, and control flow.
Designing prompts and tool schemas.
Managing context and memory.
Using structured outputs, traces, and tool feedback to inspect and debug the loop.
Applying guardrails and human handoffs.
Using Agent SDKs and MCP without outsourcing the system design.
Running scheduled and event-driven work with hooks and cron jobs.
Building evals that test task success, tool use, guardrails, and human handoffs.

Evals also raise a boundary question. Vivek Trivedy’s account of the agent harness is runtime-oriented: it includes the tools, state, context, execution environment, orchestration, and control logic used while an agent completes a task. Hamel Husain has argued to me (in private correspondence) that the eval harness is part of the agent harness too. That extends the definition beyond runtime to include the infrastructure that runs test cases, captures traces and artifacts, and scores outcomes. We’ll discuss this, among other things, in an upcoming live conversation.

When building agents, before reaching for compaction, memory, handoffs, or sub-agents, map the job on two axes: how many actions must the agent coordinate, and how much context must it carry across the task? If both are low, keep the harness small. Give the model the few tools it needs, test the loop, and add infrastructure only when a real failure demands it. Revisit those additions whenever a stronger model arrives, because yesterday’s necessary workaround may be tomorrow’s dead weight.

Want to go deeper? Check out our collection of agent-harness resources, including papers, talks, tools, and practical examples. I’m also running a four-hour workshop soon, Build AI Agents from First Principles, where we’ll build a working customer service agent from scratch and cover tools, state, context, memory, guardrails, SDKs, and MCP.

Managers Are Not Overhead: They Are Infrastructure

David Michelson — Wed, 22 Jul 2026 10:42:49 +0000

Managers have been disproportionate casualties of the rolling waves of post-COVID-19 tech layoffs that started in late 2022. Popularized by large companies such as Meta, Google, and Amazon, phrases like “flattening the org” and “reducing bureaucracy” are now synonymous with thinning the management layers that ballooned during the 2021–2022 hiring sprees. Retrospectively, such flattening can seem prescient given that AI models can now automate schedules, draft performance reviews, coordinate communication across teams, and aid in the prioritization and decision support typical of management. Pushed to the experimental extreme, this can now mean 50 ICs reporting into one supervisor. The logic here is simple and stark: Since AI can, or will soon be able to, handle a lot of what managers used to do, fewer managers are necessary. Instead, decision-making can be distributed within teams as individual contributors become more adept at orchestrating and supervising agentic workflows with increasingly refined judgment and decreased reliance on managerial oversight. Everyone, in effect, is a manager now.

The problem with this narrative is that organizations are reducing managers at precisely the time they are becoming increasingly important to realizing their AI investments. Several sources of recent data back this up. A main conclusion from Microsoft’s 2026 Work Trend Index Annual Report is that “organizational factors—culture, manager support, talent practices—account for twice the reported AI impact of individual effort alone.” Once leadership sets AI strategy and incentives, “it’s managers who operationalize it, and the data shows the impact of their ability to do so.” Specifically,

when managers actively modeled AI use, employees reported a 17-point lift in reported AI value, a 22-point lift in critical thinking about their AI use, and a 30-point lift in trust in agentic AI. When managers created psychological safety around experimentation, employees reported up to 20 points higher AI readiness and value—and were 1.4x more likely to be high-frequency users of agentic AI.

The impact of managers is even greater on more advanced AI users, what Microsoft calls “Frontier Professionals” (16% of those surveyed, users who “use agents for multistep workflows and building multi-agent systems”). This group is more likely to report that their manager uses AI (85% vs. 64%), establishes quality standards for AI work (83% vs. 57%), encourages experimentation (84% vs. 61%), and rewards work redesign regardless of outcome (26% vs. 11%). The report notes that “in many cases, employees are moving faster than the organization around them.” Microsoft calls this the “Transformation Paradox.” According to the Microsoft data, managers are the layer that helps resolve it. They translate organizational strategy into team practices that let individual work with AI produce value.

Of course, once AI adoption is the norm and managers no longer need to manage that change, one could argue that many aspects of the role remain susceptible to automation and the role will contract. We don’t know how this will play out yet, but if management roles were already contracting we would expect to see early signs, and the data shows the opposite. LeadDev’s Engineering Leadership Report 2026 surveyed 600 engineering leaders, 55% of whom are engineering managers or managers of managers. The report notes that “AI is simultaneously expanding what leaders can do technically and what is expected of them organizationally, without reducing the demands on their time in either dimension.” Not only are managers becoming more hands-on technically, but

63% of engineering leaders say their scope and area of responsibility increased over the past 12 months.
60% saw increased communication with team members, customers, and stakeholders.
22% have more teams reporting to them.
29% have more direct reports.
Architectural decisions and technical strategy saw the most respondents citing increased time dedicated to it.

One way to interpret these figures is to say that more teams and more reports show flattening working as planned from a business perspective. Another reading—not mutually exclusive—is that the role is in transition and most organizations have not fully wrestled with what that involves: managers doing their old work at greater scale, and the new work of making AI a core team practice. Either way, that’s not contraction. Contraction would mean the scope of the role itself is shrinking as AI and ICs absorb more of the work. More teams and more reports is what flattening produces, not evidence the role is going away.

To be clear, none of this means organizations should stop scrutinizing reporting structures and removing genuinely unhelpful layers of bureaucracy that stifle decision-making. But it does mean asking a harder question before the next round of cuts: Are you reducing management based on what managers used to do or based on the critical work they are doing now or will need to do next?

The “what they used to do” answer treats managers like overhead. The emerging evidence suggests that managers are currently playing the role of infrastructure, the critical layer that translates AI investment into actual value at the team level. Flattening on the assumption that AI will facilitate its own adoption or that value will emerge from unguided individual effort is making a productivity bet that the data doesn’t support.