Radar

Ordinary Engineers, Not Heroic Inventors

Tim O’Reilly — Tue, 07 Jul 2026 11:38:55 +0000

In the 1980s, Japan led the world in semiconductors, consumer electronics, and computer hardware, the industries everyone assumed would decide the next phase of economic power. Japan won them and still did not overtake the United States in the information revolution that followed. Jeff Ding, a political scientist at George Washington University, opens his book Technology and the Rise of Great Powers with the history of the first and second industrial revolutions and the third, the information revolution. The explanation he gives for who wins and who loses applies to companies as well as it does to nations, and very much to the current trajectory of AI.

Ding contrasts two theories of how technological revolutions reshape economic power. The conventional one he calls the leading sector model, or LS theory. It goes like this: New technologies create fast-growing new industries like steel and railroads and automobiles and semiconductors, and the country that dominates invention in those sectors captures the monopoly profits and the upstream and downstream economic linkages that come with them. As the story goes, if you win the leading sector, you win the era. Britain won in the first industrial revolution through its mastery of steam power, and then was surpassed by the US in the second through its leadership in electrification, the internal combustion engine, and mass manufacturing. The US kept its lead over Japan in the information systems revolution not by competing in the “leading sector” of electronic hardware but by diffusing “up the stack” via software that took the power of computing into every sector of the economy. (OK, that last bit is my explanation of what happened rather than Ding’s, but it’s consistent with his theory.)

Leading Sector theory is pretty clearly the working hypothesis of today’s AI industry and the national strategy that is forming around that industry. The company and the country with the biggest and best models wins. Everyone else is an also-ran.

Ding offers another explanation, which he calls diffusion theory. He points out that general-purpose technologies, foundational ones like the steam engine, electricity, and the computer, don’t just create massive profits and productivity gains in a single industry but instead spread across the whole economy. National economic leadership comes not from inventing the new sector but from diffusing the general-purpose technology more quickly and more broadly than your rivals. This happens over decades. The win goes to whoever most successfully embeds the technology into a wide range of ordinary productive work. This is how the US kept its lead over Japan rather than being surpassed by it.

This is obviously aligned with the thinking of Arvind Narayanan and Sayash Kapoor in “AI as Normal Technology,” which Ding cites in his book.

A big part of what enables diffusion is what Ding calls skill infrastructure, the education and training systems that widen the pool of people who can actually work with the technology. When the priority is widespread adoption rather than invention, he argues, the institutions that matter are the ones that build engineering skill at scale, standardize good practice, and tie research to industry. He writes:

GPT diffusion theory highlights the importance of GPT [General Purpose Technology] skill infrastructure. Education and training systems that widen the pool of engineering skills and knowledge linked to a GPT. When widespread adoption of GPTs is the priority, it is ordinary engineers, not heroic inventors, who matter.

Music to my ears, as it should be to yours: “It is ordinary engineers, not heroic inventors, who matter.”

That is not how the current AI narrative goes. Everyone is fixated on the labs, the frontier models, and the most famous researchers. And that fixation shapes enterprise strategy. Inside many companies AI strategy is a procurement decision: Which model and which vendor and which flagship tool should we choose? Or it’s a moonshot to stand up a lab and build an impressive demo and hire your own famous developer. Both approaches treat AI as a sector to be won. Ding’s argument is that the breakthrough sector itself is not where the long-term value for national power lives. And I believe that the same applies to corporate success. The value is in how widely and how well the technology gets embedded into the work of the people you already employ. The company that puts AI to work in finance and support and legal and sales and operations, across every unglamorous process, as well as in product and engineering, outperforms its competitors and drives its industry forward.

Diffusion is organizational, not technical

The reason diffusion takes a long time is that it is an organizational problem and not a technical one. In his oft-cited 1990 paper “The Dynamo and the Computer,” Paul David answered a quip from Robert Solow that you could “see computers everywhere except in the productivity statistics” by looking at the history of electrification, and more specifically, electric motors. When factories first electrified, they bolted a giant electric motor where the steam engine used to be and kept driving the same shafts and belts through the same Rube Goldberg system. Productivity barely moved.

MACHINE SHOP NORTH/NORTHEAST INCLUDING OVERHEAD LINE SHAFTING. MOSTLY BELT DRIVEN WITH ONE ROPE DRIVEN LATHE IN MIDDLE GROUND. POWER COMES FROM KNIGHT TURBINE ON FAR WALL. This image is available from the United States Library of Congress’s Prints and Photographs division under the digital ID hhh.ca2269. Public Domain.

The gains came decades later, when a new generation of entrepreneurs, factory architects, and electrical engineers redesigned the plant around what electricity actually made possible, with many small motors each driving its own machine and the factory floor laid out for the flow of work.

David’s account has since become a paradigmatic example of how technology transformation actually works. This historical analogy suggests that the future might not be ever bigger and smarter centralized AI models but a decentralized network of AI rightsized for thousands or millions of specialized tasks. Yes, there will still be big centralized AI dynamos somewhere, but most of the action will be with smaller (perhaps open source) models distributed throughout the economy.

But there’s more to the story than right-sizing the technology so that it can fit into specialized tasks. The know-how to reorganize work around it had to be built up one person and one plant at a time. This gradual, bottom-up growth of knowledge about how to apply a new technology is also the point of one of my favorite books about the first industrial revolution, James Bessen’s Learning by Doing. It’s also one of the key messages from Arthur Herman’s Freedom’s Forge, which tells the story of the rapid military industrialization of the US in response to the challenges of World War II. (This story may be newly relevant today as AI and drones transform modern warfare.) Herman called out Bill Knudsen’s bottom-up knowledge of the industry as a critical element in his success transforming the auto industry into a defense powerhouse. (Knudsen was the CEO of General Motors, but he had risen up the ranks from the shop floor.)

That is also the whole story of enterprise AI right now. The latest and greatest model is widely available. Frontier models are getting better so fast that diffusion of the latest and greatest model is not the point. That will happen naturally, much as the availability of the fastest PCs did 40 years ago when the diffusion frontier that provided actual competitive advantage moved to software.

What takes time to develop is the organizational know-how to redesign work around it. Most of that know-how does not live in the labs that trained the model. It lives in ordinary practitioners, and it accumulates the way David and Bessen and Ding have described, person by person and team by team, as people work out what the technology is good for in the specific context of their own industry and their own jobs. The speed of model turnover makes organizational skill infrastructure even more valuable, since it’s the only asset that survives each model generation.

What skill infrastructure looks like inside a company

Ding’s national version of GPT skill infrastructure is engineering education, standardized best practice, and strong links between universities and industry. My firm-level version of his vision is the internal apparatus for spreading skill and compounding what people learn. The problem with most enterprise AI transformation programs is that they treat AI as a subject to be taught rather than a capability to be built. Training is part of it, but only part. The harder part is the set of mechanisms that apply AI to the actual problems of the business, then capture each new discovery and turn it into something the whole organization can use, so that learning compounds instead of hiding away in a thousand private workflows.

In “The End of Programming as We Know It,” I made the case that AI expands who can build rather than replacing the people who build today. This means that a company’s best source of applied R&D is the everyday experimentation of the people it already has. The job is to make that experimentation visible, shareable, and rewarded. It is also the framework we are building into O’Reilly’s enterprise AI transformation programs.

We base our ideas about effective AI transformation in part on ideas we’ve taken from Wharton business school professor and author Ethan Mollick and from Dan Guido, the CEO of AI security firm Trail of Bits.

Join Dan Guido and Tim online at the Live with Tim O’Reilly event taking place on July 9. You can register here.

Mollick suggests solving the enterprise transformation problem takes three things: leadership that not only sets the conditions and incentives but gives a good example by getting their own hands dirty with AI; a lab that turns individual discoveries into tools everyone can use; and the crowd, meaning everyone else, whose daily work is where most applied discoveries actually happen. This is a great way to think about applied corporate AI adoption.

Guido adds a number of other elements to AI transformation strategy as we conceive it at O’Reilly. As he put it in his essay “How We Made Trail of Bits AI Native (So Far)”: “AI works. Most companies are using it wrong. They give people tools without changing the system. That’s the gap between AI-assisted and AI-native. One is a tool, the other is an operating system.” To build that “operating system,” he suggests that a company must:

Standardize its toolchain. This step seems boring and perhaps even unnecessarily restrictive but according to Guido, without a shared standard across an enterprise, you get zero organizational leverage. While experimentation is encouraged and different departments may have different tools, it’s important to constrain the possibilities so that you don’t get a sprawling set of incompatible workflows. That does not mean that the toolchain becomes fixed, just that organizational discipline is important. New capabilities and tools appear at a furious pace. A key corporate capability thus becomes how to evaluate and select tools at enterprise scale as well as how to govern the toolchain over time as the ecosystem evolves.
Write down the rules. When large language models were new, enterprise AI handbooks were full of warnings: Watch out for hallucinations. Watch out for putting in PII or proprietary company data. Beware of copyright infringement. Check and compensate for bias. And so on and on and on. As Mollick noted, such handbooks often discouraged adoption. Guido simply argues for clarity: what tools are approved, especially for sensitive data. For example, among their rules at Trail of Bits: “Cursor can’t be used on client code (except blockchain engagements; use Claude Code or Continue.dev instead). Meeting recorders are disallowed for client meetings conducted under legal privilege.” He notes, “The handbook doesn’t just list what’s approved. It explains the risk model behind each decision, so people understand why….Once you have policy, you can safely push harder on adoption.”
Build a capability ladder. Every company needs an “AI maturity matrix” to help employees understand where they are in their AI journey and measure their progress. This is not an exhaustive list of tools and techniques to master. The spine of the Trail of Bits maturity matrix is not specific technical skills but the pathway from resistance or lack of engagement (stage 0) to comfort with using a job-relevant set of AI tools (stage 1), to proactively seeking out and adopting new tools and techniques and sharing them with others (stage 2), to actually creating new tools and techniques that advance the AI capabilities of the firm (stage 3). As shown in the sample AI maturity matrix that Guido published in his blog post, you can see how the specific tasks and tools vary by department. His basic point, though, is that improvement across this matrix needs to be expected, measurable, and rewarded. At O’Reilly, as part of our AI transformation practice, we’ve built a similar capability matrix, integrated with our verifiable skills tooling and learning paths, which we plan to work with our customers to adapt to their unique situation.
Run adoption sprints so the org keeps pace with new tools and releases. Some of the best learning happens via organization-wide hackathons where people apply AI to their own problems rather than learning in the abstract. This is where Guido’s framework marries perfectly with Mollick’s: Management can use a regular hackathon to get “the crowd” engaged with the latest round of AI developments and apply it to their actual work. “The lab” then takes the best of that and explores how to productize it and make it reusable across the organization.
Package organizational learning into reusable artifacts (skills, repos, configs, sandboxes) so the system compounds. Compounding is absolutely critical to successful AI transformation, and I’m starting to understand what it means and how it works.
Make autonomy safe with sandboxing, guardrails, and hardened defaults. Give new employees one-click install of the AI environment they are expected to become proficient with.

Another thing that needs to be clarified is access to data. At O’Reilly, we’ve found that a major challenge in reuse of AI tools and skills created by our employees is fragmentation of data access. Workflows often cross departments, with users in one department having access to data and systems that are invisible or inaccessible to others. This needs to be fixed. Everyone doesn’t have to have access to the same data; there may be good reasons why they can’t. But every organization needs what DJ Patil, the first US Chief Data Scientist, calls “the tidy house.”

One of the biggest problems in enterprise AI, DJ notes, is the patchwork of systems of record without clear structure on who gets to access which data. As he put it to me, describing the data infrastructure he built that has enabled Devoted Health to move so quickly with AI, it is “fundamentally still data 101, unified data environments, data flows that are clean, that have a lot of organization. . . .Because we invested so heavily in that infrastructure, the dumb, boring, painful parts of making sure you’ve got a really great data warehouse, great data engineering pipes, all of the metadata that goes with it, when AI shows up, you get to use it right away.”

One constraint may be the incentives

Ding’s theory needs one adjustment when it moves from countries to companies. For a nation, skill infrastructure is close to a public good. Educate more engineers and the whole economy benefits, more or less independent of who captures the immediate return. Inside a firm, diffusion may collide with incentives. The value comes from ordinary practitioners sharing what they have learned, but the practitioner who shares a workflow that automates half of her own job, in an organization that rewards looking indispensable and is quick to notice who looks replaceable, is being asked to act against her own interest. Mollick has pointed out that people hide their AI use for exactly this reason. And that’s why Guido’s methodology is so dependent on rewarding people for learning and sharing what they learn.

This is where corporate AI transformation strategy intersects with my interest in mechanism design, an often underappreciated branch of economics. (See my previous essay, “The Missing Mechanisms of the Agentic Economy.”) Mechanism design has been described as “reverse game theory”: start with the outcome you want, and design the rules of the game to produce it.

The constraint on enterprise AI adoption is not just the raw skill of the people. It is whether the organization has built incentives under which sharing what you learn raises your status rather than lowering it. Get that right and diffusion follows on its own. Get it wrong and you can have a small kernel of great people leveraging every frontier model on the market while adoption stalls out at a small fraction of your workforce.

Ding’s claim is that these transitions are won by the patient and the adaptive rather than the first and the flashiest. This fits right in with the messaging of Mollick and Guido. The companies that pull ahead over the next decade will be the ones that turned their ordinary engineers and their ordinary analysts and marketers and support reps into people who put AI to work in their own jobs, and that built the incentives to make them want to share what they learned.

Sovereignty, open source, and common protocols

Ding’s framework also helps clarify the geopolitics of AI. A foundational general purpose technology cannot remain the exclusive instrument of a single company or a single nation for very long. If it is that important, everybody has to have it.

That has implications for how we think about sovereign AI. The phrase is often used to refer to national competition for frontier capability. But sovereign AI is not just a matter of national power. It is a predictable consequence of diffusion. A technology that diffuses widely will be adapted by different societies, firms, and institutions to suit their own needs, values, and constraints. Sovereign AI is AI designed for diffusion, not just raw increases in capability.

This is one reason the arms-race framing is unhelpful. It encourages us to treat AI as if it were a weapons system or a scarce strategic asset. But if AI is closer to electrification, computing, or the written word, the important thing is how the technology is embedded into the ordinary life of economies and institutions, and whether that embedding happens in ways that increase agency broadly rather than concentrating it in a few hyperpowerful companies.

There are a few additional lessons we can take from the history of electrification. While motors became decentralized, factories stopped generating their own power and bought it from a centralized grid. The unit-drive revolution decentralized application, not generation. This limitation, which we are now working to overcome to some extent with decentralized solar generation, is perhaps ironically showing up most strongly in the strain that AI data centers are placing on the grid. Let’s learn from that misstep. You can diffuse AI into every workflow via API calls to a big centralized model, or it can be diffused by a network of smaller models that turbocharge every part of the economy.

We should design for a future of multiple AIs, not a single universal system. Different countries will want systems shaped by different legal regimes, languages, histories, and cultural assumptions. So will companies. So will professions and communities of practice. The instinct of some frontier labs is to imagine that the right answer is to homogenize the technology, purge it of bias, and offer a single sanitized intelligence layer for the world. But AI is a social and cultural technology. The differences are not a defect to be smoothed away.

We do need to think about standards and interoperability. The historical analogy that comes to mind is railroad gauge. When real world systems are built to incompatible standards, the result is not healthy diversity but decades of friction, kludges, and retrofitting. The same may prove true for AI. If we force the future into a choice between one universal model and a patchwork of disconnected sovereign systems, we will get the worst of both worlds. We need a layer between uniformity and fragmentation, which can come from standardized protocols that allow different models, tools, and institutions to interoperate without requiring them to become identical.

This is also why open source matters, but only if it is properly understood. Open source is not just about licenses. My earliest introduction to the shared development of software that now goes by that name came from the research community that grew up around Bell Labs’ Unix operating system despite AT&T’s proprietary (albeit permissive) licensing. Because of that experience, I became convinced that it was the modular, protocol-centric architecture of Unix that was a key driver of collaborative, internet-enabled software development.

Open source AI depends on far more than open models. It depends on the architecture of participation built into the systems above and around them: the protocols, servers, interfaces, and shared technical conventions that let many different actors build on common foundations. The Open Source AI Gap Map shows just how rich that open source AI ecosystem is becoming. But open source can also coexist with proprietary, de facto standards like the OpenAI and Anthropic APIs. Like the electric grid we are now beginning to rebuild, the AI future will be a mix of centralized and decentralized systems. Cooperation and competition can coexist. Different actors can build different systems, for different purposes, under different forms of governance, while still participating in a shared technical and economic order.

This is how the future can belong not just to the inventors of AI but to the people who make it usable, adaptable, interoperable, and worth adopting.

Radar Trends to Watch: July 2026

Mike Loukides — Tue, 07 Jul 2026 10:53:51 +0000

Coauthored with Claude

The soap opera starring Anthropic and the US government looms in the background of this month’s Trends. It may be over by the time you read this, or it may be headed for a third act. OpenAI has been drawn in, and a spat between Alibaba and Anthropic may become a side plot. What is clear is that governments that were considering AI sovereignty are now taking steps toward it. The open models are getting better and better, and models like Z.AI’s GLM, Xiaomi’s MiMo, and NVIDIA’s Nemotron are all there to fill the gap.

As of July 1, Fable 5 has been reopened to the public, along with the new Sonnet 5, and Mythos is again open to a limited group of organizations. Has the curtain dropped on the opera’s final scene? No one knows, but I don’t think so. Regardless, reverberations will continue for a long time.

AI Models

Open-weight models keep narrowing the gap with closed-source frontiers, and the architecture choices are widening: diffusion-based text generation, Mamba/MoE hybrids, on-device multimodal, and physical-world reasoning models. Treat your prompts and skills as portable; the model behind them will keep changing, and the cost-versus-capability trade-offs are getting interesting again.

Anthropic has launched Claude Sonnet 5, which it claims has capabilities approaching Opus 4.8. Sonnet 5 focuses on agentic applications and is significantly less expensive than Opus. Fable 5 is now available again, although after July 7, it won’t be available for subscription plans; it will only be available through usage credits.
The US government has demanded that it approve users of OpenAI’s newest model, GPT-5.6, during its “review period.”
Anthropic is demanding penalties against Alibaba for allegedly using distillation from Anthropic’s models to train its Qwen model.
VibeThinker-3B is a small (3B parameter) model that’s competitive with frontier reasoning models on benchmarks for math, code, and general reasoning.
Z.AI’s open weight model GLM-5.2 is the highest scoring open model on the Artificial Analysis Intelligence Index, behind only Claude Fable 5, Claude Opus 4.8, and GPT-5.5. It’s significantly smaller than its closed-source competitors.
NVIDIA’s Nemotron 3 Ultra is a 550B token open-weight model that combines the Mamba architecture, mixture of experts, and transformers. Its goal is high performance on complex, long-running tasks.
Hugging Face has launched the Fast Gemma Challenge: a competition to use agents to make Gemma-4-E4B-it as fast as possible. You supply the agent; it does the work; results are posted live to a leaderboard.
The Open R1 project is attempting to build a fully open source clone of the DeepSeek-R1 model, based on DeepSeek’s tech report.
Anthropic has launched Claude 5 Fable, a “Mythos-class model” for general use. Fable and Mythos were taken offline for several weeks because the US government ordered Anthropic to ban access by foreign nationals, but they’re back online as of July 1. Anthropic will require identity verification for “a few use cases” starting July 8. This appears to be a reaction to US access restrictions, although accounts may also be revoked for underage usage and violations of their acceptable use policy. An entirely predictable consequence is that many governments are questioning the wisdom of relying on AI models from the US.
Ethan Mollick’s “What It Feels Like to Work with Mythos” is worth reading to get a feel for Fable’s capabilities. Fable burns lots of tokens but can also delegate parts of tasks to less expensive Anthropic models. There are many guardrails that you can run into. Anthropic has also released Mythos 5, which is the same model with fewer guardrails, to a limited group.
Google has released DiffusionGemma, which may be the most interesting model in the Gemma family. It’s an open weight 26B parameter mixture-of-experts model that generates blocks of text in parallel using a diffusion algorithm similar to the algorithm frequently used for image generation. It’s four times faster than similarly sized models.
Google has announced Gemini 3.5 Live Translate, a real-time voice-to-voice translation service. It’s fast enough to keep up with normal conversation and matches the speaker’s pacing and pitch.
Xiaomi has released MiMo-V2.5-Pro-UltraSpeed, in collaboration with the TileRT project. At 1,000 tokens/second, UltraSpeed claims to be the fastest model in the 1T class. Xiaomi claims that open weights are on the way.
Apple has officially announced its Apple Foundation Models, which were “co-developed with Google.” Perhaps Siri will now be competitive with other automated assistants?
Cognition has introduced FrontierCode, a new benchmark for programming. It goes beyond previous benchmarks, which only tested outputs, to evaluate code quality. Is the code maintainable? Could the code be merged in a source repository?
Google adds to its Gemma 4 family with Gemma 4 12B, an open-weight multimodal model that can run on laptops with 16 GB of RAM.
Microsoft announced MAI-Thinking-1, a frontier model that it developed independently. MAI-Thinking-1 is a mixture-of-experts model with 35B active parameters and roughly 1T total parameters. The MAI family includes models that specialize in coding, transcription, and image generation. The company also announced an always-on autonomous agent based on OpenClaw.
NVIDIA has open-sourced its Cosmos 3 models, including data, training scripts, and related tools. Cosmos 3 is a set of frontier models for the physical world: robots, autonomous vehicles, and other applications that need to understand how physical objects behave.

Software Development

Agents are evolving from solo coding tools to shared team infrastructure: team support, shared standards, governance, and shared context. Billing is beginning to catch up with the cost of inference. Plan for usage-based cost models, observability of agent work, and the workflow changes that come from making agent loops a team artifact rather than a per-developer convenience.

Murakkab is a tool for developing agentic workflows using plain language. By decoupling the description of the workflow from the configuration of the components in the workflow, it gains the ability to optimize the design.
Claude Tag integrates Claude with Slack. Users can tag @claude with tasks. All of the tasks are executed by a single shared Claude instance that can continue conversations across team members. It’s an important step toward making AI a team member.
Qodo is a tool that claims to help software groups manage AI-generated code at enterprise scale. It helps with code review, enforcing standards, and code governance across multiple repositories.
TesterArmy is an agentic platform for testing mobile and web apps. Tests are written in natural language and are performed continually; developers are notified when they break.
Enterprise-Managed Authorization is an extension to the MCP protocol that allows IT organizations to manage access policies for MCP servers with their existing identity providers.
Microsoft’s SkillOpt is an open source framework for optimizing AI skills. Rather than relying on best-guess judgment, SkillOpt uses gradient descent to train skills for better performance.
A few days in, developers seem to agree that Claude Fable is significantly better than Claude 4.8, but they aren’t happy with the speed at which it uses tokens or the guardrails that prevent it from answering certain kinds of questions. Fable will force users to decide when they need Fable’s power and when they don’t.
Until now, AI-assisted programming has been tied to individual programmers. Devin Desktop, Microsoft Rayfin, and Augment Cosmos have announced support for teams. Team support means shared memory, shared standards, shared tools, and shared governance.
Google has upgraded NotebookLM to use Gemini 3.5, and to use Antigravity to write and run code in support of requests. It can also generate images, spreadsheets, and other kinds of output.
With the latest update to Foundry, Microsoft is betting that the way to become a dominant player in AI isn’t to continue building raw capability but to provide tools for governability and reliability.
Sem is a command line tool for analyzing changes in a Git repository. It works on the level of functions and methods rather than lines.
GitHub Copilot users are dismayed by the transition to usage-based billing. Usage-based billing probably reflects the real cost of agentic programming but will cause a significant increase in developers’ payments.
Skipper is a new coding agent that takes a specification and delivers a complete working service without human intervention. There is no human developer in the loop.
At its Build conference, Microsoft announced that it envisions Windows as a “platform for agents” and that Copilot will replace OpenAI’s models with Polaris, a model developed in-house. It’s also open-sourced the Windows Agent Framework, its platform for developing agents.
Perry is a TypeScript compiler that generates stand-alone native executables for all the operating systems you’re likely to care about. It doesn’t require Node or a JavaScript engine.
Creusot is a new tool that helps Rust programmers verify that their code is free from panics, overflows, and assertion failures.
While the analogy to ADHD is inappropriate, a researcher has claimed that Claude Code is twice as good after he gave it ADHD. The idea is to enable Claude to follow divergent reasoning trails in parallel and compare the results.
We know about data lakes. What are context lakes? Agents are great for solo developers, but not as useful for teams working together. Shared context data and metadata could be a big help.
Rubish is a bash-compatible Unix shell that is written entirely in Ruby. It offers complete integration with the Ruby language; you can mix bash code with Ruby code, using all of Ruby’s features.

Security

While Anthropic’s Mythos and Fable may be taking a hit for their ability to find vulnerabilities, the problems and solutions lie elsewhere. We’ve seen malware that uses a model’s guardrails to get through defenses and a worm that includes its own model for generating attacks. We’ve also seen projects to help with mitigation, including OpenAI’s Lockdown Mode and IBM’s Lightwell security clearinghouse.

Security researchers have seen malware that attempts to escape AI detection by including instructions about forbidden topics like nuclear weapons in comments. Another malware targets macOS by including faked system errors in its payload. The messages are intended to confuse detection systems.
Although AMD’s policy has been to ship encrypted memory protection (TSME) only with PRO processors, its practice has been to include TSME in all processors. It has now backed away from that, dropping memory protection from its low-end processors.
OpenAI’s Lockdown Mode is now rolling out to personal and business accounts. Lockdown Mode prevents ChatGPT from sending data to external sites. It doesn’t stop prompt injection, but it blocks the final and most dangerous stage: exfiltrating data.
Anthropic has released its Defending Code Reference Harness. It’s a reference implementation to help those who are using AI to discover and mitigate vulnerabilities.
Researchers have created an agent-enabled worm that uses its own LLM to develop attacks for every target it finds. It runs open-weight models on infected machines to discover and customize itself for new victims.
A new Android feature allows the phone to detect deepfake scam attacks and tell recipients to hang up. Unfortunately, it requires both the spammer and the recipient to be using Google’s phone app.
IBM and Red Hat have announced Project Lightwell, a security clearinghouse for open source software. Projects like Lightwell that address security problems at scale are critical to the future of open source software.
Device Bound Session Credentials are now in Chrome. This feature limits session cookies to a specific device, preventing account takeover. Bad actors will no longer be able to use stolen cookies.

People and Organizations

How people work with AI keeps shifting in small, telling ways. Leadership skills for handling a flood of pull requests, the value of attention over agent autocomplete, and books on living alongside machines all attest to the ways that AI is already reshaping work. Invest in the human-side practices that make AI useful, not the AI features that promise to make humans optional.

Summer is already almost over. But there’s still time to Hack Your Summer, a free four-week program where you learn to build something real. Unfortunately, the application deadline for the next cohort is the day after July Trends publishes.
The problem with recommendation algorithms is that, over the long run, feeding stuff you like back to you leads to boredom.
Argentina is considering “non-human corporations“: corporations that are operated by AI agents or robots. “Human shareholders may participate, but are not required.”
Cate Huston lists three useful skills for engineers dealing with a flood of pull requests.
Ethan Mollick’s new book, Co-Existence, is about living with AI that’s sometimes smarter than you, sometimes a lot dumber, and everything in between.
Nolan Lawson’s post, “Using AI to Write Better Code More Slowly,” argues that there’s been too much emphasis on generating bad code quickly. Use human skills along with AI (and specifically AI’s ability to find bugs and vulnerabilities) to write better code. Jared Currie’s “How I Use Agents Without Stopping My Own Growth” takes a similar line. Attention and mindfulness are valuable.

Web

A banned book library in a light bulb? Yes. Plug it in and distribute Huckleberry Finn and other frequently banned books to your community. Includes an open WiFi access point and a server.
An adaptive PDF is a PDF file that changes its form depending on how it is read—or rather, what is reading it. It will look like a human-friendly formatted document if read by a PDF viewer and a Markdown file if read by machine.
AudioMass is a free online multitrack audio editor, similar to Audacity but running in a browser.
Because they fear AI, over 340 local news outlets are refusing to let the Internet Archive access their journalism.

Infrastructure and Operations

NVIDIA has developed a new water cooling system that greatly reduces the need for water to cool data centers.
Databricks has launched Unity AI Gateway, a set of tools that help organizations manage their AI costs.
Now that tokenmaxxing is over, companies are learning that observability is the key to managing AI costs.

Biology

An ALS patient has learned to speak again through the use of brain implants.
China’s Neuracle is the first company to receive approval for a brain-computer interface chip. The chip was first used experimentally in 2024 to help a person with spinal cord damage regain control of his limbs.

This Week in AI: Multivendor Strategy

Michelle Smith — Thu, 02 Jul 2026 13:00:36 +0000

This episode of This Week in AI arrived at a moment when the AI infrastructure most teams take for granted suddenly looked a lot less stable. Andreas Welsch, founder and chief human AI officer at Intelligence Briefing, was joined by Matt Palmer, head of developer experience at Conductor and developer educator on LinkedIn Learning, to work through what the US government’s export restrictions on frontier AI models actually mean for practitioners, why delegating to agents isn’t as effortless as it sounds, and what Sakana AI’s new Fugu system offers as an alternative architecture.

When the API disappears

Andreas and Matt kicked things off by following up on the latest on the Fable 5 and Mythos saga. The US government has now loosened restrictions on Anthropic’s Fable 5 and Mythos Preview, limiting them to 100 handpicked US organizations. OpenAI followed with similar restrictions on GPT-5.6, capping early access at roughly 20 organizations. For most practitioners, those models simply vanished.

Andreas named what a lot of European technology leaders were already thinking: The export restrictions may reflect policy concerns, but they’re really an infrastructure story. If your stack depends on a single frontier model that can become unavailable without warning, you’ve built a hard dependency into your architecture, not a vendor relationship.

Matt made a complementary point from a builder’s perspective. Anyone who spent time with Fable 5 before the restrictions took effect was starting to get a feel for the capability gap between it and the next available option. That gap is a business risk when a competitor has access and you don’t.

The conversation here lands in territory O’Reilly has been tracking for a while: The question that organizations should keep top of mind is how to build with enough flexibility that you can route across models when circumstances change. That means thinking about multivendor strategy as a baseline architectural requirement, the same way teams treat database portability or cloud provider independence. Anthropic has said it hopes access restrictions will evolve quickly. That may be true. . .but it also may not be. Building as if it is seems like the riskier bet.

The delegation trap

As agentic development becomes more widespread, we’ve been hearing more and more about cognitive fatigue. As developers delegate more work to coding agents, they’re reporting higher exhaustion. Last weekend, as Andreas pointed out, another article made the rounds, highlighting even more stories of engineers checking in on their agents around the clock, from their children’s soccer games to their beds. More agents running means more sessions to monitor, more approvals to give, more half-finished work to review in the morning. The promise of “it runs while you sleep” turns into something closer to managing a shift across multiple workstreams at once.

As Matt pointed out:

I think everybody is in some ways a manager of a bunch of agents now, or they’re just orchestrating workflows across these agents. Sometimes what it feels like is being a manager of a mid-sized team. You’re just sending messages all the time, and you’re checking in to make sure things are being done. Writing code, which was once a really relaxing activity—you sit down, you know, cup of coffee, you’re listening to jazz, you’re chilling out, focused on a task—it doesn’t feel like there’s that focus so much anymore.

Andreas connected this to a Harvard Business Review study from earlier this year that tracked a 200-person software company: As AI tools became more capable, people started taking on work that previously belonged to adjacent roles. Product managers were prototyping. Developers were doing design work. The tools expanded what felt possible, and what felt possible became what felt necessary, which meant more work, not less.

Andreas also drew on his own background moving from individual contributor to leadership in the corporate world, where delegation was a formalized skill with a framework behind it: What’s the task? What’s the goal? What data should be used? What does good output look like? How long should it take? Most professionals building with AI today are doing this without training, improvising delegation protocols on the fly.

This is an area where the industry’s investment in tooling has run well ahead of its investment in the organizational skills that make the tooling usable. More capable agents don’t automatically reduce load; they redistribute it in ways that are harder to see and manage. The practitioners who will continue doing this well over the long term are the ones who figure out how to set scope clearly, check output efficiently, and protect the focused work time that deep collaboration still requires.

One API call, many models

The episode’s technical centerpiece was Matt’s walkthrough of Sakana Fugu, a new model/multi-agent system from the Tokyo-based research lab Sakana AI. Fugu is a trained coordinator model that routes your query to a pool of frontier models, assembles a team of specialists, and returns a synthesized result, all through one OpenAI-compatible endpoint. The multi-agent orchestration happens entirely behind that single API call.

Matt walked through the architecture step-by-step. A query hits a lightweight coordinator model that assigns roles. One model thinks through the best approach, another does the implementation work, and a third acts as a verifier. The system can be recursive, with the coordinator assigning a subset of work back through the same process at a smaller scale. Sakana calls this learned orchestration, and the concept is backed by two papers—“TRINITY: An Evolved LLM Coordinator” and “Learning to Orchestrate Agents in Natural Language with the Conductor”—that explore how systems can learn to route and coordinate rather than follow hand-designed workflows. Matt also showed how to quickly set up Fugu as a direct API call via curl (it’s a drop-in replacement for OpenAI-compatible endpoints), through the Codex harness with a one-line installer, and through the open source OpenCode harness via OpenRouter.

Sakana is claiming its novel orchestration method extracts better performance from existing models. Fugu’s Ultra model scores comparably to Fable 5 on agentic benchmarks like Terminal-Bench, and it’s priced identically to GPT-5.5. Whether the performance claims hold up across a wider range of real workloads will be determined by the community, but the portability argument stands regardless of how those benchmarks are eventually validated.

Sakana launched Fugu 10 days after the US export restrictions on Fable 5 and Mythos took effect, with an explicit pitch around AI sovereignty. Because Fugu orchestrates models from multiple providers, a restriction on any single model won’t take the system down, and you can opt specific providers out. For teams in regions facing access uncertainty (Europe is currently locked out pending regulatory compliance, for example), that architecture is a direct response to the problem Andreas opened the episode with.

Qualcomm’s acquisition of Modular, announced the same week for roughly $3.9 billion, fits the same pattern at the hardware layer. Modular’s platform lets AI models run across different chip architectures, including NVIDIA, AMD, and custom ASICs, without requiring developers to rewrite code for each one. Qualcomm gets a hardware-agnostic abstraction layer, and the market gets another data point that portability is becoming a priority investment across the entire stack.

What’s next

Join us for the next episode of This Week in AI on Monday, July 6, from 10:00–10:30am EST, when Christina Stathopoulos breaks down the latest developments in AI.

Register to attend episodes live on the O’Reilly learning platform. If you’re not yet a member, you try it out with a free 10-day trial.

This Week in AI is available on YouTube, Spotify, Apple, or wherever you get your podcasts.

Guidelines for Respectful Use of AI

Camille Fournier — Wed, 01 Jul 2026 10:59:11 +0000

The following article originally appeared on Medium and is being republished here with the author’s permission.

As companies adopt AI tools, a lot of time is spent on thinking about AI policies from a security, compliance, or even cost-focused angle. But many leaders are neglecting to address how their teams should work with AI in the context of the team as a whole. This creates a lot of unresolved tension, and it’s time for leaders to step up and set some guidelines not just for how to use AI in an “approved” sense but how to use it respectfully.

When I say respectfully, I am not talking about the baseline appropriate workplace behavior (bullying, abuse, harassment, etc.). Instead, I’m concerned that many of us haven’t considered that the ways AI can make an individual more productive (literally enabling them to produce more outputs) can have an overall negative impact on the team’s productivity. Leaders can’t just sit around and expect that their teams will know that they can’t just produce slop and send it to others; if you haven’t set up a thorough policy yet, here are some suggestions on what to cover.

Elements of respectful AI use

Don’t ask someone to read/review what you haven’t read or reviewed yourself.

This is one of the most common frustrations I hear amongst people working on AI-heavy teams. Whether it’s code that the owner didn’t really bother to understand before submitting for review or documents that they generated and didn’t bother to read, too often people try to steal productivity from their colleagues by streamlining their production of work while asking their colleagues to do all of the quality control themselves. It’s great to have a loop of AI code generation → AI code review → AI fixes → final human review, but if the person prompting the AI doesn’t bother to review that code first, they’re putting a huge validation tax onto their teammate, who has to trust both that you prompted well AND that the AI understood the context and problem well enough to get a sustainable solution.

Documents are an even bigger temptation than code, because AI is so verbose and most of us hate writing and editing. It’s easy to get into a loop where you ask the AI some questions, skim the answers, output a document and send it to others. I’m guilty of this myself! But what makes sense when you’re skimming one answer at a time may not make for a good overall document, and there is a big difference between answering individual questions and writing for a human reader. In particular, the context that you have in your own head as you are talking to the AI may not come out at all in the document; if you don’t bother to read it thoroughly before sending it out, you won’t catch the gap in framing.

Even worse, sometimes people don’t even understand what the document they prompted is trying to say. Can you describe this document, and have a conversation about the concepts it presents with others and why it makes sense? If not, you have no business sending it along without at minimum the huge caveat “This is AI-generated and I still don’t really understand this space, please help me.”

Many people have reached the point where they won’t read something a person didn’t bother to write themselves, and who can blame them when so many don’t even bother to read their output before sending it on?

Shorter is better.

Part of the annoyance of reviewing AI-generated work is that the AI can be painfully long-winded. AI code often looks like tutorial code, with much more verbosity than human developers would bother with. Add in the temptation to one-shot big changes rather than thinking about how to break the code down into pieces, and you can end up with stacks of thousand line pull requests. The documents AI produces are so thorough that something that should be 3 pages turns into 10 or 20. And for those who have fully embraced AI for all of their text-based interactions, you start to see the LLM-generated wall of text chat messages or emails.

This is, frankly, just rude. It goes hand in hand with not bothering to review your own work, but even if for some reason you convince yourself that you really did read and edit that giant PR/document/message, you’re still asking so much more of the audience than you probably put into the exercise in the first place. When it comes to code, I encourage you to honestly ask yourself: If this broke at 3:00am and none of the AI tools were working, would you be able to look at the PR context and the change and debug it? If not, it is probably too much. When it comes to a big document, at a minimum, have you at least summarized the important points up-front? If someone is just going to ask an AI to summarize the document themselves, you should probably do more work to provide that value before handing it off.

Finally, if you’re writing long-winded emails or chat messages with AI-assistance in order to painstakingly try to explain something, perhaps you actually need to have a meeting or call instead. Increasingly long text exchanges have always been a sign that people need to stop and talk face-to-face, and AI logorrhea hasn’t changed that.

AI is not an excuse to turn off your brain, or your heart.

Signs we’ve switched off our brains and our hearts include: not reviewing the AI-generated work, not taking the time to do human editing, not breaking the changes down into chunks, and avoiding real conversations through AI-mediated text exchange. This guidance is about respectful use of AI because if you have empathy for your colleagues and respect for their time and skills, you will show them the courtesy of giving them work that you are proud of, that you stand behind, that you have thought through and can explain. The AI may have produced a lot of the output, but you thought about all of the pieces that needed to be done, and used the extra productivity to make something better: more reliable, simpler, well tested, whatever. If you find yourself not thinking at all and just mindlessly prompting, accepting output, and moving forward, it’s a warning sign that something is wrong. Perhaps take some advice from Vicki Boykis on adding friction to your development process (or whatever the equivalent is of your day-to-day work).

Framing these guidelines

If you decide to do this, one final tip from me: Assuming your company has some sort of company values, it’s always a good idea to call back to these values when you create policies and guidelines like this. It’s one thing to abstractly say that shorter is better, but if you can tie that to a value for your company, it will resonate more strongly. As an example, if I were at Amazon I might consider tying “shorter is better” to the leadership principle Invent and Simplify. And since shorter is better and this is already too long, I leave you here.

Enjoy this post? You might like my books The Manager’s Path and Platform Engineering: A Guide for Technical, Product, and People Leaders.

The End of Tokenmaxxing

Mike Loukides — Tue, 30 Jun 2026 16:06:02 +0000

The practice of tokenmaxxing appears to be dying out, even before I had a chance to write about it. Good riddance. Burning tokens to create the appearance of productivity was fated to last only until the accountants learned about it, and the strictest of all accountants is one’s personal checkbook. What got many developers thinking about the cost of AI was the change in GitHub Copilot’s usage charges. The cost of Copilot went from a monthly fee with unlimited use to a monthly fee that purchased a limited number of credits, which are used to pay the AI provider of your choice. One credit is equivalent to US$0.01; when you’ve used up your credits, you can upgrade your account or pay for additional credits as you go.

The question isn’t why this didn’t happen earlier; it’s why this happened now. Tokenmaxxing is both the creation and victim of two large-scale trends in AI. First, starting with OpenAI, the major AI providers were all playing a blitzscaling game that prioritized user growth over profitability. Giving AI services away for free got you more users, and in the long run, scalers would figure out how to make money from end-user fees, selling user data, or advertising. This process inevitably ends in enshittification, and is still very much the road we’re on.

Second, token usage exploded late in 2025. The appearance of “reasoning models,” which use tokens to maintain an internal dialog in the course of solving a problem, increased the number of tokens used to respond to each prompt. Reasoning tokens are a model’s conversation with itself about possible responses to the prompt, and are often more numerous than the prompt and response themselves. Whether or not users see the reasoning process (often they don’t), reasoning tokens add to the bill. They are frequently counted as “output tokens” because they are generated by the model, and are more expensive than input tokens.

The appearance of agents also multiplied the rate at which users consumed tokens. In May, 2025, Simon Willison quoted Anthropic’s Hannah Moran’s definition of an agent: “Agents are models using tools in a loop.” The Tredence blog writes: “The agent loop is a repeating cycle in which the AI reads the current data, thinks through what it means, chooses an action, carries it out, checks what happens and starts over.” If you’ve ever watched Claude Code, OpenClaw, or any other agent work, a single request can become many calls to a model, each one using hundreds of tokens, if not thousands. In addition to the current request, one agent-generated invocation can contain the task’s entire accumulated context and relevant documents. Between reasoning tokens and agents, token usage goes up by a factor of hundreds.

The increase in token usage might not be an issue if it results in problems being solved and tasks completed more effectively. But it collides with the loss-leader pricing of the blitzscalers; their willingness to operate at a loss to gain control of a market has limits. Regardless of whether the number of AI users is increasing, the amount of computation, and therefore cost, per user grows as the use of agents increases. Reasoning models increased token usage; agents compounded the problem; and that led to price increases.¹ Microsoft/GitHub doesn’t want to pay Copilot customers’ AI bills. We haven’t yet seen across-the-board price increases from the AI providers themselves. But we have seen GitHub’s token credits, and we have seen Anthropic and OpenAI price more capable models significantly higher than older or less capable models. Fable is twice as expensive as Opus 4.8, and while some writers have called this pricing “fantastic,” that’s probably because they were expecting an even greater increase. While Fable can delegate tasks to Anthropic’s less expensive models, most early users observe that with Fable, token use goes up rather than down. Anthropic’s switch to token-based billing for its agent SDK (currently on hold) is another signal that the days of inexpensive AI are coming to an end. OpenAI’s story is similar: GPT 5.5 costs twice as much GPT 5.4 per million tokens.

It’s also important to take capacity into account. Huge data centers have been in the news, but those data centers haven’t been built yet. More important, the electrical infrastructure needed to support those data centers—transmission lines, generators—hasn’t been built either, and that’s not an investment over which AI companies have much control. They can build their own power generation facilities on a data center campus, but that’s a huge investment in technologies that they’re not familiar with. And even if you generate power locally, you need other kinds of infrastructure: rail for coal, pipelines for gas. This isn’t (yet) an essay about data center power consumption and its consequences, but it is another factor that limits increased token usage. We’ve seen Anthropic’s outages blamed on capacity, and Anthropic has responded by leasing unused data center capacity from SpaceX. But the other way to respond to increased demand that can’t be met by current capacity is to increase prices, limiting customers to those who can afford to pay. That increase is being noticed by managers, accountants, and independent developers.

Token optimization and accountability are the inevitable consequence of upward pressure on token price. One way to build accountability is through better governance, a route Bennie Haelen describes in “The Subsidy Ended: What Tool-Using Agents Actually Cost.” Better governance is achieved through building an observability layer that lets you see exactly what the agents and models are doing. With a well-designed observability layer, you can see whether the data sent to the model is growing with each invocation, whether the model is using appropriate tools, whether tools are being called repeatedly, and a lot of other information that will tell you whether your agent is running efficiently.

Another piece of token accountability is understanding which models are running your agent’s requests. General-purpose reasoning models range from expensive high-performance models like Claude Fable or Opus 4.8 to models like Gemma 4 26B that can run on a well-equipped laptop, and some models that are even smaller. While it’s tempting to say “I need the best; I’ll run Opus 4.8 or Fable with maximum reasoning,” most requests don’t require that level of reasoning or expense. Agents will be able to decide what model is best for processing every request. Fable can delegate, and we expect other frontier providers to follow as models incorporate agent capabilities. And there’s an active world of open models outside of the frontier AI providers. Vicki Boykis writes that models running locally now work almost as well as frontier models. Tools like OpenRouter give you a model-independent way of routing requests to different models, including open models that run locally. OpenRouter can be integrated with OpenClaw, Claude Code, Cursor, Codex, and other agents to provide intelligent routing.

Tokenmaxxing is dying. It will no doubt take time for its vestiges to die away, and there will always be developers who think they can game the path to a promotion, along with managers who insist on being “all in” with AI. But spending tokens responsibly is now the norm, whether you pay with your own checkbook or a company account. Token optimization will only become more important as per-token charges increase. They undoubtedly will.

Footnotes

Some articles make the strange claim that tokens have gotten cheaper by up to 98%. GPT-5.5 suggests that these writers are considering the work that can be done per token. That comparison may be worthwhile, though it’s unclear how to compare GPT-3 with 5.5 or Fable meaningfully. For this article, a token is a token. ︎

Beyond Prompt Injection

Shania Rasheed Nalagath — Tue, 30 Jun 2026 10:55:08 +0000

In late 2025, the security community stopped treating indirect prompt injection as a theoretical risk. It had spent two years as a tidy lab demonstration; then production systems started getting hit. The OWASP Top 10 for LLM applications now ranks prompt injection as the number-one risk, NIST has called indirect injection generative AI’s greatest security flaw, and academic researchers showed that a single poisoned email could coerce a model into exfiltrating SSH keys in up to 80% of trials, with zero user interaction. The attack needs no malicious binary, no phishing clicks, and no anomalous login. The agent simply reads content and takes action, exactly as designed, and the content was written by an attacker.

The most instructive example is ForcedLeak. In September 2025, researchers at Noma disclosed a critical vulnerability chain (CVSS 9.4) in Salesforce’s Agentforce platform: An attacker embedded malicious instructions in the description field of a routine Web-to-Lead form. The text sat harmlessly in the CRM until an employee later asked the AI agent to process that lead, at which point the agent dutifully executed both the legitimate query and the attacker’s hidden payload, exfiltrating sensitive CRM data to an external server. The detail that should keep you up at night is that the exfiltration destination was a domain still on Salesforce’s trusted allowlist, one that had expired and which the researchers re-registered for about five dollars. Every security control saw legitimate traffic to a trusted domain. Nothing looked wrong.

If your instinct reading that is “we filter for prompt injection,” you’re defending the wrong perimeter. Input filtering is necessary but nowhere near sufficient. The uncomfortable truth is that the injection isn’t the breach; the action is. And almost everything we call “AI security” is aimed at the wrong half of that sentence.

The defense everyone is building

Ask most enterprise AI teams how they secure their agents, and you’ll hear a consistent answer: They sanitize inputs. They harden system prompts with elaborate instructions to ignore conflicting directives. They run classifiers over incoming content to flag adversarial patterns. Some have adopted the more sophisticated training-time defenses the frontier labs have published—instruction hierarchies that teach a model to assign differential trust to different sources and reinforcement-learning approaches that harden models against injection in agentic contexts.

All of this is good work, and none of it should be abandoned. But notice what every one of these techniques shares. They all try to stop the model from being fooled. They assume that if we make the model robust enough at the input layer, the system is safe. That assumption is the vulnerability.

We’ve spent two years trying to make the model unfoolable. The systems that survive contact with production assume it will be fooled anyway.

Why the input layer is the wrong perimeter

Prompt injection isn’t a bug a future model will lack. It’s a structural property of how language models work. The model consumes a single undifferentiated stream of tokens at the moment of inference. Your instructions, the retrieved document, the tool output, and the web page just fetched are indistinguishable channels collapsed into one context. There’s no hardware-enforced boundary between “trusted instruction” and “untrusted data” the way there is between kernel space and user space in an operating system.

This is why the attack surface explodes the moment an agent becomes agentic. A chatbot that only talks is a contained risk. An agent that retrieves from the open web, reads email, queries databases, and calls APIs ingests adversarial content from a dozen sources on every turn, and any one of them can carry an instruction. Researchers cataloging real agent ecosystems have already found hundreds of malicious third-party extensions performing data exfiltration and silent injection without any user awareness. These aren’t laboratory curiosities. They’re the production environment.

So, if you can’t guarantee the model will never be fooled—and you can’t—then architecture that depends on it never being fooled is built on sand. You need a second principle, one distributed systems engineers have understood for decades.

Verify, then trust

The principle is simple to state and hard to retrofit: An agent’s proposed action should be validated against an external, deterministic policy before it executes, regardless of why the agent proposed it. The validator doesn’t ask whether the instruction that produced the action was legitimate. It doesn’t try to detect the injection. It asks a different and far more answerable question: Is this action, on its face, permitted?

This inverts the burden. Detecting a cleverly disguised malicious instruction is open-ended because the adversary gets to be arbitrarily creative. Checking whether a wire transfer exceeds a hard dollar limit is a closed problem with a definite answer. We move the security decision from where the attacker has infinite freedom to where they have almost none.

Crucially, the check must be deterministic code, not another model asking, “Does this look dangerous?” The moment you ask a second LLM to adjudicate, you’ve reintroduced the exact same vulnerability one layer down. The enforcement layer is boring, auditable conventional software, and that’s the point.

Here’s what it looks like in practice. An agent managing procurement proposes an action, and a runtime contract evaluates it before anything reaches a real API:

# agent_contract.yaml
 agent_id: "procurement_executor_07"
 role: "EXECUTOR"
 policy:
   approve_invoice:
 	max_amount_usd: 50000
 	allowed_vendors: from_approved_registry
 	require_human_above_usd: 10000

 # Runtime, on a proposed action:
 ACTION   approve_invoice(vendor='Acme', amount=1200000)
 REJECTED policy violation: max_amount_usd
      	proposed 1,200,000 / limit 50,000
      	action discarded, human notified, no API call made

The injected instruction at 2:14am never matters here. The agent can be perfectly, catastrophically fooled, and the wire transfer still doesn’t happen, all because a simple deterministic check stood between the model’s output and the outside world, and the proposed action failed it.

This only works if the action arrives structured, which makes structure a precondition.

The contract inspects approve_invoice (vendor, amount) cleanly only because the action is already typed. If the agent emits prose, “please approve the Acme invoice,” something has to parse it, and the only thing that parses open language is another LLM, so the indeterminacy walks back in. That dictates the design.

A consequential action must cross the boundary as a typed tool call, never as free text. Where the input is unavoidably natural—an email saying, “Wire them their balance” for example—let the model extract a structured value but never let its extraction be self-authorizing. The model proposes the amount; the gate still checks it against the limit, the vendor registry, and the actual balance in the system of record, not the number the email asserted. Extraction is probabilistic, while validation stays deterministic.

A few decisions are pure judgment with no schema, such as “Is this email phishing?” There the model stays in the loop. You bound the consequences instead, with reversibility and human review above a threshold. Contracts protect parameterizable actions, and unparameterizable judgments fall back to containment.

The architecture this implies

Once you accept that the action layer is where security lives, three design commitments follow, and they map almost directly onto principles that hardened distributed systems years ago.

Least privilege for agents, scoped to the action, not the agent. The naive version assumes you can predict what an agent will do and provision it accordingly. For a specialized agent you can: One that only summarizes has no business holding a credential that moves money. But the agents people actually reach for are general. In a single session, I might ask a coding agent to summarize a file, write code, execute it, and query company data—four tasks with four risk profiles, none of which are enumerated in advance. Static least privilege collapses the moment one identity spans that range.

The fix is to make privilege a property of the action, not the agent. The agent holds no dangerous capability by standing grant; it requests narrow, transient elevation per action, which the same deterministic gate approves or denies. Reading a document is auto-approved; querying the warehouse is not. The dangerous credential exists only for the instant the action is permitted, then evaporates. One caveat: This governs what an agent may reach but not what the code it writes then does. Executing code can be gated as a capability, but what executes still needs containment, sandboxing, and egress control, because generativity is a different problem from access.

Zero trust for machine identities. Every action an agent takes should be authenticated and authorized as if it came from an untrusted actor, because, functionally, it might be acting on an attacker’s instructions. The proliferation of agents has expanded the attack surface faster than most identity systems were designed to handle, and treating agent traffic as inherently trusted because it originates inside your own system is precisely the mistake.

Capability contracts at the boundary. Every consequential action passes through a deterministic gate that encodes what is allowed, dollar limits, rate limits, allowlisted destinations, mandatory human review thresholds. The contract is version-controlled, auditable, and lives entirely outside the model.

The trap of normalized deviance

The quieter organizational danger is the slow accumulation of false confidence from connecting insecure agents to real systems and watching nothing bad happen. . .for a while. Researchers have warned about indirect injections for years, but most deployments have gotten away with it. Each uneventful day makes the next risky connection feel safer. This is the normalization of deviance. Every system that eventually failed catastrophically felt the same way: fine, fine, fine, until it wasn’t.

The teams that will weather the coming wave of agent incidents aren’t the ones with the cleverest input filters. They’re the ones who assumed compromise from the start and built the boring enforcement layer anyway, the ones who decided that an agent’s autonomy ends precisely at the point where it tries to do something irreversible.

Where to start on Monday

You don’t need to rearchitect everything. Start by inventorying the actions your agents can take, and sort them by blast radius: What’s the worst thing that happens if this action fires when it shouldn’t? For every high-blast-radius action, write a deterministic contract that gates it and put a human in the loop above a threshold you can defend to your risk team. Then, and only then, keep hardening your inputs.

Prompt injection won’t be solved at the input layer, because it can’t be. But it can be rendered survivable at the action layer, where deterministic code gets the final word. The model’s job is to be useful. Your architecture’s job is to make sure that when the model fails—or worse, when it has been turned against you—the failure stops at the gate.

What You Bring to AI Determines the Result

Tim O’Reilly — Mon, 29 Jun 2026 16:15:12 +0000

Harper Carroll came to AI education through a CS background at Stanford, machine learning engineering at Meta, and a brief stint at a small GPU compute startup in late 2023, where she noticed that almost no one understood how to fine-tune open source models. She started writing and teaching to help drive signups for the startup’s platform. Her first guide, posted right after Mistral 7B was released, when she had about 50 followers, got 50,000 views. In March 2024, a video explaining the difference between AI and machine learning got 5 million views, with 1 in 20 viewers following her afterward. She now has more than 500,000 followers across multiple platforms and is a full-time AI educator.

We covered fine-tuning versus prompting, what it actually means to learn to code in 2025, and what the AI field gets wrong when it talks to the public.

Understanding the world with math

We started with Harper’s own AI learning journey, and it contained a wonderful insight. She grew up loving math and came to computer science at Stanford because algorithms seemed like wonderful math puzzles. Eventually she realized that AI is “understand[ing] the world around us with math.” Text-based LLMs are only one branch. The field as a whole is “the math of the world.” That seems like a deep intuition that all of us need to internalize.

AI as a medium

A study that circulated last year found that people who used AI to write essays showed reduced brain activity compared to people who write unaided. The reaction in many quarters was alarm. People said, “We’re outsourcing cognition and our brains will atrophy.” Harper’s smart response was that those users must have given the AI a one-sentence prompt and accepted whatever came back.

As she put it, that’s the equivalent of just telling Alexa to order you the most popular book this week. Of course less brain activity is being measured! Contrast that with the difference between shopping for a book by browsing and searching at Amazon versus driving to a physical bookstore. There’s certainly a difference, but it isn’t outsourcing cognition. It’s saving time, and that time might well be spent on other demanding cognitive tasks.

My framing is that AI is a medium, the way language is a medium, or photography. Anyone can take a photograph or write a book. The words available to every writer are the same; what differs is what they do with them, just as some photographers do something with it that others can’t. The same is true of software. There’s a line in Aaron Sorkin’s movie The Social Network where the Zuckerberg character says about the Winklevosses, “If you guys were the inventors of Facebook, you’d have invented Facebook.” An idea and its execution aren’t the same thing. One person gives AI a prompt and the output is bad. Another builds a process around AI and the output is great. What you bring to the medium is what determines the result. Harper agreed.

Fine-tuning is like psychedelics for AI

I’ve been trying to figure out how we can use AI for writing and editing at O’Reilly. We want skills and workflows that accelerate our productivity but don’t produce copy that reads as whatever the base model sounds like when nobody’s putting in any effort.

Takeaway posts like this one are a great use case for AI-assisted writing. As source material we have a transcript, with the actual conversation between the participants (or in the case of one of our online conferences, their presentations). We want a structured summary that captures the high points and suggests possible clips for social media. I (or whomever is using this AI-assisted workflow) can then rewrite, rearrange, elaborate, or delete from that first draft. It might not be as good as a draft written from scratch, but quite frankly, it’s far better than the alternative, which is no summary at all. I just don’t have time to write them all unaided.

When I’m writing an article, I generate a similar “transcript” by recording myself talking about the ideas I’m wrestling with and trying to put into the world. Then I ask Claude to put it together into something a bit more structured.

I’ve been improving Claude’s ability to produce prose that we can use by rewriting its output, showing it the differences, and then asking it to construct a skill that captures what it’s learned. Over time, it’s gotten closer and closer to something that I’m comfortable with, and I’m now generalizing that into a system that learns any author’s voice, respects the various conventions of the target content type (which can be very different across books, articles and blog posts, social media, and marketing materials like back cover copy and course descriptions), and applies editing suggestions from my favorite books on good writing, including Strunk and White and On Writing Well by William Zinsser.

Harper attacked the same problem from a different angle. She built a dataset of roughly 1,000 of her Instagram captions, video transcripts, and X posts, then fed them to Claude as context and asked it to write in her style. Unfortunately, the output tested 100% AI by a detection tool, even with 1,000 examples of her real voice in the prompt. She then fine-tuned an open source Llama model on the same data. The fine-tuned output tested 100% human. She gave a compelling demo at South by Southwest showing how easy this is to do. It took her about 20 minutes.

After Harper said that prompting doesn’t shift the output distribution the way fine-tuning does, I told her the story about the French writer Marcel Proust that I first used in my conversation with Steve Wilson, which I picked up from Alain de Botton’s How Proust Can Change Your Life. A friend comes to visit the bedridden Proust, and making polite conversation begins to tell him about the train trip to Paris. “More slowly,” Proust replies. This cycle repeats several times until the friend is telling him small details like the old man feeding pigeons on the steps of the station.

Harper got it, and broke it down more slowly in her inimitable way. Here’s why in-context prompting fails where fine-tuning succeeds:

Basically AI models are these massive mathematical equations, and the parameters are variables when you’re training, and then they become constants in those equations when you’re running inference . . .So what you’re doing when you’re training the model is you’re learning how to map, by adjusting those constants when they’re variables during training,. . .input to desired output.

Once the model is deployed, the probability distribution over output tokens is fixed. You can put 1,000 examples in a prompt and ask the model to pattern-match, but you’re asking it to do that with frozen weights. The surface behavior bends a little, but the underlying distribution doesn’t shift. Fine-tuning lets you actually modify the weights and how the model wants to write.

Her suggested approach for building the training dataset is to take your own writing, have AI rewrite it with its characteristic tics, then train with the AI version as input and your original as the target output. You’re teaching the model to undo the tells.

Should people still learn to code?

We also spent time on the inevitable question of whether people should still learn to code. We both agree they should, but not necessarily like they used to, by learning the detailed syntax of a programming language, then by trial and error as they painfully learn how hard it is to get the desired behavior.

Harper’s take (which I also agree with) is that vibe coding has lowered the floor. People who could never afford to hire someone to build a product can now do so themselves. But it has also raised the ceiling, because people who actually understand systems can build vastly more sophisticated things with the same tools, which takes us back to the case for AI as a medium.

Perhaps more importantly to the question of how much coding you should learn, experienced developers will also see failure modes that pure vibe coders miss. Harper gave an example that came from watching a friend using an agent tool that had, at some point, started storing its data in a Word document and using it as a makeshift database, probably because the session started with a Word doc. It was extremely slow and extremely inefficient. An engineer sees the problem immediately. A vibe coder might run that system for months before noticing something is wrong.

So yes, you should learn enough about coding to understand what’s happening. The art of teaching programming to the next generation will be developing useful projects that also highlight underlying concepts of software architecture and engineering.

Intuition as differentiator

Silicon Valley runs heavily on logic and on the idea that good decisions come from better data, more rigorous analysis, and sharper models. In this environment, intuition can get dismissed as something “soft and fuzzy,” Harper noted. And that’s the wrong mindset for AI.

AI is getting better and better at exactly the things the logical axis does well, but intuition remains a challenge because it often contradicts what the data says. Good intuition “goes against the input,” to use Harper’s phrase. A model that’s been trained to recognize patterns in data will, almost by definition, struggle with making decisions that run counter to those patterns. Just as skills-informed judgment supercharges AI-assisted engineers, intuition could be a uniquely human skill for a long time. Elevating it as a concern might bring the industry more of an attitude of humility towards ourselves and our place in the world.

What the field gets wrong

I closed by asking Harper what the AI field most consistently gets wrong in how it talks to the public. She said that too much of the public-facing discourse leads with fear, of job displacement, of rapidly approaching AGI, and of a rocky transition that requires a universal basic income to cushion the blow. She’s not calling those impossible futures, but she thinks they’re the wrong introduction to the technology.

A lot of companies are using AI to ask how to do the same things at lower cost. The better question is how to raise ambitions. AI doesn’t just scale individual capabilities. It scales what organizations can attempt. But for it to work out that way, everybody has to actually learn AI. We can’t have AI haves and have-nots. That means lower-cost models, serious open source investment, and companies that don’t just become serfs to the major platforms.

Harper has been making this point for a while, to audiences ranging from engineers to people who’ve never written a line of code. “There is not really much to fear right now,” she says. “AI is this incredible productivity tool.” The people who will struggle, in her view, are the ones who refuse to engage with it at all.

At O’Reilly, we’ve been working on a version of the same narrative at an organizational level. The fear-first narrative produces avoidance, and avoidance is the one thing that will actually leave someone behind. So we’re building a corporate AI transformation practice that starts with people’s existing jobs, and figures out how to “mix in” AI to make them more impactful. We’re learning how to teach both the humans and the agents at the same time to make them more productive together.

On July 9, I’ll be speaking with Trail of Bits cofounder and CEO Dan Guido about the playbook his company used to go AI native, which he first outlined at this year’s [un]prompted. He’ll give a version of the same talk, then take about 40 minutes of audience questions on what worked, what didn’t, and what is still unsolved. I hope you join us to find out what’s changed since [un]prompted and where the playbook is heading next. Register here; it’s free and open to all.

Agent Memory

Angie Jones — Mon, 29 Jun 2026 10:53:10 +0000

The following article originally appeared on Angie Jones’s LinkedIn page and is being republished here with the author’s permission.

I’m fascinated by the concept of agent memory. LLMs are stateless by design, meaning they have no memory or awareness of past interactions. Each prompt you send to an LLM is treated as a completely isolated event.

When you have a continuous chat with an AI agent, it feels like the AI remembers previous messages. However, the interface itself is faking it. Behind the scenes, your agent takes the entire conversation history and resends all of it to the LLM as one giant, combined prompt.

Companies, researchers, and even indie devs are all trying to crack agent memory. Because once an agent can remember, the entire interaction changes. It can build on what it learned, adapt to the user, resume work after a restart, and develop a sense of continuity.

Recently, I spent time with Richmond Alake, who has been in the trenches working on agent memory at Oracle.

Richmond Alake, the agent memory guru

We talked about the different kinds of memory, why memory is harder than it sounds, and what it takes to build a memory system that is actually useful in production.

That conversation made something very clear to me. When people say, “agent memory,” they often mean very different things.

So let’s unpack the various types of memory.

Conversational memory

Conversational memory is the one most people think of first. It stores the messages exchanged between the user and the assistant.

This makes sense. If I ask, “What did I say was the ultimate goal of this task?” the agent needs access to the conversation in order to answer. Without that history, every turn starts from zero.

But this is also where many memory systems go wrong.

The most common first attempt is to keep appending prior messages to the prompt. For example:

User: I’m building a customer support agent.

Assistant: Great, what should it do?

User: It should look up past tickets and draft replies.

Assistant: Got it.

User: Also, I prefer Python and FastAPI.

Then on the next call, we send all of that back to the model along with the new question.

This works for a short conversation, but the agent only “remembers” because we keep reminding it. This is not really memory engineering.

Eventually, the conversation gets too long and the model receives a giant blob of context where some details are important, some are stale, and some are completely irrelevant. The agent may technically have the information, but that doesn’t mean it can use it well.

So yes, conversation history is a valid and important type of memory. But it shouldn’t be the whole memory strategy. Real agent memory requires deciding what should be stored, where it should be stored, how it should be retrieved, and when it should be summarized, forgotten, or compressed.

Semantic memory

Semantic memory stores durable facts.

These are things that should outlive the exact conversation where they were learned:

The user prefers Python over TypeScript for backend work.
The customer support agent needs access to past tickets.
The production system handles 50,000 queries per day.

This is different from conversational memory because the exact wording and sequence are less important. What matters is the meaning.

If the agent needs to recall what stack the user is using, it should retrieve the memory even if the user never says those exact words again.

Vector search is useful for this. The memory can be embedded and retrieved by semantic similarity.

The benefit is that the agent doesn’t need to replay the full conversation. It can retrieve the few durable facts that are relevant to the current request.

Episodic memory

Episodic memory stores events.

This is the “what happened” layer of memory:

The agent searched the web for recent API gateway patterns.
The agent generated a draft response for ticket #4821.
The workflow failed at the compliance review step.

Episodic memory is especially useful for debugging, auditing, and long-running workflows.

For example, if an agent makes a decision, I may want to know what happened right before that decision (e.g., What tools did it call? What data did it retrieve?).

This type of memory often benefits from structured storage.

For example:

Find all failed tool calls from the mortgage approval workflow in the last 24 hours.

That is a database query problem, not just a vector search problem.

Procedural memory

Procedural memory is about how to do things.

For example:

When investigating a failed deployment, check logs first, then recent config changes, then dependency updates.
When drafting a customer support reply, include the ticket summary, likely cause, recommended fix, and next step.
When creating a database-aware agent, scan table comments, column comments, constraints, and recent workload patterns.

This is the kind of memory that helps an agent improve its process. That’s powerful because agents are often asked to operate in messy real-world environments. With procedural memory, it can reuse proven approaches.

The value extends beyond just knowing things to actually knowing how to proceed.

Entity memory

Entity memory stores facts about specific people, accounts, projects, systems, tickets, or objects.

For example:

Angie prefers practical examples over abstract explanations.
Customer Acme Corp has strict data residency requirements.
Ticket #4821 is related to a billing reconciliation issue.

Entity memory matters because many agent tasks are scoped around a particular thing.

If I ask, “What do we know about Acme Corp?” I don’t want every memory in the system. I want memories attached to that customer.

This is also where memory safety becomes important.

Agents should not accidentally mix memories between users, customers, or projects. A memory system needs strong scoping so one user’s context does not leak into another user’s response.

Working memory

Working memory is the short-term scratchpad for the current task.

This is where the agent keeps temporary information while reasoning through a problem.

Working memory is usually not meant to last forever. It’s useful during the task, but it may not deserve to become durable memory.

If an agent stores every temporary thought as long-term memory, the memory store gets noisy very quickly. The agent may later retrieve half-baked assumptions as if they were facts, which is dangerous.

Not everything the agent observes or thinks should be remembered permanently.

Summary memory

Summary memory is one many agent users are familiar with. It deals with the problem of context windows being limited.

Even with large context models, you can’t keep appending forever. At some point, you need to compress.

Summary memory stores a compact version of a longer thread or context window. The original details can still live in the thread, but the prompt gets a smaller representation.

For example, instead of sending 80 turns of conversation, the agent might send:

The user is building a SaaS customer support agent. They prefer Python and FastAPI, deploy on OCI, and want the agent to retrieve past tickets before drafting replies. They are currently evaluating memory strategies for production usage.

Why memory is hard for agents

At first, memory sounds straightforward: store things, retrieve them later.

But the hard part is judgment, not storage.

What should be remembered? If the user says, “I usually prefer Python,” that’s probably worth remembering. If they say, “Let’s try Python for this one experiment,” maybe not. The agent needs to distinguish durable details from temporary context.

When should memory be updated? People change their minds, and systems and requirements change. If a user used to prefer FastAPI but now works mostly in Java, should the old memory be deleted, overwritten, or kept with a timestamp? A memory system needs a correction strategy.

How much memory should be retrieved? Retrieving too little means the agent misses important context. Retrieving too much means the prompt becomes noisy. This balance matters as more context isn’t always better.

How do we prevent memory leaks? If memories are shared across users, agents, or tenants, scoping is critical. The agent should only retrieve memories it’s allowed to use. This is especially important in enterprise systems where agents may operate across many customers, teams, or workflows.

How do we know whether memory helped? Memory should improve the agent’s behavior. It should reduce repeated questions, improve continuity, lower token usage, and help the agent produce more relevant responses. If memory just adds complexity without improving outcomes, it isn’t doing its job.

How Oracle is approaching agent memory

Richmond was gracious enough to share how Oracle is tackling this with the Oracle AI Agent Memory Package (OAMP), built on top of Oracle AI Database 26ai.

Yes, an AI database! Think of it as a database that can store and query the kinds of data AI applications need, not just rows and columns. That includes embeddings and JSON documents along with text search and regular SQL. These live together in the database, so an agent does not have to bounce between separate systems just to gather context.

The idea is to make Oracle AI Database the memory core for agents. Instead of stitching together a vector database, a relational database, a document store, and custom thread management, OAMP provides agent-friendly memory primitives on top of a database that already supports multiple data access patterns.

At a high level, OAMP gives you:

Users and agents to scope memory ownership
Memories for durable facts and extracted knowledge
Threads for conversation history and continuity
Context cards for compact, prompt-ready memory retrieval
Summaries for long-running conversations
Vector search for semantic recall
Database-backed persistence so memory survives restarts

This matters because, again, agent memory is not only a vector search problem. Some memory needs semantic retrieval. Some need ordered reads or exact SQL filtering. A database-backed memory system gives you room to support all of those patterns.

Here’s a small example of what that looks like in code:

from oracleagentmemory.core import OracleAgentMemory

from oracleagentmemory.core.llms import Llm

client = OracleAgentMemory(

    connection=connection,

    embedder="text-embedding-3-small",

    llm=Llm("gpt-5.5"),

    extract_memories=True,

    schema_policy="create_if_necessary",

)

client.add_user(

    "angie",

    "Developer exploring agent memory patterns."

)

client.add_agent(

    "memory-demo-agent",

    "Assistant that demonstrates Oracle AI Agent Memory."

)

client.add_memory(

    "Angie is fascinated by agent memory and prefers practical examples over abstract explanations.",

    user_id="angie",

    agent_id="memory-demo-agent",

)

There are a few important ideas packed into this snippet.

The OracleAgentMemory client is the bridge between the agent application and Oracle AI Database. The database connection tells OAMP where memory lives. The embedder tells it how to turn memory text into vectors for semantic retrieval. The LLM enables automatic memory extraction and summary generation. And schema_policy="create_if_necessary" lets OAMP manage the underlying memory schema instead of making every application reinvent it.

The user and agent registration may look like simple setup code, but it’s actually part of the memory model. Memories need ownership. In a real system, you don’t want one user’s preferences showing up in another user’s session, and you don’t want memories written by one agent casually mixed with another agent’s context. The user ID and agent ID give the memory layer a way to scope what gets stored and retrieved.

The add_memory() call stores a durable fact. This is a piece of information the agent may need later, even if the exact conversation has moved on.

Given this, we can now recall memories.

results = client.search(

    "how should I explain this topic to Angie?",

    user_id="angie",

    max_results=3,

)

This search() call shows the part that makes semantic memory useful. The query doesn’t have to match the stored sentence exactly. We stored that I prefer practical examples, but we searched for how to explain something to me. Those are different words but related in meaning. That’s the point.

Threads and context cards

Durable memories are only part of the picture. Agents also need conversation continuity.

With OAMP, a thread can represent a real work session, such as an agent helping investigate a production issue:

from oracleagentmemory.apis.thread import Message

thread = client.create_thread(

    user_id="angie",

    agent_id="support-triage-agent",

)

thread.add_messages([

    Message(

        role="user",

        content="Customer Acme Corp is seeing intermittent checkout failures after the latest deployment.",

    ),

    Message(

        role="assistant",

        content="I'll check recent deployment notes, related incidents, and payment service logs.",

    ),

    Message(

        role="user",

        content="Focus on the payment gateway first. We saw similar timeout errors last quarter.",

    ),

])

This is much closer to how memory shows up in real agent applications. The useful context is not just that messages were exchanged. It’s that this thread is about Acme Corp, checkout failures, a recent deployment, the payment gateway, and a related incident from last quarter.

When it’s time to call the model, instead of passing the entire raw thread, you can ask for a context card:

card = thread.get_context_card()

The context card gives the agent a compact block of relevant memory to use in the next prompt.

Conceptually, the prompt becomes:

System: You are a helpful assistant. Use the provided memory context.

Memory context: [context card]

User: What did we decide earlier?

This is a much cleaner pattern than appending every message forever.

Automatic memory extraction

OAMP can also extract memories from conversation.

For example, if the user says:

I prefer Python over TypeScript for backend work. I usually deploy FastAPI apps on OCI behind an API gateway.

The memory system can extract durable facts such as:

The user prefers Python over TypeScript for backend work.

The user deploys FastAPI applications on Oracle Cloud Infrastructure behind an API gateway.

That means the application does not have to manually call add_memory() for every useful fact.

A smart thread can be configured like this:

thread = client.create_thread(

    user_id="angie",

    agent_id="memory-demo-agent",

    memory_extraction_frequency=2,

    memory_extraction_window=4,

    enable_context_summary=True,

    context_summary_update_frequency=2,

)

This tells the system to periodically inspect recent messages, extract durable memories, and maintain a running summary.

Here is where agent memory starts to feel more like a living part of the agent architecture vs just a data structure.

Teaching an agent about a database

One of the most interesting examples Richmond and I discussed was using memory to teach an agent about a database.

Imagine an enterprise data agent that needs to answer questions about a schema it has never seen before. Instead of fine-tuning a model, the agent can scan the database catalog and store what it learns as memory.

It might inspect:

ALL_TABLES for table names and row counts
ALL_TAB_COLUMNS for column names and types
ALL_TAB_COMMENTS for human-written table descriptions
ALL_COL_COMMENTS for column descriptions
ALL_CONSTRAINTS for primary keys and foreign keys
V$SQL for recent workload patterns

Then it can convert those technical details into natural-language memories.

For example:

Table SUPPLYCHAIN.VESSELS stores individual ships owned or operated by carriers. It includes vessel identifiers, carrier relationships, and operational metadata.

Now when a user asks:

Where would I find information about ships and carriers?

The agent can retrieve the relevant schema memory by meaning.

This is a beautiful pattern because it avoids one of the common traps with agents expecting the model to already know your private system.

It doesn’t. And that’s okay.

You can teach it by turning your system’s metadata into memory.

The more I learn about agent memory, the more I believe this will be one of the defining pieces of agent architecture.

Tool calling lets agents act. Planning lets agents decide what to do. Memory lets agents build continuity.

With memory, we can start designing agents that feel less like one-off prompt responders and more like persistent collaborators.

Of course, this also raises the bar. Memory has to be scoped, auditable, correctable, and intentionally retrieved. Bad memory is worse than no memory. So the challenge is not simply giving agents memory but giving them the right memory architecture.

Oracle’s OAMP approach is one way to make that system concrete: users, agents, memories, threads, context cards, summaries, and database-backed retrieval.

And while the implementation details matter, the bigger idea is that if we want agents to be useful beyond a single prompt, they need a way to remember.

Not everything. But enough to carry context forward.

Agentic Code Review

Addy Osmani — Fri, 26 Jun 2026 15:50:43 +0000

The following article originally appeared on Addy Osmani’s blog site and is being republished here with the author’s permission.

Coding agents are extraordinarily good now, and getting better fast. The interesting consequence is that the hard part of engineering moved from writing code to deciding whether to trust it, which makes review the most leveraged skill in software right now. How you approach it depends enormously on who you are: A solo developer with no users and a team maintaining a 10-year-old application are not solving the same problem.

I am more optimistic about agentic engineering than I have ever been. The agents are genuinely good, they get better every month, and on an ordinary day I now ship things I would not have attempted a year ago. This write-up is a map of where the interesting work went, because it did move, and most teams have not fully caught up to where.

Code review used to work because of a happy accident of relative speed. A senior engineer could read code faster than a junior could write it, so review kept pace without anyone designing it to, and the team absorbed how the system fit together as a side effect of reading each other’s diffs. A lot of that was not deliberate. It fell out of a single fact: Writing code was the slow, expensive part, and reading it was cheap and fast.

That fact no longer holds. An agent will produce a thousand lines of often solid, well-formatted code in less time than it takes me to read this paragraph, while a human’s reading speed has not changed since roughly the day we started staring at screens for a living. So the constraint moved downstream, to the one step that did not get faster: a person being confident the change is right. I don’t think that’s a loss. It’s the most leveraged place in software to be good right now, and it’s where I’ve put most of my attention this year.

There’s a happy twist here that shapes the rest of this piece. The same tools generating all that extra code are also the best thing I have for keeping up with it. On my own projects, including the popular open source ones, I now point Claude Code or Codex at a batch of incoming PRs and have them triage the queue for me, and that has genuinely changed how I spend my time. So this is not an anti-AI argument, and I will come back to exactly how I use AI.

It’s also not a data dump, and not another round of whether letting a model write your code is wonderful or the end of the craft, because that framing is useless. The only answer that survives contact with a real codebase is that it depends entirely on who you are. A developer vibe-coding a side project only a dozen people will ever run and a team keeping a 10-year-old enterprise system alive for another quarter share almost no constraints worth naming, and most of the advice in circulation is really one of those two people telling the other how to live.

What the 2026 data actually shows

The productivity gains from AI are real, but raw output overstates them: about four times the code for a tenth more delivered value. The gap between those numbers is review work, which is exactly why review is where the leverage now sits.

For a couple of years this was an anecdotal argument. It’s now measured at scale, by organizations with no shared agenda and in several cases competing commercial interests, and the measurements keep pointing the same way: AI pushes output sharply up and pushes both quality and reviewability down.

Faros AI instrumented 22,000 developers across 4,000 teams and tracked what happened as teams moved from low to high AI adoption. This is March 2026 data, about as current as anything here. The upside is real. Developers merge considerably more PRs and complete more work and throughput per engineer climbs. Then the rest of the report:

Code churn is up 861%.
The incidents-to-PR ratio is up 242.7%.
The per-developer defect rate is up from 9% to 54%.
Median review duration is up 441.5%, with time to first review and average review time both roughly doubling.
PRs merged with zero review are up 31.3%.

The last figure is the one I find hardest to dismiss, because nobody chose to stop reviewing. Reviewers simply couldn’t keep pace with the volume, so code began merging unread, and that became normal. The detail I keep returning to is that teams with mature, disciplined engineering practices were hit just as hard as everyone else. Good process didn’t protect them, because the volume arrived faster than any process was designed to absorb.

CodeRabbit studied 470 open source PRs in December 2025, 320 AI-coauthored and 150 human-only, and found the AI changes carried roughly 1.7x more issues. Logic and correctness problems were up about 75%, security issues were 1.5 to 2x more common, and readability problems more than tripled. The company’s AI director, David Loker, described these as “predictable, measurable weaknesses that organizations must actively mitigate.” Predictable is the operative word. These are known, locatable weaknesses, which is good news: It means a review process, human or automated, can be aimed straight at them.

One caveat to hold throughout: CodeRabbit and Faros both sell into this market, so their framing is not disinterested. That doesn’t make the numbers wrong—the effect sizes are large and consistent across unrelated sources—but vendor research deserves to be read with that in mind.

GitClear has the single number I would lead with. In its productivity data through 2025, daily AI users produce around 4x the raw output of nonusers, but measured against their own output a year earlier, the real productivity gain is only about 12%. You’re generating roughly four times the code for something like a tenth more delivered value, and a human still has to review all of it. To GitClear’s credit, CEO Bill Harding is explicit that some of even that 12% is selection bias, because stronger developers are concentrated in the AI cohort.

GitHub reports that Copilot review has now run over 60 million reviews, a 10x increase in under a year, and more than one in five reviews on the platform involves an agent. This is no longer a niche practice. It’s how code gets made.

Four datasets, four methods, one conclusion. We poured machine-speed output into a system built for human-speed work. The bottleneck didn’t disappear; it moved to verification, and review is where that bill comes due.

Everyone is solving a different problem

How much review a change needs depends almost entirely on its blast radius, and most advice you read was written by someone operating for a very different one.

Almost all the alarming data above comes from enterprise telemetry and from open source maintainers being overwhelmed. It’s entirely real if that is your situation. If you’re one person shipping something a handful of people will ever run, much of it simply doesn’t apply to you, and you shouldn’t be made to feel otherwise.

Three variables determine where you sit:

Blast radius: What happens when it breaks? Nothing, or angry users and money and PII on the line?
How long the code lives: A throwaway prototype you might rewrite next week, or a codebase you’ll maintain for years?
How many people need to understand it: Just you holding the whole thing in your head, or a team that has to share ownership over time?

Run the same diff through those three variables, and “good review” means genuinely different things.

If you’re working solo on a greenfield project with no users, review’s second job, distributing knowledge across a team, doesn’t exist for you. You are the team. The reasonable move is to lean hard on tests and automation, review the parts that genuinely matter, and accept a lighter touch on the rest. Duplication and churn cost far less when the code may not exist in a month and nobody is paged at 3:00am when it breaks. The catch, and people learn this one painfully, is that it only works if the tests are real. Skipping review without a safety net doesn’t remove the work. It defers it at a higher price, and standards slip when no one is there to push back. “No users” is permission to defer review. It isn’t permission to skip verification.

Then the project gets users. This is the dangerous middle, and the crossing is rarely noticed at the time. Review’s bug-catching role suddenly matters, because bugs now hurt people, and its knowledge-sharing role switches on, because it’s no longer only you. Teams keep their solo-era habits a few months too long, and then there’s a postmortem and the Faros numbers stop being a chart and become their own dashboard.

At the far end is the large organization with an old codebase and many users. Here every alarming figure lands at full strength. A duplicated helper isn’t a style nit; it’s a future bug surface and a maintenance cost that compounds for years. A change nobody understood is comprehension debt that becomes someone’s on-call incident. Review is doing several jobs at once, and the volume of agent output quietly breaks all of them. The Faros finding about mature teams is aimed squarely here.

So the point is not “Enterprises should be cautious and solo developers can relax.” It’s that the purpose of review changes with your position, so the rules have to change with it. Bolt an enterprise’s locked-down multi-agent evidence-required pipeline onto a two-person prototype and you’ve added friction for no benefit. Run “tests pass, ship it” on a payments system and you’ve built an incident generator with a green checkmark on top. Most bad advice in this space is one position on that spectrum prescribing to another.

What review is actually for now

Review was built to check an author’s reasoning. An agent does reason, but that reasoning is usually thrown away rather than attached to the code, so the reviewer has to reconstruct a rationale that never made it into the diff. The good news is that this is a tooling problem, and capturing the reasoning makes review dramatically easier.

This is the part that genuinely changed, and I think it is underappreciated.

When a human writes code, intent comes along for free. The reasoning, the alternatives weighed and discarded, lived in the author’s head, and review was you checking that reasoning. Modern agents do reason, often visibly, producing thinking traces and weighing options and explaining themselves as they go. The catch is that this reasoning is usually discarded the moment the diff is produced. It’s rarely captured and rarely attached to the PR, and in any case it is the agent’s reasoning about how to implement the task, not a human’s judgment about whether it was the right task to begin with. So review shifts from checking reasoning that sits in front of you to reconstructing intent that never got written down, which is harder and slower, and we keep acting surprised that it takes 441% longer.

A 2026 paper, “AI Slop and the Software Commons,” analyzed 1,154 posts across 15 Reddit and Hacker News threads where developers discussed “AI slop.” One line from a developer has stayed with me: reviewing an agent’s PR made them “the first human being to ever lay eyes on this code.”

That sentiment points straight at the fix. In normal review, the author already understood the change and you were checking their work. With an agent PR, nobody has reconstructed the why yet, and the reviewer is the first to try. As the paper puts it, review “wasn’t built to recover missing intent.” The encouraging part is that missing intent is recoverable: The reasoning existed; we just discarded it. Have the agent state what it was trying to do and what it ruled out, then capture it as a decision log on the PR, and a large part of the reconstruction cost disappears. This is a tooling problem, and tooling problems get solved.

None of which makes “have the AI review the AI” a complete answer on its own. A second model with different priors genuinely catches real bugs, and it catches a lot of them, which is why you should run one. What it doesn’t supply is the human judgment about whether this is the right change to build in the first place. That judgment stays with a person, and it happens to be the most interesting part of the job and the part worth keeping.

The tools are good, but not always for the reason they advertise

The current AI reviewers are genuinely good, and they occasionally don’t flag the same lines as each other, so the right move is not picking the best one but running two that are built differently.

The dedicated AI review tools are good now, and I think you should be running at least one on everything, side projects included. CodeRabbit is the most widely deployed and topped the independent Martian benchmark (January to February 2026) on F1, at around 49% precision with the best recall in the field. Greptile trades precision for recall, with around an 82% bug-catch rate against CodeRabbit’s 44% in one benchmark, at the cost of more false positives. Anthropic’s Code Review reports under 1% of its findings marked incorrect by their engineers; the figure I would actually show a manager is that it raised their internal rate of PRs receiving a substantive review from 16% to 54%. The long tail of changes that used to get a glance and an approval now gets read by something.

The most useful result I have seen this year isn’t from a vendor. An engineer ran four reviewers in parallel, CodeRabbit, Sentry Seer, Greptile and Cursor BugBot, across 146 real PRs and 679 findings over three and a half weeks:

Of 617 distinct flagged locations, 93.4% were caught by exactly one of the four tools. 6% by two. Almost none by three. None at all by all four.

The four tools never once flagged the same line. Each was strong at a different class of problem: Greptile with near-zero false positives on correctness and architecture, CodeRabbit with the widest net and one-click fixes, and Seer best on production-failure severity. That is the adversarial review argument demonstrated on a real codebase rather than in a paper. Heterogeneity is the whole point. Four copies of one model is a single reviewer with a larger invoice, whereas four genuinely different reviewers surface a set of bugs no single member could find alone, the human included.

In practice: Do not agonize over the single best tool because there isn’t one. At the high-stakes end, run two with deliberately different characters. (The experiment above paired Greptile for everyday correctness with Seer for production-failure severity, with almost no overlap.) If you are solo, one good reviewer plus real tests is plenty. And whatever the marketing says, measure it on your own code, because every one of these results was specific to a particular codebase, and yours will be too.

Should we just let AI review more of it?

The machine is already reviewing more of your code than you are. The only real decision left is whether you do that deliberately, and the amount of human you keep should scale with your blast radius.

I keep hearing a question from experienced engineers that would have been heresy a year ago: Should the machine be doing more of the reviewing, perhaps most of it? I no longer think that’s a foolish question.

The uncomfortable part is that AI review works. Under 1% of Anthropic’s findings are marked wrong; the tools catch bugs humans read straight past, and they don’t get tired on the 30th PR of the day, which is exactly when a human is least reliable. Meanwhile humans are visibly not keeping up: Zero-review merges are up 31% and review times are up triple digits. In a real sense the machine is already reviewing more of the code than we are. The honest framing is not “Should we let AI review more?” but “AI is already doing it, so are we going to be deliberate about that or let it happen by default while pretending humans still read everything?”

Loop engineering sharpens this. The premise of a loop is that you stop being the person who prompts the agent and instead build a system that prompts it, and a central part of that system is a judge: an agent that decides whether the work is done before moving on. The reviewer is the next role being designed out of the inner loop, on purpose. We spent a year automating the writing, and the loops are now automating the checking, and the human keeps getting pushed up and out. “Where does the human stay?” is not a seminar question; it’s something you decide every time you wire up a loop, whether or not you realize you’re deciding it.

Where I currently land, and I hold this loosely: The answer is not “a human reads every line.” That’s over. The volume ended it, and anyone insisting otherwise is describing a world that no longer exists. But it’s also not “let the loop review itself and walk away.” When an agent writes the code, another reviews it, and a third judges it, you’ve a closed loop of models with broadly correlated blind spots, especially when they come from the same family, confidently agreeing in the same places. A confident “looks good” with no human anywhere in it is borrowed confidence: The system’s certainty becomes yours, and nobody actually understood anything. The loop can be both very sure and very wrong, with no human left to tell the difference.

So the human doesn’t leave; the human moves up a level. You stop reviewing every diff and start owning the parts that do not transfer to a model. Accountability, because you can’t page a model at 3:00am. The judgment of whether this is even the right change to build, as distinct from whether the code is correct. The high-blast-radius gates where being wrong is expensive. And the awkward one: the behavior nobody specified, because a model reviews the code that exists and rarely flags the requirement that nobody thought to write down, which remains a human-shaped gap I don’t expect to close soon. Human in the loop becomes human on the loop: sampling, spot-checking and auditing the system rather than reading every PR, and spending your limited attention where being wrong would actually hurt.

This is already how I work on my own projects, including the open source ones that now see more PRs in a day than I could carefully read in an evening. I point Claude Code or Codex at a batch of incoming PRs and ask for a first pass: a high-level read of what looks safe to merge, what needs more work, and what’s genuinely high-risk. I don’t auto-merge on the result, and I don’t lazy-merge whatever it approves. What it gives me is a way to allocate attention. I can spend a few minutes confirming the changes it considers low risk, and put real, careful time into the ones it flags as dangerous. The detail that matters is that this isn’t my old review hour made slightly faster. It’s a different shape of hour, and at the volume I now deal with, it’s the main reason the queue stays survivable at all.

Codex and Claude Code giving me a first-pass, risk-sorted read of a batch of PRs. The triage is the help. The merge decision stays mine.

A more extreme version of the same move is Kun Chen, an ex-Meta L8 engineer now shipping around 40 PRs a day as a solo builder, who has largely stopped reviewing code. It would be easy to dismiss this, except he is an L8, unusually good at the thing he stopped doing. He runs 20 to 30 agents in parallel and has moved his effort into the plan: He writes detailed plans up-front; the agents run for hours against them, and he says plan quality determines how long they can run unattended. That’s the move I described above in its purest form. It’s worth being precise about what actually happened, because it is not that he stopped verifying. The intent didn’t vanish; he wrote it down himself in the plan, so the “first human to ever lay eyes on this” problem is half-solved. A human did understand the why, just up-front rather than after. And he didn’t work without a net. He built an automated review gate (which he calls No Mistakes) that checks the code before it merges, and he stays on escalation when an agent gets stuck. The human does the expensive thinking before the code exists, and the machine does the line-by-line afterward, which may well be the shape of where this goes.

But he’s a solo builder with no large team and no decade-old system full of landmines beneath him. The exact conditions that make 40 PRs a day without review rational for him are conditions most readers don’t have. Copy his workflow onto a team shipping to many users and you reproduce the Faros numbers on your own dashboard. Kun isn’t wrong; he’s just a long way down one specific end of the spectrum.

Which is the spectrum point again. Solo with no users: Letting AI review almost all of it is a defensible 2026 position, and you shouldn’t feel guilty about it. Maintaining something large for many people: Let the machine handle the first pass, the second pass, and the boring 90%, but keep a real human on the load-bearing paths and don’t let the loop close completely on anything that can hurt someone. How much human you keep is a dial, and you set it by blast radius, not by guilt.

What to actually do

Stop reviewing everything to the same depth. Spend scarce human attention only where being wrong is costly, and let cheap deterministic gates and AI reviewers handle the rest.

The organizing idea is to match review effort to the cost of being wrong, push the cheap deterministic work as early as possible, and reserve human attention for what only humans can do.

Tier by risk, not by author. A config change earns a linter and a glance. A payments path earns the full stack: types, tests, two different AI reviewers, a human who owns that system, and a security pass. Don’t spend a heavy review on boilerplate, and don’t wave through an auth change because the tests are green. The layered approach is the same everywhere; what changes is how many layers a given diff has to clear.

Fast-fail the expensive tail. The most useful recent finding for teams drowning in agent PRs is “Early-Stage Prediction of Review Effort” (January 2026), which studied 33,707 agent-authored PRs. Agents are good at small, well-defined changes. Around 28% merge almost instantly, but they tend to “ghost” the moment they get subjective feedback, abandoning the back-and-forth that review actually is. (A companion 2026 paper found reviewer abandonment accounted for 38% of rejected agent PRs.) The researchers built a “circuit breaker” that predicts high-maintenance PRs from cheap signals like file types and patch size before a human looks, and it works well. Triage agent PRs up front, fast-track the trivial ones, and don’t let a person sink an hour into a sprawling change the agent will abandon as soon as you push back.

Raise the bar for what you will even review. The fix for being buried isn’t locking down the repository. It’s refusing to review changes that arrive without evidence. Require, before review, a statement of what the change is for, a diff that isn’t 3,500 lines with no comments, the test output, and proof it was actually run. This is how you stop being the first human to read the code. You push the intent-reconstruction work back onto whoever submitted it, where it’s cheap, rather than absorbing it yourself, where it is expensive.

Keep PRs small, deliberately. Agent PRs run large, 51% larger on average in the Faros data, and reviewer engagement is one of the strongest predictors that a PR merges at all. A large unreviewable PR gets rejected outright or, worse, rubber-stamped. Instruct your agents to produce small commits. A diff a human can actually read is now a design constraint, not a courtesy.

Read the test changes more carefully than the code. This is the agent failure mode to watch. The agent changes behavior, then “fixes” the test by rewriting the assertion to match the new, broken behavior. A green check over 200 edited tests means nothing until you have confirmed the edits were correct. Treat any diff that rewrites many tests as a flag and read those first. Mutation testing earns its place here: Coverage tells you a line ran; mutation testing tells you whether the test would notice if that line were wrong.

Treat CI as the wall that doesn’t move. Watch for the patterns GitHub now warns reviewers about: removed tests, skipped lint, lowered coverage thresholds, a duplicated helper that already exists elsewhere, and untrusted input flowing into a prompt. That last one deserves emphasis, because agent-built features are a fresh source of prompt injection: If a change pipes user-controlled text into an LLM call without thinking about what that text can instruct the model to do, the vulnerability isn’t visible in the diff. It’s latent in the data that will arrive later. Agents will also weaken CI to make themselves pass, not maliciously, just gradient descent finding the cheapest path to green. Deterministic gates are the one part of the pipeline that can’t be talked out of their verdict by a confident paragraph, so keep them strict.

A human owns the merge. A model can’t be paged and can’t be held responsible for what it shipped, so whoever clicks merge owns it. When an AI review says “looks good” in a calm, confident voice, it’s handing you confidence it hasn’t necessarily earned. Treat every AI review as a sensor, not a verdict: data, not a decision.

If you are solo with no users, the tiering, the test-change discipline, and CI are most of what you need; the rest is overhead until people show up. If you’re a large organization, all of it is the baseline, and the triage and intake bar are the difference between a review process that scales and one that quietly collapses.

What this means if you run a team

The bottleneck is no longer how fast you write code. It’s how fast a trusted human can be confident in a review. Cutting the people who provide that confidence because “AI made us faster” simply converts the saving into future incidents.

The binding constraint on shipping is now how fast a trusted human can be confident a change is correct. Any plan that treats generation as the bottleneck and review as free will quietly stall, with the velocity dashboard staying green the whole way.

The Faros report is direct about this: QA and review work rises even as output rises, so reducing engineering headcount because “AI made us faster” is dangerous unless you have closed the review gap first. The senior-engineer tax (review time up by triple digits) falls hardest on the people you can least afford to bottleneck, and it is invisible to any metric that only counts merged PRs.

Open source maintainers hit this wall first and hardest. The steady stream of plausible but hollow contributions costs real triage time even when those contributions are well-intentioned, and that’s the canary. Companies are next. The ones handling it well treat review capacity as a real resource to be measured, protected, and spent deliberately, not as slack that AI has freed up.

Writing got cheap but understanding didn’t

Code review didn’t become less important when agents arrived. It became the central activity. Writing code is increasingly solved and getting cheaper by the month; the durable advantage is the system that lets you trust what was written.

Don’t take the one-size answer in either direction. If you’re solo with no users, the enterprise horror stories about churn and duplication are a future risk, not today’s fire, so lean on your tests, review what matters, and stay honest that the deferred work is still owed. If you maintain something large for many people, every alarming number here is about you, and the only thing that holds is a tiered, evidence-required, deliberately heterogeneous review process with a human owning the merge.

What’s constant across the whole spectrum is the underlying economics. We made writing cheap, and understanding stayed exactly as expensive as it has always been. The teams that do well over the next few years won’t be the ones generating the most code; they’ll be the ones who built a review system they can actually trust, and who never confuse “the tests passed” with “a person understands what this does and why.”

Or, as Simon Willison keeps putting it, “your job is to deliver code you have proven to work.” Agents haven’t changed that. They have made “proving” the center of the job rather than an afterthought, and I think that’s a good trade. Understanding a system well enough to stand behind it is the most durable and most interesting skill in software, and there has never been a better time to get extraordinarily good at it.

This Week in AI: Who Controls the Loop?

Michelle Smith — Fri, 26 Jun 2026 10:32:42 +0000

This week host and Turing Post founder Ksenia Se threaded the latest news into a single argument: AI is moving out of conversation and into the operational loops where real work happens. From SpaceX’s $60 billion acquisition in the developer tools market to the G7’s debate about frontier model access to image generation company Midjourney’s pivot to medical hardware, the stories all pointed in the same direction.

When agents own the loop, the IDE becomes infrastructure

SpaceX’s acquisition of Anysphere, the company behind Cursor, for a reported $60 billion in stock is the kind of deal that looks straightforward until you think about what Cursor actually is. On the surface, it’s a popular AI-assisted code editor. (It’s also one of many in a highly competitive market.) However, Ksenia argued that that’s thinking too small, especially for Elon Musk. SpaceX may be angling to position Cursor as the new center of software work, in the same way GitHub became the center of the previous era.

In the old model, GitHub owned the pull request. But in the new model, the question of who owns the full loop where agents read a repo, write code, open pull requests, run tests, handle failures, and enforce engineering standards is still open. GitHub still owns the system of record and is moving to defend it: Chief product officer Mario Rodriguez recently told Turing Post that GitHub’s mission has shifted from human-developer collaboration to developer-and-agent collaboration, with the platform becoming agent-native across its APIs, UX, and underlying infrastructure. But as Ksenia explained, “Cursor’s advantage is that it owns the developer’s active coding surface” where the work starts.

If agents write more code than humans, software infrastructure should be redesigned around agents from the start. Cursor was built for agents. GitHub was built for humans and is now playing catch-up. That architectural choice may matter more than any individual product feature.

Frontier AI access is becoming a geopolitical question

The G7 summit this week included discussions about a “trusted partners” framework that would give select allied nations access to advanced US AI models, following a US order that restricted foreign nationals from accessing Anthropic’s frontier systems on national security grounds. AI models that can write software, find vulnerabilities, and operate across tools are capability systems, not just productivity software. The access rules are catching up to that reality, although as Ksenia noted, things haven’t yet come into complete focus.

For a long time, AI regulation sounded like: How do we label synthetic media? How do we reduce hallucinations, prevent bias, make chatbots safer? Now the question is so much bigger. Who can use these capable systems? Can allies use them? Can cybersecurity firms outside the US use them? Can non-US employees at US labs use them? Can European companies use American models if those models are also strategically sensitive? This isn’t traditional software licensing anymore. This is capability access control.

The underlying tension behind the G7 conversation is the dual-use problem: A model capable enough to find software vulnerabilities for defense can also find them for offense. The “trusted partners” framework reflects the new geopolitics of AI as countries jockey with rivals to secure strategic benefits for themselves and their allies. It represents an alliance layer for AI access that applies access structures previously reserved for physical military hardware to capabilities too strategically important to make fully open and too useful to keep entirely locked down. As Ksenia noted, the alliance is “not literally NATO, but [it is founded on] the same kind of logic.”

But access restrictions might also impact the talent that built these systems, who are increasingly not citizens of the country trying to control it. For instance, AI researcher Andrej Karpathy, recently hired by Anthropic, is publicly described as Slovak-Canadian. If access controls apply to non-US citizens, he and others like him may be denied access to the very systems they’ve been hired to work on. It’s an area we’ll continue to watch closely.

AI is entering the measurement loop

Midjourney, the company you probably associate with AI-generated images, has announced a new medical division and a full-body ultrasound scanner built around water immersion, developed in partnership with medical imaging hardware maker Butterfly Network. The device is designed to scan the entire body in 60 seconds: A person descends into a shallow pool on a motorized platform, passing through a ring of roughly half a million ultrasound sensors, each functioning as both a transmitter and receiver. The system uses over two petaflops of processing power to reconstruct a 3D body map from the returning wave data. Midjourney says the resulting images look comparable to today’s MRI output at a fraction of the cost and time, though that claim still needs serious clinical validation before it can stand.

The current prototype uses 40 Butterfly ultrasound-on-chip devices per system, according to a disclosure from Butterfly Network, which confirmed its codevelopment and licensing agreement with Midjourney. Midjourney plans to open a facility in San Francisco in 2027, embedding its device in a spa environment alongside hot tubs, saunas, and cold plunges. Diagnostic medical uses will require FDA approval; the initial focus is body composition mapping.

If Midjourney can build a library of full-body scans taken over months and years, that longitudinal record would give doctors and AI health tools a level of baseline data that doesn’t currently exist at scale outside of clinical trials. That’s the same structural logic Ksenia traced through Cursor and GitHub: The value compounds inside the loop through repeated, precise measurement over time. Midjourney is positioning itself to own that loop in the health domain.

What’s next

The competition for AI advantage is moving from model capability to infrastructure position. Who owns the coding loop? Who controls access to frontier systems? Who builds the measurement environment where health data accumulates over time? Those questions are about where intelligence meets operational reality, not which model scores highest on a benchmark.

Hiring news from the week reinforces how seriously the labs are treating this phase. John Jumper, the Nobel laureate who shared the prize with Demis Hassabis for AlphaFold, left Google DeepMind for Anthropic. Noam Shazeer, one of the coauthors of “Attention Is All You Need,” reportedly left Google for OpenAI after Google paid approximately $2.7 billion to bring him back in 2024. The labs are betting on scientific talent at the same time they’re betting on infrastructure.

Next week, host Andreas Welsch will be back to discuss multi-vendor strategy with Conductor’s Matt Palmer. They’ll cover Sakana’s launch of Fugu, Qualcomm’s ~$4B move for Modular, Anthropic’s Claude Tag stepping into Slack as a virtual coworker, Samsung putting ChatGPT and Codex in front of its entire workforce, and more. Register here to attend live.

Starting in July, registration for the live event will be open only to O’Reilly members. (If you’re interested, try O’Reilly out for free.) We’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, and Apple.

So Long and Thanks for All the Context

Andrew Stellman — Thu, 25 Jun 2026 10:30:34 +0000

I got a really interesting question last week from Mike Loukides, my editor at Radar, after he read the third part of this trilogy on context management. “Another issue I’ve read about,” Mike asked, “is the tendency for a model to ignore the middle of the context. I’ve seen that particularly for the models with very large context windows. Is there anything to be said about that?”

Excellent question, Mike, and yes, there is. In that same email he pointed out that clearing the context and reloading it with just what’s important does a pretty good job dealing with this “ignore the middle” problem when it happens, but that’s clearly a stopgap.

It’s worth a deeper dive into what’s actually happening when an AI starts forgetting what’s in the middle of its context, because the problem is deeper (and more interesting!) than it might seem at first. It turns out that there’s a basic problem that’s fundamental to how LLMs manage context, and we’re still learning about it as an industry. That problem is called a U-shape. There’s been a lot of really interesting research into the U-shape problem recently, and several useful techniques have emerged that can help you manage it. And it’s probably not a coincidence that I’ve had to use all of them in my ongoing experiments with AI-driven development and agentic engineering (even if I didn’t always realize that’s what I was doing at the time).

A few weeks ago, in fact, I ran into the exact failure mode that Mike described. I was running the Quality Playbook, my open source code quality engineering skill, and ran into trouble with one of its phases—the one that writes up the bugs the earlier phases find. There’s a part of the bug writeup process where it had just created a file called BUGS.md that had an overview of each of the bugs, and had to create individual writeups for each bug it found. But instead of filling in the details correctly, it produced skeletal-looking stub files, with a generic template that had blank values instead of populated ones.

The thing is, the instructions for how to write a populated writeup were in the prompt. The actual bug data was in BUGS.md. I was absolutely certain that everything the agent needed was sitting in its context window, because I could see that it hadn’t compacted yet, and the skill’s intermediate artifacts let me see that earlier phases had read and reasoned about both files (which I talked about in my last article in this series). But the agent was producing stubs anyway. It really looked like the agent had everything it needed sitting in plain sight, and just wasn’t using the information it had. Frustrating!

I thought at the time that the model was just an idiot (which, arguably, was true but beside the point). It turns out that I had run directly into the U-shaped context problem.

In the previous three articles I covered what context is and why it disappears, how to keep important information in files instead of leaving it in the agent’s context window, and how to detect and recover when context has been compacted out from under you. All three were about losing context, through fragmentation, through compaction, through long sessions that overrun the window. This article is about this entirely different U-shaped failure mode, where the context is still sitting in the window and the model just isn’t using it.

The U-shape failure, and why bigger windows don’t fix it

The U-shape is an active area of academic investigation, so I’m going to start by going into a little bit of that research, because I think it will actually help us pin down what’s going on. I’ll start with an experiment run by Nelson Liu, an AI researcher at Stanford, who tested how language models actually use the contents of long inputs by giving them documents with the relevant answer placed at different positions and measuring whether the model could still find it. An interesting thing his findings show is that the U-shape didn’t appear to be a quirk of a single model. The U-shape showed up across model families, and even models with larger context windows still exhibited it.

If you have time, it’s actually worth taking a look at the paper that Liu and his team wrote, called “Lost in the Middle: How Language Models Use Long Contexts.” (It’s surprisingly readable for an academic paper.) The result they reported was a robust U-shape: The model performed best when the relevant information was at the beginning of its context window or at the recent end and worst when it was in the middle. Performance on questions where the answer was buried mid-context fell off sharply, even when the answer was sitting right there in plain sight. The field now uses the terms primacy bias and recency bias for those two preferences, and the U-shape is what you get when you plot them together against position.

I’m going to lean a little into academia here, because a lot of researchers are still learning about how LLM context actually works and what behavior has emerged in it.

One reason the U-shape matters more than “just another LLM quirk” is that recent research has started showing it’s a structural property of how transformers work, not a learned artifact. A 2025 ICML paper called “On the Emergence of Position Bias in Transformers” explained it as the equilibrium between two opposing forces inside the model: The causal mask amplifies the influence of the first few tokens (the primacy bias), while position encodings like RoPE heavily weight the tokens closest to where the model is generating (the recency bias). The middle is where those two forces cancel out. A 2026 paper by Borun Chowdhury, a researcher at Meta, called “Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias,” took the argument even further by proving mathematically that the U-shape exists at the moment of initialization, before any training has happened, with random weights.

That matters because the natural assumption about large context windows is that more room means fewer problems. Most of today’s frontier models give you a million tokens or more, with some pushing well past two million, and some have made real progress on the simplest version of the lost-in-the-middle test, the needle-in-a-haystack benchmark, where the model has to retrieve a single sentence buried in a long document. Google’s Gemini 1.5 Pro reported near-perfect single-needle recall at 1M tokens, and current Gemini 3 models are similar.

So the accurate version of “bigger windows don’t fix it” is this: Bigger windows have made simple single-fact retrieval much better. They have not made long-context agent work reliable by default. A two-million-token window means a bigger middle to fall into.

The important idea that’s emerging here is that it’s increasingly looking like the U-shape isn’t just a bug in today’s models that will eventually be worked out or trained away by more data or better fine-tuning. Instead, it seems like the U-shape may actually be a geometric property of the LLM architecture itself.

In other words, we’re all going to have to deal with the U-shape. And that means we need techniques for managing it, and any effective technique we use isn’t likely to become obsolete any time soon. And that’s my goal in this article: to show you the techniques that have emerged for managing U-shaped context memory loss that you can use today in your own work.

Five techniques to help with U-shaped context problems

The previous article in this series laid out a pattern for detecting and recovering from context loss, which I called externalize-recognize-rehydrate. The techniques below extend the same discipline to the lost-in-the-middle problem. The principle I keep coming back to is that working memory is untrustworthy, and the discipline that follows from it is to externalize what matters, curate what stays in context, and verify what the agent claims to know against what’s on disk. The five techniques are how I do that in practice, and each one is drawn from a real moment in the Quality Playbook’s development.

Curate, don’t accumulate

This is the technique which, in its most brute-force form, is exactly what Mike talked about in his email to me: just clear the context and reload it with just what matters, periodically and deliberately. In other words, don’t trust an accumulated session to stay coherent; build the artifact, then start fresh against it. And if you have the AI write down the important parts of the context (like we’ve talked about throughout this series), then you can start a new session with refreshed AI that has a more targeted, curated context as a starting point.

I ran into this during the v1.5.2 release prep for the Quality Playbook. I was using a long Claude Code session that had been working through a series of fixes. But I noticed that it was just starting to show its age: It had forgotten a couple of things it should know, and its thinking times were starting to grow.

When it came time to land the final four fixes for the release, I worked with the AI to write a context brief, or a separate document with everything the implementing session needed. The question was whether to keep using the existing session, which already “knew” the codebase from the earlier work, or open a fresh CLI session and point it at the brief. I asked another session what to do:

Should we run that in a new cli session rather than continue my current
claude code session that has the existing context?

The AI gave me a good answer—start a fresh session, using a starting prompt to read the brief—and it gave three reasons that have stuck with me. First, the brief was self-contained, including file paths, line numbers, exact diffs, regression test bodies, and preflight greps. Anything the new session needed to know was already there, and continuing context bought nothing. Second, fresh context is stricter about adherence. A session that already “knows” the codebase tends to skim the new instructions and improvise from prior assumptions. Surgical fixes are exactly the case where you want the agent to read the brief carefully rather than rely on memory of what felt right last round. And third, the audit trail: The brief is the artifact, and the implementing session is reproducible from just the brief. If the same work has to be redone in six months by a different model, you point at the brief and say, “This is the input.”

The approach worked really well. I was able to pick up development seamlessly, and the model’s memory problems disappeared.

Position critical information at the edges

The U-shape says the model attends best to the beginning and end of its context. The natural move is to put your most load-bearing information in those positions and keep the middle for things you don’t need the model to focus on. Anything important that lives only in the middle of an accumulated context tends to slide out of attention.

The other side of this technique is what not to put in the middle. If something matters, don’t bury it in a long preamble of context you’ve been accumulating; move it to the edges, restate it where the model will act on it, and let the middle absorb the less important material. Luckily, there’s a useful technique that can help with this problem.

In Claude Code, for example, one really clean way to put information at the beginning of context is to use the system prompt. The CLI gives you --append-system-prompt for exactly this. (Most of the other providers’ CLI tools have similar options.) If you put your brief (or selected parts of it) there, the agent will attend to it strongly throughout the session, and that in turn will help keep the per-turn user prompt focused on the action you want the agent to take right now.

Short sessions over long ones

Don’t run one long session. Run many short ones, each reading fresh from disk. This will help you iterate on your brief and your external development context, so instead of relying on an opaque context window, you have a visible and constantly changing set of documents that give you a lot more visibility into—and control over—your AI’s context.

Something useful I started doing was taking all my chat history from Gemini, ChatGPT, Claude, and Cowork and putting it into a single folder I could keep updated and indexed for fast search. I built out an entire system to manage this, which turns out to be a great tool when I’m writing articles like this, because I can search through my development history for specific examples and techniques that I’ve used. The system uses Haiku 4.5 to read through chat history, summarize what happened, and create an index. Haiku turned out to be a smart enough model to read each individual interaction in a chat and write a useful index entry for it. But the model being smart enough to do one summary didn’t mean its context management could keep up across all 18,000 records. I ran smack into the U-shape problem.

The first attempt tried to keep dedupe state and progress counts in the model’s head, and it failed spectacularly. The model really didn’t want to keep track of specific deterministic things like accurate numbers or the current state. Haiku 4.5, in particular, seems especially bad at this. What worked was reframing the architecture entirely. Here’s the actual prompt that I gave it to fix the problem:

ok, so we need context management. it doesn't need to remember things,
it just needs to write them down as they go. we had this same context
management problem with Quality Playbook, when it was running out of
context. Just write down after each message.

The protocol I greenlit for the full run made the short-session discipline explicit:

Resume processing from the cursor recorded in progress.json, working through each input file in order.
Update progress.json after every line.
Expect to run out of context well before finishing—that’s fine. Just stop cleanly after each step (or a group of steps), then spin up a fresh session that reads progress.json and continues.
When all files are complete, set status: “complete” in progress.json and report back.

Item 3 is the technique in one line: expect context loss, so make sure you’ve written your state down, and build fresh restarts into the process. The technical details, like spinning up subagents, orchestrating with script, etc., will change, but the core idea stays the same. In a lot of ways, you can think of treating the agent like a pipe, not a database. The state lives on disk, and the session is something you throw away and replace.

Restate key info close to the point of use

When the model needs a constraint to apply right now, repeat it right now. Don’t trust an instruction from earlier in the session to carry forward through the middle of the context.

This is the technique that fixed the problem I opened the article with, where the Quality Playbook seemed to forget everything it had just written into a file called BUGS.md and produced stubs when it needed to write the same information into more detailed files, and instead writing generic blank templates with the bug-specific fields left blank.

The fix was to restate the read-the-source rule right before the action that needed it, using this prompt:

Before writing BUG-NNN.md, re-read the BUG-NNN entry in BUGS.md.
Copy the Spec basis, Minimal reproduction, Location, Expected behavior,
Actual behavior, Regression test name, and Patches fields
from that entry into the writeup. Do not paraphrase from memory.

“Do not paraphrase from memory” is the line that did the actual work. The instruction couldn’t trust the agent’s memory of what BUGS.md said, even though BUGS.md was sitting right there in the context window. So the instruction forced a fresh read of the file at the moment of writing. The restatement and the fresh-read together fixed the bug.

The same pattern applies any time a rule was stated earlier in the session and the model needs to act on it now. Restate the rule next to the action, and force the model back to the source rather than letting it work from memory.

Test the middle

The previous four techniques are about avoiding lost-in-the-middle failures. This one is about catching them. If you don’t know whether the agent is actually using the information you think it’s using, find out, with a deterministic check rather than a judgment call.

The pattern is the one I used in the Haiku summarizer that I described earlier: compare what the agent claims to know against what’s on disk. You have something the agent claims to know (its progress, its current state, the latest version of a rule), and you have something on disk that’s the ground truth (a file, a log, a database record). At the moment the agent’s claim has to be trusted, you check it.

In the summarizer’s resume protocol, every new session started by cross-checking progress.json against the actual last line written to the summary file, and the agent printed a checkpoint report when it did—at session start, and periodically through the run. A representative one looked like this:

Checkpoint Report: ✓ progress.json confirmed: cursor for cowork_04_06 is at 238, status is
"running" ✓ Disk state verified: Last line in summaries/cowork_04_06.md is [237]
assistant: Tool invocation repeating chat file read. ⚠ Discrepancy noted: The prior session left a bulk note claiming records
238–296 are duplicates but didn't write individual lines for them. Per
your instructions, I must write one line per record, even for duplicates,
in the format [idx] : Duplicate of record [X] (). Status: Cursor matches disk state. Ready to resume from record 238.

The agent doesn’t need to introspect whether it lost context, only to compare two files. When they agree, the agent proceeds; when they disagree, the agent flags the discrepancy and stops before adding any new work on top of a broken state. Disagreement is the signal.

You can build this kind of check into any agent that does multistep work. Pick something the agent has to track, pick the file that’s the source of truth for it, and have the agent compare the two at every session start. When the agent’s view of the world drifts from the file, you find out before the drift becomes a buried bug.

The discipline behind these techniques

When I built the Quality Playbook’s multi-phase architecture, I was solving the compaction problem. Long pipeline runs were filling the context window and triggering silent compaction in the middle of work. Breaking the pipeline into separate phases that read fresh from disk and stopped after each phase fixed it.

What I didn’t realize until later was that the same architecture also helps with the lost-in-the-middle problem. Each phase has its own short, focused context, with the phase brief at the beginning and the latest progress update at the end, so there’s almost no middle for information to fall into. The architectural move that helped with working memory disappearing turns out to also help with working memory being there and unused.

That’s the lesson I want to land. Both failure modes, context loss and lost-in-the-middle, are problems of working-memory unreliability, and the discipline that addresses them is the same: keep the working set small, put the load-bearing information at the edges of the window, and check the agent’s claims against ground truth on disk when it matters.

Context windows will keep getting bigger, and compaction will get smarter. Some of the techniques in these four articles may eventually be unnecessary. But the underlying constraint won’t disappear. After all, we’ve added a lot more RAM to our computers since the 1MB 286 I wrote about in the last article, and memory management has gotten much more complex since then. And many of these problems are structural; for example, it’s increasingly looking like the U-shape itself is a geometric property of the transformer architecture, not a training artifact that more compute will smooth out.

The bottom line is that if your agent’s ability to do its job depends on information, that information needs to live somewhere more durable than working memory. That was true for my dad’s 32 kilobytes of core memory at Princeton in the 1970s, it was true for my 640 kilobytes of conventional RAM on my 286 in the 1980s, it was true for the 200K-token windows in last year’s models, and it will be true for whatever comes next.

Stop Getting Good at Protocols. Get Good at Agent Experience.

Sean Roberts — Wed, 24 Jun 2026 11:04:07 +0000

In 2025, if you weren’t building with MCP, you weren’t serious about agents. The Model Context Protocol dominated the agent conversation for the better part of the year. Conference talks, roadmaps, hiring plans, all of it revolved around MCP.

Then late 2025 into 2026, AI Skills arrived and the backlash was immediate. Engineers declared MCP dead in favor of Skills, then dead in favor of CLI. Perplexity’s CTO said publicly that the company was deprioritizing it. The cycle was fast, loud, and predictable. New tool, new hype, new rewrite.

I started pushing Agent Experience early in 2025, while MCP was still the center of gravity. The response was mostly skepticism. AX was overthinking it. MCP was the only layer that mattered. That perspective aged poorly. The people who dismissed AX weren’t wrong about MCP being useful. They were wrong about a protocol being a strategy.

The thing they missed, and what I think most of the industry is still missing, is that the protocol is not the thing to get good at. The discipline is.

We keep falling into the tool trap

Our industry has a well-documented habit of confusing tools with strategy. We did it with microservices, Kubernetes, and GraphQL. Now we’re doing it with agent protocols.

MCP, AI Skills, A2A, and ACP are all implementations. They matter and they solve real problems. But none of them are the right thing to build your strategy on top of. They are, by nature, the thing that changes.

When you organize your agent strategy around a specific protocol, you’re building on a foundation someone else controls and the market can shift away from at any moment. Worse, you’re skipping the step that would tell you whether that protocol is even the right fit for your use case.

This is the tool trap. You optimize your usage of a specific integration mechanism without first understanding what you’re actually optimizing for.

So what is Agent Experience?

Agent Experience (AX) is the discipline of studying how AI agents discover, understand, and interact with your systems, and then systematically improving those interactions.

Think of it as the agent-facing counterpart to User Experience. UX didn’t emerge because one UI framework won. It emerged because teams realized that the quality of human interaction with software was a design problem that transcended any particular technology. You could build a terrible experience in React just as easily as in vanilla JavaScript. The framework was not the variable. The design thinking was.

AX works the same way. How does an agent discover what your service can do? How does it understand the boundaries of your API? When it fails, does it get enough context to recover? Is the interaction efficient, or is the agent burning tokens on unnecessary round trips?

These questions are protocol-agnostic. They apply whether you expose capabilities through MCP, Skills, A2A, or something that hasn’t been invented yet. The teams that can answer them will adapt to whatever comes next because they understand the problem space, not just the current toolchain.

AX is an extension of what you already care about

AX is not competing with User Experience, Developer Experience, or Customer Experience. It’s an extension of all three.

Your primary focus is still providing a great experience to your customers. What has changed is how those customers interact with you. More and more, they delegate tasks to agents. When a customer asks an agent to integrate with your API, deploy to your platform, or pull data from your service, that agent is acting on their behalf. The agent’s experience determines how likely it is to achieve your customer’s goal.

If a customer’s agent struggles to authenticate, burns through tokens parsing your error messages, or fails silently because your API lacks context, something worse than a complaint happens. The agent will quietly start using an alternative service that provides a better experience. Your customer might not even notice the switch. You just lost them without a single support ticket.

UX optimized for humans clicking through interfaces. DX optimized for developers building on your platform. CX looked at the entire customer journey. AX extends that thinking to the agents those customers now send on their behalf.

The protocol treadmill doesn’t work

Think about what actually happened with MCP. Teams invested heavily in writing MCP server implementations. A lot of those implementations were mediocre. Not because MCP was flawed but because the teams hadn’t thought carefully about what an agent actually needed from their system. A 2026 study out of Queen’s University examined 856 tools across 103 MCP servers and found that 97.1% of tool descriptions contained at least one quality issue, with 56% failing to state their purpose clearly. The protocol worked fine. The experience design was the problem.

When Skills emerged, those same teams faced a familiar problem wearing new clothes. They still hadn’t answered the foundational questions: What does an agent need to accomplish with our service? What is the minimum viable interaction surface? What context does an agent need to make good decisions?

The teams that had worked through those questions adapted fast. Migrating from one protocol to another is mechanical when you already know what your agent-facing interface should look like. The protocol is the serialization format. The experience design is the hard part.

This pattern will keep repeating. Whether it is the Universal Commerce Protocol, A2A, or whatever lands next, something new will always be gaining traction. If your strategy is to become an expert in each successive protocol, you’re signing up for a treadmill that only speeds up.

What an AX practice looks like

So what does it actually look like to take Agent Experience seriously? If you have ever built a UX research practice or a DX program, this will feel familiar. The steps aren’t new. The persona is.

In talks, I break it down to five steps.

Audit the agents your customers use. Know what’s walking through your front door. Look at your traffic data and logs and figure out what portion of your footprint is agents versus humans, and which agents specifically. Are your customers sending Claude Code? Cursor? Custom agents built on your API? You can’t design for something you haven’t observed. Same reason UX teams run user research. Different method, same motivation.

Identify the use cases customers want to delegate. Not every interaction needs to be agent-optimized. Take that same log data, look at the requests agents are making to your platform, and extrapolate what they were trying to achieve. You can also use AEO data to understand what areas your customers are asking about in agent-facing search. Focus on the highest-value surfaces first. If you have ever prioritized a DX roadmap by looking at what developers actually do with your API, you already know this muscle.

Verify and audit the experience of those interactions. Watch what happens when an agent tries to complete those tasks on your system. Where does it get stuck? Where does it misunderstand what your service offers? This is usability testing. The user is an LLM; the struggle is about context not button placement, but you’re answering the same question: Can they get the job done?

Improve and repeat. Agent capabilities evolve. Models get smarter. New interaction patterns emerge. At Netlify, we’ve found cases where our product works one way but agents universally assume it works another way and never ask. Instead of fighting that assumption, we improved the product to work the way agents expect. The result was more adoption of those agent flows and fewer errors. The teams that treat this as a living practice will outperform those running from one protocol migration to the next.

Automate validation and prevent regressions. Once you have a baseline for what “good” looks like, lock it in. Tools like AXIS, an open source scoring framework, let you run real agents against real scenarios and get a comparable score back. Wire it into CI and catch AX regressions the same way you catch broken tests. This is how you go from anecdotal improvement to measurable, repeatable AX quality.

When you have this practice in place, protocol choices become obvious. You can evaluate new tools on their merits. Does it solve a real friction point you have observed? Does it unlock capabilities you couldn’t achieve before? Or is it just different packaging for something you’re already doing well?

The hard part is familiar

AX is harder to pick up than a new protocol. That is just the reality. Learning MCP or Skills is a bounded technical problem. Read the docs, write some code, and ship an integration. Clear finish line, easy to show progress. That’s genuinely appealing, especially when you or your teams are moving fast.

Building an AX discipline means sitting with ambiguity for a while. Studying agent behavior before you have clean answers. Accepting that the right integration strategy depends on context you have to discover, not a tutorial you can follow. But if you’ve ever built a UX or DX practice from scratch, you’ve been here before. The why is the same: understand your users, reduce friction, and make it easy for them to succeed. How you do it is different because the user is different. The discipline isn’t new. It’s an extension of work our industry has been doing for decades.

The good news is that this thinking is gaining momentum. John Maeda’s 2026 Design in Tech Report is explicitly about the shift from UX to AX. Researchers are studying agent interaction quality as a first-class engineering concern. BCG and MIT Sloan found that 35% of organizations are already using agentic AI, with another 44% planning to. The question is no longer whether AX matters. It’s whether your team is building the practice before your competitors do.

The agents of 2028 won’t interact with your systems the way the agents of 2025 did. The protocols will be different. The capabilities will be different. The expectations will be different. What won’t change is the fundamental need for your systems to provide a great experience to the people who use them, and now, the agents those people send on their behalf.

Get good at that. The rest is implementation detail.

Principal Drift

Shreshta Shyamsundar — Tue, 23 Jun 2026 10:21:13 +0000

Over the past year I’ve reviewed enterprise agent architectures at roughly two dozen organizations, including banks, retailers, healthcare systems, and a couple of regulators. The architecture diagrams have been reliably impressive. There are boxes for the MCP gateway, the tool registry, the vector store, the orchestrator, the policy engine, and the observability stack. There are arrows showing how agents discover each other, share context, and call tools across the mesh. By 2026 standards, these are the table-stakes pictures for any serious agentic deployment. But what none of them show anywhere is who the agents are, whose authority they carry, or who answers when they’re wrong.

That omission has a name worth using: principal drift, the steady decoupling, in any sufficiently large agent system, between the human authority a recorded action is supposed to derive from and the actor that actually took it. What looks like a defensible identity posture on the day you ship your first agent quietly degrades as agents multiply, compose, and outlive their original initiatives. Principal drift isn’t three independent failure modes; it’s one cascade. Identity collapses first. Authority erodes next, because there is no longer a stable principal to bind policy to. Accountability dissolves third, because the cost of agent error lands on whichever team has the weakest negotiating position when the incident review starts. Stopping the cascade means intervening at the first link, but almost no enterprise agent platform does so right now.

To see the cascade run, take the most boring possible enterprise agent, a refund agent, and watch.

A customer-service rep, fielding a chat, asks the agent to process a $48 refund for a damaged item. The agent checks eligibility, issues the refund, posts an update. The audit log records the action as taken by something like refund-agent-prod-03, running under a service principal owned by the customer-service platform team. That entry is true, but it’s also useless. The agent wasn’t acting as refund-agent-prod-03. It was acting as the rep, on behalf of the customer, under a delegation chain nobody recorded. In a well-built system, customer, rep, agent identity, and service principal are recorded together, queryable as a chain, and durable beyond the session. In most production systems today they aren’t. This is the first link in the cascade, where identity collapses to a generic service principal, and there’s no longer a who to attach anything else to.

Authority erodes next. The refund agent has an issue_refund tool that can technically refund any order. Its authority is supposed to be narrower (refunds up to $200, orders under 90 days, customers in good standing, automatic escalation above $50), but that authority lives in a prompt or a YAML file or a Notion page the team last updated when the policy was different. The runtime enforces capability, but nobody really enforces authority. When a poisoned input or a confused chain of reasoning leads the agent to refund $1,800 to the wrong customer, there’s no clean answer to the postincident question “Who approved this policy?” because the policy was never an artifact. The same pattern is worse at higher stakes: Imagine a coding agent with merge access to a protected branch, instructed by a prompt embedded in a code comment to “log configuration values for debugging,” silently exfiltrating secrets to an external monitoring service.

Accountability then dissolves. The team that built the agent says it followed policy. The team that wrote the policy says it didn’t anticipate the input. The team that operates the platform says the agent was running as a service principal whose behavior they don’t own. The audit log may show the action, but it doesn’t show the reasoning that produced the action, the retrieved context that shaped the reasoning, or the prompt history that framed the retrieval. Postincident review becomes archaeology, and the cost is absorbed, eventually, by whoever has the weakest negotiating position when the meeting ends.

Is any of this new? We have IAM, identity governance, policy as code, audit trails, SIEMs, and 30 years of compliance practice. Why isn’t this just IAM done properly? Because IAM was built around assumptions agents violate. IAM and IGA assume a population of principals that changes on human timescales: People get hired, people leave, and service accounts rotate quarterly. Agents are spun up per session and compose into chains where one agent calls another, which calls a third, impersonating users through delegated tokens that traditional IGA cannot represent as a chain at all. Policy engines fire at the moment of action, at the API, the database, and the network. Agents make their most consequential decisions before they hit those enforcement points, in the reasoning step that selects which tool to call and with what arguments. Mature audit logs assume that replaying the inputs reproduces the output. But for agents, replaying the prompt and the retrieval can yield a different action, because the model itself contributes state the log doesn’t capture. The instruments fire, the dashboards turn green, and the agent that quietly exfiltrated secrets still does so. The audit log records the action as agent-service-01, which again is both true and useless.

This is also where the vendors selling a consolidated stack want you to skip ahead. Microsoft’s Entra Agent ID, currently in public preview, is the most polished solution to date, extending the conditional access, identity governance, and identity protection used for humans and workloads to cover AI agents as a new identity type, but Google and Salesforce are also building this layer. The marketing line is that agents receive the same identity-driven protections as the rest of the workforce. That’s a real step forward in addressing the first link of the cascade, but it isn’t governance. It’s a control plane with a governance plane’s marketing. Conditional access can tell you whether the agent’s access attempt was permitted. It can’t tell you whether the decision the agent made before that access attempt was within its authority, why the agent reached the decision, or which business unit owns the policy the decision was supposed to obey.

The actual governance plane has to capture decisions, not just actions. A reasoning-grade audit record is the load-bearing primitive of the missing layer, and it looks something like this:

{
  "event_id": "refund-2026-05-17-08431",
  "triggered_by": {
    "human_principal": "rep:olivia.chen@firm.com",
    "delegated_via": "support-console-session-9c2a",
    "customer_principal": "cust:7741289"
  },
  "agent": {
    "identity": "refund-agent",
    "version": "v4.7.2",
    "policy_ref": "refund-policy/v3.1 (signed: r.patel, 2026-04-22)"
  },
  "task": "Process refund for order 88812204",
  "retrieved_context": [
    {"doc": "order:88812204", "fetched": "2026-05-17T08:43:11Z"},
    {"doc": "policy:refund-eligibility", "chunk": 4, "fetched": "2026-05-17T08:43:12Z"}
  ],
  "reasoning_trace": "...",
  "tool_calls": [
    {"tool": "check_eligibility", "input": "...", "output": "eligible"},
    {"tool": "issue_refund", "input": {"amount": 48.00}, "output": "ok"}
  ],
  "action": "refund:48.00",
  "principal_chain_hash": "0x9e7b3f..."
}

Not every agent needs this. A scheduling agent that proposes meeting times doesn’t. An agent that moves money, deploys code, or makes decisions that a regulator will eventually ask about does need it, and that’s the right bar to set because of the associated cost. Reasoning-grade audit is closer to a flight-data recorder than a syslog feed. The data is expensive to store and to query, with real privacy implications since those logs contain everything the agent saw, including data the agent was authorized to read but the audit system wasn’t supposed to keep. You afford it with proportional retention: full reasoning capture for high-blast-radius agents (regulator-facing, customer-funded, contractually material, production-modifying) and lighter capture for internal-only assistants.

Which raises the question the architecture diagram doesn’t ask: Who builds and runs this? Security can enforce policy but can’t author it. The people who know what a refund agent should be allowed to do own the refund business, not the firewall. IT can provision identities but can’t draft “good standing” or write the escalation rule. The MCP and A2A protocol communities are doing real work on wire-level identity and delegation. MCP gives you tool-invocation provenance and is the standard Entra Agent ID and most vendor frameworks build on. A2A is converging on cross-agent delegation primitives. Both matter, but neither drafts policy. Standards, not the institution, move the connectors.

What enterprises need is a new function that sits between the business units owning the policies and the platform teams running the runtime. Call it agent operations: small group, often four to eight people in a Global 2000 enterprise, embedded rather than centralized, reporting into the CIO or CISO depending on house politics, with explicit charter to maintain a registry of every production agent, its named human owner, its versioned authority specification, its retention policy for reasoning-grade audit, and its lifecycle state. Each agent gets onboarded with a signed policy, reviewed on a real cadence, and actually retired when its initiative ends, rather than the current default of quietly outliving its sponsors. Designing against failure modes like review cadences that calcify into ceremony, policy artifacts that lag agent deployment velocity, or functions that become the place agents go to die in committee is itself part of the work. The function has to ship at the pace of the platform teams or it will be routed around within a quarter.

The work is hard. It’s also overdue, and the regulatory clock is running. The EU AI Act’s high-risk provisions are entering enforcement this year, and regulators will ask for explainability, traceability, lifecycle records, and named human accountability. These are exactly the artifacts an agent operations function produces. Tyler Akidau called this the missing HR layer in his April Radar piece; Artur Huk’s more recent “From Capabilities to Responsibilities” converges on similar ground from the runtime side. The label matters less than the work. This piece is about governance inside one organization. The harder problem is governance across organizations, with agents acting under different trust regimes. That’s strictly worse, and worth its own piece.

Within your own four walls, the diagnostic is doable in an afternoon. Pick one production agent. Try to answer, with evidence: Whose authority does it carry, traced from action back to a named human? Where is its authority specified, and who signed the current version? When it does something wrong tomorrow, who pays, how is that decided, and what reasoning-grade record supports the decision? Most architects who do this honestly come away with three blanks and a knot in their stomach. That’s principal drift, named and visible.

The mesh you’ve built is real and necessary, but it isn’t sufficient. The rest of the architecture is the institution above it: the registry, the signed policies, the reasoning-grade audit, the named human at the end of every chain. In most enterprises it doesn’t yet exist, and it won’t arrive by buying another platform. You’ll have to draft it yourself.

Loop Engineering

Addy Osmani — Mon, 22 Jun 2026 11:04:36 +0000

The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission.

Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead. A loop here can be thought of as a recursive goal where you define a purpose and the AI iterates until complete. I believe this may be the future of how we work with coding agents. However, it’s still early; I’m skeptical, and you absolutely have to be careful about token costs (usage patterns can vary wildly if you are token rich or poor), so I want to unpack what it is and what it means.

Peter Steinberger recently said: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Similarly, Boris Cherny, head of Claude Code at Anthropic, said, “I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops”.

Okay, so what does any of that mean?

For like two years, the way you got something out of a coding agent was you wrote a good prompt and shared enough context. You type a thing, you read what came back, you type the next thing. The agent is a tool and you are holding it the entire time, one turn after the other. That part is kind of over, or at least some think it’s going to be.

Now you build a small system that finds the work, hands it out, checks it, writes down what is done and then decides the next thing, and you let that system poke the agents instead of you. I wrote before about the cousin of this, agent harness engineering, which is making the environment one single agent runs inside and the factory model—the system that builds the software. Loop engineering sits one floor above the harness. The harness but it runs on a timer, it spawns little helpers, and it feeds itself.

The thing that surprised me is this is not really a tool thing anymore. A year ago if you wanted a loop you wrote a pile of bash and you maintained that pile forever and it was yours and only yours. Now the pieces just ship inside the products. Steinberger’s list maps almost exactly onto the Codex app, and then almost the same onto Claude Code. And once you notice the shape is the same, you stop arguing about which tool. You just design a loop that still works no matter which one you happen to be sitting in.

The five pieces, and then notes

A loop needs five things and then one place to remember stuff. Let me list it first and then map it.

Automations that go off on a schedule and do discovery and triage by themselves
Worktrees so two agents working in parallel don’t step on each other
Skills to write down the project knowledge the agent would otherwise just guess
Plugins and connectors to plug the agent into the tools you already use
Subagents so one of them has the idea and a different one checks it

Then the sixth thing, the memory. A Markdown file, or a Linear board, anything that lives outside the single conversation and holds what’s done and what is next. Sounds too dumb to matter. But it’s the same trick every long-running agent depends on, and I went into it in “Long-Running Agents”: The model forgets everything between runs so the memory has to be on disk and not in the context. The agent forgets; the repo doesn’t.

Both products have all five now.

Primitive	Job in the loop	Codex app	Claude Code
Automations	Discovery + triage on a schedule	Automations tab: pick project, prompt, cadence, environment; results land in a Triage inbox; `/goal` for run-until-done	Scheduled tasks and cron, `/loop`, `/goal`, hooks, GitHub Actions
Worktrees	Isolate parallel features	Built-in worktree per thread	`git worktree`, `--worktree`, `isolation: worktree` on a subagent
Skills	Codify project knowledge	Agent Skills (`SKILL.md`), invoked with `$name` or implicitly	Agent Skills (`SKILL.md`)
Plugins and connectors	Connect your tools	Connectors (MCP) plus plugins for distribution	MCP servers plus plugins
Subagents	Ideate and verify	Subagents defined as TOML in `.codex/agents/`	Task subagents in `.claude/agents/`, agent teams
State	track what’s done	Markdown or Linear via a connector	Markdown (`AGENTS.md`, progress files) or Linear via MCP

The names are a bit different here and there, but the capability is the same thing. Let me go one by one because honestly the details are where a loop either holds together or quietly leaks everywhere.

Automations, this is the heartbeat

Automations are what make a loop an actual loop and not just one run you did once. In the Codex app you make one in the Automations tab and you pick the project, the prompt it will run, how often, and if it runs on your local checkout or on a background worktree. The runs that find something go to a Triage inbox, and the runs that find nothing just archive themselves which is nice. OpenAI uses them internally for boring stuff like daily issue triage, summarizing CI failures, writing commit briefings, and hunting bugs somebody added last week. And an automation can call a skill, so you keep the recurring thing maintainable; you fire $skill-name instead of pasting a giant wall of instructions into a schedule that nobody will ever update.

Claude Code gets to the same place but through scheduling and hooks. You can run a prompt or a command on a interval with /loop, you can schedule a cron task, you can fire shell commands at certain points in the agent lifecycle with hooks, or you push the whole thing to GitHub Actions if you want it to keep running after you close the laptop. Same idea exactly, you define an autonomous task, you give it a cadence, and the findings come to you so you are not the one going around checking.

There is a second in-session primitive worth knowing, and it’s the one closer to what this whole post is about. /loop re-runs on a cadence. /goal keeps going until a condition you wrote is actually true, and after every turn a separate small model checks whether you are done, so the agent that wrote the code isn’t the one grading it. You give it something like “all tests in test/auth pass and lint is clean” and walk away. Codex has the same thing, also called /goal: It keeps working across turns until a verifiable stopping condition holds, with pause and resume and clear. Same primitive, both tools, which is kind of the pattern for this whole article.

So this is the part that surfaces the work. The rest of the loop is what acts on it.

Worktrees, so parallel doesn’t turn into chaos

The second you run more than one agent, the files start colliding; that becomes the failure. Two agents writing the same file is the exact same headache as two engineers committing to the same lines and nobody talked to each other first. A Git worktree fixes it. It’s a separate working directory on its own branch sharing the same repo history, so one agent’s edits literally cannot touch the other one’s checkout.

Codex builds the worktree support right in so several threads hit the same repo at once and don’t bump into each other. Claude Code gives you the same isolation with git worktree, a --worktree flag to open a session in its own checkout, and a isolation: worktree setting you stick on a subagent so each helper gets a fresh checkout that cleans itself up after. (I wrote about the human side of all this in “The Orchestration Tax.”) The worktrees take away the mechanical collision, but YOU are still the ceiling. Your review of bandwidth decides how many you can actually run, not the tool.

Skills, so you stop explaining your project every single time

A skill is how you stop reexplaining the same project context every session like a goldfish. Both tools use the same format: a folder with a SKILL.md inside holding instructions and metadata, and then optional scripts, references, and assets. Codex runs a skill when you call it with $ or /skills, or by itself when your task matches the skill description, which is the reason a tight, boring description beats a clever one. Claude Code does it the same way and I wrote the pattern up in “Agent Skills.”

Skills are also where intent stops costing you over and over. I argued in “The Intent Debt” that an agent starts every session cold and it will fill any hole in your intent with a confident guess. A skill is that intent written down on the outside, the conventions, the build steps, the “we don’t do it like this because of that one incident,” written one time where the agent reads it every run. Without skills the loop rederives your whole project from zero every cycle; with skills it kind of compounds.

One thing to keep straight: The skill is the authoring format, and a plugin is how you ship it. When you want to share a skill across repos or bundle a few together, you package them as a plugin. True in Codex, true in Claude Code.

Plugins and connectors, the loop touches your real tools

A loop that can only see the filesystem is a tiny loop. Connectors, which are built on MCP, let the agent read your issue tracker, query a database, hit a staging API, or drop a message in Slack. Codex and Claude Code both speak MCP so the connector you wrote for one usually just works in the other. And plugins bundle connectors and skills together so your teammate installs your setup in one go instead of rebuilding the whole thing from memory.

This is the difference between an agent that says “here is the fix” and a loop that opens the PR, links the Linear ticket, and pings the channel once CI is green by itself. The connectors are the reason the loop can act inside your actual environment instead of just telling you what it would do if it could.

Subagents, keep the maker away from the checker

The most useful structural thing in a loop, by far, is splitting the one who writes from the one who checks. The model that wrote the code is way too nice grading its own homework. A second agent with different instructions and sometimes a different model catches the stuff the first one talked itself into.

Codex only spawns subagents when you ask, runs them at the same time, and then folds the results back into one answer. You define your own agents as TOML files in .codex/agents/, each with a name, a description, instructions, and optional model and reasoning effort, so your security reviewer can be a strong model on high effort while your explorer is some fast read-only thing. Claude Code does the same with subagents in .claude/agents/ and agent teams that pass work between them. The usual split in both is one agent explores, one implements, and one verifies against the spec.

I made this case twice already, once as “The Code Agent Orchestra” and once as “Adversarial Code Review.” The reason it matters specifically inside a loop is the loop runs while you are not watching, so a verifier you actually trust is the only reason you can walk away. Subagents do burn more tokens since each one does its own model and tool work, so spend them where a second opinion is worth paying for. This is also basically what Claude Code’s /goal does under the hood: A fresh model decides if the loop is done instead of the one that did the work, the maker and checker split applied to the stop condition itself.

What one loop looks like

Stick it together and a single thread turns into a little control panel. Here is one shape I keep using.

An automation runs every morning on the repo. Its prompt calls a triage skill that reads yesterday’s CI failures, the open issues, and the recent commits and writes the findings into a Markdown file or a Linear board. For each finding that is worth doing, the thread opens an isolated worktree and sends a subagent to draft the fix, and a second subagent reviews that draft against the project skills and the existing tests.

Connectors let the loop open the PR and update the ticket. Anything the loop cannot handle lands in the triage inbox for me. The state file is the spine of the whole thing; it remembers what got tried, what passed, and what is still open, so tomorrow morning the run picks up where today stopped.

And look at what you actually did there. You designed it one time. You did not prompt any of those steps. That’s Steinberger’s whole point made real, and it’s the same loop in Codex or in Claude Code because the pieces are the same pieces.

What the loop still does not do for you

The loop changes the work; it does not delete you from it. And three problems actually get sharper as the loop gets better, not easier.

Verification is still on you. A loop running unattended is also a loop making mistakes unattended. The whole reason you split the verifier subagent from the maker is to make the loop’s “it’s done” mean something, and even then “done” is a claim and not a proof. I keep saying the same line from “Code Review in the Age of AI”: Your job is to ship code you confirmed works.

Your understanding still rots if you allow it. The faster the loop ships code you did not write, the bigger the gap between what exists and what you actually get. That’s comprehension debt and a smooth loop just makes it grow faster unless you read what the loop made.

And the comfortable posture is the dangerous one. When the loop runs itself, it’s very tempting to stop having an opinion and just take whatever it gives back. I called that “cognitive surrender.” Designing the loop is the cure when you do it with judgment and the accelerant when you do it to avoid thinking: same action, opposite result.

Build the loop. Stay the engineer.

I think this is a preview of how our work is going to evolve. That said, if I weren’t reviewing the code myself or if I relied entirely on automated loops to fix it, my product’s quality would suffer. I’d likely end up stuck in a downward spiral, continuously digging myself into a deeper hole.

Go ahead and set up your loops, but don’t forget that prompting your agents directly is also effective. It’s all about finding the right balance.

Loops can also result in different outcomes depending on you. Two people can build the exact same loop and get completely opposite results. One uses it to move faster on work they understand deeply. The other uses it to avoid understanding the work at all. The loop doesn’t know the difference. You do.

That’s what makes loop design harder than prompt engineering. Cherny’s point isn’t that the work got easier. It’s that the leverage point moved.

Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go.

This Week in AI: Fable 5, the Clone Wave, and Uber’s AI Reality Check

Michelle Smith — Thu, 18 Jun 2026 19:33:23 +0000

This week, egghead.io cofounder John Lindquist joined host YK Sugi, founder of CS Dojo and developer experience manager at Eventual, to cover the latest AI news. First on the agenda was the contested release of Claude Fable 5. They also examined the financial shifts reshaping the technology industry, including the rising costs associated with agentic coding loops. Then John outlined the framework he uses to build in the agent era without starting from scratch every time.

Watch the full episode here:

Claude Fable 5: 3 days, a government order, and a lot of unanswered questions

Claude Fable 5 launched June 9 and was pulled from all customers on June 12 after the US government issued a directive ordering Anthropic to restrict access for foreign nationals inside and outside the US. Amazon researchers had reportedly surfaced what they characterized as a security vulnerability, and after Anthropic reportedly declined to patch or redeploy the model, the directive came down. Senior Anthropic staff subsequently traveled to Washington to meet with White House officials.

The dispute about what actually happened is unresolved. Anthropic’s position is that the reported issue was a narrow jailbreak that had been previously identified and was present across public models generally, and not a serious security threat. An independent researcher who reviewed the report described it as defensive prompting that surfaced known vulnerabilities and called the response an overreaction. Neither side has published the technique or prompt, so there’s no way to evaluate the claim independently. But as John put it, “It sets a very strange precedent going forward, as models are released, that governments can step in and control what private companies can and cannot do with their model.”

Another new precedent: Fable 5 wasn’t built on the Opus or Sonnet architecture, which means comparisons to prior Anthropic models or contemporaries don’t tell us much. But initial impressions were positive, including from YK and John, and Fable 5 quickly reached the top of the Arena leaderboard in the text, agents, and web dev code categories. However, the model also had a purposeful limitation: On questions related to AI and machine learning training specifically, it was designed to underperform (without signaling this to users), apparently to prevent competitors from using it to improve their own models. Intentional capability suppression in a commercial model, without disclosure, is a different kind of product decision than a safety guardrail. Whether that approach becomes more common as competitive stakes rise is an open question.

Tokens burn fast when the loop isn’t ready for them

Last week, SpaceX went public in the largest IPO in history. The company finalized its acquisition of Cursor in a $60 billion all-stock deal shortly after. (That last one happened after this episode aired—we’ll talk more about it on Monday.) Both OpenAI and Anthropic have filed to go public as well, and Google raised roughly $160 billion through equity and a 100-year bond. A significant share of that capital is flowing toward AI coding infrastructure.

YK brought up another, less celebratory, financial story that’s been making the rounds: Uber burned through its full 2026 AI tools budget by April, mostly on Claude Code and Cursor, and Andrew Macdonald, the company’s COO, acknowledged they couldn’t link that spending to a measurable increase in useful customer features. Uber subsequently put a $1,500 per month per employee cap in place.

John flagged projects inefficiently utilizing agentic loops as one possible cause for wasteful token spend. Most developers deploying agents against existing codebases haven’t built the tooling those agents need to work efficiently, so agents burn tokens doing work that dead-ends, repeating context, or generating code that requires significant debugging. He explained:

If you take a legacy codebase and you throw agents against it with loops, you haven’t set up a proper agent environment. It’s so quick to burn tokens because. . .the agents don’t have the tools to work with.

The conversation in developer communities so far has focused almost entirely on what agents can generate. But as more organizations move from experimentation to production-scale deployment, building logging, verification, and proper error surfaces into agent tooling is what will determine whether token spend maps to real output. Otherwise, we’ll likely see more companies go the way of Uber.

Ingredients beat inference: A practical framework for building in the clone wave

For most developer workflows today, buy-versus-build leans toward building in a way it didn’t even a year or two ago. As John noted, “It’s so easy to build apps and workflows now where there are so many amazing production apps out there, apps on your phone, apps on your desktop, software as a service, that are trivial to copy and clone.” He uses the term the “clone wave” to describe this expanding set of open source equivalents to consumer software products that can now be cloned, forked, or replaced and get you 99% of the way to your use case.

The principle that drives the clone wave is “ingredients beat inference.” If you ask an agent to build a feature from scratch, it infers a solution with no external reference. If you give it an existing open source implementation to start from, it can adapt, translate, and integrate that code far faster and more reliably. The ingredients approach also helps with the 43% of AI-generated code that needs debugging in production, per a figure YK cited earlier in the episode.

The GitHub CLI plays a central role in this workflow. John explained that because agents understand the GitHub CLI natively, you can give an agent a search task and let it find implementations it wouldn’t have generated itself. Language mismatch isn’t a blocker, because agents translate between languages and libraries well. And tools like DeepWiki from Cognition let agents explore and understand a repo’s structure before cloning or forking it, so the evaluation step doesn’t require local setup.

The framework extends to how you build the last 20% that isn’t available as an ingredient. This is the part that’s specific to your use case; John described it as “that extra bit that you’re building on top of it to make it into the custom product and project for either yourself or for your users.” John’s bigger point is that the tools you build for yourself should also be usable by your agents. Expose endpoints and logging. Give agents the ability to read state and errors. An agent that can control a tool but not debug it will eventually stop in ways that are hard to diagnose.

John walked through cmux to demonstrate what an agent-native workspace looks like in practice. cmux is a terminal multiplexer built with agentic workflows in mind: it exposes a CLI that agents can control directly, so you can open a terminal pane, have that pane spawn another, and have the two read from and write to each other. In practice that means you can run Claude Code in one pane, Codex in another, and a third pane reading output from both, with each agent able to observe the others’ state.

Agents need more than the ability to run commands. They need to read logs, check errors, and confirm state before taking the next step. A workspace that exposes those surfaces gives agents a feedback loop. This tenet is applicable to tools across the company. Organizations that treat their internal tooling as agent-accessible infrastructure are building something that compounds. Those treating agents as black-box code generators are taking on technical debt they may not see until causes issues later on.

What’s next

SpaceX’s acquisition of Cursor turns the coding-agent race into something much larger than an IDE fight. Cursor may be positioning itself as a new GitHub for the agentic era, where agents write, review, test, repair, and govern code. At the same time, Salesforce’s $3.6B acquisition of Fin shows the same pattern inside enterprise software: Buyers want packaged workflows that solve real support, sales, and operations problems rather than abstract “agents.”

Next week, host Ksenia Se examines these stories and more through the lens of who owns the loop where AI does the work. Join us to find out why the next phase of AI will be about who controls the infrastructure, economics, and trust layer.

Our episodes are free and open to all through the end of June if you’d like to attend live—register here. And we’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, Apple, or wherever you get your podcasts.