Tomasz Tunguz

The Substitution Wave in AI

Sun, 07 Jun 2026 00:00:00 +0000

Three forces are reshaping the AI cost structure :

Foundation labs are moving up the stack into applications,¹ ²
Frontier model prices keep rising for the smartest models,³
Open-source models have crossed the good enough threshold for most use cases.⁴ ⁵

The natural response from AI buyers is substitution.

Coinbase⁶ :

At Coinbase we’re working hot on routing prompts to cheaper models where appropriate, & in some cases have been able to keep costs roughly flat, while token usage continues to grow exponentially.

Lindy⁷ :

Pulled the trigger today & switched 100% of Lindy traffic to DeepSeek v4, churning from Anthropic models. Saves us millions of $ & we’re actually seeing an increase in performance on many core use cases. Transformative for the business.

Harvey⁸ :

On a 100-task slice of our Legal Agent Benchmark (LAB), SFT moved Kimi 2.6’s all-pass rate from 11% to 15%, beating Opus’ 14%. But the cost gap was even more striking : $84 vs $954 across the same 100 tasks, or ~11x cheaper.

Cursor went further. They post-trained Kimi K2.5 into their own production model, Composer.⁹

Composer 2.5 is exceptionally intelligent & up to 10x more efficient than similarly capable models.

Coinbase’s quote shows where the savings go : costs flat, tokens exponential. Buyers don’t pocket the discount — they spend it on more intelligence.

Closed models are getting more expensive at the frontier; open models are getting cheaper at parity. The choice is which slope you want under your unit economics.

The Minimill of AI

Fri, 05 Jun 2026 00:00:00 +0000

A laptop on my desk now handles 78% of my AI work, with the rest sent to the cloud. The shift came out of my skill distillation work.

Here’s how it works.

I create tasks in Asana. An agent sees the task : scheduling, email triage, research, a CRM update ; & classifies it as easy or hard. If it’s straightforward, a local model on my Mac handles it in seconds. If it’s complex, the same model routes it to a cloud model.

Across the last seven days, daily peaks reached 88%.

As the workload grew, the two-lane design paid off. Throughput jumped about 25%, average task duration fell from 47 seconds to 19, & queue age dropped from 73 seconds to four. Nothing about the work changed. Small, fast tasks simply stopped waiting behind big, slow ones.

The task factory that uses distilled skills is now humming along with 25% more throughput, queue age down 94%, & a much more responsive system. For now, the cloud handles the hard fifth. The Mac handles the rest.

It’s the minimill of agentic work. Nucor’s minimills¹ started small, capital-light, & close to demand; within a generation they outflanked the integrated steel giants.

Every laptop, phone, & edge device with enough memory to host a distilled model becomes its own minimill : routing locally, paying cloud rates only for the hard fifth. Tens of millions of these will proliferate inside companies in the next few years, each one quietly absorbing much of the work that today shows up on a hyperscaler invoice.

Nucor began in the 1960s by melting scrap steel in electric-arc furnaces rather than smelting iron ore in giant integrated blast-furnace mills. Each minimill was a fraction of the size & cost of an integrated plant, sited near regional demand, & ran on flexible, lower-cost labor. The integrated mills dismissed minimills as fit only for low-grade products like rebar. Over the next thirty years Nucor moved up-market into sheet steel & structural beams, & by 2014 had become the largest steel producer in the United States, while most of the integrated giants (Bethlehem, LTV, National) had gone bankrupt. Clayton Christensen used the story as the canonical example of disruptive innovation in The Innovator’s Dilemma. ↩︎

Intelligence Per Dollar

Wed, 03 Jun 2026 00:00:00 +0000

Yesterday Microsoft added a new metric to a model release card, one that will likely become a standard.¹

Average token usage.

In the first row, the Microsoft model hits 71.6 on SWE-Bench Verified using about a third of the tokens Claude Haiku 4.5 burns.

Benchmarks are now measured on two different dimensions, the overall performance & the cost to achieve that intelligence.

This is yet another sign that the era of subsidies², tokenmaxxing³, & all-out performance for many use cases is over.

Even the most valuable companies in the world cannot afford state-of-the-art intelligence for every conceivable use case.⁴ Uber capped employee AI spending after blowing through its budget in four months.⁵ Salesforce is spending $300M on Anthropic tokens & has frozen engineering hires.⁶

This new dual benchmark answers the buyer’s only question : what is my intelligence per dollar?

Artificial Analysis already benchmarks this.⁷ GPT 5.5 & Claude Opus 4.8 land within a point of each other on the Intelligence Index, around 60. Running the index costs $3,357 on GPT 5.5 & $4,685 on Opus 4.8. Same answer, 40% more expensive.

Model companies must now compete on both dimensions. The application layer will compete one level up, on dollars per outcome, what a closed ticket, a shipped PR, or a resolved support case actually costs.

Every layer in the stack now has to price the same way the customer thinks : per result, not per token.

Introducing MAI-Code-1-Flash — Microsoft announces a new coding model with average token usage on the release card. ↩︎
The Unsustainable Subsidy — The era of AI subsidies is ending. ↩︎
Tokenmaxxing — Models that game benchmarks with extra tokens are losing their edge. ↩︎
Microsoft cancels Claude Code licenses, shifting developers to GitHub Copilot CLI — Microsoft cancelled Claude Code licenses across its Experiences and Devices division (Windows, Microsoft 365, Outlook, Teams, Surface) after engineering usage outran budgets. ↩︎
Uber caps employee AI spending after blowing through budget in 4 months — Uber caps employee AI spending after blowing through budget in four months. ↩︎
Salesforce Spends $300M on AI, Freezes Engineering Hires — Salesforce Spends $300M on AI, Freezes Engineering Hires. ↩︎
AI Model & API Providers Analysis — Independent analysis of AI model costs. ↩︎

The Thriving Ecosystem of Open Models

Tue, 02 Jun 2026 00:00:00 +0000

Competition is a discovery procedure. — Friedrich Hayek

And developers are discovering the value of open models.

OpenRouter offers a useful view into the model market.¹ It is not the whole AI economy. But it is close to the API frontier, where developers can switch models quickly, compare price-performance daily, & route each request to the best available option.

Since 2025, open models have grown sharply on OpenRouter. In the latest model-level snapshot, open-weight models generated 69.1% of named open-versus-closed token volume. Closed models produced 30.9%.

New models attract developer attention & large scale testing, after which token use surges. Each new clustered release of different models sustains a new plateau of token volume.

Just as in the closed-model ecosystem, the competition among open models means rapid innovation & leaderboard changes.

DeepSeek’s early lead gave way to MiniMax & Kimi models in late 2025 & early 2026. Later, launches from MiMo, Qwen, Alibaba’s open-weight model family, Hy3, Tencent’s open-weight model release, & DeepSeek reshuffled share again.

Arcee, a US lab focused, makes a strong appearance recently.

Open models still represent a fraction of overall inference, but the thriving competition, increasing usage, & surge of experimentation suggest developers are increasingly willing to route production traffic to them.

Source data: OpenRouter rankings & usage data, analyzed from weekly token-volume snapshots in the OpenRouter analysis dataset. ↩︎

The AI Skepticism Map

Mon, 01 Jun 2026 00:00:00 +0000

With Michael Burry ¹ & Leopold Aschenbrenner ² placing heavy short trades on AI, questions about GPU depreciation, & the Saaspocalypse, how negative is the financial market on AI?

We can look at the percentage of shares sold short, a bet the stock will decline.

Across all software, semiconductor, neocloud, data center, & hyperscalers, the median short interest (short shares / total shares) has increased by about 24% in the last quarter.

One segment stands out for gloomy skies in the cloud: the GPU data center businesses, whose shorted shares have grown 60% in the last year ³. AI cloud and neocloud companies have the highest current median short interest at 16.8% of float.

The negative sentiment for SaaS & Dev Tools is a more abrupt & recent phenomenon. Developer tools and infrastructure software follow at 9.5%. Enterprise SaaS and AI apps sit at 8.9%.

Hyperscalers are at the other end of the spectrum. Their median short interest is 1.1%. NVIDIA, the defining AI infrastructure stock, is also lightly shorted: 1.2%.

Semiconductor stocks saw a decrease in short-selling. With memory makers like Micron up 742% this year ⁴, & many ecosystem CEOs pointing to memory & storage as the limiting factor, the newest trillion-dollar companies are all memory.

The stocks with the most actively bearish betters? Most of these are small or mid-cap companies. The updated chart below adds market capitalization to each company label. The largest AI winners are mostly absent.

SoundHound AI is 36.3% short. C3.ai is 32.2%. BigBear.ai is 29.4%. Applied Digital is 28.0%. UiPath is 22.0%. TeraWulf is 21.3%.

This is the market’s current AI skepticism map.

The skepticism is concentrated in companies whose AI exposure still depends on future capital access, future demand, or future operating leverage.

That distinction matters. If short interest were rising uniformly across AI semiconductors, hyperscalers, and software, the message would be broad fatigue with the AI trade. Instead, the data suggest a more specific view: memory has become critical & in short supply; software & devtools businesses need to prove their worth post-AI; & businesses reselling GPUs have more than their fair share of doubters about current prices versus long-term value.

Skill Distillation

Fri, 29 May 2026 00:00:00 +0000

I’ve been using state-of-the-art models to teach small models running on my computer how I work.

My personal agent, based on Pi, runs my inbox, my deal pipeline, my blog publishing, my calendar, & my research. It looks less like a chatbot & more like a small operating system.

The first layer is QMD, a local markdown knowledge base of about eighty workflow files in ~/memories. Before answering any procedural question, the agent searches QMD for the right playbook.

The second layer is Skills, atomic SKILL.md files that describe one job each. The skills are written by a frontier model. So are the evaluations that grade them. The same system writes, tests, and rewrites each skill until accuracy converges. It also checks recall against QMD, so the right keywords always surface the right skill.

The third layer is the Agent Loop, a model running Plan → Tool Call → Observe → Refine, calling out to seventeen Rust APIs & a handful of MCP integrations.

One of the techniques I’ve started to use is skill distillation. A frontier model, Opus 4.7, GPT-5.1, Gemini 3 Pro, authors & refines the skill files. A smaller model, Qwen 35B or Gemma 26B running locally, executes them. The teacher transfers procedural knowledge to the student through markdown. The skill is inspectable, versionable, & hot-swappable.

This is fundamentally different from classical knowledge distillation, which compresses a big model’s soft probability outputs into a smaller model’s weights. It’s different from instruction tuning, which bakes behavior into weights through prompt-response pairs. It’s different from RAG, which retrieves facts.

Skill distillation retrieves procedures. The smaller model doesn’t have to know how to evaluate a company. It just has to know how to follow the steps.

Every night a system runs through historical logs to understand what new skills should be generated, mirroring the loop that Pete Koomen described at Y Combinator earlier this week.

The frontier model becomes a teacher. The library becomes the company’s institutional knowledge. The student becomes whichever model happens to be cheapest this quarter.

Security in the Age of AI Agents: Office Hours with Jonathan Jaffe

Thu, 28 May 2026 00:00:00 +0000

When security practitioners become engineers, the mission changes from managing people to architecting the automated policies that govern an agentic world.

Jonathan Jaffe, CISO at Lemonade, joined me on Office Hours to discuss what this means for how we build, secure, & operate AI systems when both sides are automated.

AI is just as powerful for defenders as it is for attackers. The fear narrative underestimates this fact. Defenders harden everywhere, simultaneously, because every vendor in the stack is also racing to ship.

“There are tens of thousands of attack targets out there. The chances that you’re going to be one of those is small. At the same time, all of the vendors that you use will also have access to this to improve their services.”

The window of exploitability is narrowing. Yes, AI will write more vulnerable code. But AI-written code also gets reviewed, pen-tested, & patched faster than any human pipeline. Plus, the total number of bugs within a particular piece of software is finite. As the velocity of solving or resolving bugs increases, software will become far more resilient.

Security teams are becoming engineering teams. At Lemonade, every security person is an engineer. They built their own AI platform with agents on top of it. One agent reads threat intel. Another checks whether the vulnerable method is actually called in production code.

“Automation is the only way you can deal with the scale of what’s coming at us now.”

Every agent needs an identity. On a single endpoint, we could be running 200 or 10,000 agents, but each one of them needs to be numbered and then governed by policy at the point of action.

“Every agent needs to have an identity, and more than that, you need a way to control policy for all of these agents in a much more complex way than current identity and access management systems do.”

Modern agentic security engineering is rapidly transforming, and we should expect to see significantly hardened systems as a result. It’s a bright future for security and security professionals.

I’m grateful to Jonathan for sharing his insights at Office Hours!

Software After AI

Wed, 27 May 2026 00:00:00 +0000

The end of the software era is the beginning of the harness era.

AI outmoded SaaS managed databases with fixed workflows with intelligence. Like a mustang, AI is powerful but wild. Harnessing the power means domestication.

There are seven parts to this domestication :

Context & memory : General models need bespoke retrieval. The system that fetches the right context for a radiologist is not the system that fetches it for a paralegal.

Sometimes it’s a lot of short-term memory. What was the agent working on 45 seconds ago? Other times it’s large-scale image retrieval, say for radiology or for video generation. Other times it’s a keyword search across a billion documents. Those systems will be bespoke to each individual use case to drive the best accuracy.

Sitting alongside retrieval is the context database, the recipe book of how each business actually runs. The standard operating procedures we all carry in our heads & bring to work every day are those recipes. Capturing them initially & evolving them as both people & process change is the essence of the context database.
Tools & action : Tools are how the agent affects the outside world. The recipes in the context database describe what to do. Tools are the ingredients & utensils that actually do it.

A modern harness exposes tools through a registry, validates the arguments the model passes, dispatches the call, gates sensitive actions behind approvals, & parses the result back into the agent’s loop. MCP has emerged as the connective tissue. The quality of a harness depends on how many tools it can safely expose & how cleanly it handles their failures.
Orchestration & loop : The agentic loop is think, act, observe, repeat. Planning, decomposition, sub-agents, retries, & stop conditions define how the work gets done.

We also expect our software to improve as we use it. Closed loop patterns that learn from each run will separate different vendors.
State & persistence : In a large-scale enterprise with lots of different people working on a system, the system needs to be resilient. When a harness crashes at step 7 of a 10 step task, it should resume at step 8, not restart from zero. File systems, checkpoints, session threads, & artifact storage are the mechanisms that prevent lost work.
Sandbox & compute : Each agent needs a sandbox in which to play. Isolated Unix workspaces, controlled network egress, & credentials that live outside the model are what make sandboxes secure, confidential, & fast at scale.
Observability & governance : You cannot trust what you cannot see. Tracing every step, logging every tool call, running evals as regression tests, & putting humans in the loop for the highest stakes decisions are how a demo becomes a production system. Guardrails enforce policy. Evals catch regressions before customers do.
Cost & workflow optimization : The seventh discipline is architectural judgment. What should be deterministic versus non-deterministic? Which model is the right one for each step, state of the art, medium, small, or fine-tuned? What knowledge belongs in skills versus in memory?

The result is a new competitive dynamic in software.

This won’t work in every category. The markets the major labs prioritize will benefit from their ability to move quickly & their direct control of the models. But that leaves thousands of separate markets up for startups.

What happens when every company has access to the same model? The best riders win.

Agent Gravity : Who's Running Your Agents

Tue, 26 May 2026 00:00:00 +0000

If data gravity was the most important force in the Decade of Data, agent gravity will be the same in the Decade of Agents.

Agents are wonderfully powerful technologies & they require tremendous compute to power. That compute is big business & major platforms will fight to keep them on their platforms. The more agents & data running through a platform, the greater the agent gravity.

The most recent episode with a new Databricks feature on Microsoft’s platform :

While this was not the feature’s stated purpose, it essentially made it easier for Power BI customers to manage their data and build AI agents in Databricks instead of a competing data management offering from Microsoft, called Fabric.- The Information

So what’s going on :

If DBX customers can create data pipelines & manipulate their data through agents, then the person building those agents - or the agent itself - will decide where to run the agent (agent gravity) & where to process the data (data gravity).

These agents can siphon the knowledge in the semantic layer, migrate the data into other cloud data warehouses, & publish data to other BI systems.

Very quickly, users, knowingly or unknowingly, can migrate the profitable agent workloads & data warehouse workloads to a new platform.

Winning & sustaining agent gravity is the motif of the Decade of Agents.

Plastic User Interfaces

Fri, 22 May 2026 00:00:00 +0000

Salesforce has gone headless : a sales person can update their deal sheet without ever logging into salesforce.com through AI. Many companies are following suit with MCPs. English as an interface to complex systems is a tremendous innovation.

And yet, some of the most sophisticated thinkers in AI are pushing more than markdown text, a format AI & computer systems use. These thinkers espouse richer UIs :

“Imagine using iMessage to do everything, when in fact every other app has a unique interface…With e-commerce, you want a very rich user interface.” - Brian Chesky, CEO of AirBNB

“I want richer visualizations, color, and diagrams and I want to be able to share them easily,” he adds. “I’ve started preferring HTML as an output format instead of Markdown and increasingly see this being used by others on the Claude Code team, this is why.” — Thariq Shihipar, Claude Code engineer

AI enables us to dynamically create UIs whenever we need them built for purpose. Custom-tailored to vacation shopping, CRM updating, or terminal typing, whatever the recipient’s preference might be.

Headless systems don’t decapitate the system ; they enable many user interfaces.

On the go? How about an audio summary of your email? Reviewing marketing copy? Interactive web app. Planning expenses for next year? Interactive spreadsheet with charts.

Software systems need to decide which of these to keep over time & which are disposable ; those newer semi-permanent artifacts will become the new heads - yes, there will be many - & they’ll evolve as business does.

This dynamic UI management is the future of software value : the harness to control the interface/ensure it’s correct & the knowledge management to rationalize all the AI products over time as a context database & library of artifacts.

The user interface, the head isn’t disappearing, it’s become plastic, malleable to the interface a user needs when they need it. There’s a great future in plastics.