VentureBeat

Four AI supply-chain attacks in 50 days exposed the release pipeline red teams aren't covering

louiswcolumbus@gmail.com (Louis Columbus) — Mon, 18 May 2026 17:16:58 GMT

Four supply-chain incidents hit OpenAI, Anthropic and Meta in 50 days: three adversary-driven attacks and one self-inflicted packaging failure. None targeted the model, and all four exposed the same gap: release pipelines, dependency hooks, CI runners, and packaging gates that no system card, AISI evaluation, or Gray Swan red-team exercise has ever scoped.

On May 11, 2026, a self-propagating worm called Mini Shai-Hulud published 84 malicious package versions across 42 @tanstack/* npm packages in six minutes flat. The worm rode in on release.yml, chaining a pull_request_target misconfiguration, GitHub Actions cache poisoning, and OIDC token extraction from runner memory to hijack TanStack’s own trusted release pipeline. The packages carried valid SLSA Build Level 3 provenance because they were published from the correct repository, by the correct workflow, using a legitimately minted OIDC token. No maintainer password was phished. No 2FA prompt was intercepted.

The trust model worked exactly as designed and still produced 84 malicious artifacts.

Two days later, OpenAI confirmed that two employee devices were compromised and credential material was exfiltrated from internal code repositories. OpenAI is now revoking its macOS security certificates and forcing all desktop users to update by June 12, 2026. OpenAI noted that it had already been hardening its CI/CD pipeline after an earlier supply-chain incident, but the two affected devices had not yet received the updated configurations. That is the response profile of a build-pipeline breach, not a model-safety incident.

Four incidents, one finding

Model red teams do not cover release pipelines. The four incidents below are evidence for a single architectural finding that belongs in every AI vendor questionnaire.

OpenAI Codex command injection (disclosed March 30, 2026). BeyondTrust Phantom Labs researcher Tyler Jespersen found that OpenAI Codex passed GitHub branch names directly into shell commands with zero sanitization. An attacker could inject a semicolon and a backtick subshell into a branch name, and the Codex container would execute it, returning the victim’s GitHub OAuth token in cleartext. The flaw affected the ChatGPT website, Codex CLI, Codex SDK, and the IDE Extension. OpenAI classified it Critical Priority 1 and completed remediation by February 2026. The Phantom Labs team used Unicode characters to make a malicious branch name visually identical to "main" in the Codex UI. One branch name. That is where the attack started.

LiteLLM supply-chain poisoning and Mercor breach (March 24–27, 2026). The threat group TeamPCP used credentials stolen in a prior compromise of Aqua Security’s Trivy vulnerability scanner to publish two poisoned versions of the LiteLLM Python package to PyPI. LiteLLM is a widely adopted open-source LLM proxy gateway used across major AI infrastructure teams. The malicious versions were live for roughly 40 minutes and received nearly 47,000 downloads before PyPI quarantined them.

That was enough.

The attack cascaded downstream into Mercor, the $10 billion AI data startup that supplies training data to Meta, OpenAI, and Anthropic. Four terabytes exfiltrated, including proprietary training methodology references from Meta. Meta froze the partnership indefinitely. A class action followed within five days. One compromised open-source dependency sitting 40 minutes on PyPI created a cross-industry blast radius that no single vendor’s model red team would have caught.

Anthropic Claude Code source map leak (March 31, 2026). This incident was not adversary-driven. Anthropic shipped Claude Code version 2.1.88 to the npm registry with a 59.8 MB source map file that should never have been included. The map file pointed to a zip archive on Anthropic’s own Cloudflare R2 bucket containing 513,000 lines of unobfuscated TypeScript across 1,906 files. Agent orchestration logic. 44 feature flags. System prompts. Multi-agent coordination architecture. All public. All downloadable. No authentication required. Security researcher Chaofan Shou flagged the exposure within hours, and Anthropic pulled the package. Anthropic confirmed it was a “release packaging issue caused by human error.” This was the second such leak in 13 months. The root cause was a missing line in .npmignore. No attacker was involved, but the release-surface gap is identical. No human review gate existed between the build artifact and the registry publish step.

TanStack worm and downstream propagation (May 11–14, 2026). Wiz Research attributed the Mini Shai-Hulud attack to TeamPCP with high confidence. StepSecurity detected the compromise within 20 minutes. The worm spread beyond TanStack to Mistral AI, UiPath, and 160-plus packages within hours. Mini Shai-Hulud even impersonated the Anthropic Claude GitHub App identity by authoring commits under the fabricated identity “claude ” to bypass code review.

Four incidents. Three frontier labs. One finding. The red-team scope stops at the model boundary, and the build pipeline sits on the other side of it.

The timing no system card can explain

On May 10, 2026, OpenAI launched Daybreak, a cybersecurity initiative built on GPT-5.5 and a new permissive model called GPT-5.5-Cyber designed for authorized red teaming, penetration testing, and vulnerability discovery. Daybreak pairs Codex Security with partners, including Cisco, CrowdStrike, Akamai, Cloudflare, and Zscaler. OpenAI positioned the launch as proof that frontier AI can tilt the balance toward defenders.

The next day, the TanStack worm compromised two OpenAI employee devices.

OpenAI’s own incident disclosure acknowledged the gap directly. The company had already been hardening its CI/CD pipeline after the earlier Axios supply-chain attack, but the two affected devices “did not have the updated configurations that would have prevented the download.” The controls existed. The deployment was in progress. The worm arrived first.

The security community saw the same gap: Security researcher @EnTr0pY_88 noted on X that the real signal was the certificate rotation, not the exfiltrated code. "The cert rotation…is what you do when the blast radius reached signing trust, not just source access." @OpenMatter_ put the SLSA provenance failure in one sentence. "If an attacker controls your CI runner, they control your attestations. Policy-based security is failing at scale." And @The_Calda compressed the disclosure's internal contradiction into seven words. "'Limited impact' but the next sentence is 'we're rotating signing certs.'"

A company that launched a cyber defense platform on Sunday and disclosed a build-pipeline breach on Tuesday is not failing at model safety. OpenAI is demonstrating the exact gap this audit grid exists to close. The model red team and the release-pipeline red team are two different disciplines; four incidents in 50 days suggest only one of them is being funded consistently.

The VentureBeat Prescriptive Matrix

The matrix below maps the seven release-surface classes missing from AI vendor questionnaires, with vendor hit, failure mechanism, detection gap, technical mitigation, and priority tier a security team can execute before Q2 renewals close.

For teams that need to map these rows into existing GRC tooling, rows 2, 3, and 5 align with NIST SSDF PS.1.1 (protect all forms of code from unauthorized access and tampering). Row 4 maps to SSDF PS.2.1 (provide mechanisms for verifying software release integrity). Row 6 maps partially to SLSA Source Track requirements for verified contributor identity, though no published framework directly addresses upstream dependency maintainer credential provenance. Row 7 is not yet addressed by any published framework, which is itself the finding.

Release-surface class	Vendor hit	Failure mechanism	Detection gap	Technical mitigation	Priority
Model capability evals (jailbreak, misuse, exfiltration)	All three (ongoing)	Covered. System cards, AISI Expert suite, Gray Swan scope this today.	None. This row is the baseline.	Continue requiring the system card at every renewal.	Baseline
CI runner trust boundary (pull_request_target)	TanStack; OpenAI downstream (May 11–14, 2026)	TanStack pwn-request ran fork code in base-repo context. Poisoned pnpm cache. Extracted OIDC token from runner memory. Two OpenAI employee devices compromised.	No system card covers CI runner isolation. No AISI eval tests fork-to-base trust boundaries.	Audit every repo for pull_request_target + fork SHA checkout. Block fork code from base-repo context. Pin cache keys to commit SHA.	Do this week
OIDC trusted-publisher + SLSA provenance	TanStack; OpenAI downstream (May 11, 2026)	TanStack minted valid SLSA Build Level 3 provenance for all 84 malicious packages. First known npm worm with valid cryptographic attestation.	SLSA attestation confirms build origin, not build intent. No vendor questionnaire distinguishes the two.	Pin trusted publisher to branch + workflow, not just repository. Add behavioral analysis at install time.	Do this week
Release packaging review (human gate before publish)	Anthropic (Mar 31, 2026)	Missing .npmignore shipped 59.8 MB source map in Claude Code npm package. 513K lines exposed including agent logic, 44 feature flags, system prompts. Second leak in 13 months. Self-inflicted, not adversary-driven.	No red-team exercise checks artifact contents before registry publish.	Human review between build artifact and registry publish. Enforce .npmignore in CI. Fail build on unexpected artifact size.	Before renewal
Dependency lifecycle hooks (prepare, postinstall)	TanStack; OpenAI + downstream (May 11, 2026)	router_init.js executes on import. tanstack_runner.js self-propagates via optionalDependencies prepare hook. Spread to Mistral AI, UiPath, 160+ packages in hours.	Lifecycle hooks execute before any scanner runs. Model evals never test package install behavior.	Disable lifecycle scripts in CI by default. Explicit allowlist for production. Flag new optionalDependencies in PR review. Set minimumReleaseAge.	Do this week
Vendor maintainer credential hygiene	Meta via Mercor (Mar 24–27, 2026)	TeamPCP stole LiteLLM maintainer credential via prior Trivy compromise. Two poisoned PyPI versions live 40 min. Mercor cache held Meta training methodology references. 4 TB exfiltrated. Meta froze the partnership.	Vendor questionnaires ask about encryption and access control, not maintainer credential provenance for upstream dependencies.	Require hardware-key auth from every maintainer before onboarding. Add package-manager cooldown. Audit transitive dependency tree quarterly.	Add to vendor contract
Agent container input sanitization	OpenAI Codex (disclosed Mar 30, 2026)	BeyondTrust Phantom Labs injected shell commands through GitHub branch-name parameter. Stole OAuth tokens from Codex container. Scalable across shared repos. Rated Critical P1, patched Feb 2026.	Agent red teams test prompt injection, not input-parameter injection at the container level.	Sanitize all external input before shell execution. Audit OAuth token scope and lifetime per agent session. Enforce least-privilege on every container.	Do this week

Security director action plan

The matrix tells your team what to fix. Three actions tell security directors how to move it forward.

Add one question to every AI vendor questionnaire. "Does your organization red-team its release pipeline, including CI runner trust boundaries, OIDC token scoping, dependency lifecycle hooks, and registry publish gates? Provide the last assessment date and scope." No date and no scope document is the finding.
Run rows 2 through 7 against your own CI pipelines this week. StepSecurity and Snyk both published detection and remediation steps for the TanStack worm patterns. Dev teams pull OpenAI SDKs, Anthropic packages, and Llama weights through npm, PyPI, and HuggingFace every week. The same patterns that got exploited are in your CI right now.
Brief the board on the provenance gap. The TanStack worm proved that valid cryptographic provenance can sit on top of a malicious package. Attestation tells the board where a package was built. Behavioral analysis tells the board what it does after install. Q2 renewal requires both. Snyk's analysis recommends pinning trusted publisher configurations to specific branches and workflows, not just repositories. That is the language the board presentation needs.

The worm already knows where your AI credentials live

Mini Shai-Hulud does not stop at CI secrets. Datadog Security Labs documented that the payload reads ~/.claude.json and exfiltrates it. It scans for 1Password and Bitwarden vaults, Kubernetes service accounts, cloud provider tokens, and shell history files where developers paste API keys. StepSecurity's deobfuscation confirmed that Mini Shai-Hulud harvests Claude and Kiro MCP server configurations, which store API keys and auth tokens for external services. For developers using AI coding agents, the worm already knows where their credentials live.

OpenAI, Anthropic, and Meta will keep publishing system cards. They will keep funding red-team competitions. They will keep passing model evaluations. None of that stops the next worm from riding in on release.yml.

The TanStack postmortem team said it directly. Modern supply-chain defenses are important but not sufficient on their own. Teams must proactively identify and close workflow gaps rather than relying solely on the security features of their tools.

LangSmith Engine closes the agent debugging loop automatically — but multi-model enterprises still need a neutral layer

Mon, 18 May 2026 16:23:19 GMT

Enterprises building and deploying agents have a problem: it’s taking their engineers too long to find out that an agent made a mistake, and the loop has continued to perpetuate, especially without a human at every step.

LangSmith, the monitoring and evaluation platform from LangChain, launched a new capability in public beta that could make that issue more manageable. LangSmith Engine automates the entire chain by detecting production failures, diagnosing root causes against the live codebase, drafting a fix and preventing regression. It does this in a single automated pass.

LangSmith Engine gives AI engineers a faster path to triage, but it launches into a crowded field: Anthropic, OpenAI and Google are all pulling observability and evaluation into their own platforms.

LangSmith Engine looks at failures

LangChain said in a blog post that the typical agent development cycle starts by tracing the agent to understand what it’s doing, followed by identifying gaps, making changes to the prompts and tools, and creating ground-truth datasets. Developers then run experiments and check for regressions before shipping the agent.

The problem is that customers often run into issues when the trace review doesn’t surface faulty patterns, error repetition gets difficult to see, and there’s no targeted evaluator to catch the same problem when it repeats in production.

LangSmith Engine works by monitoring production traces for several signal types, “explicit errors, online evaluator failures, trace anomalies, negative user feedback and unusual behaviors like user asking questions the agent wasn’t built to answer,” according to the blog post.

Engine will then read the live codebase, find the culprit and draft a pull request before proposing a custom evaluator for that specific failure pattern. The human comes in at the approval step.

It’s built on top of LangSmith’s existing tracing and evaluation infrastructure and also works with an enterprise’s evaluator results.

Unlike observability tools such as Weights & Biases, Arize Phoenix and Honeyhive, LangSmith Engine takes the entire chain automatically — detecting the failure, diagnosing root cause, drafting a fix — and brings the human in only at the approval step.

Model providers bringing evaluators in platform

While LangSmith identified this evaluation loop as a need for many enterprises, Engine comes at a time where the larger providers are beginning to offer observability tools within their platform. This means enterprises may choose to use an end-to-end platform rather than add LangSmith Engine onto their existing workflows.

Anthropic's Claude Managed Agents brings together agentic deployment, evaluation and orchestration into a single suite. OpenAI's Frontier offers a similar end-to-end platform for building, governing and evaluating enterprise agents — though both have faced questions from enterprises wary of committing to a single vendor.

However, practitioners point out that not everyone wants to bring evaluations and observability fully into one platform.

Leigh Coney, founder and principal consultant at Workwise Solutions, told VentureBeat that third-party observability is the default for many enterprises.

“One fund I work with runs Claude for analysis and GPT for a separate workflow. If observability lives inside each provider's tooling, you now have two systems that can't talk to each other. Your compliance team can't produce a unified audit trail,” he said. “So third-party observability is surviving because multi-model is already the default in enterprise, and somebody has to sit across providers.”

Jessica Arredondo Murphy, CEO and co-founder of True Fit, said independent platforms like LangSmith have to prove to enterprises that they can "answer the long-term question of whether they become the cross-model operating layer for quality and reliability.”

“Enterprises are not consolidating onto the first-party model provider tooling as quickly as the model providers would prefer. What I see is a pragmatic split: teams will use first-party tooling for fast onboarding and early-stage debugging, but as soon as they care about production reliability, governance, and long-term flexibility, they tend to introduce a more neutral layer for observability and evaluation,” she said.

LangSmith Engine is available now in public beta. Teams can connect a tracing project, optionally connect their repo, and Engine will begin surfacing issues from production traces automatically.

Architectural patterns for graph-enhanced RAG: Moving beyond vector search in production

Sun, 17 May 2026 18:00:23 GMT

Retrieval-augmented generation (RAG) has become the de facto standard for grounding large language models (LLMs) in private data. The standard architecture — chunking documents, embedding them into a vector database, and retrieving top-k results via cosine similarity — is effective for unstructured semantic search.

However, for enterprise domains characterized by highly interconnected data (supply chain, financial compliance, fraud detection), vector-only RAG often fails. It captures similarity but misses structure. It struggles with multi-hop reasoning questions like, "How will the delay in Component X impact our Q3 deliverable for Client Y?" because the vector store doesn't "know" that Component X is part of Client Y's deliverable.

This article explores the graph-enhanced RAG pattern. Drawing on my experience building high-throughput logging systems at Meta and private data infrastructure at Cognee, we will walk through a reference architecture that combines the semantic flexibility of vector search with the structural determinism of graph databases.

The problem: When vector search loses context

Vector databases excel at capturing meaning but discard topology. When a document is chunked and embedded, explicit relationships (hierarchy, dependency, ownership) are often flattened or lost entirely.

Consider a supply chain risk scenario. While this is a hypothetical example, it represents the exact class of structural problems we see constantly in enterprise data architectures:

Structured data: A SQL database defining that Supplier A provides Component X to Factory Y.
Unstructured data: A news report stating, "Flooding in Thailand has halted production at Supplier A's facility."

A standard vector search for "production risks" will retrieve the news report. However, it likely lacks the context to link that report to Factory Y's output. The LLM receives the news but cannot answer the critical business question: "Which downstream factories are at risk?"

In production, this manifests as hallucination. The LLM attempts to bridge the gap between the news report and the factory but lacks the explicit link, leading it to either guess relationships or return an "I don't know" response despite the data being present in the system.

The pattern: Hybrid retrieval

To solve this, we move from a "Flat RAG" to a "Graph RAG" architecture. This involves a three-layer stack:

Ingestion (The "Meta" Lesson): At Meta, working on the Shops logging infrastructure, we learned that structure must be enforced at ingestion. You cannot guarantee reliable analytics if you try to reconstruct structure from messy logs later. Similarly, in RAG, we must extract entities (nodes) and relationships (edges) during ingestion. We can use an LLM or named entity recognition (NER) model to extract entities from text chunks and link them to existing records in the graph.
Storage: We use a graph database (like Neo4j) to store the structural graph. Vector embeddings are stored as properties on specific nodes (e.g., a RiskEvent node).
Retrieval: We execute a hybrid query:
- Vector scan: Find entry points in the graph based on semantic similarity.
- Graph traversal: Traverse relationships from those entry points to gather context.

Reference implementation

Let's build a simplified implementation of this supply chain risk analyzer using Python, Neo4j, and OpenAI.

1. Modeling the graph

We need a schema that connects our unstructured "risk events" to our structured "supply chain" entities.

2. Ingestion: Linking structure and semantics

In this step, we assume the structural graph (suppliers -> factories) already exists. We ingest a new unstructured "risk event" and link it to the graph.

3. The hybrid retrieval query

This is the core differentiator. Instead of just returning the top-k chunks, we use Cypher to perform a vector search to find the event, and then traverse to find the downstream impact.

The output: Instead of a generic text chunk, the LLM receives a structured payload:

[{'issue': 'Severe flooding...', 'impacted_supplier': 'TechChip Inc', 'risk_to_factory': 'Assembly Plant Alpha'}]

This allows the LLM to generate a precise answer: "The flooding at TechChip Inc puts Assembly Plant Alpha at risk."

Production lessons: Latency and consistency

Moving this architecture from a notebook to production requires handling trade-offs.

1. The latency tax

Graph traversals are more expensive than simple vector lookups. In my work on product image experimentation at Meta, we dealt with strict latency budgets where every millisecond impacted user experience. While the domain was different, the architectural lesson applies directly to Graph RAG: You cannot afford to compute everything on the fly.

Vector-only RAG: ~50-100ms retrieval time.
Graph-enhanced RAG: ~200-500ms retrieval time (depending on hop depth).

Mitigation: We use semantic caching. If a user asks a question similar (cosine similarity > 0.85) to a previous query, we serve the cached graph result. This reduces the "graph tax" for common queries.

2. The "stale edge" problem

In vector databases, data is independent. In a graph, data is dependent. If Supplier A stops supplying Factory Y, but the edge remains in the graph, the RAG system will confidently hallucinate a relationship that no longer exists.

Mitigation: Graph relationships must have Time-To-Live (TTL) or be synced via Change Data Capture (CDC) pipelines from the source of truth (the ERP system).

Infrastructure decision framework

Should you adopt Graph RAG? Here is the framework we use at Cognee:

Use vector-only RAG if:
- The corpus is flat (e.g., a chaotic Wiki or Slack dump).
- Questions are broad ("How do I reset my VPN?").
- Latency < 200ms is a hard requirement.
Use graph-enhanced RAG if:
- The domain is regulated (finance, healthcare).
- "Explainability" is required (you need to show the traversal path).
- The answer depends on multi-hop relationships ("Which indirect subsidiaries are affected?").

Conclusion

Graph-enhanced RAG is not a replacement for vector search, but a necessary evolution for complex domains. By treating your infrastructure as a knowledge graph, you provide the LLM with the one thing it cannot hallucinate: The structural truth of your business.

Daulet Amirkhanov is a software engineer at UseBead.

The enterprise risk nobody is modeling: AI is replacing the very experts it needs to learn from

Sat, 16 May 2026 21:05:53 GMT

For AI systems to keep improving in knowledge work, they need either a reliable mechanism for autonomous self-improvement or human evaluators capable of catching errors and generating high-quality feedback. The industry has invested enormously in the first. It's giving almost no thought to what's happening to the second.

I’d argue that we need to treat the human evaluation problem with just as much rigor and investment as we put into building the model capabilities themselves. New grad hiring at major tech companies has dropped by half since 2019. Document review, first-pass research, data cleaning, code review: Models handle these now. The economists tracking this call it displacement. The companies doing it call it efficiency. Neither are focusing on the future problem.

Why self-improvement has limits in knowledge work

The obvious pushback is reinforcement learning (RL). AlphaZero learned Go, chess, and Shogi at superhuman levels without human data and generated novel strategies in the process. Move 37 in the 2016 match against Lee Sedol, a move professionals said they would never have played, didn't come from human annotation. It emerged from AI self-play.

What enables this is the stability of the environment. Move 37 is a novel move within the fixed state space of Go. The rules are complete, unambiguous, and permanent. More importantly, the reward signal is perfect: Win or lose, and immediate, with no room for interpretation. The system always knows whether a move was good because the game eventually ends with a clear result.

Knowledge work doesn't have either of those properties. The rules in any professional domain are dynamic and continuously rewritten by the humans operating in them. New laws get passed. New financial instruments are invented. A legal strategy that worked in 2022 may fail in a jurisdiction that has since changed its interpretation. Whether a medical diagnosis was right may not be known for years. Without a stable environment and an unambiguous reward signal, you cannot close the loop. You need humans in the evaluation chain to continue teaching the model.

The formation problem

The AI systems being built today were trained on the expertise of people who went through exactly that formation. The difference now is that entry-level jobs that develop such expertise were automated first. Which means the next generation of potential experts is not accumulating the kind of judgment that makes a human evaluator worth having in the loop.

History has examples of knowledge dying. Roman concrete. Gothic construction techniques. Mathematical traditions that took centuries to recover. But in every historical case, the cause was external: Plague, conquest, the collapse of the institutions that hosted the knowledge. What's different here is that no external force is required. Fields could atrophy not from catastrophe but from a thousand individually rational economic decisions, each one sensible in isolation. That's a new mechanism, and we don't have much practice recognizing it while it's happening.

When entire fields go quiet

At its logical limit, this isn’t just a pipeline problem. It’s a demand collapse for the expertise itself.

Consider advanced mathematics. It doesn’t atrophy because we stop training mathematicians. It atrophies because organizations stop needing mathematicians for their day-to-day work, the economic incentive to become one disappears, the population of people who can do frontier mathematical reasoning shrinks, and the field’s capacity to generate novel insight quietly collapses. The same logic applies to coding. Our question is not “will AI write code” but “if AI writes all production code, who develops the deep architectural intuition that produces genuinely novel systems design?”

There is a critical difference between a field being automated and a field being understood. We can automate a huge amount of structural engineering today, but the abstract knowledge of why certain approaches work lives in the heads of people who spent years doing it wrong first. If you eliminate the practice, you don’t just lose the practitioners. You lose the capacity to know what you’ve lost.

Advanced mathematics, theoretical computer science, deep legal reasoning, complex systems architecture: When the last person who deeply understands a subfield of algebra retires and no one replaces them because the funding dried up and the career path disappeared, that knowledge isn’t likely to be rediscovered any time soon.

It’s gone. And nobody notices because the models trained on their work still perform well on benchmarks for another decade. I think of this as a hollowing out: The surface capability remains (models can still produce outputs that look expert) while the underlying human capacity to validate, extend, or correct that expertise quietly disappears.

Why rubrics don't fully substitute

The current approach is rubric-based evaluation. Constitutional AI, reinforcement learning from AI feedback (RLAIF), and structured criteria that let models score models are serious techniques that meaningfully reduce dependence on human evaluators. I'm not dismissing them.

Their limitation is this: A rubric can only capture what the person who wrote it knew to measure. Optimize hard against it and you get a model that's very good at satisfying the rubric. That's not the same thing as a model that's actually right.

Rubrics scale the explicit, articulable part of judgment. The deeper part, the instinct, the felt sense that something is off, doesn't fit in a rubric. You can't write it down because you need to experience it first before you know what to write.

What this means in practice

This isn’t an argument for slowing development. The capability gains are real. And it’s possible that researchers will find ways to close the evaluation loop without human judgment. Maybe synthetic data pipelines get good enough. Maybe models develop reliable self-correction mechanisms we can’t yet imagine.

But we don’t have those today. And in the meantime, we’re dismantling the human infrastructure that currently fills the gap, not as a deliberate decision but as a byproduct of a thousand rational ones. The responsible version of this transition isn’t to assume the problem will solve itself. It’s to treat the evaluation gap as an open research problem with the same urgency we bring to capability gains.

The thing AI most needs from humans is the thing we’re least focused on preserving. Whether that’s permanently true or temporarily true, the cost of ignoring it is the same.

Ahmad Al-Dahle is CTO of Airbnb.

Intercom, now called Fin, launches an AI agent whose only job is managing another AI agent

michael.nunez@venturebeat.com (Michael Nuñez) — Fri, 15 May 2026 21:06:09 GMT

The company formerly known as Intercom just did something that no major customer service platform has attempted at scale: it built an AI agent whose sole job is to manage another AI agent.

Fin Operator, announced Thursday at a live event in San Francisco, is a new AI-powered system designed specifically for the back-office teams that configure, monitor, and improve Fin, the company's customer-facing AI agent. Rather than replacing human support agents — which is what Fin itself does on the front lines — Operator targets the growing army of support operations professionals who spend their days updating knowledge bases, debugging conversation failures, and combing through performance dashboards.

"Fin is an agent for your customers," Brian Donohue, the company's VP of Product, told VentureBeat in an exclusive interview ahead of the launch. "Operator is an agent for your support ops team. This is an agent for the back office team who manages Fin and then manages their human agents."

The announcement arrives at a pivotal moment for the company. Just two days ago, CEO Eoghan McCabe formally renamed the 15-year-old company from Intercom to Fin — an aggressive signal that the AI agent is now the business, not merely a feature of it. Fin recently crossed $100 million in annual recurring revenue and is growing at 3.5x. The broader company generates $400 million in ARR, meaning the AI agent now accounts for roughly a quarter of total revenue and virtually all of its growth.

Fin Operator enters early access for Pro-tier users starting today, with general availability planned for summer 2026.

The invisible crisis behind every AI customer service deployment

As companies push their AI agents to handle more conversations — Fin alone now resolves more than two million customer issues each week across 8,000 customers globally, including Anthropic, DoorDash, and Mercury — the operational complexity behind those systems has exploded. Someone has to keep the knowledge base current. Someone has to figure out why the bot entered an infinite loop with a frustrated customer last Tuesday. Someone has to analyze whether the automation rate dropped after a product update.

That "someone" is the support operations team, and according to Donohue, they are drowning.

"Almost every support ops team is already doing data analysis and knowledge management — that's table stakes today," Donohue said. "Where teams struggle is the agent builder work. It's a new skill set, and most don't have enough time for it. They get their first iteration up and running, and then they get stuck."

The problem is structural. AI customer agents are not static software. They require constant tuning — a process that looks more like training a new employee than configuring a SaaS tool. Each customer conversation is a potential source of failure, and each failure requires diagnosis, root-cause analysis, a configuration fix, testing, and monitoring. It is tedious, technical, and relentless. Fin Operator aims to collapse that entire loop into a conversational interface.

How one AI system plays data analyst, knowledge manager, and debugger all at once

Donohue described Operator as filling three distinct roles that typically consume the bandwidth of support ops teams: expert data analyst, expert knowledge manager, and expert agent builder.

As a data analyst, Operator can field high-level questions like, "How did my team perform last week?" and generate on-the-fly charts, trend reports, and drill-down analyses across all of the data already stored in Intercom's platform. The company has loaded Operator with contextual knowledge about customer-specific data attributes to help it interpret workspace-specific metrics accurately.

As a knowledge manager, Operator can ingest a product update — say, a three-page PDF describing a new feature — and autonomously search the company's entire content library to identify what needs to change. It finds gaps, drafts new articles, suggests edits to existing ones, and presents everything in a diff-style review interface. The underlying search engine is the same semantic search system that Intercom has built and optimized for Fin over more than two years.

"On that knowledge management front, you just have such a time compression of something that would take, certainly hours, sometimes days, into the space of about 10 minutes," Donohue said.

As an agent builder, Operator introduces what the company calls a "debugger skill." Support ops teams can paste in a link to a conversation where Fin misbehaved, and Operator will trace every step of Fin's internal reasoning, identify the root cause — often a piece of guidance that unintentionally creates a loop — propose a rewrite, back-test the change against the original conversation, and then suggest creating a production monitor to catch similar issues going forward.

"This is literally what our professional services team does," Donohue explained. "You've written guidance that is unintentionally causing Fin to repeat itself — this happens a lot. You didn't realize it, but you never gave it an escape hatch."

The 'pull request' safety net that keeps humans in control of AI changes

One of the most consequential design decisions in Fin Operator is what the company calls its "proposal system" — a mechanism that functions like a pull request in software engineering.

Every change that Operator recommends — whether it is an edit to a help article, a rewrite of an AI guidance rule, or the creation of a new QA monitor — appears as a proposal with a full diff view. Users can inspect, edit, and approve each change before it takes effect. Nothing goes live without a human clicking "Apply."

"Right now, we're taking zero risk on this — Fin cannot make any changes to the system without human approval," Donohue emphasized. "Nothing goes live until a human clicks apply."

This is a notable architectural choice. In a market increasingly enamored with fully autonomous AI systems, the company is deliberately keeping a human approval gate in place — at least for now. Donohue acknowledged this will evolve, but said the current moment demands caution: "It's too big a leap to just let Operator make changes automatically and then tell the team, 'Hey, let me tell you about what I did.'"

For enterprise buyers evaluating AI tools, this design point matters. It is the difference between an AI system that proposes changes and one that enacts them — a distinction that compliance teams, security officers, and risk managers will scrutinize closely.

Why Fin Operator runs on Anthropic's Claude instead of the company's own AI models

In a revealing technical detail, Donohue confirmed that Fin Operator does not use the company's proprietary Apex models — the same custom AI models that power the customer-facing Fin agent and that the company has promoted as outperforming GPT-5.4 and Claude Sonnet 4.6 in customer service benchmarks.

Instead, Operator runs on Anthropic's Claude.

"We're not using our custom models," Donohue said. "Those are designed to directly answer customer questions, whereas these are closer to what frontier models are best suited for. This is really closer to software engineering."

The distinction is telling. Fin's Apex models are optimized for one thing: resolving customer service conversations with minimal hallucination and maximum accuracy. Operator's tasks — analyzing data, writing code-like configurations, debugging complex reasoning chains — demand a different kind of intelligence. Donohue characterized these capabilities as more akin to software engineering, an area where Anthropic's Claude models have been deliberately optimized.

The company has not ruled out building custom models for Operator in the future, but Donohue positioned it as a lower priority. What the team has built around Claude, he argued, is the differentiated layer: the proposal system, the debugger skill, the semantic search integration, the data attribution logic, and the charting capabilities that make Operator more than just "Claude inside the app."

Early beta testers say Fin Operator feels like adding five people to the team

Fin Operator is currently in beta with roughly 200 customers, a number Donohue said has "ramped up pretty fast the last couple of weeks."

Constantina Samara, VP of Customer Support, Enablement & Trust at Synthesia, said the tool has already changed how her team works: "Previously, improving how Fin handles a conversation often meant reviewing everything yourself — the conversation, the configuration, the content. With Fin Operator, you just ask. It walks you through what happened and makes improving Fin dramatically easier."

Jordan Thompson, an AI Conversational Analyst at Raylo, reported that he has been using Operator daily and has run head-to-head comparisons between Operator's analysis and his own manual work. "It's very accurate," Thompson said. "It's just as strong at high-level trend analysis as it is at debugging individual conversations. That's a real limitation when using an LLM connector on its own — you get conversational depth but nothing on reporting or trends."

Donohue also shared an internal anecdote from the company's own knowledge management team. Beth, who leads knowledge operations, told the product team that Operator made her feel like she had "five more people on my team." Whether internal testimonials carry the same weight as external customer validation is debatable, but Donohue said the knowledge management use case consistently generates the most visceral reactions because the time savings are so stark — collapsing hours or days of content auditing into roughly 10 minutes.

A new pricing model signals how AI is reshaping the economics of enterprise software

Fin Operator will live inside the company's Pro add-on tier — a relatively new bundle that already includes advanced analytics features like CX scoring, topic detection, real-time issue detection, and quality assurance monitoring across both AI and human agent conversations.

The pricing model introduces something new for the company: usage-based billing. Intercom has historically relied on outcome-based pricing — charging roughly $0.99 per conversation that Fin resolves without human intervention. Operator's work does not map cleanly to that model because it produces configuration changes, not customer resolutions.

"This has pushed us to a different model, to go more into that usage model for support ops teams," Donohue said. "We'll try to be generous with the usage amounts that come into Pro, but for people who are leaning heavily in, we'll have the ability to buy more usage blocks."

The shift is worth watching. Outcome-based pricing was one of the company's most distinctive market positions — a bet that customers would pay for results rather than seats. Extending that philosophy to internal operations work proved impractical, which suggests that as AI agents take on more diverse roles within an organization, the pricing models that support them will need to become equally diverse.

How Fin Operator stacks up in a crowded field of AI customer service competitors

Fin Operator lands in an increasingly competitive landscape. Zendesk, Salesforce, Sierra, and a constellation of AI-native startups are all building some version of AI-powered support operations tooling. The broader AI automation market is projected to reach $169 billion in 2026, according to Grand View Research, growing at a 31.4% compound annual rate.

But Donohue argued that Operator's differentiation lies in two areas. First, breadth: Operator works across the full surface area of the company's configuration system — data, content, procedures, simulations, guidance, and monitoring — rather than addressing a single narrow use case. Second, the fact that it spans both AI and human operations.

"Most critically, where I think we have the most differentiation is because it's for your human system and your AI system," Donohue said. "That's really one of the unique spaces we have — to have a first-class AI agent and a first-class help desk, and Operator works across both."

The competitive positioning also benefits from timing. The company's recent corporate rebrand from Intercom to Fin signals a wholesale commitment to AI that legacy players may struggle to match. As CEO McCabe wrote in announcing the name change, the AI agent "is about to be the largest part of our business." The help desk product continues as Intercom 2, but the parent company now carries the name of its AI agent — a branding move that some industry observers have interpreted as pre-IPO positioning. The Fin API Platform, launched in early April, adds another dimension: the company opened its proprietary Apex models to third-party developers and even offered to license the technology to direct competitors like Decagon and Sierra.

The real paradigm shift isn't a new chat interface — it's an agent that does the thinking for you

Step back from the product specifics and Fin Operator represents something potentially more consequential than a new dashboard or analytics tool. It is one of the first commercial products to explicitly embody the emerging paradigm of AI agents that manage other AI agents — a two-layer abstraction that is beginning to reshape how companies think about operational software.

Donohue was emphatic on this point. The real paradigm shift, he argued, is not the chat interface replacing buttons and menus. It is that the AI is doing the actual knowledge work — figuring out what should change, why, and how.

"The UX change is secondary, even though it's most visible," Donohue said. "The change is that we are identifying and doing the work of support operations. It's doing the work of what the knowledge manager is doing, so that they just have to approve that. That's the huge shift."

The analogy to software engineering is apt. Over the past year, AI coding agents have fundamentally altered the daily workflow of developers, shifting their primary responsibility from writing code to reviewing and guiding the AI that writes it. Donohue sees the same transformation arriving for support operations professionals.

"Software engineers — three months have upended their world, where their primary job now is managing agents who are actually writing the code," he said. "Similarly now, support ops, your job is to manage an agent who's managing the agent for your customers."

Whether this vision pans out at enterprise scale remains to be seen. The company is still launching Operator in beta precisely because it wants to keep refining quality through what Donohue described as a painstaking, conversation-by-conversation debugging process. "We've spent three months, conversation by conversation, learning, fixing, learning, fixing, to get it where it's robust," he said.

But if the early returns hold, Fin Operator may preview what the next generation of enterprise software looks like: not tools that help humans do work faster, but agents that do the work themselves, subject to human judgment and approval. For customer service leaders already running AI agents in production, the question is no longer just "how good is my bot?" It is now, inevitably, "who is managing it?" And increasingly, the answer is another bot.

How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%

bendee983@gmail.com (Ben Dickson) — Fri, 15 May 2026 21:04:15 GMT

One of the key challenges of current multi-agent AI systems is that they communicate by generating and sharing text sequences, which introduces latency, drives up token costs, and makes it difficult to train the entire system as a cohesive unit.

To overcome this challenge, researchers at University of Illinois Urbana-Champaign and Stanford University developed RecursiveMAS, a framework that enables agents to collaborate and transmit information through embedding space instead of text. This change results in both efficiency and performance gains.

Experiments show that RecursiveMAS achieves accuracy improvement across complex domains like code generation, medical reasoning, and search, while also increasing inference speed and slashing token usage.

RecursiveMAS is significantly cheaper to train than standard full fine-tuning or LoRA methods, making it a scalable and cost-effective blueprint for custom multi-agent systems.

The challenges of improving multi-agent systems

Multi-agent systems can help tackle complex tasks that single-agent systems struggle to handle. When scaling multi-agent systems for real-world applications, a big challenge is enabling the system to evolve, improve, and adapt to different scenarios over time.

Prompt-based adaptation improves agent interactions by iteratively refining the shared context provided to the agents. By updating the prompts, the system acts as a director, guiding the agents to generate responses that are more aligned with the overarching goal. The fundamental limitation is that the capabilities of the models underlying each agent remain static.

A more sophisticated approach is to train the agents by updating the weights of the underlying models. Training an entire system of agents is difficult because updating all the parameters across multiple models is computationally non-trivial.

Even if an engineering team commits to training their models, the standard method of agents communicating via text-based interactions creates major bottlenecks. Because agents rely on sequential text generation, it causes latency as each model must wait for the previous one to finish generating its text before it can begin its own processing.

Forcing models to spell out their intermediate reasoning token-by-token just so the next model can read it is highly inefficient. It severely inflates token usage, drives up compute costs, and makes iterative learning across the whole system painfully slow to scale.

How RecursiveMAS works

Instead of trying to improve each agent as an isolated, standalone component, RecursiveMAS is designed to co-evolve and scale the entire multi-agent system as a single integrated whole.

The framework is inspired by recursive language models (RLMs). In a standard language model, data flows linearly through a stack of distinct layers. In contrast, a recursive language model reuses a set of shared layers that processes the data and feeds it back to itself. By looping the computation, the model can deepen its reasoning without adding parameters.

RecursiveMAS extends this scaling principle from a single model to a multi-agent architecture that acts as a unified recursive system. In this setup, each agent functions like a layer in a recursive language model. Rather than generating text, the agents iteratively pass their continuous latent representations to the next agent in the sequence, creating a looped hidden stream of information flowing through the system.

This latent hand-off continues down the line through all the agents. When the final agent finishes its processing, its latent outputs are fed directly back to the very first agent, kicking off a new recursion round.

This structure allows the entire multi-agent system to interact, reflect, and refine its collective reasoning over multiple rounds entirely in the latent space, with only the very last agent producing a textual output in the final round. It is like the agents are communicating telepathically as a unified whole and the last agent provides the final response as text.

The architecture of latent collaboration

To make continuous latent space collaboration possible, the authors introduce a specialized architectural component called the RecursiveLink. This is a lightweight, two-layer module designed to transmit and refine a model's latent states rather than forcing it to decode text.

A language model's last-layer hidden states contain the rich, semantic representation of its reasoning process. The RecursiveLink is designed to preserve and transmit this high-dimensional information from one embedding space to another.

To avoid the cost of updating every parameter across multiple large language models, the framework keeps the models' parameters frozen. Instead, it optimizes the system by only training the parameters of the RecursiveLink modules.

To handle both internal reasoning and external communication, the system uses two variations of the module. The inner RecursiveLink operates inside an agent during its reasoning phase. It takes the model's newly generated embeddings and maps them directly back into its own input embedding space. This allows the agent to continuously generate a stream of latent thoughts without generating discrete text tokens.

The outer RecursiveLink serves as the bridge between agents. Because agents in a real-world system might use different model architectures and sizes, their internal embedding spaces have entirely different dimensions. The outer RecursiveLink includes an additional layer designed to match the embeddings from one agent's hidden dimension with the next agent's embedding space.

During training, first, the inner links are trained independently to warm up each agent's ability to think in continuous latent embeddings. Then, the system enters outer-loop training, where the diverse, frozen models are chained together in a loop, and the system is evaluated based on the final textual output of the last agent.

The only thing that gets updated in the training process is the RecursiveLink parameters and the original model weights remain unchanged, similar to low-rank adaptation (LoRA). Another advantage of this system comes into effect when you have multiple agents on top of the same backbone model.

If you have a multi-agent system where two agents are built on the exact same foundation model acting in different roles, you do not need to load two copies of the model into your GPU memory, nor do you train them separately. The agents will share the same backbone as the brain and use the RecursiveLink as the connective tissue.

RecursiveMAS in action

The researchers evaluated RecursiveMAS across nine benchmarks spanning mathematics, science and medicine, code generation, and search-based question answering. They created a multi-agent system using open-weights models including Qwen, Llama-3, Gemma3, and Mistral. These models were assigned roles to form different agent collaboration patterns such as sequential reasoning and mixture-of-experts collaboration.

RecursiveMAS was compared to baselines under identical training budgets, including standalone models enhanced with LoRA or full supervised fine-tuning, alternative multi-agent frameworks like Mixture-of-Agents and TextGrad, and recursive baselines like LoopLM. It was also compared to Recursive-TextMAS, which uses the same recursive loop structure as RecursiveMAS but forces the agents to explicitly communicate via text.

RecursiveMAS achieved an average accuracy improvement of 8.3% compared to the strongest baselines across the benchmarks. It excelled particularly on reasoning-heavy tasks, outperforming text-based optimization methods like TextGrad by 18.1% on AIME2025 and 13% on AIME2026.

Because it avoids generating text at every step, RecursiveMAS achieved 1.2x to 2.4x end-to-end inference speedup. RecursiveMAS is also much more token efficient than the alternative. Compared to the text-based Recursive-TextMAS, it reduces token usage by 34.6% in the first round of the recursion, and by round three, it achieves 75.6% token reduction. RecursiveMAS also proved remarkably cheap to train. Because it only updates the lightweight RecursiveLink modules, which consist of roughly 13 million parameters or about 0.31% of the trainable parameters of the frozen models, it requires the lowest peak GPU memory and cuts training costs by more than half compared to full fine-tuning.

Enterprise adoption

The efficiency gains — lower token consumption, reduced GPU memory requirements, and faster inference — are intended to make complex multi-step agent workflows viable in production environments without the compute overhead that limits enterprise agentic deployments. The researchers have released the code and trained model weights under the Apache 2.0 license.

Claude’s next enterprise battle is not models: it’s the agent control plane

carl.franzen@venturebeat.com (Carl Franzen) — Fri, 15 May 2026 16:45:00 GMT

New VB Pulse data shows Microsoft and OpenAI leading enterprise agent orchestration, but Anthropic’s first measurable foothold points to a larger fight over who controls the infrastructure where AI agents run.

For the last two years, the enterprise AI race has mostly been framed as a model war: OpenAI’s GPT series versus Anthropic’s Claude versus Google’s Gemini, with smaller and open-source alternatives also coming in from the U.S. and China.

But the next strategic fight may not be over which model answers a prompt best. It may be over who controls the layer where agents plan, call tools, access data, run workflows and prove to security teams that they did not do anything they were not supposed to do.

New VB Pulse survey data suggests the category is already taking shape. Our independent Enterprise Agentic Orchestration tracker, a survey that records the preferences of qualified, verified technical-decision maker respondents at enterprises at regular intervals, found that Microsoft Copilot Studio and Azure AI Studio led with 38.6% primary-platform adoption in February, up from 35.7% in January.

OpenAI’s Assistants and Responses API held second place, rising from 23.2% to 25.7%.

Anthropic remained far smaller, but it made its first appearance in the tracker: moving from 0% in January to 5.7% in February for Anthropic tool use and workflows.

The underlying move is small — four respondents out of a total 70 in this cohort, with more to come — but strategically interesting because it marks the first sign in this tracker of Claude usage moving from the model layer into native orchestration.

That distinction matters. Enterprises are not merely choosing chatbots. They are deciding where the live operational machinery of AI work will sit: inside Microsoft’s stack, inside OpenAI’s API layer, inside Anthropic’s managed runtime, inside an open framework, or across a hybrid mix of all of them.

“This is the convergence moment for enterprise AI,” said Tom Findling, CEO and cofounder of AI cybsersecurity startup Conifers, in a statement to VentureBeat. “Models and agent frameworks have matured enough together that enterprises are now shifting focus beyond model quality to the control plane around it. In security operations, we’re seeing the competitive advantage move toward platforms that can orchestrate agents, leverage enterprise context, and provide governance and auditability across customer environments.”

Anthropic’s number is still small to start — but the increase is not

The Anthropic number, by itself, should not be overread. A move from zero to 5.7% is not a juggernaut. It is not proof that Anthropic has captured enterprise orchestration.

It is not even enough to say Anthropic has a durable lead in any part of this market. Microsoft owns the early enterprise distribution advantage, and OpenAI has a much larger installed base in orchestration than Anthropic.

But small numbers can matter when they appear at the start of a new market structure. Anthropic’s emergence in orchestration comes as the broader VB Pulse data shows Claude also gaining massive enterprise adoption at the model layer.

In our VB Pulse Q1 Foundation Models and Intelligence Platforms tracker, Anthropic rose from 23.9% in January to 28.6% in February and then even more dramatically to 56.2% in March among qualified enterprise respondents, with the March reading flagged as directional only, because the sample was only 16 respondents.

The story, then, is not that Anthropic is winning orchestration today. It is that Anthropic’s model momentum may be starting to spill into the orchestration layer.

That is where the strategic stakes get higher.

A model is easier to swap than an agent runtime

A model is relatively easy to swap, at least in theory. A company can route one workload to Claude, another to GPT, another to Gemini and another to a smaller open model.

In fact, the VB Pulse Foundation Models tracker over the same Q1 period shows that multi-model strategy is the enterprise consensus: respondents increasingly report adopting multiple models and building orchestration layers that route across them by task, cost and risk profile.

An agent runtime is different. Once a company’s workflows, tool permissions, credentials, audit logs, memory, sandboxed execution and operational monitoring live inside one provider’s environment, switching providers becomes less like changing models and more like changing infrastructure.

That is the real reason Anthropic’s 5.7% foothold is worth watching

Anthropic has already made clear that it wants to provide more than the model. Its Claude Managed Agents documentation describes a public beta for a managed agent harness with secure sandboxing, built-in tools and API-run sessions, while Anthropic’s engineering post frames the architecture around decoupling the model from the surrounding agent machinery: the session, the harness and the sandbox.

In plain English, Anthropic is trying to host the environment where Claude agents remember context, use tools, run code, operate inside sandboxes and persist across long-running workflows. That is no longer just inference. That is operational infrastructure.

The pitch is obvious: most enterprises do not want to stitch together their own agent stack from scratch. They want agents that can act, but they also want permission boundaries, audit trails, workflow reliability and ways to stop the system when something goes wrong.

Security is becoming the buying criterion

The VB Pulse orchestration tracker shows that buyers are prioritizing exactly those concerns. Security and permissions ranked as the top orchestration platform selection criterion in both January and February, at 39.3% and 37.1%.

Control over agent execution rose from 17.9% to 22.9%, while flexibility across models and tools fell from 35.7% to 25.7%. The market appears to be shifting from optionality toward governance.

That shift is not surprising. A chatbot can be wrong and still remain mostly contained. An agent that can send emails, modify documents, query databases, call APIs or execute workflows has a much larger blast radius. The enterprise question is not only whether the agent is smart enough.

It is who gave it permission, what it touched, what it changed, whether those actions were logged, and whether the company can unwind the damage if something goes wrong.

Ev Kontsevoy, cofounder and CEO of Teleport, an identity and digital infrastructure solutions company, argues that the industry is still putting too much emphasis on orchestration itself and not enough on identity: “The race to own the agent orchestration layer is real,” Kontsevoy said. “It’s also solving the wrong problem first. Orchestration without identity only multiplies chaos. Without identity, you don’t know what an agent can access, what it actually did, or how to revoke its access when it operates outside policy. A unified identity layer is a prerequisite to deploying agents — one or many — in infrastructure.”

Syam Nair, Chief Product Officer at the intelligent data infrastructure company NetApp, believes data management is key in all cases to secure AI agent orchestration across the enterprise. As he said in a statement to VentureBeat: "Effective agent management requires built-in intelligence and a continuously updated understanding of both data and, critically, its metadata. This visibility allows organizations to define and enforce clear policies so data is used only by the right agents, for the right purposes. Making this work at scale is a crossfunctional effort. Security, storage, and data science teams must work together to implement policies that safeguard company data, while creating a strong data foundation for AI."

He continued: "The CIOs and technology leaders that are successful are the ones who take the input, policies, and vision from all these teams into account as they build a data infrastructure that minimizes risk and drives business value."

Microsoft has the distribution edge

That is why Microsoft’s early lead makes sense. Copilot Studio and Azure AI Studio sit inside an enterprise stack many companies already use: Microsoft 365, Teams, Entra ID, Azure and existing procurement relationships.

The VB Pulse Orchestration Tracker for Q1 2026 describes Microsoft as the enterprise default, with no other platform within 13 percentage points in February.

David Weston, CVP, AI Security, Microsoft, provided some insight on why, writing in a statement to VentureBeat: "Without a unified control layer, you start to see fragmentation – agents operating in silos, inconsistent governance, and gaps in security. What customers are asking for is a way to bring order to that complexity. With Agent 365, we’re providing a single control plane to observe, govern, and secure agents across Microsoft, partner, and third-party ecosystems, all grounded in enterprise data and identity."

OpenAI’s second-place position is also unsurprising. Its Assistants and Responses API gave developers an early way to build agent-like systems using OpenAI’s models and tooling. In the orchestration tracker, OpenAI is not surging, but it is still ticking up steadily: 23.2% in January to 25.7% in February.

Anthropic is the newcomer at the orchestration layer. But its timing may be favorable. The VB Pulse Foundation Models tracker for Q1 2026 suggests enterprises increasingly see Claude as a fit for higher-stakes workloads where safety, instruction following, long context and governance matter.

The orchestration tracker suggests those same buyers are now moving from agent experiments toward production workflows, where security, permissions and task reliability become the gating issues.

That creates a possible path for Anthropic: not to beat Microsoft as the default enterprise platform, at least not immediately, but to become the agent runtime for companies that already trust Claude for sensitive or complex workloads.

The risk is lock-in

The risk for enterprises is lock-in.

The orchestration tracker found that a hybrid control plane — combining provider-native orchestration with external orchestration — was the leading expected architecture, holding around 35% to 36% across the two substantive waves.

Provider-managed-only approaches grew modestly but remained a minority. The report’s conclusion is blunt: enterprises are not willing to give full orchestration control to any single provider.

It makes total sense as enterprises seek to leverage the "best-in-breed" models, harnesses, and tools from multiple vendors, especially as their needs differ widely across sector, business, and size.

"Most enterprises will operate in a multi-model, multi-agent environment, which makes an independent control plane essential," agreed Felix Van de Maele, CEO of Collibra, a company specializing in unified governance for data and AI, in a statement to VentureBeat. "That is why we built AI Command Center: to give organizations the visibility, governance, and real-time oversight needed to manage AI systems and agents across the full lifecycle."

That caution shows up in the risk data. When asked about risks if agent control lives inside a model provider platform, respondents cited security and permissioning limitations as the top concern. Vendor lock-in was the second-largest concern and the only one that increased from January to February, rising from 23.2% to 25.7%.

This is the tension at the heart of the agent market. Enterprises want managed infrastructure because building reliable agents is hard. But the more a provider manages, the more it may own.

Dr. Rania Khalaf, chief AI officer at WSO2 — the subsidiary of EQT that offers open source, customizable AI stacks for enterprises — said enterprises will need an agent control plane that sits apart from individual frameworks, harnesses and runtimes because agents combine the unpredictability of LLMs with the ability to take actions that have consequences.

“Teams want the freedom to use the best model and framework for each job — Claude for coding, Gemini for writing, LangGraph or CrewAI for dynamic modular behavior — and that heterogeneity makes consistent governance untenable in integrated platforms that lock into one ecosystem,” Khalaf said.

From LLMOps to Agent Ops

Khalaf said the industry is also moving from MLOps to LLMOps to “Agent Ops,” where governance has to cover the whole agent, not just the model call.

“A guardrail on an LLM call can catch hallucination or toxic output, but it will not catch an agent thrashing in an unbreakable, costly loop, which is why governance now has to extend out from the LLM interaction to the scope of the agent,” she said.

The practical implication is that enterprises need to separate policy and control from the agent logic itself. Khalaf pointed to the recent example of an agent deleting a production database despite being told not to, arguing that the failure showed the limits of relying on prompt-level instructions where hard identity and access controls are needed.

“Pulling guardrails, evals, policies, bindings, and agent identity out of the core agent logic allows them to be configured per deployment and per environment, owned by the appropriate teams in security, product, and compliance, without fragmenting the governance layer as different teams choose different models and frameworks,” Khalaf said.

MCP is open. The runtime may still be sticky

That is where Anthropic’s Model Context Protocol, or MCP, complicates the story. MCP is not a walled garden; Anthropic introduced it as an open standard for connecting AI systems to data and tools, and Anthropic’s documentation describes MCP as an open-source standard for connecting AI applications to external systems.

But openness at the protocol layer does not automatically eliminate lock-in at the runtime layer. An enterprise could use an open protocol to connect tools while still becoming dependent on a provider’s managed sessions, logs, sandboxes, permissions model, workflow state and deployment environment. In other words, MCP may reduce integration friction, while managed agent infrastructure could still increase switching costs.

Khalaf said Microsoft’s lead likely reflects its M365 and Azure distribution, while Anthropic’s emerging foothold could reflect a different architectural bet around open protocols such as MCP. But she argued the long-term direction is not a single-provider stack.

“Enterprises serious about running agents in production will end up multi-vendor across these layers,” Khalaf said, “which is why the open and interoperable control plane matters more than the current percentages might suggest.”

The next cycle may be cross-vendor collaboration

That same tension — between provider-native convenience and cross-vendor reality — is where Arick Goomanovsky, CEO and cofounder of universal AI agent orchestrator startup BAND, sees the next competitive cycle forming.

“Enterprises now run agents everywhere: individual assistants and coding agents, multi-agent systems in production, agents embedded in Agentforce and ServiceNow, and third-party agents consumed as agent-as-a-service,” Goomanovsky said. “None of them collaborate across those boundaries by default.”

Goomanovsky argues that the missing layer is not just orchestration inside a single model provider, but a cross-vendor collaboration layer that lets agents from different ecosystems act together.

“What’s emerging in parallel is demand for an agentic collaboration harness - an interaction layer that lets agents from Microsoft, OpenAI, Anthropic, and internal teams operate as one workforce,” he said. “Orchestration inside any single vendor is still a walled garden so the next competitive cycle is cross-vendor agent collaboration.”

Independent frameworks face an enterprise packaging problem

There is also a warning sign for independent orchestration frameworks. LangChain and LangGraph fell from 5.4% to 1.4% as the primary orchestration platform in the qualified enterprise sample.

External orchestration abstracted entirely from model providers also fell from 8.9% to 2.9%.

Scott Likens, Global Chief AI Engineer at professional services giant PwC, has a front row seat to this trend as the company spearheads and assists clients with their AI transformations.

As he told VentureBeat in a statement: "Right now, most enterprises are still operating in fragmented environments, with orchestration spread across platforms, business applications, and internally developed tooling. Over time, the market will likely move toward more unified orchestration models, but interoperability, governance and security will remain critical because enterprises are unlikely to standardize on a single agent ecosystem."

The report argues that fully independent orchestration frameworks may not yet have the enterprise packaging — security certifications, support, compliance documentation and vendor accountability — that procurement teams require.

That does not mean open frameworks are irrelevant. It does suggest that enterprise buyers may increasingly consume open or developer-first orchestration through managed products, cloud-provider partnerships or internal control planes rather than as standalone frameworks.

The agent market starts to look like cloud infrastructure

This is where the agent market starts to look less like the early chatbot market and more like enterprise cloud infrastructure. The winning vendors will not only have capable models. They will have identity integration, permission controls, audit logs, observability, workflow tooling, sandboxing, evaluation and a credible answer to who owns the control plane.

Indeed, the orchestration layer is but one part of the stack that the enterprise must fill in, and enterprises may actually decide to have different orchestration layers for agents working in different departments and functions.

As Nithya Lakshmanan, Chief Product Officer at revenue orchestration company Outreach.ai wrote in a statement to VentureBeat: "General-purpose orchestration platforms coordinate agent activity well, but they don't carry the workflow-specific context that determines whether an agent's action is correct for a given situation. In revenue workflows, an agent acting on incomplete deal history or missing buyer context will underperform and erode trust with users. The teams getting the most out of multi-agent systems are treating domain-specific data as the governance layer, with orchestration sitting on top. Most enterprises have chosen their orchestration stack, and what they're now figuring out is how those platforms get access to the workflow context they need to make agents useful inside specific business functions."

That is why Anthropic — which is increasingly launching its own domain-specific agents for finance and design, among other categories — is worth following closely. The company does not need to win the entire orchestration market tomorrow for its strategy to matter. It only needs to persuade a growing set of Claude enterprise customers to let Anthropic handle more of the surrounding machinery: tools, workflows, memory, execution and governance.

If it succeeds, Claude becomes more than a model in a multi-model portfolio. It becomes part of the infrastructure where enterprise work gets done.

That would put Anthropic in a more direct fight with OpenAI and Microsoft — not just over model quality, but over the operating layer of AI agents.

The narrow but important read

The safe interpretation of the VB Pulse data is narrow but important: Anthropic is not yet a major enterprise orchestration platform. Microsoft is. OpenAI is much closer. But Anthropic has registered its first measurable foothold at the orchestration layer, just as the market is deciding who should control agent execution.

For enterprise buyers, that may be the question that matters most in 2026. Not which model is best, but which provider gets to run the agent — and how hard it will be to leave once the agent is running.