VentureBeat

How America's 250th birthday became a test of AI-powered collective intelligence

lb_rosenberg@yahoo.com (Louis Rosenberg, Unanimous A.I.) — Sat, 04 Jul 2026 19:00:12 GMT

Imagine if you could bring 250 people together in a massive room and have them discuss and debate an important issue, arguing the points and counterpoints, and converging on answers that accurately reflect their collective knowledge, wisdom, values, and sensibilities.

Now imagine that you convened this debate on America’s 250^th birthday and asked 250 randomly selected Americans to come up with the top three innovations that America has contributed to the world over the last 250 years. What would they come up with?

I know – this all sounds impossible.

After all, you can’t get more than a dozen people to have a productive conversation on anything. At large scale, nobody would get enough airtime to express their views or respond to others. This is why typical business meetings or focus groups never have more than 8 to 10 people. Thoughtful real-time conversations just don’t scale.

To solve this, a new category of AI technology called “hyper-communication” is greatly expanding the size, scope, and efficiency of large-scale deliberations. It uses specialized AI agents to connect groups in real-time, allowing people to discuss and debate issues at any scale. The goal is to enable hundreds or even thousands of participant to hold thoughtful discussions where they can express their views and argue the merits of any issue.

I first wrote about this emerging technology in VentureBeat two years ago in an article about “Collective Superintelligence.” In that piece, I explain how large human groups can be hyper-connected by AI agents in ways that greatly amplify the group’s collective intelligence. You can check out the science behind hyper-communication in that prior VentureBeat piece. Here I am focusing on the debate among 250 Americans on America’s birthday.

To do this, I asked the team at Unanimous AI to field a randomly selected group of at least 250 Americans (with a broad distribution from every region in the country and diverse mix of political and social demographics) and invite them to a twenty-minute online debate inside a hyper-communication platform called Thinkscape that enables massively scalable discussion by text, voice, or video.

Once connected, we asked the group to come up with the top three contributions that America has made to the world over the last 250 years – not a survey of opinions, but deliberation of ideas, arguments, evidence, and reasoning. The group converged on a set of top answers that surprised me – but on reflection, they were sensible and well-reasoned.

Before getting into the answers, let me show you what the debate looks like behind the scenes. There were 277 people, each of them debating the issues with four or five other people in parallel discussion spaces. The magic is the swarm of AI agents that connect all the small groups together into a single real-time deliberation.This is what it looks like at high speed:

In the debate above, the group of 277 people came up with 94 different ideas and then narrowed it down to a top 10, then a top 3. In the gif above, we just plot the top ten ideas as they emerged and battle for support during the live conversational debate.

The most interesting part of a large debate like this is not the answers, but the reasons that emerge to justify the answers. Here is the group’s reasoning behind the “top three innovations” that America has given to the world over the last 250 years:

#1: The Internet: “Our collective perspective is that America’s greatest contribution to the world over the past 250 years is the internet. It was born exclusively in the U.S. through academic and government research and was scaled globally with profound impact. It transformed communication, democratized information and education, enabled commerce, medicine, research and cultural exchange, and amplified soft power and civic organizing. We also acknowledged significant harms (misinformation, addiction, privacy loss) and arguments that it’s recent, global, or not uniquely American.”

#2 Advances in medicine: “Our collective perspective is that the United States has saved and prolonged hundreds of millions of lives worldwide. American-developed vaccines have successfully eradicated or controlled once-deadly diseases, significantly extending life expectancy and enabling broader societal and technological progress. From major breakthroughs in cancer research and treatments to cutting-edge medical technologies that have revolutionized hospital safety and procedures, U.S. ingenuity has redefined healthcare. Ultimately, while the global diffusion of affordable medicines and vaccines has extended these benefits across borders, the U.S. remains a premier medical destination where people from around the world travel to receive the most advanced treatments.”

#3: Spreading democracy: “Our collective perspective is that one of America’s most significant global contributions is the nation's system of governance. The US has long demonstrated democracy in practice as an enduring global model. The U.S. Constitution provided a vital blueprint for representative government, inspiring democratic movements and revolutions worldwide while actively promoting human rights and individual liberties internationally. By empowering citizens with the fundamental power to vote and choose their own leaders, this framework has served as a foundational framework for broader societal advances and directly helped establish thriving democracies around the world.”

It’s important to remember, this is 100% human intelligence — a pure reflection of the collective knowledge, wisdom, and values of 277 randomly selected Americans. That’s because the role of the AI agents in a hyper-communication system is to connect people, not replace them. The agents work to enable scalable human deliberation in which every participant is given optimized ability to express their views, respond to others, and converge on solutions based on their merits. The only question left is — what should we ask next?

Louis Rosenberg earned his PhD from Stanford University, was a professor at California State University (Cal Poly) and has been awarded over 300 patents for his work in human-computer interaction, AI, and collective intelligence.

Trunk Tools' stack cut document review from 60 days to 10 by ditching general-purpose models

taryn.plumb@venturebeat.com (Taryn Plumb) — Sat, 04 Jul 2026 14:34:50 GMT

Most verticals aren’t clean, well-oiled SaaS databases; the reality is ugly documents, proprietary schemas, implicit workflows, and long‑running tasks that most general-purpose models struggle with.

This prompted construction project management company Trunk Tools to build a specialized, three-layer architecture — perception, semantics, agents — based on highly-detailed data to support high-accuracy, highly-relevant industry automation.

Their purpose-built stack has shrunk review cycles from months to days, prevented costly field errors, and given autonomous agents the ability to reason over millions of pages of documentation, the company says.

“We really set out to take the data from dispersed systems, pre-process it, structure it, go through our ontology into a knowledge graph, and then train AI models,” said Sarah Buchner, Trunk Tools' founder and CEO and a former carpenter.

For builders in other verticals, the company's approach could serve as a blueprint for transforming data chaos into agent‑ready, industry-specific workflows.

Where general-purpose LLMs break down on industry data

Foundation LLMs, while powerful, are optimized for breadth, not always depth.

“General-purpose LLMs are trained to be okay at everything, so they're weak at anything niche,” said Kriti Faujdar, a senior product manager working in AI infrastructure, agentic AI, security, and LLM platforms. For instance: Rare terms, domain-specific reasoning, the unspoken context that any practitioner “just knows.”

Web, app, and software developer Sébastien De Bollivier agreed that the biggest bottleneck is reliability on data that is “jargon-dense, abbreviation-heavy, and format-specific.”

“A GPT-4-class model can understand a French legal contract, but will fumble the specific article references practitioners need to cite,” he said.

Besides, the most valuable enterprise data never made it into pretraining anyway, Faujdar pointed out. It's sitting in internal systems and proprietary formats. “RAG helps a little,” she said. “But it's just giving better facts to a model that still can't reason properly in the domain.”

Pre-training on domain data is critical; enterprises should then fine-tune on good task examples and build their own evals. “A few thousand examples from real practitioners beats millions of scraped, noisy ones," Faujdar said.

Mixture-of-experts (MoE) can provide specialization without inference costs blowing up. Pairing RAG with fine-tuning also works well; RAG handles the factual long trail while fine-tuning fixes vocabulary and reasoning.

De Bollivier pointed to the advantage of hybrid stacks: A general-purpose model for reasoning and orchestration, a smaller fine-tuned model (or dense retrieval over a curated corpus) for domain-specific extraction. He advised: “Don't fine-tune to make the model 'smarter' about a domain, fine-tune to make it more reliable on the specific output format your workflow requires.”

The trades and construction are certainly industries seeing traction with these techniques, as are legal and healthcare, De Bollivier said. These verticals have “high stakes for errors plus standardized document formats, equaling clear domain-training ROI.”

One honest caveat worth mentioning, Faujdar said: Specialized models can often fall apart outside their domain, so they’re often not useful outside their expertise (unless they’re re-trained).

Perception, semantics, agents: inside Trunk Tools' three-layer stack

In highly-specialized domains like construction, “data dumps” into large language models (LLMs) don’t cut it, said Trunk Tools' CTO Amrish Kapoor. This is because most transformers are probabilistic models: When given an image, they report back that it is “probably” a tree, or “probably” a child playing next to a tree.

This makes them insufficient for high‑precision symbolic interpretation. For instance, in construction documents, a 2-millimeter-wide symbol has a vastly different meaning depending on where it’s placed.

Further, constrained by context limits, probabilistic models struggle with long‑term project memory. “I don't mean a context window of a few tokens,” Kapoor said. “I'm talking about long term memory that stretches across months and years, because this is how long some of these projects are.”

Instead, the company's three-layer system breaks workflows into:

Perception (reading and extracting data from messy docs like PDFs, drawings, or scans)
A semantic/graph layer (making sense of that data and understanding their relationships).
LLMs and agents on top.

Construction drawings are typically symbolic, Buchner said. A door isn't always labeled ‘door.’ Sometimes it's simply an arc on a wall that a trained eye learns to read based on years of practice.

“The perception layer is what teaches AI to read that language,” she said. The semantic layer then gives that information meaning; for instance, connecting the door to the drawing that details it, the spec that governs it, and the trade that installs it. This helps answer project engineers’ critical questions: Not "is there a door here?" but "does this door create a problem down the line?"

Particularly in construction, that shift matters because the cost of a problem compounds with time. “A conflict caught in design is relatively low cost to address,” Buchner said, “whereas the same problem caught in the field might cost tens of thousands of dollars.”

At a high level, the system identifies the document type and begins extracting information based on content (drawing, schedules, paragraph text). This data is then “transformed and augmented” in the platform, which triggers agentic workflows like knowledge graph relationships and end-user workflows.

For instance, an agent might review an architecture bulletin and produce a visual overlay comparing an older version and a newer version (flagging additions and removals), then generate written narratives that describe what those changes are in simple terms. This helps users understand what’s changed and coordinate with trade partners on updated pricing and change orders.

The scale of construction’s data problem

Construction workflows are “ripe with implicit assumptions and connections between data in its myriad of sources,” Buchner said. And the amount of unstructured data is “humanly impossible” to process or make sense of.

Buchner estimated the average high-rise building generates about 3.6 million pages of corresponding documentation. “If you print it into a stack of papers it would be as high as the building itself.”

All three layers of Trunk Tools' stack — perception, semantic, LLM — are trained on “very specific datasets” from customers with “explicit permissions” and auto‑labeling/IP, Kapoor explained. Customers who don’t want Trunk training on their data can opt out.

Data is deidentified and aggregated, and Trunk Tools also collects “tons more” labeled data through other pipelines like 3D building information modeling (BIM).

The company says it only ships agents that achieve around 95% accuracy. The team maintains continuous evaluation pipelines based on ground truth data from customers and experts. They also employ an LLMs-as-a-judge model.

“This notion of an LLM as a judge is to score how well you're doing, both subjectively as well as objectively,” Kapoor said. Objectivity can be an easy ‘right’ or ‘not right,’ but subjectivity requires more nuance.

For instance, when creating an email or narrative or explanation, an LLM as a judge framework can create a composite score, or a numerical value that aggregates different metrics and tests a model's performance or risk.

There can be challenges, though, particularly with latency, Buchner noted; any time the reasoning capacity of underlying models increases, the risk of latency goes up, too. Trunk Tools maintains a set of evaluation criteria to objectively measure latency whenever changes are made to underlying infrastructure, agents, and API calls.

Then, “before we release to customers, we ensure marginal changes to the end-user experience are well worth the performance enhancements,” Buchner said.

From 60 days to 10: the measurable payoff

Trunk Tools' platform powers seven AI agents purpose-built for construction, such as analyzing request for information (RFI) responses, overviewing bids, or reviewing drawings and submittals.

The submittal agent, for instance, flags missing, conflicting, or noncompliant information in product specs and RFIs. While it’s an essential step in the construction process, “it's a super annoying workflow,” Buchner said, because human reviewers have to compare documents “with a bunch of other parts of documents.”

But the agent is able to do this in seconds, and Trunk Tools says it has reduced submittal cycles from 50 to 60 days to 10, “which has massive schedule and financial implications.”

The company is now at a place where these agents are communicating directly with each other, which is “quite exciting,” Buchner said. So, for example, one agent will review an architectural drawing for accuracy, then autonomously hand it over to agents handling RFIs and asking follow-up questions.

“If the drawings have problems, the RFI agent is taking over and is actively reaching out for clarification,” Buchner explained.

Trunk Tools says its customers report savings of 20 to 40 minutes per field question. Buchner said that users in the field know better than anyone how much of a “time suck” it is to go back and forth from office trailers, dig through project documents in scattered systems or printed PDFs, reconcile discrepancies, and return to coordinate with trade partners.

The company says its customers report these additional outcomes:

Average 8 minute time savings for single-document retrieval (status checks, location lookups, quantity queries).
Average 20 minute time savings for standard referencing (cross-referencing 2 to 3 spec sections to form an answer.
Average 40 minute time savings for multi-document research (listing and filtering queries, mapping relationships, analyzing RFIs and submittals across 4 to 6 documents).
Average 75 minute time savings for complex tasks (creating RFIs and other communication materials, deep cross-referencing across documents, change tracking).

In one instance, the company's drawing review agent flagged that a structural beam had been moved up 8.5 inches. However, this was not documented by the architect. If the change hadn’t been caught, the project manager would likely have had to strip out and reinstall the right size beam, Buchner said. This rework would have added $10,000 or more to the budget, and “certainly there would have been implications on the schedule.”

Buchner also pointed to other examples: an agent flagged $60,000 in exaggerated pricing with no justification from landscaping subcontractors; identified a fireplace that needed to be sealed prior to drywall installation, saving around $100,000 in labor, materials, and delays; and called out that an electric door required a panel that wasn’t included in electrical drawings.

Learnings for other industries

Trunk Tools' approach to building agents is applicable to any vertical working with high volumes of unstructured, industry-specific data. Builders working in specific verticals must understand the industry’s specific data challenges their end users face and build technical infrastructure that can transform unstructured data into something an “LLM can traverse and understand,” Buchner said. “Only then can you build the connections between data points that ultimately feed agentic workflows.” A lot of money is being invested in foundational models, so enterprises should build modular systems that can leverage the strengths of various models as they continue to improve, Buchner advised. Then, “build your technical advantage where the generic models are not investing and not performing well,” she said.

Enterprises lost Claude Fable 5 for a few weeks. New data shows two-thirds had already built their hedge

Fri, 03 Jul 2026 00:33:00 GMT

Two-thirds of enterprises have hedged their AI model strategy, and the past few weeks of controversy around Anthropic’s Claude Fable 5 model showed why that posture has gone mainstream.

On June 12, a U.S. export-control order pulled Anthropic's Claude Fable 5 — the most capable model on the market — offline for every customer, with no warning and no timeline. It returned this week wrapped in tighter safeguards, after China's Z.ai released its open-weights GLM-5.2 into the vacuum. New VentureBeat Pulse Research, which surveyed 145 enterprises across these last few weeks, shows that two-thirds had already hedged their model strategy before the order came down: 51% blend closed frontier models with open-weight models deployed on their own infrastructure, and another 16% are moving core workflows off closed APIs entirely. The remaining third was all-in on closed ecosystems when the lights went out.

The blackout put a spotlight on vendor dependency, by showing what happens when the model you rely on disappears. But vendor dependency is only the most visible piece of a deeper problem: Most enterprises lack the monitoring to know when an AI system they've put into production stops working correctly.

Just 1 in 10 enterprises has automated monitoring that would catch an AI model drifting, misbehaving, or failing in production. Roughly a quarter would learn of a production failure only when end users — internal or external — report it, or lack the visibility to detect it at all. And 79% of enterprise organizations have already taken a real financial or operational hit from autonomous agents — most often shadow AI, unauthorized agentic work run by enterprises' own employees on corporate credit cards, outside anyone's oversight.

We call this the “Control Gap,” or the distance between how aggressively enterprises are deploying AI and how little of it they can see, own, or govern. June’s blackout turned this into a live stress test.

About this data: VentureBeat Pulse Research surveyed 145 qualified respondents at organizations with 100 or more employees in June 2026, with fielding spanning the Fable 5 blackout that began June 12. The sample is self-selected and directional: 41% work in technology/software, 20% are consultants or advisors, and the respondent base skews senior and technical — CIO/CTO/CISOs (18%), directors of engineering/IT (14%), enterprise architects (12%). More than half of the respondents were from companies with 2,500 employees or more.

While our sample is not huge, what you can trust more than the exact percentages is the pattern: Every question in the survey, independently, points the same way, with deployment running ahead of governance, visibility, and cost control.

The full methodology is in the report.

How the Fable 5 export order rewrote enterprise AI risk

Fable 5 launched June 9 to immediate acclaim — and sticker shock, at $10 per million input tokens and $50 per million output. Three days later, the U.S. government issued an emergency export-control directive barring access by foreign nationals. Anthropic, with no way to verify nationality in real time, suspended the model for everyone.

Z.ai has continued to pick up momentum; on Wednesday it released an open agentic coding environment, called Zcode. OpenAI, meanwhile, previewed its cutting-edge GPT-5.6 line on June 26.

Enterprises had already spent the spring learning what AI dependence costs in dollars. Uber burned through its entire 2026 AI coding budget in four months after Claude Code adoption hit 84% of its roughly 5,000 engineers, Forbes reported. Microsoft canceled most internal Claude Code licenses in its Windows and Microsoft 365 division, steering engineers to its own tooling, according to The Verge.

June added the harder lesson: The model your workflows depend on can vanish overnight, by government order, through no decision of yours or your vendor's. And Chinese companies like DeepSeek were releasing hugely disruptive, powerful models, driving down costs to a fraction of Western ones.

Brian Craig, senior director of architecture at Liberty IT, the Ireland-based engineering arm of Liberty Mutual, one of the world’s largest insurance companies, saw both lessons collide in real time. Craig is Irish, which meant the export order hit him directly as a foreign-national user.

Onstage at VentureBeat's AI Impact event in New York on June 24, mid-blackout, I asked him about it. "Fable arrived, and immediately you saw the sticker price of using it, and you went, 'Ooh, goodness, it better be really good,'" Craig said. "But luckily enough, we didn’t get to use it enough to get to fall in love with it." Then it was gone.

The hedge was already built before the blackout hit

Craig's company was built to route around exactly this kind of disruption. Liberty IT runs what it calls an AI backbone — roughly 50 components spanning security, governance, observability, and orchestration, each independently replaceable.

"You can't lock in right now in one vendor and even one framework," Craig told the room. "You need to keep being able to have the flexibility with that backbone to be able to hook into different models, different vendors, depending not so much on who's the flavor of the day, but on what you can feel confident about for the next six months."

The survey shows Craig has plenty of company. A 51% majority of enterprises run a hybrid posture — closed frontier models for general reasoning, open-weight models deployed locally for specialized execution — and 16% are making a hard pivot, moving core workflows onto open weights running on their own hybrid or private cloud. The 32% holding a closed commitment are candid about why: The operational overhead of self-hosting still outweighs the savings for them. After June, that calculus has a new variable in it.

Defection is now the active posture, and the target may surprise you. Asked which primary AI vendor they are most likely to downsize or phase out over the next 12 months, respondents named Microsoft first at 30% — most citing cutbacks to Copilot and Azure AI frameworks in favor of direct model access — ahead of the 28% who plan to trim no vendor at all. OpenAI drew 21%, largely on pricing volatility, with Anthropic at 15% and Google at 6%. No vendor faces an exodus. But loyalty by inertia has ended: Among these enterprises, actively cutting at least one provider is now more common than expanding across all of them.

Just 1 in 10 enterprises would catch a failing production model automatically

How would an enterprise know if one of its production AI models was drifting, behaving unsafely, or failing to complete tasks? We asked directly. Forty percent say they are very confident they would detect it. The question also asked what that confidence rests on, and respondents split into two camps: 30% rely on humans reviewing critical AI outputs, and just 10% — 14 of the 145 organizations — have automated monitoring and alerting running against production systems. The remaining respondents hold weaker positions still: 32% expect to catch most issues "eventually," 19% say they would likely hear about a failure from end users first, and 8% report no systematic visibility into production AI behavior at all.

That distinction matters because the two approaches are very different. Human review may seem like the gold standard, but it only reaches the outputs someone designates as important for such a review — and it happens at the pace humans can move at, with the inconsistency any manual process carries. Automated monitoring watches everything the system produces, continuously, and flags anomalies as they happen — for the same reason enterprises stopped depending on manual checks for uptime and security a decade ago.

As agentic workloads multiply output volumes far beyond what any review team can read, the manual approach starts to fall behind. The leaders at our June 24 event in New York treat human review as a designed control with automation underneath it. "Nothing gets deployed into production unless it's a human actually reviewing it and signing off," Craig said of Liberty's agentic software factory, where planning, coding, testing, critic, and librarian agents ship features from epic to production.

"It always has to be risk-based. That's why we work for an insurance company." Todd Johnson, the Morgan Stanley managing director who runs agentic AI across the bank's end-of-day P&L controller process, described the same principle from finance: "One of our strong principles in our AI governance generally is that there always has to be human accountability, even if there's a degree of automation." VentureBeat covered Morgan Stanley's new results around its P&L resolution agent system separately.

Liberty Mutual and Morgan Stanley chose manual sign-off deliberately, layered on top of observability, identity, and governance infrastructure. Whether the human-review camp has similar infrastructure underneath is more than a single-select question can establish. The 16% who separately named missing observability tooling as their biggest governance barrier are the ones saying outright that it hasn't been built.

The top governance barrier is organizational: no single owner for AI across platforms

Why does the AI visibility tooling never get built? The respondents' answers suggest it is an organizational shortcoming. The single most-cited barrier to governing AI across platforms is the absence of a single owner or accountable team, at 32%. Vendor opacity follows at 25%, missing tooling at 16% — and a lack of talent lands dead last at 5%.

The skills exist, but the organizational mandate does not: Only 38% say a central team actually governs AI behavior across their platforms today, 21% say ownership is unclear or actively contested between teams, and 17% say no role holds formal accountability at all.

The AI surface being governed makes the vacuum worse. Fully 85% of enterprises run two or more platforms each claiming to be the "primary" AI layer — ERP, ITSM, productivity suite, data platform, each with its own AI, its own controls, and its own assumptions. 36% describe an open contest between four or more. Just 8% have consolidated to one. Asked in a free-text question what one thing they would fix, respondents converged from different directions on the same answer: a single accountable owner, and a control plane that abstracts cost, drift, and model choice away from the end user.

79% have already paid for an agent control failure — led by shadow AI

The cost of the vacuum is showing up on corporate cards.

Asked to name the most severe financial or operational control failure they have experienced from autonomous agents, 49% of enterprises cite shadow AI — departmental teams running unauthorized agentic pipelines on corporate credit cards, bypassing central financial oversight entirely. Another 25% have been hit by an infinite-loop bill, an uncaught recursive workflow racking up thousands in token costs in a single incident, and 6% by an agent that degraded production databases with unthrottled queries. Only 21% report guarded stability, with hard token throttling and budget caps at the infrastructure layer. Add it up: 79% of these enterprises have already paid for an agent control failure in real money or real downtime.

Finally, the economics of tokens suggest the pressure will keep rising. Per-token inference costs are falling 70 to 80% a year, and agentic workloads consume 100 to 500 times the tokens of the LLM tools they replaced.

Brian Gracely, senior director of portfolio strategy at Red Hat, told our New York audience the answer starts with right-sizing: "If I'm simply trying to resolve an insurance claim, I don't need to know about the history of Western civilization in my model. I don't need to know soccer scores."

Enterprises are pairing smaller, specialized models with semantic routing, he said, so the platform decides which requests genuinely need frontier-scale reasoning — and which are burning premium tokens on commodity work. (One adjacent data point from the survey underlines the appetite for pragmatism: 73% of enterprises report little or nothing to show for their custom fine-tuning investments of the past 18 months — a reckoning we'll examine in its own report.)

The bottom line: Replaceability is spreading faster than ownership

The survey describes enterprises moving fast on AI with weak controls underneath. 58% are adding more AI initiatives than they retire. 85% run multiple platforms that each claim to be the primary AI layer. Three times as many enterprises rely on human review to catch a failing production model as have automated monitoring in place. And 79% have already paid for an agent control failure — most often unauthorized agent spending on corporate cards, outside IT's oversight.

On one problem, enterprises have clearly adapted: model dependency. Two-thirds hedge their model strategy, either running open-weight models alongside closed ones (51%) or moving core workflows off closed APIs entirely (16%). The Fable 5 shutdown showed the value of that position — the hedged companies could route around a model that a government order made unavailable overnight.

The remaining problems are internal, and no purchase fixes them: 32% name the lack of a single accountable owner as their top governance barrier, and 17% say no role holds formal accountability for AI at all. Assigning an owner costs nothing and requires no vendor. It still hasn't happened at most of these companies.

Our coming Q3 wave of research will measure whether June changed this — whether enterprises assigned owners and installed automated monitoring, or just added a second model and moved on.

Get the full Control Gap report here.

The themes in this report — agent orchestration, governance, and cost control — are the agenda at VB Transform, VentureBeat's flagship event, July 14-15 at Hotel Nia in Menlo Park, with technical leaders from Visa, GM, Waymo, Intuit, Instacart, LangChain and others. Details and registration here.

Disclosure: VentureBeat's June 24 AI Impact event in New York was sponsored by Red Hat and Intel. Sponsors have no input into VentureBeat Pulse Research survey design, findings, or editorial coverage.

New Alibaba AI framework skips loading every tool, cutting agent token use 99%

bendee983@gmail.com (Ben Dickson) — Thu, 02 Jul 2026 20:54:12 GMT

As enterprise AI systems scale to handle complex workflows, practitioners face the challenge of routing subtasks to the right tools and skills. Agents can have hundreds of tools and skills and get confused on which one to use for each step of a workflow.

To address this challenge, researchers at Alibaba developed SkillWeaver, a framework that creates an execution graph for a given task and chooses the right skills for each of the nodes. They also introduce Skill-Aware Decomposition (SAD), a novel technique that uses a feedback loop to enable the agent to fetch and vet relevant tool candidates iteratively. This compositional approach and feedback loop mechanism distinguishes SkillWeaver from other tool-routing frameworks that choose tools in a one-shot fashion.

SkillWeaver relates to real-world AI applications where agents autonomously orchestrate multi-tool ecosystems, such as the Model Context Protocol (MCP), to execute multi-step business operations like downloading datasets, transforming information, and creating visual reports.

In practice, the researchers' experiments with SkillWeaver show that implementing this retrieve-and-route approach significantly increases accuracy while reducing token consumption by over 99% compared to naively exposing agents to an entire tool library.

For practitioners building AI agents, the main takeaway is that the granularity of task decomposition is the biggest bottleneck to accurate tool retrieval.

The challenge of skill routing

Skills are a key pattern in modern LLM agent architectures. A skill is a modular, reusable tool specification that uses structured natural language documentation.

As enterprise agents integrate with massive tool ecosystems, accurately routing user queries to the right skills becomes a difficult task. Exposing an entire library to an LLM to find the right tool is highly inefficient, quickly overwhelms context limits, and consumes hundreds of thousands of tokens.

Most current tool-use frameworks attempt to solve this through API retrieval, documentation matching, or hierarchical structures that treat routing strictly as a single-skill selection or per-step problem.

However, this single-skill paradigm is insufficient for enterprise environments because real-world queries are inherently compositional. A standard business request such as "Download the dataset, transform it, and create visual reports" cannot be fulfilled by one tool. It requires breaking the prompt down and sequencing an API client, a data processor, and a visualization tool into a cohesive, multi-step execution plan.

How SkillWeaver and SAD work

To tackle this, the researchers frame the problem of handling complex tasks that require multiple skills as "compositional skill routing." Given a complex user prompt and a vast library of tools, an agent must simultaneously figure out how to break the request into a sequence of atomic sub-tasks, how to map each sub-task to the single best available skill, and how to compose those skills into an executable plan.

SkillWeaver orchestrates this process through three distinct stages: Decompose, Retrieve, and Compose. In the first stage, an LLM acts as a task decomposer, breaking the user's complex query down into a sequence of sub-tasks that each require one skill. Once the sub-tasks are clearly defined, the system uses an embedding model to compare each subtask against the skill library to pull a shortlist of the top candidate tools for each step.

In the final stage, a planner evaluates the retrieved candidates based on how well they work together. It checks for inter-skill compatibility to ensure the outputs of one tool naturally flow into the inputs of the next. It then creates a final execution plan as a Directed Acyclic Graph (DAG) that maps out dependencies so independent tasks can potentially execute in parallel.

For example, consider a user asking an AI agent to "Download the dataset, transform it, and create visual reports." In the decompose stage, the decomposer LLM breaks this into three distinct sub-tasks: downloading the dataset, transforming the data, and creating the reports.

In the retrieve stage, the system searches the library and finds candidates like “api-client” or “http-fetch” for task one, “csv-parser” or “etl-pipeline” for task two, and so on. Finally, the compose stage evaluates these options, selects the specific combination of “api-client,” “csv-parser,” and “chart-gen” that are most compatible, and wires them together into a final, ready-to-execute workflow.

A key challenge of this pipeline is that LLMs often produce generic step descriptions that fail to match the specific, technical vocabulary of the actual skills available in the library. To fix this, SkillWeaver introduces Iterative Skill-Aware Decomposition (SAD), a novel feedback loop. SAD works by having the LLM draft an initial plan, conducting a preliminary search to find loosely matching skills, and then feeding those retrieved skills back into the LLM as hints. This allows the LLM to rewrite its decomposition so the granularity and vocabulary perfectly align with the actual tools that exist.

SkillWeaver in action

To evaluate how SkillWeaver performs in realistic enterprise scenarios, the researchers created a custom benchmark called CompSkillBench. It consists of 300 multi-step queries of different difficulty levels. To mirror real-world environments, they used a library of 2,209 real-world skills sourced from the public MCP ecosystem, covering 24 functional categories like cloud infrastructure, finance, and databases.

For the core engine, the researchers primarily used a lightweight 7-billion parameter model (Qwen2.5-7B-Instruct) for task decomposition, paired with a standard semantic search retriever (MiniLM with a FAISS index) to find the tools. SkillWeaver was evaluated against three main setups: a brute-force "LLM-Direct" method where they stuffed all the tool names into the prompt of a large model, a vanilla LLM-based decomposition without SAD, and a ReAct-style agent loop.

The experiments indicate that task decomposition is the main bottleneck. Standard LLM behavior falls short when dealing with large tool libraries, but the SAD feedback loop dramatically moves the needle. In the vanilla setup, the 7B model achieved a decomposition accuracy (i.e., predicting the correct number of steps) only 51.0% of the time. By activating the SAD feedback loop, accuracy jumped to 67.7% (with the larger Qwen-Max model, the accuracy reached 92%). On "hard" tasks requiring four to five distinct skills, SAD improved accuracy by 50%.

One fascinating finding was that larger models can actually perform worse when unguided. When tested in the vanilla setup, a larger 14-billion parameter model saw its accuracy plummet below the 7B model's accuracy because it tended to over-decompose tasks into microscopic, unnecessary steps. Once SAD was introduced, the retrieved tool hints anchored the model back to reality and increased its accuracy. This suggests that aligning an agent with the vocabulary of specific tools is often more impactful than paying for a larger, more expensive LLM.

Another important takeaway is token savings. The LLM-Direct baseline, which used the very large Qwen-Max model, showed that feeding all tools into the prompt of a large model fails. Despite near-perfect task breakdown capabilities, the massive model only retrieved the right tool category 21.1% of the time when flooded with tool options. SkillWeaver's targeted retrieve-and-route approach vastly outperformed this in accuracy while slashing context window consumption from an estimated 884,000 tokens down to roughly 1,160 tokens per query, a 99.9% reduction. For practitioners, this translates directly to drastically lower API costs and faster response times.

Finally, the traditional ReAct baseline completely failed, achieving 0% decomposition accuracy. Its loop naturally collapses multi-step plans into isolated actions rather than explicitly mapping out a cohesive, multi-tool sequence.

Considerations for developers

While the researchers have not yet released the source code for SkillWeaver, their work was built on off-the-shelf tools that can easily be reproduced.

Skill-Aware Decomposition (SAD), which is the key innovation at the heart of the framework, is a clever prompt-engineering and retrieval loop. The authors have shared the prompt templates in their paper, and developers can implement it themselves quite easily using standard orchestration libraries like LangChain, LlamaIndex, or even raw Python scripts.

As for the retrieval component, the authors built the core framework using all-MiniLM-L6-v2, an open-source embedding model. They found that swapping in a slightly stronger off-the-shelf encoder (BGE-base-en-v1.5) immediately boosted accuracy without any fine-tuning. While an off-the-shelf bi-encoder is great at getting a relevant tool into the top 10 candidates nearly 70% of the time, it struggles to consistently rank the perfect tool at exactly number one, achieving that only about 37% of the time. To bridge this gap, teams will likely need to implement a secondary cross-encoder or LLM-based reranker to re-order those top 10 candidates.

One upfront preparation requirement is vectorizing the tool library and building a FAISS index in advance. In practice, this is a negligible hurdle. Embedding and indexing all 2,209 skills in the benchmark took a mere 15 seconds. Once built, retrieving tools from the index adds less than 15 milliseconds of latency per query. For enterprise environments, syncing the tool index is a trivial background job.

A current limitation in SkillWeaver is the lack of error recovery. While SkillWeaver successfully maps out a compatible DAG for execution, the authors' pilot study revealed the challenges of multi-step tool chains. For example, if an API call fails in step two, the entire chain breaks. The paper's core contribution is limited to the routing and planning phase. For a true production deployment, practitioners must build their own error recovery, fallback, and retry mechanisms on top of the compose stage to handle real-world API timeouts or malformed outputs.

Z.ai launches ZCode to challenge Cursor, Claude Code and GitHub Copilot in AI coding

michael.nunez@venturebeat.com (Michael Nuñez) — Thu, 02 Jul 2026 10:00:00 GMT

Z.ai, the Beijing-based artificial intelligence lab formerly known as Zhipu AI, on Wednesday officially launched ZCode, a free desktop application it describes as an "Agentic Development Environment" purpose-built for its flagship GLM-5.2 large language model. The move marks the company's most aggressive push yet into the fast-growing AI-powered coding tool market, where it now competes directly with Cursor, Claude Code, GitHub Copilot, and Google's Antigravity.

"Introducing ZCode, the official development environment for GLM-5.2," the company wrote on X, noting the tool is available on macOS, Windows, and Linux, supports bring-your-own-key (BYOK) configurations for third-party models, and offers a 1.5x usage-quota bonus for subscribers to its GLM Coding Plan.

Read one way, ZCode is simply another entrant in a crowded market. Read another, it is a single product that crystallizes three of the most consequential trends in enterprise software today: the race-to-the-bottom pricing of frontier AI models, the geopolitical balkanization of the AI stack, and the rapid maturation of agentic coding agents into what Gartner now estimates is a roughly $10 billion market.

An AI coding tool designed to think in projects, not prompts

Unlike traditional IDEs that bolt on AI through a chat sidebar or autocomplete extension, ZCode is best understood as an agent-first development environment. Its core design is built around long-horizon tasks: the user describes an outcome, the agent plans the work, edits files, runs checks, reviews progress, and continues across multiple iterations until the goal is met.

ZCode organizes the development experience around the ZCode Agent, deeply tuned for GLM-5.2, with emphasis on deep integration: the model, tools, and execution workflow are tuned together so the Agent fits continuous, multi-step real-world development tasks. The environment supports continuous follow-up across devices: desktop, mobile Remote, and Feishu / WeChat Bot can all keep the same workspace task moving. Sensitive commands, file changes, and high-permission actions go through confirmation before execution.

That remote-control feature — the ability to steer a running coding agent from WeChat, Feishu, or Telegram on a phone — is a differentiator that speaks directly to the Chinese developer market, where those messaging platforms dominate professional communication. You can keep checking progress and adding instructions while long-running work continues, from any device with these messaging apps.

The tool is free to download. Revenue flows through Z.ai's GLM Coding Plan subscription tiers, which start at $16.20 per month for a "Lite" plan and scale to $144 per month for "Max" — prices that undercut Anthropic's Claude Code and Cursor's comparable tiers by significant margins.

Through July 31, ZCode is offering a promotional 1.5x effective quota bonus for Coding Plan subscribers, with off-peak token consumption charged at a 0.67x coefficient. The platform also supports multiple AI models and agents, including Claude Code, Codex, Gemini, and OpenCode — a pragmatic concession to the reality that no single model wins every task.

GLM-5.2, the open-source model trained entirely on Chinese chips, powers the whole experience

ZCode's value proposition is inseparable from GLM-5.2, the model it was designed to showcase. Z.ai released GLM-5.2 on June 16, first to its Coding Plan subscribers and subsequently as open-source weights under the MIT license on Hugging Face — a sequencing decision that prioritized distribution over the traditional benchmark-led launch.

The model's specifications are formidable. GLM-5.2 is a 744-billion-parameter mixture-of-experts architecture with 40 billion active parameters, a genuine one-million-token context window — five times the 200K limit on its predecessor — and training on 28.5 trillion tokens. It ranked second globally on Code Arena as of mid-June, trailing only Anthropic's Claude Fable 5, making it one of the highest-performing publicly available models for coding tasks.

Critically, the model was built entirely without American chips. As Decrypt reported, GLM-5.2 "runs entirely on Huawei silicon." Stability AI founder Emad Mostaque estimated total training costs at roughly $25 million, with 80 percent spent on post-training — a figure that, if accurate, would make GLM-5.2 extraordinarily cheap relative to Western frontier models.

On benchmarks, GLM-5.2 performs within striking distance of the best proprietary systems. It trails Anthropic's Claude Opus 4.8 by just one percentage point on FrontierSWE, a benchmark measuring multi-hour autonomous engineering projects, while edging out OpenAI's GPT-5.5.

Its API pricing — $1.40 per million input tokens and $4.40 per million output — are a cost reduction of up to 82 percent compared to Anthropic's Claude Opus 4.8 at $5 and $25, respectively. Because ZCode is a first-party tool from the same company that makes the model, it requires no manual endpoint configuration — the model is wired in.

The Anthropic export ban gave Chinese AI its biggest opening yet

ZCode's arrival cannot be separated from the geopolitical drama that has roiled the AI industry over the past three weeks. On June 12, the U.S. government, citing national security authorities, issued an export control directive suspending all access to Anthropic's Fable 5 and Mythos 5 models by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. Enterprise clients in finance, healthcare, SaaS, and critical infrastructure found their core intelligence services abruptly disabled, without exception, prior warning, or effective recourse.

While the Trump administration lifted those controls just yesterday — Anthropic confirmed on June 30 that the Department of Commerce had rescinded the directive — the episode sent shockwaves through the developer community and accelerated interest in open-source, self-hostable alternatives. The government's crackdown on Anthropic coincided with a swift rise in Chinese open-source models that are proving to be almost as capable and significantly cheaper than some of the most powerful U.S. models.

Z.ai's timing was surgical. On the same day the Trump administration ordered Anthropic's most advanced models blocked for foreign nationals, Zhipu announced the open-source release of GLM-5.2 with no usage restrictions. The South China Morning Post reported that GLM-5.2 would be available to all users of Zhipu's new GLM Coding Plan subscription, "priced at just a tenth of Anthropic's premium Claude Code and Claude Max tiers."

The market responded accordingly. Zhipu AI's market capitalization crossed HK$1 trillion (US$128 billion) on June 22, driven by a 42 percent intraday share surge. JPMorgan raised its 2026–2030 revenue forecast for Zhipu by between 7 and 16 percent following the launch, projecting an over 534 percent revenue surge for 2026 and expecting the AI firm to turn a profit by 2028.

Why vendor lock-in now carries a geopolitical risk that no SLA can cover

The Fable 5 episode did more than embarrass Anthropic. It introduced a new risk category into enterprise AI procurement: sovereign access risk. When a government can disable a commercially deployed AI model overnight, the traditional evaluation criteria of developer experience, benchmark scores, and pricing become secondary to a more fundamental question: Will this tool still work tomorrow?

The event exposed the inadequacy of standard enterprise contract language. An investigation by FifthRow found that almost all standard Data Processing Addenda, SaaS agreements, and procurement SLAs "relied on vague 'force majeure' or 'compliance with law' catch-alls, not on precise, actionable regulatory suspension or kill-switch clauses."

ZCode's BYOK architecture and GLM-5.2's MIT-licensed open weights offer a partial answer. A development team can download the model, host it on its own infrastructure, and run ZCode against it without ever touching Z.ai's cloud — eliminating both American export-control risk and Chinese data-sovereignty concerns in a single move. The catch is that anyone using Z.ai's cloud API remains subject to Chinese law, a consideration that evaporates only with pure self-hosting.

Gartner analysts have warned that governance, pricing, support, workflows, commercial maturity, and market durability matter as much as developer experience and model capabilities when evaluating coding agent vendors for enterprise-wide adoption. By that measure, ZCode faces a steep climb. It is not open source itself; Linux support remains in beta; and security reviewers have flagged the need for careful evaluation of its credential handling, particularly for remote development over SSH and messaging-platform-triggered tasks — an agent that can be summoned from WeChat involves access paths that should be mapped before trusting it with anything sensitive.

Inside the $10 billion race where model labs are becoming full-stack IDE companies

ZCode enters one of the most crowded and fastest-moving markets in enterprise software. Enterprise AI coding agents are capturing a growing share of enterprise software engineering spend, with the market estimated at roughly $9.8 billion to $11.0 billion annualized as of April 2026, according to Gartner. A defining shift this year, the analyst firm noted, is "the movement of frontier model providers into direct competition with application-layer vendors" — precisely the pattern ZCode embodies.

Gartner codified this evolution in May when it renamed its annual Magic Quadrant from "AI Code Assistants" to "Enterprise AI Coding Agents," defining the category as "autonomous or semiautonomous software engineering solutions that perceive context, translate human intent into multistep plans, and execute and verify those steps across code, tests and related engineering artifacts." The 2026 Magic Quadrant names Anthropic, Cursor, GitHub, and OpenAI as Leaders. Z.ai was not among the 12 vendors evaluated — an absence that underscores both the company's nascent enterprise sales presence outside China and the Western-centric lens through which the analyst community still views the market.

The competitive landscape is daunting. Cursor is the $2 billion ARR IDE that feels like VS Code with a supercharger. Claude Code reached approximately $2.5 billion in annualized revenue by early 2026. Google relaunched Antigravity 2.0 at I/O in May, and Cognition retired the Windsurf brand, relaunching the IDE as Devin Desktop with the Agent Command Center as the default surface.

Against these entrenched players, ZCode's pitch rests on three pillars: deep first-party integration with GLM-5.2 that no third-party editor can replicate, aggressive pricing that starts at a fraction of Western competitors, and MIT-licensed open weights that allow enterprises to self-host — eliminating the regulatory kill-switch risk that the Fable ban made viscerally real.

Z.ai's real challenge is turning a $128 billion valuation into a global developer tools business

Z.ai controls the model (GLM-5.2), the subscription layer (the GLM Coding Plan), and the IDE (ZCode) — a tightly coupled stack that optimizes for performance but concentrates switching costs. For the company, the business logic is clear. Its most reliable revenue stream has been on-premises deployments for Chinese government agencies, state-owned banks, and energy conglomerates. In full-year 2025, on-premises deployment revenue reached RMB 534 million, growing over 100 percent year-over-year and accounting for 73.7 percent of total revenue with a gross margin of 48.8 percent. ZCode and the GLM Coding Plan represent the company's bid to build a comparable revenue engine in cloud-based developer tools — globally, not just in China.

The early signals are encouraging for Z.ai, if anecdotal. Community reception on X was enthusiastic, with one early user calling the tool "super stable" and others clamoring for more Coding Plan capacity. "Bro, can't snag your family's Coding Plan? When are you gonna stock up on more cards?" one user wrote in Chinese, suggesting demand is already outstripping supply.

But the hard questions loom large. Can a Chinese AI company build trust with Western enterprise buyers amid escalating technology tensions? Can ZCode's ecosystem mature fast enough to compete with Cursor's polished UX, Claude Code's deep agent primitives, and GitHub Copilot's unmatched distribution? And can Z.ai sustain a company valued at $128 billion while still losing money?

What is no longer in question is the competitive dynamic itself. Three weeks ago, a U.S. government directive proved that access to the world's best coding model can vanish overnight. Today, a Chinese lab is shipping a free IDE, an open-source model trained on zero American chips, and a subscription plan that costs less per month than a single lunch in Manhattan. The AI coding agent market did not just become global this summer. It became a market where the fallback option might be better than the thing it's falling back from — and that changes the calculus for every engineering leader choosing a toolchain in the second half of 2026.

The Control Gap: Enterprise AI organizations have an ownership problem, not a technology problem — and most are governing it by hand

Wed, 01 Jul 2026 21:50:00 GMT

AI portfolios are expanding far faster than the ability to govern them across enterprises. Most organizations run a contested field of platforms, each claiming to be the “primary” AI layer; few could confidently detect a model drifting or failing in production; and the single most-cited barrier to control is the absence of any one owner accountable for AI across the stack. The result is a widening control gap — ambition and spend racing ahead of visibility, ownership, and cost control — with autonomous agents already producing real financial and operational failures.

This wave of VentureBeat Pulse Research examines the enterprise AI control gap: how many platforms claim to be the primary AI layer, who actually governs AI behavior across them, whether organizations could detect a model failing in production, what most blocks cross-platform governance, and how the financial and operational control failures of autonomous agents are already surfacing.

The central finding is a control gap — the distance between how aggressively enterprises are expanding AI and how little of it they can see, own, or govern. Just under three-fifths (58%) are net-adding AI initiatives, with “expanding significantly” the largest single posture.

Yet 85% run two or more platforms each claiming to be the “primary” AI layer and only 8% have consolidated to one. Against that contested surface, 40% say they are very confident they would detect a model drifting, behaving unsafely, or failing in production — but only 10% back that confidence with active monitoring and alerting, the rest leaning on manual human review. The machinery to expand AI is running well ahead of the machinery to control it.

The gap is, above all, a question of ownership. Only a third (38%) say a central team governs AI today, and a fifth (20%) say each platform team governs its own independently; the single most-cited barrier to cross-platform governance is the absence of a single accountable owner (32%), and roughly one in six (17%) say no role holds formal accountability at all. The same vacuum shows up in spend: just under half (49%) name shadow AI — unauthorized agentic pipelines run on corporate cards outside central oversight — as their most severe control failure, and another 25% have been hit by a runaway “infinite loop” agent bill. Enterprises have standardized the ambition well before they have standardized the control.

Methodology

VentureBeat fielded this survey as part of its ongoing Pulse Research series, this instrument focused on the enterprise AI control gap — governance, observability, and cost control across multiple AI platforms. Responses are filtered to organizations with 100 or more employees and, for this cut, exclude the respondents who selected “Other” as their job function, leaving a base of identifiable roles (n=145); all are drawn from a single Q2 2026 (June) wave.

By organization size the sample tilts toward the mid-market and lower-large bands: 100–499 and 500–2,499 employees (23% each) lead, with 10,000–49,999 (22%) and 2,500–9,999 (20%) close behind and 50,000+ at 11%. By role it is senior and technical: consultants and advisors (20%), CIO/CTO/CISO (18%), directors of engineering/IT (14%), product and program managers (13%), and enterprise architects (12%) make up the core. Technology/Software is the largest industry at 41%, followed by Financial Services and Professional Services (12% each) and Healthcare/Life Sciences and Manufacturing/Industrial (10% each).

The findings should be read as a directional signal rather than a precise measurement; it is self-selected and is not a probability sample. Where a single share would be fragile on its own, the report leans on the direction and grouping of responses rather than the exact percentage point.

Finding 1: Expansion is outrunning control

AI portfolios are growing faster than the means to govern them

We asked enterprises to describe how their AI portfolio has changed over the past 12 months. Growth leads — with a meaningful minority deliberately pulling back.

Expansion leads. Combining “expanding significantly” (33%) and “net positive growth” (25%), just under three-fifths of enterprises (58%) are net-adding AI initiatives. Yet a substantial share is easing off deliberately: roughly a quarter (23%) are actively rationalizing — scaling what works and cutting the rest — and another 12% hold their portfolios flat. Only a handful (3%) have paused to get governance in order first.

This is the engine behind every gap that follows: enterprises are accelerating into a landscape they have not yet learned to see or own, and a notable 4% cannot even describe their own portfolio. The ambition documented here is exactly what makes the visibility and ownership shortfalls in Findings 3 and 4 consequential rather than academic.

Finding 2: No single “primary” AI layer — the surface is contested

More than four in five run multiple platforms each claiming primacy

We asked how many enterprise platforms currently claim to be the organization’s “primary” AI layer — the ERP, EHR, ITSM, productivity suite, or data platform each positioning itself as the center of gravity. Almost no one has a single answer.

The defining condition is contested primacy. Adding the two multi-platform bands, 85% of enterprises have at least two platforms each asserting itself as the primary AI layer, and more than a third (36%) describe an open four-way-or-more contest. Only 8% have consolidated to a single layer, and another 6% have not even mapped the question. This is the structural reason governance is hard: there is no agreed center of gravity to govern from. Each platform brings its own AI, its own controls, and its own assumptions — and, as Finding 3 shows, the question of who governs across them increasingly has no settled answer.

Finding 3: Governance is claimed at the center but contested in practice

A central team owns it on paper; in practice, it's fragmenting

We asked who is actually responsible for governing AI behavior across all of those platforms today, and which function holds primary accountability. The headline answer is reassuring; the detail is not.

On the surface, a central governance function is the leading answer — but only a third (38%) claim one, well short of a majority. The rest of the distribution undercuts it further: a fifth (21%) say ownership is unclear or contested between teams, a fifth (20%) say each platform team simply governs its own AI independently, and 19% say no one has addressed it at all.

Accountability fragments further when we asked which role actually holds it — CIO/CTO/CISO leads at 27%, a Chief AI Officer or equivalent at 22%, and a striking 17% say no one holds formal accountability yet. Even where a central team is claimed, the named owner is most often the general technology executive rather than a dedicated AI authority. The governance function exists more often as an org-chart aspiration than an operating reality — the precondition for the detection gap in Finding 4.

Finding 4: The detection gap — confidence is real but largely manual

Only one in 10 have active monitoring and alerting

We asked how confident enterprises are that they would detect an AI model in production that was drifting, behaving unsafely, or failing to complete tasks correctly. This is the heart of the control gap.

This is the report’s central number. While 40% say they are very confident they would detect a failing model, the overwhelming majority of that confidence rests on manual human review (30%) rather than automation — just 10% have active monitoring and alerting actually in place.

At the other end, more than a quarter combine the two reactive answers — no systematic visibility (8%) and would hear it from end users first (19%) — meaning they would learn of a production failure after the fact, from the people it affected. The plurality (32%) sit in a hopeful middle, expecting to “catch most issues eventually.” Set against the aggressive expansion of Finding 1, this is the crux of the control gap — enterprises are scaling AI into production faster than they are building automated means to know when it breaks. Confidence is real, but it is largely manual, and automated detection remains the exception.

Finding 5: The missing owner is the biggest barrier

Governance stalls on accountability first, visibility second

We asked enterprises to name their single biggest barrier to governing AI across multiple platforms. The org chart tops the list.

The single missing owner leads at 32%, the most-cited barrier. Vendor opacity (25%) and the lack of tooling or infrastructure to observe across platforms (16%) sit behind, and together these two technical-visibility barriers (41%) outweigh the ownership gap. Leadership deprioritization accounts for another 17%, while a clear lack of talent is rare (5%). Rounding out the picture, another 5% say it isn't a barrier for them at all — they've already solved it.

Read together, the picture is more contested than the headline suggests: enterprises still most often name a missing owner, but a good share locate the obstacle in vendor black boxes and the absence of cross-platform observability.

Asked in a free-text question what one thing they would fix, respondents converged from different directions on the same answer — a single accountable owner, and a control plane that abstracts cost, drift, and model choice away from the end user.

Finding 6: The fine-tuning ROI reckoning

Roughly seven in 10 have little to show for custom model investment

We asked what share of the proprietary foundation models enterprises have invested in fine-tuning over the past 18 months have delivered clear, measurable positive ROI in production today. Most describe a sandbox graveyard — or a deliberate decision to avoid one.

Custom fine-tuning has, for most, not paid off. Combining the three disappointing outcomes — sandbox graveyard, strategic avoidance, and total write-off — roughly seven in ten (73%) either failed to get custom models into productive use or deliberately declined to try, against 27% for whom fine-tuned models are a reliable advantage. The largest single group (45%) remains the graveyard: projects too expensive or complex to maintain, stranded in development. Another quarter (24%) never started — they priced in the downstream maintenance burden and avoided it.

The signal is that many enterprises still treat bespoke model training as a cost trap, which helps explain the pragmatic, buy-and-blend vendor posture in Finding 7.

Finding 7: Vendor posture — hybrid by default, with defection rising

Enterprises blend open and closed models; more are now trimming a vendor

We asked two related questions: whether enterprises are shifting workloads toward open-weight models to escape API costs and lock-in, and which proprietary vendor, if any, they are most likely to phase out over the next year. The answers describe hedging — and a rising willingness to cut.

On open weights, a clear majority (51%) strike a hybrid balance, with a deliberate closed commitment second at 32% and a hard pivot to self-hosted open models at 16%. The hybrid plurality is the same instinct visible throughout this survey — keep optionality, avoid being trapped — while the closed group remains candid that the operational overhead of self-hosting still outweighs the savings for them.

On vendor defection, loyalty by inertia no longer leads: Microsoft is now the single most-named target (29%, often citing Copilot/Azure cutbacks in favor of direct model access), narrowly ahead of the 27% who are downsizing no one at all. OpenAI follows at 21% (citing pricing volatility), with Anthropic at 15% and Google at 6%. No single vendor faces a wholesale exodus, but among identifiable roles the balance has tipped from “expanding across all” toward actively trimming at least one provider.

Finding 8: The agentic spending crisis — shadow AI leads the failures

Unauthorized pipelines, not runaway loops, are the top control failure

Finally, we asked what the most severe financial or operational control failure enterprises have experienced as autonomous agents run over longer execution windows. Shadow AI tops the list — and very few have escaped a scare.

The control gap has a price, and it is being paid. Just under half of enterprises (49%) cite shadow AI — unauthorized agentic pipelines spun up on corporate cards outside any central oversight — as their most severe failure, the operational twin of the “no single owner” barrier in Finding 5. Another 25% have been burned by a runaway infinite-loop agent bill, and 6% by an agent that degraded production databases. Only 21% report guarded stability — the minority that has imposed hard token throttling and budget caps at the infrastructure layer and avoided surprises.

Put differently, roughly four in five of these enterprises (79%) have already experienced a real financial or operational control failure from autonomous AI, not merely worried about one. As with detection in Finding 4, the deterministic controls that would prevent these failures exist at only a fraction of organizations.

The bottom line: A control gap that spending cannot close on its own

Organizations with 100 or more employees describe AI programs that are expanding fast and governing slowly. Just under three-fifths are net-adding to their portfolios; more than four in five run a contested field of platforms with no agreed primary layer; and the thing they most often name as their chief obstacle is a single accountable owner. The visibility to match the ambition is largely manual — only 10% have active monitoring and alerting, and confidence in detecting a failing model rests mostly on human review rather than automation.

The consequences are already concrete rather than hypothetical. Custom fine-tuning has disappointed more often than not, pushing enterprises toward a hedged, hybrid, buy-and-blend model posture; and the autonomous agents now reaching production have produced real control failures for roughly four in five respondents, led by shadow AI running outside any central oversight. This reads as a directional signal rather than a precise measurement — but the direction is consistent across every question: ambition, spend, and deployment are racing ahead of ownership, observability, and cost control. The control gap is not a tooling problem that more spending will close on its own; it is, first, a question of who owns the answer.

Based on survey responses from 145 qualified enterprise respondents (100+ employees). Sample size is small; data should be treated as directional. Respondents include Directors, VPs, CIOs, CTOs, and Enterprise Architects across Technology, Financial Services, Retail, Healthcare, and other sectors.

Anthropic is bringing back Claude Fable 5 globally after US lifts export control order — where can enterprises access it?

carl.franzen@venturebeat.com (Carl Franzen) — Wed, 01 Jul 2026 15:51:00 GMT

Anthropic is restoring global access to its most powerful generally released AI model yet, Claude Fable 5, today, after the U.S. Department of Commerce last night withdrew the emergency export controls it had issued previously around the model.

The U.S. export control order issued on June 12, 2026, led Anthropic to suspend all global access to both Fable 5 and its less restricted cybersecurity counterpart model Claude Mythos 5, just days after both models were initially introduced.

Now, Fable 5 is once again being made available for users globally across the primary Anthropic ecosystem, including the Claude Platform, Claude.ai, Claude Code, and Claude Cowork. The official Claude account on X announced the return of the model at 3:31 pm ET on July 1, 2026.

For organizations leveraging cloud hyperscalers, Anthropic says it is moving to re-enable access on Amazon Web Services, Google Cloud, and Microsoft Foundry “as quickly as possible.” So far, VentureBeat's research has been unable to confirm if the models have been restored on these external cloud hyperscaler platforms yet.

Mythos 5 remains a different case. A letter posted on the social network X allegedly from U.S. Commerce Secretary Howard Lutnick to Anthropic executive Tom Brown says a license is no longer required for the export, reexport, or in-country transfer of Fable and Mythos.

But Anthropic’s own redeployment post on its website says only that Mythos 5 access has been restored for “a set of US organizations,” following government approval on June 26. The company says it is continuing to coordinate with the government to expand access to broader domestic and international partners in its opt-in cybersecurity testing program, Project Glasswing.

That leaves Mythos 5 in a middle category: legally cleared from the emergency export-control order, but not generally available. The current limit appears to come from Anthropic’s decision to keep Mythos behind a vetted-access model, with the U.S. government still playing a role in approvals, standards and expansion.

Posting on X, Commerce Secretary Howard Lutnick said Anthropic and the government had “worked closely” to “analyze and approve Fable 5,” while White House Chief of Staff Susie Wiles also posted on X, framing the decision around U.S. AI leadership and deployment speed.

Wiles wrote that the United States is the “undisputed winner in the AI race,” adding that the shared priority is to “get the best tech deployed as quickly and safely as possible.”

The reversal follows concerns from cybersecurity leaders and AI policy experts over the export control order, who argued that the U.S. risked hobbling its own industry while giving Chinese AI labs an opening. Former Facebook security chief Alex Stamos called the Fable restriction a “huge own goal for the US,” warning that security companies could be driven toward Chinese models, while other critics said the so-called "ad hoc" regulatory intervention made dependence on U.S. AI platforms look like a strategic liability.

Reminder on Claude Fable 5 pricing

For chief information and technology officers evaluating the return of the model, the deployment comes with distinct structural conditions and significant financial investments.

Anthropic is pricing both Fable 5 and Mythos 5 at $10.00 per million input tokens and $50.00 per million output tokens, the most expensive of all frontier models globally.

Model	Input ($/1M)	Output ($/1M)	Total ($/1M)	Source
MiMo-V2.5 Flash	$0.10	$0.30	$0.40	Xiaomi
deepseek-v4-flash	$0.14	$0.28	$0.42	DeepSeek
deepseek-v4-pro	$0.435	$0.87	$1.305	DeepSeek
MiniMax-M3	$0.30	$1.20	$1.50	MiniMax
LongCat-2.0 — limited-time promo	$0.30	$1.20	$1.50	LongCat
Gemini 3.1 Flash-Lite	$0.25	$1.50	$1.75	Google
Qwen3.7-Plus	$0.40	$1.60	$2.00	Alibaba Cloud
MiMo-V2.5	$0.40	$2.00	$2.40	Xiaomi
LongCat-2.0 — standard	$0.75	$2.95	$3.70	LongCat
Grok 4.3 (low context)	$1.25	$2.50	$3.75	xAI
MiMo-V2.5 Pro (≤256K)	$1.00	$3.00	$4.00	Xiaomi
Kimi-K2.6	$0.95	$4.00	$4.95	Moonshot AI
GLM-5.2	$1.40	$4.40	$5.80	Z.ai
GPT-5.6 Luna	$1.00	$6.00	$7.00	OpenAI
Grok 4.3 (high context)	$2.50	$5.00	$7.50	xAI
MiMo-V2.5 Pro (>256K)	$2.00	$6.00	$8.00	Xiaomi
Qwen3.7-Max	$2.50	$7.50	$10.00	Alibaba Cloud
Gemini 3.5 Flash	$1.50	$9.00	$10.50	Google
Gemini 3.1 Pro Preview (≤200K)	$2.00	$12.00	$14.00	Google
GPT-5.6 Terra	$2.50	$15.00	$17.50	OpenAI
GPT-5.4	$2.50	$15.00	$17.50	OpenAI
Gemini 3.1 Pro Preview (>200K)	$4.00	$18.00	$22.00	Google
Claude Opus 4.8	$5.00	$25.00	$30.00	Anthropic
GPT-5.5	$5.00	$30.00	$35.00	OpenAI
GPT-5.5 Instant (chat-latest)	$5.00	$30.00	$35.00	OpenAI
Sakana Fugu Ultra (≤272K)	$5.00	$30.00	$35.00	Sakana AI
GPT-5.6 Sol	$5.00	$30.00	$35.00	OpenAI
Claude Fable 5 / Claude Mythos 5	$10.00	$50.00	$60.00	Anthropic

However, to incentivize immediate enterprise adoption following the export control order disruption saga, Anthropic is executing a temporary rollout plan through July 7.

For Pro, Max, Team, and select Enterprise subscriptions, Fable 5 usage will be included at no added cost for up to 50% of a user’s weekly tier allowance.

After July 7, Fable 5 will move to usage credits for those plans. For standard Enterprise seats, there is no included Fable 5 allowance; all usage is billed through credits, and the model will not work for those users unless credits are enabled.

Already, some AI influencers are attempting to offer enterprises and developers guidance on how to maximize their usage of Fable 5 during its 7-day discounted price/subscription included promotion:

Chronology of a Crisis: From Launch to Lockout

The whiplash regulatory cycle surrounding the model underscores the volatility currently facing enterprise software supply chains. The crisis unfolded over a rapid, three-week timeline:

June 9, 2026: Anthropic launches Claude Fable 5 and Mythos 5. Early corporate case studies report major performance gains. For instance, Stripe reports that Fable 5 compressed a codebase-wide migration across a 50-million-line Ruby infrastructure into a single day — a project estimated to take a team more than two months by hand.
June 12, 2026: At 5:21 PM ET, the U.S. government issues an export-control directive citing national security authorities. The order bans access to the models by any foreign national, whether inside or outside the borders of the United States. Lacking real-time mechanisms to verify user nationality at the API layer, Anthropic is forced to pull the plug for all customers to ensure compliance. Anthropic says access to all other Anthropic models was not affected.
June 13–25, 2026: Enterprise users and developers face abrupt disruption, forcing workflows that had adopted Fable 5 or Mythos 5 to fall back to older models such as Opus 4.8. Tensions peak as Anthropic publicly objects, arguing that pulling a major commercial model over a narrow jailbreak finding could “essentially halt all new model deployments for all frontier model providers.”
June 26, 2026: The U.S. government allows Anthropic to restore Mythos 5 access to a set of trusted U.S. organizations, partially reversing the June 12 order. Anthropic says it is restoring access for those organizations and continuing to work with the government to expand Mythos 5 access and make Fable 5 generally available again.
June 30, 2026: Commerce Secretary Howard Lutnick sends a letter withdrawing the June 12 export-control license requirement for both Mythos and Fable. The decision removes the emergency legal block, but Anthropic’s rollout still treats the models differently: Fable 5 returns globally, while Mythos 5 remains limited to approved users through Glasswing and related trusted-access channels.

The Technical Catalyst: The Amazon Vulnerability Report

The swift intervention by the federal government stemmed from a report by Amazon researchers describing a method for bypassing Fable 5’s safeguards. This was a brutal irony for Anthropic, given Amazon was one of the startup's initial and largest backers to the tune of $8 billion, and the two companies previously collaborated on improving Amazon's Alexa+ voice assistant.

According to Anthropic, the technique prompted Fable 5 to identify software vulnerabilities; in one case, the model produced code demonstrating how the relevant vulnerability could be exploited.

When the report reached government officials, it triggered alarm regarding the offensive cyber capabilities of public, AI large language models (LLMs). Anthropic countered that the exploit did not tap into unique “Mythos-level” cyber capabilities, noting that its own testing found other models — including Claude Opus 4.8, OpenAI’s GPT-5.5, and Moonshot’s Kimi K2.7 — could identify the same vulnerabilities. Anthropic also said every model it tested could produce the same exploit demonstration as Fable 5.

To break the regulatory logjam, Anthropic developed an improved automated safety classifier specifically trained to catch and neutralize the Amazon technique. Tested by the Commerce Department’s Center for AI Standards and Innovation (CAISI), the updated classifier successfully halts that specific technique in more than 99% of cases.

Anthropic explicitly warns enterprise clients that this safety enforcement comes at an operational cost. Because the new classifiers require an expanded “safety margin” to catch ambiguous edge cases, benign coding and debugging requests may be flagged more often. When a prompt is blocked by the safety layer, the active session automatically downgrades, routing the request to Opus 4.8.

In a post on X, Thariq Shihipar, a Member of Technical Staff at Anthropic working on Claude Code, said that Anthropic is “continuing to refine these safeguards to better distinguish genuine misuse from legitimate requests and reduce false positives.”

Backroom Diplomacy: The Shifting of the Guard

The breakthrough that brought Fable 5 back to commercial markets was as much political as it was technical. According to WIRED, Anthropic initially argued that the administration’s security concerns were overblown and that no frontier model provider could guarantee zero jailbreaks.

That argument frustrated the administration, according to WIRED’s reporting. In recent weeks, Anthropic changed tack, focusing less on the theoretical impossibility of eliminating jailbreaks and more on building stronger safeguards and satisfying the government’s operational concerns.

WIRED reported that Anthropic CEO Dario Amodei was recently replaced in meetings by Brown, whom officials liked more personally. Brown is also the addressee of Lutnick’s June 30 Commerce letter.

Under Brown’s guidance, Anthropic appears to have moved from arguing over the absolute limits of model safety to committing to the expanded safeguards and collaboration framework the administration demanded.

The resulting Commerce letter describes several commitments by Anthropic. Under the terms of the clearance, Anthropic has agreed to:

Proactively detect and address security risks associated with the models.
Work with the U.S. government on protocols, standards and releases for Mythos, Fable and future models.
Inform the U.S. government of malicious activity.

Separately, Anthropic says it will expand pre-release government access and evaluation for frontier models, share information rapidly when significant jailbreaks or misuse patterns are identified, dedicate resources to joint government research and work toward a common industry security bar.

The U.S. Commerce Department explicitly reserved the right to re-evaluate these permissions and re-impose license requirements if circumstances change or if Anthropic fails to meet its commitments.

The Sovereign Calculus: Lessons for Enterprise AI

The two-week blackout of Claude Fable 5 exposed the fragility of centralized, closed-API models for modern business infrastructure. It showed that enterprise automation pipelines remain vulnerable to sudden regulatory shifts and vendor compliance mandates.

The tech community’s response highlights a broader push toward hardware and model sovereignty. Following the initial shutdown, prominent tech figures voiced concerns over this centralization. AI founder Alex Finn described the Anthropic freeze as a major “wakeup call,” urging developers to invest heavily in local, open-weights infrastructure to insulate operations from federal volatility. As Finn noted on social media:

“No company or government will EVER be able to take away your local models.”

For enterprise architects, the return of Fable 5 demands a balanced approach to deployment:

The Frontier Performance Advantage: Utilizing closed models like Fable 5 offers state-of-the-art capabilities across agentic coding, long-context work, document reasoning and multi-step enterprise automation, according to Anthropic’s launch materials and early customer examples.
The Mitigating Data Trade-Off: Accessing Fable 5 means accepting Anthropic’s mandatory 30-day data retention requirement for covered models. Anthropic says prompts and model completions are retained for at least 30 days by default and then automatically deleted, except when they are part of a safety investigation or must be kept for legal reasons. Highly regulated financial, healthcare and legal groups must evaluate whether this telemetry window complies with their data privacy mandates.

The truth is, enterprises in the U.S. and globally have more options than ever for frontier-class LLMs, especially with the recent launch over the last few months of new, powerful, open weights Chinese alternatives that can be downloaded, run locally or on virtual private clouds, and customized to any enterprise's liking.

MiniMax M3 pairs frontier-tier coding and agentic performance with a 1 million-token context window and native multimodality. Z.ai’s GLM-5.2's benchmark results exceed OpenAI's GPT-5.5 on SWE-bench Pro and several long-horizon coding tests, and near Claude Opus 4.8 on FrontierSWE and MCP-Atlas. Meituan’s LongCat-2.0 is also positioned around enterprise use, with a 1 million-token context window, MIT licensing and strong early developer traction through its Owl Alpha run on OpenRouter — though as we reported, the full weights are still listed as “coming soon.”

Meanwhile, Anthropic's top domestic rival OpenAI is still struggling to release its latest models broadly due to U.S. government pressure. The company says its newest and most powerful models, GPT-5.6 Sol, Terra and Luna — unveiled last week — are starting in a limited preview for a small group of trusted partners after OpenAI previewed the models and their capabilities to the U.S. government and the government requested the rollout be staggered.

OpenAI says it still plans broader availability, but argued in its announcement that "we don’t believe this kind of government access process should become the long-term default. It keeps the best tools from users, developers, enterprises, cyber defenders, and global partners who need them. We are taking this short-term step because we believe it is the strongest path to broader availability in the coming weeks, while we work with the Administration to develop the cyber Executive Order framework and a repeatable process for future model releases."

The executive order in question, signed by President Donald J. Trump on June 2, 2026, calls upon various federal agencies to collaborate on a process for benchmarking and assessing capabilities of new AI models to ensure they are safe and appropriate for wide release, a process supposed to take 30 days (which would seem to indicate the agencies are due to provide their process tomorrow, July 2, 2026.)

Frontier model launches are starting to look less like ordinary product releases and more like negotiated deployments shaped by U.S. national security review — a shift that could slow American distribution even as Chinese competitors move aggressively through open-weight and lower-cost channels

To safeguard operations against future regulatory lockouts, enterprise technical leaders are moving toward model-agnostic fallback architectures.

By deploying proxy layers that can dynamically reroute critical production pipelines from proprietary APIs to locally hosted, open-weights alternatives, businesses can leverage top-tier capabilities without exposing themselves to single-point-of-failure vulnerabilities.

Fable 5 is officially back online, but the landscape governing its release has been fundamentally transformed.