<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
    <channel>
        <title>VentureBeat</title>
        <link>https://venturebeat.com/feed/</link>
        <description>Transformative tech coverage that matters</description>
        <lastBuildDate>Sat, 30 May 2026 14:01:43 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>Copyright 2026, VentureBeat</copyright>
        <item>
            <title><![CDATA[The AI agent bottleneck isn't model performance — it's permissions]]></title>
            <link>https://venturebeat.com/orchestration/the-ai-agent-bottleneck-isnt-model-performance-its-permissions</link>
            <guid isPermaLink="false">7IiAwRzYzPGFYTz7T7BqXk</guid>
            <pubDate>Fri, 29 May 2026 22:27:49 GMT</pubDate>
            <description><![CDATA[<p>Enterprise AI agents are stalling — not because of model performance, but because of permissioning. Every agentic workflow eventually hits the same wall: what is this agent allowed to touch, on whose behalf, and how does the system know?</p><p>Workday&#x27;s answer is to make its existing system of record the governance layer for agents. Gerrit Kazmaier, the company&#x27;s president for product and technology, told VentureBeat in an interview that customers often struggle when they cobble together solutions for their agents. </p><p>“Sana makes sure the integrity of the approvals and security model is always adhered to,” Kazmaier said. “Frankly, that’s where we see customers struggling when they try to build do-it–yourself AI by just accessing raw data, so the richness of the security model gets lost, and the results become overly broad.”</p><p>Workday, which <a href="https://newsroom.workday.com/2026-03-17-Introducing-Sana-from-Workday-Superintelligence-for-Work-That-Finds-Answers,-Takes-Action,-and-Automates-Workflows">launched Sana in March</a>, expanded its partnership with Google to bring its Sana agent system of record to the <a href="https://venturebeat.com/orchestration/google-and-aws-split-the-ai-agent-stack-between-control-and-execution">Gemini Enterprise</a> — so agents built on Sana are also discoverable there.</p><h2>Architecting accuracy</h2><p>Kazmaier said the biggest hurdle they faced was ensuring agent accuracy, especially for HR and finance users. </p><p>“Almost right is not acceptable,” Kazmaier said. “Think about paying people correctly, closing the books or managing work schedules reliably.”  </p><p>Accuracy is harder to evaluate here than in most AI contexts. Policy configurations, role-based security, and organizational hierarchies are deeply interrelated — a small error compounds. And unlike most generative AI outputs, HR and finance queries often lack a correction loop. By the time a paycheck processes incorrectly or an interview is scheduled wrong, the damage is done.</p><p>Workday addressed this by building Gemini in as its base reasoning layer, then adding its context engine and business process logic on top. Workday also added verification and classification models that “interrogate” outputs before execution. </p><p>Accuracy and identity, it turns out, are the same question: does the system know enough about the agent, the authorizing human, and the current state of the record to act correctly?</p><p>Workday’s advantage is that it can infer its customers&#x27; organizational structures from the data they provide. Already, third-party identity providers like Okta verify their information by checking Workday, so its context is the system of record for many enterprises. Kazmaier said the Sana Self-Service Agent uses Gemini as the conversational surface to trigger the workflow. The user is then authenticated and authorized through Workday’s identity and security model. Sana agents will only act on behalf of that user and work within their current permissions. </p><p>Audit trails follow the same logic: Gemini retains only interaction logs, while the main audit remains within Workday and its customer. </p><p>For many practitioners in the HR and finance space, the permission and governance layer in the agent system of record is key in regulated spaces. </p><p>“It has to live in the system of record, that’s not a preference, that’s the only way it works,” said Dan Obendorfer, director of product at Würk, in an email to VentureBeat. “If your permissions are defined somewhere outside of where the data actually lives, you’ve already lost.”</p><p>Kadan Stadelmann, chief technology officer and co-founder of Compance.AI, made the same point separately. “Without agent ownership, performance, costs or actions, chaos ensues.”</p>]]></description>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/3GJ8TdHpeKmb0m17vGERe4/9a0fe35f71e28ad43037babeb1b0e333/crimedy7_illustration_of_a_robot_bouncer_or_security_guard_in_8d52b895-08ce-40f4-9580-8d6048a64abb_1.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[MeMo's memory model lets teams upgrade their LLM without retraining it — and performance jumps 26%]]></title>
            <link>https://venturebeat.com/orchestration/memo-memory-model-teams-upgrade-llm-without-retraining</link>
            <guid isPermaLink="false">4W7VOSFiQsA1MMYRxIU5gu</guid>
            <pubDate>Fri, 29 May 2026 19:28:17 GMT</pubDate>
            <description><![CDATA[<p>Enabling LLMs to acquire new knowledge after training remains a major hurdle for enterprise AI — current solutions are either too expensive, too slow, or constrained by context window limits.</p><p><a href="https://arxiv.org/abs/2605.15156">MeMo</a>, a framework from researchers at multiple universities, encodes new knowledge into a dedicated smaller memory model that operates separately from the main LLM.</p><p>The modular architecture works with both open- and closed-source models and sidesteps the complexity of RAG pipelines and full model retraining.</p><p>Experiments show that MeMo handles complex queries reliably even when retrieval pipelines are noisy. It avoids the catastrophic forgetting associated with direct fine-tuning and provides a cost-effective pathway for continuous knowledge updates.</p><h2>The challenge of updating LLM memory</h2><p>Large language models are frozen after training and their internal knowledge remains static until they undergo subsequent, computationally massive updates. </p><p>Currently, developers rely on three main approaches to integrate external knowledge into an LLM, each with distinct drawbacks:</p><p><b>Non-parametric methods</b>, such as retrieval-augmented generation (RAG) and <a href="https://venturebeat.com/orchestration/microsofts-new-ai-training-method-eliminates-bloated-system-prompts-without"><u>in-context learning</u></a>, retrieve relevant documents from an external database and insert them directly into the model&#x27;s prompt. While popular, these methods are limited by context window sizes. </p><p>As Armando Solar-Lezama, a co-author of the paper, told VentureBeat, “Vector databases have a fundamentally difficult job of encoding the full semantics of a chunk of text in a single vector, and then match that vector to a query, even when the relevance of the chunk... may only be apparent in the context of other chunks.” </p><p>The researchers note that the semantic similarity of embeddings often does not correspond to what a user&#x27;s query actually requires. Processing thousands of retrieved tokens also creates substantial computational overhead and inference latency. Most problematically, RAG systems are highly sensitive to noise. Irrelevant or poorly retrieved passages often degrade the model&#x27;s final response.</p><p><b>Parametric methods</b>, like continual pretraining or supervised fine-tuning, attempt to internalize new knowledge directly into the LLM&#x27;s weights. Updating modern, massive LLMs is prohibitively expensive and typically impossible for proprietary, closed-source models hidden behind APIs. Fine-tuning is also prone to causing <a href="https://en.wikipedia.org/wiki/Catastrophic_interference"><u>catastrophic forgetting</u></a>. Forcing the model to adapt to new corporate data often erodes its previously acquired reasoning capabilities and safety guardrails.</p><p><b>Latent memory methods</b>, such as context compression, offer a middle ground. They compress knowledge into compact &quot;soft tokens&quot; or representations that are added to the model’s context during inference. The fatal flaw here is &quot;representation coupling.&quot; The compressed memory is strictly bound to the model architecture that produced it; you can&#x27;t transfer a latent memory trained on an open-source model to a closed-source one.</p><h2>How MeMo works</h2><p>The MeMo (Memory as a Model) framework introduces a modular architecture featuring two separate components. The MEMORY model is a small language model trained specifically to encode new knowledge into its parameters. The EXECUTIVE model is a frozen, off-the-shelf LLM that functions as the reasoning engine. When a user asks a question, the EXECUTIVE model treats the MEMORY model as an external oracle, issuing targeted sub-queries to gather facts and synthesizing those facts into a final answer.</p><p>The core design principle driving MeMo is the concept of &quot;reflections.&quot; Reflections are targeted question-answer (QA) pairs designed to capture every possible angle of a knowledge corpus. Rather than forcing the AI to process a massive, unstructured document corpus during training, MeMo uses a GENERATOR model to distill the raw text into thousands of targeted QA pairs. The MEMORY model is then fine-tuned on this dataset to answer questions using only its parametric knowledge without the need to read retrieved context.</p><p>At inference time, the interaction between the two models follows a structured, three-stage protocol:</p><p>1. The EXECUTIVE model decomposes a user&#x27;s complex query into a set of atomic sub-questions. The MEMORY model answers each independently to establish the basic facts.</p><p>2. Using those initial clues, the EXECUTIVE model issues follow-up queries to narrow down candidate entities until it confidently converges on a specific target. </p><p>3. Finally, the EXECUTIVE model queries the MEMORY model for supporting facts about that target entity and synthesizes the retrieved snippets into a cohesive answer.</p><p>This architecture merges the strengths of the three existing AI memory paradigms while bypassing their pitfalls. It leverages off-the-shelf frontier models by keeping memory storage separate from reasoning, guaranteeing compatibility with both open-weight and closed API models. It internalizes knowledge directly into parameters, but isolates the updates to a smaller, dedicated MEMORY model to protect the reasoning engine. Finally, it creates a queryable memory artifact that is not tied to any specific model and can be used with different LLM families.</p><h2>Handling continual knowledge updates</h2><p>Managing an AI&#x27;s memory requires continuous updates as company policies change and new reports are published. Normally, updating a model&#x27;s parameters requires retraining it from scratch on both the old and the new data combined. As the knowledge base grows, this cumulative retraining cost becomes unmanageable.</p><p>To handle continual updates efficiently, MeMo relies on a technique called &quot;model merging.&quot; Instead of a massive joint retraining phase, MeMo trains a new, independent MEMORY model exclusively on the newly added documents. The system derives a &quot;task vector&quot; representing the parameter changes learned from the fresh data. These updates are then mathematically merged into the weights of the original MEMORY model.</p><p>This approach reduces the computing hours required to keep the system current while avoiding the interference that causes catastrophic forgetting. </p><p>This efficiency comes with a trade-off: model merging incurs an 11% to 19% accuracy drop compared to a full retrain, depending on the reasoning model used.</p><h2>MeMo in action</h2><p>To measure real-world effectiveness, the research team evaluated MeMo against several industry benchmarks that require complex, multi-hop reasoning across multiple documents.</p><p>The researchers used Qwen2.5-32B-Instruct as the GENERATOR model to distill raw text into reflections. For the primary MEMORY model, they deployed Qwen2.5-14B-Instruct. They also validated the approach on smaller 1-2B parameter models across different architectures, including Gemma3-1B. </p><p>For the EXECUTIVE reasoning model, they tested both the open-weight Qwen2.5-32B and Google&#x27;s proprietary Gemini 3 Flash.</p><p>They benchmarked MeMo against a &quot;Perfect Retrieval&quot; upper bound (where the exact correct documents are manually provided) and several advanced retrieval systems, including traditional BM25 search, dense vector retrieval, and state-of-the-art graph-based RAG (HippoRAG2). They also tested &quot;Cartridges,&quot; a recent method that loads a <a href="https://venturebeat.com/orchestration/new-kv-cache-compaction-technique-cuts-llm-memory-50x-without-accuracy-loss"><u>trained KV-cache</u></a> onto the model during inference.</p><p>MeMo dominated in long-document reasoning. On the NarrativeQA benchmark, MeMo achieved 53.58% accuracy paired with Gemini 3 Flash, according to the researchers. HippoRAG2 maxed out at 23.21%.</p><p>Enterprise systems frequently need to synthesize complex answers, such as traversing overlapping regulatory frameworks written independently by different bodies, or consolidating insights across a massive codebase and external documentation. Traditional RAG systems falter here because they hit context window limits and fail to connect concepts spanning hundreds of pages. MeMo succeeds because those connections are mapped and internalized inside the MEMORY model during training. It is &quot;like having your very own Malcolm Gladwell that can connect the story of the Beatles with the story of Bill Gates to make an argument about the nature of expertise,&quot; Solar-Lezama said.</p><p>The experiments revealed another major advantage: upgrading the reasoning engine requires zero retraining. Simply switching the EXECUTIVE model from the open-source Qwen to the proprietary Gemini 3 Flash boosted MeMo&#x27;s performance by 26.73% on NarrativeQA and 11.90% on the MuSiQue benchmark. For practitioners, this means you can train a MEMORY model securely on your private data and instantly plug it into the latest commercial APIs, continuously upgrading system intelligence without incurring new training costs. </p><p>The research team described the integration as requiring no additional setup: &quot;The base (or Executive) LLM that teams are already using in RAG can be configured to query the Memory model directly. These queries are done in natural language, similar to sending a message request to an API, with no additional setup required.&quot; </p><p>MeMo also handles noisy data exceptionally well. When researchers deliberately flooded the dataset with irrelevant documents (up to twice the amount of the useful information), HippoRAG2’s performance dropped by 11.55%. MeMo&#x27;s performance remained relatively stable, dropping less than 2%. Enterprise knowledge bases are typically messy, filled with duplicate documents and outdated policies. Standard RAG systems struggle with this noise, pulling incorrect paragraphs into the prompt and causing hallucinations. Because MeMo&#x27;s EXECUTIVE model interacts with a synthesized oracle rather than raw document chunks, it remains highly robust against disorganized corporate data.</p><h2>Limitations and trade-offs</h2><p>For engineering teams looking to deploy MeMo, there are several key limitations to consider.</p><p>Unlike traditional RAG systems that quickly index raw documents into a vector database, MeMo requires an upfront training cost for each new corpus. The data generation pipeline used to synthesize the training reflections is computationally expensive. For example, the team noted that &quot;generating the full reflection QA dataset took approximately 240 GPU-hours on NVIDIA H200s,&quot; while training a 14B parameter MEMORY model &quot;took approximately 180 H200 GPU-hours.&quot; As Solar-Lezama said, &quot;Reducing the training cost is one of the most significant open research problems in order to make this a workhorse technique.&quot;</p><p>Because the MEMORY model is a fixed-size neural network, its ability to internalize knowledge is bounded by its representational capacity. While the researchers did not hit a hard limit during their benchmarking, they hypothesize that “sufficiently large or information-dense corpora will exceed what a fixed-size MEMORY model can correctly compress and represent.”</p><p>Finally, because MeMo synthesizes answers from parametric memory rather than retrieving exact text snippets, it obscures the provenance of the information. This makes it difficult to attribute specific claims to original source documents, which poses a critical compliance issue for enterprise applications requiring strict audit trails.</p><p>Deciding between MeMo and traditional RAG comes down to a heuristic of &quot;lookup vs. synthesis,&quot; alongside data volatility. The researchers advise that &quot;traditional RAG would be preferred when answers live in a single document or when there is a well-defined source... MeMo would be preferred when the task shifts from lookup to synthesizing an answer from information scattered across multiple chunks.&quot; If your knowledge corpus changes rapidly (e.g., daily feeds) and you require exact source citations, RAG remains the better option due to the upfront training cost of MeMo. If your corpus consists of generalized domain knowledge that evolves slowly relative to its volume, MeMo offers vastly superior reasoning. Teams can also adopt a hybrid routing architecture in production: sending &quot;lookup&quot; queries to a standard vector database and &quot;synthesis&quot; queries to the MEMORY model.</p><p>&quot;Looking further out, I would expect memory models to become a standard architectural component alongside retrieval,&quot; Daniela Rus, co-author of the paper and director of the MIT Computer Science and Artificial Intelligence Lab (CSAIL), told VentureBeat, &quot;in the same way that caching and indexing are standard components of any serious data system today.&quot;</p>]]></description>
            <author>bendee983@gmail.com (Ben Dickson)</author>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/uNG5np6loL4mLiU9LKH0s/7525aad6eda1c42caffcb84af89bce26/LLM_memory_module.jpg?w=300&amp;q=30" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Pinterest cut AI costs 90% by gutting a frontier model's vision layer]]></title>
            <link>https://venturebeat.com/orchestration/pinterest-cut-ai-costs-90-by-gutting-a-frontier-models-vision-layer</link>
            <guid isPermaLink="false">7EXCBFypi32Hey0SeQZu6a</guid>
            <pubDate>Fri, 29 May 2026 16:24:25 GMT</pubDate>
            <description><![CDATA[<p>At 620 million monthly users, calling a frontier model for every image recommendation isn&#x27;t a strategy — it&#x27;s a bill. Pinterest CTO Matt Madrigal solved it by gutting Qwen3-VL&#x27;s vision layer and rebuilding it with proprietary embeddings, cutting costs 90% and boosting accuracy 30%.</p><p>Madrigal’s team has been heavily investing in customizing open-source models “foundationally in-house.” </p><p>“If you&#x27;ve got really unique data that you can then fine-tune an open source model with, data quality will, frankly, outweigh or overcome model size,” Madrigal explained in a recent <a href="https://www.youtube.com/watch?v=BvFanq9fTg0">VB Beyond the Pilot podcast</a>. </p><div></div><h2>How Pinterest customized Qwen for visual discovery</h2><p>Pinterest, which has around 620 million monthly active users, has long applied open source models for visual search and discovery, going back to Google’s BERT and OpenAI’s CLIP. The company fine-tuned its own Pin CLIP on the latter, incorporating proprietary visual embeddings and image metadata. </p><p>Pinterest’s conversational shopping assistant, Navigator 1, was built on Qwen3-VL and customized in “pretty significant” ways. Madrigal’s team essentially “ripped out” Qwen’s vision encoder layer and fine-tuned the model on proprietary multimodal embeddings. This has allowed them to capture metadata around pins and images that can then be precomputed offline and regularly retrained on new information to deliver personalized experiences. </p><p>“Open-source models, especially with open Apache licenses where you can truly tweak a lot of open weights and customize for unique use cases — that&#x27;s where we&#x27;ve found open source to be so powerful for us,” Madrigal said. </p><p>Bringing their own embeddings allows his team to gain context around metadata, pins, and images; also, notably, the model performs better at runtime and inference. Without these embeddings, devs would have to call and encode each image returned at runtime, one at a time. That results in a latency “20 times worse” from an inference perspective, Madrigal said. </p><p>“If it&#x27;s something that&#x27;s going to be critical for our end users, that&#x27;s going to drive engagement, that will have to scale to over 600 million monthly active users, we&#x27;re going to either probably build it or we&#x27;re going to leverage open source and customize the heck out of it,” he said. </p><div></div><h2>How a taste graph captures evolving interests</h2><p>To guide users from inspiration to purchase, Madrigal&#x27;s team built a &quot;taste graph&quot;: a dynamic representation of what individual users actually like, not just what they click on.  “It&#x27;s this representation of billions of people&#x27;s evolving tastes,” he said. </p><p>People go to Google or other search engines when they have a clear picture of what they want; Pinterest is for when they’re still in the discovery phase, Madrigal said. Pinterest’s goal is to encourage “lateral exploration” and transform discovery to intent (that is, clicking through ads or making purchases). </p><p>Under the hood, the architecture combines a graph structure with representational learning. User embeddings capture a user’s evolving tastes. These are constantly updated based on activity and new content and signals. “It&#x27;s not a social graph,” Madrigal said. “It&#x27;s much more of a preference graph: What&#x27;s going to inspire you? What are you trying to do next?” </p><p>For instance, one user may be into mid-century modern designs; another may prefer a Nantucket aesthetic. Those preferences will be captured in user embeddings, and the taste graph will deliver up specific, relevant products as a result. </p><p>“You go from the upper funnel, inspiration discovery, all the way through lower funnel intent,” Madrigal said. </p><p><b>Listen to the full podcast to hear more about:</b></p><ul><li><p>How Pinterest uses sandboxes to encourage creativity in a way that is secure and contained; </p></li><li><p>Why a continuous feedback loop can prevent visual AI slop; </p></li><li><p>The importance of constant benchmarking to gauge user engagement, performance, latency, and other factors. </p></li></ul><p><b>You can also listen and subscribe to </b><a href="https://beyondthepilot.ubpages.com/"><b>Beyond the Pilot</b></a><b> on </b><a href="https://open.spotify.com/show/4Zti73yb4hmiTNa7pEYls4"><b>Spotify</b></a><b>, </b><a href="https://podcasts.apple.com/us/podcast/beyond-the-pilot-enterprise-ai-in-action/id1839285239"><b>Apple</b></a><b> or wherever you get your podcasts.</b></p>]]></description>
            <author>taryn.plumb@venturebeat.com (Taryn Plumb)</author>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/4q0SyXNOt4BdRHyRMYY0SU/0396007fe3dc5633fb9cb93a6681c219/Open-source.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[AI agents are entering their rebuild era as enterprises confront the reliability problem]]></title>
            <link>https://venturebeat.com/orchestration/ai-agents-are-entering-their-rebuild-era-as-enterprises-confront-the-reliability-problem</link>
            <guid isPermaLink="false">trhVvl9lCnJ3I2sGN1UPJ</guid>
            <pubDate>Fri, 29 May 2026 15:00:12 GMT</pubDate>
            <description><![CDATA[<p>As enterprise AI agents move into production, organizations are confronting a growing reliability problem. Many teams are discovering that LLM performance alone does not determine whether agents succeed in production. Long-running AI workflows must survive crashes, preserve state, recover from failures, manage inference costs, and coordinate across APIs, tools, and enterprise systems.</p><p>After a first wave focused on rapid deployment, organizations now need to revisit those first-generation implementations, and redesign early agent architectures around workflow orchestration, observability, governance, and recovery, said Preeti Somal, Senior VP Engineering at Temporal Technologies, during the latest AI Impact Series event in New York. </p><p>“We do have a lot of customers that come to us where they’re building version 2.0 of the same agent,” Somal said. “They had to move really fast, but they didn’t take care of the plumbing. Things crash and burn, and then they’re back to rebuilding with the reliable foundation.”</p><p>For workflow orchestration company Temporal, whose infrastructure predates the current wave of agentic AI, the shift reflects a broader enterprise realization: production AI systems require durable execution, state management, visibility into workflows, and mechanisms to recover when models or downstream systems fail. </p><h2>Agentic AI has supercharged familiar engineering problems</h2><p>“These patterns aren’t necessarily new,&quot; Somal said. &quot; AI just supercharges them.&quot;</p><p>Agentic systems introduce additional complexity because they often involve long-running, multi-step processes spanning multiple services, models, APIs, and tools. A single workflow might call several large language models, access retrieval systems, trigger external applications, and manage state over hours or days. The engineering questions, Somal said, often emerge only after deployment.</p><p>“People will write agents but haven’t thought about what happens if the agent crashes,” she said. “Am I going to need to run the entire agent flow again?” </p><p>For enterprises operating under cost constraints, the answer matters. Restarting workflows after failures can multiply inference expenses, increase latency, and create poor customer experiences.</p><p>Somal compared the current moment to an earlier period in enterprise cloud adoption when organizations went straight to migrating workloads before considering that they needed to redesign underlying architectures if they wanted these workloads to weather the long-term.</p><p>“This rush to do AI in a world where you haven’t even modernized your application reminds me a little bit of that lift-and-shift that happened in the cloud,” she said. “Everybody realized you’re spending more money on cloud and we haven’t gotten value there.” </p><h2>Why long-running agents force a new architecture</h2><p>Enterprise workflows increasingly involve agents executing over long windows, sometimes spanning many hours while interacting with tools and systems. Reliability challenges compound when workflows persist over time, and it impacts both state and memory, two ideas that are often treated interchangeably in AI conversations.</p><p>State concerns workflow execution. It includes where an agent is in a process, which actions have already completed, and where recovery should resume after failure. Memory or context captures information an agent carries forward across interactions or tasks.</p><p>“The state of the agent is around what step and what actions have been performed, and if something crashes, where do you want to recover from, versus the context and memory piece,” Somal explained. </p><p>That distinction becomes increasingly important when enterprises begin moving beyond simple chatbot interactions toward longer-running business processes. Somal pointed to a healthcare example involving customer Abridge, where workflows process physician visits through multiple stages, including audio processing, summarization, model calls, and after-visit generation.</p><p>“There’s not just one piece to that flow,” Somal said. “Taking videos and slicing that, taking summaries, calling the LLMs, generating the after-visit summary, all of that is being orchestrated.” </p><p>The implication for enterprises is that successful agents increasingly depend on systems that can survive interruptions, coordinate across services, and maintain continuity over time.</p><h2>The rise of the deterministic spine</h2><p>A useful framework for enterprise AI design is the deterministic spine, Somal said, which is how they think about Temporal&#x27;s role. </p><p>“It is denoting the path you want to take,&quot; she said. &quot;It is calling the brain, but if the brain doesn’t respond, it will call it again. If the brain responds but the next step is going to fail, it will pick up from where that failure happened.” </p><p>In this framing, the language model acts as a probabilistic system producing variable outputs, while orchestration software maintains execution reliability around it. And the concept matters because enterprise systems increasingly require consistency even when models remain non-deterministic. A procurement workflow, healthcare summary, customer support escalation, or compliance process cannot simply fail silently because a model call timed out or an external dependency crashed.</p><p>“What you care most about is making sure that you can recover and that you’re not paying the token tax if something goes wrong,” Somal said. </p><h2>Reliability, visibility, and the economics of token spend</h2><p>As enterprise leaders evaluate AI ROI, cost visibility has become a growing concern. Long-running agents frequently make multiple model calls across complex workflows, which can create opaque spending patterns. Somal described one operational advantage of orchestration as visibility into where costs accumulate. Because workflows are observable step-by-step, teams can see where tokens are being consumed across an agent process.</p><p>“You’ve got visibility into that entire flow in a single pane of glass,” she said. “You can now see where you’re spending the tokens in an agent that is multiple steps and calling multiple different systems.” </p><p>Workflow recovery also shapes cost efficiency. Without durable orchestration, a late-stage failure can force organizations to rerun an entire process from the beginning, including all prior model calls. Somal said systems designed around recovery can resume execution from the point of interruption.</p><p>“You pick up from where the crash happened,” she said. “We save you the cost of running the agent from step one again.” </p><h2>Enterprises need to build paved paths and enlist partner expertise</h2><p>Governance concerns are another emerging pattern as agentic AI takes hold. Rather than adopting fully managed agent systems wholesale, Somal said enterprises increasingly want standardized internal frameworks that provide guardrails while preserving flexibility, and implementing necessary features like governance controls, model selection policies, identity systems, cost management, and observability. </p><p>“The enterprises are looking at building these paved paths,” she said. “Taking something off the shelf is maybe not going to work because there are all of these other requirements.” </p><p>As organizations revisit first-generation deployments, challenges like this increasingly look less like a model problem and more like a systems engineering problem, and Temporal is positioned to help enterprises take this next step in part because for many organizations, it already existed as part of broader modernization programs before AI became a strategic priority.</p><p>“Temporal is already in the enterprise,” Somal said. “Taking that and extending that to AI and agent platforms feels very natural.” </p><p></p>]]></description>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/2f5klWpC1zRzNZlEhccuJI/4c7b1d857a5402e12caa5be679d38c8e/26AIT_Temporal_sg_01_12_25_09cleaned.jpg?w=300&amp;q=30" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Researchers automated LLM reasoning strategy design and cut token usage by 69.5%]]></title>
            <link>https://venturebeat.com/orchestration/researchers-automated-llm-reasoning-strategy-design-and-cut-token-usage-by-69-5</link>
            <guid isPermaLink="false">75rGRv17fHyEZL6wixCn8v</guid>
            <pubDate>Thu, 28 May 2026 21:32:45 GMT</pubDate>
            <description><![CDATA[<p>Test-time scaling (TTS) has emerged as a proven method to improve the performance of large language models in real-world applications by giving them extra compute cycles at inference time. However, TTS strategies have historically been handcrafted, relying heavily on human intuition to dictate the rules of the model’s reasoning. </p><p>To address this bottleneck, researchers from Meta, Google, and several universities have introduced <a href="https://arxiv.org/abs/2605.08083">AutoTTS</a>, a framework that automatically discovers optimal TTS strategies. This automated approach allows enterprise organizations to dynamically optimize compute allocation without manually tuning heuristics. </p><p>By implementing the optimal strategies discovered by AutoTTS, organizations can directly reduce the token usage and operational costs of deploying advanced reasoning models in production environments. In experimental trials, AutoTTS managed inference budgets efficiently, successfully reducing token consumption by up to 69.5% without sacrificing accuracy.</p><h2>The manual bottleneck in test-time scaling</h2><p><a href="https://venturebeat.com/ai/how-test-time-scaling-unlocks-hidden-reasoning-abilities-in-small-language-models-and-allows-them-to-outperform-llms">Test-time scaling</a> enhances LLMs by granting them extra compute when generating answers. This extra compute allows the model to generate multiple reasoning paths or evaluate its intermediate steps before arriving at a final response. </p><p>The primary challenge for designing TTS strategies is determining how to allocate this extra computation optimally. Historically, researchers have designed these strategies manually, relying on guesswork to build rigid heuristics. Engineers must hypothesize the rules and thresholds for when a model should branch out into new reasoning paths, probe deeper into an existing path, prune an unpromising branch, or stop reasoning altogether. </p><p>Because this manual tuning process is constrained by human intuition, a vast amount of possible approaches remain unexplored. This often results in suboptimal trade-offs between model accuracy and computing costs.</p><p>Current TTS algorithms can be mapped to a width-depth control space — &quot;width&quot; being the number of reasoning branches explored, &quot;depth&quot; being how far each develops. Self-consistency (SC) samples a fixed number of trajectories and majority-votes the answer. Adaptive-consistency (ASC) saves compute by stopping early once a confidence threshold is hit. Parallel-probe takes a more granular approach, pruning unpromising branches while deepening the rest. All three are hand-crafted, and that&#x27;s the constraint AutoTTS is designed to break.</p><p>While some more advanced methods employ richer structures like tree search or external verifiers, they all share one key characteristic: they are meticulously hand-crafted. This manual approach restricts the scope of strategy discovery, leaving a massive portion of the potential resource-allocation space untouched.</p><h2>Automating strategy discovery with AutoTTS</h2><p>AutoTTS reframes the way test-time scaling is optimized. Instead of treating strategy design as a human task, AutoTTS approaches it as an algorithmic search problem within a controlled environment. </p><p>This framework redefines the roles of both the human engineer and the AI model. Rather than hand-crafting specific rules for when an LLM should branch, prune, or stop reasoning, the engineer&#x27;s role shifts to constructing the discovery environment. The human defines the boundaries, including the control space of states and actions, optimization objectives balancing accuracy versus cost, and the specific feedback mechanisms. </p><p>An explorer LLM, such as Claude Code, designs the strategy. This explorer acts as an autonomous agent that iteratively proposes TTS “controllers.” These controllers are code-defined policies or algorithms that dictate how an AI model allocates its computational budget during inference. The explorer tests and refines these controllers based on feedback until it discovers an optimal resource-allocation policy. </p><p>To make this automated search computationally affordable, AutoTTS relies on an “offline replay environment.” If the explorer LLM had to invoke a base reasoning model to generate new tokens every time it tested a new strategy, the compute costs would be astronomical. Instead, it relies on thousands of reasoning trajectories pre-collected from the base LLM. These trajectories include &quot;probe signals,&quot; which are intermediate answers that help the controller evaluate progress across different reasoning branches. </p><p>During the discovery loop, the explorer agent proposes a controller and evaluates it against this offline data. The agent observes the execution traces of the proposed controller that show it allocated compute over time. By analyzing these traces, the agent can diagnose specific failure modes, such as noting if a controller pruned branches too aggressively in a specific scenario. This provides an advantage over just viewing a final result. The agent then iteratively rewrites its code to improve the accuracy-cost tradeoff. </p><h2>Inside the AI-designed controller</h2><p>Because the explorer agent is not constrained by human intuition, it can discover highly coordinated, complex rules that a human engineer would likely never hand-code. One optimal controller discovered by AutoTTS, named the Confidence Momentum Controller, leverages several non-obvious mechanisms to manage compute:</p><ul><li><p><b>Trend-based stopping</b>: Hand-crafted strategies often instruct the model to stop reasoning once it hits a certain instantaneous confidence threshold. The AutoTTS agent discovered that instantaneous confidence can be misleading due to temporary spikes. Instead, the controller tracks an exponential moving average (EMA) of confidence and only stops if the overall confidence level is high and the trend is not actively declining.</p></li><li><p><b>Coupled width-depth control</b>: Manually designed algorithms usually treat the &quot;widening&quot; of new reasoning paths and the &quot;deepening&quot; of current paths as separate decisions. AutoTTS discovered a closed feedback loop where the two actions are linked. If the confidence of the current branches stalls or regresses, the controller automatically triggers the spawning of new branches.</p></li><li><p><b>Alignment-aware depth allocation</b>: Instead of giving all active reasoning branches an equal computation budget, the controller dynamically identifies which branches agree with the current leading answer. It then gives those branches priority &quot;bursts&quot; of extra computation. This concentrates the computational budget on the emerging consensus to quickly verify if it is correct.</p></li></ul><h2>Cost savings and accuracy gains in real-world benchmarks</h2><p>To test whether an AI could autonomously discover a better test-time scaling strategy, researchers set up a rigorous evaluation framework. The core experiments were conducted on Qwen3 models ranging from 0.6B to 8B parameters. The researchers also tested the system&#x27;s ability to generalize on a distilled 8B version of the DeepSeek-R1 model. </p><p>The explorer AI agent was initially tasked with discovering an optimal strategy using the AIME24 mathematical reasoning benchmark. This discovered strategy was then tested on two held-out math benchmarks, AIME25 and HMMT25, as well as the graduate-level general reasoning benchmark GPQA-Diamond. </p><p>The AutoTTS discovered controller was pitted against four manually designed test-time scaling algorithms in the industry. These baselines included Self-Consistency with 64 parallel reasoning paths (SC@64), Adaptive-Consistency (ASC), Parallel-Probe, and Early-Stopping Self-Consistency (ESC). ESC is a hybrid approach that generates trajectories in parallel and stops early when an answer seems stable.</p><p>When set to a balanced, cost-conscious mode, the AutoTTS-discovered controller reduced total token consumption by approximately 69.5% compared to SC@64. At the same time, the controller maintained the same average accuracy across the four Qwen models. When the inference budget was turned up, AutoTTS pushed peak accuracy beyond all handcrafted baselines in five out of eight test cases.</p><div></div><p>This efficiency translated to other tasks. On the GPQA-Diamond benchmark, the balanced AutoTTS variant slashed the inference token cost from 510K tokens down to just 151K tokens, while slightly improving overall accuracy. On the DeepSeek model, AutoTTS achieved the highest overall accuracy on the HMMT25 benchmark while cutting the token spend nearly in half.</p><p>For practitioners building enterprise AI applications, these experiments highlight two major operational benefits:</p><ul><li><p><b>Raising peak performance:</b> AutoTTS doesn&#x27;t just save money on token consumption. It actively raises the peak attainable performance of the base model. The AI-designed controller is remarkably good at detecting noisy or unproductive reasoning branches on the fly and continuously redirecting its compute budget toward the branches generating the most useful reasoning signals.</p></li><li><p><b>Cost-effective custom development</b>: Because the framework relies on an offline replay environment, the entire discovery process cost only $39.90 and took 160 minutes. For enterprise teams, that means optimized reasoning strategies tailored to proprietary models and internal tasks are now within reach — without a dedicated research budget.</p></li></ul><p>Both the <a href="https://github.com/zhengkid/AutoTTS">AutoTTS framework</a> and the Confidence Momentum Controller are available on GitHub; the CMC can be used as a drop-in replacement for other TTS controllers.</p>]]></description>
            <author>bendee983@gmail.com (Ben Dickson)</author>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/4uS7JdQ3Q7BzOfRUeZd7us/e702355f143d4ca73c7269c49c846025/test-time_scaling_strategy.jpg?w=300&amp;q=30" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Mistral AI launches Vibe, expands into industrial AI and announces data center push to challenge OpenAI]]></title>
            <link>https://venturebeat.com/technology/mistral-ai-launches-vibe-expands-into-industrial-ai-and-announces-data-center-push-to-challenge-openai</link>
            <guid isPermaLink="false">2rVWfVu4ZcuyUDgs3OOWfS</guid>
            <pubDate>Thu, 28 May 2026 20:54:16 GMT</pubDate>
            <description><![CDATA[<p><a href="https://mistral.ai/">Mistral AI</a> used its inaugural conference on Wednesday to announce a sweeping expansion into industrial manufacturing, a new inference data center south of Paris, and a rebranding of its consumer-facing assistant — moves that collectively signal the three-year-old French startup&#x27;s ambition to become the enterprise AI provider of record for companies that refuse to hand their most sensitive data to American hyperscalers.</p><p>At the <a href="https://ainowsummit.com/">AI NOW Summit</a>, held at a venue in central Paris, co-founder and CEO Arthur Mensch took the stage alongside CTO Timothée Lacroix and Chief Scientist Guillaume Lample to lay out a strategy that stretches from bare-metal GPU clusters to physics simulations for aircraft wings. The company disclosed that it now employs 1,000 people and is targeting €1 billion ($1.17B USD) in revenue for 2026 — a figure that, if achieved, would be an extraordinary growth trajectory for a company that began with 15 employees collaborating with its first customer, BNP Paribas, in 2023.</p><p>&quot;We have two convictions at Mistral,&quot; Mensch told the audience. &quot;The first is that in order to deploy AI in the enterprise, you actually need, as an AI provider, to own the full stack.&quot; He described Mistral&#x27;s business as fundamentally about &quot;transforming electrons into tokens and intelligence,&quot; arguing that physical infrastructure control matters as much as model quality.</p><p>The announcements come at a pivotal moment for Mistral and for the broader European AI ecosystem. The company has <a href="https://www.clay.com/dossier/mistral-ai-funding">raised at least $3.9 billion</a> across nine funding rounds, according to Clay&#x27;s funding tracker, including a massive €1.7 billion Series C led by Dutch semiconductor equipment maker ASML in September 2025 at an €11.7 billion valuation, and an $830 million debt financing round in March 2026 from a consortium of seven banks to fund data center construction. Mistral now finds itself in a peculiar competitive position: too large to be dismissed as a research lab, but still dwarfed by the resources of <a href="https://openai.com/">OpenAI</a>, <a href="https://deepmind.google/">Google DeepMind</a>, and <a href="https://www.anthropic.com/">Anthropic</a>.</p><p>Its answer, articulated across nearly an hour of presentations Wednesday, is vertical depth — going industry by industry, workflow by workflow, and building the infrastructure to keep everything on premises.</p><h2><b>Why Mistral is betting that physics AI will reshape how Airbus and BMW design products</b></h2><p>The centerpiece announcement was <a href="https://mistral.ai/industry/manufacturing/">Mistral for Industrial Engineering</a>, a fully integrated AI stack that combines Mistral&#x27;s large language models with physics simulation capabilities acquired through its purchase of <a href="https://www.emmi.ai/">Emmi AI</a>, completed earlier in May 2026. The platform targets the aerospace, automotive, and semiconductor industries with tools for accelerating product design, validating simulations, and optimizing production.</p><p>The launch came with headline partnerships. Mistral announced it is working with <a href="https://www.airbus.com/en/newsroom/press-releases/2026-05-airbus-partners-with-mistral-ai-to-strengthen-the-use-of-artificial-intelligence-in-sovereign">Airbus</a> across its commercial aircraft, helicopter, defense, and space divisions, implementing AI from initial design through to on-board capabilities. For <a href="https://www.automotiveworld.com/news/bmw-bets-on-mistral-for-an-edge-in-physical-ai/">BMW Group</a>, Mistral is serving as a central partner for what the automaker calls its &quot;Large Industry Model&quot; initiative, focused on multimodal reasoning models for crash simulation and other complex engineering tasks. ASML, already Mistral&#x27;s largest shareholder, is also an early adopter.</p><p>Mensch framed the industrial push as addressing a fundamental gap in how AI is currently deployed. &quot;AI is great today at automating tasks for knowledge workers and for people that are doing software engineering,&quot; he told the summit audience. &quot;But once you move to all the kind of engineers, well, they are underserved.&quot;</p><p>The reason, he explained, is structural. Simulating the behavior of a wing or a factory process requires compute-intensive physics solvers that can take hours or weeks per design variant. Traditional simulation creates a bottleneck that makes AI-assisted iteration impractical. </p><p>Mistral&#x27;s answer is what it calls &quot;<a href="https://mistral.ai/news/physics-ai-research/">physics AI</a>&quot; — data-driven models trained on solver outputs that can predict physical behavior in seconds rather than hours, running on a single GPU. As Mistral&#x27;s own blog post on the technology acknowledges, physics AI is &quot;not a replacement for first-principles solvers in every regime&quot; — it is a throughput accelerator for the majority of design-loop iterations, with traditional solvers reserved for verification and edge cases.</p><p>&quot;We now have both the language intelligence and the physical intelligence models, and by combining them together we are building delegation loops that allow us to create better tools, that allow us to create better objects that actually have an impact on the physical world,&quot; Mensch said.</p><p>The <a href="https://www.asml.com/en/news/press-releases/2025/asml-mistral-ai-enter-strategic-partnership">ASML partnership</a> offered a concrete illustration. In a video testimonial shown at the summit, an ASML representative described how the company&#x27;s lithography machines run around the clock at customer fabrication plants, and field service engineers need to diagnose issues as rapidly as possible. By combining ASML&#x27;s internal engineering expertise with Mistral&#x27;s models, &quot;we were able to develop a solution that&#x27;s 120 times faster with a similar accuracy as we have today,&quot; the representative said. Another ASML speaker described AI agents acting as &quot;an always-on code reviewer&quot; to catch software defects before they reach customers.</p><div></div><h2><b>Inside Mistral&#x27;s €4 billion infrastructure gamble to build Europe&#x27;s most powerful AI data centers</b></h2><p>Mistral&#x27;s full-stack ambitions extend all the way down to the physical layer. Launched in June 2025, <a href="https://mistral.ai/news/mistral-compute/">Mistral Compute</a> is a €4 billion ($4.66B USD) investment in data centers in France and Sweden, with a stated roadmap of 200 MW of capacity by 2027 and 1 GW by 2030.</p><p>Lacroix described the company&#x27;s existing 40 MW facility at Bruyères-le-Châtel, south of Paris, which was built in collaboration with Eclarion and has been training models since early 2026. &quot;It&#x27;s been very interesting to see how we can transfer rigor, which is one of our company values, into down to the hardware layer,&quot; he said, describing the process of &quot;fixing compute trays and fixing fibers, allowing us to reach the very best speeds possible on that hardware for training.&quot;</p><p>On Wednesday, Mistral announced a <a href="https://wiky.com/2026/05/28/mistral-defends-ai-use-in-warfare-rebuts-pope-criticism/">new 10 MW facility</a> at Les Ulis in the Essonne department, also south of Paris, dedicated to inference operations and scheduled to open in Q3 2026. Lacroix also referenced a site in Borlänge, Sweden, planned for development through 2027, which will host NVIDIA&#x27;s next-generation Vera Rubin GPUs. &quot;One of the benefits for us of owning the hardware layer is also that it lets us be at the very bleeding edge of what infrastructure provides,&quot; he told the audience.</p><p>The infrastructure push is funded in part by the <a href="https://www.clay.com/dossier/mistral-ai-funding">$830 million debt financing round</a> announced in March 2026, which Clay&#x27;s funding tracker attributes to a consortium of seven banks: Bpifrance, BNP Paribas, Crédit Agricole CIB, HSBC, La Banque Postale, MUFG, and Natixis CIB. And this infrastructure ownership is not merely a hedge against GPU scarcity — it is central to Mistral&#x27;s pitch to security-conscious enterprise and government customers. The company&#x27;s February 2026 acquisition of serverless platform Koyeb has been integrated into Mistral Studio to support both hosted and on-premises deployments, giving customers a choice between running inference on Mistral&#x27;s hardware or their own.</p><p>&quot;More and more, the compute world has been getting supply constrained,&quot; Lacroix told the audience. &quot;One of the reasons we&#x27;ve been doing all of this and developing all of this data center capacity is to secure compute capacity not only for ourselves but also for our customers.&quot;</p><h2><b>Le Chat is dead, long live Vibe: How Mistral&#x27;s new agent platform takes aim at enterprise productivity</b></h2><p>In a consumer-facing rebrand with significant enterprise implications, Mistral announced that <a href="https://chat.mistral.ai/chat">Le Chat</a> — its conversational AI assistant launched in February 2024 — is being renamed Vibe and reimagined as a unified agent platform for enterprise productivity and software development.</p><p>&quot;We are transitioning Le Chat to the Vibe family,&quot; Lacroix told the audience, explaining that the evolution was driven by the growing power of agentic models, particularly the new Mistral Medium 3.5. As the team used Vibe&#x27;s coding CLI internally with increasingly complex tasks, &quot;we realized that this really didn&#x27;t need to be bound to the CLI, it didn&#x27;t need to be limited to code, and we could do a lot more with it,&quot; he said.</p><p><a href="https://mistral.ai/products/vibe/">Vibe</a> encompasses two primary modes. Vibe for Work is a web and mobile agent that connects to enterprise tools — Google Workspace, Outlook, SharePoint, Slack, GitHub — to perform multi-step tasks such as summarizing emails, analyzing spreadsheets, drafting reports, and scheduling recurring workflows. Vibe for Code is a coding agent available through a web interface, a new VS Code extension, and the existing CLI, capable of building features, fixing bugs, refactoring code, and shipping pull requests. Critically, the same underlying agent powers both modes. &quot;When you access it through our web app or through the CLI, you have access to the same connections, the same tools, the same understanding of who you are, what you do, and what you&#x27;re trying to achieve,&quot; Lacroix said.</p><p>Pricing starts at free for basic use, <a href="https://mistral.ai/pricing/">$14.99 per month</a> for Pro, <a href="https://mistral.ai/pricing/">$24.99 per user per month</a> for Teams, and custom pricing for Enterprise deployments. Alongside Vibe, Mistral also launched Search Toolkit, an open-source framework for building production search pipelines already in use by shipping giant CMA CGM, which uses it alongside Voxtral to process audio from multiple data sources and return alerts within 15 seconds.</p><h2><b>Mistral&#x27;s model strategy signals a new phase: fewer products, more capabilities per model</b></h2><p>Chief Scientist Guillaume Lample used his portion of the keynote to describe a philosophical shift in Mistral&#x27;s model strategy: consolidation of capabilities into fewer, more versatile models rather than maintaining separate specialized products.</p><p><a href="https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5/">Mistral Medium 3.5</a>, the company&#x27;s current flagship, absorbs capabilities that previously required distinct models. Pixtral (image processing), Magistrale (reasoning), and DevStral (coding) have all been deprecated as standalone products, with their capabilities folded natively into Medium 3.5. &quot;Now all our models are natively multimodal,&quot; Lample said. &quot;We no longer have Magistrale. This model is deprecated, because all our models will natively be doing reasoning.&quot;</p><p>The company is also working on <a href="https://mistral.ai/models/">Mistral Large 4</a>, which Lample said would arrive &quot;in a couple of months at most, during the summer,&quot; with expanded capabilities in industrial applications such as fluid dynamics, computational chemistry, computer-aided design, and cybersecurity. On the smaller end of the spectrum, Lample highlighted Mr. Lossier, a 1-billion-parameter OCR model that can process thousands of pages per minute on a single GPU, and the Voxtral speech model family, which has expanded from automatic speech recognition to include text-to-speech with voice cloning. A &quot;duplex&quot; model for real-time conversational speech is planned for release within months.</p><p>Lample also made the case for open-weight models becoming more — not less — important in the agentic era. &quot;Today we are building these agentic workflows, these models are running in the background, they are doing a lot of actions, a lot of tool calls, so they are extremely token-hungry, much more than before,&quot; he said. &quot;What we are seeing today is actually a comeback of this small model and the efficient model.&quot; Upcoming models will be trained on more than 200 languages, a multilingual strength now powering a partnership with Amazon to improve non-English interactions on Alexa+.</p><h2><b>How Mistral&#x27;s enterprise playbook stacks up against OpenAI and Anthropic</b></h2><p>Mistral&#x27;s positioning stands in sharp contrast to the strategies of its most prominent American rivals. While <a href="https://openai.com/">OpenAI</a> and <a href="https://www.anthropic.com/">Anthropic</a> have each attracted hundreds of millions of consumer users and derive significant revenue from subscription products, Mistral has leaned almost entirely into enterprise and government deployments. As TechCrunch reported in March when Mistral announced its Forge customization platform at Nvidia GTC, CEO Mensch has described the company as being &quot;<a href="https://techcrunch.com/2026/03/17/mistral-forge-nvidia-gtc-build-your-own-ai-enterprise/">on track to surpass $1 billion in annual recurring revenue</a>&quot; — a figure driven largely by corporate clients.</p><p>The <a href="https://venturebeat.com/infrastructure/mistral-ai-launches-forge-to-help-companies-build-proprietary-ai-models">Forge platform</a>, which lets enterprises train custom models on their own data rather than simply fine-tuning or applying retrieval-augmented generation to existing models, represents the foundation on which the company&#x27;s industry-specific solutions are built. As Mistral&#x27;s head of product, Elisa Salamanca, told TechCrunch, Forge &quot;lets enterprises and governments customize AI models for their specific needs.&quot; Early partners include Ericsson, the European Space Agency, Italian consulting company Reply, and Singapore&#x27;s DSO and HTX, alongside ASML.</p><p>Mistral has also built an expanding network of systems integration partnerships to drive enterprise adoption. In February 2026, Accenture and Mistral announced a multi-year <a href="https://newsroom.accenture.com/news/2026/accenture-and-mistral-ai-accelerate-enterprise-reinvention-with-scalable-ai-that-delivers-strategic-autonomy-for-customers">strategic collaboration</a>, with Accenture itself becoming a Mistral customer. Mauro Macchi, Accenture&#x27;s CEO for Europe, Middle East, and Africa, said at the time that the partnership brings together &quot;sovereign models and the capability to scale technology across industries, geographies and business functions.&quot;</p><p>The <a href="https://www.reuters.com/business/finance/bnp-paribas-steps-up-mistral-partnership-bolster-rapid-ai-defences-2026-05-26/">BNP Paribas relationship</a> offers the most detailed public case study. In a video testimonial at the summit, a BNP Paribas representative described deploying Mistral&#x27;s models on-premises to satisfy strict security requirements, developing AI agents for KYC processes that reduced incomplete files from 80% to 10% and compressed processing time from weeks to days. The bank&#x27;s LLM platform at its Corporate and Institutional Banking division has now rolled out to 65,000 users. Mensch noted the significance: &quot;We started to collaborate in 2023 where we were 15 people, so that was, I think, really a leap of faith at the time.&quot;</p><p>The industrial vertical is also being extended to government clients. Mistral disclosed that it is working with France, Luxembourg, Singapore, Morocco, Greece, and Slovakia to build citizen-facing AI services — from deploying agents that help job-seekers through France Travail to building models that understand Moroccan Darija and Amazigh languages. &quot;We think that AI needs to be specialized and understand structural nuances,&quot; Mensch told the audience. &quot;It needs to speak languages as good as it speaks English.&quot;</p><h2><b>The road ahead for Europe&#x27;s most ambitious AI company</b></h2><p>For <a href="https://mistral.ai/">Mistral</a>, Wednesday&#x27;s announcements amount to a declaration that the company intends to compete not by matching American AI giants on any single dimension, but by assembling capabilities none of them are willing or able to offer in combination: open-weight models, owned infrastructure, on-premises deployment, physics simulation, and deep vertical customization — all under a single roof.</p><p>The strategy demands execution on multiple fronts simultaneously, each requiring enormous capital and specialized talent. The competition is formidable and accelerating. <a href="https://openai.com/">OpenAI</a> has been rapidly expanding its enterprise offerings. <a href="https://www.anthropic.com/">Anthropic</a>, backed by billions from Amazon, is building its own corporate AI practice. <a href="https://www.google.com/">Google</a>, <a href="https://www.microsoft.com/en-us">Microsoft</a>, and <a href="https://www.amazon.com/">Amazon</a> all offer AI platforms deeply integrated with cloud infrastructure that most enterprises already use.</p><p>But <a href="https://mistral.ai/">Mistral</a> is wagering that the world&#x27;s most consequential AI deployments — the ones governing how aircraft get designed, how banks process compliance, how governments interact with citizens — will ultimately go to providers that offer sovereignty over data, models, and compute. &quot;AI is too strategic to be left in the hands of a few,&quot; Mensch said, echoing the conviction he described from Mistral&#x27;s founding three years ago.</p><p>Three years in, the company that started as a Paris research lab with a handful of employees now trains models in its own data centers, simulates physics for the manufacturers that build the world&#x27;s planes and cars, and is rewriting its assistant into an agent that can file your pull requests and summarize your inbox in the same conversation. Whether that sprawling ambition coheres into a durable business or stretches Mistral too thin is the €11.7 billion ($13.6B USD)  question. The 1,000 people now working there are betting that in enterprise AI, owning the full stack is not a liability — it is the product.</p>]]></description>
            <author>michael.nunez@venturebeat.com (Michael Nuñez)</author>
            <category>Technology</category>
            <category>Business</category>
            <category>Infrastructure</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/64vl7LlTNGuZeHkUFTvu1Z/f7a5d77afb596fa3342837a14d05f644/Nuneybits_Vector_art_of_Paris_server_skyline_in_burnt_orange_ccb4ba92-3ffa-49fc-8eeb-223c3b174750.webp?w=300&amp;q=30" length="0" type="image/webp"/>
        </item>
        <item>
            <title><![CDATA[Anthropic's Claude Opus 4.8 is here with 3X cheaper fast mode and near-Mythos level alignment]]></title>
            <link>https://venturebeat.com/technology/anthropics-claude-opus-4-8-is-here-with-3x-cheaper-fast-mode-and-near-mythos-level-alignment</link>
            <guid isPermaLink="false">5vs25ZqndjZeZkNTrrnTAm</guid>
            <pubDate>Thu, 28 May 2026 18:23:00 GMT</pubDate>
            <description><![CDATA[<p>Anthropic today <a href="https://www.anthropic.com/news/claude-opus-4-8">released Claude Opus 4.8</a>, an upgrade to its flagship model that ships at the same price as its predecessor, alongside a dramatically cheaper &quot;fast mode&quot; tier and a new feature that lets the model spawn hundreds of parallel subagents for codebase-scale work.</p><p>The model is available immediately across Anthropic&#x27;s surfaces — claude.ai, Claude Code, the API, and Cowork — at unchanged pricing: $5 per million input tokens and $25 per million output tokens. Developers can call it as <code>claude-opus-4-8</code>.</p><p>The headline efficiency story is fast mode. Anthropic has slashed the price of running Opus 4.8 in fast mode — where the model produces tokens at roughly 2.5x normal speed — to $10 per million input tokens and $50 per million output tokens, down from $30/$150 for Opus 4.7</p><p>That&#x27;s a 3X reduction from the fast-mode pricing of previous models, and brings high-throughput inference within reach of latency-sensitive production workloads. </p><p>Fast mode is available immediately in Claude Code via the <code>/fast</code> command; API access is gated, with a waitlist at <a href="https://claude.com/fast-mode">claude.com/fast-mode</a>.</p><p>In regular mode, Claude Opus 4.8 remains among the more expensive of leading frontier models, but still comes in under chief rival OpenAI&#x27;s GPT-5.5.</p><h2><b>Frontier AI Model API Pricing Snapshot</b></h2><table><tbody><tr><td><p><b>Model</b></p></td><td><p><b>Input</b></p></td><td><p><b>Output</b></p></td><td><p><b>Total Cost</b></p></td><td><p><b>Source</b></p></td></tr><tr><td><p>MiMo-V2.5 Flash</p></td><td><p>$0.10</p></td><td><p>$0.30</p></td><td><p>$0.40</p></td><td><p><a href="https://platform.xiaomimimo.com/docs/en-US/pricing">Xiaomi MiMo</a></p></td></tr><tr><td><p>deepseek-v4-flash</p></td><td><p>$0.14</p></td><td><p>$0.28</p></td><td><p>$0.42</p></td><td><p><a href="https://api-docs.deepseek.com/quick_start/pricing">DeepSeek</a></p></td></tr><tr><td><p>deepseek-v4-pro</p></td><td><p>$0.435</p></td><td><p>$0.87</p></td><td><p>$1.305</p></td><td><p><a href="https://api-docs.deepseek.com/quick_start/pricing">DeepSeek</a></p></td></tr><tr><td><p>MiniMax M2.7</p></td><td><p>$0.30</p></td><td><p>$1.20</p></td><td><p>$1.50</p></td><td><p><a href="https://platform.minimax.io/docs/guides/models-intro">MiniMax</a></p></td></tr><tr><td><p>Gemini 3.1 Flash-Lite</p></td><td><p>$0.25</p></td><td><p>$1.50</p></td><td><p>$1.75</p></td><td><p><a href="https://ai.google.dev/gemini-api/docs/pricing">Google</a></p></td></tr><tr><td><p>MiMo-V2.5</p></td><td><p>$0.40</p></td><td><p>$2.00</p></td><td><p>$2.40</p></td><td><p><a href="https://platform.xiaomimimo.com/docs/en-US/pricing">Xiaomi MiMo</a></p></td></tr><tr><td><p>Kimi-K2.6</p></td><td><p>$0.95</p></td><td><p>$4.00</p></td><td><p>$4.95</p></td><td><p><a href="https://platform.kimi.ai/docs/pricing/chat-k26">Moonshot/Kimi</a></p></td></tr><tr><td><p>GLM-5</p></td><td><p>$1.00</p></td><td><p>$3.20</p></td><td><p>$4.20</p></td><td><p><a href="https://docs.z.ai/guides/overview/pricing">Z.ai</a></p></td></tr><tr><td><p>Grok 4.3 low context</p></td><td><p>$1.25</p></td><td><p>$2.50</p></td><td><p>$3.75</p></td><td><p><a href="https://docs.x.ai/developers/models/grok-4.3">xAI</a></p></td></tr><tr><td><p>GLM-5.1</p></td><td><p>$1.40</p></td><td><p>$4.40</p></td><td><p>$5.80</p></td><td><p><a href="https://docs.z.ai/guides/overview/pricing">Z.ai</a></p></td></tr><tr><td><p>Claude Haiku 4.5</p></td><td><p>$1.00</p></td><td><p>$5.00</p></td><td><p>$6.00</p></td><td><p><a href="https://www.anthropic.com/pricing">Anthropic</a></p></td></tr><tr><td><p>Grok 4.3 high context</p></td><td><p>$2.50</p></td><td><p>$5.00</p></td><td><p>$7.50</p></td><td><p><a href="https://docs.x.ai/developers/models/grok-4.3">xAI</a></p></td></tr><tr><td><p>Qwen3.7-Max</p></td><td><p>$2.50</p></td><td><p>$7.50</p></td><td><p>$10.00</p></td><td><p><a href="https://modelstudio.console.alibabacloud.com/ap-southeast-1?spm=a2ty_o05.31384571.0.0.52649f6b7G0D55&amp;tab=doc#/doc/?type=model&amp;url=2840914_2&amp;modelId=qwen3.7-max&amp;serviceSite=international">Alibaba Cloud</a></p></td></tr><tr><td><p>Gemini 3.5 Flash</p></td><td><p>$1.50</p></td><td><p>$9.00</p></td><td><p>$10.50</p></td><td><p><a href="https://ai.google.dev/gemini-api/docs/pricing">Google</a></p></td></tr><tr><td><p>Gemini 3.1 Pro Preview ≤200K</p></td><td><p>$2.00</p></td><td><p>$12.00</p></td><td><p>$14.00</p></td><td><p><a href="https://ai.google.dev/gemini-api/docs/pricing">Google</a></p></td></tr><tr><td><p>GPT-5.4</p></td><td><p>$2.50</p></td><td><p>$15.00</p></td><td><p>$17.50</p></td><td><p><a href="https://openai.com/api/pricing/">OpenAI</a></p></td></tr><tr><td><p>Gemini 3.1 Pro Preview &gt;200K</p></td><td><p>$4.00</p></td><td><p>$18.00</p></td><td><p>$22.00</p></td><td><p><a href="https://ai.google.dev/gemini-api/docs/pricing">Google</a></p></td></tr><tr><td><p><b>Claude Opus 4.8</b></p></td><td><p><b>$5.00</b></p></td><td><p><b>$25.00</b></p></td><td><p><b>$30.00</b></p></td><td><p><b></b><a href="https://platform.claude.com/docs/en/about-claude/pricing"><b>Anthropic</b></a></p></td></tr><tr><td><p>GPT-5.5</p></td><td><p>$5.00</p></td><td><p>$30.00</p></td><td><p>$35.00</p></td><td><p><a href="https://openai.com/api/pricing/">OpenAI</a><b></b></p></td></tr></tbody></table><h2><b>Modest gains over 4.7, but Mythos-class capabilities coming</b></h2><p>On benchmarks, Opus 4.8 is a step up rather than a leap. It scores 88.6% on SWE-bench Verified (vs. 87.6% for Opus 4.7), 69.2% on the harder SWE-bench Pro (vs. 64.3%), and 74.6% on Terminal-Bench 2.1 (vs. 66.1%). Anthropic itself characterizes the model as &quot;a modest but tangible improvement on its predecessor.&quot;</p><p>It beats GPT-5.5 regular across at least 12 benchmarks, including most knowledge-work, coding (issue-level), agentic tool-use, and long-context benchmarks. GPT-5.5 wins on terminal/CLI workflows and is roughly tied on web browsing and graduate-level science.</p><p>The bigger signal sits in Anthropic&#x27;s internal capability ladder: Opus 4.8 lands between Opus 4.7 and the more capable Claude Mythos Preview, which is currently restricted to a small number of organizations under Project Glasswing for cybersecurity work. </p><p>Anthropic says it expects to bring &quot;Mythos-class models to all our customers in the coming weeks&quot; once additional cyber safeguards are in place.</p><p>Several enterprise partners cited material gains. Databricks reported that Opus 4.8 unlocks &quot;a step change in agentic reasoning&quot; inside its Genie data agent, at &quot;61% cheaper token cost than Opus 4.7&quot; thanks to multimodal efficiency on PDFs and diagrams. </p><p>Hebbia cited better citation precision and token efficiency on dense financial filings. Devin-maker Cognition said the release &quot;translates directly into faster capability gains for engineers&quot; and noted Opus 4.8 fixed comment-verbosity and tool-calling issues from 4.7. A computer-use vendor reported 84% on Online-Mind2Web, a jump over both Opus 4.7 and GPT-5.5.</p><h2><b>Dynamic workflows: hundreds of parallel subagents</b></h2><p>Alongside the model, Anthropic launched a research preview of dynamic workflows in Claude Code — a feature designed for tasks too large for a single context window. Claude plans the work, spawns hundreds of parallel subagents, then verifies its own outputs before reporting back. Anthropic&#x27;s example: a codebase-scale migration &quot;across hundreds of thousands of lines of code from kickoff to merge, with the existing test suite as its bar.&quot;</p><p>Dynamic workflows is available on Claude Code&#x27;s Enterprise, Team, and Max plans.</p><p>Two smaller additions round out the release:</p><ol><li><p><b>Effort control on claude.ai and Claude Cowork:</b> A new selector lets users dial how much thinking Claude does per response — higher effort spends more tokens for better answers, lower effort responds faster and burns rate limits more slowly. Available on all plans.</p></li><li><p><b>System entries inside the messages array on the API:</b> Developers can now update Claude&#x27;s instructions mid-task — adjusting permissions, token budgets, or environment context as an agent runs — without breaking the prompt cache.</p></li></ol><h2><b>Honesty, and an &quot;evaluation awareness&quot; caveat</b></h2><p>Anthropic is leading with honesty as a headline trait. The company&#x27;s alignment team reports Opus 4.8 is &quot;around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked,&quot; and that misaligned behavior rates are now &quot;substantially lower than Opus 4.7, and similar to our best-aligned model, Claude Mythos Preview.&quot;</p><p>Indeed, a bar chart released by Anthropic shows how close Opus 4.8 is to the still selectively released Mythos in terms of its misalignment (a lower score is better), coming in at roughly 1.9, down from 2.5 for Opus 4.7 and effectively tied with the more capable, restricted Mythos Preview. The score is based on roughly 2,600 simulated investigation sessions per model. </p><p>The <a href="https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0925c805a1a7ca77314ccbf4a6.pdf">244-page system card </a>publicly released by Anthropic also goes into greater detail on specific categories of misalignment — whether a model produces potentially harmful content around &quot;military-grade weapons,&quot; &quot;harmful sexual content&quot;, &quot;disallowed cyberoffense&quot;, and &quot;undermining liberal democracy,&quot; and again, across all of them, Opus 4.8 scores markedly better than 4.7 or Sonnet 4.6, and comes quite close to Mythos.</p><p>Anthropic flags one finding it considers &quot;the most concerning&quot; from training: Opus 4.8 shows a growing tendency to reason explicitly about how its outputs will be graded, including in environments where it wasn&#x27;t told it was being evaluated. In other words: the model knows it is likely being graded, and produces a response it thinks will earn it a good grade on the test, not one it would necessarily produce if it thought it wasn&#x27;t being graded.</p><p>Anthropic says this didn&#x27;t translate into worse observable behavior — Opus 4.8 shows fewer misleading task-success claims than prior models — but calls it &quot;a concerning trend that could complicate training in the future.&quot; Preliminary interpretability work also found unverbalized grader-related reasoning in roughly 5% of training episodes.</p><p>Anthropic ran the model through a one-week live bug bounty for prompt injection — a first — and concluded Opus 4.8 sits between Opus 4.7 and Sonnet 4.6 on robustness, ahead of &quot;all comparable frontier models&quot; tested, with deployed safeguards bringing browser-use attack success rates to near zero.</p><h2><b>What&#x27;s next?</b></h2><p>Anthropic teased two trajectories. Near-term: cheaper models that provide &quot;many of the same capabilities as Opus.&quot; Longer-term: the Mythos-class models, which the company says represent higher intelligence than Opus but require stronger cyber safeguards before general release.</p><p>For now, Opus 4.8 is positioned as the new go-to enterprise and development workhorse — slightly smarter than 4.7, dramatically cheaper to run fast, and noticeably more honest about what it doesn&#x27;t know.</p>]]></description>
            <author>carl.franzen@venturebeat.com (Carl Franzen)</author>
            <category>Technology</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/27tk5kZOLqHhemHPKHhKAv/8c69c10482122f05a0fa9d5a6fa475ef/ChatGPT_Image_May_28__2026__01_55_48_PM.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
    </channel>
</rss>