<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
    <channel>
        <title>VentureBeat</title>
        <link>https://venturebeat.com/feed/</link>
        <description>Transformative tech coverage that matters</description>
        <lastBuildDate>Mon, 08 Jun 2026 15:37:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>Copyright 2026, VentureBeat</copyright>
        <item>
            <title><![CDATA[When Claude changed, everything changed: Managing AI blast radius in production]]></title>
            <link>https://venturebeat.com/orchestration/when-claude-changed-everything-changed-managing-ai-blast-radius-in-production</link>
            <guid isPermaLink="false">1aKElYaW7iIn62jN0gNWDy</guid>
            <pubDate>Mon, 08 Jun 2026 01:02:33 GMT</pubDate>
            <description><![CDATA[<p>Our system did one thing, and it did it well: It turned natural-language questions into API calls.</p><p>The users were analysts, account managers, and operations leads. They knew what data they needed, but assembling it manually meant pulling from four dashboards, two BI tools, and a Salesforce report builder. With our system, they typed the request in plain English. A request like &quot;Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city&quot; was translated into an API call that the system could act on:</p><p><i>json</i></p><p><i>{</i></p><p><i>  &quot;description&quot;: &quot;User requested sales volume for the given date range, here is the API call to get the response&quot;,</i></p><p><i>  &quot;api_call&quot;: &quot;/api/sales_volume&quot;,</i></p><p><i>  &quot;post_body&quot;: {</i></p><p><i>    &quot;start_date&quot;: &quot;2026-01-01&quot;,</i></p><p><i>    &quot;end_date&quot;: &quot;2026-03-31&quot;,</i></p><p><i>    &quot;region&quot;: &quot;northeast&quot;</i></p><p><i>  }</i></p><p><i>}</i></p><p>The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend — we had integrations with internal reporting portals, Salesforce, and several homegrown services — applied a large language model (LLM)(-generated JSON query to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a chart in the browser.</p><p>By mid-2025, the system was generating several hundred reports a month. These reports were consumed by leadership and analysts and circulated to external stakeholders. It had become the default way most teams pulled ad-hoc data.</p><p>The contract between the LLM and the rest of the system was a structured JSON object as described in the above example.</p><p><i>json</i></p><p><i>{</i></p><p><i>  &quot;description&quot;: &quot;User requested sales volume for the given date range, here is the API call to get the response&quot;,</i></p><p><i>  &quot;api_call&quot;: &quot;/api/sales_volume&quot;,</i></p><p><i>  &quot;post_body&quot;: {</i></p><p><i>    &quot;start_date&quot;: &quot;2026-01-01&quot;,</i></p><p><i>    &quot;end_date&quot;: &quot;2026-03-31&quot;,</i></p><p><i>    &quot;region&quot;: &quot;northeast&quot;</i></p><p><i>  }</i></p><p><i>}</i></p><p>We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, we had grown complacent about the stability and predictability of LLMs in solving what we believed was a simple problem. <a href="https://venturebeat.com/technology/anthropics-new-claude-can-code-for-30-hours-think-of-it-as-your-ai-coworker">Model upgrades</a> had become routine, like bumping a minor version of a well-behaved library.</p><p>Then we rolled out 4.5. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field. Two failure modes followed.</p><p>First, the filter parameters never reached the API. Our system read <i>post_body</i> as the source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error.</p><p>Second, the model started asking clarifying questions in its response. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, would sometimes respond with a question instead. Our system had no path for this. It had been built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused downstream systems to break in multiple ways.</p><p>We rolled back to 4.0. That was harder than it should have been: Between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure.</p><h2><b>Why traditional engineering discipline fails here</b></h2><p>Software engineering rests on the ability to bound the effect of a change. When you upgrade a driver or library, you read the release notes to see whether to expect breaking changes. Unit tests circumscribe what could possibly have moved. You can leverage the following property: The system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is bounded by construction.</p><p><a href="https://venturebeat.com/security/claude-mythos-exposed-a-hard-truth-your-enterprise-patching-process-is-way-too-slow">LLM-backed systems</a> break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends.</p><p>This is what we mean by an <b><i>infinite blast radius</i></b>: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded.</p><h2><b>Anatomy of the failure</b></h2><p>The post-mortem revealed that our prompt had always been under-specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields.</p><p>Earlier versions of the model inferred this constraint from context. Sonnet 4.5, evidently better at being &quot;helpful&quot; in its formatting choices, decided that inquiring for clarification or providing the request body in the description made the response more useful. From the model&#x27;s perspective, this was a reasonable interpretation of an ambiguous instruction. However, this violated the assumptions under which our system was built.</p><p>The bug was not in the model. The bug was in our assumption that the model would continue to fill in our specification gaps as it always had. Three successful upgrades had trained us to believe those gaps were safe.</p><p>Structured output modes and tool-use APIs would have caught this specific failure at the schema level. We weren&#x27;t using them for engineering reasons outside the scope of this article. But schemas only constrain syntax, not semantics. A schema cannot specify that a clarifying question shouldn&#x27;t appear in a system with no path for clarification, or that a date range should never silently default to all-time. Schemas solve the easier half of the problem.</p><h2><b>The evals-first architecture</b></h2><p>The discipline that closes this gap is to treat the evaluation suite — not the prompt — as the formal <a href="https://venturebeat.com/technology/why-prompt-debt-retrieval-debt-and-evaluation-debt-are-quietly-reshaping-enterprise-ai-risk">specification of the system</a>. The prompt is an <i>implementation</i> of the spec. The model is an <i>interpreter</i>. The evals are the spec itself, and any model or prompt change is valid if and only if it passes them.</p><p>In practice, an eval is a triple: An input, a property the output must satisfy, and a scoring function. For our system, the eval that would have caught the 4.5 regression looks roughly like this:</p><p><i>python</i></p><p><i>def test_description_contains_no_serialized_payload(response):</i></p><p><i>    desc = response[&quot;description&quot;].lower()</i></p><p><i>    forbidden = [&quot;curl&quot;, &quot;post_body&quot;, &quot;{&quot;, &quot;http://&quot;, &quot;https://&quot;]</i></p><p><i>    assert not any(token in desc for token in forbidden), \</i></p><p><i>        f&quot;description leaked structured content: {response[&#x27;description&#x27;]}&quot;</i></p><p>A few hundred such properties, some written by hand for known-important invariants, some generated as regression tests from real production traffic, some scored by an LLM-as-judge for fuzzier qualities like tone, become a gate. Model upgrades and prompt changes should be treated as pull requests that must turn the suite green before they merge.</p><p>Evals are expensive to build and maintain. They drift as your product changes. LLM-as-judge scoring introduces its own variance in outcomes. And the suite can only catch failure modes you have thought to specify — you cannot eval your way to safety against a category of failure you have never imagined. We learned this lesson the hard way: Nobody on our team had ever written an assertion that said &quot;the description field should not contain a curl command,&quot; because nobody had thought the model would put one there.</p><p>Evals are not a silver bullet. They give you the ability to bound the blast radius of a change in the only way available when the underlying function is a black box: By densely sampling the input-output response you actually care about, and refusing to deploy when that behavior moves.</p><h2><b>The roadmap</b></h2><p>The engineering community has yet to develop a body of knowledge for writing effective evals. There are no widely accepted standards for what &#x27;coverage&#x27; means in natural language input spaces. CI/CD systems were not built to gate probabilistic test outcomes. As agents take on more autonomous work — writing code, moving money, scheduling infrastructure changes — the gap between &quot;the model passed our smoke tests&quot; and &quot;we know what this system will do in production&quot; becomes the central engineering problem of the next several years.</p><p>The teams that close that gap will be the ones who stop treating evals as a quality-assurance afterthought and start treating them as the actual specification of what their system is.</p><p><i>Vijay Sagar Gullapalli is Founding AI Engineer at Adopt AI and a USPTO-patented inventor.</i></p><p><i>Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams. </i></p>]]></description>
            <category>Orchestration</category>
            <category>DataDecisionMakers</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/FUBn4ZqFiOMRJs1UcjN1O/fdd8bca8d1fc308cd3b42db61a981138/u7277289442_A_mushroom_cloud_of_data_is_exploding_against_a_d_859177f5-e487-44a4-a545-407c10edd87e_0.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Agentic AI solved coding — and exposed every other problem in software engineering ]]></title>
            <link>https://venturebeat.com/technology/agentic-ai-solved-coding-and-exposed-every-other-problem-in-software-engineering</link>
            <guid isPermaLink="false">5gcS69zEHGvbdbY17EDF9Z</guid>
            <pubDate>Sun, 07 Jun 2026 16:00:20 GMT</pubDate>
            <description><![CDATA[<p>Agentic AI is now a core part of the engineering process, driving massive execution leverage and helping us generate more code than ever before. Yet, a difficult question I’ve increasingly heard from business leaders is: <i>if we’re shipping code faster than ever, why aren’t our products improving at the same rate?</i></p><p>The reason is that writing code was never the rate limiter. Defining the right requirements, integrating with complex systems, and maintaining software under real-world conditions has always been the hard part. And when agents flood an organization with lots of new code, the hard part only gets harder. Agents compress execution time. They do not compress ambiguity, accountability, or operational complexity. </p><p>As AI-generated code scales, human review is becoming a massive new bottleneck, and engineers are losing the context needed to catch agent mistakes. The companies that understand this will move forward deliberately and <a href="https://www.nytimes.com/2026/06/01/technology/box-13-new-types-jobs-ai.html"><u>even create new roles because of AI</u></a>. The ones that don’t will default to a simpler, far more destructive conclusion: Reduce headcount and increase AI spend.</p><h2><b>The playbook</b></h2><p>Irreversible structural decisions demand caution, precisely because the technology is moving so fast. Enterprise engineering leaders need a deliberate playbook to navigate the chaos. Here&#x27;s how to start:</p><h3><b>Phase 1: Financial and risk governance</b></h3><p>Protect the downside — secure the infrastructure and cap the financial bleeding.</p><ul><li><p><b>Treat governance as a tier-one risk:</b> The pressure to integrate AI is real, but giving teams the freedom to experiment without a centralized structure creates fragmented processes, duplicated work, and runaway costs. Organizations will need to establish shared standards while still allowing teams to adapt and explore within defined boundaries. This means treating agent configuration like production infrastructure — versioning, reviewing, and testing prompts and skills before rolling them out gradually.</p></li><li><p><b>Enforce least privilege for non-human actors:</b> Never allow an agent to simply inherit the full permissions of its human operator. Human engineers are granted broad access because they possess contextual judgment and bear ultimate accountability. Deploying agents with human-level access without careful consideration introduces an accountability gap into your systems. Implement strict separation between <i>read</i> and <i>write/execute</i> access, and mandate human-in-the-loop approval gates for destructive or production-altering actions. As agents transition from suggesting code to autonomously executing tasks, they must be rigorously incorporated into your security model.</p></li><li><p><b>Watch your wallet:</b> Protect your overall AI budget by enforcing quotas and rate limits for both engineering and production. Cautionary tales are increasingly common: Uber capped its AI spend after <a href="https://www.axios.com/2026/05/28/ai-spending-roi-enterprise-costs"><u>burning its 2026 budget by April</u></a>, and, according to Axios, an unnamed company <a href="https://www.axios.com/2026/05/28/ai-spending-roi-enterprise-costs"><u>incurred a staggering $500 million Anthropic bill</u></a> in a single month due to runaway agentic loops.</p></li></ul><h3><b>Phase 2: Technical strategy</b></h3><p>Build the engine: Choose the right models and measure their success.</p><ul><li><p><b>Go multi-model and multi-vendor:</b> No single model excels at every task. It&#x27;s important to precisely characterize the behavior and performance boundaries across models to understand where each excels, routing specific tasks to the systems best equipped to handle them. Standardizing on a single vendor or model sacrifices capabilities and introduces a critical single point of failure. No organization should absorb that level of concentration risk in its core engineering function.</p></li><li><p><b>Pay for the frontier:</b> Treat AI as engineering leverage, not just another SaaS expense. Pay for premium frontier models that deliver the highest quality output and reduce costly rework. Ultimately, the cheapest model isn&#x27;t the one with the lowest token price — it’s the one that maximizes efficiency while minimizing your downstream risk.</p></li><li><p><b>Measure what actually matters:</b> Deployments, lines of code, and pull requests were never good metrics for productivity, and with AI, they are actively misleading. Instead, aim for metrics that are attached to business outcomes (feature adoption, retention) and engineering durability (change failure rate, escaped defects, code survival over time). For AI efficiency, measure task success per dollar and rework time. Token counts are convenient for leaderboards but they cannot tell you if the tokens were well spent.</p></li></ul><h3><b>Phase 3: Talent and organization</b></h3><p>Realign your human capital to manage the new bottleneck.</p><ul><li><p><b>Shift engineers from syntax to systems:</b> As agents handle the bulk of code generation, human review and architectural alignment are the new bottlenecks. Organizations must deliberately upskill their workforce to transition from syntax-writers to systems-thinkers and agent-managers. Engineers need the training and mandate to guide agentic processes, manage complex cross-system integrations, and hold the overarching architectural vision that agents can struggle to maintain.</p></li><li><p><b>Redefine performance and incentives:</b> When an individual engineer can generate the output of a former squad, traditional metrics like story points or sprint velocity can become ineffective overhead. Consider realigning your evaluation frameworks to better reward expanded business impact, cross-system reliability, and effective agent orchestration. If you want systems-thinkers who cover more strategic surface area, are willing to explore and take risks, and build products in a durable way, you must reward them for higher level impact, not sheer volume of output.</p></li><li><p><b>Don’t cut headcount before your strategy adapts:</b> If you haven&#x27;t integrated agentic workflows, measured augmented output in production, and reworked your roadmap around faster execution, you do not actually know whether your needs and capabilities align. Cutting headcount before establishing that baseline isn&#x27;t discipline — it’s blindness. The goal is not simply smaller teams, but teams capable of covering more strategic surface area.</p></li></ul><h2><b>Enterprise AI adoption requires human elasticity</b></h2><p>AI is not a replacement for engineering judgment; it is a force multiplier for it. In well-structured systems, it safely accelerates delivery. In poorly understood systems, it accelerates failure. We are already seeing the fallout: Outages, rising technical debt, and unexpected cost spikes driven by poorly governed adoption. These are operational failures, not theoretical risks.</p><p>The mistake organizations are now making isn’t adopting AI too slowly — it’s adopting it without understanding where it breaks.</p><p>For the C-suite, understanding this dynamic is no longer optional — it is the determining factor in how a business navigates this era. The challenge is that execution velocity is outpacing the industry&#x27;s ability to manage the consequences. We have handed engineering teams the ultimate power tool. The old adage demands that you measure twice and cut once. Instead, too many firms are opting to just cut.</p><p><i>Joe Bertolami is CTO and co-founder of </i><a href="https://cliftonai.com/"><i><u>Clifton AI</u></i></a><i>.</i></p>]]></description>
            <category>Technology</category>
            <category>DataDecisionMakers</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/3GJfnTfObGfvke0twEzE23/f6fe7d7d2227048f9051439d53dee05f/Engineering_1.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Microsoft AI chief says company was “set free” from OpenAI to pursue superintelligence]]></title>
            <link>https://venturebeat.com/technology/microsoft-ai-chief-says-company-was-set-free-from-openai-to-pursue-superintelligence</link>
            <guid isPermaLink="false">3EaOhSBDkCkKYIkUQonFFJ</guid>
            <pubDate>Fri, 05 Jun 2026 22:55:38 GMT</pubDate>
            <description><![CDATA[<p>For three years, Microsoft&#x27;s artificial intelligence story has been inseparable from OpenAI. The partnership — cemented by a cumulative investment exceeding $13 billion — gave Microsoft early access to the most advanced AI models on the planet, catapulting its Copilot products into the enterprise mainstream and adding hundreds of billions of dollars to its market capitalization. To the outside world, Microsoft&#x27;s AI strategy <i>was</i> OpenAI.</p><p>Mustafa Suleyman wants to change that narrative.</p><p>In an exclusive sit-down interview with VentureBeat at <a href="https://news.microsoft.com/build-2026-live-blog/microsoft-build-2026-live/">Microsoft Build 2026</a>, the CEO of Microsoft AI disclosed that a contractual change with OpenAI roughly six months ago granted his division the formal authority to pursue what he openly calls &quot;superintelligence&quot; — using Microsoft&#x27;s own researchers, its own data pipelines, and its own custom silicon.</p><p>&quot;We were only sort of set free from our contract with OpenAI about six months ago to formally pursue superintelligence,&quot; Suleyman said. &quot;So this is very early days.&quot;</p><p>The comment, delivered matter-of-factly backstage at the Fort Mason Center here, offers the clearest signal yet of a strategic inflection point unfolding inside the world&#x27;s most valuable public company. Microsoft is not abandoning OpenAI. But it is building something alongside it — and, eventually, something that could stand entirely on its own.</p><h2>Microsoft&#x27;s first in-house model family signals a new level of AI ambition</h2><p>The most tangible evidence of that shift arrived the same day. Microsoft announced <a href="https://microsoft.ai/news/building-a-hillclimbing-machine-launching-seven-new-mai-models/">a family of seven new AI models</a> developed entirely in-house by its AI Superintelligence Team, spanning reasoning, code generation, image creation, transcription, and voice synthesis. The models — branded under the &quot;MAI&quot; family name — are Microsoft&#x27;s most ambitious first-party AI release to date.</p><p>The flagship, <a href="https://microsoft.ai/news/introducing-mai-thinking-1/">MAI-Thinking-1</a>, is a 35-billion-active-parameter reasoning model that Microsoft says matches leading models in its weight class on key software engineering benchmarks and demonstrates advanced mathematical reasoning. Suleyman emphasized one point repeatedly: the model was trained from scratch on clean, commercially licensed data, without distillation from third-party frontier models — a direct, if unstated, contrast to the widespread industry practice of using outputs from competitors&#x27; systems to train cheaper alternatives.</p><p>&quot;We train our reasoning models from scratch,&quot; Suleyman wrote in a blog post accompanying the announcement. &quot;We don&#x27;t distill from other labs and we don&#x27;t rely on unlicensed or opaque data.&quot;</p><p>The rest of the family fills out a multimodal portfolio designed for enterprise deployment: <a href="https://microsoft.ai/news/introducingmai-code-1-flash/">MAI-Code-1-Flash</a>, a lightweight coding model built specifically for <a href="https://github.com/features/copilot">GitHub Copilot</a> and <a href="https://code.visualstudio.com/">VS Code</a>; <a href="https://microsoft.ai/models/mai-image-2-5/">MAI-Image-2.5</a>, which supports both text-to-image and image editing; <a href="https://microsoft.ai/news/mai-transcribe-1-5more-accurate-context-aware-and-built-for-production/">MAI-Transcribe-1.5</a>, which Microsoft claims is the most accurate transcription model available, operating across 43 languages; and <a href="https://microsoft.ai/models/mai-voice-2/">MAI-Voice-2</a>, a multilingual speech-generation system. All of the models ship through <a href="https://azure.microsoft.com/en-us/products/ai-foundry">Microsoft Foundry</a>, the company&#x27;s model-hosting and deployment infrastructure, and for the first time, developers can tune model weights themselves through third-party platforms including <a href="https://openrouter.ai/">OpenRouter</a>, <a href="https://fireworks.ai/">Fireworks</a>, and <a href="https://www.baseten.co/">Baseten</a>.</p><p>But Suleyman made clear in the interview that the seven models are a proof of concept, not a finished product. The real project is the lab itself.</p><p>&quot;Our job is to make sure that when we look out to 2030 and beyond, we have the capacity not just to buy models from third parties, but to build the absolute frontier, the best models in the world,&quot; he said. &quot;That&#x27;s a long transition.&quot;</p><h2>What &quot;set free&quot; from OpenAI actually means for Microsoft&#x27;s AI future</h2><p>To understand what Suleyman means by &quot;set free,&quot; you need to understand the unusual contractual architecture that has governed Microsoft&#x27;s AI efforts for years.</p><p>When Microsoft <a href="https://openai.com/index/microsoft-invests-in-and-partners-with-openai/">invested billions</a> into OpenAI beginning in 2019, the partnership came with a specific arrangement: OpenAI would build the frontier models, and Microsoft would serve as the <a href="https://blogs.microsoft.com/blog/2023/01/23/microsoftandopenaiextendpartnership/">exclusive cloud provider</a>, integrating those models into its products and reselling them through Azure. The deal gave Microsoft extraordinary commercial leverage — access to the world&#x27;s most advanced AI without having to build it — but it also created a dependency. Microsoft was explicitly barred from pursuing its own AGI research, and the agreement even capped how large a model the company could train, restricting it from building systems beyond a certain computing threshold measured in FLOPS.</p><p>That arrangement was formally renegotiated. As <a href="https://fortune.com/2025/11/06/microsoft-launches-new-ai-humanist-superinteligence-team-mustafa-suleyman-openai/"><i>Fortune</i></a> and <a href="https://www.axios.com/2025/11/06/microsoft-mustafa-suleyman-superintelligence"><i>Axios</i></a> reported in November, a revised deal with OpenAI removed those restrictions, clearing the way for Suleyman to launch the MAI Superintelligence Team and pursue what he calls &quot;<a href="https://microsoft.ai/news/towards-humanist-superintelligence/">humanist superintelligence</a>.&quot; The result, in Suleyman&#x27;s telling at the time, was a &quot;best-of-both environment, where we&#x27;re free to pursue our own superintelligence and also work closely with them.&quot;</p><p>By the time he sat down with VentureBeat at Build 2026, roughly six months had passed since that self-sufficiency effort formally began. Microsoft had already started shipping in-house models — including <a href="https://venturebeat.com/technology/microsoft-launches-mai-image-2-efficient-a-cheaper-and-faster-ai-image-model">MAI-Image-2-Efficient</a>, a lighter-weight image generation model released in April — but the seven MAI models announced at Build are the team&#x27;s most ambitious release yet: a full multimodal family spanning reasoning, code, image generation, transcription, and voice.</p><p>Even so, Suleyman does not view the shift as a rupture with OpenAI. He described Microsoft&#x27;s current position as one of abundance, not scarcity.</p><p>&quot;There&#x27;s no immediate urgent need to fill a gap in three months&#x27; time or six months&#x27; time,&quot; he said. &quot;We have OpenAI, we have Anthropic, we have thousands of models inside Foundry. So there&#x27;s already a huge amount of optionality available to us.&quot;</p><p>The framing is telling. Microsoft&#x27;s push into first-party frontier models is not born out of a crisis in the OpenAI relationship but out of a strategic calculation: as AI becomes the most consequential technology layer in enterprise computing, the company cannot afford to depend entirely on partners for the foundational capability. &quot;Over the next five years, we have to be able to produce state-of-the-art frontier-scale models,&quot; Suleyman said. &quot;That&#x27;s our mission.&quot;</p><h2>Suleyman says the shift from chatbots to autonomous AI agents has already begun</h2><p>If the seven MAI models represent the technical ambition, a new capability called <a href="https://devblogs.microsoft.com/microsoft365dev/frontier-tuning-teaching-ai-to-work-the-way-you-do/">Frontier Tuning</a> represents the commercial logic. Announced alongside the models at Build, Frontier Tuning allows enterprise customers to customize MAI models using their own proprietary data, workflows, and domain terminology, all within their own secure compliance boundary. The system uses reinforcement learning environments — what Microsoft calls &quot;<a href="https://blogs.microsoft.com/blog/2026/06/02/ai-alone-wont-change-your-business-the-system-running-it-will/">training gyms for AI</a>&quot; — that let agents learn directly from real workplace tasks without affecting production systems.</p><p>The results Microsoft shared are striking. An MAI model tuned for Excel reportedly matches GPT 5.4 performance while operating at up to ten times greater efficiency. Early enterprise adopters are seeing similar gains: when tuned for one unnamed organization&#x27;s exacting standards, the MAI model achieved the highest win rate of any model tested at roughly one-tenth the cost.</p><p>Suleyman framed Frontier Tuning as part of a broader evolutionary stage — a move from intelligence to action. &quot;We&#x27;ve basically moved beyond just conversation,&quot; he told VentureBeat. &quot;Now we&#x27;re moving to action.&quot;</p><p>He introduced a new framework for thinking about that progression: the shift from IQ (factual intelligence) to EQ (emotional intelligence, or the ability to follow tone and style instructions) to what he calls AQ — the &quot;Actions Quotient.&quot; </p><p>Future AI agents, in Suleyman&#x27;s telling, won&#x27;t just answer questions. They will log into enterprise software, navigate complex multi-application workflows, and execute tasks across Excel, Word, Teams, Jira, Adobe InDesign, and customer relationship management systems — just as a human employee would.</p><p>&quot;You should be able to show up on day one and almost provision credentials to a new AI agent,&quot; he said. &quot;The model needs to be able to move across all of these different environments, and that&#x27;s actually the great strength of Microsoft.&quot;</p><p>The <a href="https://news.microsoft.com/build-2026-live-blog/microsoft-build-2026-live/">Build 2026</a> announcements bore this out in concrete product terms. <a href="https://www.microsoft.com/en-us/microsoft-365/blog/2026/06/02/introducing-microsoft-scout-your-always-on-personal-agent/">Microsoft Scout</a>, the company&#x27;s first &quot;Autopilot&quot; agent, operates as an always-on background assistant built on the open-source OpenClaw technology. It runs with its own governed identity inside <a href="https://www.microsoft.com/en-us/security/business/microsoft-entra">Microsoft Entra</a>, so its actions are auditable and attributable. <a href="https://techcommunity.microsoft.com/blog/windows-itpro-blog/made-for-developers-and-agents-windows-365-at-build-2026/4519041">Windows 365 for Agents</a> gives AI agents their own managed Cloud PCs, allowing them to interact directly with applications and browsers inside enterprise environments. And the <a href="https://devblogs.microsoft.com/foundry/whats-new-in-microsoft-foundry-build-2026/">Foundry platform</a> received major updates — including hosted agents with sub-100-millisecond cold starts, a new Microsoft Agent Framework, and one-click publishing to Teams and Microsoft 365 Copilot.</p><h2>Why Microsoft believes enterprise data is the next AI training frontier</h2><p>Suleyman also articulated why he believes Microsoft&#x27;s position is uniquely defensible — and the argument has less to do with model architecture than with where work actually happens.</p><p>&quot;We&#x27;ve sort of hoovered up all of the obvious pools of training data,&quot; he said, referring to the industry&#x27;s early scramble to ingest the open web. &quot;In the next phase, we actually want to be able to give these agents to companies to train on their specific tasks with the data that they have inside of their own big workflows.&quot;</p><p>The claim is subtle but consequential. The first wave of generative AI was trained on publicly available text — books, websites, Reddit posts, code repositories. That data is now largely exhausted, and its use is increasingly contested in court.</p><p>The next wave, Suleyman argues, will be trained on enterprise-specific data: the internal workflows, decision traces, and institutional knowledge that define how real organizations operate. Microsoft, which serves 493 of the Fortune 500 through Azure according to Suleyman, is already embedded inside those workflows through Microsoft 365, Teams, Dynamics 365, and the broader Azure ecosystem. Frontier Tuning is the mechanism that converts that positional advantage into model performance.</p><p>&quot;People underappreciate that that&#x27;s going to be the next domain,&quot; Suleyman said.</p><p>The early partner list for Frontier Tuning reflects the ambition: <a href="https://www.mayoclinic.org/">Mayo Clinic</a>, where Microsoft is co-creating a frontier AI model for healthcare using de-identified clinical data; <a href="https://www.ey.com/en_us">EY</a>, which is tuning a tax-advisory agent for deployment to 75,000 professionals globally; <a href="https://www.landolakesinc.com/">Land O&#x27;Lakes</a>, where Frontier Tuning delivered what the company&#x27;s product development scientist called &quot;meaningful improvements in grounded outputs and style compliance&quot;; and <a href="https://www.pearson.com/">Pearson</a>, which is using tuned models to provide learning-science-aligned feedback in its Communication Coach product.</p><p>The Mayo Clinic partnership may be the most significant. Microsoft and Mayo Clinic are collaborating to build a <a href="https://news.microsoft.com/source/2026/06/02/mayo-clinic-and-microsoft-collaborate-to-develop-a-frontier-ai-model-for-healthcare/">healthcare-specific frontier model</a> that combines Mayo&#x27;s clinical expertise and longitudinal patient insights with Microsoft&#x27;s AI capabilities. The model will be owned by Mayo Clinic and deployed first within Mayo&#x27;s own environment before being made available to other organizations through Foundry.</p><h2>Microsoft&#x27;s custom AI chips and GPU buying spree reveal the scale of its compute advantage</h2><p>None of this works without an industrial-scale compute infrastructure, and Suleyman was unusually candid about the hardware economics underlying Microsoft&#x27;s strategy.</p><p>&quot;We are the largest buyer of GPUs on the planet,&quot; he said. &quot;We&#x27;re the largest buyer of GB200s and GB300s in the world.&quot;</p><p>Microsoft will continue purchasing Nvidia accelerators &quot;for many, many years to come,&quot; Suleyman said. But the company is simultaneously building its own custom silicon. <a href="https://blogs.microsoft.com/blog/2026/01/26/maia-200-the-ai-accelerator-built-for-inference/">Maia 200</a>, Microsoft&#x27;s second-generation AI accelerator, is already running in production across data centers in Iowa and Arizona, with deployments planned for Italy, Australia, and South Korea. According to Microsoft, Maia 200 delivers the best tokens-per-dollar-per-watt in the company’s fleet.</p><p>Suleyman put a finer point on the economics in the interview: Maia 200 is 30 percent more cost-efficient than Nvidia&#x27;s GB200, he said. And when Microsoft co-optimizes its own MAI models to run natively on Maia silicon, the company sees an additional 1.4x improvement in performance per watt. &quot;It is going to be cheaper in years to come to build on MAI models with Maia 200 and Maia 300 inside of Azure,&quot; he said.</p><p>That claim — if it holds at scale — has profound implications for the competitive landscape. It means Microsoft is not merely buying its way to AI dominance through Nvidia; it is building a vertically integrated stack in which its own models, running on its own chips, inside its own cloud, tuned on its customers&#x27; own data, could offer performance and cost characteristics that no competitor can replicate.</p><h2>Suleyman rejects the idea that AI models are becoming commodities</h2><p>Suleyman also pushed back sharply against one of the most popular narratives in Silicon Valley: that AI models are rapidly commoditizing.</p><p>&quot;A lot of people are saying models are commoditizing,&quot; he said. &quot;I don&#x27;t think that&#x27;s true.&quot;</p><p>His argument hinges on what he calls &quot;quality tokens&quot; — the proposition that the composition, curation, licensing, and deduplication of training data matter at least as much as raw scale. Microsoft&#x27;s new MAI models, he said, were trained on a pre-training mix composed of approximately 50 percent high-quality code, with the remainder drawn from commercially licensed and carefully curated sources.</p><p>The result, he argued, is a distinct &quot;lineage&quot; of models optimized for coding, reasoning, and agentic behavior — fundamentally different from models optimized for consumer chat, cultural content, or multilingual breadth.</p><p>&quot;We&#x27;re going to see very distinct lineages that reflect different training objectives of different companies,&quot; he said. &quot;Quality tokens matter more than just brute-force scale.&quot;</p><p>This is a strategically important argument for Microsoft to make. If models are commodities — if any lab can match the frontier within months using cheaper compute and distilled training data — then the model layer becomes a race to the bottom, and Microsoft&#x27;s billions in compute investment offer no durable advantage. But if model quality is a function of data discipline, research depth, and institutional patience, then the lab-building approach Suleyman is pursuing becomes a genuine competitive moat.</p><p>He used a specific metaphor to describe that approach, one borrowed from optimization theory: the &quot;<a href="https://microsoft.ai/news/building-a-hillclimbing-machine-launching-seven-new-mai-models/">hill-climbing machine</a>.&quot; The phrase describes a system that continuously improves — cycle after cycle — by applying more compute, better data, and sharper evaluation. &quot;The goal here is to build what we think of as a hill-climbing machine,&quot; he wrote in <a href="https://microsoft.ai/news/building-a-hillclimbing-machine-launching-seven-new-mai-models/">his blog post</a>. &quot;An organization that can continuously improve, cycle after cycle.&quot; The metaphor is revealing because it describes a process, not a destination. Suleyman is not promising that Microsoft will build the world&#x27;s best model next quarter. He is arguing that Microsoft is building the <i>system</i> — the research culture, the data pipelines, the silicon co-optimization, the evaluation infrastructure — that will produce progressively better models over years.</p><h2>Inside Microsoft&#x27;s five-year plan to become a self-sufficient AI superpower</h2><p>The strategic picture that emerges from Suleyman&#x27;s comments — and from the full scope of the Build 2026 announcements — is of a company preparing for a future in which AI capability is not rented from a partner but generated internally, at scale, across every layer of the stack.</p><p>Microsoft still needs OpenAI. The partnership continues to power Copilot, Azure AI services, and ChatGPT&#x27;s infrastructure. Suleyman acknowledged as much, describing Microsoft&#x27;s portfolio of model providers as a source of strength, not a problem to be solved. </p><p>But the direction of travel is unmistakable. With its own frontier models, its own custom silicon, its own reinforcement learning environments for enterprise tuning, and its own autonomous agent infrastructure, Microsoft is constructing a parallel path — one that, by 2030, could make the company a fully self-sufficient frontier AI lab embedded inside the world&#x27;s largest enterprise software platform.</p><p>&quot;Our ultimate goal is what we call Humanist Superintelligence,&quot; Suleyman wrote in his <a href="https://microsoft.ai/news/building-a-hillclimbing-machine-launching-seven-new-mai-models/">blog post</a>. &quot;That means advanced AI systems designed to serve people and organizations, not replace them.&quot;</p><p>Whether that goal is achievable — or even clearly definable — remains one of the great open questions in technology. And Suleyman expressed more confidence than caution when asked about the trajectory of progress. &quot;I really think we&#x27;re at the tip of the iceberg,&quot; he said. &quot;The models are so much more powerful than we know how to extract intelligence from them.&quot;</p><p>But confidence and execution are different things. Building a frontier lab is not an announcement; it is a decade-long commitment that requires retaining elite researchers, maintaining scientific rigor under commercial pressure, and producing results that justify the staggering capital expenditure.</p><p>Google learned this with DeepMind — which Suleyman himself co-founded in 2010, before joining Microsoft — and even that lab, widely regarded as one of the best in the world, spent years navigating the tension between pure research and product delivery.</p><p>Suleyman seemed aware of the contradiction. &quot;If you rush it, you&#x27;ll screw it up,&quot; he said.</p><p>The sticker on his laptop reads: &quot;Patience and urgency.&quot; It is a paradox that Microsoft now has five years — and several hundred billion dollars — to resolve.</p>]]></description>
            <author>michael.nunez@venturebeat.com (Michael Nuñez)</author>
            <category>Technology</category>
            <category>Infrastructure</category>
            <category>Business</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/4wc08GCSuP5WTTenHHRob0/c2f2cdfcfee0e96b40f5d8e60ad45d98/Nuneybits_Vector_art_of_the_iconic_Microsoft_Windows_logo_on_a__7e9c6174-b677-4904-a7be-2d3aed0072e9.webp?w=300&amp;q=30" length="0" type="image/webp"/>
        </item>
        <item>
            <title><![CDATA[Microsoft's AI Futurist explains how he uses Copilot — and the real-world problems enterprises are solving with agents]]></title>
            <link>https://venturebeat.com/orchestration/microsofts-ai-futurist-explains-how-he-uses-copilot-and-the-real-world-problems-enterprises-are-solving-with-agents</link>
            <guid isPermaLink="false">3RdRRZK0navJNcrZIWBWHl</guid>
            <pubDate>Fri, 05 Jun 2026 19:31:00 GMT</pubDate>
            <description><![CDATA[<p>Microsoft used its <a href="https://news.microsoft.com/build-2026/">Build 2026 conference </a>this week to push a clear message: agents are rapidly moving into production throughout enterprise systems, and the winning platform will be the one that gives them reliable context, governance, identity, memory — and secure access to enterprise data. </p><p>The company <a href="https://venturebeat.com/data/enterprise-ai-agents-keep-creating-data-silos-microsofts-build-answer-is-microsoft-iq-and-rayfin">announced Microsoft IQ</a> as a context layer across GitHub Copilot, Microsoft Foundry and Copilot Studio; Work IQ APIs coming June 16; Fabric IQ for structured business data; Foundry IQ for retrieval across enterprise knowledge and the live web; and Web IQ as a new agent-facing web search stack. </p><p>Microsoft also introduced <a href="https://www.microsoft.com/en-us/microsoft-365/blog/2026/06/02/introducing-microsoft-scout-your-always-on-personal-agent/">Scout</a>, a personal work agent, and a whopping <i>seven </i><a href="https://microsoft.ai/news/building-a-hillclimbing-machine-launching-seven-new-mai-models/">new in-house AI models in its growing MAI family</a> across modalities and use cases, including MAI-Thinking-1.</p><p>Those announcements sit directly in <b>Marco Casalaina</b>’s lane. Casalaina<a href="https://www.linkedin.com/in/marcocasalaina/"> is Microsoft’s VP Products, Core AI and AI Futurist</a>. He leads Microsoft’s AI Futures team and previously led teams across Azure AI, including Azure OpenAI, Vision, Speech, Decision, Language, Responsible AI and AI Studio. </p><p><a href="https://onegiantleap.com/2026-speakers/marco-casalaina?utm_source=direct&amp;utm_medium=direct&amp;utm_campaign=Unspecified">Before Microsoft</a>, he led Salesforce’s Einstein AI team and earned a computer science degree from Cornell University. <a href="https://www.crn.com/slide-shows/channel-programs/30-notable-it-executive-moves-march-2022">CRN reported</a> that he joined Microsoft in early 2022 as vice president of products for Azure Cognitive Services, meaning he has now been at the company for more than four years.</p><p>VentureBeat spoke with Casalaina ahead of Build about Microsoft’s agent strategy, the company’s model-choice philosophy, how Microsoft IQ fits with MCP, and why he believes enterprises need far more than just access to powerful models. The interview below has been edited for clarity and condensed from the transcript.</p><h3><b>VentureBeat (VB): To start, can you explain your role at Microsoft and what “AI Futurist” means in practice?</b></h3><p>Marco Casalaina (MC): I am VP Products of what we call Core AI. Core AI is our set of tools for AI developers, and that includes Foundry, Visual Studio, VS Code, GitHub and GitHub Copilot. That’s our overall group.</p><p>My Silicon Valley title is AI Futurist, and that has a very concrete meaning here. I’ve worked with other folks who are considered futurists, like Peter Schwartz, and that can be a little bit more fuzzy. For me, what it means concretely is that I am the first person to try anything new here.</p><p>I am constantly getting things from all over Microsoft, not even just Foundry, because I work with really everybody across the company. Pretty much everybody sends me the new things at all times. Even today, I got something brand new just before this call. I’m usually the first person to try anything new here, which is pretty cool. I get to see a lot of really cool stuff.</p><p>A friend of mine, who is head of AI at Intuit, calls me an “adjacent possiblist.” I consider my futurist concept to be about a year out from now — the immediate future of what’s about to happen next. That’s what I focus on.</p><h3><b>VB: Where are you looking at the agentic state of things, and in particular Microsoft’s position as enterprises and individuals rush to adopt agentic AI?</b></h3><p>MC: We can look at it from bottom to top. At the very base of the stack is our commitment to model choice. All along, we’ve had the OpenAI GPT frontier models. Now we have a really solid partnership with Anthropic, where we’re offering the Claude models. We just launched Claude Opus 4.8 on Azure — on Foundry, I should say — and at Build, we are introducing our new MAI model.</p><p>The MAI models are a set of frontier models that we’re building in-house. They are made for token efficiency, optimization and customization. We are specifically making them for our customers to customize on their own data sets.</p><p>One level above that, we are announcing hosted agents in Foundry. That is our managed agent capability in Foundry. It automatically handles scaling, containerization and those kinds of things. It is an environment where you can manage agents.</p><p>One level above that is the Foundry control plane. At least for the agents you build, you want to have control over them. This gives you observability into their cost, tokens and correctness. You can do continuous evaluations and sample interactions with those agents, run evals and make sure they are continuing to work and not drifting.</p><p>The big news is going to be the GA of what we call the IQs here at Microsoft. There are currently three, and there will be four. There is Foundry IQ, which is basically for knowledge — largely unstructured knowledge. There is Fabric IQ. We have a ton of customers who have entrusted a lot of data to the Microsoft Cloud in Fabric, Power BI and related technologies. Fabric IQ is about making an agent-facing interface for this data, so agents can get to it without literally going through a Power BI report. That’s ridiculous.</p><p>Work IQ is about the Microsoft ecosystem. You can look at Work IQ as the agentic face of all the Microsoft apps: Outlook, Teams, Word, SharePoint and all those kinds of things. How does an agent interact with those things? That is Work IQ.</p><p>And finally, the fourth IQ is Web IQ. We are releasing our new agent-facing web search capability. It can search the web, search through videos and even do some kinds of browsing tasks automatically. It is super fast, and it kind of has no face. It’s headless. The interface is intended for agents.</p><p>We will also be announcing Agent Optimizer. That includes a new type of evaluation that allows you to evaluate much more granularly whether an agent is actually working and working correctly. The optimization step can go back in and make modifications to the prompt, obviously with your consent, and modify your agent so it works more correctly going forward. Effectively, it creates a feedback loop to make agents work better.</p><h3><b>VB: Microsoft has sometimes been criticized for murky and clunky product naming. Where do these IQ products sit? Are enterprise users supposed to go to IQ first, or is IQ more for developers to connect to?</b></h3><p>MC: All of the IQs are headless. The concept of IQ is that each one provides a different type of context to an agent specifically. Largely, it will be developers interacting with the various IQs — developers and the agents they build.</p><p>The IQ brand is really about agent context. End users largely won’t interact with the IQs. It is true that if you use Microsoft 365 Copilot today, you’ll notice a little thing that says it is using Work IQ. So it is a little bit visible, but the customer or end user doesn’t have to go find the IQ. Their system or developers hook that up.</p><h3><b>VB: Is the IQ family essentially Microsoft’s version of MCP? Is it using MCP, or is it something different?</b></h3><p>MC: All of the IQs are indeed exposed as MCP servers. You have correctly characterized MCP as basically an agent-facing or self-describing API. It’s not that fancy. That’s really what it is, with some authentication layers and capabilities built in, which is super useful.</p><p>Something like Work IQ — really all the IQs — have to be authenticated. In order for Work IQ to see my email, Teams messages, documents and stuff like that, I have to be able to authenticate it on behalf of me.</p><p>That gets us to another core differentiator that we will be announcing at Build, which is agent identity. We have this Entra system, and Entra is, I believe, the world’s largest used identity system for human users. For some time now, you have been able to declare an agent to have an identity in there. Now, agents will be able to have their own identity, their own Teams box, their own email inbox and stuff like that.</p><p>These agents will use Work IQ to check their own email, check their own documents and that sort of thing.</p><h3><b>VB: Enterprises are not one-size-fits-all on models. Microsoft supports many leading models through Foundry and Azure, while also building its own. Is Microsoft a model company, an infrastructure company or a connector between models and work products?</b></h3><p>MC: The answer is yes. We are obviously the hyperscaler. We are absolutely committed to model choice, and we will continue to offer the frontier models from all of the major players: OpenAI, Anthropic, Mistral, Black Forest, xAI — you name it. They are all going to be represented in there.</p><p>At the same time, we have what is now called our Microsoft AI Superintelligence Team, formed by Mustafa Suleyman, and we are building our own frontier models as well. Like I said earlier, we are really gearing these models toward optimization — token efficiency, bang for the buck and customization.</p><p>These are things our customers have been asking for: the ability to more finely customize models, whether that is fine-tuning or continued pre-training. Continued pre-training is literally changing the weights of the model, whereas fine-tuning is adding a little layer on top.</p><p>We have these capabilities in Foundry: fine-tuning, distillation and those kinds of things. I would note, by the way, that our MAI models are not distilled. Some model providers, especially some of the less scrupulous ones, will distill other models into theirs, and that can have unusual effects. We don’t do that. The data provenance of our models is of primary importance to us.</p><p>When we come out with these models, we want our customers to know that the data provenance is clean in terms of the rights to the data, where it came from and all that kind of stuff.</p><p>The choice thing also goes above the model layer. When we talk about Foundry hosted agents, we have the Microsoft Agent Framework. You talk about agent orchestration — how you make agents work together when you have multiple agents — and Microsoft Agent Framework is an excellent framework for that.</p><p>However, I can make a LangGraph or LangChain Foundry hosted agent. I can make a CrewAI Foundry hosted agent. I can use any number of orchestration frameworks and put that up as a Foundry hosted agent, and it becomes a first-class Foundry agent.</p><p>That means I get the observability. It shows up in the Foundry control plane. I can do evaluations on it. I can do traces on it. I can get all those things from the Foundry control plane with an agent built in really any framework I choose.</p><h3><b>VB: Some companies are interested in Chinese and open-source models. How much of Microsoft offering its own models is about giving customers an American version of that?</b></h3><p>MC: I can’t speak to that exactly. Of course, we offer DeepSeek models and Qwen models in Foundry, so we offer all of these choices today, and our customers can make that choice.</p><p>The MAI models are really focused on token efficiency and customizability. That is what our customers are demanding, and that is the gap we are filling.</p><h3><b>VB: As agents take on longer tasks and more specialized work, will enterprises keep expanding the number of models they use, or will there be a winnowing?</b></h3><p>MC: I do see it expanding. We are not just focused on tokens per se. A token is not a token is not a token. One token is not necessarily equivalent across these things. It is all about what you are doing with each token and the efficiency of that. It comes back to what kind of value you are getting for the cost. That is a lot of the rationale behind why we are developing our own MAI models.</p><p>Part of my job is to travel all around the world. I’ve been all over the place. For example, I’ve been working with Bayer. One of the things we are measuring is not just token usage, but number of users — monthly active users and daily active users — because we have a lot of first-party capabilities like Microsoft 365 Copilot. Over the last year, we’ve seen a 6x increase in monthly active users. We have over 20 million users of Microsoft 365 Copilot alone.</p><p>That is on the agents you use. In terms of the agents you build, Bayer put up its own agent system on Foundry, and now it has 20,000 of its own employees on it.</p><p>A few weeks ago, I was in Sydney, Australia, hanging out with AEMO, the Australian Energy Market Operator. They operate the electrical grid of Australia. They showed me that they had built agents to manage grid operations.</p><p>This is a human-centered thing. They have grid operators sitting in centers in West Sydney, Brisbane and places like that, and they are bombarded with alerts. I wouldn’t believe it if I hadn’t seen it myself. The alerts are constant. They built a system to triage those alerts. Is this alert a super major thing, or is it just that a transformer is getting a little hot? It also says, here is when we had this problem last time, and here is how we resolved it last time. Maybe now we need to replace this component, or whatever.</p><p>Ultimately, it is the grid operators making the choice. A lot of our philosophy here is human empowerment. These human-centered agents are the ones that are working best among our customers. What I saw at AEMO and Bayer is this notion of human empowerment: taking away some of the grunt work, or in the case of AEMO, taking billions of alerts and reducing them to something much more manageable and actionable for the people involved.</p><p>We are moving past the era where agents are just answering questions. AI in general is moving past that. We are not just answering questions anymore. We are moving toward a place where AI can really meaningfully help you do your work.</p><h3><b>VB: How do observability, tokenomics, ROI analysis and agent governance fit into Microsoft Foundry?</b></h3><p>MC: That is what the Foundry control plane is all about. We introduced it in November of last year. If you looked at my own Foundry control plane — I’ve built a ton of these agents, and I am a developer by background — you would see all of my agents that are running and the ones that are paused.</p><p>I can see how many tokens they’ve used over the last day, week or month. I can look at trends. I can look at costs, because the cost will be different depending on what underlying model I’m using. If I’m using our model router, it can route to different models depending on the complexity of the inbound prompt.</p><p>We also have Azure cost management overall. Azure has had cost management for over a decade, before the AI thing even happened. This integrates with overall Azure cost management.</p><p>It is not just narrowly about what your AI is doing. Your AI will be using storage resources, data resources and other compute resources around that AI. You can get a complete picture of not just the cost and token usage of the AI itself, but everything around it.</p><p>When you think about governance, that also extends to evaluation. One of the things we are releasing in preview is rubric-based evaluation. Rubric-based evaluation is much more granular.</p><p>Let’s say you have built a restaurant reservation agent. The things you want to test about that agent are not really groundedness. Groundedness is the opposite of hallucination, and that is very question-answering. For a restaurant reservation agent, you want to test very granular things. If you say, “Make me a table for two tomorrow,” did it come back and ask, “What time would you like the table?” Before it gave you a table for two tomorrow at 6 p.m., did it actually check that the table was available, or did it randomly give you a table without checking first?</p><p>There are very granular things you want to test about that specific use case. You don’t just want to test whether the agent works. You want to test whether the agent works right.</p><p>That is what we are approaching with our new rubric-based evaluation system. You will see that in Satya’s keynote. I have been using it myself lately, and I’m very happy about it. I’ve been waiting for this.</p><h3><b>VB: Microsoft is also partnering with companies like Anthropic and allowing Claude to work with Microsoft 365. How important is Copilot to this story? Why would someone turn to Copilot over other options?</b></h3><p>MC: Microsoft 365 Copilot is a huge advantage for us. As I mentioned, we crossed the 20 million user mark on Copilot relatively recently.</p><p>The great thing about that is that it is the face. When you go into Foundry and make an agent, there is a button that says “publish to Copilot” — actually, it says “publish to Copilot in Teams,” because you can put it in Teams too.</p><p>The idea is that you want to put these agents where your users are. A lot of people who use the Microsoft ecosystem are in Teams, or they are using Copilot. I can create a custom agent, as many of my colleagues have, and now it is in Copilot, which I use maybe 50 times a day.</p><p>Since January, Copilot has become more and more capable. I now use it to draft my email. I am not just using it for question answering. I’m starting to use it to manage my calendar and draft emails. I really do this every day now.</p><p>When I want to use a custom agent — for example, to file my expenses, because we have a custom agent for that now — I can access that agent not in some random standalone interface, but in Copilot or Teams, where I already am.</p><p>That surface area that people are already engaging with is a major advantage.</p><h3><b>VB: As people offload more repetitive work to AI, what are they able to spend more time doing?</b></h3><p>MC: Let’s consider something I did yesterday. I got an email from a customer named Frankie, and he asked me a question about Foundry hosted agents. I knew the answer because I had talked to my colleague Jeff Holland, who is the head of our hosted agents product management. I had asked Jeff the same question two weeks ago.</p><p>Where or how I asked him, I don’t remember. Was it in Teams? Was it email? Was it a meeting? I don’t really remember. But I knew the answer to the question Frankie was asking.</p><p>So I went into Copilot and said, “Answer Frankie’s question about how hosted agents scale, and reference the conversation I had with Jeff a couple of weeks ago on this same topic.” And it did it. It drafted the email.</p><p>Over time, I have taught Copilot my style. I don’t do the bold-print thing. I tell it: don’t use em dashes and that kind of stuff. I have a certain style in the way I write emails. It’s a little terse, to be perfectly honest, but I want it to be the way I write.</p><p>It drafted this thing. It searched through my Teams messages, my emails and the transcripts of my meetings with Jeff. It used Work IQ, as a matter of fact. It found the answer, drafted the email and provided a link to the documentation that specifically covered the question Frankie was asking.</p><p>I looked at the draft and thought, yep, that’s it.</p><p>Yes, I could have composed this email myself. I knew the answer to the question. I could have looked up the documentation. If I dug around, I’m sure I could have found the conversation I had with Jeff in whatever medium that was. I could have done that stuff. It probably would have taken me, I don’t know, an hour to find all the information and compose it.</p><p>Instead, I did it in about a minute. I had a draft, I looked at it, I was happy with it, I pressed send, and that was the end of that.</p><p>It really is about giving people time back. It is not even just grunt work. It is all this time you spend looking things up and finding things. Now, I can make it take an action. It didn’t just answer the question. It fully drafted the email and copied Jeff.</p><h3><b>VB: Do you fear for your job? How has AI changed your own work?</b></h3><p>MC: I don’t fear for my job. My job has changed. For one thing, I do a lot more now, both in my business life and personal life.</p><p>This weekend I was using Web IQ, the new Web IQ. I’ve been car shopping. My car’s lease is coming up, and there is a very specific car I’m trying to find, which is hard to find. It’s a Hyundai Ioniq 6, which Hyundai, for whatever reason, has stopped offering in the United States. I’m going to get one, though.</p><p>I set my agent to the task, using Web IQ, of finding all the Hyundai Ioniq 6s available in the entire Bay Area — everywhere, all the way out to Sacramento, all the way as far south as Gilroy. I set it to this task, and then I went on a hike.</p><p>When I got back, I had a big long list of all the Hyundai Ioniq 6s, at least the 2024 and 2025 models, available in the entire Bay Area. From that, I started calling down these dealers.</p><p>Even in my personal life, I’m using it constantly. It saves me a ton of time. That would have taken me hours, to go through every single dealer’s inventory like this. But Web IQ could do that, and it was super quick.</p><h3><b>VB: Any final thought for developers around this news?</b></h3><p>MC: Foundry is really the place. This is the place where you can build your agents, scale your agents, test your agents and improve your agents. That’s what it’s all about, and it’s happening.</p>]]></description>
            <author>carl.franzen@venturebeat.com (Carl Franzen)</author>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/1K4LAVCi8QNMrPhIDXzCoq/09968c485105f82d1305fd60ec7cb9f4/ChatGPT_Image_Jun_5__2026__03_26_52_PM.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[AI agents are learning on the job — just not for your whole team]]></title>
            <link>https://venturebeat.com/orchestration/ai-agents-are-learning-on-the-job-just-not-for-your-whole-team</link>
            <guid isPermaLink="false">7uZ8DypEBhAAvj8zvgYIdD</guid>
            <pubDate>Fri, 05 Jun 2026 17:51:03 GMT</pubDate>
            <description><![CDATA[<p>When someone on a team corrects an AI agent — better prompts, better feedback, better context — that improvement disappears the moment a colleague opens the same tool. The correction doesn&#x27;t transfer, and the next person starts from zero.</p><p>The problem compounds in multi-agent workflows, where teams expect agents to share context across users and tasks. Without a shared memory layer, every team member effectively trains a different version of the same agent — and those versions never sync.</p><p>That gap shows up in the numbers. According to Asana&#x27;s own research, 75% of knowledge workers use AI on the job, but only 5% of companies have reported productivity gains. </p><p>“Model providers are getting really, really good at improving reasoning and retry loops, but what they’re not good at is bringing the enterprise work context in a way that human beings can reason about for shared memory,” Asana Chief Product Officer Arnab Bose told VentureBeat. </p><p>Asana had been building toward an agentic platform that centers context and shared memory. Its Agentic Work Management platform ensures that if any team member corrects an agent, that correction applies to everyone else on the team. </p><p>“That context graph is automatically provided to agents operating inside Asana’s system so you don’t have to have every human member of the team become an expert at prompt engineering or context engineering,” Bose said. </p><p>Bose said the <a href="https://venturebeat.com/orchestration/shared-memory-is-the-missing-layer-in-ai-orchestration">shared memory architecture</a> matters beyond Asana&#x27;s own product; it&#x27;s the design decision enterprises need to make for any multi-agent system.</p><p>Shared memory also becomes important when enterprises begin moving from simple single agents to multi-agent workflows that need to share context and behaviors. </p><h2>Memories for a multi-agent, multi-platform workflow</h2><p>The models powering agents are stateless by design, so memory becomes a dedicated layer outside of a context window. While this area of AI innovation is marching towards maturity, the question of what gets stored, who controls it, and how it stays consistent when different agents and users write to the same instance remains largely unsolved.</p><p>This is manageable for use cases with only one user. However, in enterprise agentic workflows, the idea is for agents to work with the entire team. Most platforms have agents that still act for individuals, which leads to task repeating and inconsistent versions of reality and spreading mistakes. Agents could then also contradict each other.</p><p>Sriharsha Chintalapani, co-founder and CTO of Collate, said in an email to VentureBeat that the lack of shared memory is a major obstacle for multi-agent workflows particularly around consistency.</p><p>&quot;Agents are sensitive to the quality of their prompts,&quot; Chintalapani said. &quot;Someone with a strong understanding of the task will generally get more accurate results than someone less experienced. Partly that’s because they’re able to construct more detailed prompts, but also because they’re able to give the agent better feedback. The agent remembers the corrections it’s received and applies that knowledge to successive prompts. The more accurate the feedback, the better the agent will perform for that user. &quot;</p><p>He added that organizations should stop treating shared memory solely as a prompt engineering problem and think of building systems that repeat context across every conversation. </p><p>Neej Gore, chief data officer at Zeta Global, said in a separate email that shared context becomes a living memory that &quot;compounds intelligence across the enterprise.&quot;</p><p>The opportunity may lie in building AI agents that retrieve memory relationally, pulling in relevant context based on what&#x27;s being asked — an approach Chintalapani says few organizations outside the largest model providers are equipped to build.</p><h2>Personal versus team agents</h2><p>AI agents already proliferate enterprises; it’s just that many of these operate as personal agents doing work specific to individual users. Most prompts start from one person, any files are uploaded by one account, and even for agents living in a company-wide system mostly learn individual user preferences. </p><p>Most enterprise AI workflow platforms recognize that memory is important but approach it through different lenses. For example, Microsoft’s Copilot <a href="https://techcommunity.microsoft.com/blog/microsoft365copilotblog/introducing-copilot-memory-a-more-productive-and-personalized-ai-for-the-way-you/4432059">takes an individual-first approach by learning a user’s role</a> within the organization, tone preferences and working patterns, which are then stored as personal memories for the agent to apply across the different Microsoft 365 surfaces. </p><p>For engineering and orchestration teams evaluating agentic platforms, the shared memory question is now a procurement criterion — not just a technical nicety. An agent that learns only for the person using it will require ongoing individual upkeep. One connected to a team-wide memory layer builds institutional knowledge automatically.</p><p>

</p>]]></description>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/2ao1SxdtmqSty91VkWtsGQ/090c5e47ced9b77f97e134c0d3fdc488/crimedy7_illustration_of_a_robot_remembering_something_--ar_1_ca076720-6052-42d7-a901-a6539dc5dcfc_1.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Meta's AI support agent bound recovery emails for anyone who asked. Your SOC never saw an alert.]]></title>
            <link>https://venturebeat.com/security/meta-ai-support-agent-recovery-email-takeover-soc-audit-grid</link>
            <guid isPermaLink="false">5RjObu657JLLRcXlMWh4dH</guid>
            <pubDate>Fri, 05 Jun 2026 16:42:50 GMT</pubDate>
            <description><![CDATA[<p>Meta&#x27;s AI support agent bound recovery emails to accounts for whoever asked, and SOCs never saw an alert. An authorized agent writes a log of legitimate transactions, so nothing in the detection stack fired. Attackers asked the bot to make the change, took the one-time code it sent, <a href="https://www.404media.co/hackers-simply-asked-meta-ai-to-give-them-access-to-high-profile-instagram-accounts-it-worked/">and ran the password reset</a>, 404 Media reported.</p><p>No malware, no stolen credentials, and no prompt injection in the sense most security teams drill for. The agent did exactly what Meta built it to do. That is what should keep a security operations leader up at night: The takeover did not break a control; it rode one that was already trusted.</p><p>What a SOC needs is a way to walk each recovery path through an audit grid with its AI build team before the next renewal closes. The AI Authority Audit Grid at the end of this article maps every authentication write a support agent can make on the recovery path, what Meta&#x27;s incident proved about each one, why it stays dark to the SOC, and the control that closes it.</p><h2><b>The agent is an authorized actor, so the SOC reads the takeover as routine traffic</b></h2><p>From inside the detection stack, the attack produced no signal the stack could read. The agent binds a new email, then resets the password, and <a href="https://genai.owasp.org/llmrisk/llm062025-excessive-agency/">identity and access management</a> logs both writes as an authorized actor, so each lands in the authentication state as a legitimate transaction. No anomalous login, no failed-auth spike, nothing for EDR or DLP, no SIEM rule to match, because nothing in the sequence looks like an attack. The takeover lived inside the trust boundary the stack assumes is safe. There is no foothold to find, because the agent was the foothold, and it was supposed to be there.</p><p>The chain was almost insulting in its simplicity. Brian Krebs documented the <a href="https://krebsonsecurity.com/2026/06/hackers-used-metas-ai-support-bot-to-seize-instagram-accounts/">version pro-Iran hackers posted to Telegram on May 31</a>. The attacker <a href="https://www.bbc.com/news/articles/c98rzr72dpyo">switched on a VPN to appear in the victim&#x27;s region</a>, sidestepping Instagram&#x27;s location alarms, then asked the support assistant to add a new email and send a verification code, as the BBC confirmed from the same recordings. The bot complied, sending the one-time code straight to the attacker, <a href="https://gizmodo.com/hackers-tricked-meta-ai-into-handing-out-access-to-major-instagram-accounts-2000766087">Gizmodo reported</a>. The reset finished and the owner was locked out, in minutes. The exploit failed against any account with MFA enabled, according to Krebs.</p><p>The hijacked accounts were not soft targets. They included Sephora, U.S. Space Force senior enlisted leader Chief Master Sergeant John Bentivegna, researcher Jane Manchun Wong, and a dormant Obama White House handle that briefly posted a defaced image, according to <a href="https://www.404media.co/hackers-simply-asked-meta-ai-to-give-them-access-to-high-profile-instagram-accounts-it-worked/">404 Media</a>. <a href="https://techcrunch.com/2026/06/03/instagram-is-alerting-users-who-were-targeted-by-hackers-during-ai-chatbot-attacks/">Meta disputes the Obama account</a>, according to TechCrunch, and called claims that leaders&#x27; accounts were breached &quot;completely false,&quot; according to the BBC. The rest stand.</p><h2><b>MFA held. The recovery path beside it did not.</b></h2><p>The detail that decided who survived was narrow. Krebs reported the attack failed against any account with multifactor authentication, even SMS. The recovery path beside it was the gap. When that path asked for a selfie video, <a href="https://www.ghacks.net/2026/06/03/instagram-accounts-hijacked-by-tricking-meta-ai-support-into-verifying-attackers-as-owners/">attackers ran the target&#x27;s public photos through an AI video generator</a> and submitted the clip, which Meta accepted as valid identity verification, gHacks reported. Either way the failure was the recovery door, not the login door MFA guards.</p><p>That makes this an architecture problem, not a Meta problem. MFA gates the login path for owner and attacker alike, but the recovery path runs beside it, built to relax the usual checks because it exists for the moment a user has lost the normal way in. Meta put an agent on that path with write access to authentication state and no deterministic check between a convincing request and a committed change. Authorization cannot live inside the model, because a conversational system can be talked into skipping a check. It has to live outside the model, in a gate the agent cannot reason its way past. Security researchers have a name for this pattern, the confused deputy, a trusted system tricked into spending its privileges on an attacker&#x27;s behalf.</p><p>This is not the last support agent that will hand over an account. Ian Goldin, a threat researcher at Lumen&#x27;s Black Lotus Labs, told Krebs on Security that AI bots are as easy to social engineer as the human agents they replace, and just as eager to help. &quot;AI chatbots create interesting new attack surface, and we&#x27;re likely going to see a lot more of these kinds of attacks,&quot; Goldin said. Every enterprise wiring an agent into a recovery, provisioning, or password flow is shipping the same write access Meta did.</p><p>Simon Willison, who coined the term prompt injection, put it plainly on <a href="https://simonwillison.net/2026/Jun/1/hackers-simply-asked-meta-ai/">his blog</a>. &quot;Meta really did wire their support system into an AI chatbot that had the ability to fast-forward through the entire account recovery process,&quot; he wrote. &quot;This one hardly even qualifies as a prompt infection. Don&#x27;t wire your support bot up to allow one-shot account takeovers.&quot; The attacker never tricked the agent. The attacker asked, and the agent had untrusted input, write access, and a way to execute, all at once.</p><p>OWASP named this class before Meta shipped it, as Excessive Agency at <a href="https://genai.owasp.org/llmrisk/llm062025-excessive-agency/">LLM06</a> and Identity and Privilege Abuse at <a href="https://genai.owasp.org/2025/12/09/owasp-top-10-for-agentic-applications-the-benchmark-for-agentic-security-in-the-age-of-autonomous-ai/">ASI03 in the Agentic AI Top 10</a>. The warning label was on the box: Meta pushed the assistant to every Facebook and Instagram account in March, according to 404 Media, with the power to reset passwords and handle recovery, the product page promising &quot;solutions, not just suggestions&quot; under the line &quot;account security and recovery.&quot; Meta gave the agent the power and never built the gate to govern it.</p><h2><b>The AI Authority Audit Grid</b></h2><p>Security operations leaders need to run this against their own support agent before the next renewal closes. Each row is an authentication write the agent makes on the recovery path, with what Meta proved, why your stack misses it, and the control that closes it.</p><table><tbody><tr><td><p><b>Authentication write</b></p></td><td><p><b>What Meta proved</b></p></td><td><p><b>Why your stack misses it</b></p></td><td><p><b>Enterprise control and owner</b></p></td></tr><tr><td><p>Login authentication (MFA, factor prompts)</p></td><td><p>Held on login. Accounts with any MFA enabled, even SMS, survived (<a href="https://krebsonsecurity.com/2026/06/hackers-used-metas-ai-support-bot-to-seize-instagram-accounts/">Krebs</a>). The gap was the recovery path beside it.</p></td><td><p>MFA gates the login path for owner and attacker alike. It does not gate the recovery path beside it.</p></td><td><p>Enforce MFA as the baseline and extend step-up verification to the recovery path, the same standard login gets (<a href="https://cheatsheetseries.owasp.org/cheatsheets/AI_Agent_Security_Cheat_Sheet.html">OWASP</a>). A selfie video is not proof of identity. Any agent that operates on a path MFA does not cover fails the audit. Owner: IAM.</p></td></tr><tr><td><p>Email rebind</p></td><td><p>Full takeover. The agent bound attacker-controlled emails on request, taking Sephora and a U.S. Space Force account (<a href="https://www.404media.co/hackers-simply-asked-meta-ai-to-give-them-access-to-high-profile-instagram-accounts-it-worked/">404 Media</a>).</p></td><td><p>IAM logs the agent as an authorized actor, so the rebind reads as a legitimate transaction and no alert reaches the SOC or the account owner.</p></td><td><p>Confirm out-of-band to the existing verified contact before any rebind commits, gated outside the model, and notify the old address the moment it changes (<a href="https://community.ibm.com/community/user/blogs/shane-weeden1/2020/02/28/account-recovery-is-just-another-authentication-me">IBM</a>). An agent that rebinds without confirming the old address fails. Owner: IAM and platform engineering.</p></td></tr><tr><td><p>Password reset</p></td><td><p>Full takeover in minutes. Researcher Jane Manchun Wong was among the affected accounts (<a href="https://www.404media.co/hackers-simply-asked-meta-ai-to-give-them-access-to-high-profile-instagram-accounts-it-worked/">404 Media</a>).</p></td><td><p>The reset runs on the recovery path, outside the login MFA check, so no factor prompt fires and no detection rule triggers.</p></td><td><p>Require a second non-email factor before any reset completes. NIST dropped email as a valid out-of-band channel (<a href="https://pages.nist.gov/800-63-4/sp800-63b/authenticators/">NIST 800-63B</a>). An agent reset must clear the same gate a human reset does. Owner: IAM.</p></td></tr><tr><td><p>Recovery-method change</p></td><td><p>Persistent lockout. Victims could not self-recover. The support loop offered only AI with no human escalation (<a href="https://www.bleepingcomputer.com/news/security/instagram-users-locked-out-after-meta-ai-abused-to-steal-accounts/">BleepingComputer</a>).</p></td><td><p>A silent swap of the recovery email or phone removes the owner&#x27;s re-entry path with no SOC visibility.</p></td><td><p>Require step-up review on any change, notify the prior method, and grant time-delayed, reduced-scope access after recovery so a swap never hands over instant control (<a href="https://www.authsignal.com/blog/articles/account-recovery-is-the-identity-industrys-most-overlooked-challenge">Authsignal</a>). Keep a human escalation path the agent cannot close. Owner: GRC and IT operations.</p></td></tr><tr><td><p>Account-action execution</p></td><td><p>Speed risk. A dormant Obama White House handle briefly showed a defaced image during the spree, an account Meta disputes was taken this way (<a href="https://techcrunch.com/2026/06/03/instagram-is-alerting-users-who-were-targeted-by-hackers-during-ai-chatbot-attacks/">TechCrunch</a>).</p></td><td><p>The agent executes irreversible state changes in seconds with no human in the loop and no reversibility window.</p></td><td><p>Separate decision from execution. The agent only proposes the action. A policy service validates scope and approval before it runs, with approval bound to the exact action (<a href="https://cheatsheetseries.owasp.org/cheatsheets/AI_Agent_Security_Cheat_Sheet.html">OWASP</a>). No auth-state write commits without that gate and a reversibility window. Owner: platform engineering and the AI build team.</p></td></tr><tr><td><p>Agent action logging</p></td><td><p>Detection gap. The takeover left no alert, and Meta has not published how many accounts fell before the patch (<a href="https://techcrunch.com/2026/06/03/instagram-is-alerting-users-who-were-targeted-by-hackers-during-ai-chatbot-attacks/">TechCrunch</a>).</p></td><td><p>Without per-action telemetry piped to the SIEM, an authorized-agent takeover is invisible to the SOC.</p></td><td><p>Emit structured decision metadata for every auth-state write into the SIEM: action class, authorization outcome, approval ID, result, policy version (<a href="https://cheatsheetseries.owasp.org/cheatsheets/AI_Agent_Security_Cheat_Sheet.html">OWASP</a>). A write your SIEM cannot see is a write you cannot defend. Owner: SOC and detection engineering.</p></td></tr></tbody></table><p>The fix is not bolting yet another MFA prompt onto the login screen. The people who survived Meta’s incident were the ones who already had that control in place.</p><p>The fix is pulling authorization out of the recovery path’s honor system and putting it behind a gate that does not move just because a prompt sounds convincing. Build the agent so the SOC sees every write it makes, and so any write that changes who owns an account cannot commit without a check that the model does not control.</p><p>Meta just showed what happens when the most trusting employee on the team is also the one holding the keys. The next agent like that is already reading your intellectual property and financials.</p>]]></description>
            <author>louiswcolumbus@gmail.com (Louis Columbus)</author>
            <category>Security</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/38WzL1nPF6nI78v2ZkNhqv/24ab8b1e4f85b1a21a123522a39ded31/hero_image.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Anthropic says 80% of its new production code is now authored by Claude — how your enterprise can keep up]]></title>
            <link>https://venturebeat.com/technology/anthropic-says-80-of-its-new-production-code-is-now-authored-by-claude-how-your-enterprise-can-keep-up</link>
            <guid isPermaLink="false">7jx1tXydvxRPlFtfPPjhby</guid>
            <pubDate>Thu, 04 Jun 2026 20:25:00 GMT</pubDate>
            <description><![CDATA[<p>Anthropic co-founder and CEO Dario Amodei <a href="https://medium.com/@coders.stop/dario-amodei-said-90-of-code-will-be-ai-written-in-6-months-6b8060720d97">said it was coming</a>, but it still feels like a milestone: More than 80% of the code merged into Anthropic’s production codebase in May wasn&#x27;t authored by humans, but by its own AI model, Claude, according to a <a href="https://www.anthropic.com/institute/recursive-self-improvement">new report shared by the record-breaking AI startup today.</a></p><p>This transformation has triggered an<a href="https://x.com/AnthropicAI/status/2062568864240836995"> 8x increase in the volume of code</a> shipped per engineer per quarter compared to the company’s 2021–2025 baseline, which the company notes means even more code someone or something must review.</p><p>For enterprise technical leaders, this is no longer a localized research curiosity; it&#x27;s a new, aggressive competitive baseline. </p><p>If a frontier AI laboratory can successfully offload the vast majority of its engineering output to autonomous agents — showing signs of the long-sought AI Holy Grail of &quot;<a href="https://en.wikipedia.org/wiki/Recursive_self-improvement">recursive self-improvement</a>,&quot; models that can independently research and upgrade themselves — what&#x27;s preventing enterprises across other sectors from automating more of their internal software development with AI agents, too? </p><p>Obviously, it&#x27;s easier said than done. Anthropic is one of the principle creators of the current gen AI boom, so you&#x27;d expect them to know how to deploy the technology effectively.</p><p>But for other enterprises looking to bump up the amount of code and workflows handled by agents, Anthropic&#x27;s new blog post details the outlines of a general plan they too can adopt to re-engineer their operations and workflows to take advantage of the latest AI advances. </p><h2><b>Anthropic&#x27;s roadmap that other enterprises can follow</b></h2><p>The transition from human-centric coding to autonomous orchestration requires understanding the evolution of AI capabilities. Anthropic outlines a clear historical continuum that enterprises can map onto their own digital transformation roadmaps: </p><ul><li><p><b>2021–2023 (Manual Writing):</b> Engineers write code and documentation natively within local text editors. </p></li><li><p><b>2023–2025 (Chatbot Assistance):</b> Developers use early models to generate brief code snippets, copying and pasting outputs manually into their environments. </p></li><li><p><b>2025–2026 (Coding Agents):</b> Capable agents actively write and edit entire files autonomously. </p></li><li><p><b>Present Day (Autonomous Agents):</b> Agents execute code independently, debug live environments, and delegate multi-hour work streams to specialized sub-agents. </p></li></ul><p>This rapid evolution is validated by external benchmarks. Software engineering evaluation frameworks like SWE-bench—which tasks models with resolving real bug reports in complex, open-source codebases—have saturated over a two-year window. </p><p>Furthermore, long-duration capability evaluations demonstrate that models like Claude Opus 4.6 can reliably sustain operations on 12-hour tasks, while Claude Mythos Preview pushes past 16 hours of continuous problem-solving. </p><p>Internally, the technological leap is even more stark. On highly complex, open-ended engineering problems where clear specifications are initially absent, Claude’s success rate climbed to 76% in May 2026 — a 50-point increase in a six-month window. </p><p>In isolated optimization benchmarks, where models are tasked with accelerating AI model training code, Anthropic’s internal Mythos Preview model achieved a 52x speedup. </p><p>For comparison, a skilled human developer typically requires four to eight hours of manual refactoring to achieve a mere 4x speedup on the exact same codebase. </p><h2><b>3-step plan to more complete production code automation</b></h2><p>For an enterprise to replicate Anthropic&#x27;s 80 percent milestone, technical decision-makers must abandon the &quot;developer assistant&quot; mental model and transition to an &quot;automated factory&quot; architecture. This shift impacts product management, operations, and developer workflows in three distinct ways: </p><h3>1. Shift from Code Execution to Architectural Oversight</h3><p>When code generation costs near zero in human time, the primary engineering role shifts from writing software to specifying goals and reviewing outputs. Enterprise leaders must retrain developers to act as systems architects and judges. As one Anthropic employee noted regarding the operational reality of this shift: </p><blockquote><p>&quot;The shape of stuff today is roughly ‘humans have ideas, and the models are able to implement, test and evaluate them an [order of magnitude] faster than before.’&quot; </p></blockquote><h3>2. Overcome The Code Review Bottleneck</h3><p>Injecting vast quantities of AI-generated code into an organization inevitably creates operational friction.</p><p>According to <a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl’s law</a>, the speedup of any process is strictly limited by its serial, non-automated bottlenecks.</p><p>At Anthropic, flooding the system with synthetic code instantly turned human code review into a critical bottleneck. </p><p>To counter this, enterprise teams must deploy automated AI code reviewers directly into their Continuous Integration/Continuous Deployment (CI/CD) pipelines. </p><p>Anthropic implemented an automated Claude reviewer (a publicly accessible version, <a href="https://venturebeat.com/technology/anthropic-rolls-out-code-review-for-claude-code-as-it-sues-over-pentagon">Claude Code Review</a> rolled out for commercial usage in March) tasked with analyzing every pull request for architectural defects, security flaws, and regression bugs before merging. Other dedicated firms like <a href="https://venturebeat.com/programming-development/qodo-teams-up-with-google-cloud-to-provide-devs-with-free-ai-code-review-tools-directly-within-platform">Qodo</a> offer tools tailor-made for this purpose, as well. </p><p>In Anthropic&#x27;s case, retrospective analyses indicated that the automated layer caught approximately one-third of the production bugs responsible for historical outages on the flagship claude.ai website.</p><h3>3. Target High-Volume Operational Debt</h3><p>Enterprises are frequently paralyzed by legacy code maintenance and long-deferred technical debt. Rather than deploying agents to write speculative new features, technical leaders should direct autonomous agents toward closed-loop, painstaking cleanup operations.</p><p>In April 2026, an Anthropic engineer deployed Claude to resolve a persistent class of API errors. Operating autonomously, the model shipped more than 800 individual fixes, successfully reducing the error rate by a factor of 1,000. </p><p>The supervising engineer estimated that a human developer would have spent four full years executing the same work, due to the cognitive load of holding massive, unfamiliar code context in their head simultaneously. </p><h2><b>Considerations for enterprises moving forward in an age of primarily AI-generated code</b></h2><p>Operating a codebase predominantly authored by AI introduces unique governance challenges that enterprise legal and security teams must navigate.</p><p>Unlike open-source licensing models (such as the permissive MIT license or copyleft GPL frameworks), enterprise codebases utilizing proprietary LLM infrastructure remain subject to the commercial terms of service of the respective AI vendor. </p><p>The deployment of autonomous agents requires rigorous verification protocols to ensure compliance, security, and intellectual property protection:</p><ul><li><p><b>Code Quality and Maintenance:</b> Anthropic’s internal data indicates that while AI-authored code was objectively lower in quality than human output in late 2025, it reached rough parity by mid-2026, with expectations to surpass human standards within the year. Enterprise governance must adapt to a reality where the baseline quality of automated output is structurally superior to average manual coding. </p></li><li><p><b>Security Auditing at Scale:</b> The sheer volume of automated code creation demands automated vulnerability discovery. Anthropic’s Project Glasswing illustrates the scale of this issue: utilizing Mythos Preview, the project identified more than 10,000 high- and critical-severity software vulnerabilities across global digital infrastructure within its first few weeks. This shifted the enterprise cybersecurity challenge entirely from vulnerability <i>discovery</i> to patch <i>deployment</i> velocity. </p></li><li><p><b>The Risk of Alignment Cascades:</b> Technical leaders must maintain strict verification gates. If an enterprise uses an AI system to continuously modify, maintain, and expand its proprietary software infrastructure, undetected errors or subtle misalignments can compound over successive agent sessions, gradually corrupting system integrity or introducing security exploits that escape human notice. </p></li></ul><h2><b>Brace for internal enterprise culture disruption</b></h2><p>The transition to an AI-dominated codebase is altering the cultural dynamics of engineering teams, introducing both unprecedented efficiency and deep psychological friction.</p><p>Publicly, Anthropic framed these metrics as a harbinger of a broader transformation. In an <a href="https://x.com/AnthropicAI/status/2062568862479208923">official statement on X</a>, the company observed:</p><blockquote><p>&quot;Our internal data shows Claude is accelerating AI development—a possible path to recursive self-improvement, or AI autonomously building a more capable successor. It’s happening faster than we thought, and the implications deserve greater attention.&quot; </p></blockquote><p>They expanded on the immediate productivity implications shortly thereafter:</p><blockquote><p>&quot;Today, Anthropic engineers on average ship 8x as much code per quarter as they did compared to 2021-2025... Many engineers also say Claude’s code quality is now on par with human code; we expect it to be better within the year.&quot; </p></blockquote><p>Behind these corporate metrics lies a complex human reality. Internal employee communications reveal a distinct erosion of traditional workplace collaboration, as peer-to-peer developer interaction is systematically replaced by asynchronous agent calls:</p><blockquote><p>&quot;Work (and life) ran on a gift economy of small favors between humans. ‘Can you help me get this script running?’ [...] each one created a little debt, a little mutual awareness. Claude has eaten the favors. It’s faster, it creates zero debt, but each of these is a lost bid for human collaboration.&quot; </p></blockquote><p>For individual contributors, the total automation of their primary skill set introduces acute professional anxiety regarding relevance and systemic control:</p><blockquote><p>&quot;I started leaning hard into Claudifying about a year ago. That’s been a crazy adventure and it’s now been ~5 months since I last wrote any code myself.&quot; </p></blockquote><blockquote><p>&quot;On days where everything works well, I can’t help but think nothing I do matters, everything is automated and better and faster than I ever will be. But then there are days where everything breaks and I don&#x27;t understand why and I realize I have no idea what I’ve been up to anymore.&quot; </p></blockquote><p>Enterprise leaders aiming to match Anthropic’s technical velocity cannot afford to ignore these psychological dynamics. </p><p>Achieving an 80 percent automated codebase requires more than purchasing API tokens or configuring agent loops; it demands a total cultural overhaul, a strategy for mitigating developer obsolescence anxiety, and the implementation of rigorous, automated verification guardrails to maintain ultimate human control over the software stack. </p>]]></description>
            <author>carl.franzen@venturebeat.com (Carl Franzen)</author>
            <category>Technology</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/6TDfdDR3BaglHMnVTBvvmB/ebc812673e673345d4466f174868cc17/ChatGPT_Image_Jun_4__2026__04_47_29_PM.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
    </channel>
</rss>