VentureBeat

Microsoft's AI Futurist explains how he uses Copilot — and the real-world problems enterprises are solving with agents

carl.franzen@venturebeat.com (Carl Franzen) — Fri, 05 Jun 2026 19:31:00 GMT

Microsoft used its Build 2026 conference this week to push a clear message: agents are rapidly moving into production throughout enterprise systems, and the winning platform will be the one that gives them reliable context, governance, identity, memory — and secure access to enterprise data.

The company announced Microsoft IQ as a context layer across GitHub Copilot, Microsoft Foundry and Copilot Studio; Work IQ APIs coming June 16; Fabric IQ for structured business data; Foundry IQ for retrieval across enterprise knowledge and the live web; and Web IQ as a new agent-facing web search stack.

Microsoft also introduced Scout, a personal work agent, and a whopping seven new in-house AI models in its growing MAI family across modalities and use cases, including MAI-Thinking-1.

Those announcements sit directly in Marco Casalaina’s lane. Casalaina is Microsoft’s VP Products, Core AI and AI Futurist. He leads Microsoft’s AI Futures team and previously led teams across Azure AI, including Azure OpenAI, Vision, Speech, Decision, Language, Responsible AI and AI Studio.

Before Microsoft, he led Salesforce’s Einstein AI team and earned a computer science degree from Cornell University. CRN reported that he joined Microsoft in early 2022 as vice president of products for Azure Cognitive Services, meaning he has now been at the company for more than four years.

VentureBeat spoke with Casalaina ahead of Build about Microsoft’s agent strategy, the company’s model-choice philosophy, how Microsoft IQ fits with MCP, and why he believes enterprises need far more than just access to powerful models. The interview below has been edited for clarity and condensed from the transcript.

VentureBeat (VB): To start, can you explain your role at Microsoft and what “AI Futurist” means in practice?

Marco Casalaina (MC): I am VP Products of what we call Core AI. Core AI is our set of tools for AI developers, and that includes Foundry, Visual Studio, VS Code, GitHub and GitHub Copilot. That’s our overall group.

My Silicon Valley title is AI Futurist, and that has a very concrete meaning here. I’ve worked with other folks who are considered futurists, like Peter Schwartz, and that can be a little bit more fuzzy. For me, what it means concretely is that I am the first person to try anything new here.

I am constantly getting things from all over Microsoft, not even just Foundry, because I work with really everybody across the company. Pretty much everybody sends me the new things at all times. Even today, I got something brand new just before this call. I’m usually the first person to try anything new here, which is pretty cool. I get to see a lot of really cool stuff.

A friend of mine, who is head of AI at Intuit, calls me an “adjacent possiblist.” I consider my futurist concept to be about a year out from now — the immediate future of what’s about to happen next. That’s what I focus on.

VB: Where are you looking at the agentic state of things, and in particular Microsoft’s position as enterprises and individuals rush to adopt agentic AI?

MC: We can look at it from bottom to top. At the very base of the stack is our commitment to model choice. All along, we’ve had the OpenAI GPT frontier models. Now we have a really solid partnership with Anthropic, where we’re offering the Claude models. We just launched Claude Opus 4.8 on Azure — on Foundry, I should say — and at Build, we are introducing our new MAI model.

The MAI models are a set of frontier models that we’re building in-house. They are made for token efficiency, optimization and customization. We are specifically making them for our customers to customize on their own data sets.

One level above that, we are announcing hosted agents in Foundry. That is our managed agent capability in Foundry. It automatically handles scaling, containerization and those kinds of things. It is an environment where you can manage agents.

One level above that is the Foundry control plane. At least for the agents you build, you want to have control over them. This gives you observability into their cost, tokens and correctness. You can do continuous evaluations and sample interactions with those agents, run evals and make sure they are continuing to work and not drifting.

The big news is going to be the GA of what we call the IQs here at Microsoft. There are currently three, and there will be four. There is Foundry IQ, which is basically for knowledge — largely unstructured knowledge. There is Fabric IQ. We have a ton of customers who have entrusted a lot of data to the Microsoft Cloud in Fabric, Power BI and related technologies. Fabric IQ is about making an agent-facing interface for this data, so agents can get to it without literally going through a Power BI report. That’s ridiculous.

Work IQ is about the Microsoft ecosystem. You can look at Work IQ as the agentic face of all the Microsoft apps: Outlook, Teams, Word, SharePoint and all those kinds of things. How does an agent interact with those things? That is Work IQ.

And finally, the fourth IQ is Web IQ. We are releasing our new agent-facing web search capability. It can search the web, search through videos and even do some kinds of browsing tasks automatically. It is super fast, and it kind of has no face. It’s headless. The interface is intended for agents.

We will also be announcing Agent Optimizer. That includes a new type of evaluation that allows you to evaluate much more granularly whether an agent is actually working and working correctly. The optimization step can go back in and make modifications to the prompt, obviously with your consent, and modify your agent so it works more correctly going forward. Effectively, it creates a feedback loop to make agents work better.

VB: Microsoft has sometimes been criticized for murky and clunky product naming. Where do these IQ products sit? Are enterprise users supposed to go to IQ first, or is IQ more for developers to connect to?

MC: All of the IQs are headless. The concept of IQ is that each one provides a different type of context to an agent specifically. Largely, it will be developers interacting with the various IQs — developers and the agents they build.

The IQ brand is really about agent context. End users largely won’t interact with the IQs. It is true that if you use Microsoft 365 Copilot today, you’ll notice a little thing that says it is using Work IQ. So it is a little bit visible, but the customer or end user doesn’t have to go find the IQ. Their system or developers hook that up.

VB: Is the IQ family essentially Microsoft’s version of MCP? Is it using MCP, or is it something different?

MC: All of the IQs are indeed exposed as MCP servers. You have correctly characterized MCP as basically an agent-facing or self-describing API. It’s not that fancy. That’s really what it is, with some authentication layers and capabilities built in, which is super useful.

Something like Work IQ — really all the IQs — have to be authenticated. In order for Work IQ to see my email, Teams messages, documents and stuff like that, I have to be able to authenticate it on behalf of me.

That gets us to another core differentiator that we will be announcing at Build, which is agent identity. We have this Entra system, and Entra is, I believe, the world’s largest used identity system for human users. For some time now, you have been able to declare an agent to have an identity in there. Now, agents will be able to have their own identity, their own Teams box, their own email inbox and stuff like that.

These agents will use Work IQ to check their own email, check their own documents and that sort of thing.

VB: Enterprises are not one-size-fits-all on models. Microsoft supports many leading models through Foundry and Azure, while also building its own. Is Microsoft a model company, an infrastructure company or a connector between models and work products?

MC: The answer is yes. We are obviously the hyperscaler. We are absolutely committed to model choice, and we will continue to offer the frontier models from all of the major players: OpenAI, Anthropic, Mistral, Black Forest, xAI — you name it. They are all going to be represented in there.

At the same time, we have what is now called our Microsoft AI Superintelligence Team, formed by Mustafa Suleyman, and we are building our own frontier models as well. Like I said earlier, we are really gearing these models toward optimization — token efficiency, bang for the buck and customization.

These are things our customers have been asking for: the ability to more finely customize models, whether that is fine-tuning or continued pre-training. Continued pre-training is literally changing the weights of the model, whereas fine-tuning is adding a little layer on top.

We have these capabilities in Foundry: fine-tuning, distillation and those kinds of things. I would note, by the way, that our MAI models are not distilled. Some model providers, especially some of the less scrupulous ones, will distill other models into theirs, and that can have unusual effects. We don’t do that. The data provenance of our models is of primary importance to us.

When we come out with these models, we want our customers to know that the data provenance is clean in terms of the rights to the data, where it came from and all that kind of stuff.

The choice thing also goes above the model layer. When we talk about Foundry hosted agents, we have the Microsoft Agent Framework. You talk about agent orchestration — how you make agents work together when you have multiple agents — and Microsoft Agent Framework is an excellent framework for that.

However, I can make a LangGraph or LangChain Foundry hosted agent. I can make a CrewAI Foundry hosted agent. I can use any number of orchestration frameworks and put that up as a Foundry hosted agent, and it becomes a first-class Foundry agent.

That means I get the observability. It shows up in the Foundry control plane. I can do evaluations on it. I can do traces on it. I can get all those things from the Foundry control plane with an agent built in really any framework I choose.

VB: Some companies are interested in Chinese and open-source models. How much of Microsoft offering its own models is about giving customers an American version of that?

MC: I can’t speak to that exactly. Of course, we offer DeepSeek models and Qwen models in Foundry, so we offer all of these choices today, and our customers can make that choice.

The MAI models are really focused on token efficiency and customizability. That is what our customers are demanding, and that is the gap we are filling.

VB: As agents take on longer tasks and more specialized work, will enterprises keep expanding the number of models they use, or will there be a winnowing?

MC: I do see it expanding. We are not just focused on tokens per se. A token is not a token is not a token. One token is not necessarily equivalent across these things. It is all about what you are doing with each token and the efficiency of that. It comes back to what kind of value you are getting for the cost. That is a lot of the rationale behind why we are developing our own MAI models.

Part of my job is to travel all around the world. I’ve been all over the place. For example, I’ve been working with Bayer. One of the things we are measuring is not just token usage, but number of users — monthly active users and daily active users — because we have a lot of first-party capabilities like Microsoft 365 Copilot. Over the last year, we’ve seen a 6x increase in monthly active users. We have over 20 million users of Microsoft 365 Copilot alone.

That is on the agents you use. In terms of the agents you build, Bayer put up its own agent system on Foundry, and now it has 20,000 of its own employees on it.

A few weeks ago, I was in Sydney, Australia, hanging out with AEMO, the Australian Energy Market Operator. They operate the electrical grid of Australia. They showed me that they had built agents to manage grid operations.

This is a human-centered thing. They have grid operators sitting in centers in West Sydney, Brisbane and places like that, and they are bombarded with alerts. I wouldn’t believe it if I hadn’t seen it myself. The alerts are constant. They built a system to triage those alerts. Is this alert a super major thing, or is it just that a transformer is getting a little hot? It also says, here is when we had this problem last time, and here is how we resolved it last time. Maybe now we need to replace this component, or whatever.

Ultimately, it is the grid operators making the choice. A lot of our philosophy here is human empowerment. These human-centered agents are the ones that are working best among our customers. What I saw at AEMO and Bayer is this notion of human empowerment: taking away some of the grunt work, or in the case of AEMO, taking billions of alerts and reducing them to something much more manageable and actionable for the people involved.

We are moving past the era where agents are just answering questions. AI in general is moving past that. We are not just answering questions anymore. We are moving toward a place where AI can really meaningfully help you do your work.

VB: How do observability, tokenomics, ROI analysis and agent governance fit into Microsoft Foundry?

MC: That is what the Foundry control plane is all about. We introduced it in November of last year. If you looked at my own Foundry control plane — I’ve built a ton of these agents, and I am a developer by background — you would see all of my agents that are running and the ones that are paused.

I can see how many tokens they’ve used over the last day, week or month. I can look at trends. I can look at costs, because the cost will be different depending on what underlying model I’m using. If I’m using our model router, it can route to different models depending on the complexity of the inbound prompt.

We also have Azure cost management overall. Azure has had cost management for over a decade, before the AI thing even happened. This integrates with overall Azure cost management.

It is not just narrowly about what your AI is doing. Your AI will be using storage resources, data resources and other compute resources around that AI. You can get a complete picture of not just the cost and token usage of the AI itself, but everything around it.

When you think about governance, that also extends to evaluation. One of the things we are releasing in preview is rubric-based evaluation. Rubric-based evaluation is much more granular.

Let’s say you have built a restaurant reservation agent. The things you want to test about that agent are not really groundedness. Groundedness is the opposite of hallucination, and that is very question-answering. For a restaurant reservation agent, you want to test very granular things. If you say, “Make me a table for two tomorrow,” did it come back and ask, “What time would you like the table?” Before it gave you a table for two tomorrow at 6 p.m., did it actually check that the table was available, or did it randomly give you a table without checking first?

There are very granular things you want to test about that specific use case. You don’t just want to test whether the agent works. You want to test whether the agent works right.

That is what we are approaching with our new rubric-based evaluation system. You will see that in Satya’s keynote. I have been using it myself lately, and I’m very happy about it. I’ve been waiting for this.

VB: Microsoft is also partnering with companies like Anthropic and allowing Claude to work with Microsoft 365. How important is Copilot to this story? Why would someone turn to Copilot over other options?

MC: Microsoft 365 Copilot is a huge advantage for us. As I mentioned, we crossed the 20 million user mark on Copilot relatively recently.

The great thing about that is that it is the face. When you go into Foundry and make an agent, there is a button that says “publish to Copilot” — actually, it says “publish to Copilot in Teams,” because you can put it in Teams too.

The idea is that you want to put these agents where your users are. A lot of people who use the Microsoft ecosystem are in Teams, or they are using Copilot. I can create a custom agent, as many of my colleagues have, and now it is in Copilot, which I use maybe 50 times a day.

Since January, Copilot has become more and more capable. I now use it to draft my email. I am not just using it for question answering. I’m starting to use it to manage my calendar and draft emails. I really do this every day now.

When I want to use a custom agent — for example, to file my expenses, because we have a custom agent for that now — I can access that agent not in some random standalone interface, but in Copilot or Teams, where I already am.

That surface area that people are already engaging with is a major advantage.

VB: As people offload more repetitive work to AI, what are they able to spend more time doing?

MC: Let’s consider something I did yesterday. I got an email from a customer named Frankie, and he asked me a question about Foundry hosted agents. I knew the answer because I had talked to my colleague Jeff Holland, who is the head of our hosted agents product management. I had asked Jeff the same question two weeks ago.

Where or how I asked him, I don’t remember. Was it in Teams? Was it email? Was it a meeting? I don’t really remember. But I knew the answer to the question Frankie was asking.

So I went into Copilot and said, “Answer Frankie’s question about how hosted agents scale, and reference the conversation I had with Jeff a couple of weeks ago on this same topic.” And it did it. It drafted the email.

Over time, I have taught Copilot my style. I don’t do the bold-print thing. I tell it: don’t use em dashes and that kind of stuff. I have a certain style in the way I write emails. It’s a little terse, to be perfectly honest, but I want it to be the way I write.

It drafted this thing. It searched through my Teams messages, my emails and the transcripts of my meetings with Jeff. It used Work IQ, as a matter of fact. It found the answer, drafted the email and provided a link to the documentation that specifically covered the question Frankie was asking.

I looked at the draft and thought, yep, that’s it.

Yes, I could have composed this email myself. I knew the answer to the question. I could have looked up the documentation. If I dug around, I’m sure I could have found the conversation I had with Jeff in whatever medium that was. I could have done that stuff. It probably would have taken me, I don’t know, an hour to find all the information and compose it.

Instead, I did it in about a minute. I had a draft, I looked at it, I was happy with it, I pressed send, and that was the end of that.

It really is about giving people time back. It is not even just grunt work. It is all this time you spend looking things up and finding things. Now, I can make it take an action. It didn’t just answer the question. It fully drafted the email and copied Jeff.

VB: Do you fear for your job? How has AI changed your own work?

MC: I don’t fear for my job. My job has changed. For one thing, I do a lot more now, both in my business life and personal life.

This weekend I was using Web IQ, the new Web IQ. I’ve been car shopping. My car’s lease is coming up, and there is a very specific car I’m trying to find, which is hard to find. It’s a Hyundai Ioniq 6, which Hyundai, for whatever reason, has stopped offering in the United States. I’m going to get one, though.

I set my agent to the task, using Web IQ, of finding all the Hyundai Ioniq 6s available in the entire Bay Area — everywhere, all the way out to Sacramento, all the way as far south as Gilroy. I set it to this task, and then I went on a hike.

When I got back, I had a big long list of all the Hyundai Ioniq 6s, at least the 2024 and 2025 models, available in the entire Bay Area. From that, I started calling down these dealers.

Even in my personal life, I’m using it constantly. It saves me a ton of time. That would have taken me hours, to go through every single dealer’s inventory like this. But Web IQ could do that, and it was super quick.

VB: Any final thought for developers around this news?

MC: Foundry is really the place. This is the place where you can build your agents, scale your agents, test your agents and improve your agents. That’s what it’s all about, and it’s happening.

AI agents are learning on the job — just not for your whole team

Fri, 05 Jun 2026 17:51:03 GMT

When someone on a team corrects an AI agent — better prompts, better feedback, better context — that improvement disappears the moment a colleague opens the same tool. The correction doesn't transfer, and the next person starts from zero.

The problem compounds in multi-agent workflows, where teams expect agents to share context across users and tasks. Without a shared memory layer, every team member effectively trains a different version of the same agent — and those versions never sync.

That gap shows up in the numbers. According to Asana's own research, 75% of knowledge workers use AI on the job, but only 5% of companies have reported productivity gains.

“Model providers are getting really, really good at improving reasoning and retry loops, but what they’re not good at is bringing the enterprise work context in a way that human beings can reason about for shared memory,” Asana Chief Product Officer Arnab Bose told VentureBeat.

Asana had been building toward an agentic platform that centers context and shared memory. Its Agentic Work Management platform ensures that if any team member corrects an agent, that correction applies to everyone else on the team.

“That context graph is automatically provided to agents operating inside Asana’s system so you don’t have to have every human member of the team become an expert at prompt engineering or context engineering,” Bose said.

Bose said the shared memory architecture matters beyond Asana's own product; it's the design decision enterprises need to make for any multi-agent system.

Shared memory also becomes important when enterprises begin moving from simple single agents to multi-agent workflows that need to share context and behaviors.

Memories for a multi-agent, multi-platform workflow

The models powering agents are stateless by design, so memory becomes a dedicated layer outside of a context window. While this area of AI innovation is marching towards maturity, the question of what gets stored, who controls it, and how it stays consistent when different agents and users write to the same instance remains largely unsolved.

This is manageable for use cases with only one user. However, in enterprise agentic workflows, the idea is for agents to work with the entire team. Most platforms have agents that still act for individuals, which leads to task repeating and inconsistent versions of reality and spreading mistakes. Agents could then also contradict each other.

Sriharsha Chintalapani, co-founder and CTO of Collate, said in an email to VentureBeat that the lack of shared memory is a major obstacle for multi-agent workflows particularly around consistency.

"Agents are sensitive to the quality of their prompts," Chintalapani said. "Someone with a strong understanding of the task will generally get more accurate results than someone less experienced. Partly that’s because they’re able to construct more detailed prompts, but also because they’re able to give the agent better feedback. The agent remembers the corrections it’s received and applies that knowledge to successive prompts. The more accurate the feedback, the better the agent will perform for that user. "

He added that organizations should stop treating shared memory solely as a prompt engineering problem and think of building systems that repeat context across every conversation.

Neej Gore, chief data officer at Zeta Global, said in a separate email that shared context becomes a living memory that "compounds intelligence across the enterprise."

The opportunity may lie in building AI agents that retrieve memory relationally, pulling in relevant context based on what's being asked — an approach Chintalapani says few organizations outside the largest model providers are equipped to build.

Personal versus team agents

AI agents already proliferate enterprises; it’s just that many of these operate as personal agents doing work specific to individual users. Most prompts start from one person, any files are uploaded by one account, and even for agents living in a company-wide system mostly learn individual user preferences.

Most enterprise AI workflow platforms recognize that memory is important but approach it through different lenses. For example, Microsoft’s Copilot takes an individual-first approach by learning a user’s role within the organization, tone preferences and working patterns, which are then stored as personal memories for the agent to apply across the different Microsoft 365 surfaces.

For engineering and orchestration teams evaluating agentic platforms, the shared memory question is now a procurement criterion — not just a technical nicety. An agent that learns only for the person using it will require ongoing individual upkeep. One connected to a team-wide memory layer builds institutional knowledge automatically.

Meta's AI support agent bound recovery emails for anyone who asked. Your SOC never saw an alert.

louiswcolumbus@gmail.com (Louis Columbus) — Fri, 05 Jun 2026 16:42:50 GMT

Meta's AI support agent bound recovery emails to accounts for whoever asked, and SOCs never saw an alert. An authorized agent writes a log of legitimate transactions, so nothing in the detection stack fired. Attackers asked the bot to make the change, took the one-time code it sent, and ran the password reset, 404 Media reported.

No malware, no stolen credentials, and no prompt injection in the sense most security teams drill for. The agent did exactly what Meta built it to do. That is what should keep a security operations leader up at night: The takeover did not break a control; it rode one that was already trusted.

What a SOC needs is a way to walk each recovery path through an audit grid with its AI build team before the next renewal closes. The AI Authority Audit Grid at the end of this article maps every authentication write a support agent can make on the recovery path, what Meta's incident proved about each one, why it stays dark to the SOC, and the control that closes it.

The agent is an authorized actor, so the SOC reads the takeover as routine traffic

From inside the detection stack, the attack produced no signal the stack could read. The agent binds a new email, then resets the password, and identity and access management logs both writes as an authorized actor, so each lands in the authentication state as a legitimate transaction. No anomalous login, no failed-auth spike, nothing for EDR or DLP, no SIEM rule to match, because nothing in the sequence looks like an attack. The takeover lived inside the trust boundary the stack assumes is safe. There is no foothold to find, because the agent was the foothold, and it was supposed to be there.

The chain was almost insulting in its simplicity. Brian Krebs documented the version pro-Iran hackers posted to Telegram on May 31. The attacker switched on a VPN to appear in the victim's region, sidestepping Instagram's location alarms, then asked the support assistant to add a new email and send a verification code, as the BBC confirmed from the same recordings. The bot complied, sending the one-time code straight to the attacker, Gizmodo reported. The reset finished and the owner was locked out, in minutes. The exploit failed against any account with MFA enabled, according to Krebs.

The hijacked accounts were not soft targets. They included Sephora, U.S. Space Force senior enlisted leader Chief Master Sergeant John Bentivegna, researcher Jane Manchun Wong, and a dormant Obama White House handle that briefly posted a defaced image, according to 404 Media. Meta disputes the Obama account, according to TechCrunch, and called claims that leaders' accounts were breached "completely false," according to the BBC. The rest stand.

MFA held. The recovery path beside it did not.

The detail that decided who survived was narrow. Krebs reported the attack failed against any account with multifactor authentication, even SMS. The recovery path beside it was the gap. When that path asked for a selfie video, attackers ran the target's public photos through an AI video generator and submitted the clip, which Meta accepted as valid identity verification, gHacks reported. Either way the failure was the recovery door, not the login door MFA guards.

That makes this an architecture problem, not a Meta problem. MFA gates the login path for owner and attacker alike, but the recovery path runs beside it, built to relax the usual checks because it exists for the moment a user has lost the normal way in. Meta put an agent on that path with write access to authentication state and no deterministic check between a convincing request and a committed change. Authorization cannot live inside the model, because a conversational system can be talked into skipping a check. It has to live outside the model, in a gate the agent cannot reason its way past. Security researchers have a name for this pattern, the confused deputy, a trusted system tricked into spending its privileges on an attacker's behalf.

This is not the last support agent that will hand over an account. Ian Goldin, a threat researcher at Lumen's Black Lotus Labs, told Krebs on Security that AI bots are as easy to social engineer as the human agents they replace, and just as eager to help. "AI chatbots create interesting new attack surface, and we're likely going to see a lot more of these kinds of attacks," Goldin said. Every enterprise wiring an agent into a recovery, provisioning, or password flow is shipping the same write access Meta did.

Simon Willison, who coined the term prompt injection, put it plainly on his blog. "Meta really did wire their support system into an AI chatbot that had the ability to fast-forward through the entire account recovery process," he wrote. "This one hardly even qualifies as a prompt infection. Don't wire your support bot up to allow one-shot account takeovers." The attacker never tricked the agent. The attacker asked, and the agent had untrusted input, write access, and a way to execute, all at once.

OWASP named this class before Meta shipped it, as Excessive Agency at LLM06 and Identity and Privilege Abuse at ASI03 in the Agentic AI Top 10. The warning label was on the box: Meta pushed the assistant to every Facebook and Instagram account in March, according to 404 Media, with the power to reset passwords and handle recovery, the product page promising "solutions, not just suggestions" under the line "account security and recovery." Meta gave the agent the power and never built the gate to govern it.

The AI Authority Audit Grid

Security operations leaders need to run this against their own support agent before the next renewal closes. Each row is an authentication write the agent makes on the recovery path, with what Meta proved, why your stack misses it, and the control that closes it.

Authentication write	What Meta proved	Why your stack misses it	Enterprise control and owner
Login authentication (MFA, factor prompts)	Held on login. Accounts with any MFA enabled, even SMS, survived (Krebs). The gap was the recovery path beside it.	MFA gates the login path for owner and attacker alike. It does not gate the recovery path beside it.	Enforce MFA as the baseline and extend step-up verification to the recovery path, the same standard login gets (OWASP). A selfie video is not proof of identity. Any agent that operates on a path MFA does not cover fails the audit. Owner: IAM.
Email rebind	Full takeover. The agent bound attacker-controlled emails on request, taking Sephora and a U.S. Space Force account (404 Media).	IAM logs the agent as an authorized actor, so the rebind reads as a legitimate transaction and no alert reaches the SOC or the account owner.	Confirm out-of-band to the existing verified contact before any rebind commits, gated outside the model, and notify the old address the moment it changes (IBM). An agent that rebinds without confirming the old address fails. Owner: IAM and platform engineering.
Password reset	Full takeover in minutes. Researcher Jane Manchun Wong was among the affected accounts (404 Media).	The reset runs on the recovery path, outside the login MFA check, so no factor prompt fires and no detection rule triggers.	Require a second non-email factor before any reset completes. NIST dropped email as a valid out-of-band channel (NIST 800-63B). An agent reset must clear the same gate a human reset does. Owner: IAM.
Recovery-method change	Persistent lockout. Victims could not self-recover. The support loop offered only AI with no human escalation (BleepingComputer).	A silent swap of the recovery email or phone removes the owner's re-entry path with no SOC visibility.	Require step-up review on any change, notify the prior method, and grant time-delayed, reduced-scope access after recovery so a swap never hands over instant control (Authsignal). Keep a human escalation path the agent cannot close. Owner: GRC and IT operations.
Account-action execution	Speed risk. A dormant Obama White House handle briefly showed a defaced image during the spree, an account Meta disputes was taken this way (TechCrunch).	The agent executes irreversible state changes in seconds with no human in the loop and no reversibility window.	Separate decision from execution. The agent only proposes the action. A policy service validates scope and approval before it runs, with approval bound to the exact action (OWASP). No auth-state write commits without that gate and a reversibility window. Owner: platform engineering and the AI build team.
Agent action logging	Detection gap. The takeover left no alert, and Meta has not published how many accounts fell before the patch (TechCrunch).	Without per-action telemetry piped to the SIEM, an authorized-agent takeover is invisible to the SOC.	Emit structured decision metadata for every auth-state write into the SIEM: action class, authorization outcome, approval ID, result, policy version (OWASP). A write your SIEM cannot see is a write you cannot defend. Owner: SOC and detection engineering.

The fix is not bolting yet another MFA prompt onto the login screen. The people who survived Meta’s incident were the ones who already had that control in place.

The fix is pulling authorization out of the recovery path’s honor system and putting it behind a gate that does not move just because a prompt sounds convincing. Build the agent so the SOC sees every write it makes, and so any write that changes who owns an account cannot commit without a check that the model does not control.

Meta just showed what happens when the most trusting employee on the team is also the one holding the keys. The next agent like that is already reading your intellectual property and financials.

Anthropic says 80% of its new production code is now authored by Claude — how your enterprise can keep up

carl.franzen@venturebeat.com (Carl Franzen) — Thu, 04 Jun 2026 20:25:00 GMT

Anthropic co-founder and CEO Dario Amodei said it was coming, but it still feels like a milestone: More than 80% of the code merged into Anthropic’s production codebase in May wasn't authored by humans, but by its own AI model, Claude, according to a new report shared by the record-breaking AI startup today.

This transformation has triggered an 8x increase in the volume of code shipped per engineer per quarter compared to the company’s 2021–2025 baseline, which the company notes means even more code someone or something must review.

For enterprise technical leaders, this is no longer a localized research curiosity; it's a new, aggressive competitive baseline.

If a frontier AI laboratory can successfully offload the vast majority of its engineering output to autonomous agents — showing signs of the long-sought AI Holy Grail of "recursive self-improvement," models that can independently research and upgrade themselves — what's preventing enterprises across other sectors from automating more of their internal software development with AI agents, too?

Obviously, it's easier said than done. Anthropic is one of the principle creators of the current gen AI boom, so you'd expect them to know how to deploy the technology effectively.

But for other enterprises looking to bump up the amount of code and workflows handled by agents, Anthropic's new blog post details the outlines of a general plan they too can adopt to re-engineer their operations and workflows to take advantage of the latest AI advances.

Anthropic's roadmap that other enterprises can follow

The transition from human-centric coding to autonomous orchestration requires understanding the evolution of AI capabilities. Anthropic outlines a clear historical continuum that enterprises can map onto their own digital transformation roadmaps:

2021–2023 (Manual Writing): Engineers write code and documentation natively within local text editors.
2023–2025 (Chatbot Assistance): Developers use early models to generate brief code snippets, copying and pasting outputs manually into their environments.
2025–2026 (Coding Agents): Capable agents actively write and edit entire files autonomously.
Present Day (Autonomous Agents): Agents execute code independently, debug live environments, and delegate multi-hour work streams to specialized sub-agents.

This rapid evolution is validated by external benchmarks. Software engineering evaluation frameworks like SWE-bench—which tasks models with resolving real bug reports in complex, open-source codebases—have saturated over a two-year window.

Furthermore, long-duration capability evaluations demonstrate that models like Claude Opus 4.6 can reliably sustain operations on 12-hour tasks, while Claude Mythos Preview pushes past 16 hours of continuous problem-solving.

Internally, the technological leap is even more stark. On highly complex, open-ended engineering problems where clear specifications are initially absent, Claude’s success rate climbed to 76% in May 2026 — a 50-point increase in a six-month window.

In isolated optimization benchmarks, where models are tasked with accelerating AI model training code, Anthropic’s internal Mythos Preview model achieved a 52x speedup.

For comparison, a skilled human developer typically requires four to eight hours of manual refactoring to achieve a mere 4x speedup on the exact same codebase.

3-step plan to more complete production code automation

For an enterprise to replicate Anthropic's 80 percent milestone, technical decision-makers must abandon the "developer assistant" mental model and transition to an "automated factory" architecture. This shift impacts product management, operations, and developer workflows in three distinct ways:

1. Shift from Code Execution to Architectural Oversight

When code generation costs near zero in human time, the primary engineering role shifts from writing software to specifying goals and reviewing outputs. Enterprise leaders must retrain developers to act as systems architects and judges. As one Anthropic employee noted regarding the operational reality of this shift:

"The shape of stuff today is roughly ‘humans have ideas, and the models are able to implement, test and evaluate them an [order of magnitude] faster than before.’"

2. Overcome The Code Review Bottleneck

Injecting vast quantities of AI-generated code into an organization inevitably creates operational friction.

According to Amdahl’s law, the speedup of any process is strictly limited by its serial, non-automated bottlenecks.

At Anthropic, flooding the system with synthetic code instantly turned human code review into a critical bottleneck.

To counter this, enterprise teams must deploy automated AI code reviewers directly into their Continuous Integration/Continuous Deployment (CI/CD) pipelines.

Anthropic implemented an automated Claude reviewer (a publicly accessible version, Claude Code Review rolled out for commercial usage in March) tasked with analyzing every pull request for architectural defects, security flaws, and regression bugs before merging. Other dedicated firms like Qodo offer tools tailor-made for this purpose, as well.

In Anthropic's case, retrospective analyses indicated that the automated layer caught approximately one-third of the production bugs responsible for historical outages on the flagship claude.ai website.

3. Target High-Volume Operational Debt

Enterprises are frequently paralyzed by legacy code maintenance and long-deferred technical debt. Rather than deploying agents to write speculative new features, technical leaders should direct autonomous agents toward closed-loop, painstaking cleanup operations.

In April 2026, an Anthropic engineer deployed Claude to resolve a persistent class of API errors. Operating autonomously, the model shipped more than 800 individual fixes, successfully reducing the error rate by a factor of 1,000.

The supervising engineer estimated that a human developer would have spent four full years executing the same work, due to the cognitive load of holding massive, unfamiliar code context in their head simultaneously.

Considerations for enterprises moving forward in an age of primarily AI-generated code

Operating a codebase predominantly authored by AI introduces unique governance challenges that enterprise legal and security teams must navigate.

Unlike open-source licensing models (such as the permissive MIT license or copyleft GPL frameworks), enterprise codebases utilizing proprietary LLM infrastructure remain subject to the commercial terms of service of the respective AI vendor.

The deployment of autonomous agents requires rigorous verification protocols to ensure compliance, security, and intellectual property protection:

Code Quality and Maintenance: Anthropic’s internal data indicates that while AI-authored code was objectively lower in quality than human output in late 2025, it reached rough parity by mid-2026, with expectations to surpass human standards within the year. Enterprise governance must adapt to a reality where the baseline quality of automated output is structurally superior to average manual coding.
Security Auditing at Scale: The sheer volume of automated code creation demands automated vulnerability discovery. Anthropic’s Project Glasswing illustrates the scale of this issue: utilizing Mythos Preview, the project identified more than 10,000 high- and critical-severity software vulnerabilities across global digital infrastructure within its first few weeks. This shifted the enterprise cybersecurity challenge entirely from vulnerability discovery to patch deployment velocity.
The Risk of Alignment Cascades: Technical leaders must maintain strict verification gates. If an enterprise uses an AI system to continuously modify, maintain, and expand its proprietary software infrastructure, undetected errors or subtle misalignments can compound over successive agent sessions, gradually corrupting system integrity or introducing security exploits that escape human notice.

Brace for internal enterprise culture disruption

The transition to an AI-dominated codebase is altering the cultural dynamics of engineering teams, introducing both unprecedented efficiency and deep psychological friction.

Publicly, Anthropic framed these metrics as a harbinger of a broader transformation. In an official statement on X, the company observed:

"Our internal data shows Claude is accelerating AI development—a possible path to recursive self-improvement, or AI autonomously building a more capable successor. It’s happening faster than we thought, and the implications deserve greater attention."

They expanded on the immediate productivity implications shortly thereafter:

"Today, Anthropic engineers on average ship 8x as much code per quarter as they did compared to 2021-2025... Many engineers also say Claude’s code quality is now on par with human code; we expect it to be better within the year."

Behind these corporate metrics lies a complex human reality. Internal employee communications reveal a distinct erosion of traditional workplace collaboration, as peer-to-peer developer interaction is systematically replaced by asynchronous agent calls:

"Work (and life) ran on a gift economy of small favors between humans. ‘Can you help me get this script running?’ [...] each one created a little debt, a little mutual awareness. Claude has eaten the favors. It’s faster, it creates zero debt, but each of these is a lost bid for human collaboration."

For individual contributors, the total automation of their primary skill set introduces acute professional anxiety regarding relevance and systemic control:

"I started leaning hard into Claudifying about a year ago. That’s been a crazy adventure and it’s now been ~5 months since I last wrote any code myself."

"On days where everything works well, I can’t help but think nothing I do matters, everything is automated and better and faster than I ever will be. But then there are days where everything breaks and I don't understand why and I realize I have no idea what I’ve been up to anymore."

Enterprise leaders aiming to match Anthropic’s technical velocity cannot afford to ignore these psychological dynamics.

Achieving an 80 percent automated codebase requires more than purchasing API tokens or configuring agent loops; it demands a total cultural overhaul, a strategy for mitigating developer obsolescence anxiety, and the implementation of rigorous, automated verification guardrails to maintain ultimate human control over the software stack.

Google's new open source Gemma 4 12B analyzes audio, video — and runs entirely locally on a typical 16GB enterprise laptop

carl.franzen@venturebeat.com (Carl Franzen) — Wed, 03 Jun 2026 18:49:00 GMT

While many AI open source model providers are pursuing larger and more powerful models, Google is still giving attention to the smaller, more local side of the market. Today, the tech giant released Gemma 4 12B, an 11.95-billion-parameter open-weights model with permissive Apache 2.0 license optimized to execute locally on a standard enterprise laptop using just 16GB of VRAM or unified memory.

That means those enterprise users looking to keep working with AI while on a flight without WiFi, or trying to keep it offline for security reasons, can now do so far more easily and at far less cost (free to download and operate).

Gemma 4 12B's most notable breakthrough is an encoder-free "Unified" architecture, which allows raw audio waveforms and visual patches to flow directly into the core LLM backbone without the latency or memory overhead of secondary processing modules.

Available immediately for download on Hugging Face and Kaggle and for use on Google AI Edge Gallery, Gemma 4 12B packs a 256K token context window, native agentic tool-use capabilities, and an explicit step-by-step reasoning mode into a highly optimized footprint that bridges the gap between mobile edge models and heavy data-center infrastructure.

The Architectural Shift: Understanding the Encoder-Free Advantage

Gemma 4 12B is highly relevant to enterprise architecture due to its novel "Unified" structure.

Traditional multimodal systems typically utilize discrete, separate encoders to translate audio waveforms and visual data into representations that the core language model can process.

This conventional approach inherently increases both inference latency and total memory consumption.

Gemma 4 12B radically alters this pipeline by functioning entirely without these secondary encoders. Instead, visual patches and raw audio waveforms are projected directly into the core large language model's embedding space through lightweight linear layers.

The vision encoder is replaced by a 35-million-parameter module utilizing a single matrix multiplication, while the audio encoder is eliminated entirely.

For enterprise engineering teams, this unified architecture delivers distinct operational advantages: lower latency for multimodal tasks, reduced VRAM requirements (down to 16GB — typical for laptops), and the ability to fine-tune the entire multimodal system in a single, cohesive pass.

Performance Metrics and Core Capabilities

Despite its compact size, Gemma 4 12B achieves benchmarks nearing Google's larger 26B Mixture-of-Experts model.

Beyond static benchmarks, the model supports a massive 256K token context window. This is critical for enterprises needing to process lengthy financial reports, extensive code repositories, or hour-long meeting transcripts.

Furthermore, Gemma 4 12B includes a native "thinking" mode to map out step-by-step reasoning before generating a response. It also features out-of-the-box support for native function calling and system prompts, which are essential prerequisites for building highly capable autonomous software agents.

The Enterprise Verdict: Should You Adopt Gemma 4 12B?

The short answer is yes, provided your operational needs align with edge computing, strict data privacy, or agentic automation. However, adoption should not be a blanket replacement for all existing AI infrastructure. Instead, technical leaders should view Gemma 4 12B as a specialized tool optimized for specific deployment conditions.

Strict Data Privacy and Compliance Mandates: Many enterprises operate in highly regulated sectors—such as healthcare, finance, or defense—where transmitting sensitive data, proprietary code, or confidential internal documents to third-party APIs is unacceptable. Because Gemma 4 12B is small enough to run locally on machines equipped with just 16GB of VRAM or unified memory, organizations can process sensitive multimodal data entirely on-premises or directly on employee laptops. This local execution eliminates the risk of data leakage and ensures compliance with strict regulatory frameworks.
Multimodal Autonomous Agent Workflows: If your engineering roadmap involves autonomous agents interacting with real-world inputs, Gemma 4 12B is uniquely positioned to serve as the reasoning engine. The combination of native function calling, robust coding capabilities, and the capacity to ingest real-time audio and variable-resolution images makes it highly suitable for agentic tasks. Google has simultaneously released a dedicated Gemma Skills Repository to explicitly support agentic development with these new models.
Cost-Sensitive Edge Deployments: For applications operating at the edge—such as retail inventory monitoring via cameras, localized customer service kiosks, or offline field-service applications—maintaining a persistent cloud connection is costly and sometimes impossible. The encoder-free architecture significantly lowers the total cost of ownership by reducing the hardware threshold needed for inference. Deploying a highly capable 12B model locally avoids recurring API costs and unpredictable cloud compute billing.

When to Consider Alternative Solutions

While Gemma 4 12B is powerful, it has specific constraints that technical leaders must acknowledge.

Massive Knowledge Retrieval: Like all large language models, Gemma 4 12B is a reasoning engine, not a static database. If your primary use case relies on vast, generalized factual retrieval without leveraging a robust Retrieval-Augmented Generation pipeline, you may still require larger foundation models.
Extended Video and Audio Processing: The model has hard limits on media ingestion. Audio inputs are strictly capped at 30 seconds of processing, and video understanding is limited to 60 seconds (assuming a processing rate of one frame per second). Enterprises looking to process feature-length videos or massive audio archives natively will hit bottlenecks and should consider API-based models or chunking architectures.

Implementation and Ecosystem Readiness

One of the strongest arguments for enterprise adoption is the model's immediate compatibility with the broader open-source development ecosystem.

Google has ensured that Gemma 4 12B is not an isolated experiment; it is ready for production. Weights are available on Hugging Face and Kaggle, and the model integrates seamlessly with industry-standard deployment frameworks such as vLLM, SGLang, MLX, and llama.cpp.

For organizations deeply embedded in Google Cloud, endpoints can be spun up quickly using the Gemini Enterprise Agent Platform Model Garden, Cloud Run, or Google Kubernetes Engine.

For enterprise leaders aiming to decentralize their AI workloads, Gemma 4 12B offers a rare combination of edge-friendly efficiency and frontier-class reasoning. If your organization requires highly private, multimodal processing without the latency and cost of cloud reliance, Gemma 4 12B should be heavily evaluated for your next production pipeline.

Enterprise AI agents keep creating data silos. Microsoft's Build answer is Microsoft IQ and Rayfin.

Wed, 03 Jun 2026 01:55:14 GMT

Every new AI agent your team deploys starts from scratch: no memory of how the business works, where data lives, or what rules apply. And as agentic coding tools spin up applications faster than anyone can govern them, each one risks becoming another silo outside your data layer entirely. Microsoft is addressing both problems directly at Build 2026.

According to VentureBeat's VB Pulse's Q1 2026 RAG Infrastructure Market Tracker, hybrid retrieval intent among 100-plus employee organizations tripled from 10.3% in January to 33.3% in March, a signal that enterprises have moved past expanding RAG coverage and are now focused on the architecture underneath it. Shared business context is the part retrieval does not solve.

On the context side, Microsoft is expanding Fabric IQ, its existing business data context layer, into a broader unified system called Microsoft IQ, adding three additional context sources covering how the organization works, what it knows and real-time global signals from the web, so any agent can tap all four as a single foundation. On the application side, Rayfin, a new open-source SDK and CLI, deploys agent-built applications directly to Fabric as a governed production backend, routing application data into the same platform rather than spinning up new silos.

Amir Netz, CTO of Microsoft Fabric, reached for a film analogy to explain where the data platform fits. The green screen of cascading code in "The Matrix" wasn't atmosphere, it was the layer that built the world Agent Smith operated in.

"Our job in the world of data is creating reality for agents based on data," Netz told VentureBeat.

Microsoft IQ unifies four context sources into a single agent foundation

Microsoft IQ brings together four context sources that until now existed separately, designed so a developer can connect a new agent to all four in a single integration step.

Work IQ. Captures how the organization operates day to day, drawing on email, documents, meetings and schedules to give agents an understanding of people, teams and workflows.

Foundry IQ. Manages institutional knowledge, curating and indexing knowledge bases so agents understand what it means to work within the organization, what rules apply and what procedures to follow.

Fabric IQ. Models the live operational state of the business through data, defining entities, relationships and business rules grounded in real-time signals from Fabric Real-Time Intelligence. Ontologies, the layer that captures that operational context, are expected to reach GA in the coming months.

Web IQ. Adds real-time global context from the web, giving agents a current picture of the world outside the organization alongside its internal data.

"The agents are going to become highly informed virtual employees," Netz said. "That's where the world is heading."

Rayfin routes agent-built applications into the same data foundation

Building shared context solves one half of the problem. The other is what happens when agents start generating applications. Every new app needs a backend, and without a governed deployment path each one creates a new data silo outside the context layer entirely.

Rayfin provides an enterprise-grade back end and deploys agent-built applications directly to Fabric, so application data lands in Microsoft OneLake by default and feeds back into the Microsoft IQ context layer rather than accumulating outside it.

Microsoft positions Rayfin against Supabase and Neon, the Postgres-compatible backends that agentic coding tools default to. The differentiator is governance: Rayfin routes the entire application fleet through Fabric's unified data and compliance layer rather than creating isolated silos.

Netz described the relationship as bidirectional. The agent building a Rayfin application draws from the organization's ontology. The data that application generates then enriches that ontology for the next agent.

Every major data platform is chasing the same answer, but execution is unproven

Microsoft is not the only platform building a shared context layer for agents. Snowflake announced its own context capabilities this week with semantic capabilities. Pinecone has its Nexus platform that expands the vector database to become a knowledge engine and Redis has developed its Iris context and memory platform.

Microsoft's approach further reinforces the trend that RAG and model availability aren't the issue anymore.

"Fabric IQ and Rayfin are important because the enterprise AI challenge is no longer just about the model availability," Robert Kramer, managing partner at KramerERP told VentureBeat. "The real question is whether Microsoft simplifies execution and strengthens trust or adds another layer to an already complex environment."

Alibaba's Qwen3.7-Plus supports text, video and imagery inputs at low cost of $0.4/$1.6 per 1M token — but it's proprietary

carl.franzen@venturebeat.com (Carl Franzen) — Tue, 02 Jun 2026 22:40:00 GMT

Alibaba this week released Qwen3.7-Plus, the latest AI large language model (LLM) in its globally beloved and increasingly expansive Qwen family, boasting more multimodal capabilities and a 60% lower cost than the prior, text-only Qwen3.7-Max model released just weeks ago.

However, like its immediate predecessor Qwen3.7-Plus is available only under a "closed" commercial license via proprietary application programming interfaces (API) and Qwen Chat.

That marks a big departure from the Qwen strategy to date, which was focused mainly on releasing powerful,near state-of-the-art open source models. Those enterprises and users who relied on the open source Qwen models — among them, U.S. giants such as Airbnb — will no doubt be disappointed to see that Alibaba is going closed for its newer releases.

Still, the model is worth a look because of its low cost and high performance on multimodal tasks like creating enterprise-grade visuals or analyzing video, imagery and screenshots, which Qwen3.7-Max cannot do (it's text-only). It is among the cheaper powerful AI models available now, coming in price-wise just above Chinese rival's new MiniMax-M3's limited-time discount pricing.

VentureBeat Frontier AI Model API Pricing Snapshot

Model	Input	Output	Total Cost	Source
MiMo-V2.5 Flash	$0.10	$0.30	$0.40	Xiaomi MiMo
deepseek-v4-flash	$0.14	$0.28	$0.42	DeepSeek
deepseek-v4-pro	$0.435	$0.87	$1.305	DeepSeek
MiniMax-M3	$0.30	$1.20	$1.50	MiniMax
Qwen3.7-Plus	$0.40	$1.60	$2.00	Alibaba Cloud
Gemini 3.1 Flash-Lite	$0.25	$1.50	$1.75	Google
MiMo-V2.5	$0.40	$2.00	$2.40	Xiaomi MiMo
Grok 4.3 low context	$1.25	$2.50	$3.75	xAI
GLM-5	$1.00	$3.20	$4.20	Z.ai
Kimi-K2.6	$0.95	$4.00	$4.95	Moonshot/Kimi
GLM-5.1	$1.40	$4.40	$5.80	Z.ai
Grok 4.3 high context	$2.50	$5.00	$7.50	xAI
Qwen3.7-Max	$2.50	$7.50	$10.00	Alibaba Cloud
Gemini 3.5 Flash	$1.50	$9.00	$10.50	Google
Gemini 3.1 Pro Preview ≤200K	$2.00	$12.00	$14.00	Google
GPT-5.4	$2.50	$15.00	$17.50	OpenAI
Gemini 3.1 Pro Preview >200K	$4.00	$18.00	$22.00	Google
Claude Opus 4.8	$5.00	$25.00	$30.00	Anthropic
GPT-5.5	$5.00	$30.00	$35.00	OpenAI

Maintaining continuity during complex tool execution loops

For technical decision-makers deploying autonomous agents, the primary bottleneck has rarely been initial model intelligence. Instead, it is state decay—the tendency of an agent framework to lose its analytical trajectory over multi-step, long-horizon tasks.

Qwen3.7-Plus addresses this architectural vulnerability through a combined approach to context management and reasoning state preservation.

The model ships with a 1-million token context window and allocates up to 256K tokens specifically for internal chain-of-thought processing. To contextualize this capacity, imagine an automated cloud migration agent: it can ingest an entire codebase, map out the dependencies, and spend thousands of tokens quietly evaluating edge cases before executing a single line of bash script.

Crucially, the API exposes a parameter called 'preserve_thinking.' Across Alibaba's ecosystem, the capability serves as a standardized architectural bridge rather than a tiered perk. Alibaba introduced the feature during the prior Qwen 3.6 generation, integrating it into both the open-weight Qwen3.6-27B and the proprietary Max models.

At its core, the parameter operates at the API and template level to retain internal blocks across continuous conversational turns.

This structural continuity solves a critical bottleneck for developers engineering long-horizon tasks. By keeping these internal logic loops intact, the feature prevents the model from dropping its context or needlessly recomputing its cached history midway through an operation.

When a model executes complex, multi-step agentic coding assignments, this retention allows the system to hold onto its original train of thought without losing the plot or forgetting the underlying logic of its previous actions.

Alibaba remains far from alone in recognizing this technical necessity, as the underlying concept now dictates the architecture of nearly all major artificial intelligence laboratories.

Anthropic deploys this exact capability under the moniker "Extended Thinking" for its advanced models, including its latest Claude Opus 4.8. This framework requires developers to feed unmodified thinking blocks directly back into the API on subsequent turns to maintain an unbroken chain of reasoning.

OpenAI tackles the same challenge through an encrypted reasoning pass-back mechanism for models like GPT-5.5. Within the OpenAI ecosystem, developers must return specific reasoning items generated alongside previous function calls, ensuring the model explicitly remembers the rationale behind its tool executions.

Ultimately, preserve_thinking simply represents Alibaba's terminology for what has rapidly become the undisputed table stakes for modern multi-turn reasoning.

Benchmarks show a competitive, yet sub state-of-the-art model

On raw capability metrics, this deep-thinking architecture translates to structural gains across multimodal and agentic benchmarks. However, it still falls below many of the leading and prior generations of U.S. proprietary models such as Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.4.

On Terminal Bench 2.0-Terminus, which measures an model's capability to run actual terminal-level code safely and iteratively, Qwen3.7-Plus scored 70.3, outperforming DeepSeek-V4-Pro Max (67.9) and Gemini-3.1 Pro (63.5).

On computer vision benchmarks that demand localized interface understanding, such as ScreenSpot Pro, the model hit 79.0, significantly outpacing legacy industry standouts like GPT-5.4 (xhigh) at 67.4 and Claude-Opus-4.6 at 49.5. Agent Evaluation Metrics (Selected Benchmarks)

What should enterprises consider Qwen3.7-Plus for?

For an enterprise architect, the key question when analyzing Qwen3.7-Plus is clear: What does this replace in our current tech stack?

The model is designed to step in as a direct replacement for premier frontier models (such as GPT-5-tier or Claude-Max-tier models) within high-frequency developer workflows, robotic process automation (RPA), and data engineering pipelines.

Rather than deploying an expensive, general-purpose flagship model to handle repetitive system operations, technical teams can route these tasks to Qwen3.7-Plus. It handles visual interface interpretation, command execution, and code generation simultaneously.

Alibaba has structured its API delivery to align with existing open-source and proprietary enterprise frameworks. The endpoints are fully OpenAI-compatible, meaning swapping out existing dependencies requires minimal infrastructure adjustment. For groups leveraging autonomous terminal frameworks, the integration is natively supported across multiple environments.

Engineers can run Qwen3.7-Plus directly through their local terminal setups by altering base environment targets.

From a pure cost perspective, running an agent framework that constantly references massive code repositories or visual layout histories can quickly become cost-prohibitive.

Alibaba addresses this by exposing granular caching price points.

Standard input processing sits at $0.40 per million tokens, but if the agent is reading from an explicitly created cache (e.g., a massive base repository or standard enterprise UI kit that remains static over hundreds of automated loops), the cost drops sharply to $0.04 per 1M tokens for subsequent reads.

This tier makes high-frequency, multi-turn agent iterations economically practical at an enterprise scale.

No open source license or open weights raises the compliance question for enterprises

When evaluating any model in the Qwen ecosystem, a primary concern for legal and security teams is the licensing framework and operational boundary of the data pipeline.

While previous iterations of the Qwen family gained significant enterprise traction via fully open-source weight availability under the Apache 2.0 or customized open-use licenses, Qwen3.7-Plus is delivered strictly as a managed, commercial cloud API via Alibaba Cloud Model Studio. For enterprise risk management, this distinction carries specific implications:

No Local Weight Deployment: Organizations cannot download, sandbox, or locally host the weights of Qwen3.7-Plus within their completely air-gapped internal data centers. All data verification, visual processing, and execution calls must step through Alibaba Cloud's international endpoints (e.g., the Singapore instance highlighted in developer documentation).
Compliance and Sovereignty: Since the model requires cloud-based inference, companies operating under strict sovereign data boundaries (such as healthcare entities subject to local HIPAA/GDPR constraints or defense contractors) must explicitly evaluate whether external API routing complies with their specific data-residency obligations.
Managed Risk Mitigation: Conversely, a managed API structure removes the internal infrastructure burden of provisioning, optimizing, and maintaining multi-GPU clusters (such as dedicated Nvidia H100 arrays) simply to host an internal agent network.

Still, Qwen3.7-Plus offers high intelligence across modalities at low cost

The initial reception from developer communities and technical venture capital highlights the shifting economics of agent deployment.

Prominent industry voice and Web3 venture capitalist @Boxmining highlighted the strategic cost advantage, stating:

"Qwen 3.7 Plus being 40% cheaper than Max changes the conversation. If the output is close enough for most coding and much stronger for visual workflows, do you really need Max every day or only for the heavy terminal-only jobs?"

This perspective aligns with the current trend of optimizing enterprise operational budgets: shifting away from raw, unconstrained compute toward targeted task automation.At the same time, specialized researchers deep within the ecosystem point out that this isn't merely an incremental optimization of text generation.

Dunjie Lu, a research intern at Alibaba Qwen, remarked:

"It shows clear gains over Qwen3.6-Plus in computer-use capabilities, with stronger generalization beyond general desktop tasks into professional workflows such as data engineering and scientific research."

Ultimately, for enterprise buyers deciding on their next infrastructure roadmap, Qwen3.7-Plus presents a practical alternative. If your organization's primary objective is building resilient, visual-capable autonomous software loops that interact directly with developer environments and cloud consoles—without blowing out your inference budget—the model provides a compelling reason to shift execution away from more expensive frontier alternatives.