<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
    <channel>
        <title>VentureBeat</title>
        <link>https://venturebeat.com/feed/</link>
        <description>Transformative tech coverage that matters</description>
        <lastBuildDate>Wed, 01 Jul 2026 06:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>Copyright 2026, VentureBeat</copyright>
        <item>
            <title><![CDATA[Morgan Stanley cut its riskiest reconciliation job in half — by making its agents less autonomous]]></title>
            <link>https://venturebeat.com/orchestration/morgan-stanley-cut-its-riskiest-reconciliation-job-in-half-by-making-its-agents-less-autonomous</link>
            <guid isPermaLink="false">4QJJlE7IuNRpMZbY1kYTkF</guid>
            <pubDate>Tue, 30 Jun 2026 22:23:41 GMT</pubDate>
            <description><![CDATA[<p>Most enterprise AI deployments so far have focused on coding assistants and customer service bots. Morgan Stanley has deployed agents in one of banking&#x27;s most accuracy-critical, deadline-driven workflows instead — profit and loss (P&amp;L) reconciliation — and cut the work in half. The counterintuitive part: it got there by making the system less autonomous, not more.</p><p>Humans stay tightly in the loop, and their decisions are iteratively turned into repeatable rules the system can apply on its own.</p><p>“It&#x27;s much more like a co-worker than a copilot,” Morgan Stanley Managing Director Todd Johnson said at a recent VB AI Impact event. The internal production agentic system, known as FIXR, goes beyond simple, straightforward &quot;gen AI 1.0&quot; tasks. “We think that&#x27;s where the opportunity is to really unlock more complex work in the organization.”</p><h2><b>FIXR behind the scenes</b></h2><p>Every trading day, Morgan Stanley’s trade desks handle the important work around transactions such as cash equities or debt investments. </p><p>And, at the end of each of those days, controllers must reconcile P&amp;L across the finance giant’s Finance, Risk, Operations, and Trade Capture systems. All that data must come together, and, perhaps not surprisingly, hundreds of thousands of attributes frequently fail to match. </p><p>Typically, this means controllers must manually investigate each mismatch (or “break”), make decisions on adjustments, then ideally sign off before the number goes to the desk. And all of this while working on a hard morning deadline. </p><p>Previously, this could take up to six hours for a single book. Now, FIXR performs the task in two to three hours, Johnson said. Across the roughly 100 controllers who do this work, that adds up to about 1,500 hours saved per week.</p><p>After nightly P&amp;L calculations complete, the system automatically analyzes “breaks” and proposes resolutions based on learned rules. Several agents work together: </p><ul><li><p>One interprets past guidance to develop start-of-day resolutions.</p></li><li><p>One learns from controller behavior and documents the rules they apply.</p></li><li><p>One converts repeated patterns into durable, automated logic.</p></li></ul><p>Over time, the system can auto-clear certain breaks it’s encountered before, suggest solutions for others that may be less familiar, ask for help when it’s unsure, and flag for human investigation. When items are repeatedly resolved through the same method, it can create firm rules. </p><p>Critically, humans don’t leave the loop, but stay fully in it, he said. They review, approve or correct every recommendation, then feed those decisions back to improve the next run. The agent learns daily from controllers what it gets right and wrong and codifies that knowledge as it iterates. </p><p>“You still preserve that element of human accountability even as you start to automate,” Johnson said. “Over time you&#x27;ll see more and more of those items resolved in an automatic way.”</p><p>He emphasized that autonomy requires a great deal of trust; enterprises will not see efficiency gains if everyone&#x27;s checking everything an agent does. </p><p>The human–agent feedback loop was critical to addressing the challenge of controlled, measured, and repeatable automation. “We recognized that all that intelligence that&#x27;s sitting in the mind of a controller is gonna be difficult to get all into an agent on day one,” Johnson said. </p><h2><b>Focus on process-first, extensibility</b></h2><p>It was critical to establish processes first, before getting any AI involved, Johnson said. His team ran a “very thorough” process intelligence assessment that mapped and mined workflows to identify where automation would be the most advantageous: Was the answer agents, traditional automation, or simple re-engineering of an inefficient step? </p><p>“If we can fix that first before we add agents to the problem, then we really will be transforming the opportunity,” he said. </p><p>The P&amp;L sign-off process was full of manual steps suitable for automation, and agents taking over some of these time-consuming tasks are freeing up controllers for “more value-added analysis” and “deeper risk consideration” work, he said. </p><p>Extensibility, though, was just as important as time savings. Johnson’s team chose this particular P&amp;L reconciliation use case because hundreds of controllers were doing this work globally across the business (in the Americas, Europe, Asia). </p><p>So start with a use case, prove it, extend it, “and then ultimately the transformation will be as we roll this out more and more across the organization,” Johnson said. </p><h2>Deterministic by design</h2><p>Johnson said the team also deliberately limited how much of the workflow depended on the model&#x27;s judgment at all. &quot;If you have an opportunity to make things very prescribed and repeatable, that&#x27;s cheaper in terms of token consumption, it&#x27;s more repeatable in terms of controls — and have the LLM do the stuff where you don&#x27;t need that kind of deterministic workflow,&quot; he said. </p><p>As the system sees more controller feedback on a given break type, Morgan Stanley converts that pattern into a fixed rule instead of leaving it to the model.</p><h2><b>Humans still own the behavior </b></h2><p>An interesting (and perhaps fundamental) question being raised at the dawn of the agentic era is: Are agents code or digital employees?</p><p>Johnson argues that “they&#x27;re probably a little bit of both,” and, as such, require nuance when it comes to governance and oversight. Technical teams must still be responsible for maintaining protections and guardrails like firewalls or encryption, for instance. </p><p>But there’s a new dynamic around the “performance element”: Humans using agents are responsible for them because it’s aiding their business work. For instance, if a senior controller is working with a junior controller, they don’t just relinquish responsibility because someone is helping them out, Johnson noted. </p><p>“One of our strong principles in our AI governance generally is that there always has to be human accountability, even if there&#x27;s a degree of automation,” he said. </p><p>But there typically isn’t “one single one person,” and the process is ultimately continuous. To this point, Johnson joked that one “depressing” thing about agentic AI is that it’s going to require ongoing training because models are ever-changing. </p><p>“You&#x27;re never gonna be able to say: ‘We&#x27;ve done all the evaluation and testing that we need to do. Let&#x27;s just let it go.’ You&#x27;re going to have to have a constant view as it evolves over time.”</p><h2>Morgan Stanley is aiming at real enterprise pain points</h2><p>Morgan Stanley&#x27;s experience mirrors patterns VentureBeat has uncovered across enterprise AI deployments. </p><p>In VentureBeat&#x27;s recent VB Pulse survey, nearly three-quarters of respondents reported seeing little to no ROI from custom model fine-tuning, describing a &quot;sandbox graveyard&quot; of AI projects that proved too costly to maintain. This suggests that Morgan Stanley&#x27;s process-first, buy-and-blend approach may be more sustainable than chasing bespoke models. The survey had 87 respondents and findings should be considered directional. </p><p>Governance emerged as another common challenge: 38% of respondents cited the lack of a single accountable owner as their biggest barrier to production AI, while only two of the 87 enterprises surveyed had active monitoring and alerting in place to detect model failures. </p>]]></description>
            <author>taryn.plumb@venturebeat.com (Taryn Plumb)</author>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/1TN1zV3BPgEf6CJmRBHkZH/a9b324e131a6b8f6fb7742f53baab3d8/u7277289442_A_data_engineer_is_sitting_at_a_dashboard_with_ch_68da02a0-1779-44b4-9d88-37fed2d17178_3.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Anthropic launches Claude Sonnet 5 at a steep discount to its top model as the company races toward a blockbuster IPO]]></title>
            <link>https://venturebeat.com/technology/anthropic-launches-claude-sonnet-5-at-a-steep-discount-to-its-top-model-as-the-company-races-toward-a-blockbuster-ipo</link>
            <guid isPermaLink="false">5OjOQkEgqHpBGx2O95LARP</guid>
            <pubDate>Tue, 30 Jun 2026 18:00:00 GMT</pubDate>
            <description><![CDATA[<p><a href="https://www.anthropic.com/">Anthropic</a> today released <a href="https://www.anthropic.com/news/claude-sonnet-5">Claude Sonnet 5</a>, a new AI model that the company says delivers near-flagship performance at mid-tier prices — a move designed to give cost-conscious enterprise developers access to powerful agentic capabilities just as the San Francisco-based AI lab barrels toward an initial public offering that will test whether the private market&#x27;s staggering AI valuations can survive public scrutiny.</p><p>The release, which Anthropic describes as &quot;<a href="https://www.anthropic.com/news/claude-sonnet-5">the most agentic Sonnet model ye</a>t,&quot; makes Sonnet 5 the default model for users on Anthropic&#x27;s Free and Pro plans, while also making it available to Max, Team, and Enterprise customers. Introductory <a href="https://platform.claude.com/docs/en/about-claude/pricing">API pricing</a> is set at $2 per million input tokens and $10 per million output tokens through August 31, after which it rises to $3 and $15 respectively — still well below the $5 input and $25 output pricing of Anthropic&#x27;s top-of-the-line Opus 4.8.</p><p>The strategic logic is unmistakable: Anthropic is trying to democratize access to capabilities that until very recently only its most expensive models could deliver, while building the kind of broad-based developer adoption that will look attractive in an <a href="https://www.anthropic.com/news/confidential-draft-s1-sec">S-1 filing</a>.</p><h2><b>Sonnet 5 benchmarks show the mid-tier model closing in on Anthropic&#x27;s flagship Opus</b></h2><p><a href="https://www.anthropic.com/news/claude-sonnet-5">Sonnet 5</a> posts major gains over its predecessor, <a href="https://www.anthropic.com/news/claude-sonnet-4-6">Sonnet 4.6</a>, across every evaluation Anthropic disclosed. On <a href="https://www.swebench.com/">SWE-bench Pro</a>, an agentic coding benchmark, Sonnet 5 scores 63.2% compared with Sonnet 4.6&#x27;s 58.1% — a jump that brings it within striking distance of Opus 4.8&#x27;s 69.2%. On <a href="https://www.tbench.ai/">Terminal-Bench 2.1</a>, another coding evaluation, the gap narrows further: 80.4% for Sonnet 5 versus 67.0% for Sonnet 4.6 and 82.7% for Opus 4.8.</p><p>In multidisciplinary reasoning, as measured by <a href="https://agi.safe.ai/">Humanity&#x27;s Last Exam</a>, Sonnet 5 scores 43.2% without tools and 57.4% with tools — the latter figure essentially matching Opus 4.8&#x27;s 57.9%. On computer use tasks evaluated through OSWorld-Verified, Sonnet 5 reaches 81.2%, up from 78.5%. And on <a href="https://artificialanalysis.ai/evaluations/gdpval-aa">GDPval-AA v2</a>, a knowledge-work benchmark, it scores 1,618 — surpassing Opus 4.8&#x27;s 1,615 and far exceeding Sonnet 4.6&#x27;s 1,395.</p><p>The pattern across these evaluations tells a consistent story: <a href="https://www.anthropic.com/news/claude-sonnet-5">Sonnet 5</a> doesn&#x27;t merely inch forward from its predecessor. It vaults into a performance tier that overlaps substantially with Anthropic&#x27;s flagship model, while costing roughly 40% less per token at standard pricing and 60% less during the introductory period.</p><h2><b>Enterprise partners say Sonnet 5&#x27;s agentic AI capabilities finish jobs that previous models abandoned</b></h2><p>The emphasis on agentic capabilities — the ability to plan, use tools like browsers and terminals, and execute multi-step workflows autonomously — reflects where the AI industry&#x27;s center of gravity has shifted in 2026. Enterprises are no longer simply asking chatbots questions; they are deploying AI systems that can navigate complex software environments, execute multi-step coding tasks, and operate with minimal human supervision.</p><p>Early access partners painted a picture of a model that doesn&#x27;t just start tasks but finishes them. Sualeh Asif, co-founder of Cursor, the AI-powered code editor that has become a bellwether for developer tool adoption, said that &quot;with Claude Sonnet 5, agents stay on plan, follow our conventions, and ship clean multi-step changes, all at an efficient cost.&quot; Daniel Shepard, a senior engineer at Zapier, described handing the model a two-part automation job — updating Salesforce account tiers and sending a launch announcement — that &quot;used to stall halfway&quot; with previous models but now completes end to end.</p><p>These testimonials matter because they describe exactly the kind of reliability gap that has kept many enterprises from moving agentic AI from pilot programs to production deployments. A model that gets 80% of the way through a complex task before stalling creates more problems than it solves; one that reliably completes the full workflow changes the economics of automation. Anthropic also introduced cost-performance curves showing that developers can now adjust effort levels across <a href="https://www.anthropic.com/news/claude-sonnet-5">Sonnet 5</a> and <a href="https://www.anthropic.com/news/claude-opus-4-8">Opus 4.8</a> to find the optimal balance of cost and accuracy for their specific use case — a granularity that reflects growing sophistication in how enterprises consume AI services.</p><h2><b>An updated tokenizer boosts Sonnet 5 performance but could quietly raise costs for some workloads</b></h2><p>One technical detail <a href="https://www.anthropic.com/news/claude-sonnet-5">buried in the announcement&#x27;s footnotes</a> deserves attention: Sonnet 5 uses an updated tokenizer that changes how the model processes text, similar to the change Anthropic introduced with Opus 4.7.</p><p>The tradeoff is that the same input can map to roughly 1.0 to 1.35 times as many tokens depending on content type. Anthropic says the introductory pricing is calibrated to make the transition &quot;roughly cost-neutral,&quot; but enterprise customers running high-volume workloads will want to benchmark their specific use cases carefully before assuming their bills won&#x27;t change.</p><h2><b>Anthropic says Sonnet 5 is safer than its predecessor, but its most capable models still lead on alignment</b></h2><p>Anthropic&#x27;s safety disclosures reveal a nuanced picture. The company reports that <a href="https://www.anthropic.com/news/claude-sonnet-5">Sonnet 5</a> shows lower rates of hallucination and sycophancy than <a href="https://www.anthropic.com/news/claude-sonnet-4-6">Sonnet 4.6</a>, is better at refusing malicious requests, and is more resistant to prompt injection attacks in agentic contexts. On Anthropic&#x27;s automated behavioral audit — which tests for a wide range of misaligned behaviors including cooperation with misuse and deception — Sonnet 5 scored lower (meaning safer) overall than Sonnet 4.6.</p><p>However, Sonnet 5 showed &quot;somewhat higher rates of misaligned behavior&quot; compared with the more capable <a href="https://www.anthropic.com/news/claude-opus-4-8">Opus 4.8</a> and Anthropic&#x27;s <a href="https://www.anthropic.com/claude/mythos">Claude Mythos Preview</a>, the company&#x27;s powerful but tightly restricted cybersecurity-focused model. On a Firefox 147 exploit development evaluation created in collaboration with Mozilla, neither Sonnet model could develop a working exploit — both scored 0.0% — though Sonnet 5 showed a slightly higher partial success rate (13.2%) than Sonnet 4.6 (8.8%). Both remain far below Opus 4.8 (68.8% working exploits) and Mythos 5 (88.4%).</p><p>Because of these incremental gains in cyber-adjacent capabilities, Anthropic launched Sonnet 5 with cyber safeguards enabled by default — real-time systems that detect and block dangerous cybersecurity usage. The safeguards mirror those on Opus 4.7 and 4.8 but are less restrictive than those applied to <a href="https://www.anthropic.com/news/claude-fable-5-mythos-5">Fable 5</a>, the latest Mythos-class model that <a href="https://www.bloomberg.com/news/videos/2026-06-10/the-opening-trade-6-10-2026-video">Bloomberg reported</a> on June 10 is &quot;blocked from responding to queries related to cybersecurity and biology.&quot; Organizations enrolled in <a href="https://support.claude.com/en/articles/14604842-real-time-cyber-safeguards-on-claude">Anthropic&#x27;s Cyber Verification Program</a> automatically receive the same access on Sonnet 5 without needing to reapply.</p><h2><b>From $14 billion to $47 billion in revenue: Sonnet 5 arrives as Anthropic&#x27;s IPO narrative takes shape</b></h2><p>The <a href="https://www.anthropic.com/news/claude-sonnet-5">Sonnet 5</a> launch arrives at what may be the most consequential moment in Anthropic&#x27;s short history. The company confidentially filed its IPO prospectus with the SEC in early June, setting up what CNBC has described as &quot;<a href="https://www.cnbc.com/2026/06/05/tech-download-anthropic-ipo-ai-valuations.html">the most scrutinized public offering in tech history</a>.&quot;</p><p>The financial trajectory has been extraordinary. In February, Anthropic raised $30 billion at a <a href="https://www.anthropic.com/news/anthropic-raises-30-billion-series-g-funding-380-billion-post-money-valuation">$380 billion valuation</a>, with the company reporting $14 billion in annualized revenue that had &quot;grown more than tenfold in each of the past three years,&quot; as <a href="https://www.theguardian.com/technology/2026/feb/12/anthropic-funding-round">The Guardian reported</a>. </p><p>By late May, Anthropic had closed a <a href="https://www.anthropic.com/news/series-h">$65 billion Series H round at a $965 billion</a> post-money valuation — co-led by Altimeter Capital, Sequoia Capital, and others — with a revenue run rate that had crossed $47 billion. Harrison Rolfes, an analyst at PitchBook, <a href="https://www.cnbc.com/2026/06/05/tech-download-anthropic-ipo-ai-valuations.html">told CNBC</a> that the number that will &quot;either validate or collapse the entire narrative the private markets have been pricing for three years&quot; won&#x27;t be the valuation or revenue, but gross margin — a figure no outside observer has yet seen.</p><p>In this context, <a href="https://www.anthropic.com/news/claude-sonnet-5">Sonnet 5</a> serves a dual purpose. For developers, it offers genuine capability improvements at competitive prices. For Anthropic&#x27;s IPO narrative, it demonstrates the company can deliver a compelling product at a price tier that could drive the kind of broad adoption Wall Street rewards — high-volume, recurring API revenue from thousands of enterprise customers.</p><h2><b>Government deals and growing competition define the market Sonnet 5 enters</b></h2><p>The timing also aligns with Anthropic&#x27;s aggressive push into institutional contracts. Just yesterday, California Governor Gavin Newsom announced a first-of-its-kind partnership providing <a href="https://www.gov.ca.gov/2026/06/29/governor-newsom-announces-a-first-of-its-kind-partnership-providing-anthropic-tools-to-state-agencies-and-improving-services-for-californians/">Claude to all state agencies at a 50% discount</a>, with free workforce training.</p><p>Kate Jensen, Anthropic&#x27;s Head of Americas, called it an effort to &quot;put Claude to work for the people who keep this state running.&quot; The deal — which extends to California&#x27;s cities and counties — represents exactly the kind of durable, recurring adoption that could anchor revenue well beyond the developer community.</p><p>But Anthropic&#x27;s release lands in an increasingly crowded field. OpenAI, which <a href="https://openai.com/index/accelerating-the-next-phase-ai/">raised a $122 billion round in March</a> at an $852 billion valuation, is pursuing its own IPO. Elon Musk&#x27;s SpaceX, which merged with xAI, priced its IPO at <a href="https://www.cnbc.com/2026/06/03/spacex-ipo-stock-price-roadshow-musk.html">$135 per share with a $1.77 trillion valuation</a>. Google, Meta, and a growing wave of well-funded competitors — including Asian AI startups that, as the Wall Street Journal has reported, are developing Mythos-like cybersecurity capabilities — are all vying for the same enterprise market.</p><p>Gil Luria, head of technology research at D.A. Davidson, told CNBC that while Anthropic &quot;<a href="https://www.cnbc.com/2026/06/05/tech-download-anthropic-ipo-ai-valuations.html">appears to have the lead</a>&quot; in frontier AI models, &quot;much of their current usage is for trials and experimentation and that may not sustain.&quot; That observation cuts to the heart of the challenge facing every frontier AI lab: converting experimental developer usage into durable, production-grade revenue.</p><h2><b>The real test for Sonnet 5 isn&#x27;t benchmarks — it&#x27;s whether cheaper AI can sustain a trillion-dollar story</b></h2><p>Sonnet 5&#x27;s positioning — offering near-Opus performance at Sonnet prices — is a direct play for that conversion. Enterprise customers experimenting with expensive Opus-class models may find that Sonnet 5 delivers sufficient quality for production workloads at a price point that finance teams can approve at scale. If it works, it could accelerate the shift from experimentation to deployment that every AI company needs to justify its valuation.</p><p>Three things will determine whether <a href="https://www.anthropic.com/news/claude-sonnet-5">Sonnet 5</a> matters beyond the initial benchmark charts. Real-world agentic reliability is the first: benchmarks measure capability, but production deployments measure consistency, and the true test will come when thousands of developers push the model through messy, unpredictable workflows at scale.</p><p>The tokenizer economics are the second: the updated tokenizer&#x27;s 1.0 to 1.35x token expansion could quietly erode the pricing advantage for certain workloads, and enterprise customers should run their own cost analyses rather than relying on headline per-token prices. The third is the IPO narrative itself: when Anthropic&#x27;s S-1 eventually becomes public, investors will scrutinize whether the Sonnet tier — cheaper but high-volume — or the Opus tier — expensive but high-margin — drives the bulk of revenue and, critically, gross profit.</p><p>As <a href="https://www.cnbc.com/2026/06/05/tech-download-anthropic-ipo-ai-valuations.html">PitchBook&#x27;s Rolfes told CNBC</a>, the 2026 IPO window &quot;either becomes the most consequential IPO cycle since the dot-com era or the most expensive lesson in narrative-versus-fundamentals that public markets have ever taught.&quot;</p><p>Anthropic is betting that a model good enough to rival its flagship and cheap enough to run at scale is the product that closes the gap between those two outcomes. The public markets will soon decide whether they agree.</p>]]></description>
            <author>michael.nunez@venturebeat.com (Michael Nuñez)</author>
            <category>Technology</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/45jFDirzHKYcyikqCrIaaK/9c51576d05d8b5faf3022e3329c9e371/Nuneybits_Matrix_code_on_a_retro_desktop_computer_screen_in_bur_6cdbf492-ae5e-4176-bc42-2d7618224d1a.webp?w=300&amp;q=30" length="0" type="image/webp"/>
        </item>
        <item>
            <title><![CDATA[Google's Gemini Omni Flash hits the API, turning enterprise video production into a conversation]]></title>
            <link>https://venturebeat.com/technology/googles-gemini-omni-flash-hits-the-api-turning-enterprise-video-production-into-a-conversation</link>
            <guid isPermaLink="false">4CzxDMIbp5kUTBOM1RpsZ9</guid>
            <pubDate>Tue, 30 Jun 2026 16:19:00 GMT</pubDate>
            <description><![CDATA[<p>For most enterprises, a 90-second training video or a product explainer has never been an easy ask. It means a well planned brief, an internal film crew or an outside vendor, a shoot, an edit, and a round of revisions. Change one line of on-screen text due to a legal review and the whole chain runs again. The cost and the long time lines are why so much internal video never gets made.</p><p>That equation is what Google is aiming to rewrite with <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni-3-5-videos/">Gemini Omni Flash</a>, the first model in its new &quot;Omni&quot; family, now rolling out to developers and enterprise customers through an API after debuting to consumers at I/O 2026. Google frames the family&#x27;s ambition as creating anything &quot;from any input,&quot; starting with video. But the headline interaction isn&#x27;t just a sharper text-to-video prompt. It&#x27;s the ability to edit a finished clip through conversation.</p><div></div><p>When the model launched in May, <a href="https://venturebeat.com/technology/google-unveils-gemini-omni-any-to-any-ai-model-what-enterprises-should-know">VentureBeat&#x27;s enterprise analysis</a> flagged the catch: with no programmatic interface, Omni was a consumer and prosumer tool, not a production one. This API rollout changes that. It puts conversational editing in front of the marketing and learning-and-development teams that make the most videos in an organization.</p><h2><b>The pitch: a five-tool pipeline collapses into a single conversation</b></h2><p>Until now, many teams have been assembling AI videos the hard way, bolting together an LLM for a script, a text-to-image model, an image-to-video model, a separate lip-sync tool and a voice generator, each with its own contract, billing and data path. </p><p>Omni&#x27;s enterprise argument is unification: one model that takes text, images and video and returns a finished clip with synced audio.</p><p>That simplicity factor is the part decision-makers should weigh first. Collapsing several point tools into one model means fewer vendors and a single place to monitor output and enforce data-handling rules. For an organization that has avoided generative video because stitching the tools together wasn&#x27;t worth the overhead, the equation shifts.</p><p>With conversational editing each instruction builds on the last, so a marketer can relight a product shot, reframe it, or change the wardrobe without regenerating from scratch and losing the parts that already worked. It is the difference between booking a reshoot and sending a note.</p><h2><b>Multimodal references and a physics engine for brand assets</b></h2><p>Omni accepts far more than a text prompt. Alongside the words describing what you want, you can feed it multiple reference images, and existing video clips, and it carries those specifics into the result. Hand it a photograph of a particular object, ask the model to place that object into a scene, and it reproduces the real thing&#x27;s coloring and rough shape instead of inventing a generic stand-in. While the match might not be pixel-perfect, it is close enough to be recognizable. That reference-driven control is what makes the feature commercially interesting: a product photo, a brand logo, or a specific location can be dropped in as an ingredient rather than described in a prompt and hoped for.</p><p>Two of Google&#x27;s four highlighted strengths speak directly to enterprise work. The first is a world model, the system&#x27;s grasp of how physical scenes behave. Add light rain and puddles to an existing shot and it renders reflections of the people and objects in the wet pavement, the sort of physical consistency that separates real footage from obvious AI video. </p><p>The second is text and logo insertion. Point it at a scene full of signage and you can have it rewrite those signs in another language, or for a brand of your choosing, and even drop in a company&#x27;s logo. The results aren&#x27;t flawless: in testing, sign tracking in complex scenes weren’t always perfect and some text slipped back to the original language between frames. For training videos that need on-screen labels, or ads that need a logo placed in-scene, it is a capability worth a close look, and a reminder that the output still needs a human review before it ships.</p><h2><b>The interactions API and where the limits still bite</b></h2><p>Under the hood, this runs on Google&#x27;s new interactions API, a stateful interface built for multi-turn tasks rather than open-ended chat. Each turn carries the previous video and its references forward, which is what lets edits accumulate coherently. Developers can chain generations. They can produce a clip, edit the cat into a puma kitten, restyle a video into 8-bit retro and then into a watercolor look, and store each version to branch from later.</p><p>The constraints are real and worth budgeting around. Clips currently cap at 10 seconds, per the model&#x27;s <a href="https://deepmind.google/models/model-cards/gemini-omni-flash/">published model card</a>. To make something longer, you generate chunks and edit them together. Uploaded footage can be edited too, as long as it runs 10 seconds or under and the user holds the rights to it. Google&#x27;s own model card is candid that holding consistency across edits and rendering accurate text remain open problems.</p><h2><b>Guardrails, watermarking and the line Google won&#x27;t cross</b></h2><p>For a CISO, the demos matter less than the provenance work shipping alongside the model. Every Omni clip carries Google&#x27;s SynthID watermark, Google is extending C2PA Content Credentials across its generative tools, and it has launched an AI Content Detection API that flags AI-generated media, both Google&#x27;s and other vendors&#x27;.</p><p>Google has also drawn a deliberate line. The model won&#x27;t take a still photo of a person plus an audio clip and lip-sync them into speech, an explicit move to limit deepfakes. It will, however, take a recording of someone talking and translate it into another language, a useful path for localizing global training content. For regulated enterprises, those constraints and the baked-in provenance are features rather than friction.</p><div></div><h2><b>The numbers: cheap, 720p-only, and (preliminarily) ranked first</b></h2><p>The pricing landed alongside the API, and it is aggressive. Omni Flash costs $0.10 per second of generated 720p video, which puts a ten-second clip at roughly a dollar. That matches Veo 3.1 Fast at the same resolution, runs double Veo 3.1 Lite, and undercuts standard Veo 3.1 by three-quarters.</p><table><tbody><tr><td><p><b>Per second (USD)</b></p></td><td><p><b>Gemini Omni Flash</b></p></td><td><p><b>Veo 3.1 Lite</b></p></td><td><p><b>Veo 3.1 Fast</b></p></td><td><p><b>Veo 3.1</b></p></td></tr><tr><td><p>720p</p></td><td><p>$0.10</p></td><td><p>$0.05</p></td><td><p>$0.10</p></td><td><p>$0.40</p></td></tr><tr><td><p>1080p</p></td><td><p>n/a</p></td><td><p>$0.08</p></td><td><p>$0.12</p></td><td><p>$0.40</p></td></tr><tr><td><p>4K</p></td><td><p>n/a</p></td><td><p>n/a</p></td><td><p>$0.30</p></td><td><p>$0.60</p></td></tr></tbody></table><p>
The table also exposes the catch though. Omni Flash only generates 720p. There is no 1080p or 4K option, while the Veo tiers scale up to 4K. For internal training and most social video, 720p is fine. For premium brand work meant for a large screen, it is a real ceiling, and the reason Veo 3.1 still has a job</p><p>Clips run 3 to 10 seconds at 720p native, in landscape (16:9) or portrait (9:16). As reference inputs the model accepts up to seven images and up to three video clips of three seconds or less. It does not take audio as an input yet, though it generates audio alongside the video it produces. Output is standard MP4, and every clip ships with SynthID watermarking and C2PA credentials baked in.</p><p>On quality, the early signal is strong. In LMArena&#x27;s Text-to-Video Arena, a leaderboard where people vote on head-to-head outputs from competing models, Omni Flash sat at number one with a score of 1527. </p><h2><b>What it means for budgets, and what&#x27;s still missing</b></h2><p>With real pricing in hand, the iteration story gets concrete. Every conversational edit is a fresh generation you pay for, so an edit-heavy session still adds up, roughly a dollar for each ten-second pass at 720p. What the stateful model changes isn&#x27;t the cost of an edit, it&#x27;s the number of wasted ones: because context carries across turns, those generations go toward refining a take that mostly works instead of restarting from a blank prompt and hoping the next attempt lands.</p><p>Omni isn&#x27;t alone in this field. Veo 3.1 remains Google&#x27;s production-grade option when you need higher resolution, and rivals from Bytedance, Alibaba and OpenAI are all chasing the same budgets. What Omni adds is the editing capability itself: the ability to treat a video as a living document instead of a one-shot render.</p>]]></description>
            <author>sam.witteveen@venturebeat.com (Sam Witteveen)</author>
            <category>Technology</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/kLyxawwfV6DdvwExOeUaG/4a28ab530ddd0df708b2da07890419ff/Gemini_Generated_Image_dl8h1pdl8h1pdl8h.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Google unveils Nano Banana 2 Lite aka Gemini 3.1 Flash-Lite for low cost, 4-second fast enterprise image generations]]></title>
            <link>https://venturebeat.com/technology/google-unveils-nano-banana-2-lite-aka-gemini-3-1-flash-lite-for-low-cost-4-second-fast-enterprise-image-generations</link>
            <guid isPermaLink="false">6W1ERqKiJBgc5bSoUtyt8P</guid>
            <pubDate>Tue, 30 Jun 2026 16:06:00 GMT</pubDate>
            <description><![CDATA[<p>Google is upgrading its AI image generation capabilities today with the <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni-flash-nano-banana-2-lite/">debut</a> of <a href="https://x.com/googleaidevs/status/2071988366925521075">Nano Banana 2 (NB2) Lite</a>, an optimized model built for rapid execution and tight infrastructure budgets. </p><p>Technically designated as Gemini 3.1 Flash-Lite Image on Google&#x27;s application programming interface (API), NB2 Lite is positioned as the fastest and most cost-effective option within Google&#x27;s creative model family, capable of generating images in 4 seconds at a flat rate of $0.034 per 1,000 images. </p><p>It&#x27;s available immediately to enterprise developers through Google AI Studio, the Gemini API, and the Gemini Enterprise Agent Platform (GEAP).</p><p>It&#x27;s not quite as fast or customizable as startup <a href="https://venturebeat.com/ai/enterprise-grade-ai-image-generation-in-2-seconds-is-here-krea-2-raw-and-turbo-available-as-open-weights-under-custom-license">Krea&#x27;s new, partially open licensed Krea 2 Turbo</a> (which allows for open modification and commercial usage by small enterprises), but the big selling point here is the low price and bundling with Google&#x27;s larger Workplace and AI offerings. </p><p>This release lands alongside the public preview of Gemini Omni Flash, a multimodal conversational video generation and editing model. </p><p>However, while Omni Flash represents Google&#x27;s long-term bet on agentic video manipulation, Nano Banana 2 Lite is the immediate infrastructure workhorse, tailored specifically for high-throughput commercial application, rapid programmatic prototyping, and automated asset generation workflows. </p><h3><b>The technology of speed</b></h3><p>At its core, Nano Banana 2 Lite is built directly upon the Gemini 3.1 Flash Lite architecture, engineered to solve the persistent tension between computational latency and operational overhead. </p><p>In high-velocity enterprise frameworks, traditional large-scale image models introduce significant friction due to multi-second processing delays and high per-token costs. Google&#x27;s new lightweight model circumvents these bottlenecks by generating a standard 1k resolution image in under four seconds. </p><p>This represents a stark performance optimization over its legacy predecessor, Nano Banana (Gemini 2.5 Flash Image), achieved through targeted enhancements in core baseline capabilities. </p><p>According to internal documentation, the model features upgraded world knowledge for drafting rough data visualizations and contextual layouts, enhanced character consistency to preserve identity across continuous image streams, and localized typographic rendering capabilities. </p><p>The trade-offs inherent to this &quot;Lite&quot; designation are transparently outlined in Google’s technical data sheets. </p><p>Unlike the broader standard Nano Banana 2 (NB2) and Nano Banana Pro (NB Pro) lines, which support versatile multi-resolution scaling across 1k, 2k, and 4k outputs, Nano Banana 2 Lite restricts its resolution support exclusively to a 1k canvas. Yet, within this specialized operational boundary, the architectural tuning yields surprising competitive efficiencies. In standardized internal benchmarks, Nano Banana 2 Lite achieved a Text to Image arena Elo score of 1251. This score comfortably eclipses the legacy NB1 score of 1151 and remarkably edges out the bulkier, more expensive NB Pro, which sits at 1245 in the same text-to-image track. For specialized editing tasks, the model maintains a single-image editing Elo score of 1308 and a multiple-image editing score of 1294, providing a highly optimized sweet spot for real-time applications.</p><div></div><h2><b>A boost to rapid prototyping and marketing research</b></h2><p>From a product implementation perspective, Google is marketing Nano Banana 2 Lite not as an artistic engine, but as an invisible, high-throughput utility layer for automated workflows. T</p><p>he target demographic spans software engineers, programmatic ad platforms, and digital commerce applications where rapid iteration is crucial. </p><p>Think real-time A/B testing for thousands of targeted advertising variations or immediate layout adjustments on localized storefronts. Google highlights three specific production environments where the model excels. </p><p>First, its world knowledge allows systems to instantly draft accurate contextual scenes or location-specific mockups. </p><p>Second, its character consistency handles the rigorous demands of storyboarding tools and digital fashion try-ons, where keeping object fidelity static across sequential generations is historically difficult. </p><p>Finally, its text rendering improvements mean legible copy can be embedded directly into rapid ad generations, allowing teams to verify layout compatibility across various languages on the fly. </p><p>Developers should note, however, that while native image generation operates with lowest-latency profiles, conditional image editing tasks may experience marginally higher response times due to the secondary processing layers required to rewrite existing pixels. </p><h2><b>Licensing and acess</b></h2><p>The deployment mechanism of Nano Banana 2 Lite via proprietary APIs underscores an enterprise-first commercial licensing strategy. </p><p>Unlike open-weights models that developers can pull down to run locally under open-source frameworks like Apache 2.0 or modified OpenRAIL licenses, Google’s latest models remain tightly integrated into its managed cloud stack. </p><p>For enterprises, this eliminates the operational complexity of hosting hardware but binds usage strictly to Google’s metered pricing terms.Financially, this commercial strategy is highly aggressive. </p><p>At $0.034 per 1,000 images across both AI Studio and GEAP channels, the model undercuts the older, less capable NB1 model ($0.039) and slashes costs dramatically compared to standard NB2 ($0.067) and NB Pro ($0.134) tiers. Internal notes indicate that the model delivers roughly 60–70% of the general capability of NB2 and NB Pro while executing at significantly higher speeds and a fraction of the cost. </p><p>By lowering the fiscal barrier to high-frequency image generation, Google is making a direct play to lock enterprise developers into its commercial platform ecosystem.</p>]]></description>
            <author>carl.franzen@venturebeat.com (Carl Franzen)</author>
            <category>Technology</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/1SspsnZP2DbcuyWnSv0NZE/b152191ee0294e13bfca282555bf3300/Gemini_Generated_Image_dcilhudcilhudcil.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[AI agents need context everywhere they run, even where the cloud can't follow]]></title>
            <link>https://venturebeat.com/data/ai-agents-need-context-everywhere-they-run-even-where-the-cloud-cant-follow</link>
            <guid isPermaLink="false">5y0VlXF9SQA3R06edxcP8V</guid>
            <pubDate>Tue, 30 Jun 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[<p>The competitive edge in enterprise AI is shifting to context: which platform can give an agent the right memory, the right retrieval and the right data at the moment of decision.</p><p>Couchbase on Tuesday announced its AI Data Plane, combining persistent agent memory, real-time context retrieval and an enterprise-managed MCP server in a single operational platform. </p><p>Couchbase&#x27;s roots are in <a href="https://venturebeat.com/ai/enterprise-ai-gets-closer-to-data-with-couchbases-new-capella-ai-services">caching and high-transaction databases</a> — an architecture the company argues makes it better suited for agent memory than vendors that came to the problem from search or analytics. The AI Data Plane runs identically across cloud, on-premises and disconnected edge environments, extending agent memory and local vector search to devices with no network connection.</p><p>&quot;How do you make sure that the intelligence that you get out of these models are the ones that databases specialize in?&quot; Gopi Duddi, CTO at Couchbase, told VentureBeat. &quot;How can you get that value out of storage systems, which are still going to be databases?&quot;</p><h2>What the AI Data Plane delivers</h2><p>The AI Data Plane packages three components designed to replace the fragmented stacks most enterprises are currently running.</p><p><b>Agent memory:</b> A unified persistence layer for conversational context, structured operational data and vector embeddings. Couchbase says the guardrails are what distinguish it from standalone memory services: token constraints per session, time-to-live limits on stored memories and metering controls that cap compute consumption per agent session.</p><p><b>Enterprise MCP server:</b> An enterprise-supported self-managed server for standardized model-context protocol integration, shipping as part of the platform rather than requiring a separate service.</p><p><b>Agent catalog:</b> A function-level catalog of discoverable agent tooling built by Couchbase. Duddi distinguished it from metadata catalogs like Databricks Unity or AWS Glue — describing it, in his words, as closer to a glorified MCP that surfaces agent functions as callable tools within the platform.</p><h2>Memory-first architecture takes agent context to the disconnected edge</h2><p>The lineage of Couchbase and its core architectural foundation is what Duddi says gives it an edge when it comes to context.</p><p>&quot;We were a cache before we became a database,&quot; Duddi said.</p><p>Writing to memory is 10x faster than writing to disk, Duddi said — a speed advantage he argues separates Couchbase from NoSQL databases that layer memory workloads on top of disk-based storage.</p><p>Couchbase isn&#x27;t the only data technology that has its roots in a caching layer. Redis similarly is rooted in cache and also<a href="https://venturebeat.com/data/context-architecture-is-replacing-rag-as-agentic-ai-pushes-enterprise-retrieval-to-its-limits"> recently announced</a> an agentic AI context layer. Duddi argued that Couchbase is different in that it maintains an ACID (Atomicity, Consistency, Isolation, and Durability) compliant database which matters for transactional workloads. Couchbase also has a long history across multiple deployment modalities.</p><p>That architecture extends to the edge through Couchbase Lite, the platform&#x27;s on-device runtime. It runs SQL, full-text search and vector search locally without a network connection, using a proprietary sync mechanism to replicate bidirectionally back to cloud or between edge nodes when connectivity returns. The target environments are retail floor operations, field service, industrial deployments and regulated settings where agent data cannot leave the device.</p><p>Duddi cited hotel reservations as an early example: multiple agents serving customers concurrently, each pulling local context and running vector search on-device, with shared session memory synchronizing centrally. The practical benefit is token efficiency. Rather than every agent independently retrieving and processing the same data, the platform caches shared context so concurrent sessions draw on it without burning tokens repeatedly.</p><div></div><h2>Agora&#x27;s view from production</h2><p>Agora, a platform that helps developers embed real-time voice, video and conversational AI into enterprise applications, has run Couchbase in production since February 2024.</p><p>The initial use case was its Signaling product, managing channel setup and state synchronization for live calls. Expanding into conversational AI agents brought stricter requirements: memory-first architecture, full JSON support for storage and query, cross-datacenter replication for high availability and enterprise-grade vendor support.</p><p>&quot;Couchbase was the best fit based on these criteria,&quot; Patrick Ferriter, SVP of Product at Agora, told VentureBeat.</p><p>Agora is now extending that relationship to support context retrieval for conversational AI agents.</p><p>&quot;This will simplify the architecture and deliver enterprise grade RAG with predictable lower latency required for conversational AI use cases,&quot; Ferriter said.</p><p>For data professionals trying to figure out the best approach to context, there is no one answer. On platform selection, Ferriter was direct.</p><p>&quot;It depends on the preference and goals of the organization, including timing,&quot; Ferriter  said. &quot;If they want something enterprise grade and optimal for immediate production and scale vs. having to optimize and maintain an open-source solution with community support. We wanted the former and that is why we looked at an expanded partnership with Couchbase.&quot;</p><h2>Competitive context: following the right trend</h2><p>The context layer has become a crowded space in 2025.</p><p>Oracle put a<a href="https://venturebeat.com/data/oracle-converges-the-ai-data-stack-to-give-enterprise-agents-a-single"> memory core</a> in its database back in March providing a context layer. Redis added a<a href="https://venturebeat.com/data/context-architecture-is-replacing-rag-as-agentic-ai-pushes-enterprise-retrieval-to-its-limits"> context layer</a> in May as did vector-native database vendor<a href="https://venturebeat.com/data/the-rag-era-is-ending-for-agentic-ai-a-new-compilation-stage-knowledge-layer-is-what-comes-next"> Pinecone</a>.  </p><p>&quot;Couchbase is following this trend, not setting it, but it&#x27;s the right one to follow,&quot; Devin Pratt, Research Director for AI, Automation, Data and Analytics at IDC, told VentureBeat. &quot;Its real edge is reach, running the same platform from cloud to edge to mobile, which is how enterprises actually operate. The test now is to scale against bigger names.&quot;</p><p>For teams navigating the vendor landscape, Pratt&#x27;s framing is direct. &quot;Match the tool to the workload. Consolidate where it makes sense, use a specialized engine like a graph database where relationship-heavy reasoning earns it, and let governance drive the call rather than treating memory as plumbing,&quot; Pratt said.</p>]]></description>
            <category>Data</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/76hubg1hFFLGeInYbGk6iZ/a0e50e1e6ab26b2991d23ab5f70b06ad/couchbase-context-smk1.jpg?w=300&amp;q=30" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Meituan open sources LongCat-2.0, the 1.6T, near-frontier agentic coding model that's been leading OpenRouter — trained entirely on Chinese chips]]></title>
            <link>https://venturebeat.com/technology/meituan-open-sources-longcat-2-0-the-1-6t-near-frontier-agentic-coding-model-thats-been-leading-openrouter-trained-entirely-on-chinese-chips</link>
            <guid isPermaLink="false">55Xus2lo15LrfBb2h3HoCq</guid>
            <pubDate>Tue, 30 Jun 2026 05:39:00 GMT</pubDate>
            <description><![CDATA[<p>A few hours ago, Chinese delivery app company <a href="https://longcat.chat/blog/longcat-2.0/">Meituan officially unveiled LongCat-2.0 </a>on <a href="https://github.com/meituan-longcat/LongCat-2.0">GitHub</a>, <a href="https://huggingface.co/meituan-longcat/LongCat-2.0">Hugging Face</a>, and its native platform, unmasking the model as the computational engine behind &quot;Owl Alpha,&quot; the anonymous stealth model that has spent the last two months commanding global developer charts on OpenRouter. </p><p>Developed to fundamentally disrupt closed-source enterprise dominance in autonomous software engineering, the 1.6-trillion-parameter Mixture-of-Experts (MoE) system brings a native 1-million-token context window to the public domain under a highly permissive, enterprise grade, commercially viable MIT license. However, the company has yet to post the full weights — both the Github and Hugging Face pages say &quot;Model weights coming soon — stay tuned!&quot;</p><p>Commercial access to the architecture introduces a highly aggressive pricing tier, deploying a mechanism where all context-cache hits are processed completely<i> free of charge</i>, running alongside a time-limited &quot;<a href="https://longcat.chat/platform/docs/TokenPack.html">Token Pack</a>&quot; flash-sale paradigm. There&#x27;s also a typical <a href="https://longcat.chat/platform/docs/APIPayAsYouGo.html">&quot;pay-as-you-go&quot; API</a> for non-cache hits standard priced at $0.75/$2.95 per million tokens in/out.</p><p>However, a limited-time promotional discount aggressively slashes these operational expenditures down to $0.30 per million tokens for uncached input and $1.20 per million tokens for output, both on the cheaper-end of top performing models globally. </p><table><tbody><tr><td><p><b>Model</b></p></td><td><p><b>Input ($/1M)</b></p></td><td><p><b>Output ($/1M)</b></p></td><td><p><b>Total ($/1M)</b></p></td><td><p><b>Source</b></p></td></tr><tr><td><p>MiMo-V2.5 Flash</p></td><td><p>$0.10</p></td><td><p>$0.30</p></td><td><p>$0.40</p></td><td><p><a href="https://platform.xiaomimimo.com/docs/en-US/pricing">Xiaomi</a></p></td></tr><tr><td><p>deepseek-v4-flash</p></td><td><p>$0.14</p></td><td><p>$0.28</p></td><td><p>$0.42</p></td><td><p><a href="https://api-docs.deepseek.com/quick_start/pricing">DeepSeek</a></p></td></tr><tr><td><p>deepseek-v4-pro</p></td><td><p>$0.435</p></td><td><p>$0.87</p></td><td><p>$1.305</p></td><td><p><a href="https://api-docs.deepseek.com/quick_start/pricing">DeepSeek</a></p></td></tr><tr><td><p>MiniMax-M3</p></td><td><p>$0.30</p></td><td><p>$1.20</p></td><td><p>$1.50</p></td><td><p><a href="https://platform.minimax.io/subscribe/token-plan?tab=api-enterprise">MiniMax</a></p></td></tr><tr><td><p><b>LongCat-2.0 — limited-time promo</b></p></td><td><p><b>$0.30</b></p></td><td><p><b>$1.20</b></p></td><td><p><b>$1.50</b></p></td><td><p><b></b><a href="https://longcat.chat/platform/docs/APIPayAsYouGo.html"><b>LongCat</b></a><b></b></p></td></tr><tr><td><p>Gemini 3.1 Flash-Lite</p></td><td><p>$0.25</p></td><td><p>$1.50</p></td><td><p>$1.75</p></td><td><p><a href="https://ai.google.dev/gemini-api/docs/pricing">Google</a></p></td></tr><tr><td><p>Qwen3.7-Plus</p></td><td><p>$0.40</p></td><td><p>$1.60</p></td><td><p>$2.00</p></td><td><p><a href="https://modelstudio.console.alibabacloud.com/ap-southeast-1?tab=doc#/doc/?type=model&amp;url=2840914_2&amp;modelId=qwen3.7-plus&amp;serviceSite=international">Alibaba Cloud</a></p></td></tr><tr><td><p>MiMo-V2.5</p></td><td><p>$0.40</p></td><td><p>$2.00</p></td><td><p>$2.40</p></td><td><p><a href="https://platform.xiaomimimo.com/docs/en-US/pricing">Xiaomi</a></p></td></tr><tr><td><p><b>LongCat-2.0 — standard</b></p></td><td><p><b>$0.75</b></p></td><td><p><b>$2.95</b></p></td><td><p><b>$3.70</b></p></td><td><p><b></b><a href="https://longcat.chat/platform/docs/APIPayAsYouGo.html"><b>LongCat</b></a></p></td></tr><tr><td><p>Grok 4.3 (low context)</p></td><td><p>$1.25</p></td><td><p>$2.50</p></td><td><p>$3.75</p></td><td><p><a href="https://docs.x.ai/developers/models/grok-4.3">xAI</a></p></td></tr><tr><td><p>MiMo-V2.5 Pro (≤256K)</p></td><td><p>$1.00</p></td><td><p>$3.00</p></td><td><p>$4.00</p></td><td><p><a href="https://platform.xiaomimimo.com/docs/en-US/pricing">Xiaomi</a></p></td></tr><tr><td><p>Kimi-K2.6</p></td><td><p>$0.95</p></td><td><p>$4.00</p></td><td><p>$4.95</p></td><td><p><a href="https://platform.kimi.ai/docs/pricing/chat-k26">Moonshot AI</a></p></td></tr><tr><td><p>GLM-5.2</p></td><td><p>$1.40</p></td><td><p>$4.40</p></td><td><p>$5.80</p></td><td><p><a href="https://docs.z.ai/guides/overview/pricing">Z.ai</a></p></td></tr><tr><td><p>GPT-5.6 Luna</p></td><td><p>$1.00</p></td><td><p>$6.00</p></td><td><p>$7.00</p></td><td><p><a href="https://openai.com/index/previewing-gpt-5-6-sol/">OpenAI</a></p></td></tr><tr><td><p>Grok 4.3 (high context)</p></td><td><p>$2.50</p></td><td><p>$5.00</p></td><td><p>$7.50</p></td><td><p><a href="https://docs.x.ai/developers/models/grok-4.3">xAI</a></p></td></tr><tr><td><p>MiMo-V2.5 Pro (&gt;256K)</p></td><td><p>$2.00</p></td><td><p>$6.00</p></td><td><p>$8.00</p></td><td><p><a href="https://platform.xiaomimimo.com/docs/en-US/pricing">Xiaomi</a></p></td></tr><tr><td><p>Qwen3.7-Max</p></td><td><p>$2.50</p></td><td><p>$7.50</p></td><td><p>$10.00</p></td><td><p><a href="https://modelstudio.console.alibabacloud.com/ap-southeast-1?tab=doc#/doc/?type=model&amp;url=2840914_2&amp;modelId=qwen3.7-max&amp;serviceSite=international">Alibaba Cloud</a></p></td></tr><tr><td><p>Gemini 3.5 Flash</p></td><td><p>$1.50</p></td><td><p>$9.00</p></td><td><p>$10.50</p></td><td><p><a href="https://ai.google.dev/gemini-api/docs/pricing">Google</a></p></td></tr><tr><td><p>Gemini 3.1 Pro Preview (≤200K)</p></td><td><p>$2.00</p></td><td><p>$12.00</p></td><td><p>$14.00</p></td><td><p><a href="https://ai.google.dev/gemini-api/docs/pricing">Google</a></p></td></tr><tr><td><p>GPT-5.6 Terra</p></td><td><p>$2.50</p></td><td><p>$15.00</p></td><td><p>$17.50</p></td><td><p><a href="https://openai.com/index/previewing-gpt-5-6-sol/">OpenAI</a></p></td></tr><tr><td><p>GPT-5.4</p></td><td><p>$2.50</p></td><td><p>$15.00</p></td><td><p>$17.50</p></td><td><p><a href="https://openai.com/api/pricing/">OpenAI</a></p></td></tr><tr><td><p>Gemini 3.1 Pro Preview (&gt;200K)</p></td><td><p>$4.00</p></td><td><p>$18.00</p></td><td><p>$22.00</p></td><td><p><a href="https://ai.google.dev/gemini-api/docs/pricing">Google</a></p></td></tr><tr><td><p>Claude Opus 4.8</p></td><td><p>$5.00</p></td><td><p>$25.00</p></td><td><p>$30.00</p></td><td><p><a href="https://platform.claude.com/docs/en/about-claude/pricing">Anthropic</a></p></td></tr><tr><td><p>GPT-5.5</p></td><td><p>$5.00</p></td><td><p>$30.00</p></td><td><p>$35.00</p></td><td><p><a href="https://openai.com/api/pricing/">OpenAI</a></p></td></tr><tr><td><p>GPT-5.5 Instant (chat-latest)</p></td><td><p>$5.00</p></td><td><p>$30.00</p></td><td><p>$35.00</p></td><td><p><a href="https://developers.openai.com/api/docs/models/chat-latest">OpenAI</a></p></td></tr><tr><td><p>Sakana Fugu Ultra (≤272K)</p></td><td><p>$5.00</p></td><td><p>$30.00</p></td><td><p>$35.00</p></td><td><p><a href="https://console.sakana.ai/pricing#subscription-plan">Sakana AI</a></p></td></tr><tr><td><p>GPT-5.6 Sol</p></td><td><p>$5.00</p></td><td><p>$30.00</p></td><td><p>$35.00</p></td><td><p><a href="https://openai.com/index/previewing-gpt-5-6-sol/">OpenAI</a></p></td></tr><tr><td><p>Claude Fable 5 / Claude Mythos 5</p></td><td><p>$10.00</p></td><td><p>$50.00</p></td><td><p>$60.00</p></td><td><p><a href="https://platform.claude.com/docs/en/about-claude/models/overview">Anthropic</a></p></td></tr></tbody></table><p>What makes the release a definitive inflection point for global tech infrastructure is its operational independence: the massive model was trained entirely on a cluster of over 50,000 domestic Chinese Application-Specific Integrated Circuits (ASICs), proving that near-frontier AI models can be scaled successfully without relying on the typical U.S. Nvidia GPUs that have, to date, powered much of the global generative AI frontier model training effort. </p><p>This successful deployment of alternative silicon signals a profound structural shift. If Chinese conglomerates can consistently iterate trillion-parameter architectures using homegrown ASICs rather than general-purpose GPUs, it would seem to threaten Nvidia&#x27;s dominance in this sector. </p><p>Crucially, this technological pivot arrives precisely as Washington pressures top-tier American labs to restrict access to their latest models. Following a U.S. governmental request,<a href="https://venturebeat.com/technology/openai-unveils-gpt-5-6-sol-terra-and-luna-models-but-only-accessible-to-limited-preview-partners-for-now-per-us-gov"> OpenAI was forced to limit access to its new GPT-5.6 models</a>, while Anthropic was previously also <a href="https://venturebeat.com/technology/anthropic-blocks-all-public-access-to-claude-fable-5-mythos-5-following-us-government-order-what-enterprises-should-do">ordered by the U.S. </a>to restrict access to its latest Claude Fable 5 / Mythos 5 models, which it took entirely offline in response. At the same time, a growing chorus of <a href="https://www.axios.com/2026/06/29/trump-ai-model-release-delays-tech-backlash">technologists</a>, <a href="https://thehill.com/policy/technology/5925364-ai-regulation-anthropic-trump-administration/">activists</a>, and industry experts warn that these defensive regulatory maneuvers have inadvertently backfired. By locking down Western closed-source models and driving up API costs, the U.S. government has left a wide operational window for global developers seeking affordable, high-performance alternatives like those found in Chinese open source models such as Meituan LongCat-2.0.</p><p>The raw operational metrics backed up the developer enthusiasm: during its unbranded residency on <a href="https://openrouter.ai/openrouter/owl-alpha">OpenRouter, Owl Alpha</a> accounted for approximately 10.1 trillion monthly tokens—averaging 559 billion tokens per day—representing a 242% month-over-month explosion in volume that propelled it into the platform&#x27;s global top three.</p><p>By the time Meituan stepped forward to claim the architecture, the model had already secured the top ranking on the Hermes Agent workspace, second place on Claude Code deployments, and third place across international OpenClaw environments.</p><div></div><h2><b>Technology: Engineering the 1M-Token Sparse Context</b></h2><p>At the core of LongCat-2.0 lies an aggressive optimization of Mixture-of-Experts (MoE) sparsity, scaling total parameters to 1.6 trillion while limiting active computation to an average of 48 billion parameters per token.</p><p>Depending on the structural complexity of a query, the model’s dynamic activation ranges from 33 billion to 56 billion parameters. This design implements a &quot;Zero-Compute Experts&quot; framework, ensuring that routine execution elements pass through lighter subnetworks, entirely eliminating the idle computational overhead that typically penalizes ultra-dense models.</p><p>To sustain a functional 1-million-token context window without incurring catastrophic hardware bottlenecks, Meituan introduced LongCat Sparse Attention (LSA). Designed as an evolutionary iteration of DeepSeek Sparse Attention, LSA resolves the quadratic scoring costs and memory fragmentation that typically plague fine-grained sparse mechanisms through three distinct, orthogonal vectors:</p><ul><li><p><b>Streaming-aware Indexing (SI):</b> This system restructures the token selection pipeline by blending hardware-aligned contiguous data reads with dynamic random selection. By converting fragmented memory access into highly predictable, sequential blocks, the system achieves coalesced High Bandwidth Memory (HBM) utilization and elevated effective bandwidth.</p></li><li><p><b>Cross-Layer Indexing (CLI):</b> Leveraging the empirical reality that attention saliency remains highly stable across adjacent hidden layers, CLI amortizes calculation costs. A single indexing pass successfully guides multiple consecutive layers during inference, a capability reinforced by cross-layer distillation throughout the training phase.</p></li><li><p><b>Hierarchical Indexing (HI):</b> This approach applies a coarse-to-fine, two-stage scoring layout. The indexer performs a rapid, approximate block-level recall to filter candidates, before running fine-grained token selection exclusively on the remaining population.</p></li></ul><p>Furthermore, Meituan integrated an N-gram Embedding module inherited from its lighter model lines. By expanding parameter allocation in sparse dimensions completely orthogonal to the MoE expert layout, the architecture appends 135 billion parameters to a 5-gram token combination framework. </p><p>This expands the core embedding space by roughly 100-fold, allowing the model to capture dense local token relationships and accelerate large-batch inference operations by reducing memory Input/Output (I/O) bottlenecks.</p><h2><b>Product: Post-Training, MOPD Framework and Benchmark Performance</b></h2><p>While generalist large language models prioritize fluid, conversational interfaces, LongCat-2.0 focuses explicitly on multi-step engineering tasks, tool integration, and automated repository manipulation — agentic tasks, in other words. </p><p>In standardized assessments, LongCat-2.0 registers an empirical 59.5 on SWE-bench Pro, surpassing GPT-5.5&#x27;s benchmark of 58.6. The model further establishes its agentic specialization by marking a 70.8 on Terminal-Bench 2.1, a 77.3 on SWE-bench Multilingual, and a 73.2 on the general corporate workflow simulator FORTE.</p><p>This precise operational behavior is achieved through a structural post-training layer called Multi-Teacher Optimization via Mixture of Specialized Experts (MOPD). Rather than blending raw human feedback into a singular reward function, the MOPD architecture segregates post-training optimization into three independent, highly focused expert clusters.</p><ul><li><p>The <b>Agent Experts</b> are fine-tuned strictly for structural execution, specializing in precise tool invocation, multi-turn API parameter parsing, and self-correcting loop mechanisms to avoid execution stagnation.</p></li><li><p>The <b>Reasoning Experts</b> are optimized in isolation to advance multi-hop logic, complex chain-of-thought engineering, mathematics, and high-level STEM problem-solving.</p></li><li><p>The <b>Interaction Experts</b> focus entirely on human alignment, instruction-following nuances, factual grounding to suppress hallucinations, and maintaining rigid safety guardrails without diminishing the model&#x27;s overall utility.</p></li></ul><p>By segregating these vectors during post-training, LongCat-2.0 prevents functional degradation. A dynamic gate-routing mechanism then seamlessly fuses these specialized behaviors at runtime, allowing the final model to coordinate deep reasoning, stable tool execution, and safe user interaction simultaneously</p><p>While LongCat-2.0 generally trails premium frontier systems like Claude Opus 4.8 across broad general-agent benchmarks such as FORTE and BrowseComp, it explicitly punches above its weight in software engineering. </p><p>What makes this open-weight architecture special is its hyper-focus on autonomous development; it manages to narrowly exceed OpenAI&#x27;s proprietary GPT-5.5 on the rigorous software engineering benchmark SWE-bench Pro (scoring 59.5 against 58.6), proving it is highly capable and fiercely competitive for complex coding tasks despite a leaner computational footprint.</p><h2><b>Commercial Framework: Pay-As-You-Go vs. Flash-Sale Token Packs</b></h2><p>Meituan&#x27;s deployment strategy introduces a specialized commercial model that splits network access between conventional real-time API billing and structured &quot;Token Packs&quot;. </p><p>For traditional enterprise integration, standard top-up accounts are available, deducting operational capital in real time based directly on token input and generation metrics.</p><p>However, to accommodate the unpredictable compute bursts characteristic of autonomous development agents, Meituan launched a structured Token Pack framework. Purchased as fixed, one-time volumetric allocations valid for a strict 30-day window, these packages stack directly on top of an organization&#x27;s existing baseline API account. </p><p>To manage network load across its ASIC clusters, Meituan releases these high-volume packages via limited flash sales four times daily, precisely at 10:00, 16:00, 21:00, and 23:00 Beijing Time on a first-come, first-served basis.The economic standout of this framework is the zero-charge processing of context cache hits. </p><p>In massive agentic environments where a coding assistant must repeatedly read, reference, and modify the same multi-million-token code repository over an extended session, standard architectures penalize developers by charging full pricing for repeated input context. </p><p>Under Meituan&#x27;s infrastructure, only cache-miss inputs and final token generations consume the package quota. This architecture completely alters the operational cost economics of large-scale agent software development, enabling deep iterative context exploration without compounding costs.</p><h2><b>Licensing: Open-Source Structural Freedom</b></h2><p>By registering the LongCat-2.0 repository under the open-source MIT License, Meituan positions the architecture with maximum legal flexibility for enterprise integration. </p><p>In contrast to copyleft paradigms like the GNU General Public License (GPL)—which legally obligates developers to open-source any derivative frameworks or internal software that links to the code—the MIT license permits near-unrestricted freedom.</p><p>For corporate engineering teams, this legal standard ensures that LongCat-2.0 can be deeply modified, compiled, and hard-coded directly into closed-source commercial applications, proprietary dev tools, and internal automation backends. </p><p>Corporations can fork the repository, optimize the internal LSA mechanisms for private databases, and sell the resulting software stack to end users without any obligation to disclose their proprietary intellectual property or structural enhancements.</p><h2><b>Meituan&#x27;s Evolution: From Delivery Super App to AI Powerhouse</b></h2><p>Founded in March 2010 by serial entrepreneur <a href="https://www.howtheybegan.com/founders/wang-xing">Wang Xing</a>, Meituan initially launched as a Groupon-style daily deals website before rapidly evolving into one of China’s dominant “super apps”. </p><p>Following a massive 2015 merger with Dianping, the Beijing-based tech giant solidified a dominant market share over the country&#x27;s urban delivery corridors, bridging local consumer reviews, instant retail, hotel bookings, and food delivery. Operating as a publicly traded powerhouse on the Hong Kong Stock Exchange, Meituan claims over 770 million annual transacting users and supports a network of more than 14.5 million merchants. </p><p>However, faced with intense domestic market competition, severe margin compression, and a sliding profit margin, the company aggressively pivoted its strategy beyond logistics. Meituan publicly committed to investing &quot;billions&quot; into artificial intelligence and domestic chip capabilities to revitalize its technology-driven offerings. </p><p>This strategic shift into the global AI race began materializing in late 2025 with the release of LongCat-Flash, a 560-billion-parameter Mixture-of-Experts foundation model, followed quickly by the advanced reasoning model LongCat-Flash-Thinking. By open-sourcing these frontier-class models under enterprise-friendly licenses, Meituan signaled its ambition to become a foundational player in global AI infrastructure rather than remaining strictly a regional e-commerce and delivery giant. </p><h2><b>Enterprise Implications: Autonomous Operational Workflows</b></h2><p>For modern enterprises, the release of LongCat-2.0 unlocks clear operational strategies across software engineering, system operations, and long-form data interpretation. </p><p>The combination of an open-weight, MIT-licensed model with an expansive 1-million-token context window means organizations can bypass the data privacy concerns and recurring overhead associated with hosting proprietary third-party APIs.In large-scale enterprise development environments, teams can leverage the model&#x27;s specialized Agent Experts to orchestrate autonomous codebase migrations. </p><p>Instead of dedicating hundreds of developer hours to manually rewriting legacy application frameworks, engineers can pass an entire enterprise repository along with modern SDK documentation directly into the 1-million-token context window. LongCat-2.0 can map the dependencies, execute the repository-level structural updates, compile the new codebase, and catch compilation and execution bugs autonomously within local sandbox environments before generating a final pull request.</p><p>The model&#x27;s architectural separation via the MOPD gate-routing mechanism yields significant advantages for strict enterprise compliance. By routing specific operational queries through isolated expert clusters, a financial institution or healthcare firm can deploy deep logic and mathematical reasoning passes without risking factual hallucination or violating strict safety bounds. </p><p>The Interaction Experts function as an implicit guardrail layer, suppressing errors and enforcing instruction-following protocols without degrading the raw processing power of the internal Reasoning Experts. Combined with the zero-cost caching model, enterprises can maintain hyper-focused autonomous software networks that can repeatedly inspect corporate data pools, continuously maintaining and optimizing internal infrastructure at a fraction of standard operational costs.</p>]]></description>
            <author>carl.franzen@venturebeat.com (Carl Franzen)</author>
            <category>Technology</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/3JhPWvGZL0ITKfOW59a6dD/6b4f45b2aba49b72a7a254507dc69a9d/ChatGPT_Image_Jun_30__2026__01_18_58_AM.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[DeepSeek open sources DSpark, a new framework to speed up LLM inference by up to 85%]]></title>
            <link>https://venturebeat.com/orchestration/deepseek-open-sources-dspark-a-new-framework-to-speed-up-llm-inference-by-up-to-85</link>
            <guid isPermaLink="false">7EOYObdZ8lrdxIkdKSDh9L</guid>
            <pubDate>Mon, 29 Jun 2026 20:36:15 GMT</pubDate>
            <description><![CDATA[<p>Even as the geopolitical conversation around AI continues to grow more fraught following the<a href="https://venturebeat.com/technology/anthropic-blocks-all-public-access-to-claude-fable-5-mythos-5-following-us-government-order-what-enterprises-should-do"> U.S. government&#x27;s actions to limit the new models from Anthropic</a> and <a href="https://venturebeat.com/technology/openai-unveils-gpt-5-6-sol-terra-and-luna-models-but-only-accessible-to-limited-preview-partners-for-now-per-us-gov">OpenAI</a>, Chinese open source darling DeepSeek is back with yet another open release that could once again change AI development around the globe. </p><p>Over the weekend, the firm released <a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark">DSpark</a>, a new, MIT-Licensed system designed to make large language models answer faster without changing what the underlying model is trying to say. </p><p>The easiest way to think about it is this: most AI chatbots write like someone crossing a river one stepping stone at a time. They choose one small chunk of text, then the next, then the next. </p><p>DSpark gives the system a scout that runs a few steps ahead, guesses the likely path, and lets the larger model quickly check which steps are safe. When the guesses are good, the model moves faster. When the guesses are weak, DSpark tries not to waste time checking them.</p><p>DeepSeek published the work with a <a href="https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf">technical paper</a>, model checkpoints and <a href="https://github.com/deepseek-ai/DeepSpec">DeepSpec</a>, a codebase for training and evaluating speculative decoding systems. The release is available through DeepSeek’s public <a href="https://github.com/deepseek-ai">GitHub</a> and <a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark">Hugging Face </a>pages, both under the permissive, friendly, commonplace MIT license, making the new technique broadly usable by developers, researchers and commercial enterprise operations that want to study or adapt the approach.</p><p>The system is aimed at one of the most expensive problems in AI deployment: serving large models quickly enough for real users, while using hardware efficiently enough to make the economics work. That matters for consumer chatbots, coding assistants, agentic workflows and enterprise AI systems where users expect long answers to stream quickly rather than crawl out word by word.</p><p>DeepSeek is applying DSpark to its own latest frontier open model,<a href="https://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5"> DeepSeek-V4</a>. </p><p>Specifically, DeepSeek used its new DSpark framework on DeepSeek-V4-Flash, its already speed-optimized 284-billion-parameter mixture-of-experts model with 13 billion active parameters, and DeepSeek-V4-Pro, its more thoughtful and powerful 1.6-trillion-parameter model with 49 billion active parameters (Both support context windows up to one million tokens). </p><p>But the broader significance is that<i> DSpark is not conceptually limited to DeepSeek-V4.</i> DeepSeek’s own tests and released checkpoints cover other open model families, including Alibaba&#x27;s open weights <i>Qwen</i> and Google&#x27;s open weights <i>Gemma. </i></p><p>That means enterprise teams running open-weight models could, in principle, train or fine-tune DSpark-style draft modules for their own target models. It is not a switch that any API customer can flip from the outside, but it is a method that can travel to other models when the operator controls the weights and serving stack.</p><div></div><h2><b>Staggering speed increases for generating tokens during inference</b></h2><p>In DeepSeek’s live production tests, DSpark improved aggregate throughput by 51% for DeepSeek-V4-Flash at an 80-token-per-second-per-user service target, and by 52% for DeepSeek-V4-Pro at a 35-token-per-second-per-user target. At matched system capacity, DeepSeek reports per-user generation speedups of 60% to 85% for V4-Flash and 57% to 78% for V4-Pro over its prior MTP-1 production baseline.</p><p>The different speed claims measure different things. The 60% to 85% figure for V4-Flash, and the 57% to 78% figure for V4-Pro, describe how much faster individual users receive generated tokens when DeepSeek compares DSpark with MTP-1 at matched practical system capacity. </p><p>Those are the cleaner “generation speed” numbers. DeepSeek also reports much larger 661% and 406% increases, but these measure aggregate throughput under very strict speed targets: 120 tokens per second per user for V4-Flash and 50 tokens per second per user for V4-Pro. </p><p>At those targets, DeepSeek says its older MTP-1 baseline approaches an operational cliff, meaning it can keep only a small number of concurrent requests running while preserving that level of responsiveness. </p><p>DSpark avoids more of that collapse, so the percentage difference in total system output becomes much larger. Put simply: the 85% number is closer to “how much faster the ride feels for a user” under comparable conditions, while the 661% and 406% figures are closer to “how much more traffic the road can still carry” when the old system is already bottlenecking. </p><h2><b>Why speculative decoding matters</b></h2><p>LLMs usually generate text one token at a time. A token can be a word, part of a word, punctuation mark or other small piece of text. Every new token depends on the text already produced, so the model has to keep pausing, checking the full context and choosing the next piece.</p><p>That is accurate, but slow. It is like having a senior editor approve every word before a writer can move to the next one. The editor may be excellent, but the process creates a bottleneck.</p><p>Speculative decoding, developed in the early Transfomer era, tries to fix that bottleneck. Instead of asking the large model to produce every token one by one, the system uses a smaller or lighter draft component to suggest several likely next tokens. The large model then checks that batch of guesses in parallel. If the draft guessed correctly, the system moves ahead several tokens at once. If the draft made a bad guess, the system rejects the bad token and anything after it, adds a corrected token, and tries again.</p><p>The point is speed without changing the larger model’s intended output. In the standard speculative decoding setup, the draft model is not replacing the target model. It is acting more like an assistant who prepares a rough next sentence for the senior editor to approve or reject.</p><p>The idea did not appear out of nowhere with today’s large language models. A <a href="https://arxiv.org/abs/1811.03115">key precursor came in 2018</a>, when Mitchell Stern, Noam Shazeer and Jakob Uszkoreit proposed blockwise parallel decoding for deep autoregressive models. Their method predicted multiple future steps in parallel, then kept the longest prefix validated by the main model. That paper established much of the draft-and-check intuition behind later speculative decoding work.</p><p>The research line became more explicit in 2022. <a href="https://arxiv.org/abs/2203.16487">Heming Xia, Tao Ge and co-authors introduced SpecDec</a>, a draft-and-verify approach for sequence-to-sequence generation. Later that year, Yaniv Leviathan, Matan Kalman and Yossi Matias posted “<a href="https://arxiv.org/abs/2211.17192">Fast Inference from Transformers via Speculative Decoding</a>,” which helped define the modern version of the technique for transformer-based language models. DeepMind researchers followed in 2023 with a closely related method called <a href="https://arxiv.org/abs/2302.01318">speculative sampling.</a></p><p>Those 2022 and 2023 papers are the clearest ancestors of how speculative decoding is discussed in current LLM inference work: a faster draft process proposes tokens, and the larger target model verifies them in a way designed to preserve the target model’s output distribution. </p><p>Since then, the field has moved quickly through several variants, including separate draft models, multi-token prediction heads, tree-based verification, feature-level methods such as <a href="https://arxiv.org/abs/2401.15077">EAGLE</a>, self-speculation, Medusa-style extra heads and parallel/blockwise drafters such as DFlash.</p><p>The key metric is not how many tokens a draft model can guess. It is how many of those guesses the larger model actually accepts. Long speculative blocks help only if enough of the proposed tokens survive verification. Otherwise, the system spends compute checking guesses that it throws away.</p><p>That is the context for DSpark. Speculative decoding is already an established inference technique before DeepSeek’s release, with support in major serving stacks and multiple competing research approaches. But it is still not a solved problem. Speedups depend heavily on the draft model, the workload, the serving setup and the current traffic level. DSpark’s contribution is to improve both sides of the trade-off: it tries to draft more coherent token blocks and then verify only the parts of those blocks that are likely to pay off under real serving conditions.</p><h2><b>What DSpark changes</b></h2><p>DSpark tackles two related problems: bad guesses and wasted checking.</p><p>First, the system uses what DeepSeek calls semi-autoregressive generation. In plain English, that means DSpark tries to combine speed with a bit more awareness of sequence. </p><p>A fully parallel drafter can guess several tokens at once, which is fast, but its later guesses can become less coherent because each position is predicted too independently. A purely step-by-step drafter can keep better track of how one token leads to the next, but it loses much of the speed advantage.</p><p>DSpark tries to keep the best of both. It uses a parallel backbone for most of the drafting work, then adds a lightweight sequential head that lets the draft take nearby token relationships into account. In the paper’s example, a parallel drafter might confuse likely phrase endings such as “of course” and “no problem,” producing awkward combinations because it is guessing positions too separately. DSpark’s sequential component helps the system make the later tokens fit the earlier ones.</p><p>Second, DSpark adds confidence-scheduled verification. Rather than always asking the target model to check the same number of draft tokens, DSpark estimates which prefix of the draft is likely to survive. A hardware-aware scheduler then adjusts how much of each draft should be verified based on both model confidence and current serving load.</p><p>A simple analogy: when a restaurant is quiet, the head chef can inspect more of the prep cook’s work. When the kitchen is slammed, the chef spends attention only on the dishes most likely to be ready. DSpark applies a similar idea to AI serving. Under lighter traffic, the system can afford to check longer draft prefixes. Under heavier traffic, it trims low-confidence trailing guesses before they consume batch capacity that could be used for other users.</p><p>DeepSeek frames this as an answer to a common production trade-off. Static multi-token drafting can look attractive in isolation, but can hurt throughput under high concurrency because the system keeps checking tokens that are likely to be rejected. DSpark’s scheduler makes the verification budget flexible instead of fixed.</p><h2><b>Offline results: better draft acceptance across Qwen and Gemma</b></h2><p>DeepSeek tested DSpark offline on Qwen3-4B, Qwen3-8B, Qwen3-14B and Gemma4-12B target models across math, coding and chat benchmarks. </p><p>In those tests, the team compared DSpark with DFlash, a parallel drafter, and Eagle3, an autoregressive drafter. The paper reports accepted length per decoding round, a measure of how many tokens survive verification on average.</p><p>Across the three Qwen3 model sizes, DSpark improved macro-average accepted length over Eagle3 by 30.9%, 26.7% and 30.0%, respectively. Compared with DFlash, it improved accepted length by 16.3%, 18.4% and 18.3%. The paper also says the gains generalized to Gemma4-12B.</p><p>That supports a point raised by developer Daniel Han, who highlighted on X that DeepSeek showed DSpark working beyond DeepSeek’s own V4 models, including Gemma and Qwen. I would include Han as community reaction, not as the sole evidence for the claim. The stronger support comes from DeepSeek’s own benchmarks and released checkpoints.</p><p>The offline results also show why workload matters. Structured tasks such as math and code tend to have higher accepted lengths than open-ended chat. That makes intuitive sense: a code completion or math step often has fewer reasonable next moves than a free-form conversation. </p><p><b>For enterprises, </b>this means<b> DSpark-style methods may be especially attractive for coding assistants, data analysis agents, structured workflow automation</b> and other settings where outputs follow more predictable patterns.</p><h2><b>How enterprises could use DSpark without DeepSeek-V4</b></h2><p>One of the most important questions is whether DSpark is a DeepSeek-only optimization or a broader method that can be applied to other models. The answer is: broader method, but not automatic plug-in.</p><p>For open-weight models, the path is relatively clear. An enterprise running Qwen, Gemma, Llama, Mistral, Granite, Command-style open weights or another model it hosts itself could train or fine-tune a DSpark-style draft module against that target model. </p><p>The team would then measure acceptance on its own workloads and integrate the verification scheduler into its inference stack.</p><p>That is different from simply downloading DeepSeek’s DSpark module and attaching it to any model. Speculative decoding depends on alignment between the draft module and the target model. The draft has to learn what the target model is likely to accept. A drafter trained for DeepSeek-V4 will not automatically be the right drafter for a different model, especially one fine-tuned on a company’s internal data or configured for different reasoning behavior.</p><p>DeepSpec’s workflow reflects this. The process involves preparing data, regenerating target-model answers, building a target cache, training the draft model and evaluating speculative-decoding acceptance. For domain-specific use, the draft model may need additional fine-tuning, especially if the target model runs in a thinking or reasoning mode.</p><p>For proprietary models, the answer depends on what the enterprise controls. If a company owns or fully hosts the model weights and serving stack, it could theoretically train and deploy a DSpark-style drafter. If the model is available only through a hosted API from a vendor, the customer cannot directly add DSpark from the outside. The API provider could implement a similar optimization internally, but the customer generally cannot access the token verification loop, logits, batching behavior or serving scheduler needed to make DSpark work.</p><p>That distinction matters for enterprise buyers. DSpark strengthens the case for open or self-hosted AI infrastructure because it gives advanced teams another lever to improve speed and cost. But it also shows why model serving is becoming a specialized discipline. The value is not just in picking a model, but in how intelligently that model is run.</p><h2><b>What developers get from DeepSpec</b></h2><p>For developers, DeepSpec gives a concrete implementation path for training and evaluating speculative decoding draft models. It includes data preparation, training and benchmark evaluation steps, along with released checkpoints for several open model families. That makes the release useful not only for running DeepSeek-V4 with DSpark, but also for researchers and infrastructure teams studying how to add faster decoding to other open models.</p><p>There are real deployment caveats. DeepSpec’s own README says the default Qwen3-4B data preparation setup can require roughly 38 TB of target cache storage, and the default scripts assume a single node with eight GPUs. That makes the release more immediately relevant to AI labs, cloud teams and sophisticated enterprise AI infrastructure groups than to ordinary application developers.</p><p>Still, releasing the training pipeline matters. Many inference optimizations appear only as papers, vague benchmarks or closed production claims. DeepSpec gives developers something closer to a set of blueprints: not a finished enterprise product, but a way to reproduce, adapt and evaluate the method.</p><h2><b>Early community testing</b></h2><p>The release has already drawn fast developer attention. Developer <a href="https://github.com/rafaelcaricio/spark_vllm_docker/pull/1">Rafael Caricio published a GitHub pull request </a>documenting single-stream DeepSeek-V4-Flash DSpark work, reporting warmed benchmark anchors of 26.33 tokens per second without speculative decoding, 39.88 tokens per second with MTP-1, and roughly 60 tokens per second with DSpark — about 1.5x over MTP-1 and 2.3x over no-spec decoding.</p><p>A later commit in the same thread recorded a five-run mean of 60.31 tokens per second, with a 1.51x gain over MTP-1 and 2.29x over non-speculative decoding. </p><p>The same work also points to an important practical limit: in realistic multi-turn coding sessions, performance can degrade as draft acceptance falls with growing context. In other words, DSpark can make decoding faster, but acceptance quality still determines how much speed the system actually realizes.</p><p>That is a useful reality check. DSpark is not magic. It still depends on how predictable the next tokens are and how well the drafter stays aligned with the target model. But the early implementation work suggests DeepSeek’s claims are not purely academic. Developers are already testing the method in practical serving environments and reporting gains close to the paper’s single-stream expectations.</p><h2><b>The bottom line</b></h2><p>DSpark shows how much performance remains available in the inference layer, even when the underlying model architecture stays the same. As AI companies compete on model quality, context length and pricing, decoding efficiency is becoming another major battleground. </p><p>Faster generation means lower latency for users, higher throughput for providers and better economics for teams serving open models at scale.</p><p>DeepSeek’s release is notable because it combines a production-tested method, open code, public checkpoints and a detailed paper. The main innovation is not just drafting more tokens. It is making the system more selective about which speculative work is worth verifying.</p><p>For enterprise teams, the broader lesson is that the next wave of AI performance gains will not come only from larger models. It will also come from smarter ways to run the models companies already have — especially when those companies control enough of the stack to tune the model, train a compatible draft module and optimize the serving engine around real workloads.</p>]]></description>
            <author>carl.franzen@venturebeat.com (Carl Franzen)</author>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/1LKLzOBxewK4b4E04bGEIY/be7d3b2bd4874c5b8a3001b8498b5e02/ChatGPT_Image_Jun_29__2026__04_35_22_PM.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
    </channel>
</rss>