<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
    <channel>
        <title>VentureBeat</title>
        <link>https://venturebeat.com/feed/</link>
        <description>Transformative tech coverage that matters</description>
        <lastBuildDate>Tue, 23 Jun 2026 12:18:29 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>Copyright 2026, VentureBeat</copyright>
        <item>
            <title><![CDATA[Alibaba's AI video model rises to No. 2 in global rankings, as OpenAI's Sora and ByteDance's Seedance fall away]]></title>
            <link>https://venturebeat.com/technology/alibabas-ai-video-model-rises-to-no-2-in-global-rankings-as-openais-sora-and-bytedances-seedance-fall-away</link>
            <guid isPermaLink="false">Ff198lZmN6ZRGaii3Qog8</guid>
            <pubDate>Mon, 22 Jun 2026 20:22:56 GMT</pubDate>
            <description><![CDATA[<p><a href="https://www.alibabacloud.com/en?_p_lc=1">Alibaba Cloud</a> on Sunday released <a href="https://www.happyhorse.com/">HappyHorse 1.1</a>, a major upgrade to its AI video generation model that the company says delivers production-ready video synthesis across core content creation scenarios. The model is now live on <a href="https://modelstudio.alibabacloud.com/">Alibaba Cloud Model Studio</a> with full API access for enterprise customers and developers, accompanied by a 40% sitewide launch discount for the first two weeks.</p><p>The release arrives at a moment of remarkable upheaval in the AI video generation market — and Alibaba appears keenly aware of the timing. OpenAI <a href="https://help.openai.com/en/articles/20001152-what-to-know-about-the-sora-discontinuation">discontinued Sora</a> after it proved financially unsustainable. ByteDance <a href="https://www.cnbc.com/2026/03/17/bytedance-seedance-shut-down-tiktok-marsha-blackburn-peter-welch.html">indefinitely shelved</a> the international rollout of Seedance 2.0 following a barrage of copyright complaints from Hollywood studios. For enterprise procurement teams that had been evaluating or integrating those tools into marketing, advertising, and content production workflows, the competitive landscape has contracted sharply in a matter of months.</p><p>That contraction creates both an opportunity and a test for Alibaba. HappyHorse 1.1 is not a research demo or a consumer toy — it is an API-first product built for integration into enterprise software stacks, priced for volume, and backed by a $52.7 billion global infrastructure buildout. Whether it can convert technical capability into enterprise adoption, particularly in Western markets navigating intensifying U.S.-China tech tensions, will determine whether Alibaba can establish itself as a serious player in the generative video market that analysts expect to reach tens of billions of dollars by the end of the decade.</p><h2><b>How HappyHorse climbed from anonymous benchmark entry to top-ranked video model</b></h2><p><a href="https://www.happyhorse.com/">HappyHorse</a> first appeared in early April as an anonymous submission on the <a href="https://x.com/arena/status/2044977389185482998">Artificial Analysis Video Arena</a>, an independent benchmarking platform where real users compare model outputs in blind, side-by-side evaluations. The model immediately claimed the top position in both text-to-video and image-to-video rankings. Alibaba was subsequently confirmed as the creator, revealing it was built by the company&#x27;s ATH (Alibaba Token Hub) AI Innovation Unit — a team previously part of the Future Life Lab under the Taobao and Tmall Group before a strategic organizational restructuring.</p><p>According to <a href="http://arena.ai">Arena.ai</a>, HappyHorse 1.0 now holds the No. 2 position across all three Video Arena leaderboards. The platform noted the model scores 1,444 in both text-to-video and image-to-video categories, leading Google&#x27;s Veo-3.1 (with audio) by 69 points in text-to-video and xAI&#x27;s Grok-Imagine-Video by 23 points in image-to-video. In Elo-based ranking systems like Arena&#x27;s, models gain or lose points based on whether users prefer their outputs in head-to-head comparisons, meaning persistent double-digit leads reflect a consistent quality gap as perceived by human evaluators — not a statistical fluke.</p><p>The model&#x27;s architecture helps explain why. According to community-compiled technical documentation, HappyHorse is built around a 15-billion-parameter unified self-attention Transformer that processes text, image, video, and audio tokens within a single token sequence. Unlike many competitors that stitch together separate models for video and audio, HappyHorse operates as a unified system that handles all modalities in a single generation pass, eliminating the need for third-party dubbing or post-processing audio tools. For enterprise buyers evaluating total cost of ownership, that architectural simplicity translates directly into fewer integration points, fewer vendor dependencies, and faster time to production.</p><h2><b>What the 1.1 upgrade fixes — and why it matters for commercial video production</b></h2><p>The 1.1 upgrade targets a set of pain points that enterprise video production teams know intimately. <a href="https://www.alibabacloud.com/en?_p_lc=1">Alibaba Cloud</a> described the release as &quot;systematically optimized across core content generation scenarios,&quot; and the specific improvements reveal a model that has been tuned for commercial deployment rather than viral social media demos.</p><p>The most consequential upgrade is multi-image reference capability, which Alibaba calls R2V (Reference-to-Video). The feature allows users to upload multiple character reference images and maintain consistent identity across generated video — directly addressing one of the hardest problems in AI video production, where subjects tend to drift in appearance between frames or shots. For brands producing advertising campaigns, product videos, or serialized marketing content, identity consistency is not a nice-to-have; it is a requirement that has historically forced teams back to traditional production methods.</p><p>Motion quality receives a significant overhaul, with what Alibaba describes as &quot;strengthened motion modeling&quot; that addresses prior limitations in speed and fluidity. The company also made targeted improvements to visual texture, specifically calling out the elimination of &quot;facial oiliness,&quot; &quot;over-sharpening,&quot; and &quot;unnatural textures&quot; — artifacts that have plagued commercial AI video since the technology emerged and that immediately signal to viewers that content is machine-generated.</p><p>Two additional upgrades round out the release. <a href="https://www.happyhorse.com/">HappyHorse 1.1</a> improves audio-visual synchronization, including what Alibaba claims is &quot;zero-drift lip sync&quot; for dialogue scenes and context-aware speech pacing — building on the 1.0 version&#x27;s already notable ability to generate up to 15 seconds of 1080p video with synchronized audio output. The model also improves instruction-following for long and complex prompts, a critical differentiator for enterprise users who need to specify precise camera movements, lighting conditions, and narrative beats in a single generation pass rather than iterating through dozens of attempts.</p><h2><b>Sora&#x27;s collapse and Seedance&#x27;s freeze leave enterprise buyers with fewer choices than ever</b></h2><p>The competitive context surrounding this launch is unusually favorable for Alibaba, and it is worth understanding why.</p><p>OpenAI&#x27;s Sora web and app experiences were <a href="https://help.openai.com/en/articles/20001152-what-to-know-about-the-sora-discontinuation">discontinued on April 26</a>, with the Sora API set to follow on September 24. The shutdown came after the product proved financially untenable: Sora cost roughly $1 million per day to operate but generated only about $2.1 million in total revenue, while active users dropped from a peak near 1 million to under 500,000. For enterprise teams that had integrated Sora into production pipelines, the abrupt withdrawal underscored the risks of depending on AI products that lack a sustainable business model — a cautionary tale that procurement officers are unlikely to forget quickly.</p><p>ByteDance&#x27;s <a href="https://seed.bytedance.com/en/seedance2_0">Seedance 2.0</a>, which many considered Sora&#x27;s most formidable successor, ran into a different kind of wall. Netflix, Warner Bros., Disney, Paramount, and Sony sent legal threats to ByteDance over allegations of systematic copyright infringement after users generated viral clips featuring Hollywood intellectual property. <a href="https://techcrunch.com/2026/03/15/bytedance-reportedly-pauses-global-launch-of-its-seedance-2-0-video-generator/">ByteDance indefinitely postponed</a> the international launch, and the global rollout remains suspended.</p><p>That leaves <a href="https://blog.google/innovation-and-ai/technology/ai/veo-3-1-lite/">Google&#x27;s Veo 3.1</a> as the primary Western competitor in the enterprise video generation space. But Alibaba&#x27;s Arena rankings suggest HappyHorse is outperforming Veo on user-perceived quality, and the 40% launch discount on Alibaba Cloud Model Studio could make HappyHorse significantly cheaper at scale. At the 1.0 level, pricing through third-party API platforms ran roughly $1.82 per 10-second clip at 720p and $3.12 at 1080p. With the promotional pricing, HappyHorse 1.1 could bring production-quality AI video generation within reach of mid-market companies and agencies that previously considered the technology too expensive for anything beyond experimentation.</p><h2><b>Alibaba&#x27;s $52.7 billion infrastructure bet gives HappyHorse a distribution advantage rivals can&#x27;t match</b></h2><p><a href="https://www.happyhorse.com/">HappyHorse 1.1</a> does not exist in isolation. It sits atop a global infrastructure offensive that distinguishes Alibaba from pure-play AI model companies that build impressive technology but lack the physical and commercial machinery to serve regulated enterprise customers at scale.</p><p>Just five days before the HappyHorse 1.1 launch, <a href="https://www.alibabacloud.com/en?_p_lc=1">Alibaba Cloud</a> opened its first data centers in France, establishing its third European hub after Germany and the United Kingdom. The Paris region features two availability zones, bringing the company&#x27;s global footprint to 105 availability zones across 32 regions. &quot;The expansion of our cloud infrastructure into France reinforces our ongoing commitment to empowering European businesses with sovereign, secure, and intelligent solutions,&quot; said Dr. Feifei Li, Alibaba Cloud&#x27;s CTO and president of international business, in the company&#x27;s announcement. In Japan, the company opened its fifth data center in Tokyo on June 19.</p><p>As reported by <a href="https://www.datacenterdynamics.com/en/news/alibaba-cloud-launches-france-region/">Data Center Dynamics</a>, CEO Eddie Wu has committed to investing $52.7 billion in building a &quot;unified global cloud network,&quot; with the company later considering increasing this to $69 billion. This year alone, Alibaba has launched new regions in Mexico, Thailand, Malaysia&#x27;s Johor, and France. The France deployment is also part of Alibaba Cloud&#x27;s plan to roll out enterprise-grade agentic AI services across Europe in the second half of the year, including <a href="https://help.aliyun.com/en/functioncompute/fc/what-is-agentrun">AgentRun</a> (a development platform for AI agents), <a href="https://help.aliyun.com/en/starops/product-overview/introduction-of-starops">STAROps</a> (an intelligent operations platform), and <a href="https://www.alibabacloud.com//blog/one-click-openclaw-deployment-building-enterprise-grade-ai-agent-applications-with-acs-agent-sandbox_602980/_____tmd_____/punish?x5secdata=xcybsQIh5Cown%2fWZGmvZM4R8tzrKeLy38z%2bxF39tV8%2fJwaQbn3Vu7Pb7GOOHfHTc9jfWBSal7fUMFaPB4md90IQbPqDwo4rlivLRDyLVfZwpl0vKVA7dwDSrf6Scw4ClRD9ZUte6ZkHtjGJxj2KB%2f4rQdKygWtukQNfv494%2fgbCGHwYB5Pg08kF18V9%2bYRULrQ6hp2PCkXtH%2f3pVnvORQU3ViffPPs%2fa1PN%2fDb4vdHSw5EdZZoZdHfv15xALfTrN4w__bx__www.alibabacloud.com%2fblog%2fone-click-openclaw-deployment-building-enterprise-grade-ai-agent-applications-with-acs-agent-sandbox_602980&amp;x5step=1">ACS Agent Sandbox</a> (which provides hardware-level security isolation for agent workloads).</p><p>The infrastructure buildout serves a dual purpose for a product like <a href="https://www.happyhorse.com/">HappyHorse</a>. Running a 15-billion-parameter video generation model with integrated audio is extraordinarily compute-intensive, and having local infrastructure reduces latency for enterprise API calls while keeping customer data within regulatory boundaries. For European buyers operating under the European Commission&#x27;s new tech sovereignty framework — published June 3 with the explicit goal of protecting the bloc&#x27;s &quot;digital independence&quot; — the ability to run AI video generation workloads on locally hosted infrastructure is not a luxury. It is increasingly a compliance requirement.</p><h2><b>The Pentagon listing and geopolitical risk loom over Alibaba&#x27;s Western ambitions</b></h2><p>Alibaba&#x27;s global push is unfolding under significant geopolitical headwinds that enterprise buyers cannot afford to ignore. The <a href="https://www.cnbc.com/2026/06/09/alibaba-baidu-byd-named-on-pentagons-china-military-list-.html">Pentagon added Alibaba</a>, along with BYD and Baidu, to its list of Chinese military companies on June 8, preventing them from securing U.S. defense contracts. Alibaba rejected the designation, saying it is &quot;not a Chinese military company nor part of any military-civil fusion strategy.&quot;</p><p>The listing does not automatically trigger sanctions, and it does not directly restrict commercial transactions between private U.S. companies and Alibaba. But it adds a layer of reputational and regulatory complexity to procurement decisions, particularly for companies with U.S. government exposure, defense supply chain connections, or transatlantic operations. Enterprise technology purchases are rarely evaluated on technical merit alone — vendor risk assessments, board-level compliance reviews, and geopolitical scenario planning all factor into buying decisions for cloud infrastructure and AI tooling.</p><p>For European customers specifically, the calculus is layered in a different way. The continent&#x27;s growing emphasis on digital sovereignty cuts in two directions simultaneously: it creates demand for alternatives to the dominant U.S. hyperscalers (<a href="https://aws.amazon.com/">Amazon Web Services</a>, <a href="https://azure.microsoft.com/en-us">Microsoft Azure</a>, and <a href="https://cloud.google.com/">Google Cloud</a> control roughly 70 percent of European cloud infrastructure revenue, according to Synergy Research Group), but it also raises questions about whether a Chinese provider represents a meaningful improvement in strategic autonomy. Alibaba&#x27;s strategy of building sovereignty-compliant infrastructure in-market is a direct attempt to answer that question — but the Pentagon listing ensures it will be asked repeatedly.</p><h2><b>What enterprise teams should watch as the AI video market consolidates</b></h2><p>The practical implications of <a href="https://www.happyhorse.com/">HappyHorse 1.1</a> for enterprise teams are substantial. HappyHorse supports four modes of generation — text-to-video, image-to-video, subject-to-video, and the newly added video editing — covering the full spectrum of commercial video needs from ideation through production to post-production, all with integrated audio at no additional cost. That breadth of capability, delivered through a single API endpoint, simplifies what has historically been a fragmented and expensive production pipeline.</p><p>The question going forward is whether Alibaba can convert benchmark dominance and competitive timing into durable enterprise relationships. The company plans to release HappyHorse through Alibaba Cloud Model Studio with full enterprise SLAs, security certifications, and regional compliance — the table stakes that separate research breakthroughs from production-grade services. Watch for customer disclosures, usage metrics, and whether third-party platforms like fal.ai and Atlas Cloud (which already host HappyHorse 1.0) update to the 1.1 version quickly, which would signal genuine developer demand beyond Alibaba&#x27;s own ecosystem.</p><p>The AI video generation market entered 2026 with three credible enterprise contenders. One is dead. One is frozen. And the one still standing is a Chinese company backed by $52.7 billion in infrastructure spending, ranked No. 2 across every major independent benchmark, and offering a 40% discount to anyone willing to place the bet. In enterprise technology, the best product does not always win — but it rarely loses when the competition has already left the field.</p><p>
</p>]]></description>
            <author>michael.nunez@venturebeat.com (Michael Nuñez)</author>
            <category>Technology</category>
            <category>Business</category>
            <category>Infrastructure</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/5KFpqkXqsJ1UadPksN3wpB/437fe886256a70c820f5e152f0512430/Nuneybits_Vector_art_of_cheerful_horse_trotting_across_computer_f02e9dc0-d6b4-4a8c-b0f8-de134058b9c8.webp?w=300&amp;q=30" length="0" type="image/webp"/>
        </item>
        <item>
            <title><![CDATA[No Claude Fable 5? No problem: Sakana achieves frontier performance with new Fugu multi-model, auto synthesis system]]></title>
            <link>https://venturebeat.com/orchestration/no-claude-fable-5-no-problem-sakana-achieves-frontier-performance-with-new-fugu-multi-model-auto-synthesis-system</link>
            <guid isPermaLink="false">5CzhsFGdeqZF7g6AWNwWC9</guid>
            <pubDate>Mon, 22 Jun 2026 16:13:00 GMT</pubDate>
            <description><![CDATA[<p>Last night, the increasingly enterprise-focused AI startup <a href="https://sakana.ai/fugu/">Sakana launched Fugu</a>, a multi-agent orchestration system that delivers frontier-level AI performance through a single, OpenAI-compatible API. </p><p>Designed for developers, enterprises, and nations seeking resilience against vendor lock-in and geopolitical export controls, Fugu (Japanese for &quot;pufferfish&quot;), bypasses the traditional monolithic model structure by dynamically routing queries to a swappable pool of specialized AI agents. </p><p>Sakana CEO and co-founder David Ha, formerly of Google Brain, positioned Fugu as a more reliable option for enterprise workflows than any single AI model provider in the wake of<a href="https://venturebeat.com/technology/anthropic-blocks-all-public-access-to-claude-fable-5-mythos-5-following-us-government-order-what-enterprises-should-do"> Anthropic&#x27;s move on June 12 to revoke public access</a> to its most powerful models, Claude Mythos 5 and Claude Fable 5, in the wake of a U.S. government export control order. As <a href="https://x.com/hardmaru/status/2068884466056225025">Ha wrote in a post today on X:</a></p><blockquote><p>&quot;Fugu dynamically orchestrates the world’s best models to tackle complex tasks. We are proving that a well-orchestrated pool of swappable agents can match restricted frontier models like Fable and Mythos.

But Fugu is about more than just performance. I believe that Orchestration Models are the next frontier, beyond bigger models.

Relying on a single company’s model for national infrastructure is a massive risk. As recent export controls have shown, access to top models can disappear overnight.

Collective intelligence is the practical hedge against this concentration of power. Fugu simply routes around vendor restrictions by relying on an entirely swappable agent pool.&quot;</p></blockquote><p>Sakana AI explicitly states that the specific models Fugu selects and how it coordinates them are proprietary, meaning this routing information is hidden from the user by design. The documentation only refers generally to a &quot;diverse pool of powerful models,&quot; &quot;multiple LLMs,&quot; or &quot;specialized models&quot; without providing a specific count.</p><p>By acting as a sophisticated coordinator rather than a standalone foundation model, Fugu matches the output quality of top-tier models like Fable and Mythos on third-party benchmarks of agentic tasks, while fundamentally altering how developers deploy critical AI infrastructure.</p><h2><b>How Sakana Fugu works and where it beats Anthropic&#x27;s Claude Fable 5</b></h2><p>At its core, Sakana Fugu operates like a master general contractor. When presented with a complex request, Fugu does not attempt to execute every step itself. </p><p>Instead, it breaks the problem down, delegates sub-tasks to a pool of expert foundation models, verifies their work, and synthesizes the final output.</p><p>&quot;Fugu is itself an LLM, trained to call various LLMs in an agent pool, including instances of itself recursively,&quot; the Sakana AI team noted in their technical release. </p><p>Grounded in two of Sakana&#x27;s 2026 research papers, <a href="https://sakana.ai/trinity/">TRINITY</a> and the <a href="https://sakana.ai/learning-to-orchestrate/">Conductor</a>, the system autonomously manages the entire lifecycle of model selection and verification using learned coordination strategies rather than hand-designed workflows. To the end user, this multi-agent swarm is entirely abstracted behind a standard API endpoint.</p><p>Sakana AI is offering two variants of the system to cater to different operational workloads:</p><ul><li><p><b>Fugu:</b> A high-speed, low-latency model optimized for everyday tasks. It is designed to act as the default engine for interactive chatbots and integrates directly into coding environments like Codex.</p></li><li><p><b>Fugu Ultra:</b> The flagship tier engineered for complex, high-stakes tasks such as AI research, cybersecurity analysis, and multi-step patent investigations. According to Sakana, Fugu Ultra coordinates a deeper pool of experts and matches industry-leading monolithic models across rigorous scientific and reasoning benchmarks.</p></li></ul><p>Additionally, on the pay-as-you-go plan, standard Fugu charges a dynamic rate based on the specific underlying models activated, whereas Fugu Ultra utilizes a fixed pricing structure starting at $5 per million input tokens and $30 per million output tokens.</p><p>As indicated by benchmark charts shared by Sakana, Fugu actually exceeds the performance of Anthropic&#x27;s Claude Fable 5 on <a href="https://huggingface.co/blog/leaderboard-livecodebench">LiveCodeBench</a>, an open source benchmark testing coding performance on regularly refreshed, software problem-solving tasks (Fugu Ultra: 93.2, Fugu: 92.9, Fable: 89.8), and beats the prior Claude Mythos Preview model on <a href="https://epoch.ai/benchmarks/gpqa-diamond">GPQA-D (Diamond)</a> , a test of 198 graduate-level multiple-choice questions in biology, physics, and chemistry (Fugu Ultra: 95.5, Fugu: 95.5, Mythos Preview: 94.6).</p><p>By orchestrating multiple models from different providers, Fugu essentially builds native redundancy into the AI stack. If one provider suffers an outage or faces sudden regulatory restrictions, Fugu routes around the disruption to maintain uptime.</p><h2><b>Licensing and availability</b></h2><p>Fugu is offered as a commercial, proprietary API service, not an open-source framework. </p><p>Because Sakana’s core intellectual property lies in its non-obvious collaboration patterns, the specific routing information—meaning exactly which underlying models Fugu selects for a given query—remains proprietary and is intentionally hidden from the user.</p><p>However, Sakana offers critical controls for enterprise data compliance. Developers can explicitly opt specific models or providers out of their Fugu routing pool to maintain strict corporate privacy standards. </p><p>Additionally, users can opt out of having their prompts used for future training data. Geographically, Fugu is restricted from operating within the European Union (EU) and European Economic Area (EEA) while Sakana works to align its black-box data routing architecture with GDPR regulations.</p><h2><b>Pricing is fairly steep</b></h2><p>Fugu is available immediately in most regions—with the temporary exception of the EU and EEA—at subscription tiers and pay-as-you-go pricing.</p><p>Teams can opt for monthly <a href="https://sakana.ai/fugu/">subscription allowances </a>designed for individual or hands-on use: a Standard tier at $20/month for lightweight workflows, a Pro tier at $100/month providing 10x standard usage, and a Max tier at $200/month offering 20x usage for continuous, long-running tasks. I wasn&#x27;t able to find the actual amount of tokens covered under these plans, but I&#x27;ve reached out to Ha on X for more information.</p><p>As part of the initial rollout, Sakana is offering a free second month for users who subscribe to any tier by July 31, 2026.</p><p>For enterprise scaling and production deployments, Sakana offers an elastic pay-as-you-go plan. Crucially for high-stakes environments, requests made under this consumption-based model are served at a higher priority than those from monthly subscription plans. </p><p>Under this framework, the standard Fugu engine charges the single rate of the highest-tier underlying model involved in a query, without ever stacking multi-agent fees. The flagship Fugu Ultra tier (fugu-ultra-20260615) utilizes a fixed pricing structure per one million tokens: $5 for input, $30 for output, and $0.50 for cached input. These rates increase to $10, $45, and $1.00 respectively for extreme workloads utilizing context windows above 272K tokens. That puts it among the more expensive options compared to single AI models via provider APIs:</p><h1><b>VentureBeat Frontier AI Model API Pricing Snapshot</b></h1><table><tbody><tr><td><p><b>Model</b></p></td><td><p><b>Input</b></p></td><td><p><b>Output</b></p></td><td><p><b>Total Cost</b></p></td><td><p><b>Source</b></p></td></tr><tr><td><p>MiMo-V2.5 Flash</p></td><td><p>$0.10</p></td><td><p>$0.30</p></td><td><p>$0.40</p></td><td><p>Xiaomi MiMo</p></td></tr><tr><td><p>deepseek-v4-flash</p></td><td><p>$0.14</p></td><td><p>$0.28</p></td><td><p>$0.42</p></td><td><p>DeepSeek</p></td></tr><tr><td><p>deepseek-v4-pro</p></td><td><p>$0.435</p></td><td><p>$0.87</p></td><td><p>$1.305</p></td><td><p>DeepSeek</p></td></tr><tr><td><p>MiniMax-M3</p></td><td><p>$0.30</p></td><td><p>$1.20</p></td><td><p>$1.50</p></td><td><p>MiniMax</p></td></tr><tr><td><p>Gemini 3.1 Flash-Lite</p></td><td><p>$0.25</p></td><td><p>$1.50</p></td><td><p>$1.75</p></td><td><p>Google</p></td></tr><tr><td><p>Qwen3.7-Plus</p></td><td><p>$0.40</p></td><td><p>$1.60</p></td><td><p>$2.00</p></td><td><p>Alibaba Cloud</p></td></tr><tr><td><p>MiMo-V2.5</p></td><td><p>$0.40</p></td><td><p>$2.00</p></td><td><p>$2.40</p></td><td><p>Xiaomi MiMo</p></td></tr><tr><td><p>Grok 4.3 (low context)</p></td><td><p>$1.25</p></td><td><p>$2.50</p></td><td><p>$3.75</p></td><td><p>xAI</p></td></tr><tr><td><p>MiMo-V2.5 Pro (≤256K)</p></td><td><p>$1.00</p></td><td><p>$3.00</p></td><td><p>$4.00</p></td><td><p>Xiaomi MiMo</p></td></tr><tr><td><p>Kimi-K2.6</p></td><td><p>$0.95</p></td><td><p>$4.00</p></td><td><p>$4.95</p></td><td><p>Moonshot</p></td></tr><tr><td><p>GLM-5.2</p></td><td><p>$1.40</p></td><td><p>$4.40</p></td><td><p>$5.80</p></td><td><p>Z.ai</p></td></tr><tr><td><p>Grok 4.3 (high context)</p></td><td><p>$2.50</p></td><td><p>$5.00</p></td><td><p>$7.50</p></td><td><p>xAI</p></td></tr><tr><td><p>MiMo-V2.5 Pro (&gt;256K)</p></td><td><p>$2.00</p></td><td><p>$6.00</p></td><td><p>$8.00</p></td><td><p>Xiaomi MiMo</p></td></tr><tr><td><p>Qwen3.7-Max</p></td><td><p>$2.50</p></td><td><p>$7.50</p></td><td><p>$10.00</p></td><td><p>Alibaba Cloud</p></td></tr><tr><td><p>Gemini 3.5 Flash</p></td><td><p>$1.50</p></td><td><p>$9.00</p></td><td><p>$10.50</p></td><td><p>Google</p></td></tr><tr><td><p>Gemini 3.1 Pro Preview (≤200K)</p></td><td><p>$2.00</p></td><td><p>$12.00</p></td><td><p>$14.00</p></td><td><p>Google</p></td></tr><tr><td><p>GPT-5.4</p></td><td><p>$2.50</p></td><td><p>$15.00</p></td><td><p>$17.50</p></td><td><p>OpenAI</p></td></tr><tr><td><p>Gemini 3.1 Pro Preview (&gt;200K)</p></td><td><p>$4.00</p></td><td><p>$18.00</p></td><td><p>$22.00</p></td><td><p>Google</p></td></tr><tr><td><p>Claude Opus 4.8</p></td><td><p>$5.00</p></td><td><p>$25.00</p></td><td><p>$30.00</p></td><td><p>Anthropic</p></td></tr><tr><td><p>GPT-5.5</p></td><td><p>$5.00</p></td><td><p>$30.00</p></td><td><p>$35.00</p></td><td><p>OpenAI</p></td></tr><tr><td><p><b>Sakana Fugu Ultra</b></p></td><td><p><b>$5.00</b></p></td><td><p><b>$30.00</b></p></td><td><p><b>$35.00</b></p></td><td><p><b>Sakana AI</b></p></td></tr><tr><td><p>Claude Fable 5 / Claude Mythos 5</p></td><td><p>$10.00</p></td><td><p>$50.00</p></td><td><p>$60.00</p></td><td><p>Anthropic</p></td></tr></tbody></table><p>Developers modeling operational costs should also note a significant architectural caveat in how Fugu bills for its multi-agent capabilities. According to the developer documentation, Fugu Ultra’s API responses include detailed usage fields that separate user-visible token generation from internal orchestration work. The background tokens consumed and generated when Fugu delegates sub-tasks, verifies code, or routes between underlying agents are not absorbed by the provider; they represent real token usage and are counted toward the final price of the request at standard rates.</p><h2><b>The Orchestration landscape: Fugu vs. The Field and notable benchmark performance</b></h2><p>To understand Fugu’s position in the mid-2026 AI ecosystem, it is critical to distinguish between <i>model routing</i> and <i>multi-agent orchestration</i>. </p><p>Over the past year, enterprise adoption of standard routing platforms—such as Not Diamond, Martian, and the open-source RouteLLM framework—has skyrocketed. These systems act as intelligent air traffic controllers; using semantic classifiers or meta-models, they analyze an incoming prompt and predict which single foundation model will yield the highest quality or most cost-effective response, dispatching the query accordingly.</p><p>Fugu operates on a fundamentally different paradigm. Rather than making a one-shot routing decision, Fugu aligns more closely with complex multi-round systems like Router-R1 (a framework introduced at NeurIPS 2025). It breaks a query down, interleaves reasoning with delegation, and dynamically assigns sub-tasks to multiple models in parallel or sequence before synthesizing a final output.</p><p>While frameworks like LangGraph, CrewAI, and Microsoft AutoGen offer developers the tools to build similar multi-agent systems, they require immense manual configuration—defining roles, setting up conditional edges, and managing state across long-running loops. </p><p>Fugu abstracts this operational overhead entirely. It is essentially a LangGraph-style workflow packaged as a single, black-box API endpoint.</p><p>An orchestration system is ultimately bounded by the raw capabilities of the underlying models in its pool, a reality reflected in Sakana’s own benchmark testing against standalone frontier models.</p><p>On rigorous coding and agentic tasks, collective intelligence shows a distinct advantage over standard models. Fugu Ultra posted a <b>73.7 on SWE-Bench Pro</b>, significantly outperforming Anthropic&#x27;s Claude Opus 4.8 (69.2) and OpenAI&#x27;s GPT-5.5 (58.6). </p><p>However, Fugu is not a silver bullet, and its performance is not a clean sweep across the board. When compared to highly specialized or restricted-access monolithic models, Fugu occasionally trails:</p><ul><li><p><b>SWE-Bench Pro:</b> While Fugu Ultra (73.7) beat most accessible models, it was comfortably eclipsed by Anthropic’s limited-access Fable 5 (80.0), which is currently absent from Fugu&#x27;s swappable pool due to the U.S. government&#x27;s export control order and Anthropic&#x27;s subsequent response to remove the model entirely from global usage. </p></li><li><p><b>Humanity&#x27;s Last Exam:</b> Fugu Ultra (50.0) narrowly edged out Opus 4.8 (49.8), but again fell short of Fable 5 (53.3).</p></li><li><p><b>Long-Context and Security:</b> On the MRCRv2 long-context-recall test, OpenAI&#x27;s GPT-5.5 maintained the lead (94.8 vs Fugu Ultra&#x27;s 93.6), and Opus 4.8 remained the top performer on the CTI-REALM cybersecurity benchmark (69.6 vs Fugu Ultra&#x27;s 69.4).</p></li></ul><p>The quantitative data points to a clear conclusion: Fugu is highly effective at boosting performance on messy, multi-step tasks (like writing a complex HTML5 game from scratch) by leaning on the combined strengths of multiple mid-tier and high-tier models. </p><p>However, for sheer brute-force reasoning within a single, highly constrained domain, the industry&#x27;s largest standalone models still hold the edge—provided an enterprise can maintain uninterrupted access to them.</p><h2><b>Background on Sakana&#x27;s formation and noteworthy achievements to date</b></h2><p><a href="https://venturebeat.com/ai/what-you-need-to-know-about-sakana-ai-the-new-startup-from-a-transformer-paper-co-author">Sakana AI was formed in Tokyo in 2023 </a>by Llion Jones, a co-author of Google’s foundational 2017 &quot;Attention Is All You Need&quot; paper, and David Ha, the former head of research at Stability AI. </p><p>Disillusioned by large tech company bureaucracy and the industry&#x27;s hyper-fixation on scaling single, massive foundational models, the founders built Sakana around principles of biomimicry and evolutionary computing.</p><p>The company&#x27;s name, derived from the Japanese word for fish, reflects its core technical thesis: utilizing collective &quot;swarm&quot; intelligence rather than brute-force compute. Following a $2.6 billion Series B valuation in late 2025 and <a href="https://venturebeat.com/technology/when-deep-research-isnt-enough-for-your-business-sakana-ai-launches-ultra-deep-research-agent-for-100-page-reports-in-8-hours">the recent June 2026 launch of Marlin</a>—an autonomous, eight-hour research agent for the B2B sector—Fugu represents the commercialization of Sakana&#x27;s multi-agent routing technology for everyday developers.</p><h2><b>A mixed reception among the broader AI community online</b></h2><p>The developer community has responded to Fugu by rigorously testing its practical tradeoffs, weighing its routing efficiencies against the sheer power of monolithic foundation models.</p><p>AI observer, developer and influencer <a href="https://x.com/ChrissGPT/status/2068904825685787083?s=20">Chris (@ChrissGPT on X)</a> highlighted the specific utility of Fugu over raw foundational AI. </p><p>&quot;For a single clean prompt, you probably would [use Fable 5, Mythos, or GPT-5.5 directly],&quot; he noted, but argued that Fugu&#x27;s true value emerges in messy, multi-step environments. &quot;...whether it involves delegation, verification, synthesis, code review, research loops, security analysis... the more it would make sense to use this,&quot; he wrote.</p><p>Chris also pointed out the strategic geopolitical advantage of Fugu&#x27;s architecture, noting that if frontier AI access is abruptly revoked due to regulation or export controls, an orchestrator can dynamically swap models to prevent a total system failure.</p><p>Creative agency owner <a href="https://x.com/markksantos/status/2068962823007285628?s=20">Mark Santos (@markksantos) </a>of Mark Studios provided a direct, real-world comparison by tasking both Fugu Ultra and Claude Opus 4.8 with building a &quot;Crossy Road&quot; game clone using Three.js. The results underscored the operational differences between an orchestrator and a monolithic giant:</p><ul><li><p><b>Sakana Fugu Ultra:</b> Completed the task in 22 minutes using ~89,000 tokens for roughly $7.32. However, the final game suffered from minor logic errors, such as inverted directional turns and wonky camera angles.</p></li><li><p><b>Claude Opus 4.8:</b> Took 79 minutes, burned ~940,000 tokens for nearly $37.85, and got stuck in a retry loop requiring human intervention. Despite the inefficiency, it ultimately produced superior application design and functionality.</p></li></ul><p>Santos concluded the experiment by stating, &quot;In terms of application functionality, quality, and design, Opus won. In terms of model speed and performance, Fugu... won&quot;.</p><p>Elie Bakouch, a research engineer at cloud-based, open AI infrastructure and systems provider <a href="https://www.primeintellect.ai/">Prime Intellect</a>, <a href="https://x.com/eliebakouch/status/2068939729811468503">pointed out on X</a> that &quot;to be clear, this is a closed source orchestrator on top of closed source models. if before you didn&#x27;t control the models, now you don&#x27;t even control which ones are used or how much. this is not &#x27;AI sovereignty&#x27;...&quot;</p><div></div><p>These early tests and reactions mirror the sentiment summarized by <a href="https://www.reddit.com/r/LLMDevs/comments/1uca8e3/comment/ot2k0kx/?utm_source=share&amp;utm_medium=web3x&amp;utm_name=web3xcss&amp;utm_term=1&amp;utm_content=share_button">Reddit user GreedyWorking1499</a> in initial platform discussions: &quot;<i>Until proven otherwise, this is just a highly advanced router/wrapper, not a fundamental not a fundamental leap in intelligence like Mythos/Fable was.</i>&quot;</p><p>Yet, as enterprises increasingly demand fail-safes against single-vendor reliance, Sakana is proving that packaging collective intelligence into a single API endpoint is a highly viable commercial path.</p>]]></description>
            <author>carl.franzen@venturebeat.com (Carl Franzen)</author>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/4kHEA49GjoZBqBlllf4GIv/0fd14ca57187e6d633ab33ceded01f69/ChatGPT_Image_Jun_22__2026__11_26_12_AM.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Why agentic enterprises need to become learning systems]]></title>
            <link>https://venturebeat.com/orchestration/why-agentic-enterprises-need-to-become-learning-systems</link>
            <guid isPermaLink="false">6rPcvxuYi8OaBqUrNifhS7</guid>
            <pubDate>Mon, 22 Jun 2026 15:00:00 GMT</pubDate>
            <description><![CDATA[<p><i>Presented by Splunk</i></p><hr/><p>Every day, organizations learn things their AI systems never get to use.</p><p>A security analyst corrects an AI-generated investigation. A network engineer identifies the root cause of a recurring outage. An observability team discovers that a pattern of latency, logs and infrastructure changes predicts service degradation. A customer operations team learns which signals indicate an escalation is likely.</p><p>Each moment contains valuable organizational knowledge. But in most enterprises, that knowledge disappears into tickets, dashboards, chat threads, post-incident reviews and the minds of individual experts. It may help solve the immediate problem, but it rarely becomes part of a reusable system that improves future AI-driven decisions.</p><p>That is the next challenge for the agentic enterprise.</p><p>The future will not be defined simply by who has the most capable model or the most autonomous agents. Many organizations will have access to similar frontier models. Many will deploy agents across security, IT, engineering, customer service, and business operations.</p><p>The real differentiator will be whether those agents can learn from the organization around them.</p><p>Not by constantly retraining the underlying model, but by capturing operational experience, converting it into institutional knowledge and making that knowledge available to future agents, workflows, and decisions.</p><p>The agentic enterprise is not just an enterprise that uses AI. It is an enterprise that learns through AI.</p><h2>Agentic enterprises allow AI systems to learn from them</h2><p>The AI conversation has been dominated by model capability: larger context windows, better reasoning, faster inference, stronger tool use, and more sophisticated agentic behavior.</p><p>Those advances matter. But in the enterprise, a model is only one part of the system.</p><p>A model does not automatically know how a specific organization operates. It does not inherently know which remediation step solved last month’s outage, which analyst correction improved a threat investigation, which network signal preceded a service disruption, or which internal policy should override an otherwise plausible recommendation.</p><p>That knowledge belongs to the enterprise.</p><p>For agentic systems to improve, organizations need a way to capture that knowledge and make it reusable. In many cases, that does not require changing the model itself. It requires changing the ecosystem around the model: the knowledge base, retrieval layer, prompts, policies, guardrails, routing logic and workflows that shape how agents behave.</p><p>The model may remain the same. The learning system around it becomes smarter.</p><h2>Feedback loops turn every outcome into a teachable moment for agents</h2><p>Every agentic workflow creates signals.</p><p>An agent receives a request. It retrieves context, reasonsthrough possible actions, calls tools, and generates answers. A human accepts, rejects, or modifies that answer. Downstream systems reveal whether the action worked.</p><p>That entire chain is valuable.</p><p>AI observability gives organizations visibility into what happened: the prompt, response, reasoning path, tool calls, data sources, intermediate steps, failure modes and outcomes. Without that visibility, organizations cannot understand why an agent behaved the way it did, let alone improve it.</p><p>But observability alone is not enough.</p><p>The larger opportunity is to turn observed behavior into institutional knowledge. A trace should not only help a developer and operators debug an agent. It should help the enterprise understand what the agent learned, what the human corrected, what outcome followed, and what should change before the next similar event.</p><p>That is the shift from monitoring AI to teaching AI.</p><p>In the agentic enterprise, feedback loops connect action to outcome, outcome to knowledge and knowledge back to future action.</p><h2>A learning system in practice across security, observability and the network</h2><p>Consider a service experiencing intermittent degradation.</p><p>An observability agent detects unusual latency and error rates. A network agent identifies packet loss across a specific path. A security agent notices that the same time window includes suspicious authentication behavior and unusual traffic from a previously unseen source.</p><p>Individually, each agent has only a partial view. Together, they create a richer operational picture.</p><p>The first time this incident occurs, human experts may need to intervene. A network engineer confirms that packet loss was caused by a misconfigured routing change. A security analyst determines that the suspicious traffic was not an attack, but a side effect of a misrouted internal service. An SRE connects the network event to the application degradation.</p><p>That resolution contains knowledge the organization should not have to relearn.</p><p>A mature agentic learning system would capture the traces, human corrections, topology context, security findings, observability signals and final remediation steps. It would preserve the relationship between those signals: latency pattern, network path, identity behavior, routing change and remediation.</p><p>The next time a similar pattern appears, agents would not start from zero. They could retrieve the prior case, compare current conditions, recommend the proven diagnostic path and escalate with better context.</p><p>The underlying frontier model did not need to be retrained.</p><p>The enterprise learned.</p><h2>The architecture of the learning agentic enterprise</h2><p>A learning-oriented agentic enterprise needs more than a model or chatbot. It needs an architecture that can capture experience, turn it into usable knowledge, connect that knowledge to operational context, and govern how it changes future agent behavior.</p><p><b>Memory </b>preserves what happened: what the agent saw, what it did, where humans intervened, and what outcomes followed.</p><p><b>Knowledge bases</b> turn that experience into reusable guidance, including playbooks, examples, policies, procedures, and evidence.</p><p>A <b>data fabric </b>connects the operational environment. The signals agents need live across logs, metrics, traces, tickets, identity systems, security tools, network telemetry, collaboration platforms, and business applications. A data fabric makes those signals discoverable, correlated, governed, and usable in context.</p><p><b>AI observability </b>explains how agents behave by capturing prompts, tool calls, intermediate steps, responses, feedback, and outcomes. That visibility helps organizations understand where agents succeed, where they fail, and what should improve.</p><p>The <b>control plane</b> governs how learning becomes change: what knowledge is promoted, which prompts or policies are updated, which agents can use new information, what approvals are required, and how changes are audited.</p><p>Together, these capabilities allow AI systems to improve over time in a controlled, trustworthy way that allows the enterprise to learn from its own operations.</p><h2>The organizations that learn fastest will win </h2><p>The next era of AI will not be won by models alone. It will be won by organizations that can capture what they learn from every workflow, expert correction, incident, investigation, and outcome.</p><p>The most advanced agentic enterprises will not simply deploy more agents. They will build systems that allow every agent to benefit from the collective knowledge of the organization.</p><p>That means connecting operational data through a data fabric. It means observing agent behavior deeply enough to understand it. It means preserving experience in memory and institutionalizing it in knowledge bases. It means using a control plane to govern how learning changes agent behavior.</p><p>The future of AI is not a single autonomous agent acting alone. It is an ecosystem of agents, humans, data and controls that learns over time.</p><p>The organizations that build that ecosystem will create AI systems that get better with every interaction. Not because the model is constantly changing, but because the enterprise itself is becoming more intelligent.</p><p><i>Learn more about how </i><a href="https://www.splunk.com/ciscodatafabric"><i>Cisco Data Fabric powered by the Splunk Platform</i></a><i> is accelerating agentic operations.</i></p><p><i>Hao Yang is Vice President AI at Splunk, a Cisco Company.</i></p><hr/><p><i>Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact </i><a href="mailto:sales@venturebeat.com"><i><u>sales@venturebeat.com</u></i></a><i>.</i></p>]]></description>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/79J1LCPmW0K5edwrClgk73/f504c7a9a7e6de04e5654e27f17e30a1/Image.jpeg?w=300&amp;q=30" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Researchers introduce Self-Harness, a framework that lets AI agents rewrite their own rules, boosting performance up to 60%]]></title>
            <link>https://venturebeat.com/orchestration/researchers-introduce-self-harness-a-framework-that-lets-ai-agents-rewrite-their-own-rules-boosting-performance-up-to-60</link>
            <guid isPermaLink="false">34BZWRl0WoShIvXTDpjvrF</guid>
            <pubDate>Mon, 22 Jun 2026 14:23:00 GMT</pubDate>
            <description><![CDATA[<p>Not every company can or should build their own frontier AI language model. However, the <i>harness</i> controlling the model is something that most enterprises can and <i>should</i> customize for their specific purposes.</p><p>Of course, this is easier said than done. A<!-- -->gent harnesses are still largely tuned through manual, ad hoc debugging — a process that relies heavily on intuition rather than systematic feedback loops, making it difficult to keep pace with rapidly evolving LLMs.</p><p>To solve this challenge, researchers at the Shanghai Artificial Intelligence Laboratory have introduced “<a href="https://arxiv.org/abs/2606.09498">Self-Harness</a>,” a new paradigm in which an LLM-based agent systematically improves its own operating rules. By examining its own execution traces to apply edits, the system trades manual guesswork for empirical evidence.</p><p>Self-improving harnesses can enable development teams to deploy robust custom agents that continually adapt their own execution protocols to overcome model-specific weaknesses.</p><h2><b>The challenge of harness engineering</b></h2><p>An LLM-based agent&#x27;s performance is not determined solely by its underlying base model, but also by its harness: the surrounding system that provides context and enables the model to interact with the environment. A harness includes components like system prompts, tools, memory, verification rules, runtime policies, orchestration logic, and failure-recovery procedures.</p><p>This layer is crucial because many common agent failures stem from the harness rather than the model. For example, an agent may report success without checking the model’s response (e.g., running the code to see if it passes the tests), or it might retry a failed action repeatedly. The harness is also responsible for preventing <a href="https://venturebeat.com/ai/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without">context rot or overload</a> when the agent’s interaction history grows very large. Examples of popular harnesses include SWE-agent, Claude Code, Codex, and OpenHands.</p><p>Harness engineering remains a significant challenge, but the bottleneck isn&#x27;t necessarily that humans are too slow or incapable. </p><p>In fact, Hangfan Zhang, lead author of the Self-Harness paper, told VentureBeat that &quot;in many cases, an experienced engineer with deep domain knowledge can still propose better changes than an LLM can today.&quot;</p><p>Instead, the true bottleneck of manual engineering is that it relies heavily on ad hoc debugging rather than a verifiable, empirical feedback loop. &quot;The deeper issue is that the current harness-engineering paradigm often lacks a systematic feedback loop,&quot; Zhang explained. &quot;Many edits are made based on intuition, a few observed failures, or ad hoc debugging.&quot;</p><p>With new models being released at a rapid pace, depending on human intuition to manually tune model-specific harnesses becomes increasingly costly and untenable. While some approaches use stronger models to improve the harnesses of weaker target agents, this dependence on external guidance has its own challenges, as these models may be costly, unavailable for frontier models, or mismatched to the target model&#x27;s failure modes.</p><h2><b>How Self-Harness works</b></h2><p>The Self-Harness paradigm enables an LLM-based agent to improve its own harness without relying on human engineers or stronger external models.</p><p>This continuous self-evolution is driven by a three-stage iterative loop that turns behavioral evidence into harness updates:</p><ul><li><p><b>Weakness mining:</b> Starting from an initial harness, the agent runs a set of tasks, producing execution traces with verifiable outcomes. The agent categorizes failed traces and tries to detect model-specific failure patterns.</p></li><li><p><b>Harness proposal:</b> Based on these failure patterns, the agent uses a “proposer” role to generate a set of diverse yet minimal harness modifications, each tied to a specific failure mechanism to avoid overly general corrections.</p></li><li><p><b>Proposal validation:</b> The system evaluates candidate modifications through regression tests. An edit is promoted only if it improves performance without causing measurable degradation on held-out tasks. If multiple candidate modifications pass the regression tests, they are merged into the next version of the harness, which then serves as the starting point for the next iteration.</p></li></ul><p>To visualize why an enterprise would need this, imagine an automated issue-fixing agent that reads internal documentation, writes patches, and opens pull requests. If the company updates its documentation style, the agent might suddenly fail, pulling the wrong context or writing bad patches. </p><p>On the surface, the agent simply looks broken. But Self-Harness turns this ambiguous failure into a solvable problem. &quot;The failure traces expose where the agent is misusing the new documentation format; the proposer can generate a targeted harness edit... and the evaluator can decide whether that edit improves the failing cases without regressing other cases,&quot; Zhang said.</p><h2><b>Self-Harness in action</b></h2><p>The researchers evaluated Self-Harness on <a href="https://www.tbench.ai/">Terminal-Bench-2.0</a>, a benchmark that tests general tool-based execution, including artifact management, command use, verification behavior, and recovery from execution errors. They applied Self-Harness with MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5.</p><p>To isolate the impact of the self-evolving harness, they started with a minimal harness built upon the DeepAgent SDK, containing only the benchmark-facing system prompt, and the default filesystem and shell tools. The model backend, tool set, benchmark environment, and evaluator were kept unchanged while only the harness was allowed to vary.</p><p>The quantitative results show that <b>agents improved their performance through automated harness edits. </b>On held-out tasks, <b>performance jumped significantly across the board, ranging from 33 to 60 percent </b>relative improvements for different models.</p><p>Importantly, an explicit acceptance rule promotes only those edits that improve performance without introducing unacceptable regressions. What makes Self-Harness powerful for enterprise applications is that it doesn’t simply make the prompt longer or add generic instructions. Instead, it introduces targeted changes that reflect the recurring problems each model encounters during execution.</p><p>For example, under the baseline harness, MiniMax M2.5 would get stuck endlessly exploring dataset configurations until the execution environment timed out, failing to produce any deliverables. Through Self-Harness, the system identified this specific flaw and wrote a &quot;loop breaker&quot; into its runtime policy, forcing the agent to stop and redirect its approach after 50 tool calls. It also added a rule to create an initial version of required artifacts as early as possible.</p><p>On the other hand, Qwen-3.5 had a habit of hitting a file overwrite error and then blindly retrying the same command repeatedly, eventually deleting necessary files out of confusion before stopping. The self-harness fixed this by introducing a strict command-retry discipline (forbidding exact duplicate commands) and a mechanism that forced the agent to immediately recreate any missing artifacts if a file error occurred.</p><p>GLM-5 struggled to preserve environment changes across different commands, and would often waste time on massive downloads or finalize tasks even when sanity checks were failing. Its self-generated harness introduced rules instructing the agent to persist PATH variables across shell sessions, limit external compute, and repair any failed sanity checks before concluding its run.</p><h2><b>The hidden costs of automated harnesses</b></h2><p>While Self-Harness automates the tedious work of tracking down idiosyncratic model failures, decision-makers must be realistic about the trade-offs. Replacing human engineering with automated trial-and-error requires significant computational overhead.</p><p>&quot;Self-Harness replaces part of the human engineering burden with repeated proposal generation, parallel candidate evaluation, and regression testing,&quot; Zhang said. &quot;That can mean more API tokens, more latency during optimization, and more infrastructure for running evaluation tasks.&quot;</p><p>Also, this system relies on the accuracy of its evaluation pipeline. During their experiments on Terminal-Bench-2.0, the researchers relied on strict, deterministic verifiers to ensure the agent&#x27;s edits were actually helpful. Without this rigorous ground truth, an automated system risks promoting bad updates. &quot;[The] evaluation system is not an optional component; it is what lets us trade human intuition for empirical evidence,&quot; Zhang said.</p><p>This reliance on strict verifiers also dictates where Self-Harness should be deployed. &quot;The best deployment targets today are environments where failures can be measured and where trial-and-error is relatively safe,&quot; Zhang said, pointing to coding, internal workflow automation, and DevOps data pipelines as ideal use cases.</p><p>Conversely, enterprises should avoid fully automating harnesses in high-stakes or subjective fields. &quot;The clearest red flags are domains where evaluation is subjective, delayed, non-deterministic, or costly to get wrong, such as medical decision-making, safety-critical infrastructure, or legal decisions.&quot;</p><h2><b>From prompt tweakers to feedback architects</b></h2><p>The introduction of self-improving agents does not mean coding or enterprise workflows will suddenly become human-free. The quality of collaboration between the human engineer and the AI is still paramount and difficult to capture with automated benchmarks. </p><p>Instead, the engineering profession is moving up the abstraction layer. &quot;The role of enterprise engineers will shift from manually patching individual prompts or tool calls toward designing the feedback systems that make agent improvement possible,&quot; Zhang predicted. Moving forward, &quot;the engineer becomes less of a prompt tweaker and more of a feedback architect.&quot;</p><p>As foundational models grow more capable, they will naturally absorb many capabilities that currently require manual harness engineering. &quot;But once that happens, the harness will not disappear; its scope will move outward to connect the model to richer external environments,&quot; Zhang said. &quot;Until that boundary moves beyond what humans can evaluate, humans will remain critical providers of feedback.&quot;</p>]]></description>
            <author>bendee983@gmail.com (Ben Dickson)</author>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/7evVer22ufMtmOwoZ1Cfuq/6222b39130f1015e2fc42f84df11c42c/self-improving_harness.jpg?w=300&amp;q=30" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[AI hit the memory wall — now it needs a new context tier]]></title>
            <link>https://venturebeat.com/orchestration/ai-hit-the-memory-wall-now-it-needs-a-new-context-tier</link>
            <guid isPermaLink="false">4QXpEiEaZco5CgBerd5t4h</guid>
            <pubDate>Mon, 22 Jun 2026 07:00:00 GMT</pubDate>
            <description><![CDATA[<p><i>Presented by Solidigm</i></p><hr/><p>As inference workloads evolve from discrete question-and-answer exchanges into persistent, multi-step agentic systems, GPU availability is no longer the most critical AI bottleneck. Instead, the bottleneck has migrated from compute to context, says Jeff Harthorn, AI applied research lead at Solidigm.</p><p>&quot;Why context management has become a primary bottleneck, more than GPU availability or compute efficiency, is the question of 2026,&quot; says Harthorn. &quot;GPUs have gotten dramatically cheaper per FLOP. Model architectures and inference serving engines have all gotten much more efficient. But the thing that&#x27;s grown faster than both of those is context. The persistent state that has to live between sessions has grown even faster than context itself.&quot;</p><p>It&#x27;s happening as context windows grow dramatically, making individual inputs far larger than before. Agentic AI systems chain dozens or hundreds of model calls together, each generating state that must be tracked, and enterprises are requiring that inference state persist across sessions for audit, governance, and reuse. These trends compound each other, pushing context volumes beyond what any existing memory tier was designed to handle.</p><p>&quot;Those three things are all happening at the same time, all of which are pushing context data and context memory into the stratosphere much more quickly than we&#x27;re used to seeing,&quot; adds Ace Stryker, director of AI and ecosystem marketing at Solidigm.</p><p>The solution is a dedicated context tier emerging between GPU memory and bulk network storage: a layer of high-performance, high-density flash designed specifically to hold and serve Key-value (KV) cache, the inference data that allows models to retain and reuse context, and retrieval data at inference speed. Nvidia has formalized this architecture under the term CMX. Storage companies including Solidigm are building SSD products optimized for this workload.</p><p>&quot;Storage has not been the first thing folks have thought about when they&#x27;ve been planning their enterprise infrastructure buildout,&quot; Stryker says. &quot;In a lot of ways, it was a relatively small cost compared to compute, and it was a commodity. You just shopped around for the lowest dollar per gigabyte and called it good. But now, if your storage is not up to snuff, your ROI suffers, and it directly impacts your bottom line.” </p><h2>Why AI inference requires a different storage architecture than training</h2><p>The storage architecture that AI systems rely on today was largely inherited from training workflows. Training is sequential and write-dominated, with data moving in large blocks to and from bulk object storage. The tier structure, with high-bandwidth memory on the GPU, fast NVMe in the server, and bulk storage over the network, serves that use case reasonably well.</p><p>However, inference is a different animal. Its I/O signature is fine-grained, latency-sensitive, and increasingly stateful. KV cache data and retrieval data each have distinct access patterns, but both need to be served quickly and reused across interactions. Neither fits cleanly within GPU high-bandwidth memory, which is expensive and physically constrained, nor within traditional bulk storage, which was never designed for active inference workloads.</p><p>&quot;The architectural gap that&#x27;s interesting to me right now isn&#x27;t at the top of the stack or the bottom, it&#x27;s right in the middle,&quot; Harthon says. &quot;A lot of what sits below the GPU HBM is being asked to do things it wasn&#x27;t really designed for, which is where the most interesting systems work today is happening.&quot;</p><p>One of the most visible symptoms of this gap is recomputation. In inference, the pre-fill stage processes all of the context relevant to a given session before token generation can begin. When KV cache state isn&#x27;t available in a fast, accessible tier, the system recomputes it — burning GPU cycles that produce no new value.</p><p>&quot;A meaningful share of GPU cycles end up going to re-pre-filling,&quot; Harthon explains. &quot;During all of that calculated context, that&#x27;s potentially compute that&#x27;s being spent reproducing state, rather than doing new work. When you start looking at the problem that way, GPU utilization starts looking like it&#x27;s partly a storage problem.&quot;</p><p>This reframing is driving renewed interest in a metric borrowed from networking: goodput, or useful tokens per dollar, rather than raw tokens per dollar.</p><h2>The AI context memory tier and how it works</h2><p>The industry&#x27;s response is taking structural form. A new tier is emerging between GPU memory and traditional network storage, designed specifically to hold and serve inference context, a layer distinct from drives inside GPU servers (G3) and storage servers over the network (G4), engineered to serve context data back to accelerators as rapidly as possible.</p><p>&quot;If you&#x27;re building a data center starting in the second half of this year, or the beginning of next year, you can&#x27;t think about storage only living in two places,&quot; Stryker says. &quot;Storage has to live in at least three places to handle the context memory tier, and that&#x27;s likely to be a permanent fixture in how the infrastructure gets built going forward.&quot;</p><p>It&#x27;s analogous to the emergence of object storage as a category, which didn&#x27;t exist until enough workloads needed it. And once it did, it developed its own primitives, SLAs, cost models, and an ecosystem of vendors. </p><p>&quot;The context tier looks like it might be on a similar arc,&quot; Harthorn says. &quot;That volumetric pressure is causing the category to form, rather than any one vendor&#x27;s road map.&quot;</p><p>For infrastructure leaders, this means actively planning for the new tier rather than treating it as optional. Deploying additional NAND at this layer reduces dependency on DRAM, which is orders of magnitude more expensive per gigabyte and constrained in both availability and thermal headroom. </p><p>&quot;In terms of your investment effectiveness, you&#x27;re laying out less cash to do it if you rely on the SSD layer in the way that Nvidia is now recommending and prescribing for a lot of use cases,&quot; Stryker adds.</p><h2>What flash needs to deliver to support AI inference</h2><p>Participating meaningfully in the inference stack places new demands on SSD technology. Tail latency, the worst-case performance of a drive, must be predictable, not just fast on average. An orchestration system that allocates GPU resources based on expected storage response times cannot tolerate unexpected multi-second delays. Consistent, observable performance matters more here than peak throughput.</p><p>Beyond latency, density becomes a critical concern, especially at hyperscale. In data centers where power, not cost, is the binding constraint, watts per petabyte becomes the operative metric. Floating gate NAND, the manufacturing approach at the core of Solidigm&#x27;s products, is suited to that calculation. Network integration via NVMe over Fabrics, RDMA, and eventual CXL support is also essential, given the tight latency budgets of active inference pipelines.</p><p>&quot;The drives have to have reliable performance characteristics, beyond the throughput side and being able to transfer as much data as possible as fast as possible, the way that training needed,&quot; Harthon says. &quot;Now it&#x27;s about being able to do it very consistently, in a way that&#x27;s very observable to the people operating and orchestrating these systems.&quot;</p><h2>How enterprise AI leaders should plan for the context tier </h2><p>The standards, software primitives, and best practices being established now will define how AI inference infrastructure operates for years to come. Solidigm is engaged in that process through standards bodies, partner lab collaborations, and published research, which is critical precisely because the category is still forming.</p><p>&quot;The interesting question for the next couple of years isn&#x27;t whether AI infrastructure needs more compute,&quot; Harthorn says. &quot;It&#x27;s whether it can use what it has more efficiently. A lot of that answer runs through this tier that is being built today.&quot;</p><hr/><p><i>Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact </i><a href="mailto:sales@venturebeat.com"><i><u>sales@venturebeat.com</u></i></a><i>.</i></p>]]></description>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/5GJSLwxzfZtlmGbw0irLnU/fc9fab29b482054ff058a97a10e51be1/FastCo_HeroImage.jpg?w=300&amp;q=30" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[7,000 Langflow servers are under attack. LangGraph and LangChain have the same holes]]></title>
            <link>https://venturebeat.com/security/7000-langflow-servers-under-attack-langgraph-langchain-same-holes</link>
            <guid isPermaLink="false">4v4X3T0BLpo0EzJ33GND2V</guid>
            <pubDate>Fri, 19 Jun 2026 21:14:19 GMT</pubDate>
            <description><![CDATA[<p>Your AI agent did exactly what it was designed to do. The framework underneath it just handed an attacker a shell on the box that holds your OpenAI key, your database credentials, and your CRM tokens.</p><p>That is not a hypothetical. In a few months, three of the most widely deployed AI agent frameworks each turned a known, ordinary bug class into a way through. <a href="https://research.checkpoint.com/2026/from-sqli-to-rce-exploiting-langgraphs-checkpointer/">Check Point Research</a> chained a SQL injection in LangGraph’s SQLite checkpointer to full remote code execution. Tenable and VulnCheck tracked a path traversal in Langflow’s file upload endpoint to active, in-the-wild RCE. <a href="https://www.cyera.com/research/langdrained-3-paths-to-your-data-through-the-worlds-most-popular-ai-framework">Cyera</a> documented a path traversal in LangChain-core’s prompt loader that reads your secrets off disk. Two paths to a shell, one to your keys. They are the same bug, wearing three frameworks.</p><p>These frameworks became production infrastructure faster than anyone secured them. They store agent state, take file uploads, load prompt configs, and hold the credentials to databases, CRMs, and internal APIs. The edge tools watch traffic. The endpoint tools watch processes. Neither was built to treat an imported framework as a boundary worth guarding, and that blind spot is exactly where all three chains live, widening every week as these frameworks ship to production.</p><h2><b>The LangGraph chain, SQL injection to a Python shell</b></h2><p>Start with the one most teams pulled into production this quarter. LangGraph gives AI agents memory through checkpointers, the persistence layer that stores execution state. It has cleared over <a href="https://pypistats.org/packages/langgraph">50 million downloads a month</a>. Yarden Porat of Check Point Research took that layer apart and found three vulnerabilities. Two of them chain to RCE.</p><p><a href="https://advisories.gitlab.com/pypi/langgraph-checkpoint-sqlite/CVE-2025-67644/">CVE-2025-67644</a>, rated CVSS 7.3, is a SQL injection in the SQLite checkpointer. The function that builds the WHERE clause for checkpoint lookups drops user-controlled filter keys straight into the query with no parameterization and no escaping. This does not hit everyone, but where it hits, it is serious. A deployment is exposed when it self-hosts LangGraph on the SQLite or Redis checkpointer and lets untrusted input reach get_state_history() or a similar history endpoint. Meet those conditions, and an attacker who controls the filter writes a fabricated row straight into the checkpoint table. Run LangChain’s managed LangSmith platform on PostgreSQL, and the exposure is gone.</p><p>Then <a href="https://advisories.gitlab.com/pypi/langgraph/CVE-2026-28277/">CVE-2026-28277</a>, CVSS 6.8, finishes the job. LangGraph’s msgpack checkpoint decoder rebuilds Python objects from the stored data, which lets it import a module and call a named function with attacker-supplied arguments. That step needs write access to the checkpoint store; the SQL injection is what grants it remotely. LangGraph loads the forged row as a legitimate checkpoint, the decoder runs the specified function, including os.system, and code executes under the identity of the agent server. A third issue, CVE-2026-27022, CVSS 6.5, reaches the same place through the Redis checkpointer.</p><p>There has been no confirmed exploitation in the wild yet. A working proof-of-concept is public in Check Point’s disclosure. The fixes are version bumps: langgraph-checkpoint-sqlite to 3.0.1, langgraph to 1.0.10, and langgraph-checkpoint-redis to 1.0.2.</p><h2><b>The Langflow chain, one unauthenticated request to RCE</b></h2><p>Langflow is the one already under attack. CVE-2026-5027, CVSS 8.8, is a path traversal in the POST /api/v2/files endpoint, which takes the filename straight from the form data and writes it to disk unsanitized. An attacker packs that filename with traversal sequences and drops a file anywhere, such as a cron job in /etc/cron.d/. Because Langflow ships with auto-login enabled in its default configuration, an exposed instance needs no credentials at all. A single unauthenticated request reaches the endpoint, and the next cron run hands over a shell.</p><p>VulnCheck’s Caitlin Condon confirmed exploitation on June 9: “Our Canaries observed exploitation of CVE-2026-5027 that successfully leveraged the path traversal to write what appear to be test files on victim systems.” Censys put roughly 7,000 exposed instances on the internet, most in North America. This is the third Langflow flaw to draw active exploitation this year, after <a href="https://www.probablypwned.com/article/langflow-cve-2025-34291-muddywater-account-takeover-rce">CVE-2025-34291</a>, which the Iranian state-sponsored group MuddyWater weaponized and which CISA added to its <a href="https://thehackernews.com/2026/05/cisa-adds-exploited-langflow-and-trend.html">Known Exploited Vulnerabilities catalog</a> in May. CVE-2026-5027 itself was patched in version 1.9.0, released April 15.</p><p>The timeline is what sets the clock. The patch shipped April 15. Attacks started in June, and <a href="https://www.thestack.technology/langflow-instances-are-getting-exploited-again/">VulnCheck added CVE-2026-5027 to its exploited-vulnerabilities list June 8</a> once its sensors caught the first in-the-wild hits. Every instance left unpatched between those two dates has been sitting in the open for almost two months. The lesson for security teams is to start the patch clock at disclosure, not at a federal catalog entry.</p><h2><b>The LangChain-core gap, arbitrary file reads through the prompt loader</b></h2><p>LangChain-core, the foundation under both, disclosed <a href="https://thehackernews.com/2026/03/langchain-langgraph-flaws-expose-files.html">CVE-2026-34070</a>, CVSS 7.5, a path traversal in its legacy prompt-loading API. The load_prompt() functions read a file path out of a config dict with no check against traversal sequences or absolute paths, so an attacker who influences that path reads arbitrary files the process can reach, including the .env file holding OPENAI_API_KEY and ANTHROPIC_API_KEY. Cyera paired it with CVE-2025-68664, CVSS 9.3, a deserialization flaw that resolves environment secrets through a crafted object. The fix versions differ, which matters when you patch: CVE-2026-34070 lands in <a href="https://security.snyk.io/vuln/SNYK-PYTHON-LANGCHAINCORE-15809257">langchain-core 1.2.22 and 0.3.86</a>; CVE-2025-68664 lands earlier in <a href="https://nvd.nist.gov/vuln/detail/CVE-2025-68664">1.2.5 and 0.3.81</a>. Clear both, or the higher-severity flaw stays live behind a patched one.</p><p>Three frameworks, three classic AppSec bugs. Path traversal. SQL injection. Unsafe deserialization. Nothing exotic, nothing AI-specific, just old vulnerabilities living inside new infrastructure. None of this is a frontier-model problem. It is plumbing, sitting in the layer where AI meets the enterprise.</p><h2><b>Why the scanner cannot see it</b></h2><p>Merritt Baer, CSO at <a href="https://www.enkryptai.com/">Enkrypt AI</a> and former deputy CISO at AWS, has named what makes this kind of failure hard to see coming. It does not announce itself as an AI problem. &quot;CISOs will experience MCP insecurity not in the abstract, but when an employee pastes sensitive data into a tool, or when an attacker finds an unauthenticated MCP server in your cloud,&quot; Baer told VentureBeat. &quot;It won&#x27;t feel like &#x27;AI risk.&#x27; It will feel like your traditional security program failing.&quot; The framework chains here are the same shape. An exposed Langflow instance is an unauthenticated server in your cloud, and the alert, if one fires, reads like an ordinary incident.</p><p>That is the gap in one sentence. The exploit lives in the framework your code imports. The WAF never sees a msgpack decoder running three layers down. The EDR watches the agent server make the same process calls it makes a thousand times a day and waves it through. Both tools are doing their job. Nobody scoped the framework itself as the thing that could turn on you. </p><p>The root cause is older than AI, and Baer names it. “MCP is shipping with the same mistake we’ve seen in every major protocol rollout: insecure defaults,” she told VentureBeat. “If we don’t build authentication and least privilege in from day one, we’ll be cleaning up breaches for the next decade.” Langflow’s auto-login is that mistake shipped. LangChain-core’s unguarded prompt loader is that mistake shipped. The convenient default is the vulnerability. And the moment an agent connects to anything, that risk compounds. “You’re not just trusting your own security, you’re inheriting the hygiene of every tool, every credential, every developer in that chain,” Baer said. “That’s a supply chain risk in real time.”</p><p>There is a governance failure layered on top of the technical one, and it is the same miscategorization Assaf Keren, chief security officer at Qualtrics and former CISO at PayPal, has flagged in adjacent tooling. “Most security teams still classify experience management platforms as ‘survey tools,’ which sit in the same risk tier as a project management app,” Keren told VentureBeat. “This is a massive miscategorization.” Swap in AI agent frameworks, and it still holds. Teams file LangGraph, Langflow, and LangChain under developer convenience, then wire them into databases, CRMs, and provider keys. “Security has to be an enabler,” Keren said, “or teams route around it.” These frameworks are what routing around it looks like.</p><p>Follow the money and it points at the same layer. On its <a href="https://www.fool.com/earnings/call-transcripts/2026/06/03/crowdstrike-crwd-q1-2027-earnings-transcript/">Q1 fiscal 2027 earnings call</a>, CrowdStrike reported its AI detection and response line up more than 250% sequentially, and on June 17 it <a href="https://www.crowdstrike.com/en-us/press-releases/crowdstrike-advances-ai-and-cloud-security-operations-on-aws/">extended that runtime coverage</a> to agent, LLM, and MCP traffic on AWS. George Kurtz, the company’s co-founder and CEO, named the reason in plain terms: “Agents run on the endpoint. They make tool calls, access files, invoke APIs, and move data at the process level.” That is the exact plumbing these chains abuse, and real money is now moving to the layer your AppSec scan skips.</p><h2><b>What to put in front of the board</b></h2><p>The board does not need the CVE numbers. It needs the consequence, and Keren draws the line the board cares about. Most teams have mapped the technical blast radius. “But not the business blast radius,” Keren told VentureBeat. “When an AI engine triggers a compensation adjustment based on poisoned data, the damage is not a security incident. It is a wrong business decision executed at machine speed.” A framework RCE is the same problem one layer earlier. The agent does not just leak a credential; it acts on production systems with it, and the business sees an outcome no one can explain.</p><p>So frame it the way a board frames it: we run AI agent frameworks in production that can be turned into remote shells through bugs our scanners are not built to find, all three are patched, one is under active attack, and here is the date every instance is verified and closed. None of this required custom malware or a zero-day.</p><h2><b>The six-question checklist</b></h2><p>Six trust boundaries, one per row, each with the question, the proof point, the command, the fix, and the board line. Run it tonight.</p><table><tbody><tr><td><p><b>Trust-Boundary Question</b></p></td><td><p><b>Proof Point</b></p></td><td><p><b>What Broke</b></p></td><td><p><b>Verify Before You Install</b></p></td><td><p><b>The Fix</b></p></td><td><p><b>Board Language</b></p></td></tr><tr><td><p><b>1. Can the agent&#x27;s state store be poisoned with code?</b></p></td><td><p>LangGraph SQLi-to-RCE chain. CVE-2025-67644 (CVSS 7.3) chains into CVE-2026-28277 (CVSS 6.8). PoC public, no in-the-wild use yet.</p></td><td><p>Filter keys interpolated into SQL with an f-string. Forged checkpoint row hits the msgpack decoder, which imports and runs an attacker-named callable.</p></td><td><p>pip show langgraph-checkpoint-sqlite. Below 3.0.1 = vulnerable. Confirm get_state_history() is not exposed to network input.</p></td><td><p>Upgrade langgraph-checkpoint-sqlite to 3.0.1, langgraph to 1.0.10, langgraph-checkpoint-redis to 1.0.2.</p></td><td><p>“Our agent memory layer can be tricked into running attacker code. Vendor has patched it. We are upgrading and confirming the endpoint is not exposed.”</p></td></tr><tr><td><p><b>2. Can an unauthenticated request write a file to our agent server?</b></p></td><td><p>Langflow CVE-2026-5027 (CVSS 8.8). On VulnCheck KEV (June 8). Active exploitation confirmed June 9. ~7,000 exposed instances (Censys).</p></td><td><p>Path traversal in POST /api/v2/files. Filename unsanitized. Auto-login on by default. Two HTTP calls drop a cron job and earn a shell.</p></td><td><p>Query Censys or Shodan for your Langflow, Flowise, n8n, and Dify instances on the perimeter. Check whether auto-login is enabled.</p></td><td><p>Upgrade Langflow to 1.9.0+. Disable auto-login. Pull AI dev tools behind VPN or zero-trust. Isolate port 7860.</p></td><td><p>“Our AI dev tools are reachable from the internet with login off. This exact flaw is under active attack now. We are pulling them behind access controls today.”</p></td></tr><tr><td><p><b>3. Can our prompt loader read files it should never touch?</b></p></td><td><p>LangChain-core CVE-2026-34070 (CVSS 7.5), path traversal in the prompt-loading API. Paired with deserialization CVE-2025-68664 (CVSS 9.3).</p></td><td><p>load_prompt() reads a config-supplied path with no traversal check, returning files such as the .env holding OPENAI_API_KEY and ANTHROPIC_API_KEY.</p></td><td><p>pip show langchain-core. Below 1.2.22 (1.x) or 0.3.86 (0.x) = vulnerable. Audit any code passing user-influenced paths to load_prompt().</p></td><td><p>Upgrade langchain-core past both fixes: 1.2.22 / 0.3.86 (CVE-2026-34070) and 1.2.5 / 0.3.81 (CVE-2025-68664). Replace load_prompt() with an allowlisted directory. Run as non-root.</p></td><td><p>“Our prompt system could be steered to read our API keys off disk. We are patching and removing the legacy loader.”</p></td></tr><tr><td><p><b>4. Does a compromised framework hand over every credential at once?</b></p></td><td><p>These frameworks are often deployed with provider keys, database credentials, and integration tokens available to the process environment. Cyera documents the credential-exfiltration path.</p></td><td><p>One RCE on the agent server exposes every secret the process can read. Blast radius is the full credential set, not one app.</p></td><td><p>Inventory which secrets each framework process can reach. Confirm keys come from a secrets manager, not static .env files.</p></td><td><p>Move provider keys to ephemeral injection. Rotate any key a vulnerable instance could have read. Scope each key to least privilege.</p></td><td><p>“A single break in one AI framework exposes the keys to every model and data store it touches. We are rotating and scoping them now.”</p></td></tr><tr><td><p><b>5. Are these frameworks running outside security governance?</b></p></td><td><p>A prior Langflow flaw, CVE-2025-34291, was weaponized by Iranian-linked MuddyWater and added to CISA KEV in May. Shadow AI is the new shadow IT.</p></td><td><p>Teams stand frameworks up for speed, give them credentials, and never bring them under review. The security team cannot see what it does not know exists.</p></td><td><p>Run a discovery sweep for AI frameworks outside change management. Map each to an owner and an approval record.</p></td><td><p>Assign every framework a documented owner and a place in the approval process. Offer a sanctioned alternative so teams do not route around you.</p></td><td><p>“We have AI frameworks in production that no one formally approved. We are bringing them under governance, not banning them.”</p></td></tr><tr><td><p><b>6. Can our scanners even see inside the framework at runtime?</b></p></td><td><p>Runtime detection is forming around this layer: CrowdStrike Falcon AIDR expanded to AWS June 17 (Bedrock, Kiro, Strands); its <a href="https://www.crowdstrike.com/en-us/press-releases/crowdstrike-expands-project-quiltworks-with-aws-hardening-the-cloud-attack-surface-against-frontier-ai-risk/">QuiltWorks coalition</a> now covers cloud workloads.</p></td><td><p>WAF reads HTTP at the edge. EDR watches the endpoint. By default, neither reliably models a msgpack decoder or a prompt loader three layers down in an imported framework as a separate trust boundary.</p></td><td><p>Test whether your AppSec scan covers third-party framework internals. Track CVEs by dependency, not just by what your edge tools can parse.</p></td><td><p>Add framework dependencies to vuln management. Treat agent output and stored state as untrusted. Patch on disclosure, not on KEV listing.</p></td><td><p>“Our scanners check our code, not the frameworks our code imports. We are closing that blind spot and patching on disclosure, not waiting for the federal catalog.”</p></td></tr></tbody></table><p><i>How to read this table: each row is one trust boundary, left to right, from the question to ask to the line to read your board.</i></p><h2><b>Give the board the deadline, not the technology</b></h2><p>The fixes are not a re-architecture. They are version bumps and config changes you can land this week. The exposure is the gap between the day the patch shipped and the day your team runs the checks, and right now that gap is measured in months. The frameworks did exactly what they were built to do. </p>]]></description>
            <author>louiswcolumbus@gmail.com (Louis Columbus)</author>
            <category>Security</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/5CFo8mBoW1WjItcZvYyHpg/3172659c88b4856fe7137de54672ab16/hero.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Fine-tuning forgets. RAG leaks context. Hypernetworks build the model your agent needs on demand.]]></title>
            <link>https://venturebeat.com/orchestration/fine-tuning-forgets-rag-leaks-context-hypernetworks-build-the-model-your-agent-needs-on-demand</link>
            <guid isPermaLink="false">6O4fBfCpDsG1PifGNWI4Vg</guid>
            <pubDate>Fri, 19 Jun 2026 16:30:50 GMT</pubDate>
            <description><![CDATA[<p>Enterprise teams keep watching the same thing happen. An AI agent demos beautifully, goes to production, and stalls: it runs for a short stretch, then needs a human to top up its context and check its output, and the promised efficiency drains into supervision. The agent did the work; you did the watching. It’s one reason so many agent pilots never turn into production systems.</p><p>The pitch on the other side of that wall is the one every team wants to believe: an agent that runs a long job on its own, overnight if it has to, and leaves a person to validate only the last 10%. Whether that is achievable turns on a problem the orchestration conversation mostly skips. When AI firm Chroma tested 18 leading models, <a href="https://www.morphllm.com/context-rot">every one lost accuracy as its input grew</a>, a property of how attention works, not a gap a stronger model closes. An agent fed more and more of your business as it runs does not get steadier. It gets shakier.</p><p>This is the layer beneath the orchestration race. Routing, durable execution and observability all assume each agent is already competent enough to coordinate in the first place. The deeper question is how long an agent can run before a human has to step in, and that comes down to where your company&#x27;s knowledge lives relative to the model. Both standard fixes leave a human in the loop.</p><h2>Why teaching a model your business keeps you in the loop</h2><p>Frontier models keep getting more capable, and the gap does not close, because it is not a capability problem. It is about where your knowledge sits relative to the model, and enterprises have had <a href="https://venturebeat.com/ai/fine-tuning-vs-in-context-learning-new-research-guides-better-llm-customization-for-real-world-tasks">two ways</a> to place it there. </p><p>The first is fine-tuning, which bakes knowledge into the weights. It remains subject to catastrophic forgetting, a problem identified in the 1980s and <a href="https://www.emergentmind.com/topics/catastrophic-forgetting-in-language-models">still unresolved in 2026</a>: teaching a model something new tends to erode what it already knew. Teams work around it by isolating each task in its own fine-tuned model or adapter, which produces a sprawling estate of models that <a href="https://www.infoworld.com/article/4131242/researchers-propose-a-self-distillation-fix-for-catastrophic-forgetting-in-llms.html">raises cost and governance overhead</a>. And a fine-tuned model is a snapshot, stale the day a policy changes, when the expensive, slow retraining cycle starts over.</p><p>The second is in-context learning, which skips retraining by placing the relevant policies in the prompt at run time. This is where context rot bites. Retrieval narrows what goes into the prompt, but a retrieval miss looks identical to a confident answer, and both cost and latency climb with every token added.</p><p>The two failures rhyme. With fine-tuning, the model can be confidently working from last quarter&#x27;s policy. With in-context learning, it can be confidently working from a detail it lost in the middle of a long prompt. Either way the output looks equally assured, so you cannot tell which parts are wrong without checking all of them. That is why the human never gets to leave. Some teams often run both at once, fine-tuning the stable knowledge and retrieving the rest. That softens each failure but removes neither: on any given output you still cannot be sure the model is both current and working from the right context, so you still check it.</p><h2>A third path: generate the specialist model on demand</h2><p>A third approach is moving from research into early product. Instead of retraining one model or stuffing its prompt, a generator builds a small, task-specific model on demand from your policies, at inference time. The generator is a hypernetwork: a network whose output is the weights of another network. </p><p>The idea was <a href="https://arxiv.org/abs/1609.09106">named in 2016</a>; applying it to produce specialist language models from text or documents is recent and active. Sakana AI&#x27;s <a href="https://arxiv.org/abs/2506.06105">Text-to-LoRA</a>, presented at ICML 2025, generates a model adapter from a plain-language description in a single pass, and a 2026 system called SHINE calls hypernetwork adaptation <a href="https://arxiv.org/pdf/2602.06358">a promising new frontier</a>, precisely because it sidesteps both the retraining cost of fine-tuning and the context limits of prompting.</p><p>The point of generating adapters rather than training and storing them is to collapse a sprawling library of per-task LoRAs into one network that can produce them on demand, including for tasks it has not seen.</p><p>The elegant part is how this closes the loop on the problem above: the per-task adapter teams hand-build to dodge catastrophic forgetting is the same object a hypernetwork produces automatically. The model zoo stops being a governance headache and becomes a generated output.</p><p>The case for going small underneath all this was put most directly in a 2025 paper by <a href="https://arxiv.org/abs/2506.02153">Nvidia researchers</a>: for the narrow, repetitive tasks that fill agent workflows, small models are capable enough and 10 to 30 times cheaper to run than frontier generalists. Nace.AI, a Palo Alto company that raised a <a href="https://www.businesswire.com/news/home/20260505315897/en/">$21.5 million seed round in May</a>, is the clearest commercial instance. Its core technology, a generator it calls a MetaModel, <a href="https://nace.ai/research/enterprise-policy-injection-with-metamodels">produces parameter adaptations for a model at inference time</a> from a company&#x27;s policies, pointed at regulated work: audit, compliance, risk assessment. The company says its agents handle the bulk of a workflow while human experts validate the result, a split it markets as 90/10.</p><h2><b>How the three approaches compare</b></h2><table><tbody><tr><td><p>
</p></td><td><p><b>Fine-tuning</b></p></td><td><p><b>In-context / RAG</b></p></td><td><p><b>Hypernetwork-generated model</b></p></td></tr><tr><td><p><b>Where business knowledge lives</b></p></td><td><p>In the model&#x27;s weights</p></td><td><p>In the prompt, re-supplied each run</p></td><td><p>In on-demand generated weights</p></td></tr><tr><td><p><b>Cost to update on a policy change</b></p></td><td><p>High: retrain</p></td><td><p>Low: edit the source</p></td><td><p>Low: regenerate</p></td></tr><tr><td><p><b>Staleness</b></p></td><td><p>High: a snapshot</p></td><td><p>Low</p></td><td><p>Low: regenerated from current policy</p></td></tr><tr><td><p><b>Per-call cost and latency</b></p></td><td><p>Low</p></td><td><p>High, grows with context</p></td><td><p>Low at run time</p></td></tr><tr><td><p><b>Dominant failure mode</b></p></td><td><p>Forgetting; model-zoo sprawl</p></td><td><p>Context rot; silent retrieval misses</p></td><td><p>Generator quality; calibration</p></td></tr><tr><td><p><b>Who owns the improving asset</b></p></td><td><p>Whoever trains the model</p></td><td><p>Whoever holds the data store</p></td><td><p>Depends where generator and feedback live</p></td></tr></tbody></table><h2>Why a hypernetwork-built model raises the autonomy ceiling</h2><p>A model that is narrow, current and small has a smaller surface on which to be wrong. Fewer errors, confined to a known domain, mean fewer outputs an agent has to escalate to a person, which is the real basis for any high-autonomy claim. It is also where a number like 90/10 comes from: not a dial set in advance, but an outcome of how little the system needs to hand back. Reported autonomy shares are best read as measurements of an architecture, not as settings.</p><p>Two design choices decide whether that autonomy is trustworthy or merely fast. The first is grounding: tying every output to its source so a reviewer can verify rather than redo. Research models built for exactly this, such as <a href="https://arxiv.org/pdf/2510.00880">HalluGuard</a>, label each claim as supported or not and cite the passage they relied on. Nace ships its agents with grounding models and reasoning traces for the same reason. A 10% review only means something if the human can confirm provenance in seconds.</p><p>The second is the feedback loop, and it forces a question every buyer should ask: when your experts validate the output, whose model improves, and where does it live? That decides whether the compounding asset belongs to the vendor or to you. Arrangements differ. Nace, for instance, uses an external network of certified experts for some engagements and, for direct enterprise deployments, the customer&#x27;s own staff, with the resulting model kept inside the customer&#x27;s cloud. Each choice routes the learning, and the ownership, somewhere different.</p><h2>Where the third path breaks</h2><p>The approach is still early, and a few questions will decide how far it goes. Calibration is the linchpin: the value rests on the model knowing when it is unsure. And it is genuinely unsettled, recent work generating these adapters found they do not automatically improve calibration over ordinary fine-tuning, with gains appearing only under specific constraints. </p><p>The quality of the generated model also depends heavily on the policy data it is built from, which puts a premium on data curation. And scale is the open research frontier, the hypernetworks shown in published work so far have been small. This is where Nace&#x27;s own work gets interesting: in our interview, the company said it has scaled its generator well beyond those published sizes and derived a scaling law for how performance grows, results it has begun to share publicly and is now putting through peer review. If it holds up, it would help answer one of the central open questions in the field, and it is the paper worth watching.</p><p>Whichever approach wins, the work still ends at a human, and that handoff is its own design problem. When Deloitte Australia delivered a roughly A$440,000 government report, it <a href="https://www.theregister.com/2025/10/06/deloitte_ai_report_australia/">shipped with fabricated citations and an invented court quote</a> after passing senior review, because the reviewers checked the conclusions, which were sound, and not the provenance, which was not. Controlled research suggests the pattern is general: experts <a href="https://academic.oup.com/pnasnexus/article/5/6/pgag146/8703788">corrected an identical flawed recommendation less often when it was labeled AI-generated</a>. </p><p>The EU AI Act&#x27;s <a href="https://artificialintelligenceact.eu/article/14/">Article 14</a> now names this automation bias. The lesson is not about any one vendor: a high autonomy share concentrates human attention into a thin, late slice of the work, so the value of that review depends entirely on whether the human can check provenance fast, which loops back to grounding.</p><h2>What to build, and what to ask before you buy</h2><p>The honest takeaway: what holds your agents back is usually not orchestration or model size, but whether the model knows your business well enough to be left alone, and the right fix depends on the job. To automate a long, repetitive, high-volume process end to end, run most of your internal audit overnight and have your own experts check the final slice, a hypernetwork generated model is the approach most likely to do it cheaply and run long enough to matter. For a short task that finishes in a few steps and never needed to run unattended, the gap between this and a well-prompted frontier model shrinks to almost nothing, and is not worth the integration cost.</p><p>When a vendor pitches autonomous or specialist agents, four questions cut through it. </p><ol><li><p>Where does the business knowledge live: in the weights, the prompt, or generated on demand?</p></li><li><p>What does each output come with, so a reviewer can verify it instead of redoing it? </p></li><li><p>What decides which work gets escalated to a human? </p></li><li><p>And whose model improves from that feedback, and where does it run? </p></li></ol><p>The answers, not the headline ratio, tell you what you are buying.</p><p>The hypernetwork approach is the most credible attempt yet at making a small model know a specific business without forgetting it and without re-explaining it on every run. It is also the least proven, and the parts that matter most, calibration and scale, are still in peer review. For the right job, pilot it now. For the wrong one, the integration cost buys you little that a well-prompted frontier model wouldn&#x27;t.</p>]]></description>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/5i664fldp1fQIBiyfnykGq/ac7e785b49b229db8798fd5a95b3f895/Gemini_Generated_Image_rwbmyvrwbmyvrwbm.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
    </channel>
</rss>