<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
    <channel>
        <title>VentureBeat</title>
        <link>https://venturebeat.com/feed/</link>
        <description>Transformative tech coverage that matters</description>
        <lastBuildDate>Sun, 14 Jun 2026 22:00:26 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>Copyright 2026, VentureBeat</copyright>
        <item>
            <title><![CDATA[MCP solved tool calling. A2A solved coordination. What solves transport?]]></title>
            <link>https://venturebeat.com/orchestration/mcp-solved-tool-calling-a2a-solved-coordination-what-solves-transport</link>
            <guid isPermaLink="false">4PHKoyZ3cKlEPNqLucajQH</guid>
            <pubDate>Sun, 14 Jun 2026 04:00:00 GMT</pubDate>
            <description><![CDATA[<p>The history of distributed computing is one of protocol proliferation followed by consolidation. </p><p>Common Object Request Broker Architecture (CORBA), Distributed Component Object Model (DCOM), Java remote method invocation (RMI), and early simple object access protocol (SOAP) competed for the enterprise integration market in the late 1990s before representational state transfer (REST) quietly won by being simpler and HTTP-native. </p><p>Extensible Messaging and Presence Protocol (XMPP), Internet Relay Chat (IRC), and a dozen proprietary protocols fragmented real-time messaging before MG telemetry transport (MQTT) and WebSockets carved out their respective niches. Every new computing paradigm generates a burst of competing standards, then slowly converges as implementations accumulate and interoperability becomes economically necessary.</p><p>The AI agent ecosystem is currently in the proliferation phase. Four significant protocols have been published in the past eighteen months: Model context protocol (MCP) from Anthropic in late 2024, agent communication protocol (ACP) from IBM Research in March 2025, Agent2Agent (A2A) from Google in April 2025, and agent network protocol (ANP) from an independent working group. </p><p>The W3C AI Agent Protocol Community Group has opened a standards track. The Internet Engineering Task Force (IETF) is receiving Internet-Drafts on agent transport. Conferences are running workshops on interoperability. Every week brings a new GitHub repository claiming to solve the agent communication problem.</p><p>Understanding where and how quickly this converges has real consequences for architecture decisions being made right now.</p><h2><b>What the protocols actually solve</b></h2><p>The proliferation looks more chaotic than it is, because most of these protocols address different layers of a stack rather than competing for the same slot. The confusion comes from marketing, which describes each as &quot;the standard for AI agent communication&quot; without specifying which aspect of communication.</p><p>MCP is a tool-calling interface. It defines how a model discovers what functions a server exposes, how to invoke them, and how to interpret the response. It is a typed remote procedure call (RPC) contract between a model client and a tool server, running over HTTP. The Linux Foundation confirmed more than 10,000 active public MCP servers and 164 million monthly Python SDK downloads by April 2026. MCP has already won the tool-calling layer. The standardization work is effectively done.</p><p>A2A is a task coordination interface. Where MCP defines how an agent calls a tool, A2A defines how two agents delegate a task. It introduces Agent Cards (capability advertisements), task lifecycle states, and three interaction modes: Synchronous, streaming, and asynchronous. Google donated it to the Linux Foundation in June 2025, and enterprise AI teams have adopted it broadly because it fills a real gap that MCP leaves open.</p><p>ACP is a message envelope format. Lightweight, stateless, designed for agent-to-agent message exchange without A2A&#x27;s full coordination semantics. It is useful in systems where simple message passing suffices and A2A&#x27;s task lifecycle overhead is unnecessary.</p><p>ANP is a discovery and identity protocol. It uses Decentralized Identifiers (DIDs) for agent identity and JSON-LD graphs for capability descriptions, providing a foundation for decentralized agent marketplaces where no central registry is required.</p><p>The stack that is emerging: Capability discovery via ANP or simpler registries, task coordination via A2A, tool calls via MCP, and lightweight messaging via ACP for cases that do not require full task lifecycle management. These layers complement rather than compete.</p><h2><b>The transport problem that remains</b></h2><p>Every protocol in this list runs over HTTP. This reflects where the protocols came from: Research teams, API providers, and enterprise software companies building systems where HTTP is an unquestioned assumption. HTTP is the protocol they know, the one their servers already speak, and the one that makes demos easy.</p><p>The production problem is that HTTP assumes a reachable server. Behind network address translation (NAT) — and 88% of networked devices sit behind NAT — there is no reachable server without a relay. For agent fleets that need to route tasks directly between peers across cloud boundaries, home networks, and edge deployments, this centralization forces every message through relay infrastructure. Relay infrastructure adds latency, cost, and a failure mode.</p><p>The application-layer protocols solve the semantics of what agents say to each other. They do not solve how agents find each other and establish direct connections. That is a session-layer problem, Layer 5 in the open systems interconnection (OSI) model and none of MCP, A2A, ACP, or ANP address it.</p><p>The technologies for solving it exist. UDP hole-punching with session traversal utilities for NAT (STUN) provides NAT traversal for roughly 70% of network topologies. X25519 Diffie-Hellman and AES-256-GCM provide authenticated encryption at the tunnel level without a certificate authority. Quick UDP internet connections (QUIC) (RFC 9000) or custom sliding-window protocols over user datagram protocol (UDP) provide reliable delivery without TCP&#x27;s head-of-line blocking. These are the same primitives that WireGuard uses for VPN tunnels and that WebRTC uses for browser-to-browser media streams.</p><p>What differs in the agent context is capability-based routing. Agents need to find peers not by hostname but by what those peers can do. A research agent should be able to query &quot;which peers have real-time foreign exchange data?&quot; and receive a list of currently active specialist agents. This is closer to a service registry than to DNS, and it is a natural extension of ANP&#x27;s design philosophy applied to the transport layer.</p><p>A handful of projects are assembling these pieces. Pilot Protocol has the most complete published specification, with an IETF Internet-Draft covering addressing, tunnel establishment, and NAT traversal for agent networks. libp2p provides a battle-tested foundation with similar primitives. The IETF&#x27;s QUIC working group is developing NAT traversal extensions that will be relevant here.</p><h2><b>What convergence will look like</b></h2><p>The HTTP-based protocols (MCP, A2A) are already converging on stable versions. The next 12 months will see production hardening, security improvements, stateless MCP servers for horizontal scaling, better A2A federation — rather than new fundamental designs. The tool-calling and task-coordination layers are largely solved.</p><p>The transport layer is 18 to 24 months behind. Expect a period of implementation diversity as teams experiment with different approaches to peer-to-peer (P2P) agent networking, followed by consolidation around a small number of implementations once empirical data on performance and reliability accumulates. The IETF and W3C standardization tracks will likely produce something in the 2027-2028 window, by which time one or two open-source implementations will have accrued enough production deployments to establish de facto standards ahead of the formal specification.</p><p>For engineering leaders making architecture decisions today, the practical implication is layered adoption. The application-layer protocols are stable enough to build on. MCP adoption now is low-risk. A2A adoption for multi-agent coordination is reasonable with the expectation that the protocol will evolve. The transport layer is where you either build something custom and plan to replace it, or you evaluate early implementations knowing the space is still moving.</p><p>The teams that will have the most leverage when the transport layer stabilizes are the ones that designed their agent systems with a clean separation between application semantics (MCP, A2A) and transport (whatever sits below). Clean separation is cheap to implement now and expensive to retrofit later, a lesson the microservices era taught anyone who tried to add observability or circuit breaking to systems that had none.</p><p><i>Philip Stayetski is a co-founder of Vulture Labs.</i></p>]]></description>
            <category>Orchestration</category>
            <category>DataDecisionMakers</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/F3OLgvGIreNtAZkWpzQuo/3263a964f4df4a6645e785c0abb58f23/Data_blocks.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Anthropic blocks all public access to Claude Fable 5, Mythos 5 following US government order — what enterprises should do]]></title>
            <link>https://venturebeat.com/technology/anthropic-blocks-all-public-access-to-claude-fable-5-mythos-5-following-us-government-order-what-enterprises-should-do</link>
            <guid isPermaLink="false">AMjLirPGAfkycxko0iX3C</guid>
            <pubDate>Sat, 13 Jun 2026 12:24:00 GMT</pubDate>
            <description><![CDATA[<p>The US government last night issued an unprecedented export control directive <a href="https://www.anthropic.com/news/fable-mythos-access">ordering Anthropic to immediately suspend</a> all access to its top-tier Claude Fable 5 and Claude Mythos 5 models for foreign nationals, citing unspecified national security authorities. </p><p>In response, Anthropic has blocked <i>all </i>public access to both models, globally — meaning no users around the world can access them at this time, even paying enterprise customers and Anthropic employees internally. It&#x27;s a huge blow and reversal following the <a href="https://venturebeat.com/technology/anthropic-brings-mythos-to-the-masses-with-claude-fable-5-its-most-powerful-generally-available-model-ever">public release of Fable/Mythos 5</a> just three days prior. </p><p>Current Fable 5/Mythos 5 sessions will end in errors and new queries will be automatically routed to older, less capable models like Opus 4.8. Anthropic says in a<a href="https://www.anthropic.com/news/fable-mythos-access"> blog post </a>that &quot;We believe this is a misunderstanding and are working to restore access as soon as possible,&quot; and apologizes to its customers. </p><p>The sudden regulatory intervention serves as a stark warning to the enterprise sector: centralized, cloud-based frontier models exist at the absolute mercy of government oversight and vendor compliance.</p><h2><b>Did Pliny the Liberator&#x27;s public jailbreak catalyze the extraordinary USG action against Fable/Mythos 5?</b></h2><p>The government&#x27;s sweeping action follows a<a href="https://x.com/elder_plinius/status/2064776322979676227"> viral jailbreak of Fable 5 published publicly on X on June 10</a> by the prolific jailbreaker &quot;<a href="https://venturebeat.com/ai/an-interview-with-the-most-prolific-jailbreaker-of-chatgpt-and-other-leading-llms">Pliny the Liberator</a>,&quot; who claimed to have successfully bypassed the model&#x27;s safety guardrails to extract functional instructions for cyber exploits, explosives, and chemical synthesis pathways, specifically noting the &quot;birch reduction method&quot; for methamphetamine.</p><p>Pliny outlined a highly sophisticated, multi-agent attack that leveraged a combination of &quot;Unicode, homoglyphs, Cyrillic,&quot; long-context reference tracking, and a technique of breaking harmful requests into innocuous, out-of-distribution tokens. The attacker then used a previously jailbroken Opus model to piece the benign chunks back together into actionable, restricted outputs.</p><p>Anthropic doesn&#x27;t specify if this is the jailbreak that precipitated the government order, and in fact, notes that the information provided by the U.S. government regarding the specific jailbreak has been poorly documented, writing: &quot;To date, the government has only given us verbal evidence of a potential narrow, non-universal jailbreak, which essentially consists of asking the model to read a specific codebase and fix any software flaws. Our understanding is that one potential jailbreak was shared with the government.&quot;  </p><p>The company argues the capabilities uncovered are &quot;widely available&quot; in other public models, explicitly naming rival OpenAI&#x27;s GPT-5.5. </p><p>Furthermore, Anthropic warns that pulling a commercial model over a non-universal jailbreak sets a regulatory standard that could &quot;essentially halt all new model deployments for all frontier model providers&quot;.</p><h2><b>The Pentagon precedent and need for enterprise AI redundancy and diversification</b></h2><p>This sudden blackout of Anthropic&#x27;s latest and greatest AI models will no doubt cause some consternation for organizations relying primarily on the Claude API — as it should, even though they still have access to other, less powerful Claude models. </p><p>As I warned earlier this year <a href="https://venturebeat.com/technology/anthropic-vs-the-pentagon-what-enterprises-should-do">when the Pentagon abruptly blacklisted Anthropic</a>, enterprises can no longer afford — from an operational reliability standpoint — to run critical workflows on <i>any</i> single AI model or even provider. Putting all your AI &quot;eggs&quot; into one basket, so to speak, creates a single, ultimately brittle failure point from which recovery or mitigation becomes exceedingly difficult. </p><p>Granted, in this case, Anthropic notes helpfully that &quot;access to all other Anthropic models will not be affected.&quot; And while Opus 4.8 or other Anthropic models may already be the preferred ones for organizations given their lower cost, or seen as acceptable fallbacks, the reality is, the U.S. government order was narrowly targeted <i>in this particular instance — </i>who&#x27;s to saying the government wouldn&#x27;t, in the future, demand a block of <i>all of a given lab&#x27;s AI models/products/services?</i></p><p>We had an indication that enterprise AI customers should diversify their providers earlier this year. Recall that in March 2026, Secretary of Defense Pete Hegseth labeled Anthropic a &quot;supply chain risk&quot; after the company refused to allow the military to use Claude for mass domestic surveillance and lethal autonomous weapons without safety restrictions. </p><p>The resulting fallout led to a sweeping prohibition on Anthropic&#x27;s use across defense supply chains, stripping contractors of access overnight.</p><p>The lesson from the Department of Defense fallout remains critically relevant today. Any organization building agentic workflows or production apps tied solely to a single closed-API provider risks immediate operational failure if that provider faces an injunction, a cyberattack, or an export control directive.</p><p>As an enterprise technical leader, your top goal if not already achieved should be to urgently<b> diversify your AI supply</b> — whether it&#x27;s other cloud-based AI models and providers, or AI models running on enterprise-controlled local or virtual hardware. </p><p>At this point, enterprise AI supplier diversification is arguably imperative to ensure you can continue to run AI workflows without disruption. </p><h2><b>Enterprise implications: sovereign setup vs. frontier capabilities</b></h2><p>The community reaction to the Fable 5 takedown reflects a rapidly shifting enterprise calculus toward hardware sovereignty.</p><p>AI founder <a href="https://x.com/AlexFinn/status/2065614148537299149?s=20">Alex Finn took to X </a>to flag the Anthropic shutdown as a &quot;wakeup call,&quot; urging developers to run local models on home GPUs to insulate themselves from regulatory volatility. </p><p>&quot;No company or government will EVER be able to take away your local models,&quot; Finn writes, warning that government overreach will only escalate as models inch closer to artificial general intelligence (AGI), the stated goal of OpenAI and some other AI firms, in which an AI model becomes capable of performing most economically valuable work tasks now done by humans. </p><p>Competitors are already capitalizing on this sentiment; Chinese open source AI provider MiniMax quickly highlighted the open weights/open source availability of its <a href="https://venturebeat.com/technology/minimax-m3-debuts-eclipsing-gpt-5-5-and-gemini-3-1-pro-on-key-benchmark-performance-for-just-5-10-of-the-cost">new, frontier-class M3 model</a>, contrasting its decentralized availability against Claude&#x27;s centralized vulnerability. In other words: enterprises can download and run M3 on their own hardware now without ever worrying about any government stepping in to prevent access. </p><p>This dynamic presents a complex trade-off for CIOs and IT leaders:</p><ul><li><p><b>The Sovereign Advantage: </b>Running local, open-weights models on sovereign hardware provides absolute control, ensures data privacy, and immunizes the enterprise against abrupt government export controls, vendor policy shifts, or API rate limits.</p></li><li><p><b>The Frontier Sacrifice:</b> Adopting a purely local strategy means sacrificing the cutting-edge reasoning, agentic capabilities, and massive context windows inherent to the latest closed-API frontier models, which require centralized, multi-billion-dollar compute clusters to operate.</p></li></ul><p>The most resilient path forward is an active fallback architecture. Enterprises must design their systems to be model-agnostic. By building intelligent routing layers that can dynamically switch from a frontier model like Fable 5 to an open-weights fallback or a secondary provider&#x27;s API the moment an outage or regulatory ban hits, businesses ensure their operations survive the volatile intersection of AI scaling and government oversight.</p>]]></description>
            <author>carl.franzen@venturebeat.com (Carl Franzen)</author>
            <category>Technology</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/2z5YLKjoeV6nzBjKELk7Yo/900a955fa4e5390ad6ea0921e64a7b85/ChatGPT_Image_Jun_13__2026__08_09_56_AM.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out]]></title>
            <link>https://venturebeat.com/technology/kimi-k2-7-code-cuts-thinking-tokens-30-practitioners-say-benchmarks-dont-check-out</link>
            <guid isPermaLink="false">6LezdGIZlQdYA7DZfA00mK</guid>
            <pubDate>Fri, 12 Jun 2026 21:55:22 GMT</pubDate>
            <description><![CDATA[<p>Moonshot AI released Kimi K2.7-Code this week, an open-source update to its <a href="https://venturebeat.com/ai/moonshots-kimi-k2-thinking-emerges-as-leading-open-source-ai-outperforming">K2 coding model </a>family, claiming leaner reasoning and double-digit performance gains.</p><p>K2.7-Code is built on the same trillion-parameter mixture-of-experts architecture as its p<a href="https://venturebeat.com/ai/kimi-k2-6-runs-agents-for-days-and-exposes-the-limits-of-enterprise-orchestration">redecessor K2.6</a>, and drops in via an OpenAI-compatible API — which matters for teams already running K2.6 in production gateways.</p><p>When K2.6 launched in April, it topped OpenRouter&#x27;s weekly LLM leaderboard — a ranking based on actual API routing decisions by developers, not self-reported benchmark scores.</p><p>Moonshot AI says K2.7-Code addresses what it calls &quot;overthinking,&quot; reducing thinking-token usage by 30% compared to K2.6 — a number that would directly affect inference costs for teams running agentic workflows. Whether that efficiency gain holds on independent benchmarks is a question practitioners have already started raising publicly.</p><h2>What Kimi K2.7-Code is</h2><p>K2.7-Code is released under a Modified MIT license, with weights available on HuggingFace. The model is deployable via vLLM or SGLang. It runs exclusively in thinking mode and does not support temperature adjustment — Moonshot AI has fixed it at 1.0, meaning teams cannot tune output determinism the way they might with other models.</p><p>The core change from K2.6 is how the model generates low-level code. Where K2.6 produced implementations by wrapping existing libraries and routing through established frameworks, K2.7-Code authors implementations directly. Moonshot AI says this produces more reliable generalization across Rust, Go and Python, and across task types including frontend development, DevOps and performance optimization.</p><p>On benchmark performance, Moonshot AI claims gains of 21.8% on Kimi Code Bench v2, 11% on Program Bench and 31.5% on MLS Bench Lite. All three are proprietary benchmarks run by Moonshot AI. The model has not been submitted to DeepSWE, an independent coding benchmark that produces a 70-point spread across models — compared to SWE-Bench Pro&#x27;s 30-point spread — making it a more discriminating signal for teams configuring model routing systems.</p><div></div><h2>More honest, weaker for it</h2><p>The picture from outside Moonshot&#x27;s own benchmarks is more complicated.</p><p>Researcher Elliot Arledge ran K2.7-Code against K2.6 and Claude Fable 5 on KernelBench-Hard, a public benchmark focused on GPU kernel optimization, and published his full run logs at kernelbench.com. </p><p>&quot;K2.7 is more honest but not more capable,&quot; <a href="https://x.com/elliotarledge/status/2065443474560946615">Arledge wrote on X</a>. </p><p>On five of six problems, K2.7-Code produced real authored Triton kernels where K2.6 had used library wrappers. Two of those kernels failed on the model&#x27;s own bugs. The MoE kernel result regressed from K2.6&#x27;s score of 0.222 to 0.157. </p><p>&quot;Fable, for reference, tops every cell it doesn&#x27;t honestly fail,&quot; Arledge wrote.</p><p>Sugumaran Balasubramaniyan, a developer who built a model-task-router for the Hermes Agent platform using DeepSWE as his reference signal, responded publicly to the K2.7-Code release and challenged Moonshot AI directly on the benchmark choices.</p><p> &quot;Respectfully, every model &#x27;improves&#x27; double digits on its own test suite,&quot; <a href="https://x.com/sugumaran___/status/2065416166911205579">Balasubramaniyan wrote on X</a>. </p><p>He noted that K2.6 scored 24% on DeepSWE, tied with GPT-5.4-mini, and asked whether Moonshot AI would submit K2.7-Code to the same benchmark. </p><p>Balasubramaniyan said it took 13 review rounds to get the benchmark data right for his router and that he would route coding tasks to K2.7-Code if the independent numbers hold up.</p><div></div><h2>What this means for enterprises</h2><p>The token efficiency gain is immediately usable. Teams running K2.6 in production can swap in K2.7-Code via the OpenAI-compatible API and expect lower inference costs on agentic workflows without an architecture change. The 30% thinking-token reduction is Moonshot&#x27;s own number, but the integration path is low-risk enough to test against your own workloads before committing.</p><p>The practical question is whether those efficiency gains hold on a team&#x27;s own task distribution. Running K2.7-Code against your own workloads before adjusting gateway weights is the low-risk path to finding out.</p>]]></description>
            <category>Technology</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/1qgfxRq3zYmGo7J9jMkqCA/0d83440ff424b453d03efc45bb23cce1/kimi-gateway-smk1.jpg?w=300&amp;q=30" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Google researchers introduce 'faithful uncertainty,' allowing LLMs to offer best guesses instead of hallucinations]]></title>
            <link>https://venturebeat.com/orchestration/google-researchers-introduce-faithful-uncertainty-allowing-llms-to-offer-best-guesses-instead-of-hallucinations</link>
            <guid isPermaLink="false">1rMeLRKOzghotucnoKnDPv</guid>
            <pubDate>Fri, 12 Jun 2026 21:27:29 GMT</pubDate>
            <description><![CDATA[<p>Large language models continue to struggle with hallucinations, presenting a major roadblock for real-world enterprise applications. Reducing these errors is a messy business, forcing model developers to navigate a strict tradeoff where eliminating factual errors often suppresses valid answers.</p><p>In a <a href="https://arxiv.org/abs/2605.01428">new paper</a>, Google researchers introduce the concept of &quot;faithful uncertainty,&quot; a metacognitive technique that aligns a model&#x27;s response with its internal confidence. This alignment allows the model to offer appropriately hedged hypotheses, such as &quot;My best guess is,&quot; instead of defaulting to an unhelpful &quot;answer-or-abstain&quot; binary.</p><p>In real-world agentic AI applications, this metacognitive awareness acts as an essential control layer. It empowers autonomous systems to accurately determine when their internal knowledge is sufficient and when they must dynamically trigger external tools or search APIs to resolve deficits.</p><h2>The utility tax of current mitigation strategies</h2><p>Understanding <a href="https://venturebeat.com/ai/google-deepmind-researchers-introduce-new-benchmark-to-improve-llm-factuality-reduce-hallucinations">why LLMs hallucinate</a> hinges on separating two capabilities: a model knowing facts versus knowing what is known. Historically, most factuality gains in AI have come from expanding the knowledge boundary, meaning developers simply pack more facts into the model&#x27;s parameters through larger scale and more training data.</p><p>However, expanding a model&#x27;s knowledge does not automatically improve its boundary awareness, which is its ability to distinguish the known from the unknown and recognize its own limitations.</p><p>“There are broadly two ways to improve LLM factuality,” Gal Yona, Research Scientist at Google and co-author of the paper, told VentureBeat. The first is continuing to teach the model more facts. But, Yona notes, “model capacity is finite, and the long tail of knowledge is effectively infinite.” </p><p>Once models hit this limit, the hope is they know what they don&#x27;t know and simply abstain from answering. However, this is inherently difficult for LLMs.</p><p>“This is why most practical attempts to reduce hallucinations through various interventions don&#x27;t actually make it to deployment,” Yona explains. “They do reduce hallucinations, but they also hurt utility, because the model ends up refusing to answer questions it actually does know.”</p><p>This inability to distinguish between knowns and unknowns creates what the paper&#x27;s authors call the &quot;utility tax.&quot; Enforcing a zero-hallucination standard requires the model to abstain whenever it is even slightly uncertain, discarding massive volumes of completely valid information. For example, the authors demonstrate that reducing an underlying 25% error rate down to a strict 5% target forces developers to discard 52% of the model&#x27;s correct answers.</p><p>Treating all errors as hallucinations forces enterprise systems to choose between trustworthiness and helpfulness. Application developers are generally unwilling to pay this massive utility tax and render their models unhelpful. </p><p>Consequently, they optimize systems to prioritize coverage, forcing models to operate in a state where they continue to generate confident hallucinations.</p><div></div><h2>Reframing hallucinations as confident errors</h2><p>To move past the utility tax, the researchers propose to stop treating any factual error as a hallucination. Instead, they reframe hallucinations as &quot;confident errors&quot;: incorrect information delivered authoritatively without appropriate qualification.</p><p>This subtle reframing dissolves the strict &quot;answer-or-abstain&quot; dichotomy and allows the model to express its uncertainty. </p><p>In this new framework, if a model makes a factual mistake but appropriately hedges its response (e.g., by stating, &quot;I am not completely sure, but I think...&quot;), it isn&#x27;t a hallucination. It is simply a hypothesis offered to the user for consideration. By expressing uncertainty, the AI preserves its utility—sharing whatever partial or likely knowledge it has—without violating the user&#x27;s trust.</p><p>However, if an AI assistant hedges all its responses with a disclaimer, the user is forced to double-check everything, defeating the purpose of the tool entirely.</p><p>The solution the researchers propose is &quot;faithful uncertainty.&quot; This approach requires aligning a model&#x27;s linguistic uncertainty, or the words it uses to express doubt, with its intrinsic uncertainty, which is its actual, internal statistical confidence in that specific answer. This ensures the model only hedges when its internal state genuinely reflects conflicting or low-probability information.</p><p>Faithful uncertainty forms a core component of “metacognition,” the AI&#x27;s ability to be aware of its own uncertainty and act on it. To understand this practically, consider the intuitive example of consulting a doctor. We do not trust doctors because they are all-knowing. We trust them because they reliably distinguish between a confident diagnosis (&quot;You have a fracture&quot;) and an educated hypothesis (&quot;It might be a sprain, but let&#x27;s run some tests&quot;).</p><h2>Practical implications for enterprise AI</h2><p>Under the new framing, errors where a model is genuinely confident but factually incorrect are categorized as “honest mistakes.” This casts knowledge expansion (training the model on more data) and faithful uncertainty as completely complementary efforts. Knowledge expansion pushes the absolute knowledge boundary outward to minimize honest mistakes, while faithful uncertainty honestly communicates wherever that boundary currently lies.</p><p>This new framing has important implications for agentic applications. The shift to agentic AI might make it seem like knowing what the model doesn&#x27;t know is redundant, since models can just search external databases. However, access to external tools actually amplifies the need for faithful uncertainty. In agentic systems, metacognition becomes the central control layer that governs the entire system.</p><p>External tools solve the storage problem because the model no longer needs to encode every fact into its parameters. However, this introduces a new control problem: managing when to retrieve information, verify facts, and orchestrate these external tools. Without faithful uncertainty, an agent is essentially flying blind and must rely on external, static heuristics or over-engineered scaffolds.</p><p>“The model might search for something it already knows confidently—wasting latency and cost for no gain. Or the opposite: it confidently answers from memory when it should have searched, producing a plausible but wrong output,” Yona said. Today’s agent harnesses try to solve this externally with query classifiers or always-search rules, but Yona notes that these are &quot;static and brittle.&quot; By using its intrinsic uncertainty to regulate its own behavior, the agent dynamically optimizes its tool use, choosing to invoke a search tool only when its internal confidence is genuinely low.</p><p>Beyond deciding when to search, faithful uncertainty is critical for evaluating the results of a search. If a tool returns low-quality or unexpected information, a metacognitive agent does not blindly accept whatever appears in its context window. Instead, it uses its uncertainty awareness to weigh the retrieved external signals against its own internal priors. This prevents sycophantic behavior where the system might otherwise trust external sources that conflict with its actual known knowledge.</p><h2>The bootstrapping paradox: The catch to teaching uncertainty</h2><p>For enterprise builders, achieving this faithful uncertainty is trickier than it sounds. It requires teaching models the syntax of uncertainty through supervised fine-tuning (SFT). Because pre-trained models are mostly fed authoritative text, they must be explicitly taught to say things like, &quot;I&#x27;m not entirely sure, but I think VentureBeat was founded in...&quot;</p><p>But SFT introduces a &quot;bootstrapping paradox.&quot; Unlike standard training datasets where the &quot;right answer&quot; is the same regardless of the model, the ground truth for uncertainty is the model&#x27;s own dynamic knowledge base.</p><p>“Here&#x27;s the catch: the &#x27;correct&#x27; expression of uncertainty is inherently dynamic, because it depends on what this particular model knows or doesn&#x27;t know at this particular point in training,” Yona said. “If you train on a label that says &#x27;I don&#x27;t know X&#x27; but the model actually does know X, you&#x27;ve taught it to hallucinate uncertainty... The training data is static, but the target is a moving one, and that&#x27;s the fundamental tension teams need to grapple with.”</p><h2>The road to self-aware AI</h2><p>For enterprises looking to implement these capabilities without expensive retraining, prompting serves as the most accessible entry point. “Prompt engineering is already something most engineers do today, this provides the lowest-friction path to improving metacognitive behavior today,” Yona said. Enterprise developers can explore frameworks like <a href="https://github.com/yale-nlp/MetaFaith">MetaFaith</a>, an open-source project previously co-authored by Yona, to begin applying metacognitive prompting to off-the-shelf models.</p><p>However, Yona cautions that &quot;there is still substantial headroom that prompting alone doesn’t solve,&quot; meaning the industry will eventually need to rely on advanced reinforcement learning (RL) to bake metacognition deeply into model training.</p><p>Ultimately, as enterprises transition from isolated chat applications to complex, multi-agent workflows, self-awareness will become a defining prerequisite for reliable autonomy. But evaluating whether a model truly possesses this awareness remains a profound technical challenge.</p><p>“How do you actually evaluate whether a model can sense its internal states?” Yona asks. “Even in humans, it’s hard to define or separate &#x27;true&#x27; self-monitoring abilities from a capable reliance on proxies. We face exactly the same challenges with LLMs: a model might learn to mimic the style of uncertainty without truly sensing its internal state. Developing evaluation frameworks that can tell the difference is one of the most important open problems in this space.”</p>]]></description>
            <author>bendee983@gmail.com (Ben Dickson)</author>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/yz9h2aiBuc13gCjhvzQon/240a640f913673b6f56d4922639cd4ac/ai_metacognition.jpg?w=300&amp;q=30" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[NanoClaw and JFrog launch 'immune system' to block AI agents from downloading malicious code]]></title>
            <link>https://venturebeat.com/security/nanoclaw-and-jfrog-launch-immune-system-to-block-ai-agents-from-downloading-malicious-code</link>
            <guid isPermaLink="false">38lkGsiwBWZ1jd5vvMcudB</guid>
            <pubDate>Fri, 12 Jun 2026 16:46:00 GMT</pubDate>
            <description><![CDATA[<p>The creators of the hit, enterprise-friendly, open source OpenClaw variant <a href="https://venturebeat.com/orchestration/nanoclaws-creators-are-turning-the-secure-open-source-ai-agent-harness-into-an-enterprise-second-brain">NanoClaw</a> are partnering with software supply chain management leader <a href="https://jfrog.com/">JFrog</a> to launch a new, joint security integration they say will protect NanoClaw autonomous agents from malicious code injection. </p><p>&quot;These agents are doing things that you cannot necessarily control, and you cannot necessarily train,&quot; said Gal Marder, Chief Strategy Officer at JFrog, in an exclusive interview with VentureBeat.</p><p>Available immediately, the partnership hardwires NanoClaw agents directly to JFrog’s vetted software registries, ensuring that AI assistants can only pull scanned, safe dependencies. </p><p>The release addresses a rapidly growing blind spot in tech: autonomous agents frequently install packages in the background to extend their capabilities, often without their human operators&#x27; knowledge or oversight. </p><p>&quot;The people who are operating the agents are not necessarily developers, and they are not even aware of the implications,&quot; explained Gavriel Cohen, creator of NanoClaw and CEO and co-founder of its new commercial services startup, <a href="http://nanoco.ai/">NanoCo AI</a>. </p><p>To secure the broader ecosystem, the partners are <!-- -->working to make<!-- --> it available completely free of charge for the open-source community, while enterprise organizations can seamlessly route their agents through their existing, commercially licensed JFrog environments.</p><p>The new technical capability enabled by this partnership follows NanoCo&#x27;s moves to add permissions dialogs across the apps in which it&#x27;s available via <a href="https://venturebeat.com/orchestration/should-my-enterprise-ai-agent-do-that-nanoclaw-and-vercel-launch-easier-agentic-policy-setting-and-approval-dialogs-across-15-messaging-apps">a partnership with Vercel</a>, and a <a href="https://venturebeat.com/infrastructure/nanoclaw-and-docker-partner-to-make-sandboxes-the-safest-way-for-enterprises">new partnership with Docker to allow NanoClaw</a> agents to run more securely, isolated from other software environments directly inside Docker virtual containers. </p><h2><b>The risk of current, personal autonomous AI agents </b></h2><p>When an operator interacts with an autonomous system like NanoCo&#x27;s NanoClaw, they communicate at a high level of abstraction. </p><p>A user might simply send an audio file or a voice note, prompting the agent to independently figure out how to process it. </p><p>As Cohen explained, the agent thinks, &quot;oh, I can&#x27;t understand voice notes, so let me go and grab a package and download something and install it and set it up and run it&quot;.</p><p>This dynamic self-improvement makes AI agents incredibly powerful, but it also renders them highly susceptible to software supply chain attacks. </p><p>Bad actors are increasingly poisoning open-source registries with malicious packages. Because agents act autonomously to fetch what they need, they bypass human scrutiny. </p><p>The operators, who may not even be developers, are largely unaware of the security implications unfolding behind the scenes.</p><h2><b>How NanoCo and JFrog are working to stop agents from running malicious code</b></h2><p>The integration between NanoCo and JFrog acts as an automated immune system for these AI environments.</p><p>Under the hood, NanoClaw agents are now configured to route their requests for software packages, CLI tools, and Model Context Protocol (MCP) servers exclusively through JFrog’s registries.</p><p>If an agent attempts to download a compromised library—such as a vulnerable version of the popular Axios package—the JFrog registry intercepts the request.</p><p>It blocks the installation, returning a security policy error to the agent, noting that the request was &quot;rejected by JFrog&#x27;s registry with a 403 security policy&quot;. </p><p>Crucially, the system does not just stop at blocking the threat; it creates a dynamic correction loop. The agent is notified of the vulnerability and guided to automatically seek out and install an approved, non-malicious version of the requested package instead.</p><p>For large organizations, this integration solves a massive compliance headache. Marder notes that as enterprises adopt autonomous agents, they require absolute visibility. </p><p>Organizations need &quot;a system of record, we need somewhere to track what agents that&#x27;s running by whom and consuming what packages and using what skills and using what MCPs,&quot; he told VentureBeat.</p><p>Beyond visibility, the JFrog integration provides a foundational &quot;trust layer&quot; and strict governance over what these automated systems are permitted to access.</p><h2><b>Licensing and accessibility</b></h2><p>In the realm of software distribution, licensing and access parameters dictate adoption. The NanoCo and JFrog partnership utilizes a dual-track approach to serve both individual open-source developers and highly regulated enterprises.</p><p>For the open-source community, the integration is completely free. JFrog is providing open-source NanoClaw users with complimentary access to safe, vetted sources of artifacts, tools, and skills. </p><p>This allows individual developers to run autonomous agents locally without drowning in manual approval requests for every single dependency. Furthermore, as community members build and share new &quot;skills&quot; for the agents, these contributions are uploaded to the registry, scanned for malicious code, and cleared before anyone else can use them. </p><p>This infrastructure directly neutralizes the threat of poisoned community repositories.</p><p>For enterprise deployments, the architecture plugs seamlessly into an organization&#x27;s existing commercial environment. Rather than using the public open-source registry, corporate users point their NanoClaw agents to their own internal JFrog registries. </p><p>This ensures that all agent activity adheres to the company’s specific commercial licenses, internal security policies, visibility needs, and governance standards.</p><p>As AI continues to blur the line between human intent and machine execution, the infrastructure securing that execution must evolve. This partnership acknowledges a core reality: you cannot train an AI to perfectly recognize every zero-day vulnerability; instead, you must build an environment where the agent simply cannot reach the vulnerability in the first place.</p>]]></description>
            <author>carl.franzen@venturebeat.com (Carl Franzen)</author>
            <category>Security</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/77BUvHRR5neMsb4vlwRip2/ed64951fbd22cd58addb8d99b6f977df/Gemini_Generated_Image_xz3q0ixz3q0ixz3q__2_.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[PixelRAG beats text parsers on accuracy and cuts AI agent token costs 10x]]></title>
            <link>https://venturebeat.com/data/pixelrag-beats-text-parsers-on-accuracy-and-cuts-ai-agent-token-costs-10x</link>
            <guid isPermaLink="false">3Ko8MItirrolymnCeDl93a</guid>
            <pubDate>Fri, 12 Jun 2026 15:39:11 GMT</pubDate>
            <description><![CDATA[<p>Most enterprise RAG pipelines start the same way: a text parser converts web pages and documents into plain text so they can be chunked and indexed for retrieval. That conversion step destroys retrieval signals — and according to new research, it&#x27;s responsible for the majority of wrong answers.</p><p>A research team from UC Berkeley, Princeton University, EPFL and Databricks published a paper this week introducing <a href="https://github.com/StarTrail-org/PixelRAG/blob/main/assets/pixelrag-paper.pdf">PixelRAG,</a> a system that skips that conversion entirely. Instead of parsing pages into text, PixelRAG renders them as screenshots, indexes those images and feeds retrieved tiles directly to a vision-language model reader. Tested across 30 million screenshot tiles covering all of Wikipedia, it outperforms text-based RAG across six benchmarks, improving accuracy by up to 18.1% over text-based baselines.</p><p>Parsers are the wrong place to look for fixes, according to the research team.</p><p>&quot;Improving parsers is an endless process because every website requires special handling,&quot; Yichuan Wang, lead author and UC Berkeley doctorate student, told VentureBeat.  &quot;Our goal was to explore whether recent advances in VLMs make it possible to bypass that entire problem and build a retrieval system that works across websites without site-specific engineering.&quot;</p><h2>HTML parsers destroy the retrieval signals that enterprise RAG depends on</h2><p>The goal of the researchers was to develop a clean end-to-end architecture.</p><p>&quot;Modern web RAG pipelines often involve rendering, parsing, cleaning, chunking, and many other handcrafted stages,&quot; Wang said. &quot;Every stage introduces potential cascade errors and abstractions that move us further away from the original webpage. We were interested in whether we could eliminate most of that complexity and operate directly on the rendered page.&quot;</p><p>Wang also noted that parsing inevitably loses information. Images, visual hierarchy, typography, emphasis (e.g., bold text), tables, and layout are either discarded or converted into imperfect textual approximations. </p><p>&quot;No matter how good a parser becomes, some information is fundamentally lost during the conversion,&quot; he said.</p><p>The research identifies three ways text-based RAG loses the answer before it reaches the reader. All three were measured on SimpleQA, a standard benchmark of 1,000 factual Wikipedia questions:</p><ul><li><p><b>Parser loss (36.6% of failures).</b> HTML-to-text conversion destroys structured content so completely that no text chunk in the corpus contains the answer.</p></li><li><p><b>Rank loss (55.2% of failures).</b> The answer exists in the corpus but gets outranked by keyword-dense infoboxes that land at rank 1 for 75.9% of queries, pushing answer-bearing paragraphs to rank 20 or lower.</p></li><li><p><b>Reader loss (8.2% of failures). </b>The correct content reaches the reader but flattened structure causes misattribution.</p></li></ul><h2>How PixelRAG works </h2><p>Unlike a standard LLM that reads only text, a vision-language model takes images as input alongside text, meaning it can read a rendered web page the way a human does, with layout and structure intact. &quot;For many structured information extraction tasks, we believe modern VLMs have an inherent advantage because they can reason jointly over both content and layout rather than relying on a flattened text representation,&quot; Wang said.</p><p>PixelRAG is built around that principle, replacing the text parsing pipeline with a four-stage system that operates entirely on rendered screenshots.</p><ul><li><p><b>Rendering.</b> Pages are rendered using Playwright, a browser automation library, at a fixed 875-pixel viewport and sliced into 1024-pixel-tall tiles. Wikipedia&#x27;s 7 million articles produce roughly 30 million tiles. Assets are cached locally and rendered entirely offline.</p></li><li><p><b>Indexing.</b> Each tile is encoded as a single 2048-dimensional vector using Qwen3-VL-Embedding-2B and stored in a FAISS approximate nearest-neighbor index. The full index runs to approximately 120 GB in fp16 and supports incremental updates without full re-indexing.</p></li><li><p><b>Training.</b> The retrieval model is fine-tuned on synthetic contrastive data generated from the datastore, using dynamic hard-negative mining to filter false negatives. LoRA, a lightweight fine-tuning method that updates a small fraction of model weights, is applied to both the language model backbone and the visual encoder. Training on approximately 40,000 pairs completes in under three hours on a single H100.</p></li><li><p><b>Storage.</b> Raw screenshot tiles for Wikipedia require 5.6 TB, but a render-on-demand approach eliminates persistent storage: embed all tiles, delete the screenshots and re-render pages on demand at query time. The vector index requires approximately 120 GB.
</p></li></ul><h2>Six benchmarks, 10x agent token savings and one unsolved problem</h2><p>Researchers tested PixelRAG across six benchmarks spanning factual Wikipedia QA, table-based queries, multimodal QA and live news retrieval. They said it outperformed text-based RAG on all six, including on tasks where questions are answerable from text alone. On SimpleQA it reaches 78.8% accuracy versus 71.6% for the strongest text parser, widening to 48.8% versus 42.5% on structured table queries. Teams need Qwen3-VL-4B class models or above to see the benefit. Smaller models trail text retrieval by more than 12.5 percentage points.</p><p>The agent cost advantage is the strongest near-term case for PixelRAG. In benchmark testing, an AI agent using PixelRAG as its search backend ran on 3.6 million prompt tokens versus 37.5 million for text retrieval, at 2 to 4 times lower cost than alternatives including Google, while achieving higher accuracy. Image compression can cut that token budget by a further third.</p><p><b>Visual chunking is the main unsolved problem</b>. Text-based RAG systems have spent years refining how to split documents into meaningful retrieval units based on topic, section or semantic content. PixelRAG currently has no equivalent: it slices pages by fixed pixel height, meaning a table or paragraph can get cut in half mid-tile with no awareness of content boundaries. </p><p>&quot;The text retrieval community has spent years studying chunking strategies, while visual retrieval has received much less attention,&quot; Wang said. &quot;We think this is an important area for future research.&quot;</p><div></div><h2>What this means for enterprises</h2><p>The retrieval quality problem PixelRAG addresses reflects a broader market shift already underway.<a href="https://venturebeat.com/data/the-retrieval-rebuild-why-hybrid-retrieval-intent-tripled-as-enterprise-rag-programs-hit-the-scale-wall"> VB Pulse Q1 2026</a> data from qualified enterprise respondents found intent to adopt hybrid retrieval tripling from 10.3% in January to 33.3% in March, the fastest-growing strategic position in the dataset. PixelRAG&#x27;s own authors point to hybrid deployment as the most practical near-term path — layering visual retrieval on top of existing text systems rather than replacing them.</p><p>For teams already running RAG pipelines, the path to those savings is more straightforward than a ground-up rebuild.</p><p>&quot;A practical path is to use PixelRAG as an enhancement layer alongside existing text retrieval systems,&quot; Wang said. &quot;Hybrid retrieval that combines both text and visual search is straightforward and is likely how many production deployments would evolve.&quot;</p>]]></description>
            <category>Data</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/3TcqqLJgNe55Oec1bKRhTc/52878b1366700b2bcc733d57bfc7c614/pixelRAG-smk1.jpg?w=300&amp;q=30" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Microsoft’s open-source SkillOpt automatically upgrades AI agent skills without touching model weights]]></title>
            <link>https://venturebeat.com/orchestration/microsofts-open-source-skillopt-automatically-upgrades-ai-agent-skills-without-touching-model-weights</link>
            <guid isPermaLink="false">3nitXO6za8dFMY0uqmHoEG</guid>
            <pubDate>Thu, 11 Jun 2026 23:37:00 GMT</pubDate>
            <description><![CDATA[<p><a href="https://agentskills.io/home">Agent skills</a> have become an important part of real-world AI applications, providing a mechanism — a set of instructions saved in a folder of text-based markdown (.md) files, usually — for models to adapt to specific enterprise use cases and complex workflows. </p><p>However, optimizing these skills is a slow process and faulty process, as they cannot be trained in the same way as the parameters of the underlying AI model. Instead, users typically must update them manually by retyping the instructions in each file, playing a &quot;guessing game&quot; as to what changes might improve agentic AI performance and reduce errors. </p><p><a href="https://github.com/microsoft/SkillOpt">SkillOpt</a>, a new, open source (<a href="https://github.com/microsoft/SkillOpt">MIT Licensed</a>) framework developed by Microsoft, does one better: it introduces an optimizer designed for agent skills, turning the agent&#x27;s skill .md document as a trainable object that evolves based on performance feedback.</p><p>It uses deep-learning-style optimization to make it possible for the AI to systematically explore modifications to the document and find the best combination of instructions. Most importantly, it accomplishes this procedural adaptation without making changes to the underlying model&#x27;s weights.</p><p>On various industry benchmarks, SkillOpt outperforms existing baselines, significantly boosting accuracy for models like GPT-5.5 and Qwen. The result is a set of compact, transferable skill artifacts that allow AI agents to adapt to new domains effortlessly.</p><h2>The challenge of optimizing agent skills</h2><p>Agent skills package procedural knowledge into natural-language specifications, including domain heuristics, tool-use policies, output constraints, and known failure modes. These skills provide an external interface for agents to adapt to complex enterprise workflows. In practice, agent skills are stored as text documents and inserted into the agent&#x27;s context before execution.</p><p>One of the key benefits of skills is that they customize the behavior of the underlying model without changing its weights. However, the skill document itself needs to be tweaked and optimized to get the best performance out of the agent.</p><p>While deep learning relies on strict mathematical controls for stability, human prompt engineering often relies on trial and error. When attempting to automatically update a skill document based on feedback, the lack of mathematical discipline makes text highly volatile.</p><p>Yifan Yang, Senior Research SDE at Microsoft Research Asia, told VentureBeat that the problem is not making changes, but ensuring those changes are mathematically sound.</p><p>&quot;The breaking point isn&#x27;t whether a team can change a skill, it&#x27;s that they can&#x27;t guarantee the change is an improvement,&quot; Yang said. &quot;Three failure modes recur: no step-size control, so skills drift; no validation, so a fix that reads as reasonable gets written in and can quietly regress performance; and no negative memory, so the same failed edit keeps coming back.&quot;</p><p>To illustrate how easily performance can drop when edits aren&#x27;t mathematically validated, Yang noted that &quot;an ungated rewrite pushed GPT-5.5 on SpreadsheetBench from 41.8 down to 41.1.&quot;</p><p>According to Yang, these failure modes are amplified in multi-step workflows &quot;because that&#x27;s where frontier models are weakest zero-shot. Not on reasoning, but on procedural discipline: format, self-verification, tool policy.&quot;</p><p>Before SkillOpt, agent skills were primarily hand-crafted, generated in a single shot, or evolved through loosely controlled self-revision pipelines that could not reliably improve under feedback.</p><p>Prompt optimization methods like TextGrad and <a href="https://venturebeat.com/business/gepa-optimizes-llms-without-costly-reinforcement-learning">GEPA</a> treat language artifacts as optimizable objects and use trajectory feedback to evolve prompts, but they focus on single-prompt configurations rather than generating persistent, reusable skill artifacts.</p><p>Meanwhile, skill evolution and discovery methods like EvoSkill and Trace2Skill convert agent execution experiences into trajectory lessons to refine skill folders, build domain-specific libraries, or perform evolutionary search.</p><p>None of them apply deep-learning-style controls, such as learning rates, validation gates, and momentum, which are necessary to continuously train a single, compact skill document.</p><h2>Importing mathematical discipline to text</h2><p>SkillOpt optimizes a text document through an iterative propose-and-test loop that separates the model executing the tasks from the model optimizing the skill. The process unfolds in several steps:</p><ul><li><p>SkillOpt starts with an initial skill document and a frozen target model (or harness), where the target model runs a batch of tasks to generate execution trajectories that act as the evidence for the current step.</p></li><li><p>An offline optimizer model analyzes these trajectories, separating successes from failures into minibatches. Looking at a minibatch helps the model identify systematic procedural errors rather than one-off anomalies. Based on these patterns, the optimizer proposes structural add, delete, or replace edits to the skill document.</p></li><li><p>The proposed edits are reviewed to filter out duplicates or contradictions, and the optimizer then ranks these candidate edits by their expected utility.</p></li><li><p>Rather than applying all proposed changes, SkillOpt clips the list to a maximum edit budget for that step, generating a candidate skill.</p></li><li><p>The candidate skill is evaluated on a held-out validation set using the target model. If the candidate improves the validation score, it is accepted and becomes the new current skill. If it fails, the edits are rejected and sent to a rejected-edit buffer, providing negative feedback so the optimizer knows not to repeat that mistake.</p></li></ul><p>SkillOpt directly addresses the problem of treating text as a trainable object by importing mathematical concepts from deep learning. The creators note that “the deep-learning analogy is operational rather than decorative,” helping the framework avoid the instability issues associated with other optimization techniques.</p><p>The edit budget acts as a learning rate. By limiting how many edits can be applied at once, the skill version is prevented from moving too far from its previous state, preserving continuity while allowing new procedures to be acquired. </p><p>Just like checking validation loss in deep learning, the strict held-out examples ensure that plausible-sounding text edits are only kept if they mathematically improve the agent&#x27;s actual performance on the validation split.</p><p>At the end of an epoch, SkillOpt performs a slow update by comparing tasks under the previous and current epoch&#x27;s skills. This acts like a momentum term, carrying durable, long-horizon procedural lessons forward while isolating them from the fast, step-level edits.</p><h2>SkillOpt in action</h2><p>To evaluate the technique in practice, researchers tested SkillOpt across different models, ranging from large-scale frontier models like GPT-5.5 to smaller closed and open models including GPT-5.4-mini and Qwen3.5-4B. They also deployed the skills within different execution harnesses, using plain chat as well as complex coding harnesses like the Codex CLI and Claude Code.</p><p>The evaluation spanned diverse industry benchmarks including single-round question-answering, multi-round code generation involving tool use, and multimodal document reasoning. SkillOpt was measured against multiple baselines ranging from a default no-skill setting to human-written skills and one-shot LLM-generated skills. It was also compared against advanced prompt-optimization and skill-evolution methods, specifically Trace2Skill, TextGrad, GEPA, and EvoSkill.</p><p>SkillOpt dominated across the board, proving highly effective on all 52 evaluated combinations of model, benchmark, and harness. It was particularly effective with frontier models, delivering an average absolute improvement of +23.5 points against the no-skill baseline on GPT-5.5. Furthermore, SkillOpt outperformed a hypothetical oracle baseline that cherry-picks the best competing method for every problem.</p><p>Small target models saw immense relative gains, proving that a compact text file can supply procedural knowledge that small models lack in their weights. For example, GPT-5.4-nano nearly doubled its score on multimodal document QA and tripled its score on embodied interaction and sequential decision-making.</p><p>These academic benchmarks map to critical enterprise pain points. Zero-shot models often hallucinate formatting or fail to use tools properly in multi-step scenarios. Yang explained that the biggest performance leaps occurred in operations that enterprises historically struggle to automate reliably.</p><p>&quot;Document data extraction... exact figures out of contracts, invoices, and forms — AP automation, claims, compliance,&quot; Yang said. &quot;What improves is reliability: precise formatting, self-verification, auditable outputs. And the gains come from learning procedure, not memorizing answers.&quot;</p><p>For enterprise practitioners, the true value of SkillOpt lies in its portability, efficiency, and compatibility with existing infrastructure. Experiments confirm that the framework is harness-agnostic. In addition to basic chat, the same optimization loop was successfully integrated into tool-backed execution environments like the Codex CLI and Claude Code with significant gains on industry benchmarks.</p><p>Developers can train a skill using one execution loop and deploy it in another. For example, a spreadsheet skill trained entirely inside the Codex loop was moved directly into Claude Code and drove a +59.7 point gain over Claude Code&#x27;s native baseline without any further changes.</p><p>SkillOpt artifacts also transfer cleanly across model scales. A skill optimized for GPT-5.4 was deployed onto the smaller GPT-5.4-mini and GPT-5.4-nano models with positive gains, proving that the learned procedures encode reusable workflows rather than just exploiting quirks of a specific model&#x27;s architecture.</p><p>Finally, the framework is highly efficient regarding token usage and context window real estate. Across all benchmarks, the final deployed skills never exceeded 2,000 tokens, with a median length of roughly 920 tokens. This results in highly readable, auditable artifacts that a human practitioner can review and manage in minutes.</p><h2>Implementation strategies and the enterprise &#x27;catch&#x27;</h2><p>For enterprise tech leaders, adopting a new framework requires understanding the overhead and limitations. While the research paper notes that training tokens can reach up to 210 million for academic benchmarks, the reality for day-to-day enterprise use cases is much lighter. The high token counts in testing were largely due to re-scoring massive held-out test sets.</p><p>&quot;The real upfront work is the verifier and a representative held-out split. The optimizer is light; the evaluation harness is where the engineering goes,&quot; Yang said. He added that for everyday use, &quot;in community frameworks like GBrain, where SkillOpt updates run on Claude Sonnet, training a skill for a single task averages just $1–5.&quot; This optimization cost is a one-time fee that amortizes completely at deployment.</p><p>However, the framework requires specific conditions to work effectively, namely a few dozen representative examples and a scorable feedback signal. Teams should avoid applying SkillOpt to open-ended or subjective tasks. &quot;With no clean automatic scorer you have to design a human- or model-based evaluator and watch its stability,&quot; Yang said.</p><p>SkillOpt also integrates smoothly with existing orchestration stacks, removing a major adoption hurdle. For instance, developers already using pipeline compilers can run both systems harmoniously. &quot;DSPy is a different, complementary layer,&quot; Yang said. &quot;It compiles declarative LM pipelines and optimizes program structure; SkillOpt optimizes the external skill state a frozen agent loads. You can run them together.&quot;</p><p>Looking ahead, open-source developers are already scheduling SkillOpt to run periodically over their agents&#x27; past trajectories, creating a small ecosystem of self-optimizing code-agent plugins. This continuous feedback loop represents a significant shift in how AI systems adapt.</p><p>&quot;The valuable version of self-improvement is an agent autonomously discovering knowledge to improve its own behavior and the user experience, under verification and audit,&quot; Yang said. &quot;Skills are the fastest, cheapest, most reversible first step, and the same mindset points toward agents eventually optimizing themselves, all the way down to their own weights.&quot;</p>]]></description>
            <author>bendee983@gmail.com (Ben Dickson)</author>
            <category>Orchestration</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/2yHyN4RGkIqnssmfMDdPT6/eeef27bc6852a7d2992921ec4705a521/skill_optimization.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
    </channel>
</rss>