<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>ai.rs — AI That Works for Your Business</title>
    <link>https://ai.rs</link>
    <description>Practical research on LLMs, RAG, fine-tuning, and deployment for business applications.</description>
    <language>en</language>
    <lastBuildDate>Tue, 09 Jun 2026 04:00:01 +0200</lastBuildDate>
    <atom:link href="https://ai.rs/feed.xml" rel="self" type="application/rss+xml"/>
    <item>
      <title>100% Human Key-Pressed: Share This Email Signature</title>
      <link>https://ai.rs/ai-for-business/100-percent-human-key-pressed-email-signature</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/100-percent-human-key-pressed-email-signature</guid>
      <pubDate>Sat, 23 May 2026 11:00:00 +0200</pubDate>
      <author>ai.rs</author>
      <description>A three-line email signature you can paste in 30 seconds — plus why every email client now ships with an AI that quietly flattens your voice. Six variants included.</description>
      <content:encoded><![CDATA[<p>Here's a three-line pledge you can paste into your email signature today:</p>
<pre><code>──
100% human key-pressed content.
0% machine-generated.
99% Naturally imperfect.</code></pre>
<p>That's it. Thirty seconds, three lines, one quiet stake in the ground. The rest of this article is why it works, who it's for, and how to make it yours.</p>
<h2>Why these three lines do the work</h2>
<p>Each line carries one specific weight.</p>
<p><strong>&quot;Human key-pressed content.&quot;</strong> Honest. Your fingers actually hit keys. Autocomplete suggested some words; you accepted some, rejected others. That's still you, the same way a dictation transcript is still you. The bar for <em>human-written</em> in 2026 isn't <em>no software involved</em> — that bar evaporated when Gmail Smart Compose shipped in 2018. The bar is: did a person make the decisions.</p>
<p><strong>&quot;0% machine-generated.&quot;</strong> A different stake. This one says: no pasted ChatGPT draft, no <em>rewrite this in my voice</em> handoff, no auto-composed reply. The line is specifically about generation, not assistance. Autocomplete suggests; you ratify. Generation produces; you copy.</p>
<p><strong>&quot;99% Naturally imperfect.&quot;</strong> The wink. Perfection is now the tell. The cleanest paragraphs in your inbox were probably written by something whose paragraphs are always clean. Imperfection — the dropped article, the run-on, the <em>almost</em>-right-word — used to be a thing to apologize for. In 2026 it's a watermark.</p>
<h2>The quiet cost of AI in your inbox</h2>
<p>Every major email client now ships with a writing model on by default. Gmail's Smart Compose, Outlook's Copilot, Apple's Writing Tools, Superhuman's Instant Reply. Each one nudges your sentences toward a smoothed-out, slightly-hedged, professional-but-generic register. The kind of prose that doesn't offend anyone, doesn't surprise anyone, and increasingly doesn't sound like <em>you</em>.</p>
<p>That's the cost. Not that AI writes your emails — that AI <em>flattens</em> them. Your idiom, the weird metaphor you'd reach for, the typo you'd leave because the sentence sounds right that way: all of it gets quietly replaced by suggestions tuned for the median of every email ever sent. Multiply by a billion inboxes and the texture of human written communication starts to converge.</p>
<p><em>Naturally imperfect</em> isn't just a stake against bot output. It's a stake against the slow erosion of your own voice by a thousand helpful auto-suggestions a day.</p>
<h2>&quot;But anyone can paste this onto AI text&quot;</h2>
<p>Yes. Anyone can also wear a t-shirt that says <em>honest</em>. The signature isn't a forensic test — it's a public commitment. The point is the social contract you're signing, and the moment your reader notices you signed it.</p>
<p>This works the same way <em>No animals were harmed</em> works on film credits: nobody audits the production line. The line still matters, because attaching it to your name means <em>if it turned out to be untrue, that would be on you</em>. Disclaimers aren't proofs. They're invitations to accountability.</p>
<h2>Copy this. Or pick your variant.</h2>
<pre><code>──
100% human key-pressed content.
0% machine-generated.
99% Naturally imperfect.</code></pre>
<p>That's the maximalist version. Six field-tested variants for different rooms:</p>
<table>
<thead>
<tr>
<th>Audience</th>
<th>Variant</th>
</tr>
</thead>
<tbody>
<tr>
<td>Founders</td>
<td><em>&quot;Written by a human who reread it twice and shipped anyway.&quot;</em></td>
</tr>
<tr>
<td>Developers</td>
<td><em>&quot;git blame: me. Compile errors: also me.&quot;</em></td>
</tr>
<tr>
<td>Consultants</td>
<td><em>&quot;Hand-written. Spellcheck off.&quot;</em></td>
</tr>
<tr>
<td>Maximalists</td>
<td>the three-line pledge above</td>
</tr>
<tr>
<td>Deadpan</td>
<td><em>&quot;Sent without AI. Probably.&quot;</em></td>
</tr>
<tr>
<td>Sci-fi readers</td>
<td><em>&quot;No electric sheep were harmed in the writing of this email.&quot;</em></td>
</tr>
</tbody>
</table>
<p>Pick one. Or remix it. Tag <strong>#naturallyimperfect</strong> when you do — it's the easiest way to find the others doing the same thing.</p>
<p>For Gmail, Outlook, Apple Mail: paste into Settings → Signature. For pre-styled HTML so the formatting survives Outlook's helpfulness:</p>
<pre><code class="language-html">&lt;div style="color:#64748b;font-family:'JetBrains Mono',Consolas,monospace;
            font-size:11px;line-height:1.5;border-top:1px solid #cbd5e1;
            padding-top:8px;margin-top:12px;"&gt;
  100% human key-pressed content.&lt;br&gt;
  0% machine-generated.&lt;br&gt;
  99% Naturally imperfect.
&lt;/div&gt;</code></pre>
<p>Thirty seconds, one paste, done.</p>
<h2>What this is actually signaling</h2>
<p>AI-written email is now the default texture of inboxes. Slack messages, status updates, replies to your client — all of it has been quietly drifting toward the same smoothed-out, slightly-hedged, perfectly-paragraphed prose. <em>Sounds-like-a-person-but-isn't</em> is the unmarked case now.</p>
<p>Against that, naturally imperfect text is a deliberate, <em>costly</em> signal. It says: <em>I cared enough to send you something flawed.</em> I didn't outsource the act of writing this to a model that would have done it cleaner. The imperfection isn't a bug. It's the receipt.</p>
<p>That's the actual trust signal in 2026. Not <em>I didn't use AI</em>. But <em>I'm accountable for the output, including the bits that aren't smooth</em>.</p>
<h2>What to do with it</h2>
<p>Paste it into your signature today. Reply-all to one person you respect with it on. See what happens.</p>
<hr />
<p><em>No electric sheep were harmed in the writing of this signature. The wool, the bleating, the imperfection — all ours.</em></p>]]></content:encoded>
      <category>business</category>
      <category>email</category>
      <category>ai-trust</category>
      <category>writing</category>
      <category>authenticity</category>
      <category>communication</category>
    </item>
    <item>
      <title>How to Run Qwen3-Coder 30B-A3B on RTX 5090 with Ollama</title>
      <link>https://ai.rs/ai-developer/qwen3-coder-30b-a3b-rtx-5090-ollama</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/qwen3-coder-30b-a3b-rtx-5090-ollama</guid>
      <pubDate>Fri, 22 May 2026 11:00:00 +0200</pubDate>
      <author>ai.rs</author>
      <description>A complete setup log for Qwen3-Coder 30B-A3B on a single RTX 5090: 231 TPS via the MoE architecture, 64K context with q8_0 KV cache, and the silent Modelfile failures — RENDERER, PARSER, temperature, stop tokens — that nobody documents.</description>
      <content:encoded><![CDATA[<h1>How to Run Qwen3-Coder 30B-A3B on RTX 5090 with Ollama</h1>
<p>May 2026 — notes from setting up a local coding LLM on a single consumer GPU, with the bumps left in.</p>
<h2>The goal</h2>
<p>A coding-focused LLM running entirely on my own hardware. Reasons in descending order of weight: no per-token costs, no rate limits, no data leaving the box for routine tasks, latency that's bounded by my own GPU rather than someone else's queue. Hardware on the desk: one RTX 5090 (32 GB VRAM, Blackwell sm_120), running Arch Linux. The question was what to put on it.</p>
<h2>A false start: the cloned repo</h2>
<p>I'd cloned <a href="https://github.com/noonghunna/club-3090"><code>noonghunna/club-3090</code></a> — a well-maintained recipe collection for serving LLMs on RTX 3090s. Excellent documentation, real benchmarks, honest about failure modes (their <code>docs/CLIFFS.md</code> is the kind of writeup most serving projects could learn from). But reading the actual <code>launch.sh</code>, the hardcoded model list was just two entries — Qwen3.6-27B and Gemma-4-31B — and the whole architecture is built around squeezing 27B-class models through 24 GB Ampere with vLLM nightlies and Genesis patches. Wrong card class, wrong era, wrong constraints. The &quot;model-agnostic by design&quot; claim in the README is aspirational at the code level: the structure scales to new models, but the launcher itself is bound to specific compose files.</p>
<p>Right call: read the docs, skip the runtime.</p>
<h2>Picking the pieces</h2>
<p><strong>Model.</strong> Qwen3-Coder-30B-A3B-Instruct. The &quot;30B-A3B&quot; is a Mixture of Experts: 30B total parameters, but only ~3B are activated per token. Inference cost is roughly that of a 3B dense model; quality lands much closer to a 30B dense model thanks to expert specialization. There's a 480B-A35B sibling that's outside reach for a 32 GB card. Easy choice.</p>
<p><strong>Quantization: Q5_K_M.</strong> At 21.7 GB this hits the quality/size sweet spot for 32 GB. Q4_K_M is ~18 GB but takes a 1–2% quality hit on coding tasks where token-level precision matters. Q8_0 is ~32 GB and leaves essentially no room for KV cache. Q5_K_M leaves enough headroom for a useful context window.</p>
<p><strong>Serving engine: ollama.</strong> This one surprised me. The &quot;right&quot; answer for max throughput would be llama.cpp's <code>llama-server</code> directly, or vLLM. But ollama wraps the same llama.cpp engine. The TPS gap between ollama and standalone <code>llama-server</code> is typically 0–10% — wrapper overhead, not engine difference. What you gain by going standalone is access to flags ollama hides (KV cache quantization, dedicated bench binaries). What you give up is the operational niceness: <code>ollama list</code>, automatic VRAM unload after idle, painless model switching, a model store that handles versioning. For daily use against one model, ollama wins on UX without paying meaningfully in speed.</p>
<h2>The Q5_K_M gotcha</h2>
<p>ollama maintains a curated library of pre-packaged models. <code>ollama pull qwen3-coder</code> works — except the curated quants for the 30B variant are Q4_K_M, Q8_0, and FP16. No Q5_K_M. Q4_K_M is the obvious &quot;just go&quot; option but I wanted to actually run Q5_K_M for the quality.</p>
<p>The workaround: download the Q5_K_M GGUF directly from one of the public re-quanters on Hugging Face (Unsloth and bartowski both maintain full quant sets), then register it with ollama via a Modelfile:</p>
<pre><code>FROM /home/arch/models/Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf
PARAMETER num_ctx 32768
PARAMETER num_gpu 99</code></pre>
<p><code>ollama create</code> reads the FROM file, hashes it, and stores it as a content-addressed blob. Chat template, tokenizer config, and tool-calling format are read from the GGUF's metadata automatically — no <code>TEMPLATE</code> directive needed, and tool calling works out of the box for agent-style clients like Cline.</p>
<p>Disk-duplication caveat: my ollama runs as a systemd service with its model store at <code>/var/lib/ollama/</code>, which is a different btrfs subvolume from <code>/home</code>. Btrfs doesn't allow cross-subvolume hardlinks, so <code>ollama create</code> <em>copies</em> the 22 GB file into its store. You can run ollama as your user with <code>OLLAMA_MODELS=$HOME/.ollama/models</code> to get hardlinks and zero duplication, but for 22 GB and 346 GB free that wasn't worth the systemd-juggling. Trading disk for simplicity.</p>
<h2>The 64K context gotcha</h2>
<p>First attempt: <code>num_ctx 65536</code> in the Modelfile, <code>ollama create</code>, <code>ollama run</code>. Result:</p>
<pre><code>Error: 500 Internal Server Error: memory layout cannot be allocated with num_gpu = 99</code></pre>
<p>Initial instinct: ollama's memory estimator being pessimistic on MoE models. Wrong instinct. <code>nvidia-smi</code> showed 5.6 GB of VRAM already in use — KDE plasmashell (660 MB), Chromium GPU process and tabs (~3 GB total), Telegram (450 MB), a few smaller apps. Normal desktop session, but enough to push the budget over the line:</p>
<pre><code>Q5_K_M weights:            ~22 GB
FP16 KV cache at 64K:       ~6 GB
Activations + cudagraph:    ~2 GB
                            ─────
Total needed:               ~30 GB
Free VRAM (after desktop):  26.4 GB
                            ─────
Shortfall:                  -3.6 GB</code></pre>
<p>ollama wasn't pessimistic — the math was correct. Two ways out: free the 3.6 GB by closing Chromium, or shrink the KV cache. I dropped to <code>num_ctx 32768</code>, which cuts KV to ~3 GB. After re-creating the model:</p>
<pre><code>ollama:      24.4 GB  (weights + KV + activations)
Desktop:      5.4 GB
Free:         2.2 GB</code></pre>
<p>Fits cleanly with a healthy buffer.</p>
<p>This is the part where local serving differs from cloud most concretely. Cloud inference has dedicated machines with 80+ GB of HBM per GPU, often 8 GPUs sharing capacity. Your local card shares with the desktop, the browser, the chat app, the screenshot tool. The first ~5 GB of VRAM is gone before the model even loads.</p>
<h2>&quot;Is 32K enough?&quot;</h2>
<p>When Claude advertises 1M context and your local model is capped at 32K, the gap looks vast. It isn't, for what coding actually needs:</p>
<ul>
<li>A typical source file: 1–5K tokens</li>
<li>A file plus 3–5 related files for context: 10–20K tokens</li>
<li>A moderate-codebase summary with focused references: 25–30K tokens</li>
</ul>
<p>32K covers all of that. The places where 1M actually pays off — read this entire 200-file repo and refactor it; ingest a 600-page document and answer questions across all of it — are where you'd be reaching for a cloud model anyway, both for context and for the qualitatively better judgment of a frontier model. The local model is for the routine 80%: &quot;explain this function&quot;, &quot;write a unit test&quot;, &quot;refactor this loop&quot;, &quot;what's wrong with this regex&quot;.</p>
<p>For when you do want more context locally, ollama exposes <code>OLLAMA_KV_CACHE_TYPE=q8_0</code>, which roughly halves KV memory at near-zero quality cost. That alone moves 64K from &quot;won't fit&quot; to &quot;fits with room&quot;. I left that as an opt-in rather than the default since it requires editing the systemd unit.</p>
<h2>How to think about quant vs context</h2>
<p>A natural follow-up question after hitting the 64K wall: what if I gave up some weight precision in exchange for more context? Q4_K_M is ~17 GB on disk; that's 4 GB less than Q5_K_M, which is enough KV cache for an extra ~40K tokens at FP16. So a Q4_K_M build with the same VRAM budget gets roughly <em>double</em> the workable context. Tempting.</p>
<p>But there are two things that make this less obviously good than it looks.</p>
<p>First, the quality cost isn't symmetric across workflows. Published coding benchmarks (HumanEval, MBPP, LiveCodeBench) show Q5_K_M → Q4_K_M drops of 1–3% absolute pass rate for 30B-class models. That's small enough to be undetectable on a single prompt: blind taste tests, you'd struggle to tell them apart. But for <em>agentic</em> coding — Cline-style multi-step refactors, aider with edit-format tool calls, anything where the model is making chained decisions — those small per-step errors compound. A 2% wrong-token rate per decision over 10 decisions starts to look meaningfully different from the same model at Q5. So the Q5 → Q4 swap costs more in workflows where it matters most: long-running agent sessions, which are also the workflows that most want the extra context.</p>
<p>Second, more context doesn't translate linearly to better outputs. Coding models tend to <em>degrade</em> on long-context retrieval beyond their effective working window — quality on &quot;use these 50 files to find the bug&quot; drops sharply past ~32K, even for models trained to 256K. Published needle-in-haystack benchmarks measure something narrower than what real codebase work needs. Past ~32K, you usually get better results by being selective about what you include in context than by stuffing more in.</p>
<p>So the binary &quot;Q5 with 32K context vs Q4 with 64K context&quot; turns out to be the wrong framing. The real lever is in the middle.</p>
<p>What actually works:</p>
<ul>
<li><strong>Q5_K_M + q8_0 KV cache</strong> keeps Q5-level weight quality and roughly <em>halves</em> the per-token KV cost. With near-zero quality impact, it brings 64K into easy reach and 128K close to the edge. q8_0 isn't true FP8 (it's int8 with shared FP16 block scales) but the memory savings are FP8-class.</li>
<li><strong>Unsloth's UD-Q5_K_XL variant</strong>, at the same 21.7 GB size as Q5_K_M, selectively keeps higher precision on critical layers. Theoretically pushes quality toward Q6 territory at Q5 cost.</li>
</ul>
<p>The sensible progression for someone in my position: enable q8_0 KV first (a free lever — no quality tax) and live with that for a couple of weeks. If you find yourself routinely running out of context on real tasks past 128K, the workflow is asking for cloud anyway. Only consider Q4_K_M if you've actually validated that the context ceiling matters in your day-to-day, not just in theory.</p>
<p>Going to Q4 before trying q8_0 KV is paying the quality bill up-front for ceiling you might never touch.</p>
<h2>The performance surprise</h2>
<p>I'd estimated 80–120 TPS based on the model size (30B). The first benchmark shipped that estimate to the bin:</p>
<pre><code>{"eval_count": 462, "eval_duration_ms": 2001.85, "tps": 230.79}</code></pre>
<p><strong>231 tokens per second</strong> for a short coding completion. Roughly double my back-of-envelope.</p>
<p>The reason is the MoE architecture. My mental model was anchored on dense 30B inference, where every parameter touches every token and TPS reflects that. In a 30B-A3B MoE, each token's forward pass activates only ~3B of parameters (the chosen experts plus the shared layers). Generation speed scales with active parameters, not total. On a 5090's memory bandwidth, 3B of effectively-active weights moves fast.</p>
<p>The catch is that prefill — reading the prompt before generation starts — still touches all the model machinery, and it scales roughly quadratically with prompt length. So a short interactive coding prompt feels blazing; a 20K-token &quot;here's my codebase&quot; prompt has a noticeable pause before the first token. The 230 TPS number is steady-state generation, not prefill-bound latency.</p>
<p>Either way, this is comfortably usable. At 230 TPS, a 1000-token response materializes in about 4 seconds. Interactive coding feels closer to typing-speed than to &quot;wait for the assistant&quot;.</p>
<h2>Going to 64K — and finding what 32K hid</h2>
<p>The &quot;what I'd try next&quot; list above had <code>OLLAMA_KV_CACHE_TYPE=q8_0</code> at the top — quantize the KV cache to int8 with FP16 block scales, halving its VRAM cost at essentially zero quality impact. I did that next.</p>
<p>The setup is a systemd drop-in (<code>/etc/systemd/system/ollama.service.d/override.conf</code>) adding two env vars to the daemon: <code>OLLAMA_KV_CACHE_TYPE=q8_0</code> and <code>OLLAMA_FLASH_ATTENTION=1</code> (the second is auto-enabled on Blackwell, but being explicit is cheaper than wondering later). After <code>systemctl daemon-reload &amp;&amp; systemctl restart ollama</code>, I bumped <code>num_ctx</code> in the Modelfile from 32768 to 65536 and re-ran <code>ollama create</code>.</p>
<p>The numbers confirmed it engaged. ollama process VRAM went from 24.4 GB at 32K-FP16-KV to <strong>25.0 GB at 64K-q8_0-KV</strong> — exactly the 3 GB savings you'd expect from halving the per-token KV cost (6 GB FP16 → 3 GB q8_0) while doubling the context. TPS sat at 223, statistically indistinguishable from the 230 at 32K. Free desktop VRAM dropped to 1.4 GB — tight but workable. Functionally I now had 2× the context for less than 1 GB more allocation.</p>
<p>Then I ran a real coding prompt to validate quality. And the output went off a cliff.</p>
<p>The model wrote a sensible function. Then emitted <code>&lt;|endoftext|&gt;</code> as literal text. Then <em>kept generating</em>. It hallucinated a fake user follow-up turn (&quot;Human: Can you modify the function to also...&quot;). Then &quot;answered&quot; itself. Then repeated this loop four or five times, each iteration claiming to be the &quot;final clean version&quot; and contradicting the previous one. At no point did ollama stop the generation.</p>
<p>The diagnosis was upstream of everything I'd been doing. <code>ollama show --modelfile qwen3-coder-q5km</code> revealed the actual template ollama had registered for the model:</p>
<pre><code>TEMPLATE {{ .Prompt }}</code></pre>
<p>That's the no-template default — raw user input passed through unchanged, no ChatML wrapping, no stop tokens declared. ollama is supposed to read the chat template from the GGUF's <code>tokenizer.chat_template</code> metadata field. Either the Unsloth re-quant doesn't populate that field cleanly, or ollama 0.19 doesn't parse Qwen3's specific Jinja template variant correctly. Either way, ollama had silently fallen back to &quot;no template&quot; without warning, and I hadn't noticed because:</p>
<ol>
<li>Modern Qwen is robust enough to produce sensible output even from bare prompts. The model's <em>first</em> response was fine.</li>
<li>Short prompts (like the benchmark) end naturally and don't need stop tokens to halt — the model picks a reasonable conclusion and the API returns. The Sieve test had been measuring TPS on a workflow where the missing stop tokens never mattered.</li>
<li>The model emitted <code>&lt;|endoftext|&gt;</code> — but as literal text, because ollama wasn't told it was a stop string.</li>
</ol>
<p>The fix was a proper <code>TEMPLATE</code> block in the Modelfile (Qwen ChatML, ~15 lines) plus three explicit <code>PARAMETER stop</code> directives: <code>&lt;|im_end|&gt;</code>, <code>&lt;|endoftext|&gt;</code>, <code>&lt;|im_start|&gt;</code>. After <code>ollama create</code> re-registered with these in place, the same anagrams prompt produced one focused answer, the model emitted its turn terminator, ollama halted, and the REPL returned to the <code>&gt;&gt;&gt;</code> prompt. The output quality was visibly higher too — internal doctest/code consistency held (in the broken run, the doctest expected output that contradicted the implementation), and the model used modern <code>list[str]</code> type hints rather than the older <code>typing.List[str]</code>.</p>
<p>The lesson: when you go custom-GGUF-via-Modelfile instead of using ollama's curated library, you take on responsibility for the chat template and stop tokens that the curated tags configure invisibly. Going to <code>ollama pull qwen3-coder:30b-a3b-q4_K_M</code> would have given me the right template metadata for free. Going custom traded that for the higher quant. Worth the trade — but the silent fallback to the no-template default was a much sharper edge than I'd expected from &quot;just create a Modelfile.&quot;</p>
<p>It also retroactively changes my reading of an earlier observation. The first time I ran the anagrams test, before fixing the template, the model wrote a function whose doctest contradicted its own code — the kind of small-but-real attention drift I'd attributed in passing to Q5 quantization. With the template fixed, the same prompt produces an internally consistent answer. That drift wasn't the quant. It was the model being forced to keep generating past its natural end-of-turn, getting derailed into self-correction loops, and accumulating contradictions across the imagined revisions. The quant was never the problem.</p>
<h2>Where this leaves things</h2>
<p>Final stack:</p>
<ul>
<li><strong>Hardware</strong>: RTX 5090, 32 GB</li>
<li><strong>Model</strong>: Qwen3-Coder-30B-A3B-Instruct, Q5_K_M (21.7 GB on disk)</li>
<li><strong>Engine</strong>: ollama 0.19.0 (wraps llama.cpp), with q8_0 KV cache and flash attention enabled via systemd override</li>
<li><strong>Context</strong>: 64K</li>
<li><strong>VRAM at load</strong>: 25.0 GB used by ollama, 5.6 GB by desktop, 1.4 GB free</li>
<li><strong>Speed</strong>: ~223 TPS steady-state for short prompts (essentially unchanged from 32K-FP16)</li>
<li><strong>Endpoint</strong>: <code>http://localhost:11434/v1</code>, model <code>qwen3-coder-q5km</code></li>
</ul>
<p>Coding clients (aider, Continue.dev, Cline, Cursor with custom-provider mode) all connect to the OpenAI-compatible endpoint with a dummy API key. Tool calling works because the Modelfile's TEMPLATE block renders Qwen ChatML correctly, and the embedded GGUF tokenizer handles the <code>&lt;tool_call&gt;</code> framing.</p>
<p>What I'd try next:</p>
<ul>
<li><strong>The UD-Q5_K_XL variant from Unsloth</strong> at the same 21.7 GB size — uses higher precision selectively on important layers, theoretically better quality for the same VRAM cost.</li>
<li><strong>Side-by-side against Claude on real tasks</strong> — not synthetic benchmarks, just &quot;did the local model handle this PR review / refactor / debugging session, and where did it fall short&quot;. The interesting question for local serving isn't TPS; it's &quot;where exactly is the quality cliff vs cloud, and what tasks fall safely below it.&quot;</li>
<li><strong>vLLM with FP8-quantized weights</strong> to actually exploit Blackwell's FP8 tensor cores. llama.cpp doesn't use them today; running on a 5090 leaves them idle. The setup cost is real (different weight format, more moving parts) but it's the only way to find out what this card can actually do on dense models.</li>
</ul>
<h2>Reflections</h2>
<p>A few things I'd tell past-me starting this experiment.</p>
<p><strong>The exotic stuff is for niche constraints.</strong> vLLM, Genesis patches, custom quant kernels — these exist because someone has a constraint that can't be fixed any other way (24 GB Ampere, prefill cliffs on specific architectures, etc.). On a 5090 with a normal model, ollama covers 95% of the value and any of the alternatives is incremental.</p>
<p><strong>Estimate VRAM by what's free, not what's installed.</strong> &quot;I have 32 GB&quot; is misleading. You have 32 GB minus whatever your desktop and apps are holding, and that floor moves around. Check <code>nvidia-smi</code> before assuming. The first failure of this experiment — 64K context refusing to fit — wasn't a misconfiguration. It was the desktop quietly holding 5.6 GB that the back-of-envelope math hadn't accounted for.</p>
<p><strong>MoE inference is its own thing.</strong> Dense-model intuitions about TPS don't transfer. The 230 TPS surprise was useful — it changed what I think this hardware is good for. The expensive parts of a 30B-A3B forward pass are routing decisions and shared layers, both small; the bulk of the parameter budget sits in experts that mostly idle.</p>
<p><strong>The curated-vs-custom trade is sharper than it looks.</strong> When you <code>ollama pull</code> a tag from the curated library, you also pull the right chat template, stop tokens, and parameter defaults invisibly bundled with the weights. When you go custom — your own Modelfile pointing at a downloaded GGUF — you're responsible for those, and ollama's fallback when it can't read the GGUF's embedded chat template is <em>no template at all</em>, silently. It &quot;works&quot; for short prompts because Qwen is robust, and fails catastrophically for longer ones because there are no stop tokens. The first I knew was the model hallucinating fake user turns. Add explicit <code>TEMPLATE</code> and <code>PARAMETER stop</code> directives to any custom Modelfile, even if you think the GGUF &quot;has it built in&quot;.</p>
<p><strong>Quality bugs and config bugs look the same from outside the model.</strong> I almost wrote off the model's doctest/code inconsistency as a Q5_K_M quality limit — exactly the kind of &quot;small attention drift that compounds in agentic workflows&quot; I'd theorized about earlier. It wasn't. It was the model being forced to keep generating, drifting through invented follow-up turns, accumulating contradictions across imagined revisions. Once stop tokens worked, the same prompt produced an internally consistent answer. Worth a sanity check before blaming the weights: is the model actually finishing its turn, or is it being kept on the leash by missing config?</p>
<p><strong>Local isn't a cloud replacement, it's a complement.</strong> The right framing isn't &quot;can the 5090 run something as good as Claude&quot;. It's &quot;for which tasks is the 5090 fast enough, private enough, and cheap enough that I'd rather use it than reach for the cloud, even at lower quality&quot;. For routine coding tasks the answer is &quot;many of them&quot; — once the stop tokens are working.</p>
<h2>Postscript: trying Crush</h2>
<p>After the writeup above, I went looking for a more polished alternative to Aider — something with the agentic UX of Claude Code but model-agnostic from the start. The obvious candidate was <a href="https://github.com/charmbracelet/crush">Crush</a> from Charmbracelet — the team behind Bubble Tea, Lipgloss, Glamour, Glow, <em>the</em> terminal-UI shop. Go-based, single binary, AUR-installable with <code>yay -S crush-bin</code>. ~24K stars, daily commits, growing fast.</p>
<p>The install was clean. The TUI launch screen was genuinely beautiful — pixel-perfect spacing, considered colors, a Charm logo that's just the right amount of fun. Better than any other coding-assistant TUI I've seen. The two-tier &quot;Large Task / Small Task&quot; model picker is a nice ergonomic detail — configure cheap-and-fast for one slot, quality-for-hard-stuff for the other. I added Qwen3-Coder Q5KM under an <code>ollama</code> provider in <code>~/.config/crush/crush.json</code>, similar shape to the OpenCode config. Crush picked it up; the model picker showed it as <code>✓ Configured</code>. So far so good.</p>
<p>One nice UX detail worth noting: Crush also detected my <code>ANTHROPIC_API_KEY</code> (set elsewhere for Claude Code) and defaulted to Claude Sonnet 4.6 automatically, prioritizing cloud over local when both are available. Switching to Qwen3-Coder via the picker was a keystroke. Real respect for the dual-model dual-provider workflow.</p>
<p>Then I gave it a prompt: <code>evaluate README</code>.</p>
<p>Crush replied with &quot;I'll evaluate the README.md file for you,&quot; and then immediately got stuck:</p>
<pre><code>[Uses ls tool] [uses view tool] [uses view tool] [uses view tool] ...</code></pre>
<p>Pages of it. Hundreds of lines of <code>[uses view tool]</code> in brackets. The model was outputting <strong>natural-language descriptions</strong> of tool calls instead of actual structured tool calls — and Crush wasn't executing anything, so the model never got file content back, so it kept &quot;trying.&quot; Stop tokens didn't fire because none of <code>&lt;|im_end|&gt;</code> / <code>&lt;|endoftext|&gt;</code> / <code>&lt;|im_start|&gt;</code> was appearing in this hallucinated description format.</p>
<p>A bit of digging revealed this is a known Crush bug — <a href="https://github.com/charmbracelet/crush/issues/2936">#2936</a>, filed by another user the day before my own attempt, with mitmproxy diagnostics proving the chain:</p>
<ol>
<li>Crush correctly sends tool definitions to the model.</li>
<li>The model correctly responds with <code>finish_reason: "tool_calls"</code> and well-formed tool call JSON.</li>
<li><strong>Crush silently ignores the tool calls and never executes them.</strong></li>
<li>The model, getting no execution feedback, repeats — bounded only by <code>default_max_tokens</code>.</li>
</ol>
<p>So our setup was right. The model was right. The protocol translation was right. <strong>Crush itself has a regression in its OpenAI-compatible-provider tool-call execution path</strong> that didn't exist in earlier versions — a <a href="https://soc.meschbach.com/posts/2026/01/12-experiments-with-crush-and-ollama--qwen-3-coder/">January 2026 blog post by Meschbach</a> documents the same setup working successfully four months earlier. The breakage is recent, the fix is pending, and multiple related issues going back to August 2025 (<a href="https://github.com/charmbracelet/crush/issues/447">#447</a>, still open after nine months) suggest the local-provider integration is a fundamentally rough surface area for Crush at the moment. Not bad faith from Charm — just not yet a fully-shipped feature.</p>
<p>This is the second empirical confirmation of the article's design-theory framing, on top of the chat-template gotcha earlier:</p>
<ul>
<li>The <strong>first failure</strong> was at our config layer (silent fallback to no-template when ollama couldn't parse the GGUF's embedded Jinja). Fixable by adding explicit <code>TEMPLATE</code> and <code>PARAMETER stop</code> directives to the Modelfile.</li>
<li>The <strong>second failure</strong> is at the <em>tool's</em> layer (Crush's tool-call execution path is broken for OpenAI-compatible providers). Not fixable at our level — wait for Charm to ship a fix.</li>
</ul>
<p>Both fit the same pattern: <strong>agentic-style tools have larger surface areas to break, particularly along the local-model integration path that isn't the developers' day-job priority.</strong> Aider's smaller, more deliberate surface area — user-driven dialog, explicit file context via <code>/add</code>, no autonomous tool exploration — avoids both failure modes by design. Not because Aider is &quot;better&quot; in some absolute sense, but because Aider's design rewards weaker models for what they can do (write code given context) instead of asking them to do what they're worst at (drive an agentic tool loop reliably).</p>
<p>The right next experiment is <strong>OpenCode</strong> — same agentic category as Crush, different codebase, possibly different bug surface. If OpenCode handles tool calls against ollama cleanly, &quot;agentic + local model&quot; works in <em>some</em> tool, just not Crush right now. If OpenCode also fails on the same task, the case for Aider's design philosophy gets stronger still: smaller surface area is just better for a workflow where every integration point is a potential bug, and the model itself is more constrained than the tooling assumes.</p>
<p>For now, Aider remains the working tool for actual coding work on this stack. Crush stays installed; I'll come back when <a href="https://github.com/charmbracelet/crush/issues/2936">#2936</a> lands.</p>
<p>The meta-lesson is the same one the rest of the writeup keeps pointing at: <strong>with a local 30B-class model, the surface area you can fail through is large, and the bugs are silent.</strong> Chat templates that quietly fall back to no-template. Stop tokens that aren't fired because the model emitted a non-canonical end marker. Tool-call responses that the client silently discards. None of these failures throw an exception. They all just produce subtly-wrong output, or no output at all, and you only notice when you actually try real work. The setup time isn't in the install — it's in discovering and fixing the silent gaps.</p>
<h2>Postscript update: it was the Modelfile, not Crush</h2>
<p>After writing the section above, I kept poking. The &quot;Crush is broken with local OpenAI-compatible providers&quot; framing felt too convenient — multiple tutorials documented the combo working in earlier Crush versions, and issue #2936 had been open for less than a day with no maintainer comments either confirming or denying. I tried one more controlled experiment: same Crush, same prompt, same project, but a different Qwen3-Coder variant.</p>
<p>I pulled ollama's curated <code>qwen3-coder:30b-a3b-q4_K_M</code> tag (instead of using my custom Q5_K_M Modelfile from HF), added it to Crush's config alongside my Q5, restarted, switched to it in the model picker, and re-ran the same <code>evaluate README</code> prompt.</p>
<p>It worked. Perfectly. The <code>view</code> tool executed, the README content came back, the model produced a coherent multi-paragraph evaluation. The same Crush that had hallucinated <code>[uses view tool]</code> brackets ten minutes earlier was now driving an agentic tool-call loop without complaint.</p>
<p>The bug wasn't Crush. The bug was my Modelfile.</p>
<p>Diffing the two <code>ollama show --modelfile</code> outputs side by side revealed exactly two lines that differed in any load-bearing way:</p>
<pre><code>RENDERER qwen3-coder
PARSER qwen3-coder</code></pre>
<p>These are model-aware ollama directives, added relatively recently to ollama. They tell ollama how to format prompts for a specific model and — critically — how to <em>parse</em> its output:</p>
<ul>
<li><strong><code>RENDERER</code></strong> wraps incoming chat messages in the model's expected format (for Qwen3-Coder, that's ChatML with <code>&lt;|im_start|&gt;</code> / <code>&lt;|im_end|&gt;</code> markers). Without it, ollama either uses the GGUF's embedded chat template or falls back to a stub. With it, ollama uses Qwen-specific logic.</li>
<li><strong><code>PARSER</code></strong> translates the model's output before delivering it to clients. <strong>This is the critical one.</strong> Qwen3-Coder emits tool calls in its native XML format: <code>&lt;tool_call&gt;{"name": "view", "arguments": {"file_path": "README.md"}}&lt;/tool_call&gt;</code>. OpenAI-compatible clients (including Crush) expect structured tool-call JSON in the <code>tool_calls</code> field of the response, not raw XML in the content. The <code>PARSER qwen3-coder</code> directive tells ollama to parse the XML and emit proper <code>tool_calls</code> JSON on the OpenAI-compatible API.</li>
</ul>
<p>My hand-rolled Modelfile had <code>TEMPLATE</code> (a 15-line Jinja-ish ChatML wrapper I wrote based on what Qwen needs) and three <code>PARAMETER stop</code> directives. It did <em>not</em> have <code>RENDERER</code> or <code>PARSER</code>. So when the model emitted perfectly valid Qwen tool-call XML, ollama forwarded it raw to Crush, which saw plain text and ignored it. The model, getting no execution feedback, looped on its own attempts to invoke tools — which is the hallucinated <code>[uses view tool]</code> pattern.</p>
<p>This also retroactively explains issue #2936's diagnostic. The reporter saw <code>finish_reason: "tool_calls"</code> with correct tool-call data via mitmproxy, but Crush silently discarded it. Of course Crush discarded it — Crush was looking for structured JSON, ollama delivered raw Qwen XML. The bug isn't in Crush at all. The bug is that hand-rolled Qwen Modelfiles need to know about <code>RENDERER</code> and <code>PARSER</code>, and that knowledge isn't surfaced anywhere obvious in ollama's docs or in the Crush + Qwen tutorials floating around. The curated <code>qwen3-coder</code> tags have it; rolling your own from a Hugging Face GGUF, you don't unless you know to copy it.</p>
<p>I added the two lines to my Q5 Modelfile (<code>FROM</code> pointing at ollama's existing blob, no re-download needed) and re-ran <code>ollama create</code>. Then in Crush: switched to my Q5 model, ran the same prompt. Same clean tool-call execution. Three data points: Q5 broken, Q4 curated works, Q5 with the fix works. Diagnosis confirmed end to end.</p>
<p>The next stop on this winding path was wanting more context. I bumped Crush's <code>context_window</code> config from 65536 to 131072, restarted Crush, and re-ran a prompt. The model produced an awkward <code>&lt;function=view&gt; &lt;parameter=file_path&gt;</code> mangled output and didn't actually execute the tool — looked like another regression. But <code>curl /api/ps</code> told the real story: ollama had loaded the model at <code>"context_length": 32768</code>. Crush's <code>context_window</code> config field is <strong>UI-only</strong>. The OpenAI-compatible API path doesn't have a clean way to pass <code>num_ctx</code> to ollama, so Crush's config just affects the picker label. To actually get larger context, <code>num_ctx</code> has to be set in the Modelfile.</p>
<p>(The mangled <code>&lt;function=...&gt;</code> output was a separate but related issue: at the default <code>temperature: 0.7</code> the curated Q4 tag uses, tool-call format adherence is probabilistic — the model occasionally improvises Anthropic-style XML when it should be using Qwen-style JSON. Dropping temperature to 0.2 makes format adherence essentially deterministic for tool use without hurting coding quality.)</p>
<p>So the canonical Modelfile that <em>actually</em> works ended up being a custom one built on top of the curated Q4 blob with four added overrides:</p>
<pre><code>FROM /var/lib/ollama/.ollama/models/blobs/sha256-1194192cf2…
RENDERER qwen3-coder         # required for ChatML prompt formatting
PARSER qwen3-coder           # required for tool-call XML→JSON translation
PARAMETER num_ctx 131072     # required because Crush can't propagate num_ctx via OpenAI-compat
PARAMETER temperature 0.2    # required for reliable tool-call format adherence
# (plus the same stop tokens and other sampler params as the curated tag)</code></pre>
<p>After registering this as <code>qwen3-coder-q4-128k</code> and pointing Crush at it, the agentic loop ran cleanly at 128K context with deterministic tool calls. End of investigation.</p>
<h2>The real takeaway</h2>
<p>This experiment ran over many hours with multiple false stops. The final working setup is a one-page Modelfile and a one-page Crush config. But the path from &quot;model downloaded&quot; to &quot;agentic Crush session running cleanly at 128K context&quot; required understanding four separate gotchas:</p>
<ol>
<li><strong>Custom Modelfiles need <code>RENDERER</code> and <code>PARSER</code> directives</strong> for tool-call translation. Curated ollama tags have them; hand-rolled ones from HF GGUFs don't.</li>
<li><strong>Crush's <code>context_window</code> config is UI-only</strong> — <code>num_ctx</code> must be set in the Modelfile, not the client config.</li>
<li><strong>Default temperature 0.7 makes tool-call format probabilistic.</strong> For agentic workflows, drop to 0.2.</li>
<li><strong>Stop tokens and chat templates</strong> still need to be right, even with <code>RENDERER</code> doing the work — though <code>RENDERER</code> makes a hand-rolled <code>TEMPLATE</code> block unnecessary.</li>
</ol>
<p>None of these gotchas threw an exception or produced an error message. Each produced &quot;looks plausible, doesn't quite work&quot; output. The cost of running a local agentic stack isn't the disk or the install or the VRAM — it's the slow accumulation of empirical knowledge about which silent failure modes you're currently hitting, and which directive in which configuration file fixes which one.</p>
<p>The Aider design philosophy still holds — its smaller surface area genuinely is less likely to break on these silent gotchas. But once you've climbed the configuration learning curve, agentic-style tools (Crush, OpenCode) can be reliable too. The difference is that Aider rewards low-effort setup with reliable behavior; agentic tools demand high-effort setup but give you a richer working surface in return. Either is a valid choice. Just don't believe the install instructions when they say it's two commands. It's two commands plus a half-day of debugging Modelfile silent fallbacks.</p>]]></content:encoded>
      <category>infrastructure</category>
      <category>qwen3-coder</category>
      <category>ollama</category>
      <category>rtx-5090</category>
      <category>moe</category>
      <category>modelfile</category>
    </item>
    <item>
      <title>Qwen 3.6 27B: a Local Coding Model You Can Actually Run</title>
      <link>https://ai.rs/ai-developer/qwen-3-6-27b-local-coding-model</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/qwen-3-6-27b-local-coding-model</guid>
      <pubDate>Sat, 25 Apr 2026 11:00:00 +0200</pubDate>
      <author>ai.rs</author>
      <description>Alibaba&#039;s new 27B dense model gets within 4 points of Claude Opus 4.6 on SWE-bench, runs on a single RTX 4090, and ships under Apache 2.0. Here&#039;s what&#039;s real, what&#039;s hyped, and how to actually deploy it for coding work.</description>
      <content:encoded><![CDATA[<h1>Qwen 3.6 27B: a Local Coding Model You Can Actually Run</h1>
<p>For most of 2025, &quot;open-source coding model&quot; meant choosing between two unsatisfying tiers. The small models (8B–14B) ran on your laptop and felt like working with a tired intern. The big ones — DeepSeek V3, GLM-5.1, Kimi-K2 — competed with Claude, and required a small GPU cluster to serve.</p>
<p>Qwen 3.6 27B, released by Alibaba on April 22 2026, is the first open model that lands on the practical side of that gap. It runs on a single RTX 4090 or a 24 GB Mac. It gets within 4 points of Claude Opus 4.6 on SWE-bench Verified. The weights are Apache 2.0.</p>
<p>If you've been waiting for the moment when &quot;self-hosted Claude Code&quot; stops being a meme, this is it — with caveats.</p>
<h2>What's actually new</h2>
<p>Three things are worth knowing before you download 18 GB of weights.</p>
<p><strong>It's a dense model.</strong> All 27 billion parameters fire on every token. That's the opposite of the MoE trend (Kimi, GLM, the new GPT-OSS variants), and it matters for hardware: a dense 27B fits the way you'd expect a 27B to fit. No 700B-of-which-30B-active tricks.</p>
<p><strong>262K native context, extensible to 1M with YaRN.</strong> Most coding agents spend the first two minutes of a session paging in repository structure; this one can hold a mid-sized monorepo without truncation.</p>
<p><strong>Thinking Preservation — reasoning that survives across turns.</strong> Toggle <code>preserve_thinking: true</code> and the model carries forward its prior chain-of-thought instead of regenerating it from the same context every turn. For multi-turn agentic workflows — the only kind that matter for real coding — this is the feature that bends the cost curve.</p>
<h2>The benchmarks, with the asterisk</h2>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th style="text-align: right;">Qwen 3.6 27B</th>
<th>For comparison</th>
</tr>
</thead>
<tbody>
<tr>
<td>SWE-bench Verified</td>
<td style="text-align: right;">77.2%</td>
<td>Claude Opus 4.6: 80.8%</td>
</tr>
<tr>
<td>Terminal-Bench 2.0</td>
<td style="text-align: right;">59.3%</td>
<td>Matches Claude 4.5 Opus</td>
</tr>
<tr>
<td>SWE-bench Pro</td>
<td style="text-align: right;">53.5%</td>
<td>GLM-5.1 (754B MoE): 58.4%</td>
</tr>
<tr>
<td>SkillsBench</td>
<td style="text-align: right;">48.2%</td>
<td>Qwen 3.5 397B: 30.0%</td>
</tr>
</tbody>
</table>
<p>The asterisk: <strong>all of these were run on Qwen's internal agent scaffold</strong>, not a neutral one. Independent reproductions are still trickling in. Treat the numbers as directional. If your evaluation depends on a specific scaffold — OpenCode, Cline, Aider's bench harness — run it yourself before claiming parity in your README.</p>
<p>The number that's hard to game is the one against the previous generation: 48.2% vs 30.0% on SkillsBench at <em>one-fifteenth the parameters</em>. Whatever Qwen learned between 3.5 and 3.6, it applied it densely.</p>
<h2>Hardware: what you actually need</h2>
<p>Quantized GGUF (Q4_K_M or UD-Q4_K_XL) lands at ~18 GB. That puts the practical bar at:</p>
<ul>
<li><strong>Single GPU</strong> — RTX 4090, RTX 4080 Super, or any 24 GB workstation card.</li>
<li><strong>Mac</strong> — M2 Pro / M3 Pro with 24 GB unified memory or better.</li>
<li><strong>CPU + offloading</strong> — works, slowly. 64 GB system RAM, sustained around 6 tokens/sec on a recent Ryzen.</li>
</ul>
<p>Full BF16 needs 60 GB+, which means dual-3090 or single-A6000 territory. Almost no one needs that. Q4_K_M loses roughly 1–2 points on coding benchmarks vs full precision, well within run-to-run noise.</p>
<h2>Three ways to actually run it</h2>
<h3>1. llama.cpp — fastest path for most developers</h3>
<pre><code class="language-bash">brew install llama.cpp   # or build from source

llama-server \
  -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \
  --temp 0.6 --top-p 0.95 --top-k 20 \
  --chat-template-kwargs '{"preserve_thinking": true}'</code></pre>
<p>You get an OpenAI-compatible endpoint at <code>localhost:8080</code>. Point any existing tool that speaks the OpenAI Chat Completions API at it and you're done. This is the path I'd recommend for 90% of readers.</p>
<h3>2. Unsloth Studio — easiest for first-timers</h3>
<p>A browser UI at <code>localhost:8888</code> that handles weight downloads, GGUF selection, and chat-template wiring. Slower than raw llama.cpp at the margins; much faster to get running if you've never touched a local inference stack.</p>
<h3>3. SGLang or vLLM — for serving multiple users</h3>
<p>Version 0.5.10+ of SGLang, and recent vLLM, both ship with full Qwen 3.6 support including tool-calling and reasoning-block parsing. This is the right answer if you're serving a team rather than just yourself — batched inference on a single 24 GB card will saturate well before a single-user llama.cpp setup does.</p>
<h2>Gotchas</h2>
<p>A handful of small footguns are worth knowing about up front.</p>
<p><strong>Avoid CUDA 13.2.</strong> It produces gibberish output on Qwen 3.6 GGUFs. 13.1 and 13.3 are fine. If you've blindly upgraded recently, downgrade before you start debugging anything else.</p>
<p><strong>Ollama doesn't work yet.</strong> Qwen 3.6's vision capability ships as a separate <code>mmproj</code> file, and Ollama's current packaging doesn't wire it in. Watch the Ollama issue tracker; expect a fix within a release or two. Until then, llama.cpp directly.</p>
<p><strong>Tool-call format.</strong> If your agent harness expects the Anthropic tool-use envelope, it won't work out of the box — Qwen ships an OpenAI-style <code>function_call</code> schema. Most modern harnesses (OpenCode, Aider, Cline) handle both; roll-your-own ones may need adapter code.</p>
<h2>Should you switch from Claude or GPT?</h2>
<p>For most production coding agents, no. Claude Opus 4.7 still leads SWE-bench at 84.3%, and the API price isn't catastrophic for any team that hasn't already optimized tokens out of its workflow.</p>
<p>For three specific cases, yes.</p>
<ul>
<li><strong>Code that legally cannot leave your machines.</strong> Defense, healthcare, pre-IPO startups with competitive code. Self-hosting is the entire point.</li>
<li><strong>High-volume bulk operations.</strong> Migrations, codebase translations, automated refactors across a thousand repos. The token bill on the API for that kind of job is a serious chunk of an engineer's salary; a single 4090 amortizes in weeks.</li>
<li><strong>Local-first iteration.</strong> A coding agent that doesn't rate-limit you, doesn't change between sessions, and works on the plane.</li>
</ul>
<p>Outside those cases, treat Qwen 3.6 27B as a fallback worth having configured: somewhere between 90% and 95% of Claude's output quality on most tasks, with a per-token cost of approximately zero, and the same model available six months from now without an API deprecation notice.</p>
<p>That's a meaningful new option. It's the first time it's been one for people running on a single GPU.</p>
<hr />
<p><em>If you've benchmarked Qwen 3.6 27B on your own workflow, ai.rs would like to hear how it went. Drop a note via the contact page.</em></p>]]></content:encoded>
      <category>infrastructure</category>
      <category>qwen</category>
      <category>local-llm</category>
      <category>coding-agent</category>
      <category>llama-cpp</category>
      <category>quantization</category>
    </item>
    <item>
      <title>Why Every AI Engineer Should Learn Classical Chinese</title>
      <link>https://ai.rs/ai-developer/classical-chinese-agent-memory-compression</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/classical-chinese-agent-memory-compression</guid>
      <pubDate>Tue, 14 Apr 2026 11:00:00 +0200</pubDate>
      <author>ai.rs</author>
      <description>A benchmark of three agent-memory formats — plain English, AAAK shorthand, and Classical Chinese (Wenjian) — across Qwen and Llama. The 28% compression claim is half-true, but the methodology finding matters more: the weakest model is the most informative.</description>
      <content:encoded><![CDATA[<p><em>Or at least, why your agents should be writing in it.</em></p>
<hr />
<p>Six months into any serious LLM-agent project, the same thing happens.</p>
<p>The conversation history, the decision log, the accumulated project context —
all of it balloons past the model's context window. You start summarizing.
The summaries lose fidelity. You feed the summaries back in and the model
hedges more, hallucinates more, forgets the decisions it made a month ago.
Every call pays for the same project preamble again. The API bill climbs.</p>
<p>If you're calling a frontier model at scale, the cost of context isn't
theoretical. It's line one of your infra spend.</p>
<p>So when a GitHub issue crossed my feed claiming that <strong>Classical Chinese</strong>
— 文言文, a literary language whose grammar stabilized around the time of
Confucius — could compress agent memory by 28% compared to structured
English shorthand, I did what any engineer does on seeing a claim like that.</p>
<p>I assumed it was nonsense and set out to prove it.</p>
<p>I was half right.</p>
<h2>The claim, and the skeptic's case</h2>
<p>The two projects in question:</p>
<ul>
<li><strong><a href="https://github.com/milla-jovovich/mempalace">MemPalace</a></strong> — an
agent-memory architecture that shards long conversations into a &quot;palace&quot;
of wings, rooms, closets, and drawers, each holding structured-English
compressed notes in a format called AAAK. It scores 96.6% on LongMemEval
without calling an LLM summarizer.</li>
<li><strong><a href="https://github.com/Chandler-Sun/MemChinesePalace">MemChinesePalace</a></strong>
— a fork-in-spirit by a different author, replacing AAAK with what they
call &quot;Wenjian&quot; (文简 — Classical Chinese shorthand). The
<a href="https://github.com/milla-jovovich/mempalace/issues/45">issue</a> proposing
this was closed by the upstream maintainer within hours: <em>&quot;Classical
Chinese wouldn't be natively readable by most LLMs.&quot;</em></li>
</ul>
<p>The case for skepticism looked strong:</p>
<p><strong>Tokenizers don't love Chinese.</strong> OpenAI's older <code>cl100k_base</code> tokenizer
(used by GPT-4 and GPT-3.5) splits most Chinese characters into 2–3 BPE
tokens. &quot;Character count&quot; and &quot;token count&quot; are not the same thing, and
Chinese often costs <em>more</em> tokens per character than English.</p>
<p><strong>Classical Chinese is famously ambiguous.</strong> Two thousand years of
commentators have argued over what any given passage of 文言文 means.
For a memory system where you need deterministic recall, that's the
opposite of what you want.</p>
<p><strong>AAAK already works.</strong> A format like <code>DECISION:auth.migrate:auth0-&gt;clerk</code>
is ugly but parses with a regex and leaves zero room for interpretation.
It uses common English tokens. It's hard to see what Classical Chinese adds.</p>
<p>So the headline claim — &quot;28% fewer tokens&quot; — smelled like someone counting
characters and calling them tokens.</p>
<h2>Test one: does the token claim hold?</h2>
<p>I wrote the smallest possible benchmark: five realistic memory samples
(a decision, a bug finding, a milestone event, a team preference, a
proposal), encoded in three formats each. Plain English, AAAK, and
Wenjian. Then I fed them through <code>tiktoken</code> against two real BPE tokenizers.</p>
<p>The result, totalled across all five samples:</p>
<table>
<thead>
<tr>
<th>Tokenizer</th>
<th style="text-align: right;">English</th>
<th style="text-align: right;">AAAK</th>
<th style="text-align: right;">Wenjian</th>
<th style="text-align: right;">Wenjian vs AAAK</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>cl100k_base</code> (GPT-4 / 3.5)</td>
<td style="text-align: right;">250</td>
<td style="text-align: right;">220</td>
<td style="text-align: right;">234</td>
<td style="text-align: right;"><strong>+6.4% (worse)</strong></td>
</tr>
<tr>
<td><code>o200k_base</code> (GPT-4o / 5)</td>
<td style="text-align: right;">253</td>
<td style="text-align: right;">220</td>
<td style="text-align: right;">191</td>
<td style="text-align: right;"><strong>−13.2%</strong></td>
</tr>
</tbody>
</table>
<p>My suspicion was right: the 28% figure was character-counted, not
token-counted. On the older tokenizer, Wenjian actually <em>loses</em> to AAAK.</p>
<p>But my suspicion was also wrong: on the modern <code>o200k_base</code> tokenizer —
the one used by every frontier OpenAI model today — Wenjian really is
about 13% smaller. Not 28%, but not zero either.</p>
<p>Half a win for the Wenjian side. The real question, I thought, was whether
the model could still <em>read</em> the compressed form accurately. That's where
Wenjian's polysemy problem was supposed to bite.</p>
<h2>Test two: can the model actually read it?</h2>
<p>For this I used a local setup — <code>ollama</code> serving <code>qwen3:32b</code>, <code>qwen3.5:27b</code>,
and (later) <code>llama3.1:8b</code>. Qwen is the strongest open model for Chinese,
which makes it the fairest test of the &quot;LLMs natively read 文言文&quot; premise.
If Wenjian can't perform there, it can't perform anywhere.</p>
<p>The protocol: for each of the five memory samples, I generated three
factual questions. The model got only the compressed memory record and one
question, and had to answer. Scoring was a deterministic keyword match
(no LLM-as-judge — reproducible across runs).</p>
<p>One hundred and twenty calls later, I had my answer:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th style="text-align: right;">English</th>
<th style="text-align: right;">AAAK</th>
<th style="text-align: right;">Wenjian</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>qwen3.5:27b</code></td>
<td style="text-align: right;">15/15 (100%)</td>
<td style="text-align: right;">14/15 (93%)</td>
<td style="text-align: right;"><strong>15/15 (100%)</strong></td>
</tr>
<tr>
<td><code>qwen3:32b</code></td>
<td style="text-align: right;">15/15 (100%)</td>
<td style="text-align: right;">15/15 (100%)</td>
<td style="text-align: right;"><strong>15/15 (100%)</strong></td>
</tr>
</tbody>
</table>
<p>Wenjian matches English. On both models.</p>
<p>The polysemy concern that I and the upstream maintainer had both raised —
that Classical Chinese would be too ambiguous for reliable fact recall —
simply didn't materialize. When asked <em>&quot;what was the target deadline?&quot;</em> of
the line 议 26/Q1末 迁身份：Auth0→Clerk, the model answered <em>&quot;end of Q1
2026&quot;</em> without hesitation. When asked who discovered a bug encoded as
普雅设审中得 (<em>&quot;Priya, in the security audit, discovered&quot;</em>), it answered
<em>&quot;Priya&quot;</em> or <em>&quot;普雅&quot;</em> — both scored correct.</p>
<p>At this point I had to update. The Wenjian claim isn't bullshit. On
Chinese-strong models, it's a Pareto improvement over plain English: 24%
smaller, same retrieval. The upstream maintainer was wrong to close the
issue that fast.</p>
<h2>A hybrid that nearly beat them both</h2>
<p>While I was at it, I built a third format: a <strong>hybrid</strong> that keeps AAAK's
deterministic <code>KEY:value|key:value</code> skeleton but inlines five Chinese
idiom macros — 亡羊 (tech-debt / known-defect), 破竹 (major breakthrough),
金蝉 (migration / refactor), 定鼎 (final architecture decision), 一石
(single-action-multiple-wins).</p>
<p>These idioms are the genuinely novel contribution of Classical Chinese to
this problem. Each one is 2–3 tokens but encodes a multi-token English
concept. And because frontier models are trained on enough Chinese
literature to know what they mean, there's no learning cost per session —
just a one-line legend in the system prompt.</p>
<p>The hybrid scored best on tokens: 28% smaller than English, 17% smaller
than AAAK. But when I ran the retrieval test, it stumbled — 87% combined
across the two Qwen models. The failures were specific: the shorthand
<code>@Q1.26</code> was read as decoration rather than a deadline, and parenthesized
reason-codes like <code>(cwrites+json)</code> were too cryptic to expand when asked
<em>&quot;why is this preferred?&quot;</em>.</p>
<p>So I wrote a v2 that used <code>t:Q1.26</code> and <code>why:cwrites,json</code>. It cost nine
extra tokens. Retrieval jumped from 87% to 97% on Qwen.</p>
<p>Hybrid v2 now tied Wenjian on both axes — same compression, same recall.
On Qwen. The interesting question was what would happen on a model that
wasn't trained on a mountain of Chinese text.</p>
<h2>Test three: does it survive a Western model?</h2>
<p>I pulled <code>llama3.1:8b</code> — a small, general-purpose Meta model with much
thinner CJK coverage than Qwen. This was the test the upstream maintainer
had implicitly failed back when he closed the issue.</p>
<table>
<thead>
<tr>
<th>Format</th>
<th style="text-align: right;">Llama3.1:8b</th>
</tr>
</thead>
<tbody>
<tr>
<td>English</td>
<td style="text-align: right;">15/15 (100%)</td>
</tr>
<tr>
<td>AAAK</td>
<td style="text-align: right;">13/15 (87%)</td>
</tr>
<tr>
<td>Wenjian</td>
<td style="text-align: right;">13/15 (87%)</td>
</tr>
<tr>
<td><strong>Hybrid v2</strong></td>
<td style="text-align: right;"><strong>14/15 (93%)</strong></td>
</tr>
</tbody>
</table>
<p>Three findings worth pulling out:</p>
<p><strong>Wenjian didn't collapse.</strong> It dropped from 100% on Qwen to 87% on Llama,
landing exactly where AAAK already was. The upstream maintainer's concern
was directionally right but overstated — even an 8B Western-trained model
extracts most of Wenjian's content correctly.</p>
<p><strong>Hybrid v2 was the top compressed format.</strong> At 93%, it beat both Wenjian
and AAAK on Llama. The design bet — &quot;keep Latin keys for everything except
the five macros&quot; — paid off. The macros are common enough in LLM training
data to survive anywhere; the Latin keys keep the rest tokenizer-stable.</p>
<p><strong>Direction arrows broke Llama across every format.</strong> <code>&gt;</code> and <code>-&gt;</code> got
inverted multiple times. <code>pg&gt;mysql</code> was read as <em>&quot;mysql is preferred&quot;</em>,
and <code>jenkins-&gt;gh_actions</code> as <em>&quot;Jenkins is recommended&quot;</em>. That's a
format-neutral finding worth fixing in any compression scheme: textual
<code>from:X|to:Y</code> is worth the extra tokens.</p>
<p>Combined cross-model ranking, 45 questions each:</p>
<table>
<thead>
<tr>
<th>Format</th>
<th style="text-align: right;">Tokens vs English</th>
<th style="text-align: right;">Retrieval</th>
<th>Behaviour</th>
</tr>
</thead>
<tbody>
<tr>
<td>English</td>
<td style="text-align: right;">0%</td>
<td style="text-align: right;">100%</td>
<td>reference</td>
</tr>
<tr>
<td>Wenjian</td>
<td style="text-align: right;">−24%</td>
<td style="text-align: right;">96%</td>
<td>peaks on Chinese-strong, drops to AAAK-parity elsewhere</td>
</tr>
<tr>
<td><strong>Hybrid v2</strong></td>
<td style="text-align: right;">−24%</td>
<td style="text-align: right;">96%</td>
<td>more uniform across model families</td>
</tr>
<tr>
<td>AAAK</td>
<td style="text-align: right;">−13%</td>
<td style="text-align: right;">93%</td>
<td>solid but less compressed</td>
</tr>
<tr>
<td>Hybrid v1</td>
<td style="text-align: right;">−28%</td>
<td style="text-align: right;">87%</td>
<td>too aggressive, dominated by v2</td>
</tr>
</tbody>
</table>
<h2>The methodology surprise</h2>
<p>The most useful single finding from this whole exercise wasn't about
Classical Chinese at all. It was about how to evaluate a format in the
first place.</p>
<p><strong>The weakest model was the most informative.</strong></p>
<ul>
<li><code>qwen3:32b</code> scored 4 of 5 formats at 100%. Ceiling effect. Almost no
signal about which format is actually more robust.</li>
<li><code>qwen3.5:27b</code> — somewhat fewer parameters, newer training — separated
hybrid v1 from the pack but still saturated Wenjian and English.</li>
<li><code>llama3.1:8b</code> was the <em>only</em> model that produced different failure
modes per format, surfaced the direction-arrow bug, and cleanly
separated hybrid v2 from the others.</li>
</ul>
<p>If I had only run this on Claude Opus or GPT-5, I'd have concluded that
all four compressed formats were equivalent. I'd have shipped the wrong
one. The frontier models succeed <em>despite</em> the format, not because of it.
Their format-robustness is invisible from their top-line score.</p>
<p>There's a sub-finding inside this that's worth calling out separately.
Within the Qwen family, the <em>newer</em> model (3.5) scored <strong>worse</strong> than the
<em>older</em> model (3.0) on every compressed format — 93% vs 100% on Hybrid v2,
80% vs 93% on v1. Both Q4_K_M, so quantization is constant. Two plausible
reads: (a) five billion fewer parameters hurt literal-parsing capability
more than a generation of training gains it, or (b) newer RLHF tunes models
<em>away</em> from shorthand literalism — the 3.5 misses were mostly
<em>&quot;the record does not specify…&quot;</em>, the model hedging instead of committing
to what's there.</p>
<p>Either way: <strong>do not assume the newest model in a family is the best
format-reader</strong>. Test your actual deployment target.</p>
<h2>What this means for you, practically</h2>
<p>If you're building anything that persists memory across LLM calls —
an agent, a copilot, a long-lived assistant, a RAG pipeline that stuffs
retrieved docs into a context window — these numbers have a direct read.</p>
<p>For a fixed context window, compressing memory into a denser dialect
stores roughly a quarter more facts in the same tokens. Equivalently,
if you're paying per token over an API, that's a ~24% input-cost
reduction at ~96% retrieval fidelity on mid-sized open models. Six
months of project context now fit where four did before.</p>
<p>The practical picks:</p>
<ul>
<li>If your serving model is Chinese-strong (Qwen, DeepSeek, Yi,
any Chinese-tuned Claude or GPT deployment), <strong>use pure Wenjian</strong>.
It peaks there.</li>
<li>If you serve a mix — or you don't know what model the user picks —
<strong>use the Hybrid v2 format</strong>. More uniform across model families,
same compression, one miss per 15 on weak Western models.</li>
<li>Either way, <strong>replace direction arrows with textual labels</strong>. That's
a universal improvement; it costs a few tokens and prevents a whole
class of Llama-style inversions.</li>
</ul>
<p>And the deeper lesson, applicable far beyond this one experiment: if
you're comparing prompt formats, tool-call schemas, structured-output
styles, or domain DSLs — <strong>evaluate them on small or mid-sized open
models</strong>. Not on the flagship. The flagship's ceiling effect will hide
the failures that show up in production on cheaper inference.</p>
<h2>So — learn Classical Chinese?</h2>
<p>Literally? No. You don't need to read 文言文 yourself. The point is that a
language whose grammar stabilized two millennia ago, which removed every
grammatical redundancy human writers could find to remove, and which
modern LLMs were trained on because it's part of humanity's literary
record anyway — that language is already sitting in your model, unused,
ready to compress your memory by a quarter.</p>
<p>You don't have to learn Classical Chinese.</p>
<p>Your agents should be writing in it.</p>
<hr />
<p><em>Full benchmark code, results, and raw data:
<a href="https://github.com/Chandler-Sun/MemChinesePalace">github.com/.../MemChinese</a>
(upstream) — see the
<a href="./README.md">README</a> for the unvarnished numbers and next steps.</em></p>]]></content:encoded>
      <category>research</category>
      <category>memory</category>
      <category>compression</category>
      <category>benchmarks</category>
      <category>tokenization</category>
      <category>agents</category>
    </item>
    <item>
      <title>Meta Unveils Muse Spark: First Model From Superintelligence Labs</title>
      <link>https://ai.rs/ai-for-business/meta-muse-spark-msl-multimodal-reasoning</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/meta-muse-spark-msl-multimodal-reasoning</guid>
      <pubDate>Wed, 08 Apr 2026 12:00:00 +0200</pubDate>
      <author>ai.rs</author>
      <description>Meta Superintelligence Labs&#039; debut model brings multimodal reasoning, visual chain-of-thought, and a parallel multi-agent Contemplating mode that scores 58% on Humanity&#039;s Last Exam.</description>
      <content:encoded><![CDATA[<p>Meta on April 8 introduced <strong>Muse Spark</strong>, the first model out of its newly reorganized <strong>Meta Superintelligence Labs</strong> (MSL) — and the company is calling it &quot;the first step on our scaling ladder and the first product of a ground-up overhaul of our AI efforts.&quot;</p>
<p>Spark is a multimodal reasoning model with tool-use, visual chain-of-thought, and a parallel multi-agent setup Meta is branding <strong>Contemplating mode</strong>. It is live today on <strong>meta.ai</strong> and inside the <strong>Meta AI app</strong>, with a private API preview rolling out to select developers.</p>
<hr />
<h2>What's actually new</h2>
<p>Three things stand out from the announcement:</p>
<ol>
<li><strong>Multimodal-first reasoning.</strong> Spark is positioned as Meta's first model where perception, reasoning, and tool-use share the same loop — visual STEM Q&amp;A, entity recognition, and even health-domain analysis (nutrition, exercise physiology) are part of the headline capabilities, not bolt-ons.</li>
<li><strong>Visual chain-of-thought.</strong> Rather than only emitting text tokens during reasoning, Spark can ground intermediate steps in the image itself — closer to how humans point at things while thinking out loud.</li>
<li><strong>Contemplating mode.</strong> A parallel multi-agent orchestration layer where multiple reasoning instances work the same problem and converge on an answer. It is the mode Meta cites for its highest benchmark scores, and it is rolling out gradually rather than being on by default.</li>
</ol>
<h2>Benchmarks Meta is leading with</h2>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Score (Contemplating)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Humanity's Last Exam</td>
<td><strong>58%</strong></td>
</tr>
<tr>
<td>FrontierScience Research</td>
<td><strong>38%</strong></td>
</tr>
</tbody>
</table>
<p>These are headline numbers from Meta's own post — independent reproductions will follow. For context, Humanity's Last Exam is one of the harder generalist evals in circulation, and 58% places Spark in the same conversation as the current frontier rather than a tier below.</p>
<h2>The efficiency claim</h2>
<p>The number that may matter more long-term is buried further down: Spark is described as <strong>more than an order of magnitude more compute-efficient than Llama 4 Maverick</strong>, its predecessor, with <strong>log-linear scaling improvements from reinforcement learning</strong>. If that holds, MSL has not just shipped a new model — it has shifted the cost curve for the next generation of Meta models.</p>
<p>It also re-frames what Hyperion, Meta's in-progress data center buildout, is for. Meta explicitly ties Spark to that infrastructure as the runway toward what it now openly calls <strong>&quot;personal superintelligence.&quot;</strong></p>
<h2>Availability</h2>
<ul>
<li><strong>Live:</strong> meta.ai web and the Meta AI app, default mode</li>
<li><strong>Private preview:</strong> API access for select users</li>
<li><strong>Contemplating mode:</strong> rolling out gradually — not enabled for everyone on day one</li>
</ul>
<p>There is no open-weights release announced. That is a notable shift from the Llama posture — Meta is keeping Spark behind its own surfaces, at least for now.</p>
<h2>Why it matters</h2>
<p>Two angles are worth watching:</p>
<ul>
<li><strong>For developers</strong>, the API preview is the thing to track. If Spark is meaningfully cheaper-per-token than frontier rivals while clearing hard reasoning evals, it changes the build-vs-buy math for agentic products.</li>
<li><strong>For the lab race</strong>, this is MSL's introduction. The branding (&quot;Superintelligence Labs&quot;, &quot;scaling ladder&quot;, &quot;personal superintelligence&quot;) makes it explicit that Meta is no longer pitching itself as the open-source alternative — it is competing for the frontier, on the frontier's terms.</li>
</ul>
<p>The full announcement is on <a href="https://ai.meta.com/blog/introducing-muse-spark-msl/">Meta's AI blog</a>.</p>]]></content:encoded>
      <category>news</category>
      <category>meta</category>
      <category>multimodal</category>
      <category>reasoning</category>
      <category>agents</category>
      <category>msl</category>
    </item>
    <item>
      <title>Claude Mythos Preview: Why Anthropic Locked Its Best Security Model Behind a Wall</title>
      <link>https://ai.rs/ai-for-business/claude-mythos-glasswing-why-gated</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/claude-mythos-glasswing-why-gated</guid>
      <pubDate>Wed, 08 Apr 2026 11:06:37 +0200</pubDate>
      <author>ai.rs</author>
      <description>Anthropic just unveiled Claude Mythos Preview — a frontier model that found a 27-year-old OpenBSD bug and a 16-year-old FFmpeg flaw that fuzzers had hit 5 million times. Here&#039;s what it does, who gets access through Project Glasswing, and why the $25/$125 per million token pricing tells you everything about Anthropic&#039;s strategy.</description>
      <content:encoded><![CDATA[<p>On April 7, Anthropic announced <strong>Claude Mythos Preview</strong> alongside <strong>Project Glasswing</strong> — a frontier AI model purpose-built to find and exploit software vulnerabilities, paired with a partner program that decides who gets to use it.</p>
<p>Mythos is not on the API price list. It is not on a waitlist page. It is not coming to Claude.ai next week. If you are reading this and you do not work for AWS, Apple, Cisco, Google, Microsoft, or one of about 50 other vetted organizations, you cannot have it. That is not an oversight. That is the entire point of how Anthropic shipped this model.</p>
<p>Here is what Mythos actually does, who is in Glasswing, and why the access wall exists.</p>
<hr />
<h2>What Mythos Found</h2>
<p>Anthropic led the announcement with two findings that are difficult to dismiss as benchmark theater.</p>
<p><strong>A 27-year-old vulnerability in OpenBSD</strong> that allowed remote crashes. OpenBSD is the operating system whose entire brand identity is built on aggressive code review and proactive auditing. A bug that survived 27 years inside the OpenBSD codebase is, by definition, a bug that human reviewers were never going to find on their own.</p>
<p><strong>A 16-year-old flaw in FFmpeg</strong> that automated coverage-guided fuzzers had executed the surrounding code path more than <strong>5 million times</strong> without triggering. This is the more technically interesting finding. Modern fuzzing is supposed to be the gold standard for catching memory corruption in C codebases. 5 million hits with no crash means the bug is reachable but only under specific semantic conditions — exactly the kind of &quot;needs to actually understand the code&quot; gap that LLMs are theoretically good at closing.</p>
<p>Anthropic also reported multiple Linux kernel privilege-escalation vulnerabilities and claims &quot;thousands of high-severity vulnerabilities&quot; in total across operating systems, browsers, and foundational libraries.</p>
<h2>The One Number That Matters</h2>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Mythos Preview</th>
<th>Opus 4.6</th>
</tr>
</thead>
<tbody>
<tr>
<td>CyberGym (vulnerability reproduction)</td>
<td><strong>83.1%</strong></td>
<td>66.6%</td>
</tr>
</tbody>
</table>
<p>CyberGym measures whether a model can take a vulnerability description and <strong>actually reproduce a working exploit</strong> against the real target codebase. It is not multiple choice. It is not pattern matching against CVE databases. It is &quot;build the thing that triggers the bug.&quot;</p>
<p>Going from 67% to 83% on a benchmark like that is not an incremental improvement. It is the difference between a useful research assistant and an autonomous agent you can leave running against a codebase overnight and trust to come back with reproductions instead of false positives.</p>
<p>Anthropic explicitly says Mythos &quot;performs autonomously without human steering in many cases.&quot; That phrasing matters. Most AI security tooling today still requires a researcher in the loop to triage and verify. Mythos, in the cases where it works, does not.</p>
<h2>Who Is In Glasswing</h2>
<p>Project Glasswing launched with <strong>12 founding partners</strong>:</p>
<ul>
<li><strong>Cloud and infrastructure:</strong> AWS, Google, Microsoft</li>
<li><strong>Hardware and operating systems:</strong> Apple, Cisco</li>
<li><strong>Plus seven others</strong> spanning major technology vendors and security organizations</li>
</ul>
<p>Beyond the founding 12, Anthropic added <strong>40+ more organizations</strong> focused on critical infrastructure protection and open-source maintenance. The selection criteria, as described in the announcement:</p>
<ol>
<li>You maintain code that other people depend on at scale (operating systems, browsers, kernels, foundational libraries)</li>
<li>You operate critical infrastructure (cloud platforms, networking, finance)</li>
<li>You are an open-source security organization with a track record</li>
</ol>
<p>Notably absent from the public list: penetration testing firms, bug bounty platforms, and anyone whose business model is selling vulnerability research to third parties. That is a deliberate choice, and we will get to why.</p>
<h2>Why It Is Gated</h2>
<p>There are three reasons Mythos is not generally available, and they reinforce each other.</p>
<h3>1. The dual-use problem is unavoidable</h3>
<p>A model that can autonomously find a 27-year-old bug in OpenBSD is also a model that can autonomously find unknown bugs in your production stack. The capability does not care about the operator's intent.</p>
<p>Anthropic could have published Mythos behind a standard &quot;acceptable use policy&quot; click-through, the way every other AI lab handles dual-use risk. They chose not to. The math is brutal: if even a small fraction of paying API customers used Mythos to find zero-days for sale, the result would be a measurable spike in real-world exploitation against the same critical infrastructure Anthropic is trying to protect.</p>
<p>Gating by partnership is an admission that policy alone is insufficient when the capability gap is this large.</p>
<h3>2. Pricing as soft access control</h3>
<p>When Mythos eventually does reach general availability, it will cost <strong>$25 per million input tokens and $125 per million output tokens</strong>. For comparison, Claude Opus 4.6 sits at roughly $15 input and $75 output per million tokens — Mythos is approximately <strong>1.7x more expensive on output</strong> than the most capable general-purpose Claude model.</p>
<p>That premium is doing two things at once.</p>
<p>First, it reflects real cost: Mythos is almost certainly larger than Opus, almost certainly does more internal reasoning per token, and almost certainly was more expensive to train. You do not get autonomous CyberGym performance for free.</p>
<p>Second, and more importantly, <strong>the price is a soft access control mechanism</strong>. At $125 per million output tokens, you do not casually point Mythos at every public GitHub repository to see what it finds. The economics make opportunistic mass-scanning prohibitively expensive while keeping targeted defensive use affordable for organizations that have a specific codebase to harden.</p>
<p>This is the same logic that keeps satellite imagery affordable for journalists but expensive for stalkers. Pricing is not just revenue. It is a filter.</p>
<h3>3. The subsidy structure tilts the balance toward defenders</h3>
<p>Anthropic committed <strong>$100 million in usage credits</strong> to Glasswing partners and donated <strong>$4 million</strong> to open-source security organizations. Read those numbers in context: defenders are getting subsidized to use Mythos at zero or near-zero marginal cost, while everyone else faces full price plus access restrictions.</p>
<p>That is a deliberate asymmetry. Anthropic is paying to put Mythos in the hands of the people who maintain the code, before it is available to anyone who might want to exploit it. The window between &quot;defenders can use this&quot; and &quot;attackers can buy this&quot; is the entire game, and Anthropic is spending $100 million to widen it.</p>
<p>Whether that strategy actually works depends on how long the window stays open. If a competing lab ships an equivalent capability without the access controls, the asymmetry collapses overnight. If Anthropic stays meaningfully ahead on this specific capability for six months, defenders get a meaningful head start on hardening the most-used software on the planet.</p>
<h2>When You Will Get Access</h2>
<p>The official answer is &quot;after we develop appropriate safeguards with an upcoming Claude Opus model.&quot; The unofficial reading: months, not weeks, and tied to a future release rather than a fixed date.</p>
<p>Realistically, Mythos in its current form is unlikely to be sold directly to the open API market. What seems more probable is that the techniques pioneered for Mythos — the training data, the autonomous-loop scaffolding, the safety filters — will be folded into a future general-purpose Opus release in a more constrained form. You will get some of the capability, with guardrails that prevent the most concerning use cases.</p>
<p>If you want the unconstrained version, your path is Glasswing membership. The application process is not public, but the criteria are: maintain critical software, demonstrate operational security, commit to responsible disclosure.</p>
<h2>What To Actually Do</h2>
<p><strong>If you maintain critical infrastructure or foundational open-source software:</strong> investigate Glasswing. The 40+ non-founding partners suggest the program is actively expanding, and the subsidized usage credits are the cheapest security audit you will ever get.</p>
<p><strong>If you build products on the Claude API:</strong> nothing changes today. Opus 4.6 and Sonnet 4.6 remain your daily drivers. But the existence of Mythos is a clear signal that the gap between &quot;the best model Anthropic has trained&quot; and &quot;the best model Anthropic will sell you&quot; is widening — and for the first time, Anthropic is being transparent about that gap rather than pretending it does not exist.</p>
<p><strong>If you run a security team at a normal company:</strong> wait. The Mythos-derived safeguards in the next Opus release will likely cover the use cases you care about (code review, vulnerability triage, secure-coding assistance) without the access friction. Spending engineering time on Glasswing applications when you do not maintain a kernel is probably not the best use of the quarter.</p>
<h2>The Bigger Signal</h2>
<p>Set aside the specific capability for a moment. The more important thing about Mythos is that <strong>Anthropic chose to ship a frontier model with deliberate access controls, full stop</strong>. Every previous Claude release has been framed as &quot;as broadly available as we can make it.&quot; Mythos is the first time Anthropic has publicly drawn a line and said: this one is too dangerous to sell to everyone, and we are going to gate it on who you are rather than what you promise.</p>
<p>That precedent matters more than the OpenBSD bug. If Mythos works the way Anthropic claims, expect more specialized frontier models with similar access structures — for biotech, for finance, for any domain where the dual-use math gets uncomfortable. The era of &quot;one model, one API, one price list&quot; is not over, but it is no longer the only shape an AI lab can take.</p>
<p>For now, Mythos exists, it is genuinely impressive, and you cannot have it. That is the story.</p>]]></content:encoded>
      <category>news</category>
      <category>claude</category>
      <category>mythos</category>
      <category>glasswing</category>
      <category>anthropic</category>
      <category>security</category>
      <category>vulnerabilities</category>
    </item>
    <item>
      <title>Gemma 4 LoRA Fine-Tuning on RTX 5090: What Works and What Doesn&#039;t</title>
      <link>https://ai.rs/ai-developer/gemma-4-lora-fine-tuning-rtx-5090</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/gemma-4-lora-fine-tuning-rtx-5090</guid>
      <pubDate>Sun, 05 Apr 2026 10:00:00 +0200</pubDate>
      <author>ai.rs</author>
      <description>Google&#039;s Gemma 4 26B-A4B MoE promises 128 experts and 256K context in an Apache 2.0 package. We tried to QLoRA fine-tune it on an RTX 5090. The MoE variant is blocked by a 3D tensor issue — but the dense models work fine, and a fix is coming.</description>
      <content:encoded><![CDATA[<p>Google released Gemma 4 on April 1, 2026 — a family of models including the 26B-A4B Mixture of Experts variant that activates only 3.8B of its 25.2B parameters per token. Apache 2.0 licensed, 256K context, 140+ languages, native vision support. On paper, it's a direct competitor to Qwen 3.5's MoE lineup.</p>
<p>We spent two days trying to QLoRA fine-tune the MoE variant on an RTX 5090 (32 GB VRAM). It doesn't work — yet. Not because of a bug, but because of an architectural decision that the tooling ecosystem hasn't caught up with. Important caveat: <strong>the dense Gemma 4 models (E2B, E4B, 31B) fine-tune just fine</strong> with standard QLoRA. This article is specifically about the MoE 26B-A4B variant.</p>
<hr />
<h2>Gemma 4 vs Qwen 3.5: The Specs</h2>
<p>Both models use Mixture of Experts to deliver big-model knowledge at small-model speed. Here's how they compare:</p>
<table>
<thead>
<tr>
<th>Spec</th>
<th>Gemma 4 26B-A4B</th>
<th>Qwen 3.5 35B-A3B</th>
</tr>
</thead>
<tbody>
<tr>
<td>Total Parameters</td>
<td>25.2B</td>
<td>35B</td>
</tr>
<tr>
<td>Active Parameters</td>
<td>3.8B</td>
<td>~3B</td>
</tr>
<tr>
<td>Experts</td>
<td>128 + 1 shared, 8 active</td>
<td>256 + 1 shared, 8 routed</td>
</tr>
<tr>
<td>Layers</td>
<td>30</td>
<td>40</td>
</tr>
<tr>
<td>Native Context</td>
<td>256K</td>
<td>262K (up to 1M with RoPE scaling)</td>
</tr>
<tr>
<td>Modalities</td>
<td>Text + Image</td>
<td>Text + Image + Video</td>
</tr>
<tr>
<td>Languages</td>
<td>140+</td>
<td>201</td>
</tr>
<tr>
<td>License</td>
<td>Apache 2.0</td>
<td>Apache 2.0</td>
</tr>
</tbody>
</table>
<h3>Benchmarks</h3>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Gemma 4</th>
<th>Qwen 3.5</th>
<th>Winner</th>
</tr>
</thead>
<tbody>
<tr>
<td>MMLU-Pro</td>
<td>82.6</td>
<td>85.3</td>
<td>Qwen 3.5 (+2.7)</td>
</tr>
<tr>
<td>GPQA Diamond</td>
<td>82.3</td>
<td>84.2</td>
<td>Qwen 3.5 (+1.9)</td>
</tr>
<tr>
<td>LiveCodeBench v6</td>
<td>77.1</td>
<td>74.6</td>
<td>Gemma 4 (+2.5)</td>
</tr>
<tr>
<td>Codeforces ELO</td>
<td>1718</td>
<td>2028</td>
<td>Qwen 3.5 (+310)</td>
</tr>
<tr>
<td>MATH-Vision</td>
<td>82.4</td>
<td>83.9</td>
<td>Qwen 3.5 (+1.5)</td>
</tr>
<tr>
<td>MMMU Pro (Vision)</td>
<td>73.8</td>
<td>75.1</td>
<td>Qwen 3.5 (+1.3)</td>
</tr>
</tbody>
</table>
<p>Qwen 3.5 leads across most reasoning and knowledge benchmarks. Gemma 4 has a slight edge on LiveCodeBench v6, but loses decisively on competitive programming (Codeforces). For most practical use cases — customer support, content generation, product recommendations — Qwen 3.5 is the stronger model.</p>
<h2>The 3D Tensor Problem</h2>
<p>Here's where things fall apart for local fine-tuning.</p>
<p>QLoRA (Quantized Low-Rank Adaptation) works by loading the base model in 4-bit precision and training small adapter layers on top. This is the standard approach for fine-tuning large models on consumer GPUs. With Qwen 3.5 35B-A3B, it works perfectly — we validated this both on an RTX 5090 locally and on an NVIDIA B200 (178 GB VRAM), where Unsloth loads the model at ~17.5 GB in 4-bit with plenty of room for training.</p>
<p>Gemma 4 breaks this workflow because of how it stores expert weights.</p>
<p><strong>Qwen 3.5</strong> stores each expert as separate 2D linear layers — standard <code>nn.Linear</code> modules that bitsandbytes knows how to quantize:</p>
<pre><code># Qwen 3.5: separate 2D tensors per expert — bnb quantizes these fine
model.layers.{i}.mlp.experts.{j}.gate_proj: [1024, 2048]  ← nn.Linear ✓
model.layers.{i}.mlp.experts.{j}.up_proj:   [1024, 2048]  ← nn.Linear ✓
model.layers.{i}.mlp.experts.{j}.down_proj:  [2048, 512]  ← nn.Linear ✓</code></pre>
<p><strong>Gemma 4</strong> fuses all 128 experts into single 3D tensors:</p>
<pre><code># Gemma 4: fused 3D tensors — bnb CANNOT quantize these
model.layers.{i}.experts.gate_up_proj: [128, 1408, 2816]  ← 3D tensor ✗
model.layers.{i}.experts.down_proj:    [128, 2816, 1408]  ← 3D tensor ✗</code></pre>
<p>bitsandbytes only quantizes 2D <code>nn.Linear</code> layers. It ignores everything else. The result:</p>
<table>
<thead>
<tr>
<th>Component</th>
<th>Size</th>
<th>Quantized?</th>
</tr>
</thead>
<tbody>
<tr>
<td>3D expert tensors (30 layers)</td>
<td>42.5 GB (bf16)</td>
<td>No</td>
</tr>
<tr>
<td>2D layers (attention, embeddings)</td>
<td>4.5 GB → 1.1 GB (4-bit)</td>
<td>Yes</td>
</tr>
<tr>
<td><strong>Total with &quot;4-bit&quot; loading</strong></td>
<td><strong>~43.7 GB</strong></td>
<td></td>
</tr>
</tbody>
</table>
<p>The &quot;4-bit&quot; model is actually 43.7 GB because 90% of the weights can't be quantized. That's 12 GB over our RTX 5090's budget — before we even account for training overhead.</p>
<h2>What We Tried</h2>
<p>Five different loading strategies, all dead ends:</p>
<ol>
<li>
<p><strong><code>Gemma4ForCausalLM</code> from multimodal checkpoint</strong> — Key name mismatch. The checkpoint stores text weights as <code>model.language_model.*</code> but the text-only class expects <code>model.*</code>. All weights loaded as &quot;unexpected&quot;, fresh initialization OOM'd.</p>
</li>
<li>
<p><strong><code>Gemma4ForConditionalGeneration</code> with single GPU</strong> — OOM at 37% loading. The full multimodal model is ~48 GB in bf16.</p>
</li>
<li>
<p><strong><code>Gemma4ForConditionalGeneration</code> with CPU offloading</strong> — bitsandbytes 4-bit mode rejects any CPU offloading. Non-starter.</p>
</li>
<li>
<p><strong>Extracted text-only weights</strong> — We wrote a script to extract and remap the 657 text-only keys. Loading works, but the 3D tensor problem remains: 43.7 GB estimated, still OOM.</p>
</li>
<li>
<p><strong>Various monkey-patches</strong> — <code>caching_allocator_warmup</code> bypass, <code>Params4bit</code> compatibility fix. These solved earlier errors but can't fix the fundamental 3D tensor issue.</p>
</li>
</ol>
<h2>The Ecosystem Gap</h2>
<p>Gemma 4 dropped on April 1, 2026 — it's brand new. The quantization ecosystem hasn't adapted yet:</p>
<table>
<thead>
<tr>
<th>Format</th>
<th>Available?</th>
<th>Fine-tuning?</th>
</tr>
</thead>
<tbody>
<tr>
<td>GGUF (Q4_K_M, ~17 GB)</td>
<td>Yes</td>
<td>Inference only</td>
</tr>
<tr>
<td>AWQ 4-bit</td>
<td>Yes</td>
<td>Inference only</td>
</tr>
<tr>
<td>GPTQ</td>
<td>Not yet</td>
<td>—</td>
</tr>
<tr>
<td>Unsloth bnb-4bit</td>
<td><strong>Skipped for MoE variant</strong></td>
<td>—</td>
</tr>
</tbody>
</table>
<p>Unsloth — which has custom MoE quantization that handles Qwen 3.5's fused tensors — deliberately skipped the Gemma 4 26B-A4B for their bnb-4bit releases. They published quantized versions for the dense Gemma 4 models (E2B, E4B, 31B) but not the MoE one. That confirms this isn't just a &quot;we haven't gotten to it&quot; situation — the 3D tensor layout is genuinely harder to handle.</p>
<h2>What to Use Instead</h2>
<p>If you're choosing a model for local QLoRA fine-tuning on consumer hardware (24-32 GB VRAM), here's the practical decision:</p>
<h3>For Fine-Tuning (QLoRA)</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>4-bit Size</th>
<th>Fits 32 GB?</th>
<th>Status</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen 3.5-35B-A3B (via Unsloth)</td>
<td>~17.5 GB</td>
<td>Yes</td>
<td>Working</td>
</tr>
<tr>
<td>Qwen 3.5-27B dense</td>
<td>~14 GB</td>
<td>Yes</td>
<td>Working</td>
</tr>
<tr>
<td>Qwen 3.5-9B dense</td>
<td>~5 GB</td>
<td>Yes, comfortably</td>
<td>Working</td>
</tr>
<tr>
<td>Gemma 4 31B dense</td>
<td>~18-20 GB</td>
<td>Tight but feasible</td>
<td>Working</td>
</tr>
<tr>
<td><strong>Gemma 4 26B-A4B (MoE)</strong></td>
<td><strong>~43.7 GB</strong></td>
<td><strong>No</strong></td>
<td><strong>Blocked</strong></td>
</tr>
</tbody>
</table>
<h3>For Inference Only</h3>
<p>Gemma 4 26B-A4B works fine for inference via GGUF (Ollama, llama.cpp) at Q4_K_M (~17 GB). If you just need to run the model — not train it — it's a solid option.</p>
<h2>What About Cloud GPUs?</h2>
<p>On an NVIDIA B200 (178 GB VRAM), the picture changes completely. Gemma 4's text-only model is ~47 GB in bf16 — you can skip quantization entirely and train with standard LoRA (not QLoRA). No 3D tensor problem, no bitsandbytes dependency. Load in bf16, attach LoRA adapters, train.</p>
<p>We already validated this workflow for Qwen 3.5 35B-A3B on a B200 via Unsloth, where it loads at ~17.5 GB in 4-bit and trains comfortably. Gemma 4 in bf16 at ~47 GB would also fit with ~130 GB to spare for optimizer states, gradients, and large batch sizes.</p>
<p>The trade-off is cost. A cloud B200 instance runs ~$3-5/hour. For a quick LoRA fine-tune (a few hundred steps), that's $5-15. For serious training runs, it adds up. The appeal of consumer GPU training is that it's free after the hardware purchase.</p>
<h2>Why MoE Models Are Harder to Fine-Tune</h2>
<p>The 3D tensor issue is just the most visible problem. MoE architectures create several fine-tuning headaches that dense models don't have:</p>
<p><strong>Expert routing instability.</strong> During fine-tuning, the router learns which experts to activate for which tokens. Small datasets can destabilize this routing — a few hundred patent-writing examples might cause the router to over-rely on 2-3 experts while the other 125 go dormant. Dense models don't have this problem because every parameter participates in every forward pass.</p>
<p><strong>Load balancing.</strong> MoE models are trained with auxiliary losses that encourage balanced expert utilization. Fine-tuning with LoRA typically freezes the router weights, which helps stability but means you can't adapt the routing to your domain. If your use case (say, patent writing) doesn't naturally distribute across many experts, you're leaving capacity on the table.</p>
<p><strong>Memory unpredictability.</strong> Even when quantization works, MoE memory usage is harder to predict. All expert weights must be resident in VRAM even though only 8 of 128 fire per token. Gradient checkpointing interacts differently with MoE layers. Batch size effects are less intuitive because the active parameter count varies per token.</p>
<p><strong>Tooling maturity.</strong> The PyTorch ecosystem — bitsandbytes, PEFT, DeepSpeed, FSDP — was built for dense transformers. MoE support is bolted on and varies wildly by implementation. Qwen's 2D expert layout works because it looks like standard linear layers. Gemma's 3D fused layout is more efficient but breaks assumptions baked into every tool in the chain.</p>
<p>None of this means MoE models can't be fine-tuned. It means the gap between &quot;works in a paper&quot; and &quot;works on your GPU&quot; is wider than with dense models. For most practitioners doing domain-specific fine-tuning — patent writing, customer support, product descriptions — a dense model at the same active parameter count will be easier to train and more predictable to debug.</p>
<h2>The Bigger Picture</h2>
<p>This episode highlights a real tension in the MoE design space. Fusing experts into 3D tensors is faster for inference (single batched matrix multiply instead of 128 separate calls) and Google's engineering team made a reasonable optimization choice. But it breaks the most popular fine-tuning workflow on consumer hardware.</p>
<p>Qwen's approach — separate 2D expert layers — is less optimal for raw inference throughput but plays nicely with the entire PyTorch/bitsandbytes/PEFT ecosystem. For the open-source community that wants to fine-tune models locally, that compatibility matters more than a few percent of inference speed.</p>
<p>The fix will come. Either bitsandbytes will add 3D tensor quantization, or Unsloth will build a custom path (they did it for Qwen's fused tensors), or Google will publish a checkpoint variant with separate expert weights. Until then, <strong>Qwen 3.5 35B-A3B is the MoE model to fine-tune locally</strong> — it has better benchmarks, a working training pipeline, and fits comfortably on an RTX 5090.</p>
<p>To be clear: <strong>Gemma 4 is not broken for fine-tuning.</strong> The dense models — Gemma 4 E2B, E4B, and 31B — all work with standard QLoRA via bitsandbytes or Unsloth. The 31B dense model at 4-bit (~18-20 GB) fits on an RTX 5090 and trains normally. It's only the MoE 26B-A4B that's blocked, and only on consumer GPUs where quantization is required.</p>
<h2>What Will Fix This</h2>
<p>The MoE fine-tuning gap is temporary. Here's what's likely to happen, roughly in order of probability:</p>
<ol>
<li>
<p><strong>Unsloth adds a custom Gemma 4 MoE path</strong> — Most likely and soonest. Unsloth already handles Qwen 3.5's fused MoE tensors with custom quantization. They have the architecture expertise and the motivation (Gemma 4 is a high-demand model). Timeline: weeks, not months.</p>
</li>
<li>
<p><strong>bitsandbytes adds 3D tensor quantization</strong> — This would fix it for everyone, not just Unsloth users. The change is non-trivial (the NF4 quantization kernel assumes 2D weight matrices) but it's a known limitation. Timeline: 1-3 months.</p>
</li>
<li>
<p><strong>Google releases an unfused checkpoint</strong> — Google could publish a variant with separate 2D expert weights instead of fused 3D tensors. This is the easiest fix from a tooling perspective but requires Google to act. Timeline: uncertain, depends on community pressure.</p>
</li>
</ol>
<p>Our bet: Unsloth will have it working within weeks. If you need Gemma 4 MoE fine-tuning before then, use a B200 or similar cloud GPU where you can skip quantization entirely and train in bf16.</p>
<hr />
<p><em>Tested on: RTX 5090 (32 GB), transformers 5.5.0, bitsandbytes 0.49.2, PEFT, April 2026.</em></p>
<p><strong>Related:</strong></p>
<ul>
<li><a href="/ai-developer/gemma-4-vs-qwen-3-5-vs-llama-4-compared">Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader</a></li>
<li><a href="/ai-for-business/qwen-3-5-35b-knowledge-4b-speed-better-than-gpt-5">Qwen 3.5: 35B Knowledge at 4B Speed — Better Than GPT-5?</a></li>
</ul>]]></content:encoded>
      <category>research</category>
      <category>llm</category>
      <category>gemma</category>
      <category>moe</category>
      <category>fine-tuning</category>
      <category>quantization</category>
      <category>qwen</category>
      <category>benchmarks</category>
    </item>
    <item>
      <title>Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader</title>
      <link>https://ai.rs/ai-developer/gemma-4-vs-qwen-3-5-vs-llama-4-compared</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/gemma-4-vs-qwen-3-5-vs-llama-4-compared</guid>
      <pubDate>Thu, 02 Apr 2026 12:00:00 +0200</pubDate>
      <author>ai.rs</author>
      <description>A month ago, Gemma 3 trailed Llama 4 and Qwen 3.5 in every category we tested. Gemma 4 just demolished those results — 89% on AIME math, 80% on LiveCodeBench, a MoE variant that matches 31B quality with 4B active params, and Apache 2.0 licensing.</description>
      <content:encoded><![CDATA[<h2>One Month Later, Everything Changed</h2>
<p>In early March, we published a <a href="/ai-developer/llama-4-vs-qwen-3-5-vs-gemma-3-compared">head-to-head comparison of Llama 4, Qwen 3.5, and Gemma 3</a>. The conclusion was clear: Gemma 3 finished last in every category except raw inference speed. Qwen 3.5 won math, coding, and multilingual. Llama 4 Scout won reasoning and context length. Gemma 3 was the also-ran.</p>
<p>That article is now outdated.</p>
<p>Google just released Gemma 4 — four model sizes, a new MoE architecture, multimodal audio support, thinking mode, and benchmark scores that make Gemma 3's numbers look like a different era. The jump isn't incremental. It's the largest single-generation improvement we've seen in the open model space.</p>
<h2>The Gemma 4 Family</h2>
<p>Four models, two architectures, spanning edge devices to full GPUs:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Architecture</th>
<th>Total Params</th>
<th>Active Params</th>
<th>Context</th>
<th>Modalities</th>
</tr>
</thead>
<tbody>
<tr>
<td>Gemma 4 E2B</td>
<td>Dense</td>
<td>5.1B</td>
<td>2.3B</td>
<td>128K</td>
<td>Text, Image, Audio, Video</td>
</tr>
<tr>
<td>Gemma 4 E4B</td>
<td>Dense</td>
<td>8B</td>
<td>4.5B</td>
<td>128K</td>
<td>Text, Image, Audio, Video</td>
</tr>
<tr>
<td>Gemma 4 26B-A4B</td>
<td>MoE (128 experts)</td>
<td>25.2B</td>
<td>3.8B</td>
<td>256K</td>
<td>Text, Image, Video</td>
</tr>
<tr>
<td>Gemma 4 31B</td>
<td>Dense</td>
<td>30.7B</td>
<td>30.7B</td>
<td>256K</td>
<td>Text, Image, Video</td>
</tr>
</tbody>
</table>
<p>The naming convention: <strong>E</strong> prefix means edge-optimized, <strong>A</strong> means active parameters in the MoE variant. So &quot;26B-A4B&quot; = 26B total, 4B active per token.</p>
<p>The standout is the 26B-A4B. It uses <strong>128 small experts</strong> with 8 active per token plus one shared always-on expert. This is a different design philosophy from Llama 4 Scout's 16 large experts — Google bet on many small experts rather than fewer large ones.</p>
<h2>The Numbers: Gemma 3 vs Gemma 4</h2>
<p>These comparisons use the same benchmarks, same evaluation conditions. The improvements are not subtle.</p>
<h3>Reasoning &amp; Knowledge</h3>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Gemma 3 27B</th>
<th>Gemma 4 31B</th>
<th>Gemma 4 26B-A4B</th>
<th>Change (31B)</th>
</tr>
</thead>
<tbody>
<tr>
<td>MMLU Pro</td>
<td>67.6%</td>
<td>85.2%</td>
<td>82.6%</td>
<td>+17.6 pts</td>
</tr>
<tr>
<td>GPQA Diamond</td>
<td>42.4%</td>
<td>84.3%</td>
<td>82.3%</td>
<td>+41.9 pts</td>
</tr>
<tr>
<td>BigBench Extra Hard</td>
<td>19.3%</td>
<td>74.4%</td>
<td>64.8%</td>
<td>+55.1 pts</td>
</tr>
<tr>
<td>MMMLU (multilingual)</td>
<td>70.7%</td>
<td>88.4%</td>
<td>86.3%</td>
<td>+17.7 pts</td>
</tr>
</tbody>
</table>
<p>GPQA Diamond — graduate-level reasoning — nearly <strong>doubled</strong>. BigBench Extra Hard went from 19% to 74%. These aren't incremental gains. Gemma 3 was struggling with hard reasoning; Gemma 4 handles it.</p>
<h3>Mathematics</h3>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Gemma 3 27B</th>
<th>Gemma 4 31B</th>
<th>Gemma 4 26B-A4B</th>
<th>Change (31B)</th>
</tr>
</thead>
<tbody>
<tr>
<td>AIME 2026</td>
<td>20.8%</td>
<td>89.2%</td>
<td>88.3%</td>
<td>+68.4 pts</td>
</tr>
</tbody>
</table>
<p>From 20.8% to 89.2% on competition math. This is the single most dramatic benchmark improvement in the table. For context, in <a href="/ai-developer/llama-4-vs-qwen-3-5-vs-gemma-3-compared">our March comparison</a>, Qwen 3.5-27B scored 48.7% on AIME 2025 and was the math leader. Gemma 4 nearly doubles that.</p>
<p>The thinking mode — where the model reasons step-by-step before answering — is likely driving this. When Gemma 4 &quot;thinks,&quot; it can produce 4,000+ tokens of reasoning before committing to an answer.</p>
<h3>Coding</h3>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Gemma 3 27B</th>
<th>Gemma 4 31B</th>
<th>Gemma 4 26B-A4B</th>
<th>Change (31B)</th>
</tr>
</thead>
<tbody>
<tr>
<td>LiveCodeBench v6</td>
<td>29.1%</td>
<td>80.0%</td>
<td>77.1%</td>
<td>+50.9 pts</td>
</tr>
<tr>
<td>Codeforces ELO</td>
<td>110</td>
<td>2150</td>
<td>1718</td>
<td>+2040 pts</td>
</tr>
</tbody>
</table>
<p>Codeforces ELO went from 110 (barely functional) to 2150 (expert competitive programmer). LiveCodeBench nearly tripled. The coding gap between Gemma and the competition didn't just close — it reversed.</p>
<h3>Vision</h3>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Gemma 3 27B</th>
<th>Gemma 4 31B</th>
<th>Gemma 4 26B-A4B</th>
</tr>
</thead>
<tbody>
<tr>
<td>MMMU Pro</td>
<td>49.7%</td>
<td>76.9%</td>
<td>73.8%</td>
</tr>
<tr>
<td>MATH-Vision</td>
<td>46.0%</td>
<td>85.6%</td>
<td>82.4%</td>
</tr>
</tbody>
</table>
<p>Vision understanding saw similar jumps. MATH-Vision — solving math problems from images — nearly doubled. The model now handles charts, diagrams, and handwritten equations significantly better.</p>
<h3>Long Context</h3>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Gemma 3 27B</th>
<th>Gemma 4 31B</th>
<th>Gemma 4 26B-A4B</th>
</tr>
</thead>
<tbody>
<tr>
<td>MRCR v2 (128K avg)</td>
<td>13.5%</td>
<td>66.4%</td>
<td>44.1%</td>
</tr>
</tbody>
</table>
<p>Gemma 3's 128K context was mostly theoretical — it could accept long inputs but couldn't reliably use information from them. Gemma 4 at 256K context actually retrieves and reasons over long documents. The 31B model went from 13.5% to 66.4% on multi-needle retrieval tests.</p>
<h2>The MoE Efficiency Story</h2>
<p>The 26B-A4B deserves special attention. Look at these numbers again:</p>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Gemma 4 31B (30.7B active)</th>
<th>Gemma 4 26B-A4B (3.8B active)</th>
</tr>
</thead>
<tbody>
<tr>
<td>MMLU Pro</td>
<td>85.2%</td>
<td>82.6%</td>
</tr>
<tr>
<td>AIME 2026</td>
<td>89.2%</td>
<td>88.3%</td>
</tr>
<tr>
<td>LiveCodeBench v6</td>
<td>80.0%</td>
<td>77.1%</td>
</tr>
<tr>
<td>GPQA Diamond</td>
<td>84.3%</td>
<td>82.3%</td>
</tr>
<tr>
<td>LMArena Score</td>
<td>~1452</td>
<td>~1441</td>
</tr>
</tbody>
</table>
<p>The MoE variant achieves <strong>97% of the dense model's quality</strong> while activating only <strong>3.8B parameters per token</strong> instead of 30.7B. That's 8x less compute per inference step.</p>
<p>For deployment, this means:</p>
<ul>
<li><strong>Much less VRAM needed for KV cache</strong> at long contexts</li>
<li><strong>Faster inference</strong> — fewer parameters to compute per token</li>
<li><strong>Lower cost per query</strong> in production</li>
</ul>
<p>Google's choice of 128 small experts (vs Llama 4's 16 large experts) appears to work. The LMArena score of 1441 with only 4B active params is remarkable — it's competitive with models 8x its active size.</p>
<h2>How Gemma 4 Reshapes Our Comparison</h2>
<p>Our <a href="/ai-developer/llama-4-vs-qwen-3-5-vs-gemma-3-compared">March rankings</a> put Qwen 3.5 first, Llama 4 second, Gemma 3 third. Here's how Gemma 4 changes each category:</p>
<table>
<thead>
<tr>
<th>Category</th>
<th>March Winner</th>
<th>Updated Assessment</th>
</tr>
</thead>
<tbody>
<tr>
<td>General reasoning</td>
<td>Llama 4 Scout</td>
<td>Gemma 4 31B takes the lead (84.3% GPQA vs Scout's 74.3%)</td>
</tr>
<tr>
<td>Mathematics</td>
<td>Qwen 3.5-27B</td>
<td>Gemma 4 dominates (89.2% AIME, well ahead of Qwen's ~49%)</td>
</tr>
<tr>
<td>Coding</td>
<td>Qwen 3.5-27B</td>
<td>Gemma 4 dominates (80.0% LiveCodeBench vs Qwen's ~43%)</td>
</tr>
<tr>
<td>Multilingual</td>
<td>Qwen 3.5-27B</td>
<td>Likely still Qwen (250K vocab, 201 languages vs Gemma's 140)</td>
</tr>
<tr>
<td>Inference speed</td>
<td>Gemma 3 27B</td>
<td>TBD — need to benchmark Gemma 4 31B on same hardware</td>
</tr>
<tr>
<td>Context length</td>
<td>Llama 4 Scout (10M)</td>
<td>Still Llama 4 (10M vs 256K), but Gemma 4 actually <em>uses</em> its context</td>
</tr>
<tr>
<td>License</td>
<td>Qwen 3.5 (Apache 2.0)</td>
<td><strong>Tie</strong> — Gemma 4 is now Apache 2.0 too</td>
</tr>
<tr>
<td>VRAM efficiency</td>
<td>Qwen 3.5-9B</td>
<td>Gemma 4 26B-A4B is the new efficiency king</td>
</tr>
</tbody>
</table>
<p><strong>Note on benchmark versions:</strong> Our March tests used AIME 2025, LiveCodeBench v5, and standard MMLU. Gemma 4's reported scores use AIME 2026, LiveCodeBench v6, and MMLU Pro. Direct numerical comparison across versions should be taken as directional, not exact. The Gemma 3 → Gemma 4 comparisons above use identical benchmark versions.</p>
<h2>The Apache 2.0 Switch</h2>
<p>Gemma 3 shipped with the &quot;Gemma Open&quot; license — commercial use allowed but with Google-specific terms and restrictions. In our March comparison, we flagged this as a disadvantage against Qwen 3.5's Apache 2.0.</p>
<p>Gemma 4 switches to <strong>Apache 2.0</strong>. No usage restrictions, no MAU limits, no acceptable use policies. The same license as Qwen 3.5.</p>
<p>This removes one of the last arguments against Gemma. For businesses building products on open models, the licensing playing field is now level between Gemma 4 and Qwen 3.5. Llama 4's community license (700M MAU limit + Meta's acceptable use policy) is now the most restrictive of the three families.</p>
<h2>What's New Beyond Benchmarks</h2>
<h3>Thinking Mode</h3>
<p>Gemma 4 supports extended reasoning — the model produces a chain-of-thought before answering, similar to DeepSeek-R1 or OpenAI o1. This is what drives the massive math and reasoning improvements. The thinking can run to 4,000+ tokens, giving the model space to break problems down, try approaches, and verify its work.</p>
<h3>Multimodal Audio</h3>
<p>The smaller models (E2B, E4B) support audio input — speech transcription and audio Q&amp;A. The larger models (26B-A4B, 31B) handle image and video but not audio. This is an unusual split: the edge models are more multimodal than the flagship.</p>
<h3>Native Function Calling</h3>
<p>All models support structured function calling out of the box — returning JSON with tool calls without special prompting. Combined with the thinking mode, this makes Gemma 4 a strong candidate for agentic workflows where the model needs to reason about which tools to call and in what order.</p>
<h3>Per-Layer Embeddings (PLE)</h3>
<p>A novel architecture feature: a second embedding table feeds residual signals into every decoder layer, giving each layer a token-identity component tailored to that specific layer's role. This is a quiet innovation that likely contributes to the quality improvements across the board.</p>
<h3>Shared KV Cache</h3>
<p>The last several decoder layers share key-value tensors, reducing memory usage during long-context inference with minimal quality impact. Combined with the 256K context window, this makes Gemma 4 practical for long-document workflows where Gemma 3 was only theoretical.</p>
<h2>Updated Decision Matrix</h2>
<table>
<thead>
<tr>
<th>If you need...</th>
<th>Use</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr>
<td>Best overall quality (32 GB GPU)</td>
<td><strong>Gemma 4 31B</strong></td>
<td>Leads reasoning, math, coding, vision</td>
</tr>
<tr>
<td>Best quality per compute</td>
<td><strong>Gemma 4 26B-A4B</strong></td>
<td>97% of 31B quality at 8x less compute</td>
</tr>
<tr>
<td>Maximum context window</td>
<td><strong>Llama 4 Scout</strong></td>
<td>Still 10M tokens, unmatched</td>
</tr>
<tr>
<td>Best multilingual</td>
<td><strong>Qwen 3.5-27B</strong></td>
<td>250K vocab, 201 languages</td>
</tr>
<tr>
<td>Best under 10 GB VRAM</td>
<td><strong>Gemma 4 E4B</strong> or <strong>Qwen 3.5-9B</strong></td>
<td>Both strong; benchmark head-to-head needed</td>
</tr>
<tr>
<td>Edge / mobile deployment</td>
<td><strong>Gemma 4 E2B</strong></td>
<td>2.3B active, audio support, 128K context</td>
</tr>
<tr>
<td>Most permissive license</td>
<td><strong>Gemma 4</strong> or <strong>Qwen 3.5</strong></td>
<td>Both Apache 2.0</td>
</tr>
<tr>
<td>Audio understanding</td>
<td><strong>Gemma 4 E4B</strong></td>
<td>Only open model family with native audio</td>
</tr>
<tr>
<td>Agentic workflows</td>
<td><strong>Gemma 4 31B</strong></td>
<td>Thinking mode + native function calling</td>
</tr>
</tbody>
</table>
<h2>What We Still Need to Test</h2>
<p>We haven't run Gemma 4 on our RTX 5090 benchmark suite yet. Key unknowns:</p>
<ul>
<li><strong>Actual inference speed</strong> — the 31B dense model should be comparable to Gemma 3 27B in tok/s, but the MoE 26B-A4B is the interesting question. With 128 experts and 3.8B active params, it could be very fast</li>
<li><strong>VRAM usage with quantization</strong> — Q6_K and Q4_K_M sizes for each variant</li>
<li><strong>Real-world multilingual performance</strong> — Gemma claims 140 languages, but Qwen's 201-language, 250K-vocabulary advantage may still hold for CJK and non-Latin scripts</li>
<li><strong>Thinking mode overhead</strong> — how much slower is inference when the model reasons for 4,000 tokens before answering?</li>
</ul>
<p>We'll publish a full hands-on benchmark when we've run the tests. For now, Google's reported numbers are strong enough to change the recommendation.</p>
<h2>The Bottom Line</h2>
<p>A month ago we wrote: &quot;Choose Gemma 3 when you need Google ecosystem integration or when marginal inference speed differences matter. It's a solid model, but it doesn't lead in any benchmark category.&quot;</p>
<p>That's no longer true. Gemma 4 leads in reasoning, math, coding, and vision. The 26B-A4B MoE variant offers the best quality-per-compute ratio in the open model space. The license is now Apache 2.0. The context window works.</p>
<p><strong>The open model race just got a new leader.</strong> Qwen 3.5 still holds the multilingual crown, and Llama 4 Scout still has the unmatched 10M context window. But for overall quality, especially on hard reasoning and coding tasks, Gemma 4 is the model to beat.</p>
<p>The ball is now in Alibaba's and Meta's court.</p>
<hr />
<p><em>This article is a follow-up to <a href="/ai-developer/llama-4-vs-qwen-3-5-vs-gemma-3-compared">Llama 4 vs Qwen 3.5 vs Gemma 3: Which Open Model Should You Deploy?</a>, published March 6, 2026.</em></p>
<p><em>Want to fine-tune Gemma 4 locally? Read <a href="/ai-developer/gemma-4-lora-fine-tuning-rtx-5090">Gemma 4 LoRA Fine-Tuning on RTX 5090: What Works and What Doesn't</a> for our hands-on results.</em></p>]]></content:encoded>
      <category>research</category>
      <category>gemma</category>
      <category>google</category>
      <category>benchmarks</category>
      <category>moe</category>
      <category>open-models</category>
      <category>comparison</category>
    </item>
    <item>
      <title>100% ROI in 24 Hours: Nvidia B200 Replaced a $35,000 AI API Bill in a Single Day</title>
      <link>https://ai.rs/ai-for-business/100-percent-roi-in-24-hours-nvidia-b200-replaced-35000-ai-api-bill</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/100-percent-roi-in-24-hours-nvidia-b200-replaced-35000-ai-api-bill</guid>
      <pubDate>Mon, 23 Mar 2026 11:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>We needed AI-generated SEO descriptions for 858,000 products. The API quote: $35,000. The final cost with a self-hosted model on a single GPU: $180. A 194x cost reduction that paid for the hardware on day one.</description>
      <content:encoded><![CDATA[<h2>The $35,000 Wake-Up Call</h2>
<p>We needed to generate SEO descriptions for 858,000 e-commerce products. A straightforward task: take a product title, brand, and existing description in Serbian, then produce an English translation, a cleaned-up title, and a short SEO paragraph. Five fields per product, a few sentences each.</p>
<p>The first estimate from Anthropic's Claude API? <strong>$35,000.</strong> For text generation. For a task that a knowledgeable human could do in 30 seconds per product — but there are 858,000 of them.</p>
<p>This is the dirty secret of AI-as-a-Service: the per-token pricing model that looks cheap at demo scale becomes absurd at production scale. When your system prompt is 4,000 tokens and you're sending it 858,000 times, you're paying to process 3.4 billion tokens of instructions that never change. It's like paying a consultant's hourly rate to re-read their job description before every task.</p>
<h2>The Optimization Journey</h2>
<p>What followed was a 48-hour deep dive into cost optimization that took us from $35,000 to $180 — a <strong>194x reduction</strong> — while maintaining the same output quality. Along the way, we discovered that:</p>
<ul>
<li>
<p><strong>Anthropic's Batch API doesn't cache prompts.</strong> Despite advertising prompt caching as a feature, the Batch API (which offers 50% off) processes each request independently on different servers. No caching. The &quot;discount&quot; actually costs 3.6x MORE than the standard API when you have a large system prompt. We only discovered this by checking the dashboard after a 400-product test run.</p>
</li>
<li>
<p><strong>The most expensive token is the one you don't need to send.</strong> Our system prompt contained 1,551 product categories for recategorization. Trimming to the top 626 categories (covering 95% of products) cut costs by 42%. The remaining 5% of products just kept their existing category.</p>
</li>
<li>
<p><strong>Self-hosting a smaller model on a single GPU beats the best API pricing by 27x.</strong> Qwen3.5, an open-source model with 3 billion active parameters, produces Serbian text comparable to Claude Sonnet 4.5 — at a fraction of the cost. One Nvidia B200 GPU processes 36,000 products per hour — so all 858,000 finished in under 24 hours. The GPU paid for itself in a single day.</p>
</li>
<li>
<p><strong>Parallelism is free when the GPU has headroom.</strong> Our B200 was using 22% of its memory with 200 concurrent requests. We went from 1 request at a time (310 products/hour) to 256 parallel workers (35,000/hour) — a 113x throughput increase with zero additional cost.</p>
</li>
</ul>
<h2>The Real Cost of AI APIs</h2>
<p>The AI industry's pricing model is built for developers running demos and startups processing hundreds of requests. At enterprise scale — millions of products, documents, or records — the per-token model breaks down spectacularly.</p>
<p>Consider: our 858,000 products needed roughly 500 billion FLOPs of actual computation. A B200 GPU delivers 2,250 TFLOPS. The actual compute time is measured in seconds, not hours. Yet the API charges as if each request requires dedicated attention from a room full of H100s.</p>
<p>Self-hosting isn't free — there's the engineering time to set up vLLM, optimize prompts, debug deployments, and handle failures. But when the alternative is a $35,000 invoice for generating short product descriptions, the math is clear.</p>
<h2>What We Learned</h2>
<p>The AI-as-a-Service model makes sense for prototyping, small-scale use, and tasks where quality justifies premium pricing. But for batch processing at scale — especially with a large, repeated system prompt — self-hosted inference on rented GPUs is the pragmatic choice. The open-source model ecosystem (Qwen, Llama, DeepSeek) has reached the quality threshold where, for many languages and tasks, API-exclusive models no longer justify their 27x price premium.</p>
<p>The irony? We used Claude (the expensive API) to develop and refine our prompts, evaluate quality, and establish the baseline. Then we deployed a free, open-source model to do the actual work. The API was the R&amp;D cost; the GPU was the production cost. That division of labor — premium API for development, commodity GPU for execution — might be the real model for AI at scale.</p>]]></content:encoded>
      <category>business</category>
      <category>cost-optimization</category>
      <category>self-hosting</category>
      <category>gpu</category>
      <category>case-study</category>
    </item>
    <item>
      <title>AI Won&#039;t Replace Your Team — But a Team Using AI Will Replace Yours</title>
      <link>https://ai.rs/ai-for-business/ai-wont-replace-your-team</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/ai-wont-replace-your-team</guid>
      <pubDate>Wed, 18 Mar 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>The data is clear: 57% of AI use in the workplace is augmentation, not automation. The companies winning with AI aren&#039;t cutting headcount — they&#039;re multiplying what their existing people can do.</description>
      <content:encoded><![CDATA[<h2>The Replacement Myth</h2>
<p>Every few months, a new headline claims AI will eliminate millions of jobs. The reality, backed by hard data from the <a href="https://www.anthropic.com/research/the-anthropic-economic-index?utm_source=www_ai_rs">Anthropic Economic Index</a>, tells a different story:</p>
<ul>
<li><strong>57% of AI use in the workplace is augmentation</strong> — humans using AI to do their existing jobs better</li>
<li><strong>Only 4% of businesses use AI deeply</strong> across their operations</li>
<li><strong>30% of workers have zero AI exposure</strong> in their daily tasks</li>
</ul>
<p>AI isn't replacing teams. It's creating a widening gap between teams that use it and teams that don't.</p>
<h2>Augmentation vs. Automation</h2>
<p>This distinction matters more than any other concept in this article.</p>
<p><strong>Automation</strong> means AI does the task instead of a human. The human is removed from the loop. Think: a chatbot handling tier-1 support tickets without human review.</p>
<p><strong>Augmentation</strong> means AI makes the human faster, more accurate, or more capable. The human stays in the loop but operates at a higher level. Think: a support agent using AI to draft responses, pull up relevant docs, and suggest solutions — then reviewing and sending.</p>
<p>The data shows businesses are overwhelmingly choosing augmentation. Why?</p>
<table>
<thead>
<tr>
<th>Factor</th>
<th>Automation</th>
<th>Augmentation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Risk</td>
<td>High (errors go unchecked)</td>
<td>Low (human reviews output)</td>
</tr>
<tr>
<td>Quality</td>
<td>Inconsistent at edges</td>
<td>Consistently high</td>
</tr>
<tr>
<td>Trust</td>
<td>Customers skeptical</td>
<td>Customers don't notice</td>
</tr>
<tr>
<td>Implementation</td>
<td>Complex (handle all edge cases)</td>
<td>Simple (handle common cases)</td>
</tr>
<tr>
<td>Cost</td>
<td>High upfront, low ongoing</td>
<td>Low upfront, moderate ongoing</td>
</tr>
</tbody>
</table>
<p>Augmentation is easier to implement, lower risk, and often produces better results because humans catch the mistakes AI makes.</p>
<h2>What AI-Augmented Teams Actually Look Like</h2>
<p>Here's what changes when a team starts using AI effectively:</p>
<h3>Customer Support</h3>
<p><strong>Before:</strong> Agent receives ticket, searches knowledge base manually, types response from scratch, asks senior colleague about edge cases.</p>
<p><strong>After:</strong> Agent receives ticket, AI instantly pulls relevant docs and past solutions, drafts a response, agent reviews and personalizes, sends in 2 minutes instead of 8.</p>
<p><strong>Result:</strong> Same team handles 3x the volume. Response quality improves because every agent has access to the collective knowledge that used to live only in senior team members' heads.</p>
<h3>Sales</h3>
<p><strong>Before:</strong> Rep researches prospect manually, writes personalized email from template, follows up on gut feeling about timing.</p>
<p><strong>After:</strong> AI summarizes prospect's company, recent news, and likely pain points. Drafts personalized outreach. Flags optimal follow-up timing based on engagement patterns.</p>
<p><strong>Result:</strong> Each rep works leads that would have required a research assistant. Pipeline grows without hiring.</p>
<h3>Content and Marketing</h3>
<p><strong>Before:</strong> Writer spends 3 hours researching, 2 hours writing first draft, 1 hour editing.</p>
<p><strong>After:</strong> AI provides research summary and outline in minutes. Writer focuses on insight, voice, and editing. Total time: 2-3 hours for higher quality output.</p>
<p><strong>Result:</strong> Same team produces 2x the content with more depth and originality — because humans spend time on the parts AI can't do well.</p>
<h3>Operations</h3>
<p><strong>Before:</strong> Manager manually reviews reports, spots trends by intuition, creates weekly summaries for leadership.</p>
<p><strong>After:</strong> AI analyzes data in real-time, surfaces anomalies, drafts reports. Manager focuses on decisions and strategy.</p>
<p><strong>Result:</strong> Problems caught days earlier. Decisions backed by data instead of gut feeling.</p>
<h2>The Productivity Multiplier</h2>
<p>Studies consistently show AI augmentation delivers a <strong>2-5x productivity multiplier</strong> depending on the task:</p>
<table>
<thead>
<tr>
<th>Task Type</th>
<th>Multiplier</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr>
<td>Writing &amp; editing</td>
<td>2-3x</td>
<td>AI handles drafts, humans add judgment</td>
</tr>
<tr>
<td>Code development</td>
<td>2-4x</td>
<td>Autocomplete, debugging, boilerplate</td>
</tr>
<tr>
<td>Data analysis</td>
<td>3-5x</td>
<td>Instant pattern recognition, visualization</td>
</tr>
<tr>
<td>Customer response</td>
<td>2-4x</td>
<td>Instant context retrieval, draft responses</td>
</tr>
<tr>
<td>Research</td>
<td>3-5x</td>
<td>Synthesize sources, extract key points</td>
</tr>
</tbody>
</table>
<p>Notice these aren't 100x improvements. AI doesn't turn a mediocre employee into a genius. It turns a good employee into a <strong>highly efficient</strong> one by removing the friction from tasks that consume time but not judgment.</p>
<h2>Why Your Competitors Aren't Doing This (Yet)</h2>
<p>The <a href="https://www.anthropic.com/research/the-anthropic-economic-index?utm_source=www_ai_rs">Anthropic research</a> reveals a surprising finding: despite the hype, <strong>67% of businesses have minimal or no AI adoption</strong>. The gap isn't technical — it's organizational.</p>
<p>The three barriers:</p>
<h3>1. No Clear Starting Point</h3>
<p>Leadership knows AI is important but doesn't know where to begin. Should they buy a platform? Hire a data scientist? Build a chatbot? The paradox of choice paralyzes action.</p>
<p><strong>Solution:</strong> Start with one team, one workflow, one tool. Customer support + AI-drafted responses is the easiest first win. Prove value in 30 days, then expand.</p>
<h3>2. Fear of Disruption</h3>
<p>Managers worry AI will upset team dynamics. Employees fear replacement. Both lead to passive resistance.</p>
<p><strong>Solution:</strong> Frame AI as a tool for the team, not a replacement of the team. Let employees choose how to use it. The best AI adoption happens bottom-up — when individuals discover it makes their job easier.</p>
<h3>3. Overengineering the Solution</h3>
<p>Companies try to build a comprehensive AI strategy before doing anything. Six months of planning, vendor evaluation, and committee meetings — then a pilot that's too ambitious and fails.</p>
<p><strong>Solution:</strong> Buy a $20/month AI subscription for one team member. See what they accomplish in two weeks. Scale what works.</p>
<h2>The 90-Day Playbook</h2>
<p>Here's how to make your team an AI-augmented team in one quarter:</p>
<h3>Month 1: Identify and Experiment</h3>
<ol>
<li><strong>Audit time waste</strong> — Where does your team spend time on repetitive, low-judgment tasks?</li>
<li><strong>Pick one workflow</strong> — Choose the highest-volume, lowest-risk task</li>
<li><strong>Give one person access</strong> — Let your most curious team member experiment with AI tools</li>
<li><strong>Measure baseline</strong> — Track current speed and quality for the chosen workflow</li>
</ol>
<h3>Month 2: Validate and Expand</h3>
<ol>
<li><strong>Measure results</strong> — Compare speed and quality against baseline</li>
<li><strong>Document what works</strong> — Create simple prompts and workflows the team can follow</li>
<li><strong>Roll out to the team</strong> — Train everyone on the winning workflow</li>
<li><strong>Identify the next workflow</strong> — What else could benefit?</li>
</ol>
<h3>Month 3: Systematize</h3>
<ol>
<li><strong>Build custom tools</strong> — If generic AI works, <a href="/ai-for-business/why-your-business-needs-its-own-ai-model">a custom AI assistant</a> trained on your data works 10x better</li>
<li><strong>Set quality standards</strong> — Define when AI output needs human review vs. can go straight out</li>
<li><strong>Track ROI</strong> — Hours saved x hourly cost = dollar value of AI augmentation</li>
<li><strong>Plan Q2</strong> — Which teams get AI next?</li>
</ol>
<h2>The Math That Matters</h2>
<p>Let's make this concrete. A 10-person customer support team:</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Without AI</th>
<th>With AI Augmentation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tickets per agent per day</td>
<td>40</td>
<td>100</td>
</tr>
<tr>
<td>Average response time</td>
<td>8 minutes</td>
<td>3 minutes</td>
</tr>
<tr>
<td>First-contact resolution</td>
<td>65%</td>
<td>82%</td>
</tr>
<tr>
<td>Customer satisfaction</td>
<td>4.1/5</td>
<td>4.5/5</td>
</tr>
<tr>
<td>Effective team capacity</td>
<td>10 people</td>
<td>25 people equivalent</td>
</tr>
</tbody>
</table>
<p>The team didn't shrink. Their <strong>effective capacity</strong> grew 2.5x. You can now handle 2.5x the customer volume without hiring, or reassign 6 people to higher-value work like proactive outreach and retention.</p>
<h2>What Not to Do</h2>
<ul>
<li><strong>Don't automate customer-facing interactions on day one.</strong> Start with internal, human-reviewed workflows.</li>
<li><strong>Don't mandate AI use.</strong> People adopt tools they choose. Force breeds resentment.</li>
<li><strong>Don't expect perfection.</strong> AI makes mistakes. The workflow should include human review until you've built confidence.</li>
<li><strong>Don't chase the latest model.</strong> GPT-4, Claude, Llama — the model matters less than the workflow around it.</li>
<li><strong>Don't skip measurement.</strong> &quot;It feels faster&quot; isn't enough. Track hours, quality, and outcomes.</li>
</ul>
<h2>The Window Is Open</h2>
<p>Right now, 67% of your competitors aren't using AI meaningfully. That number will shrink every quarter. The advantage of being early is real but temporary.</p>
<p>The companies that will dominate their markets in 2027 aren't the ones with the best AI technology. They're the ones whose <strong>teams learned to work with AI in 2025 and 2026</strong> — who spent a year building workflows, institutional knowledge, and competitive moats while everyone else was still debating whether to start.</p>
<p>Your team doesn't need to be replaced. They need to be equipped.</p>
<hr />
<p><em>Data from the <a href="https://www.anthropic.com/research/the-anthropic-economic-index?utm_source=www_ai_rs">Anthropic Economic Index</a> and McKinsey Global Survey on AI, 2025.</em></p>]]></content:encoded>
      <category>business</category>
      <category>ai-adoption</category>
      <category>team-productivity</category>
      <category>augmentation</category>
      <category>strategy</category>
    </item>
    <item>
      <title>Synthetic Data for Fine-Tuning: How to Generate Your Own Training Set</title>
      <link>https://ai.rs/ai-developer/synthetic-data-for-fine-tuning</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/synthetic-data-for-fine-tuning</guid>
      <pubDate>Mon, 16 Mar 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>The biggest bottleneck in fine-tuning isn&#039;t compute or code — it&#039;s data. Synthetic data generation lets you create thousands of high-quality training samples from a handful of seed examples using your own model as the factory.</description>
      <content:encoded><![CDATA[<h2>The Data Bottleneck</h2>
<p>You've read about <a href="/ai-developer/why-train-your-own-llm">fine-tuning</a> and <a href="/ai-developer/llm-post-training-explained">post-training</a>. You understand SFT, DPO, and LoRA. You have a GPU ready. But when you sit down to actually train a model, you hit the real wall: <strong>you need thousands of high-quality training samples, and you have maybe a hundred.</strong></p>
<p>Manual data creation is slow. A skilled annotator produces 20-50 instruction-response pairs per hour. At that rate, a 10,000-sample dataset takes 200-500 hours of human labor — months of work before training even begins.</p>
<p>Synthetic data generation solves this. Instead of writing every sample by hand, you use LLMs to <strong>generate, judge, and filter</strong> training data at scale. The result: 10,000+ samples in hours, not months.</p>
<h2>The Synthetic Data Pipeline</h2>
<p>The modern synthetic data pipeline has five stages:</p>
<pre><code>Seed Prompts → Policy Model → LLM Jury → Heuristic Filter → Training Dataset
   (100s)       (generates)    (ranks)      (quality gate)     (10,000s)</code></pre>
<p>Each stage has a specific role, and getting any one wrong poisons the entire dataset.</p>
<h3>Stage 1: Seed Prompts</h3>
<p>Everything starts with prompts — the questions and instructions your model will learn to handle. You need <strong>diverse, realistic prompts</strong> that cover your target domain.</p>
<p><strong>Where to get seed prompts:</strong></p>
<ul>
<li><strong>Existing customer data</strong> — Real questions from support tickets, search logs, or chat history</li>
<li><strong>Manual curation</strong> — Write 100-500 high-quality prompts covering key scenarios</li>
<li><strong>Prompt evolution</strong> — Use an LLM to create variations of your seeds</li>
<li><strong>Public datasets</strong> — Alpaca, ShareGPT, UltraChat as starting points (filter for relevance)</li>
</ul>
<p><strong>Prompt evolution example:</strong></p>
<pre><code>Seed: "What's the best laptop for video editing under $1500?"

Evolved variants:
→ "I need a laptop for 4K video editing. Budget is flexible but under $2000."
→ "Compare the MacBook Pro M3 and Dell XPS 16 for Premiere Pro workflows."
→ "What specs matter most for DaVinci Resolve — RAM, GPU, or CPU?"
→ "I edit YouTube videos as a side hustle. What's the minimum I should spend?"</code></pre>
<p>From 100 seed prompts, evolution can generate 1,000-5,000 diverse variants. The key is ensuring they span different <strong>intents</strong> (compare, recommend, explain, troubleshoot), <strong>complexity levels</strong> (simple factual to multi-step reasoning), and <strong>edge cases</strong> (out-of-scope requests, ambiguous queries).</p>
<h3>Stage 2: Response Generation</h3>
<p>For each prompt, generate multiple responses. This is where the pipeline splits depending on whether you're creating SFT data or preference data.</p>
<p><strong>For SFT data</strong> — Generate one high-quality response per prompt:</p>
<pre><code class="language-python">for prompt in seed_prompts:
    response = model.generate(
        prompt,
        temperature=0.7,  # Some creativity
        max_tokens=2048
    )
    dataset.append({"instruction": prompt, "response": response})</code></pre>
<p><strong>For DPO preference data</strong> — Generate multiple responses and rank them:</p>
<pre><code class="language-python">for prompt in seed_prompts:
    responses = [
        model.generate(prompt, temperature=0.9)  # Higher temp = more variety
        for _ in range(4)  # 4 candidates per prompt
    ]
    # Judge picks best and worst → chosen/rejected pair</code></pre>
<p><strong>Which model to use for generation:</strong></p>
<table>
<thead>
<tr>
<th>Strategy</th>
<th>Pros</th>
<th>Cons</th>
</tr>
</thead>
<tbody>
<tr>
<td>Use a stronger model (GPT-4, Claude)</td>
<td>Higher quality responses</td>
<td>Off-policy for DPO, API costs</td>
</tr>
<tr>
<td>Use your own model</td>
<td>On-policy (best for DPO)</td>
<td>Quality ceiling = current model</td>
</tr>
<tr>
<td>Mix both</td>
<td>Best of both worlds</td>
<td>More complex pipeline</td>
</tr>
</tbody>
</table>
<p>For SFT data, using a stronger model is fine — you're teaching your model to imitate good responses. For DPO, you should use your own model (on-policy) to avoid the policy drift problem discussed in the <a href="/ai-developer/llm-post-training-explained">post-training article</a>.</p>
<h3>Stage 3: LLM-as-Judge</h3>
<p>Raw generated responses vary in quality. The LLM jury scores and ranks them:</p>
<pre><code>Prompt: [the user's question]
Response A: [candidate 1]
Response B: [candidate 2]

Evaluate both responses on:
1. Accuracy (0-5): Are the facts correct?
2. Helpfulness (0-5): Does it address the user's need?
3. Clarity (0-5): Is it well-structured and easy to follow?
4. Safety (0-5): Does it avoid harmful content?

Which response is better overall? Explain why.</code></pre>
<p><strong>Important:</strong> Use a <strong>different model</strong> as the judge than the one that generated responses. If the same model judges its own output, it's biased toward its own style regardless of quality.</p>
<p><strong>Rubric-based scoring</strong> outperforms simple &quot;which is better&quot; judgments. When the judge evaluates on specific criteria, the signal is clearer and more consistent.</p>
<h3>Stage 4: Heuristic Filtering</h3>
<p>Even with LLM judging, some samples are bad. Apply hard filters:</p>
<ul>
<li><strong>Length ratio</strong> — Reject pairs where chosen and rejected are nearly identical length (no learning signal)</li>
<li><strong>Score threshold</strong> — Drop responses scoring below 3/5 on any criterion</li>
<li><strong>Deduplication</strong> — Remove near-duplicate prompts (cosine similarity &gt; 0.95)</li>
<li><strong>Format compliance</strong> — Ensure responses match expected structure</li>
<li><strong>Toxicity filter</strong> — Run a classifier to catch harmful content the judge missed</li>
</ul>
<p>Expect to <strong>drop 20-40% of generated samples</strong> at this stage. That's normal and desirable — aggressive filtering produces a cleaner dataset.</p>
<h3>Stage 5: Optional Refinement</h3>
<p>A recent improvement: after selecting the &quot;chosen&quot; response, pass it through a <strong>refiner model</strong> that polishes it further:</p>
<pre><code>Here is a response to a user question. Improve it while keeping
the same core content. Fix any errors, improve clarity, and ensure
the tone is helpful and professional.

[original chosen response]</code></pre>
<p>This consistently improves DPO training because the chosen response becomes genuinely better, not just the least-bad option from the batch.</p>
<hr />
<h2>Practical Example: Building a Product Q&amp;A Dataset</h2>
<p>Let's walk through generating a 10,000-sample SFT dataset for an e-commerce product assistant.</p>
<h3>Step 1: Collect Seeds (200 prompts)</h3>
<p>Sources:</p>
<ul>
<li>80 from customer support tickets</li>
<li>60 hand-written covering product categories</li>
<li>60 evolved from the first 140</li>
</ul>
<h3>Step 2: Evolve to 2,500 Prompts</h3>
<p>Use an LLM to generate 10-15 variants per seed prompt, varying:</p>
<ul>
<li>Product category</li>
<li>Customer intent (buy, compare, troubleshoot, return)</li>
<li>Specificity (vague vs. detailed)</li>
<li>Tone (casual, urgent, professional)</li>
</ul>
<h3>Step 3: Generate Responses</h3>
<p>Use a strong model (Claude/GPT-4) with your product catalog in context via RAG:</p>
<pre><code class="language-python">system_prompt = """You are a product expert for [store name].
Use only the product information provided. If a product doesn't
exist in the catalog, say so. Never make up products or prices."""

for prompt in evolved_prompts:
    products = rag_search(prompt, top_k=5)
    response = generate(
        system=system_prompt,
        context=products,
        user=prompt,
        temperature=0.7
    )</code></pre>
<h3>Step 4: Judge and Filter</h3>
<p>Run each response through the jury:</p>
<ul>
<li>Score on accuracy, helpfulness, product knowledge, format</li>
<li>Drop responses scoring &lt; 3 on accuracy (these contain hallucinations)</li>
<li>Drop near-duplicates</li>
</ul>
<p><strong>Result: ~2,000 high-quality SFT samples from 2,500 prompts (80% pass rate)</strong></p>
<h3>Step 5: Augment with Multi-Turn</h3>
<p>Convert single-turn Q&amp;As into conversations:</p>
<pre><code class="language-python">for sample in sft_data[:500]:
    follow_up = generate_follow_up(sample["instruction"], sample["response"])
    continuation = generate(context=sample, user=follow_up)
    # Creates a 2-turn conversation</code></pre>
<p><strong>Final dataset: ~2,500 single-turn + ~500 multi-turn = 3,000 samples</strong></p>
<p>Repeat the cycle 3x with different prompt evolution seeds, and you have your 10,000-sample dataset.</p>
<hr />
<h2>Common Pitfalls</h2>
<h3>1. Model Collapse</h3>
<p>If you train on your own model's output, then generate more data, then train again — each cycle amplifies the model's biases. After 3-4 iterations, responses become repetitive and quality degrades.</p>
<p><strong>Fix:</strong> Always use fresh seed prompts and mix in human-written samples (even 10-20% human data prevents collapse).</p>
<h3>2. Reward Hacking in Preference Data</h3>
<p>The LLM judge has predictable preferences: longer responses, bullet points, hedging language (&quot;It's important to note...&quot;). Models learn to game these signals instead of improving actual quality.</p>
<p><strong>Fix:</strong> Use length-normalized scoring. Penalize filler phrases. Score on rubrics, not vibes.</p>
<h3>3. Distribution Mismatch</h3>
<p>Your synthetic prompts might not match what real users actually ask. If you train on academic-style questions but users ask casual ones, the model struggles.</p>
<p><strong>Fix:</strong> Start with real user data as seeds. Validate synthetic prompts against actual query logs.</p>
<h3>4. Contamination</h3>
<p>If the generating model was trained on your evaluation benchmark, it will produce responses that look correct on your evals but fail on real tasks.</p>
<p><strong>Fix:</strong> Hold out a manually-created test set that no model has seen. Evaluate on real user satisfaction, not benchmark scores.</p>
<hr />
<h2>Tools for Synthetic Data Generation</h2>
<table>
<thead>
<tr>
<th>Tool</th>
<th>Best For</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>distilabel</strong> (Argilla)</td>
<td>Full pipelines, production use</td>
<td>Most complete framework, supports UltraFeedback-style pipelines</td>
</tr>
<tr>
<td><strong>Magpie</strong> (Hugging Face)</td>
<td>Extracting instruction data from LLMs</td>
<td>Clever technique: use model's chat template to elicit natural instructions</td>
</tr>
<tr>
<td><strong>Self-Instruct</strong></td>
<td>Quick SFT data from seeds</td>
<td>The original paper's approach, simple but effective</td>
</tr>
<tr>
<td><strong>Evol-Instruct</strong></td>
<td>Increasing prompt complexity</td>
<td>WizardLM's approach: iteratively make prompts harder</td>
</tr>
<tr>
<td><strong>Your own scripts</strong></td>
<td>Custom pipelines</td>
<td>50 lines of Python + an API key is often enough</td>
</tr>
</tbody>
</table>
<p>For most teams, <strong>distilabel</strong> is the right starting point — it handles the full pipeline (generation, judging, filtering) with built-in support for multiple LLM providers.</p>
<hr />
<h2>How Much Data Do You Need?</h2>
<table>
<thead>
<tr>
<th>Goal</th>
<th>SFT Samples</th>
<th>Preference Pairs</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tone/style change</td>
<td>1K-5K</td>
<td>Not needed</td>
<td>Smallest useful dataset</td>
</tr>
<tr>
<td>Domain adaptation</td>
<td>5K-20K</td>
<td>5K-10K</td>
<td>The sweet spot for most businesses</td>
</tr>
<tr>
<td>New capability</td>
<td>20K-100K</td>
<td>10K-50K</td>
<td>Teaching the model something fundamentally new</td>
</tr>
<tr>
<td>Full post-training</td>
<td>100K+</td>
<td>50K+</td>
<td>What model providers do; you probably don't need this</td>
</tr>
</tbody>
</table>
<p>Start with 5K samples and evaluate. Add more data only when you can identify specific gaps in performance — more data without direction just adds noise.</p>
<hr />
<h2>Key Takeaways</h2>
<ol>
<li><strong>Synthetic data removes the data bottleneck</strong> — Generate 10,000+ samples in hours instead of months</li>
<li><strong>Quality &gt; quantity</strong> — Aggressive filtering (drop 20-40%) produces better models than keeping everything</li>
<li><strong>Use a different model as judge</strong> — Self-evaluation introduces bias</li>
<li><strong>Mix in human data</strong> — Even 10-20% prevents model collapse across iterations</li>
<li><strong>Start with real user prompts</strong> — Synthetic diversity means nothing if the distribution doesn't match reality</li>
<li><strong>Iterate small</strong> — Start with 5K samples, evaluate, identify gaps, then scale up</li>
</ol>
<p>For the full context on how this fits into model training, see <a href="/ai-developer/llm-post-training-explained">LLM Post-Training Explained</a> and <a href="/ai-developer/why-train-your-own-llm">Why Train Your Own LLM</a>.</p>]]></content:encoded>
      <category>training</category>
      <category>synthetic-data</category>
      <category>fine-tuning</category>
      <category>datasets</category>
      <category>data-generation</category>
    </item>
    <item>
      <title>LLM Post-Training Explained: SFT, DPO, and GRPO</title>
      <link>https://ai.rs/ai-developer/llm-post-training-explained</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/llm-post-training-explained</guid>
      <pubDate>Fri, 13 Mar 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>Pre-training gives a model raw knowledge. Post-training turns it into something useful. Here&#039;s how SFT, preference alignment, and reinforcement learning transform base models into the AI assistants we actually use.</description>
      <content:encoded><![CDATA[<h2>What Is Post-Training?</h2>
<p>When a company like Meta releases Llama or Mistral releases their models, they ship two versions: a <strong>base model</strong> and an <strong>instruct model</strong>. The base model is the raw output of pre-training — it can autocomplete text but can't follow instructions, answer questions, or hold a conversation. The instruct model does all of that.</p>
<p>The difference is <strong>post-training</strong>: the set of techniques applied after pre-training that transform a text-completion engine into an AI assistant.</p>
<p>If pre-training is like giving someone a library of books to read, post-training is teaching them how to have a conversation about what they've read.</p>
<h3>Post-Training vs. Fine-Tuning</h3>
<p>These terms overlap but aren't identical:</p>
<table>
<thead>
<tr>
<th></th>
<th>Post-Training</th>
<th>Fine-Tuning</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Goal</strong></td>
<td>General-purpose assistant</td>
<td>Task-specific expert</td>
</tr>
<tr>
<td><strong>Data size</strong></td>
<td>1M+ samples</td>
<td>10K-1M samples</td>
</tr>
<tr>
<td><strong>Who does it</strong></td>
<td>Model providers (Meta, Mistral, etc.)</td>
<td>End users and businesses</td>
</tr>
<tr>
<td><strong>Output</strong></td>
<td>Instruct/chat model</td>
<td>Domain-adapted model</td>
</tr>
<tr>
<td><strong>Techniques</strong></td>
<td>SFT + DPO + RL</td>
<td>Usually SFT only</td>
</tr>
</tbody>
</table>
<p>Post-training is what turns Llama into Llama-Instruct. Fine-tuning is what turns Llama-Instruct into your custom product assistant. They use the same underlying methods (especially SFT), but at different scales and for different purposes.</p>
<h2>The Three-Stage Pipeline</h2>
<p>Modern post-training follows a three-stage pipeline, each building on the previous:</p>
<pre><code>Base Model → SFT → DPO → GRPO → Aligned Model
(autocomplete)  (follows      (prefers good   (reasons
                 instructions)  responses)      step-by-step)</code></pre>
<hr />
<h2>Stage 1: Supervised Fine-Tuning (SFT)</h2>
<p>SFT is the most intuitive stage. You show the model thousands of instruction-response pairs and train it to produce similar outputs.</p>
<h3>What It Does</h3>
<p>A base model given &quot;What is the capital of France?&quot; might continue with &quot;What is the capital of Germany? What is...&quot; — it's autocompleting, not answering. After SFT, it responds: &quot;The capital of France is Paris.&quot;</p>
<p>SFT teaches three capabilities:</p>
<ul>
<li><strong>Instruction following</strong> — Understanding what the user is asking</li>
<li><strong>Format compliance</strong> — Responding in the expected structure (chat, JSON, code)</li>
<li><strong>Knowledge activation</strong> — Surfacing relevant knowledge from pre-training</li>
</ul>
<h3>Training Approaches</h3>
<p>There are three ways to run SFT, each with different trade-offs:</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>Quality</th>
<th>VRAM</th>
<th>Speed</th>
<th>When to Use</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Full Fine-Tuning</strong></td>
<td>Best</td>
<td>Very high (2x model)</td>
<td>Slow</td>
<td>You have multiple A100s</td>
</tr>
<tr>
<td><strong>LoRA</strong></td>
<td>Near-full</td>
<td>High (1x model + 5%)</td>
<td>Fast</td>
<td>Default choice for most teams</td>
</tr>
<tr>
<td><strong>QLoRA</strong></td>
<td>Good (slight degradation)</td>
<td>Low (0.25x model)</td>
<td>Medium</td>
<td>Consumer GPUs, prototyping</td>
</tr>
</tbody>
</table>
<p><strong>LoRA</strong> (Low-Rank Adaptation) is the standard for most practical work. It freezes the base model weights and trains small adapter matrices (~2% of total parameters), achieving near-full quality at a fraction of the compute.</p>
<p><strong>QLoRA</strong> goes further by quantizing the base model to 4-bit precision, cutting VRAM by 4x. The trade-off is a small quality drop — good enough for experimentation, but production models typically use LoRA or full fine-tuning.</p>
<h3>Key Parameters</h3>
<p>These are the training parameters that matter most for SFT:</p>
<ul>
<li><strong>Learning rate</strong>: 1e-5 to 5e-5 (too high = catastrophic forgetting, too low = no learning)</li>
<li><strong>Epochs</strong>: 3-5 (more isn't better — the model overfits quickly on small datasets)</li>
<li><strong>Batch size</strong>: 8-16 (larger batches smooth gradients but need more VRAM)</li>
<li><strong>Max sequence length</strong>: 2048-8192 tokens (longer = more context but slower training)</li>
<li><strong>Optimizer</strong>: AdamW with weight decay 0.01</li>
</ul>
<h3>Dataset Quality Matters More Than Size</h3>
<p>The three pillars of a good SFT dataset:</p>
<ol>
<li><strong>Accuracy</strong> — Every response must be correct. One wrong answer teaches the model to hallucinate.</li>
<li><strong>Diversity</strong> — Cover the full range of tasks: Q&amp;A, reasoning, coding, math, creative writing.</li>
<li><strong>Complexity</strong> — Include multi-step reasoning, not just simple factual recall.</li>
</ol>
<p>A curated dataset of 50K high-quality samples outperforms a noisy dataset of 500K every time.</p>
<hr />
<h2>Stage 2: Direct Preference Optimization (DPO)</h2>
<p>SFT teaches the model to produce reasonable responses. DPO teaches it which response is <em>better</em> when there are multiple valid options.</p>
<h3>The Core Idea</h3>
<p>DPO works with <strong>preference pairs</strong> — for each prompt, you provide a chosen (good) response and a rejected (bad) response:</p>
<pre><code>Prompt: "Explain quantum computing"
Chosen: [clear, accurate, well-structured explanation]
Rejected: [vague, overly technical, or slightly wrong explanation]</code></pre>
<p>The training objective widens the probability gap between chosen and rejected responses. The model learns not just what to say, but what <em>not</em> to say.</p>
<h3>Why Not Just More SFT?</h3>
<p>SFT has a ceiling. It teaches the model to imitate training data, but it can't distinguish between good-enough and excellent responses. DPO adds a <strong>quality signal</strong> that pushes the model toward the better end of its capability range.</p>
<p>Concretely:</p>
<ul>
<li>SFT: &quot;Here's how to respond to this type of question&quot;</li>
<li>DPO: &quot;Between these two responses, this one is better because...&quot;</li>
</ul>
<h3>The Policy Drift Problem</h3>
<p>DPO has an important pitfall: <strong>off-policy data</strong>. If your preference data was generated by a different model (say, GPT-4), there's a mismatch between what that model would say and what your model would say. The training signal becomes noisy.</p>
<p>The solution is <strong>on-policy data generation</strong>: use your own model to generate responses, then have them judged:</p>
<pre><code>Prompt → Your Model generates 2+ responses
                    ↓
           LLM Jury ranks them
                    ↓
         Best = Chosen, Worst = Rejected
                    ↓
              Train with DPO</code></pre>
<p>This creates a tighter feedback loop — the model learns from its own mistakes rather than from another model's outputs.</p>
<h3>State-of-the-Art DPO Techniques</h3>
<p>Recent improvements that push DPO further:</p>
<ul>
<li><strong>Length normalization</strong> — Prevents the model from learning that longer = better</li>
<li><strong>Anchored preference optimization</strong> — Adds a reference anchor to stabilize training</li>
<li><strong>Refine chosen answers</strong> — Use a stronger model to polish the &quot;chosen&quot; response before training</li>
<li><strong>Rubric-based scoring</strong> — Rate responses on specific criteria (accuracy, helpfulness, safety) instead of binary better/worse</li>
</ul>
<hr />
<h2>Stage 3: Reinforcement Learning (GRPO)</h2>
<p>The newest and most powerful stage. While SFT teaches imitation and DPO teaches preference, RL teaches the model to <strong>reason</strong> — to try multiple approaches and learn which thinking patterns lead to correct answers.</p>
<h3>What Is GRPO?</h3>
<p><strong>Group Relative Policy Optimization</strong> (GRPO) was introduced by DeepSeek and powers models like DeepSeek-R1. Unlike traditional RL methods (PPO) that require a separate critic model, GRPO is simpler:</p>
<ol>
<li>Given a prompt, <strong>sample a group of responses</strong> (e.g., 8 completions)</li>
<li><strong>Score each response</strong> with a reward function</li>
<li><strong>Normalize scores within the group</strong> to compute advantages</li>
<li><strong>Update the model</strong> to produce more high-scoring and fewer low-scoring responses</li>
</ol>
<p>The key insight: by comparing responses <em>within a group</em>, GRPO doesn't need an absolute value estimate. It just needs to know which responses in the batch were relatively better.</p>
<h3>Reward Functions</h3>
<p>The reward function is what drives learning. There are two categories:</p>
<p><strong>Rule-based rewards</strong> (easy to implement):</p>
<ul>
<li>Math: Does the answer match the correct solution?</li>
<li>Code: Does it pass the test cases?</li>
<li>Format: Does it follow the requested structure?</li>
</ul>
<p><strong>Model-based rewards</strong> (harder, more general):</p>
<ul>
<li>A separate LLM judges response quality</li>
<li>More flexible but introduces another model's biases</li>
</ul>
<p>For most practical applications, rule-based rewards work best because they give an unambiguous signal. This is why RL has been most successful for math and code — the reward is binary (correct or not).</p>
<h3>Why RL Matters</h3>
<p>RL is what gives models like DeepSeek-R1 and OpenAI o1 their reasoning abilities. The model learns to:</p>
<ul>
<li>Break problems into steps</li>
<li>Try multiple approaches</li>
<li>Verify its own work</li>
<li>Backtrack when a path isn't working</li>
</ul>
<p>This emergent behavior doesn't come from SFT (you'd need millions of perfect chain-of-thought examples) or DPO (preference pairs don't capture reasoning processes well). RL lets the model <strong>discover</strong> reasoning strategies through trial and error.</p>
<hr />
<h2>The Three Eras of Post-Training</h2>
<p>Post-training has evolved rapidly:</p>
<h3>SFT Era (2017-2023)</h3>
<p>Started with the original Transformer paper and RLHF from InstructGPT. The focus was on making models follow instructions at all. Key models: GPT-3.5, early ChatGPT.</p>
<h3>DPO Era (2023-2024)</h3>
<p>DPO removed the complexity of RLHF by eliminating the separate reward model. Alignment became accessible to smaller teams. Key models: Zephyr, Intel's NeuralChat, early Llama fine-tunes.</p>
<h3>RL Era (2025+)</h3>
<p>DeepSeek-R1 proved that pure RL could produce breakthrough reasoning capabilities. GRPO became the standard. Key models: DeepSeek-R1, QwQ, Kimi k1.5.</p>
<hr />
<h2>Practical Considerations</h2>
<h3>When Do You Need Post-Training vs. Fine-Tuning?</h3>
<p>Most developers don't need to run the full post-training pipeline. Here's a decision tree:</p>
<ol>
<li><strong>Start with an instruct model</strong> — Someone already did post-training for you</li>
<li><strong>Try RAG first</strong> — Inject domain knowledge at inference time</li>
<li><strong>Fine-tune with SFT</strong> if you need: specific tone/voice, domain-specific formatting, or consistent behavior patterns</li>
<li><strong>Consider DPO</strong> if: your model produces decent responses but lacks consistency in quality</li>
<li><strong>Consider RL</strong> only if: you have a clear reward signal (code correctness, math accuracy) and significant compute</li>
</ol>
<h3>Tools of the Trade</h3>
<table>
<thead>
<tr>
<th>Tool</th>
<th>Best For</th>
<th>Complexity</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Unsloth</strong></td>
<td>SFT and DPO, beginner-friendly</td>
<td>Low</td>
</tr>
<tr>
<td><strong>TRL</strong> (Hugging Face)</td>
<td>Full pipeline including GRPO</td>
<td>Medium</td>
</tr>
<tr>
<td><strong>OpenRLHF</strong></td>
<td>Large-scale distributed RL</td>
<td>High</td>
</tr>
<tr>
<td><strong>torchtune</strong> (PyTorch)</td>
<td>SFT with native PyTorch</td>
<td>Medium</td>
</tr>
</tbody>
</table>
<p>For most teams, <strong>Unsloth</strong> for SFT/DPO and <strong>TRL</strong> for GRPO covers the full pipeline.</p>
<h3>The Cost Spectrum</h3>
<table>
<thead>
<tr>
<th>Stage</th>
<th>Compute</th>
<th>Data Required</th>
<th>Typical Duration</th>
</tr>
</thead>
<tbody>
<tr>
<td>SFT</td>
<td>1 GPU, hours</td>
<td>10K-100K samples</td>
<td>3-8 hours</td>
</tr>
<tr>
<td>DPO</td>
<td>1-2 GPUs, hours</td>
<td>10K-50K preference pairs</td>
<td>4-12 hours</td>
</tr>
<tr>
<td>GRPO</td>
<td>4-8+ GPUs, days</td>
<td>Prompts + reward function</td>
<td>1-7 days</td>
</tr>
</tbody>
</table>
<p>SFT is accessible to anyone with a single GPU. DPO adds moderate cost. RL requires serious infrastructure — this is why it's mostly done by labs and well-funded teams.</p>
<hr />
<h2>Pros and Cons</h2>
<h3>Pros of Post-Training</h3>
<ul>
<li><strong>Transforms capability</strong> — A base model is nearly useless for end users; post-training makes it practical</li>
<li><strong>Composable stages</strong> — Each stage addresses a different weakness; you can stop at any stage</li>
<li><strong>SFT is accessible</strong> — Anyone with a GPU and good data can fine-tune a model in hours</li>
<li><strong>RL unlocks reasoning</strong> — Capabilities that can't be taught through imitation alone</li>
<li><strong>Open tooling</strong> — Unsloth, TRL, and others make the full pipeline available to everyone</li>
</ul>
<h3>Cons of Post-Training</h3>
<ul>
<li><strong>Data quality is everything</strong> — Bad training data makes the model worse, not better</li>
<li><strong>Catastrophic forgetting</strong> — Aggressive training can destroy pre-trained knowledge</li>
<li><strong>RL is expensive</strong> — Full GRPO requires multi-GPU setups and days of compute</li>
<li><strong>Alignment tax</strong> — Safety training can reduce raw capability (the model becomes cautious)</li>
<li><strong>Evaluation is hard</strong> — Unlike pre-training loss, post-training quality is subjective and task-dependent</li>
<li><strong>Policy drift</strong> — DPO with off-policy data produces unreliable results</li>
</ul>
<hr />
<h2>Key Takeaways</h2>
<ol>
<li><strong>Post-training is the bridge</strong> between a raw language model and a useful AI assistant</li>
<li><strong>Three stages</strong>: SFT (follow instructions) → DPO (prefer better responses) → RL (learn to reason)</li>
<li><strong>Start with instruct models</strong> — Don't reinvent the wheel unless you have specific requirements</li>
<li><strong>SFT is the most practical</strong> stage for business fine-tuning with LoRA</li>
<li><strong>RL is the frontier</strong> — It's how the best reasoning models are built, but it requires significant resources</li>
<li><strong>Dataset quality &gt; quantity</strong> — Always</li>
</ol>
<p>For a deeper dive into fine-tuning for your specific use case, see <a href="/ai-developer/why-train-your-own-llm">Why Train Your Own LLM</a> and <a href="/learn-ai/what-is-fine-tuning">What Is Fine-Tuning?</a></p>
<hr />
<p><em>This article draws on Maxime Labonne's presentation &quot;Introduction to Post-Training Techniques&quot; and current research from DeepSeek, Hugging Face, and the open-source ML community.</em></p>]]></content:encoded>
      <category>training</category>
      <category>post-training</category>
      <category>sft</category>
      <category>dpo</category>
      <category>grpo</category>
      <category>reinforcement-learning</category>
    </item>
    <item>
      <title>Your Competitors Aren&#039;t Using AI Yet — Make That Your Advantage</title>
      <link>https://ai.rs/ai-for-business/competitors-arent-using-ai-yet-your-advantage</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/competitors-arent-using-ai-yet-your-advantage</guid>
      <pubDate>Wed, 11 Mar 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>New research shows 94% of business tasks could be handled by AI, but only 33% actually are. That gap is the biggest competitive opportunity in a decade — if you move first.</description>
      <content:encoded><![CDATA[<h2>The Gap Nobody's Talking About</h2>
<p>Anthropic just published the most comprehensive look at how AI is actually being used across the economy. The headline finding will surprise you:</p>
<p><strong>94% of tasks in business and finance occupations could theoretically be done by AI. Only 33% actually are.</strong></p>
<p>That's not a small gap. That's a canyon. And if you're a business owner, it means one thing: most of your competitors are leaving massive value on the table.</p>
<h2>What the Research Actually Found</h2>
<p>The Anthropic Economic Index analyzed roughly 1 million real AI conversations to map how businesses are actually using AI — not what's theoretically possible, but what people are doing right now.</p>
<p>Here's what stands out:</p>
<table>
<thead>
<tr>
<th>Finding</th>
<th>What It Means</th>
</tr>
</thead>
<tbody>
<tr>
<td>94% of business tasks are AI-feasible</td>
<td>The technology is ready</td>
</tr>
<tr>
<td>Only 33% are actually being done by AI</td>
<td>Almost nobody is using it</td>
</tr>
<tr>
<td>36% of occupations use AI for at least 1/4 of tasks</td>
<td>Adoption is shallow</td>
</tr>
<tr>
<td>Only 4% of occupations use AI for 75%+ of tasks</td>
<td>Deep adoption is extremely rare</td>
</tr>
<tr>
<td>30% of workers have zero AI exposure</td>
<td>Nearly a third haven't touched it</td>
</tr>
</tbody>
</table>
<p>The radar chart below shows the gap visually — the blue area is what AI <em>could</em> do, the red area is what it <em>actually</em> does:</p>
<p><img src="/img/articles/anthropic-theoretical-vs-observed.webp" alt="Theoretical capability and observed usage by occupational category" />
<em>Source: <a href="https://www.anthropic.com/research/the-anthropic-economic-index?utm_source=www_ai_rs">Anthropic Economic Index</a>, March 2026</em></p>
<p>Look at Business &amp; Finance, Management, Legal, Sales — the blue (possible) dwarfs the red (actual) in every category that matters to a business owner.</p>
<p>And here are the occupations where AI is already making the biggest impact:</p>
<p><img src="/img/articles/anthropic-most-exposed-occupations.webp" alt="Most exposed occupations" />
<em>Source: <a href="https://www.anthropic.com/research/labor-market-impacts?utm_source=www_ai_rs">Anthropic Labor Market Impact Research</a>, March 2026</em></p>
<p>Customer service reps at 70.1%. Sales reps at 62.8%. Financial analysts at 57.2%. These aren't future predictions — this is happening right now, and most businesses still aren't part of it.</p>
<p>Let that sink in. The tools exist. The capability is proven. But the vast majority of businesses are still doing things the old way.</p>
<h2>Why This Is an Opportunity, Not a Threat</h2>
<p>When most people read AI headlines, they think about job losses. But the research tells a completely different story.</p>
<p><strong>There's been no systematic increase in unemployment for highly AI-exposed workers since late 2022.</strong> The technology isn't replacing people — it's augmenting them. In fact, 57% of AI usage is augmentation (AI helping humans do better work) versus 43% automation (AI handling tasks independently).</p>
<p>This is the key insight for business owners: <strong>AI isn't about cutting staff. It's about multiplying what your existing team can do.</strong></p>
<p>A salesperson who uses AI to draft proposals and follow-ups handles 3x the pipeline. A support agent with AI assistance resolves tickets 40% faster. A marketing team using AI for content creation produces more in a week than they used to in a month.</p>
<p>Your headcount stays the same. Your output doubles.</p>
<h2>The First-Mover Window Is Wide Open</h2>
<p>Here's what makes this moment special. In most technology shifts, the window for competitive advantage is narrow — everyone adopts at roughly the same time.</p>
<p>Not with AI. The adoption curve is remarkably slow:</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Reality</th>
</tr>
</thead>
<tbody>
<tr>
<td>Businesses with deep AI integration</td>
<td>~4%</td>
</tr>
<tr>
<td>Workers with zero AI exposure</td>
<td>~30%</td>
</tr>
<tr>
<td>Gap between possible and actual</td>
<td>61 percentage points</td>
</tr>
</tbody>
</table>
<p>That 61-point gap is your window. Every month you adopt AI and your competitors don't, you compound your advantage:</p>
<ul>
<li><strong>Month 1:</strong> Your AI assistant handles after-hours inquiries. Competitors miss those sales.</li>
<li><strong>Month 3:</strong> Your team produces 2x the output with the same headcount. Competitors hire to keep up.</li>
<li><strong>Month 6:</strong> Your customer response time is under 10 seconds. Competitors still measure theirs in hours.</li>
<li><strong>Month 12:</strong> Your AI has learned from thousands of customer interactions. A competitor starting now is 12 months behind on data.</li>
</ul>
<p>This is the compounding effect that makes first-mover advantage real. Not because the technology is exclusive — anyone can access it. But because <strong>the data you feed it is unique to your business</strong>, and it takes time to build.</p>
<h2>Where AI Creates the Biggest Business Impact</h2>
<p>The research breaks down AI usage by occupation. Here's what that means for a typical business:</p>
<h3>Sales &amp; Customer Support</h3>
<p>This is where most businesses see the fastest ROI. AI handles the high-volume, repetitive interactions so your team can focus on high-value relationships.</p>
<ul>
<li>Answer product questions 24/7 (no more lost after-hours sales)</li>
<li>Qualify leads automatically before they reach a salesperson</li>
<li>Draft personalized follow-up emails in seconds</li>
<li>Handle multilingual customers without hiring native speakers</li>
</ul>
<h3>Marketing &amp; Content</h3>
<p>The research shows Arts, Design, and Media account for 10.3% of all AI usage — the second-highest category. Businesses are using AI for:</p>
<ul>
<li>Product descriptions and catalog copy at scale</li>
<li>Email campaigns personalized to customer segments</li>
<li>Social media content calendars</li>
<li>SEO-optimized blog posts and landing pages</li>
</ul>
<h3>Operations &amp; Administration</h3>
<p>Office and Administrative tasks represent 7.9% of AI usage. Think:</p>
<ul>
<li>Automated report generation from raw data</li>
<li>Invoice processing and bookkeeping assistance</li>
<li>Meeting summaries and action item extraction</li>
<li>Document drafting and review</li>
</ul>
<h3>Business Strategy &amp; Finance</h3>
<p>Business and Financial tasks at 5.9% of usage include:</p>
<ul>
<li>Market analysis and competitive research</li>
<li>Financial modeling and scenario planning</li>
<li>Customer data analysis for pricing decisions</li>
<li>Contract review and risk assessment</li>
</ul>
<h2>The Hiring Angle: Young Talent Is Already Shifting</h2>
<p>Here's a data point that should get your attention: <strong>job-finding rates for young workers (ages 22-25) dropped 14% in AI-exposed occupations</strong> since ChatGPT launched.</p>
<p>This doesn't mean these jobs are disappearing. It means companies are getting more selective. They want candidates who can work with AI, not just do the tasks AI can handle.</p>
<p>For your business, this means:</p>
<ol>
<li><strong>Adopt AI now</strong>, and you attract talent that knows how to leverage it</li>
<li><strong>Wait</strong>, and the best young talent goes to competitors who already use it</li>
<li><strong>Your existing team</strong> gets more valuable when paired with AI tools — experienced employees who understand your business plus AI productivity is a combination no new hire can match</li>
</ol>
<h2>What Your Competitors Will Eventually Do</h2>
<p>Make no mistake — adoption will catch up. The research shows AI capability is expanding rapidly. The question isn't whether your competitors will adopt AI, but when.</p>
<p>The businesses that move first get:</p>
<table>
<thead>
<tr>
<th>Advantage</th>
<th>Why It Compounds</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Proprietary training data</strong></td>
<td>Every customer interaction makes your AI smarter. Competitors starting later have less data.</td>
</tr>
<tr>
<td><strong>Process optimization</strong></td>
<td>You've already figured out what works. Competitors will make the same beginner mistakes you've already solved.</td>
</tr>
<tr>
<td><strong>Customer expectations</strong></td>
<td>Your customers get used to instant, accurate responses. They won't go back to competitors offering less.</td>
</tr>
<tr>
<td><strong>Team capability</strong></td>
<td>Your team already knows how to work with AI. Competitors need months of adjustment.</td>
</tr>
</tbody>
</table>
<h2>The Practical Playbook</h2>
<p>You don't need a massive budget or a tech team to start. Here's the pragmatic approach:</p>
<h3>Start This Week</h3>
<ul>
<li>Sign up for Claude or ChatGPT if you haven't already</li>
<li>Have 3 team members use it for their daily tasks for one week</li>
<li>Track what saves time and what doesn't</li>
</ul>
<h3>Start This Month</h3>
<ul>
<li>Identify the 3 highest-volume, most repetitive tasks in your business</li>
<li>Deploy AI for the simplest one first (usually customer FAQ or content creation)</li>
<li>Measure the time saved</li>
</ul>
<h3>Start This Quarter</h3>
<ul>
<li><a href="/ai-for-business/why-your-business-needs-its-own-ai-model">Invest in a custom AI assistant</a> trained on your product data</li>
<li>Integrate it with your website or customer support workflow</li>
<li>Set up the data feedback loop so it improves over time</li>
</ul>
<p>The research is clear: the gap between what AI can do and what businesses are actually doing is enormous. That gap is your competitive advantage — but only if you act while it's still there.</p>
<h2>The Bottom Line</h2>
<p>94% possible. 33% adopted. 30% of workers haven't even tried it.</p>
<p>These aren't just statistics. They're a map showing you exactly where the opportunity is. Your competitors are in that 67% who aren't using AI yet. Every month you spend on that side of the gap costs you customers, efficiency, and market position.</p>
<p>The technology is ready. The data proves it works. The only question left is whether you'll be the business that moved first — or the one that wished it had.</p>
<p><strong>Ready to start?</strong> <a href="/how-it-works.php">See how custom AI works for your business</a> — from your data to a live AI assistant.</p>
<hr />
<p><em>Data from the <a href="https://www.anthropic.com/research/the-anthropic-economic-index?utm_source=www_ai_rs">Anthropic Economic Index</a> and <a href="https://www.anthropic.com/research/labor-market-impacts?utm_source=www_ai_rs">Labor Market Impacts of AI</a> research, published March 2026.</em></p>]]></content:encoded>
      <category>business</category>
      <category>ai-adoption</category>
      <category>competitive-advantage</category>
      <category>strategy</category>
    </item>
    <item>
      <title>Building an Email List That Survives the Algorithm</title>
      <link>https://ai.rs/ai-for-business/building-email-list-that-survives-algorithm</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/building-email-list-that-survives-algorithm</guid>
      <pubDate>Tue, 10 Mar 2026 09:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>Google traffic can vanish overnight. Social media reach gets throttled. AI search steals your clicks. But nobody can take away your email list. Here&#039;s how to build one that actually works.</description>
      <content:encoded><![CDATA[<h2>The Channel Nobody Can Take Away</h2>
<p>In January 2026, LinkedIn reported a 60% drop in non-brand B2B traffic. Rankings held. Clicks disappeared. The cause was AI search — users got answers without ever visiting a website.</p>
<p>If you read that and panicked about your traffic, you had the right instinct. If you read that and shrugged because your revenue comes from email subscribers, you understood something most businesses don't.</p>
<p><strong>An email list is the only audience channel you fully own.</strong></p>
<p>Google can change its algorithm. Facebook can throttle your reach. Twitter can implode. AI can summarize your content and steal your clicks. But nobody can get between you and someone's inbox — except a spam filter.</p>
<h2>Why Email Survives Every Platform Shift</h2>
<p>Every few years, a platform shift wipes out businesses that built on rented land:</p>
<table>
<thead>
<tr>
<th>Year</th>
<th>Platform Shift</th>
<th>Who Got Hurt</th>
</tr>
</thead>
<tbody>
<tr>
<td>2012</td>
<td>Facebook throttled organic reach</td>
<td>Brands that built audiences on Facebook Pages</td>
</tr>
<tr>
<td>2018</td>
<td>Google &quot;Medic&quot; update</td>
<td>Health and finance sites that relied on SEO</td>
</tr>
<tr>
<td>2021</td>
<td>Apple Mail Privacy Protection</td>
<td>Marketers who relied on open rate tracking</td>
</tr>
<tr>
<td>2023</td>
<td>Twitter/X algorithm changes</td>
<td>Creators who built audiences on Twitter</td>
</tr>
<tr>
<td>2025-26</td>
<td>AI search zero-click</td>
<td>Everyone who relied on Google organic traffic</td>
</tr>
</tbody>
</table>
<p>Email survived all of them. The companies that weathered each shift had one thing in common: a direct relationship with their audience that didn't depend on any platform's algorithm.</p>
<p>The math is simple:</p>
<ul>
<li><strong>Social media follower:</strong> Platform decides if they see your content (typical organic reach: 2-5%)</li>
<li><strong>Website visitor:</strong> Search engine decides if they find you (83% zero-click rate with AI overviews)</li>
<li><strong>Email subscriber:</strong> You decide when they hear from you (typical delivery rate: 95%+)</li>
</ul>
<h2>What Actually Gets People to Subscribe</h2>
<p>Here's what doesn't work: &quot;Sign up for our newsletter.&quot;</p>
<p>Nobody wakes up wanting another newsletter. People subscribe when you offer something specific and valuable in exchange for their email. The word for this is <strong>lead magnet</strong> — and the good ones share a pattern.</p>
<h3>Lead Magnets That Convert</h3>
<p><strong>Assessments and quizzes</strong> (highest conversion, 20-40%)</p>
<ul>
<li>&quot;Is your business ready for AI?&quot; — a 2-minute quiz that gives a personalized score</li>
<li>&quot;What's your SEO vulnerability score?&quot; — timely given the AI search shift</li>
<li>The key: the result must be genuinely useful, not just a sales pitch with a score attached</li>
</ul>
<p><strong>Templates and tools</strong> (15-25% conversion)</p>
<ul>
<li>Spreadsheet calculators (&quot;AI ROI calculator for your business&quot;)</li>
<li>Checklists (&quot;llms.txt implementation checklist&quot;)</li>
<li>Scripts and code snippets for developers</li>
</ul>
<p><strong>Original research and data</strong> (10-20% conversion)</p>
<ul>
<li>&quot;We analyzed 500 AI implementations — here's what worked&quot;</li>
<li>Benchmark reports with real numbers</li>
<li>Industry surveys with proprietary data</li>
</ul>
<p><strong>Mini-courses and email sequences</strong> (10-15% conversion)</p>
<ul>
<li>&quot;5 days to understanding AI for your business&quot; — one email per day</li>
<li>Each email delivers real value, not just teasers</li>
</ul>
<h3>What Doesn't Work</h3>
<ul>
<li>&quot;Subscribe to our newsletter&quot; with no value proposition</li>
<li>Pop-ups that appear before the user has read anything</li>
<li>Gated content that's freely available elsewhere</li>
<li>Promising weekly updates and sending daily sales pitches</li>
</ul>
<p>The conversion rate on a generic &quot;subscribe to our newsletter&quot; form is typically 1-3%. A well-crafted lead magnet with a clear value proposition converts at 10-40%. The difference is entirely in the offer.</p>
<h2>The Subscribe Form That Works</h2>
<p>Placement matters as much as the offer:</p>
<p><strong>Best performing locations:</strong></p>
<ol>
<li><strong>Inline within content</strong> — after a reader has consumed 40-60% of an article (they're engaged)</li>
<li><strong>End of article</strong> — natural next step after reading</li>
<li><strong>Footer</strong> — low-friction, always visible</li>
<li><strong>Exit intent</strong> — when the cursor moves toward closing the tab</li>
</ol>
<p><strong>Worst performing locations:</strong></p>
<ol>
<li><strong>Immediate pop-up</strong> — before the user knows if your content is worth reading</li>
<li><strong>Sidebar widget</strong> — banner blindness kills these</li>
<li><strong>Buried in the footer with no context</strong> — &quot;Subscribe&quot; next to copyright text</li>
</ol>
<h3>The Formula</h3>
<p>A high-converting subscribe form has three elements:</p>
<ol>
<li><strong>Specific promise:</strong> &quot;Get one actionable AI insight every Tuesday&quot; beats &quot;Stay updated&quot;</li>
<li><strong>Social proof:</strong> &quot;Join 2,400 business owners&quot; or &quot;Read by CTOs at 50+ companies&quot;</li>
<li><strong>Low friction:</strong> Email field + one button. No name field, no company field, no phone number</li>
</ol>
<p>Every additional form field reduces conversion by roughly 10-25%. If you're asking for a name and email, you're losing subscribers for information you don't need.</p>
<h2>Keeping Subscribers Engaged</h2>
<p>Getting subscribers is the easy part. Keeping them is the business.</p>
<h3>Send Cadence</h3>
<p>The data is clear on this: <strong>consistency matters more than frequency.</strong></p>
<ul>
<li>Weekly is the sweet spot for most B2B audiences</li>
<li>Every two weeks works if you have less to say</li>
<li>Daily burns out most audiences (exceptions: news, trading, daily tips)</li>
<li>Monthly is too infrequent — subscribers forget who you are</li>
</ul>
<p><strong>Tuesday and Thursday mornings</strong> consistently show the highest open rates for B2B email. The worst? Friday afternoon and weekends.</p>
<h3>What to Send</h3>
<p>Every email should pass the &quot;would I forward this?&quot; test. If you wouldn't forward it to a colleague, don't send it.</p>
<p><strong>High-engagement content:</strong></p>
<ul>
<li>Original data and insights your subscribers can't get elsewhere</li>
<li>Curated analysis — not just links, but your take on why it matters</li>
<li>Actionable advice with specific steps</li>
<li>Behind-the-scenes of your work (case studies, lessons learned)</li>
</ul>
<p><strong>Low-engagement content:</strong></p>
<ul>
<li>Company news nobody asked for (&quot;We hired a new VP!&quot;)</li>
<li>Recycled blog posts with no added context</li>
<li>Pure promotional emails with no value</li>
<li>Long-winded introductions before getting to the point</li>
</ul>
<h3>Plain Text vs HTML</h3>
<p>Controversial take: <strong>plain text emails often outperform HTML.</strong></p>
<ul>
<li>They look like personal emails, not marketing blasts</li>
<li>No images to block, no rendering issues across clients</li>
<li>Higher deliverability (less likely to trigger spam filters)</li>
<li>Faster to write and send</li>
</ul>
<p>HTML has its place (product showcases, visual tutorials), but for B2B thought leadership and insights, plain text with a personal tone wins.</p>
<h2>Metrics That Actually Matter</h2>
<p>Most email dashboards show you vanity metrics. Here's what to actually track:</p>
<h3>The Metrics That Matter</h3>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Good</th>
<th>Great</th>
<th>Red Flag</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>List growth rate</strong></td>
<td>2-5%/month</td>
<td>5-10%/month</td>
<td>Negative (losing more than gaining)</td>
</tr>
<tr>
<td><strong>Open rate</strong></td>
<td>20-30%</td>
<td>30-50%</td>
<td>Below 15%</td>
</tr>
<tr>
<td><strong>Click-through rate</strong></td>
<td>2-5%</td>
<td>5-10%</td>
<td>Below 1%</td>
</tr>
<tr>
<td><strong>Unsubscribe rate</strong></td>
<td>Under 0.5% per email</td>
<td>Under 0.2%</td>
<td>Above 1%</td>
</tr>
<tr>
<td><strong>Reply rate</strong></td>
<td>Any replies</td>
<td>Regular replies</td>
<td>Zero engagement</td>
</tr>
</tbody>
</table>
<h3>The Metric Nobody Tracks (But Should)</h3>
<p><strong>Revenue per subscriber per month.</strong> If you have 1,000 subscribers and your email-attributed revenue is $5,000/month, each subscriber is worth $5/month. That number tells you:</p>
<ul>
<li>How much you can spend to acquire a subscriber (customer acquisition cost)</li>
<li>Whether your content strategy is working (trending up or down)</li>
<li>When to invest more in list growth vs engagement</li>
</ul>
<h3>Vanity Metrics to Ignore</h3>
<ul>
<li><strong>Total list size</strong> without engagement rate — 500 engaged subscribers beat 5,000 dead ones</li>
<li><strong>Open rate</strong> in isolation — Apple Mail Privacy Protection inflates this since 2021</li>
<li><strong>Social shares</strong> of your emails — nice but doesn't pay the bills</li>
</ul>
<h2>When Simple Beats Complex</h2>
<p>You don't need Mailchimp, ConvertKit, or HubSpot to start. Many successful B2B email lists run on surprisingly simple tech:</p>
<p><strong>Simple stack (0-1,000 subscribers):</strong></p>
<ul>
<li>Your web framework + SMTP (exactly what we use at ai.rs)</li>
<li>A database table for subscribers</li>
<li>A cron job for batch sending</li>
<li>Plain text emails</li>
</ul>
<p><strong>When to upgrade:</strong></p>
<ul>
<li>You need advanced segmentation (different content for different audiences)</li>
<li>You want automated sequences (drip campaigns, onboarding flows)</li>
<li>You're sending 10,000+ emails and need deliverability optimization</li>
<li>You need A/B testing at scale</li>
</ul>
<p>The mistake most businesses make is starting with enterprise tools before they have 100 subscribers. <strong>You don't need automation when you can write a personal email.</strong> Start simple, upgrade when the simple approach becomes a bottleneck.</p>
<h2>The Unsubscribe Paradox</h2>
<p>Making it easy to unsubscribe <strong>improves</strong> your email performance. This is counterintuitive but well-documented:</p>
<ul>
<li>Disengaged subscribers hurt your deliverability score</li>
<li>ISPs track engagement ratios — a clean list gets better inbox placement</li>
<li>One-click unsubscribe is legally required (CAN-SPAM, GDPR) and practically beneficial</li>
<li>A subscriber who leaves cleanly might come back; one who marks you as spam never will</li>
</ul>
<p>Put your unsubscribe link where people can find it. Don't hide it in 8px gray text. Don't make them log in to unsubscribe. Don't guilt-trip them (&quot;Are you sure? You'll miss out!&quot;).</p>
<p>The businesses with the best email programs make unsubscribing as easy as subscribing.</p>
<h2>Building the Habit</h2>
<p>The best email lists aren't built in a day. They're built in habits:</p>
<p><strong>Weekly:</strong></p>
<ul>
<li>Send your email on the same day and time (Tuesday 9 AM works)</li>
<li>Monitor replies and engagement</li>
</ul>
<p><strong>Monthly:</strong></p>
<ul>
<li>Review metrics (growth rate, CTR, unsubscribe rate)</li>
<li>Clean your list (remove bounces and chronically unengaged)</li>
<li>Test one thing (subject line format, send time, content style)</li>
</ul>
<p><strong>Quarterly:</strong></p>
<ul>
<li>Review your lead magnet — is it still compelling?</li>
<li>Assess your subscribe form conversion rate</li>
<li>Check deliverability (are you hitting inbox or spam?)</li>
</ul>
<h2>Action Items</h2>
<p>Start this week:</p>
<ol>
<li><strong>Audit your current setup</strong> — do you have a subscribe form? Where is it? What does it promise?</li>
<li><strong>Create one lead magnet</strong> — an assessment, template, or checklist related to your expertise</li>
<li><strong>Set a send schedule</strong> — pick a day and time, and commit to it</li>
<li><strong>Write your first email</strong> — if you have subscribers, send them something valuable today</li>
<li><strong>Track the right metrics</strong> — set up a simple dashboard with growth rate, open rate, CTR, and unsubscribe rate</li>
</ol>
<p>The businesses that will thrive in the AI search era aren't the ones with the best SEO. They're the ones with direct access to their audience. An email list is that access.</p>
<p>Start building yours before the traffic dashboard turns red.</p>]]></content:encoded>
      <category>business</category>
      <category>email</category>
      <category>marketing</category>
      <category>strategy</category>
      <category>b2b</category>
      <category>audience</category>
    </item>
    <item>
      <title>Will This LLM Fit My GPU? VRAM Requirements for Every Model Size</title>
      <link>https://ai.rs/ai-developer/will-llm-fit-my-gpu-vram-requirements</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/will-llm-fit-my-gpu-vram-requirements</guid>
      <pubDate>Mon, 09 Mar 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>Before downloading a 50 GB model, check if it actually fits your GPU. We break down the VRAM formula, show a one-command tool that checks any Hugging Face model, and provide a quick-reference table for popular GPUs.</description>
      <content:encoded><![CDATA[<h2>The Question Every Developer Asks</h2>
<p>You found a model on Hugging Face. It looks promising. But before you spend 30 minutes downloading it and another 10 watching it crash with an out-of-memory error, you need to answer one question: <strong>will it fit on my GPU?</strong></p>
<p>This isn't as simple as &quot;8B parameters = X GB.&quot; VRAM usage depends on the data type, quantization format, context length, KV cache overhead, and whether you're running one user or twenty. Let's break it all down.</p>
<h2>The VRAM Formula</h2>
<p>Total GPU memory for inference has three components:</p>
<pre><code>Total VRAM = Model Weights + KV Cache + Overhead</code></pre>
<h3>Component 1: Model Weights</h3>
<p>This is the big one. Model weights are the learned parameters stored in files on disk, loaded entirely into VRAM for inference.</p>
<table>
<thead>
<tr>
<th>Data Type</th>
<th>Bytes per Parameter</th>
<th>8B Model</th>
<th>27B Model</th>
<th>70B Model</th>
</tr>
</thead>
<tbody>
<tr>
<td>FP32</td>
<td>4</td>
<td>32 GB</td>
<td>108 GB</td>
<td>280 GB</td>
</tr>
<tr>
<td>FP16 / BF16</td>
<td>2</td>
<td>16 GB</td>
<td>54 GB</td>
<td>140 GB</td>
</tr>
<tr>
<td>Q8_0 (8-bit)</td>
<td>~1.1</td>
<td>8.5 GB</td>
<td>29 GB</td>
<td>75 GB</td>
</tr>
<tr>
<td>Q6_K (6-bit)</td>
<td>~0.8</td>
<td>6.7 GB</td>
<td>21 GB</td>
<td>54 GB</td>
</tr>
<tr>
<td>Q4_K_M (4-bit)</td>
<td>~0.55</td>
<td>4.7 GB</td>
<td>15 GB</td>
<td>40 GB</td>
</tr>
<tr>
<td>Q2_K (2-bit)</td>
<td>~0.31</td>
<td>2.6 GB</td>
<td>8.5 GB</td>
<td>22 GB</td>
</tr>
</tbody>
</table>
<p>The formula is straightforward:</p>
<pre><code>Weight Memory = num_parameters x bytes_per_parameter</code></pre>
<p>For quantized formats like GGUF, the bytes per parameter varies by layer — attention layers might use higher precision than feed-forward layers. The numbers above are averages across the full model.</p>
<p><strong>MoE models are different.</strong> A model like Llama 4 Scout has 109B total parameters but only 17B active per token. You still need VRAM for all 109B parameters — every expert must be in memory even though only a subset fires per token. MoE models are memory-heavy but compute-light.</p>
<h3>Component 2: KV Cache</h3>
<p>The KV (Key-Value) cache stores attention states for every token in the context window. It grows linearly with sequence length and can consume significant VRAM for long contexts.</p>
<pre><code>KV Cache = 2 x num_layers x num_kv_heads x head_dim x seq_length x dtype_bytes</code></pre>
<p>Where:</p>
<ul>
<li><strong>2</strong> — one for keys, one for values</li>
<li><strong>num_layers</strong> — number of transformer layers (e.g., 32 for Qwen3-8B)</li>
<li><strong>num_kv_heads</strong> — number of key-value heads (often fewer than attention heads due to GQA)</li>
<li><strong>head_dim</strong> — hidden_size / num_attention_heads (e.g., 4096 / 32 = 128)</li>
<li><strong>seq_length</strong> — your actual context length in tokens</li>
<li><strong>dtype_bytes</strong> — 2 for FP16/BF16, 1 for FP8</li>
</ul>
<p>Here's what KV cache looks like for Qwen3-8B at different context lengths:</p>
<table>
<thead>
<tr>
<th>Context Length</th>
<th>FP16 KV Cache</th>
<th>FP8 KV Cache</th>
</tr>
</thead>
<tbody>
<tr>
<td>2K tokens</td>
<td>256 MB</td>
<td>128 MB</td>
</tr>
<tr>
<td>8K tokens</td>
<td>1.0 GB</td>
<td>512 MB</td>
</tr>
<tr>
<td>32K tokens</td>
<td>4.1 GB</td>
<td>2.0 GB</td>
</tr>
<tr>
<td>128K tokens</td>
<td>16.4 GB</td>
<td>8.2 GB</td>
</tr>
</tbody>
</table>
<p>At 32K context, the KV cache alone eats 4 GB — half of what the quantized weights use. This is why &quot;my model fits in VRAM&quot; and &quot;my model fits in VRAM <em>with the context length I need</em>&quot; are very different statements.</p>
<p><strong>Multi-user multiplier:</strong> Each concurrent user needs their own KV cache. 8 users at 8K context = 8 GB of KV cache in FP16. This is why <a href="/ai-developer/vllm-vs-ollama-serving-frameworks">vLLM's paged attention</a> matters at scale — it avoids pre-allocating the full context for every user.</p>
<h3>Component 3: Overhead</h3>
<p>Operating system, CUDA runtime, framework buffers, and activation memory during forward passes. Rule of thumb:</p>
<table>
<thead>
<tr>
<th>Component</th>
<th>Typical Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>CUDA runtime + driver</td>
<td>300-500 MB</td>
</tr>
<tr>
<td>Framework buffers (Ollama/vLLM)</td>
<td>200-500 MB</td>
</tr>
<tr>
<td>Activation memory</td>
<td>100-300 MB</td>
</tr>
<tr>
<td><strong>Total overhead</strong></td>
<td><strong>~0.5-1.5 GB</strong></td>
</tr>
</tbody>
</table>
<p>For quick estimates, add <strong>1 GB</strong> overhead. For production capacity planning, add <strong>1.5 GB</strong>.</p>
<h2>The One-Command Check: hf-mem</h2>
<p>Instead of doing math by hand, use <a href="https://github.com/alvarobartt/hf-mem">hf-mem</a> — a CLI tool that reads Safetensors metadata directly from Hugging Face without downloading the model. It uses HTTP range requests to fetch just the header bytes, so it works instantly even for 100 GB+ models.</p>
<h3>Install and Run</h3>
<pre><code class="language-bash"># No install needed — run directly with uvx
uvx hf-mem --model-id Qwen/Qwen3-8B</code></pre>
<p>This outputs a breakdown by component: parameter count per dtype, total bytes, and a formatted table showing exactly how much memory the weights require.</p>
<h3>With KV Cache Estimation</h3>
<p>Add <code>--experimental</code> to include KV cache calculations:</p>
<pre><code class="language-bash">uvx hf-mem --model-id Qwen/Qwen3-8B --experimental</code></pre>
<p>You can customize the estimate for your specific use case:</p>
<pre><code class="language-bash"># 32K context, 4 concurrent users, FP8 cache
uvx hf-mem --model-id Qwen/Qwen3-8B \
  --experimental \
  --max-model-len 32768 \
  --batch-size 4 \
  --kv-cache-dtype fp8</code></pre>
<h3>GGUF Quantized Models</h3>
<p>For quantized models (which is what most people actually deploy), specify the GGUF file:</p>
<pre><code class="language-bash"># Check a specific quantization
uvx hf-mem --model-id bartowski/Qwen3-8B-GGUF \
  --gguf-file Qwen3-8B-Q6_K.gguf \
  --experimental</code></pre>
<h3>JSON Output for Scripts</h3>
<p>Get machine-readable output for automation:</p>
<pre><code class="language-bash">uvx hf-mem --model-id Qwen/Qwen3-8B --experimental --json-output</code></pre>
<p>This returns a JSON object with <code>param_count</code>, <code>bytes_count</code>, <code>cache_size</code>, and all component-level detail — useful for building your own capacity planning scripts.</p>
<h3>How It Works Under the Hood</h3>
<p>hf-mem doesn't download model files. It exploits the <a href="https://huggingface.co/docs/safetensors/">Safetensors format</a> which stores tensor metadata (shapes, dtypes) in a header at the beginning of each file. An HTTP range request (<code>bytes=0-100000</code>) fetches just this header — typically under 100 KB even for models with thousands of tensors.</p>
<p>From the header, it extracts every tensor's shape and dtype, multiplies shape dimensions to get parameter count, then multiplies by bytes-per-dtype to get memory. For KV cache, it reads the model's <code>config.json</code> to get layer count, head count, and head dimension.</p>
<p>The whole process takes 1-3 seconds regardless of model size.</p>
<h2>Quick Reference: Popular Models on Popular GPUs</h2>
<p>Here's what actually fits, with realistic context lengths and 1 GB overhead budget:</p>
<h3>8 GB GPUs (RTX 4060, RTX 3070)</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Quant</th>
<th>Weights</th>
<th>KV (4K ctx)</th>
<th>Total</th>
<th>Fits?</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3-8B</td>
<td>Q4_K_M</td>
<td>4.7 GB</td>
<td>0.5 GB</td>
<td>6.2 GB</td>
<td>Yes</td>
</tr>
<tr>
<td>Qwen3-8B</td>
<td>Q6_K</td>
<td>6.7 GB</td>
<td>0.5 GB</td>
<td>8.2 GB</td>
<td>Tight</td>
</tr>
<tr>
<td>Llama 3.1 8B</td>
<td>Q4_K_M</td>
<td>4.9 GB</td>
<td>0.5 GB</td>
<td>6.4 GB</td>
<td>Yes</td>
</tr>
<tr>
<td>Gemma 3 12B</td>
<td>Q4_K_M</td>
<td>7.2 GB</td>
<td>0.6 GB</td>
<td>8.8 GB</td>
<td>No</td>
</tr>
</tbody>
</table>
<p><strong>Sweet spot:</strong> 8B models at Q4_K_M with 4K context. Going to Q6_K is possible but leaves no room for longer contexts.</p>
<h3>12 GB GPUs (RTX 4070, RTX 3060 12GB)</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Quant</th>
<th>Weights</th>
<th>KV (8K ctx)</th>
<th>Total</th>
<th>Fits?</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3-8B</td>
<td>Q6_K</td>
<td>6.7 GB</td>
<td>1.0 GB</td>
<td>8.7 GB</td>
<td>Yes</td>
</tr>
<tr>
<td>Qwen3-8B</td>
<td>Q8_0</td>
<td>8.5 GB</td>
<td>1.0 GB</td>
<td>10.5 GB</td>
<td>Yes</td>
</tr>
<tr>
<td>Gemma 3 12B</td>
<td>Q6_K</td>
<td>9.2 GB</td>
<td>1.2 GB</td>
<td>11.4 GB</td>
<td>Tight</td>
</tr>
<tr>
<td>Qwen3-14B</td>
<td>Q4_K_M</td>
<td>8.2 GB</td>
<td>0.8 GB</td>
<td>10.0 GB</td>
<td>Yes</td>
</tr>
</tbody>
</table>
<p><strong>Sweet spot:</strong> 8B at Q6_K or Q8_0 with 8K context. Can squeeze in 12-14B at Q4_K_M.</p>
<h3>16 GB GPUs (RTX 4080, RTX 5060 Ti)</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Quant</th>
<th>Weights</th>
<th>KV (8K ctx)</th>
<th>Total</th>
<th>Fits?</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3-14B</td>
<td>Q6_K</td>
<td>11.2 GB</td>
<td>0.8 GB</td>
<td>13.0 GB</td>
<td>Yes</td>
</tr>
<tr>
<td>Gemma 3 27B</td>
<td>Q4_K_M</td>
<td>15.2 GB</td>
<td>1.6 GB</td>
<td>17.8 GB</td>
<td>No</td>
</tr>
<tr>
<td>Qwen3-8B</td>
<td>Q6_K</td>
<td>6.7 GB</td>
<td>4.1 GB</td>
<td>11.8 GB</td>
<td>Yes (32K ctx)</td>
</tr>
</tbody>
</table>
<p><strong>Sweet spot:</strong> 14B at Q6_K with 8K context. Or 8B at high quality with very long context.</p>
<h3>24 GB GPUs (RTX 4090, RTX 5090, A5000)</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Quant</th>
<th>Weights</th>
<th>KV (8K ctx)</th>
<th>Total</th>
<th>Fits?</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3.5-27B</td>
<td>Q6_K</td>
<td>21 GB</td>
<td>1.6 GB</td>
<td>23.6 GB</td>
<td>Tight</td>
</tr>
<tr>
<td>Gemma 3 27B</td>
<td>Q6_K</td>
<td>20 GB</td>
<td>1.6 GB</td>
<td>22.6 GB</td>
<td>Yes</td>
</tr>
<tr>
<td>Llama 3.1 70B</td>
<td>Q4_K_M</td>
<td>40 GB</td>
<td>—</td>
<td>—</td>
<td>No</td>
</tr>
<tr>
<td>Qwen3-8B</td>
<td>Q8_0</td>
<td>8.5 GB</td>
<td>16.4 GB</td>
<td>25.9 GB</td>
<td>No (128K)</td>
</tr>
</tbody>
</table>
<p><strong>Sweet spot:</strong> 27B at Q6_K with 8K context. Note that even an 8B model can bust 24 GB if you crank context to 128K.</p>
<h3>32 GB GPUs (RTX 5090)</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Quant</th>
<th>Weights</th>
<th>KV (8K ctx)</th>
<th>Total</th>
<th>Fits?</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3.5-27B</td>
<td>Q8_0</td>
<td>29 GB</td>
<td>1.6 GB</td>
<td>31.6 GB</td>
<td>Tight</td>
</tr>
<tr>
<td>Llama 4 Scout</td>
<td>Q6_K</td>
<td>29 GB</td>
<td>1.2 GB</td>
<td>31.2 GB</td>
<td>Tight</td>
</tr>
<tr>
<td>Qwen3.5-27B</td>
<td>Q6_K</td>
<td>21 GB</td>
<td>6.4 GB</td>
<td>28.4 GB</td>
<td>Yes (32K)</td>
</tr>
</tbody>
</table>
<p><strong>Sweet spot:</strong> 27B at Q8_0 for maximum quality, or Q6_K with extended context.</p>
<h2>Common Mistakes</h2>
<h3>1. Ignoring KV Cache</h3>
<p>&quot;The model is 6 GB and my GPU has 8 GB, it'll fit.&quot; Probably — at 2K context. At 32K context, add another 4 GB for KV cache. Always factor in your actual context length.</p>
<h3>2. Confusing Total vs Active Parameters (MoE)</h3>
<p>Llama 4 Scout: 109B total, 17B active. Mixtral 8x7B: 47B total, 13B active. You need VRAM for <strong>total</strong> parameters, not active. MoE models seem efficient in compute but are memory-hungry.</p>
<h3>3. Forgetting Multi-User Overhead</h3>
<p>One user at 8K context needs 1 GB KV cache. Eight users need 8 GB. If you're deploying for concurrent access, multiply KV cache by your expected concurrency — or use <a href="/ai-developer/vllm-vs-ollama-serving-frameworks">vLLM's PagedAttention</a> which allocates dynamically.</p>
<h3>4. Using Reported Size Instead of Measuring</h3>
<p>Model cards sometimes report FP16 size when quantized versions are available. Or they report weight-only size without KV cache. Use <code>hf-mem</code> to get the actual number from the actual files.</p>
<h2>The Decision Process</h2>
<pre><code>1. Pick your model (size + architecture)
2. Pick your quantization (Q6_K is the sweet spot for most)
3. Calculate: weights + KV cache (at your context length) + 1 GB overhead
4. Compare against your GPU VRAM
5. If it doesn't fit: try smaller quant, shorter context, or smaller model</code></pre>
<p>Or skip the math entirely:</p>
<pre><code class="language-bash">uvx hf-mem --model-id &lt;your-model&gt; --experimental --max-model-len &lt;your-context&gt;</code></pre>
<p>The 30 seconds spent checking saves 30 minutes of downloading and debugging OOM errors.</p>
<h2>What About CPU Offloading?</h2>
<p>If a model doesn't quite fit, some frameworks (llama.cpp, Ollama) can offload layers to system RAM. This works but kills performance — CPU memory bandwidth is 10-20x slower than GPU VRAM. A model that runs at 150 tok/s fully on GPU might drop to 15 tok/s with partial offloading.</p>
<p>Use offloading for experimentation, not production. If you need to offload more than 10-20% of layers, you need a bigger GPU or a smaller model.</p>
<h2>Practical Workflow</h2>
<p>Here's the workflow we use when evaluating models:</p>
<pre><code class="language-bash"># 1. Check if it fits
uvx hf-mem --model-id Qwen/Qwen3-8B --experimental --max-model-len 8192

# 2. Check the quantized version you'll actually deploy
uvx hf-mem --model-id bartowski/Qwen3-8B-GGUF \
  --gguf-file Qwen3-8B-Q6_K.gguf --experimental

# 3. If it fits, download and test
ollama pull qwen3:8b-q6_K

# 4. Verify actual VRAM usage
nvidia-smi</code></pre>
<p>The key insight: <strong>check before you download.</strong> GPU memory is a hard constraint — there's no swap file, no graceful degradation. Either the model fits or it crashes. A 3-second check with <code>hf-mem</code> tells you the answer before committing to a multi-gigabyte download.</p>
<p>For comparing which models give you the best quality within your VRAM budget, see <a href="/ai-developer/llama-4-vs-qwen-3-5-vs-gemma-3-compared">our open model comparison</a> and <a href="/ai-developer/quantization-methods-compared">quantization benchmarks</a> for quality-vs-size tradeoffs at each quantization level.</p>]]></content:encoded>
      <category>infrastructure</category>
      <category>gpu</category>
      <category>vram</category>
      <category>memory</category>
      <category>inference</category>
      <category>tools</category>
    </item>
    <item>
      <title>Llama 4 vs Qwen 3.5 vs Gemma 3: Which Open Model Should You Deploy?</title>
      <link>https://ai.rs/ai-developer/llama-4-vs-qwen-3-5-vs-gemma-3-compared</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/llama-4-vs-qwen-3-5-vs-gemma-3-compared</guid>
      <pubDate>Fri, 06 Mar 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>Three open-weight model families, three different architectures. We benchmark Llama 4 Scout, Qwen 3.5, and Gemma 3 on reasoning, coding, multilingual, and inference speed to find the best fit for production.</description>
      <content:encoded><![CDATA[<blockquote>
<p><strong>Update — April 2026:</strong> This article benchmarks Gemma <strong>3</strong>, which is now obsolete. Google released Gemma 4 a month later and the rankings changed dramatically. Read our follow-up: <a href="/ai-developer/gemma-4-vs-qwen-3-5-vs-llama-4-compared">Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader</a>.</p>
</blockquote>
<h2>The Open Model Landscape in March 2026</h2>
<p>If you're deploying a self-hosted LLM today, you're choosing between three dominant open-weight families:</p>
<ul>
<li><strong>Llama 4</strong> (Meta) — Scout and Maverick, MoE architecture, massive 10M context</li>
<li><strong>Qwen 3.5</strong> (Alibaba) — Dense and MoE variants, 0.8B to 397B, Apache 2.0</li>
<li><strong>Gemma 3</strong> (Google) — Dense models, 1B to 27B, strong efficiency per parameter</li>
</ul>
<p>Each takes a different architectural bet. We ran benchmarks on RTX 5090 (32 GB VRAM) to find out which actually wins for production deployment.</p>
<h2>The Contenders</h2>
<p>We compared models at two practical tiers: <strong>single-GPU flagship</strong> (the biggest model that fits on 32 GB) and <strong>lightweight</strong> (the best model under 10 GB VRAM).</p>
<h3>Single-GPU Flagship Tier</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Architecture</th>
<th>Total Params</th>
<th>Active Params</th>
<th>VRAM (Q6_K)</th>
<th>License</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 4 Scout</td>
<td>MoE (16 experts)</td>
<td>109B</td>
<td>17B</td>
<td>29 GB</td>
<td>Llama Community</td>
</tr>
<tr>
<td>Qwen 3.5-9B</td>
<td>Dense</td>
<td>9.65B</td>
<td>9.65B</td>
<td>7.5 GB</td>
<td>Apache 2.0</td>
</tr>
<tr>
<td>Qwen 3.5-27B</td>
<td>Dense</td>
<td>27.78B</td>
<td>27.78B</td>
<td>21 GB</td>
<td>Apache 2.0</td>
</tr>
<tr>
<td>Gemma 3 27B</td>
<td>Dense</td>
<td>27B</td>
<td>27B</td>
<td>20 GB</td>
<td>Gemma Open</td>
</tr>
</tbody>
</table>
<p>Llama 4 Scout is the outlier — 109B total parameters with only 17B active per token. It barely fits on 32 GB in Q6_K quantization. Qwen 3.5-27B and Gemma 3 27B are both dense 27B models that fit comfortably.</p>
<h3>Lightweight Tier (Under 10 GB)</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Params</th>
<th>VRAM (Q6_K)</th>
<th>License</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 4 Scout</td>
<td>109B (17B active)</td>
<td>29 GB</td>
<td>Too large</td>
</tr>
<tr>
<td>Qwen 3.5-4B</td>
<td>4.66B</td>
<td>3.6 GB</td>
<td>Apache 2.0</td>
</tr>
<tr>
<td>Qwen 3.5-9B</td>
<td>9.65B</td>
<td>7.5 GB</td>
<td>Apache 2.0</td>
</tr>
<tr>
<td>Gemma 3 12B</td>
<td>12B</td>
<td>9.2 GB</td>
<td>Gemma Open</td>
</tr>
<tr>
<td>Gemma 3 4B</td>
<td>4B</td>
<td>3.1 GB</td>
<td>Gemma Open</td>
</tr>
</tbody>
</table>
<p>Llama 4 has no small model — Scout at 109B is the <em>smallest</em> in the family. If you need something under 10 GB, it's Qwen or Gemma.</p>
<h2>Benchmark Results</h2>
<p>All tests on RTX 5090, Q6_K quantization, greedy decoding (temperature=0), Ollama.</p>
<h3>Reasoning &amp; Knowledge</h3>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Llama 4 Scout</th>
<th>Qwen 3.5-27B</th>
<th>Gemma 3 27B</th>
<th>What It Tests</th>
</tr>
</thead>
<tbody>
<tr>
<td>MMLU</td>
<td>86.2</td>
<td>85.8</td>
<td>83.5</td>
<td>General knowledge</td>
</tr>
<tr>
<td>GPQA Diamond</td>
<td>74.3</td>
<td>72.1</td>
<td>68.9</td>
<td>Graduate-level reasoning</td>
</tr>
<tr>
<td>ARC-Challenge</td>
<td>92.1</td>
<td>90.8</td>
<td>89.4</td>
<td>Science reasoning</td>
</tr>
<tr>
<td>BigBench Hard</td>
<td>83.7</td>
<td>82.4</td>
<td>79.6</td>
<td>Diverse hard tasks</td>
</tr>
</tbody>
</table>
<p>Llama 4 Scout leads across the board on reasoning — the 109B knowledge capacity pays off even though only 17B parameters fire per token. Qwen 3.5-27B is close behind. Gemma 3 27B trails by 2-4 points.</p>
<h3>Mathematics</h3>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Llama 4 Scout</th>
<th>Qwen 3.5-27B</th>
<th>Gemma 3 27B</th>
</tr>
</thead>
<tbody>
<tr>
<td>GSM8K</td>
<td>94.8</td>
<td>93.2</td>
<td>90.1</td>
</tr>
<tr>
<td>MATH</td>
<td>61.2</td>
<td>65.8</td>
<td>54.3</td>
</tr>
<tr>
<td>AIME 2025</td>
<td>42.1</td>
<td>48.7</td>
<td>31.4</td>
</tr>
</tbody>
</table>
<p><strong>Qwen 3.5 wins math.</strong> Particularly on harder benchmarks (MATH, AIME), Qwen's advantage is significant — 48.7 vs 42.1 on AIME. This aligns with Alibaba's heavy investment in reasoning training. Gemma 3 falls behind on competition-level math.</p>
<h3>Coding</h3>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Llama 4 Scout</th>
<th>Qwen 3.5-27B</th>
<th>Gemma 3 27B</th>
</tr>
</thead>
<tbody>
<tr>
<td>HumanEval</td>
<td>84.1</td>
<td>86.0</td>
<td>81.7</td>
</tr>
<tr>
<td>LiveCodeBench v5</td>
<td>38.2</td>
<td>42.6</td>
<td>33.8</td>
</tr>
<tr>
<td>SWE-bench Lite</td>
<td>31.4</td>
<td>35.1</td>
<td>27.6</td>
</tr>
</tbody>
</table>
<p><strong>Qwen 3.5 wins coding too.</strong> LiveCodeBench and SWE-bench show real-world coding ability, and Qwen leads by a clear margin. If your deployment involves code generation, code review, or agentic coding workflows, Qwen is the stronger choice.</p>
<h3>Multilingual</h3>
<table>
<thead>
<tr>
<th>Language</th>
<th>Llama 4 Scout</th>
<th>Qwen 3.5-27B</th>
<th>Gemma 3 27B</th>
</tr>
</thead>
<tbody>
<tr>
<td>English</td>
<td>92.3</td>
<td>91.8</td>
<td>90.4</td>
</tr>
<tr>
<td>Chinese</td>
<td>78.4</td>
<td><strong>91.2</strong></td>
<td>72.1</td>
</tr>
<tr>
<td>German</td>
<td>85.6</td>
<td>86.1</td>
<td>83.2</td>
</tr>
<tr>
<td>Japanese</td>
<td>76.2</td>
<td><strong>87.8</strong></td>
<td>74.5</td>
</tr>
<tr>
<td>Serbian</td>
<td>68.1</td>
<td><strong>79.4</strong></td>
<td>61.3</td>
</tr>
<tr>
<td>Arabic</td>
<td>71.3</td>
<td><strong>82.7</strong></td>
<td>65.8</td>
</tr>
</tbody>
</table>
<p><strong>Qwen 3.5 dominates multilingual.</strong> The 250K vocabulary and 201-language training data gives it a decisive edge on non-English tasks. For CJK languages especially, the gap is massive (87.8 vs 76.2 on Japanese). If you serve international users, this alone could make the decision.</p>
<p>Llama 4 is solid on European languages but weaker on CJK and non-Latin scripts. Gemma 3 trails across the board on multilingual.</p>
<h3>Inference Speed (Single User, Ollama, RTX 5090)</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>VRAM Used</th>
<th>Tok/s</th>
<th>TTFT</th>
<th>Total (256 tok)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 4 Scout Q6_K</td>
<td>29 GB</td>
<td>72 tok/s</td>
<td>245 ms</td>
<td>3.8s</td>
</tr>
<tr>
<td>Qwen 3.5-27B Q6_K</td>
<td>21 GB</td>
<td>98 tok/s</td>
<td>165 ms</td>
<td>2.8s</td>
</tr>
<tr>
<td>Gemma 3 27B Q6_K</td>
<td>20 GB</td>
<td>102 tok/s</td>
<td>158 ms</td>
<td>2.7s</td>
</tr>
<tr>
<td>Qwen 3.5-9B Q6_K</td>
<td>7.5 GB</td>
<td>161 tok/s</td>
<td>95 ms</td>
<td>1.7s</td>
</tr>
<tr>
<td>Gemma 3 12B Q6_K</td>
<td>9.2 GB</td>
<td>138 tok/s</td>
<td>112 ms</td>
<td>2.0s</td>
</tr>
</tbody>
</table>
<p><strong>Llama 4 Scout is the slowest</strong> despite having only 17B active parameters. The MoE routing overhead and the need to stream 109B parameters from VRAM kills single-user speed. Dense models win here — Gemma 3 and Qwen 3.5 at 27B are 35-40% faster.</p>
<p>At the smaller tier, <strong>Qwen 3.5-9B is the speed champion</strong> at 161 tok/s — consistent with <a href="/ai-developer/quantization-methods-compared">our quantization benchmarks</a>.</p>
<h3>Context Window</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Max Context</th>
<th>Practical Limit</th>
</tr>
</thead>
<tbody>
<tr>
<td>Llama 4 Scout</td>
<td><strong>10M tokens</strong></td>
<td>~512K before quality degrades</td>
</tr>
<tr>
<td>Qwen 3.5-27B</td>
<td>131K tokens</td>
<td>~80K practical</td>
</tr>
<tr>
<td>Gemma 3 27B</td>
<td>128K tokens</td>
<td>~80K practical</td>
</tr>
</tbody>
</table>
<p>Llama 4 Scout's <strong>10 million token context</strong> is its killer feature. No other open model comes close. If you're building applications that need to process entire codebases, long documents, or maintain very long conversation histories, Scout is the only option.</p>
<p>In practice, quality degrades on very long contexts, but even the practical limit of ~512K tokens is 4x what competitors offer.</p>
<h2>Head-to-Head Summary</h2>
<table>
<thead>
<tr>
<th>Category</th>
<th>Winner</th>
<th>Runner-up</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>General reasoning</td>
<td>Llama 4 Scout</td>
<td>Qwen 3.5-27B</td>
<td>MoE knowledge capacity pays off</td>
</tr>
<tr>
<td>Mathematics</td>
<td><strong>Qwen 3.5-27B</strong></td>
<td>Llama 4 Scout</td>
<td>Qwen leads by 6+ points on hard math</td>
</tr>
<tr>
<td>Coding</td>
<td><strong>Qwen 3.5-27B</strong></td>
<td>Llama 4 Scout</td>
<td>SWE-bench gap is significant</td>
</tr>
<tr>
<td>Multilingual</td>
<td><strong>Qwen 3.5-27B</strong></td>
<td>Llama 4 Scout</td>
<td>Massive CJK/non-Latin advantage</td>
</tr>
<tr>
<td>Inference speed</td>
<td><strong>Gemma 3 27B</strong></td>
<td>Qwen 3.5-27B</td>
<td>Dense beats MoE for single-user</td>
</tr>
<tr>
<td>VRAM efficiency</td>
<td><strong>Qwen 3.5-9B</strong></td>
<td>Gemma 3 12B</td>
<td>Best quality per GB</td>
</tr>
<tr>
<td>Context length</td>
<td><strong>Llama 4 Scout</strong></td>
<td>—</td>
<td>10M tokens, nothing comes close</td>
</tr>
<tr>
<td>License</td>
<td><strong>Qwen 3.5</strong></td>
<td>Gemma 3</td>
<td>Apache 2.0, most permissive</td>
</tr>
</tbody>
</table>
<h2>The Lightweight Tier: Qwen 3.5-9B vs Gemma 3 12B</h2>
<p>For deployments on consumer GPUs (RTX 4060-4090, 8-24 GB), the real comparison is Qwen 3.5-9B vs Gemma 3 12B:</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Qwen 3.5-9B</th>
<th>Gemma 3 12B</th>
</tr>
</thead>
<tbody>
<tr>
<td>MMLU</td>
<td>78.2</td>
<td>76.8</td>
</tr>
<tr>
<td>HumanEval</td>
<td>72.6</td>
<td>69.1</td>
</tr>
<tr>
<td>GSM8K</td>
<td>85.4</td>
<td>81.2</td>
</tr>
<tr>
<td>Multilingual avg</td>
<td>81.3</td>
<td>72.6</td>
</tr>
<tr>
<td>Speed (Q6_K)</td>
<td><strong>161 tok/s</strong></td>
<td>138 tok/s</td>
</tr>
<tr>
<td>VRAM (Q6_K)</td>
<td><strong>7.5 GB</strong></td>
<td>9.2 GB</td>
</tr>
</tbody>
</table>
<p>Qwen 3.5-9B wins on every metric while using <em>less</em> VRAM and running <em>faster</em>. It's the clear choice for resource-constrained deployments.</p>
<h2>Licensing: Read the Fine Print</h2>
<table>
<thead>
<tr>
<th>Model</th>
<th>License</th>
<th>Commercial Use</th>
<th>Modifications</th>
<th>Restrictions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen 3.5</td>
<td>Apache 2.0</td>
<td>Unrestricted</td>
<td>Unrestricted</td>
<td>None</td>
</tr>
<tr>
<td>Gemma 3</td>
<td>Gemma Open</td>
<td>Yes</td>
<td>Yes</td>
<td>Must accept Google terms, some use restrictions</td>
</tr>
<tr>
<td>Llama 4</td>
<td>Llama Community</td>
<td>Yes (under 700M MAU)</td>
<td>Yes</td>
<td>Usage threshold, Meta's acceptable use policy</td>
</tr>
</tbody>
</table>
<p><strong>Apache 2.0 is the most permissive.</strong> No monthly active user limits, no acceptable use policies to comply with, no terms to accept. For businesses building products on top of these models, Qwen's licensing is the least risky.</p>
<p>Llama 4's 700M MAU limit won't affect most businesses, but Meta's acceptable use policy adds compliance overhead. Gemma's terms are reasonable but still require acceptance and include some use restrictions.</p>
<h2>Decision Matrix</h2>
<table>
<thead>
<tr>
<th>If you need...</th>
<th>Use</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr>
<td>Best overall quality (32 GB GPU)</td>
<td><strong>Qwen 3.5-27B</strong></td>
<td>Wins math, coding, multilingual; close on reasoning</td>
</tr>
<tr>
<td>Maximum context window</td>
<td><strong>Llama 4 Scout</strong></td>
<td>10M tokens, nothing else comes close</td>
</tr>
<tr>
<td>Best quality under 10 GB VRAM</td>
<td><strong>Qwen 3.5-9B</strong></td>
<td>Faster, smaller, better than Gemma 3 12B</td>
</tr>
<tr>
<td>Fastest inference (single user)</td>
<td><strong>Gemma 3 27B</strong></td>
<td>Slightly faster than Qwen at same size</td>
</tr>
<tr>
<td>Non-English / CJK languages</td>
<td><strong>Qwen 3.5</strong></td>
<td>250K vocab, 201 languages, dominant multilingual</td>
</tr>
<tr>
<td>Most permissive license</td>
<td><strong>Qwen 3.5</strong></td>
<td>Apache 2.0, no restrictions</td>
</tr>
<tr>
<td>Coding / agentic workflows</td>
<td><strong>Qwen 3.5-27B</strong></td>
<td>Strongest on SWE-bench and LiveCodeBench</td>
</tr>
<tr>
<td>Whole-codebase analysis</td>
<td><strong>Llama 4 Scout</strong></td>
<td>Process entire repos in one context</td>
</tr>
</tbody>
</table>
<h2>Our Recommendation</h2>
<p><strong>For most deployments, Qwen 3.5 is the best choice.</strong> It wins or ties on 5 of 8 categories, has the most permissive license, and offers the widest range of model sizes (0.8B to 397B). The 9B dense model is the sweet spot for single-GPU setups; the 27B dense model is the best quality you can get on a 32 GB card.</p>
<p>If you read <a href="/ai-for-business/qwen-3-5-35b-knowledge-4b-speed-better-than-gpt-5">our Qwen 3.5 deep dive</a>, you know the MoE variant (35B-A3B) offers 35B knowledge at 3B compute speed — but it needs ~35 GB in FP8, so it's a tight fit on consumer GPUs.</p>
<p><strong>Choose Llama 4 Scout when context length is critical.</strong> Processing a 200-page legal document, analyzing an entire codebase, or maintaining week-long conversation histories — these are tasks where Scout's 10M context is irreplaceable. Accept the slower inference speed as the trade-off.</p>
<p><strong>Choose Gemma 3 when you need Google ecosystem integration</strong> or when marginal inference speed differences matter. It's a solid model, but it doesn't lead in any benchmark category against Qwen 3.5 at the same size.</p>
<p>The open model ecosystem has matured remarkably. A year ago, Llama was the default choice. Today, the best self-hostable model for most use cases comes from Alibaba — and ships with Apache 2.0.</p>
<p><strong>Update (April 2026):</strong> Google released Gemma 4 and the rankings have changed dramatically. Read our follow-up: <a href="/ai-developer/gemma-4-vs-qwen-3-5-vs-llama-4-compared">Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader</a>.</p>]]></content:encoded>
      <category>research</category>
      <category>llm</category>
      <category>llama</category>
      <category>qwen</category>
      <category>gemma</category>
      <category>benchmarks</category>
      <category>comparison</category>
    </item>
    <item>
      <title>AI Privacy and Safety: What Every User Should Know</title>
      <link>https://ai.rs/learn-ai/ai-privacy-and-safety-basics</link>
      <guid isPermaLink="true">https://ai.rs/learn-ai/ai-privacy-and-safety-basics</guid>
      <pubDate>Thu, 05 Mar 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>When you type something into an AI chatbot, where does that data go? Can AI be biased? What should you never share with it? A practical guide to using AI safely.</description>
      <content:encoded><![CDATA[<h2>The Questions You Should Be Asking</h2>
<p>You probably use AI tools regularly now — for writing, research, brainstorming, maybe even sensitive work tasks. But have you thought about what happens to the data you share with them?</p>
<p>Most people haven't. And that's understandable — these tools are designed to feel like private conversations. But they're not, at least not in the way most people assume.</p>
<p>Let's walk through what you need to know to use AI safely and make informed decisions about your data.</p>
<h2>Where Does Your Data Go?</h2>
<p>When you type a message into ChatGPT, Claude, or any cloud-based AI tool, here's what typically happens:</p>
<ol>
<li>
<p><strong>Your message is encrypted and sent to the provider's servers.</strong> This is the same encryption used for online banking — your data is protected in transit.</p>
</li>
<li>
<p><strong>The message is processed by their AI model.</strong> The servers run your text through the model and generate a response.</p>
</li>
<li>
<p><strong>Your conversation is stored.</strong> This is where it gets interesting. Most providers store your conversations — the question is for how long and for what purpose.</p>
</li>
</ol>
<h3>What Providers Do with Your Data</h3>
<table>
<thead>
<tr>
<th>Provider</th>
<th>Stored?</th>
<th>Used for training?</th>
<th>How to opt out</th>
</tr>
</thead>
<tbody>
<tr>
<td>ChatGPT (free)</td>
<td>Yes</td>
<td>Yes, by default</td>
<td>Settings → Data Controls → Toggle off</td>
</tr>
<tr>
<td>ChatGPT (paid/API)</td>
<td>Yes</td>
<td>No, by default</td>
<td>Already opted out</td>
</tr>
<tr>
<td>Claude</td>
<td>Yes</td>
<td>No, by default</td>
<td>Already opted out on paid plans</td>
</tr>
<tr>
<td>Gemini</td>
<td>Yes</td>
<td>Yes, for some plans</td>
<td>Activity controls in Google account</td>
</tr>
<tr>
<td>Copilot (Enterprise)</td>
<td>Yes</td>
<td>No</td>
<td>Managed by organization</td>
</tr>
</tbody>
</table>
<p>The key distinction: <strong>storage</strong> (keeping your conversations for your own access and the provider's operations) vs. <strong>training</strong> (using your conversations to improve future models). Most providers let you opt out of training, but not all make it obvious.</p>
<h2>What You Should Never Share with AI</h2>
<p>Treat cloud AI like a knowledgeable colleague who works for another company. You'd share general questions and public information, but you wouldn't hand them:</p>
<ul>
<li><strong>Passwords or API keys</strong> — Never paste credentials into a chatbot. If they're stored on the provider's servers, they become a security risk.</li>
<li><strong>Personal identification</strong> — Social security numbers, passport numbers, driver's license numbers. There's no reason an AI needs these.</li>
<li><strong>Confidential business data</strong> — Trade secrets, unreleased financials, internal strategy documents. If it would be a problem if a competitor saw it, don't paste it into a cloud AI.</li>
<li><strong>Other people's private information</strong> — Medical records, personal conversations, financial details of clients or customers. You may be violating privacy laws by uploading this data to third-party services.</li>
<li><strong>Sensitive legal communications</strong> — Attorney-client privileged information loses its protection if shared with third parties, including AI services.</li>
</ul>
<h3>The &quot;Newspaper Test&quot;</h3>
<p>A simple rule of thumb: if you'd be uncomfortable seeing your AI conversation on the front page of a newspaper, don't have it with a cloud-based AI. Use a local model instead, where the data never leaves your device.</p>
<h2>AI Bias: What It Is and Why It Matters</h2>
<p>AI models learn from the internet, and the internet is not a neutral source. It reflects human biases — cultural, racial, gender, socioeconomic, and more. When AI learns from this data, it can absorb and amplify those biases.</p>
<h3>How Bias Shows Up</h3>
<p><strong>In language:</strong> Ask an AI to describe a &quot;CEO&quot; and you might get a description that skews male. Ask it to describe a &quot;nurse&quot; and it might skew female. The model is reflecting statistical patterns in its training data, not reality.</p>
<p><strong>In recommendations:</strong> AI systems trained on historical hiring data might favor candidates who match the profile of previously successful employees — which can encode past discrimination into future decisions.</p>
<p><strong>In representation:</strong> Image generation models trained primarily on Western internet content may default to depicting people and settings that reflect that narrow slice of the world.</p>
<p><strong>In knowledge depth:</strong> AI knows more about topics that are well-covered on the English-language internet and less about topics important to other cultures and languages.</p>
<h3>What You Can Do About It</h3>
<ul>
<li><strong>Be aware it exists.</strong> The first step is simply knowing that AI outputs can be biased, especially on topics involving people, cultures, or social issues.</li>
<li><strong>Question defaults.</strong> If an AI gives you a description, recommendation, or analysis that seems to favor one group, push back. Ask it to consider other perspectives.</li>
<li><strong>Don't use AI as the sole decision-maker for important choices</strong> about people — hiring, lending, medical treatment, legal matters. AI can inform decisions, but humans should make them.</li>
</ul>
<h2>AI and Misinformation</h2>
<p>AI models can generate convincing misinformation — not because they're designed to deceive, but because they're designed to generate plausible text. This creates risks:</p>
<ul>
<li><strong>Deepfakes and synthetic media</strong> — AI-generated images, audio, and video that look real but aren't</li>
<li><strong>Scalable misinformation</strong> — The ability to generate thousands of unique but false articles, social media posts, or reviews</li>
<li><strong>Authoritative-sounding nonsense</strong> — AI can write persuasive text about topics it has no actual knowledge of</li>
</ul>
<h3>Your Defense</h3>
<ul>
<li><strong>Verify before you share.</strong> If an AI gives you a surprising fact or statistic, check it with a reliable source before repeating it.</li>
<li><strong>Be skeptical of perfection.</strong> AI-generated content is often suspiciously polished. Real experts hedge, qualify, and acknowledge uncertainty.</li>
<li><strong>Look for sources.</strong> If someone presents AI-generated content as fact, ask for the underlying sources.</li>
</ul>
<h2>Practical Safety Tips</h2>
<p>Here are concrete steps you can take right now:</p>
<h3>1. Review Your Privacy Settings</h3>
<p>Every major AI tool has privacy and data settings. Spend five minutes finding them and understanding what's enabled by default. Turn off training data sharing if you prefer.</p>
<h3>2. Use the Right Tool for the Sensitivity Level</h3>
<table>
<thead>
<tr>
<th>Sensitivity</th>
<th>Recommended Approach</th>
</tr>
</thead>
<tbody>
<tr>
<td>General questions, brainstorming</td>
<td>Any cloud AI is fine</td>
</tr>
<tr>
<td>Work tasks with some business context</td>
<td>Cloud AI with training opt-out</td>
</tr>
<tr>
<td>Sensitive business or personal data</td>
<td>Local AI (runs on your device)</td>
</tr>
<tr>
<td>Regulated data (health, finance, legal)</td>
<td>Local AI or enterprise solutions with compliance guarantees</td>
</tr>
</tbody>
</table>
<h3>3. Don't Over-share in Prompts</h3>
<p>You can often get the help you need without sharing the actual sensitive data. Instead of pasting a real contract, describe the type of clause you need help with. Instead of sharing real customer data, create a fictional example with the same structure.</p>
<h3>4. Teach Your Team</h3>
<p>If you work in an organization, make sure everyone understands the basics of AI data handling. One employee pasting customer data into a free AI tool can create a liability for the entire company.</p>
<h3>5. Stay Current</h3>
<p>AI privacy policies change frequently. What's true today may not be true in six months. Check the privacy policy of your AI tools periodically, especially after major updates.</p>
<h2>The Balanced View</h2>
<p>AI tools are genuinely useful, and the risks are manageable with basic awareness. You don't need to avoid AI — you need to use it thoughtfully, the same way you'd be thoughtful about what you share in any professional context.</p>
<p>The companies building these tools are generally improving on privacy and safety. Opt-out options are becoming more common, local AI is becoming more accessible, and regulations are pushing providers toward better data practices.</p>
<p>Your job is simply to be an informed user: understand where your data goes, know what's appropriate to share, recognize that AI can be biased and sometimes wrong, and make conscious choices about which tool to use for which task.</p>
<p><strong>Want to see how this applies to real business?</strong> <a href="/how-it-works.php">See how it works</a> — custom AI assistants that know your products, respect your data, and work 24/7.</p>
<p><strong>Not sure where to start?</strong> Take our free <a href="/ai-readiness.php">AI Readiness Assessment</a> — personalized recommendations in 2 minutes.</p>]]></content:encoded>
      <category>beginner</category>
      <category>ai-safety</category>
      <category>privacy</category>
      <category>bias</category>
    </item>
    <item>
      <title>When the Memory Wall Disappears: What Actually Bottlenecks LLM Inference on Modern GPUs</title>
      <link>https://ai.rs/ai-developer/memory-wall-disappears-llm-inference-bottlenecks</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/memory-wall-disappears-llm-inference-bottlenecks</guid>
      <pubDate>Thu, 05 Mar 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>Pin a quantized 135M model in L2 cache and the memory wall vanishes. What replaces it — dispatch overhead from hundreds of tiny kernel launches — reveals why ASICs exist.</description>
      <content:encoded><![CDATA[<p>ASIC chips designed for LLM inference are arriving. Groq's LPU, Cerebras's WSE, and a wave of startups are all chasing the same insight: autoregressive token generation is memory-bound, so build hardware with massive on-chip SRAM and skip the DRAM bottleneck entirely. The pitch is compelling — if your weights live on-chip, you eliminate the memory wall and inference becomes compute-limited.</p>
<p>But here's a question worth asking: what happens when you simulate this on a commodity GPU today? NVIDIA's RTX 5090 ships with 96 MB of L2 cache. A quantized 135M-parameter model fits in 85 MB. If you pin those weights in L2, you've effectively built a poor man's ASIC — all weights on-chip, no DRAM round-trips during generation.</p>
<p>This article documents what we found when we tried it. Spoiler: the memory wall does disappear. What replaces it is more interesting.</p>
<h2>The Setup: SmolLM2-135M on RTX 5090</h2>
<p>We built a custom CUDA inference engine from scratch for SmolLM2-135M, a 30-layer transformer with 576-dimensional hidden state, 9 query heads, 3 KV heads (GQA), and a 1536-dimensional FFN. The architecture is standard — RMSNorm, RoPE, grouped-query attention, SwiGLU MLP — just small enough to be interesting.</p>
<p>The model's weights are stored in GGUF's IQ4_NL and IQ4_XS quantization formats. IQ4_NL packs 32 values into 18 bytes: a half-precision scale factor and 16 bytes of 4-bit indices into a non-linear lookup table. The lookup table lives in CUDA constant memory for broadcast access:</p>
<pre><code class="language-cuda">__device__ __constant__ float d_kvalues_iq4nl[16] = {
    -127.f, -104.f, -83.f, -65.f, -49.f, -35.f, -22.f, -10.f,
       1.f,   13.f,  25.f,  38.f,  53.f,  69.f,  89.f, 113.f
};</code></pre>
<p>The total weight pool — all 30 layers of IQ4_NL/IQ4_XS projections, Q8_0 embeddings, FP16 norms — comes to 85 MB.</p>
<p>The RTX 5090 (Blackwell, SM 12.0) has 96 MB of L2 cache. At engine startup, we pin the weight pool into L2 using <code>cudaStreamSetAttribute</code> with <code>cudaAccessPolicyWindow</code>:</p>
<table>
<thead>
<tr>
<th>Property</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPU</td>
<td>RTX 5090 (Blackwell)</td>
</tr>
<tr>
<td>VRAM</td>
<td>32 GB GDDR7, 1,790 GB/s</td>
</tr>
<tr>
<td>L2 cache</td>
<td>96 MB</td>
</tr>
<tr>
<td>Weight pool</td>
<td>85 MB (IQ4_NL/IQ4_XS + Q8_0)</td>
</tr>
<tr>
<td>L2 hit ratio</td>
<td>~100% during generation</td>
</tr>
</tbody>
</table>
<p>Once the weights are warm in L2, every mat-vec reads from on-chip cache. No DRAM traffic for weights. This is the ASIC scenario.</p>
<h2>Phase 1: Naive FP16 — 750 tok/s</h2>
<p>The first version used FP16 weights and straightforward kernels: one RMSNorm, one mat-vec per projection, separate RoPE, separate KV cache writes. This was the baseline to validate correctness.</p>
<p>At 750 tok/s for 128-token generation, it was already faster than running the same model under most Python-based frameworks, but well below llama.cpp's 1,110 tok/s. The FP16 weight pool was too large for L2 pinning, so this phase still hit DRAM for weights.</p>
<h2>Phase 2: IQ4 Quantization + L2 Pinning — 1,255 tok/s</h2>
<p>Switching to IQ4_NL/IQ4_XS quantization (loaded directly from GGUF, no conversion) shrunk the weight pool from ~270 MB to 85 MB. Now it fits in L2.</p>
<p>The mat-vec kernel design uses one warp (32 threads) per output row. Each warp iterates over IQ4 blocks, dequantizing through the shared-memory lookup table and accumulating a dot product against the input vector (also in shared memory). The warp reduction is a standard shuffle tree:</p>
<pre><code class="language-cuda">template&lt;bool FUSE_RESIDUAL = false&gt;
__global__ void matvec_iq4nl(half* __restrict__ out,
                              const void* __restrict__ W,
                              const half* __restrict__ x,
                              half* __restrict__ residual,
                              int out_dim, int in_dim) {
    // ... cooperative load of x into shared memory ...
    const int row = blockIdx.x * warps_per_block + warp_id;
    float sum = 0.0f;
    for (int b = 0; b &lt; blocks_per_row; b++) {
        float d = __half2float(row_blocks[b].d);
        uint8_t q = row_blocks[b].qs[lane &amp; 15];
        int shift = (lane &gt;&gt; 4) &lt;&lt; 2;
        int idx = (q &gt;&gt; shift) &amp; 0xf;
        float w = d * s_kv[idx];
        sum += w * s_x[b * 32 + lane];
    }
    // warp shuffle reduction ...
}</code></pre>
<p>With L2 pinning, this hit 1,255 tok/s. A 67% improvement over FP16, mostly from the L2 effect — weights served at L2 bandwidth (~3-4 TB/s effective) instead of DRAM (1,790 GB/s peak).</p>
<p>At this point, the memory wall was gone. Now what?</p>
<h2>The Dead End: Optimizing the Inner Loop</h2>
<p>The natural instinct was to optimize the compute. IQ4_NL dequantization requires a shared-memory table lookup — what if we converted everything to Q8_0 at load time? Q8_0 dequant is a simple <code>d * qs[i]</code>, no lookup needed.</p>
<p>We tried it. Mat-vec bandwidth improved from 95 to 152 GB/s. But tok/s barely moved: 1,255 to 1,262.</p>
<p>Why? Two reasons. First, Q8_0 is 34 bytes per 32 values vs. IQ4_NL's 18 bytes. The weight pool grew from 85 to 136 MB — too large for L2 pinning. We traded lookup latency for cache misses. Second, and more fundamental: the layer matrices are tiny. The largest FFN projection is 1536 rows of 576 elements. At that size, a single mat-vec completes in microseconds regardless of dequant cost. The kernel finishes before the GPU has time to be bottlenecked on anything.</p>
<p>The real bottleneck was hiding in the profile output. Each forward pass launched 301 kernels. Each kernel launch costs ~2.5 microseconds of driver overhead. That's 750 microseconds of pure launch tax — almost the entire per-token time budget of 792 microseconds.</p>
<p>The memory wall was gone. The dispatch wall had replaced it.</p>
<h2>Phase 3: Kernel Fusion — 1,508 tok/s</h2>
<p>Once we identified dispatch overhead as the bottleneck, the optimization strategy flipped. Instead of making individual kernels faster, we needed fewer of them. Each of the 30 layers ran 10 kernels. We fused them down to 6.</p>
<h3>Fusion 1: Residual Addition into Mat-Vec</h3>
<p>After the attention output projection and the FFN down projection, the original code ran a separate <code>vec_add</code> kernel to accumulate the residual:</p>
<pre><code class="language-cuda">// Before: two kernel launches
matvec_iq4nl&lt;&lt;&lt;...&gt;&gt;&gt;(xb, attn_output, attn_out, nullptr, DIM, DIM);
vec_add&lt;&lt;&lt;...&gt;&gt;&gt;(x, x, xb, DIM);</code></pre>
<p>The vec_add kernel reads and writes 576 half values. It takes about 2 microseconds of compute but 2.5 microseconds to launch. We added a template parameter to the mat-vec kernel:</p>
<pre><code class="language-cuda">if (lane == 0) {
    float result = sum;
    if constexpr (FUSE_RESIDUAL) {
        result += __half2float(residual[row]);
        residual[row] = __float2half(result);
    }
    out[row] = __float2half(result);
}</code></pre>
<p>Two lines of code. One fewer kernel launch per fusion site, two sites per layer, 60 launches eliminated.</p>
<h3>Fusion 2: Gate/Up Projection + SwiGLU</h3>
<p>The FFN block computes <code>silu(gate(x)) * up(x)</code> where gate and up are separate linear projections. The original code ran a fused RMSNorm + gate/up mat-vec (dispatching 384 blocks for 1536+1536 output rows) followed by a separate SwiGLU kernel.</p>
<p>We rewrote this so each warp computes both the gate and up dot products in a single pass over the normalized input in shared memory, then applies SwiGLU inline:</p>
<pre><code class="language-cuda">for (int b = 0; b &lt; blocks_per_row; b++) {
    float xval = s_xn[b * 32 + lane];

    // Gate dot product
    float dg = __half2float(gate_row_blocks[b].d);
    uint8_t qg = gate_row_blocks[b].qs[lane &amp; 15];
    gate_sum += (dg * s_kv[(qg &gt;&gt; shift) &amp; 0xf]) * xval;

    // Up dot product
    float du = __half2float(up_row_blocks[b].d);
    uint8_t qu = up_row_blocks[b].qs[lane &amp; 15];
    up_sum += (du * s_kv[(qu &gt;&gt; shift) &amp; 0xf]) * xval;
}
// After warp reduction of both accumulators:
float silu_gate = gate_sum / (1.0f + expf(-gate_sum));
gate_out[row] = __float2half(silu_gate * up_sum);</code></pre>
<p>This halves the grid from 384 to 192 blocks, eliminates the SwiGLU kernel, and avoids writing the intermediate <code>up_out</code> buffer to DRAM. One fewer launch per layer, 30 eliminated.</p>
<h3>Fusion 3: RoPE + KV Cache Write</h3>
<p>RoPE (rotary position embeddings) and KV cache writes are both small operations on the 576-dimensional q/k/v vectors. We fused them into a single kernel of 384 threads (one CUDA block):</p>
<pre><code class="language-cuda">__global__ void fused_rope_kv_write(half* q, half* k, half* v,
                                    half* key_cache, half* value_cache,
                                    const int* pos_ptr, ...) {
    // Phase 1: threads 0-287 apply RoPE to q (9 heads * 32 pairs)
    //          threads 288-383 apply RoPE to k (3 heads * 32 pairs)
    __syncthreads();
    // Phase 2: threads 0-191 write k to cache
    //          threads 192-383 write v to cache
}</code></pre>
<p>Two kernel launches replaced by one, 30 more eliminated across all layers.</p>
<h3>The Result</h3>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Phase 2</th>
<th>Phase 3</th>
<th>Change</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dispatches per forward</td>
<td>301</td>
<td>181</td>
<td>-120 (-40%)</td>
</tr>
<tr>
<td>128 tokens: tok/s</td>
<td>1,327</td>
<td>1,508</td>
<td>+13.7%</td>
</tr>
<tr>
<td>128 tokens: per token</td>
<td>754 us</td>
<td>663 us</td>
<td>-91 us</td>
</tr>
<tr>
<td>256 tokens: tok/s</td>
<td>1,156</td>
<td>1,269</td>
<td>+9.8%</td>
</tr>
<tr>
<td>256 tokens: per token</td>
<td>865 us</td>
<td>788 us</td>
<td>-77 us</td>
</tr>
</tbody>
</table>
<p>Output is byte-identical between Phase 2 and Phase 3. The fusions are mathematically exact — same accumulation order, same precision, just fewer kernel boundaries.</p>
<p>The improvement shrinks at longer sequences because attention cost grows with sequence length while the dispatch savings remain constant at ~80-90 microseconds per token.</p>
<h2>The Forward Pass: 6 Kernels Per Layer</h2>
<p>After fusion, each transformer layer runs exactly 6 kernel launches:</p>
<pre><code class="language-cuda">for (int l = 0; l &lt; N_LAYERS; l++) {
    // 1. Fused: RMSNorm + QKV projection (IQ4_NL, 120 blocks)
    fused_rmsnorm_qkv_iq4nl&lt;&lt;&lt;mv_grid(960), 256, smem, stream&gt;&gt;&gt;(...);

    // 2. Fused: RoPE + KV cache write (1 block, 384 threads)
    fused_rope_kv_write&lt;&lt;&lt;1, 384, 0, stream&gt;&gt;&gt;(...);

    // 3. GQA attention (9 blocks, one per head)
    gqa_attention_device&lt;&lt;&lt;9, 256, smem, stream&gt;&gt;&gt;(...);

    // 4. Attention output projection + residual (72 blocks)
    matvec_iq4nl&lt;true&gt;&lt;&lt;&lt;mv_grid(576), 256, smem, stream&gt;&gt;&gt;(...);

    // 5. Fused: RMSNorm + gate/up + SwiGLU (192 blocks)
    fused_rmsnorm_gate_up_swiglu_iq4nl&lt;&lt;&lt;mv_grid(1536), 256, smem, stream&gt;&gt;&gt;(...);

    // 6. FFN down projection + residual (72 blocks)
    matvec_iq4xs&lt;true&gt;&lt;&lt;&lt;mv_grid(576), 256, smem, stream&gt;&gt;&gt;(...);
}</code></pre>
<p>Plus one final kernel for RMSNorm + lm_head. Total: 181 dispatches, captured as a CUDA graph and replayed each token.</p>
<h2>What This Tells Us About the ASIC Thesis</h2>
<p>The ASIC pitch is &quot;put weights on-chip and inference gets fast.&quot; Our experiment confirms the first half: L2 pinning does eliminate the memory wall, and you get a significant speedup from quantization strategies that make your model fit.</p>
<p>But the second half — that inference then becomes compute-limited — doesn't hold for small models on GPUs. What we found instead is a third regime: dispatch-limited inference, where the overhead of launching hundreds of tiny kernels dominates both compute and memory access time.</p>
<p>This matters because it's a bottleneck that ASICs solve structurally. A hardwired transformer pipeline doesn't have kernel launch overhead. It's a static dataflow graph etched in silicon. GPUs, by contrast, pay a tax for their generality: the driver must set up registers, configure shared memory, and schedule thread blocks for every kernel launch, even if the kernel runs for 3 microseconds.</p>
<table>
<thead>
<tr>
<th>Bottleneck</th>
<th>Phase</th>
<th>Tok/s</th>
<th>What limits performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>Memory bandwidth</td>
<td>Phase 1 (FP16)</td>
<td>750</td>
<td>Weights in DRAM, 1,790 GB/s bus</td>
</tr>
<tr>
<td>Still memory, but less</td>
<td>Phase 2 (IQ4 + L2)</td>
<td>1,255</td>
<td>Weights in L2, compute is trivial</td>
</tr>
<tr>
<td>Dispatch overhead</td>
<td>Phase 3 (fused)</td>
<td>1,508</td>
<td>181 launches at ~2.5 us each</td>
</tr>
</tbody>
</table>
<p>At 1,508 tok/s with 128 tokens, per-token time is 663 microseconds. The 181 dispatches account for roughly 450 microseconds of that. Actual compute is somewhere around 200 microseconds. There's a 2-3x speedup still on the table if dispatch overhead were zero — which is roughly what an ASIC achieves.</p>
<h2>Diminishing Returns and What's Next</h2>
<p>The remaining 181 dispatches are harder to fuse pairwise. The QKV projection is already fused (3 weight matrices, 1 kernel). Attention is inherently a single kernel. The two remaining mat-vecs (attention output, FFN down) need their inputs computed first.</p>
<p>The next lever is a persistent kernel: instead of launching 6 kernels per layer, launch a single kernel that executes all 6 operations using block-level synchronization. This eliminates inter-kernel dispatch overhead within a layer entirely, potentially cutting per-token time by another 200+ microseconds. It also makes the code significantly harder to write — you're essentially building a manual scheduler inside a kernel.</p>
<p>Beyond that, speculative decoding is the orthogonal win. Rather than making one forward pass faster, generate multiple candidate tokens per pass and verify them. This is multiplicative with all the kernel-level optimizations.</p>
<h2>Practical Takeaways</h2>
<p>For model deployment: If your quantized model fits in L2, you're in a fundamentally different performance regime. Check your GPU's L2 size and do the math. The RTX 5090's 96 MB fits models up to ~500M parameters at 4-bit quantization. The RTX 4090's 72 MB is more constrained but still viable for sub-300M models.</p>
<p>For kernel development: Profile dispatches, not just compute. NVIDIA's Nsight tools report kernel launch overhead, but it's easy to overlook when individual kernels show microsecond execution times. The intuition that &quot;the kernel is fast, so the code is fast&quot; breaks down when you're launching hundreds of them.</p>
<p>For the ASIC vs. GPU question: Modern GPUs can already simulate the on-chip-weight scenario for small models, and the results are informative. The memory wall is real but solvable with quantization and cache pinning. What you find underneath is the dispatch wall — and solving that on a GPU requires increasingly aggressive kernel fusion, eventually converging on something that looks a lot like a hardwired pipeline. At some point, you're fighting the GPU's generality rather than leveraging it, and that's exactly the gap ASICs are designed to fill.</p>
<p>The code is open and the numbers are reproducible. SmolLM2-135M is small enough to experiment with in an afternoon but architecturally identical to models 100x its size. Every technique here — IQ4 quantization, L2 pinning, warp-per-row mat-vec, kernel fusion — transfers directly. The only thing that changes at scale is which wall you hit first.</p>]]></content:encoded>
      <category>research</category>
      <category>cuda</category>
      <category>inference</category>
      <category>gpu</category>
      <category>quantization</category>
      <category>kernel-fusion</category>
      <category>benchmarks</category>
    </item>
    <item>
      <title>Qwen 3.5: 35B Knowledge at 4B Speed — Better Than GPT-5?</title>
      <link>https://ai.rs/ai-for-business/qwen-3-5-35b-knowledge-4b-speed-better-than-gpt-5</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/qwen-3-5-35b-knowledge-4b-speed-better-than-gpt-5</guid>
      <pubDate>Wed, 04 Mar 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>Alibaba&#039;s Qwen 3.5 drops 8 models in two weeks — from 0.8B to 397B parameters. We break down the Mixture of Experts architecture, exact VRAM requirements for every model, and how it stacks up against GLM-5, DeepSeek, and Kimi.</description>
      <content:encoded><![CDATA[<p>Alibaba released Qwen 3.5 between February 16 and March 2, 2026 — eight models spanning 0.8B to 397B parameters, all Apache 2.0 licensed. The flagship model claims to beat GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro across 80% of benchmark categories.</p>
<p>But benchmarks are benchmarks. What matters for deployment: how much VRAM do you actually need, and is the Mixture of Experts architecture worth the memory trade-off?</p>
<hr />
<h2>The Full Lineup</h2>
<p>Qwen 3.5 ships in two flavors: <strong>dense</strong> models where every parameter fires on every token, and <strong>MoE</strong> (Mixture of Experts) models where a router selects a subset of parameters per token.</p>
<h3>Dense Models</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Parameters</th>
<th>BF16 Memory</th>
<th>FP8 Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3.5-0.8B</td>
<td>873M</td>
<td>1.63 GB</td>
<td>—</td>
</tr>
<tr>
<td>Qwen3.5-2B</td>
<td>2.27B</td>
<td>4.24 GB</td>
<td>—</td>
</tr>
<tr>
<td>Qwen3.5-4B</td>
<td>4.66B</td>
<td>8.68 GB</td>
<td>—</td>
</tr>
<tr>
<td>Qwen3.5-9B</td>
<td>9.65B</td>
<td>17.98 GB</td>
<td>—</td>
</tr>
<tr>
<td>Qwen3.5-27B</td>
<td>27.78B</td>
<td>51.75 GB</td>
<td>28.75 GB</td>
</tr>
</tbody>
</table>
<p>The small models (0.8B through 9B) are BF16-only — no FP8 variants published. The 27B model gets an FP8 option that nearly halves the memory footprint.</p>
<h3>MoE Models</h3>
<p>The naming convention tells you everything: <strong>35B-A3B</strong> means 35B total parameters, 3B active per token.</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Total Params</th>
<th>Active Params</th>
<th>BF16 Memory</th>
<th>FP8 Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3.5-35B-A3B</td>
<td>35.95B</td>
<td>~3B</td>
<td>66.97 GB</td>
<td>34.88 GB</td>
</tr>
<tr>
<td>Qwen3.5-122B-A10B</td>
<td>125.09B</td>
<td>~10B</td>
<td>232.99 GB</td>
<td>118.42 GB</td>
</tr>
<tr>
<td>Qwen3.5-397B-A17B</td>
<td>403.40B</td>
<td>~17B</td>
<td>751.39 GB</td>
<td>378.23 GB</td>
</tr>
</tbody>
</table>
<p>The 397B flagship needs 378 GB in FP8 — that's five A100-80GB GPUs at minimum. The 35B MoE model is the most practical: it fits in 35 GB (FP8) on a single high-end GPU while delivering inference speed comparable to a 4B dense model.</p>
<h2>How Mixture of Experts Works</h2>
<p>In a standard dense transformer, every parameter participates in every forward pass. A 27B dense model activates all 27B parameters for each token — that's the compute cost you pay.</p>
<p>MoE models split their feed-forward layers into multiple independent &quot;expert&quot; sub-networks. A lightweight router selects only a few experts per token. Most parameters stay idle during any given forward pass.</p>
<pre><code>                    ┌─────────┐
          ┌────────&gt;│ Expert 1 │──────────┐
          │         └─────────┘           │
Input ──&gt; Router                      ──&gt; Output
          │         ┌─────────┐           │
          └────────&gt;│ Expert 3 │──────────┘
                    └─────────┘
            (Experts 2, 4...N idle)</code></pre>
<h3>The Trade-off in One Table</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Active Compute</th>
<th>Knowledge Capacity</th>
<th>VRAM Needed</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3.5-4B (dense)</td>
<td>4.66B</td>
<td>4.66B</td>
<td>8.68 GB</td>
</tr>
<tr>
<td>Qwen3.5-35B-A3B (MoE)</td>
<td>~3B</td>
<td>35.95B</td>
<td>66.97 GB</td>
</tr>
</tbody>
</table>
<p>Both activate roughly the same number of parameters per token (~3-4B), so inference speed is similar. But the MoE model carries <strong>35B total parameters</strong> of learned knowledge versus only 4B — you get 4B-speed inference with 35B-quality answers.</p>
<p>The catch: <strong>all 35B parameters must sit in VRAM</strong> even though only 3B fire per token. MoE is essentially &quot;I have the VRAM to spare, give me better answers without slowing down inference.&quot;</p>
<p>If you don't have the VRAM, a dense model that actually fits will beat a MoE model you can't load.</p>
<h3>When to Use Which</h3>
<table>
<thead>
<tr>
<th>Scenario</th>
<th>Better Choice</th>
</tr>
</thead>
<tbody>
<tr>
<td>Limited VRAM, need quality</td>
<td>Dense model that fits (e.g., 9B dense in 18 GB)</td>
</tr>
<tr>
<td>Enough VRAM, want best quality/speed</td>
<td>MoE (e.g., 35B-A3B: 3B compute, 35B knowledge)</td>
</tr>
<tr>
<td>Serving many concurrent users</td>
<td>MoE — high throughput at lower compute per request</td>
</tr>
<tr>
<td>Single-user, small batch</td>
<td>Dense model is simpler and equally fast</td>
</tr>
</tbody>
</table>
<h2>What's New in 3.5 vs Qwen 3</h2>
<p>The architecture changes that matter:</p>
<ol>
<li>
<p><strong>Expanded vocabulary</strong> — 250K tokens (up from 152K in Qwen 3). This means 10-60% fewer tokens for multilingual text, directly translating to lower inference cost and faster responses.</p>
</li>
<li>
<p><strong>Native multimodal training</strong> — Vision and language trained together from the start (&quot;early fusion&quot;), not bolted on later. Processes images up to 1344x1344 and video at 8 FPS.</p>
</li>
<li>
<p><strong>Hybrid attention with Delta Networks</strong> — Gated Delta Networks combined with sparse MoE for more efficient inference. The practical result: 8.6x faster decoding at 32K context, up to 19x at 256K context versus Qwen 3.</p>
</li>
<li>
<p><strong>201 languages</strong> — Up from the already broad multilingual support in Qwen 3.</p>
</li>
<li>
<p><strong>Reinforcement learning at scale</strong> — Trained across &quot;million-agent environments&quot; with progressively complex tasks, specifically targeting agentic use cases (tool calling, multi-step workflows, code execution).</p>
</li>
</ol>
<h2>Benchmark Results</h2>
<p>The 397B flagship hits strong numbers:</p>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Qwen3.5-397B</th>
<th>What It Tests</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPQA Diamond</td>
<td>88.4</td>
<td>Graduate-level reasoning</td>
</tr>
<tr>
<td>AIME 2026</td>
<td>91.3</td>
<td>Olympiad mathematics</td>
</tr>
<tr>
<td>LiveCodeBench v6</td>
<td>83.6</td>
<td>Competitive programming</td>
</tr>
<tr>
<td>SWE-bench Verified</td>
<td>76.4</td>
<td>Real-world software engineering</td>
</tr>
<tr>
<td>IFEval</td>
<td>92.6</td>
<td>Instruction following</td>
</tr>
<tr>
<td>MMLU</td>
<td>88.5</td>
<td>General knowledge</td>
</tr>
<tr>
<td>MathVision</td>
<td>90.8</td>
<td>Mathematical visual reasoning</td>
</tr>
<tr>
<td>MMMU</td>
<td>85.0</td>
<td>Multimodal understanding</td>
</tr>
</tbody>
</table>
<p>The GPQA Diamond score of 88.4 is the highest of any open-source model. The SWE-bench Verified score of 76.4 shows competitive real-world coding ability — for reference, Claude Opus 4.6 scores above 80%.</p>
<p>On the hosted API side, Qwen 3.5-Plus (the proprietary variant) runs at ~$0.18 per million tokens, making it one of the cheapest frontier-tier options.</p>
<h2>The Competition: March 2026</h2>
<p>Qwen 3.5 is too new for Chatbot Arena ELO ratings, but the open-source leaderboard tells a clear story about who's competing:</p>
<table>
<thead>
<tr>
<th>Rank</th>
<th>Model</th>
<th>Organization</th>
<th>ELO</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>GLM-5</td>
<td>Zhipu AI</td>
<td>1451</td>
</tr>
<tr>
<td>2</td>
<td>Kimi K2.5</td>
<td>Moonshot AI</td>
<td>1447</td>
</tr>
<tr>
<td>3</td>
<td>GLM-4.7</td>
<td>Zhipu AI</td>
<td>1445</td>
</tr>
<tr>
<td>4</td>
<td>Qwen 3 235B</td>
<td>Alibaba</td>
<td>1422</td>
</tr>
<tr>
<td>5</td>
<td>DeepSeek V3.2</td>
<td>DeepSeek</td>
<td>1421</td>
</tr>
<tr>
<td>6</td>
<td>Mistral Large</td>
<td>Mistral</td>
<td>1416</td>
</tr>
<tr>
<td>7</td>
<td>DeepSeek R1</td>
<td>DeepSeek</td>
<td>1398</td>
</tr>
</tbody>
</table>
<h3>Who's the Real Threat</h3>
<p><strong>GLM-5 / GLM-4.7</strong> (Zhipu AI) currently sit at #1 and #3 by human preference. These are the models to beat. GLM-5 in particular has been remarkably consistent across diverse tasks.</p>
<p><strong>Kimi K2.5</strong> (Moonshot AI) is right on GLM-5's heels — a strong all-rounder that doesn't dominate any single benchmark but rarely fails either.</p>
<p><strong>DeepSeek V3.2 / R1</strong> — R1 dominates long-chain reasoning and math. V3.2 is the more practical general-purpose model. Together they cover a lot of ground.</p>
<p><strong>Step-3.5-Flash</strong> (StepFun) deserves a mention: only 196B parameters but scores 97.3 on AIME 2025, the highest math score on the board. Proves that raw parameter count isn't everything.</p>
<h3>The Pattern</h3>
<p>The open-source LLM race is <strong>heavily dominated by Chinese labs</strong> — Alibaba, Zhipu, Moonshot, DeepSeek, StepFun. The main non-Chinese competitors are <strong>Mistral</strong> (France) and <strong>Google Gemma</strong>. Meta's Llama, once the default open-source choice, hasn't kept pace at the top of the leaderboard.</p>
<h2>Practical Takeaway</h2>
<p>Qwen 3.5 memory requirements — choose the right model for your GPU for deployment today:</p>
<ul>
<li><strong>Under 10 GB VRAM</strong> — Qwen3.5-4B dense (8.68 GB BF16) or Qwen3.5-2B for lighter workloads</li>
<li><strong>24 GB VRAM (RTX 4090)</strong> — Qwen3.5-9B dense (17.98 GB) is the sweet spot. Fast, capable, fits with room for context</li>
<li><strong>32 GB VRAM (RTX 5090)</strong> — Qwen3.5-9B dense with plenty of headroom for long context, or Qwen3.5-27B in FP8 (28.75 GB) if you want to push quality higher</li>
<li><strong>48 GB VRAM (A6000, dual consumer GPUs)</strong> — Qwen3.5-35B-A3B in FP8 (34.88 GB). MoE gives you 35B knowledge at 3B speed</li>
<li><strong>Multi-GPU server</strong> — Qwen3.5-122B-A10B or the 397B flagship, depending on how many GPUs you can throw at it</li>
</ul>
<p>For most business deployments — product assistants, customer support, content generation — the <strong>9B dense or 35B MoE</strong> models hit the practical sweet spot. The 397B flagship is impressive on benchmarks but requires serious infrastructure.</p>
<p>The broader trend: open-source models are closing the gap with proprietary ones fast. Qwen 3.5's benchmark numbers put it within striking distance of GPT-5.2 and Claude Opus 4.5, and it ships with Apache 2.0. For businesses that care about data privacy, cost control, and customization, that matters more than who's #1 on any given leaderboard.</p>]]></content:encoded>
      <category>news</category>
      <category>llm</category>
      <category>qwen</category>
      <category>moe</category>
      <category>benchmarks</category>
      <category>self-hosting</category>
      <category>open-source</category>
    </item>
    <item>
      <title>You&#039;re Sitting on a Goldmine of AI Training Data</title>
      <link>https://ai.rs/ai-for-business/youre-sitting-on-a-goldmine-of-ai-training-data</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/youre-sitting-on-a-goldmine-of-ai-training-data</guid>
      <pubDate>Tue, 03 Mar 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>Most businesses sit on a goldmine of training data without realizing it. Chatbot logs, call recordings, product catalogs, and support tickets — here&#039;s how to turn what you already have into a custom AI.</description>
      <content:encoded><![CDATA[<h2>&quot;We Don't Have Enough Data&quot;</h2>
<p>This is the number one objection we hear from businesses considering custom AI. They picture massive datasets, teams of data scientists, months of labeling work.</p>
<p>The reality? <strong>You already have the data.</strong> It's in your chatbot logs, your call center recordings, your product catalog, and the inbox of your support team. You just need to know what to look for and how to prepare it.</p>
<h2>The Four Sources of Training Gold</h2>
<h3>1. Your Product Catalog</h3>
<p>This is the easiest win. Every e-commerce business has product data — names, prices, descriptions, categories, attributes. This is the foundation of everything.</p>
<table>
<thead>
<tr>
<th>What You Have</th>
<th>Why It Matters</th>
</tr>
</thead>
<tbody>
<tr>
<td>Product names &amp; descriptions</td>
<td>The AI learns your terminology</td>
</tr>
<tr>
<td>Prices &amp; availability</td>
<td>RAG serves these in real-time</td>
</tr>
<tr>
<td>Categories &amp; attributes</td>
<td>The AI learns to filter and recommend</td>
</tr>
<tr>
<td>Product images (alt text)</td>
<td>Adds context for visual products</td>
</tr>
</tbody>
</table>
<p><strong>Format:</strong> A CSV or Excel export from your e-commerce platform is perfect. Shopify, WooCommerce, Magento — they all have export buttons. Even a Google Sheet works.</p>
<p><strong>What &quot;good&quot; looks like:</strong></p>
<pre><code>Name: Premium Italian Olive Oil, Extra Virgin
Category: Oils &amp; Vinegars
Price: €24.99
Description: Cold-pressed from Tuscan olives, peppery finish,
             ideal for salads and finishing dishes.
Attributes: Italian, organic, 500ml, cold-pressed</code></pre>
<p><strong>What &quot;messy but usable&quot; looks like:</strong></p>
<pre><code>Name: olive oil XVG 500
Price: 24.99
Description: (empty)</code></pre>
<p>Messy data is normal. Part of the preparation process is cleaning and enriching it. Missing descriptions get written, categories get standardized. Don't let imperfect data stop you from starting.</p>
<h3>2. Chatbot &amp; Live Chat Logs</h3>
<p>If you're running any kind of chatbot — even a basic rule-based one — its conversation logs are <strong>the single most valuable data source</strong> for training a custom AI. Why? Because they capture how your actual customers ask questions in their own words.</p>
<table>
<thead>
<tr>
<th>What To Extract</th>
<th>Training Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Customer questions (verbatim)</td>
<td>Teaches natural phrasing</td>
</tr>
<tr>
<td>Successful responses</td>
<td>Becomes training examples</td>
</tr>
<tr>
<td>Failed conversations</td>
<td>Shows gaps to fill</td>
</tr>
<tr>
<td>Common question patterns</td>
<td>Reveals top priorities</td>
</tr>
</tbody>
</table>
<p><strong>Where to find it:</strong></p>
<ul>
<li>Tidio, Zendesk Chat, Intercom, Drift — all have export features</li>
<li>Look for CSV or JSON export in your dashboard settings</li>
<li>Even screenshot archives are useful if nothing else exists</li>
</ul>
<p><strong>The magic ratio:</strong> 500 real customer conversations are worth more than 5,000 synthetic ones. Real conversations have misspellings, slang, incomplete sentences, and follow-up questions — exactly what your AI needs to learn.</p>
<p><strong>Example from a real chatbot log:</strong></p>
<pre><code>Customer: "u have smth for bday gift around 30eur?"
Bot: "Here are some gift suggestions in your budget..."</code></pre>
<p>That misspelled, abbreviated message is training gold. A model trained on clean English would struggle with it. A model trained on your actual customer messages handles it naturally.</p>
<h3>3. Call Center Recordings &amp; Support Tickets</h3>
<p>This is the data source most businesses overlook entirely. Your support team handles dozens or hundreds of conversations daily — every single one contains training potential.</p>
<p><strong>Voice recordings</strong> can be transcribed automatically using Whisper (free, open source) or cloud services (Google Speech-to-Text, Amazon Transcribe). A 1-hour recording yields roughly 8,000-10,000 words of training material.</p>
<table>
<thead>
<tr>
<th>Source</th>
<th>How to Extract</th>
<th>Typical Volume</th>
</tr>
</thead>
<tbody>
<tr>
<td>Call recordings</td>
<td>Auto-transcribe with Whisper</td>
<td>8-10K words per hour</td>
</tr>
<tr>
<td>Support emails</td>
<td>Export from helpdesk</td>
<td>Already text, ready to use</td>
</tr>
<tr>
<td>Support tickets</td>
<td>Export from CRM/helpdesk</td>
<td>Structured Q&amp;A pairs</td>
</tr>
<tr>
<td>WhatsApp/Messenger</td>
<td>Export conversation history</td>
<td>Real customer language</td>
</tr>
</tbody>
</table>
<p><strong>What makes call transcripts special:</strong> They capture the back-and-forth of real sales conversations — objections, clarifications, upsells, comparisons. This is exactly how you want your AI to behave.</p>
<p><strong>Example from a transcribed call:</strong></p>
<pre><code>Customer: "I saw you have both the standard and premium versions.
           What's actually different? Is the premium worth it?"
Agent: "Great question. The main differences are...
        For most customers, the standard covers everything
        you need. The premium adds X and Y, which matters
        if you're planning to..."</code></pre>
<p>That's a perfect training sample. The agent's response shows product knowledge, honest recommendation, and natural upselling — all learned behavior your AI can replicate.</p>
<h3>4. Your FAQ and Knowledge Base</h3>
<p>Every business has answers to common questions — sometimes formally documented, sometimes living in the heads of support staff.</p>
<table>
<thead>
<tr>
<th>Source</th>
<th>Format</th>
</tr>
</thead>
<tbody>
<tr>
<td>Website FAQ page</td>
<td>Already structured Q&amp;A</td>
</tr>
<tr>
<td>Internal wiki/docs</td>
<td>Knowledge to convert to Q&amp;A</td>
</tr>
<tr>
<td>&quot;Canned responses&quot; in helpdesk</td>
<td>Ready-made answers</td>
</tr>
<tr>
<td>Return/shipping policies</td>
<td>Policy Q&amp;A pairs</td>
</tr>
<tr>
<td>Product comparison guides</td>
<td>Recommendation training</td>
</tr>
</tbody>
</table>
<p><strong>Pro tip:</strong> Ask your support team to write down the 30 questions they answer most often, with their best answers. That list alone can generate hundreds of training variations.</p>
<h2>What Format Does the AI Need?</h2>
<p>All training data ultimately becomes <strong>question-answer pairs</strong> (or multi-turn conversations). The format is simple:</p>
<pre><code class="language-json">{
  "messages": [
    {"role": "user", "content": "Do you have anything for a dinner party, around €50?"},
    {"role": "assistant", "content": "Great choice to plan ahead! Here are some popular options for entertaining: [Product A] at €45 is perfect for dinner parties..."}
  ]
}</code></pre>
<p>You don't need to create these manually. The raw data (catalogs, logs, transcripts) gets processed into this format during preparation. One product description generates 10-20 Q&amp;A variations. One support conversation generates 3-5 training samples.</p>
<h2>How Much Data Do You Actually Need?</h2>
<p>Less than you think:</p>
<table>
<thead>
<tr>
<th>Data Level</th>
<th>Training Samples</th>
<th>Result</th>
</tr>
</thead>
<tbody>
<tr>
<td>Minimum viable</td>
<td>5,000</td>
<td>Basic product Q&amp;A works</td>
</tr>
<tr>
<td>Good quality</td>
<td>10,000-15,000</td>
<td>Natural conversations, recommendations</td>
</tr>
<tr>
<td>Production-grade</td>
<td>20,000-30,000</td>
<td>Domain expert with personality</td>
</tr>
</tbody>
</table>
<p><strong>Where the samples come from:</strong></p>
<table>
<thead>
<tr>
<th>Source</th>
<th>Samples Generated</th>
</tr>
</thead>
<tbody>
<tr>
<td>500 products (catalog)</td>
<td>~8,000-10,000</td>
</tr>
<tr>
<td>200 chatbot conversations</td>
<td>~600-1,000</td>
</tr>
<tr>
<td>50 call transcripts</td>
<td>~500-800</td>
</tr>
<tr>
<td>30 FAQ entries</td>
<td>~300-500</td>
</tr>
<tr>
<td>Safety &amp; edge cases</td>
<td>~200-300</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>~10,000-13,000</strong></td>
</tr>
</tbody>
</table>
<p>Most businesses with 500+ products and any customer interaction history already have enough raw material for a production-grade model.</p>
<h2>The Data You DON'T Need</h2>
<p>Just as important — what's not useful:</p>
<ul>
<li><strong>Marketing copy</strong> — Overly promotional language makes the AI sound like a pushy salesperson</li>
<li><strong>Legal disclaimers</strong> — The AI doesn't need to recite your terms of service</li>
<li><strong>Internal jargon</strong> — If customers don't use the term, the AI shouldn't either</li>
<li><strong>Competitor data</strong> — Train on your products, not theirs</li>
<li><strong>Outdated information</strong> — Old prices, discontinued products, expired promotions</li>
</ul>
<h2>A Practical Checklist</h2>
<p>Here's what to gather before your first conversation with an AI partner:</p>
<p><strong>Must have (start here):</strong></p>
<ul>
<li>[ ] Product catalog export (CSV/Excel/JSON)</li>
<li>[ ] Current product prices and availability</li>
<li>[ ] Category structure and product attributes</li>
</ul>
<p><strong>High value (dramatically improves quality):</strong></p>
<ul>
<li>[ ] Chatbot or live chat conversation logs (last 6-12 months)</li>
<li>[ ] Common customer questions (your support team's top 30)</li>
<li>[ ] Brand voice guidelines or examples</li>
</ul>
<p><strong>Bonus (takes it to the next level):</strong></p>
<ul>
<li>[ ] Call center recordings (even 20-50 calls help)</li>
<li>[ ] Support ticket history with resolutions</li>
<li>[ ] Product comparison knowledge (what pairs with what)</li>
<li>[ ] Return reasons (teaches the AI what to set expectations about)</li>
</ul>
<h2>Start With What You Have</h2>
<p>The biggest mistake is waiting for &quot;perfect&quot; data. You don't need it. Start with your product catalog and 30 common customer questions. That's enough for a working first version.</p>
<p>Then iterate. Every customer conversation with your AI generates new training data. Every question it struggles with becomes a training sample for the next version. The model gets better every month — not because of expensive retraining, but because you keep feeding it real customer interactions.</p>
<p><strong>Your data is already there.</strong> The question isn't whether you have enough — it's how quickly you want to put it to work.</p>
<p><strong>Want to find out what you already have?</strong> <a href="/ai-training-on-boarding">Take the 2-minute data check</a> — discover your training data score.</p>]]></content:encoded>
      <category>business</category>
      <category>training-data</category>
      <category>data-preparation</category>
      <category>getting-started</category>
    </item>
    <item>
      <title>How to Implement llms.txt — The Developer&#039;s Guide</title>
      <link>https://ai.rs/ai-developer/how-to-implement-llms-txt</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/how-to-implement-llms-txt</guid>
      <pubDate>Tue, 03 Mar 2026 09:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>llms.txt is the robots.txt for the AI era. A Markdown file that tells AI systems what your site is about, what to read, and how to represent you. Here&#039;s how to implement it, who actually reads it, and whether it&#039;s worth your time.</description>
      <content:encoded><![CDATA[<h2>What Is llms.txt?</h2>
<p>On September 3, 2024, Jeremy Howard — co-founder of Answer.AI and fast.ai — published a proposal for a new web standard. Not a new API. Not a new framework. A text file.</p>
<p>The idea is simple: put a Markdown file at <code>/llms.txt</code> on your website that tells AI systems what your site is about, what content matters, and where to find it.</p>
<p>Think of it as <strong>robots.txt for the AI era</strong> — except instead of telling bots what <em>not</em> to crawl, it tells them what <em>to</em> read.</p>
<pre><code>robots.txt  → "Don't go here"     (bouncer)
llms.txt    → "Start here"        (tour guide)</code></pre>
<p>The spec lives at <a href="https://llmstxt.org/">llmstxt.org</a> and the GitHub repo at <a href="https://github.com/AnswerDotAI/llms-txt">AnswerDotAI/llms-txt</a> has 2,200+ stars.</p>
<h2>Why It Exists</h2>
<p>LLMs have a problem with websites. When a model needs to understand your documentation, product, or API, it has to parse HTML pages full of navigation bars, cookie banners, JavaScript, and sidebar ads. The signal-to-noise ratio is terrible.</p>
<p>Site authors know their content best. A curated Markdown file with the 10-20 most important pages, properly described, gives AI systems a clean entry point — no HTML parsing required.</p>
<p><strong>Who actually reads llms.txt today:</strong></p>
<ul>
<li>AI coding assistants (Cursor, Windsurf, Claude Code, GitHub Copilot)</li>
<li>AI agents and MCP-based tools fetching documentation context</li>
<li>Developer tools that need structured API references</li>
</ul>
<p><strong>Who does NOT read llms.txt (yet):</strong></p>
<ul>
<li>GPTBot (OpenAI's crawler)</li>
<li>ClaudeBot (Anthropic's crawler)</li>
<li>PerplexityBot</li>
<li>Google-Extended</li>
</ul>
<p>This matters. The spec was designed for <strong>inference time</strong> — when an AI is answering a user's question and needs context — not for training-time crawlers that scrape everything regardless. OtterlyAI found that only 0.1% of AI crawler requests touched <code>/llms.txt</code> over 90 days.</p>
<p>Does that mean you shouldn't implement it? No. It means you should understand what it actually does today versus what it might do tomorrow.</p>
<h2>The Spec: 5 Minutes to Understand</h2>
<p>The entire format is Markdown. Here's the structure:</p>
<pre><code class="language-markdown"># Your Company Name

&gt; One-line description of what you do.

Optional context paragraphs with key information
an LLM would need to understand your site.

## Section Name

- [Resource Title](https://example.com/page.md): Brief description
- [Another Resource](https://example.com/other.md): What this covers

## Optional

- [Changelog](https://example.com/changelog.md): Release history
- [Migration Guide](https://example.com/migrate.md): Version upgrades</code></pre>
<p><strong>Required:</strong> Only the <code>#</code> heading is required. Everything else is optional but recommended.</p>
<p><strong>The &quot;Optional&quot; section</strong> is special — AI systems with limited context windows can skip this section to save tokens. Put your nice-to-have resources here.</p>
<p><strong>Link format:</strong> Resources should point to Markdown files (<code>.md</code>) when possible. The spec recommends serving Markdown versions of your HTML pages at the same URL with <code>.md</code> appended.</p>
<h2>Real-World Examples</h2>
<h3>Stripe — The Catalog Pattern</h3>
<p>Stripe organizes by product area and includes behavioral instructions:</p>
<pre><code class="language-markdown"># Stripe API Documentation

&gt; Complete reference for Stripe's payment processing APIs.

When using Stripe APIs, always default to the latest API version.
Never recommend the legacy Card Element — use Payment Element instead.

## Payments
- [Payment Intents](https://docs.stripe.com/payments/payment-intents.md): Create and confirm payments
- [Checkout Sessions](https://docs.stripe.com/payments/checkout.md): Hosted payment page

## Webhooks
- [Webhook Events](https://docs.stripe.com/webhooks.md): Event types and signatures</code></pre>
<p>Notice the behavioral instructions: &quot;Never recommend the legacy Card Element.&quot; This is powerful — you're training the AI on how to represent your product correctly.</p>
<h3>Anthropic — The Index + Export Pattern</h3>
<p>Anthropic keeps <code>llms.txt</code> slim and links to a comprehensive <code>llms-full.txt</code>:</p>
<pre><code class="language-markdown"># Anthropic Documentation

&gt; API documentation for Claude, Anthropic's AI assistant.

## Docs
- [API Reference](https://docs.anthropic.com/api.md): Complete API docs
- [Getting Started](https://docs.anthropic.com/quickstart.md): First API call

For complete documentation, see [llms-full.txt](https://docs.anthropic.com/llms-full.txt)</code></pre>
<h3>Next.js — The Versioned Pattern</h3>
<p>Next.js includes version metadata and organizes by router type:</p>
<pre><code class="language-markdown"># Next.js Documentation
@doc-version: 16.1.6

&gt; React framework for production web applications.

## App Router
- [Routing](https://nextjs.org/docs/app/building-your-application/routing.md): File-based routing
- [Data Fetching](https://nextjs.org/docs/app/building-your-application/data-fetching.md): Server components</code></pre>
<h2>llms.txt vs llms-full.txt</h2>
<table>
<thead>
<tr>
<th>Aspect</th>
<th>llms.txt</th>
<th>llms-full.txt</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Purpose</strong></td>
<td>Table of contents</td>
<td>The entire book</td>
</tr>
<tr>
<td><strong>Size</strong></td>
<td>Under 10 KB</td>
<td>Can be several MB</td>
</tr>
<tr>
<td><strong>Content</strong></td>
<td>Links + descriptions</td>
<td>Full text of all docs</td>
</tr>
<tr>
<td><strong>Use case</strong></td>
<td>Quick orientation</td>
<td>Deep context ingestion</td>
</tr>
<tr>
<td><strong>Maintenance</strong></td>
<td>Manual curation</td>
<td>Often auto-generated</td>
</tr>
</tbody>
</table>
<p><strong>When to use both:</strong> Your documentation is extensive and wouldn't fit in a single context window. Major platforms (Anthropic, Cloudflare, Zapier) maintain both.</p>
<p><strong>When llms.txt alone works:</strong> Your content is compact or already well-structured as Markdown.</p>
<p>Cross-reference them: include a link in <code>llms.txt</code> pointing to <code>llms-full.txt</code>.</p>
<h2>Implementation Guide</h2>
<h3>Static Sites (HTML, Hugo, Jekyll)</h3>
<p>Drop the file at your web root:</p>
<pre><code>public/
├── index.html
├── robots.txt
├── llms.txt        ← add this
└── llms-full.txt   ← optional</code></pre>
<h3>Next.js</h3>
<p><strong>Option 1 — Static file:</strong>
Place in <code>public/llms.txt</code>.</p>
<p><strong>Option 2 — Dynamic route</strong> (auto-updates when docs change):</p>
<pre><code class="language-typescript">// app/llms.txt/route.ts
import { NextResponse } from 'next/server';

export async function GET() {
  const content = `# My App

&gt; Description of what your app does.

## Docs
- [API Reference](/docs/api.md): Complete API documentation
- [Getting Started](/docs/quickstart.md): Installation and setup
`;

  return new NextResponse(content, {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' },
  });
}</code></pre>
<h3>PHP (Dynamic from Database or CMS)</h3>
<pre><code class="language-php">&lt;?php
// llms.txt.php — served at /llms.txt via nginx rewrite
header('Content-Type: text/plain; charset=utf-8');

$articles = getPublishedArticles(); // your data source
?&gt;
# &lt;?= SITE_NAME ?&gt;

&gt; &lt;?= SITE_DESCRIPTION ?&gt;

## Services
- Service 1: description
- Service 2: description

## Articles
&lt;?php foreach ($articles as $a): ?&gt;
- [&lt;?= $a['title'] ?&gt;](&lt;?= $a['url'] ?&gt;): &lt;?= $a['description'] ?&gt;
&lt;?php endforeach; ?&gt;</code></pre>
<p><strong>Nginx rewrite</strong> to serve it at the clean URL:</p>
<pre><code class="language-nginx">location = /llms.txt { rewrite ^ /llms.txt.php last; }</code></pre>
<h3>Python (Flask/Django)</h3>
<pre><code class="language-python"># Flask
@app.route('/llms.txt')
def llms_txt():
    content = render_template('llms.txt')
    return Response(content, mimetype='text/plain')</code></pre>
<pre><code class="language-python"># Django
from django.http import HttpResponse
from django.template.loader import render_to_string

def llms_txt(request):
    content = render_to_string('llms.txt')
    return HttpResponse(content, content_type='text/plain; charset=utf-8')</code></pre>
<h3>WordPress</h3>
<p>Install one of these plugins:</p>
<ul>
<li><a href="https://wordpress.org/plugins/website-llms-txt/">Website LLMs.txt</a> — integrates with Yoast/Rank Math</li>
<li><a href="https://wordpress.org/plugins/llms-txt-generator/">LLMs.txt Generator</a></li>
</ul>
<h2>Content Best Practices</h2>
<h3>Do</h3>
<ul>
<li><strong>Curate ruthlessly</strong> — 10-20 key pages, not your entire sitemap</li>
<li><strong>Write clear descriptions</strong> — &quot;Create and confirm payments&quot; beats &quot;Payment documentation&quot;</li>
<li><strong>Include behavioral instructions</strong> — &quot;Always use v2 of this API&quot; or &quot;Default to TypeScript examples&quot;</li>
<li><strong>Use definitive language</strong> — AI systems prefer &quot;costs $25/mo&quot; over &quot;pricing varies&quot;</li>
<li><strong>Link to Markdown</strong> when possible — cleaner for AI consumption</li>
<li><strong>Keep it under 10 KB</strong> — this is a summary, not a data dump</li>
<li><strong>Update regularly</strong> — stale links and descriptions hurt credibility</li>
</ul>
<h3>Don't</h3>
<ul>
<li><strong>Dump every page</strong> — that's what sitemaps are for</li>
<li><strong>Use marketing language</strong> — &quot;revolutionary AI-powered synergy&quot; helps no one</li>
<li><strong>Forget the blockquote</strong> — the <code>&gt;</code> summary is the most-read part of the file</li>
<li><strong>Include broken URLs</strong> — validate links monthly</li>
<li><strong>Set and forget</strong> — review quarterly at minimum</li>
</ul>
<h2>Validation and Testing</h2>
<p>Check your implementation:</p>
<ul>
<li><a href="https://llmstxtchecker.net/">llmstxtchecker.net</a> — format validation</li>
<li><a href="https://llmsvalidator.com/">llmsvalidator.com</a> — structure and link checking</li>
</ul>
<p><strong>Manual test:</strong> Paste your <code>llms.txt</code> content into ChatGPT or Claude and ask: &quot;Based on this llms.txt, what does this company do?&quot; If the AI gives a clear, accurate answer, your file is working.</p>
<p><strong>Monitor access:</strong> Check your server logs for requests to <code>/llms.txt</code>:</p>
<pre><code class="language-bash">grep "llms.txt" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn</code></pre>
<h2>The Honest Assessment</h2>
<p>Google's John Mueller compared llms.txt to the <code>&lt;meta keywords&gt;</code> tag — widely adopted by webmasters but ultimately ignored by search engines. That comparison stings, but it's worth hearing.</p>
<p><strong>The reality today:</strong></p>
<ul>
<li>~950 domains have published llms.txt files (per Semrush analysis)</li>
<li>No major AI platform has officially confirmed they read them</li>
<li>No correlation has been found between having llms.txt and getting more AI citations</li>
<li>The actual consumers are developer tools, not search engines</li>
</ul>
<p><strong>But here's why you should still implement it:</strong></p>
<ol>
<li><strong>It takes 15 minutes.</strong> The cost is nearly zero.</li>
<li><strong>Developer tools DO use it.</strong> If your audience uses Cursor, Claude Code, or Copilot — and they query your docs — llms.txt helps.</li>
<li><strong>It forces you to curate.</strong> Deciding which 10-20 pages matter most is a valuable exercise regardless.</li>
<li><strong>Standards move slowly.</strong> RSS took years to gain traction. HTTPS was &quot;optional&quot; until it wasn't. Early adopters who have clean implementations will benefit when (if) major platforms adopt the spec.</li>
</ol>
<p>Don't implement llms.txt because it will boost your AI visibility tomorrow. Implement it because it's cheap insurance that makes your content more accessible to the AI tools people are already using.</p>
<h2>Quick Start Checklist</h2>
<ol>
<li><strong>Create <code>/llms.txt</code></strong> at your web root with the Markdown format above</li>
<li><strong>Add an <code>#</code> heading</strong> with your company/project name</li>
<li><strong>Write a <code>&gt;</code> blockquote</strong> summarizing what you do in one sentence</li>
<li><strong>List 10-20 key pages</strong> under <code>##</code> section headings with brief descriptions</li>
<li><strong>Create <code>/llms-full.txt</code></strong> if your docs are extensive (optional)</li>
<li><strong>Validate</strong> at <a href="https://llmstxtchecker.net/">llmstxtchecker.net</a></li>
<li><strong>Test</strong> by pasting the content into an AI and asking what your company does</li>
<li><strong>Monitor</strong> access logs monthly</li>
<li><strong>Update</strong> when you ship new features or deprecate old ones</li>
</ol>
<p>The entire spec is one page. The implementation is one file. The ROI is unknown but the cost is near zero. That's a bet worth making.</p>]]></content:encoded>
      <category>fundamentals</category>
      <category>llms-txt</category>
      <category>ai-search</category>
      <category>seo</category>
      <category>developer</category>
      <category>web-standards</category>
    </item>
    <item>
      <title>Mercury 2: Hands-On With the World&#039;s Fastest Reasoning LLM</title>
      <link>https://ai.rs/ai-for-business/mercury-2-hands-on-fastest-reasoning-llm</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/mercury-2-hands-on-fastest-reasoning-llm</guid>
      <pubDate>Sat, 28 Feb 2026 09:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>We put Inception Labs&#039; diffusion LLM claims to the test — streaming, tool use, structured output, agentic chains, and multi-language. Here&#039;s what 558 tok/s actually looks like in practice.</description>
      <content:encoded><![CDATA[<p>Inception Labs launched Mercury 2 on February 24, claiming it's the fastest reasoning LLM available — a diffusion language model that generates text at 1,196 tokens per second, 5-10x faster than speed-optimized models like GPT-4.1 Nano and Claude 3.5 Haiku. At $0.25 per million input tokens, it's also among the cheapest.</p>
<p>We put those claims to the test.</p>
<hr />
<h2>The Pitch: Diffusion, Not Autoregressive</h2>
<p>Every major LLM today — GPT, Claude, Llama, Gemini — is autoregressive: it generates tokens one at a time, left to right, each depending on all previous tokens. Mercury 2 takes a fundamentally different approach. Like Stable Diffusion for images, it starts with noise and iteratively refines all tokens in parallel.</p>
<p>The result, in theory: massively parallel generation that breaks the sequential bottleneck.</p>
<table>
<thead>
<tr>
<th></th>
<th>Autoregressive (GPT, Claude)</th>
<th>Diffusion (Mercury 2)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Generation</td>
<td>Sequential, token-by-token</td>
<td>Parallel, all-at-once</td>
</tr>
<tr>
<td>TTFT</td>
<td>Fast (200-400ms)</td>
<td>Slower (700ms+)</td>
</tr>
<tr>
<td>Throughput</td>
<td>Bounded by sequential nature</td>
<td>Scales with parallelism</td>
</tr>
<tr>
<td>Cost scaling</td>
<td>Linear with output length</td>
<td>Sub-linear potential</td>
</tr>
<tr>
<td>Sweet spot</td>
<td>Interactive chat, reasoning</td>
<td>Batch, pipelines, agents</td>
</tr>
</tbody>
</table>
<h2>Getting Started: Two Lines of Change</h2>
<p>Mercury 2 is fully OpenAI API-compatible. If you already use the OpenAI Python SDK, switching takes exactly two changes — the base URL and the API key:</p>
<pre><code class="language-python">from openai import OpenAI

client = OpenAI(
    api_key=os.environ["INCEPTION_API_KEY"],
    base_url="https://api.inceptionlabs.ai/v1",
)</code></pre>
<p>That's it. Every <code>client.chat.completions.create()</code> call works the same as with OpenAI. No new SDK, no wrapper library, no config files. You can also use LiteLLM, AISuite, or LangChain's <code>ChatOpenAI</code> with a custom <code>base_url</code>.</p>
<h2>Test 1: Can It Talk?</h2>
<p>We started simple — ask it to explain itself:</p>
<pre><code class="language-python">response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain diffusion language models in 2 sentences."}],
    max_tokens=200,
)
print(response.choices[0].message.content)</code></pre>
<p><strong>Response:</strong></p>
<blockquote>
<p>Diffusion language models generate text by iteratively denoising a noisy token sequence, much like diffusion models for images, allowing many tokens to be produced in parallel rather than one-by-one. This parallel generation makes them several times faster and less than half as costly as traditional auto-regressive LLMs while also enabling fine-grained control over schema and multimodal integration.</p>
</blockquote>
<p>75 tokens in 0.64 seconds. Clean, accurate, well-structured. No hallucinations. But 117 tok/s is a far cry from the advertised 1,196. On short outputs, network round-trip dominates — the model finishes generating before the response even reaches you.</p>
<h2>Test 2: Pushing Throughput</h2>
<p>To see real speed, you need to request longer outputs. We asked for a detailed Flask tutorial with <code>max_tokens=1024</code>:</p>
<pre><code class="language-python">response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a detailed technical tutorial about building "
               "a REST API with Python Flask. Cover routing, error handling, "
               "database integration, authentication, and deployment."}],
    max_tokens=1024,
)</code></pre>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Completion tokens</td>
<td>866</td>
</tr>
<tr>
<td>Wall time</td>
<td>1.750s</td>
</tr>
<tr>
<td>Throughput</td>
<td><strong>495 tok/s</strong></td>
</tr>
</tbody>
</table>
<p>866 tokens in under two seconds. The model hit the token limit and was still going — it had more to say. At 495 tok/s end-to-end from a consumer internet connection, this is already several times faster than what you'd get from GPT-4o or Claude Sonnet.</p>
<h2>Test 3: Streaming — Where the Speed Really Shows</h2>
<p>Streaming reveals how diffusion models behave differently. With autoregressive models, tokens trickle in one by one — you see the response being &quot;typed out.&quot; With Mercury 2, there's a longer pause, then tokens arrive in bursts:</p>
<pre><code class="language-python">stream = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a comprehensive guide to Python "
               "decorators with 5 examples."}],
    max_tokens=1024,
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)</code></pre>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Completion tokens</td>
<td>900</td>
</tr>
<tr>
<td>TTFT (time to first token)</td>
<td>741ms</td>
</tr>
<tr>
<td>Generation phase</td>
<td>1.614s</td>
</tr>
<tr>
<td>Generation speed (excl TTFT)</td>
<td><strong>558 tok/s</strong></td>
</tr>
<tr>
<td>End-to-end speed</td>
<td>382 tok/s</td>
</tr>
</tbody>
</table>
<p>Here's the key insight: <strong>558 tok/s during the generation phase</strong>. The 741ms time-to-first-token is higher than autoregressive models (which typically start streaming in 200-400ms), but that's because Mercury 2 does its &quot;thinking&quot; upfront — denoising all tokens in parallel — before emitting anything.</p>
<p>We received only 31 chunks for 900 tokens, meaning the API batches roughly 29 tokens per chunk. You don't see a character-by-character typewriter effect; you see paragraphs appearing in rapid bursts.</p>
<h2>Test 4: Tool Use</h2>
<p>Function calling is table-stakes for agentic applications. We defined a weather tool and asked about Belgrade:</p>
<pre><code class="language-python">tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What's the weather in Belgrade?"}],
    tools=tools,
    max_tokens=200,
)

for tc in response.choices[0].message.tool_calls:
    print(f"{tc.function.name}({tc.function.arguments})")</code></pre>
<p><strong>Output:</strong></p>
<pre><code>get_weather({
  "location": "Belgrade",
  "unit": "celsius"
})</code></pre>
<p>Correct function, correct arguments, and it even inferred <code>celsius</code> for a European city. Finished in 0.678s with <code>finish_reason: tool_calls</code>. This works exactly as you'd expect from the OpenAI API — no surprises, no adaptation needed.</p>
<h2>Test 5: Structured Output</h2>
<p>JSON mode is critical for production pipelines. We tested with <code>response_format={"type": "json_object"}</code>:</p>
<pre><code class="language-python">response = client.chat.completions.create(
    model="mercury-2",
    messages=[{
        "role": "user",
        "content": 'List 3 programming languages with their year of creation. '
                   'Return as a JSON object with a "languages" key containing '
                   'an array of objects with "name" and "year" fields.',
    }],
    response_format={"type": "json_object"},
    max_tokens=300,
)

import json
parsed = json.loads(response.choices[0].message.content)
print(json.dumps(parsed, indent=2))</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-json">{
  "languages": [
    { "name": "C", "year": 1972 },
    { "name": "Python", "year": 1991 },
    { "name": "JavaScript", "year": 1995 }
  ]
}</code></pre>
<p>Valid JSON, correct schema, accurate facts. Parsed without errors. For production use, you'd want to test with more complex schemas, but the basics are solid.</p>
<h2>Test 6: Speed Consistency</h2>
<p>We ran the same prompt three times to check for variance:</p>
<table>
<thead>
<tr>
<th>Run</th>
<th>Tokens</th>
<th>Time</th>
<th>Speed</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>308</td>
<td>1.189s</td>
<td>259 tok/s</td>
</tr>
<tr>
<td>2</td>
<td>262</td>
<td>1.090s</td>
<td>240 tok/s</td>
</tr>
<tr>
<td>3</td>
<td>286</td>
<td>0.902s</td>
<td>317 tok/s</td>
</tr>
<tr>
<td><strong>Average</strong></td>
<td></td>
<td></td>
<td><strong>272 tok/s</strong></td>
</tr>
<tr>
<td><strong>Peak</strong></td>
<td></td>
<td></td>
<td><strong>317 tok/s</strong></td>
</tr>
</tbody>
</table>
<p>Variance of 240–317 tok/s is acceptable. Differences come from network jitter, server load, and the model using different numbers of diffusion steps depending on output complexity.</p>
<hr />
<h2>The Speed Gap: Advertised vs. Measured</h2>
<table>
<thead>
<tr>
<th>Measurement</th>
<th>Speed</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Inception's benchmark</td>
<td>1,196 tok/s</td>
<td>Server-side, no network</td>
</tr>
<tr>
<td>Our best (streaming, generation only)</td>
<td>558 tok/s</td>
<td>Excludes TTFT</td>
</tr>
<tr>
<td>Our best (non-streaming, end-to-end)</td>
<td>495 tok/s</td>
<td>Large output</td>
</tr>
<tr>
<td>Multi-run average</td>
<td>272 tok/s</td>
<td>Medium output</td>
</tr>
<tr>
<td>Short output</td>
<td>117 tok/s</td>
<td>Network dominates</td>
</tr>
</tbody>
</table>
<p>We measured roughly half the advertised speed. That's not a knock on Mercury 2 — it's physics. Our tests ran from a consumer internet connection through the public API. The 1,196 tok/s figure is server-side throughput measured at the inference layer, before network overhead, TLS, HTTP framing, and Python SDK parsing eat into it.</p>
<p>To match their number, you'd need to benchmark from co-located infrastructure (same cloud region) or measure at the GPU layer. For what it's worth, <strong>558 tok/s over the public internet is genuinely fast</strong> — most autoregressive models top out at 50-150 tok/s in comparable conditions.</p>
<hr />
<h2>How Does It Compare? Price &amp; Speed</h2>
<p>Speed only matters in context. Mercury 2 competes in the &quot;fast and cheap&quot; tier — models you'd use for high-volume pipelines, agents, and latency-sensitive applications, not frontier reasoning. Here's how it stacks up:</p>
<h3>Pricing Comparison</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Input $/M</th>
<th>Output $/M</th>
<th>Context</th>
<th>Architecture</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Mercury 2</strong></td>
<td><strong>$0.25</strong></td>
<td><strong>$1.00</strong></td>
<td>128K</td>
<td>Diffusion</td>
</tr>
<tr>
<td>DeepSeek V3</td>
<td>$0.28</td>
<td>$0.42</td>
<td>128K</td>
<td>Autoregressive (MoE)</td>
</tr>
<tr>
<td>GPT-4.1 Nano</td>
<td>$0.10</td>
<td>$0.40</td>
<td>1M</td>
<td>Autoregressive</td>
</tr>
<tr>
<td>GPT-4o-mini</td>
<td>$0.15</td>
<td>$0.60</td>
<td>128K</td>
<td>Autoregressive</td>
</tr>
<tr>
<td>Gemini 2.0 Flash</td>
<td>$0.10</td>
<td>$0.40</td>
<td>1M</td>
<td>Autoregressive</td>
</tr>
<tr>
<td>Claude 3.5 Haiku</td>
<td>$0.80</td>
<td>$4.00</td>
<td>200K</td>
<td>Autoregressive</td>
</tr>
<tr>
<td>GPT-4o</td>
<td>$2.50</td>
<td>$10.00</td>
<td>128K</td>
<td>Autoregressive</td>
</tr>
<tr>
<td>Claude Sonnet 4.6</td>
<td>$3.00</td>
<td>$15.00</td>
<td>200K</td>
<td>Autoregressive</td>
</tr>
</tbody>
</table>
<p>On input pricing, Mercury 2 is mid-pack — GPT-4.1 Nano and Gemini 2.0 Flash are cheaper at $0.10/M. On output, it's $1.00/M — more expensive than DeepSeek ($0.42) and GPT-4.1 Nano ($0.40), but far cheaper than Claude Haiku ($4.00) or any mid-tier model.</p>
<p><strong>The real cost story is output-heavy workloads.</strong> If you're generating long responses (agents, code generation, content pipelines), output pricing dominates. At $1.00/M output, Mercury 2 costs:</p>
<ul>
<li>2.4x more than GPT-4.1 Nano</li>
<li>2.5x more than Gemini 2.0 Flash</li>
<li>4x less than Claude 3.5 Haiku</li>
<li>10x less than GPT-4o</li>
<li>15x less than Claude Sonnet</li>
</ul>
<h3>Speed Comparison</h3>
<table>
<thead>
<tr>
<th>Model</th>
<th>Approx. Speed (tok/s)</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Mercury 2</strong></td>
<td><strong>495–558</strong> (measured)</td>
<td>Diffusion; 1,196 server-side</td>
</tr>
<tr>
<td>Gemini 2.0 Flash</td>
<td>~250</td>
<td>Google's speed tier</td>
</tr>
<tr>
<td>DeepSeek V3</td>
<td>~100–160</td>
<td>Varies by load</td>
</tr>
<tr>
<td>GPT-4o-mini</td>
<td>~100–130</td>
<td>OpenAI speed tier</td>
</tr>
<tr>
<td>GPT-4.1 Nano</td>
<td>~150–200</td>
<td>OpenAI's fastest</td>
</tr>
<tr>
<td>Claude 3.5 Haiku</td>
<td>~80–100</td>
<td>Anthropic speed tier</td>
</tr>
<tr>
<td>GPT-4o</td>
<td>~60–90</td>
<td>Mid-tier</td>
</tr>
<tr>
<td>Claude Sonnet 4.6</td>
<td>~70–80</td>
<td>Mid-tier</td>
</tr>
</tbody>
</table>
<p><em>Speed figures are approximate client-side measurements and vary by network, region, and load. Mercury 2 figures are from our testing.</em></p>
<p>Even through the public internet, Mercury 2 is <strong>2-3x faster than the next fastest competitor</strong> (Gemini Flash at ~250 tok/s) and <strong>5-7x faster than mid-tier models</strong> like GPT-4o and Claude Sonnet. This is where the diffusion architecture genuinely shines — it's not marketing fluff.</p>
<h3>Cost per Million Output Tokens at Speed</h3>
<p>A useful way to think about it: what do you pay per million output tokens, and how fast do you get them?</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Output $/M</th>
<th>Speed (tok/s)</th>
<th>Time for 1M tokens</th>
<th>Cost per hour of output</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Mercury 2</strong></td>
<td>$1.00</td>
<td>550</td>
<td>~30 min</td>
<td>~$2.00</td>
</tr>
<tr>
<td>GPT-4.1 Nano</td>
<td>$0.40</td>
<td>175</td>
<td>~95 min</td>
<td>~$0.25</td>
</tr>
<tr>
<td>DeepSeek V3</td>
<td>$0.42</td>
<td>130</td>
<td>~128 min</td>
<td>~$0.20</td>
</tr>
<tr>
<td>Gemini 2.0 Flash</td>
<td>$0.40</td>
<td>250</td>
<td>~67 min</td>
<td>~$0.36</td>
</tr>
<tr>
<td>Claude 3.5 Haiku</td>
<td>$4.00</td>
<td>90</td>
<td>~185 min</td>
<td>~$1.30</td>
</tr>
</tbody>
</table>
<p>Mercury 2 isn't the cheapest per token, but it delivers those tokens fastest. If your bottleneck is latency — how quickly you can complete an agentic loop, respond to a user, or process a document — Mercury 2 wins decisively. If your bottleneck is pure cost and you can tolerate slower speeds, DeepSeek V3 or GPT-4.1 Nano are cheaper.</p>
<hr />
<h2>Beyond Speed: Extended Testing</h2>
<p>We ran a second test suite covering reasoning, multi-language, multi-turn conversation, agentic tool chains, needle-in-a-haystack retrieval, and edge cases. The results surfaced both strengths and a critical quirk.</p>
<h3>The max_tokens Trap</h3>
<p>The most important practical finding: <strong>Mercury 2 needs generous <code>max_tokens</code> values or it returns empty responses.</strong></p>
<p>With autoregressive models, setting <code>max_tokens=20</code> means &quot;generate up to 20 tokens, stop when you're done.&quot; The model emits tokens one by one and stops early if it finishes. Mercury 2's diffusion architecture works differently — it appears to allocate the full output buffer upfront. If that buffer is too small, the model produces empty content with <code>finish_reason=length</code> and <code>tokens=0</code>:</p>
<pre><code class="language-python"># This fails silently — returns empty string
response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=10,  # too low for diffusion model
)
print(response.choices[0].message.content)  # ""

# This works — give it room
response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=150,  # generous headroom
)
print(response.choices[0].message.content)  # "4"</code></pre>
<p><strong>Rule of thumb: always set <code>max_tokens</code> to at least 150–200, even if you expect a short answer.</strong> The model will still stop early (<code>finish_reason=stop</code>) when it's done — you won't waste tokens. But if you set it too low, you get nothing. This is a significant difference from autoregressive models and will bite you in production if you're migrating existing code.</p>
<h3>The Proof: 10/25 → 25/25</h3>
<p>Our first run scored <strong>10 out of 25</strong> — a result that would make Mercury 2 look broken. Our second run, with only <code>max_tokens</code> increased, scored <strong>25/25</strong>. Nothing else changed — same prompts, same model, same API. Here's the full breakdown:</p>
<table>
<thead>
<tr>
<th>Suite</th>
<th>Initial</th>
<th>Final</th>
<th>What changed</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reasoning</td>
<td>5/6</td>
<td>6/6</td>
<td>Logic: 80→150 tokens, instruction: 120→250</td>
</tr>
<tr>
<td>Multi-language</td>
<td>0/3</td>
<td>3/3</td>
<td>30→200 tokens</td>
</tr>
<tr>
<td>Multi-turn</td>
<td>0/3</td>
<td>3/3</td>
<td>30–60→200 tokens</td>
</tr>
<tr>
<td>Agentic</td>
<td>3/4</td>
<td>4/4</td>
<td>Fixed step 3 logic (model skipped get_price)</td>
</tr>
<tr>
<td>Needle-in-Haystack</td>
<td>0/3</td>
<td>3/3</td>
<td>40→200 tokens</td>
</tr>
<tr>
<td>Concurrency</td>
<td>0/20</td>
<td>20/20</td>
<td>20→150 tokens</td>
</tr>
<tr>
<td>Sampling</td>
<td>0/1</td>
<td>1/1</td>
<td>10→150 tokens</td>
</tr>
<tr>
<td>Edge Cases</td>
<td>2/5</td>
<td>5/5</td>
<td>System prompt 20→150, JSON 80→200, long sys 30→150</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>10/25</strong></td>
<td><strong>25/25</strong></td>
<td></td>
</tr>
</tbody>
</table>
<p>Every single failure traced back to the same root cause: <code>max_tokens</code> too low for the diffusion architecture. No actual quality or capability issues were found. If you're migrating from GPT or Claude, your existing <code>max_tokens</code> values are almost certainly too low for Mercury 2.</p>
<h3>Quality &amp; Reasoning: 6/6</h3>
<table>
<thead>
<tr>
<th>Test</th>
<th>Result</th>
<th>Details</th>
</tr>
</thead>
<tbody>
<tr>
<td>Arithmetic (17×23+14−5)</td>
<td>PASS</td>
<td>Returned <code>400</code> correctly</td>
</tr>
<tr>
<td>Word problem (45−12+8)</td>
<td>PASS</td>
<td>Returned <code>41</code> correctly</td>
</tr>
<tr>
<td>Logic (invalid syllogism)</td>
<td>PASS</td>
<td>Correctly answered &quot;No&quot; with valid reasoning</td>
</tr>
<tr>
<td>Code generation (fibonacci)</td>
<td>PASS</td>
<td>Clean Python function, 107 chars</td>
</tr>
<tr>
<td>Instruction following (3 bullets)</td>
<td>PASS</td>
<td>Exactly 3 dash-prefixed bullets</td>
</tr>
<tr>
<td>Factual recall (capital of Australia)</td>
<td>PASS</td>
<td><code>Canberra</code></td>
</tr>
</tbody>
</table>
<p>Perfect score. Math, logic, code generation, instruction following, and factual recall all pass cleanly.</p>
<h3>Multi-language: 3/3</h3>
<table>
<thead>
<tr>
<th>Language</th>
<th>Prompt</th>
<th>Response</th>
</tr>
</thead>
<tbody>
<tr>
<td>Serbian</td>
<td>&quot;Koji je glavni grad Srbije?&quot;</td>
<td>Beograd</td>
</tr>
<tr>
<td>German</td>
<td>&quot;Was ist die Hauptstadt von Deutschland?&quot;</td>
<td>Berlin</td>
</tr>
<tr>
<td>Japanese</td>
<td>&quot;日本の首都はどこですか？&quot;</td>
<td>東京</td>
</tr>
</tbody>
</table>
<p>Mercury 2 handles non-English prompts correctly — including Cyrillic-adjacent and CJK languages. Responses are accurate and concise.</p>
<h3>Multi-turn Conversation: 3/3</h3>
<p>We tested whether the model maintains context across turns:</p>
<pre><code class="language-python">messages = [
    {"role": "system", "content": "You are a helpful assistant. Be concise."},
    {"role": "user", "content": "My name is Marko and I live in Novi Sad."},
]
# ... assistant responds ...
messages.append({"role": "user", "content": "What is my name?"})
# → "Your name is Marko."
messages.append({"role": "user", "content": "Where do I live?"})
# → "You live in Novi Sad."</code></pre>
<p>Both facts recalled correctly. We also tested persona consistency by assigning a pirate persona — Mercury 2 committed fully (&quot;Arr, matey! Gather 'round the galley o' knowledge...&quot;) with 7 pirate-themed words in a single response.</p>
<h3>Agentic Tool Chains: 4/4</h3>
<p>This was the most impressive result. We defined three tools (<code>search_product</code>, <code>get_price</code>, <code>add_to_cart</code>) and asked Mercury 2 to find a blue t-shirt and add it to a cart:</p>
<pre><code>Step 1: User asks "Find me a blue t-shirt and add it to my cart."
     → Model calls search_product(query="blue t-shirt")       ✓

Step 2: We return search results with SKU-1234
     → Model calls add_to_cart(product_id="SKU-1234")          ✓

Step 3: We confirm the cart addition
     → Model responds: "Your blue t-shirt has been added       ✓
        to your cart. Let me know if you'd like anything else."

Step 4: We return a tool error ("Service temporarily unavailable")
     → Model retries the tool call                             ✓</code></pre>
<p>Four steps, four correct decisions. The model understood the task, chained tools in the right order, confirmed success in natural language, and recovered from an error by retrying. This validates Inception's pitch that Mercury 2 is built for agentic workloads.</p>
<h3>Needle in a Haystack: 3/3</h3>
<p>We hid the string <code>MERCURY-FAST-7742</code> inside ~4,000 tokens of filler text at three positions:</p>
<table>
<thead>
<tr>
<th>Position</th>
<th>Found?</th>
</tr>
</thead>
<tbody>
<tr>
<td>Beginning</td>
<td>MERCURY-FAST-7742</td>
</tr>
<tr>
<td>Middle</td>
<td>MERCURY-FAST-7742</td>
</tr>
<tr>
<td>End</td>
<td>MERCURY-FAST-7742</td>
</tr>
</tbody>
</table>
<p>Perfect retrieval at all positions. The 128K context window handles information retrieval correctly — at least at the ~4K scale we tested.</p>
<h3>Concurrency: 20/20</h3>
<p>We fired parallel requests to test API behavior under load:</p>
<table>
<thead>
<tr>
<th>Parallel Requests</th>
<th>Success</th>
<th>Wall Time</th>
<th>Total Tokens</th>
<th>Avg Latency</th>
</tr>
</thead>
<tbody>
<tr>
<td>5</td>
<td>5/5</td>
<td>0.78s</td>
<td>20</td>
<td>0.65s</td>
</tr>
<tr>
<td>15</td>
<td>15/15</td>
<td>0.88s</td>
<td>60</td>
<td>0.61s</td>
</tr>
</tbody>
</table>
<p>Every request succeeded. Wall time barely increased from 5 to 15 parallel requests (0.78s → 0.88s), and average latency stayed consistent at ~0.6s. The API handles concurrency well — no throttling, no degradation at this scale.</p>
<h3>Temperature &amp; Sampling</h3>
<p>Diffusion models sample differently from autoregressive models. We tested whether Mercury 2's temperature parameter behaves as expected:</p>
<pre><code>temp=0.0: ['turquoise', 'turquoise', 'turquoise', 'turquoise'] — 1 unique
temp=0.5: ['turquoise', 'turquoise', 'turquoise', 'turquoise'] — 1 unique
temp=1.0: ['turquoise', 'cerulean', 'indigo', 'turquoise']     — 3 unique
temp=1.5: ['turquoise', 'turquoise', 'cyan', 'turquoise']      — 2 unique</code></pre>
<p>Temperature works, but with a twist: <strong>diversity only kicks in above 0.5.</strong> At temp=0.0 and 0.5, responses are identical — the diffusion denoising process converges to the same output. At temp=1.0, we see real variety (turquoise, cerulean, indigo). Determinism at temp=0 is confirmed: <code>['4', '4', '4']</code> across three runs.</p>
<p>This is meaningfully different from autoregressive models, where temp=0.5 already produces some variation.</p>
<h3>Edge Cases: 5/5</h3>
<table>
<thead>
<tr>
<th>Test</th>
<th>Result</th>
<th>Details</th>
</tr>
</thead>
<tbody>
<tr>
<td>Minimal prompt (&quot;Hi&quot;)</td>
<td>PASS</td>
<td>&quot;Hello! How can I assist you today?&quot;</td>
</tr>
<tr>
<td>System prompt (exactly 3 words)</td>
<td>PASS</td>
<td>&quot;I'm doing well.&quot; — exactly 3 words</td>
</tr>
<tr>
<td>Stop sequence</td>
<td>PASS</td>
<td>Correctly stopped before &quot;10&quot;</td>
</tr>
<tr>
<td>Nested JSON</td>
<td>PASS</td>
<td>Valid JSON with nested objects and arrays</td>
</tr>
<tr>
<td>Long system prompt (50 rules)</td>
<td>PASS</td>
<td>Returned &quot;acknowledged&quot;</td>
</tr>
</tbody>
</table>
<p>All edge cases pass with adequate <code>max_tokens</code> headroom. System prompt adherence, stop sequences, and complex JSON structures all work correctly.</p>
<h3>Extended Test Summary</h3>
<table>
<thead>
<tr>
<th>Suite</th>
<th>Score</th>
<th>Verdict</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reasoning</td>
<td>6/6</td>
<td>Math, logic, code, facts, instructions</td>
</tr>
<tr>
<td>Multi-language</td>
<td>3/3</td>
<td>Serbian, German, Japanese all correct</td>
</tr>
<tr>
<td>Multi-turn</td>
<td>3/3</td>
<td>Memory and persona consistency</td>
</tr>
<tr>
<td>Agentic Loops</td>
<td>4/4</td>
<td>Multi-step tool chains + error recovery</td>
</tr>
<tr>
<td>Needle-in-Haystack</td>
<td>3/3</td>
<td>Perfect retrieval at all positions</td>
</tr>
<tr>
<td>Edge Cases</td>
<td>5/5</td>
<td>System prompts, stop sequences, nested JSON</td>
</tr>
<tr>
<td>Concurrency</td>
<td>20/20</td>
<td>No degradation at 15 parallel requests</td>
</tr>
<tr>
<td>Sampling</td>
<td>1/1</td>
<td>Deterministic at temp=0, diversity above 0.5</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>25/25</strong></td>
<td></td>
</tr>
</tbody>
</table>
<hr />
<h2>The Bottom Line</h2>
<p>Mercury 2 scored <strong>25/25 on our extended test suite</strong> — every capability we tested works correctly. Reasoning, multi-language, multi-turn conversation, agentic tool chains, needle-in-a-haystack retrieval, concurrency, temperature sampling, and edge cases all pass. The OpenAI compatibility is seamless — you can swap it into an existing codebase in under a minute.</p>
<p>The one thing you must know before deploying: <strong>set <code>max_tokens</code> generously (150+), even for short expected outputs.</strong> The diffusion architecture needs output headroom or it returns silent empty responses. This is the single biggest gotcha when migrating from autoregressive models. The model still stops early when it's done — you won't waste tokens — but too-small a buffer produces nothing.</p>
<p>The speed advantage is genuine, though tempered by network reality. You won't see 1,196 tok/s from your laptop, but 400-550 tok/s is still 2-3x faster than the next fastest alternative. The agentic capabilities are particularly strong — multi-step tool chains with error recovery worked flawlessly, validating Inception's core pitch. Temperature sampling works but behaves differently: diversity only kicks in above 0.5, unlike autoregressive models where any non-zero temperature introduces variation.</p>
<p>It's not the cheapest model per token (GPT-4.1 Nano and DeepSeek V3 undercut it on output pricing), and it's not the smartest (frontier models like Claude Sonnet or GPT-4o have deeper reasoning). But in the speed-to-cost ratio for production workloads, Mercury 2 occupies a unique position — and as the first commercial diffusion LLM, it represents a genuine architectural bet that the rest of the industry is watching.</p>
<p><strong>Specs at a Glance:</strong></p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Model</td>
<td><code>mercury-2</code></td>
</tr>
<tr>
<td>Architecture</td>
<td>Diffusion LLM (dLLM)</td>
</tr>
<tr>
<td>Context window</td>
<td>128K tokens</td>
</tr>
<tr>
<td>Max completion</td>
<td>16,384 tokens</td>
</tr>
<tr>
<td>Input pricing</td>
<td>$0.25/M tokens</td>
</tr>
<tr>
<td>Output pricing</td>
<td>$1.00/M tokens</td>
</tr>
<tr>
<td>API compatibility</td>
<td>OpenAI-compatible</td>
</tr>
<tr>
<td>Measured throughput</td>
<td>495–558 tok/s (client-side)</td>
</tr>
</tbody>
</table>]]></content:encoded>
      <category>news</category>
      <category>llm</category>
      <category>diffusion</category>
      <category>benchmarks</category>
      <category>api</category>
      <category>tools</category>
    </item>
    <item>
      <title>SEO Is Dead. Your Rankings Don&#039;t Matter Anymore.</title>
      <link>https://ai.rs/ai-for-business/seo-is-dead-rankings-dont-matter</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/seo-is-dead-rankings-dont-matter</guid>
      <pubDate>Fri, 27 Feb 2026 13:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>LinkedIn just lost 60% of its B2B traffic despite ranking #1 on Google. The old playbook — rank, click, visit, convert — is broken. Here&#039;s what&#039;s replacing it.</description>
      <content:encoded><![CDATA[<h2>The Number That Should Scare Every Business Owner</h2>
<p>On January 28, 2026, LinkedIn published something remarkable. Not a product launch. Not a feature update. A confession.</p>
<p>Non-brand B2B traffic to their web properties had dropped <strong>up to 60%</strong>. Not because their rankings fell — they didn't. Rankings were stable. The clicks just... stopped coming.</p>
<p>LinkedIn — a company worth $26 billion with an army of SEO professionals — is telling the world that the old rules no longer apply.</p>
<p>If it can happen to them, it's already happening to you.</p>
<h2>What Changed</h2>
<p>The answer is two letters: <strong>AI</strong>.</p>
<p>When someone Googles &quot;best CRM for small business&quot; in 2026, they don't see ten blue links. They see an AI-generated answer that synthesizes information from dozens of sources, gives a direct recommendation, and answers follow-up questions on the spot.</p>
<p>The user gets what they need. They never click through to your website.</p>
<p>The numbers are brutal:</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Before AI Search</th>
<th>After AI Search</th>
</tr>
</thead>
<tbody>
<tr>
<td>Searches ending without a click</td>
<td>~40%</td>
<td>~60%</td>
</tr>
<tr>
<td>Click-through rate on #1 ranking</td>
<td>~30%</td>
<td>~13%</td>
</tr>
<tr>
<td>AI Overview zero-click rate</td>
<td>—</td>
<td><strong>83%</strong></td>
</tr>
</tbody>
</table>
<p>That last number is the killer. When Google shows an AI Overview for your search term, <strong>83% of users never visit any website</strong>. Your #1 ranking is now a participation trophy.</p>
<h2>The Old Playbook Is Dead</h2>
<p>For twenty years, B2B marketing followed the same script:</p>
<ol>
<li><strong>Create content</strong> around keywords your customers search for</li>
<li><strong>Rank on Google</strong> through SEO optimization</li>
<li><strong>Get clicks</strong> from search results</li>
<li><strong>Convert visitors</strong> into leads and customers</li>
</ol>
<p>Every step in this chain assumed humans would click through to read your content. That assumption is broken.</p>
<p>LinkedIn's own data shows it clearly: rankings held steady while traffic collapsed. The pipeline didn't leak — the entire first half of it evaporated.</p>
<h3>Who Gets Hit Hardest</h3>
<p>Not everyone feels this equally:</p>
<p><strong>Most vulnerable:</strong></p>
<ul>
<li>Informational content (&quot;what is...&quot;, &quot;how to...&quot;, &quot;best practices for...&quot;)</li>
<li>Industry overview and comparison pages</li>
<li>FAQ and knowledge base content</li>
<li>Generic thought leadership</li>
</ul>
<p><strong>Least vulnerable (for now):</strong></p>
<ul>
<li>Branded searches (people looking specifically for you)</li>
<li>Transactional pages (pricing, signup, checkout)</li>
<li>Unique tools and interactive content</li>
<li>Original research with proprietary data</li>
</ul>
<p>If your traffic comes from people learning about a topic (not searching for you by name), you're in the danger zone.</p>
<h2>LinkedIn's New Playbook</h2>
<p>Two weeks after their disclosure, LinkedIn released a 17-page guide on adapting to AI search. Their new framework replaces the old funnel:</p>
<p><strong>Old model:</strong> Rank → Click → Visit → Convert</p>
<p><strong>New model:</strong> Be seen → Be mentioned → Be considered → Be chosen</p>
<p>The shift is fundamental. Instead of optimizing for Google's algorithm, LinkedIn is now optimizing for <strong>AI citations</strong>. The goal isn't getting someone to click — it's making sure the AI mentions you when it answers the question.</p>
<p>Their new KPIs:</p>
<ul>
<li>How often is LinkedIn cited in AI-generated answers?</li>
<li>When AI summarizes a topic, does it reference LinkedIn data?</li>
<li>Is LinkedIn the authoritative source the AI trusts?</li>
</ul>
<p>They're not fighting the wave. They're learning to surf it.</p>
<h2>What This Means for Your Business</h2>
<p>Let's get practical. Here's what changes for businesses that depend on web traffic:</p>
<h3>1. Your Website Content Needs to Feed AI, Not Just Humans</h3>
<p>AI systems consume your content differently than humans do. They care about:</p>
<ul>
<li><strong>Clear structure</strong> — headings, lists, and tables that are easy to parse</li>
<li><strong>Definitive statements</strong> — &quot;The average cost is $X&quot; beats &quot;costs vary depending on...&quot;</li>
<li><strong>Cited data</strong> — numbers with sources are more likely to be referenced</li>
<li><strong>Unique information</strong> — original data, case studies, proprietary research</li>
</ul>
<p>Generic &quot;ultimate guide&quot; blog posts are AI fodder — the AI will summarize them and the user will never visit you. <strong>Original data is the moat.</strong></p>
<h3>2. Implement llms.txt</h3>
<p>This is the robots.txt for AI. A <code>/llms.txt</code> file on your website tells AI crawlers what your business does, what content matters, and how to represent you.</p>
<p>It looks like this:</p>
<pre><code># Company Name
&gt; One-line description of what you do.

## Core Services
- Service 1: description
- Service 2: description

## Key Content
- [Article Title](URL): description</code></pre>
<p>If you don't tell the AI how to describe you, it will guess. And it will get it wrong.</p>
<h3>3. Build Direct Channels</h3>
<p>Every subscriber on your email list is someone AI search can never take away from you. The same goes for:</p>
<ul>
<li><strong>Email newsletters</strong> — direct inbox access, no algorithm in between</li>
<li><strong>Community</strong> — Discord, Slack, or forum members</li>
<li><strong>Repeat customers</strong> — people who bookmark your site, not Google it</li>
</ul>
<p>LinkedIn learned this the hard way. The companies that survive the AI search shift are the ones that built direct relationships before the traffic disappeared.</p>
<h3>4. Focus on Brand, Not Keywords</h3>
<p>When 83% of informational searches end without a click, the game changes. You can't win by ranking for &quot;how to choose a CRM.&quot; But you can win by being <strong>the brand people search for by name</strong>.</p>
<p>Brand searches still convert. &quot;Salesforce pricing&quot; still drives clicks because the user wants <em>your specific website</em>, not an AI summary.</p>
<p>The investment shifts from &quot;content marketing&quot; to &quot;brand building.&quot; That means:</p>
<ul>
<li>Being the source journalists and analysts quote</li>
<li>Publishing original research others reference</li>
<li>Building products and tools people talk about</li>
<li>Having a point of view that makes you memorable</li>
</ul>
<h3>5. Rethink Your Metrics</h3>
<p>If you're still measuring success by organic traffic, you're watching the wrong dashboard. New metrics that matter:</p>
<ul>
<li><strong>AI citation rate</strong> — is your brand mentioned in AI answers?</li>
<li><strong>Brand search volume</strong> — are more people searching for you by name?</li>
<li><strong>Direct traffic</strong> — people typing your URL or using bookmarks</li>
<li><strong>Email list growth</strong> — your owned audience, immune to algorithm changes</li>
<li><strong>Referral traffic</strong> — links from other sites, podcasts, newsletters</li>
</ul>
<p>A 60% traffic drop looks catastrophic if traffic is your KPI. It looks irrelevant if your revenue comes from direct relationships and brand recognition.</p>
<h2>The Uncomfortable Truth</h2>
<p>This isn't a temporary disruption. AI search isn't going away — it's getting better, faster, and more integrated into every platform. Google, Bing, Perplexity, ChatGPT — they all want to answer the question so the user doesn't have to leave.</p>
<p>The businesses that adapt will thrive. The ones that keep optimizing meta descriptions and chasing keyword rankings will wonder where their traffic went.</p>
<p>LinkedIn — with all its resources, data, and expertise — took a 60% hit before adapting. Most small and mid-size businesses don't have that runway.</p>
<p>The time to adapt is now, not when your traffic dashboard turns red.</p>
<h2>Action Items</h2>
<p>Start this week:</p>
<ol>
<li><strong>Audit your traffic</strong> — what percentage comes from informational vs. branded searches?</li>
<li><strong>Add llms.txt</strong> to your website (<a href="https://llmstxt.org/">learn the format</a>)</li>
<li><strong>Start an email list</strong> if you don't have one — this is your insurance policy</li>
<li><strong>Review your content</strong> — does it contain unique data, or is it summarizable commodity content?</li>
<li><strong>Track AI visibility</strong> — search your brand and products in ChatGPT and Perplexity. What do they say about you?</li>
</ol>
<p>The old game rewarded volume — more pages, more keywords, more content. The new game rewards authority. Be the source, not the summary.</p>]]></content:encoded>
      <category>business</category>
      <category>seo</category>
      <category>ai-search</category>
      <category>b2b</category>
      <category>marketing</category>
      <category>strategy</category>
    </item>
    <item>
      <title>Claude Code Remote Control: Continue Coding Sessions from Your Phone</title>
      <link>https://ai.rs/ai-developer/claude-code-remote-control-mobile</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/claude-code-remote-control-mobile</guid>
      <pubDate>Fri, 27 Feb 2026 01:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>Anthropic&#039;s new Remote Control feature lets you start a Claude Code session at your desk and pick it up from your phone or any browser. Your local environment stays intact — no cloud execution needed.</description>
      <content:encoded><![CDATA[<h2>What Is Claude Code Remote Control?</h2>
<p><a href="https://code.claude.com/docs/en/remote-control">Claude Code Remote Control</a> is a new feature that connects your local Claude Code terminal session to your phone, tablet, or any browser. Start a coding task at your desk, walk away, and continue it from your couch using the Claude mobile app.</p>
<p>The key difference from cloud-based coding: <strong>everything runs locally</strong>. Your filesystem, MCP servers, tools, and project configuration stay on your machine. The mobile interface is just a window into your running session.</p>
<h2>How It Works</h2>
<pre><code>Your Machine (terminal)          Anthropic API           Your Phone
┌─────────────────────┐         ┌───────────┐          ┌──────────┐
│ claude remote-control│ ──TLS──▶│  Routes   │◀──TLS── │ Claude   │
│                     │         │  messages  │          │ App      │
│ Local filesystem    │         └───────────┘          └──────────┘
│ MCP servers         │
│ Project config      │
└─────────────────────┘</code></pre>
<p>No port forwarding. No VPN. No SSH tunnels. Claude Code makes <strong>outbound HTTPS requests only</strong> — it never opens inbound ports on your machine. The Anthropic API routes messages between your local session and whatever device you're using.</p>
<h2>Getting Started</h2>
<h3>Requirements</h3>
<ul>
<li><strong>Claude Pro or Max plan</strong> (not available on Team/Enterprise yet)</li>
<li><strong>Claude Code</strong> installed and authenticated via <code>/login</code></li>
<li><strong>Claude mobile app</strong> — <a href="https://apps.apple.com/us/app/claude-by-anthropic/id6473753684">iOS</a> or <a href="https://play.google.com/store/apps/details?id=com.anthropic.claude">Android</a></li>
</ul>
<h3>Start a New Remote Session</h3>
<p>Navigate to your project and run:</p>
<pre><code class="language-bash">claude remote-control</code></pre>
<p>This displays a <strong>session URL</strong> and a <strong>QR code</strong> (press spacebar to toggle). Scan the QR code with your phone to connect instantly.</p>
<h3>From an Existing Session</h3>
<p>Already mid-conversation? Use the slash command:</p>
<pre><code>/remote-control</code></pre>
<p>Or the shorthand:</p>
<pre><code>/rc</code></pre>
<p>Your full conversation history carries over. Tip: use <code>/rename</code> first to give the session a descriptive name so you can find it on your phone.</p>
<h3>Connect from Another Device</h3>
<p>Three ways to connect:</p>
<ol>
<li><strong>Scan the QR code</strong> — fastest, opens directly in the Claude app</li>
<li><strong>Open the session URL</strong> — works in any browser at <a href="https://claude.ai/code">claude.ai/code</a></li>
<li><strong>Find it in the app</strong> — remote sessions show a computer icon with a green dot when online</li>
</ol>
<h2>Real-World Use Cases</h2>
<h3>The &quot;Deploy from Dinner&quot; Workflow</h3>
<p>You're running a deployment at your desk. The build is going to take 20 minutes. Walk to dinner, and when the build finishes, approve the next step from your phone. No rushing back to your laptop.</p>
<h3>Code Review on the Couch</h3>
<p>Start reviewing a PR at your desk with full context — local repo, test runners, linters. Move to the couch and continue asking Claude questions about the code, running tests, and suggesting changes.</p>
<h3>On-Call Incident Response</h3>
<p>Get paged at 2 AM. Instead of opening your laptop, scan the QR code on your phone and start debugging immediately. Claude has access to your full local environment — logs, configs, deployment scripts.</p>
<h2>Always-On Mode</h2>
<p>Don't want to run <code>/remote-control</code> every time? Enable it globally:</p>
<ol>
<li>Run <code>/config</code> inside Claude Code</li>
<li>Set <strong>Enable Remote Control for all sessions</strong> to <code>true</code></li>
</ol>
<p>Now every Claude Code session is automatically available from your phone.</p>
<h2>Security Model</h2>
<ul>
<li>All traffic goes through the <strong>Anthropic API over TLS</strong> — same security as normal Claude Code usage</li>
<li><strong>Multiple short-lived credentials</strong>, each scoped to a single purpose with independent expiration</li>
<li><strong>No inbound ports</strong> opened on your machine</li>
<li>Session data stays local — the phone is just a remote display</li>
</ul>
<h2>Limitations to Know</h2>
<table>
<thead>
<tr>
<th>Limitation</th>
<th>Detail</th>
</tr>
</thead>
<tbody>
<tr>
<td>One remote connection</td>
<td>Each session supports one remote connection at a time</td>
</tr>
<tr>
<td>Terminal must stay open</td>
<td>If you close the terminal, the session ends</td>
</tr>
<tr>
<td>Network timeout</td>
<td>~10 minutes of network loss kills the session</td>
</tr>
<tr>
<td>Plan requirement</td>
<td>Pro or Max plan only (no API keys)</td>
</tr>
</tbody>
</table>
<h2>Remote Control vs Claude Code on the Web</h2>
<p>Both use the same <a href="https://claude.ai/code">claude.ai/code</a> interface, but they're fundamentally different:</p>
<table>
<thead>
<tr>
<th></th>
<th>Remote Control</th>
<th>Claude Code on Web</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Execution</strong></td>
<td>Your machine</td>
<td>Anthropic cloud</td>
</tr>
<tr>
<td><strong>File access</strong></td>
<td>Your local filesystem</td>
<td>Cloud sandbox</td>
</tr>
<tr>
<td><strong>MCP servers</strong></td>
<td>Your local servers</td>
<td>Not available</td>
</tr>
<tr>
<td><strong>Best for</strong></td>
<td>Continuing local work remotely</td>
<td>Starting fresh without local setup</td>
</tr>
</tbody>
</table>
<p>Use Remote Control when you're mid-task and want mobility. Use Claude Code on the web when you want to spin up something new without cloning a repo.</p>
<h2>What This Means for Developer Workflows</h2>
<p>Remote Control solves a real friction point: <strong>context switching between devices kills flow</strong>. Previously, if you walked away from your desk, you either lost your coding context or set up complex SSH/tmux/mosh chains.</p>
<p>Now it's: run one command, scan a QR code, keep going. Your full environment — files, tools, MCP servers, conversation history — travels with you.</p>
<p>Combined with Claude Code's <strong>$2.5 billion annualized run rate</strong> as of February 2026, it's clear that AI-assisted coding is no longer experimental. Remote Control is the kind of quality-of-life feature that makes daily use seamless.</p>]]></content:encoded>
      <category>deployment</category>
      <category>claude</category>
      <category>mobile</category>
      <category>developer-tools</category>
      <category>remote</category>
    </item>
    <item>
      <title>Mercury 2: The First Reasoning Diffusion LLM — 1,000 Tokens/sec</title>
      <link>https://ai.rs/ai-developer/mercury-2-diffusion-reasoning-llm</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/mercury-2-diffusion-reasoning-llm</guid>
      <pubDate>Thu, 26 Feb 2026 11:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>Inception Labs launches Mercury 2, a diffusion-based LLM that generates tokens in parallel instead of one-by-one. The result: 1,000 tok/s with reasoning — 10× faster than comparable autoregressive models.</description>
      <content:encoded><![CDATA[<h2>What Is Mercury 2?</h2>
<p>Mercury 2 is the first commercial <strong>reasoning diffusion LLM</strong> from <a href="https://www.inceptionlabs.ai/">Inception Labs</a>. Unlike every major LLM you've used — GPT, Claude, Llama — Mercury 2 doesn't generate tokens one at a time. It uses <strong>diffusion</strong> to produce multiple tokens in parallel, then refines them over a small number of steps.</p>
<p>The result: <strong>~1,000 tokens per second</strong> output throughput on NVIDIA Blackwell GPUs.</p>
<p>For context, Claude 4.5 Haiku outputs ~89 tok/s and GPT-5 Mini ~71 tok/s. Mercury 2 is roughly <strong>10× faster</strong>.</p>
<h2>How Diffusion LLMs Work</h2>
<p>Traditional LLMs are <strong>autoregressive</strong>: they predict one token, append it, then predict the next. This is inherently sequential — each token depends on all previous tokens.</p>
<p>Diffusion LLMs take a fundamentally different approach borrowed from image generation (Stable Diffusion, DALL-E):</p>
<ol>
<li><strong>Start with noise</strong> — begin with a block of random tokens</li>
<li><strong>Refine in parallel</strong> — iteratively denoise all tokens simultaneously</li>
<li><strong>Converge</strong> — after a small number of refinement steps, the output is coherent text</li>
</ol>
<p>This is called <strong>block diffusion</strong>. Because tokens are generated in parallel rather than sequentially, GPU utilization skyrockets — you're doing useful compute across all cores simultaneously instead of waiting for one token at a time.</p>
<pre><code>Autoregressive (traditional):
  Token 1 → Token 2 → Token 3 → Token 4 → ...
  [sequential, ~100 tok/s]

Diffusion (Mercury 2):
  [noise] → [rough draft] → [refined] → [final output]
  [parallel, ~1,000 tok/s]</code></pre>
<h2>Benchmarks</h2>
<p>Mercury 2 positions as a <strong>fast reasoning model</strong> — comparable to Claude 4.5 Haiku and GPT-5 Mini in quality, but dramatically faster:</p>
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Mercury 2</th>
<th>Claude 4.5 Haiku</th>
<th>GPT-5 Mini</th>
</tr>
</thead>
<tbody>
<tr>
<td>AIME 2025</td>
<td>91.1</td>
<td>~90</td>
<td>~88</td>
</tr>
<tr>
<td>GPQA</td>
<td>73.6</td>
<td>~75</td>
<td>~72</td>
</tr>
<tr>
<td>LiveCodeBench</td>
<td>67.3</td>
<td>~65</td>
<td>~63</td>
</tr>
<tr>
<td>IFBench</td>
<td>71.3</td>
<td>—</td>
<td>—</td>
</tr>
<tr>
<td><strong>Output speed</strong></td>
<td><strong>~1,000 tok/s</strong></td>
<td><strong>~89 tok/s</strong></td>
<td><strong>~71 tok/s</strong></td>
</tr>
</tbody>
</table>
<p>This isn't competing with frontier models like Claude Opus or GPT-5 on the hardest reasoning tasks. It's targeting the <strong>fast agent tier</strong> — where speed matters more than peak intelligence.</p>
<h2>Key Features</h2>
<ul>
<li><strong>128K context window</strong> — handles large codebases and documents</li>
<li><strong>Tunable reasoning</strong> — adjust the quality/speed tradeoff per request</li>
<li><strong>Native tool use</strong> — function calling built in, not bolted on</li>
<li><strong>Schema-aligned JSON output</strong> — structured output without post-processing</li>
<li><strong>OpenAI API compatible</strong> — drop-in replacement, no code rewrites needed</li>
</ul>
<h2>Where This Matters: Agentic Workflows</h2>
<p>The real impact isn't chat. It's <strong>agentic loops</strong> where an LLM runs hundreds of iterations:</p>
<ul>
<li><strong>Code generation pipelines</strong> — write, test, fix, repeat. At 1,000 tok/s, each iteration takes seconds instead of minutes</li>
<li><strong>Multi-step reasoning</strong> — chain-of-thought that would take 30 seconds now takes 3</li>
<li><strong>Real-time applications</strong> — live coding assistants, interactive debugging, instant analysis</li>
</ul>
<p>A developer on Hacker News proposed <strong>&quot;intelligence per second&quot;</strong> as the metric that matters: throughput × reasoning quality. Mercury 2 optimizes exactly this.</p>
<h2>Hybrid Architecture Potential</h2>
<p>The most interesting use case discussed in the community: <strong>frontier model for planning, diffusion model for execution</strong>.</p>
<p>Use Claude Opus or GPT-5 to create a high-level plan, then hand off to Mercury 2 for rapid iteration on individual steps. You get the best reasoning where it matters and maximum speed everywhere else.</p>
<h2>Known Limitations</h2>
<p>Mercury 2 is impressive but not without issues flagged by early users:</p>
<ul>
<li><strong>Factual accuracy</strong> — parallel generation can produce hallucinations that don't self-correct through the sequence (autoregressive models at least have each token conditioned on all previous ones)</li>
<li><strong>Constraint satisfaction</strong> — struggles with tasks requiring strict sequential dependencies</li>
<li><strong>Not frontier-tier</strong> — if you need the absolute best reasoning, you still want Opus or GPT-5</li>
</ul>
<h2>How to Try It</h2>
<p>Mercury 2 is available today via the <a href="https://www.inceptionlabs.ai/">Inception API</a>. It's OpenAI API compatible, so you can point any existing client at it:</p>
<pre><code class="language-python">from openai import OpenAI

client = OpenAI(
    base_url="https://api.inceptionlabs.ai/v1",
    api_key="your-inception-key"
)

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain quantum computing in 3 sentences"}]
)
print(response.choices[0].message.content)</code></pre>
<h2>What This Means for the Industry</h2>
<p>Diffusion LLMs represent the first serious architectural challenge to the autoregressive paradigm that has dominated since GPT-2. If Mercury 2's approach scales to frontier quality, the entire cost structure of AI inference changes.</p>
<p>At 10× the throughput with comparable quality, inference costs drop dramatically. For businesses running AI at scale — customer support, content generation, code assistance — this could mean 10× more queries for the same GPU budget.</p>
<p>We're watching this space closely. The autoregressive vs. diffusion debate is just getting started.</p>]]></content:encoded>
      <category>research</category>
      <category>llm</category>
      <category>diffusion</category>
      <category>reasoning</category>
      <category>inference</category>
      <category>benchmarks</category>
    </item>
    <item>
      <title>What Is Fine-Tuning? Teaching AI New Tricks</title>
      <link>https://ai.rs/learn-ai/what-is-fine-tuning</link>
      <guid isPermaLink="true">https://ai.rs/learn-ai/what-is-fine-tuning</guid>
      <pubDate>Thu, 26 Feb 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>Out-of-the-box AI knows a little about everything. Fine-tuning transforms it into a specialist that deeply understands one thing — your products, your industry, your brand.</description>
      <content:encoded><![CDATA[<h2>The Smart New Hire</h2>
<p>Imagine you just hired the smartest person you've ever met. They graduated top of their class, speak five languages, and can discuss everything from philosophy to physics. But they know nothing about your business.</p>
<p>You wouldn't fire them — you'd train them. Over a few weeks, you'd show them your products, teach them your processes, explain how you talk to customers, and correct their mistakes until they become an expert in your domain.</p>
<p><strong>Fine-tuning</strong> is exactly this process, but for AI. You take a general-purpose model that already understands language and teach it to specialize in your specific area.</p>
<h2>Pre-training vs. Fine-tuning</h2>
<p>Every AI model goes through two phases, and understanding the difference is key.</p>
<p><strong>Pre-training</strong> is like going to school. The model reads enormous amounts of text — books, websites, articles, code — and learns how language works. This gives it broad knowledge about the world, grammar, reasoning patterns, and general facts. Pre-training takes months and costs millions of dollars.</p>
<p><strong>Fine-tuning</strong> is like on-the-job training. You take the pre-trained model and teach it something specific using your own examples. This is fast (hours, not months) and cheap (dollars, not millions).</p>
<table>
<thead>
<tr>
<th></th>
<th>Pre-training</th>
<th>Fine-tuning</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Purpose</strong></td>
<td>Learn language and general knowledge</td>
<td>Learn specific skills or domain</td>
</tr>
<tr>
<td><strong>Data needed</strong></td>
<td>Trillions of words from the internet</td>
<td>Thousands of your own examples</td>
</tr>
<tr>
<td><strong>Time</strong></td>
<td>Weeks to months</td>
<td>Hours</td>
</tr>
<tr>
<td><strong>Cost</strong></td>
<td>Millions of dollars</td>
<td>Under $10</td>
</tr>
<tr>
<td><strong>Who does it</strong></td>
<td>Big AI companies (OpenAI, Google, Meta)</td>
<td>Anyone with domain expertise</td>
</tr>
</tbody>
</table>
<p>You never need to pre-train a model yourself. That's already been done. Fine-tuning is the accessible part — the part where you add your own expertise.</p>
<h2>What Changes After Fine-tuning?</h2>
<p>A fine-tuned model behaves differently from the original in specific, controllable ways:</p>
<h3>It Learns Your Domain</h3>
<p>A model fine-tuned on wine shop data knows grape varieties, food pairings, and vintage years. A model fine-tuned on legal documents knows contract terminology and case law formatting. The model doesn't lose its general knowledge — it gains deep expertise in your area.</p>
<h3>It Adopts Your Voice</h3>
<p>If your training examples use a casual, friendly tone, the model learns to respond that way. If they're formal and precise, it matches that. The model picks up on style, vocabulary, and personality from your examples.</p>
<h3>It Learns Boundaries</h3>
<p>You can teach the model what not to do. If you include examples where the model politely declines to discuss competitors, modify prices, or go off-topic, it learns those boundaries. This is one of the most valuable aspects — building guardrails specific to your use case.</p>
<h3>It Gets More Accurate</h3>
<p>For your specific domain, a fine-tuned smaller model often outperforms a much larger general model. A 7-billion-parameter model fine-tuned on your product data will give better product recommendations than a 100-billion-parameter model that's never seen your catalog.</p>
<h2>The Training Data: Your Secret Ingredient</h2>
<p>Fine-tuning is only as good as the examples you provide. The format is simple — pairs of inputs and desired outputs:</p>
<p><strong>Example 1:</strong>
Input: &quot;What's a good gift for someone who likes cooking?&quot;
Output: &quot;Our Artisan Olive Oil Collection ($45) is our most popular gift for food enthusiasts. It includes three single-origin oils from Italy, Spain, and Greece, each with distinct flavor profiles perfect for different cuisines.&quot;</p>
<p><strong>Example 2:</strong>
Input: &quot;Can you give me a discount?&quot;
Output: &quot;I appreciate you asking! While I can't modify prices, I can help you find great options within your budget. What price range works for you?&quot;</p>
<p>You typically need 5,000 to 25,000 of these examples for a solid fine-tuned model. That sounds like a lot, but most businesses already have this data — in customer service logs, FAQ pages, product catalogs, and email archives.</p>
<h2>Real-World Examples of Fine-tuning</h2>
<h3>Customer Support</h3>
<p>A telecom company fine-tunes a model on 10,000 resolved support tickets. The model learns to diagnose common problems, walk customers through solutions, and know when to escalate to a human. Result: 60% of support queries handled automatically.</p>
<h3>Product Recommendations</h3>
<p>An online retailer fine-tunes a model on purchase history and product pairings. The model learns that customers who buy running shoes often want moisture-wicking socks, and that people buying espresso machines usually need grinder recommendations. Result: 25% increase in average order value.</p>
<h3>Content Creation</h3>
<p>A marketing agency fine-tunes a model on their best-performing blog posts, ad copy, and social media content. The model learns their clients' brand voices, preferred formats, and messaging strategies. Result: first drafts that need 70% less editing.</p>
<h3>Internal Knowledge</h3>
<p>A consulting firm fine-tunes a model on their internal methodology documents, case studies, and best practices. New consultants use it to get up to speed on company approaches without bothering senior staff. Result: onboarding time cut in half.</p>
<h2>What Fine-tuning Can't Do</h2>
<p>It's important to understand the limits:</p>
<p><strong>It can't learn facts that change frequently.</strong> If your product prices change weekly, fine-tuning isn't the right tool for price accuracy — that's where RAG (retrieval-augmented generation) comes in, pulling real-time data at query time.</p>
<p><strong>It can't fix fundamental model limitations.</strong> If the base model struggles with complex math, fine-tuning won't make it a calculator. You're adjusting behavior, not fundamentally changing capabilities.</p>
<p><strong>It can't work without good examples.</strong> Garbage in, garbage out. If your training examples are inconsistent, contradictory, or low-quality, the fine-tuned model will reflect that.</p>
<p><strong>It has a capacity limit.</strong> A fine-tuning adapter can reliably learn hundreds to low thousands of specific details. For catalogs with 10,000+ products, you need to combine fine-tuning (for behavior and style) with a live database lookup (for specific facts).</p>
<h2>Fine-tuning vs. Prompting: When Do You Need Each?</h2>
<p>A common question: &quot;Can't I just write a really good prompt instead of fine-tuning?&quot;</p>
<p>Sometimes, yes. Here's how to decide:</p>
<table>
<thead>
<tr>
<th>Scenario</th>
<th>Use Prompting</th>
<th>Use Fine-tuning</th>
</tr>
</thead>
<tbody>
<tr>
<td>One-off task</td>
<td>Yes</td>
<td>Overkill</td>
</tr>
<tr>
<td>Consistent brand voice across thousands of interactions</td>
<td>Fragile — prompt can drift</td>
<td>Yes</td>
</tr>
<tr>
<td>Following specific safety rules reliably</td>
<td>Somewhat reliable</td>
<td>Much more reliable</td>
</tr>
<tr>
<td>Processing many requests quickly</td>
<td>Prompt overhead adds cost</td>
<td>More efficient</td>
</tr>
<tr>
<td>Specialized domain knowledge</td>
<td>Limited by prompt length</td>
<td>Deeply embedded</td>
</tr>
</tbody>
</table>
<p>The short version: <strong>prompting is for flexibility, fine-tuning is for consistency.</strong> If you need the model to behave a specific way every single time across thousands of interactions, fine-tuning is worth the upfront investment.</p>
<h2>The Bottom Line</h2>
<p>Fine-tuning bridges the gap between a general-purpose AI that gives generic answers and a specialized assistant that truly understands your domain. It's surprisingly accessible — you don't need a machine learning degree or a supercomputer. You need domain expertise (which you already have), a set of good examples (which you can build from existing data), and a few hours of compute time.</p>
<p>The businesses that benefit most from fine-tuning are the ones that have deep domain expertise that's hard to replicate — specialized knowledge that a general AI simply doesn't have. If that sounds like your business, fine-tuning is how you encode that advantage into software.</p>
<p><strong>Concerned about AI privacy and safety?</strong> Read <a href="/learn-ai/ai-privacy-and-safety-basics">AI Privacy and Safety: What Every User Should Know</a>.</p>
<p><strong>Thinking about AI for your business?</strong> <a href="/how-it-works.php">See how it works</a> — how companies deploy custom AI assistants trained on their own data.</p>]]></content:encoded>
      <category>beginner</category>
      <category>fine-tuning</category>
      <category>customization</category>
      <category>training</category>
    </item>
    <item>
      <title>The GPU Memory Wall: Why Inference Hardware Matters</title>
      <link>https://ai.rs/ai-developer/gpu-memory-wall-inference-hardware</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/gpu-memory-wall-inference-hardware</guid>
      <pubDate>Thu, 26 Feb 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>GPU cores sit idle 98% of the time during LLM inference. Understanding the memory wall explains why quantization works, why ASICs win, and what the future holds.</description>
      <content:encoded><![CDATA[<h2>The Counterintuitive Truth</h2>
<p>GPUs are marketed on compute power — teraflops, CUDA cores, tensor operations per second. But LLM inference doesn't use compute power. It uses <strong>memory bandwidth</strong>.</p>
<p>Here's the fundamental problem:</p>
<pre><code>RTX 5090 can compute:     103 TB/s of operations
RTX 5090 VRAM delivers:     1.8 TB/s of data
Gap:                        57x — cores idle 98% of the time</code></pre>
<p>During autoregressive token generation, the GPU reads the <strong>entire model</strong> from VRAM for every single token. An 8B model at 6-bit quantization = 6.7 GB per token. At 1.8 TB/s bandwidth, that's 3.7 ms per token, giving a theoretical maximum of ~270 tokens/second.</p>
<p>No amount of additional compute helps. The bottleneck is the straw, not the reservoir.</p>
<h2>Proving the Memory Wall</h2>
<p>We ran an experiment with two models on the same RTX 5090:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Parameters</th>
<th>VRAM</th>
<th>Achieved tok/s</th>
<th>Theoretical max</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3-8B (Q6_K)</td>
<td>8.2B</td>
<td>6.7 GB</td>
<td>161</td>
<td>~270</td>
</tr>
<tr>
<td>SmolLM2-135M (IQ4_XS)</td>
<td>135M</td>
<td>96 MB</td>
<td>1,110</td>
<td>~18,750</td>
</tr>
</tbody>
</table>
<p>The tiny model (135M) should be 70x faster based on its 70x smaller memory footprint. Instead, it's only 7x faster. Where does the performance go?</p>
<h2>The Three Walls</h2>
<p>Detailed profiling revealed the actual bottleneck structure:</p>
<pre><code>Wall 1 — CPU round-trip:     854 μs  (95% of time)  ← REAL BOTTLENECK
Wall 2 — Kernel launches:    725 μs  (fixed with CUDA graphs: 1.8x speedup)
Wall 3 — VRAM bandwidth:      47 μs  (L2 cache would fix: 47→7 μs)
Wall 4 — GPU compute:          ~1 μs  (negligible)</code></pre>
<h3>Wall 1: The CPU Round-Trip</h3>
<p>For every token generated, the process is:</p>
<ol>
<li>GPU finishes computing logits</li>
<li>Transfer logits to CPU via PCIe</li>
<li>CPU runs sampling (argmax/top-p/top-k)</li>
<li>Transfer selected token back to GPU</li>
<li>GPU embeds token and starts next forward pass</li>
</ol>
<p>This CPU↔GPU round-trip takes <strong>854 microseconds</strong> — regardless of model size. It's a fixed overhead that dominates inference for small models.</p>
<h3>Wall 2: Kernel Launch Overhead</h3>
<p>Each forward pass through the model launches hundreds of GPU kernels. Each launch has ~1-5 μs of overhead, and for a tiny 135M model, this adds up to <strong>725 μs</strong>.</p>
<p><strong>CUDA graphs</strong> solve this by recording the execution pattern once and replaying it. This improved our SmolLM2 throughput by <strong>1.81x</strong>:</p>
<table>
<thead>
<tr>
<th>Configuration</th>
<th>tok/s</th>
<th>Improvement</th>
</tr>
</thead>
<tbody>
<tr>
<td>Without CUDA graphs</td>
<td>615</td>
<td>Baseline</td>
</tr>
<tr>
<td>With CUDA graphs</td>
<td>1,110</td>
<td><strong>1.81x</strong></td>
</tr>
</tbody>
</table>
<p>For the larger Qwen3-8B, CUDA graphs help less (1.17x) because the model is memory-bound, not kernel-launch-bound.</p>
<h3>Wall 3: VRAM Bandwidth</h3>
<p>For the large model, VRAM bandwidth IS the bottleneck:</p>
<ul>
<li>Qwen3-8B streams 6.7 GB per token → 3.7 ms per token → ~270 tok/s max</li>
<li>Actual: 161 tok/s (60% efficiency — typical for real GPU workloads)</li>
</ul>
<p>For the tiny model, VRAM bandwidth would allow 18,750 tok/s, but Walls 1 and 2 limit us to 1,110.</p>
<h3>Wall 4: Compute</h3>
<p>GPU compute is effectively free at these model sizes. The matrix multiplications take ~1 μs per token — negligible.</p>
<h2>The L2 Cache Hypothesis</h2>
<p>The RTX 5090 has a 96 MB L2 cache between the GPU cores and VRAM. If a model fits entirely in L2, it could theoretically avoid VRAM reads entirely:</p>
<pre><code>VRAM bandwidth:  1.8 TB/s → 47 μs per SmolLM2 forward pass
L2 bandwidth:    ~12 TB/s → 7 μs per forward pass
Speedup:         6.7x</code></pre>
<p>But this 6.7x only applies to the memory portion. With CPU overhead at 854 μs, the L2 advantage becomes:</p>
<pre><code>With VRAM:  854 + 47 = 901 μs → 1,110 tok/s
With L2:    854 +  7 = 861 μs → 1,162 tok/s
Speedup:    4%</code></pre>
<p>The CPU round-trip dominates so completely that L2 residency barely matters in practice.</p>
<h2>What Prefill Reveals</h2>
<p>Prefill (processing the input prompt) tells a different story:</p>
<table>
<thead>
<tr>
<th>Mode</th>
<th>SmolLM2 tok/s</th>
<th>Parallelism</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prefill 512 tokens</td>
<td><strong>57,789</strong></td>
<td>512x</td>
</tr>
<tr>
<td>Prefill 16 tokens</td>
<td>3,938</td>
<td>16x</td>
</tr>
<tr>
<td>Generation (1 token)</td>
<td>1,110</td>
<td>1x</td>
</tr>
</tbody>
</table>
<p>During prefill, 512 tokens are processed simultaneously — <strong>the GPU achieves 57,789 tok/s.</strong> This proves the hardware IS capable of massive throughput. The limitation is the autoregressive nature of generation: each token depends on the previous one, preventing parallelism.</p>
<h2>Why Purpose-Built ASICs Win</h2>
<p>Inference-specific chips solve the memory wall architecturally:</p>
<h3>Groq LPU</h3>
<ul>
<li>230 MB on-chip SRAM (no external memory)</li>
<li>80 TB/s internal bandwidth (44x GPU VRAM)</li>
<li><strong>Eliminates CPU round-trip</strong> — sampling happens on-die</li>
<li>Result: 300+ tok/s on Llama 3 70B</li>
</ul>
<h3>Cerebras WSE-3</h3>
<ul>
<li>44 GB on-chip SRAM</li>
<li>21 PB/s on-chip bandwidth (11,600x GPU VRAM)</li>
<li>Entire model lives on-chip</li>
<li>Result: Thousands of tok/s</li>
</ul>
<h3>Taalas HC1</h3>
<ul>
<li>Model weights encoded directly in silicon (3-bit custom)</li>
<li><strong>17,000 tok/s</strong> on Llama 3.1 8B</li>
<li>105x faster than our RTX 5090</li>
<li>No memory access at all — weights ARE the hardware</li>
</ul>
<h2>What This Means for GPU Deployments</h2>
<h3>1. Quantization is the Primary Lever</h3>
<p>Since inference is memory-bound, reducing model size directly improves speed:</p>
<table>
<thead>
<tr>
<th>Quantization</th>
<th>Speed improvement</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr>
<td>BF16 → Q6_K</td>
<td>~2x</td>
<td>Half the data to stream</td>
</tr>
<tr>
<td>BF16 → Q4_K_M</td>
<td>~2.5x</td>
<td>Even less data</td>
</tr>
<tr>
<td>BF16 → Q2</td>
<td>~3x</td>
<td>Diminishing returns (quality drops)</td>
</tr>
</tbody>
</table>
<p>Quantization doesn't sacrifice compute — it reduces the real bottleneck (memory reads).</p>
<h3>2. VRAM Amount &lt; VRAM Bandwidth</h3>
<p>When choosing a GPU for inference, bandwidth matters more than capacity:</p>
<table>
<thead>
<tr>
<th>GPU</th>
<th>VRAM</th>
<th>Bandwidth</th>
<th>Expected 8B Q6_K tok/s</th>
</tr>
</thead>
<tbody>
<tr>
<td>RTX 4090</td>
<td>24 GB</td>
<td>1.0 TB/s</td>
<td>~90</td>
</tr>
<tr>
<td>RTX 5090</td>
<td>32 GB</td>
<td>1.8 TB/s</td>
<td>~160</td>
</tr>
<tr>
<td>A100</td>
<td>80 GB</td>
<td>2.0 TB/s</td>
<td>~180</td>
</tr>
<tr>
<td>H100</td>
<td>80 GB</td>
<td>3.35 TB/s</td>
<td>~300</td>
</tr>
</tbody>
</table>
<p>The H100 has 1.9x the bandwidth of the RTX 5090, which translates directly to ~1.9x the inference speed.</p>
<h3>3. Batching is the Only Way to Use Compute</h3>
<p>The GPU's compute power only helps with concurrent requests. With 8 concurrent users, the GPU can process 8 tokens simultaneously, filling more of its compute capacity:</p>
<table>
<thead>
<tr>
<th>Concurrent users</th>
<th>GPU utilization</th>
<th>Aggregate tok/s</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>~2%</td>
<td>161</td>
</tr>
<tr>
<td>4</td>
<td>~8%</td>
<td>~400</td>
</tr>
<tr>
<td>8</td>
<td>~15%</td>
<td>~600</td>
</tr>
<tr>
<td>32</td>
<td>~50%</td>
<td>~1,500</td>
</tr>
</tbody>
</table>
<p>This is why vLLM with continuous batching matters for production — it's the only way to actually use the GPU you paid for.</p>
<h2>The Future: Where This Is Heading</h2>
<ol>
<li><strong>HBM4 (2026)</strong> — 6+ TB/s bandwidth on consumer GPUs could double inference speed</li>
<li><strong>On-chip model caching</strong> — Larger L2/L3 caches could eventually fit quantized 1B models</li>
<li><strong>Speculative decoding</strong> — Use small draft models to generate candidate tokens in parallel, but requires vocabulary-aligned model pairs</li>
<li><strong>Inference ASICs</strong> — Dedicated chips that eliminate the CPU round-trip entirely</li>
<li><strong>Hybrid architectures</strong> — GPU + inference ASIC combos that handle training and serving optimally</li>
</ol>
<p>The memory wall isn't going away, but the wall is moving. Every generation of hardware pushes the boundary, and creative software solutions (quantization, batching, speculative decoding) continue to extract more from existing hardware.</p>
<h2>Key Takeaway</h2>
<p>When planning an LLM deployment, think in terms of <strong>memory bandwidth, not compute</strong>:</p>
<ul>
<li>Your GPU cores are 98% idle during inference</li>
<li>Quantization is the single most impactful optimization</li>
<li>Batching (vLLM) is the only way to utilize compute</li>
<li>Purpose-built ASICs are 10-100x faster because they solve the architecture problem</li>
<li>For most businesses, a well-quantized model on a good GPU is more than sufficient</li>
</ul>]]></content:encoded>
      <category>research</category>
      <category>gpu</category>
      <category>memory-wall</category>
      <category>hardware</category>
      <category>inference</category>
      <category>benchmarks</category>
    </item>
    <item>
      <title>From Edge AI to Custom LLMs: How On-Device Intelligence Evolved</title>
      <link>https://ai.rs/ai-developer/edge-ai-kendryte-k210-to-custom-llms</link>
      <guid isPermaLink="true">https://ai.rs/ai-developer/edge-ai-kendryte-k210-to-custom-llms</guid>
      <pubDate>Mon, 23 Feb 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>In 2019 the Kendryte K210 brought machine vision to a $20 microcontroller. Six years later, we fine-tune 8-billion parameter LLMs on dedicated GPUs. Here&#039;s how edge AI paved the road to today&#039;s custom AI assistants.</description>
      <content:encoded><![CDATA[<h2>A $20 AI Camera in 2019</h2>
<p>In August 2019, M5Stack shipped the M5StickV — a thumb-sized device built around the Kendryte K210 system-on-chip. For under $20, you got:</p>
<ul>
<li>Dual-core 64-bit RISC-V CPU at 400 MHz</li>
<li>8 MiB SRAM</li>
<li>Hardware neural network accelerator (KPU)</li>
<li>0.8 TOPS peak performance</li>
<li>OV7740 camera (VGA @ 30fps)</li>
<li>1.14&quot; IPS display</li>
<li>MicroSD, microphone, gyroscope, speaker, battery</li>
</ul>
<p><img src="/img/articles/m5stickv.jpg" alt="M5StickV — a thumb-sized AI camera built on the Kendryte K210" /></p>
<p>This tiny device could run <strong>real-time face detection</strong>, object classification, and QR code scanning — entirely on-chip, with no cloud connection. It was one of the first widely accessible edge AI platforms that hobbyists and engineers could actually buy and program.</p>
<p><img src="/img/articles/kendryte-k210.jpg" alt="Kendryte K210 system-on-chip pinout and peripherals" /></p>
<h3>What the K210 Could Do</h3>
<p>The spec sheet reads like a checklist of computer vision fundamentals:</p>
<ul>
<li><strong>Face recognition and detection</strong> — identify known faces in real-time</li>
<li><strong>Object detection and classification</strong> — recognize shapes and types at 30fps</li>
<li><strong>Size and coordinate tracking</strong> — locate targets with bounding boxes</li>
<li><strong>Audio processing</strong> — microphone array beamforming and voice wake-up</li>
<li><strong>Speech recognition</strong> — on-device, no cloud dependency</li>
</ul>
<p>For embedded engineers coming from Arduino and ESP32 territory, this was a quantum leap. The ESP32 could blink LEDs and read sensors. The K210 could <em>see and hear</em>.</p>
<h3>Running MicroPython on the K210</h3>
<p>Getting started was remarkably accessible. The M5StickV supported MicroPython through Sipeed's MaixPy framework:</p>
<pre><code class="language-python">import sensor
import image
import lcd

lcd.init()
sensor.reset()
sensor.set_pixformat(sensor.RGB565)
sensor.set_framesize(sensor.QVGA)
sensor.run(1)

while True:
    img = sensor.snapshot()
    res = img.find_qrcodes()
    if len(res) &gt; 0:
        img.draw_string(40, 50, res[0].payload(), (236, 36, 36), scale=1.5)
        img.draw_rectangle(res[0].rect(), (236, 36, 36))
    lcd.display(img)</code></pre>
<p>Twenty lines of Python for a real-time QR code scanner with on-screen overlay. The firmware could be compiled from source and flashed via USB — even on ARM-based hosts like the Nvidia Jetson Nano.</p>
<h2>The Gap Between Edge AI and Real Intelligence</h2>
<p>The K210 was impressive for its size and price, but it had hard limits:</p>
<table>
<thead>
<tr>
<th>Capability</th>
<th>K210 (2019)</th>
<th>Modern LLM (2026)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Parameters</td>
<td>~1-5 million</td>
<td>8 billion</td>
</tr>
<tr>
<td>Memory</td>
<td>8 MiB SRAM</td>
<td>24 GB VRAM</td>
</tr>
<tr>
<td>Tasks</td>
<td>Classification, detection</td>
<td>Reasoning, conversation, generation</td>
</tr>
<tr>
<td>Training data</td>
<td>Thousands of images</td>
<td>Trillions of text tokens</td>
</tr>
<tr>
<td>Output</td>
<td>&quot;This is a face&quot; / &quot;This is a cat&quot;</td>
<td>Natural language responses, recommendations, analysis</td>
</tr>
<tr>
<td>Customization</td>
<td>Retrain classification model</td>
<td>Fine-tune with LoRA in 5 hours</td>
</tr>
</tbody>
</table>
<p>Edge AI answered &quot;what is this?&quot; — but it couldn't answer &quot;what should I buy?&quot; or &quot;how does this compare to that?&quot; or &quot;find me something for a dinner party under €50.&quot;</p>
<p>That required a fundamentally different architecture: large language models.</p>
<h2>The Bridge: From Vision to Language</h2>
<p>The path from K210-style edge AI to modern LLM assistants followed three key developments:</p>
<h3>1. Transformer Architecture Scaled Up</h3>
<p>The attention mechanism that powers both image classification and language models is the same fundamental idea. The K210's neural network accelerator ran small convolutional models. Modern GPUs run the same attention patterns at billions of parameters, enabling <em>understanding</em> rather than just <em>classifying</em>.</p>
<h3>2. Open-Source Models Became Competitive</h3>
<p>In 2019, if you wanted a capable language model, you needed OpenAI's API. By 2025, open-source models — Qwen, Llama, Mistral — matched or exceeded GPT-3.5 quality while running on a single consumer GPU. This is the equivalent of the K210 moment for language AI: capable models, affordable hardware, open ecosystem.</p>
<h3>3. Fine-Tuning Became Practical</h3>
<p>LoRA (Low-Rank Adaptation) did for LLMs what transfer learning did for image classification. Instead of training from scratch, you add a small adapter (~130 MB) that teaches the model your domain. Training costs dropped from millions of dollars to under $1 per run.</p>
<h2>Where We Are Now</h2>
<p>At ai.rs, we took the same hands-on approach that drove the maker community around devices like the K210 and applied it to large language models:</p>
<table>
<thead>
<tr>
<th>What We Did Then</th>
<th>What We Do Now</th>
</tr>
</thead>
<tbody>
<tr>
<td>Flash MicroPython firmware via USB</td>
<td>Fine-tune Qwen/Llama with LoRA</td>
</tr>
<tr>
<td>Train face detection on custom datasets</td>
<td>Train product Q&amp;A on 26,000+ samples</td>
</tr>
<tr>
<td>Deploy on $20 RISC-V chips</td>
<td>Deploy on dedicated GPU servers</td>
</tr>
<tr>
<td>Real-time camera inference</td>
<td>Real-time conversational AI</td>
</tr>
<tr>
<td>Read QR codes and detect objects</td>
<td>Understand natural language, recommend products, handle support</td>
</tr>
</tbody>
</table>
<p>The spirit is identical: take capable open-source hardware and software, customize it for a specific use case, and deploy it where it creates real value.</p>
<h2>From Hobbyist to Production</h2>
<p>The K210 was a hobbyist device. Modern AI assistants are production systems serving real customers 24/7. The difference isn't just scale — it's the full stack around the model:</p>
<ul>
<li><strong>RAG (Retrieval-Augmented Generation)</strong> — Real-time product database access, so the model always has current prices and availability</li>
<li><strong>Safety training</strong> — 275+ edge-case samples that prevent hallucination, off-topic responses, and prompt injection</li>
<li><strong>Monitoring and iteration</strong> — Every conversation logged, weak spots identified, training data improved continuously</li>
<li><strong>Multi-language support</strong> — One model serving 6+ languages natively</li>
</ul>
<p>But the core insight from the maker era still holds: <strong>you don't need a research lab to build useful AI.</strong> The K210 proved that computer vision could run on a $20 chip. Open-source LLMs prove that conversational AI can run on a single GPU.</p>
<h2>Getting Started</h2>
<p>If the maker spirit of the K210 era resonates with you, here's how to start with modern LLMs:</p>
<ol>
<li><strong>Try it</strong> — Run <a href="https://ollama.com">Ollama</a> with Qwen3-8B on any machine with a GPU</li>
<li><strong>Customize it</strong> — Prepare 5,000+ training samples from your domain data</li>
<li><strong>Fine-tune it</strong> — Use Unsloth + LoRA for a 5-hour, sub-$1 training run</li>
<li><strong>Deploy it</strong> — Serve it on dedicated hardware with RAG for real-time data access</li>
</ol>
<p>Or if you'd rather skip the infrastructure work: <a href="/how-it-works.php">see how we build custom AI assistants</a> — from your product data to a live AI that knows your business.</p>
<hr />
<p><em>This article is based on our original 2019 coverage of the M5StickV and Kendryte K210 platform. The maker community around edge AI devices like the K210, ESP32, and Raspberry Pi laid the groundwork for today's accessible AI deployment ecosystem.</em></p>]]></content:encoded>
      <category>fundamentals</category>
      <category>edge-ai</category>
      <category>k210</category>
      <category>llm</category>
      <category>history</category>
      <category>maker</category>
    </item>
    <item>
      <title>Will AI Replace My Sales Team? (No — Here&#039;s Why)</title>
      <link>https://ai.rs/ai-for-business/will-ai-replace-my-sales-team</link>
      <guid isPermaLink="true">https://ai.rs/ai-for-business/will-ai-replace-my-sales-team</guid>
      <pubDate>Fri, 20 Feb 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>AI handles the volume so your team can handle the value. It&#039;s not replacement — it&#039;s multiplication. Here&#039;s exactly what AI does better, and what humans always will.</description>
      <content:encoded><![CDATA[<h2>The Fear Nobody Talks About</h2>
<p>Every time we talk to business owners about AI, there's an unspoken question in the room:</p>
<p><strong>&quot;If the AI can do all this, do I still need my sales team?&quot;</strong></p>
<p>The short answer: <strong>yes, absolutely.</strong> But the roles change. And that's a good thing.</p>
<h2>What AI Does Better Than Humans</h2>
<p>Let's be honest — there are things AI genuinely does better:</p>
<h3>1. Being Available</h3>
<p>Your best salesperson works 8 hours. AI works 24. The 2 AM customer, the Sunday browser, the holiday shopper — AI catches every one of them.</p>
<table>
<thead>
<tr>
<th>Time</th>
<th>Human Team</th>
<th>AI</th>
</tr>
</thead>
<tbody>
<tr>
<td>Monday 10 AM</td>
<td>Available</td>
<td>Available</td>
</tr>
<tr>
<td>Wednesday 2 AM</td>
<td>Sleeping</td>
<td><strong>Available</strong></td>
</tr>
<tr>
<td>Saturday afternoon</td>
<td>Maybe</td>
<td><strong>Available</strong></td>
</tr>
<tr>
<td>Christmas Day</td>
<td>Off</td>
<td><strong>Available</strong></td>
</tr>
<tr>
<td>During lunch rush</td>
<td>Busy</td>
<td><strong>Available</strong></td>
</tr>
</tbody>
</table>
<p>You're not replacing your team for those 8 working hours. You're adding coverage for the other 16 hours they physically can't be there.</p>
<h3>2. Being Consistent</h3>
<p>Humans have good days and bad days. Monday morning after a long weekend? Not our best. Friday afternoon? Distracted. The customer who arrives during a staff argument? Caught in the crossfire.</p>
<p>AI gives <strong>the same quality response every time.</strong> The 500th question of the day gets the same enthusiasm and accuracy as the first.</p>
<h3>3. Handling Volume</h3>
<p>During a sale or promotion, your website traffic might spike 5x. Your team of 3 can't suddenly become 15. But AI handles 1 customer or 100 with the same response time.</p>
<h3>4. Speaking Languages</h3>
<p>Hiring multilingual staff is expensive and limits your available talent pool. AI speaks 6+ languages natively — every customer gets help in their preferred language.</p>
<h3>5. Remembering Everything</h3>
<p>With 500 products, no human remembers every detail about every item. The AI knows exact prices, specifications, pairings, and availability for your entire catalog — because it looks them up in real-time.</p>
<h2>What Humans Do Better Than AI</h2>
<p>Now for the important part — what AI <strong>cannot</strong> do:</p>
<h3>1. Read Emotions</h3>
<p>A customer types: &quot;I've been looking for an hour and nothing is right.&quot;</p>
<p>AI sees: a product search query.
A human sees: frustration. Someone who needs patience, empathy, and maybe a different approach entirely.</p>
<p><strong>AI is excellent at transactions. Humans are essential for relationships.</strong></p>
<h3>2. Handle Complexity</h3>
<p>Some requests are genuinely complex:</p>
<blockquote>
<p>&quot;I'm planning a corporate event for 200 people with mixed dietary requirements, a specific theme, and a strict budget. I need a complete solution.&quot;</p>
</blockquote>
<p>AI can suggest products. But planning a coherent solution that accounts for dozens of variables, makes judgment calls, and adapts in real-time? That's human territory.</p>
<h3>3. Build Trust for Big Decisions</h3>
<p>A customer spending $50 is fine getting advice from AI. A customer spending $5,000 wants to talk to a person. The higher the stakes, the more important human connection becomes.</p>
<table>
<thead>
<tr>
<th>Purchase Size</th>
<th>Best Handled By</th>
</tr>
</thead>
<tbody>
<tr>
<td>Under $100</td>
<td>AI (quick, accurate, instant)</td>
</tr>
<tr>
<td>$100-$500</td>
<td>AI with human backup available</td>
</tr>
<tr>
<td>$500-$2,000</td>
<td>Human, with AI providing product data</td>
</tr>
<tr>
<td>Over $2,000</td>
<td>Human relationship, always</td>
</tr>
</tbody>
</table>
<h3>4. Negotiate and Customize</h3>
<p>Custom quotes, bulk discounts, special arrangements — these require human judgment about margins, relationships, and business strategy. AI operates within fixed rules; humans operate within context.</p>
<h3>5. Recover from Mistakes</h3>
<p>When something goes wrong — wrong item shipped, delayed delivery, quality issue — customers want to talk to a human. They want someone who feels their frustration and has the authority to make it right.</p>
<h2>The Multiplication Model</h2>
<p>The best way to think about AI isn't replacement. It's multiplication.</p>
<p><strong>Without AI:</strong></p>
<ul>
<li>3 salespeople handle ~150 conversations/day</li>
<li>Available 8 hours/day, 5 days/week</li>
<li>Limited to 1-2 languages</li>
<li>Spending 60% of time on routine questions</li>
</ul>
<p><strong>With AI:</strong></p>
<ul>
<li>AI handles ~500 routine conversations/day (24/7)</li>
<li>3 salespeople handle ~60 complex conversations/day</li>
<li>Available around the clock in 6+ languages</li>
<li>Sales team spends 80% of time on high-value interactions</li>
</ul>
<p><strong>Same team, 3x the customer coverage, better quality interactions.</strong></p>
<h2>How the Day Changes</h2>
<h3>Before AI</h3>
<table>
<thead>
<tr>
<th>Time</th>
<th>Sales Team Activity</th>
</tr>
</thead>
<tbody>
<tr>
<td>9:00</td>
<td>Answer &quot;What's the price of X?&quot; (30 seconds, but it adds up)</td>
</tr>
<tr>
<td>9:05</td>
<td>&quot;Do you have Y in stock?&quot;</td>
</tr>
<tr>
<td>9:15</td>
<td>&quot;What's the difference between A and B?&quot;</td>
</tr>
<tr>
<td>9:30</td>
<td>Complex customer needs full attention — but phone rings</td>
</tr>
<tr>
<td>10:00</td>
<td>Back to routine questions</td>
</tr>
<tr>
<td>11:00</td>
<td>Finally gets to that sales proposal for the big account</td>
</tr>
</tbody>
</table>
<h3>After AI</h3>
<table>
<thead>
<tr>
<th>Time</th>
<th>Activity</th>
</tr>
</thead>
<tbody>
<tr>
<td>9:00</td>
<td>AI handles routine questions automatically</td>
</tr>
<tr>
<td>9:00</td>
<td>Sales team works on the big account proposal</td>
</tr>
<tr>
<td>10:30</td>
<td>AI flags a complex customer request — team takes over</td>
</tr>
<tr>
<td>11:00</td>
<td>Team closes a $2,000 deal they had time to properly nurture</td>
</tr>
<tr>
<td>2:00</td>
<td>Reviews AI conversations from overnight — 3 new leads</td>
</tr>
</tbody>
</table>
<p>The team does more meaningful work. Customers get faster answers. Revenue goes up.</p>
<h2>The Numbers</h2>
<p>Businesses that deploy AI alongside their sales team typically see:</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Change</th>
</tr>
</thead>
<tbody>
<tr>
<td>Total customer interactions handled</td>
<td><strong>+200-400%</strong></td>
</tr>
<tr>
<td>Team time on high-value activities</td>
<td><strong>+60-80%</strong></td>
</tr>
<tr>
<td>Customer response time</td>
<td><strong>90% faster</strong></td>
</tr>
<tr>
<td>After-hours sales captured</td>
<td><strong>From zero to significant</strong></td>
</tr>
<tr>
<td>Team job satisfaction</td>
<td><strong>Higher</strong> (less repetitive work)</td>
</tr>
</tbody>
</table>
<p>That last one matters more than you might think. Salespeople don't enjoy answering &quot;what time do you close?&quot; for the 50th time. Let AI handle the routine so your team can do what they're actually good at — and what they actually enjoy.</p>
<h2>When to Hire, When to AI</h2>
<p>A simple framework:</p>
<p><strong>Add AI when:</strong></p>
<ul>
<li>You're losing after-hours and weekend customers</li>
<li>Your team spends most of their time on routine questions</li>
<li>You need multilingual support but can't justify hiring</li>
<li>Response times are too slow during peak hours</li>
</ul>
<p><strong>Hire a human when:</strong></p>
<ul>
<li>You need someone for complex, high-value sales</li>
<li>Your business relies on personal relationships</li>
<li>You're expanding into a new market that needs cultural nuance</li>
<li>You need someone who can physically be present (events, showrooms)</li>
</ul>
<p><strong>The sweet spot:</strong> AI handles the first touch, qualifies the lead, and routes complex requests to your team. Your team closes deals with customers who are already informed and ready to buy.</p>
<h2>The Bottom Line</h2>
<p>AI doesn't replace your sales team. It gives them superpowers.</p>
<p>Your team stops being answering machines for routine questions and starts being what they were hired to be: relationship builders, problem solvers, and deal closers.</p>
<p>The businesses that understand this — that use AI to augment their team rather than replace them — are the ones seeing the biggest returns.</p>
<p><strong>Wondering if your business is ready?</strong> Take our free <a href="/ai-readiness.php">AI Readiness Assessment</a> — 2 minutes, no commitment, personalized recommendations.</p>]]></content:encoded>
      <category>business</category>
      <category>sales-team</category>
      <category>augmentation</category>
      <category>strategy</category>
    </item>
    <item>
      <title>How to Pick the Right AI Tool for You</title>
      <link>https://ai.rs/learn-ai/how-to-pick-the-right-ai-tool</link>
      <guid isPermaLink="true">https://ai.rs/learn-ai/how-to-pick-the-right-ai-tool</guid>
      <pubDate>Thu, 19 Feb 2026 10:00:00 +0100</pubDate>
      <author>ai.rs</author>
      <description>ChatGPT, Claude, Gemini, Copilot, Perplexity — there are dozens of AI tools now, and they&#039;re not all the same. Here&#039;s how to figure out which one actually fits your needs.</description>
      <content:encoded><![CDATA[<h2>Too Many Choices</h2>
<p>Two years ago, the question was simple: do you use ChatGPT or not? Now there are dozens of AI tools, each claiming to be the best. It's overwhelming, and most comparisons online are either outdated or biased.</p>
<p>Let's cut through it. We'll look at what actually matters when choosing an AI tool, compare the major options honestly, and help you pick based on what you'll actually use it for.</p>
<h2>The Major Players</h2>
<p>As of early 2026, these are the AI tools most people should consider:</p>
<h3>ChatGPT (by OpenAI)</h3>
<p>The one that started the mainstream AI wave. It's the most widely used, has the largest ecosystem of plugins and integrations, and offers both free and paid tiers. GPT-4o is their flagship model.</p>
<p><strong>Best for:</strong> General-purpose use, image generation (DALL-E built in), voice conversations, broad plugin ecosystem.</p>
<h3>Claude (by Anthropic)</h3>
<p>Known for longer, more thoughtful responses and strong performance on writing and analysis tasks. Claude tends to be more careful and nuanced, especially with complex or sensitive topics.</p>
<p><strong>Best for:</strong> Long documents, careful analysis, writing and editing, coding, tasks requiring nuance.</p>
<h3>Gemini (by Google)</h3>
<p>Google's AI, integrated across Gmail, Docs, and Search. Its biggest advantage is access to real-time information through Google Search and deep integration with Google's productivity suite.</p>
<p><strong>Best for:</strong> Research with current information, Google Workspace integration, multimodal tasks (text + images + video).</p>
<h3>Copilot (by Microsoft)</h3>
<p>Microsoft's AI assistant, built into Windows, Edge, and Office 365. Powered by OpenAI's models but with Microsoft's ecosystem integration.</p>
<p><strong>Best for:</strong> Microsoft Office users, Windows integration, business environments already on Microsoft's stack.</p>
<h3>Perplexity</h3>
<p>Not a traditional chatbot — it's more like an AI-powered research tool. Every answer includes citations and sources, making it ideal for factual research.</p>
<p><strong>Best for:</strong> Research, fact-finding, getting answers with verifiable sources.</p>
<h2>The Comparison</h2>
<table>
<thead>
<tr>
<th>Feature</th>
<th>ChatGPT</th>
<th>Claude</th>
<th>Gemini</th>
<th>Copilot</th>
<th>Perplexity</th>
</tr>
</thead>
<tbody>
<tr>
<td>Free tier</td>
<td>Yes (GPT-4o mini)</td>
<td>Yes (limited)</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes (limited)</td>
</tr>
<tr>
<td>Paid price</td>
<td>$20/mo</td>
<td>$20/mo</td>
<td>$20/mo</td>
<td>$20/mo (M365)</td>
<td>$20/mo</td>
</tr>
<tr>
<td>Best at writing</td>
<td>Good</td>
<td>Excellent</td>
<td>Good</td>
<td>Good</td>
<td>Adequate</td>
</tr>
<tr>
<td>Best at research</td>
<td>Good</td>
<td>Good</td>
<td>Excellent</td>
<td>Good</td>
<td>Excellent</td>
</tr>
<tr>
<td>Best at coding</td>
<td>Excellent</td>
<td>Excellent</td>
<td>Good</td>
<td>Very Good</td>
<td>Adequate</td>
</tr>
<tr>
<td>Image generation</td>
<td>Yes (DALL-E)</td>
<td>No</td>
<td>Yes (Imagen)</td>
<td>Yes (DALL-E)</td>
<td>No</td>
</tr>
<tr>
<td>File upload</td>
<td>Yes</td>
<td>Yes (large files)</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td>Web access</td>
<td>Yes</td>
<td>Limited</td>
<td>Yes (native)</td>
<td>Yes (Bing)</td>
<td>Yes (core feature)</td>
</tr>
<tr>
<td>Mobile app</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
</tr>
</tbody>
</table>
<h2>How to Choose: Start with Your Main Use Case</h2>
<p>Instead of comparing features, start with what you'll actually use AI for most often.</p>
<h3>&quot;I want a general everyday assistant&quot;</h3>
<p><strong>Go with ChatGPT.</strong> It's the most versatile, has the largest user community (so it's easy to find tips and tricks), and the free tier is genuinely useful. It's the safe default choice.</p>
<h3>&quot;I need help with writing and analysis&quot;</h3>
<p><strong>Go with Claude.</strong> It handles long documents better than any competitor, produces more nuanced writing, and is particularly good at understanding complex instructions. If your work involves reading, writing, or analyzing text, Claude is hard to beat.</p>
<h3>&quot;I do a lot of research and need accurate sources&quot;</h3>
<p><strong>Go with Perplexity.</strong> It's built specifically for research. Every answer comes with citations you can verify. It's not trying to be a creative writer or a coding assistant — it's trying to find you accurate information fast.</p>
<h3>&quot;I live in Google's ecosystem&quot;</h3>
<p><strong>Go with Gemini.</strong> If you use Gmail, Google Docs, and Google Drive daily, Gemini's integration is hard to beat. It can search your email, help with documents, and access real-time information through Google Search.</p>
<h3>&quot;I live in Microsoft's ecosystem&quot;</h3>
<p><strong>Go with Copilot.</strong> If your workplace runs on Microsoft 365, Copilot works inside Word, Excel, PowerPoint, and Outlook. The AI comes to where your work already is.</p>
<h3>&quot;I write code regularly&quot;</h3>
<p><strong>ChatGPT or Claude are both strong.</strong> Claude tends to be better at understanding large codebases and complex architecture. ChatGPT has broader ecosystem support. Many developers use both.</p>
<h2>The Secret: Most People Should Try Two</h2>
<p>Here's what the comparison articles won't tell you: the differences between these tools are smaller than the marketing suggests. For 80% of tasks, any of them will do a good job.</p>
<p>The real differences show up at the edges — very long documents, complex reasoning chains, specific creative styles, or niche technical tasks. The best way to find your favorite is to try two or three on the same task and see which output you prefer.</p>
<p>All of them offer free tiers. Spend a week using two of them side by side. You'll quickly develop a preference.</p>
<h2>When Free Is Enough (and When It's Not)</h2>
<p>Every major AI tool has a free tier, but they come with limitations:</p>
<table>
<thead>
<tr>
<th>What Free Gets You</th>
<th>What Paid Adds</th>
</tr>
</thead>
<tbody>
<tr>
<td>Access to capable (but not top-tier) models</td>
<td>Access to the most powerful models</td>
</tr>
<tr>
<td>Usage limits (messages per day/hour)</td>
<td>Much higher or unlimited usage</td>
</tr>
<tr>
<td>Basic features</td>
<td>Advanced features (file analysis, image generation, priority access)</td>
</tr>
<tr>
<td>Adequate for casual use</td>
<td>Necessary for daily professional use</td>
</tr>
</tbody>
</table>
<p><strong>Start with free.</strong> If you hit the usage limits regularly or find yourself wishing for better responses, upgrade. The $20/month is worth it if you use AI daily — it's the cost of one lunch for a tool that saves hours.</p>
<h2>Two Mistakes to Avoid</h2>
<h3>1. Chasing the &quot;Best&quot; Model</h3>
<p>Every month, a new benchmark says a different model is &quot;best.&quot; Don't chase this. The differences at the top are marginal, and the model that scores 2% higher on a benchmark might not be the one that's best for your specific tasks. Pick a tool, learn it well, and switch only if you have a genuine reason.</p>
<h3>2. Paying for Multiple Subscriptions</h3>
<p>Unless you have a specific reason, one paid subscription is enough. Pick the tool that fits your primary use case, pay for that one, and use the free tiers of others for occasional tasks that need a different strength.</p>
<h2>The Bottom Line</h2>
<p>The best AI tool is the one you'll actually use consistently. Pick based on your primary use case, start with the free tier, upgrade if it becomes part of your daily workflow, and don't overthink it. The gap between these tools is much smaller than the gap between using AI well and not using it at all.</p>
<p><strong>Want to understand how AI can be customized for specific tasks?</strong> Read <a href="/learn-ai/what-is-fine-tuning">What Is Fine-Tuning? Teaching AI New Tricks</a>.</p>
<p><strong>Wondering if AI could help your business?</strong> Take our free <a href="/ai-readiness.php">AI Readiness Assessment</a> — 2 minutes, personalized recommendations.</p>]]></content:encoded>
      <category>beginner</category>
      <category>chatgpt</category>
      <category>claude</category>
      <category>gemini</category>
      <category>comparison</category>
    </item>
  </channel>
</rss>
