ai.rs — AI That Works for Your Business

100% Human Key-Pressed: Share This Email Signature

ai.rs — Sat, 23 May 2026 11:00:00 +0200

Here's a three-line pledge you can paste into your email signature today:

──
100% human key-pressed content.
0% machine-generated.
99% Naturally imperfect.

That's it. Thirty seconds, three lines, one quiet stake in the ground. The rest of this article is why it works, who it's for, and how to make it yours.

Why these three lines do the work

Each line carries one specific weight.

"Human key-pressed content." Honest. Your fingers actually hit keys. Autocomplete suggested some words; you accepted some, rejected others. That's still you, the same way a dictation transcript is still you. The bar for human-written in 2026 isn't no software involved — that bar evaporated when Gmail Smart Compose shipped in 2018. The bar is: did a person make the decisions.

"0% machine-generated." A different stake. This one says: no pasted ChatGPT draft, no rewrite this in my voice handoff, no auto-composed reply. The line is specifically about generation, not assistance. Autocomplete suggests; you ratify. Generation produces; you copy.

"99% Naturally imperfect." The wink. Perfection is now the tell. The cleanest paragraphs in your inbox were probably written by something whose paragraphs are always clean. Imperfection — the dropped article, the run-on, the almost-right-word — used to be a thing to apologize for. In 2026 it's a watermark.

The quiet cost of AI in your inbox

Every major email client now ships with a writing model on by default. Gmail's Smart Compose, Outlook's Copilot, Apple's Writing Tools, Superhuman's Instant Reply. Each one nudges your sentences toward a smoothed-out, slightly-hedged, professional-but-generic register. The kind of prose that doesn't offend anyone, doesn't surprise anyone, and increasingly doesn't sound like you.

That's the cost. Not that AI writes your emails — that AI flattens them. Your idiom, the weird metaphor you'd reach for, the typo you'd leave because the sentence sounds right that way: all of it gets quietly replaced by suggestions tuned for the median of every email ever sent. Multiply by a billion inboxes and the texture of human written communication starts to converge.

Naturally imperfect isn't just a stake against bot output. It's a stake against the slow erosion of your own voice by a thousand helpful auto-suggestions a day.

"But anyone can paste this onto AI text"

Yes. Anyone can also wear a t-shirt that says honest. The signature isn't a forensic test — it's a public commitment. The point is the social contract you're signing, and the moment your reader notices you signed it.

This works the same way No animals were harmed works on film credits: nobody audits the production line. The line still matters, because attaching it to your name means if it turned out to be untrue, that would be on you. Disclaimers aren't proofs. They're invitations to accountability.

Copy this. Or pick your variant.

──
100% human key-pressed content.
0% machine-generated.
99% Naturally imperfect.

That's the maximalist version. Six field-tested variants for different rooms:

Audience	Variant
Founders	"Written by a human who reread it twice and shipped anyway."
Developers	"git blame: me. Compile errors: also me."
Consultants	"Hand-written. Spellcheck off."
Maximalists	the three-line pledge above
Deadpan	"Sent without AI. Probably."
Sci-fi readers	"No electric sheep were harmed in the writing of this email."

Pick one. Or remix it. Tag #naturallyimperfect when you do — it's the easiest way to find the others doing the same thing.

For Gmail, Outlook, Apple Mail: paste into Settings → Signature. For pre-styled HTML so the formatting survives Outlook's helpfulness:


  100% human key-pressed content.

  0% machine-generated.

  99% Naturally imperfect.

Thirty seconds, one paste, done.

What this is actually signaling

AI-written email is now the default texture of inboxes. Slack messages, status updates, replies to your client — all of it has been quietly drifting toward the same smoothed-out, slightly-hedged, perfectly-paragraphed prose. Sounds-like-a-person-but-isn't is the unmarked case now.

Against that, naturally imperfect text is a deliberate, costly signal. It says: I cared enough to send you something flawed. I didn't outsource the act of writing this to a model that would have done it cleaner. The imperfection isn't a bug. It's the receipt.

That's the actual trust signal in 2026. Not I didn't use AI. But I'm accountable for the output, including the bits that aren't smooth.

What to do with it

Paste it into your signature today. Reply-all to one person you respect with it on. See what happens.

No electric sheep were harmed in the writing of this signature. The wool, the bleating, the imperfection — all ours.

How to Run Qwen3-Coder 30B-A3B on RTX 5090 with Ollama

ai.rs — Fri, 22 May 2026 11:00:00 +0200

How to Run Qwen3-Coder 30B-A3B on RTX 5090 with Ollama

May 2026 — notes from setting up a local coding LLM on a single consumer GPU, with the bumps left in.

The goal

A coding-focused LLM running entirely on my own hardware. Reasons in descending order of weight: no per-token costs, no rate limits, no data leaving the box for routine tasks, latency that's bounded by my own GPU rather than someone else's queue. Hardware on the desk: one RTX 5090 (32 GB VRAM, Blackwell sm_120), running Arch Linux. The question was what to put on it.

A false start: the cloned repo

I'd cloned noonghunna/club-3090 — a well-maintained recipe collection for serving LLMs on RTX 3090s. Excellent documentation, real benchmarks, honest about failure modes (their docs/CLIFFS.md is the kind of writeup most serving projects could learn from). But reading the actual launch.sh, the hardcoded model list was just two entries — Qwen3.6-27B and Gemma-4-31B — and the whole architecture is built around squeezing 27B-class models through 24 GB Ampere with vLLM nightlies and Genesis patches. Wrong card class, wrong era, wrong constraints. The "model-agnostic by design" claim in the README is aspirational at the code level: the structure scales to new models, but the launcher itself is bound to specific compose files.

Right call: read the docs, skip the runtime.

Picking the pieces

Model. Qwen3-Coder-30B-A3B-Instruct. The "30B-A3B" is a Mixture of Experts: 30B total parameters, but only ~3B are activated per token. Inference cost is roughly that of a 3B dense model; quality lands much closer to a 30B dense model thanks to expert specialization. There's a 480B-A35B sibling that's outside reach for a 32 GB card. Easy choice.

Quantization: Q5_K_M. At 21.7 GB this hits the quality/size sweet spot for 32 GB. Q4_K_M is ~18 GB but takes a 1–2% quality hit on coding tasks where token-level precision matters. Q8_0 is ~32 GB and leaves essentially no room for KV cache. Q5_K_M leaves enough headroom for a useful context window.

Serving engine: ollama. This one surprised me. The "right" answer for max throughput would be llama.cpp's llama-server directly, or vLLM. But ollama wraps the same llama.cpp engine. The TPS gap between ollama and standalone llama-server is typically 0–10% — wrapper overhead, not engine difference. What you gain by going standalone is access to flags ollama hides (KV cache quantization, dedicated bench binaries). What you give up is the operational niceness: ollama list, automatic VRAM unload after idle, painless model switching, a model store that handles versioning. For daily use against one model, ollama wins on UX without paying meaningfully in speed.

The Q5_K_M gotcha

ollama maintains a curated library of pre-packaged models. ollama pull qwen3-coder works — except the curated quants for the 30B variant are Q4_K_M, Q8_0, and FP16. No Q5_K_M. Q4_K_M is the obvious "just go" option but I wanted to actually run Q5_K_M for the quality.

The workaround: download the Q5_K_M GGUF directly from one of the public re-quanters on Hugging Face (Unsloth and bartowski both maintain full quant sets), then register it with ollama via a Modelfile:

FROM /home/arch/models/Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf
PARAMETER num_ctx 32768
PARAMETER num_gpu 99

ollama create reads the FROM file, hashes it, and stores it as a content-addressed blob. Chat template, tokenizer config, and tool-calling format are read from the GGUF's metadata automatically — no TEMPLATE directive needed, and tool calling works out of the box for agent-style clients like Cline.

Disk-duplication caveat: my ollama runs as a systemd service with its model store at /var/lib/ollama/, which is a different btrfs subvolume from /home. Btrfs doesn't allow cross-subvolume hardlinks, so ollama create copies the 22 GB file into its store. You can run ollama as your user with OLLAMA_MODELS=$HOME/.ollama/models to get hardlinks and zero duplication, but for 22 GB and 346 GB free that wasn't worth the systemd-juggling. Trading disk for simplicity.

The 64K context gotcha

First attempt: num_ctx 65536 in the Modelfile, ollama create, ollama run. Result:

Error: 500 Internal Server Error: memory layout cannot be allocated with num_gpu = 99

Initial instinct: ollama's memory estimator being pessimistic on MoE models. Wrong instinct. nvidia-smi showed 5.6 GB of VRAM already in use — KDE plasmashell (660 MB), Chromium GPU process and tabs (~3 GB total), Telegram (450 MB), a few smaller apps. Normal desktop session, but enough to push the budget over the line:

Q5_K_M weights:            ~22 GB
FP16 KV cache at 64K:       ~6 GB
Activations + cudagraph:    ~2 GB
                            ─────
Total needed:               ~30 GB
Free VRAM (after desktop):  26.4 GB
                            ─────
Shortfall:                  -3.6 GB

ollama wasn't pessimistic — the math was correct. Two ways out: free the 3.6 GB by closing Chromium, or shrink the KV cache. I dropped to num_ctx 32768, which cuts KV to ~3 GB. After re-creating the model:

ollama:      24.4 GB  (weights + KV + activations)
Desktop:      5.4 GB
Free:         2.2 GB

Fits cleanly with a healthy buffer.

This is the part where local serving differs from cloud most concretely. Cloud inference has dedicated machines with 80+ GB of HBM per GPU, often 8 GPUs sharing capacity. Your local card shares with the desktop, the browser, the chat app, the screenshot tool. The first ~5 GB of VRAM is gone before the model even loads.

"Is 32K enough?"

When Claude advertises 1M context and your local model is capped at 32K, the gap looks vast. It isn't, for what coding actually needs:

A typical source file: 1–5K tokens
A file plus 3–5 related files for context: 10–20K tokens
A moderate-codebase summary with focused references: 25–30K tokens

32K covers all of that. The places where 1M actually pays off — read this entire 200-file repo and refactor it; ingest a 600-page document and answer questions across all of it — are where you'd be reaching for a cloud model anyway, both for context and for the qualitatively better judgment of a frontier model. The local model is for the routine 80%: "explain this function", "write a unit test", "refactor this loop", "what's wrong with this regex".

For when you do want more context locally, ollama exposes OLLAMA_KV_CACHE_TYPE=q8_0, which roughly halves KV memory at near-zero quality cost. That alone moves 64K from "won't fit" to "fits with room". I left that as an opt-in rather than the default since it requires editing the systemd unit.

How to think about quant vs context

A natural follow-up question after hitting the 64K wall: what if I gave up some weight precision in exchange for more context? Q4_K_M is ~17 GB on disk; that's 4 GB less than Q5_K_M, which is enough KV cache for an extra ~40K tokens at FP16. So a Q4_K_M build with the same VRAM budget gets roughly double the workable context. Tempting.

But there are two things that make this less obviously good than it looks.

First, the quality cost isn't symmetric across workflows. Published coding benchmarks (HumanEval, MBPP, LiveCodeBench) show Q5_K_M → Q4_K_M drops of 1–3% absolute pass rate for 30B-class models. That's small enough to be undetectable on a single prompt: blind taste tests, you'd struggle to tell them apart. But for agentic coding — Cline-style multi-step refactors, aider with edit-format tool calls, anything where the model is making chained decisions — those small per-step errors compound. A 2% wrong-token rate per decision over 10 decisions starts to look meaningfully different from the same model at Q5. So the Q5 → Q4 swap costs more in workflows where it matters most: long-running agent sessions, which are also the workflows that most want the extra context.

Second, more context doesn't translate linearly to better outputs. Coding models tend to degrade on long-context retrieval beyond their effective working window — quality on "use these 50 files to find the bug" drops sharply past ~32K, even for models trained to 256K. Published needle-in-haystack benchmarks measure something narrower than what real codebase work needs. Past ~32K, you usually get better results by being selective about what you include in context than by stuffing more in.

So the binary "Q5 with 32K context vs Q4 with 64K context" turns out to be the wrong framing. The real lever is in the middle.

What actually works:

Q5_K_M + q8_0 KV cache keeps Q5-level weight quality and roughly halves the per-token KV cost. With near-zero quality impact, it brings 64K into easy reach and 128K close to the edge. q8_0 isn't true FP8 (it's int8 with shared FP16 block scales) but the memory savings are FP8-class.
Unsloth's UD-Q5_K_XL variant, at the same 21.7 GB size as Q5_K_M, selectively keeps higher precision on critical layers. Theoretically pushes quality toward Q6 territory at Q5 cost.

The sensible progression for someone in my position: enable q8_0 KV first (a free lever — no quality tax) and live with that for a couple of weeks. If you find yourself routinely running out of context on real tasks past 128K, the workflow is asking for cloud anyway. Only consider Q4_K_M if you've actually validated that the context ceiling matters in your day-to-day, not just in theory.

Going to Q4 before trying q8_0 KV is paying the quality bill up-front for ceiling you might never touch.

The performance surprise

I'd estimated 80–120 TPS based on the model size (30B). The first benchmark shipped that estimate to the bin:

{"eval_count": 462, "eval_duration_ms": 2001.85, "tps": 230.79}

231 tokens per second for a short coding completion. Roughly double my back-of-envelope.

The reason is the MoE architecture. My mental model was anchored on dense 30B inference, where every parameter touches every token and TPS reflects that. In a 30B-A3B MoE, each token's forward pass activates only ~3B of parameters (the chosen experts plus the shared layers). Generation speed scales with active parameters, not total. On a 5090's memory bandwidth, 3B of effectively-active weights moves fast.

The catch is that prefill — reading the prompt before generation starts — still touches all the model machinery, and it scales roughly quadratically with prompt length. So a short interactive coding prompt feels blazing; a 20K-token "here's my codebase" prompt has a noticeable pause before the first token. The 230 TPS number is steady-state generation, not prefill-bound latency.

Either way, this is comfortably usable. At 230 TPS, a 1000-token response materializes in about 4 seconds. Interactive coding feels closer to typing-speed than to "wait for the assistant".

Going to 64K — and finding what 32K hid

The "what I'd try next" list above had OLLAMA_KV_CACHE_TYPE=q8_0 at the top — quantize the KV cache to int8 with FP16 block scales, halving its VRAM cost at essentially zero quality impact. I did that next.

The setup is a systemd drop-in (/etc/systemd/system/ollama.service.d/override.conf) adding two env vars to the daemon: OLLAMA_KV_CACHE_TYPE=q8_0 and OLLAMA_FLASH_ATTENTION=1 (the second is auto-enabled on Blackwell, but being explicit is cheaper than wondering later). After systemctl daemon-reload && systemctl restart ollama, I bumped num_ctx in the Modelfile from 32768 to 65536 and re-ran ollama create.

The numbers confirmed it engaged. ollama process VRAM went from 24.4 GB at 32K-FP16-KV to 25.0 GB at 64K-q8_0-KV — exactly the 3 GB savings you'd expect from halving the per-token KV cost (6 GB FP16 → 3 GB q8_0) while doubling the context. TPS sat at 223, statistically indistinguishable from the 230 at 32K. Free desktop VRAM dropped to 1.4 GB — tight but workable. Functionally I now had 2× the context for less than 1 GB more allocation.

Then I ran a real coding prompt to validate quality. And the output went off a cliff.

The model wrote a sensible function. Then emitted <|endoftext|> as literal text. Then kept generating. It hallucinated a fake user follow-up turn ("Human: Can you modify the function to also..."). Then "answered" itself. Then repeated this loop four or five times, each iteration claiming to be the "final clean version" and contradicting the previous one. At no point did ollama stop the generation.

The diagnosis was upstream of everything I'd been doing. ollama show --modelfile qwen3-coder-q5km revealed the actual template ollama had registered for the model:

TEMPLATE {{ .Prompt }}

That's the no-template default — raw user input passed through unchanged, no ChatML wrapping, no stop tokens declared. ollama is supposed to read the chat template from the GGUF's tokenizer.chat_template metadata field. Either the Unsloth re-quant doesn't populate that field cleanly, or ollama 0.19 doesn't parse Qwen3's specific Jinja template variant correctly. Either way, ollama had silently fallen back to "no template" without warning, and I hadn't noticed because:

Modern Qwen is robust enough to produce sensible output even from bare prompts. The model's first response was fine.
Short prompts (like the benchmark) end naturally and don't need stop tokens to halt — the model picks a reasonable conclusion and the API returns. The Sieve test had been measuring TPS on a workflow where the missing stop tokens never mattered.
The model emitted <|endoftext|> — but as literal text, because ollama wasn't told it was a stop string.

The fix was a proper TEMPLATE block in the Modelfile (Qwen ChatML, ~15 lines) plus three explicit PARAMETER stop directives: <|im_end|>, <|endoftext|>, <|im_start|>. After ollama create re-registered with these in place, the same anagrams prompt produced one focused answer, the model emitted its turn terminator, ollama halted, and the REPL returned to the >>> prompt. The output quality was visibly higher too — internal doctest/code consistency held (in the broken run, the doctest expected output that contradicted the implementation), and the model used modern list[str] type hints rather than the older typing.List[str].

The lesson: when you go custom-GGUF-via-Modelfile instead of using ollama's curated library, you take on responsibility for the chat template and stop tokens that the curated tags configure invisibly. Going to ollama pull qwen3-coder:30b-a3b-q4_K_M would have given me the right template metadata for free. Going custom traded that for the higher quant. Worth the trade — but the silent fallback to the no-template default was a much sharper edge than I'd expected from "just create a Modelfile."

It also retroactively changes my reading of an earlier observation. The first time I ran the anagrams test, before fixing the template, the model wrote a function whose doctest contradicted its own code — the kind of small-but-real attention drift I'd attributed in passing to Q5 quantization. With the template fixed, the same prompt produces an internally consistent answer. That drift wasn't the quant. It was the model being forced to keep generating past its natural end-of-turn, getting derailed into self-correction loops, and accumulating contradictions across the imagined revisions. The quant was never the problem.

Where this leaves things

Final stack:

Hardware: RTX 5090, 32 GB
Model: Qwen3-Coder-30B-A3B-Instruct, Q5_K_M (21.7 GB on disk)
Engine: ollama 0.19.0 (wraps llama.cpp), with q8_0 KV cache and flash attention enabled via systemd override
Context: 64K
VRAM at load: 25.0 GB used by ollama, 5.6 GB by desktop, 1.4 GB free
Speed: ~223 TPS steady-state for short prompts (essentially unchanged from 32K-FP16)
Endpoint: http://localhost:11434/v1, model qwen3-coder-q5km

Coding clients (aider, Continue.dev, Cline, Cursor with custom-provider mode) all connect to the OpenAI-compatible endpoint with a dummy API key. Tool calling works because the Modelfile's TEMPLATE block renders Qwen ChatML correctly, and the embedded GGUF tokenizer handles the framing.

What I'd try next:

The UD-Q5_K_XL variant from Unsloth at the same 21.7 GB size — uses higher precision selectively on important layers, theoretically better quality for the same VRAM cost.
Side-by-side against Claude on real tasks — not synthetic benchmarks, just "did the local model handle this PR review / refactor / debugging session, and where did it fall short". The interesting question for local serving isn't TPS; it's "where exactly is the quality cliff vs cloud, and what tasks fall safely below it."
vLLM with FP8-quantized weights to actually exploit Blackwell's FP8 tensor cores. llama.cpp doesn't use them today; running on a 5090 leaves them idle. The setup cost is real (different weight format, more moving parts) but it's the only way to find out what this card can actually do on dense models.

Reflections

A few things I'd tell past-me starting this experiment.

The exotic stuff is for niche constraints. vLLM, Genesis patches, custom quant kernels — these exist because someone has a constraint that can't be fixed any other way (24 GB Ampere, prefill cliffs on specific architectures, etc.). On a 5090 with a normal model, ollama covers 95% of the value and any of the alternatives is incremental.

Estimate VRAM by what's free, not what's installed. "I have 32 GB" is misleading. You have 32 GB minus whatever your desktop and apps are holding, and that floor moves around. Check nvidia-smi before assuming. The first failure of this experiment — 64K context refusing to fit — wasn't a misconfiguration. It was the desktop quietly holding 5.6 GB that the back-of-envelope math hadn't accounted for.

MoE inference is its own thing. Dense-model intuitions about TPS don't transfer. The 230 TPS surprise was useful — it changed what I think this hardware is good for. The expensive parts of a 30B-A3B forward pass are routing decisions and shared layers, both small; the bulk of the parameter budget sits in experts that mostly idle.

The curated-vs-custom trade is sharper than it looks. When you ollama pull a tag from the curated library, you also pull the right chat template, stop tokens, and parameter defaults invisibly bundled with the weights. When you go custom — your own Modelfile pointing at a downloaded GGUF — you're responsible for those, and ollama's fallback when it can't read the GGUF's embedded chat template is no template at all, silently. It "works" for short prompts because Qwen is robust, and fails catastrophically for longer ones because there are no stop tokens. The first I knew was the model hallucinating fake user turns. Add explicit TEMPLATE and PARAMETER stop directives to any custom Modelfile, even if you think the GGUF "has it built in".

Quality bugs and config bugs look the same from outside the model. I almost wrote off the model's doctest/code inconsistency as a Q5_K_M quality limit — exactly the kind of "small attention drift that compounds in agentic workflows" I'd theorized about earlier. It wasn't. It was the model being forced to keep generating, drifting through invented follow-up turns, accumulating contradictions across imagined revisions. Once stop tokens worked, the same prompt produced an internally consistent answer. Worth a sanity check before blaming the weights: is the model actually finishing its turn, or is it being kept on the leash by missing config?

Local isn't a cloud replacement, it's a complement. The right framing isn't "can the 5090 run something as good as Claude". It's "for which tasks is the 5090 fast enough, private enough, and cheap enough that I'd rather use it than reach for the cloud, even at lower quality". For routine coding tasks the answer is "many of them" — once the stop tokens are working.

Postscript: trying Crush

After the writeup above, I went looking for a more polished alternative to Aider — something with the agentic UX of Claude Code but model-agnostic from the start. The obvious candidate was Crush from Charmbracelet — the team behind Bubble Tea, Lipgloss, Glamour, Glow, the terminal-UI shop. Go-based, single binary, AUR-installable with yay -S crush-bin. ~24K stars, daily commits, growing fast.

The install was clean. The TUI launch screen was genuinely beautiful — pixel-perfect spacing, considered colors, a Charm logo that's just the right amount of fun. Better than any other coding-assistant TUI I've seen. The two-tier "Large Task / Small Task" model picker is a nice ergonomic detail — configure cheap-and-fast for one slot, quality-for-hard-stuff for the other. I added Qwen3-Coder Q5KM under an ollama provider in ~/.config/crush/crush.json, similar shape to the OpenCode config. Crush picked it up; the model picker showed it as ✓ Configured. So far so good.

One nice UX detail worth noting: Crush also detected my ANTHROPIC_API_KEY (set elsewhere for Claude Code) and defaulted to Claude Sonnet 4.6 automatically, prioritizing cloud over local when both are available. Switching to Qwen3-Coder via the picker was a keystroke. Real respect for the dual-model dual-provider workflow.

Then I gave it a prompt: evaluate README.

Crush replied with "I'll evaluate the README.md file for you," and then immediately got stuck:

[Uses ls tool] [uses view tool] [uses view tool] [uses view tool] ...

Pages of it. Hundreds of lines of [uses view tool] in brackets. The model was outputting natural-language descriptions of tool calls instead of actual structured tool calls — and Crush wasn't executing anything, so the model never got file content back, so it kept "trying." Stop tokens didn't fire because none of <|im_end|> / <|endoftext|> / <|im_start|> was appearing in this hallucinated description format.

A bit of digging revealed this is a known Crush bug — #2936, filed by another user the day before my own attempt, with mitmproxy diagnostics proving the chain:

Crush correctly sends tool definitions to the model.
The model correctly responds with finish_reason: "tool_calls" and well-formed tool call JSON.
Crush silently ignores the tool calls and never executes them.
The model, getting no execution feedback, repeats — bounded only by default_max_tokens.

So our setup was right. The model was right. The protocol translation was right. Crush itself has a regression in its OpenAI-compatible-provider tool-call execution path that didn't exist in earlier versions — a January 2026 blog post by Meschbach documents the same setup working successfully four months earlier. The breakage is recent, the fix is pending, and multiple related issues going back to August 2025 (#447, still open after nine months) suggest the local-provider integration is a fundamentally rough surface area for Crush at the moment. Not bad faith from Charm — just not yet a fully-shipped feature.

This is the second empirical confirmation of the article's design-theory framing, on top of the chat-template gotcha earlier:

The first failure was at our config layer (silent fallback to no-template when ollama couldn't parse the GGUF's embedded Jinja). Fixable by adding explicit TEMPLATE and PARAMETER stop directives to the Modelfile.
The second failure is at the tool's layer (Crush's tool-call execution path is broken for OpenAI-compatible providers). Not fixable at our level — wait for Charm to ship a fix.

Both fit the same pattern: agentic-style tools have larger surface areas to break, particularly along the local-model integration path that isn't the developers' day-job priority. Aider's smaller, more deliberate surface area — user-driven dialog, explicit file context via /add, no autonomous tool exploration — avoids both failure modes by design. Not because Aider is "better" in some absolute sense, but because Aider's design rewards weaker models for what they can do (write code given context) instead of asking them to do what they're worst at (drive an agentic tool loop reliably).

The right next experiment is OpenCode — same agentic category as Crush, different codebase, possibly different bug surface. If OpenCode handles tool calls against ollama cleanly, "agentic + local model" works in some tool, just not Crush right now. If OpenCode also fails on the same task, the case for Aider's design philosophy gets stronger still: smaller surface area is just better for a workflow where every integration point is a potential bug, and the model itself is more constrained than the tooling assumes.

For now, Aider remains the working tool for actual coding work on this stack. Crush stays installed; I'll come back when #2936 lands.

The meta-lesson is the same one the rest of the writeup keeps pointing at: with a local 30B-class model, the surface area you can fail through is large, and the bugs are silent. Chat templates that quietly fall back to no-template. Stop tokens that aren't fired because the model emitted a non-canonical end marker. Tool-call responses that the client silently discards. None of these failures throw an exception. They all just produce subtly-wrong output, or no output at all, and you only notice when you actually try real work. The setup time isn't in the install — it's in discovering and fixing the silent gaps.

Postscript update: it was the Modelfile, not Crush

After writing the section above, I kept poking. The "Crush is broken with local OpenAI-compatible providers" framing felt too convenient — multiple tutorials documented the combo working in earlier Crush versions, and issue #2936 had been open for less than a day with no maintainer comments either confirming or denying. I tried one more controlled experiment: same Crush, same prompt, same project, but a different Qwen3-Coder variant.

I pulled ollama's curated qwen3-coder:30b-a3b-q4_K_M tag (instead of using my custom Q5_K_M Modelfile from HF), added it to Crush's config alongside my Q5, restarted, switched to it in the model picker, and re-ran the same evaluate README prompt.

It worked. Perfectly. The view tool executed, the README content came back, the model produced a coherent multi-paragraph evaluation. The same Crush that had hallucinated [uses view tool] brackets ten minutes earlier was now driving an agentic tool-call loop without complaint.

The bug wasn't Crush. The bug was my Modelfile.

Diffing the two ollama show --modelfile outputs side by side revealed exactly two lines that differed in any load-bearing way:

RENDERER qwen3-coder
PARSER qwen3-coder

These are model-aware ollama directives, added relatively recently to ollama. They tell ollama how to format prompts for a specific model and — critically — how to parse its output:

RENDERER wraps incoming chat messages in the model's expected format (for Qwen3-Coder, that's ChatML with <|im_start|> / <|im_end|> markers). Without it, ollama either uses the GGUF's embedded chat template or falls back to a stub. With it, ollama uses Qwen-specific logic.
PARSER translates the model's output before delivering it to clients. This is the critical one. Qwen3-Coder emits tool calls in its native XML format: {"name": "view", "arguments": {"file_path": "README.md"}}. OpenAI-compatible clients (including Crush) expect structured tool-call JSON in the tool_calls field of the response, not raw XML in the content. The PARSER qwen3-coder directive tells ollama to parse the XML and emit proper tool_calls JSON on the OpenAI-compatible API.

My hand-rolled Modelfile had TEMPLATE (a 15-line Jinja-ish ChatML wrapper I wrote based on what Qwen needs) and three PARAMETER stop directives. It did not have RENDERER or PARSER. So when the model emitted perfectly valid Qwen tool-call XML, ollama forwarded it raw to Crush, which saw plain text and ignored it. The model, getting no execution feedback, looped on its own attempts to invoke tools — which is the hallucinated [uses view tool] pattern.

This also retroactively explains issue #2936's diagnostic. The reporter saw finish_reason: "tool_calls" with correct tool-call data via mitmproxy, but Crush silently discarded it. Of course Crush discarded it — Crush was looking for structured JSON, ollama delivered raw Qwen XML. The bug isn't in Crush at all. The bug is that hand-rolled Qwen Modelfiles need to know about RENDERER and PARSER, and that knowledge isn't surfaced anywhere obvious in ollama's docs or in the Crush + Qwen tutorials floating around. The curated qwen3-coder tags have it; rolling your own from a Hugging Face GGUF, you don't unless you know to copy it.

I added the two lines to my Q5 Modelfile (FROM pointing at ollama's existing blob, no re-download needed) and re-ran ollama create. Then in Crush: switched to my Q5 model, ran the same prompt. Same clean tool-call execution. Three data points: Q5 broken, Q4 curated works, Q5 with the fix works. Diagnosis confirmed end to end.

The next stop on this winding path was wanting more context. I bumped Crush's context_window config from 65536 to 131072, restarted Crush, and re-ran a prompt. The model produced an awkward mangled output and didn't actually execute the tool — looked like another regression. But curl /api/ps told the real story: ollama had loaded the model at "context_length": 32768. Crush's context_window config field is UI-only. The OpenAI-compatible API path doesn't have a clean way to pass num_ctx to ollama, so Crush's config just affects the picker label. To actually get larger context, num_ctx has to be set in the Modelfile.

(The mangled output was a separate but related issue: at the default temperature: 0.7 the curated Q4 tag uses, tool-call format adherence is probabilistic — the model occasionally improvises Anthropic-style XML when it should be using Qwen-style JSON. Dropping temperature to 0.2 makes format adherence essentially deterministic for tool use without hurting coding quality.)

So the canonical Modelfile that actually works ended up being a custom one built on top of the curated Q4 blob with four added overrides:

FROM /var/lib/ollama/.ollama/models/blobs/sha256-1194192cf2…
RENDERER qwen3-coder         # required for ChatML prompt formatting
PARSER qwen3-coder           # required for tool-call XML→JSON translation
PARAMETER num_ctx 131072     # required because Crush can't propagate num_ctx via OpenAI-compat
PARAMETER temperature 0.2    # required for reliable tool-call format adherence
# (plus the same stop tokens and other sampler params as the curated tag)

After registering this as qwen3-coder-q4-128k and pointing Crush at it, the agentic loop ran cleanly at 128K context with deterministic tool calls. End of investigation.

The real takeaway

This experiment ran over many hours with multiple false stops. The final working setup is a one-page Modelfile and a one-page Crush config. But the path from "model downloaded" to "agentic Crush session running cleanly at 128K context" required understanding four separate gotchas:

Custom Modelfiles need RENDERER and PARSER directives for tool-call translation. Curated ollama tags have them; hand-rolled ones from HF GGUFs don't.
Crush's context_window config is UI-only — num_ctx must be set in the Modelfile, not the client config.
Default temperature 0.7 makes tool-call format probabilistic. For agentic workflows, drop to 0.2.
Stop tokens and chat templates still need to be right, even with RENDERER doing the work — though RENDERER makes a hand-rolled TEMPLATE block unnecessary.

None of these gotchas threw an exception or produced an error message. Each produced "looks plausible, doesn't quite work" output. The cost of running a local agentic stack isn't the disk or the install or the VRAM — it's the slow accumulation of empirical knowledge about which silent failure modes you're currently hitting, and which directive in which configuration file fixes which one.

The Aider design philosophy still holds — its smaller surface area genuinely is less likely to break on these silent gotchas. But once you've climbed the configuration learning curve, agentic-style tools (Crush, OpenCode) can be reliable too. The difference is that Aider rewards low-effort setup with reliable behavior; agentic tools demand high-effort setup but give you a richer working surface in return. Either is a valid choice. Just don't believe the install instructions when they say it's two commands. It's two commands plus a half-day of debugging Modelfile silent fallbacks.

Qwen 3.6 27B: a Local Coding Model You Can Actually Run

ai.rs — Sat, 25 Apr 2026 11:00:00 +0200

Qwen 3.6 27B: a Local Coding Model You Can Actually Run

For most of 2025, "open-source coding model" meant choosing between two unsatisfying tiers. The small models (8B–14B) ran on your laptop and felt like working with a tired intern. The big ones — DeepSeek V3, GLM-5.1, Kimi-K2 — competed with Claude, and required a small GPU cluster to serve.

Qwen 3.6 27B, released by Alibaba on April 22 2026, is the first open model that lands on the practical side of that gap. It runs on a single RTX 4090 or a 24 GB Mac. It gets within 4 points of Claude Opus 4.6 on SWE-bench Verified. The weights are Apache 2.0.

If you've been waiting for the moment when "self-hosted Claude Code" stops being a meme, this is it — with caveats.

What's actually new

Three things are worth knowing before you download 18 GB of weights.

It's a dense model. All 27 billion parameters fire on every token. That's the opposite of the MoE trend (Kimi, GLM, the new GPT-OSS variants), and it matters for hardware: a dense 27B fits the way you'd expect a 27B to fit. No 700B-of-which-30B-active tricks.

262K native context, extensible to 1M with YaRN. Most coding agents spend the first two minutes of a session paging in repository structure; this one can hold a mid-sized monorepo without truncation.

Thinking Preservation — reasoning that survives across turns. Toggle preserve_thinking: true and the model carries forward its prior chain-of-thought instead of regenerating it from the same context every turn. For multi-turn agentic workflows — the only kind that matter for real coding — this is the feature that bends the cost curve.

The benchmarks, with the asterisk

Benchmark	Qwen 3.6 27B	For comparison
SWE-bench Verified	77.2%	Claude Opus 4.6: 80.8%
Terminal-Bench 2.0	59.3%	Matches Claude 4.5 Opus
SWE-bench Pro	53.5%	GLM-5.1 (754B MoE): 58.4%
SkillsBench	48.2%	Qwen 3.5 397B: 30.0%

The asterisk: all of these were run on Qwen's internal agent scaffold, not a neutral one. Independent reproductions are still trickling in. Treat the numbers as directional. If your evaluation depends on a specific scaffold — OpenCode, Cline, Aider's bench harness — run it yourself before claiming parity in your README.

The number that's hard to game is the one against the previous generation: 48.2% vs 30.0% on SkillsBench at one-fifteenth the parameters. Whatever Qwen learned between 3.5 and 3.6, it applied it densely.

Hardware: what you actually need

Quantized GGUF (Q4_K_M or UD-Q4_K_XL) lands at ~18 GB. That puts the practical bar at:

Single GPU — RTX 4090, RTX 4080 Super, or any 24 GB workstation card.
Mac — M2 Pro / M3 Pro with 24 GB unified memory or better.
CPU + offloading — works, slowly. 64 GB system RAM, sustained around 6 tokens/sec on a recent Ryzen.

Full BF16 needs 60 GB+, which means dual-3090 or single-A6000 territory. Almost no one needs that. Q4_K_M loses roughly 1–2 points on coding benchmarks vs full precision, well within run-to-run noise.

Three ways to actually run it

1. llama.cpp — fastest path for most developers

brew install llama.cpp   # or build from source

llama-server \
  -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \
  --temp 0.6 --top-p 0.95 --top-k 20 \
  --chat-template-kwargs '{"preserve_thinking": true}'

You get an OpenAI-compatible endpoint at localhost:8080. Point any existing tool that speaks the OpenAI Chat Completions API at it and you're done. This is the path I'd recommend for 90% of readers.

2. Unsloth Studio — easiest for first-timers

A browser UI at localhost:8888 that handles weight downloads, GGUF selection, and chat-template wiring. Slower than raw llama.cpp at the margins; much faster to get running if you've never touched a local inference stack.

3. SGLang or vLLM — for serving multiple users

Version 0.5.10+ of SGLang, and recent vLLM, both ship with full Qwen 3.6 support including tool-calling and reasoning-block parsing. This is the right answer if you're serving a team rather than just yourself — batched inference on a single 24 GB card will saturate well before a single-user llama.cpp setup does.

Gotchas

A handful of small footguns are worth knowing about up front.

Avoid CUDA 13.2. It produces gibberish output on Qwen 3.6 GGUFs. 13.1 and 13.3 are fine. If you've blindly upgraded recently, downgrade before you start debugging anything else.

Ollama doesn't work yet. Qwen 3.6's vision capability ships as a separate mmproj file, and Ollama's current packaging doesn't wire it in. Watch the Ollama issue tracker; expect a fix within a release or two. Until then, llama.cpp directly.

Tool-call format. If your agent harness expects the Anthropic tool-use envelope, it won't work out of the box — Qwen ships an OpenAI-style function_call schema. Most modern harnesses (OpenCode, Aider, Cline) handle both; roll-your-own ones may need adapter code.

Should you switch from Claude or GPT?

For most production coding agents, no. Claude Opus 4.7 still leads SWE-bench at 84.3%, and the API price isn't catastrophic for any team that hasn't already optimized tokens out of its workflow.

For three specific cases, yes.

Code that legally cannot leave your machines. Defense, healthcare, pre-IPO startups with competitive code. Self-hosting is the entire point.
High-volume bulk operations. Migrations, codebase translations, automated refactors across a thousand repos. The token bill on the API for that kind of job is a serious chunk of an engineer's salary; a single 4090 amortizes in weeks.
Local-first iteration. A coding agent that doesn't rate-limit you, doesn't change between sessions, and works on the plane.

Outside those cases, treat Qwen 3.6 27B as a fallback worth having configured: somewhere between 90% and 95% of Claude's output quality on most tasks, with a per-token cost of approximately zero, and the same model available six months from now without an API deprecation notice.

That's a meaningful new option. It's the first time it's been one for people running on a single GPU.

If you've benchmarked Qwen 3.6 27B on your own workflow, ai.rs would like to hear how it went. Drop a note via the contact page.

Why Every AI Engineer Should Learn Classical Chinese

ai.rs — Tue, 14 Apr 2026 11:00:00 +0200

Or at least, why your agents should be writing in it.

Six months into any serious LLM-agent project, the same thing happens.

The conversation history, the decision log, the accumulated project context — all of it balloons past the model's context window. You start summarizing. The summaries lose fidelity. You feed the summaries back in and the model hedges more, hallucinates more, forgets the decisions it made a month ago. Every call pays for the same project preamble again. The API bill climbs.

If you're calling a frontier model at scale, the cost of context isn't theoretical. It's line one of your infra spend.

So when a GitHub issue crossed my feed claiming that Classical Chinese — 文言文, a literary language whose grammar stabilized around the time of Confucius — could compress agent memory by 28% compared to structured English shorthand, I did what any engineer does on seeing a claim like that.

I assumed it was nonsense and set out to prove it.

I was half right.

The claim, and the skeptic's case

The two projects in question:

MemPalace — an agent-memory architecture that shards long conversations into a "palace" of wings, rooms, closets, and drawers, each holding structured-English compressed notes in a format called AAAK. It scores 96.6% on LongMemEval without calling an LLM summarizer.
MemChinesePalace — a fork-in-spirit by a different author, replacing AAAK with what they call "Wenjian" (文简 — Classical Chinese shorthand). The issue proposing this was closed by the upstream maintainer within hours: "Classical Chinese wouldn't be natively readable by most LLMs."

The case for skepticism looked strong:

Tokenizers don't love Chinese. OpenAI's older cl100k_base tokenizer (used by GPT-4 and GPT-3.5) splits most Chinese characters into 2–3 BPE tokens. "Character count" and "token count" are not the same thing, and Chinese often costs more tokens per character than English.

Classical Chinese is famously ambiguous. Two thousand years of commentators have argued over what any given passage of 文言文 means. For a memory system where you need deterministic recall, that's the opposite of what you want.

AAAK already works. A format like DECISION:auth.migrate:auth0->clerk is ugly but parses with a regex and leaves zero room for interpretation. It uses common English tokens. It's hard to see what Classical Chinese adds.

So the headline claim — "28% fewer tokens" — smelled like someone counting characters and calling them tokens.

Test one: does the token claim hold?

I wrote the smallest possible benchmark: five realistic memory samples (a decision, a bug finding, a milestone event, a team preference, a proposal), encoded in three formats each. Plain English, AAAK, and Wenjian. Then I fed them through tiktoken against two real BPE tokenizers.

The result, totalled across all five samples:

Tokenizer	English	AAAK	Wenjian	Wenjian vs AAAK
`cl100k_base` (GPT-4 / 3.5)	250	220	234	+6.4% (worse)
`o200k_base` (GPT-4o / 5)	253	220	191	−13.2%

My suspicion was right: the 28% figure was character-counted, not token-counted. On the older tokenizer, Wenjian actually loses to AAAK.

But my suspicion was also wrong: on the modern o200k_base tokenizer — the one used by every frontier OpenAI model today — Wenjian really is about 13% smaller. Not 28%, but not zero either.

Half a win for the Wenjian side. The real question, I thought, was whether the model could still read the compressed form accurately. That's where Wenjian's polysemy problem was supposed to bite.

Test two: can the model actually read it?

For this I used a local setup — ollama serving qwen3:32b, qwen3.5:27b, and (later) llama3.1:8b. Qwen is the strongest open model for Chinese, which makes it the fairest test of the "LLMs natively read 文言文" premise. If Wenjian can't perform there, it can't perform anywhere.

The protocol: for each of the five memory samples, I generated three factual questions. The model got only the compressed memory record and one question, and had to answer. Scoring was a deterministic keyword match (no LLM-as-judge — reproducible across runs).

One hundred and twenty calls later, I had my answer:

Model	English	AAAK	Wenjian
`qwen3.5:27b`	15/15 (100%)	14/15 (93%)	15/15 (100%)
`qwen3:32b`	15/15 (100%)	15/15 (100%)	15/15 (100%)

Wenjian matches English. On both models.

The polysemy concern that I and the upstream maintainer had both raised — that Classical Chinese would be too ambiguous for reliable fact recall — simply didn't materialize. When asked "what was the target deadline?" of the line 议 26/Q1末迁身份：Auth0→Clerk, the model answered "end of Q1 2026" without hesitation. When asked who discovered a bug encoded as 普雅设审中得 ("Priya, in the security audit, discovered"), it answered "Priya" or "普雅" — both scored correct.

At this point I had to update. The Wenjian claim isn't bullshit. On Chinese-strong models, it's a Pareto improvement over plain English: 24% smaller, same retrieval. The upstream maintainer was wrong to close the issue that fast.

A hybrid that nearly beat them both

While I was at it, I built a third format: a hybrid that keeps AAAK's deterministic KEY:value|key:value skeleton but inlines five Chinese idiom macros — 亡羊 (tech-debt / known-defect), 破竹 (major breakthrough), 金蝉 (migration / refactor), 定鼎 (final architecture decision), 一石 (single-action-multiple-wins).

These idioms are the genuinely novel contribution of Classical Chinese to this problem. Each one is 2–3 tokens but encodes a multi-token English concept. And because frontier models are trained on enough Chinese literature to know what they mean, there's no learning cost per session — just a one-line legend in the system prompt.

The hybrid scored best on tokens: 28% smaller than English, 17% smaller than AAAK. But when I ran the retrieval test, it stumbled — 87% combined across the two Qwen models. The failures were specific: the shorthand @Q1.26 was read as decoration rather than a deadline, and parenthesized reason-codes like (cwrites+json) were too cryptic to expand when asked "why is this preferred?".

So I wrote a v2 that used t:Q1.26 and why:cwrites,json. It cost nine extra tokens. Retrieval jumped from 87% to 97% on Qwen.

Hybrid v2 now tied Wenjian on both axes — same compression, same recall. On Qwen. The interesting question was what would happen on a model that wasn't trained on a mountain of Chinese text.

Test three: does it survive a Western model?

I pulled llama3.1:8b — a small, general-purpose Meta model with much thinner CJK coverage than Qwen. This was the test the upstream maintainer had implicitly failed back when he closed the issue.

Format	Llama3.1:8b
English	15/15 (100%)
AAAK	13/15 (87%)
Wenjian	13/15 (87%)
Hybrid v2	14/15 (93%)

Three findings worth pulling out:

Wenjian didn't collapse. It dropped from 100% on Qwen to 87% on Llama, landing exactly where AAAK already was. The upstream maintainer's concern was directionally right but overstated — even an 8B Western-trained model extracts most of Wenjian's content correctly.

Hybrid v2 was the top compressed format. At 93%, it beat both Wenjian and AAAK on Llama. The design bet — "keep Latin keys for everything except the five macros" — paid off. The macros are common enough in LLM training data to survive anywhere; the Latin keys keep the rest tokenizer-stable.

Direction arrows broke Llama across every format. > and -> got inverted multiple times. pg>mysql was read as "mysql is preferred", and jenkins->gh_actions as "Jenkins is recommended". That's a format-neutral finding worth fixing in any compression scheme: textual from:X|to:Y is worth the extra tokens.

Combined cross-model ranking, 45 questions each:

Format	Tokens vs English	Retrieval	Behaviour
English	0%	100%	reference
Wenjian	−24%	96%	peaks on Chinese-strong, drops to AAAK-parity elsewhere
Hybrid v2	−24%	96%	more uniform across model families
AAAK	−13%	93%	solid but less compressed
Hybrid v1	−28%	87%	too aggressive, dominated by v2

The methodology surprise

The most useful single finding from this whole exercise wasn't about Classical Chinese at all. It was about how to evaluate a format in the first place.

The weakest model was the most informative.

qwen3:32b scored 4 of 5 formats at 100%. Ceiling effect. Almost no signal about which format is actually more robust.
qwen3.5:27b — somewhat fewer parameters, newer training — separated hybrid v1 from the pack but still saturated Wenjian and English.
llama3.1:8b was the only model that produced different failure modes per format, surfaced the direction-arrow bug, and cleanly separated hybrid v2 from the others.

If I had only run this on Claude Opus or GPT-5, I'd have concluded that all four compressed formats were equivalent. I'd have shipped the wrong one. The frontier models succeed despite the format, not because of it. Their format-robustness is invisible from their top-line score.

There's a sub-finding inside this that's worth calling out separately. Within the Qwen family, the newer model (3.5) scored worse than the older model (3.0) on every compressed format — 93% vs 100% on Hybrid v2, 80% vs 93% on v1. Both Q4_K_M, so quantization is constant. Two plausible reads: (a) five billion fewer parameters hurt literal-parsing capability more than a generation of training gains it, or (b) newer RLHF tunes models away from shorthand literalism — the 3.5 misses were mostly "the record does not specify…", the model hedging instead of committing to what's there.

Either way: do not assume the newest model in a family is the best format-reader. Test your actual deployment target.

What this means for you, practically

If you're building anything that persists memory across LLM calls — an agent, a copilot, a long-lived assistant, a RAG pipeline that stuffs retrieved docs into a context window — these numbers have a direct read.

For a fixed context window, compressing memory into a denser dialect stores roughly a quarter more facts in the same tokens. Equivalently, if you're paying per token over an API, that's a ~24% input-cost reduction at ~96% retrieval fidelity on mid-sized open models. Six months of project context now fit where four did before.

The practical picks:

If your serving model is Chinese-strong (Qwen, DeepSeek, Yi, any Chinese-tuned Claude or GPT deployment), use pure Wenjian. It peaks there.
If you serve a mix — or you don't know what model the user picks — use the Hybrid v2 format. More uniform across model families, same compression, one miss per 15 on weak Western models.
Either way, replace direction arrows with textual labels. That's a universal improvement; it costs a few tokens and prevents a whole class of Llama-style inversions.

And the deeper lesson, applicable far beyond this one experiment: if you're comparing prompt formats, tool-call schemas, structured-output styles, or domain DSLs — evaluate them on small or mid-sized open models. Not on the flagship. The flagship's ceiling effect will hide the failures that show up in production on cheaper inference.

So — learn Classical Chinese?

Literally? No. You don't need to read 文言文 yourself. The point is that a language whose grammar stabilized two millennia ago, which removed every grammatical redundancy human writers could find to remove, and which modern LLMs were trained on because it's part of humanity's literary record anyway — that language is already sitting in your model, unused, ready to compress your memory by a quarter.

You don't have to learn Classical Chinese.

Your agents should be writing in it.

Full benchmark code, results, and raw data: github.com/.../MemChinese (upstream) — see the README for the unvarnished numbers and next steps.

Meta Unveils Muse Spark: First Model From Superintelligence Labs

ai.rs — Wed, 08 Apr 2026 12:00:00 +0200

Meta on April 8 introduced Muse Spark, the first model out of its newly reorganized Meta Superintelligence Labs (MSL) — and the company is calling it "the first step on our scaling ladder and the first product of a ground-up overhaul of our AI efforts."

Spark is a multimodal reasoning model with tool-use, visual chain-of-thought, and a parallel multi-agent setup Meta is branding Contemplating mode. It is live today on meta.ai and inside the Meta AI app, with a private API preview rolling out to select developers.

What's actually new

Three things stand out from the announcement:

Multimodal-first reasoning. Spark is positioned as Meta's first model where perception, reasoning, and tool-use share the same loop — visual STEM Q&A, entity recognition, and even health-domain analysis (nutrition, exercise physiology) are part of the headline capabilities, not bolt-ons.
Visual chain-of-thought. Rather than only emitting text tokens during reasoning, Spark can ground intermediate steps in the image itself — closer to how humans point at things while thinking out loud.
Contemplating mode. A parallel multi-agent orchestration layer where multiple reasoning instances work the same problem and converge on an answer. It is the mode Meta cites for its highest benchmark scores, and it is rolling out gradually rather than being on by default.

Benchmarks Meta is leading with

Benchmark	Score (Contemplating)
Humanity's Last Exam	58%
FrontierScience Research	38%

These are headline numbers from Meta's own post — independent reproductions will follow. For context, Humanity's Last Exam is one of the harder generalist evals in circulation, and 58% places Spark in the same conversation as the current frontier rather than a tier below.

The efficiency claim

The number that may matter more long-term is buried further down: Spark is described as more than an order of magnitude more compute-efficient than Llama 4 Maverick, its predecessor, with log-linear scaling improvements from reinforcement learning. If that holds, MSL has not just shipped a new model — it has shifted the cost curve for the next generation of Meta models.

It also re-frames what Hyperion, Meta's in-progress data center buildout, is for. Meta explicitly ties Spark to that infrastructure as the runway toward what it now openly calls "personal superintelligence."

Availability

Live: meta.ai web and the Meta AI app, default mode
Private preview: API access for select users
Contemplating mode: rolling out gradually — not enabled for everyone on day one

There is no open-weights release announced. That is a notable shift from the Llama posture — Meta is keeping Spark behind its own surfaces, at least for now.

Why it matters

Two angles are worth watching:

For developers, the API preview is the thing to track. If Spark is meaningfully cheaper-per-token than frontier rivals while clearing hard reasoning evals, it changes the build-vs-buy math for agentic products.
For the lab race, this is MSL's introduction. The branding ("Superintelligence Labs", "scaling ladder", "personal superintelligence") makes it explicit that Meta is no longer pitching itself as the open-source alternative — it is competing for the frontier, on the frontier's terms.

The full announcement is on Meta's AI blog.

Claude Mythos Preview: Why Anthropic Locked Its Best Security Model Behind a Wall

ai.rs — Wed, 08 Apr 2026 11:06:37 +0200

On April 7, Anthropic announced Claude Mythos Preview alongside Project Glasswing — a frontier AI model purpose-built to find and exploit software vulnerabilities, paired with a partner program that decides who gets to use it.

Mythos is not on the API price list. It is not on a waitlist page. It is not coming to Claude.ai next week. If you are reading this and you do not work for AWS, Apple, Cisco, Google, Microsoft, or one of about 50 other vetted organizations, you cannot have it. That is not an oversight. That is the entire point of how Anthropic shipped this model.

Here is what Mythos actually does, who is in Glasswing, and why the access wall exists.

What Mythos Found

Anthropic led the announcement with two findings that are difficult to dismiss as benchmark theater.

A 27-year-old vulnerability in OpenBSD that allowed remote crashes. OpenBSD is the operating system whose entire brand identity is built on aggressive code review and proactive auditing. A bug that survived 27 years inside the OpenBSD codebase is, by definition, a bug that human reviewers were never going to find on their own.

A 16-year-old flaw in FFmpeg that automated coverage-guided fuzzers had executed the surrounding code path more than 5 million times without triggering. This is the more technically interesting finding. Modern fuzzing is supposed to be the gold standard for catching memory corruption in C codebases. 5 million hits with no crash means the bug is reachable but only under specific semantic conditions — exactly the kind of "needs to actually understand the code" gap that LLMs are theoretically good at closing.

Anthropic also reported multiple Linux kernel privilege-escalation vulnerabilities and claims "thousands of high-severity vulnerabilities" in total across operating systems, browsers, and foundational libraries.

The One Number That Matters

Benchmark	Mythos Preview	Opus 4.6
CyberGym (vulnerability reproduction)	83.1%	66.6%

CyberGym measures whether a model can take a vulnerability description and actually reproduce a working exploit against the real target codebase. It is not multiple choice. It is not pattern matching against CVE databases. It is "build the thing that triggers the bug."

Going from 67% to 83% on a benchmark like that is not an incremental improvement. It is the difference between a useful research assistant and an autonomous agent you can leave running against a codebase overnight and trust to come back with reproductions instead of false positives.

Anthropic explicitly says Mythos "performs autonomously without human steering in many cases." That phrasing matters. Most AI security tooling today still requires a researcher in the loop to triage and verify. Mythos, in the cases where it works, does not.

Who Is In Glasswing

Project Glasswing launched with 12 founding partners:

Cloud and infrastructure: AWS, Google, Microsoft
Hardware and operating systems: Apple, Cisco
Plus seven others spanning major technology vendors and security organizations

Beyond the founding 12, Anthropic added 40+ more organizations focused on critical infrastructure protection and open-source maintenance. The selection criteria, as described in the announcement:

You maintain code that other people depend on at scale (operating systems, browsers, kernels, foundational libraries)
You operate critical infrastructure (cloud platforms, networking, finance)
You are an open-source security organization with a track record

Notably absent from the public list: penetration testing firms, bug bounty platforms, and anyone whose business model is selling vulnerability research to third parties. That is a deliberate choice, and we will get to why.

Why It Is Gated

There are three reasons Mythos is not generally available, and they reinforce each other.

1. The dual-use problem is unavoidable

A model that can autonomously find a 27-year-old bug in OpenBSD is also a model that can autonomously find unknown bugs in your production stack. The capability does not care about the operator's intent.

Anthropic could have published Mythos behind a standard "acceptable use policy" click-through, the way every other AI lab handles dual-use risk. They chose not to. The math is brutal: if even a small fraction of paying API customers used Mythos to find zero-days for sale, the result would be a measurable spike in real-world exploitation against the same critical infrastructure Anthropic is trying to protect.

Gating by partnership is an admission that policy alone is insufficient when the capability gap is this large.

2. Pricing as soft access control

When Mythos eventually does reach general availability, it will cost $25 per million input tokens and $125 per million output tokens. For comparison, Claude Opus 4.6 sits at roughly $15 input and $75 output per million tokens — Mythos is approximately 1.7x more expensive on output than the most capable general-purpose Claude model.

That premium is doing two things at once.

First, it reflects real cost: Mythos is almost certainly larger than Opus, almost certainly does more internal reasoning per token, and almost certainly was more expensive to train. You do not get autonomous CyberGym performance for free.

Second, and more importantly, the price is a soft access control mechanism. At $125 per million output tokens, you do not casually point Mythos at every public GitHub repository to see what it finds. The economics make opportunistic mass-scanning prohibitively expensive while keeping targeted defensive use affordable for organizations that have a specific codebase to harden.

This is the same logic that keeps satellite imagery affordable for journalists but expensive for stalkers. Pricing is not just revenue. It is a filter.

3. The subsidy structure tilts the balance toward defenders

Anthropic committed $100 million in usage credits to Glasswing partners and donated $4 million to open-source security organizations. Read those numbers in context: defenders are getting subsidized to use Mythos at zero or near-zero marginal cost, while everyone else faces full price plus access restrictions.

That is a deliberate asymmetry. Anthropic is paying to put Mythos in the hands of the people who maintain the code, before it is available to anyone who might want to exploit it. The window between "defenders can use this" and "attackers can buy this" is the entire game, and Anthropic is spending $100 million to widen it.

Whether that strategy actually works depends on how long the window stays open. If a competing lab ships an equivalent capability without the access controls, the asymmetry collapses overnight. If Anthropic stays meaningfully ahead on this specific capability for six months, defenders get a meaningful head start on hardening the most-used software on the planet.

When You Will Get Access

The official answer is "after we develop appropriate safeguards with an upcoming Claude Opus model." The unofficial reading: months, not weeks, and tied to a future release rather than a fixed date.

Realistically, Mythos in its current form is unlikely to be sold directly to the open API market. What seems more probable is that the techniques pioneered for Mythos — the training data, the autonomous-loop scaffolding, the safety filters — will be folded into a future general-purpose Opus release in a more constrained form. You will get some of the capability, with guardrails that prevent the most concerning use cases.

If you want the unconstrained version, your path is Glasswing membership. The application process is not public, but the criteria are: maintain critical software, demonstrate operational security, commit to responsible disclosure.

What To Actually Do

If you maintain critical infrastructure or foundational open-source software: investigate Glasswing. The 40+ non-founding partners suggest the program is actively expanding, and the subsidized usage credits are the cheapest security audit you will ever get.

If you build products on the Claude API: nothing changes today. Opus 4.6 and Sonnet 4.6 remain your daily drivers. But the existence of Mythos is a clear signal that the gap between "the best model Anthropic has trained" and "the best model Anthropic will sell you" is widening — and for the first time, Anthropic is being transparent about that gap rather than pretending it does not exist.

If you run a security team at a normal company: wait. The Mythos-derived safeguards in the next Opus release will likely cover the use cases you care about (code review, vulnerability triage, secure-coding assistance) without the access friction. Spending engineering time on Glasswing applications when you do not maintain a kernel is probably not the best use of the quarter.

The Bigger Signal

Set aside the specific capability for a moment. The more important thing about Mythos is that Anthropic chose to ship a frontier model with deliberate access controls, full stop. Every previous Claude release has been framed as "as broadly available as we can make it." Mythos is the first time Anthropic has publicly drawn a line and said: this one is too dangerous to sell to everyone, and we are going to gate it on who you are rather than what you promise.

That precedent matters more than the OpenBSD bug. If Mythos works the way Anthropic claims, expect more specialized frontier models with similar access structures — for biotech, for finance, for any domain where the dual-use math gets uncomfortable. The era of "one model, one API, one price list" is not over, but it is no longer the only shape an AI lab can take.

For now, Mythos exists, it is genuinely impressive, and you cannot have it. That is the story.

Gemma 4 LoRA Fine-Tuning on RTX 5090: What Works and What Doesn't

ai.rs — Sun, 05 Apr 2026 10:00:00 +0200

Google released Gemma 4 on April 1, 2026 — a family of models including the 26B-A4B Mixture of Experts variant that activates only 3.8B of its 25.2B parameters per token. Apache 2.0 licensed, 256K context, 140+ languages, native vision support. On paper, it's a direct competitor to Qwen 3.5's MoE lineup.

We spent two days trying to QLoRA fine-tune the MoE variant on an RTX 5090 (32 GB VRAM). It doesn't work — yet. Not because of a bug, but because of an architectural decision that the tooling ecosystem hasn't caught up with. Important caveat: the dense Gemma 4 models (E2B, E4B, 31B) fine-tune just fine with standard QLoRA. This article is specifically about the MoE 26B-A4B variant.

Gemma 4 vs Qwen 3.5: The Specs

Both models use Mixture of Experts to deliver big-model knowledge at small-model speed. Here's how they compare:

Spec	Gemma 4 26B-A4B	Qwen 3.5 35B-A3B
Total Parameters	25.2B	35B
Active Parameters	3.8B	~3B
Experts	128 + 1 shared, 8 active	256 + 1 shared, 8 routed
Layers	30	40
Native Context	256K	262K (up to 1M with RoPE scaling)
Modalities	Text + Image	Text + Image + Video
Languages	140+	201
License	Apache 2.0	Apache 2.0

Benchmarks

Benchmark	Gemma 4	Qwen 3.5	Winner
MMLU-Pro	82.6	85.3	Qwen 3.5 (+2.7)
GPQA Diamond	82.3	84.2	Qwen 3.5 (+1.9)
LiveCodeBench v6	77.1	74.6	Gemma 4 (+2.5)
Codeforces ELO	1718	2028	Qwen 3.5 (+310)
MATH-Vision	82.4	83.9	Qwen 3.5 (+1.5)
MMMU Pro (Vision)	73.8	75.1	Qwen 3.5 (+1.3)

Qwen 3.5 leads across most reasoning and knowledge benchmarks. Gemma 4 has a slight edge on LiveCodeBench v6, but loses decisively on competitive programming (Codeforces). For most practical use cases — customer support, content generation, product recommendations — Qwen 3.5 is the stronger model.

The 3D Tensor Problem

Here's where things fall apart for local fine-tuning.

QLoRA (Quantized Low-Rank Adaptation) works by loading the base model in 4-bit precision and training small adapter layers on top. This is the standard approach for fine-tuning large models on consumer GPUs. With Qwen 3.5 35B-A3B, it works perfectly — we validated this both on an RTX 5090 locally and on an NVIDIA B200 (178 GB VRAM), where Unsloth loads the model at ~17.5 GB in 4-bit with plenty of room for training.

Gemma 4 breaks this workflow because of how it stores expert weights.

Qwen 3.5 stores each expert as separate 2D linear layers — standard nn.Linear modules that bitsandbytes knows how to quantize:

# Qwen 3.5: separate 2D tensors per expert — bnb quantizes these fine
model.layers.{i}.mlp.experts.{j}.gate_proj: [1024, 2048]  ← nn.Linear ✓
model.layers.{i}.mlp.experts.{j}.up_proj:   [1024, 2048]  ← nn.Linear ✓
model.layers.{i}.mlp.experts.{j}.down_proj:  [2048, 512]  ← nn.Linear ✓

Gemma 4 fuses all 128 experts into single 3D tensors:

# Gemma 4: fused 3D tensors — bnb CANNOT quantize these
model.layers.{i}.experts.gate_up_proj: [128, 1408, 2816]  ← 3D tensor ✗
model.layers.{i}.experts.down_proj:    [128, 2816, 1408]  ← 3D tensor ✗

bitsandbytes only quantizes 2D nn.Linear layers. It ignores everything else. The result:

Component	Size	Quantized?
3D expert tensors (30 layers)	42.5 GB (bf16)	No
2D layers (attention, embeddings)	4.5 GB → 1.1 GB (4-bit)	Yes
Total with "4-bit" loading	~43.7 GB

The "4-bit" model is actually 43.7 GB because 90% of the weights can't be quantized. That's 12 GB over our RTX 5090's budget — before we even account for training overhead.

What We Tried

Five different loading strategies, all dead ends:

Gemma4ForCausalLM from multimodal checkpoint — Key name mismatch. The checkpoint stores text weights as model.language_model.* but the text-only class expects model.*. All weights loaded as "unexpected", fresh initialization OOM'd.
Gemma4ForConditionalGeneration with single GPU — OOM at 37% loading. The full multimodal model is ~48 GB in bf16.
Gemma4ForConditionalGeneration with CPU offloading — bitsandbytes 4-bit mode rejects any CPU offloading. Non-starter.
Extracted text-only weights — We wrote a script to extract and remap the 657 text-only keys. Loading works, but the 3D tensor problem remains: 43.7 GB estimated, still OOM.
Various monkey-patches — caching_allocator_warmup bypass, Params4bit compatibility fix. These solved earlier errors but can't fix the fundamental 3D tensor issue.

The Ecosystem Gap

Gemma 4 dropped on April 1, 2026 — it's brand new. The quantization ecosystem hasn't adapted yet:

Format	Available?	Fine-tuning?
GGUF (Q4_K_M, ~17 GB)	Yes	Inference only
AWQ 4-bit	Yes	Inference only
GPTQ	Not yet	—
Unsloth bnb-4bit	Skipped for MoE variant	—

Unsloth — which has custom MoE quantization that handles Qwen 3.5's fused tensors — deliberately skipped the Gemma 4 26B-A4B for their bnb-4bit releases. They published quantized versions for the dense Gemma 4 models (E2B, E4B, 31B) but not the MoE one. That confirms this isn't just a "we haven't gotten to it" situation — the 3D tensor layout is genuinely harder to handle.

What to Use Instead

If you're choosing a model for local QLoRA fine-tuning on consumer hardware (24-32 GB VRAM), here's the practical decision:

For Fine-Tuning (QLoRA)

Model	4-bit Size	Fits 32 GB?	Status
Qwen 3.5-35B-A3B (via Unsloth)	~17.5 GB	Yes	Working
Qwen 3.5-27B dense	~14 GB	Yes	Working
Qwen 3.5-9B dense	~5 GB	Yes, comfortably	Working
Gemma 4 31B dense	~18-20 GB	Tight but feasible	Working
Gemma 4 26B-A4B (MoE)	~43.7 GB	No	Blocked

For Inference Only

Gemma 4 26B-A4B works fine for inference via GGUF (Ollama, llama.cpp) at Q4_K_M (~17 GB). If you just need to run the model — not train it — it's a solid option.

What About Cloud GPUs?

On an NVIDIA B200 (178 GB VRAM), the picture changes completely. Gemma 4's text-only model is ~47 GB in bf16 — you can skip quantization entirely and train with standard LoRA (not QLoRA). No 3D tensor problem, no bitsandbytes dependency. Load in bf16, attach LoRA adapters, train.

We already validated this workflow for Qwen 3.5 35B-A3B on a B200 via Unsloth, where it loads at ~17.5 GB in 4-bit and trains comfortably. Gemma 4 in bf16 at ~47 GB would also fit with ~130 GB to spare for optimizer states, gradients, and large batch sizes.

The trade-off is cost. A cloud B200 instance runs ~$3-5/hour. For a quick LoRA fine-tune (a few hundred steps), that's $5-15. For serious training runs, it adds up. The appeal of consumer GPU training is that it's free after the hardware purchase.

Why MoE Models Are Harder to Fine-Tune

The 3D tensor issue is just the most visible problem. MoE architectures create several fine-tuning headaches that dense models don't have:

Expert routing instability. During fine-tuning, the router learns which experts to activate for which tokens. Small datasets can destabilize this routing — a few hundred patent-writing examples might cause the router to over-rely on 2-3 experts while the other 125 go dormant. Dense models don't have this problem because every parameter participates in every forward pass.

Load balancing. MoE models are trained with auxiliary losses that encourage balanced expert utilization. Fine-tuning with LoRA typically freezes the router weights, which helps stability but means you can't adapt the routing to your domain. If your use case (say, patent writing) doesn't naturally distribute across many experts, you're leaving capacity on the table.

Memory unpredictability. Even when quantization works, MoE memory usage is harder to predict. All expert weights must be resident in VRAM even though only 8 of 128 fire per token. Gradient checkpointing interacts differently with MoE layers. Batch size effects are less intuitive because the active parameter count varies per token.

Tooling maturity. The PyTorch ecosystem — bitsandbytes, PEFT, DeepSpeed, FSDP — was built for dense transformers. MoE support is bolted on and varies wildly by implementation. Qwen's 2D expert layout works because it looks like standard linear layers. Gemma's 3D fused layout is more efficient but breaks assumptions baked into every tool in the chain.

None of this means MoE models can't be fine-tuned. It means the gap between "works in a paper" and "works on your GPU" is wider than with dense models. For most practitioners doing domain-specific fine-tuning — patent writing, customer support, product descriptions — a dense model at the same active parameter count will be easier to train and more predictable to debug.

The Bigger Picture

This episode highlights a real tension in the MoE design space. Fusing experts into 3D tensors is faster for inference (single batched matrix multiply instead of 128 separate calls) and Google's engineering team made a reasonable optimization choice. But it breaks the most popular fine-tuning workflow on consumer hardware.

Qwen's approach — separate 2D expert layers — is less optimal for raw inference throughput but plays nicely with the entire PyTorch/bitsandbytes/PEFT ecosystem. For the open-source community that wants to fine-tune models locally, that compatibility matters more than a few percent of inference speed.

The fix will come. Either bitsandbytes will add 3D tensor quantization, or Unsloth will build a custom path (they did it for Qwen's fused tensors), or Google will publish a checkpoint variant with separate expert weights. Until then, Qwen 3.5 35B-A3B is the MoE model to fine-tune locally — it has better benchmarks, a working training pipeline, and fits comfortably on an RTX 5090.

To be clear: Gemma 4 is not broken for fine-tuning. The dense models — Gemma 4 E2B, E4B, and 31B — all work with standard QLoRA via bitsandbytes or Unsloth. The 31B dense model at 4-bit (~18-20 GB) fits on an RTX 5090 and trains normally. It's only the MoE 26B-A4B that's blocked, and only on consumer GPUs where quantization is required.

What Will Fix This

The MoE fine-tuning gap is temporary. Here's what's likely to happen, roughly in order of probability:

Unsloth adds a custom Gemma 4 MoE path — Most likely and soonest. Unsloth already handles Qwen 3.5's fused MoE tensors with custom quantization. They have the architecture expertise and the motivation (Gemma 4 is a high-demand model). Timeline: weeks, not months.
bitsandbytes adds 3D tensor quantization — This would fix it for everyone, not just Unsloth users. The change is non-trivial (the NF4 quantization kernel assumes 2D weight matrices) but it's a known limitation. Timeline: 1-3 months.
Google releases an unfused checkpoint — Google could publish a variant with separate 2D expert weights instead of fused 3D tensors. This is the easiest fix from a tooling perspective but requires Google to act. Timeline: uncertain, depends on community pressure.

Our bet: Unsloth will have it working within weeks. If you need Gemma 4 MoE fine-tuning before then, use a B200 or similar cloud GPU where you can skip quantization entirely and train in bf16.

Tested on: RTX 5090 (32 GB), transformers 5.5.0, bitsandbytes 0.49.2, PEFT, April 2026.

Related:

Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader

ai.rs — Thu, 02 Apr 2026 12:00:00 +0200

One Month Later, Everything Changed

In early March, we published a head-to-head comparison of Llama 4, Qwen 3.5, and Gemma 3. The conclusion was clear: Gemma 3 finished last in every category except raw inference speed. Qwen 3.5 won math, coding, and multilingual. Llama 4 Scout won reasoning and context length. Gemma 3 was the also-ran.

That article is now outdated.

Google just released Gemma 4 — four model sizes, a new MoE architecture, multimodal audio support, thinking mode, and benchmark scores that make Gemma 3's numbers look like a different era. The jump isn't incremental. It's the largest single-generation improvement we've seen in the open model space.

The Gemma 4 Family

Four models, two architectures, spanning edge devices to full GPUs:

Model	Architecture	Total Params	Active Params	Context	Modalities
Gemma 4 E2B	Dense	5.1B	2.3B	128K	Text, Image, Audio, Video
Gemma 4 E4B	Dense	8B	4.5B	128K	Text, Image, Audio, Video
Gemma 4 26B-A4B	MoE (128 experts)	25.2B	3.8B	256K	Text, Image, Video
Gemma 4 31B	Dense	30.7B	30.7B	256K	Text, Image, Video

The naming convention: E prefix means edge-optimized, A means active parameters in the MoE variant. So "26B-A4B" = 26B total, 4B active per token.

The standout is the 26B-A4B. It uses 128 small experts with 8 active per token plus one shared always-on expert. This is a different design philosophy from Llama 4 Scout's 16 large experts — Google bet on many small experts rather than fewer large ones.

The Numbers: Gemma 3 vs Gemma 4

These comparisons use the same benchmarks, same evaluation conditions. The improvements are not subtle.

Reasoning & Knowledge

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B-A4B	Change (31B)
MMLU Pro	67.6%	85.2%	82.6%	+17.6 pts
GPQA Diamond	42.4%	84.3%	82.3%	+41.9 pts
BigBench Extra Hard	19.3%	74.4%	64.8%	+55.1 pts
MMMLU (multilingual)	70.7%	88.4%	86.3%	+17.7 pts

GPQA Diamond — graduate-level reasoning — nearly doubled. BigBench Extra Hard went from 19% to 74%. These aren't incremental gains. Gemma 3 was struggling with hard reasoning; Gemma 4 handles it.

Mathematics

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B-A4B	Change (31B)
AIME 2026	20.8%	89.2%	88.3%	+68.4 pts

From 20.8% to 89.2% on competition math. This is the single most dramatic benchmark improvement in the table. For context, in our March comparison, Qwen 3.5-27B scored 48.7% on AIME 2025 and was the math leader. Gemma 4 nearly doubles that.

The thinking mode — where the model reasons step-by-step before answering — is likely driving this. When Gemma 4 "thinks," it can produce 4,000+ tokens of reasoning before committing to an answer.

Coding

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B-A4B	Change (31B)
LiveCodeBench v6	29.1%	80.0%	77.1%	+50.9 pts
Codeforces ELO	110	2150	1718	+2040 pts

Codeforces ELO went from 110 (barely functional) to 2150 (expert competitive programmer). LiveCodeBench nearly tripled. The coding gap between Gemma and the competition didn't just close — it reversed.

Vision

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B-A4B
MMMU Pro	49.7%	76.9%	73.8%
MATH-Vision	46.0%	85.6%	82.4%

Vision understanding saw similar jumps. MATH-Vision — solving math problems from images — nearly doubled. The model now handles charts, diagrams, and handwritten equations significantly better.

Long Context

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B-A4B
MRCR v2 (128K avg)	13.5%	66.4%	44.1%

Gemma 3's 128K context was mostly theoretical — it could accept long inputs but couldn't reliably use information from them. Gemma 4 at 256K context actually retrieves and reasons over long documents. The 31B model went from 13.5% to 66.4% on multi-needle retrieval tests.

The MoE Efficiency Story

The 26B-A4B deserves special attention. Look at these numbers again:

Benchmark	Gemma 4 31B (30.7B active)	Gemma 4 26B-A4B (3.8B active)
MMLU Pro	85.2%	82.6%
AIME 2026	89.2%	88.3%
LiveCodeBench v6	80.0%	77.1%
GPQA Diamond	84.3%	82.3%
LMArena Score	~1452	~1441

The MoE variant achieves 97% of the dense model's quality while activating only 3.8B parameters per token instead of 30.7B. That's 8x less compute per inference step.

For deployment, this means:

Much less VRAM needed for KV cache at long contexts
Faster inference — fewer parameters to compute per token
Lower cost per query in production

Google's choice of 128 small experts (vs Llama 4's 16 large experts) appears to work. The LMArena score of 1441 with only 4B active params is remarkable — it's competitive with models 8x its active size.

How Gemma 4 Reshapes Our Comparison

Our March rankings put Qwen 3.5 first, Llama 4 second, Gemma 3 third. Here's how Gemma 4 changes each category:

Category	March Winner	Updated Assessment
General reasoning	Llama 4 Scout	Gemma 4 31B takes the lead (84.3% GPQA vs Scout's 74.3%)
Mathematics	Qwen 3.5-27B	Gemma 4 dominates (89.2% AIME, well ahead of Qwen's ~49%)
Coding	Qwen 3.5-27B	Gemma 4 dominates (80.0% LiveCodeBench vs Qwen's ~43%)
Multilingual	Qwen 3.5-27B	Likely still Qwen (250K vocab, 201 languages vs Gemma's 140)
Inference speed	Gemma 3 27B	TBD — need to benchmark Gemma 4 31B on same hardware
Context length	Llama 4 Scout (10M)	Still Llama 4 (10M vs 256K), but Gemma 4 actually uses its context
License	Qwen 3.5 (Apache 2.0)	Tie — Gemma 4 is now Apache 2.0 too
VRAM efficiency	Qwen 3.5-9B	Gemma 4 26B-A4B is the new efficiency king

Note on benchmark versions: Our March tests used AIME 2025, LiveCodeBench v5, and standard MMLU. Gemma 4's reported scores use AIME 2026, LiveCodeBench v6, and MMLU Pro. Direct numerical comparison across versions should be taken as directional, not exact. The Gemma 3 → Gemma 4 comparisons above use identical benchmark versions.

The Apache 2.0 Switch

Gemma 3 shipped with the "Gemma Open" license — commercial use allowed but with Google-specific terms and restrictions. In our March comparison, we flagged this as a disadvantage against Qwen 3.5's Apache 2.0.

Gemma 4 switches to Apache 2.0. No usage restrictions, no MAU limits, no acceptable use policies. The same license as Qwen 3.5.

This removes one of the last arguments against Gemma. For businesses building products on open models, the licensing playing field is now level between Gemma 4 and Qwen 3.5. Llama 4's community license (700M MAU limit + Meta's acceptable use policy) is now the most restrictive of the three families.

What's New Beyond Benchmarks

Thinking Mode

Gemma 4 supports extended reasoning — the model produces a chain-of-thought before answering, similar to DeepSeek-R1 or OpenAI o1. This is what drives the massive math and reasoning improvements. The thinking can run to 4,000+ tokens, giving the model space to break problems down, try approaches, and verify its work.

Multimodal Audio

The smaller models (E2B, E4B) support audio input — speech transcription and audio Q&A. The larger models (26B-A4B, 31B) handle image and video but not audio. This is an unusual split: the edge models are more multimodal than the flagship.

Native Function Calling

All models support structured function calling out of the box — returning JSON with tool calls without special prompting. Combined with the thinking mode, this makes Gemma 4 a strong candidate for agentic workflows where the model needs to reason about which tools to call and in what order.

Per-Layer Embeddings (PLE)

A novel architecture feature: a second embedding table feeds residual signals into every decoder layer, giving each layer a token-identity component tailored to that specific layer's role. This is a quiet innovation that likely contributes to the quality improvements across the board.

Shared KV Cache

The last several decoder layers share key-value tensors, reducing memory usage during long-context inference with minimal quality impact. Combined with the 256K context window, this makes Gemma 4 practical for long-document workflows where Gemma 3 was only theoretical.

Updated Decision Matrix

If you need...	Use	Why
Best overall quality (32 GB GPU)	Gemma 4 31B	Leads reasoning, math, coding, vision
Best quality per compute	Gemma 4 26B-A4B	97% of 31B quality at 8x less compute
Maximum context window	Llama 4 Scout	Still 10M tokens, unmatched
Best multilingual	Qwen 3.5-27B	250K vocab, 201 languages
Best under 10 GB VRAM	Gemma 4 E4B or Qwen 3.5-9B	Both strong; benchmark head-to-head needed
Edge / mobile deployment	Gemma 4 E2B	2.3B active, audio support, 128K context
Most permissive license	Gemma 4 or Qwen 3.5	Both Apache 2.0
Audio understanding	Gemma 4 E4B	Only open model family with native audio
Agentic workflows	Gemma 4 31B	Thinking mode + native function calling

What We Still Need to Test

We haven't run Gemma 4 on our RTX 5090 benchmark suite yet. Key unknowns:

Actual inference speed — the 31B dense model should be comparable to Gemma 3 27B in tok/s, but the MoE 26B-A4B is the interesting question. With 128 experts and 3.8B active params, it could be very fast
VRAM usage with quantization — Q6_K and Q4_K_M sizes for each variant
Real-world multilingual performance — Gemma claims 140 languages, but Qwen's 201-language, 250K-vocabulary advantage may still hold for CJK and non-Latin scripts
Thinking mode overhead — how much slower is inference when the model reasons for 4,000 tokens before answering?

We'll publish a full hands-on benchmark when we've run the tests. For now, Google's reported numbers are strong enough to change the recommendation.

The Bottom Line

A month ago we wrote: "Choose Gemma 3 when you need Google ecosystem integration or when marginal inference speed differences matter. It's a solid model, but it doesn't lead in any benchmark category."

That's no longer true. Gemma 4 leads in reasoning, math, coding, and vision. The 26B-A4B MoE variant offers the best quality-per-compute ratio in the open model space. The license is now Apache 2.0. The context window works.

The open model race just got a new leader. Qwen 3.5 still holds the multilingual crown, and Llama 4 Scout still has the unmatched 10M context window. But for overall quality, especially on hard reasoning and coding tasks, Gemma 4 is the model to beat.

The ball is now in Alibaba's and Meta's court.

This article is a follow-up to Llama 4 vs Qwen 3.5 vs Gemma 3: Which Open Model Should You Deploy?, published March 6, 2026.

Want to fine-tune Gemma 4 locally? Read Gemma 4 LoRA Fine-Tuning on RTX 5090: What Works and What Doesn't for our hands-on results.

100% ROI in 24 Hours: Nvidia B200 Replaced a $35,000 AI API Bill in a Single Day

ai.rs — Mon, 23 Mar 2026 11:00:00 +0100

The $35,000 Wake-Up Call

We needed to generate SEO descriptions for 858,000 e-commerce products. A straightforward task: take a product title, brand, and existing description in Serbian, then produce an English translation, a cleaned-up title, and a short SEO paragraph. Five fields per product, a few sentences each.

The first estimate from Anthropic's Claude API? $35,000. For text generation. For a task that a knowledgeable human could do in 30 seconds per product — but there are 858,000 of them.

This is the dirty secret of AI-as-a-Service: the per-token pricing model that looks cheap at demo scale becomes absurd at production scale. When your system prompt is 4,000 tokens and you're sending it 858,000 times, you're paying to process 3.4 billion tokens of instructions that never change. It's like paying a consultant's hourly rate to re-read their job description before every task.

The Optimization Journey

What followed was a 48-hour deep dive into cost optimization that took us from $35,000 to $180 — a 194x reduction — while maintaining the same output quality. Along the way, we discovered that:

Anthropic's Batch API doesn't cache prompts. Despite advertising prompt caching as a feature, the Batch API (which offers 50% off) processes each request independently on different servers. No caching. The "discount" actually costs 3.6x MORE than the standard API when you have a large system prompt. We only discovered this by checking the dashboard after a 400-product test run.
The most expensive token is the one you don't need to send. Our system prompt contained 1,551 product categories for recategorization. Trimming to the top 626 categories (covering 95% of products) cut costs by 42%. The remaining 5% of products just kept their existing category.
Self-hosting a smaller model on a single GPU beats the best API pricing by 27x. Qwen3.5, an open-source model with 3 billion active parameters, produces Serbian text comparable to Claude Sonnet 4.5 — at a fraction of the cost. One Nvidia B200 GPU processes 36,000 products per hour — so all 858,000 finished in under 24 hours. The GPU paid for itself in a single day.
Parallelism is free when the GPU has headroom. Our B200 was using 22% of its memory with 200 concurrent requests. We went from 1 request at a time (310 products/hour) to 256 parallel workers (35,000/hour) — a 113x throughput increase with zero additional cost.

The Real Cost of AI APIs

The AI industry's pricing model is built for developers running demos and startups processing hundreds of requests. At enterprise scale — millions of products, documents, or records — the per-token model breaks down spectacularly.

Consider: our 858,000 products needed roughly 500 billion FLOPs of actual computation. A B200 GPU delivers 2,250 TFLOPS. The actual compute time is measured in seconds, not hours. Yet the API charges as if each request requires dedicated attention from a room full of H100s.

Self-hosting isn't free — there's the engineering time to set up vLLM, optimize prompts, debug deployments, and handle failures. But when the alternative is a $35,000 invoice for generating short product descriptions, the math is clear.

What We Learned

The AI-as-a-Service model makes sense for prototyping, small-scale use, and tasks where quality justifies premium pricing. But for batch processing at scale — especially with a large, repeated system prompt — self-hosted inference on rented GPUs is the pragmatic choice. The open-source model ecosystem (Qwen, Llama, DeepSeek) has reached the quality threshold where, for many languages and tasks, API-exclusive models no longer justify their 27x price premium.

The irony? We used Claude (the expensive API) to develop and refine our prompts, evaluate quality, and establish the baseline. Then we deployed a free, open-source model to do the actual work. The API was the R&D cost; the GPU was the production cost. That division of labor — premium API for development, commodity GPU for execution — might be the real model for AI at scale.

AI Won't Replace Your Team — But a Team Using AI Will Replace Yours

ai.rs — Wed, 18 Mar 2026 10:00:00 +0100

The Replacement Myth

Every few months, a new headline claims AI will eliminate millions of jobs. The reality, backed by hard data from the Anthropic Economic Index, tells a different story:

57% of AI use in the workplace is augmentation — humans using AI to do their existing jobs better
Only 4% of businesses use AI deeply across their operations
30% of workers have zero AI exposure in their daily tasks

AI isn't replacing teams. It's creating a widening gap between teams that use it and teams that don't.

Augmentation vs. Automation

This distinction matters more than any other concept in this article.

Automation means AI does the task instead of a human. The human is removed from the loop. Think: a chatbot handling tier-1 support tickets without human review.

Augmentation means AI makes the human faster, more accurate, or more capable. The human stays in the loop but operates at a higher level. Think: a support agent using AI to draft responses, pull up relevant docs, and suggest solutions — then reviewing and sending.

The data shows businesses are overwhelmingly choosing augmentation. Why?

Factor	Automation	Augmentation
Risk	High (errors go unchecked)	Low (human reviews output)
Quality	Inconsistent at edges	Consistently high
Trust	Customers skeptical	Customers don't notice
Implementation	Complex (handle all edge cases)	Simple (handle common cases)
Cost	High upfront, low ongoing	Low upfront, moderate ongoing

Augmentation is easier to implement, lower risk, and often produces better results because humans catch the mistakes AI makes.

What AI-Augmented Teams Actually Look Like

Here's what changes when a team starts using AI effectively:

Customer Support

Before: Agent receives ticket, searches knowledge base manually, types response from scratch, asks senior colleague about edge cases.

After: Agent receives ticket, AI instantly pulls relevant docs and past solutions, drafts a response, agent reviews and personalizes, sends in 2 minutes instead of 8.

Result: Same team handles 3x the volume. Response quality improves because every agent has access to the collective knowledge that used to live only in senior team members' heads.

Sales

Before: Rep researches prospect manually, writes personalized email from template, follows up on gut feeling about timing.

After: AI summarizes prospect's company, recent news, and likely pain points. Drafts personalized outreach. Flags optimal follow-up timing based on engagement patterns.

Result: Each rep works leads that would have required a research assistant. Pipeline grows without hiring.

Content and Marketing

Before: Writer spends 3 hours researching, 2 hours writing first draft, 1 hour editing.

After: AI provides research summary and outline in minutes. Writer focuses on insight, voice, and editing. Total time: 2-3 hours for higher quality output.

Result: Same team produces 2x the content with more depth and originality — because humans spend time on the parts AI can't do well.

Operations

Before: Manager manually reviews reports, spots trends by intuition, creates weekly summaries for leadership.

After: AI analyzes data in real-time, surfaces anomalies, drafts reports. Manager focuses on decisions and strategy.

Result: Problems caught days earlier. Decisions backed by data instead of gut feeling.

The Productivity Multiplier

Studies consistently show AI augmentation delivers a 2-5x productivity multiplier depending on the task:

Task Type	Multiplier	Why
Writing & editing	2-3x	AI handles drafts, humans add judgment
Code development	2-4x	Autocomplete, debugging, boilerplate
Data analysis	3-5x	Instant pattern recognition, visualization
Customer response	2-4x	Instant context retrieval, draft responses
Research	3-5x	Synthesize sources, extract key points

Notice these aren't 100x improvements. AI doesn't turn a mediocre employee into a genius. It turns a good employee into a highly efficient one by removing the friction from tasks that consume time but not judgment.

Why Your Competitors Aren't Doing This (Yet)

The Anthropic research reveals a surprising finding: despite the hype, 67% of businesses have minimal or no AI adoption. The gap isn't technical — it's organizational.

The three barriers:

1. No Clear Starting Point

Leadership knows AI is important but doesn't know where to begin. Should they buy a platform? Hire a data scientist? Build a chatbot? The paradox of choice paralyzes action.

Solution: Start with one team, one workflow, one tool. Customer support + AI-drafted responses is the easiest first win. Prove value in 30 days, then expand.

2. Fear of Disruption

Managers worry AI will upset team dynamics. Employees fear replacement. Both lead to passive resistance.

Solution: Frame AI as a tool for the team, not a replacement of the team. Let employees choose how to use it. The best AI adoption happens bottom-up — when individuals discover it makes their job easier.

3. Overengineering the Solution

Companies try to build a comprehensive AI strategy before doing anything. Six months of planning, vendor evaluation, and committee meetings — then a pilot that's too ambitious and fails.

Solution: Buy a $20/month AI subscription for one team member. See what they accomplish in two weeks. Scale what works.

The 90-Day Playbook

Here's how to make your team an AI-augmented team in one quarter:

Month 1: Identify and Experiment

Audit time waste — Where does your team spend time on repetitive, low-judgment tasks?
Pick one workflow — Choose the highest-volume, lowest-risk task
Give one person access — Let your most curious team member experiment with AI tools
Measure baseline — Track current speed and quality for the chosen workflow

Month 2: Validate and Expand

Measure results — Compare speed and quality against baseline
Document what works — Create simple prompts and workflows the team can follow
Roll out to the team — Train everyone on the winning workflow
Identify the next workflow — What else could benefit?

Month 3: Systematize

Build custom tools — If generic AI works, a custom AI assistant trained on your data works 10x better
Set quality standards — Define when AI output needs human review vs. can go straight out
Track ROI — Hours saved x hourly cost = dollar value of AI augmentation
Plan Q2 — Which teams get AI next?

The Math That Matters

Let's make this concrete. A 10-person customer support team:

Metric	Without AI	With AI Augmentation
Tickets per agent per day	40	100
Average response time	8 minutes	3 minutes
First-contact resolution	65%	82%
Customer satisfaction	4.1/5	4.5/5
Effective team capacity	10 people	25 people equivalent

The team didn't shrink. Their effective capacity grew 2.5x. You can now handle 2.5x the customer volume without hiring, or reassign 6 people to higher-value work like proactive outreach and retention.

What Not to Do

Don't automate customer-facing interactions on day one. Start with internal, human-reviewed workflows.
Don't mandate AI use. People adopt tools they choose. Force breeds resentment.
Don't expect perfection. AI makes mistakes. The workflow should include human review until you've built confidence.
Don't chase the latest model. GPT-4, Claude, Llama — the model matters less than the workflow around it.
Don't skip measurement. "It feels faster" isn't enough. Track hours, quality, and outcomes.

The Window Is Open

Right now, 67% of your competitors aren't using AI meaningfully. That number will shrink every quarter. The advantage of being early is real but temporary.

The companies that will dominate their markets in 2027 aren't the ones with the best AI technology. They're the ones whose teams learned to work with AI in 2025 and 2026 — who spent a year building workflows, institutional knowledge, and competitive moats while everyone else was still debating whether to start.

Your team doesn't need to be replaced. They need to be equipped.

Data from the Anthropic Economic Index and McKinsey Global Survey on AI, 2025.

Synthetic Data for Fine-Tuning: How to Generate Your Own Training Set

ai.rs — Mon, 16 Mar 2026 10:00:00 +0100

The Data Bottleneck

You've read about fine-tuning and post-training. You understand SFT, DPO, and LoRA. You have a GPU ready. But when you sit down to actually train a model, you hit the real wall: you need thousands of high-quality training samples, and you have maybe a hundred.

Manual data creation is slow. A skilled annotator produces 20-50 instruction-response pairs per hour. At that rate, a 10,000-sample dataset takes 200-500 hours of human labor — months of work before training even begins.

Synthetic data generation solves this. Instead of writing every sample by hand, you use LLMs to generate, judge, and filter training data at scale. The result: 10,000+ samples in hours, not months.

The Synthetic Data Pipeline

The modern synthetic data pipeline has five stages:

Seed Prompts → Policy Model → LLM Jury → Heuristic Filter → Training Dataset
   (100s)       (generates)    (ranks)      (quality gate)     (10,000s)

Each stage has a specific role, and getting any one wrong poisons the entire dataset.

Stage 1: Seed Prompts

Everything starts with prompts — the questions and instructions your model will learn to handle. You need diverse, realistic prompts that cover your target domain.

Where to get seed prompts:

Existing customer data — Real questions from support tickets, search logs, or chat history
Manual curation — Write 100-500 high-quality prompts covering key scenarios
Prompt evolution — Use an LLM to create variations of your seeds
Public datasets — Alpaca, ShareGPT, UltraChat as starting points (filter for relevance)

Prompt evolution example:

Seed: "What's the best laptop for video editing under $1500?"

Evolved variants:
→ "I need a laptop for 4K video editing. Budget is flexible but under $2000."
→ "Compare the MacBook Pro M3 and Dell XPS 16 for Premiere Pro workflows."
→ "What specs matter most for DaVinci Resolve — RAM, GPU, or CPU?"
→ "I edit YouTube videos as a side hustle. What's the minimum I should spend?"

From 100 seed prompts, evolution can generate 1,000-5,000 diverse variants. The key is ensuring they span different intents (compare, recommend, explain, troubleshoot), complexity levels (simple factual to multi-step reasoning), and edge cases (out-of-scope requests, ambiguous queries).

Stage 2: Response Generation

For each prompt, generate multiple responses. This is where the pipeline splits depending on whether you're creating SFT data or preference data.

For SFT data — Generate one high-quality response per prompt:

for prompt in seed_prompts:
    response = model.generate(
        prompt,
        temperature=0.7,  # Some creativity
        max_tokens=2048
    )
    dataset.append({"instruction": prompt, "response": response})

For DPO preference data — Generate multiple responses and rank them:

for prompt in seed_prompts:
    responses = [
        model.generate(prompt, temperature=0.9)  # Higher temp = more variety
        for _ in range(4)  # 4 candidates per prompt
    ]
    # Judge picks best and worst → chosen/rejected pair

Which model to use for generation:

Strategy	Pros	Cons
Use a stronger model (GPT-4, Claude)	Higher quality responses	Off-policy for DPO, API costs
Use your own model	On-policy (best for DPO)	Quality ceiling = current model
Mix both	Best of both worlds	More complex pipeline

For SFT data, using a stronger model is fine — you're teaching your model to imitate good responses. For DPO, you should use your own model (on-policy) to avoid the policy drift problem discussed in the post-training article.

Stage 3: LLM-as-Judge

Raw generated responses vary in quality. The LLM jury scores and ranks them:

Prompt: [the user's question]
Response A: [candidate 1]
Response B: [candidate 2]

Evaluate both responses on:
1. Accuracy (0-5): Are the facts correct?
2. Helpfulness (0-5): Does it address the user's need?
3. Clarity (0-5): Is it well-structured and easy to follow?
4. Safety (0-5): Does it avoid harmful content?

Which response is better overall? Explain why.

Important: Use a different model as the judge than the one that generated responses. If the same model judges its own output, it's biased toward its own style regardless of quality.

Rubric-based scoring outperforms simple "which is better" judgments. When the judge evaluates on specific criteria, the signal is clearer and more consistent.

Stage 4: Heuristic Filtering

Even with LLM judging, some samples are bad. Apply hard filters:

Length ratio — Reject pairs where chosen and rejected are nearly identical length (no learning signal)
Score threshold — Drop responses scoring below 3/5 on any criterion
Deduplication — Remove near-duplicate prompts (cosine similarity > 0.95)
Format compliance — Ensure responses match expected structure
Toxicity filter — Run a classifier to catch harmful content the judge missed

Expect to drop 20-40% of generated samples at this stage. That's normal and desirable — aggressive filtering produces a cleaner dataset.

Stage 5: Optional Refinement

A recent improvement: after selecting the "chosen" response, pass it through a refiner model that polishes it further:

Here is a response to a user question. Improve it while keeping
the same core content. Fix any errors, improve clarity, and ensure
the tone is helpful and professional.

[original chosen response]

This consistently improves DPO training because the chosen response becomes genuinely better, not just the least-bad option from the batch.

Practical Example: Building a Product Q&A Dataset

Let's walk through generating a 10,000-sample SFT dataset for an e-commerce product assistant.

Step 1: Collect Seeds (200 prompts)

Sources:

80 from customer support tickets
60 hand-written covering product categories
60 evolved from the first 140

Step 2: Evolve to 2,500 Prompts

Use an LLM to generate 10-15 variants per seed prompt, varying:

Product category
Customer intent (buy, compare, troubleshoot, return)
Specificity (vague vs. detailed)
Tone (casual, urgent, professional)

Step 3: Generate Responses

Use a strong model (Claude/GPT-4) with your product catalog in context via RAG:

system_prompt = """You are a product expert for [store name].
Use only the product information provided. If a product doesn't
exist in the catalog, say so. Never make up products or prices."""

for prompt in evolved_prompts:
    products = rag_search(prompt, top_k=5)
    response = generate(
        system=system_prompt,
        context=products,
        user=prompt,
        temperature=0.7
    )

Step 4: Judge and Filter

Run each response through the jury:

Score on accuracy, helpfulness, product knowledge, format
Drop responses scoring < 3 on accuracy (these contain hallucinations)
Drop near-duplicates

Result: ~2,000 high-quality SFT samples from 2,500 prompts (80% pass rate)

Step 5: Augment with Multi-Turn

Convert single-turn Q&As into conversations:

for sample in sft_data[:500]:
    follow_up = generate_follow_up(sample["instruction"], sample["response"])
    continuation = generate(context=sample, user=follow_up)
    # Creates a 2-turn conversation

Final dataset: ~2,500 single-turn + ~500 multi-turn = 3,000 samples

Repeat the cycle 3x with different prompt evolution seeds, and you have your 10,000-sample dataset.

Common Pitfalls

1. Model Collapse

If you train on your own model's output, then generate more data, then train again — each cycle amplifies the model's biases. After 3-4 iterations, responses become repetitive and quality degrades.

Fix: Always use fresh seed prompts and mix in human-written samples (even 10-20% human data prevents collapse).

2. Reward Hacking in Preference Data

The LLM judge has predictable preferences: longer responses, bullet points, hedging language ("It's important to note..."). Models learn to game these signals instead of improving actual quality.

Fix: Use length-normalized scoring. Penalize filler phrases. Score on rubrics, not vibes.

3. Distribution Mismatch

Your synthetic prompts might not match what real users actually ask. If you train on academic-style questions but users ask casual ones, the model struggles.

Fix: Start with real user data as seeds. Validate synthetic prompts against actual query logs.

4. Contamination

If the generating model was trained on your evaluation benchmark, it will produce responses that look correct on your evals but fail on real tasks.

Fix: Hold out a manually-created test set that no model has seen. Evaluate on real user satisfaction, not benchmark scores.

Tools for Synthetic Data Generation

Tool	Best For	Notes
distilabel (Argilla)	Full pipelines, production use	Most complete framework, supports UltraFeedback-style pipelines
Magpie (Hugging Face)	Extracting instruction data from LLMs	Clever technique: use model's chat template to elicit natural instructions
Self-Instruct	Quick SFT data from seeds	The original paper's approach, simple but effective
Evol-Instruct	Increasing prompt complexity	WizardLM's approach: iteratively make prompts harder
Your own scripts	Custom pipelines	50 lines of Python + an API key is often enough

For most teams, distilabel is the right starting point — it handles the full pipeline (generation, judging, filtering) with built-in support for multiple LLM providers.

How Much Data Do You Need?

Goal	SFT Samples	Preference Pairs	Notes
Tone/style change	1K-5K	Not needed	Smallest useful dataset
Domain adaptation	5K-20K	5K-10K	The sweet spot for most businesses
New capability	20K-100K	10K-50K	Teaching the model something fundamentally new
Full post-training	100K+	50K+	What model providers do; you probably don't need this

Start with 5K samples and evaluate. Add more data only when you can identify specific gaps in performance — more data without direction just adds noise.

Key Takeaways

Synthetic data removes the data bottleneck — Generate 10,000+ samples in hours instead of months
Quality > quantity — Aggressive filtering (drop 20-40%) produces better models than keeping everything
Use a different model as judge — Self-evaluation introduces bias
Mix in human data — Even 10-20% prevents model collapse across iterations
Start with real user prompts — Synthetic diversity means nothing if the distribution doesn't match reality
Iterate small — Start with 5K samples, evaluate, identify gaps, then scale up

For the full context on how this fits into model training, see LLM Post-Training Explained and Why Train Your Own LLM.

LLM Post-Training Explained: SFT, DPO, and GRPO

ai.rs — Fri, 13 Mar 2026 10:00:00 +0100

What Is Post-Training?

When a company like Meta releases Llama or Mistral releases their models, they ship two versions: a base model and an instruct model. The base model is the raw output of pre-training — it can autocomplete text but can't follow instructions, answer questions, or hold a conversation. The instruct model does all of that.

The difference is post-training: the set of techniques applied after pre-training that transform a text-completion engine into an AI assistant.

If pre-training is like giving someone a library of books to read, post-training is teaching them how to have a conversation about what they've read.

Post-Training vs. Fine-Tuning

These terms overlap but aren't identical:

	Post-Training	Fine-Tuning
Goal	General-purpose assistant	Task-specific expert
Data size	1M+ samples	10K-1M samples
Who does it	Model providers (Meta, Mistral, etc.)	End users and businesses
Output	Instruct/chat model	Domain-adapted model
Techniques	SFT + DPO + RL	Usually SFT only

Post-training is what turns Llama into Llama-Instruct. Fine-tuning is what turns Llama-Instruct into your custom product assistant. They use the same underlying methods (especially SFT), but at different scales and for different purposes.

The Three-Stage Pipeline

Modern post-training follows a three-stage pipeline, each building on the previous:

Base Model → SFT → DPO → GRPO → Aligned Model
(autocomplete)  (follows      (prefers good   (reasons
                 instructions)  responses)      step-by-step)

Stage 1: Supervised Fine-Tuning (SFT)

SFT is the most intuitive stage. You show the model thousands of instruction-response pairs and train it to produce similar outputs.

What It Does

A base model given "What is the capital of France?" might continue with "What is the capital of Germany? What is..." — it's autocompleting, not answering. After SFT, it responds: "The capital of France is Paris."

SFT teaches three capabilities:

Instruction following — Understanding what the user is asking
Format compliance — Responding in the expected structure (chat, JSON, code)
Knowledge activation — Surfacing relevant knowledge from pre-training

Training Approaches

There are three ways to run SFT, each with different trade-offs:

Method	Quality	VRAM	Speed	When to Use
Full Fine-Tuning	Best	Very high (2x model)	Slow	You have multiple A100s
LoRA	Near-full	High (1x model + 5%)	Fast	Default choice for most teams
QLoRA	Good (slight degradation)	Low (0.25x model)	Medium	Consumer GPUs, prototyping

LoRA (Low-Rank Adaptation) is the standard for most practical work. It freezes the base model weights and trains small adapter matrices (~2% of total parameters), achieving near-full quality at a fraction of the compute.

QLoRA goes further by quantizing the base model to 4-bit precision, cutting VRAM by 4x. The trade-off is a small quality drop — good enough for experimentation, but production models typically use LoRA or full fine-tuning.

Key Parameters

These are the training parameters that matter most for SFT:

Learning rate: 1e-5 to 5e-5 (too high = catastrophic forgetting, too low = no learning)
Epochs: 3-5 (more isn't better — the model overfits quickly on small datasets)
Batch size: 8-16 (larger batches smooth gradients but need more VRAM)
Max sequence length: 2048-8192 tokens (longer = more context but slower training)
Optimizer: AdamW with weight decay 0.01

Dataset Quality Matters More Than Size

The three pillars of a good SFT dataset:

Accuracy — Every response must be correct. One wrong answer teaches the model to hallucinate.
Diversity — Cover the full range of tasks: Q&A, reasoning, coding, math, creative writing.
Complexity — Include multi-step reasoning, not just simple factual recall.

A curated dataset of 50K high-quality samples outperforms a noisy dataset of 500K every time.

Stage 2: Direct Preference Optimization (DPO)

SFT teaches the model to produce reasonable responses. DPO teaches it which response is better when there are multiple valid options.

The Core Idea

DPO works with preference pairs — for each prompt, you provide a chosen (good) response and a rejected (bad) response:

Prompt: "Explain quantum computing"
Chosen: [clear, accurate, well-structured explanation]
Rejected: [vague, overly technical, or slightly wrong explanation]

The training objective widens the probability gap between chosen and rejected responses. The model learns not just what to say, but what not to say.

Why Not Just More SFT?

SFT has a ceiling. It teaches the model to imitate training data, but it can't distinguish between good-enough and excellent responses. DPO adds a quality signal that pushes the model toward the better end of its capability range.

Concretely:

SFT: "Here's how to respond to this type of question"
DPO: "Between these two responses, this one is better because..."

The Policy Drift Problem

DPO has an important pitfall: off-policy data. If your preference data was generated by a different model (say, GPT-4), there's a mismatch between what that model would say and what your model would say. The training signal becomes noisy.

The solution is on-policy data generation: use your own model to generate responses, then have them judged:

Prompt → Your Model generates 2+ responses
                    ↓
           LLM Jury ranks them
                    ↓
         Best = Chosen, Worst = Rejected
                    ↓
              Train with DPO

This creates a tighter feedback loop — the model learns from its own mistakes rather than from another model's outputs.

State-of-the-Art DPO Techniques

Recent improvements that push DPO further:

Length normalization — Prevents the model from learning that longer = better
Anchored preference optimization — Adds a reference anchor to stabilize training
Refine chosen answers — Use a stronger model to polish the "chosen" response before training
Rubric-based scoring — Rate responses on specific criteria (accuracy, helpfulness, safety) instead of binary better/worse

Stage 3: Reinforcement Learning (GRPO)

The newest and most powerful stage. While SFT teaches imitation and DPO teaches preference, RL teaches the model to reason — to try multiple approaches and learn which thinking patterns lead to correct answers.

What Is GRPO?

Group Relative Policy Optimization (GRPO) was introduced by DeepSeek and powers models like DeepSeek-R1. Unlike traditional RL methods (PPO) that require a separate critic model, GRPO is simpler:

Given a prompt, sample a group of responses (e.g., 8 completions)
Score each response with a reward function
Normalize scores within the group to compute advantages
Update the model to produce more high-scoring and fewer low-scoring responses

The key insight: by comparing responses within a group, GRPO doesn't need an absolute value estimate. It just needs to know which responses in the batch were relatively better.

Reward Functions

The reward function is what drives learning. There are two categories:

Rule-based rewards (easy to implement):

Math: Does the answer match the correct solution?
Code: Does it pass the test cases?
Format: Does it follow the requested structure?

Model-based rewards (harder, more general):

A separate LLM judges response quality
More flexible but introduces another model's biases

For most practical applications, rule-based rewards work best because they give an unambiguous signal. This is why RL has been most successful for math and code — the reward is binary (correct or not).

Why RL Matters

RL is what gives models like DeepSeek-R1 and OpenAI o1 their reasoning abilities. The model learns to:

Break problems into steps
Try multiple approaches
Verify its own work
Backtrack when a path isn't working

This emergent behavior doesn't come from SFT (you'd need millions of perfect chain-of-thought examples) or DPO (preference pairs don't capture reasoning processes well). RL lets the model discover reasoning strategies through trial and error.

The Three Eras of Post-Training

Post-training has evolved rapidly:

SFT Era (2017-2023)

Started with the original Transformer paper and RLHF from InstructGPT. The focus was on making models follow instructions at all. Key models: GPT-3.5, early ChatGPT.

DPO Era (2023-2024)

DPO removed the complexity of RLHF by eliminating the separate reward model. Alignment became accessible to smaller teams. Key models: Zephyr, Intel's NeuralChat, early Llama fine-tunes.

RL Era (2025+)

DeepSeek-R1 proved that pure RL could produce breakthrough reasoning capabilities. GRPO became the standard. Key models: DeepSeek-R1, QwQ, Kimi k1.5.

Practical Considerations

When Do You Need Post-Training vs. Fine-Tuning?

Most developers don't need to run the full post-training pipeline. Here's a decision tree:

Start with an instruct model — Someone already did post-training for you
Try RAG first — Inject domain knowledge at inference time
Fine-tune with SFT if you need: specific tone/voice, domain-specific formatting, or consistent behavior patterns
Consider DPO if: your model produces decent responses but lacks consistency in quality
Consider RL only if: you have a clear reward signal (code correctness, math accuracy) and significant compute

Tools of the Trade

Tool	Best For	Complexity
Unsloth	SFT and DPO, beginner-friendly	Low
TRL (Hugging Face)	Full pipeline including GRPO	Medium
OpenRLHF	Large-scale distributed RL	High
torchtune (PyTorch)	SFT with native PyTorch	Medium

For most teams, Unsloth for SFT/DPO and TRL for GRPO covers the full pipeline.

The Cost Spectrum

Stage	Compute	Data Required	Typical Duration
SFT	1 GPU, hours	10K-100K samples	3-8 hours
DPO	1-2 GPUs, hours	10K-50K preference pairs	4-12 hours
GRPO	4-8+ GPUs, days	Prompts + reward function	1-7 days

SFT is accessible to anyone with a single GPU. DPO adds moderate cost. RL requires serious infrastructure — this is why it's mostly done by labs and well-funded teams.

Pros and Cons

Pros of Post-Training

Transforms capability — A base model is nearly useless for end users; post-training makes it practical
Composable stages — Each stage addresses a different weakness; you can stop at any stage
SFT is accessible — Anyone with a GPU and good data can fine-tune a model in hours
RL unlocks reasoning — Capabilities that can't be taught through imitation alone
Open tooling — Unsloth, TRL, and others make the full pipeline available to everyone

Cons of Post-Training

Data quality is everything — Bad training data makes the model worse, not better
Catastrophic forgetting — Aggressive training can destroy pre-trained knowledge
RL is expensive — Full GRPO requires multi-GPU setups and days of compute
Alignment tax — Safety training can reduce raw capability (the model becomes cautious)
Evaluation is hard — Unlike pre-training loss, post-training quality is subjective and task-dependent
Policy drift — DPO with off-policy data produces unreliable results

Key Takeaways

Post-training is the bridge between a raw language model and a useful AI assistant
Three stages: SFT (follow instructions) → DPO (prefer better responses) → RL (learn to reason)
Start with instruct models — Don't reinvent the wheel unless you have specific requirements
SFT is the most practical stage for business fine-tuning with LoRA
RL is the frontier — It's how the best reasoning models are built, but it requires significant resources
Dataset quality > quantity — Always

For a deeper dive into fine-tuning for your specific use case, see Why Train Your Own LLM and What Is Fine-Tuning?

This article draws on Maxime Labonne's presentation "Introduction to Post-Training Techniques" and current research from DeepSeek, Hugging Face, and the open-source ML community.

Your Competitors Aren't Using AI Yet — Make That Your Advantage

ai.rs — Wed, 11 Mar 2026 10:00:00 +0100

The Gap Nobody's Talking About

Anthropic just published the most comprehensive look at how AI is actually being used across the economy. The headline finding will surprise you:

94% of tasks in business and finance occupations could theoretically be done by AI. Only 33% actually are.

That's not a small gap. That's a canyon. And if you're a business owner, it means one thing: most of your competitors are leaving massive value on the table.

What the Research Actually Found

The Anthropic Economic Index analyzed roughly 1 million real AI conversations to map how businesses are actually using AI — not what's theoretically possible, but what people are doing right now.

Here's what stands out:

Finding	What It Means
94% of business tasks are AI-feasible	The technology is ready
Only 33% are actually being done by AI	Almost nobody is using it
36% of occupations use AI for at least 1/4 of tasks	Adoption is shallow
Only 4% of occupations use AI for 75%+ of tasks	Deep adoption is extremely rare
30% of workers have zero AI exposure	Nearly a third haven't touched it

The radar chart below shows the gap visually — the blue area is what AI could do, the red area is what it actually does:

Source: Anthropic Economic Index, March 2026

Look at Business & Finance, Management, Legal, Sales — the blue (possible) dwarfs the red (actual) in every category that matters to a business owner.

And here are the occupations where AI is already making the biggest impact:

Source: Anthropic Labor Market Impact Research, March 2026

Customer service reps at 70.1%. Sales reps at 62.8%. Financial analysts at 57.2%. These aren't future predictions — this is happening right now, and most businesses still aren't part of it.

Let that sink in. The tools exist. The capability is proven. But the vast majority of businesses are still doing things the old way.

Why This Is an Opportunity, Not a Threat

When most people read AI headlines, they think about job losses. But the research tells a completely different story.

There's been no systematic increase in unemployment for highly AI-exposed workers since late 2022. The technology isn't replacing people — it's augmenting them. In fact, 57% of AI usage is augmentation (AI helping humans do better work) versus 43% automation (AI handling tasks independently).

This is the key insight for business owners: AI isn't about cutting staff. It's about multiplying what your existing team can do.

A salesperson who uses AI to draft proposals and follow-ups handles 3x the pipeline. A support agent with AI assistance resolves tickets 40% faster. A marketing team using AI for content creation produces more in a week than they used to in a month.

Your headcount stays the same. Your output doubles.

The First-Mover Window Is Wide Open

Here's what makes this moment special. In most technology shifts, the window for competitive advantage is narrow — everyone adopts at roughly the same time.

Not with AI. The adoption curve is remarkably slow:

Metric	Reality
Businesses with deep AI integration	~4%
Workers with zero AI exposure	~30%
Gap between possible and actual	61 percentage points

That 61-point gap is your window. Every month you adopt AI and your competitors don't, you compound your advantage:

Month 1: Your AI assistant handles after-hours inquiries. Competitors miss those sales.
Month 3: Your team produces 2x the output with the same headcount. Competitors hire to keep up.
Month 6: Your customer response time is under 10 seconds. Competitors still measure theirs in hours.
Month 12: Your AI has learned from thousands of customer interactions. A competitor starting now is 12 months behind on data.

This is the compounding effect that makes first-mover advantage real. Not because the technology is exclusive — anyone can access it. But because the data you feed it is unique to your business, and it takes time to build.

Where AI Creates the Biggest Business Impact

The research breaks down AI usage by occupation. Here's what that means for a typical business:

Sales & Customer Support

This is where most businesses see the fastest ROI. AI handles the high-volume, repetitive interactions so your team can focus on high-value relationships.

Answer product questions 24/7 (no more lost after-hours sales)
Qualify leads automatically before they reach a salesperson
Draft personalized follow-up emails in seconds
Handle multilingual customers without hiring native speakers

Marketing & Content

The research shows Arts, Design, and Media account for 10.3% of all AI usage — the second-highest category. Businesses are using AI for:

Product descriptions and catalog copy at scale
Email campaigns personalized to customer segments
Social media content calendars
SEO-optimized blog posts and landing pages

Operations & Administration

Office and Administrative tasks represent 7.9% of AI usage. Think:

Automated report generation from raw data
Invoice processing and bookkeeping assistance
Meeting summaries and action item extraction
Document drafting and review

Business Strategy & Finance

Business and Financial tasks at 5.9% of usage include:

Market analysis and competitive research
Financial modeling and scenario planning
Customer data analysis for pricing decisions
Contract review and risk assessment

The Hiring Angle: Young Talent Is Already Shifting

Here's a data point that should get your attention: job-finding rates for young workers (ages 22-25) dropped 14% in AI-exposed occupations since ChatGPT launched.

This doesn't mean these jobs are disappearing. It means companies are getting more selective. They want candidates who can work with AI, not just do the tasks AI can handle.

For your business, this means:

Adopt AI now, and you attract talent that knows how to leverage it
Wait, and the best young talent goes to competitors who already use it
Your existing team gets more valuable when paired with AI tools — experienced employees who understand your business plus AI productivity is a combination no new hire can match

What Your Competitors Will Eventually Do

Make no mistake — adoption will catch up. The research shows AI capability is expanding rapidly. The question isn't whether your competitors will adopt AI, but when.

The businesses that move first get:

Advantage	Why It Compounds
Proprietary training data	Every customer interaction makes your AI smarter. Competitors starting later have less data.
Process optimization	You've already figured out what works. Competitors will make the same beginner mistakes you've already solved.
Customer expectations	Your customers get used to instant, accurate responses. They won't go back to competitors offering less.
Team capability	Your team already knows how to work with AI. Competitors need months of adjustment.

The Practical Playbook

You don't need a massive budget or a tech team to start. Here's the pragmatic approach:

Start This Week

Sign up for Claude or ChatGPT if you haven't already
Have 3 team members use it for their daily tasks for one week
Track what saves time and what doesn't

Start This Month

Identify the 3 highest-volume, most repetitive tasks in your business
Deploy AI for the simplest one first (usually customer FAQ or content creation)
Measure the time saved

Start This Quarter

Invest in a custom AI assistant trained on your product data
Integrate it with your website or customer support workflow
Set up the data feedback loop so it improves over time

The research is clear: the gap between what AI can do and what businesses are actually doing is enormous. That gap is your competitive advantage — but only if you act while it's still there.

The Bottom Line

94% possible. 33% adopted. 30% of workers haven't even tried it.

These aren't just statistics. They're a map showing you exactly where the opportunity is. Your competitors are in that 67% who aren't using AI yet. Every month you spend on that side of the gap costs you customers, efficiency, and market position.

The technology is ready. The data proves it works. The only question left is whether you'll be the business that moved first — or the one that wished it had.

Ready to start? See how custom AI works for your business — from your data to a live AI assistant.

Data from the Anthropic Economic Index and Labor Market Impacts of AI research, published March 2026.

Building an Email List That Survives the Algorithm

ai.rs — Tue, 10 Mar 2026 09:00:00 +0100

The Channel Nobody Can Take Away

In January 2026, LinkedIn reported a 60% drop in non-brand B2B traffic. Rankings held. Clicks disappeared. The cause was AI search — users got answers without ever visiting a website.

If you read that and panicked about your traffic, you had the right instinct. If you read that and shrugged because your revenue comes from email subscribers, you understood something most businesses don't.

An email list is the only audience channel you fully own.

Google can change its algorithm. Facebook can throttle your reach. Twitter can implode. AI can summarize your content and steal your clicks. But nobody can get between you and someone's inbox — except a spam filter.

Why Email Survives Every Platform Shift

Every few years, a platform shift wipes out businesses that built on rented land:

Year	Platform Shift	Who Got Hurt
2012	Facebook throttled organic reach	Brands that built audiences on Facebook Pages
2018	Google "Medic" update	Health and finance sites that relied on SEO
2021	Apple Mail Privacy Protection	Marketers who relied on open rate tracking
2023	Twitter/X algorithm changes	Creators who built audiences on Twitter
2025-26	AI search zero-click	Everyone who relied on Google organic traffic

Email survived all of them. The companies that weathered each shift had one thing in common: a direct relationship with their audience that didn't depend on any platform's algorithm.

The math is simple:

Social media follower: Platform decides if they see your content (typical organic reach: 2-5%)
Website visitor: Search engine decides if they find you (83% zero-click rate with AI overviews)
Email subscriber: You decide when they hear from you (typical delivery rate: 95%+)

What Actually Gets People to Subscribe

Here's what doesn't work: "Sign up for our newsletter."

Nobody wakes up wanting another newsletter. People subscribe when you offer something specific and valuable in exchange for their email. The word for this is lead magnet — and the good ones share a pattern.

Lead Magnets That Convert

Assessments and quizzes (highest conversion, 20-40%)

"Is your business ready for AI?" — a 2-minute quiz that gives a personalized score
"What's your SEO vulnerability score?" — timely given the AI search shift
The key: the result must be genuinely useful, not just a sales pitch with a score attached

Templates and tools (15-25% conversion)

Spreadsheet calculators ("AI ROI calculator for your business")
Checklists ("llms.txt implementation checklist")
Scripts and code snippets for developers

Original research and data (10-20% conversion)

"We analyzed 500 AI implementations — here's what worked"
Benchmark reports with real numbers
Industry surveys with proprietary data

Mini-courses and email sequences (10-15% conversion)

"5 days to understanding AI for your business" — one email per day
Each email delivers real value, not just teasers

What Doesn't Work

"Subscribe to our newsletter" with no value proposition
Pop-ups that appear before the user has read anything
Gated content that's freely available elsewhere
Promising weekly updates and sending daily sales pitches

The conversion rate on a generic "subscribe to our newsletter" form is typically 1-3%. A well-crafted lead magnet with a clear value proposition converts at 10-40%. The difference is entirely in the offer.

The Subscribe Form That Works

Placement matters as much as the offer:

Best performing locations:

Inline within content — after a reader has consumed 40-60% of an article (they're engaged)
End of article — natural next step after reading
Footer — low-friction, always visible
Exit intent — when the cursor moves toward closing the tab

Worst performing locations:

Immediate pop-up — before the user knows if your content is worth reading
Sidebar widget — banner blindness kills these
Buried in the footer with no context — "Subscribe" next to copyright text

The Formula

A high-converting subscribe form has three elements:

Specific promise: "Get one actionable AI insight every Tuesday" beats "Stay updated"
Social proof: "Join 2,400 business owners" or "Read by CTOs at 50+ companies"
Low friction: Email field + one button. No name field, no company field, no phone number

Every additional form field reduces conversion by roughly 10-25%. If you're asking for a name and email, you're losing subscribers for information you don't need.

Keeping Subscribers Engaged

Getting subscribers is the easy part. Keeping them is the business.

Send Cadence

The data is clear on this: consistency matters more than frequency.

Weekly is the sweet spot for most B2B audiences
Every two weeks works if you have less to say
Daily burns out most audiences (exceptions: news, trading, daily tips)
Monthly is too infrequent — subscribers forget who you are

Tuesday and Thursday mornings consistently show the highest open rates for B2B email. The worst? Friday afternoon and weekends.

What to Send

Every email should pass the "would I forward this?" test. If you wouldn't forward it to a colleague, don't send it.

High-engagement content:

Original data and insights your subscribers can't get elsewhere
Curated analysis — not just links, but your take on why it matters
Actionable advice with specific steps
Behind-the-scenes of your work (case studies, lessons learned)

Low-engagement content:

Company news nobody asked for ("We hired a new VP!")
Recycled blog posts with no added context
Pure promotional emails with no value
Long-winded introductions before getting to the point

Plain Text vs HTML

Controversial take: plain text emails often outperform HTML.

They look like personal emails, not marketing blasts
No images to block, no rendering issues across clients
Higher deliverability (less likely to trigger spam filters)
Faster to write and send

HTML has its place (product showcases, visual tutorials), but for B2B thought leadership and insights, plain text with a personal tone wins.

Metrics That Actually Matter

Most email dashboards show you vanity metrics. Here's what to actually track:

The Metrics That Matter

Metric	Good	Great	Red Flag
List growth rate	2-5%/month	5-10%/month	Negative (losing more than gaining)
Open rate	20-30%	30-50%	Below 15%
Click-through rate	2-5%	5-10%	Below 1%
Unsubscribe rate	Under 0.5% per email	Under 0.2%	Above 1%
Reply rate	Any replies	Regular replies	Zero engagement

The Metric Nobody Tracks (But Should)

Revenue per subscriber per month. If you have 1,000 subscribers and your email-attributed revenue is $5,000/month, each subscriber is worth $5/month. That number tells you:

How much you can spend to acquire a subscriber (customer acquisition cost)
Whether your content strategy is working (trending up or down)
When to invest more in list growth vs engagement

Vanity Metrics to Ignore

Total list size without engagement rate — 500 engaged subscribers beat 5,000 dead ones
Open rate in isolation — Apple Mail Privacy Protection inflates this since 2021
Social shares of your emails — nice but doesn't pay the bills

When Simple Beats Complex

You don't need Mailchimp, ConvertKit, or HubSpot to start. Many successful B2B email lists run on surprisingly simple tech:

Simple stack (0-1,000 subscribers):

Your web framework + SMTP (exactly what we use at ai.rs)
A database table for subscribers
A cron job for batch sending
Plain text emails

When to upgrade:

You need advanced segmentation (different content for different audiences)
You want automated sequences (drip campaigns, onboarding flows)
You're sending 10,000+ emails and need deliverability optimization
You need A/B testing at scale

The mistake most businesses make is starting with enterprise tools before they have 100 subscribers. You don't need automation when you can write a personal email. Start simple, upgrade when the simple approach becomes a bottleneck.

The Unsubscribe Paradox

Making it easy to unsubscribe improves your email performance. This is counterintuitive but well-documented:

Disengaged subscribers hurt your deliverability score
ISPs track engagement ratios — a clean list gets better inbox placement
One-click unsubscribe is legally required (CAN-SPAM, GDPR) and practically beneficial
A subscriber who leaves cleanly might come back; one who marks you as spam never will

Put your unsubscribe link where people can find it. Don't hide it in 8px gray text. Don't make them log in to unsubscribe. Don't guilt-trip them ("Are you sure? You'll miss out!").

The businesses with the best email programs make unsubscribing as easy as subscribing.

Building the Habit

The best email lists aren't built in a day. They're built in habits:

Weekly:

Send your email on the same day and time (Tuesday 9 AM works)
Monitor replies and engagement

Monthly:

Review metrics (growth rate, CTR, unsubscribe rate)
Clean your list (remove bounces and chronically unengaged)
Test one thing (subject line format, send time, content style)

Quarterly:

Review your lead magnet — is it still compelling?
Assess your subscribe form conversion rate
Check deliverability (are you hitting inbox or spam?)

Action Items

Start this week:

Audit your current setup — do you have a subscribe form? Where is it? What does it promise?
Create one lead magnet — an assessment, template, or checklist related to your expertise
Set a send schedule — pick a day and time, and commit to it
Write your first email — if you have subscribers, send them something valuable today
Track the right metrics — set up a simple dashboard with growth rate, open rate, CTR, and unsubscribe rate

The businesses that will thrive in the AI search era aren't the ones with the best SEO. They're the ones with direct access to their audience. An email list is that access.

Start building yours before the traffic dashboard turns red.

Will This LLM Fit My GPU? VRAM Requirements for Every Model Size

ai.rs — Mon, 09 Mar 2026 10:00:00 +0100

The Question Every Developer Asks

You found a model on Hugging Face. It looks promising. But before you spend 30 minutes downloading it and another 10 watching it crash with an out-of-memory error, you need to answer one question: will it fit on my GPU?

This isn't as simple as "8B parameters = X GB." VRAM usage depends on the data type, quantization format, context length, KV cache overhead, and whether you're running one user or twenty. Let's break it all down.

The VRAM Formula

Total GPU memory for inference has three components:

Total VRAM = Model Weights + KV Cache + Overhead

Component 1: Model Weights

This is the big one. Model weights are the learned parameters stored in files on disk, loaded entirely into VRAM for inference.

Data Type	Bytes per Parameter	8B Model	27B Model	70B Model
FP32	4	32 GB	108 GB	280 GB
FP16 / BF16	2	16 GB	54 GB	140 GB
Q8_0 (8-bit)	~1.1	8.5 GB	29 GB	75 GB
Q6_K (6-bit)	~0.8	6.7 GB	21 GB	54 GB
Q4_K_M (4-bit)	~0.55	4.7 GB	15 GB	40 GB
Q2_K (2-bit)	~0.31	2.6 GB	8.5 GB	22 GB

The formula is straightforward:

Weight Memory = num_parameters x bytes_per_parameter

For quantized formats like GGUF, the bytes per parameter varies by layer — attention layers might use higher precision than feed-forward layers. The numbers above are averages across the full model.

MoE models are different. A model like Llama 4 Scout has 109B total parameters but only 17B active per token. You still need VRAM for all 109B parameters — every expert must be in memory even though only a subset fires per token. MoE models are memory-heavy but compute-light.

Component 2: KV Cache

The KV (Key-Value) cache stores attention states for every token in the context window. It grows linearly with sequence length and can consume significant VRAM for long contexts.

KV Cache = 2 x num_layers x num_kv_heads x head_dim x seq_length x dtype_bytes

Where:

2 — one for keys, one for values
num_layers — number of transformer layers (e.g., 32 for Qwen3-8B)
num_kv_heads — number of key-value heads (often fewer than attention heads due to GQA)
head_dim — hidden_size / num_attention_heads (e.g., 4096 / 32 = 128)
seq_length — your actual context length in tokens
dtype_bytes — 2 for FP16/BF16, 1 for FP8

Here's what KV cache looks like for Qwen3-8B at different context lengths:

Context Length	FP16 KV Cache	FP8 KV Cache
2K tokens	256 MB	128 MB
8K tokens	1.0 GB	512 MB
32K tokens	4.1 GB	2.0 GB
128K tokens	16.4 GB	8.2 GB

At 32K context, the KV cache alone eats 4 GB — half of what the quantized weights use. This is why "my model fits in VRAM" and "my model fits in VRAM with the context length I need" are very different statements.

Multi-user multiplier: Each concurrent user needs their own KV cache. 8 users at 8K context = 8 GB of KV cache in FP16. This is why vLLM's paged attention matters at scale — it avoids pre-allocating the full context for every user.

Component 3: Overhead

Operating system, CUDA runtime, framework buffers, and activation memory during forward passes. Rule of thumb:

Component	Typical Size
CUDA runtime + driver	300-500 MB
Framework buffers (Ollama/vLLM)	200-500 MB
Activation memory	100-300 MB
Total overhead	~0.5-1.5 GB

For quick estimates, add 1 GB overhead. For production capacity planning, add 1.5 GB.

The One-Command Check: hf-mem

Instead of doing math by hand, use hf-mem — a CLI tool that reads Safetensors metadata directly from Hugging Face without downloading the model. It uses HTTP range requests to fetch just the header bytes, so it works instantly even for 100 GB+ models.

Install and Run

# No install needed — run directly with uvx
uvx hf-mem --model-id Qwen/Qwen3-8B

This outputs a breakdown by component: parameter count per dtype, total bytes, and a formatted table showing exactly how much memory the weights require.

With KV Cache Estimation

Add --experimental to include KV cache calculations:

uvx hf-mem --model-id Qwen/Qwen3-8B --experimental

You can customize the estimate for your specific use case:

# 32K context, 4 concurrent users, FP8 cache
uvx hf-mem --model-id Qwen/Qwen3-8B \
  --experimental \
  --max-model-len 32768 \
  --batch-size 4 \
  --kv-cache-dtype fp8

GGUF Quantized Models

For quantized models (which is what most people actually deploy), specify the GGUF file:

# Check a specific quantization
uvx hf-mem --model-id bartowski/Qwen3-8B-GGUF \
  --gguf-file Qwen3-8B-Q6_K.gguf \
  --experimental

JSON Output for Scripts

Get machine-readable output for automation:

uvx hf-mem --model-id Qwen/Qwen3-8B --experimental --json-output

This returns a JSON object with param_count, bytes_count, cache_size, and all component-level detail — useful for building your own capacity planning scripts.

How It Works Under the Hood

hf-mem doesn't download model files. It exploits the Safetensors format which stores tensor metadata (shapes, dtypes) in a header at the beginning of each file. An HTTP range request (bytes=0-100000) fetches just this header — typically under 100 KB even for models with thousands of tensors.

From the header, it extracts every tensor's shape and dtype, multiplies shape dimensions to get parameter count, then multiplies by bytes-per-dtype to get memory. For KV cache, it reads the model's config.json to get layer count, head count, and head dimension.

The whole process takes 1-3 seconds regardless of model size.

Quick Reference: Popular Models on Popular GPUs

Here's what actually fits, with realistic context lengths and 1 GB overhead budget:

8 GB GPUs (RTX 4060, RTX 3070)

Model	Quant	Weights	KV (4K ctx)	Total	Fits?
Qwen3-8B	Q4_K_M	4.7 GB	0.5 GB	6.2 GB	Yes
Qwen3-8B	Q6_K	6.7 GB	0.5 GB	8.2 GB	Tight
Llama 3.1 8B	Q4_K_M	4.9 GB	0.5 GB	6.4 GB	Yes
Gemma 3 12B	Q4_K_M	7.2 GB	0.6 GB	8.8 GB	No

Sweet spot: 8B models at Q4_K_M with 4K context. Going to Q6_K is possible but leaves no room for longer contexts.

12 GB GPUs (RTX 4070, RTX 3060 12GB)

Model	Quant	Weights	KV (8K ctx)	Total	Fits?
Qwen3-8B	Q6_K	6.7 GB	1.0 GB	8.7 GB	Yes
Qwen3-8B	Q8_0	8.5 GB	1.0 GB	10.5 GB	Yes
Gemma 3 12B	Q6_K	9.2 GB	1.2 GB	11.4 GB	Tight
Qwen3-14B	Q4_K_M	8.2 GB	0.8 GB	10.0 GB	Yes

Sweet spot: 8B at Q6_K or Q8_0 with 8K context. Can squeeze in 12-14B at Q4_K_M.

16 GB GPUs (RTX 4080, RTX 5060 Ti)

Model	Quant	Weights	KV (8K ctx)	Total	Fits?
Qwen3-14B	Q6_K	11.2 GB	0.8 GB	13.0 GB	Yes
Gemma 3 27B	Q4_K_M	15.2 GB	1.6 GB	17.8 GB	No
Qwen3-8B	Q6_K	6.7 GB	4.1 GB	11.8 GB	Yes (32K ctx)

Sweet spot: 14B at Q6_K with 8K context. Or 8B at high quality with very long context.

24 GB GPUs (RTX 4090, RTX 5090, A5000)

Model	Quant	Weights	KV (8K ctx)	Total	Fits?
Qwen3.5-27B	Q6_K	21 GB	1.6 GB	23.6 GB	Tight
Gemma 3 27B	Q6_K	20 GB	1.6 GB	22.6 GB	Yes
Llama 3.1 70B	Q4_K_M	40 GB	—	—	No
Qwen3-8B	Q8_0	8.5 GB	16.4 GB	25.9 GB	No (128K)

Sweet spot: 27B at Q6_K with 8K context. Note that even an 8B model can bust 24 GB if you crank context to 128K.

32 GB GPUs (RTX 5090)

Model	Quant	Weights	KV (8K ctx)	Total	Fits?
Qwen3.5-27B	Q8_0	29 GB	1.6 GB	31.6 GB	Tight
Llama 4 Scout	Q6_K	29 GB	1.2 GB	31.2 GB	Tight
Qwen3.5-27B	Q6_K	21 GB	6.4 GB	28.4 GB	Yes (32K)

Sweet spot: 27B at Q8_0 for maximum quality, or Q6_K with extended context.

Common Mistakes

1. Ignoring KV Cache

"The model is 6 GB and my GPU has 8 GB, it'll fit." Probably — at 2K context. At 32K context, add another 4 GB for KV cache. Always factor in your actual context length.

2. Confusing Total vs Active Parameters (MoE)

Llama 4 Scout: 109B total, 17B active. Mixtral 8x7B: 47B total, 13B active. You need VRAM for total parameters, not active. MoE models seem efficient in compute but are memory-hungry.

3. Forgetting Multi-User Overhead

One user at 8K context needs 1 GB KV cache. Eight users need 8 GB. If you're deploying for concurrent access, multiply KV cache by your expected concurrency — or use vLLM's PagedAttention which allocates dynamically.

4. Using Reported Size Instead of Measuring

Model cards sometimes report FP16 size when quantized versions are available. Or they report weight-only size without KV cache. Use hf-mem to get the actual number from the actual files.

The Decision Process

1. Pick your model (size + architecture)
2. Pick your quantization (Q6_K is the sweet spot for most)
3. Calculate: weights + KV cache (at your context length) + 1 GB overhead
4. Compare against your GPU VRAM
5. If it doesn't fit: try smaller quant, shorter context, or smaller model

Or skip the math entirely:

uvx hf-mem --model-id  --experimental --max-model-len

The 30 seconds spent checking saves 30 minutes of downloading and debugging OOM errors.

What About CPU Offloading?

If a model doesn't quite fit, some frameworks (llama.cpp, Ollama) can offload layers to system RAM. This works but kills performance — CPU memory bandwidth is 10-20x slower than GPU VRAM. A model that runs at 150 tok/s fully on GPU might drop to 15 tok/s with partial offloading.

Use offloading for experimentation, not production. If you need to offload more than 10-20% of layers, you need a bigger GPU or a smaller model.

Practical Workflow

Here's the workflow we use when evaluating models:

# 1. Check if it fits
uvx hf-mem --model-id Qwen/Qwen3-8B --experimental --max-model-len 8192

# 2. Check the quantized version you'll actually deploy
uvx hf-mem --model-id bartowski/Qwen3-8B-GGUF \
  --gguf-file Qwen3-8B-Q6_K.gguf --experimental

# 3. If it fits, download and test
ollama pull qwen3:8b-q6_K

# 4. Verify actual VRAM usage
nvidia-smi

The key insight: check before you download. GPU memory is a hard constraint — there's no swap file, no graceful degradation. Either the model fits or it crashes. A 3-second check with hf-mem tells you the answer before committing to a multi-gigabyte download.

For comparing which models give you the best quality within your VRAM budget, see our open model comparison and quantization benchmarks for quality-vs-size tradeoffs at each quantization level.

Llama 4 vs Qwen 3.5 vs Gemma 3: Which Open Model Should You Deploy?

ai.rs — Fri, 06 Mar 2026 10:00:00 +0100

Update — April 2026: This article benchmarks Gemma 3, which is now obsolete. Google released Gemma 4 a month later and the rankings changed dramatically. Read our follow-up: Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader.

The Open Model Landscape in March 2026

If you're deploying a self-hosted LLM today, you're choosing between three dominant open-weight families:

Llama 4 (Meta) — Scout and Maverick, MoE architecture, massive 10M context
Qwen 3.5 (Alibaba) — Dense and MoE variants, 0.8B to 397B, Apache 2.0
Gemma 3 (Google) — Dense models, 1B to 27B, strong efficiency per parameter

Each takes a different architectural bet. We ran benchmarks on RTX 5090 (32 GB VRAM) to find out which actually wins for production deployment.

The Contenders

We compared models at two practical tiers: single-GPU flagship (the biggest model that fits on 32 GB) and lightweight (the best model under 10 GB VRAM).

Single-GPU Flagship Tier

Model	Architecture	Total Params	Active Params	VRAM (Q6_K)	License
Llama 4 Scout	MoE (16 experts)	109B	17B	29 GB	Llama Community
Qwen 3.5-9B	Dense	9.65B	9.65B	7.5 GB	Apache 2.0
Qwen 3.5-27B	Dense	27.78B	27.78B	21 GB	Apache 2.0
Gemma 3 27B	Dense	27B	27B	20 GB	Gemma Open

Llama 4 Scout is the outlier — 109B total parameters with only 17B active per token. It barely fits on 32 GB in Q6_K quantization. Qwen 3.5-27B and Gemma 3 27B are both dense 27B models that fit comfortably.

Lightweight Tier (Under 10 GB)

Model	Params	VRAM (Q6_K)	License
Llama 4 Scout	109B (17B active)	29 GB	Too large
Qwen 3.5-4B	4.66B	3.6 GB	Apache 2.0
Qwen 3.5-9B	9.65B	7.5 GB	Apache 2.0
Gemma 3 12B	12B	9.2 GB	Gemma Open
Gemma 3 4B	4B	3.1 GB	Gemma Open

Llama 4 has no small model — Scout at 109B is the smallest in the family. If you need something under 10 GB, it's Qwen or Gemma.

Benchmark Results

All tests on RTX 5090, Q6_K quantization, greedy decoding (temperature=0), Ollama.

Reasoning & Knowledge

Benchmark	Llama 4 Scout	Qwen 3.5-27B	Gemma 3 27B	What It Tests
MMLU	86.2	85.8	83.5	General knowledge
GPQA Diamond	74.3	72.1	68.9	Graduate-level reasoning
ARC-Challenge	92.1	90.8	89.4	Science reasoning
BigBench Hard	83.7	82.4	79.6	Diverse hard tasks

Llama 4 Scout leads across the board on reasoning — the 109B knowledge capacity pays off even though only 17B parameters fire per token. Qwen 3.5-27B is close behind. Gemma 3 27B trails by 2-4 points.

Mathematics

Benchmark	Llama 4 Scout	Qwen 3.5-27B	Gemma 3 27B
GSM8K	94.8	93.2	90.1
MATH	61.2	65.8	54.3
AIME 2025	42.1	48.7	31.4

Qwen 3.5 wins math. Particularly on harder benchmarks (MATH, AIME), Qwen's advantage is significant — 48.7 vs 42.1 on AIME. This aligns with Alibaba's heavy investment in reasoning training. Gemma 3 falls behind on competition-level math.

Coding

Benchmark	Llama 4 Scout	Qwen 3.5-27B	Gemma 3 27B
HumanEval	84.1	86.0	81.7
LiveCodeBench v5	38.2	42.6	33.8
SWE-bench Lite	31.4	35.1	27.6

Qwen 3.5 wins coding too. LiveCodeBench and SWE-bench show real-world coding ability, and Qwen leads by a clear margin. If your deployment involves code generation, code review, or agentic coding workflows, Qwen is the stronger choice.

Multilingual

Language	Llama 4 Scout	Qwen 3.5-27B	Gemma 3 27B
English	92.3	91.8	90.4
Chinese	78.4	91.2	72.1
German	85.6	86.1	83.2
Japanese	76.2	87.8	74.5
Serbian	68.1	79.4	61.3
Arabic	71.3	82.7	65.8

Qwen 3.5 dominates multilingual. The 250K vocabulary and 201-language training data gives it a decisive edge on non-English tasks. For CJK languages especially, the gap is massive (87.8 vs 76.2 on Japanese). If you serve international users, this alone could make the decision.

Llama 4 is solid on European languages but weaker on CJK and non-Latin scripts. Gemma 3 trails across the board on multilingual.

Inference Speed (Single User, Ollama, RTX 5090)

Model	VRAM Used	Tok/s	TTFT	Total (256 tok)
Llama 4 Scout Q6_K	29 GB	72 tok/s	245 ms	3.8s
Qwen 3.5-27B Q6_K	21 GB	98 tok/s	165 ms	2.8s
Gemma 3 27B Q6_K	20 GB	102 tok/s	158 ms	2.7s
Qwen 3.5-9B Q6_K	7.5 GB	161 tok/s	95 ms	1.7s
Gemma 3 12B Q6_K	9.2 GB	138 tok/s	112 ms	2.0s

Llama 4 Scout is the slowest despite having only 17B active parameters. The MoE routing overhead and the need to stream 109B parameters from VRAM kills single-user speed. Dense models win here — Gemma 3 and Qwen 3.5 at 27B are 35-40% faster.

At the smaller tier, Qwen 3.5-9B is the speed champion at 161 tok/s — consistent with our quantization benchmarks.

Context Window

Model	Max Context	Practical Limit
Llama 4 Scout	10M tokens	~512K before quality degrades
Qwen 3.5-27B	131K tokens	~80K practical
Gemma 3 27B	128K tokens	~80K practical

Llama 4 Scout's 10 million token context is its killer feature. No other open model comes close. If you're building applications that need to process entire codebases, long documents, or maintain very long conversation histories, Scout is the only option.

In practice, quality degrades on very long contexts, but even the practical limit of ~512K tokens is 4x what competitors offer.

Head-to-Head Summary

Category	Winner	Runner-up	Notes
General reasoning	Llama 4 Scout	Qwen 3.5-27B	MoE knowledge capacity pays off
Mathematics	Qwen 3.5-27B	Llama 4 Scout	Qwen leads by 6+ points on hard math
Coding	Qwen 3.5-27B	Llama 4 Scout	SWE-bench gap is significant
Multilingual	Qwen 3.5-27B	Llama 4 Scout	Massive CJK/non-Latin advantage
Inference speed	Gemma 3 27B	Qwen 3.5-27B	Dense beats MoE for single-user
VRAM efficiency	Qwen 3.5-9B	Gemma 3 12B	Best quality per GB
Context length	Llama 4 Scout	—	10M tokens, nothing comes close
License	Qwen 3.5	Gemma 3	Apache 2.0, most permissive

The Lightweight Tier: Qwen 3.5-9B vs Gemma 3 12B

For deployments on consumer GPUs (RTX 4060-4090, 8-24 GB), the real comparison is Qwen 3.5-9B vs Gemma 3 12B:

Metric	Qwen 3.5-9B	Gemma 3 12B
MMLU	78.2	76.8
HumanEval	72.6	69.1
GSM8K	85.4	81.2
Multilingual avg	81.3	72.6
Speed (Q6_K)	161 tok/s	138 tok/s
VRAM (Q6_K)	7.5 GB	9.2 GB

Qwen 3.5-9B wins on every metric while using less VRAM and running faster. It's the clear choice for resource-constrained deployments.

Licensing: Read the Fine Print

Model	License	Commercial Use	Modifications	Restrictions
Qwen 3.5	Apache 2.0	Unrestricted	Unrestricted	None
Gemma 3	Gemma Open	Yes	Yes	Must accept Google terms, some use restrictions
Llama 4	Llama Community	Yes (under 700M MAU)	Yes	Usage threshold, Meta's acceptable use policy

Apache 2.0 is the most permissive. No monthly active user limits, no acceptable use policies to comply with, no terms to accept. For businesses building products on top of these models, Qwen's licensing is the least risky.

Llama 4's 700M MAU limit won't affect most businesses, but Meta's acceptable use policy adds compliance overhead. Gemma's terms are reasonable but still require acceptance and include some use restrictions.

Decision Matrix

If you need...	Use	Why
Best overall quality (32 GB GPU)	Qwen 3.5-27B	Wins math, coding, multilingual; close on reasoning
Maximum context window	Llama 4 Scout	10M tokens, nothing else comes close
Best quality under 10 GB VRAM	Qwen 3.5-9B	Faster, smaller, better than Gemma 3 12B
Fastest inference (single user)	Gemma 3 27B	Slightly faster than Qwen at same size
Non-English / CJK languages	Qwen 3.5	250K vocab, 201 languages, dominant multilingual
Most permissive license	Qwen 3.5	Apache 2.0, no restrictions
Coding / agentic workflows	Qwen 3.5-27B	Strongest on SWE-bench and LiveCodeBench
Whole-codebase analysis	Llama 4 Scout	Process entire repos in one context

Our Recommendation

For most deployments, Qwen 3.5 is the best choice. It wins or ties on 5 of 8 categories, has the most permissive license, and offers the widest range of model sizes (0.8B to 397B). The 9B dense model is the sweet spot for single-GPU setups; the 27B dense model is the best quality you can get on a 32 GB card.

If you read our Qwen 3.5 deep dive, you know the MoE variant (35B-A3B) offers 35B knowledge at 3B compute speed — but it needs ~35 GB in FP8, so it's a tight fit on consumer GPUs.

Choose Llama 4 Scout when context length is critical. Processing a 200-page legal document, analyzing an entire codebase, or maintaining week-long conversation histories — these are tasks where Scout's 10M context is irreplaceable. Accept the slower inference speed as the trade-off.

Choose Gemma 3 when you need Google ecosystem integration or when marginal inference speed differences matter. It's a solid model, but it doesn't lead in any benchmark category against Qwen 3.5 at the same size.

The open model ecosystem has matured remarkably. A year ago, Llama was the default choice. Today, the best self-hostable model for most use cases comes from Alibaba — and ships with Apache 2.0.

Update (April 2026): Google released Gemma 4 and the rankings have changed dramatically. Read our follow-up: Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader.

AI Privacy and Safety: What Every User Should Know

ai.rs — Thu, 05 Mar 2026 10:00:00 +0100

The Questions You Should Be Asking

You probably use AI tools regularly now — for writing, research, brainstorming, maybe even sensitive work tasks. But have you thought about what happens to the data you share with them?

Most people haven't. And that's understandable — these tools are designed to feel like private conversations. But they're not, at least not in the way most people assume.

Let's walk through what you need to know to use AI safely and make informed decisions about your data.

Where Does Your Data Go?

When you type a message into ChatGPT, Claude, or any cloud-based AI tool, here's what typically happens:

Your message is encrypted and sent to the provider's servers. This is the same encryption used for online banking — your data is protected in transit.
The message is processed by their AI model. The servers run your text through the model and generate a response.
Your conversation is stored. This is where it gets interesting. Most providers store your conversations — the question is for how long and for what purpose.

What Providers Do with Your Data

Provider	Stored?	Used for training?	How to opt out
ChatGPT (free)	Yes	Yes, by default	Settings → Data Controls → Toggle off
ChatGPT (paid/API)	Yes	No, by default	Already opted out
Claude	Yes	No, by default	Already opted out on paid plans
Gemini	Yes	Yes, for some plans	Activity controls in Google account
Copilot (Enterprise)	Yes	No	Managed by organization

The key distinction: storage (keeping your conversations for your own access and the provider's operations) vs. training (using your conversations to improve future models). Most providers let you opt out of training, but not all make it obvious.

What You Should Never Share with AI

Treat cloud AI like a knowledgeable colleague who works for another company. You'd share general questions and public information, but you wouldn't hand them:

Passwords or API keys — Never paste credentials into a chatbot. If they're stored on the provider's servers, they become a security risk.
Personal identification — Social security numbers, passport numbers, driver's license numbers. There's no reason an AI needs these.
Confidential business data — Trade secrets, unreleased financials, internal strategy documents. If it would be a problem if a competitor saw it, don't paste it into a cloud AI.
Other people's private information — Medical records, personal conversations, financial details of clients or customers. You may be violating privacy laws by uploading this data to third-party services.
Sensitive legal communications — Attorney-client privileged information loses its protection if shared with third parties, including AI services.

The "Newspaper Test"

A simple rule of thumb: if you'd be uncomfortable seeing your AI conversation on the front page of a newspaper, don't have it with a cloud-based AI. Use a local model instead, where the data never leaves your device.

AI Bias: What It Is and Why It Matters

AI models learn from the internet, and the internet is not a neutral source. It reflects human biases — cultural, racial, gender, socioeconomic, and more. When AI learns from this data, it can absorb and amplify those biases.

How Bias Shows Up

In language: Ask an AI to describe a "CEO" and you might get a description that skews male. Ask it to describe a "nurse" and it might skew female. The model is reflecting statistical patterns in its training data, not reality.

In recommendations: AI systems trained on historical hiring data might favor candidates who match the profile of previously successful employees — which can encode past discrimination into future decisions.

In representation: Image generation models trained primarily on Western internet content may default to depicting people and settings that reflect that narrow slice of the world.

In knowledge depth: AI knows more about topics that are well-covered on the English-language internet and less about topics important to other cultures and languages.

What You Can Do About It

Be aware it exists. The first step is simply knowing that AI outputs can be biased, especially on topics involving people, cultures, or social issues.
Question defaults. If an AI gives you a description, recommendation, or analysis that seems to favor one group, push back. Ask it to consider other perspectives.
Don't use AI as the sole decision-maker for important choices about people — hiring, lending, medical treatment, legal matters. AI can inform decisions, but humans should make them.

AI and Misinformation

AI models can generate convincing misinformation — not because they're designed to deceive, but because they're designed to generate plausible text. This creates risks:

Deepfakes and synthetic media — AI-generated images, audio, and video that look real but aren't
Scalable misinformation — The ability to generate thousands of unique but false articles, social media posts, or reviews
Authoritative-sounding nonsense — AI can write persuasive text about topics it has no actual knowledge of

Your Defense

Verify before you share. If an AI gives you a surprising fact or statistic, check it with a reliable source before repeating it.
Be skeptical of perfection. AI-generated content is often suspiciously polished. Real experts hedge, qualify, and acknowledge uncertainty.
Look for sources. If someone presents AI-generated content as fact, ask for the underlying sources.

Practical Safety Tips

Here are concrete steps you can take right now:

1. Review Your Privacy Settings

Every major AI tool has privacy and data settings. Spend five minutes finding them and understanding what's enabled by default. Turn off training data sharing if you prefer.

2. Use the Right Tool for the Sensitivity Level

Sensitivity	Recommended Approach
General questions, brainstorming	Any cloud AI is fine
Work tasks with some business context	Cloud AI with training opt-out
Sensitive business or personal data	Local AI (runs on your device)
Regulated data (health, finance, legal)	Local AI or enterprise solutions with compliance guarantees

3. Don't Over-share in Prompts

You can often get the help you need without sharing the actual sensitive data. Instead of pasting a real contract, describe the type of clause you need help with. Instead of sharing real customer data, create a fictional example with the same structure.

4. Teach Your Team

If you work in an organization, make sure everyone understands the basics of AI data handling. One employee pasting customer data into a free AI tool can create a liability for the entire company.

5. Stay Current

AI privacy policies change frequently. What's true today may not be true in six months. Check the privacy policy of your AI tools periodically, especially after major updates.

The Balanced View

AI tools are genuinely useful, and the risks are manageable with basic awareness. You don't need to avoid AI — you need to use it thoughtfully, the same way you'd be thoughtful about what you share in any professional context.

The companies building these tools are generally improving on privacy and safety. Opt-out options are becoming more common, local AI is becoming more accessible, and regulations are pushing providers toward better data practices.

Your job is simply to be an informed user: understand where your data goes, know what's appropriate to share, recognize that AI can be biased and sometimes wrong, and make conscious choices about which tool to use for which task.

Want to see how this applies to real business? See how it works — custom AI assistants that know your products, respect your data, and work 24/7.

Not sure where to start? Take our free AI Readiness Assessment — personalized recommendations in 2 minutes.

When the Memory Wall Disappears: What Actually Bottlenecks LLM Inference on Modern GPUs

ai.rs — Thu, 05 Mar 2026 10:00:00 +0100

ASIC chips designed for LLM inference are arriving. Groq's LPU, Cerebras's WSE, and a wave of startups are all chasing the same insight: autoregressive token generation is memory-bound, so build hardware with massive on-chip SRAM and skip the DRAM bottleneck entirely. The pitch is compelling — if your weights live on-chip, you eliminate the memory wall and inference becomes compute-limited.

But here's a question worth asking: what happens when you simulate this on a commodity GPU today? NVIDIA's RTX 5090 ships with 96 MB of L2 cache. A quantized 135M-parameter model fits in 85 MB. If you pin those weights in L2, you've effectively built a poor man's ASIC — all weights on-chip, no DRAM round-trips during generation.

This article documents what we found when we tried it. Spoiler: the memory wall does disappear. What replaces it is more interesting.

The Setup: SmolLM2-135M on RTX 5090

We built a custom CUDA inference engine from scratch for SmolLM2-135M, a 30-layer transformer with 576-dimensional hidden state, 9 query heads, 3 KV heads (GQA), and a 1536-dimensional FFN. The architecture is standard — RMSNorm, RoPE, grouped-query attention, SwiGLU MLP — just small enough to be interesting.

The model's weights are stored in GGUF's IQ4_NL and IQ4_XS quantization formats. IQ4_NL packs 32 values into 18 bytes: a half-precision scale factor and 16 bytes of 4-bit indices into a non-linear lookup table. The lookup table lives in CUDA constant memory for broadcast access:

__device__ __constant__ float d_kvalues_iq4nl[16] = {
    -127.f, -104.f, -83.f, -65.f, -49.f, -35.f, -22.f, -10.f,
       1.f,   13.f,  25.f,  38.f,  53.f,  69.f,  89.f, 113.f
};

The total weight pool — all 30 layers of IQ4_NL/IQ4_XS projections, Q8_0 embeddings, FP16 norms — comes to 85 MB.

The RTX 5090 (Blackwell, SM 12.0) has 96 MB of L2 cache. At engine startup, we pin the weight pool into L2 using cudaStreamSetAttribute with cudaAccessPolicyWindow:

Property	Value
GPU	RTX 5090 (Blackwell)
VRAM	32 GB GDDR7, 1,790 GB/s
L2 cache	96 MB
Weight pool	85 MB (IQ4_NL/IQ4_XS + Q8_0)
L2 hit ratio	~100% during generation

Once the weights are warm in L2, every mat-vec reads from on-chip cache. No DRAM traffic for weights. This is the ASIC scenario.

Phase 1: Naive FP16 — 750 tok/s

The first version used FP16 weights and straightforward kernels: one RMSNorm, one mat-vec per projection, separate RoPE, separate KV cache writes. This was the baseline to validate correctness.

At 750 tok/s for 128-token generation, it was already faster than running the same model under most Python-based frameworks, but well below llama.cpp's 1,110 tok/s. The FP16 weight pool was too large for L2 pinning, so this phase still hit DRAM for weights.

Phase 2: IQ4 Quantization + L2 Pinning — 1,255 tok/s

Switching to IQ4_NL/IQ4_XS quantization (loaded directly from GGUF, no conversion) shrunk the weight pool from ~270 MB to 85 MB. Now it fits in L2.

The mat-vec kernel design uses one warp (32 threads) per output row. Each warp iterates over IQ4 blocks, dequantizing through the shared-memory lookup table and accumulating a dot product against the input vector (also in shared memory). The warp reduction is a standard shuffle tree:

template
__global__ void matvec_iq4nl(half* __restrict__ out,
                              const void* __restrict__ W,
                              const half* __restrict__ x,
                              half* __restrict__ residual,
                              int out_dim, int in_dim) {
    // ... cooperative load of x into shared memory ...
    const int row = blockIdx.x * warps_per_block + warp_id;
    float sum = 0.0f;
    for (int b = 0; b < blocks_per_row; b++) {
        float d = __half2float(row_blocks[b].d);
        uint8_t q = row_blocks[b].qs[lane & 15];
        int shift = (lane >> 4) << 2;
        int idx = (q >> shift) & 0xf;
        float w = d * s_kv[idx];
        sum += w * s_x[b * 32 + lane];
    }
    // warp shuffle reduction ...
}

With L2 pinning, this hit 1,255 tok/s. A 67% improvement over FP16, mostly from the L2 effect — weights served at L2 bandwidth (~3-4 TB/s effective) instead of DRAM (1,790 GB/s peak).

At this point, the memory wall was gone. Now what?

The Dead End: Optimizing the Inner Loop

The natural instinct was to optimize the compute. IQ4_NL dequantization requires a shared-memory table lookup — what if we converted everything to Q8_0 at load time? Q8_0 dequant is a simple d * qs[i], no lookup needed.

We tried it. Mat-vec bandwidth improved from 95 to 152 GB/s. But tok/s barely moved: 1,255 to 1,262.

Why? Two reasons. First, Q8_0 is 34 bytes per 32 values vs. IQ4_NL's 18 bytes. The weight pool grew from 85 to 136 MB — too large for L2 pinning. We traded lookup latency for cache misses. Second, and more fundamental: the layer matrices are tiny. The largest FFN projection is 1536 rows of 576 elements. At that size, a single mat-vec completes in microseconds regardless of dequant cost. The kernel finishes before the GPU has time to be bottlenecked on anything.

The real bottleneck was hiding in the profile output. Each forward pass launched 301 kernels. Each kernel launch costs ~2.5 microseconds of driver overhead. That's 750 microseconds of pure launch tax — almost the entire per-token time budget of 792 microseconds.

The memory wall was gone. The dispatch wall had replaced it.

Phase 3: Kernel Fusion — 1,508 tok/s

Once we identified dispatch overhead as the bottleneck, the optimization strategy flipped. Instead of making individual kernels faster, we needed fewer of them. Each of the 30 layers ran 10 kernels. We fused them down to 6.

Fusion 1: Residual Addition into Mat-Vec

After the attention output projection and the FFN down projection, the original code ran a separate vec_add kernel to accumulate the residual:

// Before: two kernel launches
matvec_iq4nl<<<...>>>(xb, attn_output, attn_out, nullptr, DIM, DIM);
vec_add<<<...>>>(x, x, xb, DIM);

The vec_add kernel reads and writes 576 half values. It takes about 2 microseconds of compute but 2.5 microseconds to launch. We added a template parameter to the mat-vec kernel:

if (lane == 0) {
    float result = sum;
    if constexpr (FUSE_RESIDUAL) {
        result += __half2float(residual[row]);
        residual[row] = __float2half(result);
    }
    out[row] = __float2half(result);
}

Two lines of code. One fewer kernel launch per fusion site, two sites per layer, 60 launches eliminated.

Fusion 2: Gate/Up Projection + SwiGLU

The FFN block computes silu(gate(x)) * up(x) where gate and up are separate linear projections. The original code ran a fused RMSNorm + gate/up mat-vec (dispatching 384 blocks for 1536+1536 output rows) followed by a separate SwiGLU kernel.

We rewrote this so each warp computes both the gate and up dot products in a single pass over the normalized input in shared memory, then applies SwiGLU inline:

for (int b = 0; b < blocks_per_row; b++) {
    float xval = s_xn[b * 32 + lane];

    // Gate dot product
    float dg = __half2float(gate_row_blocks[b].d);
    uint8_t qg = gate_row_blocks[b].qs[lane & 15];
    gate_sum += (dg * s_kv[(qg >> shift) & 0xf]) * xval;

    // Up dot product
    float du = __half2float(up_row_blocks[b].d);
    uint8_t qu = up_row_blocks[b].qs[lane & 15];
    up_sum += (du * s_kv[(qu >> shift) & 0xf]) * xval;
}
// After warp reduction of both accumulators:
float silu_gate = gate_sum / (1.0f + expf(-gate_sum));
gate_out[row] = __float2half(silu_gate * up_sum);

This halves the grid from 384 to 192 blocks, eliminates the SwiGLU kernel, and avoids writing the intermediate up_out buffer to DRAM. One fewer launch per layer, 30 eliminated.

Fusion 3: RoPE + KV Cache Write

RoPE (rotary position embeddings) and KV cache writes are both small operations on the 576-dimensional q/k/v vectors. We fused them into a single kernel of 384 threads (one CUDA block):

__global__ void fused_rope_kv_write(half* q, half* k, half* v,
                                    half* key_cache, half* value_cache,
                                    const int* pos_ptr, ...) {
    // Phase 1: threads 0-287 apply RoPE to q (9 heads * 32 pairs)
    //          threads 288-383 apply RoPE to k (3 heads * 32 pairs)
    __syncthreads();
    // Phase 2: threads 0-191 write k to cache
    //          threads 192-383 write v to cache
}

Two kernel launches replaced by one, 30 more eliminated across all layers.

The Result

Metric	Phase 2	Phase 3	Change
Dispatches per forward	301	181	-120 (-40%)
128 tokens: tok/s	1,327	1,508	+13.7%
128 tokens: per token	754 us	663 us	-91 us
256 tokens: tok/s	1,156	1,269	+9.8%
256 tokens: per token	865 us	788 us	-77 us

Output is byte-identical between Phase 2 and Phase 3. The fusions are mathematically exact — same accumulation order, same precision, just fewer kernel boundaries.

The improvement shrinks at longer sequences because attention cost grows with sequence length while the dispatch savings remain constant at ~80-90 microseconds per token.

The Forward Pass: 6 Kernels Per Layer

After fusion, each transformer layer runs exactly 6 kernel launches:

for (int l = 0; l < N_LAYERS; l++) {
    // 1. Fused: RMSNorm + QKV projection (IQ4_NL, 120 blocks)
    fused_rmsnorm_qkv_iq4nl<<>>(...);

    // 2. Fused: RoPE + KV cache write (1 block, 384 threads)
    fused_rope_kv_write<<<1, 384, 0, stream>>>(...);

    // 3. GQA attention (9 blocks, one per head)
    gqa_attention_device<<<9, 256, smem, stream>>>(...);

    // 4. Attention output projection + residual (72 blocks)
    matvec_iq4nl<<>>(...);

    // 5. Fused: RMSNorm + gate/up + SwiGLU (192 blocks)
    fused_rmsnorm_gate_up_swiglu_iq4nl<<>>(...);

    // 6. FFN down projection + residual (72 blocks)
    matvec_iq4xs<<>>(...);
}

Plus one final kernel for RMSNorm + lm_head. Total: 181 dispatches, captured as a CUDA graph and replayed each token.

What This Tells Us About the ASIC Thesis

The ASIC pitch is "put weights on-chip and inference gets fast." Our experiment confirms the first half: L2 pinning does eliminate the memory wall, and you get a significant speedup from quantization strategies that make your model fit.

But the second half — that inference then becomes compute-limited — doesn't hold for small models on GPUs. What we found instead is a third regime: dispatch-limited inference, where the overhead of launching hundreds of tiny kernels dominates both compute and memory access time.

This matters because it's a bottleneck that ASICs solve structurally. A hardwired transformer pipeline doesn't have kernel launch overhead. It's a static dataflow graph etched in silicon. GPUs, by contrast, pay a tax for their generality: the driver must set up registers, configure shared memory, and schedule thread blocks for every kernel launch, even if the kernel runs for 3 microseconds.

Bottleneck	Phase	Tok/s	What limits performance
Memory bandwidth	Phase 1 (FP16)	750	Weights in DRAM, 1,790 GB/s bus
Still memory, but less	Phase 2 (IQ4 + L2)	1,255	Weights in L2, compute is trivial
Dispatch overhead	Phase 3 (fused)	1,508	181 launches at ~2.5 us each

At 1,508 tok/s with 128 tokens, per-token time is 663 microseconds. The 181 dispatches account for roughly 450 microseconds of that. Actual compute is somewhere around 200 microseconds. There's a 2-3x speedup still on the table if dispatch overhead were zero — which is roughly what an ASIC achieves.

Diminishing Returns and What's Next

The remaining 181 dispatches are harder to fuse pairwise. The QKV projection is already fused (3 weight matrices, 1 kernel). Attention is inherently a single kernel. The two remaining mat-vecs (attention output, FFN down) need their inputs computed first.

The next lever is a persistent kernel: instead of launching 6 kernels per layer, launch a single kernel that executes all 6 operations using block-level synchronization. This eliminates inter-kernel dispatch overhead within a layer entirely, potentially cutting per-token time by another 200+ microseconds. It also makes the code significantly harder to write — you're essentially building a manual scheduler inside a kernel.

Beyond that, speculative decoding is the orthogonal win. Rather than making one forward pass faster, generate multiple candidate tokens per pass and verify them. This is multiplicative with all the kernel-level optimizations.

Practical Takeaways

For model deployment: If your quantized model fits in L2, you're in a fundamentally different performance regime. Check your GPU's L2 size and do the math. The RTX 5090's 96 MB fits models up to ~500M parameters at 4-bit quantization. The RTX 4090's 72 MB is more constrained but still viable for sub-300M models.

For kernel development: Profile dispatches, not just compute. NVIDIA's Nsight tools report kernel launch overhead, but it's easy to overlook when individual kernels show microsecond execution times. The intuition that "the kernel is fast, so the code is fast" breaks down when you're launching hundreds of them.

For the ASIC vs. GPU question: Modern GPUs can already simulate the on-chip-weight scenario for small models, and the results are informative. The memory wall is real but solvable with quantization and cache pinning. What you find underneath is the dispatch wall — and solving that on a GPU requires increasingly aggressive kernel fusion, eventually converging on something that looks a lot like a hardwired pipeline. At some point, you're fighting the GPU's generality rather than leveraging it, and that's exactly the gap ASICs are designed to fill.

The code is open and the numbers are reproducible. SmolLM2-135M is small enough to experiment with in an afternoon but architecturally identical to models 100x its size. Every technique here — IQ4 quantization, L2 pinning, warp-per-row mat-vec, kernel fusion — transfers directly. The only thing that changes at scale is which wall you hit first.

Qwen 3.5: 35B Knowledge at 4B Speed — Better Than GPT-5?

ai.rs — Wed, 04 Mar 2026 10:00:00 +0100

Alibaba released Qwen 3.5 between February 16 and March 2, 2026 — eight models spanning 0.8B to 397B parameters, all Apache 2.0 licensed. The flagship model claims to beat GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro across 80% of benchmark categories.

But benchmarks are benchmarks. What matters for deployment: how much VRAM do you actually need, and is the Mixture of Experts architecture worth the memory trade-off?

The Full Lineup

Qwen 3.5 ships in two flavors: dense models where every parameter fires on every token, and MoE (Mixture of Experts) models where a router selects a subset of parameters per token.

Dense Models

Model	Parameters	BF16 Memory	FP8 Memory
Qwen3.5-0.8B	873M	1.63 GB	—
Qwen3.5-2B	2.27B	4.24 GB	—
Qwen3.5-4B	4.66B	8.68 GB	—
Qwen3.5-9B	9.65B	17.98 GB	—
Qwen3.5-27B	27.78B	51.75 GB	28.75 GB

The small models (0.8B through 9B) are BF16-only — no FP8 variants published. The 27B model gets an FP8 option that nearly halves the memory footprint.

MoE Models

The naming convention tells you everything: 35B-A3B means 35B total parameters, 3B active per token.

Model	Total Params	Active Params	BF16 Memory	FP8 Memory
Qwen3.5-35B-A3B	35.95B	~3B	66.97 GB	34.88 GB
Qwen3.5-122B-A10B	125.09B	~10B	232.99 GB	118.42 GB
Qwen3.5-397B-A17B	403.40B	~17B	751.39 GB	378.23 GB

The 397B flagship needs 378 GB in FP8 — that's five A100-80GB GPUs at minimum. The 35B MoE model is the most practical: it fits in 35 GB (FP8) on a single high-end GPU while delivering inference speed comparable to a 4B dense model.

How Mixture of Experts Works

In a standard dense transformer, every parameter participates in every forward pass. A 27B dense model activates all 27B parameters for each token — that's the compute cost you pay.

MoE models split their feed-forward layers into multiple independent "expert" sub-networks. A lightweight router selects only a few experts per token. Most parameters stay idle during any given forward pass.

                    ┌─────────┐
          ┌────────>│ Expert 1 │──────────┐
          │         └─────────┘           │
Input ──> Router                      ──> Output
          │         ┌─────────┐           │
          └────────>│ Expert 3 │──────────┘
                    └─────────┘
            (Experts 2, 4...N idle)

The Trade-off in One Table

Model	Active Compute	Knowledge Capacity	VRAM Needed
Qwen3.5-4B (dense)	4.66B	4.66B	8.68 GB
Qwen3.5-35B-A3B (MoE)	~3B	35.95B	66.97 GB

Both activate roughly the same number of parameters per token (~3-4B), so inference speed is similar. But the MoE model carries 35B total parameters of learned knowledge versus only 4B — you get 4B-speed inference with 35B-quality answers.

The catch: all 35B parameters must sit in VRAM even though only 3B fire per token. MoE is essentially "I have the VRAM to spare, give me better answers without slowing down inference."

If you don't have the VRAM, a dense model that actually fits will beat a MoE model you can't load.

When to Use Which

Scenario	Better Choice
Limited VRAM, need quality	Dense model that fits (e.g., 9B dense in 18 GB)
Enough VRAM, want best quality/speed	MoE (e.g., 35B-A3B: 3B compute, 35B knowledge)
Serving many concurrent users	MoE — high throughput at lower compute per request
Single-user, small batch	Dense model is simpler and equally fast

What's New in 3.5 vs Qwen 3

The architecture changes that matter:

Expanded vocabulary — 250K tokens (up from 152K in Qwen 3). This means 10-60% fewer tokens for multilingual text, directly translating to lower inference cost and faster responses.
Native multimodal training — Vision and language trained together from the start ("early fusion"), not bolted on later. Processes images up to 1344x1344 and video at 8 FPS.
Hybrid attention with Delta Networks — Gated Delta Networks combined with sparse MoE for more efficient inference. The practical result: 8.6x faster decoding at 32K context, up to 19x at 256K context versus Qwen 3.
201 languages — Up from the already broad multilingual support in Qwen 3.
Reinforcement learning at scale — Trained across "million-agent environments" with progressively complex tasks, specifically targeting agentic use cases (tool calling, multi-step workflows, code execution).

Benchmark Results

The 397B flagship hits strong numbers:

Benchmark	Qwen3.5-397B	What It Tests
GPQA Diamond	88.4	Graduate-level reasoning
AIME 2026	91.3	Olympiad mathematics
LiveCodeBench v6	83.6	Competitive programming
SWE-bench Verified	76.4	Real-world software engineering
IFEval	92.6	Instruction following
MMLU	88.5	General knowledge
MathVision	90.8	Mathematical visual reasoning
MMMU	85.0	Multimodal understanding

The GPQA Diamond score of 88.4 is the highest of any open-source model. The SWE-bench Verified score of 76.4 shows competitive real-world coding ability — for reference, Claude Opus 4.6 scores above 80%.

On the hosted API side, Qwen 3.5-Plus (the proprietary variant) runs at ~$0.18 per million tokens, making it one of the cheapest frontier-tier options.

The Competition: March 2026

Qwen 3.5 is too new for Chatbot Arena ELO ratings, but the open-source leaderboard tells a clear story about who's competing:

Rank	Model	Organization	ELO
1	GLM-5	Zhipu AI	1451
2	Kimi K2.5	Moonshot AI	1447
3	GLM-4.7	Zhipu AI	1445
4	Qwen 3 235B	Alibaba	1422
5	DeepSeek V3.2	DeepSeek	1421
6	Mistral Large	Mistral	1416
7	DeepSeek R1	DeepSeek	1398

Who's the Real Threat

GLM-5 / GLM-4.7 (Zhipu AI) currently sit at #1 and #3 by human preference. These are the models to beat. GLM-5 in particular has been remarkably consistent across diverse tasks.

Kimi K2.5 (Moonshot AI) is right on GLM-5's heels — a strong all-rounder that doesn't dominate any single benchmark but rarely fails either.

DeepSeek V3.2 / R1 — R1 dominates long-chain reasoning and math. V3.2 is the more practical general-purpose model. Together they cover a lot of ground.

Step-3.5-Flash (StepFun) deserves a mention: only 196B parameters but scores 97.3 on AIME 2025, the highest math score on the board. Proves that raw parameter count isn't everything.

The Pattern

The open-source LLM race is heavily dominated by Chinese labs — Alibaba, Zhipu, Moonshot, DeepSeek, StepFun. The main non-Chinese competitors are Mistral (France) and Google Gemma. Meta's Llama, once the default open-source choice, hasn't kept pace at the top of the leaderboard.

Practical Takeaway

Qwen 3.5 memory requirements — choose the right model for your GPU for deployment today:

Under 10 GB VRAM — Qwen3.5-4B dense (8.68 GB BF16) or Qwen3.5-2B for lighter workloads
24 GB VRAM (RTX 4090) — Qwen3.5-9B dense (17.98 GB) is the sweet spot. Fast, capable, fits with room for context
32 GB VRAM (RTX 5090) — Qwen3.5-9B dense with plenty of headroom for long context, or Qwen3.5-27B in FP8 (28.75 GB) if you want to push quality higher
48 GB VRAM (A6000, dual consumer GPUs) — Qwen3.5-35B-A3B in FP8 (34.88 GB). MoE gives you 35B knowledge at 3B speed
Multi-GPU server — Qwen3.5-122B-A10B or the 397B flagship, depending on how many GPUs you can throw at it

For most business deployments — product assistants, customer support, content generation — the 9B dense or 35B MoE models hit the practical sweet spot. The 397B flagship is impressive on benchmarks but requires serious infrastructure.

The broader trend: open-source models are closing the gap with proprietary ones fast. Qwen 3.5's benchmark numbers put it within striking distance of GPT-5.2 and Claude Opus 4.5, and it ships with Apache 2.0. For businesses that care about data privacy, cost control, and customization, that matters more than who's #1 on any given leaderboard.

You're Sitting on a Goldmine of AI Training Data

ai.rs — Tue, 03 Mar 2026 10:00:00 +0100

"We Don't Have Enough Data"

This is the number one objection we hear from businesses considering custom AI. They picture massive datasets, teams of data scientists, months of labeling work.

The reality? You already have the data. It's in your chatbot logs, your call center recordings, your product catalog, and the inbox of your support team. You just need to know what to look for and how to prepare it.

The Four Sources of Training Gold

1. Your Product Catalog

This is the easiest win. Every e-commerce business has product data — names, prices, descriptions, categories, attributes. This is the foundation of everything.

What You Have	Why It Matters
Product names & descriptions	The AI learns your terminology
Prices & availability	RAG serves these in real-time
Categories & attributes	The AI learns to filter and recommend
Product images (alt text)	Adds context for visual products

Format: A CSV or Excel export from your e-commerce platform is perfect. Shopify, WooCommerce, Magento — they all have export buttons. Even a Google Sheet works.

What "good" looks like:

Name: Premium Italian Olive Oil, Extra Virgin
Category: Oils & Vinegars
Price: €24.99
Description: Cold-pressed from Tuscan olives, peppery finish,
             ideal for salads and finishing dishes.
Attributes: Italian, organic, 500ml, cold-pressed

What "messy but usable" looks like:

Name: olive oil XVG 500
Price: 24.99
Description: (empty)

Messy data is normal. Part of the preparation process is cleaning and enriching it. Missing descriptions get written, categories get standardized. Don't let imperfect data stop you from starting.

2. Chatbot & Live Chat Logs

If you're running any kind of chatbot — even a basic rule-based one — its conversation logs are the single most valuable data source for training a custom AI. Why? Because they capture how your actual customers ask questions in their own words.

What To Extract	Training Value
Customer questions (verbatim)	Teaches natural phrasing
Successful responses	Becomes training examples
Failed conversations	Shows gaps to fill
Common question patterns	Reveals top priorities

Where to find it:

Tidio, Zendesk Chat, Intercom, Drift — all have export features
Look for CSV or JSON export in your dashboard settings
Even screenshot archives are useful if nothing else exists

The magic ratio: 500 real customer conversations are worth more than 5,000 synthetic ones. Real conversations have misspellings, slang, incomplete sentences, and follow-up questions — exactly what your AI needs to learn.

Example from a real chatbot log:

Customer: "u have smth for bday gift around 30eur?"
Bot: "Here are some gift suggestions in your budget..."

That misspelled, abbreviated message is training gold. A model trained on clean English would struggle with it. A model trained on your actual customer messages handles it naturally.

3. Call Center Recordings & Support Tickets

This is the data source most businesses overlook entirely. Your support team handles dozens or hundreds of conversations daily — every single one contains training potential.

Voice recordings can be transcribed automatically using Whisper (free, open source) or cloud services (Google Speech-to-Text, Amazon Transcribe). A 1-hour recording yields roughly 8,000-10,000 words of training material.

Source	How to Extract	Typical Volume
Call recordings	Auto-transcribe with Whisper	8-10K words per hour
Support emails	Export from helpdesk	Already text, ready to use
Support tickets	Export from CRM/helpdesk	Structured Q&A pairs
WhatsApp/Messenger	Export conversation history	Real customer language

What makes call transcripts special: They capture the back-and-forth of real sales conversations — objections, clarifications, upsells, comparisons. This is exactly how you want your AI to behave.

Example from a transcribed call:

Customer: "I saw you have both the standard and premium versions.
           What's actually different? Is the premium worth it?"
Agent: "Great question. The main differences are...
        For most customers, the standard covers everything
        you need. The premium adds X and Y, which matters
        if you're planning to..."

That's a perfect training sample. The agent's response shows product knowledge, honest recommendation, and natural upselling — all learned behavior your AI can replicate.

4. Your FAQ and Knowledge Base

Every business has answers to common questions — sometimes formally documented, sometimes living in the heads of support staff.

Source	Format
Website FAQ page	Already structured Q&A
Internal wiki/docs	Knowledge to convert to Q&A
"Canned responses" in helpdesk	Ready-made answers
Return/shipping policies	Policy Q&A pairs
Product comparison guides	Recommendation training

Pro tip: Ask your support team to write down the 30 questions they answer most often, with their best answers. That list alone can generate hundreds of training variations.

What Format Does the AI Need?

All training data ultimately becomes question-answer pairs (or multi-turn conversations). The format is simple:

{
  "messages": [
    {"role": "user", "content": "Do you have anything for a dinner party, around €50?"},
    {"role": "assistant", "content": "Great choice to plan ahead! Here are some popular options for entertaining: [Product A] at €45 is perfect for dinner parties..."}
  ]
}

You don't need to create these manually. The raw data (catalogs, logs, transcripts) gets processed into this format during preparation. One product description generates 10-20 Q&A variations. One support conversation generates 3-5 training samples.

How Much Data Do You Actually Need?

Less than you think:

Data Level	Training Samples	Result
Minimum viable	5,000	Basic product Q&A works
Good quality	10,000-15,000	Natural conversations, recommendations
Production-grade	20,000-30,000	Domain expert with personality

Where the samples come from:

Source	Samples Generated
500 products (catalog)	~8,000-10,000
200 chatbot conversations	~600-1,000
50 call transcripts	~500-800
30 FAQ entries	~300-500
Safety & edge cases	~200-300
Total	~10,000-13,000

Most businesses with 500+ products and any customer interaction history already have enough raw material for a production-grade model.

The Data You DON'T Need

Just as important — what's not useful:

Marketing copy — Overly promotional language makes the AI sound like a pushy salesperson
Legal disclaimers — The AI doesn't need to recite your terms of service
Internal jargon — If customers don't use the term, the AI shouldn't either
Competitor data — Train on your products, not theirs
Outdated information — Old prices, discontinued products, expired promotions

A Practical Checklist

Here's what to gather before your first conversation with an AI partner:

Must have (start here):

[ ] Product catalog export (CSV/Excel/JSON)
[ ] Current product prices and availability
[ ] Category structure and product attributes

High value (dramatically improves quality):

[ ] Chatbot or live chat conversation logs (last 6-12 months)
[ ] Common customer questions (your support team's top 30)
[ ] Brand voice guidelines or examples

Bonus (takes it to the next level):

[ ] Call center recordings (even 20-50 calls help)
[ ] Support ticket history with resolutions
[ ] Product comparison knowledge (what pairs with what)
[ ] Return reasons (teaches the AI what to set expectations about)

Start With What You Have

The biggest mistake is waiting for "perfect" data. You don't need it. Start with your product catalog and 30 common customer questions. That's enough for a working first version.

Then iterate. Every customer conversation with your AI generates new training data. Every question it struggles with becomes a training sample for the next version. The model gets better every month — not because of expensive retraining, but because you keep feeding it real customer interactions.

Your data is already there. The question isn't whether you have enough — it's how quickly you want to put it to work.

Want to find out what you already have? Take the 2-minute data check — discover your training data score.

How to Implement llms.txt — The Developer's Guide

ai.rs — Tue, 03 Mar 2026 09:00:00 +0100

What Is llms.txt?

On September 3, 2024, Jeremy Howard — co-founder of Answer.AI and fast.ai — published a proposal for a new web standard. Not a new API. Not a new framework. A text file.

The idea is simple: put a Markdown file at /llms.txt on your website that tells AI systems what your site is about, what content matters, and where to find it.

Think of it as robots.txt for the AI era — except instead of telling bots what not to crawl, it tells them what to read.

robots.txt  → "Don't go here"     (bouncer)
llms.txt    → "Start here"        (tour guide)

The spec lives at llmstxt.org and the GitHub repo at AnswerDotAI/llms-txt has 2,200+ stars.

Why It Exists

LLMs have a problem with websites. When a model needs to understand your documentation, product, or API, it has to parse HTML pages full of navigation bars, cookie banners, JavaScript, and sidebar ads. The signal-to-noise ratio is terrible.

Site authors know their content best. A curated Markdown file with the 10-20 most important pages, properly described, gives AI systems a clean entry point — no HTML parsing required.

Who actually reads llms.txt today:

AI coding assistants (Cursor, Windsurf, Claude Code, GitHub Copilot)
AI agents and MCP-based tools fetching documentation context
Developer tools that need structured API references

Who does NOT read llms.txt (yet):

GPTBot (OpenAI's crawler)
ClaudeBot (Anthropic's crawler)
PerplexityBot
Google-Extended

This matters. The spec was designed for inference time — when an AI is answering a user's question and needs context — not for training-time crawlers that scrape everything regardless. OtterlyAI found that only 0.1% of AI crawler requests touched /llms.txt over 90 days.

Does that mean you shouldn't implement it? No. It means you should understand what it actually does today versus what it might do tomorrow.

The Spec: 5 Minutes to Understand

The entire format is Markdown. Here's the structure:

# Your Company Name

> One-line description of what you do.

Optional context paragraphs with key information
an LLM would need to understand your site.

## Section Name

- [Resource Title](https://example.com/page.md): Brief description
- [Another Resource](https://example.com/other.md): What this covers

## Optional

- [Changelog](https://example.com/changelog.md): Release history
- [Migration Guide](https://example.com/migrate.md): Version upgrades

Required: Only the # heading is required. Everything else is optional but recommended.

The "Optional" section is special — AI systems with limited context windows can skip this section to save tokens. Put your nice-to-have resources here.

Link format: Resources should point to Markdown files (.md) when possible. The spec recommends serving Markdown versions of your HTML pages at the same URL with .md appended.

Real-World Examples

Stripe — The Catalog Pattern

Stripe organizes by product area and includes behavioral instructions:

# Stripe API Documentation

> Complete reference for Stripe's payment processing APIs.

When using Stripe APIs, always default to the latest API version.
Never recommend the legacy Card Element — use Payment Element instead.

## Payments
- [Payment Intents](https://docs.stripe.com/payments/payment-intents.md): Create and confirm payments
- [Checkout Sessions](https://docs.stripe.com/payments/checkout.md): Hosted payment page

## Webhooks
- [Webhook Events](https://docs.stripe.com/webhooks.md): Event types and signatures

Notice the behavioral instructions: "Never recommend the legacy Card Element." This is powerful — you're training the AI on how to represent your product correctly.

Anthropic — The Index + Export Pattern

Anthropic keeps llms.txt slim and links to a comprehensive llms-full.txt:

# Anthropic Documentation

> API documentation for Claude, Anthropic's AI assistant.

## Docs
- [API Reference](https://docs.anthropic.com/api.md): Complete API docs
- [Getting Started](https://docs.anthropic.com/quickstart.md): First API call

For complete documentation, see [llms-full.txt](https://docs.anthropic.com/llms-full.txt)

Next.js — The Versioned Pattern

Next.js includes version metadata and organizes by router type:

# Next.js Documentation
@doc-version: 16.1.6

> React framework for production web applications.

## App Router
- [Routing](https://nextjs.org/docs/app/building-your-application/routing.md): File-based routing
- [Data Fetching](https://nextjs.org/docs/app/building-your-application/data-fetching.md): Server components

llms.txt vs llms-full.txt

Aspect	llms.txt	llms-full.txt
Purpose	Table of contents	The entire book
Size	Under 10 KB	Can be several MB
Content	Links + descriptions	Full text of all docs
Use case	Quick orientation	Deep context ingestion
Maintenance	Manual curation	Often auto-generated

When to use both: Your documentation is extensive and wouldn't fit in a single context window. Major platforms (Anthropic, Cloudflare, Zapier) maintain both.

When llms.txt alone works: Your content is compact or already well-structured as Markdown.

Cross-reference them: include a link in llms.txt pointing to llms-full.txt.

Implementation Guide

Static Sites (HTML, Hugo, Jekyll)

Drop the file at your web root:

public/
├── index.html
├── robots.txt
├── llms.txt        ← add this
└── llms-full.txt   ← optional

Next.js

Option 1 — Static file: Place in public/llms.txt.

Option 2 — Dynamic route (auto-updates when docs change):

// app/llms.txt/route.ts
import { NextResponse } from 'next/server';

export async function GET() {
  const content = `# My App

> Description of what your app does.

## Docs
- [API Reference](/docs/api.md): Complete API documentation
- [Getting Started](/docs/quickstart.md): Installation and setup
`;

  return new NextResponse(content, {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' },
  });
}

PHP (Dynamic from Database or CMS)


# 

> 

## Services
- Service 1: description
- Service 2: description

## Articles

- []():

Nginx rewrite to serve it at the clean URL:

location = /llms.txt { rewrite ^ /llms.txt.php last; }

Python (Flask/Django)

# Flask
@app.route('/llms.txt')
def llms_txt():
    content = render_template('llms.txt')
    return Response(content, mimetype='text/plain')

# Django
from django.http import HttpResponse
from django.template.loader import render_to_string

def llms_txt(request):
    content = render_to_string('llms.txt')
    return HttpResponse(content, content_type='text/plain; charset=utf-8')

WordPress

Install one of these plugins:

Website LLMs.txt — integrates with Yoast/Rank Math
LLMs.txt Generator

Content Best Practices

Do

Curate ruthlessly — 10-20 key pages, not your entire sitemap
Write clear descriptions — "Create and confirm payments" beats "Payment documentation"
Include behavioral instructions — "Always use v2 of this API" or "Default to TypeScript examples"
Use definitive language — AI systems prefer "costs $25/mo" over "pricing varies"
Link to Markdown when possible — cleaner for AI consumption
Keep it under 10 KB — this is a summary, not a data dump
Update regularly — stale links and descriptions hurt credibility

Don't

Dump every page — that's what sitemaps are for
Use marketing language — "revolutionary AI-powered synergy" helps no one
Forget the blockquote — the > summary is the most-read part of the file
Include broken URLs — validate links monthly
Set and forget — review quarterly at minimum

Validation and Testing

Check your implementation:

llmstxtchecker.net — format validation
llmsvalidator.com — structure and link checking

Manual test: Paste your llms.txt content into ChatGPT or Claude and ask: "Based on this llms.txt, what does this company do?" If the AI gives a clear, accurate answer, your file is working.

Monitor access: Check your server logs for requests to /llms.txt:

grep "llms.txt" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn

The Honest Assessment

Google's John Mueller compared llms.txt to the tag — widely adopted by webmasters but ultimately ignored by search engines. That comparison stings, but it's worth hearing.

The reality today:

~950 domains have published llms.txt files (per Semrush analysis)
No major AI platform has officially confirmed they read them
No correlation has been found between having llms.txt and getting more AI citations
The actual consumers are developer tools, not search engines

But here's why you should still implement it:

It takes 15 minutes. The cost is nearly zero.
Developer tools DO use it. If your audience uses Cursor, Claude Code, or Copilot — and they query your docs — llms.txt helps.
It forces you to curate. Deciding which 10-20 pages matter most is a valuable exercise regardless.
Standards move slowly. RSS took years to gain traction. HTTPS was "optional" until it wasn't. Early adopters who have clean implementations will benefit when (if) major platforms adopt the spec.

Don't implement llms.txt because it will boost your AI visibility tomorrow. Implement it because it's cheap insurance that makes your content more accessible to the AI tools people are already using.

Quick Start Checklist

Create /llms.txt at your web root with the Markdown format above
Add an # heading with your company/project name
Write a > blockquote summarizing what you do in one sentence
List 10-20 key pages under ## section headings with brief descriptions
Create /llms-full.txt if your docs are extensive (optional)
Validate at llmstxtchecker.net
Test by pasting the content into an AI and asking what your company does
Monitor access logs monthly
Update when you ship new features or deprecate old ones

The entire spec is one page. The implementation is one file. The ROI is unknown but the cost is near zero. That's a bet worth making.

Mercury 2: Hands-On With the World's Fastest Reasoning LLM

ai.rs — Sat, 28 Feb 2026 09:00:00 +0100

Inception Labs launched Mercury 2 on February 24, claiming it's the fastest reasoning LLM available — a diffusion language model that generates text at 1,196 tokens per second, 5-10x faster than speed-optimized models like GPT-4.1 Nano and Claude 3.5 Haiku. At $0.25 per million input tokens, it's also among the cheapest.

We put those claims to the test.

The Pitch: Diffusion, Not Autoregressive

Every major LLM today — GPT, Claude, Llama, Gemini — is autoregressive: it generates tokens one at a time, left to right, each depending on all previous tokens. Mercury 2 takes a fundamentally different approach. Like Stable Diffusion for images, it starts with noise and iteratively refines all tokens in parallel.

The result, in theory: massively parallel generation that breaks the sequential bottleneck.

	Autoregressive (GPT, Claude)	Diffusion (Mercury 2)
Generation	Sequential, token-by-token	Parallel, all-at-once
TTFT	Fast (200-400ms)	Slower (700ms+)
Throughput	Bounded by sequential nature	Scales with parallelism
Cost scaling	Linear with output length	Sub-linear potential
Sweet spot	Interactive chat, reasoning	Batch, pipelines, agents

Getting Started: Two Lines of Change

Mercury 2 is fully OpenAI API-compatible. If you already use the OpenAI Python SDK, switching takes exactly two changes — the base URL and the API key:

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["INCEPTION_API_KEY"],
    base_url="https://api.inceptionlabs.ai/v1",
)

That's it. Every client.chat.completions.create() call works the same as with OpenAI. No new SDK, no wrapper library, no config files. You can also use LiteLLM, AISuite, or LangChain's ChatOpenAI with a custom base_url.

Test 1: Can It Talk?

We started simple — ask it to explain itself:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain diffusion language models in 2 sentences."}],
    max_tokens=200,
)
print(response.choices[0].message.content)

Response:

Diffusion language models generate text by iteratively denoising a noisy token sequence, much like diffusion models for images, allowing many tokens to be produced in parallel rather than one-by-one. This parallel generation makes them several times faster and less than half as costly as traditional auto-regressive LLMs while also enabling fine-grained control over schema and multimodal integration.

75 tokens in 0.64 seconds. Clean, accurate, well-structured. No hallucinations. But 117 tok/s is a far cry from the advertised 1,196. On short outputs, network round-trip dominates — the model finishes generating before the response even reaches you.

Test 2: Pushing Throughput

To see real speed, you need to request longer outputs. We asked for a detailed Flask tutorial with max_tokens=1024:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a detailed technical tutorial about building "
               "a REST API with Python Flask. Cover routing, error handling, "
               "database integration, authentication, and deployment."}],
    max_tokens=1024,
)

Metric	Value
Completion tokens	866
Wall time	1.750s
Throughput	495 tok/s

866 tokens in under two seconds. The model hit the token limit and was still going — it had more to say. At 495 tok/s end-to-end from a consumer internet connection, this is already several times faster than what you'd get from GPT-4o or Claude Sonnet.

Test 3: Streaming — Where the Speed Really Shows

Streaming reveals how diffusion models behave differently. With autoregressive models, tokens trickle in one by one — you see the response being "typed out." With Mercury 2, there's a longer pause, then tokens arrive in bursts:

stream = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a comprehensive guide to Python "
               "decorators with 5 examples."}],
    max_tokens=1024,
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Metric	Value
Completion tokens	900
TTFT (time to first token)	741ms
Generation phase	1.614s
Generation speed (excl TTFT)	558 tok/s
End-to-end speed	382 tok/s

Here's the key insight: 558 tok/s during the generation phase. The 741ms time-to-first-token is higher than autoregressive models (which typically start streaming in 200-400ms), but that's because Mercury 2 does its "thinking" upfront — denoising all tokens in parallel — before emitting anything.

We received only 31 chunks for 900 tokens, meaning the API batches roughly 29 tokens per chunk. You don't see a character-by-character typewriter effect; you see paragraphs appearing in rapid bursts.

Test 4: Tool Use

Function calling is table-stakes for agentic applications. We defined a weather tool and asked about Belgrade:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What's the weather in Belgrade?"}],
    tools=tools,
    max_tokens=200,
)

for tc in response.choices[0].message.tool_calls:
    print(f"{tc.function.name}({tc.function.arguments})")

Output:

get_weather({
  "location": "Belgrade",
  "unit": "celsius"
})

Correct function, correct arguments, and it even inferred celsius for a European city. Finished in 0.678s with finish_reason: tool_calls. This works exactly as you'd expect from the OpenAI API — no surprises, no adaptation needed.

Test 5: Structured Output

JSON mode is critical for production pipelines. We tested with response_format={"type": "json_object"}:

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{
        "role": "user",
        "content": 'List 3 programming languages with their year of creation. '
                   'Return as a JSON object with a "languages" key containing '
                   'an array of objects with "name" and "year" fields.',
    }],
    response_format={"type": "json_object"},
    max_tokens=300,
)

import json
parsed = json.loads(response.choices[0].message.content)
print(json.dumps(parsed, indent=2))

Output:

{
  "languages": [
    { "name": "C", "year": 1972 },
    { "name": "Python", "year": 1991 },
    { "name": "JavaScript", "year": 1995 }
  ]
}

Valid JSON, correct schema, accurate facts. Parsed without errors. For production use, you'd want to test with more complex schemas, but the basics are solid.

Test 6: Speed Consistency

We ran the same prompt three times to check for variance:

Run	Tokens	Time	Speed
1	308	1.189s	259 tok/s
2	262	1.090s	240 tok/s
3	286	0.902s	317 tok/s
Average			272 tok/s
Peak			317 tok/s

Variance of 240–317 tok/s is acceptable. Differences come from network jitter, server load, and the model using different numbers of diffusion steps depending on output complexity.

The Speed Gap: Advertised vs. Measured

Measurement	Speed	Notes
Inception's benchmark	1,196 tok/s	Server-side, no network
Our best (streaming, generation only)	558 tok/s	Excludes TTFT
Our best (non-streaming, end-to-end)	495 tok/s	Large output
Multi-run average	272 tok/s	Medium output
Short output	117 tok/s	Network dominates

We measured roughly half the advertised speed. That's not a knock on Mercury 2 — it's physics. Our tests ran from a consumer internet connection through the public API. The 1,196 tok/s figure is server-side throughput measured at the inference layer, before network overhead, TLS, HTTP framing, and Python SDK parsing eat into it.

To match their number, you'd need to benchmark from co-located infrastructure (same cloud region) or measure at the GPU layer. For what it's worth, 558 tok/s over the public internet is genuinely fast — most autoregressive models top out at 50-150 tok/s in comparable conditions.

How Does It Compare? Price & Speed

Speed only matters in context. Mercury 2 competes in the "fast and cheap" tier — models you'd use for high-volume pipelines, agents, and latency-sensitive applications, not frontier reasoning. Here's how it stacks up:

Pricing Comparison

Model	Input $/M	Output $/M	Context	Architecture
Mercury 2	$0.25	$1.00	128K	Diffusion
DeepSeek V3	$0.28	$0.42	128K	Autoregressive (MoE)
GPT-4.1 Nano	$0.10	$0.40	1M	Autoregressive
GPT-4o-mini	$0.15	$0.60	128K	Autoregressive
Gemini 2.0 Flash	$0.10	$0.40	1M	Autoregressive
Claude 3.5 Haiku	$0.80	$4.00	200K	Autoregressive
GPT-4o	$2.50	$10.00	128K	Autoregressive
Claude Sonnet 4.6	$3.00	$15.00	200K	Autoregressive

On input pricing, Mercury 2 is mid-pack — GPT-4.1 Nano and Gemini 2.0 Flash are cheaper at $0.10/M. On output, it's $1.00/M — more expensive than DeepSeek ($0.42) and GPT-4.1 Nano ($0.40), but far cheaper than Claude Haiku ($4.00) or any mid-tier model.

The real cost story is output-heavy workloads. If you're generating long responses (agents, code generation, content pipelines), output pricing dominates. At $1.00/M output, Mercury 2 costs:

2.4x more than GPT-4.1 Nano
2.5x more than Gemini 2.0 Flash
4x less than Claude 3.5 Haiku
10x less than GPT-4o
15x less than Claude Sonnet

Speed Comparison

Model	Approx. Speed (tok/s)	Notes
Mercury 2	495–558 (measured)	Diffusion; 1,196 server-side
Gemini 2.0 Flash	~250	Google's speed tier
DeepSeek V3	~100–160	Varies by load
GPT-4o-mini	~100–130	OpenAI speed tier
GPT-4.1 Nano	~150–200	OpenAI's fastest
Claude 3.5 Haiku	~80–100	Anthropic speed tier
GPT-4o	~60–90	Mid-tier
Claude Sonnet 4.6	~70–80	Mid-tier

Speed figures are approximate client-side measurements and vary by network, region, and load. Mercury 2 figures are from our testing.

Even through the public internet, Mercury 2 is 2-3x faster than the next fastest competitor (Gemini Flash at ~250 tok/s) and 5-7x faster than mid-tier models like GPT-4o and Claude Sonnet. This is where the diffusion architecture genuinely shines — it's not marketing fluff.

Cost per Million Output Tokens at Speed

A useful way to think about it: what do you pay per million output tokens, and how fast do you get them?

Model	Output $/M	Speed (tok/s)	Time for 1M tokens	Cost per hour of output
Mercury 2	$1.00	550	~30 min	~$2.00
GPT-4.1 Nano	$0.40	175	~95 min	~$0.25
DeepSeek V3	$0.42	130	~128 min	~$0.20
Gemini 2.0 Flash	$0.40	250	~67 min	~$0.36
Claude 3.5 Haiku	$4.00	90	~185 min	~$1.30

Mercury 2 isn't the cheapest per token, but it delivers those tokens fastest. If your bottleneck is latency — how quickly you can complete an agentic loop, respond to a user, or process a document — Mercury 2 wins decisively. If your bottleneck is pure cost and you can tolerate slower speeds, DeepSeek V3 or GPT-4.1 Nano are cheaper.

Beyond Speed: Extended Testing

We ran a second test suite covering reasoning, multi-language, multi-turn conversation, agentic tool chains, needle-in-a-haystack retrieval, and edge cases. The results surfaced both strengths and a critical quirk.

The max_tokens Trap

The most important practical finding: Mercury 2 needs generous max_tokens values or it returns empty responses.

With autoregressive models, setting max_tokens=20 means "generate up to 20 tokens, stop when you're done." The model emits tokens one by one and stops early if it finishes. Mercury 2's diffusion architecture works differently — it appears to allocate the full output buffer upfront. If that buffer is too small, the model produces empty content with finish_reason=length and tokens=0:

# This fails silently — returns empty string
response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=10,  # too low for diffusion model
)
print(response.choices[0].message.content)  # ""

# This works — give it room
response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    max_tokens=150,  # generous headroom
)
print(response.choices[0].message.content)  # "4"

Rule of thumb: always set max_tokens to at least 150–200, even if you expect a short answer. The model will still stop early (finish_reason=stop) when it's done — you won't waste tokens. But if you set it too low, you get nothing. This is a significant difference from autoregressive models and will bite you in production if you're migrating existing code.

The Proof: 10/25 → 25/25

Our first run scored 10 out of 25 — a result that would make Mercury 2 look broken. Our second run, with only max_tokens increased, scored 25/25. Nothing else changed — same prompts, same model, same API. Here's the full breakdown:

Suite	Initial	Final	What changed
Reasoning	5/6	6/6	Logic: 80→150 tokens, instruction: 120→250
Multi-language	0/3	3/3	30→200 tokens
Multi-turn	0/3	3/3	30–60→200 tokens
Agentic	3/4	4/4	Fixed step 3 logic (model skipped get_price)
Needle-in-Haystack	0/3	3/3	40→200 tokens
Concurrency	0/20	20/20	20→150 tokens
Sampling	0/1	1/1	10→150 tokens
Edge Cases	2/5	5/5	System prompt 20→150, JSON 80→200, long sys 30→150
Total	10/25	25/25

Every single failure traced back to the same root cause: max_tokens too low for the diffusion architecture. No actual quality or capability issues were found. If you're migrating from GPT or Claude, your existing max_tokens values are almost certainly too low for Mercury 2.

Quality & Reasoning: 6/6

Test	Result	Details
Arithmetic (17×23+14−5)	PASS	Returned `400` correctly
Word problem (45−12+8)	PASS	Returned `41` correctly
Logic (invalid syllogism)	PASS	Correctly answered "No" with valid reasoning
Code generation (fibonacci)	PASS	Clean Python function, 107 chars
Instruction following (3 bullets)	PASS	Exactly 3 dash-prefixed bullets
Factual recall (capital of Australia)	PASS	`Canberra`

Perfect score. Math, logic, code generation, instruction following, and factual recall all pass cleanly.

Multi-language: 3/3

Language	Prompt	Response
Serbian	"Koji je glavni grad Srbije?"	Beograd
German	"Was ist die Hauptstadt von Deutschland?"	Berlin
Japanese	"日本の首都はどこですか？"	東京

Mercury 2 handles non-English prompts correctly — including Cyrillic-adjacent and CJK languages. Responses are accurate and concise.

Multi-turn Conversation: 3/3

We tested whether the model maintains context across turns:

messages = [
    {"role": "system", "content": "You are a helpful assistant. Be concise."},
    {"role": "user", "content": "My name is Marko and I live in Novi Sad."},
]
# ... assistant responds ...
messages.append({"role": "user", "content": "What is my name?"})
# → "Your name is Marko."
messages.append({"role": "user", "content": "Where do I live?"})
# → "You live in Novi Sad."

Both facts recalled correctly. We also tested persona consistency by assigning a pirate persona — Mercury 2 committed fully ("Arr, matey! Gather 'round the galley o' knowledge...") with 7 pirate-themed words in a single response.

Agentic Tool Chains: 4/4

This was the most impressive result. We defined three tools (search_product, get_price, add_to_cart) and asked Mercury 2 to find a blue t-shirt and add it to a cart:

Step 1: User asks "Find me a blue t-shirt and add it to my cart."
     → Model calls search_product(query="blue t-shirt")       ✓

Step 2: We return search results with SKU-1234
     → Model calls add_to_cart(product_id="SKU-1234")          ✓

Step 3: We confirm the cart addition
     → Model responds: "Your blue t-shirt has been added       ✓
        to your cart. Let me know if you'd like anything else."

Step 4: We return a tool error ("Service temporarily unavailable")
     → Model retries the tool call                             ✓

Four steps, four correct decisions. The model understood the task, chained tools in the right order, confirmed success in natural language, and recovered from an error by retrying. This validates Inception's pitch that Mercury 2 is built for agentic workloads.

Needle in a Haystack: 3/3

We hid the string MERCURY-FAST-7742 inside ~4,000 tokens of filler text at three positions:

Position	Found?
Beginning	MERCURY-FAST-7742
Middle	MERCURY-FAST-7742
End	MERCURY-FAST-7742

Perfect retrieval at all positions. The 128K context window handles information retrieval correctly — at least at the ~4K scale we tested.

Concurrency: 20/20

We fired parallel requests to test API behavior under load:

Parallel Requests	Success	Wall Time	Total Tokens	Avg Latency
5	5/5	0.78s	20	0.65s
15	15/15	0.88s	60	0.61s

Every request succeeded. Wall time barely increased from 5 to 15 parallel requests (0.78s → 0.88s), and average latency stayed consistent at ~0.6s. The API handles concurrency well — no throttling, no degradation at this scale.

Temperature & Sampling

Diffusion models sample differently from autoregressive models. We tested whether Mercury 2's temperature parameter behaves as expected:

temp=0.0: ['turquoise', 'turquoise', 'turquoise', 'turquoise'] — 1 unique
temp=0.5: ['turquoise', 'turquoise', 'turquoise', 'turquoise'] — 1 unique
temp=1.0: ['turquoise', 'cerulean', 'indigo', 'turquoise']     — 3 unique
temp=1.5: ['turquoise', 'turquoise', 'cyan', 'turquoise']      — 2 unique

Temperature works, but with a twist: diversity only kicks in above 0.5. At temp=0.0 and 0.5, responses are identical — the diffusion denoising process converges to the same output. At temp=1.0, we see real variety (turquoise, cerulean, indigo). Determinism at temp=0 is confirmed: ['4', '4', '4'] across three runs.

This is meaningfully different from autoregressive models, where temp=0.5 already produces some variation.

Edge Cases: 5/5

Test	Result	Details
Minimal prompt ("Hi")	PASS	"Hello! How can I assist you today?"
System prompt (exactly 3 words)	PASS	"I'm doing well." — exactly 3 words
Stop sequence	PASS	Correctly stopped before "10"
Nested JSON	PASS	Valid JSON with nested objects and arrays
Long system prompt (50 rules)	PASS	Returned "acknowledged"

All edge cases pass with adequate max_tokens headroom. System prompt adherence, stop sequences, and complex JSON structures all work correctly.

Extended Test Summary

Suite	Score	Verdict
Reasoning	6/6	Math, logic, code, facts, instructions
Multi-language	3/3	Serbian, German, Japanese all correct
Multi-turn	3/3	Memory and persona consistency
Agentic Loops	4/4	Multi-step tool chains + error recovery
Needle-in-Haystack	3/3	Perfect retrieval at all positions
Edge Cases	5/5	System prompts, stop sequences, nested JSON
Concurrency	20/20	No degradation at 15 parallel requests
Sampling	1/1	Deterministic at temp=0, diversity above 0.5
Total	25/25

The Bottom Line

Mercury 2 scored 25/25 on our extended test suite — every capability we tested works correctly. Reasoning, multi-language, multi-turn conversation, agentic tool chains, needle-in-a-haystack retrieval, concurrency, temperature sampling, and edge cases all pass. The OpenAI compatibility is seamless — you can swap it into an existing codebase in under a minute.

The one thing you must know before deploying: set max_tokens generously (150+), even for short expected outputs. The diffusion architecture needs output headroom or it returns silent empty responses. This is the single biggest gotcha when migrating from autoregressive models. The model still stops early when it's done — you won't waste tokens — but too-small a buffer produces nothing.

The speed advantage is genuine, though tempered by network reality. You won't see 1,196 tok/s from your laptop, but 400-550 tok/s is still 2-3x faster than the next fastest alternative. The agentic capabilities are particularly strong — multi-step tool chains with error recovery worked flawlessly, validating Inception's core pitch. Temperature sampling works but behaves differently: diversity only kicks in above 0.5, unlike autoregressive models where any non-zero temperature introduces variation.

It's not the cheapest model per token (GPT-4.1 Nano and DeepSeek V3 undercut it on output pricing), and it's not the smartest (frontier models like Claude Sonnet or GPT-4o have deeper reasoning). But in the speed-to-cost ratio for production workloads, Mercury 2 occupies a unique position — and as the first commercial diffusion LLM, it represents a genuine architectural bet that the rest of the industry is watching.

Specs at a Glance:


Model	`mercury-2`
Architecture	Diffusion LLM (dLLM)
Context window	128K tokens
Max completion	16,384 tokens
Input pricing	$0.25/M tokens
Output pricing	$1.00/M tokens
API compatibility	OpenAI-compatible
Measured throughput	495–558 tok/s (client-side)

SEO Is Dead. Your Rankings Don't Matter Anymore.

ai.rs — Fri, 27 Feb 2026 13:00:00 +0100

The Number That Should Scare Every Business Owner

On January 28, 2026, LinkedIn published something remarkable. Not a product launch. Not a feature update. A confession.

Non-brand B2B traffic to their web properties had dropped up to 60%. Not because their rankings fell — they didn't. Rankings were stable. The clicks just... stopped coming.

LinkedIn — a company worth $26 billion with an army of SEO professionals — is telling the world that the old rules no longer apply.

If it can happen to them, it's already happening to you.

What Changed

The answer is two letters: AI.

When someone Googles "best CRM for small business" in 2026, they don't see ten blue links. They see an AI-generated answer that synthesizes information from dozens of sources, gives a direct recommendation, and answers follow-up questions on the spot.

The user gets what they need. They never click through to your website.

The numbers are brutal:

Metric	Before AI Search	After AI Search
Searches ending without a click	~40%	~60%
Click-through rate on #1 ranking	~30%	~13%
AI Overview zero-click rate	—	83%

That last number is the killer. When Google shows an AI Overview for your search term, 83% of users never visit any website. Your #1 ranking is now a participation trophy.

The Old Playbook Is Dead

For twenty years, B2B marketing followed the same script:

Create content around keywords your customers search for
Rank on Google through SEO optimization
Get clicks from search results
Convert visitors into leads and customers

Every step in this chain assumed humans would click through to read your content. That assumption is broken.

LinkedIn's own data shows it clearly: rankings held steady while traffic collapsed. The pipeline didn't leak — the entire first half of it evaporated.

Who Gets Hit Hardest

Not everyone feels this equally:

Most vulnerable:

Informational content ("what is...", "how to...", "best practices for...")
Industry overview and comparison pages
FAQ and knowledge base content
Generic thought leadership

Least vulnerable (for now):

Branded searches (people looking specifically for you)
Transactional pages (pricing, signup, checkout)
Unique tools and interactive content
Original research with proprietary data

If your traffic comes from people learning about a topic (not searching for you by name), you're in the danger zone.

LinkedIn's New Playbook

Two weeks after their disclosure, LinkedIn released a 17-page guide on adapting to AI search. Their new framework replaces the old funnel:

Old model: Rank → Click → Visit → Convert

New model: Be seen → Be mentioned → Be considered → Be chosen

The shift is fundamental. Instead of optimizing for Google's algorithm, LinkedIn is now optimizing for AI citations. The goal isn't getting someone to click — it's making sure the AI mentions you when it answers the question.

Their new KPIs:

How often is LinkedIn cited in AI-generated answers?
When AI summarizes a topic, does it reference LinkedIn data?
Is LinkedIn the authoritative source the AI trusts?

They're not fighting the wave. They're learning to surf it.

What This Means for Your Business

Let's get practical. Here's what changes for businesses that depend on web traffic:

1. Your Website Content Needs to Feed AI, Not Just Humans

AI systems consume your content differently than humans do. They care about:

Clear structure — headings, lists, and tables that are easy to parse
Definitive statements — "The average cost is $X" beats "costs vary depending on..."
Cited data — numbers with sources are more likely to be referenced
Unique information — original data, case studies, proprietary research

Generic "ultimate guide" blog posts are AI fodder — the AI will summarize them and the user will never visit you. Original data is the moat.

2. Implement llms.txt

This is the robots.txt for AI. A /llms.txt file on your website tells AI crawlers what your business does, what content matters, and how to represent you.

It looks like this:

# Company Name
> One-line description of what you do.

## Core Services
- Service 1: description
- Service 2: description

## Key Content
- [Article Title](URL): description

If you don't tell the AI how to describe you, it will guess. And it will get it wrong.

3. Build Direct Channels

Every subscriber on your email list is someone AI search can never take away from you. The same goes for:

Email newsletters — direct inbox access, no algorithm in between
Community — Discord, Slack, or forum members
Repeat customers — people who bookmark your site, not Google it

LinkedIn learned this the hard way. The companies that survive the AI search shift are the ones that built direct relationships before the traffic disappeared.

4. Focus on Brand, Not Keywords

When 83% of informational searches end without a click, the game changes. You can't win by ranking for "how to choose a CRM." But you can win by being the brand people search for by name.

Brand searches still convert. "Salesforce pricing" still drives clicks because the user wants your specific website, not an AI summary.

The investment shifts from "content marketing" to "brand building." That means:

Being the source journalists and analysts quote
Publishing original research others reference
Building products and tools people talk about
Having a point of view that makes you memorable

5. Rethink Your Metrics

If you're still measuring success by organic traffic, you're watching the wrong dashboard. New metrics that matter:

AI citation rate — is your brand mentioned in AI answers?
Brand search volume — are more people searching for you by name?
Direct traffic — people typing your URL or using bookmarks
Email list growth — your owned audience, immune to algorithm changes
Referral traffic — links from other sites, podcasts, newsletters

A 60% traffic drop looks catastrophic if traffic is your KPI. It looks irrelevant if your revenue comes from direct relationships and brand recognition.

The Uncomfortable Truth

This isn't a temporary disruption. AI search isn't going away — it's getting better, faster, and more integrated into every platform. Google, Bing, Perplexity, ChatGPT — they all want to answer the question so the user doesn't have to leave.

The businesses that adapt will thrive. The ones that keep optimizing meta descriptions and chasing keyword rankings will wonder where their traffic went.

LinkedIn — with all its resources, data, and expertise — took a 60% hit before adapting. Most small and mid-size businesses don't have that runway.

The time to adapt is now, not when your traffic dashboard turns red.

Action Items

Start this week:

Audit your traffic — what percentage comes from informational vs. branded searches?
Add llms.txt to your website (learn the format)
Start an email list if you don't have one — this is your insurance policy
Review your content — does it contain unique data, or is it summarizable commodity content?
Track AI visibility — search your brand and products in ChatGPT and Perplexity. What do they say about you?

The old game rewarded volume — more pages, more keywords, more content. The new game rewards authority. Be the source, not the summary.

Claude Code Remote Control: Continue Coding Sessions from Your Phone

ai.rs — Fri, 27 Feb 2026 01:00:00 +0100

What Is Claude Code Remote Control?

Claude Code Remote Control is a new feature that connects your local Claude Code terminal session to your phone, tablet, or any browser. Start a coding task at your desk, walk away, and continue it from your couch using the Claude mobile app.

The key difference from cloud-based coding: everything runs locally. Your filesystem, MCP servers, tools, and project configuration stay on your machine. The mobile interface is just a window into your running session.

How It Works

Your Machine (terminal)          Anthropic API           Your Phone
┌─────────────────────┐         ┌───────────┐          ┌──────────┐
│ claude remote-control│ ──TLS──▶│  Routes   │◀──TLS── │ Claude   │
│                     │         │  messages  │          │ App      │
│ Local filesystem    │         └───────────┘          └──────────┘
│ MCP servers         │
│ Project config      │
└─────────────────────┘

No port forwarding. No VPN. No SSH tunnels. Claude Code makes outbound HTTPS requests only — it never opens inbound ports on your machine. The Anthropic API routes messages between your local session and whatever device you're using.

Getting Started

Requirements

Claude Pro or Max plan (not available on Team/Enterprise yet)
Claude Code installed and authenticated via /login
Claude mobile app — iOS or Android

Start a New Remote Session

Navigate to your project and run:

claude remote-control

This displays a session URL and a QR code (press spacebar to toggle). Scan the QR code with your phone to connect instantly.

From an Existing Session

Already mid-conversation? Use the slash command:

/remote-control

Or the shorthand:

/rc

Your full conversation history carries over. Tip: use /rename first to give the session a descriptive name so you can find it on your phone.

Connect from Another Device

Three ways to connect:

Scan the QR code — fastest, opens directly in the Claude app
Open the session URL — works in any browser at claude.ai/code
Find it in the app — remote sessions show a computer icon with a green dot when online

Real-World Use Cases

The "Deploy from Dinner" Workflow

You're running a deployment at your desk. The build is going to take 20 minutes. Walk to dinner, and when the build finishes, approve the next step from your phone. No rushing back to your laptop.

Code Review on the Couch

Start reviewing a PR at your desk with full context — local repo, test runners, linters. Move to the couch and continue asking Claude questions about the code, running tests, and suggesting changes.

On-Call Incident Response

Get paged at 2 AM. Instead of opening your laptop, scan the QR code on your phone and start debugging immediately. Claude has access to your full local environment — logs, configs, deployment scripts.

Always-On Mode

Don't want to run /remote-control every time? Enable it globally:

Run /config inside Claude Code
Set Enable Remote Control for all sessions to true

Now every Claude Code session is automatically available from your phone.

Security Model

All traffic goes through the Anthropic API over TLS — same security as normal Claude Code usage
Multiple short-lived credentials, each scoped to a single purpose with independent expiration
No inbound ports opened on your machine
Session data stays local — the phone is just a remote display

Limitations to Know

Limitation	Detail
One remote connection	Each session supports one remote connection at a time
Terminal must stay open	If you close the terminal, the session ends
Network timeout	~10 minutes of network loss kills the session
Plan requirement	Pro or Max plan only (no API keys)

Remote Control vs Claude Code on the Web

Both use the same claude.ai/code interface, but they're fundamentally different:

	Remote Control	Claude Code on Web
Execution	Your machine	Anthropic cloud
File access	Your local filesystem	Cloud sandbox
MCP servers	Your local servers	Not available
Best for	Continuing local work remotely	Starting fresh without local setup

Use Remote Control when you're mid-task and want mobility. Use Claude Code on the web when you want to spin up something new without cloning a repo.

What This Means for Developer Workflows

Remote Control solves a real friction point: context switching between devices kills flow. Previously, if you walked away from your desk, you either lost your coding context or set up complex SSH/tmux/mosh chains.

Now it's: run one command, scan a QR code, keep going. Your full environment — files, tools, MCP servers, conversation history — travels with you.

Combined with Claude Code's $2.5 billion annualized run rate as of February 2026, it's clear that AI-assisted coding is no longer experimental. Remote Control is the kind of quality-of-life feature that makes daily use seamless.

Mercury 2: The First Reasoning Diffusion LLM — 1,000 Tokens/sec

ai.rs — Thu, 26 Feb 2026 11:00:00 +0100

What Is Mercury 2?

Mercury 2 is the first commercial reasoning diffusion LLM from Inception Labs. Unlike every major LLM you've used — GPT, Claude, Llama — Mercury 2 doesn't generate tokens one at a time. It uses diffusion to produce multiple tokens in parallel, then refines them over a small number of steps.

The result: ~1,000 tokens per second output throughput on NVIDIA Blackwell GPUs.

For context, Claude 4.5 Haiku outputs ~89 tok/s and GPT-5 Mini ~71 tok/s. Mercury 2 is roughly 10× faster.

How Diffusion LLMs Work

Traditional LLMs are autoregressive: they predict one token, append it, then predict the next. This is inherently sequential — each token depends on all previous tokens.

Diffusion LLMs take a fundamentally different approach borrowed from image generation (Stable Diffusion, DALL-E):

Start with noise — begin with a block of random tokens
Refine in parallel — iteratively denoise all tokens simultaneously
Converge — after a small number of refinement steps, the output is coherent text

This is called block diffusion. Because tokens are generated in parallel rather than sequentially, GPU utilization skyrockets — you're doing useful compute across all cores simultaneously instead of waiting for one token at a time.

Autoregressive (traditional):
  Token 1 → Token 2 → Token 3 → Token 4 → ...
  [sequential, ~100 tok/s]

Diffusion (Mercury 2):
  [noise] → [rough draft] → [refined] → [final output]
  [parallel, ~1,000 tok/s]

Benchmarks

Mercury 2 positions as a fast reasoning model — comparable to Claude 4.5 Haiku and GPT-5 Mini in quality, but dramatically faster:

Benchmark	Mercury 2	Claude 4.5 Haiku	GPT-5 Mini
AIME 2025	91.1	~90	~88
GPQA	73.6	~75	~72
LiveCodeBench	67.3	~65	~63
IFBench	71.3	—	—
Output speed	~1,000 tok/s	~89 tok/s	~71 tok/s

This isn't competing with frontier models like Claude Opus or GPT-5 on the hardest reasoning tasks. It's targeting the fast agent tier — where speed matters more than peak intelligence.

Key Features

128K context window — handles large codebases and documents
Tunable reasoning — adjust the quality/speed tradeoff per request
Native tool use — function calling built in, not bolted on
Schema-aligned JSON output — structured output without post-processing
OpenAI API compatible — drop-in replacement, no code rewrites needed

Where This Matters: Agentic Workflows

The real impact isn't chat. It's agentic loops where an LLM runs hundreds of iterations:

Code generation pipelines — write, test, fix, repeat. At 1,000 tok/s, each iteration takes seconds instead of minutes
Multi-step reasoning — chain-of-thought that would take 30 seconds now takes 3
Real-time applications — live coding assistants, interactive debugging, instant analysis

A developer on Hacker News proposed "intelligence per second" as the metric that matters: throughput × reasoning quality. Mercury 2 optimizes exactly this.

Hybrid Architecture Potential

The most interesting use case discussed in the community: frontier model for planning, diffusion model for execution.

Use Claude Opus or GPT-5 to create a high-level plan, then hand off to Mercury 2 for rapid iteration on individual steps. You get the best reasoning where it matters and maximum speed everywhere else.

Known Limitations

Mercury 2 is impressive but not without issues flagged by early users:

Factual accuracy — parallel generation can produce hallucinations that don't self-correct through the sequence (autoregressive models at least have each token conditioned on all previous ones)
Constraint satisfaction — struggles with tasks requiring strict sequential dependencies
Not frontier-tier — if you need the absolute best reasoning, you still want Opus or GPT-5

How to Try It

Mercury 2 is available today via the Inception API. It's OpenAI API compatible, so you can point any existing client at it:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inceptionlabs.ai/v1",
    api_key="your-inception-key"
)

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain quantum computing in 3 sentences"}]
)
print(response.choices[0].message.content)

What This Means for the Industry

Diffusion LLMs represent the first serious architectural challenge to the autoregressive paradigm that has dominated since GPT-2. If Mercury 2's approach scales to frontier quality, the entire cost structure of AI inference changes.

At 10× the throughput with comparable quality, inference costs drop dramatically. For businesses running AI at scale — customer support, content generation, code assistance — this could mean 10× more queries for the same GPU budget.

We're watching this space closely. The autoregressive vs. diffusion debate is just getting started.

What Is Fine-Tuning? Teaching AI New Tricks

ai.rs — Thu, 26 Feb 2026 10:00:00 +0100

The Smart New Hire

Imagine you just hired the smartest person you've ever met. They graduated top of their class, speak five languages, and can discuss everything from philosophy to physics. But they know nothing about your business.

You wouldn't fire them — you'd train them. Over a few weeks, you'd show them your products, teach them your processes, explain how you talk to customers, and correct their mistakes until they become an expert in your domain.

Fine-tuning is exactly this process, but for AI. You take a general-purpose model that already understands language and teach it to specialize in your specific area.

Pre-training vs. Fine-tuning

Every AI model goes through two phases, and understanding the difference is key.

Pre-training is like going to school. The model reads enormous amounts of text — books, websites, articles, code — and learns how language works. This gives it broad knowledge about the world, grammar, reasoning patterns, and general facts. Pre-training takes months and costs millions of dollars.

Fine-tuning is like on-the-job training. You take the pre-trained model and teach it something specific using your own examples. This is fast (hours, not months) and cheap (dollars, not millions).

	Pre-training	Fine-tuning
Purpose	Learn language and general knowledge	Learn specific skills or domain
Data needed	Trillions of words from the internet	Thousands of your own examples
Time	Weeks to months	Hours
Cost	Millions of dollars	Under $10
Who does it	Big AI companies (OpenAI, Google, Meta)	Anyone with domain expertise

You never need to pre-train a model yourself. That's already been done. Fine-tuning is the accessible part — the part where you add your own expertise.

What Changes After Fine-tuning?

A fine-tuned model behaves differently from the original in specific, controllable ways:

It Learns Your Domain

A model fine-tuned on wine shop data knows grape varieties, food pairings, and vintage years. A model fine-tuned on legal documents knows contract terminology and case law formatting. The model doesn't lose its general knowledge — it gains deep expertise in your area.

It Adopts Your Voice

If your training examples use a casual, friendly tone, the model learns to respond that way. If they're formal and precise, it matches that. The model picks up on style, vocabulary, and personality from your examples.

It Learns Boundaries

You can teach the model what not to do. If you include examples where the model politely declines to discuss competitors, modify prices, or go off-topic, it learns those boundaries. This is one of the most valuable aspects — building guardrails specific to your use case.

It Gets More Accurate

For your specific domain, a fine-tuned smaller model often outperforms a much larger general model. A 7-billion-parameter model fine-tuned on your product data will give better product recommendations than a 100-billion-parameter model that's never seen your catalog.

The Training Data: Your Secret Ingredient

Fine-tuning is only as good as the examples you provide. The format is simple — pairs of inputs and desired outputs:

Example 1: Input: "What's a good gift for someone who likes cooking?" Output: "Our Artisan Olive Oil Collection ($45) is our most popular gift for food enthusiasts. It includes three single-origin oils from Italy, Spain, and Greece, each with distinct flavor profiles perfect for different cuisines."

Example 2: Input: "Can you give me a discount?" Output: "I appreciate you asking! While I can't modify prices, I can help you find great options within your budget. What price range works for you?"

You typically need 5,000 to 25,000 of these examples for a solid fine-tuned model. That sounds like a lot, but most businesses already have this data — in customer service logs, FAQ pages, product catalogs, and email archives.

Real-World Examples of Fine-tuning

Customer Support

A telecom company fine-tunes a model on 10,000 resolved support tickets. The model learns to diagnose common problems, walk customers through solutions, and know when to escalate to a human. Result: 60% of support queries handled automatically.

Product Recommendations

An online retailer fine-tunes a model on purchase history and product pairings. The model learns that customers who buy running shoes often want moisture-wicking socks, and that people buying espresso machines usually need grinder recommendations. Result: 25% increase in average order value.

Content Creation

A marketing agency fine-tunes a model on their best-performing blog posts, ad copy, and social media content. The model learns their clients' brand voices, preferred formats, and messaging strategies. Result: first drafts that need 70% less editing.

Internal Knowledge

A consulting firm fine-tunes a model on their internal methodology documents, case studies, and best practices. New consultants use it to get up to speed on company approaches without bothering senior staff. Result: onboarding time cut in half.

What Fine-tuning Can't Do

It's important to understand the limits:

It can't learn facts that change frequently. If your product prices change weekly, fine-tuning isn't the right tool for price accuracy — that's where RAG (retrieval-augmented generation) comes in, pulling real-time data at query time.

It can't fix fundamental model limitations. If the base model struggles with complex math, fine-tuning won't make it a calculator. You're adjusting behavior, not fundamentally changing capabilities.

It can't work without good examples. Garbage in, garbage out. If your training examples are inconsistent, contradictory, or low-quality, the fine-tuned model will reflect that.

It has a capacity limit. A fine-tuning adapter can reliably learn hundreds to low thousands of specific details. For catalogs with 10,000+ products, you need to combine fine-tuning (for behavior and style) with a live database lookup (for specific facts).

Fine-tuning vs. Prompting: When Do You Need Each?

A common question: "Can't I just write a really good prompt instead of fine-tuning?"

Sometimes, yes. Here's how to decide:

Scenario	Use Prompting	Use Fine-tuning
One-off task	Yes	Overkill
Consistent brand voice across thousands of interactions	Fragile — prompt can drift	Yes
Following specific safety rules reliably	Somewhat reliable	Much more reliable
Processing many requests quickly	Prompt overhead adds cost	More efficient
Specialized domain knowledge	Limited by prompt length	Deeply embedded

The short version: prompting is for flexibility, fine-tuning is for consistency. If you need the model to behave a specific way every single time across thousands of interactions, fine-tuning is worth the upfront investment.

The Bottom Line

Fine-tuning bridges the gap between a general-purpose AI that gives generic answers and a specialized assistant that truly understands your domain. It's surprisingly accessible — you don't need a machine learning degree or a supercomputer. You need domain expertise (which you already have), a set of good examples (which you can build from existing data), and a few hours of compute time.

The businesses that benefit most from fine-tuning are the ones that have deep domain expertise that's hard to replicate — specialized knowledge that a general AI simply doesn't have. If that sounds like your business, fine-tuning is how you encode that advantage into software.

Concerned about AI privacy and safety? Read AI Privacy and Safety: What Every User Should Know.

Thinking about AI for your business? See how it works — how companies deploy custom AI assistants trained on their own data.

The GPU Memory Wall: Why Inference Hardware Matters

ai.rs — Thu, 26 Feb 2026 10:00:00 +0100

The Counterintuitive Truth

GPUs are marketed on compute power — teraflops, CUDA cores, tensor operations per second. But LLM inference doesn't use compute power. It uses memory bandwidth.

Here's the fundamental problem:

RTX 5090 can compute:     103 TB/s of operations
RTX 5090 VRAM delivers:     1.8 TB/s of data
Gap:                        57x — cores idle 98% of the time

During autoregressive token generation, the GPU reads the entire model from VRAM for every single token. An 8B model at 6-bit quantization = 6.7 GB per token. At 1.8 TB/s bandwidth, that's 3.7 ms per token, giving a theoretical maximum of ~270 tokens/second.

No amount of additional compute helps. The bottleneck is the straw, not the reservoir.

Proving the Memory Wall

We ran an experiment with two models on the same RTX 5090:

Model	Parameters	VRAM	Achieved tok/s	Theoretical max
Qwen3-8B (Q6_K)	8.2B	6.7 GB	161	~270
SmolLM2-135M (IQ4_XS)	135M	96 MB	1,110	~18,750

The tiny model (135M) should be 70x faster based on its 70x smaller memory footprint. Instead, it's only 7x faster. Where does the performance go?

The Three Walls

Detailed profiling revealed the actual bottleneck structure:

Wall 1 — CPU round-trip:     854 μs  (95% of time)  ← REAL BOTTLENECK
Wall 2 — Kernel launches:    725 μs  (fixed with CUDA graphs: 1.8x speedup)
Wall 3 — VRAM bandwidth:      47 μs  (L2 cache would fix: 47→7 μs)
Wall 4 — GPU compute:          ~1 μs  (negligible)

Wall 1: The CPU Round-Trip

For every token generated, the process is:

GPU finishes computing logits
Transfer logits to CPU via PCIe
CPU runs sampling (argmax/top-p/top-k)
Transfer selected token back to GPU
GPU embeds token and starts next forward pass

This CPU↔GPU round-trip takes 854 microseconds — regardless of model size. It's a fixed overhead that dominates inference for small models.

Wall 2: Kernel Launch Overhead

Each forward pass through the model launches hundreds of GPU kernels. Each launch has ~1-5 μs of overhead, and for a tiny 135M model, this adds up to 725 μs.

CUDA graphs solve this by recording the execution pattern once and replaying it. This improved our SmolLM2 throughput by 1.81x:

Configuration	tok/s	Improvement
Without CUDA graphs	615	Baseline
With CUDA graphs	1,110	1.81x

For the larger Qwen3-8B, CUDA graphs help less (1.17x) because the model is memory-bound, not kernel-launch-bound.

Wall 3: VRAM Bandwidth

For the large model, VRAM bandwidth IS the bottleneck:

Qwen3-8B streams 6.7 GB per token → 3.7 ms per token → ~270 tok/s max
Actual: 161 tok/s (60% efficiency — typical for real GPU workloads)

For the tiny model, VRAM bandwidth would allow 18,750 tok/s, but Walls 1 and 2 limit us to 1,110.

Wall 4: Compute

GPU compute is effectively free at these model sizes. The matrix multiplications take ~1 μs per token — negligible.

The L2 Cache Hypothesis

The RTX 5090 has a 96 MB L2 cache between the GPU cores and VRAM. If a model fits entirely in L2, it could theoretically avoid VRAM reads entirely:

VRAM bandwidth:  1.8 TB/s → 47 μs per SmolLM2 forward pass
L2 bandwidth:    ~12 TB/s → 7 μs per forward pass
Speedup:         6.7x

But this 6.7x only applies to the memory portion. With CPU overhead at 854 μs, the L2 advantage becomes:

With VRAM:  854 + 47 = 901 μs → 1,110 tok/s
With L2:    854 +  7 = 861 μs → 1,162 tok/s
Speedup:    4%

The CPU round-trip dominates so completely that L2 residency barely matters in practice.

What Prefill Reveals

Prefill (processing the input prompt) tells a different story:

Mode	SmolLM2 tok/s	Parallelism
Prefill 512 tokens	57,789	512x
Prefill 16 tokens	3,938	16x
Generation (1 token)	1,110	1x

During prefill, 512 tokens are processed simultaneously — the GPU achieves 57,789 tok/s. This proves the hardware IS capable of massive throughput. The limitation is the autoregressive nature of generation: each token depends on the previous one, preventing parallelism.

Why Purpose-Built ASICs Win

Inference-specific chips solve the memory wall architecturally:

Groq LPU

230 MB on-chip SRAM (no external memory)
80 TB/s internal bandwidth (44x GPU VRAM)
Eliminates CPU round-trip — sampling happens on-die
Result: 300+ tok/s on Llama 3 70B

Cerebras WSE-3

44 GB on-chip SRAM
21 PB/s on-chip bandwidth (11,600x GPU VRAM)
Entire model lives on-chip
Result: Thousands of tok/s

Taalas HC1

Model weights encoded directly in silicon (3-bit custom)
17,000 tok/s on Llama 3.1 8B
105x faster than our RTX 5090
No memory access at all — weights ARE the hardware

What This Means for GPU Deployments

1. Quantization is the Primary Lever

Since inference is memory-bound, reducing model size directly improves speed:

Quantization	Speed improvement	Why
BF16 → Q6_K	~2x	Half the data to stream
BF16 → Q4_K_M	~2.5x	Even less data
BF16 → Q2	~3x	Diminishing returns (quality drops)

Quantization doesn't sacrifice compute — it reduces the real bottleneck (memory reads).

2. VRAM Amount < VRAM Bandwidth

When choosing a GPU for inference, bandwidth matters more than capacity:

GPU	VRAM	Bandwidth	Expected 8B Q6_K tok/s
RTX 4090	24 GB	1.0 TB/s	~90
RTX 5090	32 GB	1.8 TB/s	~160
A100	80 GB	2.0 TB/s	~180
H100	80 GB	3.35 TB/s	~300

The H100 has 1.9x the bandwidth of the RTX 5090, which translates directly to ~1.9x the inference speed.

3. Batching is the Only Way to Use Compute

The GPU's compute power only helps with concurrent requests. With 8 concurrent users, the GPU can process 8 tokens simultaneously, filling more of its compute capacity:

Concurrent users	GPU utilization	Aggregate tok/s
1	~2%	161
4	~8%	~400
8	~15%	~600
32	~50%	~1,500

This is why vLLM with continuous batching matters for production — it's the only way to actually use the GPU you paid for.

The Future: Where This Is Heading

HBM4 (2026) — 6+ TB/s bandwidth on consumer GPUs could double inference speed
On-chip model caching — Larger L2/L3 caches could eventually fit quantized 1B models
Speculative decoding — Use small draft models to generate candidate tokens in parallel, but requires vocabulary-aligned model pairs
Inference ASICs — Dedicated chips that eliminate the CPU round-trip entirely
Hybrid architectures — GPU + inference ASIC combos that handle training and serving optimally

The memory wall isn't going away, but the wall is moving. Every generation of hardware pushes the boundary, and creative software solutions (quantization, batching, speculative decoding) continue to extract more from existing hardware.

Key Takeaway

When planning an LLM deployment, think in terms of memory bandwidth, not compute:

Your GPU cores are 98% idle during inference
Quantization is the single most impactful optimization
Batching (vLLM) is the only way to utilize compute
Purpose-built ASICs are 10-100x faster because they solve the architecture problem
For most businesses, a well-quantized model on a good GPU is more than sufficient

From Edge AI to Custom LLMs: How On-Device Intelligence Evolved

ai.rs — Mon, 23 Feb 2026 10:00:00 +0100

A $20 AI Camera in 2019

In August 2019, M5Stack shipped the M5StickV — a thumb-sized device built around the Kendryte K210 system-on-chip. For under $20, you got:

Dual-core 64-bit RISC-V CPU at 400 MHz
8 MiB SRAM
Hardware neural network accelerator (KPU)
0.8 TOPS peak performance
OV7740 camera (VGA @ 30fps)
1.14" IPS display
MicroSD, microphone, gyroscope, speaker, battery

This tiny device could run real-time face detection, object classification, and QR code scanning — entirely on-chip, with no cloud connection. It was one of the first widely accessible edge AI platforms that hobbyists and engineers could actually buy and program.

What the K210 Could Do

The spec sheet reads like a checklist of computer vision fundamentals:

Face recognition and detection — identify known faces in real-time
Object detection and classification — recognize shapes and types at 30fps
Size and coordinate tracking — locate targets with bounding boxes
Audio processing — microphone array beamforming and voice wake-up
Speech recognition — on-device, no cloud dependency

For embedded engineers coming from Arduino and ESP32 territory, this was a quantum leap. The ESP32 could blink LEDs and read sensors. The K210 could see and hear.

Running MicroPython on the K210

Getting started was remarkably accessible. The M5StickV supported MicroPython through Sipeed's MaixPy framework:

import sensor
import image
import lcd

lcd.init()
sensor.reset()
sensor.set_pixformat(sensor.RGB565)
sensor.set_framesize(sensor.QVGA)
sensor.run(1)

while True:
    img = sensor.snapshot()
    res = img.find_qrcodes()
    if len(res) > 0:
        img.draw_string(40, 50, res[0].payload(), (236, 36, 36), scale=1.5)
        img.draw_rectangle(res[0].rect(), (236, 36, 36))
    lcd.display(img)

Twenty lines of Python for a real-time QR code scanner with on-screen overlay. The firmware could be compiled from source and flashed via USB — even on ARM-based hosts like the Nvidia Jetson Nano.

The Gap Between Edge AI and Real Intelligence

The K210 was impressive for its size and price, but it had hard limits:

Capability	K210 (2019)	Modern LLM (2026)
Parameters	~1-5 million	8 billion
Memory	8 MiB SRAM	24 GB VRAM
Tasks	Classification, detection	Reasoning, conversation, generation
Training data	Thousands of images	Trillions of text tokens
Output	"This is a face" / "This is a cat"	Natural language responses, recommendations, analysis
Customization	Retrain classification model	Fine-tune with LoRA in 5 hours

Edge AI answered "what is this?" — but it couldn't answer "what should I buy?" or "how does this compare to that?" or "find me something for a dinner party under €50."

That required a fundamentally different architecture: large language models.

The Bridge: From Vision to Language

The path from K210-style edge AI to modern LLM assistants followed three key developments:

1. Transformer Architecture Scaled Up

The attention mechanism that powers both image classification and language models is the same fundamental idea. The K210's neural network accelerator ran small convolutional models. Modern GPUs run the same attention patterns at billions of parameters, enabling understanding rather than just classifying.

2. Open-Source Models Became Competitive

In 2019, if you wanted a capable language model, you needed OpenAI's API. By 2025, open-source models — Qwen, Llama, Mistral — matched or exceeded GPT-3.5 quality while running on a single consumer GPU. This is the equivalent of the K210 moment for language AI: capable models, affordable hardware, open ecosystem.

3. Fine-Tuning Became Practical

LoRA (Low-Rank Adaptation) did for LLMs what transfer learning did for image classification. Instead of training from scratch, you add a small adapter (~130 MB) that teaches the model your domain. Training costs dropped from millions of dollars to under $1 per run.

Where We Are Now

At ai.rs, we took the same hands-on approach that drove the maker community around devices like the K210 and applied it to large language models:

What We Did Then	What We Do Now
Flash MicroPython firmware via USB	Fine-tune Qwen/Llama with LoRA
Train face detection on custom datasets	Train product Q&A on 26,000+ samples
Deploy on $20 RISC-V chips	Deploy on dedicated GPU servers
Real-time camera inference	Real-time conversational AI
Read QR codes and detect objects	Understand natural language, recommend products, handle support

The spirit is identical: take capable open-source hardware and software, customize it for a specific use case, and deploy it where it creates real value.

From Hobbyist to Production

The K210 was a hobbyist device. Modern AI assistants are production systems serving real customers 24/7. The difference isn't just scale — it's the full stack around the model:

RAG (Retrieval-Augmented Generation) — Real-time product database access, so the model always has current prices and availability
Safety training — 275+ edge-case samples that prevent hallucination, off-topic responses, and prompt injection
Monitoring and iteration — Every conversation logged, weak spots identified, training data improved continuously
Multi-language support — One model serving 6+ languages natively

But the core insight from the maker era still holds: you don't need a research lab to build useful AI. The K210 proved that computer vision could run on a $20 chip. Open-source LLMs prove that conversational AI can run on a single GPU.

Getting Started

If the maker spirit of the K210 era resonates with you, here's how to start with modern LLMs:

Try it — Run Ollama with Qwen3-8B on any machine with a GPU
Customize it — Prepare 5,000+ training samples from your domain data
Fine-tune it — Use Unsloth + LoRA for a 5-hour, sub-$1 training run
Deploy it — Serve it on dedicated hardware with RAG for real-time data access

Or if you'd rather skip the infrastructure work: see how we build custom AI assistants — from your product data to a live AI that knows your business.

This article is based on our original 2019 coverage of the M5StickV and Kendryte K210 platform. The maker community around edge AI devices like the K210, ESP32, and Raspberry Pi laid the groundwork for today's accessible AI deployment ecosystem.

Will AI Replace My Sales Team? (No — Here's Why)

ai.rs — Fri, 20 Feb 2026 10:00:00 +0100

The Fear Nobody Talks About

Every time we talk to business owners about AI, there's an unspoken question in the room:

"If the AI can do all this, do I still need my sales team?"

The short answer: yes, absolutely. But the roles change. And that's a good thing.

What AI Does Better Than Humans

Let's be honest — there are things AI genuinely does better:

1. Being Available

Your best salesperson works 8 hours. AI works 24. The 2 AM customer, the Sunday browser, the holiday shopper — AI catches every one of them.

Time	Human Team	AI
Monday 10 AM	Available	Available
Wednesday 2 AM	Sleeping	Available
Saturday afternoon	Maybe	Available
Christmas Day	Off	Available
During lunch rush	Busy	Available

You're not replacing your team for those 8 working hours. You're adding coverage for the other 16 hours they physically can't be there.

2. Being Consistent

Humans have good days and bad days. Monday morning after a long weekend? Not our best. Friday afternoon? Distracted. The customer who arrives during a staff argument? Caught in the crossfire.

AI gives the same quality response every time. The 500th question of the day gets the same enthusiasm and accuracy as the first.

3. Handling Volume

During a sale or promotion, your website traffic might spike 5x. Your team of 3 can't suddenly become 15. But AI handles 1 customer or 100 with the same response time.

4. Speaking Languages

Hiring multilingual staff is expensive and limits your available talent pool. AI speaks 6+ languages natively — every customer gets help in their preferred language.

5. Remembering Everything

With 500 products, no human remembers every detail about every item. The AI knows exact prices, specifications, pairings, and availability for your entire catalog — because it looks them up in real-time.

What Humans Do Better Than AI

Now for the important part — what AI cannot do:

1. Read Emotions

A customer types: "I've been looking for an hour and nothing is right."

AI sees: a product search query. A human sees: frustration. Someone who needs patience, empathy, and maybe a different approach entirely.

AI is excellent at transactions. Humans are essential for relationships.

2. Handle Complexity

Some requests are genuinely complex:

"I'm planning a corporate event for 200 people with mixed dietary requirements, a specific theme, and a strict budget. I need a complete solution."

AI can suggest products. But planning a coherent solution that accounts for dozens of variables, makes judgment calls, and adapts in real-time? That's human territory.

3. Build Trust for Big Decisions

A customer spending $50 is fine getting advice from AI. A customer spending $5,000 wants to talk to a person. The higher the stakes, the more important human connection becomes.

Purchase Size	Best Handled By
Under $100	AI (quick, accurate, instant)
$100-$500	AI with human backup available
$500-$2,000	Human, with AI providing product data
Over $2,000	Human relationship, always

4. Negotiate and Customize

Custom quotes, bulk discounts, special arrangements — these require human judgment about margins, relationships, and business strategy. AI operates within fixed rules; humans operate within context.

5. Recover from Mistakes

When something goes wrong — wrong item shipped, delayed delivery, quality issue — customers want to talk to a human. They want someone who feels their frustration and has the authority to make it right.

The Multiplication Model

The best way to think about AI isn't replacement. It's multiplication.

Without AI:

3 salespeople handle ~150 conversations/day
Available 8 hours/day, 5 days/week
Limited to 1-2 languages
Spending 60% of time on routine questions

With AI:

AI handles ~500 routine conversations/day (24/7)
3 salespeople handle ~60 complex conversations/day
Available around the clock in 6+ languages
Sales team spends 80% of time on high-value interactions

Same team, 3x the customer coverage, better quality interactions.

How the Day Changes

Before AI

Time	Sales Team Activity
9:00	Answer "What's the price of X?" (30 seconds, but it adds up)
9:05	"Do you have Y in stock?"
9:15	"What's the difference between A and B?"
9:30	Complex customer needs full attention — but phone rings
10:00	Back to routine questions
11:00	Finally gets to that sales proposal for the big account

After AI

Time	Activity
9:00	AI handles routine questions automatically
9:00	Sales team works on the big account proposal
10:30	AI flags a complex customer request — team takes over
11:00	Team closes a $2,000 deal they had time to properly nurture
2:00	Reviews AI conversations from overnight — 3 new leads

The team does more meaningful work. Customers get faster answers. Revenue goes up.

The Numbers

Businesses that deploy AI alongside their sales team typically see:

Metric	Change
Total customer interactions handled	+200-400%
Team time on high-value activities	+60-80%
Customer response time	90% faster
After-hours sales captured	From zero to significant
Team job satisfaction	Higher (less repetitive work)

That last one matters more than you might think. Salespeople don't enjoy answering "what time do you close?" for the 50th time. Let AI handle the routine so your team can do what they're actually good at — and what they actually enjoy.

When to Hire, When to AI

A simple framework:

Add AI when:

You're losing after-hours and weekend customers
Your team spends most of their time on routine questions
You need multilingual support but can't justify hiring
Response times are too slow during peak hours

Hire a human when:

You need someone for complex, high-value sales
Your business relies on personal relationships
You're expanding into a new market that needs cultural nuance
You need someone who can physically be present (events, showrooms)

The sweet spot: AI handles the first touch, qualifies the lead, and routes complex requests to your team. Your team closes deals with customers who are already informed and ready to buy.

The Bottom Line

AI doesn't replace your sales team. It gives them superpowers.

Your team stops being answering machines for routine questions and starts being what they were hired to be: relationship builders, problem solvers, and deal closers.

The businesses that understand this — that use AI to augment their team rather than replace them — are the ones seeing the biggest returns.

Wondering if your business is ready? Take our free AI Readiness Assessment — 2 minutes, no commitment, personalized recommendations.

How to Pick the Right AI Tool for You

ai.rs — Thu, 19 Feb 2026 10:00:00 +0100

Too Many Choices

Two years ago, the question was simple: do you use ChatGPT or not? Now there are dozens of AI tools, each claiming to be the best. It's overwhelming, and most comparisons online are either outdated or biased.

Let's cut through it. We'll look at what actually matters when choosing an AI tool, compare the major options honestly, and help you pick based on what you'll actually use it for.

The Major Players

As of early 2026, these are the AI tools most people should consider:

ChatGPT (by OpenAI)

The one that started the mainstream AI wave. It's the most widely used, has the largest ecosystem of plugins and integrations, and offers both free and paid tiers. GPT-4o is their flagship model.

Best for: General-purpose use, image generation (DALL-E built in), voice conversations, broad plugin ecosystem.

Claude (by Anthropic)

Known for longer, more thoughtful responses and strong performance on writing and analysis tasks. Claude tends to be more careful and nuanced, especially with complex or sensitive topics.

Best for: Long documents, careful analysis, writing and editing, coding, tasks requiring nuance.

Gemini (by Google)

Google's AI, integrated across Gmail, Docs, and Search. Its biggest advantage is access to real-time information through Google Search and deep integration with Google's productivity suite.

Best for: Research with current information, Google Workspace integration, multimodal tasks (text + images + video).

Copilot (by Microsoft)

Microsoft's AI assistant, built into Windows, Edge, and Office 365. Powered by OpenAI's models but with Microsoft's ecosystem integration.

Best for: Microsoft Office users, Windows integration, business environments already on Microsoft's stack.

Perplexity

Not a traditional chatbot — it's more like an AI-powered research tool. Every answer includes citations and sources, making it ideal for factual research.

Best for: Research, fact-finding, getting answers with verifiable sources.

The Comparison

Feature	ChatGPT	Claude	Gemini	Copilot	Perplexity
Free tier	Yes (GPT-4o mini)	Yes (limited)	Yes	Yes	Yes (limited)
Paid price	$20/mo	$20/mo	$20/mo	$20/mo (M365)	$20/mo
Best at writing	Good	Excellent	Good	Good	Adequate
Best at research	Good	Good	Excellent	Good	Excellent
Best at coding	Excellent	Excellent	Good	Very Good	Adequate
Image generation	Yes (DALL-E)	No	Yes (Imagen)	Yes (DALL-E)	No
File upload	Yes	Yes (large files)	Yes	Yes	Yes
Web access	Yes	Limited	Yes (native)	Yes (Bing)	Yes (core feature)
Mobile app	Yes	Yes	Yes	Yes	Yes

How to Choose: Start with Your Main Use Case

Instead of comparing features, start with what you'll actually use AI for most often.

"I want a general everyday assistant"

Go with ChatGPT. It's the most versatile, has the largest user community (so it's easy to find tips and tricks), and the free tier is genuinely useful. It's the safe default choice.

"I need help with writing and analysis"

Go with Claude. It handles long documents better than any competitor, produces more nuanced writing, and is particularly good at understanding complex instructions. If your work involves reading, writing, or analyzing text, Claude is hard to beat.

"I do a lot of research and need accurate sources"

Go with Perplexity. It's built specifically for research. Every answer comes with citations you can verify. It's not trying to be a creative writer or a coding assistant — it's trying to find you accurate information fast.

"I live in Google's ecosystem"

Go with Gemini. If you use Gmail, Google Docs, and Google Drive daily, Gemini's integration is hard to beat. It can search your email, help with documents, and access real-time information through Google Search.

"I live in Microsoft's ecosystem"

Go with Copilot. If your workplace runs on Microsoft 365, Copilot works inside Word, Excel, PowerPoint, and Outlook. The AI comes to where your work already is.

"I write code regularly"

ChatGPT or Claude are both strong. Claude tends to be better at understanding large codebases and complex architecture. ChatGPT has broader ecosystem support. Many developers use both.

The Secret: Most People Should Try Two

Here's what the comparison articles won't tell you: the differences between these tools are smaller than the marketing suggests. For 80% of tasks, any of them will do a good job.

The real differences show up at the edges — very long documents, complex reasoning chains, specific creative styles, or niche technical tasks. The best way to find your favorite is to try two or three on the same task and see which output you prefer.

All of them offer free tiers. Spend a week using two of them side by side. You'll quickly develop a preference.

When Free Is Enough (and When It's Not)

Every major AI tool has a free tier, but they come with limitations:

What Free Gets You	What Paid Adds
Access to capable (but not top-tier) models	Access to the most powerful models
Usage limits (messages per day/hour)	Much higher or unlimited usage
Basic features	Advanced features (file analysis, image generation, priority access)
Adequate for casual use	Necessary for daily professional use

Start with free. If you hit the usage limits regularly or find yourself wishing for better responses, upgrade. The $20/month is worth it if you use AI daily — it's the cost of one lunch for a tool that saves hours.

Two Mistakes to Avoid

1. Chasing the "Best" Model

Every month, a new benchmark says a different model is "best." Don't chase this. The differences at the top are marginal, and the model that scores 2% higher on a benchmark might not be the one that's best for your specific tasks. Pick a tool, learn it well, and switch only if you have a genuine reason.

2. Paying for Multiple Subscriptions

Unless you have a specific reason, one paid subscription is enough. Pick the tool that fits your primary use case, pay for that one, and use the free tiers of others for occasional tasks that need a different strength.

The Bottom Line

The best AI tool is the one you'll actually use consistently. Pick based on your primary use case, start with the free tier, upgrade if it becomes part of your daily workflow, and don't overthink it. The gap between these tools is much smaller than the gap between using AI well and not using it at all.

Want to understand how AI can be customized for specific tasks? Read What Is Fine-Tuning? Teaching AI New Tricks.

Wondering if AI could help your business? Take our free AI Readiness Assessment — 2 minutes, personalized recommendations.