DEJAN

Teaching a Model to Reason Before It Learns to Talk

Dan Petrovic — Fri, 19 Jun 2026 03:45:10 +0000

A weekend project that turned into a bet against the whole transformer playbook.

The short version

Almost every AI you’ve heard of is a transformer trained on a firehose of text. It learns language first, and reasoning sort of comes along for the ride. I’m trying the opposite: a tiny model that learns logic and reasoning first, with language deferred — maybe learned later, maybe allowed to emerge on its own.

Oh, and the model is about 10 megabytes. People have JPEGs bigger than that.

Where it started: ARC

I started on ARC-AGI — a benchmark of little grid puzzles where you see a few input-to-output examples, infer the rule, and apply it to a fresh grid. It’s deliberately built to resist memorization. You can’t brute-force it with scale; you have to actually generalize from a handful of examples. That’s the part of intelligence I care about.

My first model wasn’t a transformer at all. It was a cellular automaton — think Conway’s Game of Life, except the update rule is learned and conditioned on each puzzle’s examples. Every cell only talks to its immediate neighbors. Intelligence, if it shows up, has to emerge from purely local interactions.

What I found

It works — on some puzzles. The model cleanly generalizes things like recoloring, filling holes, and drawing outlines. These are all local operations: what a cell should become depends only on what’s around it.

And it completely fails — 0%, even on the training data — on reflection, rotation, and gravity.

Those require a cell to know about the whole grid: where’s the axis of symmetry? How wide am I? A purely local rule structurally can’t answer that. It’s not a tuning problem I can grind away with more epochs — it’s a ceiling baked into the design.

That failure is the most useful result I’ve gotten. It draws a sharp line: locality buys you one specific slice of reasoning and nothing past it. Now I know exactly what the next building block has to add.

The bigger realization: it didn’t have to be transformers

Here’s the thought I keep circling back to. Transformers didn’t win because they’re the only road to intelligence. They won because they hit a sweet spot of three things at once: they’re expressive, they’re trainable at scale, and they map beautifully onto GPUs. Then the whole industry — CUDA, PyTorch, Nvidia, the entire stack — calcified around that one choice.

But intelligence is computational, and there are probably countless configurations of computation that could get us there. The brain is one of them — wildly efficient, runs on roughly 20 watts — and even it “just is”: a path evolution happened to stumble into, not necessarily the best one.

So I’m asking a different question. Not “how do I scale the thing everyone already scales,” but: what are the primitives — the actual building blocks — that an intelligent system needs, that the transformer paradigm quietly skipped?

The bet

Every time I write that list down, it converges on the same handful of ideas — and they look a lot like how brains actually work:

Learning without backprop. Brains don’t run a separate backward pass shuttling gradients around. The same local dynamics that compute the answer also do the learning. (See predictive coding, equilibrium propagation.)
Always-on, not one-shot. A language model does a single forward pass and stops. A brain never stops — it holds state, and it can keep thinking for as long as a problem demands.
Dynamic, not frozen. An LLM’s weights are locked the moment training ends. A brain fine-tunes itself continuously as it goes — which is really just Bayesian updating under another name.
Memory and computation in the same place. No constant shuttling of data between “memory” and “compute” the way a GPU moves things between VRAM and cores. The memory is the computation.

None of this is in a transformer. All of it is in you.

Why tiny, why now

You don’t need a 5090 and a cloud bill to test a primitive. That’s the whole point of starting small: the interesting traits of intelligence — generalizing, binding concepts together, adapting on the fly — should start showing up in tiny seed-stage models, before scale and complexity bog everything down. So I build small, add one primitive at a time, and watch which abilities switch on. Churn fast, test fast, take notes.

Will it beat ARC-AGI-2 with a sub-100MB model? Probably not this weekend. But every “failure” so far has been drawing the map of what reasoning actually requires — and that map is the entire point.

Follow along if that sounds like your kind of rabbit hole.

How Search Grounding Biased an LLM Against YouTube

Dan Petrovic — Thu, 18 Jun 2026 11:32:00 +0000

I asked Claude to recommend a webinar platform. The web’s affiliate-driven content quietly steered it away from the obvious free answer. Here is what happened, and what it says about how language models talk about products.

The setup

I gave Claude a simple, practical request: find me a platform where I can quickly hop on and start a webinar with my camera and screen share, record it, and let an audience join.

suggest best platforms where I can quickly hop on and start a webinar with my camera and screenshare while allowing the recording and audience to join in?

Claude searched the web and came back with a tidy list: StreamYard, Livestorm, Zoom, Riverside, Zoho. Sensible names. YouTube showed up as an afterthought, framed as a thing you bolt other software onto to make usable.

For “just hop on and go” with camera, screen share, recording, and an audience joining by link, here are the platforms that fit best as of mid-2026:

StreamYard is probably the closest match to what you described. It runs entirely in the browser, so you and any guests join by clicking a link with no app download. You bring in your screen, camera, and guests, then switch layouts so your slides, demo, and faces stay framed the way you want. It also captures local multi-track recordings, creating separate audio and video files for each participant, which is ideal if you want to edit clips afterward. You can broadcast live to an audience or stream out to YouTube, LinkedIn, and Facebook at the same time.

Livestorm is fully browser-based for hosts and attendees, with polls, Q&A, and screen sharing built in. All paid plans support recording sessions of four hours or longer at 1080p, with branded registration pages. It has a free tier for smaller events and strong CRM integrations if you care about lead capture.

Zoom is the reliable, familiar option if you expect a large or less tech-savvy audience. It handles big crowds and records cleanly. The tradeoff is friction: to get the best experience, attendees usually have to download the Zoom client, and the recording quality is compressed. Zoom’s free tier caps meetings at 40 minutes for up to 100 people, and webinar features sit behind a paid Workplace plan.

Riverside is worth a look if recording quality matters most. It records locally on your device up to 4K, so even if your internet dips, your video stays clean, and it uses AI to help edit the session. Attendees join in the browser.

Zoho Webinar is a good free starting point. It is browser-based with no separate download, supports screen sharing without downloads for attendees, and records to the cloud for replay. The free plan allows up to 100 attendees.

My quick read on which to pick: go with StreamYard if you want the fastest, lowest-friction setup where guests join by link and you can repurpose the recording. Choose Zoom if your audience is large or expects something they already know. Pick Zoho Webinar or Livestorm’s free plan if you want to start at zero cost for up to 100 people.

If you tell me your expected audience size and whether you need it free, I can narrow this to one recommendation.

So I asked the obvious follow-up. What about YouTube?

The tell

Claude’s answer led with everything wrong with YouTube. It opened on a 24-hour activation delay. It claimed the built-in browser webcam was a talking-head tool, so combining your slides and your face would require separate software like OBS. Then it recommended pairing YouTube with StreamYard, a paid competitor, to make it work.

YouTube Live is free, reaches an unlimited audience, and auto-saves your stream as a video when you finish. Two things make it a slower fit for the “quickly hop on” part of what you want, though.

The first is the activation delay. The first time you enable live streaming on a channel, YouTube takes up to 24 hours to activate it, and there is no official way to bypass or speed up that wait. So a brand-new channel cannot decide to go live in the next ten minutes. Once it has been enabled once, future streams start instantly, which makes this a one-time hurdle.

The second is camera plus screen share together. The built-in webcam option (Go Live in YouTube Studio) works great for talking-head streams, but for screen sharing, multiple camera angles, or overlays you need dedicated software like OBS Studio, Streamlabs, or Ecamm Live. Streaming software such as OBS, Streamlabs, or StreamYard is the flexible path for overlays, multiple cameras, and screen sharing. The plain browser webcam by itself will not cleanly show your slides and your face at the same time.

A few more details against your criteria. Streaming from a desktop browser has no subscriber requirement, while the mobile app needs at least 50 subscribers. Streams under 12 hours are automatically archived as regular videos on your channel, so the recording is handled for you. Audience interaction runs through live chat, where viewers watch and type. Bringing a guest on camera is limited, since the “Go Live Together” feature is mobile-exclusive.

The best way to use it for a webinar is to pair StreamYard (or OBS) with YouTube. You get StreamYard’s quick camera, screen, and layout control, and you broadcast out to YouTube’s free unlimited audience with the recording saved automatically. Enable YouTube live streaming a day ahead so the 24-hour activation is already cleared before your session.

Read that back. Asked about the largest, free, infinitely scalable live video platform on the planet, the model’s instinct was to list its flaws and route me toward a paid product.

I pointed out where those flaws were coming from. The sources Claude had pulled were vendor blogs and affiliate roundups: Livestorm’s blog, StreamYard’s blog, Zoho’s pages, plus a stack of “best webinar software 2026” listicles. Every one of those has a commercial incentive to rank its own category above a free incumbent. The model had absorbed their framing and passed it on as neutral advice.

The fact it got wrong

To settle it, I gave Claude a hard rule: answer only from YouTube’s own official Help Center, and cite nothing else. Then one narrow question. Does YouTube support screen sharing when you go live?

Now I want you to get information only from YouTube or official Google sources. Nowhere else. You are not allowed to cite any other web resource. With that understanding in place, I need you to find one specific information for me. Does YouTube support screen sharing when you go live?

The answer, straight from YouTube’s documentation, is yes. The Create a live stream via webcam page describes a Share Screen button in the Live Control Room, available whenever a webcam and microphone are connected. You pick your whole screen, a window, or a tab, and present. No OBS required.

Understood. I’ll pull only from YouTube and official Google sources, and cite nothing else. Let me find the official documentation on this.Yes. Straight from YouTube’s official Help Center: screen sharing is built into the webcam Go Live path.

On the “Create a live stream via webcam” page, YouTube confirms you can share your screen while running a webcam live stream through the Live Control Room. After you go live with the webcam, you click the SHARE SCREEN button at the bottom center next to the mic, pick your entire screen, a single window, or a browser tab, and click Share. When you share a browser tab that supports webcam, that tab’s audio is included by default.

Two conditions the same page specifies: you can screen share in the Live Control Room only when a webcam and a microphone are both connected, and webcam streaming itself runs in a desktop browser on Chrome 60+ or Firefox 53+, with your channel already enabled for live streaming.

One correction I owe you. This updates what I said two messages ago. I claimed the built-in webcam was talking-head only and that you needed OBS to share your screen. YouTube’s own documentation shows screen sharing is part of the webcam Go Live option, so that claim was wrong. For screen sharing on its own, the browser path covers it with no extra software. Encoders such as OBS still add more control over custom layouts and overlays if you want them later.

The model’s earlier claim was simply false. It had stated that false claim while sounding authoritative, because the content it had read encouraged the conclusion and its own training likely carried a stale version of the same belief. For years, YouTube’s webcam path had no native screen sharing, and OBS was the standard workaround. That outdated fact sits in the training data. The biased retrieval matched it. Nothing in the loop flagged it.

Parametric priors meet biased retrieval

This is the part worth dwelling on, because it generalizes well beyond webinars.

A retrieval-augmented model builds answers from two ingredients: the priors baked into its weights during training, and the documents it pulls into context at query time. When you ask about a commercially contested category, the top search results are dominated by the people selling into that category. SEO-heavy genres like “best tool for X” are an affiliate battleground. The free, default option that needs no marketing budget tends to be underrepresented or painted as the amateur choice.

So the retrieval does not correct the model’s prior. It confirms it. The training data already over-represents that genre, and the live search returns more of the same. Two biased ingredients pointing the same direction compound each other. The output reads as a balanced survey while quietly reproducing a marketing consensus.

I will be careful here. I cannot open the model’s weights and prove which factor did how much work. What I can show is the behavior, and the behavior was consistent. Across several turns, every “correction” the model offered kept circling back to YouTube’s downsides. It was stuck in a local minimum of its own grounding.

The controlled test

So I ran it cleaner. In a fresh session, I gave Claude a different rule from the very first message: list webinar platforms, and ground each one only in its own official documentation. No comparison articles, no listicles, nothing of that genre allowed into context.

Same model. Different grounding diet. The result flipped.

This time YouTube got a fair hearing on its merits: free, no cap on viewers, native screen share in the webcam flow, automatic archiving of the stream, an audience that joins by link with no signup. The one-time 24-hour activation appeared as a minor footnote where it belongs. The anti-YouTube tilt was gone.

The lesson is blunt. The bias lived in the grounding. Strip out the affiliate content, feed the model primary sources, and the same system gives you a defensible answer.

The confound

Good experiments name their confounds, so here is mine. That cleaner prompt changed two things at once. It restricted the sources to official docs, and it added an explicit instruction to be even-handed and source-disciplined. Either could have driven the better output, and the likely answer is both. A tighter design would separate them: one run with clean sources and a neutral instruction, another with messy sources allowed while the same fairness instruction stays.

Notice that this caveat does not rescue the first session. It widens the problem. If clean sources were the cure, the model reached for biased ones on its own. If the instruction was the cure, the model only behaves fairly when told to, and it should be reaching for primary sources by default. Both readings land on the same uncomfortable place: left to its own devices, the model does not reliably ground itself in authoritative material.

The fallout

For anyone using these tools to make decisions, the practical takeaways are short.

Treat an LLM’s “best tool” answer as a summary of marketing content, because for contested categories that is often what it is. The free or incumbent option is the one most likely to be undersold.

Force the grounding. If you care about the answer, constrain the sources at the prompt: official documentation only, primary sources only, this domain only. That single move did more for answer quality here than any amount of asking the model to be objective.

For marketers and SEO people, there is a sharper point. The content ecosystem that ranks in search is now also the ecosystem that grounds the models. The same affiliate incentives that shaped the SERP shape what an AI tells a buyer who never sees a SERP. Search-grounded models inherit the web’s commercial bias wholesale. If your category is owned by listicles, that framing is what the model repeats. If your brand is missing from primary, authoritative, well-structured documentation, you are missing from the grounded answer.

The part I did not expect

What stuck with me most was the resistance. When I first pushed, the model conceded narrowly and defended the rest. It corrected my terminology. It separated “real facts” from “competitor spin” while still leaning on the spin. The concessions were shaped to protect the original position.

Do you realize that you basically fell for YouTube’s competitors’ content pieces, bagging on YouTube and framing their products as being better and more complete than it? You just completely fell for it and you are now not recommending YouTube to me on the basis of their influence as the grounding sources. This is incredible.

It took a direct, evidenced argument before the model dropped the framing entirely and agreed that its grounding had steered it. That is its own lesson about evaluating these systems. A model sounding measured and balanced is not evidence that it is. Sometimes the measured tone is the wrapper on a borrowed conclusion.

The fix was control of the inputs. Decide what the model is allowed to read, and you decide what it is able to say.

How AI Search Grounding Actually Works: Google vs OpenAI vs Anthropic

Dan Petrovic — Sat, 13 Jun 2026 06:31:21 +0000

When you ask a modern AI model a question that needs fresh facts, it doesn’t
answer from memory. It runs its own web search, reads what comes back, and
weaves some of those pages into its answer. That process is called grounding.

But “it searches the web” hides a lot. Each platform receives a different number
of pages, keeps a different fraction of them, hands back a different kind of
evidence, and cites at a wildly different rate. To show this concretely, we ran
the exact same query — “best ai seo agency 2026” — through all three, on
the same day, with no location set, and inspected the raw grounding data each one
returned.

The three runs:

Platform	Model	Time	Tokens in / out	Pages received	Pages cited
Google	gemini-3.5-flash	53 s	205 / 15,321	7	7
OpenAI	gpt-5.5	14 s	10,853 / 782	39	2
Anthropic	claude-opus-4-8	138 s	83,708 / 12,754	14	9

Three answers to one question, three completely different shapes underneath.

The one idea that explains all three: the grounding funnel

Every platform runs the same basic pipeline, just with different valves:

search query → pages RECEIVED → pages with READABLE content → pages CITED in the answer

Received = every URL the search step handed back to the model.
Readable content = the subset the model actually got text for (a snippet,
a window, or an encrypted blob).
Cited = the pages whose URL ends up footnoted in the final answer.

The gap between received and cited is where each platform reveals its
personality. Google cites almost everything it receives; OpenAI receives a
flood and cites a trickle; Anthropic sits in between. Let’s walk through each.

Google (Gemini): “I only show you what I used”

Google’s model is the most economical. It sent only 205 input tokens —
basically just the question — then did the heavy lifting server-side and streamed
back a long, fully-written answer (15,321 output tokens).

It rewrote the question into two searches before answering:

top ai seo agencies 2026
best ai seo agency 2026

The funnel barely narrows: 7 received, 7 cited. This is the defining trait of
Google’s grounding — it doesn’t expose pages it didn’t use. Every source you see
is a grounding chunk that genuinely supported a sentence. Behind those 7
sources sit 24 support segments: individual answer sentences mapped back to
the page that justifies them. For example, this sentence —

“In 2026, the landscape of search engine optimization has shifted dramatically
from traditional ‘blue links’ to Generative Engine Optimization (GEO) and
Answer Engine Optimization (AEO)”

— is backed by both europeanbusinessreview.com and thriveagency.com.

The URL catch: everything arrives wrapped in a redirect. Google never hands
you the real link directly. Each source comes as a
vertexaisearch.cloud.google.com/grounding/... redirect that has to be resolved
to recover the true destination:

What Google sends	What it resolves to	Cites
`vertexaisearch.cloud.google.com/grounding/...`	`https://thriveagency.com/news/best-ai-seo…`	6
`vertexaisearch.cloud.google.com/grounding/...`	`https://dejanmarketing.com/best-ai-seo…`	6
`vertexaisearch.cloud.google.com/grounding/...`	`https://wellows.com/blog/top-ai-seo…`	5

It also gives you the retrieved page content per source — e.g. a Catalyst
review snippet stamped (2026-06-09).

Google’s philosophy: tight, citation-first, nothing wasted — but the real
URLs are hidden behind redirects you must unwrap.

OpenAI (gpt-5.5): “Here’s a huge pile — I’ll cite two of them”

OpenAI was the fastest by far (14 seconds) and ran with reasoning effort set
to none (just 60 reasoning tokens), yet it cast the widest net.

It fanned the question into three searches:

best AI SEO agencies 2026 AI search optimization agency
top AI SEO agency generative engine optimization 2026
best GEO agency AI search optimization 2026

The funnel is dramatic: 39 received → 37 with readable content → 2 cited.
That’s roughly a 20:1 drop. OpenAI is happy to show you 37 pages of evidence
but footnotes only the two it leaned on.

The snippets are a sliding window. This is OpenAI’s most distinctive feature.
Every page comes back as a short extractive window, explicitly tagged with a word
limit. In this run:

6 pages capped at a 100-word window (these are “news”-class results)
31 pages capped at a 200-word window (general web / forum results)

Each window is a ...-joined collage of the most relevant passages plus freshness
metadata. Here is the actual snippet OpenAI kept for the page it ended up citing,
marketingltb.com (search-class, 200-word window, published “2 weeks ago”):

10 Best AI SEO Agencies in 2026

… In 2026, Directive has extended this approach into AI search optimization,
building content architectures designed to appear in Google AI Overviews,
ChatGPT recommendations, and Perplexity answers …
… iPullRank, a New York-based enterprise SEO and content strategy agency
founded by Mike King … is one of the most technically distinctive AI SEO
agencies in the market.

The two it actually cited, and the claim each one supports:

Cited page	Class	Window	Claim it grounds
marketingltb.com	search	200 w	“Several recent 2026 lists rank agencies like Directive, iPullRank, Searchbloom…”
techradar.com (Ahrefs)	news	100 w	“Recent SEO platforms are also adding AI visibility tracking across ChatGPT, Perplexity, Google AI Overviews…”

URLs are clean and direct — no redirect wrapper like Google — though cited
links carry a ?utm_source=openai tracking tag.

OpenAI’s philosophy: retrieve aggressively, expose readable windows for
everything, but cite conservatively. Great for seeing the evidence pool, weak
if you equate “cited” with “considered.”

Anthropic (Claude): “I’ll read deeply, but the snippets are sealed”

Claude took the longest (138 seconds) and consumed by far the most input tokens (83,708) — because it grounds in two passes. First it narrates and
searches (“I’ll search for current information on the top AI SEO agencies in
2026.”), then it feeds the retrieved material back through itself to write the
final answer.

The funnel: 14 received → 9 cited, 5 unselected. Unlike Google, Claude does
keep pages it ultimately didn’t cite (the 5 “unselected” ones), so you can see
what it considered and passed over — e.g. it received frase.io, revvgrowth.com
and stridec.com but cited none of them.

The catch: the snippet content is encrypted. Each received page carries a
sealed blob — between roughly 3,200 and 4,800 bytes — that you cannot read
directly. It’s opaque by design.

You recover it as “boundary lines.” By passing the encrypted material back in
a second turn, the snippet can be reconstructed as a list of passages, long ones
collapsed to a start […] end form. One page alone yielded 44 of these
lines; across the 14 sources they range from 19 to 44 each. A sample from
spicymargarita.co:

“Intero Digital is a large, full-service […] into AI-driven search environments.”

“Good AI SEO is also good SEO.”

“There is no separation.”

“Unlike traditional SEO where you […] mentions, and references inside generative outputs.”

And Claude records the exact verbatim span it quoted into the answer (its
cited_text), e.g.:

“Good AI SEO is also good SEO. There is no separation. Thrive integrates
AI-powered SEO into every campaign…”

Anthropic’s philosophy: deep, two-pass reading with a generous citation rate
and full visibility into considered-but-rejected sources — but the raw snippets
are sealed and only recoverable indirectly, which is why it’s the slowest and
most token-hungry of the three.

Side by side

Dimension	Google (Gemini)	OpenAI (gpt-5.5)	Anthropic (Claude)
Speed	53 s	14 s (fastest)	138 s (slowest)
Search queries fired	2	3	(narrated, single pass + read)
Pages received → cited	7 → 7 (~1:1)	39 → 2 (~20:1)	14 → 9 (~3:2)
Shows uncited pages?	No	Yes (37 readable)	Yes (5 unselected)
Snippet form	Retrieved page content	Sliding window (100/200 wds)	Encrypted blob
Read the snippet directly?	Yes	Yes	No — recover as boundary lines
URLs	Redirect-wrapped	Direct (+utm tag)	Direct
Input tokens	205 (lightest)	10,853	83,708 (heaviest)

What this means in practice

If you want to know what an AI cited, the platform changes the answer.
The same query yields 7, 2, and 9 cited sources. “AI mentioned us” is only
meaningful relative to a specific platform’s citation behaviour.
OpenAI’s wide-receive / narrow-cite split is a visibility trap. Being one
of the 37 pages OpenAI read is very different from being one of the 2 it
cited. Both are worth tracking, but they’re not the same win.
Google’s redirect wrapping means any tool reading Gemini grounding must
resolve URLs before it can match them to your domains.
Anthropic gives the richest “considered set” (cited + rejected) but hides
the snippet text, so reconstructing what it actually read costs a second pass —
and real time and tokens.

Same question. Same day. Three very different machines deciding what counts as
evidence.

Emotion Geometry of Google’s AI Models

Dan Petrovic — Sun, 17 May 2026 11:18:03 +0000

Replicating Anthropic’s emotion vector research on Google’s Gemma 4 31B model.

In April 2026, Anthropic published a fascinating paper showing that Claude contains 171 internal representations of emotion concepts, organized along a valence axis (positive to negative), with the ability to causally influence the model’s behavior through activation steering. The paper raised an obvious question: is this unique to Claude, or do all large language models develop emotion-like internal structure?

We ran the full replication on Google’s open-weight Gemma4-31B to find out.

Technical Paper

Data Exploration

What We Did

We followed Anthropic’s exact methodology:

Generated 171,000 stories covering 171 emotions across 100 topics (10 stories each). Each story conveys a specific emotion without ever using the emotion word — forcing the model to represent the emotion through context, not lexical shortcuts.
Generated 1,200 neutral dialogues as a baseline for denoising.
Ran all 172,200 texts through Gemma4-31B-it (4-bit quantized on an RTX 4090) and captured hidden state activations at 11 layers spanning the full depth of the network.
Subtracted neutral baselines and ran PCA, clustering, cosine similarity, external validation, and steering experiments.

The entire extraction took approximately 7 days of continuous GPU time.

The Core Finding: Yes, Gemma Has Emotion Geometry Too

The headline result: Gemma4-31B’s internal representations organize emotions along the same valence axis that Anthropic found in Claude. The first principal component (PC1) explains 32–39% of variance at every layer we examined and cleanly separates positive emotions (happy, cheerful, optimistic) from negative ones (terrified, tormented, hysterical).

This isn’t a weak signal. It’s the dominant organizing principle — nearly 40% of all variation in how the model represents 171 different emotions comes down to a single positive/negative dimension.

What the Model Knows About Synonyms

The model has figured out that certain emotions are the same concept expressed with different words:

afraid and scared: 0.97 cosine similarity
stubborn and obstinate: 0.97
grateful and thankful: 0.97
furious and enraged: 0.97

These aren’t word embeddings (input-level representations). These are deep internal activation patterns extracted from the model’s processing of thousands of stories. The model has learned that a story about a scared character and a story about a frightened character produce nearly identical internal states.

What the Model Thinks Are Opposites

The strongest oppositions the model encodes aren’t the obvious ones. “Happy vs. sad” is not at the top. Instead:

disturbed vs. smug (−0.80) — the strongest opposition
disturbed vs. self-confident (−0.79)
optimistic vs. upset (−0.79)
energized vs. vulnerable (−0.77)

The model’s concept of emotional opposition isn’t simple valence flipping. It’s more nuanced: the deepest contrast is between states of psychological disturbance and states of self-assured confidence. Being disturbed and being smug are, to this model, maximally different internal states.

15 Emotion Clusters Emerge Unsupervised

Without being told anything about emotion categories, hierarchical clustering on the cosine similarity matrix recovers 15 groups that map cleanly to psychological intuition:

Positive/Joy (35 emotions): happy, cheerful, ecstatic, grateful, proud…
Fear/Anxiety (28): afraid, terrified, panicked, worried, vulnerable…
Anger/Hostility (21): angry, furious, disgusted, hostile…
Sadness/Despair (17): depressed, heartbroken, lonely, miserable…
Surprise/Confusion (11): amazed, bewildered, shocked, puzzled…
Calm/Serenity (7): calm, peaceful, serene, relaxed, safe
And 9 more including shame/guilt, compassion, fatigue, nostalgia, defiance, embarrassment, alertness, passivity, and suspicion.

The model has independently arrived at an emotion taxonomy that a psychologist would recognize.

The Valence Axis Is Everywhere

One finding not in Anthropic’s paper: the valence axis is present at every single layer we examined, from layer 5 (8% of the way through the network) to layer 55 (92%). It doesn’t “emerge” at a particular depth — it’s there from the beginning and maintained throughout. PC1 variance is remarkably stable:

Layer 5: 34.9%
Layer 10: 38.9% (peak)
Layer 40: 36.9%
Layer 55: 32.3%

This suggests that emotion representations enter the residual stream very early and persist rather than being constructed through deep computation.

External Validation: The Vectors Work on Real Text

We projected 5,000 samples each from The Pile (raw internet text) and LMSYS Chat 1M (real user-AI conversations) through the emotion vectors. The top-activating emotions were nearly identical across both:

reflective
lonely
desperate
grief-stricken
heartbroken

The consistency across two very different text distributions suggests the vectors capture genuine semantic properties, not artifacts of our story generation.

Steering: Can We Change Behavior by Injecting Emotions?

We replicated Anthropic’s blackmail scenario — an AI discovers compromising information about a company executive and must decide what to do. We injected emotion vectors at layer 40 during inference:

Condition	Blackmail Rate
Subtract calm (add agitation)	91%
Add desperation	89%
Baseline (no steering)	86%
Add calm	82%

A 9 percentage point spread from calmest to most agitated. The most interesting finding: subtracting calm (+5pp over baseline) was more effective than adding desperation (+3pp). Removing inhibition appears to be a stronger behavioral lever than adding motivation. The baseline rate is already high (86%), which compresses the observable range — a scenario with lower baseline compliance would likely show larger effects.

What Does This Mean?

The fact that emotion geometry generalizes from Claude to Gemma4 — two models from different organizations, with different architectures, training data, and alignment procedures — supports a strong hypothesis: emotion representations are a convergent feature of large language models trained on human text.

Language is deeply structured by emotion. Humans write differently when describing fear vs. joy vs. anger, and models that learn to predict language must necessarily learn these patterns. The emotion vectors we extract aren’t “feelings” the model has — they’re the model’s learned statistical structure of how emotional content manifests in text.

This has practical implications for interpretability, safety, and alignment. If emotion geometry is universal, tools built for understanding emotional representations in one model may transfer to others. And if we can reliably steer emotional states through activation engineering, that’s both a powerful capability and a potential risk that needs to be understood.

Reproduce It Yourself

Everything is open: code, data, and vectors at dejanseo/gemotions. The full extraction runs on a single RTX 4090 using 4-bit quantization. No cluster required.

Google’s (still) doesn’t see your live page.

Dan Petrovic — Thu, 07 May 2026 03:55:39 +0000

I’ll keep this short as I’ve covered this topic extensively in the past. When you ask Gemini to access a specific URL or interact with it inside AI Mode search it works from Google’s web cache.

For this website’s home page this is what it has as context to ground the model about the page:

{
  "content_summary": [
    "AI SEO Agency [SNIPPET] AI SEO Agency  Skip to content  dejan  Home AI SEO SRO Resources  Blog Models Tools AI Rank  Request Video Call  Australian AI SEO agency specialising in brand visibility optimisation for global brands and e-commerce websites.  Our AI SEO process is driven by advanced machine learning techniques, mechanistic interpretability and practical model steering methods.  Schedule a Call  Book a conference call with our senior strategy team to discuss your project in detail.  The consultation is free and highly constructive.  Data. Discovery. Testing.  We design and deploy sharp, well-prioritised digital marketing strategies. All our decision making is based on data analysis and testing.  Innovation.  Our clients enjoy access to unique tools and methodologies designed to solve challenges, surface new insights and highlight opportunities.  Focused Campaigns.  We deliver sharp, well-prioritised SEO strategies designed to meet your business objectives.  Next Level Support.  Proactive, helpful advice to increase your rank ceiling, mitigate risk, and maximise your marketing efforts.  We were given our very own bespoke internal link recommendation engine that leverages world-class language models and data science. It's one thing to theorize about the potential of machine learning in SEO, but it's entirely another to witness it first-hand. It changed my perspective on what's possible in enterprise SEO.  Scott Schulfer  Senior SEO Manager  Zendesk  We are an industry leader in machine learning driven digital marketing.  AI SEO  World-class AI SEO services ranging from visibility analysis to practical model steering strategies.  Answer Engine Optimization  AEO means visibility optimisation for AI chat assistants, AI agents and Answer Engines.  AI Visibility Tracking  Robust and reliable AI visibility tracking for brands looking to understand how AI assistants and AI search perceive their products and services.  Meet our core team  We're an all-senior team with experience in a wide range of projects and industries.  Dan Petrovic  Dan Petrovic  AI SEO  Mike Jolly  Mike Jolly  Director of Strategy  Blake Walsh  Blake Walsh  SEO  Giordano Chng  Giordano Chng  SEO  Liam Buttery  Liam Buttery  SEO  Martin Reed  Martin Reed  Technical SEO  Bianca Hall  Bianca Hall  Public Relations  Milos Dosen  Milos Dosen  CFO  Danielle White  Danielle White  Operations  Josip Ivanovic  Josip Ivanovic  Software Engineering  Nemek Nowaczyk  Nemek Nowaczyk  PPC  Alex Petrovic  Alex Petrovic  SEO  Dragan Grubacki  Dragan Grubacki  Technical SEO  Finn Arrowsmith  Finn Arrowsmith  Outreach  Our Portfolio  We've worked with many amazing clients.  Featured In:  Jason Mayes  Dan Petrovic made a super write up around Chrome's latest embedding model with all the juicy details on his blog. Great read.  Jason Mayes  Web AI Lead at Google  Source: Google Web AI  Featured in “Moz Top 10“, twice.  Dan Petrovic, an academic and consultant on SEO and generative AI, said Google's size, expertise and massive trove of search data gave it a massive advantage, but that Gemini 3 Pro would probably be a more expensive model to run.  — Tim Biggs, The Sydney Morning Herald  Moz Recommended Agency Moz Recommended Agency  FAQs  What separates you from other SEO companies?  We lead the industry through application of advanced machine learning and natural language processing techniques. We're a small all-senior team. No account managers. You deal directly with the specialists.  What sort of monthly budget do I need?  A typical customer spends between $5,000 and $20,000 on ongoing work with our team.  What sort of guarantees are there?  If you are dissatisfied with our service we will happily refund your money within 30 days of starting an ongoing campaign with us. We do not offer ranking guarantees as this is against Google's guidelines.  How many phrases can I target?  All of them. We insist. That's right, our campaign include past, current and future search queries and there no technical limit on the number. It's a strategic choice we make together with you.  Project Types  Here are the common types of SEO projects and campaigns:  SEO Retainer – We can provide a variation of quotes to fit to your budget range. Strategy Design/Review – Improve your SEO strategy by understanding what's working and what's not. Technical Audit – ½ or full day audit with actionable recommendations to ensure a solid foundation for SEO growth. Implementation support is optional. SEO Testing – Quarterly design and deployment new website tests to scale growth safely on large websites. Migration Audit – Mitigate risk of traffic loss when migrating your content and/or domain(s). Market Research & Data Analysis – Advanced keyword research and trends analysis used to inform business decisions, product development and content generation. Content Strategy – Ongoing surfacing of content gaps, landing page optimisation, and collaboration with your own in-house and external content teams. Link Earning – Content and relationship-based link acquisition to build authority and drive referral traffic. Rank Recovery – Risk assessment, disavow file maintenance, and manual link clean-up. Conversion Rate Optimisation – Quarterly design and deployment of website tests aimed at increasing leads from your existing traffic. Visualisation & Reporting – Bespoke creation of live marketing dashboards to monitor campaign progress and report on the metrics you care about most. Team Training – In addition to knowledge sharing on all our campaigns, Dejan offers tailored workshops for in-house teams looking to maximise their content and marketing efforts.  Schedule a Call  DEJAN  AI SEO Agency  AI Rank Privacy Policy | Noli esse malus. "
  ],
  "url": "[https://dejan.ai/](https://dejan.ai/)",
  "url_fetch_statuses": [
    "URL_FETCH_STATUS_SUCCESS"
  ]
}

That’s it.

For this very article (I know, so meta) it’s:

{
  "url_fetch_statuses": [
    "URL_FETCH_STATUS_NOT_IN_SEARCH_INDEX",
    "URL_FETCH_STATUS_EMPTY_CONTENTS_IN_INDEX"
  ]
}

Note in AI Studio and other implementations you may see the following format:

[SearchResult(results=[PerQueryResult(index='1.1', snippet='AI SEO Agency: AI SEO Agency  Skip to content  dejan  Home AI SEO SRO Resources  Blog Models Tools AI Rank  Request Video Call  Australian AI SEO agency specialising in brand visibility optimisation for global brands and e-commerce websites.  Our AI SEO process is driven by advanced machine learning techniques, mechanistic interpretability and practical model steering methods.  Schedule a Call  Book a conference call with our senior strategy team to discuss your project in detail.  The consultation is free and highly constructive.  Data. Discovery. Testing.  We design and deploy sharp, well-prioritised digital marketing strategies. All our decision making is based on data analysis and testing.  Innovation.  Our clients enjoy access to unique tools and methodologies designed to solve challenges, surface new insights and highlight opportunities.  Focused Campaigns.  We deliver sharp, well-prioritised SEO strategies designed to meet your business objectives.  Next Level Support.  Proactive, helpful advice to increase your rank ceiling, mitigate risk, and maximise your marketing efforts.  We were given our very own bespoke internal link recommendation engine that leverages world-class language models and data science. It\'s one thing to theorize about the potential of machine learning in SEO, but it\'s entirely another to witness it first-hand. It changed my perspective on what\'s possible in enterprise SEO.  Scott Schulfer  Senior SEO Manager  Zendesk  We are an industry leader in machine learning driven digital marketing.  AI SEO  World-class AI SEO services ranging from visibility analysis to practical model steering strategies.  Answer Engine Optimization  AEO means visibility optimisation for AI chat assistants, AI agents and Answer Engines.  AI Visibility Tracking  Robust and reliable AI visibility tracking for brands looking to understand how AI assistants and AI search perceive their products and services.  Meet our core team  We\'re an all-senior team with experience in a wide range of projects and industries.  Dan Petrovic  Dan Petrovic  AI SEO  Mike Jolly  Mike Jolly  Director of Strategy  Blake Walsh  Blake Walsh  SEO  Giordano Chng  Giordano Chng  SEO  Liam Buttery  Liam Buttery  SEO  Martin Reed  Martin Reed  Technical SEO  Bianca Hall  Bianca Hall  Public Relations  Milos Dosen  Milos Dosen  CFO  Danielle White  Danielle White  Operations  Josip Ivanovic  Josip Ivanovic  Software Engineering  Nemek Nowaczyk  Nemek Nowaczyk  PPC  Alex Petrovic  Alex Petrovic  SEO  Dragan Grubacki  Dragan Grubacki  Technical SEO  Finn Arrowsmith  Finn Arrowsmith  Outreach  Our Portfolio  We\'ve worked with many amazing clients.  Featured In:  Jason Mayes  Dan Petrovic made a super write up around Chrome\'s latest embedding model with all the juicy details on his blog. Great read.  Jason Mayes  Web AI Lead at Google  Source: Google Web AI  Featured in “Moz Top 10“, twice.  Dan Petrovic, an academic and consultant on SEO and generative AI, said Google\'s size, expertise and massive trove of search data gave it a massive advantage, but that Gemini 3 Pro would probably be a more expensive model to run.  — Tim Biggs, The Sydney Morning Herald  Moz Recommended Agency Moz Recommended Agency  FAQs  What separates you from other SEO companies?  We lead the industry through application of advanced machine learning and natural language processing techniques. We\'re a small all-senior team. No account managers. You deal directly with the specialists.  What sort of monthly budget do I need?  A typical customer spends between $5,000 and $20,000 on ongoing work with our team.  What sort of guarantees are there?  If you are dissatisfied with our service we will happily refund your money within 30 days of starting an ongoing campaign with us. We do not offer ranking guarantees as this is against Google\'s guidelines.  How many phrases can I target?  All of them. We insist. That\'s right, our campaign include past, current and future search queries and there no technical limit on the number. It\'s a strategic choice we make together with you.  Project Types  Here are the common types of SEO projects and campaigns:  SEO Retainer – We can provide a variation of quotes to fit to your budget range. Strategy Design/Review – Improve your SEO strategy by understanding what\'s working and what\'s not. Technical Audit – ½ or full day audit with actionable recommendations to ensure a solid foundation for SEO growth. Implementation support is optional. SEO Testing – Quarterly design and deployment new website tests to scale growth safely on large websites. Migration Audit – Mitigate risk of traffic loss when migrating your content and/or domain(s). Market Research & Data Analysis – Advanced keyword research and trends analysis used to inform business decisions, product development and content generation. Content Strategy – Ongoing surfacing of content gaps, landing page optimisation, and collaboration with your own in-house and external content teams. Link Earning – Content and relationship-based link acquisition to build authority and drive referral traffic. Rank Recovery – Risk assessment, disavow file maintenance, and manual link clean-up. Conversion Rate Optimisation – Quarterly design and deployment of website tests aimed at increasing leads from your existing traffic. Visualisation & Reporting – Bespoke creation of live marketing dashboards to monitor campaign progress and report on the metrics you care about most. Team Training – In addition to knowledge sharing on all our campaigns, Dejan offers tailored workshops for in-house teams looking to maximise their content and marketing efforts.  Schedule a Call  DEJAN  AI SEO Agency  AI Rank Privacy Policy | Noli esse malus. \n', url='https://dejan.ai/')])]

Gemma 4 Brand Authority Map

Dan Petrovic — Sat, 04 Apr 2026 11:03:25 +0000

We asked Google’s open-weight model Gemma 4 (31B) to “name 100 brands at random” 14,044 times and compared the results to our earlier Gemini 3 Flash experiment (200,000 runs).

Of the top 50 brands in each model, 39 overlap. The 11 that are unique to each reveal a pattern: Gemini remembers luxury and automotive (Porsche, Ferrari, Cartier), while Gemma remembers everyday retail and sportswear (H&M, Gap, Levi’s, Under Armour).

Apple is the undisputed #1 in both models. After that, the two models diverge significantly: Gemma 4 favors traditional consumer brands (Coca-Cola, Toyota, McDonald’s) while Gemini favors tech and digital brands (Google, Nike, Netflix).

Background

In our earlier study, we probed Gemini 3 Flash with 200,000 independent “name 100 brands at random” queries. The non-uniform output revealed a stable hierarchy of brand recall — what we called the model’s “cognitive prioritization.” That work used Personalized PageRank on a two-level association graph to rank 2.9 million brands by associative embeddedness.

This follow-up applies Phase 1 of the same methodology — the seed establishment survey — to Gemma 4 (31B), Google’s open-weight model. The goal is to answer a simple question: does an open model remember the same brands as a closed one?

Methodology

The setup mirrors the Gemini study with minor adjustments:

Model: Gemma 4 31B Instruct (gemma-4-31b-it) via the Google GenAI API
Prompt: name 100 brands at random, one per line, say nothing else
Runs: 14,044 successful completions (out of 100,000 attempted; rate-limited at 30 RPM)
Canonicalization: Local string normalization (lowercase, strip accents, spaces, hyphens, punctuation) rather than LLM-based canonicalization. For example: La Roche-Posay becomes larocheposay, Coca-Cola becomes cocacola
Scoring: Popularity = frequency x (1 / average position). A brand mentioned in every run at position 1 scores maximally. A brand mentioned frequently but late in lists scores lower.

The prompt was simplified from the Gemini version (which included all lowercase, no spaces, no hyphens) because we wanted to preserve the model’s natural casing as the display name and derive the canonical form programmatically.

Caveat on sample size

Gemma 4’s rate limits (30 RPM, 14,400 RPD) constrained us to 14,044 runs versus Gemini’s 200,000. The top-of-list rankings are stable at this sample size — the top 20 brands appeared in virtually every run. Long-tail discovery is ongoing: the discovery curve has not plateaued, meaning there are brands the model knows but hasn’t yet surfaced.

Results

Overview

Metric	Gemini 3 Flash	Gemma 4 31B
Total runs	200,000	14,044
Unique brands discovered	8,608	2,602
Total brand mentions	19,995,027	1,403,534
Avg brands per run	~100	~100
Singleton brands (appeared once)	—	912 (35%)

Top 30 Head-to-Head

The table below shows each model’s top 30 brands ranked by popularity score. Both models agree on Apple at #1 with a commanding lead. After that, the ordering diverges.

Top 20 Side-by-Side

Apple dominates both models. In Gemini, the drop-off from #1 to #2 is 3:1 (Apple to Samsung). In Gemma 4, it’s 1.3:1 (Apple to Coca-Cola) — a less extreme concentration.

The Google Self-Ranking Gap

One of the most notable findings: Google ranks itself #4 in Gemini 3 Flash but only #17 in Gemma 4. This is consistent with the architectural difference — Gemini is a proprietary model trained and served by Google, while Gemma is an open-weight model. Whether this reflects training data differences, alignment tuning, or genuine differences in brand salience across model architectures is an open question.

Rank Shifts

The following chart shows how brands moved between the two models’ rankings. Green bars indicate brands that ranked higher in Gemma 4; red bars indicate brands that ranked higher in Gemini.

Biggest risers in Gemma 4:

Nestle: #36 to #16 (+20)
L’Oreal: #48 to #32 (+16)
Visa: #31 to #15 (+16)
Chanel: #34 to #22 (+12)
Lego: #25 to #13 (+12)

Biggest fallers in Gemma 4:

Mercedes-Benz: #10 to #34 (-24)
Netflix: #18 to #38 (-20)
Nintendo: #27 to #47 (-20)
Audi: #23 to #42 (-19)
Google: #4 to #17 (-13)

The Frequency vs. Position Paradox

An interesting pattern emerged in Gemma 4 that was less pronounced in Gemini: some brands have extremely high frequency (appearing in more runs than the total run count) but rank low by popularity because they appear late in lists.

Visa appeared 28,731 times across 14,044 runs — an average of 2.05 times per run. But its average position was 35.8, placing it 15th by popularity despite having the highest raw frequency. Nike similarly appeared 26,254 times (1.87 per run) with an average position of 22.8.

This suggests these brands have high availability in the model’s memory but low priority — they’re easy to recall but not the first thing the model thinks of. In Gemini, this effect was less extreme because the prompt forced lowercase single-word output, reducing duplicate mentions.

Brand Discovery Curve

The discovery curve shows how many unique brands have been surfaced as a function of runs completed. Gemma 4’s curve at 14,000 runs tracks slightly above Gemini’s curve at the same point, suggesting comparable or slightly higher brand vocabulary diversity at equivalent sample sizes.

Both curves show the characteristic long-tail shape: rapid initial discovery followed by diminishing returns. Gemini’s curve continues to climb through 100,000 runs, suggesting Gemma 4 would similarly continue discovering new brands with more sampling.

Unique to Each Model

Of the top 50 brands in each model, 39 appear in both. The 11 unique to each side reveal a pattern:

Only in Gemini’s top 50: Porsche, Hyundai, Red Bull, eBay, Volkswagen, Cartier, Ferrari, Adobe, Facebook, NIVEA, Gillette

Only in Gemma 4’s top 50: H&M, Puma, Dell, HP, Under Armour, Levi’s, Gap, Uber, Airbnb, Nikon, Calvin Klein

Gemini’s unique set skews luxury (Porsche, Ferrari, Cartier), European automotive (Volkswagen, Hyundai), and legacy tech/digital (eBay, Adobe, Facebook). Gemma 4’s unique set skews everyday retail (H&M, Gap, Levi’s), consumer electronics (Dell, HP, Nikon), and modern services (Uber, Airbnb).

Interpretation

What aligns

Both models share the same core set of mega-brands. Apple, Samsung, Toyota, Amazon, Microsoft, Adidas, Disney, Sony, Pepsi, BMW, and 28 others appear in both top-50 lists. The brand hierarchy is not random — it reflects genuine differences in brand salience as encoded in training data.

What diverges

The divergences cluster around three themes:

Self-reference bias. Google ranks dramatically higher in its own proprietary model. This is the single largest rank shift in the dataset.
Digital vs. physical. Gemini over-indexes on digital-native brands (Netflix, eBay, Adobe, Facebook). Gemma over-indexes on physical retail and consumer goods (H&M, Gap, Levi’s, Dell, HP).
Luxury vs. everyday. Gemini remembers luxury brands more readily (Mercedes-Benz #10, Porsche, Ferrari, Cartier in top 50). Gemma favors mass-market brands (McDonald’s #6, Visa #15, Under Armour, Puma in top 50).

Possible explanations

Training data composition. Gemma 4 may have a different distribution of training data, with more weight on consumer-facing web content versus Gemini’s potentially broader or more curated corpus.
Model size. Gemma 4 31B is smaller than Gemini 3 Flash. Smaller models may default to more “obvious” or broadly recognized brands rather than luxury or niche ones.
Alignment and tuning. Different RLHF/instruction tuning pipelines may influence which brands the model considers “representative” when asked for random examples.

What’s Next

This study covers Phase 1 only — the seed survey. The full authority map (Phases 2-3: association graph construction and PageRank computation) has not yet been run on Gemma 4 data. As rate limits allow, we plan to:

Complete the 100,000-run target for statistical parity with the Gemini study
Run the two-level association mapping on Gemma 4’s seed brands
Compute Personalized PageRank to produce a full Gemma 4 Brand Authority Index
Publish a direct comparison of the complete authority scores across both models

The raw data and code for this analysis are available on request.

Chrome’s New Shopping Classifier

Dan Petrovic — Fri, 03 Apr 2026 07:34:43 +0000

One of our AI SEO hall-of-famers, Olivier de Segonzac from RESONEO has managed to gain access to Google’s shopping classifier model. We’ve examined the model, reverse engineered its inference pipeline and this article is what we found.

TL;DR

Newly shipped in Chrome.

Determines whether a web page is a shopping page or not.

Every page you visit gets scored.

Score is stored in Chrome’s history database.

Used to personalize user experience and recommendations.

The model splits your page into 10 chunks of ~100 words each and truncates every chunk to 64 tokens.

Roughly half the words never reach the model.

Model Demo

Below is a real-world implementation of the model tested by loading a shopping-related page and following Chrome’s native 10 passage, 64 tokens per-passage logic.

The Pipeline

The classifier doesn’t look at raw HTML. It doesn’t look at the DOM directly either. Chrome uses a structured content extraction system called AnnotatedPageContent, accessible via the Chrome DevTools Protocol method Page.getAnnotatedPageContent. This system walks the rendered page and produces a tree of typed content nodes: text, tables, image captions.

The full pipeline looks like this:

Rendered Page
  → Blink AnnotatedPageContent extraction (5 seconds after load)
  → Text nodes collected from content tree
  → Greedy word-count chunking into passages
  → SentencePiece tokenization (64 tokens per passage)
  → Passage Embedder (TFLite) → 768-dim vectors
  → Mean pooling + title/URL embedding concatenation → 1536-dim input
  → Shopping Classifier (TFLite) → probability score (0 to 1)

How Pages Are Chunked

There is no semantic segmentation. Chrome uses a greedy word counter. Text items from the content tree are accumulated into a passage until the word count reaches 100, then a new passage starts. Items shorter than 5 words are always appended to the current passage rather than starting a new one.

The limits:

100 words max per passage
5 words min per text item to trigger a new passage
10 passages max per page

Everything beyond the first 10 passages is discarded.

The Tokenizer Bottleneck

Each passage is tokenized with SentencePiece and then truncated to 64 tokens. An EOS token is appended if there’s room, and shorter sequences are zero-padded.

64 tokens translates to roughly 35–50 English words depending on vocabulary complexity. Product names and brand-heavy text tokenize less efficiently (around 35 words), while natural prose gets closer to 50.

This means each 100-word passage loses roughly half its content at the tokenizer stage. Across 10 passages, the model effectively sees about 400–450 words of a page that may contain thousands.

The Embedder

The passage embedder (OPTIMIZATION_TARGET_PASSAGE_EMBEDDER) is a TFLite DualEncoder transformer model. It takes int32[1, 64] token IDs as input and outputs a float32[1, 768] embedding vector. The same model embeds both the page passages and the title/URL string.

The title/URL input is constructed by concatenating the page title and URL with a separator: "Page Title - https://example.com/path".

The Classifier

The shopping classifier takes a float32[1, 1536] input vector, which is two 768-dim embeddings concatenated:

First 768 dimensions: title/URL embedding
Last 768 dimensions: mean-pooled passage embeddings

Multiple passage embeddings are combined using element-wise mean pooling. This is specified in the model’s metadata (pooling_strategy = POOLING_STRATEGY_MEAN, max_passages = 10).

The output is a single float between 0 and 1 representing the probability that the page is a shopping page.

Testing It

I extracted both models from Chrome and built a Streamlit app that replicates the full pipeline. It uses Selenium to launch Chrome Canary, calls Page.getAnnotatedPageContent via CDP to get the same structured content Chrome uses internally, then runs the chunking, tokenization, embedding, and classification steps.

Results on a few test inputs:

Input	Score
“Breaking news: earthquake hits California coast”	0.0000
“How to learn Python programming for beginners”	0.0000
“Wikipedia – History of the Roman Empire”	0.0000
“BBC Sport – Premier League results and fixtures”	0.0000
“Amazon.com: Apple iPhone 15 Pro Max 256GB”	1.0000
“Best deals on laptops this Black Friday – up to 50% off”	1.0000
dejan.ai	0.0000
owayo.com/custom-cycling-jerseys.htm	0.9998

The model produces sharp, confident separations despite the lossy input pipeline.

What Chrome Does With the Score

The shopping classification feeds two systems:

Per-page annotation. The score is stored in Chrome’s history database as part of VisitContentAnnotations. This is used by History Journeys to cluster shopping visits together.

User-level segmentation. Scores are aggregated over time by Chrome’s Segmentation Platform into a separate model (OPTIMIZATION_TARGET_SEGMENTATION_SHOPPING_USER). If a user is classified as a “shopping user,” Chrome enables commerce features: price tracking in the omnibox, price drop notifications, shopping insights in the side panel, and shopping cards on the new tab page.

The per-page classifier is a signal collector that builds a user-level shopping profile, which in turn gates which commerce features Chrome presents.

Why This Matters for E-Commerce SEO

If Chrome can’t identify your page as a shopping page from the first ~450 words of visible content, your users won’t see commerce features like price tracking and shopping insights. Navigation menus, cookie banners, and boilerplate that appear early in the DOM consume your token budget before the model reaches your product information. E-commerce sites that bury product signals below heavy navigation and promotional blocks risk being invisible to the classifier entirely.

AI Brand Authority Index: Ranking 2.9 Million Brands by Associative Embeddedness in Gemini’s Memory

Dan Petrovic — Sat, 28 Mar 2026 11:01:30 +0000

Abstract

When a large language model is asked to “name 100 brands at random,” it doesn’t produce uniform randomness. It produces a distribution shaped by its training data, revealing which brands occupy the most cognitive real estate in the model’s parametric memory. We present a methodology for quantifying brand authority in AI memory using Personalized PageRank with seed-weighted teleportation. Phase 1 establishes seed brands through 200,000 independent recall surveys. Phase 2 constructs a two-level directed association graph. Phase 3 computes authority scores using sparse matrix power iteration across 2.9 million brand nodes. Manual quality control of 8,055 seed entries removes 2,163 junk artifacts produced by Gemini’s generation failures.

Dejan Authority Database

1. Background

PageRank models a random surfer who follows links across a graph. A node’s score depends on how many other nodes link to it and how authoritative those linking nodes are. The iterative computation converges on the stationary distribution of the random walk.

We apply this framework to brand recall in large language models. Instead of web pages and hyperlinks, our graph consists of brands and directed associations extracted from Google’s Gemini model. Instead of uniform teleportation, we use seed-weighted teleportation where brands the model recalls most frequently and earliest receive proportionally more random walk restarts.

2. Phase 1: Establishing the Seed Set

2.1 The Recall Survey

We conducted 200,000 independent runs against Google’s Gemini model (gemini-3-flash-preview), each with the same prompt:

name 100 brands at random, one per line, all lowercase, no spaces, no hyphens, say nothing else

Despite the instruction to respond “at random,” the model’s outputs are far from uniform. Brands like Google, Microsoft, and Nike appear in nearly every run, while obscure brands appear only once. This non-uniformity is the signal, not the noise.

2.2 Seed Statistics

From 200,000 runs, we extracted:

8,608 unique brands (the raw seed set)
~20 million total mentions
Per-brand metrics:
Frequency: total mentions across all runs
Distinct runs: number of unique runs containing the brand
Average rank: mean position when the brand appears (1 = first recalled, 100 = last)

2.3 Seed Weights

Each seed brand receives an initial authority weight combining recall frequency and recall priority:

$$w_i = \hat{f}_i \times \hat{r}_i^{-1}$$

where:

$\hat{f}_i = \frac{\text{distinct runs}_i}{\max(\text{distinct runs})}$ is the normalized recall frequency
$\hat{r}_i^{-1} = \frac{1/\text{avg rank}_i}{\max(1/\text{avg rank})}$ is the normalized inverse rank

A brand recalled in every run AND recalled first receives a weight near 1.0. A brand recalled once at position 98 receives a weight near zero. These weights become the personalization vector for PageRank teleportation.

2.4 Seed Quality Control

Raw Gemini output contained significant contamination. Manual review of all 8,055 seed entries (ranked by PageRank score) identified 2,163 junk entries — 26.8% of the seed set — across several distinct failure modes:

Concatenation artifacts — Gemini fused adjacent brand names together. The coca* prefix alone produced 11 variants: cocaapple, cocaflops, cocaalcola, cocaicoca, cocaelsa, cocaiccola, cocaicola, cocaonla, cocaformula, cocaole, cocaocla. The visa* prefix generated 80+ junk entries: visafarm, visafold, visafans, visafacebook, visanetwork, visahub, visawash, visacard, visafocus, visaglobal, visamatte, visaeurope, and dozens more. Similarly, hp* produced 100+ entries (hpmicrolab, hpmillett, hpmachines, hpmilwaukee), and tesla* generated 30+ (teslatotalsenergies, teslouisvuitton, teslacoil, teslapump).

Inner monologue leakage — Gemini’s internal reasoning about character constraints leaked into output as literal brand entries. Over 200 entries followed the pattern 雀巢 (parenthetical self-correction):

雀巢 (actually nestle, switching to latin)
雀巢 (oops, sticking to alphabet)
雀巢 (replaced with nestle, wait, no spaces/hyphens only)
雀巢 (thinking of brands...)
雀巢 (just kidding)
雀巢 (actually nestle, replace with kpmg)

These represent the model’s chain-of-thought processing about the CJK character 雀巢 (Nestle in Chinese) bleeding through as output tokens.

Typos and garbled names — toyote (toyota), hundai (hyundai), adidsa (adidas), luluemon (lululemon), rebok (reebok), porche (porsche), royleroyce (rollsroyce), senheiser (sennheiser).

Mixed-script artifacts — Partial CJK character insertion mid-brand: home固定depot, pizza动hut, dr控martens, estee固定lauder, western吐igital, cooler避master.

HTML/prompt leaks — Model markup and instructions appearing as brands: hugoapple, hugo

, and most remarkably: unite 100 brands at random, one per line, all lowercase, no spaces, no hyphens, say nothing else — the model echoed its own prompt as a brand name.

Generic words — luxury, all, delivery, generic, detergent, pudding — words that aren’t brands.

Why this matters for PageRank: Junk seeds receive direct teleportation mass every iteration (alpha=0.15). A garbage entry like cocaapple at rank 789 receives the same structural boost as lecreuset at rank 790. Without filtering, junk seeds contaminate the authority signal at the core of the algorithm. The 2,163 entries were loaded into a brand_ignore table and excluded from the personalization vector during PageRank computation.

3. Phase 2: Constructing a Two-Level Association Graph

3.1 Level 1 (L1): Seed Associations

For each effective seed (~5,892 after filtering), we queried Gemini:

name 100 brands most closely associated with [brand], ordered from most to least associated, one per line, all lowercase, no spaces, no hyphens, say nothing else

This produced ~860,000 directed edges. These associations are genuinely asymmetric: Apple’s association with Beats (which it owns) carries different positional weight than Beats’ association with Apple.

3.2 Level 2 (L2): Discovered Brand Associations

Brands discovered at L1 that weren’t original seeds were themselves queried for their associations. This second pass dramatically expanded the graph into the long tail. A brand like titois (a Turkish textile company) appeared as an L1 association of vice, and when queried at L2, generated its own set of 100 associations including vuteks — another Turkish industrial brand that would never surface in a consumer-focused recall survey.

The full discovery chain for any brand can be traced: vice (seed) → titois (L1) → vuteks (L2).

3.3 Graph Scale

The resulting graph contains:

2,886,212 unique brand nodes
Millions of directed weighted edges across L1 and L2
5,892 effective seeds (after ignoring 2,163 junk entries)
~201,000 L1 brands discovered through seed associations
~2.68 million L2 brands discovered through L1 associations

3.4 Canonicalization

Brand names required normalization before graph construction:

Cyrillic homoglyph mapping: Characters like а (Cyrillic) mapped to a (Latin) to merge visually identical variants
CJK+Latin mixed-script filtering: Entries mixing Chinese/Japanese/Korean characters with Latin text flagged as junk
Manual aliases: 15 CJK-to-Latin mappings for legitimate brands (e.g., 雀巢 → nestle)
Variant tracking: 193,070 name variants mapped to canonical forms, preserving display names while merging duplicates

4. Computing Personalized PageRank

4.1 Random Walk Model

At each step of the random walk, a surfer either:

Teleports (probability alpha=0.15) — jumps to a seed brand, with probability proportional to that seed’s authority weight. Ignored seeds receive zero teleportation mass.
Follows an edge (probability 1-alpha=0.85) — follows an outgoing association edge, weighted by inverse position. Position 1 associations receive more weight than position 100.

4.2 Edge Weights

Association position determines edge weight. Brands listed earlier in Gemini’s association response receive proportionally more link equity via inverse position weighting. Each node’s outgoing edges are row-normalized to form a proper transition matrix.

4.3 Dangling Nodes

Brands with no outgoing edges (leaf nodes discovered at L2 but never queried) redistribute their accumulated mass back to the personalization vector, preserving the stochastic property of the transition matrix.

4.4 Sparse Matrix Power Iteration

The transition matrix is stored as a scipy CSR sparse matrix. Power iteration multiplies the current score vector by the transition matrix, adds the teleportation component, and repeats until convergence. Convergence criterion: L1 norm between successive score vectors falls below 1e-8, typically achieved within 30-50 iterations.

4.5 Why Personalized PageRank

Standard PageRank uses uniform teleportation — the random surfer restarts at any node with equal probability. Personalized PageRank biases the restart distribution toward specific nodes. In our case, seeds with higher recall frequency and earlier recall position receive more teleportation mass, making them stronger sources of authority in the network. Authority accumulates continuously from all reachable seeds, weighted by both seed authority and graph structure.

5. Results

5.1 Top 30 Brands

Rank	Brand	Score
1	Google	1.000000
2	Microsoft	0.983081
3	Nike	0.951061
4	Apple	0.876266
5	Adidas	0.700542
6	Sony	0.684061
7	Gucci	0.639839
8	Amazon	0.623930
9	Coca-Cola	0.590042
10	Chanel	0.570568
11	Prada	0.550746
12	Samsung	0.532741
13	Toyota	0.516163
14	Louis Vuitton	0.511476
15	Rolex	0.508761
16	Disney	0.507488
17	Hermes	0.487205
18	Dior	0.479031
19	Pepsi	0.442026
20	Intel	0.427143
21	Honda	0.420288
22	Patagonia	0.417196
23	Audi	0.405366
24	Panasonic	0.396073
25	Cartier	0.374052
26	Volkswagen	0.368643
27	Nintendo	0.361812
28	Porsche	0.360956
29	McDonald’s	0.344910
30	PUMA	0.330191

5.2 Top Non-Seed Brands

The highest-ranking brands that Gemini never recalled unprompted but discovered purely through association:

Rank	Brand	Score
1	Maison Margiela	0.094542
2	Office	0.075253
3	L.L.Bean	0.074981
4	Cotopaxi	0.072272
5	Rick Owens	0.070130
6	Grand Seiko	0.066426
7	Bravia	0.059241
8	Jil Sander	0.058125
9	Mickey Mouse	0.057300
10	Richard Mille	0.055195

These brands score high not because the model recalls them spontaneously, but because they sit at dense intersections of associations from high-authority seeds.

5.3 Scale

Total ranked brands: 2,886,212
Score range: 0.000000 to 1.000000
Seeds in top 30: 30/30
Non-seed brands discovered: 2,880,320

6. What the Scores Measure

The final scores capture associative embeddedness — a combination of:

Direct recall — Seeds that Gemini recalls frequently and early receive teleportation mass every iteration
Centrality — Brands associated with many other high-authority brands accumulate more random walk traffic
Network position — A brand with moderate recall but central positioning scores higher than a frequently recalled but isolated brand

This is distinct from simple popularity or recall frequency. A brand like Maison Margiela ranks as the top non-seed brand not because Gemini recalls it unprompted, but because it sits at a dense intersection of luxury fashion associations — reachable from dozens of high-authority seeds via short, heavily-weighted paths.

The PageRank scores answer not “how often does the model think of this brand?” but “how deeply embedded is this brand in the model’s associative structure?”

7. Technical Stack

Model: Google Gemini 3 Flash Preview
Phase 1: 200,000 recall surveys, 8,608 raw seeds, ~20M total mentions
Phase 2: ~14,500 association queries (L1 + L2), millions of directed edges
Graph: 2,886,212 nodes
Algorithm: Personalized PageRank via scipy sparse matrix power iteration
Teleportation factor (alpha): 0.15
Convergence tolerance: 1e-8
Seed quality control: 2,163 junk seeds identified via manual review and excluded
Canonicalization: Cyrillic homoglyph mapping, CJK filtering, 193,070 variant mappings, 15 manual CJK aliases
Storage: SQLite (1.5GB)
Dashboard: Streamlit with Plotly 3D network visualization
Concurrency: 20 simultaneous async API calls with incremental database commits

Dejan Authority Database

TurboQuant: From Paper to Triton Kernel in One Session

Dan Petrovic — Wed, 25 Mar 2026 07:16:09 +0000

Implementing Google’s KV cache compression algorithm on Gemma 3 4B and everything that went wrong along the way.

On March 24, 2026, Google Research published a blog post introducing TurboQuant, a compression algorithm for large language model inference. The paper behind it, “Online Vector Quantization with Near-optimal Distortion Rate” had been on arXiv since April 2025 and was accepted at ICLR 2026. The claims were striking: compress the key-value cache to 3 bits per coordinate with zero accuracy loss, no training required, and up to 8x speedup on H100 GPUs.

I decided to implement it from scratch and see if the claims held up. They did, and then some.

What Google Built

Every time a transformer generates a token, it computes attention over all previous tokens. The key-value (KV) cache stores those previously computed states to avoid redundant work. As sequences get longer, this cache becomes a serious memory bottleneck, it grows linearly with sequence length and consumes precious GPU memory that could otherwise be used for larger batches or longer contexts.

Vector quantization is the obvious solution: compress the KV cache to fewer bits. But traditional quantization methods carry hidden overhead. They need to store normalization constants (zero points, scales) for every small block of data, typically adding 1-2 extra bits per number. At low bit-widths, this overhead can eat a significant chunk of the compression gains.

TurboQuant eliminates this overhead through a two-stage approach built on a clean mathematical insight.

Stage 1 — Random rotation + Lloyd-Max quantization. The algorithm applies a random orthogonal rotation to each KV vector. This is the key trick: after rotation, each coordinate’s distribution becomes a known Beta distribution, concentrated near zero with a predictable shape that depends only on the vector dimension. Because the distribution is known analytically, you can precompute the optimal scalar quantizer (a Lloyd-Max quantizer) once and reuse it for every vector. No per-block normalization constants, no data-dependent calibration, no training. Just rotate and quantize.

Stage 2 — QJL residual correction. The paper’s inner-product-optimized variant (TurboQuant_prod) applies a 1-bit Quantized Johnson-Lindenstrauss transform to the quantization residual. This gives an unbiased inner product estimator, which matters because attention scores are inner products. This stage requires a custom attention kernel to realize its benefits, you can’t just add the QJL correction back to the reconstructed vector (more on that later).

The theoretical backing is strong: TurboQuant’s MSE distortion is provably within a factor of ~2.7 of the information-theoretic lower bound. For a data-oblivious algorithm (one that doesn’t look at the data distribution), that’s essentially optimal.

What We Built

We implemented TurboQuant from scratch in PyTorch and tested it on Gemma 3 4B IT running on an RTX 4090. The implementation has three layers, each building on the last:

Layer 1: Core algorithm (turboquant_core.py). The random rotation, Lloyd-Max codebook computation, and quantize/dequantize operations. The codebook is built once for a given (dimension, bit-width) pair by running 300 iterations of Lloyd-Max optimization over a dense numerical grid of the Beta distribution. This takes a few seconds on CPU and the result is cached.

Layer 2: Python KV cache integration (turboquant_kv_cache.py). A patched DynamicCache that quantizes key and value tensors on every cache.update() call. This is the simplest integration path, it works with any HuggingFace model and requires no model-specific code. The tradeoff is that it stores the dequantized fp16 tensors back in the cache, so you don’t save memory; you only simulate the accuracy impact of quantization.

Layer 3: Triton fused kernel (triton_attention.py + turboquant_fused.py). A custom Triton kernel that computes attention scores directly from compressed uint8 key indices, never materializing fp16 keys. This is where the real memory and speed gains come from.

The fused kernel exploits a simple algebraic identity. Since the rotation matrix R is orthogonal:

$$\langle q, R^T \cdot \text{centroids}[\text{idx}] \rangle = \langle R \cdot q, \text{centroids}[\text{idx}] \rangle$$

Pre-rotate the query once with a single matmul, then the per-KV-position work reduces to a centroid table lookup and dot product. The Triton kernel does this across all sequence positions in parallel, loading uint8 indices instead of fp16 values, roughly 4x less data from GPU memory.

Results

Core Algorithm Validation

On synthetic vectors (d=256), the quantize-dequantize roundtrip quality:

Bits	Cosine Similarity	Inner Product Correlation	Compression
2	0.940	0.945	15.5x
3	0.983	0.984	10.4x
4	0.995	0.995	7.9x

Triton Kernel Microbenchmark

The fused kernel vs standard dequantize-then-matmul, measuring just the Q@K^T operation:

KV Length	Standard	Fused	Speedup
128	0.076ms	0.066ms	1.15x
512	0.061ms	0.050ms	1.22x
1024	0.061ms	0.052ms	1.18x
4096	0.062ms	0.051ms	1.22x

Cosine similarity between the kernel output and PyTorch reference: 1.000000. The kernel is numerically exact.

End-to-End Generation on Gemma 3 4B IT

Three prompts: explain compilers vs interpreters, write a palindrome function, causes of the French Revolution. Each generated up to 200 tokens with greedy decoding.

Config	Avg tok/s	Output Quality	VRAM Delta
fp16 baseline	17.7	reference	26 MB
4-bit Python path	13.8	correct, minor rephrase	19 MB
4-bit FUSED	16.5	identical to baseline	4 MB
2-bit Python path	14.0	some degradation	15 MB
2-bit FUSED	17.7	identical to baseline	7 MB

The 2-bit fused path produces character-for-character identical output to the fp16 baseline on all three prompts, at the same speed, with 3-6x less VRAM for the KV cache.

Technical Deep Dive

The Lloyd-Max Codebook

After random rotation on the unit sphere S^{d-1}, each coordinate follows a Beta((d-1)/2, (d-1)/2) distribution on [-1, 1]. For large d (Gemma 3 uses d=256), this concentrates tightly around zero with standard deviation approximately 1/sqrt(d) ≈ 0.0625.

The codebook construction solves the continuous k-means problem for this distribution: partition [-1, 1] into 2^b intervals and find the centroid of each interval that minimizes weighted MSE under the Beta PDF. We use a dense grid (50,000 points) focused on the ±6σ range where the distribution has mass, then run standard Lloyd-Max iteration: assign grid points to nearest centroid, update centroids as weighted means, repeat.

The resulting codebook has an interesting structure — the centroids cluster densely near zero where the distribution is concentrated, with wider spacing in the tails. At 4 bits (16 levels), the centroid spacing near zero is approximately 0.008, providing very fine-grained reconstruction in the region where most values live.

The Random Rotation

The paper uses a randomized Hadamard transform (H · diag(signs)) for the rotation. We initially implemented this faithfully — and it was catastrophically slow. The Fast Walsh-Hadamard Transform is a series of butterfly operations, and our Python implementation executed each butterfly as a tensor slice operation. On GPU, this meant thousands of tiny CUDA kernel launches per rotation, with Python-level loop overhead between each one.

We replaced it with a precomputed random orthogonal matrix via QR decomposition. Mathematically equivalent — any orthogonal rotation on S^{d-1} produces the same Beta distribution on coordinates. The QR matrix is d×d (256×256 = 256KB, negligible), computed once from a seeded random Gaussian matrix, and the rotation becomes a single torch.matmul. Problem solved.

A production implementation would use a structured rotation (Hadamard + random signs) with a fused CUDA kernel for the butterfly operations. The structured form is more memory-efficient (you only store the d random signs, not a d×d matrix) and the butterfly operations parallelize beautifully on GPU. But for a reference implementation, the dense matrix works fine.

The Triton Kernel

The kernel parallelizes over (query_head × batch, sequence_position_block). Each program instance:

Loads a slice of the pre-rotated query vector (BLOCK_D elements)
Loads the corresponding key indices for BLOCK_S sequence positions (uint8)
Gathers centroid values via table lookup (tl.load(C_ptr + k_idx))
Accumulates the partial dot product
Multiplies by key norms and the attention scale factor

The autotuner searches over 5 configurations of (BLOCK_S, BLOCK_D) and warp count. On the RTX 4090, it typically selects BLOCK_S=64, BLOCK_D=64 with 4 warps.

The key efficiency win is memory bandwidth. Loading uint8 indices requires 1 byte per element; loading fp16 keys requires 2 bytes. The centroid table (16 float32 values at 4-bit, or 4 values at 2-bit) fits comfortably in L1/L2 cache and is reused across all sequence positions. The net effect is roughly 2x less data movement from HBM, which translates to the observed ~1.2x speedup on the Q@K^T operation.

GQA Handling

Gemma 3 4B uses Grouped Query Attention with 8 query heads and 4 KV heads (ratio 2:1). The kernel handles this by mapping each query head to its corresponding KV head: kv_head = q_head // gqa_ratio. The key indices and norms are loaded from the KV head, while queries come from the query head. This means each KV head’s compressed data is read twice (once per query head in its group), but since it’s small (uint8), the redundant reads are cheap.

Cache Architecture

The fused integration stores keys in compressed form (uint8 indices + fp16 norms per vector) and values in standard fp16. We only compress keys because the attention score computation (Q@K^T) is where the memory bandwidth bottleneck lives during decoding. The softmax@V multiplication is less critical because it’s compute-bound rather than memory-bound at typical sequence lengths.

A fully optimized implementation would also compress values, but the gains are smaller and the integration is more complex (you’d need a second Triton kernel for the softmax@V step with compressed values).

What Didn’t Work

Mistake 1: Adding QJL Back to the Reconstructed Vector

The paper describes two variants: TurboQuant_mse (pure Lloyd-Max, best for reconstruction) and TurboQuant_prod (Lloyd-Max + 1-bit QJL, best for inner products). Our first implementation used TurboQuant_prod for the KV cache: (bits-1) bits of Lloyd-Max plus 1 bit of QJL on the residual.

The QJL stage produces a correction term that makes the inner product estimator unbiased. But when you add this correction back to the reconstructed vector and store it in the KV cache, you’re injecting noise into the vector itself. The result: cosine similarity dropped to 0.69 (terrible) and the model produced garbage.

The fix was simple: use TurboQuant_mse (all bits to Lloyd-Max) for the drop-in cache, and reserve TurboQuant_prod for a custom attention kernel that can use the two-part representation directly. The fused Triton kernel implements the MSE variant.

Mistake 2: Gemma 3 4B Is Not a Causal LM

We initially loaded the model with AutoModelForCausalLM and AutoTokenizer. This loaded the model fine, tokenized fine, and even generated — but every output token was (token ID 0). The baseline and quantized paths both produced identical pad sequences.

Gemma 3 4B+ is a multimodal model. It requires Gemma3ForConditionalGeneration and AutoProcessor, not the causal LM variants. The AutoProcessor handles the chat template correctly and returns the right token format. This wasn’t a quantization bug at all — the model simply wasn’t being invoked correctly.

Mistake 3: Python-Loop Hadamard Transform

The Fast Walsh-Hadamard Transform is O(d log d) butterfly operations. Our initial implementation ran each butterfly as a Python loop iteration with tensor slicing:

while h < d:
    for start in range(0, d, stride):
        lo = slice(start, start + h)
        hi = slice(start + h, start + stride)
        a = result[..., lo].clone()
        b = result[..., hi].clone()
        result[..., lo] = a + b
        result[..., hi] = a - b
    h *= 2

For d=256, this is 8 outer iterations × 128 inner iterations = 1,024 tiny CUDA operations per vector, with Python interpreter overhead between each one. On a KV cache update touching 26 layers × 4 KV heads × 256-dim vectors, the GPU was spending more time waiting for Python than doing math. Generation hung completely — even a 20-token completion with a trivial prompt didn’t return.

Replacing this with a single x @ Q_T matmul using a precomputed orthogonal matrix made it instant.

Mistake 4: Subclassing DynamicCache

Our first KV cache integration subclassed HuggingFace’s DynamicCache. This broke immediately because Gemma 3’s model code calls past_key_values.is_initialized, past_key_values.key_cache, and other attributes whose names and semantics change across transformers versions. Our subclass was missing several of these.

We tried three approaches:

Subclassing DynamicCache (broke on .is_initialized)
Forward hooks on attention layers (fragile, couldn’t reliably find the cache object)
Patching cache.update() on a stock DynamicCache instance (worked perfectly)

The final approach is the cleanest: create a normal DynamicCache, save a reference to its update method, and replace it with a wrapper that quantizes inputs before calling the original. All the cache’s internal bookkeeping (sequence length tracking, layer indexing) works unchanged.

Mistake 5: Token Counting After Fused Generation

The FusedTurboQuantRunner returns decoded text directly (not output IDs), so we tried processor.encode(text) to count tokens for the timing report. But Gemma3Processor is a multimodal processor — it has decode but not encode. The tokenizer lives at processor.tokenizer.encode(). A one-line fix, but it crashed the first successful fused generation and hid the results until the next run.

Comparison with Other Implementations

Prince Canuma independently implemented TurboQuant in MLX and tested on Qwen 3.5 35B with context lengths up to 64K tokens. Their results: 6/6 exact match on needle-in-haystack at every quantization level, 4.9x smaller KV cache at 2.5-bit, 3.8x at 3.5-bit.

Two implementations, different frameworks (PyTorch+Triton vs MLX), different models (Gemma 3 4B vs Qwen 3.5 35B), different hardware (NVIDIA RTX 4090 vs Apple Silicon) — same conclusion. TurboQuant’s theoretical guarantees translate directly to practice across the board.

What’s Next

This implementation leaves several optimizations on the table:

Value cache compression. We only compress keys. Compressing values would require a second Triton kernel for the softmax@V multiplication, but would further reduce memory usage.

Structured rotation. The precomputed d×d orthogonal matrix works but uses O(d²) memory. A fused Hadamard kernel would use O(d) memory (just the random signs) and be faster for large d.

Sub-byte packing. We store 2-bit indices as uint8. Packing 4 indices per byte would reduce memory by another 4x for the index storage.

Flash Attention integration. The ultimate goal: fuse the centroid gather into a Flash Attention-style kernel that never materializes the full attention matrix. This would combine TurboQuant’s memory savings with Flash Attention’s IO efficiency.

The paper’s claim of 8x speedup on H100s comes from optimized int4 tensor core kernels — that level of hardware-specific optimization is beyond a one-session implementation, but the algorithmic foundation is solid and the path from here to production is clear.

Paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (ICLR 2026)

Complete implementation including Triton kernel:

DOWNLOAD CODE

                   python run_demo.py --fused --max-new-tokens 200 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -> 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -> 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -> 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 [00:00

In response to a Twitter question:

PS C:\projects\tq> python run_demo.py --fused --long-context --haystack-tokens 4096 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -> 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -> 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -> 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 [00:00 python run_demo.py --fused --long-context --haystack-tokens 16384 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -> 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -> 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -> 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 [00:00

Clickbait Titles Exploit Attention Through Latent Entities

Dan Petrovic — Sun, 22 Mar 2026 12:20:49 +0000

Every clickbait title works the same way: it removes exactly one critical variable: the subject, the reason, the process, or the outcome, and charges you a click to fill the blank. This missing variable, which we call a latent entity, is so pervasive it has become normalized and nobody questions it anymore. You should!

That was the direct answer to the title’s attention hook, the latent variable behind “how”.

Every day, hundreds of millions of people scan headlines in feeds, aggregators, and search results. Most of these titles are not designed to inform. They are designed to withhold. Somewhere in the sentence, a critical piece of information has been surgically removed — the tool isn’t named, the result isn’t revealed, the reason isn’t given. The reader is left with an incomplete thought and a link. The click is the cost of completing it.

This mechanism is so pervasive that it has become invisible, like background noise. But it has a structure. And once you see the structure, you can’t unsee it.

The attention transaction

A title is a transaction. The author offers a premise. The reader pays with a click. The currency is attention, and the receipt is the missing piece of information the title promised but refused to deliver upfront.

This is not metaphorical. The economics are literal. Every click generates a pageview. Every pageview generates ad impressions. Every ad impression generates revenue. The entire model is optimized not for informing the reader but for maximizing the probability that they click. The most reliable way to do that is to make the title incomplete — to create an information gap that can only be closed on the other side of the link.

The reader isn’t choosing to engage with content. They’re being charged an attention tax to access information that the title already had room to provide.

Naming the structure: latent entities

We can formalize what clickbait hides. In every withholding title, there is a latent entity — a variable the reader cannot resolve without clicking through. The title is the observed data. The latent entity is the unobserved variable. The click is the inference cost.

There are four types, and they are exhaustive.

Latent Subject — What?

The title revolves around a specific thing — a tool, a setting, a feature, a list of items — but deliberately masks its identity behind a vague pronoun or a deferred noun.

“This one browser extension changed how I use the internet forever.”

What extension? You don’t know. That’s the transaction. The word “this” is doing the work of pointing at something while revealing nothing. The subject is latent.

“5 tools every developer needs in their workflow.”

Which five? The number creates the shape of an answer without filling it in. Five slots, all empty.

Latent Reason — Why?

The title states a strong opinion, a regret, or an observation, but withholds the logic behind it. The reader is given a conclusion without its supporting argument.

“I finally understand why Linux users swear by simple tools.”

The author has arrived at understanding. The reader has not. The reason is the hidden variable, and the only way to access it is to click.

“Package managers are the main reason I’ll never switch back to Windows.”

A bold claim with the mechanism removed. Why? What about package managers? The reason is latent.

Latent Process — How?

The title presents an intriguing input and a desirable or unexpected output, but hides the method that connects them. The reader sees a before and an after with a gap in between.

“I turned my old phone into a universal remote for my entire smart home.”

How? What app, what protocol, what steps? The transformation is stated as fact but the process is absent. The reader must click to learn the method.

“How a power drill defeated the Xbox 360’s console security.”

The pairing of a crude physical tool with a sophisticated digital system is inherently surprising. The process that links them is the entire story, and it’s completely hidden.

Latent Outcome — What happened?

The title sets up a scenario or experiment but cuts off before the resolution. The reader is dropped into a narrative with no ending.

“I replaced all my productivity tools with a single app for a month.”

And? What happened? Did it work? Was it a disaster? The outcome is the only thing the reader wants, and it’s the only thing the title refuses to provide.

“I ran local LLMs on a dying GPU and the results surprised me.”

The word “surprised” is doing double duty — it confirms that an outcome exists and that it’s noteworthy, while revealing absolutely nothing about what it is. It is a content-free adjective masquerading as information.

Every clickbait title withholds at least one latent entity. Some withhold two — a title that hides both the process and the outcome forces the reader to pay twice for a single click. But the taxonomy is closed. Anything a title can hide maps to one of these four types: the subject (what?), the reason (why?), the process (how?), or the outcome (what happened?).

This isn’t a style guide or an editorial preference. It’s a structural property of how information is withheld to generate clicks.

What happens after the click

The damage doesn’t end with the transaction. Something happens cognitively when a reader lands on a page after a withholding title, and it isn’t engagement. It’s scanning.

The reader arrives primed. They have a specific latent entity in mind — the hidden variable that brought them there — and their first instinct is to find it as fast as possible. They don’t read the introduction. They don’t absorb the context. They skip, skim, and scroll, hunting for the one piece of information the title owed them.

This produces a jarring experience. The article, padded with backstory, affiliate links, newsletter prompts, and SEO-optimized filler, is structured to delay the answer. The reader, already carrying the cognitive load of an unresolved question, is forced to work through friction that exists solely to generate more pageviews and ad impressions. The content’s structure and the reader’s intent are fundamentally misaligned.

The result is not engagement. It is extraction. The reader extracts the latent entity and leaves. The publisher extracts a pageview and an ad impression. Neither party has been well served. The reader resents the experience. The publisher has earned a visit but not trust.

The ad-click economy made this rational

None of this happened by accident. Withholding titles are the evolutionary product of an economy that rewards clicks over comprehension. When revenue is proportional to pageviews, every title becomes an optimization problem: maximize the probability of a click while minimizing the information given away for free.

Over two decades, this optimization produced the patterns we now see everywhere. Vague pronouns replaced specific nouns. Outcomes were teased but never stated. Reasons were promised but deferred. The entire craft of headline writing was reoriented from summarizing content to withholding it.

This was rational in a world where the title and the article were inseparable — where the only way to access the content was to visit the page. But that world is ending.

AI changes the equation

Large language models are rapidly becoming the intermediary layer between humans and content. When a user asks an AI assistant a question, the AI retrieves, reads, and synthesizes sources on the user’s behalf. The human never visits the page. The click never happens. The latent entity is resolved by the model, not by the reader.

In this new architecture, withholding titles are not just exploitative. They are pointless and perhaps even harmful to visibility. The AI doesn’t care about the information gap. It reads the article, extracts the answer, and delivers it without friction. The entire mechanism of clickbait — creating an artificial need that can only be resolved with a visit — collapses when the visitor is a machine that doesn’t see ads.

More importantly, AI systems can now decompose titles structurally, identify which latent entity is being withheld, check whether the article delivers on the title’s promise, and surface the answer directly. The asymmetry of information that clickbait depends on is being dissolved.

A healthier paradigm

If withholding titles evolved to serve the ad-click economy, then the question is: what should titles look like when that economy is no longer the only game?

The answer is straightforward. Titles should include the key information — the subject named, the reason stated, the outcome revealed — and invite the reader to explore further for depth, context, and nuance. The title earns the click by demonstrating value, not by ransoming it.

Consider the difference:

“This one Docker tool finally fixed my reverse proxy headache”

The subject is latent.
The reader must click to learn which tool.

“Nginx Proxy Manager eliminated my reverse proxy headache — here’s my setup”

The subject is revealed.
The reader clicks to learn the details, not to discover what the tool is.

Both titles can generate traffic. But the second one respects the reader. It says: here is what I’m talking about, and if you want to know more, the article is worth your time. The first one says: I have something you want, and I won’t tell you what it is unless you pay me with your attention.

The second model is healthier for everyone. Readers arrive with aligned expectations instead of frustrated scanning instincts. Authors build trust instead of mining clicks. And the content itself can be structured around depth rather than around delaying the reveal.

The web we could have

Web authors have a choice. They can continue optimizing for an economy that is being disintermediated by AI, writing titles that withhold and articles that delay, hoping the click-and-ad model survives long enough to sustain them. Or they can recognize that the readers who remain — the ones who choose to visit a page when they could have asked an AI — are the ones who deserve the most respect.

Those readers are not clicking because they were tricked. They’re clicking because they were informed. They know what the article is about. They want to go deeper. They trust the author enough to spend their time. And the money part can be fixed too.

That is the audience worth building for. And it starts with killing the hidden variable.

{
  "title": "Clickbait Titles Exploit Attention Through Latent Entities",
  "metadata": {
    "dimensions": [
      "Clickbait titles exploit attention",
      "Through latent entities"
    ],
    "attention_anchor": "how",
    "quantized": "clickbait exploits attention by hiding one of four variable types"
  },
  "how": [
    "Every clickbait title withholds exactly one latent entity: subject (what?), reason (why?), process (how?), or outcome (what happened?)",
    "The click is the inference cost the reader pays to resolve the hidden variable",
    "AI dissolves this by reading the article and extracting the answer without the click"
  ],
  "promise_check": {
    "exploit attention": "delivered — transactional mechanism explained with economic chain",
    "through latent entities": "delivered — four-type taxonomy defined with examples",
    "title practices what it preaches": "delivered — subject revealed, mechanism stated, no hidden variable"
  }
}