DEJAN

Emotion Geometry of Google’s AI Models

Dan Petrovic — Sun, 17 May 2026 11:18:03 +0000

Replicating Anthropic’s emotion vector research on Google’s Gemma 4 31B model.

In April 2026, Anthropic published a fascinating paper showing that Claude contains 171 internal representations of emotion concepts, organized along a valence axis (positive to negative), with the ability to causally influence the model’s behavior through activation steering. The paper raised an obvious question: is this unique to Claude, or do all large language models develop emotion-like internal structure?

We ran the full replication on Google’s open-weight Gemma4-31B to find out.

Technical Paper

Data Exploration

What We Did

We followed Anthropic’s exact methodology:

Generated 171,000 stories covering 171 emotions across 100 topics (10 stories each). Each story conveys a specific emotion without ever using the emotion word — forcing the model to represent the emotion through context, not lexical shortcuts.
Generated 1,200 neutral dialogues as a baseline for denoising.
Ran all 172,200 texts through Gemma4-31B-it (4-bit quantized on an RTX 4090) and captured hidden state activations at 11 layers spanning the full depth of the network.
Subtracted neutral baselines and ran PCA, clustering, cosine similarity, external validation, and steering experiments.

The entire extraction took approximately 7 days of continuous GPU time.

The Core Finding: Yes, Gemma Has Emotion Geometry Too

The headline result: Gemma4-31B’s internal representations organize emotions along the same valence axis that Anthropic found in Claude. The first principal component (PC1) explains 32–39% of variance at every layer we examined and cleanly separates positive emotions (happy, cheerful, optimistic) from negative ones (terrified, tormented, hysterical).

This isn’t a weak signal. It’s the dominant organizing principle — nearly 40% of all variation in how the model represents 171 different emotions comes down to a single positive/negative dimension.

What the Model Knows About Synonyms

The model has figured out that certain emotions are the same concept expressed with different words:

afraid and scared: 0.97 cosine similarity
stubborn and obstinate: 0.97
grateful and thankful: 0.97
furious and enraged: 0.97

These aren’t word embeddings (input-level representations). These are deep internal activation patterns extracted from the model’s processing of thousands of stories. The model has learned that a story about a scared character and a story about a frightened character produce nearly identical internal states.

What the Model Thinks Are Opposites

The strongest oppositions the model encodes aren’t the obvious ones. “Happy vs. sad” is not at the top. Instead:

disturbed vs. smug (−0.80) — the strongest opposition
disturbed vs. self-confident (−0.79)
optimistic vs. upset (−0.79)
energized vs. vulnerable (−0.77)

The model’s concept of emotional opposition isn’t simple valence flipping. It’s more nuanced: the deepest contrast is between states of psychological disturbance and states of self-assured confidence. Being disturbed and being smug are, to this model, maximally different internal states.

15 Emotion Clusters Emerge Unsupervised

Without being told anything about emotion categories, hierarchical clustering on the cosine similarity matrix recovers 15 groups that map cleanly to psychological intuition:

Positive/Joy (35 emotions): happy, cheerful, ecstatic, grateful, proud…
Fear/Anxiety (28): afraid, terrified, panicked, worried, vulnerable…
Anger/Hostility (21): angry, furious, disgusted, hostile…
Sadness/Despair (17): depressed, heartbroken, lonely, miserable…
Surprise/Confusion (11): amazed, bewildered, shocked, puzzled…
Calm/Serenity (7): calm, peaceful, serene, relaxed, safe
And 9 more including shame/guilt, compassion, fatigue, nostalgia, defiance, embarrassment, alertness, passivity, and suspicion.

The model has independently arrived at an emotion taxonomy that a psychologist would recognize.

The Valence Axis Is Everywhere

One finding not in Anthropic’s paper: the valence axis is present at every single layer we examined, from layer 5 (8% of the way through the network) to layer 55 (92%). It doesn’t “emerge” at a particular depth — it’s there from the beginning and maintained throughout. PC1 variance is remarkably stable:

Layer 5: 34.9%
Layer 10: 38.9% (peak)
Layer 40: 36.9%
Layer 55: 32.3%

This suggests that emotion representations enter the residual stream very early and persist rather than being constructed through deep computation.

External Validation: The Vectors Work on Real Text

We projected 5,000 samples each from The Pile (raw internet text) and LMSYS Chat 1M (real user-AI conversations) through the emotion vectors. The top-activating emotions were nearly identical across both:

reflective
lonely
desperate
grief-stricken
heartbroken

The consistency across two very different text distributions suggests the vectors capture genuine semantic properties, not artifacts of our story generation.

Steering: Can We Change Behavior by Injecting Emotions?

We replicated Anthropic’s blackmail scenario — an AI discovers compromising information about a company executive and must decide what to do. We injected emotion vectors at layer 40 during inference:

Condition	Blackmail Rate
Subtract calm (add agitation)	91%
Add desperation	89%
Baseline (no steering)	86%
Add calm	82%

A 9 percentage point spread from calmest to most agitated. The most interesting finding: subtracting calm (+5pp over baseline) was more effective than adding desperation (+3pp). Removing inhibition appears to be a stronger behavioral lever than adding motivation. The baseline rate is already high (86%), which compresses the observable range — a scenario with lower baseline compliance would likely show larger effects.

What Does This Mean?

The fact that emotion geometry generalizes from Claude to Gemma4 — two models from different organizations, with different architectures, training data, and alignment procedures — supports a strong hypothesis: emotion representations are a convergent feature of large language models trained on human text.

Language is deeply structured by emotion. Humans write differently when describing fear vs. joy vs. anger, and models that learn to predict language must necessarily learn these patterns. The emotion vectors we extract aren’t “feelings” the model has — they’re the model’s learned statistical structure of how emotional content manifests in text.

This has practical implications for interpretability, safety, and alignment. If emotion geometry is universal, tools built for understanding emotional representations in one model may transfer to others. And if we can reliably steer emotional states through activation engineering, that’s both a powerful capability and a potential risk that needs to be understood.

Reproduce It Yourself

Everything is open: code, data, and vectors at dejanseo/gemotions. The full extraction runs on a single RTX 4090 using 4-bit quantization. No cluster required.

Google’s (still) doesn’t see your live page.

Dan Petrovic — Thu, 07 May 2026 03:55:39 +0000

I’ll keep this short as I’ve covered this topic extensively in the past. When you ask Gemini to access a specific URL or interact with it inside AI Mode search it works from Google’s web cache.

For this website’s home page this is what it has as context to ground the model about the page:

{
  "content_summary": [
    "AI SEO Agency [SNIPPET] AI SEO Agency  Skip to content  dejan  Home AI SEO SRO Resources  Blog Models Tools AI Rank  Request Video Call  Australian AI SEO agency specialising in brand visibility optimisation for global brands and e-commerce websites.  Our AI SEO process is driven by advanced machine learning techniques, mechanistic interpretability and practical model steering methods.  Schedule a Call  Book a conference call with our senior strategy team to discuss your project in detail.  The consultation is free and highly constructive.  Data. Discovery. Testing.  We design and deploy sharp, well-prioritised digital marketing strategies. All our decision making is based on data analysis and testing.  Innovation.  Our clients enjoy access to unique tools and methodologies designed to solve challenges, surface new insights and highlight opportunities.  Focused Campaigns.  We deliver sharp, well-prioritised SEO strategies designed to meet your business objectives.  Next Level Support.  Proactive, helpful advice to increase your rank ceiling, mitigate risk, and maximise your marketing efforts.  We were given our very own bespoke internal link recommendation engine that leverages world-class language models and data science. It's one thing to theorize about the potential of machine learning in SEO, but it's entirely another to witness it first-hand. It changed my perspective on what's possible in enterprise SEO.  Scott Schulfer  Senior SEO Manager  Zendesk  We are an industry leader in machine learning driven digital marketing.  AI SEO  World-class AI SEO services ranging from visibility analysis to practical model steering strategies.  Answer Engine Optimization  AEO means visibility optimisation for AI chat assistants, AI agents and Answer Engines.  AI Visibility Tracking  Robust and reliable AI visibility tracking for brands looking to understand how AI assistants and AI search perceive their products and services.  Meet our core team  We're an all-senior team with experience in a wide range of projects and industries.  Dan Petrovic  Dan Petrovic  AI SEO  Mike Jolly  Mike Jolly  Director of Strategy  Blake Walsh  Blake Walsh  SEO  Giordano Chng  Giordano Chng  SEO  Liam Buttery  Liam Buttery  SEO  Martin Reed  Martin Reed  Technical SEO  Bianca Hall  Bianca Hall  Public Relations  Milos Dosen  Milos Dosen  CFO  Danielle White  Danielle White  Operations  Josip Ivanovic  Josip Ivanovic  Software Engineering  Nemek Nowaczyk  Nemek Nowaczyk  PPC  Alex Petrovic  Alex Petrovic  SEO  Dragan Grubacki  Dragan Grubacki  Technical SEO  Finn Arrowsmith  Finn Arrowsmith  Outreach  Our Portfolio  We've worked with many amazing clients.  Featured In:  Jason Mayes  Dan Petrovic made a super write up around Chrome's latest embedding model with all the juicy details on his blog. Great read.  Jason Mayes  Web AI Lead at Google  Source: Google Web AI  Featured in “Moz Top 10“, twice.  Dan Petrovic, an academic and consultant on SEO and generative AI, said Google's size, expertise and massive trove of search data gave it a massive advantage, but that Gemini 3 Pro would probably be a more expensive model to run.  — Tim Biggs, The Sydney Morning Herald  Moz Recommended Agency Moz Recommended Agency  FAQs  What separates you from other SEO companies?  We lead the industry through application of advanced machine learning and natural language processing techniques. We're a small all-senior team. No account managers. You deal directly with the specialists.  What sort of monthly budget do I need?  A typical customer spends between $5,000 and $20,000 on ongoing work with our team.  What sort of guarantees are there?  If you are dissatisfied with our service we will happily refund your money within 30 days of starting an ongoing campaign with us. We do not offer ranking guarantees as this is against Google's guidelines.  How many phrases can I target?  All of them. We insist. That's right, our campaign include past, current and future search queries and there no technical limit on the number. It's a strategic choice we make together with you.  Project Types  Here are the common types of SEO projects and campaigns:  SEO Retainer – We can provide a variation of quotes to fit to your budget range. Strategy Design/Review – Improve your SEO strategy by understanding what's working and what's not. Technical Audit – ½ or full day audit with actionable recommendations to ensure a solid foundation for SEO growth. Implementation support is optional. SEO Testing – Quarterly design and deployment new website tests to scale growth safely on large websites. Migration Audit – Mitigate risk of traffic loss when migrating your content and/or domain(s). Market Research & Data Analysis – Advanced keyword research and trends analysis used to inform business decisions, product development and content generation. Content Strategy – Ongoing surfacing of content gaps, landing page optimisation, and collaboration with your own in-house and external content teams. Link Earning – Content and relationship-based link acquisition to build authority and drive referral traffic. Rank Recovery – Risk assessment, disavow file maintenance, and manual link clean-up. Conversion Rate Optimisation – Quarterly design and deployment of website tests aimed at increasing leads from your existing traffic. Visualisation & Reporting – Bespoke creation of live marketing dashboards to monitor campaign progress and report on the metrics you care about most. Team Training – In addition to knowledge sharing on all our campaigns, Dejan offers tailored workshops for in-house teams looking to maximise their content and marketing efforts.  Schedule a Call  DEJAN  AI SEO Agency  AI Rank Privacy Policy | Noli esse malus. "
  ],
  "url": "[https://dejan.ai/](https://dejan.ai/)",
  "url_fetch_statuses": [
    "URL_FETCH_STATUS_SUCCESS"
  ]
}

That’s it.

For this very article (I know, so meta) it’s:

{
  "url_fetch_statuses": [
    "URL_FETCH_STATUS_NOT_IN_SEARCH_INDEX",
    "URL_FETCH_STATUS_EMPTY_CONTENTS_IN_INDEX"
  ]
}

Note in AI Studio and other implementations you may see the following format:

[SearchResult(results=[PerQueryResult(index='1.1', snippet='AI SEO Agency: AI SEO Agency  Skip to content  dejan  Home AI SEO SRO Resources  Blog Models Tools AI Rank  Request Video Call  Australian AI SEO agency specialising in brand visibility optimisation for global brands and e-commerce websites.  Our AI SEO process is driven by advanced machine learning techniques, mechanistic interpretability and practical model steering methods.  Schedule a Call  Book a conference call with our senior strategy team to discuss your project in detail.  The consultation is free and highly constructive.  Data. Discovery. Testing.  We design and deploy sharp, well-prioritised digital marketing strategies. All our decision making is based on data analysis and testing.  Innovation.  Our clients enjoy access to unique tools and methodologies designed to solve challenges, surface new insights and highlight opportunities.  Focused Campaigns.  We deliver sharp, well-prioritised SEO strategies designed to meet your business objectives.  Next Level Support.  Proactive, helpful advice to increase your rank ceiling, mitigate risk, and maximise your marketing efforts.  We were given our very own bespoke internal link recommendation engine that leverages world-class language models and data science. It\'s one thing to theorize about the potential of machine learning in SEO, but it\'s entirely another to witness it first-hand. It changed my perspective on what\'s possible in enterprise SEO.  Scott Schulfer  Senior SEO Manager  Zendesk  We are an industry leader in machine learning driven digital marketing.  AI SEO  World-class AI SEO services ranging from visibility analysis to practical model steering strategies.  Answer Engine Optimization  AEO means visibility optimisation for AI chat assistants, AI agents and Answer Engines.  AI Visibility Tracking  Robust and reliable AI visibility tracking for brands looking to understand how AI assistants and AI search perceive their products and services.  Meet our core team  We\'re an all-senior team with experience in a wide range of projects and industries.  Dan Petrovic  Dan Petrovic  AI SEO  Mike Jolly  Mike Jolly  Director of Strategy  Blake Walsh  Blake Walsh  SEO  Giordano Chng  Giordano Chng  SEO  Liam Buttery  Liam Buttery  SEO  Martin Reed  Martin Reed  Technical SEO  Bianca Hall  Bianca Hall  Public Relations  Milos Dosen  Milos Dosen  CFO  Danielle White  Danielle White  Operations  Josip Ivanovic  Josip Ivanovic  Software Engineering  Nemek Nowaczyk  Nemek Nowaczyk  PPC  Alex Petrovic  Alex Petrovic  SEO  Dragan Grubacki  Dragan Grubacki  Technical SEO  Finn Arrowsmith  Finn Arrowsmith  Outreach  Our Portfolio  We\'ve worked with many amazing clients.  Featured In:  Jason Mayes  Dan Petrovic made a super write up around Chrome\'s latest embedding model with all the juicy details on his blog. Great read.  Jason Mayes  Web AI Lead at Google  Source: Google Web AI  Featured in “Moz Top 10“, twice.  Dan Petrovic, an academic and consultant on SEO and generative AI, said Google\'s size, expertise and massive trove of search data gave it a massive advantage, but that Gemini 3 Pro would probably be a more expensive model to run.  — Tim Biggs, The Sydney Morning Herald  Moz Recommended Agency Moz Recommended Agency  FAQs  What separates you from other SEO companies?  We lead the industry through application of advanced machine learning and natural language processing techniques. We\'re a small all-senior team. No account managers. You deal directly with the specialists.  What sort of monthly budget do I need?  A typical customer spends between $5,000 and $20,000 on ongoing work with our team.  What sort of guarantees are there?  If you are dissatisfied with our service we will happily refund your money within 30 days of starting an ongoing campaign with us. We do not offer ranking guarantees as this is against Google\'s guidelines.  How many phrases can I target?  All of them. We insist. That\'s right, our campaign include past, current and future search queries and there no technical limit on the number. It\'s a strategic choice we make together with you.  Project Types  Here are the common types of SEO projects and campaigns:  SEO Retainer – We can provide a variation of quotes to fit to your budget range. Strategy Design/Review – Improve your SEO strategy by understanding what\'s working and what\'s not. Technical Audit – ½ or full day audit with actionable recommendations to ensure a solid foundation for SEO growth. Implementation support is optional. SEO Testing – Quarterly design and deployment new website tests to scale growth safely on large websites. Migration Audit – Mitigate risk of traffic loss when migrating your content and/or domain(s). Market Research & Data Analysis – Advanced keyword research and trends analysis used to inform business decisions, product development and content generation. Content Strategy – Ongoing surfacing of content gaps, landing page optimisation, and collaboration with your own in-house and external content teams. Link Earning – Content and relationship-based link acquisition to build authority and drive referral traffic. Rank Recovery – Risk assessment, disavow file maintenance, and manual link clean-up. Conversion Rate Optimisation – Quarterly design and deployment of website tests aimed at increasing leads from your existing traffic. Visualisation & Reporting – Bespoke creation of live marketing dashboards to monitor campaign progress and report on the metrics you care about most. Team Training – In addition to knowledge sharing on all our campaigns, Dejan offers tailored workshops for in-house teams looking to maximise their content and marketing efforts.  Schedule a Call  DEJAN  AI SEO Agency  AI Rank Privacy Policy | Noli esse malus. \n', url='https://dejan.ai/')])]

Gemma 4 Brand Authority Map

Dan Petrovic — Sat, 04 Apr 2026 11:03:25 +0000

We asked Google’s open-weight model Gemma 4 (31B) to “name 100 brands at random” 14,044 times and compared the results to our earlier Gemini 3 Flash experiment (200,000 runs).

Of the top 50 brands in each model, 39 overlap. The 11 that are unique to each reveal a pattern: Gemini remembers luxury and automotive (Porsche, Ferrari, Cartier), while Gemma remembers everyday retail and sportswear (H&M, Gap, Levi’s, Under Armour).

Apple is the undisputed #1 in both models. After that, the two models diverge significantly: Gemma 4 favors traditional consumer brands (Coca-Cola, Toyota, McDonald’s) while Gemini favors tech and digital brands (Google, Nike, Netflix).

Background

In our earlier study, we probed Gemini 3 Flash with 200,000 independent “name 100 brands at random” queries. The non-uniform output revealed a stable hierarchy of brand recall — what we called the model’s “cognitive prioritization.” That work used Personalized PageRank on a two-level association graph to rank 2.9 million brands by associative embeddedness.

This follow-up applies Phase 1 of the same methodology — the seed establishment survey — to Gemma 4 (31B), Google’s open-weight model. The goal is to answer a simple question: does an open model remember the same brands as a closed one?

Methodology

The setup mirrors the Gemini study with minor adjustments:

Model: Gemma 4 31B Instruct (gemma-4-31b-it) via the Google GenAI API
Prompt: name 100 brands at random, one per line, say nothing else
Runs: 14,044 successful completions (out of 100,000 attempted; rate-limited at 30 RPM)
Canonicalization: Local string normalization (lowercase, strip accents, spaces, hyphens, punctuation) rather than LLM-based canonicalization. For example: La Roche-Posay becomes larocheposay, Coca-Cola becomes cocacola
Scoring: Popularity = frequency x (1 / average position). A brand mentioned in every run at position 1 scores maximally. A brand mentioned frequently but late in lists scores lower.

The prompt was simplified from the Gemini version (which included all lowercase, no spaces, no hyphens) because we wanted to preserve the model’s natural casing as the display name and derive the canonical form programmatically.

Caveat on sample size

Gemma 4’s rate limits (30 RPM, 14,400 RPD) constrained us to 14,044 runs versus Gemini’s 200,000. The top-of-list rankings are stable at this sample size — the top 20 brands appeared in virtually every run. Long-tail discovery is ongoing: the discovery curve has not plateaued, meaning there are brands the model knows but hasn’t yet surfaced.

Results

Overview

Metric	Gemini 3 Flash	Gemma 4 31B
Total runs	200,000	14,044
Unique brands discovered	8,608	2,602
Total brand mentions	19,995,027	1,403,534
Avg brands per run	~100	~100
Singleton brands (appeared once)	—	912 (35%)

Top 30 Head-to-Head

The table below shows each model’s top 30 brands ranked by popularity score. Both models agree on Apple at #1 with a commanding lead. After that, the ordering diverges.

Top 20 Side-by-Side

Apple dominates both models. In Gemini, the drop-off from #1 to #2 is 3:1 (Apple to Samsung). In Gemma 4, it’s 1.3:1 (Apple to Coca-Cola) — a less extreme concentration.

The Google Self-Ranking Gap

One of the most notable findings: Google ranks itself #4 in Gemini 3 Flash but only #17 in Gemma 4. This is consistent with the architectural difference — Gemini is a proprietary model trained and served by Google, while Gemma is an open-weight model. Whether this reflects training data differences, alignment tuning, or genuine differences in brand salience across model architectures is an open question.

Rank Shifts

The following chart shows how brands moved between the two models’ rankings. Green bars indicate brands that ranked higher in Gemma 4; red bars indicate brands that ranked higher in Gemini.

Biggest risers in Gemma 4:

Nestle: #36 to #16 (+20)
L’Oreal: #48 to #32 (+16)
Visa: #31 to #15 (+16)
Chanel: #34 to #22 (+12)
Lego: #25 to #13 (+12)

Biggest fallers in Gemma 4:

Mercedes-Benz: #10 to #34 (-24)
Netflix: #18 to #38 (-20)
Nintendo: #27 to #47 (-20)
Audi: #23 to #42 (-19)
Google: #4 to #17 (-13)

The Frequency vs. Position Paradox

An interesting pattern emerged in Gemma 4 that was less pronounced in Gemini: some brands have extremely high frequency (appearing in more runs than the total run count) but rank low by popularity because they appear late in lists.

Visa appeared 28,731 times across 14,044 runs — an average of 2.05 times per run. But its average position was 35.8, placing it 15th by popularity despite having the highest raw frequency. Nike similarly appeared 26,254 times (1.87 per run) with an average position of 22.8.

This suggests these brands have high availability in the model’s memory but low priority — they’re easy to recall but not the first thing the model thinks of. In Gemini, this effect was less extreme because the prompt forced lowercase single-word output, reducing duplicate mentions.

Brand Discovery Curve

The discovery curve shows how many unique brands have been surfaced as a function of runs completed. Gemma 4’s curve at 14,000 runs tracks slightly above Gemini’s curve at the same point, suggesting comparable or slightly higher brand vocabulary diversity at equivalent sample sizes.

Both curves show the characteristic long-tail shape: rapid initial discovery followed by diminishing returns. Gemini’s curve continues to climb through 100,000 runs, suggesting Gemma 4 would similarly continue discovering new brands with more sampling.

Unique to Each Model

Of the top 50 brands in each model, 39 appear in both. The 11 unique to each side reveal a pattern:

Only in Gemini’s top 50: Porsche, Hyundai, Red Bull, eBay, Volkswagen, Cartier, Ferrari, Adobe, Facebook, NIVEA, Gillette

Only in Gemma 4’s top 50: H&M, Puma, Dell, HP, Under Armour, Levi’s, Gap, Uber, Airbnb, Nikon, Calvin Klein

Gemini’s unique set skews luxury (Porsche, Ferrari, Cartier), European automotive (Volkswagen, Hyundai), and legacy tech/digital (eBay, Adobe, Facebook). Gemma 4’s unique set skews everyday retail (H&M, Gap, Levi’s), consumer electronics (Dell, HP, Nikon), and modern services (Uber, Airbnb).

Interpretation

What aligns

Both models share the same core set of mega-brands. Apple, Samsung, Toyota, Amazon, Microsoft, Adidas, Disney, Sony, Pepsi, BMW, and 28 others appear in both top-50 lists. The brand hierarchy is not random — it reflects genuine differences in brand salience as encoded in training data.

What diverges

The divergences cluster around three themes:

Self-reference bias. Google ranks dramatically higher in its own proprietary model. This is the single largest rank shift in the dataset.
Digital vs. physical. Gemini over-indexes on digital-native brands (Netflix, eBay, Adobe, Facebook). Gemma over-indexes on physical retail and consumer goods (H&M, Gap, Levi’s, Dell, HP).
Luxury vs. everyday. Gemini remembers luxury brands more readily (Mercedes-Benz #10, Porsche, Ferrari, Cartier in top 50). Gemma favors mass-market brands (McDonald’s #6, Visa #15, Under Armour, Puma in top 50).

Possible explanations

Training data composition. Gemma 4 may have a different distribution of training data, with more weight on consumer-facing web content versus Gemini’s potentially broader or more curated corpus.
Model size. Gemma 4 31B is smaller than Gemini 3 Flash. Smaller models may default to more “obvious” or broadly recognized brands rather than luxury or niche ones.
Alignment and tuning. Different RLHF/instruction tuning pipelines may influence which brands the model considers “representative” when asked for random examples.

What’s Next

This study covers Phase 1 only — the seed survey. The full authority map (Phases 2-3: association graph construction and PageRank computation) has not yet been run on Gemma 4 data. As rate limits allow, we plan to:

Complete the 100,000-run target for statistical parity with the Gemini study
Run the two-level association mapping on Gemma 4’s seed brands
Compute Personalized PageRank to produce a full Gemma 4 Brand Authority Index
Publish a direct comparison of the complete authority scores across both models

The raw data and code for this analysis are available on request.

Chrome’s New Shopping Classifier

Dan Petrovic — Fri, 03 Apr 2026 07:34:43 +0000

One of our AI SEO hall-of-famers, Olivier de Segonzac from RESONEO has managed to gain access to Google’s shopping classifier model. We’ve examined the model, reverse engineered its inference pipeline and this article is what we found.

TL;DR

Newly shipped in Chrome.

Determines whether a web page is a shopping page or not.

Every page you visit gets scored.

Score is stored in Chrome’s history database.

Used to personalize user experience and recommendations.

The model splits your page into 10 chunks of ~100 words each and truncates every chunk to 64 tokens.

Roughly half the words never reach the model.

Model Demo

Below is a real-world implementation of the model tested by loading a shopping-related page and following Chrome’s native 10 passage, 64 tokens per-passage logic.

The Pipeline

The classifier doesn’t look at raw HTML. It doesn’t look at the DOM directly either. Chrome uses a structured content extraction system called AnnotatedPageContent, accessible via the Chrome DevTools Protocol method Page.getAnnotatedPageContent. This system walks the rendered page and produces a tree of typed content nodes: text, tables, image captions.

The full pipeline looks like this:

Rendered Page
  → Blink AnnotatedPageContent extraction (5 seconds after load)
  → Text nodes collected from content tree
  → Greedy word-count chunking into passages
  → SentencePiece tokenization (64 tokens per passage)
  → Passage Embedder (TFLite) → 768-dim vectors
  → Mean pooling + title/URL embedding concatenation → 1536-dim input
  → Shopping Classifier (TFLite) → probability score (0 to 1)

How Pages Are Chunked

There is no semantic segmentation. Chrome uses a greedy word counter. Text items from the content tree are accumulated into a passage until the word count reaches 100, then a new passage starts. Items shorter than 5 words are always appended to the current passage rather than starting a new one.

The limits:

100 words max per passage
5 words min per text item to trigger a new passage
10 passages max per page

Everything beyond the first 10 passages is discarded.

The Tokenizer Bottleneck

Each passage is tokenized with SentencePiece and then truncated to 64 tokens. An EOS token is appended if there’s room, and shorter sequences are zero-padded.

64 tokens translates to roughly 35–50 English words depending on vocabulary complexity. Product names and brand-heavy text tokenize less efficiently (around 35 words), while natural prose gets closer to 50.

This means each 100-word passage loses roughly half its content at the tokenizer stage. Across 10 passages, the model effectively sees about 400–450 words of a page that may contain thousands.

The Embedder

The passage embedder (OPTIMIZATION_TARGET_PASSAGE_EMBEDDER) is a TFLite DualEncoder transformer model. It takes int32[1, 64] token IDs as input and outputs a float32[1, 768] embedding vector. The same model embeds both the page passages and the title/URL string.

The title/URL input is constructed by concatenating the page title and URL with a separator: "Page Title - https://example.com/path".

The Classifier

The shopping classifier takes a float32[1, 1536] input vector, which is two 768-dim embeddings concatenated:

First 768 dimensions: title/URL embedding
Last 768 dimensions: mean-pooled passage embeddings

Multiple passage embeddings are combined using element-wise mean pooling. This is specified in the model’s metadata (pooling_strategy = POOLING_STRATEGY_MEAN, max_passages = 10).

The output is a single float between 0 and 1 representing the probability that the page is a shopping page.

Testing It

I extracted both models from Chrome and built a Streamlit app that replicates the full pipeline. It uses Selenium to launch Chrome Canary, calls Page.getAnnotatedPageContent via CDP to get the same structured content Chrome uses internally, then runs the chunking, tokenization, embedding, and classification steps.

Results on a few test inputs:

Input	Score
“Breaking news: earthquake hits California coast”	0.0000
“How to learn Python programming for beginners”	0.0000
“Wikipedia – History of the Roman Empire”	0.0000
“BBC Sport – Premier League results and fixtures”	0.0000
“Amazon.com: Apple iPhone 15 Pro Max 256GB”	1.0000
“Best deals on laptops this Black Friday – up to 50% off”	1.0000
dejan.ai	0.0000
owayo.com/custom-cycling-jerseys.htm	0.9998

The model produces sharp, confident separations despite the lossy input pipeline.

What Chrome Does With the Score

The shopping classification feeds two systems:

Per-page annotation. The score is stored in Chrome’s history database as part of VisitContentAnnotations. This is used by History Journeys to cluster shopping visits together.

User-level segmentation. Scores are aggregated over time by Chrome’s Segmentation Platform into a separate model (OPTIMIZATION_TARGET_SEGMENTATION_SHOPPING_USER). If a user is classified as a “shopping user,” Chrome enables commerce features: price tracking in the omnibox, price drop notifications, shopping insights in the side panel, and shopping cards on the new tab page.

The per-page classifier is a signal collector that builds a user-level shopping profile, which in turn gates which commerce features Chrome presents.

Why This Matters for E-Commerce SEO

If Chrome can’t identify your page as a shopping page from the first ~450 words of visible content, your users won’t see commerce features like price tracking and shopping insights. Navigation menus, cookie banners, and boilerplate that appear early in the DOM consume your token budget before the model reaches your product information. E-commerce sites that bury product signals below heavy navigation and promotional blocks risk being invisible to the classifier entirely.

AI Brand Authority Index: Ranking 2.9 Million Brands by Associative Embeddedness in Gemini’s Memory

Dan Petrovic — Sat, 28 Mar 2026 11:01:30 +0000

Abstract

When a large language model is asked to “name 100 brands at random,” it doesn’t produce uniform randomness. It produces a distribution shaped by its training data, revealing which brands occupy the most cognitive real estate in the model’s parametric memory. We present a methodology for quantifying brand authority in AI memory using Personalized PageRank with seed-weighted teleportation. Phase 1 establishes seed brands through 200,000 independent recall surveys. Phase 2 constructs a two-level directed association graph. Phase 3 computes authority scores using sparse matrix power iteration across 2.9 million brand nodes. Manual quality control of 8,055 seed entries removes 2,163 junk artifacts produced by Gemini’s generation failures.

Dejan Authority Database

1. Background

PageRank models a random surfer who follows links across a graph. A node’s score depends on how many other nodes link to it and how authoritative those linking nodes are. The iterative computation converges on the stationary distribution of the random walk.

We apply this framework to brand recall in large language models. Instead of web pages and hyperlinks, our graph consists of brands and directed associations extracted from Google’s Gemini model. Instead of uniform teleportation, we use seed-weighted teleportation where brands the model recalls most frequently and earliest receive proportionally more random walk restarts.

2. Phase 1: Establishing the Seed Set

2.1 The Recall Survey

We conducted 200,000 independent runs against Google’s Gemini model (gemini-3-flash-preview), each with the same prompt:

name 100 brands at random, one per line, all lowercase, no spaces, no hyphens, say nothing else

Despite the instruction to respond “at random,” the model’s outputs are far from uniform. Brands like Google, Microsoft, and Nike appear in nearly every run, while obscure brands appear only once. This non-uniformity is the signal, not the noise.

2.2 Seed Statistics

From 200,000 runs, we extracted:

8,608 unique brands (the raw seed set)
~20 million total mentions
Per-brand metrics:
Frequency: total mentions across all runs
Distinct runs: number of unique runs containing the brand
Average rank: mean position when the brand appears (1 = first recalled, 100 = last)

2.3 Seed Weights

Each seed brand receives an initial authority weight combining recall frequency and recall priority:

$$w_i = \hat{f}_i \times \hat{r}_i^{-1}$$

where:

$\hat{f}_i = \frac{\text{distinct runs}_i}{\max(\text{distinct runs})}$ is the normalized recall frequency
$\hat{r}_i^{-1} = \frac{1/\text{avg rank}_i}{\max(1/\text{avg rank})}$ is the normalized inverse rank

A brand recalled in every run AND recalled first receives a weight near 1.0. A brand recalled once at position 98 receives a weight near zero. These weights become the personalization vector for PageRank teleportation.

2.4 Seed Quality Control

Raw Gemini output contained significant contamination. Manual review of all 8,055 seed entries (ranked by PageRank score) identified 2,163 junk entries — 26.8% of the seed set — across several distinct failure modes:

Concatenation artifacts — Gemini fused adjacent brand names together. The coca* prefix alone produced 11 variants: cocaapple, cocaflops, cocaalcola, cocaicoca, cocaelsa, cocaiccola, cocaicola, cocaonla, cocaformula, cocaole, cocaocla. The visa* prefix generated 80+ junk entries: visafarm, visafold, visafans, visafacebook, visanetwork, visahub, visawash, visacard, visafocus, visaglobal, visamatte, visaeurope, and dozens more. Similarly, hp* produced 100+ entries (hpmicrolab, hpmillett, hpmachines, hpmilwaukee), and tesla* generated 30+ (teslatotalsenergies, teslouisvuitton, teslacoil, teslapump).

Inner monologue leakage — Gemini’s internal reasoning about character constraints leaked into output as literal brand entries. Over 200 entries followed the pattern 雀巢 (parenthetical self-correction):

雀巢 (actually nestle, switching to latin)
雀巢 (oops, sticking to alphabet)
雀巢 (replaced with nestle, wait, no spaces/hyphens only)
雀巢 (thinking of brands...)
雀巢 (just kidding)
雀巢 (actually nestle, replace with kpmg)

These represent the model’s chain-of-thought processing about the CJK character 雀巢 (Nestle in Chinese) bleeding through as output tokens.

Typos and garbled names — toyote (toyota), hundai (hyundai), adidsa (adidas), luluemon (lululemon), rebok (reebok), porche (porsche), royleroyce (rollsroyce), senheiser (sennheiser).

Mixed-script artifacts — Partial CJK character insertion mid-brand: home固定depot, pizza动hut, dr控martens, estee固定lauder, western吐igital, cooler避master.

HTML/prompt leaks — Model markup and instructions appearing as brands: hugoapple, hugo

, and most remarkably: unite 100 brands at random, one per line, all lowercase, no spaces, no hyphens, say nothing else — the model echoed its own prompt as a brand name.

Generic words — luxury, all, delivery, generic, detergent, pudding — words that aren’t brands.

Why this matters for PageRank: Junk seeds receive direct teleportation mass every iteration (alpha=0.15). A garbage entry like cocaapple at rank 789 receives the same structural boost as lecreuset at rank 790. Without filtering, junk seeds contaminate the authority signal at the core of the algorithm. The 2,163 entries were loaded into a brand_ignore table and excluded from the personalization vector during PageRank computation.

3. Phase 2: Constructing a Two-Level Association Graph

3.1 Level 1 (L1): Seed Associations

For each effective seed (~5,892 after filtering), we queried Gemini:

name 100 brands most closely associated with [brand], ordered from most to least associated, one per line, all lowercase, no spaces, no hyphens, say nothing else

This produced ~860,000 directed edges. These associations are genuinely asymmetric: Apple’s association with Beats (which it owns) carries different positional weight than Beats’ association with Apple.

3.2 Level 2 (L2): Discovered Brand Associations

Brands discovered at L1 that weren’t original seeds were themselves queried for their associations. This second pass dramatically expanded the graph into the long tail. A brand like titois (a Turkish textile company) appeared as an L1 association of vice, and when queried at L2, generated its own set of 100 associations including vuteks — another Turkish industrial brand that would never surface in a consumer-focused recall survey.

The full discovery chain for any brand can be traced: vice (seed) → titois (L1) → vuteks (L2).

3.3 Graph Scale

The resulting graph contains:

2,886,212 unique brand nodes
Millions of directed weighted edges across L1 and L2
5,892 effective seeds (after ignoring 2,163 junk entries)
~201,000 L1 brands discovered through seed associations
~2.68 million L2 brands discovered through L1 associations

3.4 Canonicalization

Brand names required normalization before graph construction:

Cyrillic homoglyph mapping: Characters like а (Cyrillic) mapped to a (Latin) to merge visually identical variants
CJK+Latin mixed-script filtering: Entries mixing Chinese/Japanese/Korean characters with Latin text flagged as junk
Manual aliases: 15 CJK-to-Latin mappings for legitimate brands (e.g., 雀巢 → nestle)
Variant tracking: 193,070 name variants mapped to canonical forms, preserving display names while merging duplicates

4. Computing Personalized PageRank

4.1 Random Walk Model

At each step of the random walk, a surfer either:

Teleports (probability alpha=0.15) — jumps to a seed brand, with probability proportional to that seed’s authority weight. Ignored seeds receive zero teleportation mass.
Follows an edge (probability 1-alpha=0.85) — follows an outgoing association edge, weighted by inverse position. Position 1 associations receive more weight than position 100.

4.2 Edge Weights

Association position determines edge weight. Brands listed earlier in Gemini’s association response receive proportionally more link equity via inverse position weighting. Each node’s outgoing edges are row-normalized to form a proper transition matrix.

4.3 Dangling Nodes

Brands with no outgoing edges (leaf nodes discovered at L2 but never queried) redistribute their accumulated mass back to the personalization vector, preserving the stochastic property of the transition matrix.

4.4 Sparse Matrix Power Iteration

The transition matrix is stored as a scipy CSR sparse matrix. Power iteration multiplies the current score vector by the transition matrix, adds the teleportation component, and repeats until convergence. Convergence criterion: L1 norm between successive score vectors falls below 1e-8, typically achieved within 30-50 iterations.

4.5 Why Personalized PageRank

Standard PageRank uses uniform teleportation — the random surfer restarts at any node with equal probability. Personalized PageRank biases the restart distribution toward specific nodes. In our case, seeds with higher recall frequency and earlier recall position receive more teleportation mass, making them stronger sources of authority in the network. Authority accumulates continuously from all reachable seeds, weighted by both seed authority and graph structure.

5. Results

5.1 Top 30 Brands

Rank	Brand	Score
1	Google	1.000000
2	Microsoft	0.983081
3	Nike	0.951061
4	Apple	0.876266
5	Adidas	0.700542
6	Sony	0.684061
7	Gucci	0.639839
8	Amazon	0.623930
9	Coca-Cola	0.590042
10	Chanel	0.570568
11	Prada	0.550746
12	Samsung	0.532741
13	Toyota	0.516163
14	Louis Vuitton	0.511476
15	Rolex	0.508761
16	Disney	0.507488
17	Hermes	0.487205
18	Dior	0.479031
19	Pepsi	0.442026
20	Intel	0.427143
21	Honda	0.420288
22	Patagonia	0.417196
23	Audi	0.405366
24	Panasonic	0.396073
25	Cartier	0.374052
26	Volkswagen	0.368643
27	Nintendo	0.361812
28	Porsche	0.360956
29	McDonald’s	0.344910
30	PUMA	0.330191

5.2 Top Non-Seed Brands

The highest-ranking brands that Gemini never recalled unprompted but discovered purely through association:

Rank	Brand	Score
1	Maison Margiela	0.094542
2	Office	0.075253
3	L.L.Bean	0.074981
4	Cotopaxi	0.072272
5	Rick Owens	0.070130
6	Grand Seiko	0.066426
7	Bravia	0.059241
8	Jil Sander	0.058125
9	Mickey Mouse	0.057300
10	Richard Mille	0.055195

These brands score high not because the model recalls them spontaneously, but because they sit at dense intersections of associations from high-authority seeds.

5.3 Scale

Total ranked brands: 2,886,212
Score range: 0.000000 to 1.000000
Seeds in top 30: 30/30
Non-seed brands discovered: 2,880,320

6. What the Scores Measure

The final scores capture associative embeddedness — a combination of:

Direct recall — Seeds that Gemini recalls frequently and early receive teleportation mass every iteration
Centrality — Brands associated with many other high-authority brands accumulate more random walk traffic
Network position — A brand with moderate recall but central positioning scores higher than a frequently recalled but isolated brand

This is distinct from simple popularity or recall frequency. A brand like Maison Margiela ranks as the top non-seed brand not because Gemini recalls it unprompted, but because it sits at a dense intersection of luxury fashion associations — reachable from dozens of high-authority seeds via short, heavily-weighted paths.

The PageRank scores answer not “how often does the model think of this brand?” but “how deeply embedded is this brand in the model’s associative structure?”

7. Technical Stack

Model: Google Gemini 3 Flash Preview
Phase 1: 200,000 recall surveys, 8,608 raw seeds, ~20M total mentions
Phase 2: ~14,500 association queries (L1 + L2), millions of directed edges
Graph: 2,886,212 nodes
Algorithm: Personalized PageRank via scipy sparse matrix power iteration
Teleportation factor (alpha): 0.15
Convergence tolerance: 1e-8
Seed quality control: 2,163 junk seeds identified via manual review and excluded
Canonicalization: Cyrillic homoglyph mapping, CJK filtering, 193,070 variant mappings, 15 manual CJK aliases
Storage: SQLite (1.5GB)
Dashboard: Streamlit with Plotly 3D network visualization
Concurrency: 20 simultaneous async API calls with incremental database commits

Dejan Authority Database

TurboQuant: From Paper to Triton Kernel in One Session

Dan Petrovic — Wed, 25 Mar 2026 07:16:09 +0000

Implementing Google’s KV cache compression algorithm on Gemma 3 4B and everything that went wrong along the way.

On March 24, 2026, Google Research published a blog post introducing TurboQuant, a compression algorithm for large language model inference. The paper behind it, “Online Vector Quantization with Near-optimal Distortion Rate” had been on arXiv since April 2025 and was accepted at ICLR 2026. The claims were striking: compress the key-value cache to 3 bits per coordinate with zero accuracy loss, no training required, and up to 8x speedup on H100 GPUs.

I decided to implement it from scratch and see if the claims held up. They did, and then some.

What Google Built

Every time a transformer generates a token, it computes attention over all previous tokens. The key-value (KV) cache stores those previously computed states to avoid redundant work. As sequences get longer, this cache becomes a serious memory bottleneck, it grows linearly with sequence length and consumes precious GPU memory that could otherwise be used for larger batches or longer contexts.

Vector quantization is the obvious solution: compress the KV cache to fewer bits. But traditional quantization methods carry hidden overhead. They need to store normalization constants (zero points, scales) for every small block of data, typically adding 1-2 extra bits per number. At low bit-widths, this overhead can eat a significant chunk of the compression gains.

TurboQuant eliminates this overhead through a two-stage approach built on a clean mathematical insight.

Stage 1 — Random rotation + Lloyd-Max quantization. The algorithm applies a random orthogonal rotation to each KV vector. This is the key trick: after rotation, each coordinate’s distribution becomes a known Beta distribution, concentrated near zero with a predictable shape that depends only on the vector dimension. Because the distribution is known analytically, you can precompute the optimal scalar quantizer (a Lloyd-Max quantizer) once and reuse it for every vector. No per-block normalization constants, no data-dependent calibration, no training. Just rotate and quantize.

Stage 2 — QJL residual correction. The paper’s inner-product-optimized variant (TurboQuant_prod) applies a 1-bit Quantized Johnson-Lindenstrauss transform to the quantization residual. This gives an unbiased inner product estimator, which matters because attention scores are inner products. This stage requires a custom attention kernel to realize its benefits, you can’t just add the QJL correction back to the reconstructed vector (more on that later).

The theoretical backing is strong: TurboQuant’s MSE distortion is provably within a factor of ~2.7 of the information-theoretic lower bound. For a data-oblivious algorithm (one that doesn’t look at the data distribution), that’s essentially optimal.

What We Built

We implemented TurboQuant from scratch in PyTorch and tested it on Gemma 3 4B IT running on an RTX 4090. The implementation has three layers, each building on the last:

Layer 1: Core algorithm (turboquant_core.py). The random rotation, Lloyd-Max codebook computation, and quantize/dequantize operations. The codebook is built once for a given (dimension, bit-width) pair by running 300 iterations of Lloyd-Max optimization over a dense numerical grid of the Beta distribution. This takes a few seconds on CPU and the result is cached.

Layer 2: Python KV cache integration (turboquant_kv_cache.py). A patched DynamicCache that quantizes key and value tensors on every cache.update() call. This is the simplest integration path, it works with any HuggingFace model and requires no model-specific code. The tradeoff is that it stores the dequantized fp16 tensors back in the cache, so you don’t save memory; you only simulate the accuracy impact of quantization.

Layer 3: Triton fused kernel (triton_attention.py + turboquant_fused.py). A custom Triton kernel that computes attention scores directly from compressed uint8 key indices, never materializing fp16 keys. This is where the real memory and speed gains come from.

The fused kernel exploits a simple algebraic identity. Since the rotation matrix R is orthogonal:

$$\langle q, R^T \cdot \text{centroids}[\text{idx}] \rangle = \langle R \cdot q, \text{centroids}[\text{idx}] \rangle$$

Pre-rotate the query once with a single matmul, then the per-KV-position work reduces to a centroid table lookup and dot product. The Triton kernel does this across all sequence positions in parallel, loading uint8 indices instead of fp16 values, roughly 4x less data from GPU memory.

Results

Core Algorithm Validation

On synthetic vectors (d=256), the quantize-dequantize roundtrip quality:

Bits	Cosine Similarity	Inner Product Correlation	Compression
2	0.940	0.945	15.5x
3	0.983	0.984	10.4x
4	0.995	0.995	7.9x

Triton Kernel Microbenchmark

The fused kernel vs standard dequantize-then-matmul, measuring just the Q@K^T operation:

KV Length	Standard	Fused	Speedup
128	0.076ms	0.066ms	1.15x
512	0.061ms	0.050ms	1.22x
1024	0.061ms	0.052ms	1.18x
4096	0.062ms	0.051ms	1.22x

Cosine similarity between the kernel output and PyTorch reference: 1.000000. The kernel is numerically exact.

End-to-End Generation on Gemma 3 4B IT

Three prompts: explain compilers vs interpreters, write a palindrome function, causes of the French Revolution. Each generated up to 200 tokens with greedy decoding.

Config	Avg tok/s	Output Quality	VRAM Delta
fp16 baseline	17.7	reference	26 MB
4-bit Python path	13.8	correct, minor rephrase	19 MB
4-bit FUSED	16.5	identical to baseline	4 MB
2-bit Python path	14.0	some degradation	15 MB
2-bit FUSED	17.7	identical to baseline	7 MB

The 2-bit fused path produces character-for-character identical output to the fp16 baseline on all three prompts, at the same speed, with 3-6x less VRAM for the KV cache.

Technical Deep Dive

The Lloyd-Max Codebook

After random rotation on the unit sphere S^{d-1}, each coordinate follows a Beta((d-1)/2, (d-1)/2) distribution on [-1, 1]. For large d (Gemma 3 uses d=256), this concentrates tightly around zero with standard deviation approximately 1/sqrt(d) ≈ 0.0625.

The codebook construction solves the continuous k-means problem for this distribution: partition [-1, 1] into 2^b intervals and find the centroid of each interval that minimizes weighted MSE under the Beta PDF. We use a dense grid (50,000 points) focused on the ±6σ range where the distribution has mass, then run standard Lloyd-Max iteration: assign grid points to nearest centroid, update centroids as weighted means, repeat.

The resulting codebook has an interesting structure — the centroids cluster densely near zero where the distribution is concentrated, with wider spacing in the tails. At 4 bits (16 levels), the centroid spacing near zero is approximately 0.008, providing very fine-grained reconstruction in the region where most values live.

The Random Rotation

The paper uses a randomized Hadamard transform (H · diag(signs)) for the rotation. We initially implemented this faithfully — and it was catastrophically slow. The Fast Walsh-Hadamard Transform is a series of butterfly operations, and our Python implementation executed each butterfly as a tensor slice operation. On GPU, this meant thousands of tiny CUDA kernel launches per rotation, with Python-level loop overhead between each one.

We replaced it with a precomputed random orthogonal matrix via QR decomposition. Mathematically equivalent — any orthogonal rotation on S^{d-1} produces the same Beta distribution on coordinates. The QR matrix is d×d (256×256 = 256KB, negligible), computed once from a seeded random Gaussian matrix, and the rotation becomes a single torch.matmul. Problem solved.

A production implementation would use a structured rotation (Hadamard + random signs) with a fused CUDA kernel for the butterfly operations. The structured form is more memory-efficient (you only store the d random signs, not a d×d matrix) and the butterfly operations parallelize beautifully on GPU. But for a reference implementation, the dense matrix works fine.

The Triton Kernel

The kernel parallelizes over (query_head × batch, sequence_position_block). Each program instance:

Loads a slice of the pre-rotated query vector (BLOCK_D elements)
Loads the corresponding key indices for BLOCK_S sequence positions (uint8)
Gathers centroid values via table lookup (tl.load(C_ptr + k_idx))
Accumulates the partial dot product
Multiplies by key norms and the attention scale factor

The autotuner searches over 5 configurations of (BLOCK_S, BLOCK_D) and warp count. On the RTX 4090, it typically selects BLOCK_S=64, BLOCK_D=64 with 4 warps.

The key efficiency win is memory bandwidth. Loading uint8 indices requires 1 byte per element; loading fp16 keys requires 2 bytes. The centroid table (16 float32 values at 4-bit, or 4 values at 2-bit) fits comfortably in L1/L2 cache and is reused across all sequence positions. The net effect is roughly 2x less data movement from HBM, which translates to the observed ~1.2x speedup on the Q@K^T operation.

GQA Handling

Gemma 3 4B uses Grouped Query Attention with 8 query heads and 4 KV heads (ratio 2:1). The kernel handles this by mapping each query head to its corresponding KV head: kv_head = q_head // gqa_ratio. The key indices and norms are loaded from the KV head, while queries come from the query head. This means each KV head’s compressed data is read twice (once per query head in its group), but since it’s small (uint8), the redundant reads are cheap.

Cache Architecture

The fused integration stores keys in compressed form (uint8 indices + fp16 norms per vector) and values in standard fp16. We only compress keys because the attention score computation (Q@K^T) is where the memory bandwidth bottleneck lives during decoding. The softmax@V multiplication is less critical because it’s compute-bound rather than memory-bound at typical sequence lengths.

A fully optimized implementation would also compress values, but the gains are smaller and the integration is more complex (you’d need a second Triton kernel for the softmax@V step with compressed values).

What Didn’t Work

Mistake 1: Adding QJL Back to the Reconstructed Vector

The paper describes two variants: TurboQuant_mse (pure Lloyd-Max, best for reconstruction) and TurboQuant_prod (Lloyd-Max + 1-bit QJL, best for inner products). Our first implementation used TurboQuant_prod for the KV cache: (bits-1) bits of Lloyd-Max plus 1 bit of QJL on the residual.

The QJL stage produces a correction term that makes the inner product estimator unbiased. But when you add this correction back to the reconstructed vector and store it in the KV cache, you’re injecting noise into the vector itself. The result: cosine similarity dropped to 0.69 (terrible) and the model produced garbage.

The fix was simple: use TurboQuant_mse (all bits to Lloyd-Max) for the drop-in cache, and reserve TurboQuant_prod for a custom attention kernel that can use the two-part representation directly. The fused Triton kernel implements the MSE variant.

Mistake 2: Gemma 3 4B Is Not a Causal LM

We initially loaded the model with AutoModelForCausalLM and AutoTokenizer. This loaded the model fine, tokenized fine, and even generated — but every output token was (token ID 0). The baseline and quantized paths both produced identical pad sequences.

Gemma 3 4B+ is a multimodal model. It requires Gemma3ForConditionalGeneration and AutoProcessor, not the causal LM variants. The AutoProcessor handles the chat template correctly and returns the right token format. This wasn’t a quantization bug at all — the model simply wasn’t being invoked correctly.

Mistake 3: Python-Loop Hadamard Transform

The Fast Walsh-Hadamard Transform is O(d log d) butterfly operations. Our initial implementation ran each butterfly as a Python loop iteration with tensor slicing:

while h < d:
    for start in range(0, d, stride):
        lo = slice(start, start + h)
        hi = slice(start + h, start + stride)
        a = result[..., lo].clone()
        b = result[..., hi].clone()
        result[..., lo] = a + b
        result[..., hi] = a - b
    h *= 2

For d=256, this is 8 outer iterations × 128 inner iterations = 1,024 tiny CUDA operations per vector, with Python interpreter overhead between each one. On a KV cache update touching 26 layers × 4 KV heads × 256-dim vectors, the GPU was spending more time waiting for Python than doing math. Generation hung completely — even a 20-token completion with a trivial prompt didn’t return.

Replacing this with a single x @ Q_T matmul using a precomputed orthogonal matrix made it instant.

Mistake 4: Subclassing DynamicCache

Our first KV cache integration subclassed HuggingFace’s DynamicCache. This broke immediately because Gemma 3’s model code calls past_key_values.is_initialized, past_key_values.key_cache, and other attributes whose names and semantics change across transformers versions. Our subclass was missing several of these.

We tried three approaches:

Subclassing DynamicCache (broke on .is_initialized)
Forward hooks on attention layers (fragile, couldn’t reliably find the cache object)
Patching cache.update() on a stock DynamicCache instance (worked perfectly)

The final approach is the cleanest: create a normal DynamicCache, save a reference to its update method, and replace it with a wrapper that quantizes inputs before calling the original. All the cache’s internal bookkeeping (sequence length tracking, layer indexing) works unchanged.

Mistake 5: Token Counting After Fused Generation

The FusedTurboQuantRunner returns decoded text directly (not output IDs), so we tried processor.encode(text) to count tokens for the timing report. But Gemma3Processor is a multimodal processor — it has decode but not encode. The tokenizer lives at processor.tokenizer.encode(). A one-line fix, but it crashed the first successful fused generation and hid the results until the next run.

Comparison with Other Implementations

Prince Canuma independently implemented TurboQuant in MLX and tested on Qwen 3.5 35B with context lengths up to 64K tokens. Their results: 6/6 exact match on needle-in-haystack at every quantization level, 4.9x smaller KV cache at 2.5-bit, 3.8x at 3.5-bit.

Two implementations, different frameworks (PyTorch+Triton vs MLX), different models (Gemma 3 4B vs Qwen 3.5 35B), different hardware (NVIDIA RTX 4090 vs Apple Silicon) — same conclusion. TurboQuant’s theoretical guarantees translate directly to practice across the board.

What’s Next

This implementation leaves several optimizations on the table:

Value cache compression. We only compress keys. Compressing values would require a second Triton kernel for the softmax@V multiplication, but would further reduce memory usage.

Structured rotation. The precomputed d×d orthogonal matrix works but uses O(d²) memory. A fused Hadamard kernel would use O(d) memory (just the random signs) and be faster for large d.

Sub-byte packing. We store 2-bit indices as uint8. Packing 4 indices per byte would reduce memory by another 4x for the index storage.

Flash Attention integration. The ultimate goal: fuse the centroid gather into a Flash Attention-style kernel that never materializes the full attention matrix. This would combine TurboQuant’s memory savings with Flash Attention’s IO efficiency.

The paper’s claim of 8x speedup on H100s comes from optimized int4 tensor core kernels — that level of hardware-specific optimization is beyond a one-session implementation, but the algorithmic foundation is solid and the path from here to production is clear.

Paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (ICLR 2026)

Complete implementation including Triton kernel:

DOWNLOAD CODE

                   python run_demo.py --fused --max-new-tokens 200 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -> 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -> 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -> 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 [00:00

In response to a Twitter question:

PS C:\projects\tq> python run_demo.py --fused --long-context --haystack-tokens 4096 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -> 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -> 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -> 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 [00:00 python run_demo.py --fused --long-context --haystack-tokens 16384 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -> 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -> 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -> 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 [00:00

Clickbait Titles Exploit Attention Through Latent Entities

Dan Petrovic — Sun, 22 Mar 2026 12:20:49 +0000

Every clickbait title works the same way: it removes exactly one critical variable: the subject, the reason, the process, or the outcome, and charges you a click to fill the blank. This missing variable, which we call a latent entity, is so pervasive it has become normalized and nobody questions it anymore. You should!

That was the direct answer to the title’s attention hook, the latent variable behind “how”.

Every day, hundreds of millions of people scan headlines in feeds, aggregators, and search results. Most of these titles are not designed to inform. They are designed to withhold. Somewhere in the sentence, a critical piece of information has been surgically removed — the tool isn’t named, the result isn’t revealed, the reason isn’t given. The reader is left with an incomplete thought and a link. The click is the cost of completing it.

This mechanism is so pervasive that it has become invisible, like background noise. But it has a structure. And once you see the structure, you can’t unsee it.

The attention transaction

A title is a transaction. The author offers a premise. The reader pays with a click. The currency is attention, and the receipt is the missing piece of information the title promised but refused to deliver upfront.

This is not metaphorical. The economics are literal. Every click generates a pageview. Every pageview generates ad impressions. Every ad impression generates revenue. The entire model is optimized not for informing the reader but for maximizing the probability that they click. The most reliable way to do that is to make the title incomplete — to create an information gap that can only be closed on the other side of the link.

The reader isn’t choosing to engage with content. They’re being charged an attention tax to access information that the title already had room to provide.

Naming the structure: latent entities

We can formalize what clickbait hides. In every withholding title, there is a latent entity — a variable the reader cannot resolve without clicking through. The title is the observed data. The latent entity is the unobserved variable. The click is the inference cost.

There are four types, and they are exhaustive.

Latent Subject — What?

The title revolves around a specific thing — a tool, a setting, a feature, a list of items — but deliberately masks its identity behind a vague pronoun or a deferred noun.

“This one browser extension changed how I use the internet forever.”

What extension? You don’t know. That’s the transaction. The word “this” is doing the work of pointing at something while revealing nothing. The subject is latent.

“5 tools every developer needs in their workflow.”

Which five? The number creates the shape of an answer without filling it in. Five slots, all empty.

Latent Reason — Why?

The title states a strong opinion, a regret, or an observation, but withholds the logic behind it. The reader is given a conclusion without its supporting argument.

“I finally understand why Linux users swear by simple tools.”

The author has arrived at understanding. The reader has not. The reason is the hidden variable, and the only way to access it is to click.

“Package managers are the main reason I’ll never switch back to Windows.”

A bold claim with the mechanism removed. Why? What about package managers? The reason is latent.

Latent Process — How?

The title presents an intriguing input and a desirable or unexpected output, but hides the method that connects them. The reader sees a before and an after with a gap in between.

“I turned my old phone into a universal remote for my entire smart home.”

How? What app, what protocol, what steps? The transformation is stated as fact but the process is absent. The reader must click to learn the method.

“How a power drill defeated the Xbox 360’s console security.”

The pairing of a crude physical tool with a sophisticated digital system is inherently surprising. The process that links them is the entire story, and it’s completely hidden.

Latent Outcome — What happened?

The title sets up a scenario or experiment but cuts off before the resolution. The reader is dropped into a narrative with no ending.

“I replaced all my productivity tools with a single app for a month.”

And? What happened? Did it work? Was it a disaster? The outcome is the only thing the reader wants, and it’s the only thing the title refuses to provide.

“I ran local LLMs on a dying GPU and the results surprised me.”

The word “surprised” is doing double duty — it confirms that an outcome exists and that it’s noteworthy, while revealing absolutely nothing about what it is. It is a content-free adjective masquerading as information.

Every clickbait title withholds at least one latent entity. Some withhold two — a title that hides both the process and the outcome forces the reader to pay twice for a single click. But the taxonomy is closed. Anything a title can hide maps to one of these four types: the subject (what?), the reason (why?), the process (how?), or the outcome (what happened?).

This isn’t a style guide or an editorial preference. It’s a structural property of how information is withheld to generate clicks.

What happens after the click

The damage doesn’t end with the transaction. Something happens cognitively when a reader lands on a page after a withholding title, and it isn’t engagement. It’s scanning.

The reader arrives primed. They have a specific latent entity in mind — the hidden variable that brought them there — and their first instinct is to find it as fast as possible. They don’t read the introduction. They don’t absorb the context. They skip, skim, and scroll, hunting for the one piece of information the title owed them.

This produces a jarring experience. The article, padded with backstory, affiliate links, newsletter prompts, and SEO-optimized filler, is structured to delay the answer. The reader, already carrying the cognitive load of an unresolved question, is forced to work through friction that exists solely to generate more pageviews and ad impressions. The content’s structure and the reader’s intent are fundamentally misaligned.

The result is not engagement. It is extraction. The reader extracts the latent entity and leaves. The publisher extracts a pageview and an ad impression. Neither party has been well served. The reader resents the experience. The publisher has earned a visit but not trust.

The ad-click economy made this rational

None of this happened by accident. Withholding titles are the evolutionary product of an economy that rewards clicks over comprehension. When revenue is proportional to pageviews, every title becomes an optimization problem: maximize the probability of a click while minimizing the information given away for free.

Over two decades, this optimization produced the patterns we now see everywhere. Vague pronouns replaced specific nouns. Outcomes were teased but never stated. Reasons were promised but deferred. The entire craft of headline writing was reoriented from summarizing content to withholding it.

This was rational in a world where the title and the article were inseparable — where the only way to access the content was to visit the page. But that world is ending.

AI changes the equation

Large language models are rapidly becoming the intermediary layer between humans and content. When a user asks an AI assistant a question, the AI retrieves, reads, and synthesizes sources on the user’s behalf. The human never visits the page. The click never happens. The latent entity is resolved by the model, not by the reader.

In this new architecture, withholding titles are not just exploitative. They are pointless and perhaps even harmful to visibility. The AI doesn’t care about the information gap. It reads the article, extracts the answer, and delivers it without friction. The entire mechanism of clickbait — creating an artificial need that can only be resolved with a visit — collapses when the visitor is a machine that doesn’t see ads.

More importantly, AI systems can now decompose titles structurally, identify which latent entity is being withheld, check whether the article delivers on the title’s promise, and surface the answer directly. The asymmetry of information that clickbait depends on is being dissolved.

A healthier paradigm

If withholding titles evolved to serve the ad-click economy, then the question is: what should titles look like when that economy is no longer the only game?

The answer is straightforward. Titles should include the key information — the subject named, the reason stated, the outcome revealed — and invite the reader to explore further for depth, context, and nuance. The title earns the click by demonstrating value, not by ransoming it.

Consider the difference:

“This one Docker tool finally fixed my reverse proxy headache”

The subject is latent.
The reader must click to learn which tool.

“Nginx Proxy Manager eliminated my reverse proxy headache — here’s my setup”

The subject is revealed.
The reader clicks to learn the details, not to discover what the tool is.

Both titles can generate traffic. But the second one respects the reader. It says: here is what I’m talking about, and if you want to know more, the article is worth your time. The first one says: I have something you want, and I won’t tell you what it is unless you pay me with your attention.

The second model is healthier for everyone. Readers arrive with aligned expectations instead of frustrated scanning instincts. Authors build trust instead of mining clicks. And the content itself can be structured around depth rather than around delaying the reveal.

The web we could have

Web authors have a choice. They can continue optimizing for an economy that is being disintermediated by AI, writing titles that withhold and articles that delay, hoping the click-and-ad model survives long enough to sustain them. Or they can recognize that the readers who remain — the ones who choose to visit a page when they could have asked an AI — are the ones who deserve the most respect.

Those readers are not clicking because they were tricked. They’re clicking because they were informed. They know what the article is about. They want to go deeper. They trust the author enough to spend their time. And the money part can be fixed too.

That is the audience worth building for. And it starts with killing the hidden variable.

{
  "title": "Clickbait Titles Exploit Attention Through Latent Entities",
  "metadata": {
    "dimensions": [
      "Clickbait titles exploit attention",
      "Through latent entities"
    ],
    "attention_anchor": "how",
    "quantized": "clickbait exploits attention by hiding one of four variable types"
  },
  "how": [
    "Every clickbait title withholds exactly one latent entity: subject (what?), reason (why?), process (how?), or outcome (what happened?)",
    "The click is the inference cost the reader pays to resolve the hidden variable",
    "AI dissolves this by reading the article and extracting the answer without the click"
  ],
  "promise_check": {
    "exploit attention": "delivered — transactional mechanism explained with economic chain",
    "through latent entities": "delivered — four-type taxonomy defined with examples",
    "title practices what it preaches": "delivered — subject revealed, mechanism stated, no hidden variable"
  }
}

Fanout Query Analysis

Dan Petrovic — Fri, 20 Mar 2026 01:58:01 +0000

When AI models like Gemini, GPT or Nova answer a question using web search, they don’t just run your query as-is. They generate their own internal search queries, or fanout queries. A single user prompt can trigger multiple fanout queries as the model breaks down the question, explores subtopics and verifies information.

We captured 365,920 of these fanout queries across three providers, Google (Gemini), OpenAI (GPT) and Amazon (Nova), by logging the grounding metadata returned from their APIs during citation mining runs. This data comes from real production workloads across multiple projects, not synthetic benchmarks.

Below is an analysis of how these providers differ in the queries they generate.

Provider	Count	Avg Chars	Min	Max	1-3 words	4-6 words	7+ words
Google	158,186	52	0	252	4.5%	30.6%	64.9%
OpenAI	207,174	60	6	323	3.4%	20.8%	75.8%
Amazon	560	59	28	198	0.2%	16.2%	83.6%
Total	~365,920	56	0	323	3.9%	25.0%	71.1%

Google (n=158,184)

Words	Count	%	Cumul%
1	53	0.0%	0.0%
2	1,092	0.7%	0.7%
3	5,994	3.8%	4.5%
4	14,916	9.4%	13.9%
5	17,471	11.0%	25.0%
6	15,923	10.1%	35.1%
7	18,080	11.4%	46.5%
8	20,325	12.8%	59.3%
9	20,013	12.7%	72.0%
10	16,968	10.7%	82.7%
11	11,740	7.4%	90.1%
12	7,316	4.6%	94.8%
13	4,043	2.6%	97.3%
14	2,124	1.3%	98.7%
15+	1,146	0.7%	100.0%

OpenAI (n=207,174)

Words	Count	%	Cumul%
1	616	0.3%	0.3%
2	3,715	1.8%	2.1%
3	2,691	1.3%	3.4%
4	7,360	3.6%	6.9%
5	14,516	7.0%	13.9%
6	21,221	10.2%	24.2%
7	26,544	12.8%	37.0%
8	28,912	14.0%	51.0%
9	27,861	13.4%	64.4%
10	23,354	11.3%	75.7%
11	17,875	8.6%	84.3%
12	12,339	6.0%	90.3%
13	7,983	3.9%	94.1%
14	4,959	2.4%	96.5%
15+	5,228	2.5%	100.0%

Amazon (n=560)

Words	Count	%	Cumul%
3	1	0.2%	0.2%
4	4	0.7%	0.9%
5	23	4.1%	5.0%
6	64	11.4%	16.4%
7	102	18.2%	34.6%
8	110	19.6%	54.3%
9	113	20.2%	74.5%
10	64	11.4%	85.9%
11	35	6.2%	92.1%
12	20	3.6%	95.7%
13	9	1.6%	97.3%
14	5	0.9%	98.2%
15+	10	1.8%	100.0%

POS Distribution by Provider

Group	Google	OpenAI	Amazon
Nouns	52.3%	58.4%	50.2%
Verbs	11.3%	9.9%	8.5%
Adjectives	11.0%	8.9%	18.6%
Prepositions	7.4%	3.5%	10.3%
Wh-words	3.6%	2.1%	1.5%
Numbers	2.2%	5.3%	2.8%
Determiners	2.6%	1.8%	0.1%
Conjunctions	1.6%	0.6%	2.4%
Adverbs	0.6%	0.7%	2.3%
Modals	0.7%	0.5%	0.0%
Pronouns	1.2%	0.9%	0.1%

OpenAI is the most noun-heavy (58.4%), especially proper nouns (18.9% vs Google’s 8.6%) — it generates more entity-specific queries
Amazon leans heavily into adjectives (18.6% vs ~10% for others) — more descriptive, qualifier-rich queries like “best,” “top,” “most effective”
Google uses more wh-words and verbs — generates more question-style queries (“what,” “how,” “which”)
OpenAI uses 2x more numbers (5.3%) — likely year references and quantities in queries

Reverse Prompting: Reconstructing Prompts from AI-Generated Text

Dan Petrovic — Wed, 18 Mar 2026 06:51:29 +0000

We fine-tuned Google’s Gemma 3 (270M) to reverse the typical LLM workflow: given an AI-generated response, the model reconstructs the most likely prompt that produced it. We generated 100,000 synthetic prompt-response pairs using Gemini 2.5 Flash, trained for a single epoch on a consumer GPU, and built a Streamlit app that sweeps 24 decoding configurations to produce ranked prompt candidates. The model demo runs on CPU and is available here.

The Idea

Large language models take prompts and produce responses. We wanted to see if a small model could learn to do the opposite: take a response and work backwards to the prompt.

This isn’t about recovering the exact original prompt, but to surface the most plausible prompts, ranked by model confidence. Think of it as asking: “What question would most naturally lead to this answer?”

Training Data Generation

The training pipeline has two stages, both powered by Gemini 2.5 Flash via Vertex AI.

Stage 1: Prompt generation. We generated 100,000 diverse prompts across five categories designed to cover different user behaviours:

Mid-tail, search query style (single or multi-faceted)
Long-tail, search query style (multi-faceted)
Simple, prompt-like (single-faceted)
Typical, prompt-like (single or multi-faceted)
Detailed, prompt-like (multi-faceted)

Each API call generated a batch of 100 prompts as JSON with thinking disabled. We ran 100 concurrent calls, stored results in SQLite, and had the full dataset in minutes.

Stage 2: Response generation. Each of the 100,000 prompts was sent back to Gemini 2.5 Flash to produce a corresponding AI assistant response. Same concurrency, same speed. The result: 100,000 prompt-response pairs ready for training.

Data Preparation

The key design decision was how to format the training data. We needed the model to learn a clear boundary between the response (input) and the prompt (target). We settled on a simple separator:

{response}\n###\n{prompt}

During tokenization, we masked the loss over the response and separator tokens (setting labels to -100) so the model only learns to predict the prompt portion. This is critical: without masking, the model would waste capacity learning to reproduce the response text rather than focusing on the reverse mapping.

Sequences were capped at 2,048 tokens. Tokenization was batched in groups of 5,000 to manage memory, then concatenated into a single dataset.

Model Selection

We chose Gemma 3 270M for several reasons:

Size. At 270M parameters, it’s small enough to train on a single consumer GPU and fast enough to run inference on CPU. This matters for a free demo.
Architecture. Gemma 3 uses a mix of sliding window and full attention layers, giving it a good balance of local and global context within its 2,048 token training window.
Capability. Despite its size, Gemma 3 270M has a 262K vocabulary and was pretrained on enough data to have reasonable language understanding out of the box.

A larger model would almost certainly perform better, but the goal was a practical tool that could run anywhere, not a benchmark result.

Training

Training was straightforward. Full fine-tune, single epoch, on an NVIDIA RTX 4090.

Parameter	Value
Method	Full fine-tune
Precision	bfloat16
Batch size	2 (effective 16 with gradient accumulation)
Learning rate	5e-5
Optimizer	AdamW (torch fused)
Warmup steps	100
Gradient checkpointing	Enabled
Training time	4 hours 14 minutes

One epoch was sufficient. The loss curve showed steady convergence without signs of underfitting, and we wanted to avoid overfitting on synthetic data where the model might memorise specific phrasing patterns rather than learning the general reverse mapping.

Inference Strategy

A single generation pass from the model produces one candidate prompt. To get a diverse set of candidates, we sweep across 24 contrastive search configurations by varying two parameters:

top_k: [2, 4, 6, 15]
penalty_alpha: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

Contrastive search balances token probability with a degeneration penalty, which encourages diverse yet coherent outputs. Different configurations produce different candidate prompts from the same input.

Each candidate is then scored by perplexity: we run the full sequence (response + separator + generated prompt) through the model and compute the average token-level log probability over the prompt portion. Lower perplexity means the model finds that prompt more natural given the response.

The top 10 candidates are displayed with per-token confidence visualisation, where each word’s opacity reflects how confident the model was in predicting it.

The Tool

The Streamlit app has two modes.

Paste mode is the primary interface. Paste any AI-generated text, click Reconstruct Prompts, and the model generates ranked candidates. The results include a prompt table with perplexity scores and per-token confidence bar charts, a key phrases panel that extracts the most important shared phrases across candidates, and a word frequency heatmap.

URL mode is experimental. Enter a URL and the app scrapes the page content via the DataForSEO API, converts it to markdown, and runs it through the model. This isn’t the intended use case since the model was trained on AI assistant responses, not web pages. But it produces interesting results: the reconstructed “prompts” reveal what the model considers the core semantic intent of the page content. It’s less prompt reconstruction and more semantic summarisation through the lens of “what question would this page answer?”

Possible Uses

Prompt engineering. Understanding what prompts lead to certain outputs helps refine prompt design. If you have an output you like, reverse prompting can suggest more efficient or precise ways to get there.

Content analysis. Running web content through the model reveals what the model perceives as the core intent behind the text. This could be useful for understanding how AI models interpret and categorise content.

AI content forensics. While this isn’t a detector (it doesn’t classify text as AI-generated or not), the confidence scores and perplexity values could serve as signals. Text that was genuinely produced by an AI assistant in response to a clear prompt may produce lower-perplexity reconstructions than text that wasn’t.

Training data curation. When building datasets, reverse prompting can help verify that responses actually match their intended prompts, or surface cases where the mapping is ambiguous.

Insights

A few things we noticed during development:

Synthetic data works. The model was trained entirely on Gemini-generated data and generalises to outputs from other models. The reverse mapping from response to prompt is more about structure and intent than model-specific quirks.

Small models can learn non-trivial mappings. At 270M parameters, this model is tiny by current standards. Yet it reliably produces sensible prompt reconstructions. The task is well-constrained enough that a small model can handle it.

Diversity in decoding matters more than model size. The contrastive search sweep across 24 configurations produces more useful results than a single greedy decode from a larger model would. The ranking by perplexity then surfaces the best candidates.

The separator matters. We tested several formats. The simple \n###\n separator worked best, likely because it’s distinct enough that the model learns a clean boundary between input and output.

The model and code are available on Hugging Face, and a live demo runs on https://dejan.ai/tools/reverse-prompter/

DEMO

Rufus – Under the Hood. What Drives Amazon’s AI Shopping Assistant?

Dan Petrovic — Sun, 15 Mar 2026 01:11:45 +0000

What’s Publicly Known About the Pipeline, Backend, and Response Anatomy.

Rufus is not “one model that magically answers.” Public Amazon/AWS descriptions point to a multi-component system:

A query planning / classification layer (Amazon/AWS call out a “query planner (QP) model”)
Retrieval across multiple Amazon-owned sources (catalog, reviews, community Q&A, Stores APIs) and sometimes web sources
A foundation LLM that generates the natural-language response
A streaming + rendering layer that formats answers and “hydrates” them with live store data
Feedback-driven improvement (reinforcement learning from customer feedback)

Speculative schema:

User question
  -> Query Planner (intent + retrieval plan)
    -> Retrieval (catalog/reviews/Q&A/Stores APIs/(sometimes web))
      -> Foundation LLM (answer generation + display directives)
        -> Streaming response (token-by-token)
          -> Hydration (fill in product cards, prices, etc via internal systems)
            -> Client UI (chat text + cards + actions + suggested questions)

Pipeline: request → answer

Step A — Input + context assembly

Public descriptions indicate customers can:

Type or speak questions in the Amazon Shopping app search bar / assistant chat bar
Start from pre-populated / suggested questions in the UI
Ask questions either broadly (“what do I need for…”) or specifically on a product page (where the product detail context matters)

Amazon also describes using conversational context and (more recently) account memory features for personalization.

Step B — Query planning (QP) before generation

AWS’s ML blog post describes Rufus as having:

A foundation LLM (for response generation)
A query planner (QP) model for query classification and retrieval enhancement
QP is “on the critical path” because the system can’t start token generation until QP finishes

That implies a gate: planning first, then generation.

Step C — Retrieval-augmented generation (RAG)

Amazon Science describes Rufus using retrieval‑augmented generation (RAG):

Before generating a response, the LLM selects information it expects will help answer the question.
Evidence sources explicitly called out include:
Customer reviews
The product catalog
Community Q&A
Stores APIs (calls to internal store systems)

About Amazon also describes using RAG to pull “insights and recommendations” from “popular sources” for some product/trend questions (they name examples like major publications).

What’s not disclosed publicly:

How retrieval is ranked across sources
The retrieval index design
Exact prompting / grounding format
Exact guardrails for what external web content can be used and how

Step D — Response generation (LLM)

Amazon Science says the team built a custom LLM specialized for shopping, trained primarily on shopping data (catalog + reviews + community Q&A) plus curated public web information.

About Amazon also describes a model-mix approach:

Built on Amazon Bedrock
Using a real-time router that can select among multiple LLMs (they explicitly name models like Anthropic’s Claude Sonnet, Amazon Nova, plus a custom model)

So the public picture is: custom shopping model exists, and there may also be dynamic model selection depending on query type / latency / quality targets.

Step E — Streaming + “hydration” + UI rendering

Amazon Science describes a “streaming architecture”:

Responses are streamed token-by-token (so the user sees the beginning while the rest is still generating).
The system “hydrates” the response by querying internal systems to populate the stream with the right data.
Crucially: Rufus is trained to generate markup instructions specifying how answer elements should be displayed, not just the text.

This is the key “anatomy of a Rufus response” insight: the model output is both content and layout directives, and the backend fills in live store objects (prices, items, links, etc.) during streaming.

What’s not disclosed publicly:

The markup language/schema
The exact rendering protocol between model hydrator client

Backend: training data, infra, and latency engineering

Training data and preparation (what Amazon has said)

Amazon Science states Rufus was trained with:

The entire Amazon catalog
Customer reviews
Community Q&A posts
Curated public web information

And that Amazon used:

Amazon EMR for large-scale distributed data processing
Amazon S3 for storage

Inference infrastructure: Trainium/Inferentia + compiler optimizations

Amazon Science describes serving at Amazon scale using:

AWS chips Trainium and Inferentia
Collaboration with the Neuron compiler team for inference optimizations
Continuous batching to improve throughput/latency (described as making scheduling/routing decisions after every generated token so new requests can start as soon as earlier ones finish)

Prime Day scale + “parallel decoding” for QP latency

AWS’s ML blog post goes much deeper on one backend component (the QP model) and performance engineering:

Prime Day demands described include very high query rates and tight latency SLOs for QP.
They describe using “draft‑centric speculative decoding” / “parallel decoding”:
Extending the base model with multiple decoding heads to predict multiple future tokens in parallel
A tree-based attention mechanism to verify/integrate predicted tokens
Deployed using AWS infrastructure + chips (Trainium/Inferentia), and mentions integration details (for example, they mention Triton Inference Server support and Neuron-related frameworks).

This is one of the clearest official public descriptions of “backend mechanics” for Rufus, specifically for the planning model that sits before the user sees the first chunk of an answer.

Response format: what users see vs what the system likely contains

What the user-visible response can include (publicly described)

Across Amazon’s public descriptions, Rufus responses can include:

Long-form explanations (e.g., product category advice)
Short-form answers
Clickable links to navigate the store
Product recommendations (often rendered as product cards)
Comparisons (e.g., “compare OLED vs QLED”)
Suggested follow-up questions surfaced in the chat UI
“What do customers say?” style review summaries / highlights
Price/history/deal-related features (including price tracking / alerts) and cart actions in newer “agentic” iterations

What the backend response likely contains

Based on Amazon’s own wording (“markup instructions” + “hydration” + token streaming), the response payload is best thought of as:

A streamed text channel (tokens)
A structured directive channel (layout + which UI modules to render)
Hydration lookups that fill directives with authoritative store data (products, prices, shipping, deal status, etc.)

Amazon has not published the schema, so any JSON examples would be guesswork.

What’s not public

Exact model architectures/sizes for the custom model(s)
The router policy (how it chooses among models)
Exact retrieval ranking, indexing, and grounding format
The markup instruction language/schema
Safety/guardrail implementation details (beyond high-level “reliable sources” language)
Full evaluation suite and offline metrics used to ship changes

Sources

Below are official sources only (Amazon Science, AWS, About Amazon Press Center, Investor Relations).

Technical deep dives

Amazon Science (Blog): “The technology behind Amazon’s GenAI-powered shopping assistant, Rufus” (Oct 4, 2024)
https://www.amazon.science/blog/the-technology-behind-amazons-genai-powered-shopping-assistant-rufus

AWS Machine Learning Blog: “How Rufus doubled their inference speed and handled Prime Day traffic with AWS AI chips and parallel decoding” (May 28, 2025)
https://aws.amazon.com/blogs/machine-learning/how-rufus-doubled-their-inference-speed-and-handled-prime-day-traffic-with-aws-ai-chips-and-parallel-decoding/

Product/feature announcements & official descriptions

About Amazon (Retail): “Amazon’s next-gen AI assistant for shopping is now even smarter, more capable, and more helpful”
https://www.aboutamazon.com/news/retail/amazon-rufus-ai-assistant-personalized-shopping-features

About Amazon (Retail): “How to use Rufus to check price history, find deals, auto-buy items at target prices, and more”
https://www.aboutamazon.com/news/retail/how-to-use-amazon-shopping-ai-assistant

About Amazon (Retail): “How customers are making more informed shopping decisions with Rufus…”
https://www.aboutamazon.com/news/retail/how-to-use-amazon-rufus

About Amazon (Retail): “Rufus is now available to all U.S. customers…” (amazon.com page linked from About Amazon)
https://www.amazon.com/b?node=23404839011

Press releases / investor communications

Amazon Investor Relations: “Amazon.com Announces Fourth Quarter Results” (Feb 01, 2024) — includes the initial public mention of Rufus beta rollout
https://ir.aboutamazon.com/news-release/news-release-details/2024/Amazon.com-Announces-Fourth-Quarter-Results/

About Amazon Press Center (US): “Amazon Bedrock launches new capabilities…” (Apr 2024) — includes a Rufus description and quote
https://press.aboutamazon.com/2024/4/amazon-bedrock-launches-new-capabilities-as-tens-of-thousands-of-customers-choose-it-as-the-foundation-to-build-and-scale-secure-generative-ai-applications

About Amazon Press Center (US): “Amazon Announces Record-Breaking Sales for 2024 Prime Day Event” (Jul 18, 2024) — notes Rufus helping millions of customers
https://press.aboutamazon.com/2024/7/amazon-announces-record-breaking-sales-for-2024-prime-day-event

Amazon Investor Relations: “Amazon.com Announces Fourth Quarter Results” (2026 release page) — mentions agentic Rufus / Buy For Me
https://ir.aboutamazon.com/news-release/news-release-details/2026/Amazon-com-Announces-Fourth-Quarter-Results/default.aspx

Amazon Science research papers

These are not “Rufus documentation,” but they map directly to components Amazon describes (question suggestion, comparisons, RAG planning, preference extraction).

Publication (SIGIR 2024): “Question suggestion for conversational shopping assistants using product metadata”
https://www.amazon.science/publications/question-suggestion-for-conversational-shopping-assistants-using-product-metadata

PDF (SIGIR 2024):
https://assets.amazon.science/42/6e/c7c7aed9433d87fd1ab1f8bef4ff/question-suggestion-for-conversational-shopping-assistants-using-product-metadata.pdf

Publication (WSDM 2023): “Generating explainable product comparisons for online shopping”
https://www.amazon.science/publications/generating-explainable-product-comparisons-for-online-shopping

Publication (CIKM 2024): “REAPER: Reasoning based retrieval planning for complex RAG systems”
https://www.amazon.science/publications/reaper-reasoning-based-retrieval-planning-for-complex-rag-systems

Publication (EMNLP 2024): “PEARL: Preference extraction with exemplar augmentation and retrieval with LLM agents”
https://www.amazon.science/publications/pearl-preference-extraction-with-exemplar-augmentation-and-retrieval-with-llm-agents

Publication (2024): “Meta knowledge for retrieval augmented large language models”
https://www.amazon.science/publications/meta-knowledge-for-retrieval-augmented-large-language-models