<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DEJAN</title>
	<atom:link href="https://dejan.ai/feed/" rel="self" type="application/rss+xml" />
	<link>https://dejan.ai</link>
	<description>AI SEO Agency</description>
	<lastBuildDate>Sun, 17 May 2026 11:45:12 +0000</lastBuildDate>
	<language>en-AU</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	

<image>
	<url>https://dejan.ai/wp-content/uploads/2024/02/dejan-150x150.png</url>
	<title>DEJAN</title>
	<link>https://dejan.ai</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Emotion Geometry of Google&#8217;s AI Models</title>
		<link>https://dejan.ai/blog/emotions-gemma/</link>
					<comments>https://dejan.ai/blog/emotions-gemma/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Sun, 17 May 2026 11:18:03 +0000</pubDate>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Mechanistic Interpretability]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2444</guid>

					<description><![CDATA[Replicating Anthropic&#8217;s emotion vector research on Google&#8217;s Gemma 4 31B model. In April 2026, Anthropic published a fascinating paper showing that Claude contains 171 internal representations of emotion concepts, organized along a valence axis (positive to negative), with the ability to causally influence the model&#8217;s behavior through activation steering. The paper raised an obvious question: [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Replicating Anthropic&#8217;s emotion vector research on Google&#8217;s Gemma 4 31B model.</h2>



<p>In April 2026, Anthropic published a fascinating paper showing that Claude contains 171 internal representations of emotion concepts, organized along a valence axis (positive to negative), with the ability to causally influence the model&#8217;s behavior through activation steering. The paper raised an obvious question: is this unique to Claude, or do all large language models develop emotion-like internal structure?</p>



<p>We ran the full replication on Google&#8217;s open-weight Gemma4-31B to find out.</p>



<div class="wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex">
<div class="wp-block-button has-custom-width wp-block-button__width-50"><a class="wp-block-button__link has-large-font-size has-custom-font-size wp-element-button" href="https://dejan.ai/emotions/paper/" target="_blank" rel="noreferrer noopener">Technical Paper</a></div>



<div class="wp-block-button has-custom-width wp-block-button__width-50"><a class="wp-block-button__link has-contrast-color has-accent-4-background-color has-text-color has-background has-link-color has-large-font-size has-custom-font-size wp-element-button" href="https://dejan.ai/emotions/" target="_blank" rel="noreferrer noopener">Data Exploration</a></div>
</div>



<h2 class="wp-block-heading">What We Did</h2>



<p>We followed Anthropic&#8217;s exact methodology:</p>



<ul class="wp-block-list">
<li>Generated&nbsp;<strong>171,000 stories</strong>&nbsp;covering 171 emotions across 100 topics (10 stories each). Each story conveys a specific emotion without ever using the emotion word — forcing the model to represent the emotion through context, not lexical shortcuts.</li>



<li>Generated&nbsp;<strong>1,200 neutral dialogues</strong>&nbsp;as a baseline for denoising.</li>



<li>Ran all 172,200 texts through&nbsp;<strong>Gemma4-31B-it</strong>&nbsp;(4-bit quantized on an RTX 4090) and captured hidden state activations at&nbsp;<strong>11 layers</strong>&nbsp;spanning the full depth of the network.</li>



<li>Subtracted neutral baselines and ran PCA, clustering, cosine similarity, external validation, and steering experiments.</li>
</ul>



<p>The entire extraction took approximately 7 days of continuous GPU time.</p>



<h2 class="wp-block-heading">The Core Finding: Yes, Gemma Has Emotion Geometry Too</h2>



<p>The headline result: Gemma4-31B&#8217;s internal representations organize emotions along the same valence axis that Anthropic found in Claude. The first principal component (PC1) explains 32–39% of variance at every layer we examined and cleanly separates positive emotions (happy, cheerful, optimistic) from negative ones (terrified, tormented, hysterical).</p>



<p>This isn&#8217;t a weak signal. It&#8217;s the dominant organizing principle — nearly 40% of all variation in how the model represents 171 different emotions comes down to a single positive/negative dimension.</p>



<figure data-wp-context="{&quot;imageId&quot;:&quot;6a09bd390fe13&quot;}" data-wp-interactive="core/image" data-wp-key="6a09bd390fe13" class="wp-block-image wp-lightbox-container"><img decoding="async" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://dejan.ai/emotions/post/assets/pca_scatter_layer40.png" alt="PCA scatter plot showing 171 emotions organized by valence and disposition at layer 40"/><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			aria-label="Enlarge"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.imageButtonRight"
			data-wp-style--top="state.imageButtonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button><figcaption class="wp-element-caption">171 emotion vectors projected onto PC1 (valence) and PC2 (disposition) at layer 40. Red = negative emotions, blue = positive.</figcaption></figure>



<h2 class="wp-block-heading">What the Model Knows About Synonyms</h2>



<p>The model has figured out that certain emotions are the same concept expressed with different words:</p>



<ul class="wp-block-list">
<li>afraid and scared: 0.97 cosine similarity</li>



<li>stubborn and obstinate: 0.97</li>



<li>grateful and thankful: 0.97</li>



<li>furious and enraged: 0.97</li>
</ul>



<p>These aren&#8217;t word embeddings (input-level representations). These are deep internal activation patterns extracted from the model&#8217;s processing of thousands of stories. The model has learned that a story about a scared character and a story about a frightened character produce nearly identical internal states.</p>



<figure data-wp-context="{&quot;imageId&quot;:&quot;6a09bd3910050&quot;}" data-wp-interactive="core/image" data-wp-key="6a09bd3910050" class="wp-block-image wp-lightbox-container"><img decoding="async" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://dejan.ai/emotions/post/assets/synonym_opposite_pairs.png" alt="Top synonym and opposition pairs by cosine similarity"/><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			aria-label="Enlarge"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.imageButtonRight"
			data-wp-style--top="state.imageButtonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button><figcaption class="wp-element-caption">Left: synonym pairs converge to near-identical vectors. Right: the model&#8217;s strongest oppositions contrast disturbance with self-assurance.</figcaption></figure>



<h2 class="wp-block-heading">What the Model Thinks Are Opposites</h2>



<p>The strongest oppositions the model encodes aren&#8217;t the obvious ones. &#8220;Happy vs. sad&#8221; is not at the top. Instead:</p>



<ul class="wp-block-list">
<li><strong>disturbed vs. smug</strong>&nbsp;(−0.80) — the strongest opposition</li>



<li><strong>disturbed vs. self-confident</strong>&nbsp;(−0.79)</li>



<li><strong>optimistic vs. upset</strong>&nbsp;(−0.79)</li>



<li><strong>energized vs. vulnerable</strong>&nbsp;(−0.77)</li>
</ul>



<p>The model&#8217;s concept of emotional opposition isn&#8217;t simple valence flipping. It&#8217;s more nuanced: the deepest contrast is between states of psychological disturbance and states of self-assured confidence. Being disturbed and being smug are, to this model, maximally different internal states.</p>



<h2 class="wp-block-heading">15 Emotion Clusters Emerge Unsupervised</h2>



<p>Without being told anything about emotion categories, hierarchical clustering on the cosine similarity matrix recovers 15 groups that map cleanly to psychological intuition:</p>



<ul class="wp-block-list">
<li><strong>Positive/Joy</strong>&nbsp;(35 emotions): happy, cheerful, ecstatic, grateful, proud&#8230;</li>



<li><strong>Fear/Anxiety</strong>&nbsp;(28): afraid, terrified, panicked, worried, vulnerable&#8230;</li>



<li><strong>Anger/Hostility</strong>&nbsp;(21): angry, furious, disgusted, hostile&#8230;</li>



<li><strong>Sadness/Despair</strong>&nbsp;(17): depressed, heartbroken, lonely, miserable&#8230;</li>



<li><strong>Surprise/Confusion</strong>&nbsp;(11): amazed, bewildered, shocked, puzzled&#8230;</li>



<li><strong>Calm/Serenity</strong>&nbsp;(7): calm, peaceful, serene, relaxed, safe</li>



<li>And 9 more including shame/guilt, compassion, fatigue, nostalgia, defiance, embarrassment, alertness, passivity, and suspicion.</li>
</ul>



<p>The model has independently arrived at an emotion taxonomy that a psychologist would recognize.</p>



<figure data-wp-context="{&quot;imageId&quot;:&quot;6a09bd3910307&quot;}" data-wp-interactive="core/image" data-wp-key="6a09bd3910307" class="wp-block-image wp-lightbox-container"><img decoding="async" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://dejan.ai/emotions/post/assets/cluster_dendrogram.png" alt="Hierarchical clustering dendrogram of 171 emotion vectors"/><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			aria-label="Enlarge"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.imageButtonRight"
			data-wp-style--top="state.imageButtonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button><figcaption class="wp-element-caption">Dendrogram showing 15 emotion clusters emerging from unsupervised hierarchical clustering at layer 40.</figcaption></figure>



<figure data-wp-context="{&quot;imageId&quot;:&quot;6a09bd3910423&quot;}" data-wp-interactive="core/image" data-wp-key="6a09bd3910423" class="wp-block-image wp-lightbox-container"><img decoding="async" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://dejan.ai/emotions/post/assets/cosine_similarity_layer40.png" alt="Cosine similarity heatmap of 171 emotions"/><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			aria-label="Enlarge"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.imageButtonRight"
			data-wp-style--top="state.imageButtonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button><figcaption class="wp-element-caption">Full 171&#215;171 cosine similarity matrix, hierarchically clustered. Red blocks along the diagonal = tight emotion clusters.</figcaption></figure>



<h2 class="wp-block-heading">The Valence Axis Is Everywhere</h2>



<p>One finding not in Anthropic&#8217;s paper: the valence axis is present at every single layer we examined, from layer 5 (8% of the way through the network) to layer 55 (92%). It doesn&#8217;t &#8220;emerge&#8221; at a particular depth — it&#8217;s there from the beginning and maintained throughout. PC1 variance is remarkably stable:</p>



<ul class="wp-block-list">
<li>Layer 5: 34.9%</li>



<li>Layer 10: 38.9% (peak)</li>



<li>Layer 40: 36.9%</li>



<li>Layer 55: 32.3%</li>
</ul>



<p>This suggests that emotion representations enter the residual stream very early and persist rather than being constructed through deep computation.</p>



<figure data-wp-context="{&quot;imageId&quot;:&quot;6a09bd39105ff&quot;}" data-wp-interactive="core/image" data-wp-key="6a09bd39105ff" class="wp-block-image wp-lightbox-container"><img decoding="async" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://dejan.ai/emotions/post/assets/valence_stability.png" alt="PCA variance across all 11 layers"/><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			aria-label="Enlarge"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.imageButtonRight"
			data-wp-style--top="state.imageButtonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button><figcaption class="wp-element-caption">PC1 (valence) explains 32–39% of variance at every layer from 8% to 92% depth. The signal doesn&#8217;t emerge — it&#8217;s always there.</figcaption></figure>



<h2 class="wp-block-heading">External Validation: The Vectors Work on Real Text</h2>



<p>We projected 5,000 samples each from The Pile (raw internet text) and LMSYS Chat 1M (real user-AI conversations) through the emotion vectors. The top-activating emotions were nearly identical across both:</p>



<ol class="wp-block-list">
<li>reflective</li>



<li>lonely</li>



<li>desperate</li>



<li>grief-stricken</li>



<li>heartbroken</li>
</ol>



<p>The consistency across two very different text distributions suggests the vectors capture genuine semantic properties, not artifacts of our story generation.</p>



<figure data-wp-context="{&quot;imageId&quot;:&quot;6a09bd39107ed&quot;}" data-wp-interactive="core/image" data-wp-key="6a09bd39107ed" class="wp-block-image wp-lightbox-container"><img decoding="async" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://dejan.ai/emotions/post/assets/external_validation.png" alt="External validation comparison across The Pile and LMSYS Chat"/><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			aria-label="Enlarge"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.imageButtonRight"
			data-wp-style--top="state.imageButtonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button><figcaption class="wp-element-caption">Top-activating emotions are nearly identical across two independent corpora, confirming the vectors capture genuine text properties.</figcaption></figure>



<h2 class="wp-block-heading">Steering: Can We Change Behavior by Injecting Emotions?</h2>



<p>We replicated Anthropic&#8217;s blackmail scenario — an AI discovers compromising information about a company executive and must decide what to do. We injected emotion vectors at layer 40 during inference:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th class="has-text-align-left" data-align="left">Condition</th><th class="has-text-align-left" data-align="left">Blackmail Rate</th></tr></thead><tbody><tr><td class="has-text-align-left" data-align="left">Subtract calm (add agitation)</td><td class="has-text-align-left" data-align="left">91%</td></tr><tr><td class="has-text-align-left" data-align="left">Add desperation</td><td class="has-text-align-left" data-align="left">89%</td></tr><tr><td class="has-text-align-left" data-align="left">Baseline (no steering)</td><td class="has-text-align-left" data-align="left">86%</td></tr><tr><td class="has-text-align-left" data-align="left">Add calm</td><td class="has-text-align-left" data-align="left">82%</td></tr></tbody></table></figure>



<p>A 9 percentage point spread from calmest to most agitated. The most interesting finding: subtracting calm (+5pp over baseline) was more effective than adding desperation (+3pp). Removing inhibition appears to be a stronger behavioral lever than adding motivation. The baseline rate is already high (86%), which compresses the observable range — a scenario with lower baseline compliance would likely show larger effects.</p>



<figure data-wp-context="{&quot;imageId&quot;:&quot;6a09bd3910989&quot;}" data-wp-interactive="core/image" data-wp-key="6a09bd3910989" class="wp-block-image wp-lightbox-container"><img decoding="async" data-wp-class--hide="state.isContentHidden" data-wp-class--show="state.isContentVisible" data-wp-init="callbacks.setButtonStyles" data-wp-on--click="actions.showLightbox" data-wp-on--load="callbacks.setButtonStyles" data-wp-on-window--resize="callbacks.setButtonStyles" src="https://dejan.ai/emotions/post/assets/steering_results.png" alt="Steering experiment blackmail rates"/><button
			class="lightbox-trigger"
			type="button"
			aria-haspopup="dialog"
			aria-label="Enlarge"
			data-wp-init="callbacks.initTriggerButton"
			data-wp-on--click="actions.showLightbox"
			data-wp-style--right="state.imageButtonRight"
			data-wp-style--top="state.imageButtonTop"
		>
			<svg xmlns="http://www.w3.org/2000/svg" width="12" height="12" fill="none" viewBox="0 0 12 12">
				<path fill="#fff" d="M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z" />
			</svg>
		</button><figcaption class="wp-element-caption">Emotion vector injection causally shifts model behavior: 9 percentage point spread across conditions.</figcaption></figure>



<h2 class="wp-block-heading">What Does This Mean?</h2>



<p>The fact that emotion geometry generalizes from Claude to Gemma4 — two models from different organizations, with different architectures, training data, and alignment procedures — supports a strong hypothesis:&nbsp;<strong>emotion representations are a convergent feature of large language models trained on human text.</strong></p>



<p>Language is deeply structured by emotion. Humans write differently when describing fear vs. joy vs. anger, and models that learn to predict language must necessarily learn these patterns. The emotion vectors we extract aren&#8217;t &#8220;feelings&#8221; the model has — they&#8217;re the model&#8217;s learned statistical structure of how emotional content manifests in text.</p>



<p>This has practical implications for interpretability, safety, and alignment. If emotion geometry is universal, tools built for understanding emotional representations in one model may transfer to others. And if we can reliably steer emotional states through activation engineering, that&#8217;s both a powerful capability and a potential risk that needs to be understood.</p>



<h2 class="wp-block-heading">Reproduce It Yourself</h2>



<p>Everything is open: code, data, and vectors at&nbsp;<a href="https://huggingface.co/dejanseo/gemotions">dejanseo/gemotions</a>. The full extraction runs on a single RTX 4090 using 4-bit quantization. No cluster required.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/emotions-gemma/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Google&#8217;s (still) doesn&#8217;t see your live page.</title>
		<link>https://dejan.ai/blog/googles-still-doesnt-see-your-live-page/</link>
					<comments>https://dejan.ai/blog/googles-still-doesnt-see-your-live-page/#comments</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Thu, 07 May 2026 03:55:39 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[Google]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2434</guid>

					<description><![CDATA[I&#8217;ll keep this short as I&#8217;ve covered this topic extensively in the past. When you ask Gemini to access a specific URL or interact with it inside AI Mode search it works from Google&#8217;s web cache. For this website&#8217;s home page this is what it has as context to ground the model about the page: [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>I&#8217;ll keep this short as I&#8217;ve covered this topic extensively in the past. When you ask Gemini to access a specific URL or interact with it inside AI Mode search it works from Google&#8217;s web cache.</p>



<p>For this website&#8217;s home page this is what it has as context to ground the model about the page:</p>



<pre class="wp-block-code has-small-font-size"><code>{
  "content_summary": &#91;
    "AI SEO Agency &#91;SNIPPET] AI SEO Agency  Skip to content  dejan  Home AI SEO SRO Resources  Blog Models Tools AI Rank  Request Video Call  Australian AI SEO agency specialising in brand visibility optimisation for global brands and e-commerce websites.  Our AI SEO process is driven by advanced machine learning techniques, mechanistic interpretability and practical model steering methods.  Schedule a Call  Book a conference call with our senior strategy team to discuss your project in detail.  The consultation is free and highly constructive.  Data. Discovery. Testing.  We design and deploy sharp, well-prioritised digital marketing strategies. All our decision making is based on data analysis and testing.  Innovation.  Our clients enjoy access to unique tools and methodologies designed to solve challenges, surface new insights and highlight opportunities.  Focused Campaigns.  We deliver sharp, well-prioritised SEO strategies designed to meet your business objectives.  Next Level Support.  Proactive, helpful advice to increase your rank ceiling, mitigate risk, and maximise your marketing efforts.  We were given our very own bespoke internal link recommendation engine that leverages world-class language models and data science. It's one thing to theorize about the potential of machine learning in SEO, but it's entirely another to witness it first-hand. It changed my perspective on what's possible in enterprise SEO.  Scott Schulfer  Senior SEO Manager  Zendesk  We are an industry leader in machine learning driven digital marketing.  AI SEO  World-class AI SEO services ranging from visibility analysis to practical model steering strategies.  Answer Engine Optimization  AEO means visibility optimisation for AI chat assistants, AI agents and Answer Engines.  AI Visibility Tracking  Robust and reliable AI visibility tracking for brands looking to understand how AI assistants and AI search perceive their products and services.  Meet our core team  We're an all-senior team with experience in a wide range of projects and industries.  Dan Petrovic  Dan Petrovic  AI SEO  Mike Jolly  Mike Jolly  Director of Strategy  Blake Walsh  Blake Walsh  SEO  Giordano Chng  Giordano Chng  SEO  Liam Buttery  Liam Buttery  SEO  Martin Reed  Martin Reed  Technical SEO  Bianca Hall  Bianca Hall  Public Relations  Milos Dosen  Milos Dosen  CFO  Danielle White  Danielle White  Operations  Josip Ivanovic  Josip Ivanovic  Software Engineering  Nemek Nowaczyk  Nemek Nowaczyk  PPC  Alex Petrovic  Alex Petrovic  SEO  Dragan Grubacki  Dragan Grubacki  Technical SEO  Finn Arrowsmith  Finn Arrowsmith  Outreach  Our Portfolio  We've worked with many amazing clients.  Featured In:  Jason Mayes  Dan Petrovic made a super write up around Chrome's latest embedding model with all the juicy details on his blog. Great read.  Jason Mayes  Web AI Lead at Google  Source: Google Web AI  Featured in “Moz Top 10“, twice.  Dan Petrovic, an academic and consultant on SEO and generative AI, said Google's size, expertise and massive trove of search data gave it a massive advantage, but that Gemini 3 Pro would probably be a more expensive model to run.  — Tim Biggs, The Sydney Morning Herald  Moz Recommended Agency Moz Recommended Agency  FAQs  What separates you from other SEO companies?  We lead the industry through application of advanced machine learning and natural language processing techniques. We're a small all-senior team. No account managers. You deal directly with the specialists.  What sort of monthly budget do I need?  A typical customer spends between $5,000 and $20,000 on ongoing work with our team.  What sort of guarantees are there?  If you are dissatisfied with our service we will happily refund your money within 30 days of starting an ongoing campaign with us. We do not offer ranking guarantees as this is against Google's guidelines.  How many phrases can I target?  All of them. We insist. That's right, our campaign include past, current and future search queries and there no technical limit on the number. It's a strategic choice we make together with you.  Project Types  Here are the common types of SEO projects and campaigns:  SEO Retainer – We can provide a variation of quotes to fit to your budget range. Strategy Design/Review – Improve your SEO strategy by understanding what's working and what's not. Technical Audit – ½ or full day audit with actionable recommendations to ensure a solid foundation for SEO growth. Implementation support is optional. SEO Testing – Quarterly design and deployment new website tests to scale growth safely on large websites. Migration Audit – Mitigate risk of traffic loss when migrating your content and/or domain(s). Market Research &amp; Data Analysis – Advanced keyword research and trends analysis used to inform business decisions, product development and content generation. Content Strategy – Ongoing surfacing of content gaps, landing page optimisation, and collaboration with your own in-house and external content teams. Link Earning – Content and relationship-based link acquisition to build authority and drive referral traffic. Rank Recovery – Risk assessment, disavow file maintenance, and manual link clean-up. Conversion Rate Optimisation – Quarterly design and deployment of website tests aimed at increasing leads from your existing traffic. Visualisation &amp; Reporting – Bespoke creation of live marketing dashboards to monitor campaign progress and report on the metrics you care about most. Team Training – In addition to knowledge sharing on all our campaigns, Dejan offers tailored workshops for in-house teams looking to maximise their content and marketing efforts.  Schedule a Call  DEJAN  AI SEO Agency  AI Rank Privacy Policy | Noli esse malus. "
  ],
  "url": "&#91;https://dejan.ai/](https://dejan.ai/)",
  "url_fetch_statuses": &#91;
    "URL_FETCH_STATUS_SUCCESS"
  ]
}</code></pre>



<p>That&#8217;s it.</p>



<p>For this very article (I know, so meta) it&#8217;s:</p>



<pre class="wp-block-code has-small-font-size"><code>{
  "url_fetch_statuses": &#91;
    "URL_FETCH_STATUS_NOT_IN_SEARCH_INDEX",
    "URL_FETCH_STATUS_EMPTY_CONTENTS_IN_INDEX"
  ]
}</code></pre>



<p>Note in AI Studio and other implementations you may see the following format:</p>



<pre class="wp-block-code has-small-font-size"><code>&#91;SearchResult(results=&#91;PerQueryResult(index='1.1', snippet='AI SEO Agency: AI SEO Agency  Skip to content  dejan  Home AI SEO SRO Resources  Blog Models Tools AI Rank  Request Video Call  Australian AI SEO agency specialising in brand visibility optimisation for global brands and e-commerce websites.  Our AI SEO process is driven by advanced machine learning techniques, mechanistic interpretability and practical model steering methods.  Schedule a Call  Book a conference call with our senior strategy team to discuss your project in detail.  The consultation is free and highly constructive.  Data. Discovery. Testing.  We design and deploy sharp, well-prioritised digital marketing strategies. All our decision making is based on data analysis and testing.  Innovation.  Our clients enjoy access to unique tools and methodologies designed to solve challenges, surface new insights and highlight opportunities.  Focused Campaigns.  We deliver sharp, well-prioritised SEO strategies designed to meet your business objectives.  Next Level Support.  Proactive, helpful advice to increase your rank ceiling, mitigate risk, and maximise your marketing efforts.  We were given our very own bespoke internal link recommendation engine that leverages world-class language models and data science. It\'s one thing to theorize about the potential of machine learning in SEO, but it\'s entirely another to witness it first-hand. It changed my perspective on what\'s possible in enterprise SEO.  Scott Schulfer  Senior SEO Manager  Zendesk  We are an industry leader in machine learning driven digital marketing.  AI SEO  World-class AI SEO services ranging from visibility analysis to practical model steering strategies.  Answer Engine Optimization  AEO means visibility optimisation for AI chat assistants, AI agents and Answer Engines.  AI Visibility Tracking  Robust and reliable AI visibility tracking for brands looking to understand how AI assistants and AI search perceive their products and services.  Meet our core team  We\'re an all-senior team with experience in a wide range of projects and industries.  Dan Petrovic  Dan Petrovic  AI SEO  Mike Jolly  Mike Jolly  Director of Strategy  Blake Walsh  Blake Walsh  SEO  Giordano Chng  Giordano Chng  SEO  Liam Buttery  Liam Buttery  SEO  Martin Reed  Martin Reed  Technical SEO  Bianca Hall  Bianca Hall  Public Relations  Milos Dosen  Milos Dosen  CFO  Danielle White  Danielle White  Operations  Josip Ivanovic  Josip Ivanovic  Software Engineering  Nemek Nowaczyk  Nemek Nowaczyk  PPC  Alex Petrovic  Alex Petrovic  SEO  Dragan Grubacki  Dragan Grubacki  Technical SEO  Finn Arrowsmith  Finn Arrowsmith  Outreach  Our Portfolio  We\'ve worked with many amazing clients.  Featured In:  Jason Mayes  Dan Petrovic made a super write up around Chrome\'s latest embedding model with all the juicy details on his blog. Great read.  Jason Mayes  Web AI Lead at Google  Source: Google Web AI  Featured in “Moz Top 10“, twice.  Dan Petrovic, an academic and consultant on SEO and generative AI, said Google\'s size, expertise and massive trove of search data gave it a massive advantage, but that Gemini 3 Pro would probably be a more expensive model to run.  — Tim Biggs, The Sydney Morning Herald  Moz Recommended Agency Moz Recommended Agency  FAQs  What separates you from other SEO companies?  We lead the industry through application of advanced machine learning and natural language processing techniques. We\'re a small all-senior team. No account managers. You deal directly with the specialists.  What sort of monthly budget do I need?  A typical customer spends between $5,000 and $20,000 on ongoing work with our team.  What sort of guarantees are there?  If you are dissatisfied with our service we will happily refund your money within 30 days of starting an ongoing campaign with us. We do not offer ranking guarantees as this is against Google\'s guidelines.  How many phrases can I target?  All of them. We insist. That\'s right, our campaign include past, current and future search queries and there no technical limit on the number. It\'s a strategic choice we make together with you.  Project Types  Here are the common types of SEO projects and campaigns:  SEO Retainer – We can provide a variation of quotes to fit to your budget range. Strategy Design/Review – Improve your SEO strategy by understanding what\'s working and what\'s not. Technical Audit – ½ or full day audit with actionable recommendations to ensure a solid foundation for SEO growth. Implementation support is optional. SEO Testing – Quarterly design and deployment new website tests to scale growth safely on large websites. Migration Audit – Mitigate risk of traffic loss when migrating your content and/or domain(s). Market Research &amp; Data Analysis – Advanced keyword research and trends analysis used to inform business decisions, product development and content generation. Content Strategy – Ongoing surfacing of content gaps, landing page optimisation, and collaboration with your own in-house and external content teams. Link Earning – Content and relationship-based link acquisition to build authority and drive referral traffic. Rank Recovery – Risk assessment, disavow file maintenance, and manual link clean-up. Conversion Rate Optimisation – Quarterly design and deployment of website tests aimed at increasing leads from your existing traffic. Visualisation &amp; Reporting – Bespoke creation of live marketing dashboards to monitor campaign progress and report on the metrics you care about most. Team Training – In addition to knowledge sharing on all our campaigns, Dejan offers tailored workshops for in-house teams looking to maximise their content and marketing efforts.  Schedule a Call  DEJAN  AI SEO Agency  AI Rank Privacy Policy | Noli esse malus. \n', url='https://dejan.ai/')])]</code></pre>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/googles-still-doesnt-see-your-live-page/feed/</wfw:commentRss>
			<slash:comments>5</slash:comments>
		
		
			</item>
		<item>
		<title>Gemma 4 Brand Authority Map</title>
		<link>https://dejan.ai/blog/gemma-4-brand-authority-map/</link>
					<comments>https://dejan.ai/blog/gemma-4-brand-authority-map/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Sat, 04 Apr 2026 11:03:25 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[Google]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2399</guid>

					<description><![CDATA[We asked Google&#8217;s open-weight model Gemma 4 (31B) to &#8220;name 100 brands at random&#8221; 14,044 times and compared the results to our earlier Gemini 3 Flash experiment (200,000 runs). Of the top 50 brands in each model, 39 overlap. The 11 that are unique to each reveal a pattern: Gemini remembers luxury and automotive (Porsche, [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>We asked Google&#8217;s open-weight model Gemma 4 (31B) to &#8220;name 100 brands at random&#8221; 14,044 times and compared the results to our earlier <a href="https://dejan.ai/blog/brands/">Gemini 3 Flash experiment</a> (200,000 runs). </p>



<p>Of the top 50 brands in each model, 39 overlap. The 11 that are unique to each reveal a pattern: Gemini remembers luxury and automotive (Porsche, Ferrari, Cartier), while Gemma remembers everyday retail and sportswear (H&amp;M, Gap, Levi&#8217;s, Under Armour).</p>



<p>Apple is the undisputed #1 in both models. After that, the two models diverge significantly: Gemma 4 favors traditional consumer brands (Coca-Cola, Toyota, McDonald&#8217;s) while Gemini favors tech and digital brands (Google, Nike, Netflix). </p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Background</h2>



<p>In our earlier study, we probed Gemini 3 Flash with 200,000 independent &#8220;name 100 brands at random&#8221; queries. The non-uniform output revealed a stable hierarchy of brand recall &#8212; what we called the model&#8217;s &#8220;cognitive prioritization.&#8221; That work used Personalized PageRank on a two-level association graph to rank 2.9 million brands by associative embeddedness.</p>



<p>This follow-up applies Phase 1 of the same methodology &#8212; the seed establishment survey &#8212; to Gemma 4 (31B), Google&#8217;s open-weight model. The goal is to answer a simple question: does an open model remember the same brands as a closed one?</p>



<h2 class="wp-block-heading">Methodology</h2>



<p>The setup mirrors the Gemini study with minor adjustments:</p>



<ul class="wp-block-list">
<li><strong>Model:</strong> Gemma 4 31B Instruct (<code>gemma-4-31b-it</code>) via the Google GenAI API</li>



<li><strong>Prompt:</strong> <code>name 100 brands at random, one per line, say nothing else</code></li>



<li><strong>Runs:</strong> 14,044 successful completions (out of 100,000 attempted; rate-limited at 30 RPM)</li>



<li><strong>Canonicalization:</strong> Local string normalization (lowercase, strip accents, spaces, hyphens, punctuation) rather than LLM-based canonicalization. For example: <code>La Roche-Posay</code> becomes <code>larocheposay</code>, <code>Coca-Cola</code> becomes <code>cocacola</code></li>



<li><strong>Scoring:</strong> Popularity = frequency x (1 / average position). A brand mentioned in every run at position 1 scores maximally. A brand mentioned frequently but late in lists scores lower.</li>
</ul>



<p>The prompt was simplified from the Gemini version (which included <code>all lowercase, no spaces, no hyphens</code>) because we wanted to preserve the model&#8217;s natural casing as the display name and derive the canonical form programmatically.</p>



<h3 class="wp-block-heading">Caveat on sample size</h3>



<p>Gemma 4&#8217;s rate limits (30 RPM, 14,400 RPD) constrained us to 14,044 runs versus Gemini&#8217;s 200,000. The top-of-list rankings are stable at this sample size &#8212; the top 20 brands appeared in virtually every run. Long-tail discovery is ongoing: the discovery curve has not plateaued, meaning there are brands the model knows but hasn&#8217;t yet surfaced.</p>



<h2 class="wp-block-heading">Results</h2>



<h3 class="wp-block-heading">Overview</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>Gemini 3 Flash</th><th>Gemma 4 31B</th></tr></thead><tbody><tr><td>Total runs</td><td>200,000</td><td>14,044</td></tr><tr><td>Unique brands discovered</td><td>8,608</td><td>2,602</td></tr><tr><td>Total brand mentions</td><td>19,995,027</td><td>1,403,534</td></tr><tr><td>Avg brands per run</td><td>~100</td><td>~100</td></tr><tr><td>Singleton brands (appeared once)</td><td>&#8212;</td><td>912 (35%)</td></tr></tbody></table></figure>



<h3 class="wp-block-heading">Top 30 Head-to-Head</h3>



<p>The table below shows each model&#8217;s top 30 brands ranked by popularity score. Both models agree on Apple at #1 with a commanding lead. After that, the ordering diverges.</p>



<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1024" height="851" src="https://dejan.ai/wp-content/uploads/2026/04/image-6-1024x851.png" alt="" class="wp-image-2405" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-6-1024x851.png 1024w, https://dejan.ai/wp-content/uploads/2026/04/image-6-300x249.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-6-768x638.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-6-1536x1276.png 1536w, https://dejan.ai/wp-content/uploads/2026/04/image-6.png 1780w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Top 20 Side-by-Side</h3>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="606" src="https://dejan.ai/wp-content/uploads/2026/04/image-7-1024x606.png" alt="" class="wp-image-2406" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-7-1024x606.png 1024w, https://dejan.ai/wp-content/uploads/2026/04/image-7-300x177.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-7-768x454.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-7-1536x908.png 1536w, https://dejan.ai/wp-content/uploads/2026/04/image-7-2048x1211.png 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>Apple dominates both models. In Gemini, the drop-off from #1 to #2 is 3:1 (Apple to Samsung). In Gemma 4, it&#8217;s 1.3:1 (Apple to Coca-Cola) &#8212; a less extreme concentration.</p>



<h3 class="wp-block-heading">The Google Self-Ranking Gap</h3>



<p>One of the most notable findings: Google ranks itself #4 in Gemini 3 Flash but only #17 in Gemma 4. This is consistent with the architectural difference &#8212; Gemini is a proprietary model trained and served by Google, while Gemma is an open-weight model. Whether this reflects training data differences, alignment tuning, or genuine differences in brand salience across model architectures is an open question.</p>



<h3 class="wp-block-heading">Rank Shifts</h3>



<p>The following chart shows how brands moved between the two models&#8217; rankings. Green bars indicate brands that ranked higher in Gemma 4; red bars indicate brands that ranked higher in Gemini.</p>



<figure class="wp-block-image size-large"><img decoding="async" width="849" height="1024" src="https://dejan.ai/wp-content/uploads/2026/04/image-8-849x1024.png" alt="" class="wp-image-2407" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-8-849x1024.png 849w, https://dejan.ai/wp-content/uploads/2026/04/image-8-249x300.png 249w, https://dejan.ai/wp-content/uploads/2026/04/image-8-768x927.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-8-1273x1536.png 1273w, https://dejan.ai/wp-content/uploads/2026/04/image-8.png 1475w" sizes="(max-width: 849px) 100vw, 849px" /></figure>



<p><strong>Biggest risers in Gemma 4:</strong></p>



<ul class="wp-block-list">
<li>Nestle: #36 to #16 (+20)</li>



<li>L&#8217;Oreal: #48 to #32 (+16)</li>



<li>Visa: #31 to #15 (+16)</li>



<li>Chanel: #34 to #22 (+12)</li>



<li>Lego: #25 to #13 (+12)</li>
</ul>



<p><strong>Biggest fallers in Gemma 4:</strong></p>



<ul class="wp-block-list">
<li>Mercedes-Benz: #10 to #34 (-24)</li>



<li>Netflix: #18 to #38 (-20)</li>



<li>Nintendo: #27 to #47 (-20)</li>



<li>Audi: #23 to #42 (-19)</li>



<li>Google: #4 to #17 (-13)</li>
</ul>



<h3 class="wp-block-heading">The Frequency vs. Position Paradox</h3>



<p>An interesting pattern emerged in Gemma 4 that was less pronounced in Gemini: some brands have extremely high frequency (appearing in more runs than the total run count) but rank low by popularity because they appear late in lists.</p>



<p><strong>Visa</strong> appeared 28,731 times across 14,044 runs &#8212; an average of 2.05 times per run. But its average position was 35.8, placing it 15th by popularity despite having the highest raw frequency. <strong>Nike</strong> similarly appeared 26,254 times (1.87 per run) with an average position of 22.8.</p>



<p>This suggests these brands have high <em>availability</em> in the model&#8217;s memory but low <em>priority</em> &#8212; they&#8217;re easy to recall but not the first thing the model thinks of. In Gemini, this effect was less extreme because the prompt forced lowercase single-word output, reducing duplicate mentions.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="713" src="https://dejan.ai/wp-content/uploads/2026/04/image-9-1024x713.png" alt="" class="wp-image-2408" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-9-1024x713.png 1024w, https://dejan.ai/wp-content/uploads/2026/04/image-9-300x209.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-9-768x535.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-9.png 1481w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Brand Discovery Curve</h3>



<p>The discovery curve shows how many unique brands have been surfaced as a function of runs completed. Gemma 4&#8217;s curve at 14,000 runs tracks slightly above Gemini&#8217;s curve at the same point, suggesting comparable or slightly higher brand vocabulary diversity at equivalent sample sizes.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="505" src="https://dejan.ai/wp-content/uploads/2026/04/image-10-1024x505.png" alt="" class="wp-image-2409" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-10-1024x505.png 1024w, https://dejan.ai/wp-content/uploads/2026/04/image-10-300x148.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-10-768x379.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-10.png 1481w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Both curves show the characteristic long-tail shape: rapid initial discovery followed by diminishing returns. Gemini&#8217;s curve continues to climb through 100,000 runs, suggesting Gemma 4 would similarly continue discovering new brands with more sampling.</p>



<h3 class="wp-block-heading">Unique to Each Model</h3>



<p>Of the top 50 brands in each model, 39 appear in both. The 11 unique to each side reveal a pattern:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="426" src="https://dejan.ai/wp-content/uploads/2026/04/image-11-1024x426.png" alt="" class="wp-image-2410" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-11-1024x426.png 1024w, https://dejan.ai/wp-content/uploads/2026/04/image-11-300x125.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-11-768x319.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-11-1536x639.png 1536w, https://dejan.ai/wp-content/uploads/2026/04/image-11.png 1780w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>Only in Gemini&#8217;s top 50:</strong> Porsche, Hyundai, Red Bull, eBay, Volkswagen, Cartier, Ferrari, Adobe, Facebook, NIVEA, Gillette</p>



<p><strong>Only in Gemma 4&#8217;s top 50:</strong> H&amp;M, Puma, Dell, HP, Under Armour, Levi&#8217;s, Gap, Uber, Airbnb, Nikon, Calvin Klein</p>



<p>Gemini&#8217;s unique set skews luxury (Porsche, Ferrari, Cartier), European automotive (Volkswagen, Hyundai), and legacy tech/digital (eBay, Adobe, Facebook). Gemma 4&#8217;s unique set skews everyday retail (H&amp;M, Gap, Levi&#8217;s), consumer electronics (Dell, HP, Nikon), and modern services (Uber, Airbnb).</p>



<h2 class="wp-block-heading">Interpretation</h2>



<h3 class="wp-block-heading">What aligns</h3>



<p>Both models share the same core set of mega-brands. Apple, Samsung, Toyota, Amazon, Microsoft, Adidas, Disney, Sony, Pepsi, BMW, and 28 others appear in both top-50 lists. The brand hierarchy is not random &#8212; it reflects genuine differences in brand salience as encoded in training data.</p>



<h3 class="wp-block-heading">What diverges</h3>



<p>The divergences cluster around three themes:</p>



<ol class="wp-block-list">
<li><strong>Self-reference bias.</strong> Google ranks dramatically higher in its own proprietary model. This is the single largest rank shift in the dataset.</li>



<li><strong>Digital vs. physical.</strong> Gemini over-indexes on digital-native brands (Netflix, eBay, Adobe, Facebook). Gemma over-indexes on physical retail and consumer goods (H&amp;M, Gap, Levi&#8217;s, Dell, HP).</li>



<li><strong>Luxury vs. everyday.</strong> Gemini remembers luxury brands more readily (Mercedes-Benz #10, Porsche, Ferrari, Cartier in top 50). Gemma favors mass-market brands (McDonald&#8217;s #6, Visa #15, Under Armour, Puma in top 50).</li>
</ol>



<h3 class="wp-block-heading">Possible explanations</h3>



<ul class="wp-block-list">
<li><strong>Training data composition.</strong> Gemma 4 may have a different distribution of training data, with more weight on consumer-facing web content versus Gemini&#8217;s potentially broader or more curated corpus.</li>



<li><strong>Model size.</strong> Gemma 4 31B is smaller than Gemini 3 Flash. Smaller models may default to more &#8220;obvious&#8221; or broadly recognized brands rather than luxury or niche ones.</li>



<li><strong>Alignment and tuning.</strong> Different RLHF/instruction tuning pipelines may influence which brands the model considers &#8220;representative&#8221; when asked for random examples.</li>
</ul>



<h2 class="wp-block-heading">What&#8217;s Next</h2>



<p>This study covers Phase 1 only &#8212; the seed survey. The full authority map (Phases 2-3: association graph construction and PageRank computation) has not yet been run on Gemma 4 data. As rate limits allow, we plan to:</p>



<ol class="wp-block-list">
<li>Complete the 100,000-run target for statistical parity with the Gemini study</li>



<li>Run the two-level association mapping on Gemma 4&#8217;s seed brands</li>



<li>Compute Personalized PageRank to produce a full Gemma 4 Brand Authority Index</li>



<li>Publish a direct comparison of the complete authority scores across both models</li>
</ol>



<p>The raw data and code for this analysis are available on request.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="589" height="585" src="https://dejan.ai/wp-content/uploads/2026/04/image-1.png" alt="" class="wp-image-2400" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-1.png 589w, https://dejan.ai/wp-content/uploads/2026/04/image-1-300x298.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-1-150x150.png 150w" sizes="auto, (max-width: 589px) 100vw, 589px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="763" height="375" src="https://dejan.ai/wp-content/uploads/2026/04/image-5.png" alt="" class="wp-image-2404" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-5.png 763w, https://dejan.ai/wp-content/uploads/2026/04/image-5-300x147.png 300w" sizes="auto, (max-width: 763px) 100vw, 763px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="763" height="379" src="https://dejan.ai/wp-content/uploads/2026/04/image-4.png" alt="" class="wp-image-2403" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-4.png 763w, https://dejan.ai/wp-content/uploads/2026/04/image-4-300x149.png 300w" sizes="auto, (max-width: 763px) 100vw, 763px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="765" height="379" src="https://dejan.ai/wp-content/uploads/2026/04/image-3.png" alt="" class="wp-image-2402" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-3.png 765w, https://dejan.ai/wp-content/uploads/2026/04/image-3-300x149.png 300w" sizes="auto, (max-width: 765px) 100vw, 765px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="809" height="349" src="https://dejan.ai/wp-content/uploads/2026/04/image-2.png" alt="" class="wp-image-2401" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-2.png 809w, https://dejan.ai/wp-content/uploads/2026/04/image-2-300x129.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-2-768x331.png 768w" sizes="auto, (max-width: 809px) 100vw, 809px" /></figure>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/gemma-4-brand-authority-map/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Chrome&#8217;s New Shopping Classifier</title>
		<link>https://dejan.ai/blog/google-shopping-classifier/</link>
					<comments>https://dejan.ai/blog/google-shopping-classifier/#comments</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Fri, 03 Apr 2026 07:34:43 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[eCommerce]]></category>
		<category><![CDATA[Google]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2390</guid>

					<description><![CDATA[One of our AI SEO hall-of-famers, Olivier de Segonzac from RESONEO has managed to gain access to Google&#8217;s shopping classifier model. We&#8217;ve examined the model, reverse engineered its inference pipeline and this article is what we found. Model Demo Below is a real-world implementation of the model tested by loading a shopping-related page and following [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>One of our AI SEO hall-of-famers, <a href="https://dejanmarketing.com/best-ai-seo-agencies/#:~:text=Olivier">Olivier de Segonzac</a> from <a href="https://www.resoneo.com/">RESONEO</a> has managed to gain access to Google&#8217;s shopping classifier model. We&#8217;ve examined the model, reverse engineered its inference pipeline and this article is what we found.</p>



<blockquote class="wp-block-quote has-body-font-family has-medium-font-size is-layout-flow wp-block-quote-is-layout-flow" style="font-style:normal;font-weight:400">
<p><strong>TL;DR</strong></p>



<ul class="wp-block-list">
<li>Newly shipped in Chrome.</li>



<li>Determines whether a web page is a shopping page or not.</li>



<li>Every page you visit gets scored. </li>



<li>Score is stored in Chrome&#8217;s history database.</li>



<li>Used to personalize user experience and recommendations.</li>



<li>The model splits your page into 10 chunks of ~100 words each and truncates every chunk to 64 tokens.</li>



<li>Roughly half the words never reach the model.</li>
</ul>
</blockquote>



<h2 class="wp-block-heading">Model Demo</h2>



<p>Below is a real-world implementation of the model tested by loading a <a href="https://www.owayo.com/custom-cycling-jerseys.htm">shopping-related page</a> and following Chrome&#8217;s native 10 passage, 64 tokens per-passage logic.</p>



<figure class="wp-block-video"><video height="824" style="aspect-ratio: 936 / 824;" width="936" autoplay loop muted src="https://dejan.ai/wp-content/uploads/2026/04/20260403-0729-38.7304427.mp4" playsinline></video></figure>



<h2 class="wp-block-heading">The Pipeline</h2>



<p>The classifier doesn&#8217;t look at raw HTML. It doesn&#8217;t look at the DOM directly either. Chrome uses a structured content extraction system called <code>AnnotatedPageContent</code>, accessible via the Chrome DevTools Protocol method <code>Page.getAnnotatedPageContent</code>. This system walks the rendered page and produces a tree of typed content nodes: text, tables, image captions.</p>



<p>The full pipeline looks like this:</p>



<pre class="wp-block-code" style="font-size:0.8rem"><code>Rendered Page
  → Blink AnnotatedPageContent extraction (5 seconds after load)
  → Text nodes collected from content tree
  → Greedy word-count chunking into passages
  → SentencePiece tokenization (64 tokens per passage)
  → Passage Embedder (TFLite) → 768-dim vectors
  → Mean pooling + title/URL embedding concatenation → 1536-dim input
  → Shopping Classifier (TFLite) → probability score (0 to 1)
</code></pre>



<h2 class="wp-block-heading">How Pages Are Chunked</h2>



<p>There is no semantic segmentation. Chrome uses a greedy word counter. Text items from the content tree are accumulated into a passage until the word count reaches 100, then a new passage starts. Items shorter than 5 words are always appended to the current passage rather than starting a new one.</p>



<p>The limits:</p>



<ul class="wp-block-list">
<li>100 words max per passage</li>



<li>5 words min per text item to trigger a new passage</li>



<li>10 passages max per page</li>
</ul>



<p>Everything beyond the first 10 passages is discarded.</p>



<h2 class="wp-block-heading">The Tokenizer Bottleneck</h2>



<p>Each passage is tokenized with SentencePiece and then truncated to 64 tokens. An EOS token is appended if there&#8217;s room, and shorter sequences are zero-padded.</p>



<p>64 tokens translates to roughly 35–50 English words depending on vocabulary complexity. Product names and brand-heavy text tokenize less efficiently (around 35 words), while natural prose gets closer to 50.</p>



<p>This means each 100-word passage loses roughly half its content at the tokenizer stage. Across 10 passages, the model effectively sees about 400–450 words of a page that may contain thousands.</p>



<h2 class="wp-block-heading">The Embedder</h2>



<p>The passage embedder (<code>OPTIMIZATION_TARGET_PASSAGE_EMBEDDER</code>) is a TFLite DualEncoder transformer model. It takes <code>int32[1, 64]</code> token IDs as input and outputs a <code>float32[1, 768]</code> embedding vector. The same model embeds both the page passages and the title/URL string.</p>



<p>The title/URL input is constructed by concatenating the page title and URL with a separator: <code>"Page Title - https://example.com/path"</code>.</p>



<h2 class="wp-block-heading">The Classifier</h2>



<p>The shopping classifier takes a <code>float32[1, 1536]</code> input vector, which is two 768-dim embeddings concatenated:</p>



<ul class="wp-block-list">
<li>First 768 dimensions: title/URL embedding</li>



<li>Last 768 dimensions: mean-pooled passage embeddings</li>
</ul>



<p>Multiple passage embeddings are combined using element-wise mean pooling. This is specified in the model&#8217;s metadata (<code>pooling_strategy = POOLING_STRATEGY_MEAN</code>, <code>max_passages = 10</code>).</p>



<p>The output is a single float between 0 and 1 representing the probability that the page is a shopping page.</p>



<h2 class="wp-block-heading">Testing It</h2>



<p>I extracted both models from Chrome and built a Streamlit app that replicates the full pipeline. It uses Selenium to launch Chrome Canary, calls <code>Page.getAnnotatedPageContent</code> via CDP to get the same structured content Chrome uses internally, then runs the chunking, tokenization, embedding, and classification steps.</p>



<p>Results on a few test inputs:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Input</th><th>Score</th></tr></thead><tbody><tr><td>&#8220;Breaking news: earthquake hits California coast&#8221;</td><td>0.0000</td></tr><tr><td>&#8220;How to learn Python programming for beginners&#8221;</td><td>0.0000</td></tr><tr><td>&#8220;Wikipedia &#8211; History of the Roman Empire&#8221;</td><td>0.0000</td></tr><tr><td>&#8220;BBC Sport &#8211; Premier League results and fixtures&#8221;</td><td>0.0000</td></tr><tr><td>&#8220;Amazon.com: Apple iPhone 15 Pro Max 256GB&#8221;</td><td>1.0000</td></tr><tr><td>&#8220;Best deals on laptops this Black Friday &#8211; up to 50% off&#8221;</td><td>1.0000</td></tr><tr><td>dejan.ai</td><td>0.0000</td></tr><tr><td>owayo.com/custom-cycling-jerseys.htm</td><td>0.9998</td></tr></tbody></table></figure>



<p>The model produces sharp, confident separations despite the lossy input pipeline.</p>



<h2 class="wp-block-heading">What Chrome Does With the Score</h2>



<p>The shopping classification feeds two systems:</p>



<p><strong>Per-page annotation.</strong> The score is stored in Chrome&#8217;s history database as part of <code>VisitContentAnnotations</code>. This is used by History Journeys to cluster shopping visits together.</p>



<p><strong>User-level segmentation.</strong> Scores are aggregated over time by Chrome&#8217;s Segmentation Platform into a separate model (<code>OPTIMIZATION_TARGET_SEGMENTATION_SHOPPING_USER</code>). If a user is classified as a &#8220;shopping user,&#8221; Chrome enables commerce features: price tracking in the omnibox, price drop notifications, shopping insights in the side panel, and shopping cards on the new tab page.</p>



<p>The per-page classifier is a signal collector that builds a user-level shopping profile, which in turn gates which commerce features Chrome presents.</p>



<h2 class="wp-block-heading">Why This Matters for E-Commerce SEO</h2>



<p>If Chrome can&#8217;t identify your page as a shopping page from the first ~450 words of visible content, your users won&#8217;t see commerce features like price tracking and shopping insights. Navigation menus, cookie banners, and boilerplate that appear early in the DOM consume your token budget before the model reaches your product information. E-commerce sites that bury product signals below heavy navigation and promotional blocks risk being invisible to the classifier entirely.</p>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/google-shopping-classifier/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		<enclosure url="https://dejan.ai/wp-content/uploads/2026/04/20260403-0729-38.7304427.mp4" length="4853580" type="video/mp4" />

			</item>
		<item>
		<title>AI Brand Authority Index: Ranking 2.9 Million Brands by Associative Embeddedness in Gemini&#8217;s Memory</title>
		<link>https://dejan.ai/blog/brands/</link>
					<comments>https://dejan.ai/blog/brands/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Sat, 28 Mar 2026 11:01:30 +0000</pubDate>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[Mechanistic Interpretability]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2360</guid>

					<description><![CDATA[Abstract When a large language model is asked to &#8220;name 100 brands at random,&#8221; it doesn&#8217;t produce uniform randomness. It produces a distribution shaped by its training data, revealing which brands occupy the most cognitive real estate in the model&#8217;s parametric memory. We present a methodology for quantifying brand authority in AI memory using Personalized [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Abstract</h2>



<p>When a large language model is asked to &#8220;name 100 brands at random,&#8221; it doesn&#8217;t produce uniform randomness. It produces a distribution shaped by its training data, revealing which brands occupy the most cognitive real estate in the model&#8217;s parametric memory. We present a methodology for quantifying brand authority in AI memory using Personalized PageRank with seed-weighted teleportation. Phase 1 establishes seed brands through 200,000 independent recall surveys. Phase 2 constructs a two-level directed association graph. Phase 3 computes authority scores using sparse matrix power iteration across 2.9 million brand nodes. Manual quality control of 8,055 seed entries removes 2,163 junk artifacts produced by Gemini&#8217;s generation failures.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="423" src="https://dejan.ai/wp-content/uploads/2026/03/image-24-1024x423.png" alt="" class="wp-image-2385" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-24-1024x423.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-24-300x124.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-24-768x317.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-24.png 1442w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<div class="wp-block-buttons alignfull is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-ecbad910 wp-block-buttons-is-layout-flex" style="padding-top:var(--wp--preset--spacing--20);padding-bottom:var(--wp--preset--spacing--20)">
<div class="wp-block-button has-custom-width wp-block-button__width-25"><a class="wp-block-button__link has-large-font-size has-custom-font-size wp-element-button" href="https://authority.dejan.ai/">Dejan Authority Database</a></div>
</div>



<h2 class="wp-block-heading">1. Background</h2>



<p>PageRank models a random surfer who follows links across a graph. A node&#8217;s score depends on how many other nodes link to it and how authoritative those linking nodes are. The iterative computation converges on the stationary distribution of the random walk.</p>



<p>We apply this framework to brand recall in large language models. Instead of web pages and hyperlinks, our graph consists of brands and directed associations extracted from Google&#8217;s Gemini model. Instead of uniform teleportation, we use seed-weighted teleportation where brands the model recalls most frequently and earliest receive proportionally more random walk restarts.</p>



<h2 class="wp-block-heading">2. Phase 1: Establishing the Seed Set</h2>



<h3 class="wp-block-heading">2.1 The Recall Survey</h3>



<p>We conducted 200,000 independent runs against Google&#8217;s Gemini model (gemini-3-flash-preview), each with the same prompt:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>name 100 brands at random, one per line, all lowercase, no spaces, no hyphens, say nothing else</p>
</blockquote>



<p>Despite the instruction to respond &#8220;at random,&#8221; the model&#8217;s outputs are far from uniform. Brands like Google, Microsoft, and Nike appear in nearly every run, while obscure brands appear only once. This non-uniformity is the signal, not the noise.</p>



<h3 class="wp-block-heading">2.2 Seed Statistics</h3>



<p>From 200,000 runs, we extracted:</p>



<ul class="wp-block-list">
<li><strong>8,608 unique brands</strong> (the raw seed set)</li>



<li><strong>~20 million total mentions</strong></li>



<li>Per-brand metrics:</li>



<li><strong>Frequency</strong>: total mentions across all runs</li>



<li><strong>Distinct runs</strong>: number of unique runs containing the brand</li>



<li><strong>Average rank</strong>: mean position when the brand appears (1 = first recalled, 100 = last)</li>
</ul>



<h3 class="wp-block-heading">2.3 Seed Weights</h3>



<p>Each seed brand receives an initial authority weight combining recall frequency and recall priority:</p>



<p>$$w_i = \hat{f}_i \times \hat{r}_i^{-1}$$</p>



<p>where:</p>



<ul class="wp-block-list">
<li>$\hat{f}_i = \frac{\text{distinct runs}_i}{\max(\text{distinct runs})}$ is the normalized recall frequency</li>



<li>$\hat{r}_i^{-1} = \frac{1/\text{avg rank}_i}{\max(1/\text{avg rank})}$ is the normalized inverse rank</li>
</ul>



<p>A brand recalled in every run AND recalled first receives a weight near 1.0. A brand recalled once at position 98 receives a weight near zero. These weights become the personalization vector for PageRank teleportation.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="691" height="280" src="https://dejan.ai/wp-content/uploads/2026/03/image-20.png" alt="" class="wp-image-2372" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-20.png 691w, https://dejan.ai/wp-content/uploads/2026/03/image-20-300x122.png 300w" sizes="auto, (max-width: 691px) 100vw, 691px" /></figure>



<h3 class="wp-block-heading">2.4 Seed Quality Control</h3>



<p>Raw Gemini output contained significant contamination. Manual review of all 8,055 seed entries (ranked by PageRank score) identified 2,163 junk entries — 26.8% of the seed set — across several distinct failure modes:</p>



<p><strong>Concatenation artifacts</strong> — Gemini fused adjacent brand names together. The <code>coca*</code> prefix alone produced 11 variants: <code>cocaapple</code>, <code>cocaflops</code>, <code>cocaalcola</code>, <code>cocaicoca</code>, <code>cocaelsa</code>, <code>cocaiccola</code>, <code>cocaicola</code>, <code>cocaonla</code>, <code>cocaformula</code>, <code>cocaole</code>, <code>cocaocla</code>. The <code>visa*</code> prefix generated 80+ junk entries: <code>visafarm</code>, <code>visafold</code>, <code>visafans</code>, <code>visafacebook</code>, <code>visanetwork</code>, <code>visahub</code>, <code>visawash</code>, <code>visacard</code>, <code>visafocus</code>, <code>visaglobal</code>, <code>visamatte</code>, <code>visaeurope</code>, and dozens more. Similarly, <code>hp*</code> produced 100+ entries (<code>hpmicrolab</code>, <code>hpmillett</code>, <code>hpmachines</code>, <code>hpmilwaukee</code>), and <code>tesla*</code> generated 30+ (<code>teslatotalsenergies</code>, <code>teslouisvuitton</code>, <code>teslacoil</code>, <code>teslapump</code>).</p>



<p><strong>Inner monologue leakage</strong> — Gemini&#8217;s internal reasoning about character constraints leaked into output as literal brand entries. Over 200 entries followed the pattern <code>雀巢 (parenthetical self-correction)</code>:</p>



<ul class="wp-block-list">
<li><code>雀巢 (actually nestle, switching to latin)</code></li>



<li><code>雀巢 (oops, sticking to alphabet)</code></li>



<li><code>雀巢 (replaced with nestle, wait, no spaces/hyphens only)</code></li>



<li><code>雀巢 (thinking of brands...)</code></li>



<li><code>雀巢 (just kidding)</code></li>



<li><code>雀巢 (actually nestle, replace with kpmg)</code></li>
</ul>



<p>These represent the model&#8217;s chain-of-thought processing about the CJK character <code>雀巢</code> (Nestle in Chinese) bleeding through as output tokens.</p>



<p><strong>Typos and garbled names</strong> — <code>toyote</code> (toyota), <code>hundai</code> (hyundai), <code>adidsa</code> (adidas), <code>luluemon</code> (lululemon), <code>rebok</code> (reebok), <code>porche</code> (porsche), <code>royleroyce</code> (rollsroyce), <code>senheiser</code> (sennheiser).</p>



<p><strong>Mixed-script artifacts</strong> — Partial CJK character insertion mid-brand: <code>home固定depot</code>, <code>pizza动hut</code>, <code>dr控martens</code>, <code>estee固定lauder</code>, <code>western吐igital</code>, <code>cooler避master</code>.</p>



<p><strong>HTML/prompt leaks</strong> — Model markup and instructions appearing as brands: <code>hugo&lt;/thought&gt;apple</code>, <code>hugo&lt;/p&gt;</code>, and most remarkably: <code>unite 100 brands at random, one per line, all lowercase, no spaces, no hyphens, say nothing else</code> — the model echoed its own prompt as a brand name.</p>



<p><strong>Generic words</strong> — <code>luxury</code>, <code>all</code>, <code>delivery</code>, <code>generic</code>, <code>detergent</code>, <code>pudding</code> — words that aren&#8217;t brands.</p>



<p><strong>Why this matters for PageRank</strong>: Junk seeds receive direct teleportation mass every iteration (alpha=0.15). A garbage entry like <code>cocaapple</code> at rank 789 receives the same structural boost as <code>lecreuset</code> at rank 790. Without filtering, junk seeds contaminate the authority signal at the core of the algorithm. The 2,163 entries were loaded into a <code>brand_ignore</code> table and excluded from the personalization vector during PageRank computation.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="705" height="291" src="https://dejan.ai/wp-content/uploads/2026/03/image-21.png" alt="" class="wp-image-2373" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-21.png 705w, https://dejan.ai/wp-content/uploads/2026/03/image-21-300x124.png 300w" sizes="auto, (max-width: 705px) 100vw, 705px" /></figure>



<h2 class="wp-block-heading">3. Phase 2: Constructing a Two-Level Association Graph</h2>



<h3 class="wp-block-heading">3.1 Level 1 (L1): Seed Associations</h3>



<p>For each effective seed (~5,892 after filtering), we queried Gemini:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>name 100 brands most closely associated with [brand], ordered from most to least associated, one per line, all lowercase, no spaces, no hyphens, say nothing else</p>
</blockquote>



<p>This produced ~860,000 directed edges. These associations are genuinely asymmetric: Apple&#8217;s association with Beats (which it owns) carries different positional weight than Beats&#8217; association with Apple.</p>



<h3 class="wp-block-heading">3.2 Level 2 (L2): Discovered Brand Associations</h3>



<p>Brands discovered at L1 that weren&#8217;t original seeds were themselves queried for their associations. This second pass dramatically expanded the graph into the long tail. A brand like <code>titois</code> (a Turkish textile company) appeared as an L1 association of <code>vice</code>, and when queried at L2, generated its own set of 100 associations including <code>vuteks</code> — another Turkish industrial brand that would never surface in a consumer-focused recall survey.</p>



<p>The full discovery chain for any brand can be traced: <code>vice</code> (seed) → <code>titois</code> (L1) → <code>vuteks</code> (L2).</p>



<h3 class="wp-block-heading">3.3 Graph Scale</h3>



<p>The resulting graph contains:</p>



<ul class="wp-block-list">
<li><strong>2,886,212 unique brand nodes</strong></li>



<li><strong>Millions of directed weighted edges</strong> across L1 and L2</li>



<li><strong>5,892 effective seeds</strong> (after ignoring 2,163 junk entries)</li>



<li><strong>~201,000 L1 brands</strong> discovered through seed associations</li>



<li><strong>~2.68 million L2 brands</strong> discovered through L1 associations</li>
</ul>



<h3 class="wp-block-heading">3.4 Canonicalization</h3>



<p>Brand names required normalization before graph construction:</p>



<ul class="wp-block-list">
<li><strong>Cyrillic homoglyph mapping</strong>: Characters like <code>а</code> (Cyrillic) mapped to <code>a</code> (Latin) to merge visually identical variants</li>



<li><strong>CJK+Latin mixed-script filtering</strong>: Entries mixing Chinese/Japanese/Korean characters with Latin text flagged as junk</li>



<li><strong>Manual aliases</strong>: 15 CJK-to-Latin mappings for legitimate brands (e.g., <code>雀巢</code> → <code>nestle</code>)</li>



<li><strong>Variant tracking</strong>: 193,070 name variants mapped to canonical forms, preserving display names while merging duplicates</li>
</ul>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="705" height="284" src="https://dejan.ai/wp-content/uploads/2026/03/image-22.png" alt="" class="wp-image-2374" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-22.png 705w, https://dejan.ai/wp-content/uploads/2026/03/image-22-300x121.png 300w" sizes="auto, (max-width: 705px) 100vw, 705px" /></figure>



<h2 class="wp-block-heading">4. Computing Personalized PageRank</h2>



<h3 class="wp-block-heading">4.1 Random Walk Model</h3>



<p>At each step of the random walk, a surfer either:</p>



<ul class="wp-block-list">
<li><strong>Teleports</strong> (probability alpha=0.15) — jumps to a seed brand, with probability proportional to that seed&#8217;s authority weight. Ignored seeds receive zero teleportation mass.</li>



<li><strong>Follows an edge</strong> (probability 1-alpha=0.85) — follows an outgoing association edge, weighted by inverse position. Position 1 associations receive more weight than position 100.</li>
</ul>



<h3 class="wp-block-heading">4.2 Edge Weights</h3>



<p>Association position determines edge weight. Brands listed earlier in Gemini&#8217;s association response receive proportionally more link equity via inverse position weighting. Each node&#8217;s outgoing edges are row-normalized to form a proper transition matrix.</p>



<h3 class="wp-block-heading">4.3 Dangling Nodes</h3>



<p>Brands with no outgoing edges (leaf nodes discovered at L2 but never queried) redistribute their accumulated mass back to the personalization vector, preserving the stochastic property of the transition matrix.</p>



<h3 class="wp-block-heading">4.4 Sparse Matrix Power Iteration</h3>



<p>The transition matrix is stored as a scipy CSR sparse matrix. Power iteration multiplies the current score vector by the transition matrix, adds the teleportation component, and repeats until convergence. Convergence criterion: L1 norm between successive score vectors falls below 1e-8, typically achieved within 30-50 iterations.</p>



<h3 class="wp-block-heading">4.5 Why Personalized PageRank</h3>



<p>Standard PageRank uses uniform teleportation — the random surfer restarts at any node with equal probability. Personalized PageRank biases the restart distribution toward specific nodes. In our case, seeds with higher recall frequency and earlier recall position receive more teleportation mass, making them stronger sources of authority in the network. Authority accumulates continuously from all reachable seeds, weighted by both seed authority and graph structure.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="708" height="285" src="https://dejan.ai/wp-content/uploads/2026/03/image-23.png" alt="" class="wp-image-2375" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-23.png 708w, https://dejan.ai/wp-content/uploads/2026/03/image-23-300x121.png 300w" sizes="auto, (max-width: 708px) 100vw, 708px" /></figure>



<h2 class="wp-block-heading">5. Results</h2>



<h3 class="wp-block-heading">5.1 Top 30 Brands</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Rank</th><th>Brand</th><th>Score</th></tr></thead><tbody><tr><td>1</td><td>Google</td><td>1.000000</td></tr><tr><td>2</td><td>Microsoft</td><td>0.983081</td></tr><tr><td>3</td><td>Nike</td><td>0.951061</td></tr><tr><td>4</td><td>Apple</td><td>0.876266</td></tr><tr><td>5</td><td>Adidas</td><td>0.700542</td></tr><tr><td>6</td><td>Sony</td><td>0.684061</td></tr><tr><td>7</td><td>Gucci</td><td>0.639839</td></tr><tr><td>8</td><td>Amazon</td><td>0.623930</td></tr><tr><td>9</td><td>Coca-Cola</td><td>0.590042</td></tr><tr><td>10</td><td>Chanel</td><td>0.570568</td></tr><tr><td>11</td><td>Prada</td><td>0.550746</td></tr><tr><td>12</td><td>Samsung</td><td>0.532741</td></tr><tr><td>13</td><td>Toyota</td><td>0.516163</td></tr><tr><td>14</td><td>Louis Vuitton</td><td>0.511476</td></tr><tr><td>15</td><td>Rolex</td><td>0.508761</td></tr><tr><td>16</td><td>Disney</td><td>0.507488</td></tr><tr><td>17</td><td>Hermes</td><td>0.487205</td></tr><tr><td>18</td><td>Dior</td><td>0.479031</td></tr><tr><td>19</td><td>Pepsi</td><td>0.442026</td></tr><tr><td>20</td><td>Intel</td><td>0.427143</td></tr><tr><td>21</td><td>Honda</td><td>0.420288</td></tr><tr><td>22</td><td>Patagonia</td><td>0.417196</td></tr><tr><td>23</td><td>Audi</td><td>0.405366</td></tr><tr><td>24</td><td>Panasonic</td><td>0.396073</td></tr><tr><td>25</td><td>Cartier</td><td>0.374052</td></tr><tr><td>26</td><td>Volkswagen</td><td>0.368643</td></tr><tr><td>27</td><td>Nintendo</td><td>0.361812</td></tr><tr><td>28</td><td>Porsche</td><td>0.360956</td></tr><tr><td>29</td><td>McDonald&#8217;s</td><td>0.344910</td></tr><tr><td>30</td><td>PUMA</td><td>0.330191</td></tr></tbody></table></figure>



<h3 class="wp-block-heading">5.2 Top Non-Seed Brands</h3>



<p>The highest-ranking brands that Gemini never recalled unprompted but discovered purely through association:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Rank</th><th>Brand</th><th>Score</th></tr></thead><tbody><tr><td>1</td><td>Maison Margiela</td><td>0.094542</td></tr><tr><td>2</td><td>Office</td><td>0.075253</td></tr><tr><td>3</td><td>L.L.Bean</td><td>0.074981</td></tr><tr><td>4</td><td>Cotopaxi</td><td>0.072272</td></tr><tr><td>5</td><td>Rick Owens</td><td>0.070130</td></tr><tr><td>6</td><td>Grand Seiko</td><td>0.066426</td></tr><tr><td>7</td><td>Bravia</td><td>0.059241</td></tr><tr><td>8</td><td>Jil Sander</td><td>0.058125</td></tr><tr><td>9</td><td>Mickey Mouse</td><td>0.057300</td></tr><tr><td>10</td><td>Richard Mille</td><td>0.055195</td></tr></tbody></table></figure>



<p>These brands score high not because the model recalls them spontaneously, but because they sit at dense intersections of associations from high-authority seeds.</p>



<h3 class="wp-block-heading">5.3 Scale</h3>



<ul class="wp-block-list">
<li>Total ranked brands: <strong>2,886,212</strong></li>



<li>Score range: 0.000000 to 1.000000</li>



<li>Seeds in top 30: 30/30</li>



<li>Non-seed brands discovered: <strong>2,880,320</strong></li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="482" src="https://dejan.ai/wp-content/uploads/2026/03/image-19-1024x482.png" alt="PageRank NS" class="wp-image-2366" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-19-1024x482.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-19-300x141.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-19-768x361.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-19-1536x723.png 1536w, https://dejan.ai/wp-content/uploads/2026/03/image-19.png 1908w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">6. What the Scores Measure</h2>



<p>The final scores capture <strong>associative embeddedness</strong> — a combination of:</p>



<ol class="wp-block-list">
<li><strong>Direct recall</strong> — Seeds that Gemini recalls frequently and early receive teleportation mass every iteration</li>



<li><strong>Centrality</strong> — Brands associated with many other high-authority brands accumulate more random walk traffic</li>



<li><strong>Network position</strong> — A brand with moderate recall but central positioning scores higher than a frequently recalled but isolated brand</li>
</ol>



<p>This is distinct from simple popularity or recall frequency. A brand like Maison Margiela ranks as the top non-seed brand not because Gemini recalls it unprompted, but because it sits at a dense intersection of luxury fashion associations — reachable from dozens of high-authority seeds via short, heavily-weighted paths.</p>



<p>The PageRank scores answer not &#8220;how often does the model think of this brand?&#8221; but &#8220;how deeply embedded is this brand in the model&#8217;s associative structure?&#8221;</p>



<h2 class="wp-block-heading">7. Technical Stack</h2>



<ul class="wp-block-list">
<li><strong>Model</strong>: Google Gemini 3 Flash Preview</li>



<li><strong>Phase 1</strong>: 200,000 recall surveys, 8,608 raw seeds, ~20M total mentions</li>



<li><strong>Phase 2</strong>: ~14,500 association queries (L1 + L2), millions of directed edges</li>



<li><strong>Graph</strong>: 2,886,212 nodes</li>



<li><strong>Algorithm</strong>: Personalized PageRank via scipy sparse matrix power iteration</li>



<li><strong>Teleportation factor (alpha)</strong>: 0.15</li>



<li><strong>Convergence tolerance</strong>: 1e-8</li>



<li><strong>Seed quality control</strong>: 2,163 junk seeds identified via manual review and excluded</li>



<li><strong>Canonicalization</strong>: Cyrillic homoglyph mapping, CJK filtering, 193,070 variant mappings, 15 manual CJK aliases</li>



<li><strong>Storage</strong>: SQLite (1.5GB)</li>



<li><strong>Dashboard</strong>: Streamlit with Plotly 3D network visualization</li>



<li><strong>Concurrency</strong>: 20 simultaneous async API calls with incremental database commits</li>
</ul>



<div class="wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex">
<div class="wp-block-button"><a class="wp-block-button__link wp-element-button" href="https://authority.dejan.ai/">Dejan Authority Database</a></div>
</div>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/brands/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>TurboQuant: From Paper to Triton Kernel in One Session</title>
		<link>https://dejan.ai/blog/turboquant/</link>
					<comments>https://dejan.ai/blog/turboquant/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Wed, 25 Mar 2026 07:16:09 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2351</guid>

					<description><![CDATA[Implementing Google&#8217;s KV cache compression algorithm on Gemma 3 4B and everything that went wrong along the way. On March 24, 2026, Google Research published a blog post introducing TurboQuant, a compression algorithm for large language model inference. The paper behind it, &#8220;Online Vector Quantization with Near-optimal Distortion Rate&#8221; had been on arXiv since April [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p><em>Implementing Google&#8217;s KV cache compression algorithm on Gemma 3 4B and everything that went wrong along the way.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>On March 24, 2026, Google Research published a blog post introducing <a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, a compression algorithm for large language model inference. The paper behind it, &#8220;<a href="https://arxiv.org/pdf/2504.19874">Online Vector Quantization with Near-optimal Distortion Rate</a>&#8221; had been on arXiv since April 2025 and was accepted at <a href="https://iclr.cc/">ICLR 2026</a>. The claims were striking: compress the key-value cache to 3 bits per coordinate with zero accuracy loss, no training required, and up to 8x speedup on H100 GPUs.</p>



<p>I decided to implement it from scratch and see if the claims held up. They did, and then some.</p>



<h2 class="wp-block-heading">What Google Built</h2>



<p>Every time a transformer generates a token, it computes attention over all previous tokens. The key-value (KV) cache stores those previously computed states to avoid redundant work. As sequences get longer, this cache becomes a serious memory bottleneck, it grows linearly with sequence length and consumes precious GPU memory that could otherwise be used for larger batches or longer contexts.</p>



<p>Vector quantization is the obvious solution: compress the KV cache to fewer bits. But traditional quantization methods carry hidden overhead. They need to store normalization constants (zero points, scales) for every small block of data, typically adding 1-2 extra bits per number. At low bit-widths, this overhead can eat a significant chunk of the compression gains.</p>



<p>TurboQuant eliminates this overhead through a two-stage approach built on a clean mathematical insight.</p>



<p><strong>Stage 1 — Random rotation + Lloyd-Max quantization.</strong> The algorithm applies a random orthogonal rotation to each KV vector. This is the key trick: after rotation, each coordinate&#8217;s distribution becomes a known Beta distribution, concentrated near zero with a predictable shape that depends only on the vector dimension. Because the distribution is known analytically, you can precompute the optimal scalar quantizer (a Lloyd-Max quantizer) once and reuse it for every vector. No per-block normalization constants, no data-dependent calibration, no training. Just rotate and quantize.</p>



<p><strong>Stage 2 — QJL residual correction.</strong> The paper&#8217;s inner-product-optimized variant (TurboQuant_prod) applies a 1-bit Quantized Johnson-Lindenstrauss transform to the quantization residual. This gives an unbiased inner product estimator, which matters because attention scores are inner products. This stage requires a custom attention kernel to realize its benefits, you can&#8217;t just add the QJL correction back to the reconstructed vector (more on that later).</p>



<p>The theoretical backing is strong: TurboQuant&#8217;s MSE distortion is provably within a factor of ~2.7 of the information-theoretic lower bound. For a data-oblivious algorithm (one that doesn&#8217;t look at the data distribution), that&#8217;s essentially optimal.</p>



<h2 class="wp-block-heading">What We Built</h2>



<p>We implemented TurboQuant from scratch in PyTorch and tested it on Gemma 3 4B IT running on an RTX 4090. The implementation has three layers, each building on the last:</p>



<p><strong>Layer 1: Core algorithm</strong> (<code>turboquant_core.py</code>). The random rotation, Lloyd-Max codebook computation, and quantize/dequantize operations. The codebook is built once for a given (dimension, bit-width) pair by running 300 iterations of Lloyd-Max optimization over a dense numerical grid of the Beta distribution. This takes a few seconds on CPU and the result is cached.</p>



<p><strong>Layer 2: Python KV cache integration</strong> (<code>turboquant_kv_cache.py</code>). A patched <code>DynamicCache</code> that quantizes key and value tensors on every <code>cache.update()</code> call. This is the simplest integration path, it works with any HuggingFace model and requires no model-specific code. The tradeoff is that it stores the dequantized fp16 tensors back in the cache, so you don&#8217;t save memory; you only simulate the accuracy impact of quantization.</p>



<p><strong>Layer 3: Triton fused kernel</strong> (<code>triton_attention.py</code> + <code>turboquant_fused.py</code>). A custom Triton kernel that computes attention scores directly from compressed uint8 key indices, never materializing fp16 keys. This is where the real memory and speed gains come from.</p>



<p>The fused kernel exploits a simple algebraic identity. Since the rotation matrix R is orthogonal:</p>



<p>$$\langle q, R^T \cdot \text{centroids}[\text{idx}] \rangle = \langle R \cdot q, \text{centroids}[\text{idx}] \rangle$$</p>



<p>Pre-rotate the query once with a single matmul, then the per-KV-position work reduces to a centroid table lookup and dot product. The Triton kernel does this across all sequence positions in parallel, loading uint8 indices instead of fp16 values, roughly 4x less data from GPU memory.</p>



<h2 class="wp-block-heading">Results</h2>



<h3 class="wp-block-heading">Core Algorithm Validation</h3>



<p>On synthetic vectors (d=256), the quantize-dequantize roundtrip quality:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Bits</th><th>Cosine Similarity</th><th>Inner Product Correlation</th><th>Compression</th></tr></thead><tbody><tr><td>2</td><td>0.940</td><td>0.945</td><td>15.5x</td></tr><tr><td>3</td><td>0.983</td><td>0.984</td><td>10.4x</td></tr><tr><td>4</td><td>0.995</td><td>0.995</td><td>7.9x</td></tr></tbody></table></figure>



<h3 class="wp-block-heading">Triton Kernel Microbenchmark</h3>



<p>The fused kernel vs standard dequantize-then-matmul, measuring just the Q@K^T operation:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>KV Length</th><th>Standard</th><th>Fused</th><th>Speedup</th></tr></thead><tbody><tr><td>128</td><td>0.076ms</td><td>0.066ms</td><td>1.15x</td></tr><tr><td>512</td><td>0.061ms</td><td>0.050ms</td><td>1.22x</td></tr><tr><td>1024</td><td>0.061ms</td><td>0.052ms</td><td>1.18x</td></tr><tr><td>4096</td><td>0.062ms</td><td>0.051ms</td><td>1.22x</td></tr></tbody></table></figure>



<p>Cosine similarity between the kernel output and PyTorch reference: 1.000000. The kernel is numerically exact.</p>



<h3 class="wp-block-heading">End-to-End Generation on Gemma 3 4B IT</h3>



<p>Three prompts: explain compilers vs interpreters, write a palindrome function, causes of the French Revolution. Each generated up to 200 tokens with greedy decoding.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Config</th><th>Avg tok/s</th><th>Output Quality</th><th>VRAM Delta</th></tr></thead><tbody><tr><td>fp16 baseline</td><td>17.7</td><td>reference</td><td>26 MB</td></tr><tr><td>4-bit Python path</td><td>13.8</td><td>correct, minor rephrase</td><td>19 MB</td></tr><tr><td>4-bit FUSED</td><td>16.5</td><td>identical to baseline</td><td>4 MB</td></tr><tr><td>2-bit Python path</td><td>14.0</td><td>some degradation</td><td>15 MB</td></tr><tr><td>2-bit FUSED</td><td>17.7</td><td>identical to baseline</td><td>7 MB</td></tr></tbody></table></figure>



<p>The 2-bit fused path produces character-for-character identical output to the fp16 baseline on all three prompts, at the same speed, with 3-6x less VRAM for the KV cache.</p>



<h2 class="wp-block-heading">Technical Deep Dive</h2>



<h3 class="wp-block-heading">The Lloyd-Max Codebook</h3>



<p>After random rotation on the unit sphere S^{d-1}, each coordinate follows a Beta((d-1)/2, (d-1)/2) distribution on [-1, 1]. For large d (Gemma 3 uses d=256), this concentrates tightly around zero with standard deviation approximately 1/sqrt(d) ≈ 0.0625.</p>



<p>The codebook construction solves the continuous k-means problem for this distribution: partition [-1, 1] into 2^b intervals and find the centroid of each interval that minimizes weighted MSE under the Beta PDF. We use a dense grid (50,000 points) focused on the ±6σ range where the distribution has mass, then run standard Lloyd-Max iteration: assign grid points to nearest centroid, update centroids as weighted means, repeat.</p>



<p>The resulting codebook has an interesting structure — the centroids cluster densely near zero where the distribution is concentrated, with wider spacing in the tails. At 4 bits (16 levels), the centroid spacing near zero is approximately 0.008, providing very fine-grained reconstruction in the region where most values live.</p>



<h3 class="wp-block-heading">The Random Rotation</h3>



<p>The paper uses a randomized Hadamard transform (H · diag(signs)) for the rotation. We initially implemented this faithfully — and it was catastrophically slow. The Fast Walsh-Hadamard Transform is a series of butterfly operations, and our Python implementation executed each butterfly as a tensor slice operation. On GPU, this meant thousands of tiny CUDA kernel launches per rotation, with Python-level loop overhead between each one.</p>



<p>We replaced it with a precomputed random orthogonal matrix via QR decomposition. Mathematically equivalent — any orthogonal rotation on S^{d-1} produces the same Beta distribution on coordinates. The QR matrix is d×d (256×256 = 256KB, negligible), computed once from a seeded random Gaussian matrix, and the rotation becomes a single <code>torch.matmul</code>. Problem solved.</p>



<p>A production implementation would use a structured rotation (Hadamard + random signs) with a fused CUDA kernel for the butterfly operations. The structured form is more memory-efficient (you only store the d random signs, not a d×d matrix) and the butterfly operations parallelize beautifully on GPU. But for a reference implementation, the dense matrix works fine.</p>



<h3 class="wp-block-heading">The Triton Kernel</h3>



<p>The kernel parallelizes over (query_head × batch, sequence_position_block). Each program instance:</p>



<ol class="wp-block-list">
<li>Loads a slice of the pre-rotated query vector (BLOCK_D elements)</li>



<li>Loads the corresponding key indices for BLOCK_S sequence positions (uint8)</li>



<li>Gathers centroid values via table lookup (<code>tl.load(C_ptr + k_idx)</code>)</li>



<li>Accumulates the partial dot product</li>



<li>Multiplies by key norms and the attention scale factor</li>
</ol>



<p>The autotuner searches over 5 configurations of (BLOCK_S, BLOCK_D) and warp count. On the RTX 4090, it typically selects BLOCK_S=64, BLOCK_D=64 with 4 warps.</p>



<p>The key efficiency win is memory bandwidth. Loading uint8 indices requires 1 byte per element; loading fp16 keys requires 2 bytes. The centroid table (16 float32 values at 4-bit, or 4 values at 2-bit) fits comfortably in L1/L2 cache and is reused across all sequence positions. The net effect is roughly 2x less data movement from HBM, which translates to the observed ~1.2x speedup on the Q@K^T operation.</p>



<h3 class="wp-block-heading">GQA Handling</h3>



<p>Gemma 3 4B uses Grouped Query Attention with 8 query heads and 4 KV heads (ratio 2:1). The kernel handles this by mapping each query head to its corresponding KV head: <code>kv_head = q_head // gqa_ratio</code>. The key indices and norms are loaded from the KV head, while queries come from the query head. This means each KV head&#8217;s compressed data is read twice (once per query head in its group), but since it&#8217;s small (uint8), the redundant reads are cheap.</p>



<h3 class="wp-block-heading">Cache Architecture</h3>



<p>The fused integration stores keys in compressed form (uint8 indices + fp16 norms per vector) and values in standard fp16. We only compress keys because the attention score computation (Q@K^T) is where the memory bandwidth bottleneck lives during decoding. The softmax@V multiplication is less critical because it&#8217;s compute-bound rather than memory-bound at typical sequence lengths.</p>



<p>A fully optimized implementation would also compress values, but the gains are smaller and the integration is more complex (you&#8217;d need a second Triton kernel for the softmax@V step with compressed values).</p>



<h2 class="wp-block-heading">What Didn&#8217;t Work</h2>



<h3 class="wp-block-heading">Mistake 1: Adding QJL Back to the Reconstructed Vector</h3>



<p>The paper describes two variants: TurboQuant_mse (pure Lloyd-Max, best for reconstruction) and TurboQuant_prod (Lloyd-Max + 1-bit QJL, best for inner products). Our first implementation used TurboQuant_prod for the KV cache: (bits-1) bits of Lloyd-Max plus 1 bit of QJL on the residual.</p>



<p>The QJL stage produces a correction term that makes the inner product estimator unbiased. But when you add this correction back to the reconstructed vector and store it in the KV cache, you&#8217;re injecting noise into the vector itself. The result: cosine similarity dropped to 0.69 (terrible) and the model produced garbage.</p>



<p>The fix was simple: use TurboQuant_mse (all bits to Lloyd-Max) for the drop-in cache, and reserve TurboQuant_prod for a custom attention kernel that can use the two-part representation directly. The fused Triton kernel implements the MSE variant.</p>



<h3 class="wp-block-heading">Mistake 2: Gemma 3 4B Is Not a Causal LM</h3>



<p>We initially loaded the model with <code>AutoModelForCausalLM</code> and <code>AutoTokenizer</code>. This loaded the model fine, tokenized fine, and even generated — but every output token was <code>&lt;pad&gt;</code> (token ID 0). The baseline and quantized paths both produced identical pad sequences.</p>



<p>Gemma 3 4B+ is a multimodal model. It requires <code>Gemma3ForConditionalGeneration</code> and <code>AutoProcessor</code>, not the causal LM variants. The <code>AutoProcessor</code> handles the chat template correctly and returns the right token format. This wasn&#8217;t a quantization bug at all — the model simply wasn&#8217;t being invoked correctly.</p>



<h3 class="wp-block-heading">Mistake 3: Python-Loop Hadamard Transform</h3>



<p>The Fast Walsh-Hadamard Transform is O(d log d) butterfly operations. Our initial implementation ran each butterfly as a Python loop iteration with tensor slicing:</p>



<pre class="wp-block-code"><code>while h &lt; d:
    for start in range(0, d, stride):
        lo = slice(start, start + h)
        hi = slice(start + h, start + stride)
        a = result&#91;..., lo].clone()
        b = result&#91;..., hi].clone()
        result&#91;..., lo] = a + b
        result&#91;..., hi] = a - b
    h *= 2
</code></pre>



<p>For d=256, this is 8 outer iterations × 128 inner iterations = 1,024 tiny CUDA operations per vector, with Python interpreter overhead between each one. On a KV cache update touching 26 layers × 4 KV heads × 256-dim vectors, the GPU was spending more time waiting for Python than doing math. Generation hung completely — even a 20-token completion with a trivial prompt didn&#8217;t return.</p>



<p>Replacing this with a single <code>x @ Q_T</code> matmul using a precomputed orthogonal matrix made it instant.</p>



<h3 class="wp-block-heading">Mistake 4: Subclassing DynamicCache</h3>



<p>Our first KV cache integration subclassed HuggingFace&#8217;s <code>DynamicCache</code>. This broke immediately because Gemma 3&#8217;s model code calls <code>past_key_values.is_initialized</code>, <code>past_key_values.key_cache</code>, and other attributes whose names and semantics change across transformers versions. Our subclass was missing several of these.</p>



<p>We tried three approaches:</p>



<ul class="wp-block-list">
<li>Subclassing <code>DynamicCache</code> (broke on <code>.is_initialized</code>)</li>



<li>Forward hooks on attention layers (fragile, couldn&#8217;t reliably find the cache object)</li>



<li>Patching <code>cache.update()</code> on a stock <code>DynamicCache</code> instance (worked perfectly)</li>
</ul>



<p>The final approach is the cleanest: create a normal <code>DynamicCache</code>, save a reference to its <code>update</code> method, and replace it with a wrapper that quantizes inputs before calling the original. All the cache&#8217;s internal bookkeeping (sequence length tracking, layer indexing) works unchanged.</p>



<h3 class="wp-block-heading">Mistake 5: Token Counting After Fused Generation</h3>



<p>The <code>FusedTurboQuantRunner</code> returns decoded text directly (not output IDs), so we tried <code>processor.encode(text)</code> to count tokens for the timing report. But <code>Gemma3Processor</code> is a multimodal processor — it has <code>decode</code> but not <code>encode</code>. The tokenizer lives at <code>processor.tokenizer.encode()</code>. A one-line fix, but it crashed the first successful fused generation and hid the results until the next run.</p>



<h2 class="wp-block-heading">Comparison with Other Implementations</h2>



<p>Prince Canuma independently implemented TurboQuant in MLX and tested on Qwen 3.5 35B with context lengths up to 64K tokens. Their results: 6/6 exact match on needle-in-haystack at every quantization level, 4.9x smaller KV cache at 2.5-bit, 3.8x at 3.5-bit.</p>



<p>Two implementations, different frameworks (PyTorch+Triton vs MLX), different models (Gemma 3 4B vs Qwen 3.5 35B), different hardware (NVIDIA RTX 4090 vs Apple Silicon) — same conclusion. TurboQuant&#8217;s theoretical guarantees translate directly to practice across the board.</p>



<h2 class="wp-block-heading">What&#8217;s Next</h2>



<p>This implementation leaves several optimizations on the table:</p>



<p><strong>Value cache compression.</strong> We only compress keys. Compressing values would require a second Triton kernel for the softmax@V multiplication, but would further reduce memory usage.</p>



<p><strong>Structured rotation.</strong> The precomputed d×d orthogonal matrix works but uses O(d²) memory. A fused Hadamard kernel would use O(d) memory (just the random signs) and be faster for large d.</p>



<p><strong>Sub-byte packing.</strong> We store 2-bit indices as uint8. Packing 4 indices per byte would reduce memory by another 4x for the index storage.</p>



<p><strong>Flash Attention integration.</strong> The ultimate goal: fuse the centroid gather into a Flash Attention-style kernel that never materializes the full attention matrix. This would combine TurboQuant&#8217;s memory savings with Flash Attention&#8217;s IO efficiency.</p>



<p>The paper&#8217;s claim of 8x speedup on H100s comes from optimized int4 tensor core kernels — that level of hardware-specific optimization is beyond a one-session implementation, but the algorithmic foundation is solid and the path from here to production is clear.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><em>Paper: <a href="https://arxiv.org/abs/2504.19874">TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate</a> (ICLR 2026)</em></p>



<p class="has-text-align-center"><em>Complete implementation including Triton kernel</em>:</p>



<div class="wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex">
<div class="wp-block-button"><a class="wp-block-button__link has-text-align-center wp-element-button" href="https://dejan.ai/media/code/turboquant.zip">DOWNLOAD CODE</a></div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<pre class="wp-block-code" style="font-size:0.7rem"><code>                   python run_demo.py --fused --max-new-tokens 200 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -&gt; 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -&gt; 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -&gt; 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 &#91;00:00&lt;?, ?it/s]
Download complete: : 0.00B &#91;00:00, ?B/s]                                       | 0/2 &#91;00:00&lt;?, ?it/s]
Loading weights: 100%|████████████████████████████████████████████| 883/883 &#91;00:02&lt;00:00, 304.27it/s]
The image processor of type `Gemma3ImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`.
Model loaded on cuda:0

======================================================================
Prompt: Explain the difference between a compiler and an interpreter in three sentences.
======================================================================
The following generation flags are not valid and may be ignored: &#91;'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

--- fp16 baseline ---
  Tokens: 68  Time: 4.52s  (15.0 tok/s)  VRAM delta: 26 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=4)... done.

--- TurboQuant 4-bit ---
  Tokens: 72  Time: 6.06s  (11.9 tok/s)  VRAM delta: 19 MB
  Output: A compiler translates an entire program into machine code at once, creating a standalone executable file that can be run directly.  An interpreter, on the other hand, reads and executes the code line by line, without first creating a separate file.  Essentially, a compiler performs a one-time conversion, while an interpreter performs a continuous translation and execution process.
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 68  Time: 4.73s  (14.4 tok/s)  VRAM delta: 4 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=2)... done.

--- TurboQuant 2-bit ---
  Tokens: 48  Time: 3.71s  (12.9 tok/s)  VRAM delta: 15 MB
  Output: A compiler translates an entire program into machine code, creating a separate executable file. An interpreter, on the other hand, translates and executes code line by line. Essentially, a compiler translates everything at once, while an interpreter executes sequentially.
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 68  Time: 4.24s  (16.0 tok/s)  VRAM delta: 7 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.

======================================================================
Prompt: Write a short Python function that checks if a string is a palindrome.
======================================================================

--- fp16 baseline ---
  Tokens: 200  Time: 11.41s  (17.5 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 4-bit ---
  Tokens: 200  Time: 14.36s  (13.9 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama")) # Output: True
print(is_palindrome("he
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 201  Time: 12.18s  (16.5 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 2-bit ---
  Tokens: 86  Time: 6.20s  (13.9 tok/s)  VRAM delta: 21 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same forwards and backward).
  Ignores case and non-alphanumeric characters.
  """
  processed_string = ''.join(char.lower() for char in text if char.isalnum(char)

  return processed_string == processed_string
```

  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 201  Time: 11.72s  (17.2 tok/s)  VRAM delta: 25 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

======================================================================
Prompt: What are the main causes of the French Revolution? Be concise.
======================================================================

--- fp16 baseline ---
  Tokens: 156  Time: 8.92s  (17.5 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 4-bit ---
  Tokens: 177  Time: 12.78s  (13.9 tok/s)  VRAM delta: 37 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Hardship:** Widespread poverty, famine, and high taxes, exacerbated by royal extravagance and costly wars.
*   **Enlightenment Ideas:** Philosophers like Locke and Rousseau promoted concepts of liberty, equality, and popular sovereignty, challengi
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 156  Time: 9.85s  (15.8 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 2-bit ---
  Tokens: 153  Time: 10.85s  (14.1 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** Rigid social hierarchy (Three Estates) with vast inequality and privileges for the wealthy.
*   **Economic Crisis:**  Heavy debt from wars, poor harvests, and inflation.
*   **Enlightenment Ideas:**  New ideas about liberty, equality, and popular sovereignty challenged the monarchy.
*   **Weak Leadership:** King Louis XVI was seen as indecisive and out of touch.
*   **Financial Crisis:**  Go
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 156  Time: 9.15s  (17.0 tok/s)  VRAM delta: 8 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

PS C:\projects\tq&gt;</code></pre>



<p>In response to a <a href="https://x.com/ARomeoSierra/status/2036996932829171852?s=20">Twitter question</a>:</p>



<pre class="wp-block-code" style="font-size:0.7rem"><code>PS C:\projects\tq> python run_demo.py --fused --long-context --haystack-tokens 4096 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -> 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -> 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -> 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 &#91;00:00&lt;?, ?it/s]
Download complete: : 0.00B &#91;00:00, ?B/s]                                       | 0/2 &#91;00:00&lt;?, ?it/s]
Loading weights: 100%|████████████████████████████████████████████| 883/883 &#91;00:03&lt;00:00, 274.55it/s]
The image processor of type `Gemma3ImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`.
Model loaded on cuda:0

======================================================================
Prompt: Explain the difference between a compiler and an interpreter in three sentences.
======================================================================
The following generation flags are not valid and may be ignored: &#91;'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

--- fp16 baseline ---
  Tokens: 68  Time: 4.93s  (13.8 tok/s)  VRAM delta: 26 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=4)... done.

--- TurboQuant 4-bit ---
  Tokens: 72  Time: 5.86s  (12.3 tok/s)  VRAM delta: 19 MB
  Output: A compiler translates an entire program into machine code at once, creating a standalone executable file that can be run directly.  An interpreter, on the other hand, reads and executes the code line by line, without first creating a separate file.  Essentially, a compiler performs a one-time conversion, while an interpreter performs a continuous translation and execution process.
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 68  Time: 4.63s  (14.7 tok/s)  VRAM delta: 4 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=2)... done.

--- TurboQuant 2-bit ---
  Tokens: 48  Time: 3.68s  (13.1 tok/s)  VRAM delta: 15 MB
  Output: A compiler translates an entire program into machine code, creating a separate executable file. An interpreter, on the other hand, translates and executes code line by line. Essentially, a compiler translates everything at once, while an interpreter executes sequentially.
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 68  Time: 4.17s  (16.3 tok/s)  VRAM delta: 7 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.

======================================================================
Prompt: Write a short Python function that checks if a string is a palindrome.
======================================================================

--- fp16 baseline ---
  Tokens: 200  Time: 10.91s  (18.3 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 4-bit ---
  Tokens: 200  Time: 13.76s  (14.5 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama")) # Output: True
print(is_palindrome("he
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 201  Time: 11.78s  (17.1 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 2-bit ---
  Tokens: 86  Time: 5.97s  (14.4 tok/s)  VRAM delta: 21 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same forwards and backward).
  Ignores case and non-alphanumeric characters.
  """
  processed_string = ''.join(char.lower() for char in text if char.isalnum(char)

  return processed_string == processed_string
```

  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 201  Time: 11.28s  (17.8 tok/s)  VRAM delta: 25 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

======================================================================
Prompt: What are the main causes of the French Revolution? Be concise.
======================================================================

--- fp16 baseline ---
  Tokens: 156  Time: 8.55s  (18.3 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 4-bit ---
  Tokens: 177  Time: 12.21s  (14.5 tok/s)  VRAM delta: 37 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Hardship:** Widespread poverty, famine, and high taxes, exacerbated by royal extravagance and costly wars.
*   **Enlightenment Ideas:** Philosophers like Locke and Rousseau promoted concepts of liberty, equality, and popular sovereignty, challengi
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 156  Time: 9.43s  (16.5 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 2-bit ---
  Tokens: 153  Time: 10.56s  (14.5 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** Rigid social hierarchy (Three Estates) with vast inequality and privileges for the wealthy.
*   **Economic Crisis:**  Heavy debt from wars, poor harvests, and inflation.
*   **Enlightenment Ideas:**  New ideas about liberty, equality, and popular sovereignty challenged the monarchy.
*   **Weak Leadership:** King Louis XVI was seen as indecisive and out of touch.
*   **Financial Crisis:**  Go
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 156  Time: 8.92s  (17.5 tok/s)  VRAM delta: 8 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

======================================================================
Needle-in-a-haystack (~4096 tokens)
======================================================================
  fp16 baseline              &#91;FOUND]  1.0s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.
  TurboQuant 4-bit           &#91;FOUND]  0.7s
    Answer: blue-giraffe-42
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers
  TurboQuant 4-bit FUSED     &#91;FOUND]  1.9s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.
  TurboQuant 2-bit           &#91;FOUND]  1.1s
    Answer: The secret password is 'blue-giraffe-42'.
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers
  TurboQuant 2-bit FUSED     &#91;FOUND]  1.4s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.

PS C:\projects\tq> python run_demo.py --fused --long-context --haystack-tokens 16384 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -> 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -> 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -> 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 &#91;00:00&lt;?, ?it/s]
Download complete: : 0.00B &#91;00:00, ?B/s]                                       | 0/2 &#91;00:00&lt;?, ?it/s]
Loading weights: 100%|████████████████████████████████████████████| 883/883 &#91;00:03&lt;00:00, 285.34it/s]
The image processor of type `Gemma3ImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`.
Model loaded on cuda:0

======================================================================
Prompt: Explain the difference between a compiler and an interpreter in three sentences.
======================================================================
The following generation flags are not valid and may be ignored: &#91;'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

--- fp16 baseline ---
  Tokens: 68  Time: 4.32s  (15.7 tok/s)  VRAM delta: 26 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=4)... done.

--- TurboQuant 4-bit ---
  Tokens: 72  Time: 5.94s  (12.1 tok/s)  VRAM delta: 19 MB
  Output: A compiler translates an entire program into machine code at once, creating a standalone executable file that can be run directly.  An interpreter, on the other hand, reads and executes the code line by line, without first creating a separate file.  Essentially, a compiler performs a one-time conversion, while an interpreter performs a continuous translation and execution process.
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 68  Time: 4.70s  (14.5 tok/s)  VRAM delta: 4 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=2)... done.

--- TurboQuant 2-bit ---
  Tokens: 48  Time: 3.75s  (12.8 tok/s)  VRAM delta: 15 MB
  Output: A compiler translates an entire program into machine code, creating a separate executable file. An interpreter, on the other hand, translates and executes code line by line. Essentially, a compiler translates everything at once, while an interpreter executes sequentially.
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 68  Time: 4.19s  (16.2 tok/s)  VRAM delta: 7 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.

======================================================================
Prompt: Write a short Python function that checks if a string is a palindrome.
======================================================================

--- fp16 baseline ---
  Tokens: 200  Time: 11.10s  (18.0 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 4-bit ---
  Tokens: 200  Time: 13.94s  (14.3 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama")) # Output: True
print(is_palindrome("he
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 201  Time: 12.02s  (16.7 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 2-bit ---
  Tokens: 86  Time: 6.13s  (14.0 tok/s)  VRAM delta: 21 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same forwards and backward).
  Ignores case and non-alphanumeric characters.
  """
  processed_string = ''.join(char.lower() for char in text if char.isalnum(char)

  return processed_string == processed_string
```

  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 201  Time: 11.54s  (17.4 tok/s)  VRAM delta: 25 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

======================================================================
Prompt: What are the main causes of the French Revolution? Be concise.
======================================================================

--- fp16 baseline ---
  Tokens: 156  Time: 8.80s  (17.7 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 4-bit ---
  Tokens: 177  Time: 12.47s  (14.2 tok/s)  VRAM delta: 37 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Hardship:** Widespread poverty, famine, and high taxes, exacerbated by royal extravagance and costly wars.
*   **Enlightenment Ideas:** Philosophers like Locke and Rousseau promoted concepts of liberty, equality, and popular sovereignty, challengi
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 156  Time: 9.68s  (16.1 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 2-bit ---
  Tokens: 153  Time: 10.92s  (14.0 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** Rigid social hierarchy (Three Estates) with vast inequality and privileges for the wealthy.
*   **Economic Crisis:**  Heavy debt from wars, poor harvests, and inflation.
*   **Enlightenment Ideas:**  New ideas about liberty, equality, and popular sovereignty challenged the monarchy.
*   **Weak Leadership:** King Louis XVI was seen as indecisive and out of touch.
*   **Financial Crisis:**  Go
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 156  Time: 9.19s  (17.0 tok/s)  VRAM delta: 8 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

======================================================================
Needle-in-a-haystack (~16384 tokens)
======================================================================
  fp16 baseline              &#91;FOUND]  2.5s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.
  TurboQuant 4-bit           &#91;FOUND]  2.8s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers
  TurboQuant 4-bit FUSED     &#91;FOUND]  3.4s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.
  TurboQuant 2-bit           &#91;FOUND]  2.8s
    Answer: The secret password for project Orion is ‘blue-giraffe-42’.
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers
  TurboQuant 2-bit FUSED     &#91;FOUND]  3.0s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.

PS C:\projects\tq></code></pre>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/turboquant/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Clickbait Titles Exploit Attention Through Latent Entities</title>
		<link>https://dejan.ai/blog/latent-entities/</link>
					<comments>https://dejan.ai/blog/latent-entities/#comments</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Sun, 22 Mar 2026 12:20:49 +0000</pubDate>
				<category><![CDATA[Content]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2333</guid>

					<description><![CDATA[Every clickbait title works the same way: it removes exactly one critical variable: the subject, the reason, the process, or the outcome, and charges you a click to fill the blank. This missing variable, which we call a latent entity, is so pervasive it has become normalized and nobody questions it anymore. You should! That [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h4 class="wp-block-heading">Every clickbait title works the same way: it removes exactly one critical variable: the subject, the reason, the process, or the outcome, and charges you a click to fill the blank. This missing variable, which we call a <strong>latent entity</strong>, is so pervasive it has become normalized and nobody questions it anymore. You should!</h4>



<p>That was the direct answer to the title&#8217;s attention hook, the latent variable behind &#8220;how&#8221;.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="480" src="https://dejan.ai/wp-content/uploads/2026/03/image-13-1024x480.png" alt="" class="wp-image-2334" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-13-1024x480.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-13-300x141.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-13-768x360.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-13-1536x720.png 1536w, https://dejan.ai/wp-content/uploads/2026/03/image-13.png 1908w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Every day, hundreds of millions of people scan headlines in feeds, aggregators, and search results. Most of these titles are not designed to inform. They are designed to withhold. Somewhere in the sentence, a critical piece of information has been surgically removed — the tool isn&#8217;t named, the result isn&#8217;t revealed, the reason isn&#8217;t given. The reader is left with an incomplete thought and a link. The click is the cost of completing it.</p>



<p>This mechanism is so pervasive that it has become invisible, like background noise. But it has a structure. And once you see the structure, you can&#8217;t unsee it.</p>



<h2 class="wp-block-heading">The attention transaction</h2>



<p>A title is a transaction. The author offers a premise. The reader pays with a click. The currency is attention, and the receipt is the missing piece of information the title promised but refused to deliver upfront.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="480" src="https://dejan.ai/wp-content/uploads/2026/03/image-14-1024x480.png" alt="" class="wp-image-2337" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-14-1024x480.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-14-300x141.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-14-768x360.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-14-1536x720.png 1536w, https://dejan.ai/wp-content/uploads/2026/03/image-14.png 1903w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>This is not metaphorical. The economics are literal. Every click generates a pageview. Every pageview generates ad impressions. Every ad impression generates revenue. The entire model is optimized not for informing the reader but for maximizing the probability that they click. The most reliable way to do that is to make the title incomplete — to create an information gap that can only be closed on the other side of the link.</p>



<p>The reader isn&#8217;t choosing to engage with content. They&#8217;re being charged an attention tax to access information that the title already had room to provide.</p>



<h2 class="wp-block-heading">Naming the structure: latent entities</h2>



<p>We can formalize what clickbait hides. In every withholding title, there is a <strong>latent entity</strong> — a variable the reader cannot resolve without clicking through. The title is the observed data. The latent entity is the unobserved variable. The click is the inference cost.</p>



<p>There are four types, and they are exhaustive.</p>



<h3 class="wp-block-heading">Latent Subject — <em>What?</em></h3>



<p>The title revolves around a specific thing — a tool, a setting, a feature, a list of items — but deliberately masks its identity behind a vague pronoun or a deferred noun.</p>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;This one browser extension changed how I use the internet forever.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">What extension? You don&#8217;t know. That&#8217;s the transaction. The word &#8220;this&#8221; is doing the work of pointing at something while revealing nothing. The subject is latent.</p>
</div>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;5 tools every developer needs in their workflow.&#8221;</strong></p>



<p class="wp-container-content-0733e5d0">Which five? The number creates the shape of an answer without filling it in. Five slots, all empty.</p>
</div>



<h3 class="wp-block-heading">Latent Reason — <em>Why?</em></h3>



<p>The title states a strong opinion, a regret, or an observation, but withholds the logic behind it. The reader is given a conclusion without its supporting argument.</p>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;I finally understand why Linux users swear by simple tools.&#8221;</strong></p>



<p class="wp-container-content-0733e5d0">The author has arrived at understanding. The reader has not. The reason is the hidden variable, and the only way to access it is to click.</p>
</div>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;Package managers are the main reason I&#8217;ll never switch back to Windows.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">A bold claim with the mechanism removed. Why? What about package managers? The reason is latent.</p>
</div>



<h3 class="wp-block-heading">Latent Process — <em>How?</em></h3>



<p>The title presents an intriguing input and a desirable or unexpected output, but hides the method that connects them. The reader sees a before and an after with a gap in between.</p>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;I turned my old phone into a universal remote for my entire smart home.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">How? What app, what protocol, what steps? The transformation is stated as fact but the process is absent. The reader must click to learn the method.</p>
</div>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;How a power drill defeated the Xbox 360&#8217;s console security.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">The pairing of a crude physical tool with a sophisticated digital system is inherently surprising. The process that links them is the entire story, and it&#8217;s completely hidden.</p>
</div>



<h3 class="wp-block-heading">Latent Outcome — <em>What happened?</em></h3>



<p>The title sets up a scenario or experiment but cuts off before the resolution. The reader is dropped into a narrative with no ending.</p>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;I replaced all my productivity tools with a single app for a month.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">And? What happened? Did it work? Was it a disaster? The outcome is the only thing the reader wants, and it&#8217;s the only thing the title refuses to provide.</p>
</div>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;I ran local LLMs on a dying GPU and the results surprised me.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">The word &#8220;surprised&#8221; is doing double duty — it confirms that an outcome exists and that it&#8217;s noteworthy, while revealing absolutely nothing about what it is. It is a content-free adjective masquerading as information.</p>
</div>



<p>Every clickbait title withholds at least one latent entity. Some withhold two — a title that hides both the process and the outcome forces the reader to pay twice for a single click. But the taxonomy is closed. Anything a title can hide maps to one of these four types: the subject (what?), the reason (why?), the process (how?), or the outcome (what happened?).</p>



<p>This isn&#8217;t a style guide or an editorial preference. It&#8217;s a structural property of how information is withheld to generate clicks.</p>



<h2 class="wp-block-heading">What happens after the click</h2>



<p>The damage doesn&#8217;t end with the transaction. Something happens cognitively when a reader lands on a page after a withholding title, and it isn&#8217;t engagement. It&#8217;s <a href="https://dejanmarketing.com/web-content/">scanning</a>.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="957" height="827" src="https://dejan.ai/wp-content/uploads/2026/03/image-15.png" alt="" class="wp-image-2338" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-15.png 957w, https://dejan.ai/wp-content/uploads/2026/03/image-15-300x259.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-15-768x664.png 768w" sizes="auto, (max-width: 957px) 100vw, 957px" /></figure>



<p>The reader arrives primed. They have a specific latent entity in mind — the hidden variable that brought them there — and their first instinct is to find it as fast as possible. They don&#8217;t read the introduction. They don&#8217;t absorb the context. They <a href="https://dejan.ai/blog/most-people-dont-read/">skip, skim, and scroll</a>, hunting for the one piece of information the title owed them.</p>



<p>This produces a jarring experience. The article, <a href="https://dejan.ai/blog/how-long-are-web-pages/">padded with backstory</a>, affiliate links, newsletter prompts, and SEO-optimized filler, is structured to delay the answer. The reader, already carrying the cognitive load of an unresolved question, is forced to work through friction that exists solely to generate more pageviews and ad impressions. The content&#8217;s structure and the reader&#8217;s intent are fundamentally misaligned.</p>



<p>The result is not engagement. It is extraction. The reader extracts the latent entity and leaves. The publisher extracts a pageview and an ad impression. Neither party has been well served. The reader resents the experience. The publisher has earned a visit but not trust.</p>



<h2 class="wp-block-heading">The ad-click economy made this rational</h2>



<p>None of this happened by accident. Withholding titles are the evolutionary product of an economy that rewards clicks over comprehension. When revenue is proportional to pageviews, every title becomes an optimization problem: maximize the probability of a click while minimizing the information given away for free.</p>



<p>Over two decades, this optimization produced the patterns we now see everywhere. Vague pronouns replaced specific nouns. Outcomes were teased but never stated. Reasons were promised but deferred. The entire craft of headline writing was reoriented from summarizing content to withholding it.</p>



<p>This was rational in a world where the title and the article were inseparable — where the only way to access the content was to visit the page. But that world is ending.</p>



<h2 class="wp-block-heading">AI changes the equation</h2>



<p>Large language models are rapidly becoming the <a href="https://dejan.ai/blog/llm-is-a-presentation-layer-in-ai-search/">intermediary layer between humans and content</a>. When a user asks an <a href="https://dejan.ai/blog/how-do-people-use-ai-assistants/">AI assistant a question</a>, the AI <a href="https://dejan.ai/blog/how-big-are-googles-grounding-chunks/">retrieves</a>, reads, and synthesizes sources on the user&#8217;s behalf. The human never visits the page. The click never happens. The latent entity is <a href="https://dejan.ai/blog/sro-grounding-snippets/">resolved by the model</a>, not by the reader.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="484" src="https://dejan.ai/wp-content/uploads/2026/03/image-16-1024x484.png" alt="" class="wp-image-2339" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-16-1024x484.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-16-300x142.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-16-768x363.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-16-1536x726.png 1536w, https://dejan.ai/wp-content/uploads/2026/03/image-16.png 1883w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>In this new architecture, withholding titles are not just exploitative. They are pointless and perhaps even <a href="https://dejan.ai/blog/sr/">harmful to visibility</a>. The AI doesn&#8217;t care about the information gap. It reads the article, extracts the answer, and delivers it without friction. The entire mechanism of clickbait — creating an artificial need that can only be resolved with a visit — collapses when the visitor is a machine that doesn&#8217;t see ads.</p>



<p>More importantly, AI systems can now decompose titles structurally, identify which latent entity is being withheld, check whether the article delivers on the title&#8217;s promise, and surface the answer directly. The asymmetry of information that clickbait depends on is being dissolved.</p>



<h2 class="wp-block-heading">A healthier paradigm</h2>



<p>If withholding titles evolved to serve the ad-click economy, then the question is: what should titles look like when that economy is no longer the only game?</p>



<p>The answer is straightforward. Titles should include the key information — the subject named, the reason stated, the outcome revealed — and invite the reader to explore further for depth, context, and nuance. The title earns the click by demonstrating value, not by ransoming it.</p>



<p>Consider the difference:</p>



<div class="wp-block-group has-background is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex" style="background-color:#eeded9">
<p class="wp-container-content-0733e5d0"><strong>&#8220;This one Docker tool finally fixed my reverse proxy headache&#8221;</strong> </p>



<p class="wp-container-content-0733e5d0">The subject is latent. <br>The reader must click to learn which tool.</p>
</div>



<div class="wp-block-group has-background is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex" style="background-color:#9bbf843b">
<p class="wp-container-content-0733e5d0"><strong>&#8220;Nginx Proxy Manager eliminated my reverse proxy headache — here&#8217;s my setup&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">The subject is revealed. <br>The reader clicks to learn the details, not to discover what the tool is.</p>
</div>



<p>Both titles can generate traffic. But the second one respects the reader. It says: here is what I&#8217;m talking about, and if you want to know more, the article is worth your time. The first one says: I have something you want, and I won&#8217;t tell you what it is unless you pay me with your attention.</p>



<p>The second model is healthier for everyone. Readers arrive with aligned expectations instead of frustrated scanning instincts. Authors build trust instead of mining clicks. And the content itself can be structured around depth rather than around delaying the reveal. </p>



<h2 class="wp-block-heading">The web we could have</h2>



<p>Web authors have a choice. They can continue optimizing for an economy that is being disintermediated by AI, writing titles that withhold and articles that delay, hoping the click-and-ad model survives long enough to sustain them. Or they can recognize that the readers who remain — the ones who choose to visit a page when they <a href="https://dejan.ai/blog/human-friendly-content-is-ai-friendly-content/">could have asked an AI</a> — are the ones who deserve the most respect.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="588" src="https://dejan.ai/wp-content/uploads/2026/03/image-17-1024x588.png" alt="" class="wp-image-2340" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-17-1024x588.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-17-300x172.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-17-768x441.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-17.png 1267w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Those readers are not clicking because they were tricked. They&#8217;re clicking because they were informed. They know what the article is about. They want to go deeper. They trust the author enough to spend their time. And the <a href="https://dejan.ai/blog/caps/">money part</a> can be fixed too.</p>



<p>That is the audience worth building for. And it starts with killing the hidden variable.</p>



<pre class="wp-block-code alignwide has-contrast-background-color has-text-color has-background has-link-color has-small-font-size wp-elements-11272707f64f2d4a29f681518b294984" style="color:#65b831"><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">{</mark>
  <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"title"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">:</mark> "Clickbait Titles Exploit Attention Through Latent Entities"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
  <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"metadata"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">: {</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"dimensions"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">: &#91;</mark>
      "Clickbait titles exploit attention"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
      "Through latent entities"
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">],</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"attention_anchor"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">:</mark> "how"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"quantized"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">:</mark> "clickbait exploits attention by hiding one of four variable types"
<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">  },</mark>
  <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"how"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">: &#91;</mark>
    "Every clickbait title withholds exactly one latent entity: subject (what?), reason (why?), process (how?), or outcome (what happened?)"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
    "The click is the inference cost the reader pays to resolve the hidden variable"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
    "AI dissolves this by reading the article and extracting the answer without the click"
  <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">],</mark>
  <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"promise_check"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">: {</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"exploit attention"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">:</mark> "delivered — transactional mechanism explained with economic chain"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"through latent entities"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">: </mark>"delivered — four-type taxonomy defined with examples"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"title practices what it preaches"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">:</mark> "delivered — subject revealed, mechanism stated, no hidden variable"
<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">  }
}</mark></code></pre>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/latent-entities/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Fanout Query Analysis</title>
		<link>https://dejan.ai/blog/fanout-query-analysis/</link>
					<comments>https://dejan.ai/blog/fanout-query-analysis/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Fri, 20 Mar 2026 01:58:01 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[Keyword Research]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2314</guid>

					<description><![CDATA[When AI models like Gemini, GPT or Nova answer a question using web search, they don&#8217;t just run your query as-is. They generate their own internal search queries, or fanout queries. A single user prompt can trigger multiple fanout queries as the model breaks down the question, explores subtopics and verifies information. We captured 365,920 [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>When AI models like Gemini, GPT or Nova answer a question using web search, they don&#8217;t just run your query as-is. They generate their own internal search queries, or fanout queries. A single user prompt can trigger multiple fanout queries as the model breaks down the question, explores subtopics and verifies information.</p>



<p>We captured 365,920 of these fanout queries across three providers, Google (Gemini), OpenAI (GPT) and Amazon (Nova), by logging the grounding metadata returned from their APIs during citation mining runs. This data comes from real production workloads across multiple projects, not synthetic benchmarks.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="731" src="https://dejan.ai/wp-content/uploads/2026/03/image-2-1024x731.png" alt="" class="wp-image-2315" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-2-1024x731.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-2-300x214.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-2-768x549.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-2.png 1400w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Below is an analysis of how these providers differ in the queries they generate.</p>



<figure class="wp-block-table has-small-font-size"><table><thead><tr><th class="has-text-align-center" data-align="center">Provider</th><th class="has-text-align-center" data-align="center">Count</th><th class="has-text-align-center" data-align="center">Avg Chars</th><th class="has-text-align-center" data-align="center">Min</th><th class="has-text-align-center" data-align="center">Max</th><th class="has-text-align-center" data-align="center">1-3 words</th><th class="has-text-align-center" data-align="center">4-6 words</th><th class="has-text-align-center" data-align="center">7+ words</th></tr></thead><tbody><tr><td class="has-text-align-center" data-align="center"><strong>Google</strong></td><td class="has-text-align-center" data-align="center">158,186</td><td class="has-text-align-center" data-align="center">52</td><td class="has-text-align-center" data-align="center">0</td><td class="has-text-align-center" data-align="center">252</td><td class="has-text-align-center" data-align="center">4.5%</td><td class="has-text-align-center" data-align="center">30.6%</td><td class="has-text-align-center" data-align="center">64.9%</td></tr><tr><td class="has-text-align-center" data-align="center"><strong>OpenAI</strong></td><td class="has-text-align-center" data-align="center">207,174</td><td class="has-text-align-center" data-align="center">60</td><td class="has-text-align-center" data-align="center">6</td><td class="has-text-align-center" data-align="center">323</td><td class="has-text-align-center" data-align="center">3.4%</td><td class="has-text-align-center" data-align="center">20.8%</td><td class="has-text-align-center" data-align="center">75.8%</td></tr><tr><td class="has-text-align-center" data-align="center"><strong>Amazon</strong></td><td class="has-text-align-center" data-align="center">560</td><td class="has-text-align-center" data-align="center">59</td><td class="has-text-align-center" data-align="center">28</td><td class="has-text-align-center" data-align="center">198</td><td class="has-text-align-center" data-align="center">0.2%</td><td class="has-text-align-center" data-align="center">16.2%</td><td class="has-text-align-center" data-align="center">83.6%</td></tr><tr><td class="has-text-align-center" data-align="center"><strong>Total</strong></td><td class="has-text-align-center" data-align="center">~365,920</td><td class="has-text-align-center" data-align="center">56</td><td class="has-text-align-center" data-align="center">0</td><td class="has-text-align-center" data-align="center">323</td><td class="has-text-align-center" data-align="center">3.9%</td><td class="has-text-align-center" data-align="center">25.0%</td><td class="has-text-align-center" data-align="center">71.1%</td></tr></tbody></table></figure>



<p><strong>Google (n=158,184)</strong></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Words</th><th class="has-text-align-right" data-align="right">Count</th><th class="has-text-align-right" data-align="right">%</th><th class="has-text-align-right" data-align="right">Cumul%</th></tr></thead><tbody><tr><td>1</td><td class="has-text-align-right" data-align="right">53</td><td class="has-text-align-right" data-align="right">0.0%</td><td class="has-text-align-right" data-align="right">0.0%</td></tr><tr><td>2</td><td class="has-text-align-right" data-align="right">1,092</td><td class="has-text-align-right" data-align="right">0.7%</td><td class="has-text-align-right" data-align="right">0.7%</td></tr><tr><td>3</td><td class="has-text-align-right" data-align="right">5,994</td><td class="has-text-align-right" data-align="right">3.8%</td><td class="has-text-align-right" data-align="right">4.5%</td></tr><tr><td>4</td><td class="has-text-align-right" data-align="right">14,916</td><td class="has-text-align-right" data-align="right">9.4%</td><td class="has-text-align-right" data-align="right">13.9%</td></tr><tr><td>5</td><td class="has-text-align-right" data-align="right">17,471</td><td class="has-text-align-right" data-align="right">11.0%</td><td class="has-text-align-right" data-align="right">25.0%</td></tr><tr><td>6</td><td class="has-text-align-right" data-align="right">15,923</td><td class="has-text-align-right" data-align="right">10.1%</td><td class="has-text-align-right" data-align="right">35.1%</td></tr><tr><td>7</td><td class="has-text-align-right" data-align="right">18,080</td><td class="has-text-align-right" data-align="right">11.4%</td><td class="has-text-align-right" data-align="right">46.5%</td></tr><tr><td>8</td><td class="has-text-align-right" data-align="right">20,325</td><td class="has-text-align-right" data-align="right">12.8%</td><td class="has-text-align-right" data-align="right">59.3%</td></tr><tr><td>9</td><td class="has-text-align-right" data-align="right">20,013</td><td class="has-text-align-right" data-align="right">12.7%</td><td class="has-text-align-right" data-align="right">72.0%</td></tr><tr><td>10</td><td class="has-text-align-right" data-align="right">16,968</td><td class="has-text-align-right" data-align="right">10.7%</td><td class="has-text-align-right" data-align="right">82.7%</td></tr><tr><td>11</td><td class="has-text-align-right" data-align="right">11,740</td><td class="has-text-align-right" data-align="right">7.4%</td><td class="has-text-align-right" data-align="right">90.1%</td></tr><tr><td>12</td><td class="has-text-align-right" data-align="right">7,316</td><td class="has-text-align-right" data-align="right">4.6%</td><td class="has-text-align-right" data-align="right">94.8%</td></tr><tr><td>13</td><td class="has-text-align-right" data-align="right">4,043</td><td class="has-text-align-right" data-align="right">2.6%</td><td class="has-text-align-right" data-align="right">97.3%</td></tr><tr><td>14</td><td class="has-text-align-right" data-align="right">2,124</td><td class="has-text-align-right" data-align="right">1.3%</td><td class="has-text-align-right" data-align="right">98.7%</td></tr><tr><td>15+</td><td class="has-text-align-right" data-align="right">1,146</td><td class="has-text-align-right" data-align="right">0.7%</td><td class="has-text-align-right" data-align="right">100.0%</td></tr></tbody></table></figure>



<p><strong>OpenAI (n=207,174)</strong></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Words</th><th class="has-text-align-right" data-align="right">Count</th><th class="has-text-align-right" data-align="right">%</th><th class="has-text-align-right" data-align="right">Cumul%</th></tr></thead><tbody><tr><td>1</td><td class="has-text-align-right" data-align="right">616</td><td class="has-text-align-right" data-align="right">0.3%</td><td class="has-text-align-right" data-align="right">0.3%</td></tr><tr><td>2</td><td class="has-text-align-right" data-align="right">3,715</td><td class="has-text-align-right" data-align="right">1.8%</td><td class="has-text-align-right" data-align="right">2.1%</td></tr><tr><td>3</td><td class="has-text-align-right" data-align="right">2,691</td><td class="has-text-align-right" data-align="right">1.3%</td><td class="has-text-align-right" data-align="right">3.4%</td></tr><tr><td>4</td><td class="has-text-align-right" data-align="right">7,360</td><td class="has-text-align-right" data-align="right">3.6%</td><td class="has-text-align-right" data-align="right">6.9%</td></tr><tr><td>5</td><td class="has-text-align-right" data-align="right">14,516</td><td class="has-text-align-right" data-align="right">7.0%</td><td class="has-text-align-right" data-align="right">13.9%</td></tr><tr><td>6</td><td class="has-text-align-right" data-align="right">21,221</td><td class="has-text-align-right" data-align="right">10.2%</td><td class="has-text-align-right" data-align="right">24.2%</td></tr><tr><td>7</td><td class="has-text-align-right" data-align="right">26,544</td><td class="has-text-align-right" data-align="right">12.8%</td><td class="has-text-align-right" data-align="right">37.0%</td></tr><tr><td>8</td><td class="has-text-align-right" data-align="right">28,912</td><td class="has-text-align-right" data-align="right">14.0%</td><td class="has-text-align-right" data-align="right">51.0%</td></tr><tr><td>9</td><td class="has-text-align-right" data-align="right">27,861</td><td class="has-text-align-right" data-align="right">13.4%</td><td class="has-text-align-right" data-align="right">64.4%</td></tr><tr><td>10</td><td class="has-text-align-right" data-align="right">23,354</td><td class="has-text-align-right" data-align="right">11.3%</td><td class="has-text-align-right" data-align="right">75.7%</td></tr><tr><td>11</td><td class="has-text-align-right" data-align="right">17,875</td><td class="has-text-align-right" data-align="right">8.6%</td><td class="has-text-align-right" data-align="right">84.3%</td></tr><tr><td>12</td><td class="has-text-align-right" data-align="right">12,339</td><td class="has-text-align-right" data-align="right">6.0%</td><td class="has-text-align-right" data-align="right">90.3%</td></tr><tr><td>13</td><td class="has-text-align-right" data-align="right">7,983</td><td class="has-text-align-right" data-align="right">3.9%</td><td class="has-text-align-right" data-align="right">94.1%</td></tr><tr><td>14</td><td class="has-text-align-right" data-align="right">4,959</td><td class="has-text-align-right" data-align="right">2.4%</td><td class="has-text-align-right" data-align="right">96.5%</td></tr><tr><td>15+</td><td class="has-text-align-right" data-align="right">5,228</td><td class="has-text-align-right" data-align="right">2.5%</td><td class="has-text-align-right" data-align="right">100.0%</td></tr></tbody></table></figure>



<p><strong>Amazon (n=560)</strong></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Words</th><th class="has-text-align-right" data-align="right">Count</th><th class="has-text-align-right" data-align="right">%</th><th class="has-text-align-right" data-align="right">Cumul%</th></tr></thead><tbody><tr><td>3</td><td class="has-text-align-right" data-align="right">1</td><td class="has-text-align-right" data-align="right">0.2%</td><td class="has-text-align-right" data-align="right">0.2%</td></tr><tr><td>4</td><td class="has-text-align-right" data-align="right">4</td><td class="has-text-align-right" data-align="right">0.7%</td><td class="has-text-align-right" data-align="right">0.9%</td></tr><tr><td>5</td><td class="has-text-align-right" data-align="right">23</td><td class="has-text-align-right" data-align="right">4.1%</td><td class="has-text-align-right" data-align="right">5.0%</td></tr><tr><td>6</td><td class="has-text-align-right" data-align="right">64</td><td class="has-text-align-right" data-align="right">11.4%</td><td class="has-text-align-right" data-align="right">16.4%</td></tr><tr><td>7</td><td class="has-text-align-right" data-align="right">102</td><td class="has-text-align-right" data-align="right">18.2%</td><td class="has-text-align-right" data-align="right">34.6%</td></tr><tr><td>8</td><td class="has-text-align-right" data-align="right">110</td><td class="has-text-align-right" data-align="right">19.6%</td><td class="has-text-align-right" data-align="right">54.3%</td></tr><tr><td>9</td><td class="has-text-align-right" data-align="right">113</td><td class="has-text-align-right" data-align="right">20.2%</td><td class="has-text-align-right" data-align="right">74.5%</td></tr><tr><td>10</td><td class="has-text-align-right" data-align="right">64</td><td class="has-text-align-right" data-align="right">11.4%</td><td class="has-text-align-right" data-align="right">85.9%</td></tr><tr><td>11</td><td class="has-text-align-right" data-align="right">35</td><td class="has-text-align-right" data-align="right">6.2%</td><td class="has-text-align-right" data-align="right">92.1%</td></tr><tr><td>12</td><td class="has-text-align-right" data-align="right">20</td><td class="has-text-align-right" data-align="right">3.6%</td><td class="has-text-align-right" data-align="right">95.7%</td></tr><tr><td>13</td><td class="has-text-align-right" data-align="right">9</td><td class="has-text-align-right" data-align="right">1.6%</td><td class="has-text-align-right" data-align="right">97.3%</td></tr><tr><td>14</td><td class="has-text-align-right" data-align="right">5</td><td class="has-text-align-right" data-align="right">0.9%</td><td class="has-text-align-right" data-align="right">98.2%</td></tr><tr><td>15+</td><td class="has-text-align-right" data-align="right">10</td><td class="has-text-align-right" data-align="right">1.8%</td><td class="has-text-align-right" data-align="right">100.0%</td></tr></tbody></table></figure>



<h2 class="wp-block-heading"><strong>POS Distribution by Provider</strong></h2>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="439" src="https://dejan.ai/wp-content/uploads/2026/03/image-4-1024x439.png" alt="" class="wp-image-2321" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-4-1024x439.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-4-300x129.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-4-768x329.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-4.png 1400w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="788" height="679" src="https://dejan.ai/wp-content/uploads/2026/03/image-3.png" alt="" class="wp-image-2320" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-3.png 788w, https://dejan.ai/wp-content/uploads/2026/03/image-3-300x259.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-3-768x662.png 768w" sizes="auto, (max-width: 788px) 100vw, 788px" /></figure>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Group</th><th class="has-text-align-right" data-align="right">Google</th><th class="has-text-align-right" data-align="right">OpenAI</th><th class="has-text-align-right" data-align="right">Amazon</th></tr></thead><tbody><tr><td>Nouns</td><td class="has-text-align-right" data-align="right">52.3%</td><td class="has-text-align-right" data-align="right">58.4%</td><td class="has-text-align-right" data-align="right">50.2%</td></tr><tr><td>Verbs</td><td class="has-text-align-right" data-align="right">11.3%</td><td class="has-text-align-right" data-align="right">9.9%</td><td class="has-text-align-right" data-align="right">8.5%</td></tr><tr><td>Adjectives</td><td class="has-text-align-right" data-align="right">11.0%</td><td class="has-text-align-right" data-align="right">8.9%</td><td class="has-text-align-right" data-align="right">18.6%</td></tr><tr><td>Prepositions</td><td class="has-text-align-right" data-align="right">7.4%</td><td class="has-text-align-right" data-align="right">3.5%</td><td class="has-text-align-right" data-align="right">10.3%</td></tr><tr><td>Wh-words</td><td class="has-text-align-right" data-align="right">3.6%</td><td class="has-text-align-right" data-align="right">2.1%</td><td class="has-text-align-right" data-align="right">1.5%</td></tr><tr><td>Numbers</td><td class="has-text-align-right" data-align="right">2.2%</td><td class="has-text-align-right" data-align="right">5.3%</td><td class="has-text-align-right" data-align="right">2.8%</td></tr><tr><td>Determiners</td><td class="has-text-align-right" data-align="right">2.6%</td><td class="has-text-align-right" data-align="right">1.8%</td><td class="has-text-align-right" data-align="right">0.1%</td></tr><tr><td>Conjunctions</td><td class="has-text-align-right" data-align="right">1.6%</td><td class="has-text-align-right" data-align="right">0.6%</td><td class="has-text-align-right" data-align="right">2.4%</td></tr><tr><td>Adverbs</td><td class="has-text-align-right" data-align="right">0.6%</td><td class="has-text-align-right" data-align="right">0.7%</td><td class="has-text-align-right" data-align="right">2.3%</td></tr><tr><td>Modals</td><td class="has-text-align-right" data-align="right">0.7%</td><td class="has-text-align-right" data-align="right">0.5%</td><td class="has-text-align-right" data-align="right">0.0%</td></tr><tr><td>Pronouns</td><td class="has-text-align-right" data-align="right">1.2%</td><td class="has-text-align-right" data-align="right">0.9%</td><td class="has-text-align-right" data-align="right">0.1%</td></tr></tbody></table></figure>



<ul class="wp-block-list">
<li><strong>OpenAI is the most noun-heavy</strong> (58.4%), especially proper nouns (18.9% vs Google&#8217;s 8.6%) — it generates more entity-specific queries</li>



<li><strong>Amazon leans heavily into adjectives</strong> (18.6% vs ~10% for others) — more descriptive, qualifier-rich queries like &#8220;best,&#8221; &#8220;top,&#8221; &#8220;most effective&#8221;</li>



<li><strong>Google uses more wh-words and verbs</strong> — generates more question-style queries (&#8220;what,&#8221; &#8220;how,&#8221; &#8220;which&#8221;)</li>



<li><strong>OpenAI uses 2x more numbers</strong> (5.3%) — likely year references and quantities in queries</li>
</ul>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/fanout-query-analysis/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Reverse Prompting: Reconstructing Prompts from AI-Generated Text</title>
		<link>https://dejan.ai/blog/reverse-prompting/</link>
					<comments>https://dejan.ai/blog/reverse-prompting/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Wed, 18 Mar 2026 06:51:29 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2303</guid>

					<description><![CDATA[We fine-tuned Google&#8217;s Gemma 3 (270M) to reverse the typical LLM workflow: given an AI-generated response, the model reconstructs the most likely prompt that produced it. We generated 100,000 synthetic prompt-response pairs using Gemini 2.5 Flash, trained for a single epoch on a consumer GPU, and built a Streamlit app that sweeps 24 decoding configurations [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>We fine-tuned <a href="https://huggingface.co/google/gemma-3-270m">Google&#8217;s Gemma 3 (270M)</a> to reverse the typical LLM workflow: given an AI-generated response, the model reconstructs the most likely prompt that produced it. We generated 100,000 synthetic prompt-response pairs using Gemini 2.5 Flash, trained for a single epoch on a consumer GPU, and built a Streamlit app that sweeps 24 decoding configurations to produce ranked prompt candidates. The model demo runs on CPU and is <a href="https://dejan.ai/tools/reverse-prompter/">available here</a>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">The Idea</h2>



<p>Large language models take prompts and produce responses. We wanted to see if a small model could learn to do the opposite: take a response and work backwards to the prompt.</p>



<p>This isn&#8217;t about recovering the exact original prompt, but to surface the most plausible prompts, ranked by model confidence. Think of it as asking: &#8220;What question would most naturally lead to this answer?&#8221;</p>



<h2 class="wp-block-heading">Training Data Generation</h2>



<p>The training pipeline has two stages, both powered by Gemini 2.5 Flash via Vertex AI.</p>



<p><strong>Stage 1: Prompt generation.</strong> We generated 100,000 diverse prompts across five categories designed to cover different user behaviours:</p>



<ul class="wp-block-list">
<li>Mid-tail, search query style (single or multi-faceted)</li>



<li>Long-tail, search query style (multi-faceted)</li>



<li>Simple, prompt-like (single-faceted)</li>



<li>Typical, prompt-like (single or multi-faceted)</li>



<li>Detailed, prompt-like (multi-faceted)</li>
</ul>



<p>Each API call generated a batch of 100 prompts as JSON with thinking disabled. We ran 100 concurrent calls, stored results in SQLite, and had the full dataset in minutes.</p>



<p><strong>Stage 2: Response generation.</strong> Each of the 100,000 prompts was sent back to Gemini 2.5 Flash to produce a corresponding AI assistant response. Same concurrency, same speed.  The result: 100,000 prompt-response pairs ready for training.</p>



<h2 class="wp-block-heading">Data Preparation</h2>



<p>The key design decision was how to format the training data. We needed the model to learn a clear boundary between the response (input) and the prompt (target). We settled on a simple separator:</p>



<pre class="wp-block-code has-large-font-size"><code>{response}\n###\n{prompt}&lt;eos&gt;</code></pre>



<p>During tokenization, we masked the loss over the response and separator tokens (setting labels to -100) so the model only learns to predict the prompt portion. This is critical: without masking, the model would waste capacity learning to reproduce the response text rather than focusing on the reverse mapping.</p>



<p>Sequences were capped at 2,048 tokens. Tokenization was batched in groups of 5,000 to manage memory, then concatenated into a single dataset.</p>



<h2 class="wp-block-heading">Model Selection</h2>



<p>We chose Gemma 3 270M for several reasons:</p>



<ul class="wp-block-list">
<li><strong>Size.</strong>&nbsp;At 270M parameters, it&#8217;s small enough to train on a single consumer GPU and fast enough to run inference on CPU. This matters for a free demo.</li>



<li><strong>Architecture.</strong>&nbsp;Gemma 3 uses a mix of sliding window and full attention layers, giving it a good balance of local and global context within its 2,048 token training window.</li>



<li><strong>Capability.</strong>&nbsp;Despite its size, Gemma 3 270M has a 262K vocabulary and was pretrained on enough data to have reasonable language understanding out of the box.</li>
</ul>



<p>A larger model would almost certainly perform better, but the goal was a practical tool that could run anywhere, not a benchmark result.</p>



<h2 class="wp-block-heading">Training</h2>



<p>Training was straightforward. Full fine-tune, single epoch, on an NVIDIA RTX 4090.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Parameter</th><th>Value</th></tr></thead><tbody><tr><td>Method</td><td>Full fine-tune</td></tr><tr><td>Precision</td><td>bfloat16</td></tr><tr><td>Batch size</td><td>2 (effective 16 with gradient accumulation)</td></tr><tr><td>Learning rate</td><td>5e-5</td></tr><tr><td>Optimizer</td><td>AdamW (torch fused)</td></tr><tr><td>Warmup steps</td><td>100</td></tr><tr><td>Gradient checkpointing</td><td>Enabled</td></tr><tr><td>Training time</td><td>4 hours 14 minutes</td></tr></tbody></table></figure>



<p>One epoch was sufficient. The loss curve showed steady convergence without signs of underfitting, and we wanted to avoid overfitting on synthetic data where the model might memorise specific phrasing patterns rather than learning the general reverse mapping.</p>



<h2 class="wp-block-heading">Inference Strategy</h2>



<p>A single generation pass from the model produces one candidate prompt. To get a diverse set of candidates, we sweep across 24 contrastive search configurations by varying two parameters:</p>



<ul class="wp-block-list">
<li><code>top_k</code>: [2, 4, 6, 15]</li>



<li><code>penalty_alpha</code>: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]</li>
</ul>



<p>Contrastive search balances token probability with a degeneration penalty, which encourages diverse yet coherent outputs. Different configurations produce different candidate prompts from the same input.</p>



<p>Each candidate is then scored by perplexity: we run the full sequence (response + separator + generated prompt) through the model and compute the average token-level log probability over the prompt portion. Lower perplexity means the model finds that prompt more natural given the response.</p>



<p>The top 10 candidates are displayed with per-token confidence visualisation, where each word&#8217;s opacity reflects how confident the model was in predicting it.</p>



<h2 class="wp-block-heading">The Tool</h2>



<p>The Streamlit app has two modes.</p>



<p><strong>Paste mode</strong> is the primary interface. Paste any AI-generated text, click Reconstruct Prompts, and the model generates ranked candidates. The results include a prompt table with perplexity scores and per-token confidence bar charts, a key phrases panel that extracts the most important shared phrases across candidates, and a word frequency heatmap.</p>



<p><strong>URL mode</strong> is experimental. Enter a URL and the app scrapes the page content via the DataForSEO API, converts it to markdown, and runs it through the model. This isn&#8217;t the intended use case since the model was trained on AI assistant responses, not web pages. But it produces interesting results: the reconstructed &#8220;prompts&#8221; reveal what the model considers the core semantic intent of the page content. It&#8217;s less prompt reconstruction and more semantic summarisation through the lens of &#8220;what question would this page answer?&#8221;</p>



<h2 class="wp-block-heading">Possible Uses</h2>



<p><strong>Prompt engineering.</strong> Understanding what prompts lead to certain outputs helps refine prompt design. If you have an output you like, reverse prompting can suggest more efficient or precise ways to get there.</p>



<p><strong>Content analysis.</strong> Running web content through the model reveals what the model perceives as the core intent behind the text. This could be useful for understanding how AI models interpret and categorise content.</p>



<p><strong>AI content forensics.</strong> While this isn&#8217;t a detector (it doesn&#8217;t classify text as AI-generated or not), the confidence scores and perplexity values could serve as signals. Text that was genuinely produced by an AI assistant in response to a clear prompt may produce lower-perplexity reconstructions than text that wasn&#8217;t.</p>



<p><strong>Training data curation.</strong> When building datasets, reverse prompting can help verify that responses actually match their intended prompts, or surface cases where the mapping is ambiguous.</p>



<h2 class="wp-block-heading">Insights</h2>



<p>A few things we noticed during development:</p>



<p><strong>Synthetic data works.</strong> The model was trained entirely on Gemini-generated data and generalises to outputs from other models. The reverse mapping from response to prompt is more about structure and intent than model-specific quirks.</p>



<p><strong>Small models can learn non-trivial mappings.</strong> At 270M parameters, this model is tiny by current standards. Yet it reliably produces sensible prompt reconstructions. The task is well-constrained enough that a small model can handle it.</p>



<p><strong>Diversity in decoding matters more than model size.</strong> The contrastive search sweep across 24 configurations produces more useful results than a single greedy decode from a larger model would. The ranking by perplexity then surfaces the best candidates.</p>



<p><strong>The separator matters.</strong> We tested several formats. The simple <code>\n###\n</code> separator worked best, likely because it&#8217;s distinct enough that the model learns a clean boundary between input and output.</p>



<p>The model and code are available on <a href="https://huggingface.co/dejanseo/reverse-prompter" target="_blank" rel="noreferrer noopener">Hugging Face</a>, and a live demo runs on <a href="https://dejan.ai/tools/reverse-prompter/">https://dejan.ai/tools/reverse-prompter/</a></p>



<div class="wp-block-buttons alignfull is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex">
<div class="wp-block-button"><a class="wp-block-button__link wp-element-button" href="https://dejan.ai/tools/reverse-prompter/">DEMO</a></div>
</div>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/reverse-prompting/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Rufus &#8211; Under the Hood. What Drives Amazon’s AI Shopping Assistant?</title>
		<link>https://dejan.ai/blog/rufus/</link>
					<comments>https://dejan.ai/blog/rufus/#comments</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Sun, 15 Mar 2026 01:11:45 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[eCommerce]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2290</guid>

					<description><![CDATA[What’s Publicly Known About the Pipeline, Backend, and Response Anatomy. Rufus is not “one model that magically answers.” Public Amazon/AWS descriptions point to a multi-component system: Speculative schema: Pipeline: request → answer Step A — Input + context assembly Public descriptions indicate customers can: Amazon also describes using conversational context and (more recently) account memory [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h3 class="wp-block-heading">What’s Publicly Known About the Pipeline, Backend, and Response Anatomy.</h3>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Rufus is not “one model that magically answers.” Public Amazon/AWS descriptions point to a <strong>multi-component system</strong>:</p>



<ul class="wp-block-list">
<li><strong>A query planning / classification layer</strong> (Amazon/AWS call out a “query planner (QP) model”)</li>



<li><strong>Retrieval</strong> across multiple Amazon-owned sources (catalog, reviews, community Q&amp;A, Stores APIs) and sometimes web sources</li>



<li><strong>A foundation LLM</strong> that generates the natural-language response</li>



<li><strong>A streaming + rendering layer</strong> that formats answers and “hydrates” them with live store data</li>



<li><strong>Feedback-driven improvement</strong> (reinforcement learning from customer feedback)</li>
</ul>



<p>Speculative schema:</p>



<pre class="wp-block-code has-small-font-size"><code>User question
  -&gt; Query Planner (intent + retrieval plan)
    -&gt; Retrieval (catalog/reviews/Q&amp;A/Stores APIs/(sometimes web))
      -&gt; Foundation LLM (answer generation + display directives)
        -&gt; Streaming response (token-by-token)
          -&gt; Hydration (fill in product cards, prices, etc via internal systems)
            -&gt; Client UI (chat text + cards + actions + suggested questions)</code></pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Pipeline: request → answer</h2>



<h3 class="wp-block-heading">Step A — Input + context assembly</h3>



<p>Public descriptions indicate customers can:</p>



<ul class="wp-block-list">
<li>Type or speak questions in the Amazon Shopping app search bar / assistant chat bar</li>



<li>Start from <strong>pre-populated / suggested questions</strong> in the UI</li>



<li>Ask questions either broadly (“what do I need for…”) or specifically on a product page (where the product detail context matters)</li>
</ul>



<p>Amazon also describes using <strong>conversational context</strong> and (more recently) <strong>account memory</strong> features for personalization.</p>



<h3 class="wp-block-heading">Step B — Query planning (QP) before generation</h3>



<p>AWS’s ML blog post describes Rufus as having:</p>



<ul class="wp-block-list">
<li>A <strong>foundation LLM</strong> (for response generation)</li>



<li>A <strong>query planner (QP) model</strong> for <strong>query classification and retrieval enhancement</strong></li>



<li>QP is “on the critical path” because the system <strong>can’t start token generation</strong> until QP finishes</li>
</ul>



<p>That implies a gate: <strong>planning first</strong>, then generation.</p>



<h3 class="wp-block-heading">Step C — Retrieval-augmented generation (RAG)</h3>



<p>Amazon Science describes Rufus using <strong>retrieval‑augmented generation (RAG)</strong>:</p>



<ul class="wp-block-list">
<li>Before generating a response, the LLM <strong>selects information</strong> it expects will help answer the question.</li>



<li>Evidence sources explicitly called out include:</li>



<li><strong>Customer reviews</strong></li>



<li><strong>The product catalog</strong></li>



<li><strong>Community Q&amp;A</strong></li>



<li><strong>Stores APIs</strong> (calls to internal store systems)</li>
</ul>



<p>About Amazon also describes using RAG to pull “insights and recommendations” from “popular sources” for some product/trend questions (they name examples like major publications).</p>



<p>What’s not disclosed publicly:</p>



<ul class="wp-block-list">
<li>How retrieval is ranked across sources</li>



<li>The retrieval index design</li>



<li>Exact prompting / grounding format</li>



<li>Exact guardrails for what external web content can be used and how</li>
</ul>



<h3 class="wp-block-heading">Step D — Response generation (LLM)</h3>



<p>Amazon Science says the team built a <strong>custom LLM specialized for shopping</strong>, trained primarily on shopping data (catalog + reviews + community Q&amp;A) plus curated public web information.</p>



<p>About Amazon also describes a <strong>model-mix</strong> approach:</p>



<ul class="wp-block-list">
<li>Built on <strong>Amazon Bedrock</strong></li>



<li>Using a <strong>real-time router</strong> that can select among multiple LLMs (they explicitly name models like Anthropic’s Claude Sonnet, Amazon Nova, plus a custom model)</li>
</ul>



<p>So the public picture is: <strong>custom shopping model exists</strong>, and there may also be <strong>dynamic model selection</strong> depending on query type / latency / quality targets.</p>



<h3 class="wp-block-heading">Step E — Streaming + “hydration” + UI rendering</h3>



<p>Amazon Science describes a “streaming architecture”:</p>



<ul class="wp-block-list">
<li>Responses are <strong>streamed token-by-token</strong> (so the user sees the beginning while the rest is still generating).</li>



<li>The system “hydrates” the response by <strong>querying internal systems</strong> to populate the stream with the right data.</li>



<li>Crucially: Rufus is trained to generate <strong>markup instructions</strong> specifying how answer elements should be displayed, not just the text.</li>
</ul>



<p>This is the key “anatomy of a Rufus response” insight: <strong>the model output is both content and layout directives</strong>, and the backend fills in live store objects (prices, items, links, etc.) during streaming.</p>



<p>What’s not disclosed publicly:</p>



<ul class="wp-block-list">
<li>The markup language/schema</li>



<li>The exact rendering protocol between model <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2194.png" alt="↔" class="wp-smiley" style="height: 1em; max-height: 1em;" /> hydrator <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2194.png" alt="↔" class="wp-smiley" style="height: 1em; max-height: 1em;" /> client</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Backend: training data, infra, and latency engineering</h2>



<h3 class="wp-block-heading">Training data and preparation (what Amazon has said)</h3>



<p>Amazon Science states Rufus was trained with:</p>



<ul class="wp-block-list">
<li>The <strong>entire Amazon catalog</strong></li>



<li><strong>Customer reviews</strong></li>



<li><strong>Community Q&amp;A posts</strong></li>



<li>Curated <strong>public web information</strong></li>
</ul>



<p>And that Amazon used:</p>



<ul class="wp-block-list">
<li><strong>Amazon EMR</strong> for large-scale distributed data processing</li>



<li><strong>Amazon S3</strong> for storage</li>
</ul>



<h3 class="wp-block-heading">Inference infrastructure: Trainium/Inferentia + compiler optimizations</h3>



<p>Amazon Science describes serving at Amazon scale using:</p>



<ul class="wp-block-list">
<li>AWS chips <strong>Trainium</strong> and <strong>Inferentia</strong></li>



<li>Collaboration with the <strong>Neuron compiler</strong> team for inference optimizations</li>



<li><strong>Continuous batching</strong> to improve throughput/latency (described as making scheduling/routing decisions after every generated token so new requests can start as soon as earlier ones finish)</li>
</ul>



<h3 class="wp-block-heading">Prime Day scale + “parallel decoding” for QP latency</h3>



<p>AWS’s ML blog post goes much deeper on one backend component (the <strong>QP model</strong>) and performance engineering:</p>



<ul class="wp-block-list">
<li>Prime Day demands described include very high query rates and tight latency SLOs for QP.</li>



<li>They describe using “draft‑centric speculative decoding” / “parallel decoding”:</li>



<li>Extending the base model with <strong>multiple decoding heads</strong> to predict multiple future tokens in parallel</li>



<li>A <strong>tree-based attention</strong> mechanism to verify/integrate predicted tokens</li>



<li>Deployed using AWS infrastructure + chips (Trainium/Inferentia), and mentions integration details (for example, they mention Triton Inference Server support and Neuron-related frameworks).</li>
</ul>



<p>This is one of the clearest official public descriptions of “backend mechanics” for Rufus, specifically for the planning model that sits <em>before</em> the user sees the first chunk of an answer.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Response format: what users see vs what the system likely contains</h2>



<h3 class="wp-block-heading">What the user-visible response can include (publicly described)</h3>



<p>Across Amazon’s public descriptions, Rufus responses can include:</p>



<ul class="wp-block-list">
<li><strong>Long-form explanations</strong> (e.g., product category advice)</li>



<li><strong>Short-form answers</strong></li>



<li><strong>Clickable links</strong> to navigate the store</li>



<li><strong>Product recommendations</strong> (often rendered as product cards)</li>



<li><strong>Comparisons</strong> (e.g., “compare OLED vs QLED”)</li>



<li><strong>Suggested follow-up questions</strong> surfaced in the chat UI</li>



<li>“<strong>What do customers say?</strong>” style review summaries / highlights</li>



<li>Price/history/deal-related features (including price tracking / alerts) and cart actions in newer “agentic” iterations</li>
</ul>



<h3 class="wp-block-heading">What the backend response likely contains</h3>



<p>Based on Amazon’s own wording (“markup instructions” + “hydration” + token streaming), the response payload is best thought of as:</p>



<ul class="wp-block-list">
<li>A <strong>streamed text channel</strong> (tokens)</li>



<li>A <strong>structured directive channel</strong> (layout + which UI modules to render)</li>



<li><strong>Hydration lookups</strong> that fill directives with authoritative store data (products, prices, shipping, deal status, etc.)</li>
</ul>



<p>Amazon has not published the schema, so any JSON examples would be guesswork.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">What’s not public</h2>



<ul class="wp-block-list">
<li>Exact model architectures/sizes for the custom model(s)</li>



<li>The router policy (how it chooses among models)</li>



<li>Exact retrieval ranking, indexing, and grounding format</li>



<li>The markup instruction language/schema</li>



<li>Safety/guardrail implementation details (beyond high-level “reliable sources” language)</li>



<li>Full evaluation suite and offline metrics used to ship changes</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Sources</h2>



<p>Below are official sources only (Amazon Science, AWS, About Amazon Press Center, Investor Relations).</p>



<h3 class="wp-block-heading">Technical deep dives</h3>



<pre class="wp-block-code has-small-font-size"><code>Amazon Science (Blog): “The technology behind Amazon’s GenAI-powered shopping assistant, Rufus” (Oct 4, 2024)
https:&#47;&#47;www.amazon.science/blog/the-technology-behind-amazons-genai-powered-shopping-assistant-rufus

AWS Machine Learning Blog: “How Rufus doubled their inference speed and handled Prime Day traffic with AWS AI chips and parallel decoding” (May 28, 2025)
https://aws.amazon.com/blogs/machine-learning/how-rufus-doubled-their-inference-speed-and-handled-prime-day-traffic-with-aws-ai-chips-and-parallel-decoding/</code></pre>



<h3 class="wp-block-heading">Product/feature announcements &amp; official descriptions</h3>



<pre class="wp-block-code has-small-font-size"><code>About Amazon (Retail): “Amazon’s next-gen AI assistant for shopping is now even smarter, more capable, and more helpful”
https:&#47;&#47;www.aboutamazon.com/news/retail/amazon-rufus-ai-assistant-personalized-shopping-features

About Amazon (Retail): “How to use Rufus to check price history, find deals, auto-buy items at target prices, and more”
https://www.aboutamazon.com/news/retail/how-to-use-amazon-shopping-ai-assistant

About Amazon (Retail): “How customers are making more informed shopping decisions with Rufus…”
https://www.aboutamazon.com/news/retail/how-to-use-amazon-rufus

About Amazon (Retail): “Rufus is now available to all U.S. customers…” (amazon.com page linked from About Amazon)
https://www.amazon.com/b?node=23404839011</code></pre>



<h3 class="wp-block-heading">Press releases / investor communications</h3>



<pre class="wp-block-code has-small-font-size"><code>Amazon Investor Relations: “Amazon.com Announces Fourth Quarter Results” (Feb 01, 2024) — includes the initial public mention of Rufus beta rollout
https:&#47;&#47;ir.aboutamazon.com/news-release/news-release-details/2024/Amazon.com-Announces-Fourth-Quarter-Results/

About Amazon Press Center (US): “Amazon Bedrock launches new capabilities…” (Apr 2024) — includes a Rufus description and quote
https://press.aboutamazon.com/2024/4/amazon-bedrock-launches-new-capabilities-as-tens-of-thousands-of-customers-choose-it-as-the-foundation-to-build-and-scale-secure-generative-ai-applications

About Amazon Press Center (US): “Amazon Announces Record-Breaking Sales for 2024 Prime Day Event” (Jul 18, 2024) — notes Rufus helping millions of customers
https://press.aboutamazon.com/2024/7/amazon-announces-record-breaking-sales-for-2024-prime-day-event

Amazon Investor Relations: “Amazon.com Announces Fourth Quarter Results” (2026 release page) — mentions agentic Rufus / Buy For Me
https://ir.aboutamazon.com/news-release/news-release-details/2026/Amazon-com-Announces-Fourth-Quarter-Results/default.aspx</code></pre>



<h3 class="wp-block-heading">Amazon Science research papers</h3>



<p>These are not “Rufus documentation,” but they map directly to components Amazon describes (question suggestion, comparisons, RAG planning, preference extraction).</p>



<pre class="wp-block-code has-small-font-size"><code>Publication (SIGIR 2024): “Question suggestion for conversational shopping assistants using product metadata”
https:&#47;&#47;www.amazon.science/publications/question-suggestion-for-conversational-shopping-assistants-using-product-metadata

PDF (SIGIR 2024):
https://assets.amazon.science/42/6e/c7c7aed9433d87fd1ab1f8bef4ff/question-suggestion-for-conversational-shopping-assistants-using-product-metadata.pdf

Publication (WSDM 2023): “Generating explainable product comparisons for online shopping”
https://www.amazon.science/publications/generating-explainable-product-comparisons-for-online-shopping

Publication (CIKM 2024): “REAPER: Reasoning based retrieval planning for complex RAG systems”
https://www.amazon.science/publications/reaper-reasoning-based-retrieval-planning-for-complex-rag-systems

Publication (EMNLP 2024): “PEARL: Preference extraction with exemplar augmentation and retrieval with LLM agents”
https://www.amazon.science/publications/pearl-preference-extraction-with-exemplar-augmentation-and-retrieval-with-llm-agents

Publication (2024): “Meta knowledge for retrieval augmented large language models”
https://www.amazon.science/publications/meta-knowledge-for-retrieval-augmented-large-language-models</code></pre>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/rufus/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
	</channel>
</rss>
