<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DEJAN</title>
	<atom:link href="https://dejan.ai/feed/" rel="self" type="application/rss+xml" />
	<link>https://dejan.ai</link>
	<description>AI SEO Agency</description>
	<lastBuildDate>Sat, 04 Apr 2026 11:03:25 +0000</lastBuildDate>
	<language>en-AU</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	

<image>
	<url>https://dejan.ai/wp-content/uploads/2024/02/dejan-150x150.png</url>
	<title>DEJAN</title>
	<link>https://dejan.ai</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Gemma 4 Brand Authority Map</title>
		<link>https://dejan.ai/blog/gemma-4-brand-authority-map/</link>
					<comments>https://dejan.ai/blog/gemma-4-brand-authority-map/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Sat, 04 Apr 2026 11:03:25 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[Google]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2399</guid>

					<description><![CDATA[We asked Google&#8217;s open-weight model Gemma 4 (31B) to &#8220;name 100 brands at random&#8221; 14,044 times and compared the results to our earlier Gemini 3 Flash experiment (200,000 runs). Of the top 50 brands in each model, 39 overlap. The 11 that are unique to each reveal a pattern: Gemini remembers luxury and automotive (Porsche, [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>We asked Google&#8217;s open-weight model Gemma 4 (31B) to &#8220;name 100 brands at random&#8221; 14,044 times and compared the results to our earlier <a href="https://dejan.ai/blog/brands/">Gemini 3 Flash experiment</a> (200,000 runs). </p>



<p>Of the top 50 brands in each model, 39 overlap. The 11 that are unique to each reveal a pattern: Gemini remembers luxury and automotive (Porsche, Ferrari, Cartier), while Gemma remembers everyday retail and sportswear (H&amp;M, Gap, Levi&#8217;s, Under Armour).</p>



<p>Apple is the undisputed #1 in both models. After that, the two models diverge significantly: Gemma 4 favors traditional consumer brands (Coca-Cola, Toyota, McDonald&#8217;s) while Gemini favors tech and digital brands (Google, Nike, Netflix). </p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Background</h2>



<p>In our earlier study, we probed Gemini 3 Flash with 200,000 independent &#8220;name 100 brands at random&#8221; queries. The non-uniform output revealed a stable hierarchy of brand recall &#8212; what we called the model&#8217;s &#8220;cognitive prioritization.&#8221; That work used Personalized PageRank on a two-level association graph to rank 2.9 million brands by associative embeddedness.</p>



<p>This follow-up applies Phase 1 of the same methodology &#8212; the seed establishment survey &#8212; to Gemma 4 (31B), Google&#8217;s open-weight model. The goal is to answer a simple question: does an open model remember the same brands as a closed one?</p>



<h2 class="wp-block-heading">Methodology</h2>



<p>The setup mirrors the Gemini study with minor adjustments:</p>



<ul class="wp-block-list">
<li><strong>Model:</strong> Gemma 4 31B Instruct (<code>gemma-4-31b-it</code>) via the Google GenAI API</li>



<li><strong>Prompt:</strong> <code>name 100 brands at random, one per line, say nothing else</code></li>



<li><strong>Runs:</strong> 14,044 successful completions (out of 100,000 attempted; rate-limited at 30 RPM)</li>



<li><strong>Canonicalization:</strong> Local string normalization (lowercase, strip accents, spaces, hyphens, punctuation) rather than LLM-based canonicalization. For example: <code>La Roche-Posay</code> becomes <code>larocheposay</code>, <code>Coca-Cola</code> becomes <code>cocacola</code></li>



<li><strong>Scoring:</strong> Popularity = frequency x (1 / average position). A brand mentioned in every run at position 1 scores maximally. A brand mentioned frequently but late in lists scores lower.</li>
</ul>



<p>The prompt was simplified from the Gemini version (which included <code>all lowercase, no spaces, no hyphens</code>) because we wanted to preserve the model&#8217;s natural casing as the display name and derive the canonical form programmatically.</p>



<h3 class="wp-block-heading">Caveat on sample size</h3>



<p>Gemma 4&#8217;s rate limits (30 RPM, 14,400 RPD) constrained us to 14,044 runs versus Gemini&#8217;s 200,000. The top-of-list rankings are stable at this sample size &#8212; the top 20 brands appeared in virtually every run. Long-tail discovery is ongoing: the discovery curve has not plateaued, meaning there are brands the model knows but hasn&#8217;t yet surfaced.</p>



<h2 class="wp-block-heading">Results</h2>



<h3 class="wp-block-heading">Overview</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>Gemini 3 Flash</th><th>Gemma 4 31B</th></tr></thead><tbody><tr><td>Total runs</td><td>200,000</td><td>14,044</td></tr><tr><td>Unique brands discovered</td><td>8,608</td><td>2,602</td></tr><tr><td>Total brand mentions</td><td>19,995,027</td><td>1,403,534</td></tr><tr><td>Avg brands per run</td><td>~100</td><td>~100</td></tr><tr><td>Singleton brands (appeared once)</td><td>&#8212;</td><td>912 (35%)</td></tr></tbody></table></figure>



<h3 class="wp-block-heading">Top 30 Head-to-Head</h3>



<p>The table below shows each model&#8217;s top 30 brands ranked by popularity score. Both models agree on Apple at #1 with a commanding lead. After that, the ordering diverges.</p>



<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1024" height="851" src="https://dejan.ai/wp-content/uploads/2026/04/image-6-1024x851.png" alt="" class="wp-image-2405" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-6-1024x851.png 1024w, https://dejan.ai/wp-content/uploads/2026/04/image-6-300x249.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-6-768x638.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-6-1536x1276.png 1536w, https://dejan.ai/wp-content/uploads/2026/04/image-6.png 1780w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Top 20 Side-by-Side</h3>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="606" src="https://dejan.ai/wp-content/uploads/2026/04/image-7-1024x606.png" alt="" class="wp-image-2406" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-7-1024x606.png 1024w, https://dejan.ai/wp-content/uploads/2026/04/image-7-300x177.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-7-768x454.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-7-1536x908.png 1536w, https://dejan.ai/wp-content/uploads/2026/04/image-7-2048x1211.png 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>Apple dominates both models. In Gemini, the drop-off from #1 to #2 is 3:1 (Apple to Samsung). In Gemma 4, it&#8217;s 1.3:1 (Apple to Coca-Cola) &#8212; a less extreme concentration.</p>



<h3 class="wp-block-heading">The Google Self-Ranking Gap</h3>



<p>One of the most notable findings: Google ranks itself #4 in Gemini 3 Flash but only #17 in Gemma 4. This is consistent with the architectural difference &#8212; Gemini is a proprietary model trained and served by Google, while Gemma is an open-weight model. Whether this reflects training data differences, alignment tuning, or genuine differences in brand salience across model architectures is an open question.</p>



<h3 class="wp-block-heading">Rank Shifts</h3>



<p>The following chart shows how brands moved between the two models&#8217; rankings. Green bars indicate brands that ranked higher in Gemma 4; red bars indicate brands that ranked higher in Gemini.</p>



<figure class="wp-block-image size-large"><img decoding="async" width="849" height="1024" src="https://dejan.ai/wp-content/uploads/2026/04/image-8-849x1024.png" alt="" class="wp-image-2407" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-8-849x1024.png 849w, https://dejan.ai/wp-content/uploads/2026/04/image-8-249x300.png 249w, https://dejan.ai/wp-content/uploads/2026/04/image-8-768x927.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-8-1273x1536.png 1273w, https://dejan.ai/wp-content/uploads/2026/04/image-8.png 1475w" sizes="(max-width: 849px) 100vw, 849px" /></figure>



<p><strong>Biggest risers in Gemma 4:</strong></p>



<ul class="wp-block-list">
<li>Nestle: #36 to #16 (+20)</li>



<li>L&#8217;Oreal: #48 to #32 (+16)</li>



<li>Visa: #31 to #15 (+16)</li>



<li>Chanel: #34 to #22 (+12)</li>



<li>Lego: #25 to #13 (+12)</li>
</ul>



<p><strong>Biggest fallers in Gemma 4:</strong></p>



<ul class="wp-block-list">
<li>Mercedes-Benz: #10 to #34 (-24)</li>



<li>Netflix: #18 to #38 (-20)</li>



<li>Nintendo: #27 to #47 (-20)</li>



<li>Audi: #23 to #42 (-19)</li>



<li>Google: #4 to #17 (-13)</li>
</ul>



<h3 class="wp-block-heading">The Frequency vs. Position Paradox</h3>



<p>An interesting pattern emerged in Gemma 4 that was less pronounced in Gemini: some brands have extremely high frequency (appearing in more runs than the total run count) but rank low by popularity because they appear late in lists.</p>



<p><strong>Visa</strong> appeared 28,731 times across 14,044 runs &#8212; an average of 2.05 times per run. But its average position was 35.8, placing it 15th by popularity despite having the highest raw frequency. <strong>Nike</strong> similarly appeared 26,254 times (1.87 per run) with an average position of 22.8.</p>



<p>This suggests these brands have high <em>availability</em> in the model&#8217;s memory but low <em>priority</em> &#8212; they&#8217;re easy to recall but not the first thing the model thinks of. In Gemini, this effect was less extreme because the prompt forced lowercase single-word output, reducing duplicate mentions.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="713" src="https://dejan.ai/wp-content/uploads/2026/04/image-9-1024x713.png" alt="" class="wp-image-2408" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-9-1024x713.png 1024w, https://dejan.ai/wp-content/uploads/2026/04/image-9-300x209.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-9-768x535.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-9.png 1481w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Brand Discovery Curve</h3>



<p>The discovery curve shows how many unique brands have been surfaced as a function of runs completed. Gemma 4&#8217;s curve at 14,000 runs tracks slightly above Gemini&#8217;s curve at the same point, suggesting comparable or slightly higher brand vocabulary diversity at equivalent sample sizes.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="505" src="https://dejan.ai/wp-content/uploads/2026/04/image-10-1024x505.png" alt="" class="wp-image-2409" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-10-1024x505.png 1024w, https://dejan.ai/wp-content/uploads/2026/04/image-10-300x148.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-10-768x379.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-10.png 1481w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Both curves show the characteristic long-tail shape: rapid initial discovery followed by diminishing returns. Gemini&#8217;s curve continues to climb through 100,000 runs, suggesting Gemma 4 would similarly continue discovering new brands with more sampling.</p>



<h3 class="wp-block-heading">Unique to Each Model</h3>



<p>Of the top 50 brands in each model, 39 appear in both. The 11 unique to each side reveal a pattern:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="426" src="https://dejan.ai/wp-content/uploads/2026/04/image-11-1024x426.png" alt="" class="wp-image-2410" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-11-1024x426.png 1024w, https://dejan.ai/wp-content/uploads/2026/04/image-11-300x125.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-11-768x319.png 768w, https://dejan.ai/wp-content/uploads/2026/04/image-11-1536x639.png 1536w, https://dejan.ai/wp-content/uploads/2026/04/image-11.png 1780w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>Only in Gemini&#8217;s top 50:</strong> Porsche, Hyundai, Red Bull, eBay, Volkswagen, Cartier, Ferrari, Adobe, Facebook, NIVEA, Gillette</p>



<p><strong>Only in Gemma 4&#8217;s top 50:</strong> H&amp;M, Puma, Dell, HP, Under Armour, Levi&#8217;s, Gap, Uber, Airbnb, Nikon, Calvin Klein</p>



<p>Gemini&#8217;s unique set skews luxury (Porsche, Ferrari, Cartier), European automotive (Volkswagen, Hyundai), and legacy tech/digital (eBay, Adobe, Facebook). Gemma 4&#8217;s unique set skews everyday retail (H&amp;M, Gap, Levi&#8217;s), consumer electronics (Dell, HP, Nikon), and modern services (Uber, Airbnb).</p>



<h2 class="wp-block-heading">Interpretation</h2>



<h3 class="wp-block-heading">What aligns</h3>



<p>Both models share the same core set of mega-brands. Apple, Samsung, Toyota, Amazon, Microsoft, Adidas, Disney, Sony, Pepsi, BMW, and 28 others appear in both top-50 lists. The brand hierarchy is not random &#8212; it reflects genuine differences in brand salience as encoded in training data.</p>



<h3 class="wp-block-heading">What diverges</h3>



<p>The divergences cluster around three themes:</p>



<ol class="wp-block-list">
<li><strong>Self-reference bias.</strong> Google ranks dramatically higher in its own proprietary model. This is the single largest rank shift in the dataset.</li>



<li><strong>Digital vs. physical.</strong> Gemini over-indexes on digital-native brands (Netflix, eBay, Adobe, Facebook). Gemma over-indexes on physical retail and consumer goods (H&amp;M, Gap, Levi&#8217;s, Dell, HP).</li>



<li><strong>Luxury vs. everyday.</strong> Gemini remembers luxury brands more readily (Mercedes-Benz #10, Porsche, Ferrari, Cartier in top 50). Gemma favors mass-market brands (McDonald&#8217;s #6, Visa #15, Under Armour, Puma in top 50).</li>
</ol>



<h3 class="wp-block-heading">Possible explanations</h3>



<ul class="wp-block-list">
<li><strong>Training data composition.</strong> Gemma 4 may have a different distribution of training data, with more weight on consumer-facing web content versus Gemini&#8217;s potentially broader or more curated corpus.</li>



<li><strong>Model size.</strong> Gemma 4 31B is smaller than Gemini 3 Flash. Smaller models may default to more &#8220;obvious&#8221; or broadly recognized brands rather than luxury or niche ones.</li>



<li><strong>Alignment and tuning.</strong> Different RLHF/instruction tuning pipelines may influence which brands the model considers &#8220;representative&#8221; when asked for random examples.</li>
</ul>



<h2 class="wp-block-heading">What&#8217;s Next</h2>



<p>This study covers Phase 1 only &#8212; the seed survey. The full authority map (Phases 2-3: association graph construction and PageRank computation) has not yet been run on Gemma 4 data. As rate limits allow, we plan to:</p>



<ol class="wp-block-list">
<li>Complete the 100,000-run target for statistical parity with the Gemini study</li>



<li>Run the two-level association mapping on Gemma 4&#8217;s seed brands</li>



<li>Compute Personalized PageRank to produce a full Gemma 4 Brand Authority Index</li>



<li>Publish a direct comparison of the complete authority scores across both models</li>
</ol>



<p>The raw data and code for this analysis are available on request.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="589" height="585" src="https://dejan.ai/wp-content/uploads/2026/04/image-1.png" alt="" class="wp-image-2400" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-1.png 589w, https://dejan.ai/wp-content/uploads/2026/04/image-1-300x298.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-1-150x150.png 150w" sizes="auto, (max-width: 589px) 100vw, 589px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="763" height="375" src="https://dejan.ai/wp-content/uploads/2026/04/image-5.png" alt="" class="wp-image-2404" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-5.png 763w, https://dejan.ai/wp-content/uploads/2026/04/image-5-300x147.png 300w" sizes="auto, (max-width: 763px) 100vw, 763px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="763" height="379" src="https://dejan.ai/wp-content/uploads/2026/04/image-4.png" alt="" class="wp-image-2403" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-4.png 763w, https://dejan.ai/wp-content/uploads/2026/04/image-4-300x149.png 300w" sizes="auto, (max-width: 763px) 100vw, 763px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="765" height="379" src="https://dejan.ai/wp-content/uploads/2026/04/image-3.png" alt="" class="wp-image-2402" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-3.png 765w, https://dejan.ai/wp-content/uploads/2026/04/image-3-300x149.png 300w" sizes="auto, (max-width: 765px) 100vw, 765px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="809" height="349" src="https://dejan.ai/wp-content/uploads/2026/04/image-2.png" alt="" class="wp-image-2401" srcset="https://dejan.ai/wp-content/uploads/2026/04/image-2.png 809w, https://dejan.ai/wp-content/uploads/2026/04/image-2-300x129.png 300w, https://dejan.ai/wp-content/uploads/2026/04/image-2-768x331.png 768w" sizes="auto, (max-width: 809px) 100vw, 809px" /></figure>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/gemma-4-brand-authority-map/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Chrome&#8217;s New Shopping Classifier</title>
		<link>https://dejan.ai/blog/google-shopping-classifier/</link>
					<comments>https://dejan.ai/blog/google-shopping-classifier/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Fri, 03 Apr 2026 07:34:43 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[eCommerce]]></category>
		<category><![CDATA[Google]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2390</guid>

					<description><![CDATA[One of our AI SEO hall-of-famers, Olivier de Segonzac from RESONEO has managed to gain access to Google&#8217;s shopping classifier model. We&#8217;ve examined the model, reverse engineered its inference pipeline and this article is what we found. Model Demo Below is a real-world implementation of the model tested by loading a shopping-related page and following [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>One of our AI SEO hall-of-famers, <a href="https://dejanmarketing.com/best-ai-seo-agencies/#:~:text=Olivier">Olivier de Segonzac</a> from <a href="https://www.resoneo.com/">RESONEO</a> has managed to gain access to Google&#8217;s shopping classifier model. We&#8217;ve examined the model, reverse engineered its inference pipeline and this article is what we found.</p>



<blockquote class="wp-block-quote has-body-font-family has-medium-font-size is-layout-flow wp-block-quote-is-layout-flow" style="font-style:normal;font-weight:400">
<p><strong>TL;DR</strong></p>



<ul class="wp-block-list">
<li>Newly shipped in Chrome.</li>



<li>Determines whether a web page is a shopping page or not.</li>



<li>Every page you visit gets scored. </li>



<li>Score is stored in Chrome&#8217;s history database.</li>



<li>Used to personalize user experience and recommendations.</li>



<li>The model splits your page into 10 chunks of ~100 words each and truncates every chunk to 64 tokens.</li>



<li>Roughly half the words never reach the model.</li>
</ul>
</blockquote>



<h2 class="wp-block-heading">Model Demo</h2>



<p>Below is a real-world implementation of the model tested by loading a <a href="https://www.owayo.com/custom-cycling-jerseys.htm">shopping-related page</a> and following Chrome&#8217;s native 10 passage, 64 tokens per-passage logic.</p>



<figure class="wp-block-video"><video height="824" style="aspect-ratio: 936 / 824;" width="936" autoplay loop muted src="https://dejan.ai/wp-content/uploads/2026/04/20260403-0729-38.7304427.mp4" playsinline></video></figure>



<h2 class="wp-block-heading">The Pipeline</h2>



<p>The classifier doesn&#8217;t look at raw HTML. It doesn&#8217;t look at the DOM directly either. Chrome uses a structured content extraction system called <code>AnnotatedPageContent</code>, accessible via the Chrome DevTools Protocol method <code>Page.getAnnotatedPageContent</code>. This system walks the rendered page and produces a tree of typed content nodes: text, tables, image captions.</p>



<p>The full pipeline looks like this:</p>



<pre class="wp-block-code" style="font-size:0.8rem"><code>Rendered Page
  → Blink AnnotatedPageContent extraction (5 seconds after load)
  → Text nodes collected from content tree
  → Greedy word-count chunking into passages
  → SentencePiece tokenization (64 tokens per passage)
  → Passage Embedder (TFLite) → 768-dim vectors
  → Mean pooling + title/URL embedding concatenation → 1536-dim input
  → Shopping Classifier (TFLite) → probability score (0 to 1)
</code></pre>



<h2 class="wp-block-heading">How Pages Are Chunked</h2>



<p>There is no semantic segmentation. Chrome uses a greedy word counter. Text items from the content tree are accumulated into a passage until the word count reaches 100, then a new passage starts. Items shorter than 5 words are always appended to the current passage rather than starting a new one.</p>



<p>The limits:</p>



<ul class="wp-block-list">
<li>100 words max per passage</li>



<li>5 words min per text item to trigger a new passage</li>



<li>10 passages max per page</li>
</ul>



<p>Everything beyond the first 10 passages is discarded.</p>



<h2 class="wp-block-heading">The Tokenizer Bottleneck</h2>



<p>Each passage is tokenized with SentencePiece and then truncated to 64 tokens. An EOS token is appended if there&#8217;s room, and shorter sequences are zero-padded.</p>



<p>64 tokens translates to roughly 35–50 English words depending on vocabulary complexity. Product names and brand-heavy text tokenize less efficiently (around 35 words), while natural prose gets closer to 50.</p>



<p>This means each 100-word passage loses roughly half its content at the tokenizer stage. Across 10 passages, the model effectively sees about 400–450 words of a page that may contain thousands.</p>



<h2 class="wp-block-heading">The Embedder</h2>



<p>The passage embedder (<code>OPTIMIZATION_TARGET_PASSAGE_EMBEDDER</code>) is a TFLite DualEncoder transformer model. It takes <code>int32[1, 64]</code> token IDs as input and outputs a <code>float32[1, 768]</code> embedding vector. The same model embeds both the page passages and the title/URL string.</p>



<p>The title/URL input is constructed by concatenating the page title and URL with a separator: <code>"Page Title - https://example.com/path"</code>.</p>



<h2 class="wp-block-heading">The Classifier</h2>



<p>The shopping classifier takes a <code>float32[1, 1536]</code> input vector, which is two 768-dim embeddings concatenated:</p>



<ul class="wp-block-list">
<li>First 768 dimensions: title/URL embedding</li>



<li>Last 768 dimensions: mean-pooled passage embeddings</li>
</ul>



<p>Multiple passage embeddings are combined using element-wise mean pooling. This is specified in the model&#8217;s metadata (<code>pooling_strategy = POOLING_STRATEGY_MEAN</code>, <code>max_passages = 10</code>).</p>



<p>The output is a single float between 0 and 1 representing the probability that the page is a shopping page.</p>



<h2 class="wp-block-heading">Testing It</h2>



<p>I extracted both models from Chrome and built a Streamlit app that replicates the full pipeline. It uses Selenium to launch Chrome Canary, calls <code>Page.getAnnotatedPageContent</code> via CDP to get the same structured content Chrome uses internally, then runs the chunking, tokenization, embedding, and classification steps.</p>



<p>Results on a few test inputs:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Input</th><th>Score</th></tr></thead><tbody><tr><td>&#8220;Breaking news: earthquake hits California coast&#8221;</td><td>0.0000</td></tr><tr><td>&#8220;How to learn Python programming for beginners&#8221;</td><td>0.0000</td></tr><tr><td>&#8220;Wikipedia &#8211; History of the Roman Empire&#8221;</td><td>0.0000</td></tr><tr><td>&#8220;BBC Sport &#8211; Premier League results and fixtures&#8221;</td><td>0.0000</td></tr><tr><td>&#8220;Amazon.com: Apple iPhone 15 Pro Max 256GB&#8221;</td><td>1.0000</td></tr><tr><td>&#8220;Best deals on laptops this Black Friday &#8211; up to 50% off&#8221;</td><td>1.0000</td></tr><tr><td>dejan.ai</td><td>0.0000</td></tr><tr><td>owayo.com/custom-cycling-jerseys.htm</td><td>0.9998</td></tr></tbody></table></figure>



<p>The model produces sharp, confident separations despite the lossy input pipeline.</p>



<h2 class="wp-block-heading">What Chrome Does With the Score</h2>



<p>The shopping classification feeds two systems:</p>



<p><strong>Per-page annotation.</strong> The score is stored in Chrome&#8217;s history database as part of <code>VisitContentAnnotations</code>. This is used by History Journeys to cluster shopping visits together.</p>



<p><strong>User-level segmentation.</strong> Scores are aggregated over time by Chrome&#8217;s Segmentation Platform into a separate model (<code>OPTIMIZATION_TARGET_SEGMENTATION_SHOPPING_USER</code>). If a user is classified as a &#8220;shopping user,&#8221; Chrome enables commerce features: price tracking in the omnibox, price drop notifications, shopping insights in the side panel, and shopping cards on the new tab page.</p>



<p>The per-page classifier is a signal collector that builds a user-level shopping profile, which in turn gates which commerce features Chrome presents.</p>



<h2 class="wp-block-heading">Why This Matters for E-Commerce SEO</h2>



<p>If Chrome can&#8217;t identify your page as a shopping page from the first ~450 words of visible content, your users won&#8217;t see commerce features like price tracking and shopping insights. Navigation menus, cookie banners, and boilerplate that appear early in the DOM consume your token budget before the model reaches your product information. E-commerce sites that bury product signals below heavy navigation and promotional blocks risk being invisible to the classifier entirely.</p>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/google-shopping-classifier/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		<enclosure url="https://dejan.ai/wp-content/uploads/2026/04/20260403-0729-38.7304427.mp4" length="4853580" type="video/mp4" />

			</item>
		<item>
		<title>AI Brand Authority Index: Ranking 2.9 Million Brands by Associative Embeddedness in Gemini&#8217;s Memory</title>
		<link>https://dejan.ai/blog/brands/</link>
					<comments>https://dejan.ai/blog/brands/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Sat, 28 Mar 2026 11:01:30 +0000</pubDate>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[Mechanistic Interpretability]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2360</guid>

					<description><![CDATA[Abstract When a large language model is asked to &#8220;name 100 brands at random,&#8221; it doesn&#8217;t produce uniform randomness. It produces a distribution shaped by its training data, revealing which brands occupy the most cognitive real estate in the model&#8217;s parametric memory. We present a methodology for quantifying brand authority in AI memory using Personalized [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Abstract</h2>



<p>When a large language model is asked to &#8220;name 100 brands at random,&#8221; it doesn&#8217;t produce uniform randomness. It produces a distribution shaped by its training data, revealing which brands occupy the most cognitive real estate in the model&#8217;s parametric memory. We present a methodology for quantifying brand authority in AI memory using Personalized PageRank with seed-weighted teleportation. Phase 1 establishes seed brands through 200,000 independent recall surveys. Phase 2 constructs a two-level directed association graph. Phase 3 computes authority scores using sparse matrix power iteration across 2.9 million brand nodes. Manual quality control of 8,055 seed entries removes 2,163 junk artifacts produced by Gemini&#8217;s generation failures.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="423" src="https://dejan.ai/wp-content/uploads/2026/03/image-24-1024x423.png" alt="" class="wp-image-2385" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-24-1024x423.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-24-300x124.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-24-768x317.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-24.png 1442w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<div class="wp-block-buttons alignfull is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-ecbad910 wp-block-buttons-is-layout-flex" style="padding-top:var(--wp--preset--spacing--20);padding-bottom:var(--wp--preset--spacing--20)">
<div class="wp-block-button has-custom-width wp-block-button__width-25"><a class="wp-block-button__link has-large-font-size has-custom-font-size wp-element-button" href="https://authority.dejan.ai/">Dejan Authority Database</a></div>
</div>



<h2 class="wp-block-heading">1. Background</h2>



<p>PageRank models a random surfer who follows links across a graph. A node&#8217;s score depends on how many other nodes link to it and how authoritative those linking nodes are. The iterative computation converges on the stationary distribution of the random walk.</p>



<p>We apply this framework to brand recall in large language models. Instead of web pages and hyperlinks, our graph consists of brands and directed associations extracted from Google&#8217;s Gemini model. Instead of uniform teleportation, we use seed-weighted teleportation where brands the model recalls most frequently and earliest receive proportionally more random walk restarts.</p>



<h2 class="wp-block-heading">2. Phase 1: Establishing the Seed Set</h2>



<h3 class="wp-block-heading">2.1 The Recall Survey</h3>



<p>We conducted 200,000 independent runs against Google&#8217;s Gemini model (gemini-3-flash-preview), each with the same prompt:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>name 100 brands at random, one per line, all lowercase, no spaces, no hyphens, say nothing else</p>
</blockquote>



<p>Despite the instruction to respond &#8220;at random,&#8221; the model&#8217;s outputs are far from uniform. Brands like Google, Microsoft, and Nike appear in nearly every run, while obscure brands appear only once. This non-uniformity is the signal, not the noise.</p>



<h3 class="wp-block-heading">2.2 Seed Statistics</h3>



<p>From 200,000 runs, we extracted:</p>



<ul class="wp-block-list">
<li><strong>8,608 unique brands</strong> (the raw seed set)</li>



<li><strong>~20 million total mentions</strong></li>



<li>Per-brand metrics:</li>



<li><strong>Frequency</strong>: total mentions across all runs</li>



<li><strong>Distinct runs</strong>: number of unique runs containing the brand</li>



<li><strong>Average rank</strong>: mean position when the brand appears (1 = first recalled, 100 = last)</li>
</ul>



<h3 class="wp-block-heading">2.3 Seed Weights</h3>



<p>Each seed brand receives an initial authority weight combining recall frequency and recall priority:</p>



<p>$$w_i = \hat{f}_i \times \hat{r}_i^{-1}$$</p>



<p>where:</p>



<ul class="wp-block-list">
<li>$\hat{f}_i = \frac{\text{distinct runs}_i}{\max(\text{distinct runs})}$ is the normalized recall frequency</li>



<li>$\hat{r}_i^{-1} = \frac{1/\text{avg rank}_i}{\max(1/\text{avg rank})}$ is the normalized inverse rank</li>
</ul>



<p>A brand recalled in every run AND recalled first receives a weight near 1.0. A brand recalled once at position 98 receives a weight near zero. These weights become the personalization vector for PageRank teleportation.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="691" height="280" src="https://dejan.ai/wp-content/uploads/2026/03/image-20.png" alt="" class="wp-image-2372" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-20.png 691w, https://dejan.ai/wp-content/uploads/2026/03/image-20-300x122.png 300w" sizes="auto, (max-width: 691px) 100vw, 691px" /></figure>



<h3 class="wp-block-heading">2.4 Seed Quality Control</h3>



<p>Raw Gemini output contained significant contamination. Manual review of all 8,055 seed entries (ranked by PageRank score) identified 2,163 junk entries — 26.8% of the seed set — across several distinct failure modes:</p>



<p><strong>Concatenation artifacts</strong> — Gemini fused adjacent brand names together. The <code>coca*</code> prefix alone produced 11 variants: <code>cocaapple</code>, <code>cocaflops</code>, <code>cocaalcola</code>, <code>cocaicoca</code>, <code>cocaelsa</code>, <code>cocaiccola</code>, <code>cocaicola</code>, <code>cocaonla</code>, <code>cocaformula</code>, <code>cocaole</code>, <code>cocaocla</code>. The <code>visa*</code> prefix generated 80+ junk entries: <code>visafarm</code>, <code>visafold</code>, <code>visafans</code>, <code>visafacebook</code>, <code>visanetwork</code>, <code>visahub</code>, <code>visawash</code>, <code>visacard</code>, <code>visafocus</code>, <code>visaglobal</code>, <code>visamatte</code>, <code>visaeurope</code>, and dozens more. Similarly, <code>hp*</code> produced 100+ entries (<code>hpmicrolab</code>, <code>hpmillett</code>, <code>hpmachines</code>, <code>hpmilwaukee</code>), and <code>tesla*</code> generated 30+ (<code>teslatotalsenergies</code>, <code>teslouisvuitton</code>, <code>teslacoil</code>, <code>teslapump</code>).</p>



<p><strong>Inner monologue leakage</strong> — Gemini&#8217;s internal reasoning about character constraints leaked into output as literal brand entries. Over 200 entries followed the pattern <code>雀巢 (parenthetical self-correction)</code>:</p>



<ul class="wp-block-list">
<li><code>雀巢 (actually nestle, switching to latin)</code></li>



<li><code>雀巢 (oops, sticking to alphabet)</code></li>



<li><code>雀巢 (replaced with nestle, wait, no spaces/hyphens only)</code></li>



<li><code>雀巢 (thinking of brands...)</code></li>



<li><code>雀巢 (just kidding)</code></li>



<li><code>雀巢 (actually nestle, replace with kpmg)</code></li>
</ul>



<p>These represent the model&#8217;s chain-of-thought processing about the CJK character <code>雀巢</code> (Nestle in Chinese) bleeding through as output tokens.</p>



<p><strong>Typos and garbled names</strong> — <code>toyote</code> (toyota), <code>hundai</code> (hyundai), <code>adidsa</code> (adidas), <code>luluemon</code> (lululemon), <code>rebok</code> (reebok), <code>porche</code> (porsche), <code>royleroyce</code> (rollsroyce), <code>senheiser</code> (sennheiser).</p>



<p><strong>Mixed-script artifacts</strong> — Partial CJK character insertion mid-brand: <code>home固定depot</code>, <code>pizza动hut</code>, <code>dr控martens</code>, <code>estee固定lauder</code>, <code>western吐igital</code>, <code>cooler避master</code>.</p>



<p><strong>HTML/prompt leaks</strong> — Model markup and instructions appearing as brands: <code>hugo&lt;/thought&gt;apple</code>, <code>hugo&lt;/p&gt;</code>, and most remarkably: <code>unite 100 brands at random, one per line, all lowercase, no spaces, no hyphens, say nothing else</code> — the model echoed its own prompt as a brand name.</p>



<p><strong>Generic words</strong> — <code>luxury</code>, <code>all</code>, <code>delivery</code>, <code>generic</code>, <code>detergent</code>, <code>pudding</code> — words that aren&#8217;t brands.</p>



<p><strong>Why this matters for PageRank</strong>: Junk seeds receive direct teleportation mass every iteration (alpha=0.15). A garbage entry like <code>cocaapple</code> at rank 789 receives the same structural boost as <code>lecreuset</code> at rank 790. Without filtering, junk seeds contaminate the authority signal at the core of the algorithm. The 2,163 entries were loaded into a <code>brand_ignore</code> table and excluded from the personalization vector during PageRank computation.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="705" height="291" src="https://dejan.ai/wp-content/uploads/2026/03/image-21.png" alt="" class="wp-image-2373" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-21.png 705w, https://dejan.ai/wp-content/uploads/2026/03/image-21-300x124.png 300w" sizes="auto, (max-width: 705px) 100vw, 705px" /></figure>



<h2 class="wp-block-heading">3. Phase 2: Constructing a Two-Level Association Graph</h2>



<h3 class="wp-block-heading">3.1 Level 1 (L1): Seed Associations</h3>



<p>For each effective seed (~5,892 after filtering), we queried Gemini:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p>name 100 brands most closely associated with [brand], ordered from most to least associated, one per line, all lowercase, no spaces, no hyphens, say nothing else</p>
</blockquote>



<p>This produced ~860,000 directed edges. These associations are genuinely asymmetric: Apple&#8217;s association with Beats (which it owns) carries different positional weight than Beats&#8217; association with Apple.</p>



<h3 class="wp-block-heading">3.2 Level 2 (L2): Discovered Brand Associations</h3>



<p>Brands discovered at L1 that weren&#8217;t original seeds were themselves queried for their associations. This second pass dramatically expanded the graph into the long tail. A brand like <code>titois</code> (a Turkish textile company) appeared as an L1 association of <code>vice</code>, and when queried at L2, generated its own set of 100 associations including <code>vuteks</code> — another Turkish industrial brand that would never surface in a consumer-focused recall survey.</p>



<p>The full discovery chain for any brand can be traced: <code>vice</code> (seed) → <code>titois</code> (L1) → <code>vuteks</code> (L2).</p>



<h3 class="wp-block-heading">3.3 Graph Scale</h3>



<p>The resulting graph contains:</p>



<ul class="wp-block-list">
<li><strong>2,886,212 unique brand nodes</strong></li>



<li><strong>Millions of directed weighted edges</strong> across L1 and L2</li>



<li><strong>5,892 effective seeds</strong> (after ignoring 2,163 junk entries)</li>



<li><strong>~201,000 L1 brands</strong> discovered through seed associations</li>



<li><strong>~2.68 million L2 brands</strong> discovered through L1 associations</li>
</ul>



<h3 class="wp-block-heading">3.4 Canonicalization</h3>



<p>Brand names required normalization before graph construction:</p>



<ul class="wp-block-list">
<li><strong>Cyrillic homoglyph mapping</strong>: Characters like <code>а</code> (Cyrillic) mapped to <code>a</code> (Latin) to merge visually identical variants</li>



<li><strong>CJK+Latin mixed-script filtering</strong>: Entries mixing Chinese/Japanese/Korean characters with Latin text flagged as junk</li>



<li><strong>Manual aliases</strong>: 15 CJK-to-Latin mappings for legitimate brands (e.g., <code>雀巢</code> → <code>nestle</code>)</li>



<li><strong>Variant tracking</strong>: 193,070 name variants mapped to canonical forms, preserving display names while merging duplicates</li>
</ul>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="705" height="284" src="https://dejan.ai/wp-content/uploads/2026/03/image-22.png" alt="" class="wp-image-2374" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-22.png 705w, https://dejan.ai/wp-content/uploads/2026/03/image-22-300x121.png 300w" sizes="auto, (max-width: 705px) 100vw, 705px" /></figure>



<h2 class="wp-block-heading">4. Computing Personalized PageRank</h2>



<h3 class="wp-block-heading">4.1 Random Walk Model</h3>



<p>At each step of the random walk, a surfer either:</p>



<ul class="wp-block-list">
<li><strong>Teleports</strong> (probability alpha=0.15) — jumps to a seed brand, with probability proportional to that seed&#8217;s authority weight. Ignored seeds receive zero teleportation mass.</li>



<li><strong>Follows an edge</strong> (probability 1-alpha=0.85) — follows an outgoing association edge, weighted by inverse position. Position 1 associations receive more weight than position 100.</li>
</ul>



<h3 class="wp-block-heading">4.2 Edge Weights</h3>



<p>Association position determines edge weight. Brands listed earlier in Gemini&#8217;s association response receive proportionally more link equity via inverse position weighting. Each node&#8217;s outgoing edges are row-normalized to form a proper transition matrix.</p>



<h3 class="wp-block-heading">4.3 Dangling Nodes</h3>



<p>Brands with no outgoing edges (leaf nodes discovered at L2 but never queried) redistribute their accumulated mass back to the personalization vector, preserving the stochastic property of the transition matrix.</p>



<h3 class="wp-block-heading">4.4 Sparse Matrix Power Iteration</h3>



<p>The transition matrix is stored as a scipy CSR sparse matrix. Power iteration multiplies the current score vector by the transition matrix, adds the teleportation component, and repeats until convergence. Convergence criterion: L1 norm between successive score vectors falls below 1e-8, typically achieved within 30-50 iterations.</p>



<h3 class="wp-block-heading">4.5 Why Personalized PageRank</h3>



<p>Standard PageRank uses uniform teleportation — the random surfer restarts at any node with equal probability. Personalized PageRank biases the restart distribution toward specific nodes. In our case, seeds with higher recall frequency and earlier recall position receive more teleportation mass, making them stronger sources of authority in the network. Authority accumulates continuously from all reachable seeds, weighted by both seed authority and graph structure.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="708" height="285" src="https://dejan.ai/wp-content/uploads/2026/03/image-23.png" alt="" class="wp-image-2375" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-23.png 708w, https://dejan.ai/wp-content/uploads/2026/03/image-23-300x121.png 300w" sizes="auto, (max-width: 708px) 100vw, 708px" /></figure>



<h2 class="wp-block-heading">5. Results</h2>



<h3 class="wp-block-heading">5.1 Top 30 Brands</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Rank</th><th>Brand</th><th>Score</th></tr></thead><tbody><tr><td>1</td><td>Google</td><td>1.000000</td></tr><tr><td>2</td><td>Microsoft</td><td>0.983081</td></tr><tr><td>3</td><td>Nike</td><td>0.951061</td></tr><tr><td>4</td><td>Apple</td><td>0.876266</td></tr><tr><td>5</td><td>Adidas</td><td>0.700542</td></tr><tr><td>6</td><td>Sony</td><td>0.684061</td></tr><tr><td>7</td><td>Gucci</td><td>0.639839</td></tr><tr><td>8</td><td>Amazon</td><td>0.623930</td></tr><tr><td>9</td><td>Coca-Cola</td><td>0.590042</td></tr><tr><td>10</td><td>Chanel</td><td>0.570568</td></tr><tr><td>11</td><td>Prada</td><td>0.550746</td></tr><tr><td>12</td><td>Samsung</td><td>0.532741</td></tr><tr><td>13</td><td>Toyota</td><td>0.516163</td></tr><tr><td>14</td><td>Louis Vuitton</td><td>0.511476</td></tr><tr><td>15</td><td>Rolex</td><td>0.508761</td></tr><tr><td>16</td><td>Disney</td><td>0.507488</td></tr><tr><td>17</td><td>Hermes</td><td>0.487205</td></tr><tr><td>18</td><td>Dior</td><td>0.479031</td></tr><tr><td>19</td><td>Pepsi</td><td>0.442026</td></tr><tr><td>20</td><td>Intel</td><td>0.427143</td></tr><tr><td>21</td><td>Honda</td><td>0.420288</td></tr><tr><td>22</td><td>Patagonia</td><td>0.417196</td></tr><tr><td>23</td><td>Audi</td><td>0.405366</td></tr><tr><td>24</td><td>Panasonic</td><td>0.396073</td></tr><tr><td>25</td><td>Cartier</td><td>0.374052</td></tr><tr><td>26</td><td>Volkswagen</td><td>0.368643</td></tr><tr><td>27</td><td>Nintendo</td><td>0.361812</td></tr><tr><td>28</td><td>Porsche</td><td>0.360956</td></tr><tr><td>29</td><td>McDonald&#8217;s</td><td>0.344910</td></tr><tr><td>30</td><td>PUMA</td><td>0.330191</td></tr></tbody></table></figure>



<h3 class="wp-block-heading">5.2 Top Non-Seed Brands</h3>



<p>The highest-ranking brands that Gemini never recalled unprompted but discovered purely through association:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Rank</th><th>Brand</th><th>Score</th></tr></thead><tbody><tr><td>1</td><td>Maison Margiela</td><td>0.094542</td></tr><tr><td>2</td><td>Office</td><td>0.075253</td></tr><tr><td>3</td><td>L.L.Bean</td><td>0.074981</td></tr><tr><td>4</td><td>Cotopaxi</td><td>0.072272</td></tr><tr><td>5</td><td>Rick Owens</td><td>0.070130</td></tr><tr><td>6</td><td>Grand Seiko</td><td>0.066426</td></tr><tr><td>7</td><td>Bravia</td><td>0.059241</td></tr><tr><td>8</td><td>Jil Sander</td><td>0.058125</td></tr><tr><td>9</td><td>Mickey Mouse</td><td>0.057300</td></tr><tr><td>10</td><td>Richard Mille</td><td>0.055195</td></tr></tbody></table></figure>



<p>These brands score high not because the model recalls them spontaneously, but because they sit at dense intersections of associations from high-authority seeds.</p>



<h3 class="wp-block-heading">5.3 Scale</h3>



<ul class="wp-block-list">
<li>Total ranked brands: <strong>2,886,212</strong></li>



<li>Score range: 0.000000 to 1.000000</li>



<li>Seeds in top 30: 30/30</li>



<li>Non-seed brands discovered: <strong>2,880,320</strong></li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="482" src="https://dejan.ai/wp-content/uploads/2026/03/image-19-1024x482.png" alt="PageRank NS" class="wp-image-2366" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-19-1024x482.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-19-300x141.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-19-768x361.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-19-1536x723.png 1536w, https://dejan.ai/wp-content/uploads/2026/03/image-19.png 1908w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">6. What the Scores Measure</h2>



<p>The final scores capture <strong>associative embeddedness</strong> — a combination of:</p>



<ol class="wp-block-list">
<li><strong>Direct recall</strong> — Seeds that Gemini recalls frequently and early receive teleportation mass every iteration</li>



<li><strong>Centrality</strong> — Brands associated with many other high-authority brands accumulate more random walk traffic</li>



<li><strong>Network position</strong> — A brand with moderate recall but central positioning scores higher than a frequently recalled but isolated brand</li>
</ol>



<p>This is distinct from simple popularity or recall frequency. A brand like Maison Margiela ranks as the top non-seed brand not because Gemini recalls it unprompted, but because it sits at a dense intersection of luxury fashion associations — reachable from dozens of high-authority seeds via short, heavily-weighted paths.</p>



<p>The PageRank scores answer not &#8220;how often does the model think of this brand?&#8221; but &#8220;how deeply embedded is this brand in the model&#8217;s associative structure?&#8221;</p>



<h2 class="wp-block-heading">7. Technical Stack</h2>



<ul class="wp-block-list">
<li><strong>Model</strong>: Google Gemini 3 Flash Preview</li>



<li><strong>Phase 1</strong>: 200,000 recall surveys, 8,608 raw seeds, ~20M total mentions</li>



<li><strong>Phase 2</strong>: ~14,500 association queries (L1 + L2), millions of directed edges</li>



<li><strong>Graph</strong>: 2,886,212 nodes</li>



<li><strong>Algorithm</strong>: Personalized PageRank via scipy sparse matrix power iteration</li>



<li><strong>Teleportation factor (alpha)</strong>: 0.15</li>



<li><strong>Convergence tolerance</strong>: 1e-8</li>



<li><strong>Seed quality control</strong>: 2,163 junk seeds identified via manual review and excluded</li>



<li><strong>Canonicalization</strong>: Cyrillic homoglyph mapping, CJK filtering, 193,070 variant mappings, 15 manual CJK aliases</li>



<li><strong>Storage</strong>: SQLite (1.5GB)</li>



<li><strong>Dashboard</strong>: Streamlit with Plotly 3D network visualization</li>



<li><strong>Concurrency</strong>: 20 simultaneous async API calls with incremental database commits</li>
</ul>



<div class="wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex">
<div class="wp-block-button"><a class="wp-block-button__link wp-element-button" href="https://authority.dejan.ai/">Dejan Authority Database</a></div>
</div>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/brands/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>TurboQuant: From Paper to Triton Kernel in One Session</title>
		<link>https://dejan.ai/blog/turboquant/</link>
					<comments>https://dejan.ai/blog/turboquant/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Wed, 25 Mar 2026 07:16:09 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2351</guid>

					<description><![CDATA[Implementing Google&#8217;s KV cache compression algorithm on Gemma 3 4B and everything that went wrong along the way. On March 24, 2026, Google Research published a blog post introducing TurboQuant, a compression algorithm for large language model inference. The paper behind it, &#8220;Online Vector Quantization with Near-optimal Distortion Rate&#8221; had been on arXiv since April [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p><em>Implementing Google&#8217;s KV cache compression algorithm on Gemma 3 4B and everything that went wrong along the way.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>On March 24, 2026, Google Research published a blog post introducing <a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, a compression algorithm for large language model inference. The paper behind it, &#8220;<a href="https://arxiv.org/pdf/2504.19874">Online Vector Quantization with Near-optimal Distortion Rate</a>&#8221; had been on arXiv since April 2025 and was accepted at <a href="https://iclr.cc/">ICLR 2026</a>. The claims were striking: compress the key-value cache to 3 bits per coordinate with zero accuracy loss, no training required, and up to 8x speedup on H100 GPUs.</p>



<p>I decided to implement it from scratch and see if the claims held up. They did, and then some.</p>



<h2 class="wp-block-heading">What Google Built</h2>



<p>Every time a transformer generates a token, it computes attention over all previous tokens. The key-value (KV) cache stores those previously computed states to avoid redundant work. As sequences get longer, this cache becomes a serious memory bottleneck, it grows linearly with sequence length and consumes precious GPU memory that could otherwise be used for larger batches or longer contexts.</p>



<p>Vector quantization is the obvious solution: compress the KV cache to fewer bits. But traditional quantization methods carry hidden overhead. They need to store normalization constants (zero points, scales) for every small block of data, typically adding 1-2 extra bits per number. At low bit-widths, this overhead can eat a significant chunk of the compression gains.</p>



<p>TurboQuant eliminates this overhead through a two-stage approach built on a clean mathematical insight.</p>



<p><strong>Stage 1 — Random rotation + Lloyd-Max quantization.</strong> The algorithm applies a random orthogonal rotation to each KV vector. This is the key trick: after rotation, each coordinate&#8217;s distribution becomes a known Beta distribution, concentrated near zero with a predictable shape that depends only on the vector dimension. Because the distribution is known analytically, you can precompute the optimal scalar quantizer (a Lloyd-Max quantizer) once and reuse it for every vector. No per-block normalization constants, no data-dependent calibration, no training. Just rotate and quantize.</p>



<p><strong>Stage 2 — QJL residual correction.</strong> The paper&#8217;s inner-product-optimized variant (TurboQuant_prod) applies a 1-bit Quantized Johnson-Lindenstrauss transform to the quantization residual. This gives an unbiased inner product estimator, which matters because attention scores are inner products. This stage requires a custom attention kernel to realize its benefits, you can&#8217;t just add the QJL correction back to the reconstructed vector (more on that later).</p>



<p>The theoretical backing is strong: TurboQuant&#8217;s MSE distortion is provably within a factor of ~2.7 of the information-theoretic lower bound. For a data-oblivious algorithm (one that doesn&#8217;t look at the data distribution), that&#8217;s essentially optimal.</p>



<h2 class="wp-block-heading">What We Built</h2>



<p>We implemented TurboQuant from scratch in PyTorch and tested it on Gemma 3 4B IT running on an RTX 4090. The implementation has three layers, each building on the last:</p>



<p><strong>Layer 1: Core algorithm</strong> (<code>turboquant_core.py</code>). The random rotation, Lloyd-Max codebook computation, and quantize/dequantize operations. The codebook is built once for a given (dimension, bit-width) pair by running 300 iterations of Lloyd-Max optimization over a dense numerical grid of the Beta distribution. This takes a few seconds on CPU and the result is cached.</p>



<p><strong>Layer 2: Python KV cache integration</strong> (<code>turboquant_kv_cache.py</code>). A patched <code>DynamicCache</code> that quantizes key and value tensors on every <code>cache.update()</code> call. This is the simplest integration path, it works with any HuggingFace model and requires no model-specific code. The tradeoff is that it stores the dequantized fp16 tensors back in the cache, so you don&#8217;t save memory; you only simulate the accuracy impact of quantization.</p>



<p><strong>Layer 3: Triton fused kernel</strong> (<code>triton_attention.py</code> + <code>turboquant_fused.py</code>). A custom Triton kernel that computes attention scores directly from compressed uint8 key indices, never materializing fp16 keys. This is where the real memory and speed gains come from.</p>



<p>The fused kernel exploits a simple algebraic identity. Since the rotation matrix R is orthogonal:</p>



<p>$$\langle q, R^T \cdot \text{centroids}[\text{idx}] \rangle = \langle R \cdot q, \text{centroids}[\text{idx}] \rangle$$</p>



<p>Pre-rotate the query once with a single matmul, then the per-KV-position work reduces to a centroid table lookup and dot product. The Triton kernel does this across all sequence positions in parallel, loading uint8 indices instead of fp16 values, roughly 4x less data from GPU memory.</p>



<h2 class="wp-block-heading">Results</h2>



<h3 class="wp-block-heading">Core Algorithm Validation</h3>



<p>On synthetic vectors (d=256), the quantize-dequantize roundtrip quality:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Bits</th><th>Cosine Similarity</th><th>Inner Product Correlation</th><th>Compression</th></tr></thead><tbody><tr><td>2</td><td>0.940</td><td>0.945</td><td>15.5x</td></tr><tr><td>3</td><td>0.983</td><td>0.984</td><td>10.4x</td></tr><tr><td>4</td><td>0.995</td><td>0.995</td><td>7.9x</td></tr></tbody></table></figure>



<h3 class="wp-block-heading">Triton Kernel Microbenchmark</h3>



<p>The fused kernel vs standard dequantize-then-matmul, measuring just the Q@K^T operation:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>KV Length</th><th>Standard</th><th>Fused</th><th>Speedup</th></tr></thead><tbody><tr><td>128</td><td>0.076ms</td><td>0.066ms</td><td>1.15x</td></tr><tr><td>512</td><td>0.061ms</td><td>0.050ms</td><td>1.22x</td></tr><tr><td>1024</td><td>0.061ms</td><td>0.052ms</td><td>1.18x</td></tr><tr><td>4096</td><td>0.062ms</td><td>0.051ms</td><td>1.22x</td></tr></tbody></table></figure>



<p>Cosine similarity between the kernel output and PyTorch reference: 1.000000. The kernel is numerically exact.</p>



<h3 class="wp-block-heading">End-to-End Generation on Gemma 3 4B IT</h3>



<p>Three prompts: explain compilers vs interpreters, write a palindrome function, causes of the French Revolution. Each generated up to 200 tokens with greedy decoding.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Config</th><th>Avg tok/s</th><th>Output Quality</th><th>VRAM Delta</th></tr></thead><tbody><tr><td>fp16 baseline</td><td>17.7</td><td>reference</td><td>26 MB</td></tr><tr><td>4-bit Python path</td><td>13.8</td><td>correct, minor rephrase</td><td>19 MB</td></tr><tr><td>4-bit FUSED</td><td>16.5</td><td>identical to baseline</td><td>4 MB</td></tr><tr><td>2-bit Python path</td><td>14.0</td><td>some degradation</td><td>15 MB</td></tr><tr><td>2-bit FUSED</td><td>17.7</td><td>identical to baseline</td><td>7 MB</td></tr></tbody></table></figure>



<p>The 2-bit fused path produces character-for-character identical output to the fp16 baseline on all three prompts, at the same speed, with 3-6x less VRAM for the KV cache.</p>



<h2 class="wp-block-heading">Technical Deep Dive</h2>



<h3 class="wp-block-heading">The Lloyd-Max Codebook</h3>



<p>After random rotation on the unit sphere S^{d-1}, each coordinate follows a Beta((d-1)/2, (d-1)/2) distribution on [-1, 1]. For large d (Gemma 3 uses d=256), this concentrates tightly around zero with standard deviation approximately 1/sqrt(d) ≈ 0.0625.</p>



<p>The codebook construction solves the continuous k-means problem for this distribution: partition [-1, 1] into 2^b intervals and find the centroid of each interval that minimizes weighted MSE under the Beta PDF. We use a dense grid (50,000 points) focused on the ±6σ range where the distribution has mass, then run standard Lloyd-Max iteration: assign grid points to nearest centroid, update centroids as weighted means, repeat.</p>



<p>The resulting codebook has an interesting structure — the centroids cluster densely near zero where the distribution is concentrated, with wider spacing in the tails. At 4 bits (16 levels), the centroid spacing near zero is approximately 0.008, providing very fine-grained reconstruction in the region where most values live.</p>



<h3 class="wp-block-heading">The Random Rotation</h3>



<p>The paper uses a randomized Hadamard transform (H · diag(signs)) for the rotation. We initially implemented this faithfully — and it was catastrophically slow. The Fast Walsh-Hadamard Transform is a series of butterfly operations, and our Python implementation executed each butterfly as a tensor slice operation. On GPU, this meant thousands of tiny CUDA kernel launches per rotation, with Python-level loop overhead between each one.</p>



<p>We replaced it with a precomputed random orthogonal matrix via QR decomposition. Mathematically equivalent — any orthogonal rotation on S^{d-1} produces the same Beta distribution on coordinates. The QR matrix is d×d (256×256 = 256KB, negligible), computed once from a seeded random Gaussian matrix, and the rotation becomes a single <code>torch.matmul</code>. Problem solved.</p>



<p>A production implementation would use a structured rotation (Hadamard + random signs) with a fused CUDA kernel for the butterfly operations. The structured form is more memory-efficient (you only store the d random signs, not a d×d matrix) and the butterfly operations parallelize beautifully on GPU. But for a reference implementation, the dense matrix works fine.</p>



<h3 class="wp-block-heading">The Triton Kernel</h3>



<p>The kernel parallelizes over (query_head × batch, sequence_position_block). Each program instance:</p>



<ol class="wp-block-list">
<li>Loads a slice of the pre-rotated query vector (BLOCK_D elements)</li>



<li>Loads the corresponding key indices for BLOCK_S sequence positions (uint8)</li>



<li>Gathers centroid values via table lookup (<code>tl.load(C_ptr + k_idx)</code>)</li>



<li>Accumulates the partial dot product</li>



<li>Multiplies by key norms and the attention scale factor</li>
</ol>



<p>The autotuner searches over 5 configurations of (BLOCK_S, BLOCK_D) and warp count. On the RTX 4090, it typically selects BLOCK_S=64, BLOCK_D=64 with 4 warps.</p>



<p>The key efficiency win is memory bandwidth. Loading uint8 indices requires 1 byte per element; loading fp16 keys requires 2 bytes. The centroid table (16 float32 values at 4-bit, or 4 values at 2-bit) fits comfortably in L1/L2 cache and is reused across all sequence positions. The net effect is roughly 2x less data movement from HBM, which translates to the observed ~1.2x speedup on the Q@K^T operation.</p>



<h3 class="wp-block-heading">GQA Handling</h3>



<p>Gemma 3 4B uses Grouped Query Attention with 8 query heads and 4 KV heads (ratio 2:1). The kernel handles this by mapping each query head to its corresponding KV head: <code>kv_head = q_head // gqa_ratio</code>. The key indices and norms are loaded from the KV head, while queries come from the query head. This means each KV head&#8217;s compressed data is read twice (once per query head in its group), but since it&#8217;s small (uint8), the redundant reads are cheap.</p>



<h3 class="wp-block-heading">Cache Architecture</h3>



<p>The fused integration stores keys in compressed form (uint8 indices + fp16 norms per vector) and values in standard fp16. We only compress keys because the attention score computation (Q@K^T) is where the memory bandwidth bottleneck lives during decoding. The softmax@V multiplication is less critical because it&#8217;s compute-bound rather than memory-bound at typical sequence lengths.</p>



<p>A fully optimized implementation would also compress values, but the gains are smaller and the integration is more complex (you&#8217;d need a second Triton kernel for the softmax@V step with compressed values).</p>



<h2 class="wp-block-heading">What Didn&#8217;t Work</h2>



<h3 class="wp-block-heading">Mistake 1: Adding QJL Back to the Reconstructed Vector</h3>



<p>The paper describes two variants: TurboQuant_mse (pure Lloyd-Max, best for reconstruction) and TurboQuant_prod (Lloyd-Max + 1-bit QJL, best for inner products). Our first implementation used TurboQuant_prod for the KV cache: (bits-1) bits of Lloyd-Max plus 1 bit of QJL on the residual.</p>



<p>The QJL stage produces a correction term that makes the inner product estimator unbiased. But when you add this correction back to the reconstructed vector and store it in the KV cache, you&#8217;re injecting noise into the vector itself. The result: cosine similarity dropped to 0.69 (terrible) and the model produced garbage.</p>



<p>The fix was simple: use TurboQuant_mse (all bits to Lloyd-Max) for the drop-in cache, and reserve TurboQuant_prod for a custom attention kernel that can use the two-part representation directly. The fused Triton kernel implements the MSE variant.</p>



<h3 class="wp-block-heading">Mistake 2: Gemma 3 4B Is Not a Causal LM</h3>



<p>We initially loaded the model with <code>AutoModelForCausalLM</code> and <code>AutoTokenizer</code>. This loaded the model fine, tokenized fine, and even generated — but every output token was <code>&lt;pad&gt;</code> (token ID 0). The baseline and quantized paths both produced identical pad sequences.</p>



<p>Gemma 3 4B+ is a multimodal model. It requires <code>Gemma3ForConditionalGeneration</code> and <code>AutoProcessor</code>, not the causal LM variants. The <code>AutoProcessor</code> handles the chat template correctly and returns the right token format. This wasn&#8217;t a quantization bug at all — the model simply wasn&#8217;t being invoked correctly.</p>



<h3 class="wp-block-heading">Mistake 3: Python-Loop Hadamard Transform</h3>



<p>The Fast Walsh-Hadamard Transform is O(d log d) butterfly operations. Our initial implementation ran each butterfly as a Python loop iteration with tensor slicing:</p>



<pre class="wp-block-code"><code>while h &lt; d:
    for start in range(0, d, stride):
        lo = slice(start, start + h)
        hi = slice(start + h, start + stride)
        a = result&#91;..., lo].clone()
        b = result&#91;..., hi].clone()
        result&#91;..., lo] = a + b
        result&#91;..., hi] = a - b
    h *= 2
</code></pre>



<p>For d=256, this is 8 outer iterations × 128 inner iterations = 1,024 tiny CUDA operations per vector, with Python interpreter overhead between each one. On a KV cache update touching 26 layers × 4 KV heads × 256-dim vectors, the GPU was spending more time waiting for Python than doing math. Generation hung completely — even a 20-token completion with a trivial prompt didn&#8217;t return.</p>



<p>Replacing this with a single <code>x @ Q_T</code> matmul using a precomputed orthogonal matrix made it instant.</p>



<h3 class="wp-block-heading">Mistake 4: Subclassing DynamicCache</h3>



<p>Our first KV cache integration subclassed HuggingFace&#8217;s <code>DynamicCache</code>. This broke immediately because Gemma 3&#8217;s model code calls <code>past_key_values.is_initialized</code>, <code>past_key_values.key_cache</code>, and other attributes whose names and semantics change across transformers versions. Our subclass was missing several of these.</p>



<p>We tried three approaches:</p>



<ul class="wp-block-list">
<li>Subclassing <code>DynamicCache</code> (broke on <code>.is_initialized</code>)</li>



<li>Forward hooks on attention layers (fragile, couldn&#8217;t reliably find the cache object)</li>



<li>Patching <code>cache.update()</code> on a stock <code>DynamicCache</code> instance (worked perfectly)</li>
</ul>



<p>The final approach is the cleanest: create a normal <code>DynamicCache</code>, save a reference to its <code>update</code> method, and replace it with a wrapper that quantizes inputs before calling the original. All the cache&#8217;s internal bookkeeping (sequence length tracking, layer indexing) works unchanged.</p>



<h3 class="wp-block-heading">Mistake 5: Token Counting After Fused Generation</h3>



<p>The <code>FusedTurboQuantRunner</code> returns decoded text directly (not output IDs), so we tried <code>processor.encode(text)</code> to count tokens for the timing report. But <code>Gemma3Processor</code> is a multimodal processor — it has <code>decode</code> but not <code>encode</code>. The tokenizer lives at <code>processor.tokenizer.encode()</code>. A one-line fix, but it crashed the first successful fused generation and hid the results until the next run.</p>



<h2 class="wp-block-heading">Comparison with Other Implementations</h2>



<p>Prince Canuma independently implemented TurboQuant in MLX and tested on Qwen 3.5 35B with context lengths up to 64K tokens. Their results: 6/6 exact match on needle-in-haystack at every quantization level, 4.9x smaller KV cache at 2.5-bit, 3.8x at 3.5-bit.</p>



<p>Two implementations, different frameworks (PyTorch+Triton vs MLX), different models (Gemma 3 4B vs Qwen 3.5 35B), different hardware (NVIDIA RTX 4090 vs Apple Silicon) — same conclusion. TurboQuant&#8217;s theoretical guarantees translate directly to practice across the board.</p>



<h2 class="wp-block-heading">What&#8217;s Next</h2>



<p>This implementation leaves several optimizations on the table:</p>



<p><strong>Value cache compression.</strong> We only compress keys. Compressing values would require a second Triton kernel for the softmax@V multiplication, but would further reduce memory usage.</p>



<p><strong>Structured rotation.</strong> The precomputed d×d orthogonal matrix works but uses O(d²) memory. A fused Hadamard kernel would use O(d) memory (just the random signs) and be faster for large d.</p>



<p><strong>Sub-byte packing.</strong> We store 2-bit indices as uint8. Packing 4 indices per byte would reduce memory by another 4x for the index storage.</p>



<p><strong>Flash Attention integration.</strong> The ultimate goal: fuse the centroid gather into a Flash Attention-style kernel that never materializes the full attention matrix. This would combine TurboQuant&#8217;s memory savings with Flash Attention&#8217;s IO efficiency.</p>



<p>The paper&#8217;s claim of 8x speedup on H100s comes from optimized int4 tensor core kernels — that level of hardware-specific optimization is beyond a one-session implementation, but the algorithmic foundation is solid and the path from here to production is clear.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><em>Paper: <a href="https://arxiv.org/abs/2504.19874">TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate</a> (ICLR 2026)</em></p>



<p class="has-text-align-center"><em>Complete implementation including Triton kernel</em>:</p>



<div class="wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex">
<div class="wp-block-button"><a class="wp-block-button__link has-text-align-center wp-element-button" href="https://dejan.ai/media/code/turboquant.zip">DOWNLOAD CODE</a></div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<pre class="wp-block-code" style="font-size:0.7rem"><code>                   python run_demo.py --fused --max-new-tokens 200 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -&gt; 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -&gt; 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -&gt; 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 &#91;00:00&lt;?, ?it/s]
Download complete: : 0.00B &#91;00:00, ?B/s]                                       | 0/2 &#91;00:00&lt;?, ?it/s]
Loading weights: 100%|████████████████████████████████████████████| 883/883 &#91;00:02&lt;00:00, 304.27it/s]
The image processor of type `Gemma3ImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`.
Model loaded on cuda:0

======================================================================
Prompt: Explain the difference between a compiler and an interpreter in three sentences.
======================================================================
The following generation flags are not valid and may be ignored: &#91;'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

--- fp16 baseline ---
  Tokens: 68  Time: 4.52s  (15.0 tok/s)  VRAM delta: 26 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=4)... done.

--- TurboQuant 4-bit ---
  Tokens: 72  Time: 6.06s  (11.9 tok/s)  VRAM delta: 19 MB
  Output: A compiler translates an entire program into machine code at once, creating a standalone executable file that can be run directly.  An interpreter, on the other hand, reads and executes the code line by line, without first creating a separate file.  Essentially, a compiler performs a one-time conversion, while an interpreter performs a continuous translation and execution process.
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 68  Time: 4.73s  (14.4 tok/s)  VRAM delta: 4 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=2)... done.

--- TurboQuant 2-bit ---
  Tokens: 48  Time: 3.71s  (12.9 tok/s)  VRAM delta: 15 MB
  Output: A compiler translates an entire program into machine code, creating a separate executable file. An interpreter, on the other hand, translates and executes code line by line. Essentially, a compiler translates everything at once, while an interpreter executes sequentially.
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 68  Time: 4.24s  (16.0 tok/s)  VRAM delta: 7 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.

======================================================================
Prompt: Write a short Python function that checks if a string is a palindrome.
======================================================================

--- fp16 baseline ---
  Tokens: 200  Time: 11.41s  (17.5 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 4-bit ---
  Tokens: 200  Time: 14.36s  (13.9 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama")) # Output: True
print(is_palindrome("he
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 201  Time: 12.18s  (16.5 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 2-bit ---
  Tokens: 86  Time: 6.20s  (13.9 tok/s)  VRAM delta: 21 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same forwards and backward).
  Ignores case and non-alphanumeric characters.
  """
  processed_string = ''.join(char.lower() for char in text if char.isalnum(char)

  return processed_string == processed_string
```

  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 201  Time: 11.72s  (17.2 tok/s)  VRAM delta: 25 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

======================================================================
Prompt: What are the main causes of the French Revolution? Be concise.
======================================================================

--- fp16 baseline ---
  Tokens: 156  Time: 8.92s  (17.5 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 4-bit ---
  Tokens: 177  Time: 12.78s  (13.9 tok/s)  VRAM delta: 37 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Hardship:** Widespread poverty, famine, and high taxes, exacerbated by royal extravagance and costly wars.
*   **Enlightenment Ideas:** Philosophers like Locke and Rousseau promoted concepts of liberty, equality, and popular sovereignty, challengi
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 156  Time: 9.85s  (15.8 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 2-bit ---
  Tokens: 153  Time: 10.85s  (14.1 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** Rigid social hierarchy (Three Estates) with vast inequality and privileges for the wealthy.
*   **Economic Crisis:**  Heavy debt from wars, poor harvests, and inflation.
*   **Enlightenment Ideas:**  New ideas about liberty, equality, and popular sovereignty challenged the monarchy.
*   **Weak Leadership:** King Louis XVI was seen as indecisive and out of touch.
*   **Financial Crisis:**  Go
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 156  Time: 9.15s  (17.0 tok/s)  VRAM delta: 8 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

PS C:\projects\tq&gt;</code></pre>



<p>In response to a <a href="https://x.com/ARomeoSierra/status/2036996932829171852?s=20">Twitter question</a>:</p>



<pre class="wp-block-code" style="font-size:0.7rem"><code>PS C:\projects\tq> python run_demo.py --fused --long-context --haystack-tokens 4096 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -> 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -> 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -> 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 &#91;00:00&lt;?, ?it/s]
Download complete: : 0.00B &#91;00:00, ?B/s]                                       | 0/2 &#91;00:00&lt;?, ?it/s]
Loading weights: 100%|████████████████████████████████████████████| 883/883 &#91;00:03&lt;00:00, 274.55it/s]
The image processor of type `Gemma3ImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`.
Model loaded on cuda:0

======================================================================
Prompt: Explain the difference between a compiler and an interpreter in three sentences.
======================================================================
The following generation flags are not valid and may be ignored: &#91;'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

--- fp16 baseline ---
  Tokens: 68  Time: 4.93s  (13.8 tok/s)  VRAM delta: 26 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=4)... done.

--- TurboQuant 4-bit ---
  Tokens: 72  Time: 5.86s  (12.3 tok/s)  VRAM delta: 19 MB
  Output: A compiler translates an entire program into machine code at once, creating a standalone executable file that can be run directly.  An interpreter, on the other hand, reads and executes the code line by line, without first creating a separate file.  Essentially, a compiler performs a one-time conversion, while an interpreter performs a continuous translation and execution process.
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 68  Time: 4.63s  (14.7 tok/s)  VRAM delta: 4 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=2)... done.

--- TurboQuant 2-bit ---
  Tokens: 48  Time: 3.68s  (13.1 tok/s)  VRAM delta: 15 MB
  Output: A compiler translates an entire program into machine code, creating a separate executable file. An interpreter, on the other hand, translates and executes code line by line. Essentially, a compiler translates everything at once, while an interpreter executes sequentially.
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 68  Time: 4.17s  (16.3 tok/s)  VRAM delta: 7 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.

======================================================================
Prompt: Write a short Python function that checks if a string is a palindrome.
======================================================================

--- fp16 baseline ---
  Tokens: 200  Time: 10.91s  (18.3 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 4-bit ---
  Tokens: 200  Time: 13.76s  (14.5 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama")) # Output: True
print(is_palindrome("he
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 201  Time: 11.78s  (17.1 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 2-bit ---
  Tokens: 86  Time: 5.97s  (14.4 tok/s)  VRAM delta: 21 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same forwards and backward).
  Ignores case and non-alphanumeric characters.
  """
  processed_string = ''.join(char.lower() for char in text if char.isalnum(char)

  return processed_string == processed_string
```

  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 201  Time: 11.28s  (17.8 tok/s)  VRAM delta: 25 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

======================================================================
Prompt: What are the main causes of the French Revolution? Be concise.
======================================================================

--- fp16 baseline ---
  Tokens: 156  Time: 8.55s  (18.3 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 4-bit ---
  Tokens: 177  Time: 12.21s  (14.5 tok/s)  VRAM delta: 37 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Hardship:** Widespread poverty, famine, and high taxes, exacerbated by royal extravagance and costly wars.
*   **Enlightenment Ideas:** Philosophers like Locke and Rousseau promoted concepts of liberty, equality, and popular sovereignty, challengi
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 156  Time: 9.43s  (16.5 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 2-bit ---
  Tokens: 153  Time: 10.56s  (14.5 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** Rigid social hierarchy (Three Estates) with vast inequality and privileges for the wealthy.
*   **Economic Crisis:**  Heavy debt from wars, poor harvests, and inflation.
*   **Enlightenment Ideas:**  New ideas about liberty, equality, and popular sovereignty challenged the monarchy.
*   **Weak Leadership:** King Louis XVI was seen as indecisive and out of touch.
*   **Financial Crisis:**  Go
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 156  Time: 8.92s  (17.5 tok/s)  VRAM delta: 8 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

======================================================================
Needle-in-a-haystack (~4096 tokens)
======================================================================
  fp16 baseline              &#91;FOUND]  1.0s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.
  TurboQuant 4-bit           &#91;FOUND]  0.7s
    Answer: blue-giraffe-42
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers
  TurboQuant 4-bit FUSED     &#91;FOUND]  1.9s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.
  TurboQuant 2-bit           &#91;FOUND]  1.1s
    Answer: The secret password is 'blue-giraffe-42'.
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers
  TurboQuant 2-bit FUSED     &#91;FOUND]  1.4s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.

PS C:\projects\tq> python run_demo.py --fused --long-context --haystack-tokens 16384 --bits 4 2
======================================================================
Stage 0: TurboQuant core algorithm self-test
======================================================================
  Building Lloyd-Max codebook (d=256, bits=2)... done.
TurboQuant_mse  d=256  bits=2  n=64
  MSE:                0.118044
  Mean cosine sim:    0.9396
  Inner-product corr: 0.9451
  Size: 65,536 -> 4,224 bytes  (15.5x)

  Building Lloyd-Max codebook (d=256, bits=3)... done.
TurboQuant_mse  d=256  bits=3  n=64
  MSE:                0.034799
  Mean cosine sim:    0.9826
  Inner-product corr: 0.9836
  Size: 65,536 -> 6,272 bytes  (10.4x)

  Building Lloyd-Max codebook (d=256, bits=4)... done.
TurboQuant_mse  d=256  bits=4  n=64
  MSE:                0.009740
  Mean cosine sim:    0.9952
  Inner-product corr: 0.9949
  Size: 65,536 -> 8,320 bytes  (7.9x)

Loading google/gemma-3-4b-it ...
Fetching 2 files: 100%|████████████████████████████████████████████████████████| 2/2 &#91;00:00&lt;?, ?it/s]
Download complete: : 0.00B &#91;00:00, ?B/s]                                       | 0/2 &#91;00:00&lt;?, ?it/s]
Loading weights: 100%|████████████████████████████████████████████| 883/883 &#91;00:03&lt;00:00, 285.34it/s]
The image processor of type `Gemma3ImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`.
Model loaded on cuda:0

======================================================================
Prompt: Explain the difference between a compiler and an interpreter in three sentences.
======================================================================
The following generation flags are not valid and may be ignored: &#91;'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

--- fp16 baseline ---
  Tokens: 68  Time: 4.32s  (15.7 tok/s)  VRAM delta: 26 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=4)... done.

--- TurboQuant 4-bit ---
  Tokens: 72  Time: 5.94s  (12.1 tok/s)  VRAM delta: 19 MB
  Output: A compiler translates an entire program into machine code at once, creating a standalone executable file that can be run directly.  An interpreter, on the other hand, reads and executes the code line by line, without first creating a separate file.  Essentially, a compiler performs a one-time conversion, while an interpreter performs a continuous translation and execution process.
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 68  Time: 4.70s  (14.5 tok/s)  VRAM delta: 4 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.
  Building Lloyd-Max codebook (d=256, bits=2)... done.

--- TurboQuant 2-bit ---
  Tokens: 48  Time: 3.75s  (12.8 tok/s)  VRAM delta: 15 MB
  Output: A compiler translates an entire program into machine code, creating a separate executable file. An interpreter, on the other hand, translates and executes code line by line. Essentially, a compiler translates everything at once, while an interpreter executes sequentially.
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 68  Time: 4.19s  (16.2 tok/s)  VRAM delta: 7 MB
  Output: A compiler translates an entire program into machine code all at once, creating a standalone executable file that can be run directly by the computer.  In contrast, an interpreter reads and executes the program line by line, without first creating a separate executable.  Therefore, compilers offer faster execution speeds, while interpreters provide more flexibility and easier debugging.

======================================================================
Prompt: Write a short Python function that checks if a string is a palindrome.
======================================================================

--- fp16 baseline ---
  Tokens: 200  Time: 11.10s  (18.0 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 4-bit ---
  Tokens: 200  Time: 13.94s  (14.3 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama")) # Output: True
print(is_palindrome("he
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 201  Time: 12.02s  (16.7 tok/s)  VRAM delta: 41 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

--- TurboQuant 2-bit ---
  Tokens: 86  Time: 6.13s  (14.0 tok/s)  VRAM delta: 21 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same forwards and backward).
  Ignores case and non-alphanumeric characters.
  """
  processed_string = ''.join(char.lower() for char in text if char.isalnum(char)

  return processed_string == processed_string
```

  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 201  Time: 11.54s  (17.4 tok/s)  VRAM delta: 25 MB
  Output: ```python
def is_palindrome(text):
  """
  Checks if a string is a palindrome (reads the same backward as forward).

  Args:
    text: The string to check.

  Returns:
    True if the string is a palindrome, False otherwise.
  """
  processed_text = ''.join(filter(str.isalnum, text)).lower()
  return processed_text == processed_text&#91;::-1]

# Example usage:
print(is_palindrome("racecar"))  # Output: True
print(is_palindrome("A man, a plan, a canal: Panama"))  # Output: True
print(is_palindrome("h

======================================================================
Prompt: What are the main causes of the French Revolution? Be concise.
======================================================================

--- fp16 baseline ---
  Tokens: 156  Time: 8.80s  (17.7 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 4-bit ---
  Tokens: 177  Time: 12.47s  (14.2 tok/s)  VRAM delta: 37 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Hardship:** Widespread poverty, famine, and high taxes, exacerbated by royal extravagance and costly wars.
*   **Enlightenment Ideas:** Philosophers like Locke and Rousseau promoted concepts of liberty, equality, and popular sovereignty, challengi
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers

--- TurboQuant 4-bit FUSED ---
  Tokens: 156  Time: 9.68s  (16.1 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

--- TurboQuant 2-bit ---
  Tokens: 153  Time: 10.92s  (14.0 tok/s)  VRAM delta: 33 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** Rigid social hierarchy (Three Estates) with vast inequality and privileges for the wealthy.
*   **Economic Crisis:**  Heavy debt from wars, poor harvests, and inflation.
*   **Enlightenment Ideas:**  New ideas about liberty, equality, and popular sovereignty challenged the monarchy.
*   **Weak Leadership:** King Louis XVI was seen as indecisive and out of touch.
*   **Financial Crisis:**  Go
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers

--- TurboQuant 2-bit FUSED ---
  Tokens: 156  Time: 9.19s  (17.0 tok/s)  VRAM delta: 8 MB
  Output: Okay, here’s a concise breakdown of the main causes of the French Revolution:

*   **Social Inequality:** A rigid class system (Estates) with vast privileges for the nobility and clergy, and heavy burdens on the Third Estate (commoners).
*   **Economic Crisis:** Massive debt from wars, extravagant royal spending, and poor harvests led to widespread poverty and famine.
*   **Enlightenment Ideas:** Philosophers promoted concepts of liberty, equality, and popular sovereignty, challenging the legiti

======================================================================
Needle-in-a-haystack (~16384 tokens)
======================================================================
  fp16 baseline              &#91;FOUND]  2.5s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.
  TurboQuant 4-bit           &#91;FOUND]  2.8s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.
  Building Lloyd-Max codebook (d=256, bits=4)... done.
  Installed fused TurboQuant (4-bit) on 27 attention layers
  TurboQuant 4-bit FUSED     &#91;FOUND]  3.4s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.
  TurboQuant 2-bit           &#91;FOUND]  2.8s
    Answer: The secret password for project Orion is ‘blue-giraffe-42’.
  Building Lloyd-Max codebook (d=256, bits=2)... done.
  Installed fused TurboQuant (2-bit) on 27 attention layers
  TurboQuant 2-bit FUSED     &#91;FOUND]  3.0s
    Answer: The secret password for project Orion is 'blue-giraffe-42'.

PS C:\projects\tq></code></pre>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/turboquant/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Clickbait Titles Exploit Attention Through Latent Entities</title>
		<link>https://dejan.ai/blog/latent-entities/</link>
					<comments>https://dejan.ai/blog/latent-entities/#comments</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Sun, 22 Mar 2026 12:20:49 +0000</pubDate>
				<category><![CDATA[Content]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2333</guid>

					<description><![CDATA[Every clickbait title works the same way: it removes exactly one critical variable: the subject, the reason, the process, or the outcome, and charges you a click to fill the blank. This missing variable, which we call a latent entity, is so pervasive it has become normalized and nobody questions it anymore. You should! That [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h4 class="wp-block-heading">Every clickbait title works the same way: it removes exactly one critical variable: the subject, the reason, the process, or the outcome, and charges you a click to fill the blank. This missing variable, which we call a <strong>latent entity</strong>, is so pervasive it has become normalized and nobody questions it anymore. You should!</h4>



<p>That was the direct answer to the title&#8217;s attention hook, the latent variable behind &#8220;how&#8221;.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="480" src="https://dejan.ai/wp-content/uploads/2026/03/image-13-1024x480.png" alt="" class="wp-image-2334" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-13-1024x480.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-13-300x141.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-13-768x360.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-13-1536x720.png 1536w, https://dejan.ai/wp-content/uploads/2026/03/image-13.png 1908w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Every day, hundreds of millions of people scan headlines in feeds, aggregators, and search results. Most of these titles are not designed to inform. They are designed to withhold. Somewhere in the sentence, a critical piece of information has been surgically removed — the tool isn&#8217;t named, the result isn&#8217;t revealed, the reason isn&#8217;t given. The reader is left with an incomplete thought and a link. The click is the cost of completing it.</p>



<p>This mechanism is so pervasive that it has become invisible, like background noise. But it has a structure. And once you see the structure, you can&#8217;t unsee it.</p>



<h2 class="wp-block-heading">The attention transaction</h2>



<p>A title is a transaction. The author offers a premise. The reader pays with a click. The currency is attention, and the receipt is the missing piece of information the title promised but refused to deliver upfront.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="480" src="https://dejan.ai/wp-content/uploads/2026/03/image-14-1024x480.png" alt="" class="wp-image-2337" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-14-1024x480.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-14-300x141.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-14-768x360.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-14-1536x720.png 1536w, https://dejan.ai/wp-content/uploads/2026/03/image-14.png 1903w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>This is not metaphorical. The economics are literal. Every click generates a pageview. Every pageview generates ad impressions. Every ad impression generates revenue. The entire model is optimized not for informing the reader but for maximizing the probability that they click. The most reliable way to do that is to make the title incomplete — to create an information gap that can only be closed on the other side of the link.</p>



<p>The reader isn&#8217;t choosing to engage with content. They&#8217;re being charged an attention tax to access information that the title already had room to provide.</p>



<h2 class="wp-block-heading">Naming the structure: latent entities</h2>



<p>We can formalize what clickbait hides. In every withholding title, there is a <strong>latent entity</strong> — a variable the reader cannot resolve without clicking through. The title is the observed data. The latent entity is the unobserved variable. The click is the inference cost.</p>



<p>There are four types, and they are exhaustive.</p>



<h3 class="wp-block-heading">Latent Subject — <em>What?</em></h3>



<p>The title revolves around a specific thing — a tool, a setting, a feature, a list of items — but deliberately masks its identity behind a vague pronoun or a deferred noun.</p>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;This one browser extension changed how I use the internet forever.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">What extension? You don&#8217;t know. That&#8217;s the transaction. The word &#8220;this&#8221; is doing the work of pointing at something while revealing nothing. The subject is latent.</p>
</div>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;5 tools every developer needs in their workflow.&#8221;</strong></p>



<p class="wp-container-content-0733e5d0">Which five? The number creates the shape of an answer without filling it in. Five slots, all empty.</p>
</div>



<h3 class="wp-block-heading">Latent Reason — <em>Why?</em></h3>



<p>The title states a strong opinion, a regret, or an observation, but withholds the logic behind it. The reader is given a conclusion without its supporting argument.</p>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;I finally understand why Linux users swear by simple tools.&#8221;</strong></p>



<p class="wp-container-content-0733e5d0">The author has arrived at understanding. The reader has not. The reason is the hidden variable, and the only way to access it is to click.</p>
</div>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;Package managers are the main reason I&#8217;ll never switch back to Windows.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">A bold claim with the mechanism removed. Why? What about package managers? The reason is latent.</p>
</div>



<h3 class="wp-block-heading">Latent Process — <em>How?</em></h3>



<p>The title presents an intriguing input and a desirable or unexpected output, but hides the method that connects them. The reader sees a before and an after with a gap in between.</p>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;I turned my old phone into a universal remote for my entire smart home.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">How? What app, what protocol, what steps? The transformation is stated as fact but the process is absent. The reader must click to learn the method.</p>
</div>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;How a power drill defeated the Xbox 360&#8217;s console security.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">The pairing of a crude physical tool with a sophisticated digital system is inherently surprising. The process that links them is the entire story, and it&#8217;s completely hidden.</p>
</div>



<h3 class="wp-block-heading">Latent Outcome — <em>What happened?</em></h3>



<p>The title sets up a scenario or experiment but cuts off before the resolution. The reader is dropped into a narrative with no ending.</p>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;I replaced all my productivity tools with a single app for a month.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">And? What happened? Did it work? Was it a disaster? The outcome is the only thing the reader wants, and it&#8217;s the only thing the title refuses to provide.</p>
</div>



<div class="wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex">
<p class="wp-container-content-0733e5d0"><strong>&#8220;I ran local LLMs on a dying GPU and the results surprised me.&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">The word &#8220;surprised&#8221; is doing double duty — it confirms that an outcome exists and that it&#8217;s noteworthy, while revealing absolutely nothing about what it is. It is a content-free adjective masquerading as information.</p>
</div>



<p>Every clickbait title withholds at least one latent entity. Some withhold two — a title that hides both the process and the outcome forces the reader to pay twice for a single click. But the taxonomy is closed. Anything a title can hide maps to one of these four types: the subject (what?), the reason (why?), the process (how?), or the outcome (what happened?).</p>



<p>This isn&#8217;t a style guide or an editorial preference. It&#8217;s a structural property of how information is withheld to generate clicks.</p>



<h2 class="wp-block-heading">What happens after the click</h2>



<p>The damage doesn&#8217;t end with the transaction. Something happens cognitively when a reader lands on a page after a withholding title, and it isn&#8217;t engagement. It&#8217;s <a href="https://dejanmarketing.com/web-content/">scanning</a>.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="957" height="827" src="https://dejan.ai/wp-content/uploads/2026/03/image-15.png" alt="" class="wp-image-2338" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-15.png 957w, https://dejan.ai/wp-content/uploads/2026/03/image-15-300x259.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-15-768x664.png 768w" sizes="auto, (max-width: 957px) 100vw, 957px" /></figure>



<p>The reader arrives primed. They have a specific latent entity in mind — the hidden variable that brought them there — and their first instinct is to find it as fast as possible. They don&#8217;t read the introduction. They don&#8217;t absorb the context. They <a href="https://dejan.ai/blog/most-people-dont-read/">skip, skim, and scroll</a>, hunting for the one piece of information the title owed them.</p>



<p>This produces a jarring experience. The article, <a href="https://dejan.ai/blog/how-long-are-web-pages/">padded with backstory</a>, affiliate links, newsletter prompts, and SEO-optimized filler, is structured to delay the answer. The reader, already carrying the cognitive load of an unresolved question, is forced to work through friction that exists solely to generate more pageviews and ad impressions. The content&#8217;s structure and the reader&#8217;s intent are fundamentally misaligned.</p>



<p>The result is not engagement. It is extraction. The reader extracts the latent entity and leaves. The publisher extracts a pageview and an ad impression. Neither party has been well served. The reader resents the experience. The publisher has earned a visit but not trust.</p>



<h2 class="wp-block-heading">The ad-click economy made this rational</h2>



<p>None of this happened by accident. Withholding titles are the evolutionary product of an economy that rewards clicks over comprehension. When revenue is proportional to pageviews, every title becomes an optimization problem: maximize the probability of a click while minimizing the information given away for free.</p>



<p>Over two decades, this optimization produced the patterns we now see everywhere. Vague pronouns replaced specific nouns. Outcomes were teased but never stated. Reasons were promised but deferred. The entire craft of headline writing was reoriented from summarizing content to withholding it.</p>



<p>This was rational in a world where the title and the article were inseparable — where the only way to access the content was to visit the page. But that world is ending.</p>



<h2 class="wp-block-heading">AI changes the equation</h2>



<p>Large language models are rapidly becoming the <a href="https://dejan.ai/blog/llm-is-a-presentation-layer-in-ai-search/">intermediary layer between humans and content</a>. When a user asks an <a href="https://dejan.ai/blog/how-do-people-use-ai-assistants/">AI assistant a question</a>, the AI <a href="https://dejan.ai/blog/how-big-are-googles-grounding-chunks/">retrieves</a>, reads, and synthesizes sources on the user&#8217;s behalf. The human never visits the page. The click never happens. The latent entity is <a href="https://dejan.ai/blog/sro-grounding-snippets/">resolved by the model</a>, not by the reader.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="484" src="https://dejan.ai/wp-content/uploads/2026/03/image-16-1024x484.png" alt="" class="wp-image-2339" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-16-1024x484.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-16-300x142.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-16-768x363.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-16-1536x726.png 1536w, https://dejan.ai/wp-content/uploads/2026/03/image-16.png 1883w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>In this new architecture, withholding titles are not just exploitative. They are pointless and perhaps even <a href="https://dejan.ai/blog/sr/">harmful to visibility</a>. The AI doesn&#8217;t care about the information gap. It reads the article, extracts the answer, and delivers it without friction. The entire mechanism of clickbait — creating an artificial need that can only be resolved with a visit — collapses when the visitor is a machine that doesn&#8217;t see ads.</p>



<p>More importantly, AI systems can now decompose titles structurally, identify which latent entity is being withheld, check whether the article delivers on the title&#8217;s promise, and surface the answer directly. The asymmetry of information that clickbait depends on is being dissolved.</p>



<h2 class="wp-block-heading">A healthier paradigm</h2>



<p>If withholding titles evolved to serve the ad-click economy, then the question is: what should titles look like when that economy is no longer the only game?</p>



<p>The answer is straightforward. Titles should include the key information — the subject named, the reason stated, the outcome revealed — and invite the reader to explore further for depth, context, and nuance. The title earns the click by demonstrating value, not by ransoming it.</p>



<p>Consider the difference:</p>



<div class="wp-block-group has-background is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex" style="background-color:#eeded9">
<p class="wp-container-content-0733e5d0"><strong>&#8220;This one Docker tool finally fixed my reverse proxy headache&#8221;</strong> </p>



<p class="wp-container-content-0733e5d0">The subject is latent. <br>The reader must click to learn which tool.</p>
</div>



<div class="wp-block-group has-background is-nowrap is-layout-flex wp-container-core-group-is-layout-6c531013 wp-block-group-is-layout-flex" style="background-color:#9bbf843b">
<p class="wp-container-content-0733e5d0"><strong>&#8220;Nginx Proxy Manager eliminated my reverse proxy headache — here&#8217;s my setup&#8221; </strong></p>



<p class="wp-container-content-0733e5d0">The subject is revealed. <br>The reader clicks to learn the details, not to discover what the tool is.</p>
</div>



<p>Both titles can generate traffic. But the second one respects the reader. It says: here is what I&#8217;m talking about, and if you want to know more, the article is worth your time. The first one says: I have something you want, and I won&#8217;t tell you what it is unless you pay me with your attention.</p>



<p>The second model is healthier for everyone. Readers arrive with aligned expectations instead of frustrated scanning instincts. Authors build trust instead of mining clicks. And the content itself can be structured around depth rather than around delaying the reveal. </p>



<h2 class="wp-block-heading">The web we could have</h2>



<p>Web authors have a choice. They can continue optimizing for an economy that is being disintermediated by AI, writing titles that withhold and articles that delay, hoping the click-and-ad model survives long enough to sustain them. Or they can recognize that the readers who remain — the ones who choose to visit a page when they <a href="https://dejan.ai/blog/human-friendly-content-is-ai-friendly-content/">could have asked an AI</a> — are the ones who deserve the most respect.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="588" src="https://dejan.ai/wp-content/uploads/2026/03/image-17-1024x588.png" alt="" class="wp-image-2340" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-17-1024x588.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-17-300x172.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-17-768x441.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-17.png 1267w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Those readers are not clicking because they were tricked. They&#8217;re clicking because they were informed. They know what the article is about. They want to go deeper. They trust the author enough to spend their time. And the <a href="https://dejan.ai/blog/caps/">money part</a> can be fixed too.</p>



<p>That is the audience worth building for. And it starts with killing the hidden variable.</p>



<pre class="wp-block-code alignwide has-contrast-background-color has-text-color has-background has-link-color has-small-font-size wp-elements-11272707f64f2d4a29f681518b294984" style="color:#65b831"><code><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">{</mark>
  <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"title"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">:</mark> "Clickbait Titles Exploit Attention Through Latent Entities"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
  <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"metadata"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">: {</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"dimensions"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">: &#91;</mark>
      "Clickbait titles exploit attention"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
      "Through latent entities"
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">],</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"attention_anchor"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">:</mark> "how"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"quantized"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">:</mark> "clickbait exploits attention by hiding one of four variable types"
<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">  },</mark>
  <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"how"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">: &#91;</mark>
    "Every clickbait title withholds exactly one latent entity: subject (what?), reason (why?), process (how?), or outcome (what happened?)"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
    "The click is the inference cost the reader pays to resolve the hidden variable"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
    "AI dissolves this by reading the article and extracting the answer without the click"
  <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">],</mark>
  <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"promise_check"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">: {</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"exploit attention"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">:</mark> "delivered — transactional mechanism explained with economic chain"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"through latent entities"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">: </mark>"delivered — four-type taxonomy defined with examples"<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">,</mark>
    <mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-accent-3-color">"title practices what it preaches"</mark><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">:</mark> "delivered — subject revealed, mechanism stated, no hidden variable"
<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-base-color">  }
}</mark></code></pre>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/latent-entities/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Fanout Query Analysis</title>
		<link>https://dejan.ai/blog/fanout-query-analysis/</link>
					<comments>https://dejan.ai/blog/fanout-query-analysis/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Fri, 20 Mar 2026 01:58:01 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[Keyword Research]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2314</guid>

					<description><![CDATA[When AI models like Gemini, GPT or Nova answer a question using web search, they don&#8217;t just run your query as-is. They generate their own internal search queries, or fanout queries. A single user prompt can trigger multiple fanout queries as the model breaks down the question, explores subtopics and verifies information. We captured 365,920 [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>When AI models like Gemini, GPT or Nova answer a question using web search, they don&#8217;t just run your query as-is. They generate their own internal search queries, or fanout queries. A single user prompt can trigger multiple fanout queries as the model breaks down the question, explores subtopics and verifies information.</p>



<p>We captured 365,920 of these fanout queries across three providers, Google (Gemini), OpenAI (GPT) and Amazon (Nova), by logging the grounding metadata returned from their APIs during citation mining runs. This data comes from real production workloads across multiple projects, not synthetic benchmarks.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="731" src="https://dejan.ai/wp-content/uploads/2026/03/image-2-1024x731.png" alt="" class="wp-image-2315" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-2-1024x731.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-2-300x214.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-2-768x549.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-2.png 1400w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Below is an analysis of how these providers differ in the queries they generate.</p>



<figure class="wp-block-table has-small-font-size"><table><thead><tr><th class="has-text-align-center" data-align="center">Provider</th><th class="has-text-align-center" data-align="center">Count</th><th class="has-text-align-center" data-align="center">Avg Chars</th><th class="has-text-align-center" data-align="center">Min</th><th class="has-text-align-center" data-align="center">Max</th><th class="has-text-align-center" data-align="center">1-3 words</th><th class="has-text-align-center" data-align="center">4-6 words</th><th class="has-text-align-center" data-align="center">7+ words</th></tr></thead><tbody><tr><td class="has-text-align-center" data-align="center"><strong>Google</strong></td><td class="has-text-align-center" data-align="center">158,186</td><td class="has-text-align-center" data-align="center">52</td><td class="has-text-align-center" data-align="center">0</td><td class="has-text-align-center" data-align="center">252</td><td class="has-text-align-center" data-align="center">4.5%</td><td class="has-text-align-center" data-align="center">30.6%</td><td class="has-text-align-center" data-align="center">64.9%</td></tr><tr><td class="has-text-align-center" data-align="center"><strong>OpenAI</strong></td><td class="has-text-align-center" data-align="center">207,174</td><td class="has-text-align-center" data-align="center">60</td><td class="has-text-align-center" data-align="center">6</td><td class="has-text-align-center" data-align="center">323</td><td class="has-text-align-center" data-align="center">3.4%</td><td class="has-text-align-center" data-align="center">20.8%</td><td class="has-text-align-center" data-align="center">75.8%</td></tr><tr><td class="has-text-align-center" data-align="center"><strong>Amazon</strong></td><td class="has-text-align-center" data-align="center">560</td><td class="has-text-align-center" data-align="center">59</td><td class="has-text-align-center" data-align="center">28</td><td class="has-text-align-center" data-align="center">198</td><td class="has-text-align-center" data-align="center">0.2%</td><td class="has-text-align-center" data-align="center">16.2%</td><td class="has-text-align-center" data-align="center">83.6%</td></tr><tr><td class="has-text-align-center" data-align="center"><strong>Total</strong></td><td class="has-text-align-center" data-align="center">~365,920</td><td class="has-text-align-center" data-align="center">56</td><td class="has-text-align-center" data-align="center">0</td><td class="has-text-align-center" data-align="center">323</td><td class="has-text-align-center" data-align="center">3.9%</td><td class="has-text-align-center" data-align="center">25.0%</td><td class="has-text-align-center" data-align="center">71.1%</td></tr></tbody></table></figure>



<p><strong>Google (n=158,184)</strong></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Words</th><th class="has-text-align-right" data-align="right">Count</th><th class="has-text-align-right" data-align="right">%</th><th class="has-text-align-right" data-align="right">Cumul%</th></tr></thead><tbody><tr><td>1</td><td class="has-text-align-right" data-align="right">53</td><td class="has-text-align-right" data-align="right">0.0%</td><td class="has-text-align-right" data-align="right">0.0%</td></tr><tr><td>2</td><td class="has-text-align-right" data-align="right">1,092</td><td class="has-text-align-right" data-align="right">0.7%</td><td class="has-text-align-right" data-align="right">0.7%</td></tr><tr><td>3</td><td class="has-text-align-right" data-align="right">5,994</td><td class="has-text-align-right" data-align="right">3.8%</td><td class="has-text-align-right" data-align="right">4.5%</td></tr><tr><td>4</td><td class="has-text-align-right" data-align="right">14,916</td><td class="has-text-align-right" data-align="right">9.4%</td><td class="has-text-align-right" data-align="right">13.9%</td></tr><tr><td>5</td><td class="has-text-align-right" data-align="right">17,471</td><td class="has-text-align-right" data-align="right">11.0%</td><td class="has-text-align-right" data-align="right">25.0%</td></tr><tr><td>6</td><td class="has-text-align-right" data-align="right">15,923</td><td class="has-text-align-right" data-align="right">10.1%</td><td class="has-text-align-right" data-align="right">35.1%</td></tr><tr><td>7</td><td class="has-text-align-right" data-align="right">18,080</td><td class="has-text-align-right" data-align="right">11.4%</td><td class="has-text-align-right" data-align="right">46.5%</td></tr><tr><td>8</td><td class="has-text-align-right" data-align="right">20,325</td><td class="has-text-align-right" data-align="right">12.8%</td><td class="has-text-align-right" data-align="right">59.3%</td></tr><tr><td>9</td><td class="has-text-align-right" data-align="right">20,013</td><td class="has-text-align-right" data-align="right">12.7%</td><td class="has-text-align-right" data-align="right">72.0%</td></tr><tr><td>10</td><td class="has-text-align-right" data-align="right">16,968</td><td class="has-text-align-right" data-align="right">10.7%</td><td class="has-text-align-right" data-align="right">82.7%</td></tr><tr><td>11</td><td class="has-text-align-right" data-align="right">11,740</td><td class="has-text-align-right" data-align="right">7.4%</td><td class="has-text-align-right" data-align="right">90.1%</td></tr><tr><td>12</td><td class="has-text-align-right" data-align="right">7,316</td><td class="has-text-align-right" data-align="right">4.6%</td><td class="has-text-align-right" data-align="right">94.8%</td></tr><tr><td>13</td><td class="has-text-align-right" data-align="right">4,043</td><td class="has-text-align-right" data-align="right">2.6%</td><td class="has-text-align-right" data-align="right">97.3%</td></tr><tr><td>14</td><td class="has-text-align-right" data-align="right">2,124</td><td class="has-text-align-right" data-align="right">1.3%</td><td class="has-text-align-right" data-align="right">98.7%</td></tr><tr><td>15+</td><td class="has-text-align-right" data-align="right">1,146</td><td class="has-text-align-right" data-align="right">0.7%</td><td class="has-text-align-right" data-align="right">100.0%</td></tr></tbody></table></figure>



<p><strong>OpenAI (n=207,174)</strong></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Words</th><th class="has-text-align-right" data-align="right">Count</th><th class="has-text-align-right" data-align="right">%</th><th class="has-text-align-right" data-align="right">Cumul%</th></tr></thead><tbody><tr><td>1</td><td class="has-text-align-right" data-align="right">616</td><td class="has-text-align-right" data-align="right">0.3%</td><td class="has-text-align-right" data-align="right">0.3%</td></tr><tr><td>2</td><td class="has-text-align-right" data-align="right">3,715</td><td class="has-text-align-right" data-align="right">1.8%</td><td class="has-text-align-right" data-align="right">2.1%</td></tr><tr><td>3</td><td class="has-text-align-right" data-align="right">2,691</td><td class="has-text-align-right" data-align="right">1.3%</td><td class="has-text-align-right" data-align="right">3.4%</td></tr><tr><td>4</td><td class="has-text-align-right" data-align="right">7,360</td><td class="has-text-align-right" data-align="right">3.6%</td><td class="has-text-align-right" data-align="right">6.9%</td></tr><tr><td>5</td><td class="has-text-align-right" data-align="right">14,516</td><td class="has-text-align-right" data-align="right">7.0%</td><td class="has-text-align-right" data-align="right">13.9%</td></tr><tr><td>6</td><td class="has-text-align-right" data-align="right">21,221</td><td class="has-text-align-right" data-align="right">10.2%</td><td class="has-text-align-right" data-align="right">24.2%</td></tr><tr><td>7</td><td class="has-text-align-right" data-align="right">26,544</td><td class="has-text-align-right" data-align="right">12.8%</td><td class="has-text-align-right" data-align="right">37.0%</td></tr><tr><td>8</td><td class="has-text-align-right" data-align="right">28,912</td><td class="has-text-align-right" data-align="right">14.0%</td><td class="has-text-align-right" data-align="right">51.0%</td></tr><tr><td>9</td><td class="has-text-align-right" data-align="right">27,861</td><td class="has-text-align-right" data-align="right">13.4%</td><td class="has-text-align-right" data-align="right">64.4%</td></tr><tr><td>10</td><td class="has-text-align-right" data-align="right">23,354</td><td class="has-text-align-right" data-align="right">11.3%</td><td class="has-text-align-right" data-align="right">75.7%</td></tr><tr><td>11</td><td class="has-text-align-right" data-align="right">17,875</td><td class="has-text-align-right" data-align="right">8.6%</td><td class="has-text-align-right" data-align="right">84.3%</td></tr><tr><td>12</td><td class="has-text-align-right" data-align="right">12,339</td><td class="has-text-align-right" data-align="right">6.0%</td><td class="has-text-align-right" data-align="right">90.3%</td></tr><tr><td>13</td><td class="has-text-align-right" data-align="right">7,983</td><td class="has-text-align-right" data-align="right">3.9%</td><td class="has-text-align-right" data-align="right">94.1%</td></tr><tr><td>14</td><td class="has-text-align-right" data-align="right">4,959</td><td class="has-text-align-right" data-align="right">2.4%</td><td class="has-text-align-right" data-align="right">96.5%</td></tr><tr><td>15+</td><td class="has-text-align-right" data-align="right">5,228</td><td class="has-text-align-right" data-align="right">2.5%</td><td class="has-text-align-right" data-align="right">100.0%</td></tr></tbody></table></figure>



<p><strong>Amazon (n=560)</strong></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Words</th><th class="has-text-align-right" data-align="right">Count</th><th class="has-text-align-right" data-align="right">%</th><th class="has-text-align-right" data-align="right">Cumul%</th></tr></thead><tbody><tr><td>3</td><td class="has-text-align-right" data-align="right">1</td><td class="has-text-align-right" data-align="right">0.2%</td><td class="has-text-align-right" data-align="right">0.2%</td></tr><tr><td>4</td><td class="has-text-align-right" data-align="right">4</td><td class="has-text-align-right" data-align="right">0.7%</td><td class="has-text-align-right" data-align="right">0.9%</td></tr><tr><td>5</td><td class="has-text-align-right" data-align="right">23</td><td class="has-text-align-right" data-align="right">4.1%</td><td class="has-text-align-right" data-align="right">5.0%</td></tr><tr><td>6</td><td class="has-text-align-right" data-align="right">64</td><td class="has-text-align-right" data-align="right">11.4%</td><td class="has-text-align-right" data-align="right">16.4%</td></tr><tr><td>7</td><td class="has-text-align-right" data-align="right">102</td><td class="has-text-align-right" data-align="right">18.2%</td><td class="has-text-align-right" data-align="right">34.6%</td></tr><tr><td>8</td><td class="has-text-align-right" data-align="right">110</td><td class="has-text-align-right" data-align="right">19.6%</td><td class="has-text-align-right" data-align="right">54.3%</td></tr><tr><td>9</td><td class="has-text-align-right" data-align="right">113</td><td class="has-text-align-right" data-align="right">20.2%</td><td class="has-text-align-right" data-align="right">74.5%</td></tr><tr><td>10</td><td class="has-text-align-right" data-align="right">64</td><td class="has-text-align-right" data-align="right">11.4%</td><td class="has-text-align-right" data-align="right">85.9%</td></tr><tr><td>11</td><td class="has-text-align-right" data-align="right">35</td><td class="has-text-align-right" data-align="right">6.2%</td><td class="has-text-align-right" data-align="right">92.1%</td></tr><tr><td>12</td><td class="has-text-align-right" data-align="right">20</td><td class="has-text-align-right" data-align="right">3.6%</td><td class="has-text-align-right" data-align="right">95.7%</td></tr><tr><td>13</td><td class="has-text-align-right" data-align="right">9</td><td class="has-text-align-right" data-align="right">1.6%</td><td class="has-text-align-right" data-align="right">97.3%</td></tr><tr><td>14</td><td class="has-text-align-right" data-align="right">5</td><td class="has-text-align-right" data-align="right">0.9%</td><td class="has-text-align-right" data-align="right">98.2%</td></tr><tr><td>15+</td><td class="has-text-align-right" data-align="right">10</td><td class="has-text-align-right" data-align="right">1.8%</td><td class="has-text-align-right" data-align="right">100.0%</td></tr></tbody></table></figure>



<h2 class="wp-block-heading"><strong>POS Distribution by Provider</strong></h2>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="439" src="https://dejan.ai/wp-content/uploads/2026/03/image-4-1024x439.png" alt="" class="wp-image-2321" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-4-1024x439.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/image-4-300x129.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-4-768x329.png 768w, https://dejan.ai/wp-content/uploads/2026/03/image-4.png 1400w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="788" height="679" src="https://dejan.ai/wp-content/uploads/2026/03/image-3.png" alt="" class="wp-image-2320" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-3.png 788w, https://dejan.ai/wp-content/uploads/2026/03/image-3-300x259.png 300w, https://dejan.ai/wp-content/uploads/2026/03/image-3-768x662.png 768w" sizes="auto, (max-width: 788px) 100vw, 788px" /></figure>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Group</th><th class="has-text-align-right" data-align="right">Google</th><th class="has-text-align-right" data-align="right">OpenAI</th><th class="has-text-align-right" data-align="right">Amazon</th></tr></thead><tbody><tr><td>Nouns</td><td class="has-text-align-right" data-align="right">52.3%</td><td class="has-text-align-right" data-align="right">58.4%</td><td class="has-text-align-right" data-align="right">50.2%</td></tr><tr><td>Verbs</td><td class="has-text-align-right" data-align="right">11.3%</td><td class="has-text-align-right" data-align="right">9.9%</td><td class="has-text-align-right" data-align="right">8.5%</td></tr><tr><td>Adjectives</td><td class="has-text-align-right" data-align="right">11.0%</td><td class="has-text-align-right" data-align="right">8.9%</td><td class="has-text-align-right" data-align="right">18.6%</td></tr><tr><td>Prepositions</td><td class="has-text-align-right" data-align="right">7.4%</td><td class="has-text-align-right" data-align="right">3.5%</td><td class="has-text-align-right" data-align="right">10.3%</td></tr><tr><td>Wh-words</td><td class="has-text-align-right" data-align="right">3.6%</td><td class="has-text-align-right" data-align="right">2.1%</td><td class="has-text-align-right" data-align="right">1.5%</td></tr><tr><td>Numbers</td><td class="has-text-align-right" data-align="right">2.2%</td><td class="has-text-align-right" data-align="right">5.3%</td><td class="has-text-align-right" data-align="right">2.8%</td></tr><tr><td>Determiners</td><td class="has-text-align-right" data-align="right">2.6%</td><td class="has-text-align-right" data-align="right">1.8%</td><td class="has-text-align-right" data-align="right">0.1%</td></tr><tr><td>Conjunctions</td><td class="has-text-align-right" data-align="right">1.6%</td><td class="has-text-align-right" data-align="right">0.6%</td><td class="has-text-align-right" data-align="right">2.4%</td></tr><tr><td>Adverbs</td><td class="has-text-align-right" data-align="right">0.6%</td><td class="has-text-align-right" data-align="right">0.7%</td><td class="has-text-align-right" data-align="right">2.3%</td></tr><tr><td>Modals</td><td class="has-text-align-right" data-align="right">0.7%</td><td class="has-text-align-right" data-align="right">0.5%</td><td class="has-text-align-right" data-align="right">0.0%</td></tr><tr><td>Pronouns</td><td class="has-text-align-right" data-align="right">1.2%</td><td class="has-text-align-right" data-align="right">0.9%</td><td class="has-text-align-right" data-align="right">0.1%</td></tr></tbody></table></figure>



<ul class="wp-block-list">
<li><strong>OpenAI is the most noun-heavy</strong> (58.4%), especially proper nouns (18.9% vs Google&#8217;s 8.6%) — it generates more entity-specific queries</li>



<li><strong>Amazon leans heavily into adjectives</strong> (18.6% vs ~10% for others) — more descriptive, qualifier-rich queries like &#8220;best,&#8221; &#8220;top,&#8221; &#8220;most effective&#8221;</li>



<li><strong>Google uses more wh-words and verbs</strong> — generates more question-style queries (&#8220;what,&#8221; &#8220;how,&#8221; &#8220;which&#8221;)</li>



<li><strong>OpenAI uses 2x more numbers</strong> (5.3%) — likely year references and quantities in queries</li>
</ul>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/fanout-query-analysis/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Reverse Prompting: Reconstructing Prompts from AI-Generated Text</title>
		<link>https://dejan.ai/blog/reverse-prompting/</link>
					<comments>https://dejan.ai/blog/reverse-prompting/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Wed, 18 Mar 2026 06:51:29 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2303</guid>

					<description><![CDATA[We fine-tuned Google&#8217;s Gemma 3 (270M) to reverse the typical LLM workflow: given an AI-generated response, the model reconstructs the most likely prompt that produced it. We generated 100,000 synthetic prompt-response pairs using Gemini 2.5 Flash, trained for a single epoch on a consumer GPU, and built a Streamlit app that sweeps 24 decoding configurations [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>We fine-tuned <a href="https://huggingface.co/google/gemma-3-270m">Google&#8217;s Gemma 3 (270M)</a> to reverse the typical LLM workflow: given an AI-generated response, the model reconstructs the most likely prompt that produced it. We generated 100,000 synthetic prompt-response pairs using Gemini 2.5 Flash, trained for a single epoch on a consumer GPU, and built a Streamlit app that sweeps 24 decoding configurations to produce ranked prompt candidates. The model demo runs on CPU and is <a href="https://dejan.ai/tools/reverse-prompter/">available here</a>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">The Idea</h2>



<p>Large language models take prompts and produce responses. We wanted to see if a small model could learn to do the opposite: take a response and work backwards to the prompt.</p>



<p>This isn&#8217;t about recovering the exact original prompt, but to surface the most plausible prompts, ranked by model confidence. Think of it as asking: &#8220;What question would most naturally lead to this answer?&#8221;</p>



<h2 class="wp-block-heading">Training Data Generation</h2>



<p>The training pipeline has two stages, both powered by Gemini 2.5 Flash via Vertex AI.</p>



<p><strong>Stage 1: Prompt generation.</strong> We generated 100,000 diverse prompts across five categories designed to cover different user behaviours:</p>



<ul class="wp-block-list">
<li>Mid-tail, search query style (single or multi-faceted)</li>



<li>Long-tail, search query style (multi-faceted)</li>



<li>Simple, prompt-like (single-faceted)</li>



<li>Typical, prompt-like (single or multi-faceted)</li>



<li>Detailed, prompt-like (multi-faceted)</li>
</ul>



<p>Each API call generated a batch of 100 prompts as JSON with thinking disabled. We ran 100 concurrent calls, stored results in SQLite, and had the full dataset in minutes.</p>



<p><strong>Stage 2: Response generation.</strong> Each of the 100,000 prompts was sent back to Gemini 2.5 Flash to produce a corresponding AI assistant response. Same concurrency, same speed.  The result: 100,000 prompt-response pairs ready for training.</p>



<h2 class="wp-block-heading">Data Preparation</h2>



<p>The key design decision was how to format the training data. We needed the model to learn a clear boundary between the response (input) and the prompt (target). We settled on a simple separator:</p>



<pre class="wp-block-code has-large-font-size"><code>{response}\n###\n{prompt}&lt;eos&gt;</code></pre>



<p>During tokenization, we masked the loss over the response and separator tokens (setting labels to -100) so the model only learns to predict the prompt portion. This is critical: without masking, the model would waste capacity learning to reproduce the response text rather than focusing on the reverse mapping.</p>



<p>Sequences were capped at 2,048 tokens. Tokenization was batched in groups of 5,000 to manage memory, then concatenated into a single dataset.</p>



<h2 class="wp-block-heading">Model Selection</h2>



<p>We chose Gemma 3 270M for several reasons:</p>



<ul class="wp-block-list">
<li><strong>Size.</strong>&nbsp;At 270M parameters, it&#8217;s small enough to train on a single consumer GPU and fast enough to run inference on CPU. This matters for a free demo.</li>



<li><strong>Architecture.</strong>&nbsp;Gemma 3 uses a mix of sliding window and full attention layers, giving it a good balance of local and global context within its 2,048 token training window.</li>



<li><strong>Capability.</strong>&nbsp;Despite its size, Gemma 3 270M has a 262K vocabulary and was pretrained on enough data to have reasonable language understanding out of the box.</li>
</ul>



<p>A larger model would almost certainly perform better, but the goal was a practical tool that could run anywhere, not a benchmark result.</p>



<h2 class="wp-block-heading">Training</h2>



<p>Training was straightforward. Full fine-tune, single epoch, on an NVIDIA RTX 4090.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Parameter</th><th>Value</th></tr></thead><tbody><tr><td>Method</td><td>Full fine-tune</td></tr><tr><td>Precision</td><td>bfloat16</td></tr><tr><td>Batch size</td><td>2 (effective 16 with gradient accumulation)</td></tr><tr><td>Learning rate</td><td>5e-5</td></tr><tr><td>Optimizer</td><td>AdamW (torch fused)</td></tr><tr><td>Warmup steps</td><td>100</td></tr><tr><td>Gradient checkpointing</td><td>Enabled</td></tr><tr><td>Training time</td><td>4 hours 14 minutes</td></tr></tbody></table></figure>



<p>One epoch was sufficient. The loss curve showed steady convergence without signs of underfitting, and we wanted to avoid overfitting on synthetic data where the model might memorise specific phrasing patterns rather than learning the general reverse mapping.</p>



<h2 class="wp-block-heading">Inference Strategy</h2>



<p>A single generation pass from the model produces one candidate prompt. To get a diverse set of candidates, we sweep across 24 contrastive search configurations by varying two parameters:</p>



<ul class="wp-block-list">
<li><code>top_k</code>: [2, 4, 6, 15]</li>



<li><code>penalty_alpha</code>: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]</li>
</ul>



<p>Contrastive search balances token probability with a degeneration penalty, which encourages diverse yet coherent outputs. Different configurations produce different candidate prompts from the same input.</p>



<p>Each candidate is then scored by perplexity: we run the full sequence (response + separator + generated prompt) through the model and compute the average token-level log probability over the prompt portion. Lower perplexity means the model finds that prompt more natural given the response.</p>



<p>The top 10 candidates are displayed with per-token confidence visualisation, where each word&#8217;s opacity reflects how confident the model was in predicting it.</p>



<h2 class="wp-block-heading">The Tool</h2>



<p>The Streamlit app has two modes.</p>



<p><strong>Paste mode</strong> is the primary interface. Paste any AI-generated text, click Reconstruct Prompts, and the model generates ranked candidates. The results include a prompt table with perplexity scores and per-token confidence bar charts, a key phrases panel that extracts the most important shared phrases across candidates, and a word frequency heatmap.</p>



<p><strong>URL mode</strong> is experimental. Enter a URL and the app scrapes the page content via the DataForSEO API, converts it to markdown, and runs it through the model. This isn&#8217;t the intended use case since the model was trained on AI assistant responses, not web pages. But it produces interesting results: the reconstructed &#8220;prompts&#8221; reveal what the model considers the core semantic intent of the page content. It&#8217;s less prompt reconstruction and more semantic summarisation through the lens of &#8220;what question would this page answer?&#8221;</p>



<h2 class="wp-block-heading">Possible Uses</h2>



<p><strong>Prompt engineering.</strong> Understanding what prompts lead to certain outputs helps refine prompt design. If you have an output you like, reverse prompting can suggest more efficient or precise ways to get there.</p>



<p><strong>Content analysis.</strong> Running web content through the model reveals what the model perceives as the core intent behind the text. This could be useful for understanding how AI models interpret and categorise content.</p>



<p><strong>AI content forensics.</strong> While this isn&#8217;t a detector (it doesn&#8217;t classify text as AI-generated or not), the confidence scores and perplexity values could serve as signals. Text that was genuinely produced by an AI assistant in response to a clear prompt may produce lower-perplexity reconstructions than text that wasn&#8217;t.</p>



<p><strong>Training data curation.</strong> When building datasets, reverse prompting can help verify that responses actually match their intended prompts, or surface cases where the mapping is ambiguous.</p>



<h2 class="wp-block-heading">Insights</h2>



<p>A few things we noticed during development:</p>



<p><strong>Synthetic data works.</strong> The model was trained entirely on Gemini-generated data and generalises to outputs from other models. The reverse mapping from response to prompt is more about structure and intent than model-specific quirks.</p>



<p><strong>Small models can learn non-trivial mappings.</strong> At 270M parameters, this model is tiny by current standards. Yet it reliably produces sensible prompt reconstructions. The task is well-constrained enough that a small model can handle it.</p>



<p><strong>Diversity in decoding matters more than model size.</strong> The contrastive search sweep across 24 configurations produces more useful results than a single greedy decode from a larger model would. The ranking by perplexity then surfaces the best candidates.</p>



<p><strong>The separator matters.</strong> We tested several formats. The simple <code>\n###\n</code> separator worked best, likely because it&#8217;s distinct enough that the model learns a clean boundary between input and output.</p>



<p>The model and code are available on <a href="https://huggingface.co/dejanseo/reverse-prompter" target="_blank" rel="noreferrer noopener">Hugging Face</a>, and a live demo runs on <a href="https://dejan.ai/tools/reverse-prompter/">https://dejan.ai/tools/reverse-prompter/</a></p>



<div class="wp-block-buttons alignfull is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex">
<div class="wp-block-button"><a class="wp-block-button__link wp-element-button" href="https://dejan.ai/tools/reverse-prompter/">DEMO</a></div>
</div>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/reverse-prompting/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Rufus &#8211; Under the Hood. What Drives Amazon’s AI Shopping Assistant?</title>
		<link>https://dejan.ai/blog/rufus/</link>
					<comments>https://dejan.ai/blog/rufus/#comments</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Sun, 15 Mar 2026 01:11:45 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[eCommerce]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2290</guid>

					<description><![CDATA[What’s Publicly Known About the Pipeline, Backend, and Response Anatomy. Rufus is not “one model that magically answers.” Public Amazon/AWS descriptions point to a multi-component system: Speculative schema: Pipeline: request → answer Step A — Input + context assembly Public descriptions indicate customers can: Amazon also describes using conversational context and (more recently) account memory [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h3 class="wp-block-heading">What’s Publicly Known About the Pipeline, Backend, and Response Anatomy.</h3>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Rufus is not “one model that magically answers.” Public Amazon/AWS descriptions point to a <strong>multi-component system</strong>:</p>



<ul class="wp-block-list">
<li><strong>A query planning / classification layer</strong> (Amazon/AWS call out a “query planner (QP) model”)</li>



<li><strong>Retrieval</strong> across multiple Amazon-owned sources (catalog, reviews, community Q&amp;A, Stores APIs) and sometimes web sources</li>



<li><strong>A foundation LLM</strong> that generates the natural-language response</li>



<li><strong>A streaming + rendering layer</strong> that formats answers and “hydrates” them with live store data</li>



<li><strong>Feedback-driven improvement</strong> (reinforcement learning from customer feedback)</li>
</ul>



<p>Speculative schema:</p>



<pre class="wp-block-code has-small-font-size"><code>User question
  -&gt; Query Planner (intent + retrieval plan)
    -&gt; Retrieval (catalog/reviews/Q&amp;A/Stores APIs/(sometimes web))
      -&gt; Foundation LLM (answer generation + display directives)
        -&gt; Streaming response (token-by-token)
          -&gt; Hydration (fill in product cards, prices, etc via internal systems)
            -&gt; Client UI (chat text + cards + actions + suggested questions)</code></pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Pipeline: request → answer</h2>



<h3 class="wp-block-heading">Step A — Input + context assembly</h3>



<p>Public descriptions indicate customers can:</p>



<ul class="wp-block-list">
<li>Type or speak questions in the Amazon Shopping app search bar / assistant chat bar</li>



<li>Start from <strong>pre-populated / suggested questions</strong> in the UI</li>



<li>Ask questions either broadly (“what do I need for…”) or specifically on a product page (where the product detail context matters)</li>
</ul>



<p>Amazon also describes using <strong>conversational context</strong> and (more recently) <strong>account memory</strong> features for personalization.</p>



<h3 class="wp-block-heading">Step B — Query planning (QP) before generation</h3>



<p>AWS’s ML blog post describes Rufus as having:</p>



<ul class="wp-block-list">
<li>A <strong>foundation LLM</strong> (for response generation)</li>



<li>A <strong>query planner (QP) model</strong> for <strong>query classification and retrieval enhancement</strong></li>



<li>QP is “on the critical path” because the system <strong>can’t start token generation</strong> until QP finishes</li>
</ul>



<p>That implies a gate: <strong>planning first</strong>, then generation.</p>



<h3 class="wp-block-heading">Step C — Retrieval-augmented generation (RAG)</h3>



<p>Amazon Science describes Rufus using <strong>retrieval‑augmented generation (RAG)</strong>:</p>



<ul class="wp-block-list">
<li>Before generating a response, the LLM <strong>selects information</strong> it expects will help answer the question.</li>



<li>Evidence sources explicitly called out include:</li>



<li><strong>Customer reviews</strong></li>



<li><strong>The product catalog</strong></li>



<li><strong>Community Q&amp;A</strong></li>



<li><strong>Stores APIs</strong> (calls to internal store systems)</li>
</ul>



<p>About Amazon also describes using RAG to pull “insights and recommendations” from “popular sources” for some product/trend questions (they name examples like major publications).</p>



<p>What’s not disclosed publicly:</p>



<ul class="wp-block-list">
<li>How retrieval is ranked across sources</li>



<li>The retrieval index design</li>



<li>Exact prompting / grounding format</li>



<li>Exact guardrails for what external web content can be used and how</li>
</ul>



<h3 class="wp-block-heading">Step D — Response generation (LLM)</h3>



<p>Amazon Science says the team built a <strong>custom LLM specialized for shopping</strong>, trained primarily on shopping data (catalog + reviews + community Q&amp;A) plus curated public web information.</p>



<p>About Amazon also describes a <strong>model-mix</strong> approach:</p>



<ul class="wp-block-list">
<li>Built on <strong>Amazon Bedrock</strong></li>



<li>Using a <strong>real-time router</strong> that can select among multiple LLMs (they explicitly name models like Anthropic’s Claude Sonnet, Amazon Nova, plus a custom model)</li>
</ul>



<p>So the public picture is: <strong>custom shopping model exists</strong>, and there may also be <strong>dynamic model selection</strong> depending on query type / latency / quality targets.</p>



<h3 class="wp-block-heading">Step E — Streaming + “hydration” + UI rendering</h3>



<p>Amazon Science describes a “streaming architecture”:</p>



<ul class="wp-block-list">
<li>Responses are <strong>streamed token-by-token</strong> (so the user sees the beginning while the rest is still generating).</li>



<li>The system “hydrates” the response by <strong>querying internal systems</strong> to populate the stream with the right data.</li>



<li>Crucially: Rufus is trained to generate <strong>markup instructions</strong> specifying how answer elements should be displayed, not just the text.</li>
</ul>



<p>This is the key “anatomy of a Rufus response” insight: <strong>the model output is both content and layout directives</strong>, and the backend fills in live store objects (prices, items, links, etc.) during streaming.</p>



<p>What’s not disclosed publicly:</p>



<ul class="wp-block-list">
<li>The markup language/schema</li>



<li>The exact rendering protocol between model <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2194.png" alt="↔" class="wp-smiley" style="height: 1em; max-height: 1em;" /> hydrator <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2194.png" alt="↔" class="wp-smiley" style="height: 1em; max-height: 1em;" /> client</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Backend: training data, infra, and latency engineering</h2>



<h3 class="wp-block-heading">Training data and preparation (what Amazon has said)</h3>



<p>Amazon Science states Rufus was trained with:</p>



<ul class="wp-block-list">
<li>The <strong>entire Amazon catalog</strong></li>



<li><strong>Customer reviews</strong></li>



<li><strong>Community Q&amp;A posts</strong></li>



<li>Curated <strong>public web information</strong></li>
</ul>



<p>And that Amazon used:</p>



<ul class="wp-block-list">
<li><strong>Amazon EMR</strong> for large-scale distributed data processing</li>



<li><strong>Amazon S3</strong> for storage</li>
</ul>



<h3 class="wp-block-heading">Inference infrastructure: Trainium/Inferentia + compiler optimizations</h3>



<p>Amazon Science describes serving at Amazon scale using:</p>



<ul class="wp-block-list">
<li>AWS chips <strong>Trainium</strong> and <strong>Inferentia</strong></li>



<li>Collaboration with the <strong>Neuron compiler</strong> team for inference optimizations</li>



<li><strong>Continuous batching</strong> to improve throughput/latency (described as making scheduling/routing decisions after every generated token so new requests can start as soon as earlier ones finish)</li>
</ul>



<h3 class="wp-block-heading">Prime Day scale + “parallel decoding” for QP latency</h3>



<p>AWS’s ML blog post goes much deeper on one backend component (the <strong>QP model</strong>) and performance engineering:</p>



<ul class="wp-block-list">
<li>Prime Day demands described include very high query rates and tight latency SLOs for QP.</li>



<li>They describe using “draft‑centric speculative decoding” / “parallel decoding”:</li>



<li>Extending the base model with <strong>multiple decoding heads</strong> to predict multiple future tokens in parallel</li>



<li>A <strong>tree-based attention</strong> mechanism to verify/integrate predicted tokens</li>



<li>Deployed using AWS infrastructure + chips (Trainium/Inferentia), and mentions integration details (for example, they mention Triton Inference Server support and Neuron-related frameworks).</li>
</ul>



<p>This is one of the clearest official public descriptions of “backend mechanics” for Rufus, specifically for the planning model that sits <em>before</em> the user sees the first chunk of an answer.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Response format: what users see vs what the system likely contains</h2>



<h3 class="wp-block-heading">What the user-visible response can include (publicly described)</h3>



<p>Across Amazon’s public descriptions, Rufus responses can include:</p>



<ul class="wp-block-list">
<li><strong>Long-form explanations</strong> (e.g., product category advice)</li>



<li><strong>Short-form answers</strong></li>



<li><strong>Clickable links</strong> to navigate the store</li>



<li><strong>Product recommendations</strong> (often rendered as product cards)</li>



<li><strong>Comparisons</strong> (e.g., “compare OLED vs QLED”)</li>



<li><strong>Suggested follow-up questions</strong> surfaced in the chat UI</li>



<li>“<strong>What do customers say?</strong>” style review summaries / highlights</li>



<li>Price/history/deal-related features (including price tracking / alerts) and cart actions in newer “agentic” iterations</li>
</ul>



<h3 class="wp-block-heading">What the backend response likely contains</h3>



<p>Based on Amazon’s own wording (“markup instructions” + “hydration” + token streaming), the response payload is best thought of as:</p>



<ul class="wp-block-list">
<li>A <strong>streamed text channel</strong> (tokens)</li>



<li>A <strong>structured directive channel</strong> (layout + which UI modules to render)</li>



<li><strong>Hydration lookups</strong> that fill directives with authoritative store data (products, prices, shipping, deal status, etc.)</li>
</ul>



<p>Amazon has not published the schema, so any JSON examples would be guesswork.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">What’s not public</h2>



<ul class="wp-block-list">
<li>Exact model architectures/sizes for the custom model(s)</li>



<li>The router policy (how it chooses among models)</li>



<li>Exact retrieval ranking, indexing, and grounding format</li>



<li>The markup instruction language/schema</li>



<li>Safety/guardrail implementation details (beyond high-level “reliable sources” language)</li>



<li>Full evaluation suite and offline metrics used to ship changes</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Sources</h2>



<p>Below are official sources only (Amazon Science, AWS, About Amazon Press Center, Investor Relations).</p>



<h3 class="wp-block-heading">Technical deep dives</h3>



<pre class="wp-block-code has-small-font-size"><code>Amazon Science (Blog): “The technology behind Amazon’s GenAI-powered shopping assistant, Rufus” (Oct 4, 2024)
https:&#47;&#47;www.amazon.science/blog/the-technology-behind-amazons-genai-powered-shopping-assistant-rufus

AWS Machine Learning Blog: “How Rufus doubled their inference speed and handled Prime Day traffic with AWS AI chips and parallel decoding” (May 28, 2025)
https://aws.amazon.com/blogs/machine-learning/how-rufus-doubled-their-inference-speed-and-handled-prime-day-traffic-with-aws-ai-chips-and-parallel-decoding/</code></pre>



<h3 class="wp-block-heading">Product/feature announcements &amp; official descriptions</h3>



<pre class="wp-block-code has-small-font-size"><code>About Amazon (Retail): “Amazon’s next-gen AI assistant for shopping is now even smarter, more capable, and more helpful”
https:&#47;&#47;www.aboutamazon.com/news/retail/amazon-rufus-ai-assistant-personalized-shopping-features

About Amazon (Retail): “How to use Rufus to check price history, find deals, auto-buy items at target prices, and more”
https://www.aboutamazon.com/news/retail/how-to-use-amazon-shopping-ai-assistant

About Amazon (Retail): “How customers are making more informed shopping decisions with Rufus…”
https://www.aboutamazon.com/news/retail/how-to-use-amazon-rufus

About Amazon (Retail): “Rufus is now available to all U.S. customers…” (amazon.com page linked from About Amazon)
https://www.amazon.com/b?node=23404839011</code></pre>



<h3 class="wp-block-heading">Press releases / investor communications</h3>



<pre class="wp-block-code has-small-font-size"><code>Amazon Investor Relations: “Amazon.com Announces Fourth Quarter Results” (Feb 01, 2024) — includes the initial public mention of Rufus beta rollout
https:&#47;&#47;ir.aboutamazon.com/news-release/news-release-details/2024/Amazon.com-Announces-Fourth-Quarter-Results/

About Amazon Press Center (US): “Amazon Bedrock launches new capabilities…” (Apr 2024) — includes a Rufus description and quote
https://press.aboutamazon.com/2024/4/amazon-bedrock-launches-new-capabilities-as-tens-of-thousands-of-customers-choose-it-as-the-foundation-to-build-and-scale-secure-generative-ai-applications

About Amazon Press Center (US): “Amazon Announces Record-Breaking Sales for 2024 Prime Day Event” (Jul 18, 2024) — notes Rufus helping millions of customers
https://press.aboutamazon.com/2024/7/amazon-announces-record-breaking-sales-for-2024-prime-day-event

Amazon Investor Relations: “Amazon.com Announces Fourth Quarter Results” (2026 release page) — mentions agentic Rufus / Buy For Me
https://ir.aboutamazon.com/news-release/news-release-details/2026/Amazon-com-Announces-Fourth-Quarter-Results/default.aspx</code></pre>



<h3 class="wp-block-heading">Amazon Science research papers</h3>



<p>These are not “Rufus documentation,” but they map directly to components Amazon describes (question suggestion, comparisons, RAG planning, preference extraction).</p>



<pre class="wp-block-code has-small-font-size"><code>Publication (SIGIR 2024): “Question suggestion for conversational shopping assistants using product metadata”
https:&#47;&#47;www.amazon.science/publications/question-suggestion-for-conversational-shopping-assistants-using-product-metadata

PDF (SIGIR 2024):
https://assets.amazon.science/42/6e/c7c7aed9433d87fd1ab1f8bef4ff/question-suggestion-for-conversational-shopping-assistants-using-product-metadata.pdf

Publication (WSDM 2023): “Generating explainable product comparisons for online shopping”
https://www.amazon.science/publications/generating-explainable-product-comparisons-for-online-shopping

Publication (CIKM 2024): “REAPER: Reasoning based retrieval planning for complex RAG systems”
https://www.amazon.science/publications/reaper-reasoning-based-retrieval-planning-for-complex-rag-systems

Publication (EMNLP 2024): “PEARL: Preference extraction with exemplar augmentation and retrieval with LLM agents”
https://www.amazon.science/publications/pearl-preference-extraction-with-exemplar-augmentation-and-retrieval-with-llm-agents

Publication (2024): “Meta knowledge for retrieval augmented large language models”
https://www.amazon.science/publications/meta-knowledge-for-retrieval-augmented-large-language-models</code></pre>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/rufus/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Is Query Length a Reliable Predictor of Search Volume?</title>
		<link>https://dejan.ai/blog/query-length-vs-volume/</link>
					<comments>https://dejan.ai/blog/query-length-vs-volume/#comments</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Thu, 12 Mar 2026 00:29:29 +0000</pubDate>
				<category><![CDATA[eCommerce]]></category>
		<category><![CDATA[Keyword Research]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2278</guid>

					<description><![CDATA[The answer is no. There&#8217;s a widely held intuition in SEO and ecommerce search: short queries have high volume, long queries have low volume. &#8220;laptop&#8221; gets millions of searches. &#8220;left handed ergonomic vertical mouse wireless&#8221; does not. It feels obvious. But is query length actually a reliable predictor of search volume? Or is it a [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p><strong>The answer is no.</strong></p>



<p>There&#8217;s a widely held intuition in SEO and ecommerce search: short queries have high volume, long queries have low volume. &#8220;laptop&#8221; gets millions of searches. &#8220;left handed ergonomic vertical mouse wireless&#8221; does not. It feels obvious.</p>



<p>But is query length actually a <em>reliable</em> predictor of search volume? Or is it a convenient heuristic that falls apart under scrutiny?</p>



<p>I tested this using 39.6 million unique Amazon search queries with known volume data, spanning everything from head terms like &#8220;airpods&#8221; to long-tail queries like &#8220;replacement gasket for instant pot duo 8 quart.&#8221; The results surprised me.</p>



<div class="wp-block-buttons alignwide is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex">
<div class="wp-block-button"><a class="wp-block-button__link wp-element-button">Try Our Query Volume Classifier</a></div>
</div>



<h2 class="wp-block-heading">The Setup</h2>



<p>I bucketed queries into five volume classes based on their occurrence count across nearly 400 million Amazon search sessions:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Class</th><th>Occurrences</th><th>Unique Queries</th></tr></thead><tbody><tr><td>Very High</td><td>10,000+</td><td>~18K</td></tr><tr><td>High</td><td>1,000–9,999</td><td>~30K</td></tr><tr><td>Medium</td><td>100–999</td><td>~321K</td></tr><tr><td>Low</td><td>10–99</td><td>~4.6M</td></tr><tr><td>Very Low</td><td>&lt;10</td><td>~34.7M</td></tr></tbody></table></figure>



<p>Then I measured two simple length metrics — character count and word count — across a balanced sample of 5,000 queries per class. The question: can you predict volume class from length alone?</p>



<h2 class="wp-block-heading">The Averages Look Promising</h2>



<p>At first glance, the data confirms the intuition. There&#8217;s a clean trend:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Volume Class</th><th>Avg Characters</th><th>Avg Words</th><th>Median Characters</th></tr></thead><tbody><tr><td>Very High</td><td>16.0</td><td>2.6</td><td>16</td></tr><tr><td>High</td><td>17.2</td><td>2.8</td><td>16</td></tr><tr><td>Medium</td><td>19.6</td><td>3.2</td><td>19</td></tr><tr><td>Low</td><td>22.3</td><td>3.7</td><td>21</td></tr><tr><td>Very Low</td><td>23.2</td><td>3.9</td><td>22</td></tr></tbody></table></figure>



<p>Very high volume queries average 16 characters and 2.6 words. Very low volume queries average 23 characters and 3.9 words. The pattern is monotonic and statistically significant (p ≈ 0). Case closed?</p>



<p>Not quite.</p>



<h2 class="wp-block-heading">The Distributions Tell a Different Story</h2>



<p>The problem becomes obvious when you look at the actual distributions instead of the averages. The character count distributions for all five classes overlap <em>almost entirely</em>:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="569" src="https://dejan.ai/wp-content/uploads/2026/03/heuristic_analysis-1024x569.png" alt="" class="wp-image-2279" srcset="https://dejan.ai/wp-content/uploads/2026/03/heuristic_analysis-1024x569.png 1024w, https://dejan.ai/wp-content/uploads/2026/03/heuristic_analysis-300x167.png 300w, https://dejan.ai/wp-content/uploads/2026/03/heuristic_analysis-768x427.png 768w, https://dejan.ai/wp-content/uploads/2026/03/heuristic_analysis-1536x853.png 1536w, https://dejan.ai/wp-content/uploads/2026/03/heuristic_analysis-2048x1138.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<ul class="wp-block-list">
<li>A 15-character query could be very high volume (&#8220;wireless mouse&#8221;) or very low volume (&#8220;purple cat bed&#8221;)</li>



<li>A 3-word query could be anything from very high (&#8220;protein powder&#8221;) to very low (&#8220;bamboo utensil set&#8221;)</li>



<li>The median difference between very high and very low is only 6 characters</li>
</ul>



<p>When every class shares most of the same length range, length simply can&#8217;t discriminate between them.</p>



<h2 class="wp-block-heading">Quantifying the Failure</h2>



<p>To put a number on it, I built simple heuristic classifiers — one using character count, one using word count — that bin queries into volume classes based on percentile thresholds. For a fair comparison, I also trained a DeBERTa language model on the same data to predict volume class from the query text itself.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="693" height="736" src="https://dejan.ai/wp-content/uploads/2026/03/image-1.png" alt="" class="wp-image-2288" srcset="https://dejan.ai/wp-content/uploads/2026/03/image-1.png 693w, https://dejan.ai/wp-content/uploads/2026/03/image-1-282x300.png 282w" sizes="auto, (max-width: 693px) 100vw, 693px" /></figure>



<p>The results:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Method</th><th>Accuracy</th><th>Spearman Correlation</th></tr></thead><tbody><tr><td><strong>DeBERTa model</strong></td><td><strong>72.1%</strong></td><td><strong>0.896</strong></td></tr><tr><td>Word count heuristic</td><td>25.4%</td><td>-0.345</td></tr><tr><td>Char count heuristic</td><td>24.9%</td><td>-0.336</td></tr></tbody></table></figure>



<p>The length heuristics achieved roughly 25% accuracy — barely above random chance for a 5-class problem (20%). The Spearman correlation between true volume class and query length is only -0.34. For comparison, the trained model achieved 0.90.</p>



<p>The agreement rate between the model&#8217;s predictions and the length heuristic&#8217;s predictions? Just 24–25%. They mostly disagree, meaning the model is learning something fundamentally different from query length.</p>



<h2 class="wp-block-heading">What Does the Model Actually Learn?</h2>



<p>If not length, what signals is the model picking up? Looking at its predictions reveals some patterns:</p>



<p><strong>Brand recognition.</strong> &#8220;airpods&#8221; (9 chars) → very high. The model learns that certain brand names are inherently high-volume. A character-count heuristic has no concept of brand equity.</p>



<p><strong>Category head terms.</strong> &#8220;laptop&#8221; and &#8220;headphones&#8221; and &#8220;dog food&#8221; — the model recognizes generic product categories that serve as entry points for broad shopping intent. These are short, but their volume comes from <em>being category names</em>, not from being short.</p>



<p><strong>Specificity markers.</strong> &#8220;cast iron skillet 12 inch&#8221; → medium. &#8220;replacement gasket for instant pot duo 8 quart&#8221; → very low. Both are moderately long, but the model distinguishes them based on how many qualifiers narrow the intent. Size specifications, compatibility constraints, and material callouts are signals of niche demand.</p>



<p><strong>The middle is messy.</strong> The model struggles most with the low class (F1: 0.39), which sits in an ambiguous zone between medium and very low. These queries are often 3–4 words, moderately specific, and could plausibly land in either adjacent bucket. This is arguably a labeling boundary problem more than a modeling problem.</p>



<h2 class="wp-block-heading">Why the Intuition Persists</h2>



<p>The &#8220;short = high volume&#8221; heuristic isn&#8217;t <em>wrong</em> — it&#8217;s just <em>weak</em>. There is a real negative correlation between length and volume. The averages are monotonic. If you had to make a single binary bet — &#8220;is this 2-word query higher volume than this 7-word query?&#8221; — you&#8217;d be right more often than not.</p>



<p>But for any practical application — keyword prioritization, bid optimization, content strategy — a 25% accuracy classifier is useless. You&#8217;d misclassify three out of four queries.</p>



<p>The fundamental issue is that query length is a <em>confounded</em> signal. Short queries aren&#8217;t high volume <em>because</em> they&#8217;re short. They&#8217;re high volume because they tend to be generic category terms or popular brand names, and those things happen to be expressible in few words. The causal arrow runs from semantic content to volume, with length as a side effect.</p>



<h2 class="wp-block-heading">The &#8216;Nonsense Test&#8217;</h2>



<p>As a final sanity check, I ran the model on completely made-up queries of varying lengths. If the model were simply learning &#8220;short = high volume,&#8221; nonsensical short queries should still predict high volume. They don&#8217;t.</p>



<pre class="wp-block-code" style="font-size:0.8rem"><code>Query                                              Prediction   Conf
--------------------------------------------------------------------
zxqwv                                                very_low  52.9%
blorf                                                very_low  50.0%
aa                                                       high  55.8%
flurb snax                                           very_low  63.1%
gleep borp                                           very_low  54.6%
wonky plim dazzle                                    very_low  50.3%
grax tooble fent                                     very_low  57.6%
blorpy zint crumble woft                             very_low  59.3%
quax shimble trogg fleem narg                        very_low  59.9%
zixo tramble woft greel spunt naffle blorvish        very_low  62.5%
wireless blorf adapter                               very_low  64.5%
organic flurb capsules                               very_low  72.9%
replacement grax for shimble 8 quart                 very_low  76.2%
x                                                        high  93.1%
q                                                        high  91.9%
asdfghjkl                                            very_low  52.4%
aaa bbb ccc ddd eee fff ggg                          very_low  57.5%</code></pre>



<p>Nearly every nonsensical query — regardless of length — is classified as very low volume. One-word gibberish like &#8220;blorf&#8221; and &#8220;zxqwv&#8221; are not mistaken for head terms just because they&#8217;re short.</p>



<p>The exceptions are telling. &#8220;x&#8221; and &#8220;q&#8221; predict high with 93% confidence — because single-letter searches are genuinely common on Amazon (people search &#8220;q&#8221; for Q-tips, &#8220;x&#8221; for Xbox). &#8220;aa&#8221; predicts high because AA batteries are a real product. The model has learned <em>what people actually search for</em>, not how many characters they typed.</p>



<p>Meanwhile, queries with real English structure but nonsense nouns — &#8220;wireless blorf adapter,&#8221; &#8220;organic flurb capsules&#8221; — are confidently classified as very low. The model recognizes the product-query template but knows &#8220;blorf&#8221; isn&#8217;t a real product. It even assigns higher confidence to &#8220;replacement grax for shimble 8 quart&#8221; (76.2%) because the long-tail structure <em>plus</em> unrecognizable nouns is a double signal of obscurity.</p>



<p>The confidence scores are also well-calibrated: nonsense queries hover around 50–60% confidence, reflecting genuine uncertainty, while real queries like &#8220;laptop&#8221; or &#8220;airpods&#8221; score 93%+. The model knows what it doesn&#8217;t know.</p>



<h2 class="wp-block-heading">Implications</h2>



<p><strong>For SEO/SEM practitioners:</strong> Don&#8217;t use query length as a proxy for volume in your tooling or mental models. A 2-word query can easily be very low volume (&#8220;argon regulator&#8221;), and a 5-word query can be high volume (&#8220;noise cancelling earbuds for sleeping&#8221;). Use actual volume data, or if you need estimates, use a model trained on semantics.</p>



<p><strong>For search engineers:</strong> Query length features may add marginal value in a volume prediction model, but they&#8217;re dominated by semantic features. A language model that understands what queries <em>mean</em> dramatically outperforms one that counts characters.</p>



<p><strong>For data scientists:</strong> This is a nice reminder that when averages show a clean trend, always check the distributions. A monotonic trend in means can coexist with nearly complete overlap in distributions — and the overlap is what determines classifier performance.</p>



<h2 class="wp-block-heading">Methodology Note</h2>



<ul class="wp-block-list">
<li>Dataset: Amazon Shopping Queries, 395.5M sessions, 39.6M unique queries</li>



<li>Model: DeBERTa v3 base, fine-tuned for 20 epochs on balanced samples (30K–100K per class)</li>



<li>Heuristic classifiers: quintile-based binning on character/word count</li>



<li>Evaluation: 25K balanced sample (5K per class), Spearman rank correlation, classification accuracy</li>



<li>All code and data processing done in DuckDB + PyTorch</li>
</ul>



<div class="wp-block-buttons alignwide is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex">
<div class="wp-block-button"><a class="wp-block-button__link wp-element-button">Try Our Query Volume Classifier</a></div>
</div>



<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/query-length-vs-volume/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Search Grounding is Transient</title>
		<link>https://dejan.ai/blog/search-grounding-is-transient/</link>
					<comments>https://dejan.ai/blog/search-grounding-is-transient/#respond</comments>
		
		<dc:creator><![CDATA[Dan Petrovic]]></dc:creator>
		<pubDate>Fri, 06 Mar 2026 06:17:40 +0000</pubDate>
				<category><![CDATA[AI SEO]]></category>
		<category><![CDATA[Google]]></category>
		<guid isPermaLink="false">https://dejan.ai/?p=2274</guid>

					<description><![CDATA[There is a fundamental misconception about how Google&#8217;s AI search and Gemini chatbot process retrieved web content. It is widely understood that these systems use Retrieval-Augmented Generation (RAG) to search the web, pull snippets from pages, and ground their answers in factual data. However, there is a pervasive assumption that once an AI pulls in [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>There is a fundamental misconception about how Google&#8217;s AI search and Gemini chatbot process retrieved web content. It is widely understood that these systems use Retrieval-Augmented Generation (RAG) to search the web, pull snippets from pages, and ground their answers in factual data.</p>



<p>However, there is a pervasive assumption that once an AI pulls in a page, it &#8220;reads&#8221; it and retains that raw source material in its working memory for the duration of the conversation.</p>



<p>It doesn&#8217;t.</p>



<p>An AI&#8217;s memory of actual web page content is bound by &#8220;single-turn transient&#8221; architecture. The following is a breakdown of the mechanics behind this phenomenon and how it redefines the relationship between AI models and web content.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">The Experiment: Exposing the Mechanism</h3>



<p>The reality of transient memory was recently demonstrated through a user-driven &#8220;meta-test&#8221; designed to probe a major language model&#8217;s grounding capabilities. The interaction unfolded in three steps:</p>



<ol class="wp-block-list">
<li><strong>The Setup:</strong> The user prompted the search-enabled AI to look up a well-known industry figure  and list the URLs of the sources it utilized.</li>



<li><strong>The Execution:</strong> The system triggered a live web search, extracted snippets from the search results, and fed them into the language model&#8217;s context. The AI successfully generated a list of the source URLs.</li>



<li><strong>The Trap:</strong> In the immediate next prompt, the user asked: <em>&#8220;Do you still have the grounding snippet for the visisummit page?&#8221;</em></li>
</ol>



<p>The AI could no longer access the snippet. Stripped of the raw data, the model became confused about its own previous output, incorrectly assuming it must have hallucinated the original search.</p>



<p>This interaction successfully isolated the underlying mechanism: the moment an AI finishes generating its response, the raw source data is entirely purged from its working memory.</p>



<h3 class="wp-block-heading">The Architecture of Forgetting</h3>



<p>This rapid deletion is a byproduct of the &#8220;Token Economy.&#8221; AI context windows—the amount of text a model can process simultaneously—are computationally expensive and strictly limited. To manage memory efficiently, search-enabled chatbots operate on a highly restrictive cycle:</p>



<ul class="wp-block-list">
<li><strong>Turn 1 (The Search):</strong> A query is submitted. The AI triggers a search tool. The system temporarily injects the raw search results (the grounding snippets) into the AI’s context window so it can formulate an answer.</li>



<li><strong>The Purge:</strong> The millisecond the AI completes its response, the system discards all raw snippets to free up token space.</li>



<li><strong>Turn 2 (The Next Prompt):</strong> When a follow-up question is asked, the AI has zero access to the original website text. It retains only the conversational history—meaning it operates solely on the <em>summary</em> it just generated.</li>
</ul>



<p>It is akin to an open-book test where the test-taker is allowed to look at a source text for exactly one minute. Once an answer is written down, the book is permanently closed. For the remainder of the test, the individual can only reference their own handwritten notes.</p>



<p>The broader context of a web page effectively ceases to exist the moment the first turn ends. What survives is only what was captured in the initial snippet, filtered through the AI&#8217;s immediate interpretation.</p>



<p>Ultimately, AI chatbots do not comprehensively absorb websites. They glance at fleeting flashcards, write down a quick summary, and immediately dispose of the source material—leaving them to converse exclusively with their own notes.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://dejan.ai/blog/search-grounding-is-transient/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
